Article
11
min read

Deep Learning on Relational Business Data

Written by
Johannes King
Published on
July 22, 2020

The year is 2020. The world is entirely dominated by Deep Learning applications. Well, not entirely... This post highlights the blank spot on the map.

We often claim that getML transfers the deep learning revolution to relational data and time series. But what precisely do we mean by that? This article sheds lights on what aspects of deep learning getML applies to relational business data and explains why this indeed bears comparison with a revolution.

Prologue: Deep learning

The great public interest in artificial intelligence is mostly triggered by advancements in a specific subdomain of the field - deep learning. But what is special about deep learning? In order to answer this question, let us disentangle the terms Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL). These terms are often used interchangeably although they actually represent different concepts.

AI is used to describe computer programs that are capable of performing tasks that typically require human intelligence. A typical example for such a task is playing chess, prominently associated with the triumph of Deep Blue over world champion Garry Kasparov in 1997. Interestingly, playing chess is also a good example for a task that requires AI but no ML. Why is that?

Deep Blue is based on a search algorithm that evaluates possible moves based on known positions from matches in a database. It basically uses a huge library of all games ever played in chess history. But it never actually learned how to play chess. Other games like Tic-tac-toe or checkers can be mastered by machines in a similar manner: By hard-coding the rules and possible states.

Enter Machine Learning

One way to think of ML on the other hand is as one of several ways to achieve AI. Certain tasks require a computer to mimic the human process of learning. In other words, sometimes computers perform better when they are programmed to learn intelligence rather than to be intelligent in the first place.

In order to learn how to solve a certain task computers need one special capability: To generate knowledge from examples. This is essentially the way we acquire our human knowledge. We gain experience and thus learn new skills in the course of our lives. In order to learn how to distinguish different colours, we need to see examples and someone to tell us how to denominate the given examples.

Similarly, a ML system extracts correlations from so-called training data and maps these correlations to unknown data sets (strictly speaking this is only true for so-called supervised ML techniques). This allows for the algorithm to solve a given task for new data that was not available during the training.

Artificial intelligence > machine learning > deep learning

A typical application for ML are spam filters: Most spam detection algorithms measure word frequencies in the email text and look for difference in example data that is labeled “spam” or “no spam”. Thus they learn how to get progressively better at detecting spam, without having been specifically programmed for that task.

But what about Deep Learning?

Like humans, computers can learn using a variety of approaches. DL is one of these approaches. It is a special kind of ML that has aroused excitement in the 2010’s because of its jaw-dropping precision in a range of tasks based on images, text and acoustic data. The DL revolution has led to enormous progress in autonomous driving. Examples like Google Translate or Siri demonstrate the power of DL in the field of natural language processing.  DL algorithms like Pluribus and DeepMind have beaten professional poker players and a world champion in Go. All this was not possible with pure AI.

Technically, DL is based on deep neural nets. Neural nets are algorithms inspired by the human brain. Connected artificial neurons, so-called nodes, can transmit signals to other nodes. A node that receives a signal then processes it and signals other nodes connected to it. A neural net consists of a certain number of nodes layers. The increase in computing power over the last decade has enabled the creation of huge neural nets with many layers. These nets are called deep neural nets.

One of the greatest success stories of DL is image recognition. To fully grasp the importance of this achievement let us review how image recognition works on a human level: If we want to teach children what a cat is, we use example images in picture books or alike. We would take a page with images of all sorts of animals and point out which of them display a cat until the child is able to recognise cats in unseen images (and hopefully in real life) on its own.

But how does the child actually learn this ability?  One approach would be to draw the child's attention to certain features like the colour, ears, whiskers or eyes of the animal. This is analogous to image recognition algorithms before the advance of DL - they relied on so-called manual feature engineering. Data Scientist had to look for meaningful feature in images to be analysed and teach their algorithms how to detect them. These approaches, however, were quite limited in performance.

Kitten v Ice Cream - source @teenybiscuit

The reason for the mediocre performance of ML system in image recognition before DL is best illustrated by an example. Consider this image of cats and ice cream. It is not easy to tell the difference, right? Which features would you use to tell the cats apart from the ice cream? How can the computer tell them apart? The definition of selective features in this case is difficult - even for the human eye.  So, would it not be more efficient for the computer to find the best features without human intervention?

Exit Manual Feature Engineering

Here is where the power of DL comes into play. It does not rely on manual feature engineering. Deep neural nets can automatically learn the right features from images without introducing hand-coded rules or human domain knowledge - their flexible architectures can learn directly from raw pixel data. DL models are trained by sending numerous cat photos for the computer to learn from beforehand. But what does that mean concretely?

As humans we are able to learn the detection of a cat solely based on an images - the raw data so to say. The generation of features that allow us to recognise a cat is part of the learning process. The same is true for deep neural nets. The only input we provide them during the training process are raw images. They learn the relevant feature themselves. Each successive layer is able to learn more complex features in that process. The first one might for example be able to measure vague outline of an object and a later one detect special features on the surface of the object.

A huge advantage of this approach - besides being much more convenient - is the fact that it scales with the amount of data that is fed to the network. Whereas traditional approaches based on manually engineered features typically reach a plateau at some point, the performance of deep neural nets keeps increasing. This is quite plausible given that they keep generating new and more detailed features. Besides the increase in computing power the growing amount of available training data is the second pillar of the success story of DL.

A comparison between deep learning and older learning algorithms about the relationship between the amount of data and performance. With more data, deep learning performance grows steadily while older learning algorithms plateau.

The missing piece: Relational data

While DL has been enormously successful over the last decade, it can only solve a limited set of problems. The input data is restricted to images, text or sound data. When moving from cat recognition to problems that are more relevant for business analytics the raw input data is mostly available in a different, much more complex form: Relational databases.

Relational databases capture complex relationships between entities and their attributes. A bank for example might store information about their customers like the customer ID and the closing date in one table, information about transactions that were made by their customer in another table, and interactions of their customers with the complaints office in another table. In order to fully describe the customer, data from all these tables has to be taken into account. When adding more information from different sources this quickly leads to quite complex data schemes.

An example data model with one population table joined with transactions and complaints

But what do we want with all this data? A good example for a ML application is customer churn prediction. We want to know if a customer is going to leave the bank before they actually submit the termination letter. Or even before they plan on sending a termination letter. This is the very foundation of a successful customer retention strategy.

Detecting a churning customer in the database of a bank is conceptually not so different from recognising a cat in an image: By exposing the computer to a large training data set with known labels (churn/no churn) we want it to learn the characteristics of a churning customer on its own. The key to solving this task, however, is hidden in the relational data. It is impossible to detect a churning customer by using only one source of information. The whole relational data scheme must be taken into account. The trouble is, though, that traditional ML algorithms cannot process relational data.

Doing it the old way

How did data scientists cope with this problem in the past? Just like in image classification before the advent of DL they did manual feature engineering by using hand-coded rules combined with human domain knowledge. When doing a customer churn analysis a typical feature might be the average transaction volume over a certain period of time in the past. If it falls below a certain threshold this might indicate that the customer is not using their account anymore and has switched to another bank. This information can be retrieved from a relational database by sending requests using languages like Python, R or SQL.

An illustration of a manual feature engineering process

The questions that arise from this are numerous: What is the best time window to consider? Is the most meaningful feature really the average transaction volume or maybe a single transaction with the maximum volume? What about the other tables, like the one containing information about complaints? How should it be taken into account?

In order to answer these questions data scientists have to create not just one but hundreds or thousands of features when working on real world applications. The reason why data science projects take months from the definition of a business case until a production ready solution is the tedious work of manual feature engineering.  Or to quote Andrew Ng, computer scientist and Co-founder of Google Brain:

"Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning’ is basically feature engineering."

The problem with manual feature engineering

Besides being painful for data scientists, manual feature engineering comes with some severe problems that can jeopardize an entire ML project

  • Data Scientists are no domain experts. More often than not it requires domain specific knowledge in order to find the most meaningful features in a data set. This forces data scientists to become acquainted with each field in which they want to complete a project.
  • Manual feature engineering is time consuming. In a typical data science project, 90% of the time is dedicated to data preparation and feature engineering. Data scientists lack this time to do their actual job: Training machine learning models in order to solve a real world business task.
  • Important features will be overlooked. Especially with the growing amount of data these days, it is simply not possible for a human to go through all available data tables and think of potentially important features. Thus, manual feature engineering is an error-prone process.

While these problems are widely agreed on in the data science community, at the same time there is a consensus that the quality of the features used in a Data Science project has a crucial impact on the success or failure of the entire endeavour. The better the prepared features, the better the results. Or to say it with Pedro Domingos,  ML expert and Professor at the University of Washington:

"At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used."

So, having said the above, wouldn’t it be valuable to have something similar to a deep neural network for relational business data? An algorithm that can process the raw data and learns the relevant features as part of the training process? Something that would make manual feature engineering obsolete?

GetML to the rescue

Similar to the advance of image recognition after the advent of DL, we believe that automating feature engineering is the next step into the future of ML on relational data. It is the most difficult step in a data science project that bears the biggest potential for improvement. With getML we created the first tool to efficiently automate feature engineering for relational data and time series.

How does this concretely affect data science projects? Let us go back to the customer churn analysis. Instead of handcrafting numerous features data scientist can start the analysis with the raw, relational customer data. Instead of reducing the richness of the entires data set by boiling it down to a limited set of feature they can feed the data to their computers as a whole. Like in DL, they can now use the whole picture as input and not only parts of it.

An illustration showing the different processes between manual feature engineering and using getML

How does that improve anything?

The consequences of using getML in data science projects are

  • The project can be completed in a significantly shorter amount of time. Data Scientists are no longer required to become domain experts. They can tackle challenges even if they are only vaguely familiar with the corresponding field. Also, the tedious manual work of writing the code to construct features is omitted. Data Scientist can instead focus on the actual ML part of their work.
  • The prediction accuracy increases. Even if getML will certainly not beat human Data Scientist in every imaginable task on relational data, in practice this is most often the case. This is due to time constraints in most projects in combination with an  extensive amount of data that cannot be overlooked by the people working in the project. Sometimes the haystack is so large that an automated procedure is the only way to find the needle - the best features.
  • More use cases can be tested with the same resources. Potentially valuable business cases are often not pursued because the manpower to start a full-scale data science project is not at hand. GetML allows quick results for new challenges without a large upfront investment. Thus it is more likely that impactful business cases are tested and brought into production

This article has briefly introduced the main concepts behind DL in contrast to traditional ML techniques and highlighted limitation of currently available solutions on relational business data. Most real world data, however, comes in relational form, so that a solution to this problem is imperative. With getML we strive to provide such a solution. Since this parallels the DL revolution both conceptually and in terms of its potential impacts, we think that the previously made claim that getML transfers the DL revolution to relational business data is justified. Do you agree?