8 Machine Learning Algorithms explained in Human language

8 Machine Learning Algorithms explained in Human language

What we call “Machine Learning” is none other than the meeting of statistics and the incredible computation power available today (in terms of memory, CPUs, GPUs). This domain has become increasingly visible important because of the digital revolution of companies leading to the production of massive data of different forms and types, at ever increasing rates: Big Data. On a purely mathematical level most of the algorithms used today are already several decades old. In this article I will explain the underlying logic of 8 machine learning algorithms in the simplest possible terms.
I. Some global concepts before describing the algorithms
1. Classification and Prediction / Regression
Classification
Assigning a class / category to each of the observations in a dataset is called classification. It is done a posteriori, once the data is recovered.
Example: classifying consumers reasons of visit in store in order to send them a personalized campaign.
Prediction
A prediction is made on a new observation. When it comes to a numerical variable (continuous) we speak of regression.
Example: predicting a heart attack based on data from an electro cardiogram.
2. Supervised and unsupervised learning
Supervised
You already have tags on historical data and want to classify new data according to these tags. The number of classes is known.
Example: in botany you made measurements (length of the stem, petals, …) on 100 plants of 3 different species. Each of the measurements is labeled with the species of the plant. You want to build a model that will automatically tell which species a new plant belongs to thanks to the same measurements.
Unsupervised
On the contrary, in unsupervised learning, you have no labels, no predefined classes. You want to identify common patterns in order to form homogeneous groups based on your observations.
Examples: You want to classify your customers based on their browsing history on your website but you have not formed groups and are in an exploratory approach to see what would be the common points between them. In this case a clustering algorithm is adapted.
Some neural network algorithms will be able to differentiate between human and animal images without prior labeling.
II. Machine Learning Algorithms
We will describe 8 algorithms used in Machine Learning. The objective here is not to go into the details of the models but rather to give the reader elements of understanding on each of them.
1. “The Decision Tree”
A decision tree is used to classify future observations given a body of already labeled observations. This is the case of our botanical example where we already have 100 sightings classified in species A, B and C.
The tree begins with a root (where we still have all our observations) then comes a series of branches whose intersections are called nodes and ends are called leaves, each corresponding to one of the classes to predict. The depth of the tree is refers to the maximum number of nodes before reaching a leaf. Each node of the tree represents a rule (example: length of the petal greater than 2.5 cm). To browse the tree is to check a series of rules. The tree is constructed in such a way that each node corresponds to the rule that best divides the set of initial observations (variable and threshold).
Example:
Decsion tree data
Decision Tree
The tree has a depth of 2 (one node plus the root). The length of the petal is the first measure that is used because it best separates the 4 observations according to class membership (here class B).
2. “Random Forests”
As the name might suggest, the random forest algorithm is based on a multitude of decision trees.
In order to understand better the advantages and logic of this algorithm, let’s start with an example:
You are looking for a good travel destination for your next vacation. You ask your best friend for his opinion. He asks you questions about your previous trips and makes a recommendation.
You decide to ask a group of friends who ask you questions randomly. They each make a recommendation. The chosen destination is the one that has been the most recommended by your friends.
The recommendations made by your best friend and the group will both make good destination choices. But when the first recommendation method works very well for you, the second will be more reliable for other people.
This comes from the fact that your best friend, who builds a decision tree to give you a destination recommendation, knows you very well what making the decision tree over-learned about you (we talk about overfitting).
Your friend group represents the random forest of multiple decision trees and it’s a model, when used properly, avoids the pitfall of overfitting. How is this forest built?
Here are the main steps:
1. We take a number of observations from the starting dataset (with discount).
2. We take a number K of the M variables available (features), for example: only temperature and population density
3. We create a decision tree on this dataset.
4. Steps 1. to 4. are repeated N times so as to obtain N trees.
To obtain the class of a new observation we go down the N trees. Each tree will predict a different class. The class chosen is the one that is most represented among all the trees in the forest. (Majority vote / ‘Ensemble’).
3. The “Gradient Boosting” / “XG Boost”
The boosting gradient method is used to reinforce a model that produces weak predictions, such as a decision tree (see below how do we judge the quality of a model).
We will explain the principle of boosting gradient with the decision tree but this could be with another model.
You have an individual database with demographics information and past activities. You have 50% of individuals their age but the other half is unknown.
You want to get the age of a person according to his activities: food shopping, television, gardening, video games … You choose as a model a decision tree, in this case it is a regression tree because the value to predict is numeric.
Your first regression tree is satisfying but can be improved: it predicts, for example, that an individual is 19 years old when in fact he is 13 years old, and for another 55 years old instead of 68 years old.
The principle of the gradient boosting is that you will redo a model on the difference between the predicted value and the true value to be predicted.
AgePrediction Tree 1DifferencePrediction Tree 2
1319-615
6855+1363
This step N is repeated where N is determined by successively minimizing the error between the prediction and the true value.
The method to optimize is the gradient descent method that we will not explain here. The XG Boost (eXtreme Gradient Boosting) model is one of the implementations of the boosting gradient founded by Tianqi Chen and has seduced the Kaggle datascientist community with its efficiency and performance. The publication explaining the algorithm is here.
4. “Genetic Algorithms”
As their name suggests genetic algorithms are based on the process of genetic evolution that has made us who we are …
More prosaically they are mainly used when there are no observations of departure and it is hoped that a machine will learn to learn as and when testing.
These algorithms are not the most effective for a specific problem but rather for a set of subproblems (eg learning balance and walking in robotics).
Let’s take a simple example: We want to find the code of a safe that is made of 15 letters: “MACHINELEARNING”
The genetic algorithm approach will be as follows:
We start from a population of 10,000 “chromosomes” of 15 letters each. We say that the code is a word or a set of words pro
“DEEP-LEARNING”
“STATISTICAL INFERENCE-“
“HUMAN MACHINE INTERFACE” etc.
We will define a method of reproduction: for example, to combine the beginning of one chromosome with the end of another.
Ex: “DEEP-LEARNING” + “STATISTICAL-INFERENCE” = “DEEP-INFERENCE”
Then we will define a mutation method which allows to change a progeny that is blocked. In our case it could be to vary one of the letters randomly. Finally we define a score that will reward such or such descendants of chromosomes. In our case where the code is hidden we can imagine a sound that the trunk would do when 80% of the letters are similar and that would become stronger as we approach the right code.
Our genetic algorithm will start from the initial population and form chromosomes until the solution has been found.
5. “Support Vector Machines”
Also known as “SVM” this algorithm is mainly used for classification problems even though it has been extended to regression problems (Drucker et al., 96).
Let’s take our example of ideal holiday destinations. For the simplicity of our example consider only 2 variables to describe each city: the temperature and the density of population. We can therefore represent cities in 2 dimensions.
We represent by circles cities which you very much appreciated and by squares those which you least appreciated. When you consider new cities you want to know which group this new city is closest to.
SVM Optimal Plane
SVM Example
As we see in the graph on the right, there are many plans (straight lines when you only have 2 dimensions) that separate the two groups.
We will choose the line that is at the maximum distance between the two groups. To build it we already see that we do not need all the points, it is enough to take the points which are at the border of their group we call these points or vectors, the support vectors. The planes passing through these support vectors are called support planes. The separation plan will be the one that will be equidistant from the two supporting planes.
What to do if the groups are not so easily separable, for example if by one of the dimensions circles are mixed up with squares or vice-versa?
We will proceed to a transformation of these points by a function to be able to separate them. As in the example below:
SVM transformation example
SVM transformation example
The SVM algorithm will therefore consist of looking for both the optimal hyperplane and minimizing classification errors.
6. The “K nearest neighbors”
Pause. After 5 relatively technical models the algorithm of the K nearest neighbors will appear to you as a formality. Here’s how it works:
An observation is assigned the class of its nearest K neighbors.
“That’s it ?!” you might ask me.
Yes that’s all. Only as the following example shows: K’s choice can matter a lot.
K nearest neighbours
K nearest neighbours
We will typically try different values ​​of K to obtain the most satisfactory separation.
7. “Logistic Regression”
Let’s start by a reminder of linear regression. Linear regression is used to predict a numerical variable, e.g the price of cotton in relation to other numeric or binary variables: the number of cultivable hectares, the demand for cotton from various industries, and so on.
It is a question of finding the coefficients a1, a2, … in order to have the best estimate:
Cotton price = a1 * Number of hectares + a2 * Demand for cotton + …
Logistic regression is used in classification in the same way as the algorithms exposed so far. Once again let’s take the example of trips considering only two classes: good destination (Y = 1) and bad destination (Y = 0).
P (1): probability the city is a good destination.
P (0): probability that the city is a bad destination.
The city is represented by a number of variables, we will only consider two: the temperature and population density.
  • X = (X1: temperature, X2: population density)
We are therefore interested in building a function that gives us for a city X:
  • P (1 | X): probability that the destination is good knowing X, which is to say probability that the city checking X is a good destination.
We would like to relate this probability to a linear combination as a linear regression. Only the probability P (1 | X) varies between 0 and 1 except we want a function that traverses the whole domain of real numbers (from -infinite to + infinity).
For that we will start by considering P (1 | X) / (1 – P (1 | X)) which is the ratio between the probability that the destination is good and that the destination is bad.
For strong probabilities this ratio approaches + infinity (for example a probability of 0.99 gives 0.99 / 0.01 = 99) and for low probabilities it approaches 0: (a probability of 0.01 gives 0.01 / 0.99 = 0.0101 ).
We went from [0,1] to [0, + infinite [. To extend the ‘scope’ of the possible values ​​to] -infinite, 0] we take the natural logarithm of this ratio.
It follows that we are looking for b0, b1, b2, … such as:
  • ln (P (1 | X) / (1-P (1 | X)) = b0 + b1X1 + b2X2
The right part represents the regression and the logarithm of Neperian denotes the logistic part.
The logistic regression algorithm will therefore find the best coefficients to minimize the error between the prediction made for visited destinations and the true label (good, bad) given.
8. “Clustering”
Supervised vs. Unsupervised learning. Do you remember?
Until now we have described supervised learning algorithms. Classes are known and we want to classify or predict a new observation. But how to do when there is no predefined group? When you are looking for patterns shared by groups of people?
Here comes unsupervised learning and clustering algorithms.
Take the example of a company that started its digital transformation. It has new sales and communication channels through its site and one or more associated mobile applications. In the past, it was addressing it’s clients based on demographics and their purchase history. But how to exploit the navigation data of its customers? Does online behavior match classic customer segments?
These questions can motivate the use of clustering to see if major trends are emerging. This will invalidate or confirm business intuitions that you may have.
There are many clustering algorithms (hierarchical clustering, k-means, DBSCAN, …). One of the most used is the k-means algorithm. We will explain the operation simply:
Even if we do not know how the clusters will be constituted, the k-means algorithm imposes to give the expected number of clusters. Techniques exist to find the optimal number of clusters.
Consider the example of cities. Our dataset has 2 variables, so we have 2 dimensions. After a first study we expect to have 2 clusters. We begin by randomly placing two points; they represent our starter ‘means’. We associate with the same clusters the observations closest to these means. Then we calculate the average of the observations of each cluster and move the means to the computed position. We re-assign the observations to the nearest means and so on.
Clustering K means
To ensure the stability of the groups found it is recommended to repeat the draw of the initial ‘means’ several times because some initial draws may give a configuration different from the vast majority of cases.
Factors of Relevance and Quality of Machine Learning Algorithms
Machine learning algorithms are evaluated on the basis of their ability to correctly classify or predict both the observations that were used to train the model (training and test game) but also and especially observations for which the label or value is known and has not been used in the development of the model (validation set).
Proper classification implies both placing the observations in the correct group and at the same time not placing them in the wrong groups.
The chosen metric may vary depending on the intent of the algorithm and its business usage.
Several data factors can play a big role in the quality of the algorithms. Here are the main ones:
1. The number of observations:
  • the fewer observations there are, the more difficult the analysis,
  • but the more there is, the more the need for computer memory is high and the longer is the analysis)
2. The number and quality of attributes describing these observations
For example the distance between two numeric variables (price, size, weight, light intensity, noise intensity, etc.) is easy to establish, that between two categorical attributes (color, beauty, utility …) is more delicate;
3. The percentage of data filled in and missing
4. “Noise”: the number and “location” of dubious values ​​(potential errors, outliers …) or of course not conforming to the pattern of general distribution of “examples” on their distribution space will have an impact on the quality of the ‘analysis.
Conclusion
We have seen that machine learning algorithms serve two purposes: classifying and predicting and are divided into supervised and unsupervised algorithms. There are many possible algorithms, we have covered 8 of them including logistic regression and random forests to classify an observation and clustering to bring out homogeneous groups from the data. We also saw that the value of an algorithm depended on the associated cost or loss function but that its predictive power depended on several factors related to the quality and volume of data.
I hope this article has given you some insight into what is called Machine Learning. Feel free to use the comment section to get back to me on aspects that you would like to clarify or deepen.
Deep Learning Architectures explained in Human Language

Deep Learning Architectures explained in Human Language

“Deep” Learning has attracted much attention during these past years. And for a good reason: this subset of machine learning has stood out impressively in several research fields: facial recognition, speech synthesis, machine translation, and many others.
These research fields have in common to be perceptual problems related to our senses and our expression. They have long represented a real challenge for researchers because it is extremely difficult to model vision or voice by means of algorithms and mathematical formulas.
As a result, the first models that have been implemented in these fields have been constructed with a good deal of business expertise (in speech recognition: decomposition into phonemes, in machine translation: application of grammatical and syntactic rules). Years of research have been dedicated to the exploitation and processing of these non-structured data in order to derive meaning.
The problem is that these new representations of data invented by researchers have failed to generalize at full extent to any text, sound or image. If you used Google Translate before 2014, year when they switched to a 100% deep learning model, you will remember the obvious limitations at the time.
Deep learning places itself directly on top of raw data without distortion or pre-aggregation. Then, thanks to a very large number of parameters that self-adjust over learning, will learn from implicit links existing in the data.
Before going into details of three different algorithms * used in deep learning for different use cases, let’s start by simply defining the model at the heart of deep learning: the “neural network”.
* We also talk about different network architectures.

1. NEURAL NETWORKS



Let me begin by saying that neural networks have very little to do with the neural system and the brain. The analogy between a neuron and a one-neuron neural network is essentially graphic, insofar as there is a flow of information from one end to the other network.
neural network vs neuron
The first layer of a neural network is called the input layer. It is through this layer that your data will enter the network. Prior to “feeding” the network with your data you will need to change it to numbers if they are not already.
We’ll take the example of sentiment analysis on textual data.
Let’s say you have 10,000 comments on your ecommerce website about products sold:
With your team you have labeled 1000 of them (we’ll see that you can also rely on pre-trained neural networks) into 3 classes (satisfied | neutral | dissatisfied). This number of 3 classes, often taken in the sense of analysis, is an example and you can actually set more.
– “I loved it, very good taste”;
– “I didn’t like the packaging that much”;
– “I thought it was pretty good”
The final layer, called output layer, will provide you with the classification “satisfied / neutral / dissatisfied”.
And all layers between the input and output layer, layers called “hidden” are all different representations of the data. A representation may be the number of words in a sentence, the number of punctuation (?!) in a sentence, etc. You will not have to specify the network these representations; if statistically they help to correctly classify the sentences the network will teach alone.
simple neural network

To illustrate these layers take another example: that of the estimated price of a home.
As we can see here we take four input variables: the size of the house, number of bedrooms, the postal code and the degree of richness of the area. The output is not seeking to classify but to predict a number: the price of the house. This is a problem known as regression.
The italicized words to examples of representations that the neural network will make the data after having seen many.
The parameters of the network are updated thanks to a process called “backpropagation”. And the more hidden layers there are in the network the “deeper” it is, hence the name “Deep” Learning.
Let us now see 3 different types of architectures of neural networks.

2. CONVOLUTIONAL NEURAL NETWORKS (CNN)


These networks are used for all use cases around image or video which include face recognition or image classification.
For example Bai Du (the equivalent of Google in China) has set up portals powered by face recognition to let enter only employees of the company.
Snapchat and many mobile applications have leveraged the breakthroughs of deep learning and CNNs to increase their face recognition capacities in order to add extra layers on your face such as funny bunny ears and a pink nose.
convolutional neural network
The name “convolution” comes from a mathematical operation: convolution between functions.
Put simply, the convolution applies a filter to the input image, the filter parameters are learned through the learning. A learnt filter will be able of detecting features in an image, for example angles, and use them to classify at best the image.
The image is first decomposed into 3 channels (R, G, B) pixels per pixel, we obtain three matrices of size n x n (where n is the number of pixels).
Below is an example of a convolution with a 6 x 6 size matrix:
neural network convolution
It is important to note two important advantages inherent to convolutional networks:
  • the network can learn by steps to recognize characteristics in an image. To recognize a face for instance: it will learn to recognize first of eyelids and pupils, and then recognize eyes;
  • once an item to a learned image place the network will be able to recognize it anywhere else in the picture.


3. RECURRENT NEURAL NETWORKS (RNN)


Recurrent neural networks are at the heart of many substantial improvements in areas as diverse as speech recognition, automatic music composition, sentiment analysis, DNA sequence analysis, machine translation.

The main difference with other neural networks is that they take into account a sequence of data, often a sequence evolving over time. For example in the case of analyzing temporal data (time series) the network will still have in memory a part or all of the observations previous to the data being analyzed.

The pattern of this network is produced here:

recurrent neural network
Instead of taking into account separately input data (in the way a CNN would analyse image per image) the recurrent network takes into account data previously processed.
Some architectures, called bidirectional, can also take into account future data. For instance when analyzing text to identify named entities (people, companies, countries, etc.) the network would need to see the words of the whole sentence.
Example:
  • I see [Jean] Valjean still have escaped you, Javert!”
  • I see [Jean] R. plays in this adaptation of ‘Les Misérables’”.
The beginning of the sentence (underlined) is not enough to identify who is ‘Jean’.

4. AUTOENCODERS


Autoencoders are applied mainly to anomaly detection (for example to detect fraud in banking or to find faults in an industrial production line). They can also be used in dimensionality reduction (close to the objective of a Principal Component Analysis). Indeed the goal of autoencoders is to teach the machine what constitutes “normal” data.
The architecture of our network is the following:
autoencoder
The network will therefore represent data through one or more hidden layers so that the output will be as close as possible to the input data.
The objective to find the same data back as the output of the network is characteristic of autoencoders (analogous to the identity function f (x) = x).
The encoding and decoding stage it is not however specific to autoencoders. Indeed, they are found in machine translation in recurrent neural networks.
After training the network with enough data it will be possible to identify suspicious or anomalous observations when they exceed a certain threshold compared to the new “standard”.

Conclusion:
We saw 3 major types of neural networks:
  • Convolution networks with applications in facial recognition and image classification;
  • Recurrent networks with applications in the timeseries, text and voice analysis;
  • Autoencoders with applications to anomaly detection as well as dimensionality reduction.
Other architectures exist such as GANs, generative adversarial networks, which are composed of a model generating candidates for a given task, for example image creation, and another that evaluates the different outputs. Or Reinforcement Learning, a method used by Deepmind to train their Alpha Go and Alpha Go Zero models.
Obviously there are limits: for example it is possible to fool convolutional network by adding a particular sound to image undetectable to the human eye but can be fatal for a model that has not been sufficiently tested robustness. New architectures such as capsule networks have merged to face this particular problem.
All in all it is certain that deep learning has a bright future with many business applications to come.