One of the areas that I have been intrigued in recently is Artificial Intelligence. In this post, I will discuss one the latest and most fascinating specializations within Artificial Intelligence called Deep Learning.
What is Artificial Intelligence (AI)?
AI can be understood as the capability of a machine to imitate intelligent human behavior. The term Artificial Intelligence was first coined by McCarthy in 1956. The main purpose of Artificial Intelligence is to build intelligence into the artificial artifacts. AI techniques include a variety of search algorithms including approaches like breadth-first-search (BFS), depth-first-search (DFS), iterative deepening, and A*. Machine learning is a subfield of AI (different from search) that allow a machine to “learn” from data. Deep learning is a specific approach within machine learning that has the ability to solve problems of higher complexity both effectively and efficiently, and often outperform other machine learning techniques.
What is Machine Learning (ML)?
Before we enter the world of machine learning, we need to first question the necessity of it in the first place. Analysis of the performance on a given task can be quantified in many ways such as time complexity, accuracy, and precision. Humans have limitations with regards to these parameters for certain problem types or dimensions. For instance, while the human mind is excellent at visual pattern recognition in 2D or 3D, it fails miserably in 200D!
A canonical definition of machine learning given by Tom Mitchell in 1997 is, “A machine or an agent is said to learn from experience with respect to some class of tasks and a performance measure P, if the learner’s performance at tasks in the class, as measured by P, improves with experience.” This learning where the performance can be improved based on experience is called inductive learning. The basis of inductive learning goes back several centuries and is only recently that we have managed to develop more quantified methods of learning that can help us solve problems.
There are three primary different machine learning paradigms:
- Supervised Learning: Mapping from input to the required output. Can be further classified based on class of the output variable, i.e. continuous or categorical
- Unsupervised Learning: No output is associated with the input, basically used to discover patterns in the data. Most prominent ones are Clustering and Association.
- Reinforcement Learning: It does not fall in either of the above-mentioned classes. It just a way of learning to control the behavior of the system.
Machine learning provides human’s with invaluable tools to greatly augment their own ability at pattern recognition and knowledge discovery. And as computer hardware continues to improve, in both processing power, speed, memory, and storage, the potential benefits of machine learning become even more profound.
While it may seem that machine learning is too good to be true, it does have its limitations. For instance, it has been found that as the quantity of data increases, after a certain point, the performance of the model remains constant. That is the performance of many learning algorithms simply do not scale; and the learning is not proportionate to the vast amounts of data that are being made available to us today.
Many ML techniques can work with high dimensional data, but again, at some point, as the number of dimensions continue to increase, ML begins to disappoint. Oftentimes, feature selection or dimension reduction techniques need to be employed to reduce the associated complexity. This problem is referred to generally as the “curse of dimensionality” and according to the Encyclopedia of Machine Learning and Data Mining (https://link.springer.com/referencework/10.1007/978-0-387-30164-8), “a small increase in dimensionality generally requires a large increase in the numerosity of the data, in order to keep the same level of performance for regression, clustering, etc.” The encyclopedia entry goes on to say that there are many difficulties in ML due to this effect.
Additionally, given high dimensional data (or even medium dimension data), most ML techniques do not do a good job of automatic feature construction or transformation. A significant portion of a data scientist’s is dedicated (and rightly so!) to this portion of an analytics project.
As mentioned above, humans are simply experts at certain types of pattern recognition, whether that be in 2 or 3D images, or in languages. We have our own limitations, of course, but here, we rock. However, some of these problems are very difficult for machines. Take for instance natural language processing (NLP) or sentiment analysis or even better identifying sarcasm. These are hard problems for machines!
- Cannot leverage all of the data
- Difficulty with high dimensional data.
- Certain problem types are notoriously difficult
- Automatic feature construction/transformation is limited
Deep learning is a significant step in the right direction to help overcome these limitations!
Deep Learning can be considered a subset of ML. And while not exactly new (we have been toying around with the idea of deep learning for years), it is not until recently that we have enough computational power to really develop the technique. And, maybe more importantly, we’ve not had such a pressing public need and pressure for high-quality, high-performing ML methodologies with ultra-complex data! Fortunately, with the advent of new technologies in the field of computer science, we have been able to design and develop new tools and machines that can handle the complexity associated with the algorithms.
What is Deep Learning?
Deep learning is a concept that has evolved from machine learning with the core idea to build algorithms that can mimic the human brain. Deep learning is in fact a class of of artificial neural network models. ANN’s are composed of a network of “neurons” that link the input data to the output data. Each neuron simulates the way that neurons in the brain process data.
In the image above, (A) is a neuron present in the brain, (B) is an artificial neuron, (C) is a connection within the nervous system and (D) is the artificial neural network that represents these connections.
Note that an artificial neural network consists of a series of layers. The first layer starting at the left is called an input layer, which consists of nodes corresponding to a predictor (an input feature); the rightmost layer is called the output layer where the output of the model is given. Any layers in between the two are called hidden layers and allow the ANN to account for tremendous complexity in the problem.
Example application of an artificial neural network
Suppose we consider 28×28 B&W image of a handwritten digits, 0,1,…,..9. Each pixel is given a value based on how bright it is and this value generally ranges from 0 (dark-black) to 1 (bright-white). These are the 28*28=784 inputs to the network. And as such, the first layer contains 784 neurons. The last layer will contain 10 neurons: each representing one of the 10 digits (0-9). For hidden layers, let us consider 2 layers with each layer having a certain number of neurons. The activation in a layer is determined by the activations in the previous layer.
Let us now understand the working of these hidden layers. Any digit can be broken down into segments or sub-components, these sub-components include edges, loops, etc. The hidden layers in the network perform the job of recognizing these sub-components. For each input image, the neurons in each layer activate accordingly, determine the sub-components and output the possibility of the image to be a particular digit. The output is given by the last layer and the neuron which has the highest activation number is determined by the system as the digit that corresponds to the image.
The same concept can also be implemented with the speech recognition task. The speech can be subdivided into letters, words, and phrases where each layer corresponds to each sub-process.
Now ANN’s have been around for years, but their study and use has been generally limited to only 1 or maybe 2 hidden layers. A neural network with multiple hidden layers is called a Deep Network and learning with a deep network is called, Deep Learning. Since hidden layers allow for more complex problems and thus more advanced learning, what is stopping us from just adding more and more layers and always doing deep learning?
It turns out the traditional techniques used to train artificial neural networks often fail as you increase the number of hidden layers. As a result, the deep ANN’s may not perform any better than the shallow ones. This failure relates to different layers “learning” at different rates — and in fact, some of the layers getting “stuck” and not learning at all. This problem is called the “vanishing gradient problem” or “unstable gradient problem”.
While completely fixing this is still an active area of research, significant progress has been made recently and is driving the success of deep learners for hard problems. For instance, now we know that one reason this problem exists is due to the use of traditional activation functions used in the neurons. (An activation function is a function used to convert node inputs to a node output.) Backpropagation (the workhorse algorithm used for learning in ANN’s) is highly dependent on the form of the activation functions. Traditional ANN’s relied on sigmoid or hyperbolic tangent functions as activation functions. However, to be successful with a deep network, this needs to be changed.
At present, almost all deep learning models use Rectified Linear units (ReLu) as the activation within hidden layers. There are other versions of ReLu in use as well (Leaky ReLu and Maxout). Deep networks often use Softmax (or a linear function) as the activation function for the output layers. The table at right provides the functional form of some of these activation functions.
These advances have allowed ANN’s to capture amazing amounts of complexity in the hidden layers. While the standard ANN’s use 1 or 2 hidden layers, deep neural networks maybe use many more — with only 4 hidden layers, you’ve already allowed for an immense level of complexity; with 15 hidden layers it just becomes more and more “teachable” in a sense. There is not golden rule for the number of layers to use, but this along with other parameters (e.g., the number of neurons in each layer) is something that should be tuned during cross-validation. Also, neural networks come in a variety of typologies that should be considered when modeling (e.g., feed-forward NN, recurrent NN, etc.) — each with its own pros/cons, but that’ll have to wait for another blog post on a future day!
Due to the complexity which can be modeled in the many hidden layers, deep learning models can essentially generate their own features on which the outcome will depend with only minimal guidance from the programmer. This also allows for the capacity to leverage more of the available data in a “big data” environment to train the deep learner. Since our problems are not getting easier, and the data is not getting any smaller, more effort should be invested in this promising field to explore the depths of deep learning!