Dr. Alok Aggarwal
CEO and Chief Data Scientist
Scry Analytics, California, USA
Office: +1 408 872 1078; Mobile: +1 914 980 4717
January 20, 2018
In memory of Alan Turing, Marvin Minsky and John McCarthy
Every decade seems to have its technological buzzwords: we had personal computers in 1980s; Internet and worldwide web in 1990s; smart phones and social media in 2000s; and Artificial Intelligence (AI) and Machine Learning in this decade. However, the field of AI is 67 years old and this is the second of a series of five articles wherein:
The 1950-82 era saw a new field of Artificial Intelligence (AI) being born, lot of pioneering research being done, massive hype being created, and AI going into hibernation when this hype did not materialize, and the research funding dried up . During 1983 and 2010, research funding ebbed and flowed, and research in AI continued to gather steam although "some computer scientists and software engineers would avoid the term artificial intelligence for fear of being viewed as wild-eyed dreamers" .
During 1980s and 90s, researchers realized that many AI solutions could be improved by using techniques from mathematics and economics such as game theory, stochastic modeling, classical numerical methods, operations research and optimization. Better mathematical descriptions were developed for deep neural networks as well as evolutionary and genetic algorithms, which matured during this period. All of this led to new sub-domains
and commercial products in AI being created.
In this article, we first briefly discuss supervised learning, unsupervised learning and reinforcement learning, as well as shallow and deep neural networks, which became quite popular during this period. Next, we will discuss the following six reasons that helped AI research and development in gaining steam – hardware and network connectivity became cheaper and faster; parallel and distributed became practical, and lots of data ("Big Data") became available for training AI systems. Finally, we will discuss a few AI applications that were commercialized during this era.
These techniques require to be trained by humans by using labeled data . Suppose we are given several thousand pictures of faces of dogs and cats and we would like to partition them into two groups – one containing dogs and the other cats. Rather than doing it manually, a machine learning expert writes a computer program by including the attributes that differentiate dog-faces from cat-faces (e.g., length of whiskers, droopy ears, angular faces, round eyes). After enough attributes have been included and the program checked for accuracy, the first picture is given to this "black box" program. If its output is not the same as that provided by a "human trainer" (who may be training in person or has provided a pre-labeled picture), this program modifies some of its internal code to ensure that its answer becomes the same as that of the trainer (or the pre-labeled picture). After going through several thousand such pictures and modifying itself accordingly, this black box learns to differentiate the faces of dogs from cats. By 2010, researchers had developed many algorithms that could be used inside the black box, most of which are mentioned in the Appendix, and today, some applications that commonly use these techniques include object recognition, speaker recognition and speech to text conversion.
These techniques do not require any pre-labeled data and they try to determine hidden structure from "unlabeled" data . One important use case of unsupervised learning is computing the hidden probability distribution with respect to the key attributes and explaining them, e.g., understanding the data by using its attributes and then clustering and partitioning it in "similar" groups. There are several techniques in unsupervised learning most of which are mentioned in the Appendix. Since the data points given to these algorithms are unlabeled, their accuracy is usually hard to define. Applications that use unsupervised learning include recommender systems (e.g., if a person bought x then will the person by y), creating cohorts of groups for marketing purposes (e.g., clustering by gender, spending habits, education, zip code), and creating cohorts of patients for improving disease management. Since k-means is one of the most common technique, it is briefly described below:
Suppose we are given a lot of data points each having n attributes (which can be labelled as n coordinates) and we want to partition them into k groups. Since each group has n coordinates, we can imagine these data points as being in an n-dimensional space. To begin with, the algorithm partitions these data points arbitrarily into k groups. Now, for each group the algorithm computes its centroid, which is an imaginary point with each of its coordinates being the average of the same coordinates of all the points in that group, i.e., this imaginary point's first coordinate is the average of all first coordinates of the points in this group, second coordinate is the average of all second coordinates, and so on. Next, for each data point, it finds the centroid that is the closest to that point and achieves a new partition of these data points into k new groups. This algorithm again finds the centroids of these groups and repeats these steps until it either converges or has gone through a specified number of iterations. An example in a two-dimensional space with k=2 is shown in the picture below:
Another technique, hierarchical clustering creates hierarchical groups, which at the top level would have 'super groups' each containing sub-groups, which may contain sub-sub groups and so on. K-means clustering is often used for creating hierarchical groups as well.
Reinforcement Learning (RL) algorithms learn from the consequences of their actions, rather than from being taught by humans or by using pre-labeled data ; it is analogous to Pavlov’s conditioning, when Pavlov noticed that his dogs would begin to salivate whenever he entered the room, even when he was not bringing them food . The rules that such algorithms should obey are given upfront and they select their actions on basis of their past experiences and by considering new choices. Hence, they learn by trial and error in a simulated environment. At the end of each "learning session," the RL algorithm provides itself a "score" that characterizes its level of success or failure, and over time, the algorithm tries to perform those actions that maximize this score. Although IBM’s Deep Blue, which won the chess match against Kasporov, did not use Reinforcement Learning, as an example, we describe a potential RL algorithm for playing chess:
As input, the RL algorithm is given the rules of playing chess, e.g., 8*8 board, initial location of pieces, what each chess piece can do in one step, a score of zero if the player’s king has a check-mate, a score of one if the opponent's king has a check-mate, and 0.5 if only two kings are left on the board. In this embodiment, the RL algorithm creates two identical solutions, A and B, which start playing chess against each other. After each game is over, the RL algorithm assigns the appropriate scores to A and B but also keeps complete history of the moves and countermoves made by A and B that can be used to train A and B (individually) for playing better. After playing several thousand such games in the first round, the RL algorithm uses the "self-generated" labelled data with outcomes of 0, 0.5, and 1 for each game and of all the moves played in that game and by using learning techniques, determines the patterns of moves that led A (and similarly B) to getting a poor score. Hence for the next round, it refines these solutions for A and for B and optimizes the play of such "poor moves," thereby, improving them for the second round, and then for the third round, and so on, until the improvements from one round to another become miniscule, in which case A and B end up being reasonably well-trained solutions.
In 1951, Minsky and Edmonds built the first neural network machine, SNARC (Stochastic Neural Analogy Reinforcement Computer); it successfully modeled the behavior of a rat in a maze searching for food, and as it made its way through the maze, the strength of some synaptic connections would increase, thereby reinforcing the underlying behavior, which seemed to mimic the functioning of living neurons . In general, Reinforcement Learning algorithms perform well while solving optimization problems, in game theoretic situations (e.g., in playing Backgammon  or GO ) and in problems where the business rules are well defined (e.g., autonomous car driving) since they can self-learn by playing against humans or against each other.
Mixed learning techniques use a combination of one or more of supervised, unsupervised and reinforcement learning techniques. Semi-supervised learning is particularly useful in cases where it is expensive or time consuming to label a large dataset. ", e.g., while differentiating dog-faces from cat-faces, if the database contains some images that are labeled but most of them are not. Some of their broad uses include classification, pattern recognition, anomaly detection, and clustering/grouping..
As discussed in the previous article , a one-layer perceptron network consists of an input layer, connected to one hidden layer of perceptrons, which is in turn connected to an output layer of perceptrons . A signal coming via a connection is recalibrated by the "weight" of that connection, and this weight is assigned to a connection during the "learning process". Like a human neuron, a perceptron "fires" if all the incoming signals together exceed a specified potential but unlike humans, in most such networks, signals only move from one layer to that in front of it. The term, Artificial Neural Networks (ANNs) was coined by Igor Aizenberg and colleagues in 2000 for Boolean threshold neurons but is used for perceptrons and other "neurons" of the same ilk . Examples of one hidden layer and eight-hidden layer networks are given below:
Although multi-layer perceptrons were invented in 1965 and an algorithm for training an 8-layer network was provided in 1971 [18, 19, 20], the term, Deep Learning, was introduced by Rina Dechter in 1986 . For our purposes, a deep learning network has more than one hidden layer.
Although multi-layer perceptrons were invented in 1965 and an algorithm for training an 8-layer network was provided in 1971, the term, Deep Learning, was introduced by Rina Dechter in 1986
Given below are important deep learning networks that were developed during 1975 and 2006 and are frequently used today; their description is out of scope of this article:
During 1983 and 2010, hardware became much cheaper and more than 500,000 times faster; however, for many problems, one computer was still not enough to execute many machine learning algorithms in a reasonable amount of time. At a theoretical level, computer science research during 1950-2000 had shown that such problems could be solved much faster by using many computers simultaneously and in a distributed manner. However, the following fundamental problems related to distributed computing remained resolved until 2003: (a) how to parallelize computation, (b) how to distribute data "equitably" among computers and do automatic load balancing, and (b) how to handle computer failures and interrupt them if they go into infinite loops. In 2003, Google published Google File Systems paper and then followed it up by publishing MapReduce in 2004, which was a framework and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster . Since MapReduce was proprietary to Google, in 2006, Cutting and Carafella (from University of Washington but working at Yahoo) created an open source and free version of this framework called Hadoop . Also, in 2012, Spark and its resilient distributed datasets were invented, which reduced the latency of many applications when compared to MapReduce and Hadoop implementations . Today a Hadoop-Spark based infrastructure can handle 100,000 or more computers and several hundred million Gigabytes of storage.
In 1998, John Mashey (at Silicon Graphics) seemingly first coined the term, "Big Data," that referred to large volume, variety and velocity at which data is being generated and communicated . Since most learning techniques require lots of data (especially labelled data), the data stored in organizations’ repositories and on the World Wide Web, became vital for AI. By early 2000, social media websites such as Facebook, Twitter, Pinterest, Yelp, and Youtube as well as weblogs and a plethora of electronic devices started generating Big Data, which set the stage for creating several "open databases" with labeled and unlabeled data (for researchers to experiment with) [72,73]. By 2010, humans had already created almost a quadrillion Gigabytes (i.e., one zetta bytes) of data, most of which was either structured (e.g., spreadsheets, relational databases) or unstructured (e.g., text, images, audio and video files) .
In 1992, IBM’s Gerald Tesauro built TD-Gammon, which was a reinforcement learning program to play backgammon; its level was slightly below that of the top human backgammon players at that time .
Alan Turing was the first to design a computer chess program in 1953 although he "ran the program by flipping through the pages of the algorithm and carrying out its instructions on a chessboard" . In 1989, chess playing programs, HiTech and Deep Thought developed at Carnegie Mellon University, defeated a few chess masters . In 1997, IBM’s Deep Blue became the first computer chess-playing system to beat world’s champion, Garry Kasparov. Deep Blue’s success was essentially due to considerably better engineering and processing 200 million moves per second .
In 1994, Adler and his colleagues at Stanford University invented, a stereotactic radiosurgery-performing robot, Cyberknife, which could surgically remove tumors; it is almost as accurate as human doctors, and during the last 20 years, it has treated over 100,000 patients . In 1997, NASA built Sojourner, a small robot that could perform semi-autonomous operations on the surface of Mars .
In 1995, Wallce create A.L.I.C.E., which was based on pattern matching but had no reasoning capabilities . Thereafter, Jabberwacky (renamed Cleverbot in 2008) was created, which had web-searching and gameplaying abilities  but was still limited in nature. Both chatbots used improved NLP algorithms for communicating with humans.
Until the 1980s, most NLP systems were based on complex sets of hand-written rules. In the late 1980s, researchers started using machine learning algorithms for language processing. This was due to the faster and cheaper hardware as well as the reduced dominance of Chomsky-based theories of linguistics. Instead researchers created statistical models that made probabilistic decisions based on assigning weights to appropriate input features, and they also started using supervised and semi-supervised learning techniques and partially labeled data [82,83].
During late 1990s, SRI researchers used deep neural networks for speaker recognition and they achieved significant success . In 2009, Hinton and Deng collaborated with several colleagues from University of Toronto, Microsoft, Google and IBM, and showed substantial progress in speech recognition using LSTM-based deep networks [85,86].
By 2010, several companies (e.g., TiVo, Netflix, Facebook, Pandora) built recommendation engines using AI and started using them for marketing and sales purposes, thereby, improving their revenue and profit margins .
In 1989, LeCun and colleagues provided the first practical demonstration of backpropagation; they combined convolutional neural networks (CNNs) with back propagation in order to read "handwritten" digits. This system was eventually used to read the numbers in handwritten checks; in 1998, and by the early 2000s, such networks processed an estimated 10% to 20% of all the checks written in the United States .
During 1983 and 2010, exemplary research done by Hinton, Schmidhuber, Bengio, LeCun, Hochreiter, and others ensured rapid progress in deep learning and some networks were also used in commercial applications
The year 2000 had come and gone but Alan Turing’s prediction of humans creating an AI computer remained unfulfilled [3,4] and Loebner prize was initiated in 1990 with the aim of developing such a computer . Nevertheless, substantial progress was made in AI, especially with respect to deep neural networks, which were invented in 1965 with the first algorithm for training them given in 1971 [18,19,20]; during 1983 and 2010, exemplary research done by Hinton, Schmidhuber, Bengio, LeCun, Hochreiter, and others ensured rapid progress in deep learning techniques [90,91,92,93] and some of these networks began to be used in commercial applications. Because of these techniques and the availability of inexpensive hardware and data, which made them practical, the pace of research and development picked up substantially during 2005 and 2010, which in turn, led to a substantial growth in AI solutions that started rivaling humans during 2011 and 2017; we will discuss such solutions in the next article, "Domains in Which AI Systems are Rivaling Humans" .
|Supervised Machine Learning Techniques||Unsupervised Machine Learning Techniques|
|Minimum Message Length (decision graphs)||Clustering|
|Multilinear subspace learning||k-means|
|Naive Bayes classifier||Hierarchical clustering|
|Maximum entropy classifier||Anomaly detection techniques|
|Conditional Random Fields||Density-based techniques, e.g., k-nearest neighbor|
|Backpropagation||Density-based techniques, e.g., local outlier factors|
|Boosting||Sub-space based correlation and outlier detection|
|Bayesian statistics||Correlation-based outlier detection|
|Gaussian process regression||Special kind of support vector machines|
|Support vector machines||Replicator neural networks|
|Minimum Complexity Machines||Cluster analysis-based outlier detection|
|Random Forests||Deviations from association rules|
|Ensembles of Classifiers||Fuzzy logic based outlier detection|
|Ordinal classification||Ensemble techniques using score normalization|
|Nearest Neighbor Algorithm & Approximations||Ensemble techniques using feature bagging|
|Neural Networks (shallow and deep)||Neural Networks (shallow and deep)|
|Probably Approximately Correct (PAC) learning||Mixture models|
|Symbolic machine learning algorithms||Hebbian Learning|
|Genetic Algorithms||Generative Adversarial Networks|
|Handling imbalanced datasets||Learning latent variable models|
|Statistical relational learning||Expectation–maximization algorithm|
|Group method of data handling||Blind signal separation techniques|
|Kernel estimators||Principal components analysis|
|Learning Automata||Singular value decomposition|
|Learning Classifier Systems||Independent component analysis|
|Analytical learning||Dependent component analysis|
|Artificial neural network||Non-negative matrix factorization|
|Case-based reasoning||Low-complexity coding and decoding|
|Decision tree learning||Stationary subspace analysis|
|Inductive logic programming||Common spatial pattern recognition|