An Introduction to AI, updated

What is Artificial Intelligence (AI)

AI deals with the area of developing computing systems that are capable of performing tasks that humans are very good at, for example recognising objects, recognising and making sense of speech, and decision making in a constrained environment.

Narrow AI: the field of AI where the machine is designed to perform a single task, and the machine gets very good at performing that particular task. However, once the machine is trained, it does not generalise to unseen domains. This is the form of AI that we have today, for example, Google Translate.

Artificial General Intelligence (AGI): a form of AI that can accomplish any intellectual task that a human being can do. It is more conscious and makes decisions similar to the way humans make decisions. AGI remains an aspiration at this moment in time, with various forecasts ranging from 2029 to 2049 or even never in terms of its arrival. It may arrive within the next 20 or so years, but it has challenges relating to hardware, the energy consumption required in today’s powerful machines, and the need to solve for catastrophic memory loss that affects even the most advanced deep learning algorithms of today.

Super Intelligence: is a form of intelligence that exceeds the performance of humans in all domains (as defined by Nick Bostrom). This refers to aspects like general wisdom, problem solving, and creativity.

Classical Artificial Intelligence: are algorithms and approaches, including rules-based systems, search algorithms that entailed uninformed search (breadth-first, depth-first, universal cost search), and informed searches such as A and A* algorithms. These laid a strong foundation for more advanced approaches today that are better suited to large search spaces and big data sets. It also entailed approaches from logic, involving propositional and predicate calculus. Whilst such approaches are suitable for deterministic scenarios, the problems encountered in the real world are often better suited to probabilistic approaches.

The field has been making a major impact in recent times across various sectors, including Health Care, Financial Services, Retail, Marketing, Transport, Security, Manufacturing, and Travel sectors.

The advent of Big Data, driven by the arrival of the internet, smart mobile, and social media, has enabled AI algorithms, in particular from Machine Learning and Deep Learning, to leverage Big Data and perform their tasks more optimally. This combined with cheaper and more powerful hardware, such as Graphical Processing Units (GPUs), has enabled AI to evolve into more complex architectures.

Machine Learning

Machine Learning is defined as the field of AI that applies statistical methods to enable computer systems to learn from the data towards an end goal. The term was introduced by Arthur Samuel in 1959.

Key Terms to Understand

Features / Attributes: these are used to represent the data in a form that the algorithms can understand and process. For example, features in an image may represent edges and corners.

Entropy: the amount of uncertainty in a random variable.

Information Gain: the amount of information gained as a result of some prior knowledge.

Supervised Learning: a learning algorithm that works with data that is labelled (annotated). For example, learning to classify fruits with labelled images of fruits as an apple, orange, lemon, etc.

Unsupervised Learning: is a learning algorithm to discover patterns hidden in data that is not labelled (annotated). An example is segmenting customers into different clusters.

Semi-supervised Learning: is a learning algorithm when only when a small fraction of the data is labelled.

Active-Learning relates to a situation where algorithms can actively query a teacher for labels. It is defined by Jennifer Prendki as "... a type of semi-supervised learning (where both labeled and unlabeled data is used)...Active learning is about incrementally and dynamically labeling data during the training phase in order to allow the algorithm to identify what label would be the most informational for it to learn faster."

Loss Function: is the difference between what is the ground reality and what the algorithm has learned. In machine learning, the objective is to minimise the loss function so the algorithm can continue to generalise and perform in unseen scenarios.

Machine Learning Methods (non-exhaustive list)

Classification, Regression, and Clustering are the 3 main areas of Machine Learning.

Classification is summarised by Jason Brownlee as being "about predicting a label and regression is about predicting a quantity. Classification is the task of predicting a discrete class label. Classification predictions can be evaluated using accuracy, whereas regression predictions cannot."

Regression predictive modelling is summarised by Jason Brownlee as "the task of approximating a mapping function (f) from input variables (X) to a continuous output variable (y). Regression is the task of predicting a continuous quantity. Regression predictions can be evaluated using root mean squared error, whereas classification predictions cannot."

Clustering is summarised by Surya Priy as "the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them."

Reinforcement Learning: is an area that deals with modelling agents in an environment that continuously rewards the agent for making the right decision. An example is an agent that is playing chess against a human being. An agent gets rewarded when it gets a right move and penalised when it makes a wrong move. Once trained, the agent can compete with a human being in a real match.

Linear Regression: an area of machine learning that models the relationships between two or more variables that have continuous values.

Logistic Regression: is a classification technique that models the logit function as a linear combination of features. Binary logistic regressions deals with situations where the variable you are trying to predict has two outcomes ('0' or '1'). Multinominal logistic regression deals with situations where you could have multiple different values for the predicted variable.

k-means: an unsupervised approach to group (or cluster) different instances of data based upon their similarity with each other. An example is to group a population of people based on similarity.

K-Nearest Neighbour (KNN): is a supervised Machine Learning algorithm that may be applied to address both classification and regression problems. It works upon the assumption that similar items exist nearby to one and another.

Support Vector Machines (SVM): is a classification algorithm that draws separating a hyperplane between two classes of data. Once trained, an SVM model can be used as a classifier on unseen data.

Decision Trees: an algorithm to learn decision rules inferred from the data. These rules are then followed for decision making.

Boosting and Ensemble: are methods used to take several weak learners that perform poorly and then combine these weak learners into a strong classifier. Adaptive boosting (AdaBoost) can be applied towards both classification and regression problems and creates a strong classifier by combining multiple weak classifiers. Recent key examples include XG Boost that proved highly successful in Kaggle competitions with tabular or structured data (see Tianqi Chen and Carlos Guestrin 2016), Light Gradient Boosting Machine (Light GBM) introduced by Microsoft in 2017, and CatBooset introduced by Yandex in 2017.

Random Forest: belongs to the family of ensemble learning techniques and entails the creation of multiple decision trees during training with randomness while constructing the tree. The output would then be the mean prediction from the individual trees or class that represents the class mode. This prevents the algorithm from overfitting or memorising the training data.

Principal Component Analysis (PCA): is a method to reduce the dimensionality of the data whilst still maintaining the explainability of the data. This is useful to get rid of redundant information present in the data whilst preserving the features that explain most of the data.

Simultaneous Location And Mapping (SLAM): deals with methods that robots use to localise themselves in unknown environments.

Evolutionary Genetic Algorithms: these are biologically inspired algorithms that are inspired by the theory of evolution. They are frequently applied to solve optimization and search problems by application of bio-inspired concepts, including selection, crossover, and mutation.

Neural Networks: are biologically inspired networks that extract abstract features from the data in a hierarchical fashion. Neural networks were discouraged in the 1980s and 1990s. It was Geoff Hinton who continued to push them and was derided by much of the classical AI community at the time. A key moment in the history of the development of Deep Neural Networks (see below for definition) was in 2012 when a team from Toronto introduced themselves to the world with the AlexNet network at the ImageNet competition. Their neural network reduced the error significantly compared to previous approaches that used hand derived features.

Deep Learning

Deep Learning refers to the field of neural networks with several hidden layers. Such a neural network is often referred to as a deep neural network.

Several of the main types of deep neural networks used today are:

Convolutional Neural Network (CNN): A convolutional neural network is a type of neural network that uses convolutions to extract patterns from the input data in a hierarchical manner. It’s mainly used in data that has spatial relationships such as images. Convolution operations that slide a kernel over the image extract features that are relevant to the task.

Recurrent Neural Network (RNN): Recurrent Neural networks, and in particular, LSTMs, are used to process sequential data. Time series data, for example, stock market data, speech, signals from sensors, and energy data, have temporal dependencies. LSTMs are a more efficient type of RNN that alleviates the vanishing gradient problem, giving it an ability to remember both in the short term as well as far in its history.

Restricted Boltzmann Machine (RBM): is basically a type of neural network with stochastic properties. Restricted Boltzmann Machines are trained using an approach named Contrastive Divergence. Once trained, the hidden layers are a latent representation of the input. RBMs learn a probabilistic representation of the input.

Deep Belief Network: is a composition of Restricted Boltzmann Machines with each layer serving as a visible layer for the next. Each layer is trained before adding additional layers to the network, which helps in probabilistically reconstructing the input. The network is trained using a layer-by-layer unsupervised approach.

Variational Autoencoders (VAE): are an improvised version of autoencoders used for learning an optimal latent representation of the input. It consists of an encoder and a decoder with a loss function. VAEs use probabilistic approaches and refers to approximate inference in a latent Gaussian model.

GANs: Generative Adversarial Networks are a type of CNN that uses a generator and a discriminator. The generator continuously generates data while the discriminator learns to discriminate fake from real data. This way, as training progresses, the generator continuously gets good at generating fake data that looks real while the discriminator gets better at learning the difference between fake and real, in turn helping the generator to improve itself. Once trained, we can then use the generator to generate fake data that looks realistic. For example, a GAN trained on faces can be used to generate images of faces that do not exist and look very real.

Transformers: are developed to process sequential data, in particular in the field of Natural Language Processing (NLP), with tasks for text data, such as language translation. The model was introduced in 2017 in a paper entitled "Attention is All you Need." The architecture of a Transformer model entails the application of encoders and decoders along with self-attention relating to the capability of attending to various positions of the input sequence and then generating a representation of the sequence. They possess an advantage over RNNs relating to the fact that they do not need the processing of the sequenced data to be in order, meaning that in the case of a sentence, it will not require the start of the sentence before the end for processing. Well known Transformer models include Bidirectional Encoder Representations from Transformers (BERT) and the GPT variants (from OpenAI).

Deep Reinforcement Learning: Deep Reinforcement Learning algorithms deal with modelling an agent that learns to interact with an environment in the most optimal way possible. The agent continuously takes actions keeping the goal in mind, and the environment either rewards or penalises the agent for making a good or bad action, respectively. This way, the agent learns to behave in the most optimal manner so as to achieve the goal. AlphaGo from DeepMind is one of the best examples of how the agent learned to play the game of Go and was able to compete with a human being.

Capsules: still an active area of research. A CNN is known to learn representations of data that are often not interpretable. On the other hand, a Capsule network is known to extract specific kinds of representations from the input, for example, it preserves the hierarchical pose relationships between object parts. Another advantage of capsule networks is that it is capable of learning the representations with a fractional amount of data than the CNN would otherwise require.

Neuroevolution: defined by Kenneth O. Stanley as the following, "consists of trying to trigger an evolutionary process similar to the one that produced our brains, except inside a computer. In other words, Neuroevolution seeks to develop the means of evolving neural networks through evolutionary algorithms." Researchers at Uber Labs argued that neurevolutionary approaches were competitive with gradient descent based Deep Learning algorithms partly due to the reduced likelihood of being trapped in local minima. Stanley et al. stated that "Our hope is to inspire renewed interest in the field as it meets the potential of the increasing computation available today, to highlight how many of its ideas can provide an exciting resource for inspiration and hybridization to the Deep Learning, Deep Reinforcement Learning and Machine Learning communities, and to explain how Neuroevolution could prove to be a critical tool in the long-term pursuit of Artificial General Intelligence."

NeuroSymbolic AI: is defined by MIT-IBMWatsonAILab as a fusion of AI methods that combine neural networks, which extract statistical structures from raw data files – context about image and sound files, for example – with symbolic representations of problems and logic. "By fusing these two approaches, we’re building a new class of AI that will be far more powerful than the sum of its parts. These neuro-symbolic hybrid systems require less training data and track the steps required to make inferences and draw conclusions. They also have an easier time transferring knowledge across domains. We believe these systems will usher in a new era of AI where machines can learn more like the way humans do, by connecting words with images and mastering abstract concepts."

Federated Learning: also known as collaborative learning, is defined in Wikipedia as a technique in Machine Learning that enables an algorithm to be trained across many decentralized servers (or devices) that possess data locally without exchanging them. Differential Privacy aims to enhance data privacy protection by measuring the privacy loss in the communication among the elements of Federated Learning. The technique may deal with the key challenges of data privacy and security relating to heterogeneous data and impact sectors such as the Internet of Things (IoT), healthcare, banking, insurance, and other areas with data privacy and collaborative learning are of key importance and may well become a key technique in the era of 5G and Edge Computing as the AI IoT scales.