The Bias-Variance Tradeoff

To avoid extremely long and redundant blog posts, instead of writing notes on an entire chapter from Deep Learning, I will instead write about a chapter subsection or some topic I find interesting.

Today’s post is about the bias-variance tradeoff, a well-known problem in machine learning which involves minimizing two sources of error which prevent supervised learning algorithms from generalizing well to new data. We’ll discuss point estimation and statistical bias, variance (and their relationship with generalization), and consistency.

Point Estimation

From Deep Learning: “Point estimation is the attempt to provide the single ‘best’ prediction of some quantity of interest.” The quantity of interest may be a parameter, a vector of parameters, or even a function. The book uses the notation \mathbf{\hat\theta} to denote a point estimate of a parameter \mathbf{\theta}.

Now, if \{ \mathbf{x}^{(1)}, ..., \mathbf{x}^{(m)}\} is a set of independent and identically distributed (IID) data samples, any function of the data

\mathbf{\hat\theta}_m = g(\mathbf{x}^{(1)}, ..., \mathbf{x}^{(m)})

is considered a point estimate or a statistic of the data. This definition doesn’t require that g returns a good estimate of \mathbf{\theta}, or even that the range of g is the same as the allowable values of \mathbf{\theta}; in this sense, it allows the designer of the estimator a great deal of flexibility. We consider a good estimator to be a function g whose output is close to the true \mathbf{\theta} which generated the dataset \{ \mathbf{x}^{(1)}, ..., \mathbf{x}^{(m)} \}.

In this discussion, we take the frequentist view on statistics: the true parameter value \mathbf{\theta} is fixed but unknown, and the statistic \mathbf{\hat\theta} is a function of the dataset. Since the data is generated from a random process parametrized by \mathbf{\theta}, any function of the dataset is random; therefore, the estimator \mathbf{\hat\theta} is a random variable.

Point estimation can also refer to estimation of the relationship between input (explanatory) and output (dependent) variables; this is commonly referred to as function estimation. We now try to predict a variable \mathbf{y} given an input vector \mathbf{x}. We assume there is a function f(\mathbf{x}) which captures the approximates the relationship between the two; e.g., we might assume that \mathbf{y} = f(\mathbf{x}) + \mathbf{\epsilon}, where \mathbf{\epsilon} is the part of \mathbf{y} which cannot be predicted from \mathbf{x}. In function estimation, we wish to approximate f with a model or estimator \hat f. In this sense, function estimation is the same as point estimation, in which the estimated \hat f is simply a point estimate in function space.

We can now review the most commonly studied properties of point estimators and discuss what they imply about these estimators.


Here, we are referring to statistical bias; that is, “the difference between this estimator’s expected value and the true value of the parameter being estimated” (Wikipedia). More formally, this is defined as:

\text{bias}(\mathbf{\hat\theta}_m) = \mathbb{E}[\mathbf{\hat\theta}_m] - \mathbf{\theta},

where the expectation is taken over the dataset (viewed as samples of a random variable) and \mathbf{theta} is the true value of the estimated parameter \mathbf{\theta}, used to define the data generating distribution. We call an estimator unbiased if \text{bias}(\mathbf{\hat\theta}_m) = 0, implying that its expected value, \mathbb{E}[\mathbf{\hat\theta}_m] = \mathbf{\theta}. An estimator is called asymptotically biased if \text{lim}_{m \rightarrow \infty} \text{bias}(\mathbf{\hat\theta})_m = 0, implying \text{lim}_{m \rightarrow \infty} \mathbb{E}[\mathbf{\hat\theta}_m] = \mathbf{\theta}.

Variance and Standard Error

We may also wish to consider how much we expect our estimator to vary as a function of the data sample. Just as we may compute the expectation of an estimator to determine its bias, we may also compute its variance, given by:


where the random variable is the dataset. The square root of the variance is called the standard error, denoted by \text{SE}(\mathbf{\hat\theta}).

The variance or standard error of an estimator is a measure of much we expect the output of our estimator to vary as a function of independent resampling of data from the data generating distribution. Just as we might like an estimator to exhibit low bias, we might also want it to have low variance.

When we compute any statistic using a finite number of data samples, our estimate of the true parameter will always be uncertain, in the sense that we could have obtained different samples from the data generating distribution whose statistics would be different. Therefore, the expected degree of variation in an estimator is a source of error we would like to quantify.

The standard error of the mean is given by

\text{SE}(\hat\mu) = \sqrt{\text{Var}[\frac{1}{m}\sum_{i=1}^{m} x^{(i)}]} = \frac{\sigma}{\sqrt{m}},

where \sigma^2 is the true variance of the samples x^{(i)}. The standard error is often estimated by using an estimate of \sigma; however, neither the square root of the sample variance nor the square root of the unbiased estimator of the variance is an unbiased estimated of the standard error. Both tend to underestimate the true standard deviation, yet are still commonly used in practice. For large m, the square root of the unbiased estimate provides a reasonable approximation.

The standard error of the mean can be useful for machine learning experiments. We often estimate the generalization error by computing the sample mean of the error on the test dataset, where the number of data samples in the set determines this estimate’s accuracy. Using the central limit theorem, which tells us that the mean will be approximately distributed as a normal distribution, we can use the standard deviation to compute the probability that the true expectation falls in any given interval. For example, the 95% confidence interval centered on the mean \hat\mu_m is

(\hat\mu_m - 1.96 \text{SE}(\hat\mu_m),\hat\mu_m + 1.96 \text{SE}(\hat\mu_m))

under the normal distribution with mean \hat\mu_m and variance \text{SE}(\hat\mu_m)^2. In machine learning experiments, it is typical to say that algorithm A is better than algorithm B if the upper bound of the 95% confidence interval for the error of algorithm A is less than the lower bound of the 95% confidence interval for the error of algorithm B.

Trading off Bias and Variance to Minimize Mean Squared Error

Bias and variance measure two different sources of an estimator’s error: bias measures the expected deviation from the true value of the function or parameter, and variance measures the deviation from the expected estimator value that any given sampling of from the data generating distribution is likely to cause.

How do we choose between two estimators, one with more bias and one with more variance? The most common way to settle this trade-off is by using cross-validation, in which the training data is partitioned into k equally subsets, each of which is “held out” to use as validation data in a series of training, validation “rounds”. The results from evaluating the estimator on the validation data in each round is averaged to produce a estimated generalization error.

Alternatively, we may use the mean squared error (MSE) to compare estimators:

\text{MSE} = \mathbb{E}[(\mathbf{\hat\theta}_m - \mathbf{\theta})^2]

= \text{Bias}(\mathbf{\hat\theta}_m)^2 + \text{Var}(\mathbf{\hat\mu}).

The MSE measures the overall expected deviation, in the sense of squared errors, between the estimator and the true value of the parameter \mathbf{\theta}. From the above equation, the MSE incorporates both the bias and variance in a natural way. Desirable estimators are those with small MSE, and small bias and variance components in turn.

The relationship between bias and variance is closely related to the machine learning concepts of overfitting, underfitting, and capacity. When generalization error is measure by MSE, where bias and variance are meaningful components, increasing model capacity tends to lead to an increase in variance and decrease in bias. This is illustrated in Figure 1, where we see a U-shaped generalization error curve as a function of model capacity.

Adopted from Deep Learning
Figure 1: Bias-Variance Tradeoff as a Function of Model Capacity


So far, we’ve discussed useful properties of estimators given a training dataset of fixed size. We are usually interested in the behavior of the estimator as the size of this dataset grows. We typically want our point estimates to converge to the true value(s) of the corresponding parameter(s) as the number of data samples m in our dataset grows. More formally, we want

\text{plim}_{m \rightarrow \infty} \mathbf{\hat\theta}_m = \mathbf{\theta},

where \text{plim} denotes convergence in probability; i.e., for any \epsilon > 0, P(|\mathbf{\hat\theta}_m - \mathbf{\theta}| > \epsilon) \rightarrow 0 as m \rightarrow \infty. This condition is known as consistency (sometimes known as weak consistency).

Consistency guarantees that the bias induced by the estimator diminishes as the number of training samples grows. The reverse is not true: asymptotic unbiasedness does not imply consistency, meaning there exist unbiased estimators which are not consistent.


Deep Learning Book: Probability and Information Theory

Thus begins my notes on the third chapter of Deep Learning, entitled Probability and Information Theory. This chapter was more exciting to read than the last, but there is a similar amount of math notation. Like the last chapter, it contains mathematics and ideas which are fundamental to the practice of deep learning.

P.S. This took way too long to write, because I was in the middle of writing it as the semester began, and have devoted little time to writing blog posts. Moving forward, I’d like to devote a small, consistent amount of time to it each week.

Chapter 3: Probability and Information Theory

Probability theory is a mathematical framework for representing uncertain statements. It provides a method for quantifying uncertainty as well as axioms for deriving more uncertain statements. In AI applications, we use this framework in two major ways: (1) The laws of probability suggest how AI systems should perform reasoning, and so we design algorithms which compute or approximate expressions derived from probability theory; (2) we can use probability theory and statistics to analyze the behavior of proposed AI systems.

This chapter is meant to ensure that readers whose background is in software engineering can understand the material in the book, by providing the fundamentals of probability as applied to scientific and engineering disciplines.

While probability theory allows us to make uncertain statements and to reason despite our uncertainty, information theory enables us to quantify the amount of uncertainty in a probability distribution.

3.1: Why Probability?

Much of computer science deals with systems which can be described with entirely deterministic mathematics. A programmer can safely assume that a CPU will execute each machine instruction without error, and errors occur so rarely in computer hardware that more software applications never have to account for them. For this reason, it may be somewhat surprising that machine learning makes heavy use of probability theory.

Machine learning must deal with uncertain quantities and sometimes stochastic or nondeterministic processes. This uncertainty and / or stochasticity can arise from many sources. In light of this, nearly all activities require some ability to cope with the presence of uncertainty. Beyond mathematical statements which are true by definition, it’s difficult to think of any proposition (sentence) which is absolutely true, or any event that is absolutely guaranteed to happen.

There are three possible sources of uncertainty:

  1. Inherent stochasticity in the modeled system. As an example, most interpretations of quantum mechanics describe the dynamics of subatomic particles as probabilistic. We can also create theoretical scenarios which we postulate to have random dynamics (such as a card game where we assume the cards are shuffled into a truly random order).
  2. Incomplete observability. Deterministic systems can appear stochastic when we can’t observe all the variables which drive its behavior. For example, in the Monty Hall problem, a game show contestant is asked to choose between three doors and wins a prize held behind the chosen door, where two doors lead to a goat while a third leads to a car. The outcome given the contestant’s choice is deterministic, but from the contestant’s point of view, the outcome is uncertain.
  3. Incomplete modeling. If we use a model which must discard some of the information we can observe, the discarded information results in uncertainty in the model’s prediction. Suppose we build a robot that can exactly observe the location of every object around it. If the robot chunks space into discrete locations when predicting the future position of these objects, this discretization makes the robot uncertain about the precise new positions of the objects: each object could be anywhere within the discrete location it was observed to occupy.

It is often more practical to use a simple but uncertain rule rather than a complex but certain one, even if the true rule is deterministic and our modeling system has the power to accommodate a complex rule. For example, the rule “most birds fly” is cheap to develop and broadly applicable, while a rule of the form “birds fly, except for very young birds that haven’t yet learned to fly, sick or injured birds that have lost the ability to fly, …” is expensive to develop, maintain, and communicate, and is still brittle and prone to failure.

While it is clear that we need a means of representing and reasoning about uncertainty, it is not obvious that probability theory can provide the tools we need for artificial intelligence applications. Probability theory was originally developed to analyze the frequency of events, such as drawing a certain hand of cards in a poker game. These kinds of events are often easily repeatable. When we say that an outcome has a probability of p of occurring, this means that if we repeated the experiment infinitely many times, then proportion p of the repetitions would result in that outcome. This kind of reasoning doesn’t seem to be applicable to experiments which are not repeatable. If a doctor analyzes a patient and says that the patient has a 40% chance of having the flu, this means something very different: we can’t make infinitely many replicas of the patient, nor is there any reason to believe that different replicas of the patient would produce the same symptoms yet have varying actual conditions. In this case, we use probability to represent a degree of belief, with 1 indicating absolute certainty in some event occurring and 0 indicating absolute certainty in the same event not occurring. The former type of probability (relating to the rates at which events occur) is known as frequentist probability, while the latter (related to qualitative levels of uncertainty) is known as Bayesian probability.

If we list some common sense properties about how we expect reasoning about uncertainty to have, the only way to satisfy these properties is to suppose Bayesian probabilities behave the same way as frequentist probabilities. Otherwise, the two approaches to the theory are incompatible; so, the two must share a common set of axioms.

Probability can be viewed as an extension of logic to deal with uncertain propositions. Logic provides a set of formal rules for determining which propositions are implied to be true or false given the assumption that some other set of propositions is true or false. Probability theory gives a set of formal rules for determining the likelihood of a proposition being true given the likelihood of other propositions.

3.2: Random Variables

random variable is a variable that can different values at random. It is typically denoted as a lowercase letter in plain typeface, and the values it takes on as lowercase script letters. On its own, a random variable is simply a description of the states that the variable could possibly take; it must be coupled with a probability distribution that specifies how likely each of these states are.

Random variables may be discrete or continuous. A discrete random variable is one which has a finite or countably infinite number of states. These can be any named states which are not considered to have a numerical value. A continuous random variable is one which is associated with a real value (and thus has an uncountably infinite number of states).

3.3: Probability Distributions

probability distribution is a description of how likely a random variable or set of random variables is to take on each of a number of possible states.The way we describe these distributions depends on whether the variables in question are discrete or continuous.

3.3.1: Discrete Variables and Probability Mass Functions

A probability distribution over discrete variables can be described using a probability mass function (PMF). These are typically denoted with a capital P. The PMF maps a state of a random variable to the probability of that random variable taking on that state. The probability that \text{x} = x is denoted as P(x), which gives probability of 1 if \text{x} = x is certain and probability of 0 if \text{x} = x is impossible.

Probability mass functions can operate on many variables at once. Such a probability distribution over many variables is known as a joint probability distribution. P(\text{x} = x, \text{y} = y) denotes the probability that \text{x} = x and \text{y} = y simultaneously. We can also write P(x, y) for brevity.

To be a PMF on a random variable \text{x}, a function P must satisfy the following properties:

  • The domain of P must be the set of all possible states of \text{x}.
  • \forall x \in \text{x}, 0 \leq P(x) \leq 1. An impossible event has probability 0, and an event that is guaranteed to happen has probability 1; these bounds cannot be exceeded.
  • \sum_{x \in \text{x}} P(x) = 1. This property is referred to as being normalized. Without this, we could obtain probabilities greater than one by computing the probability of one of many event occurring.

Consider a single discrete random variable \text{x} with k different states. We can place a uniform distribution on \text{x}; that is, make each of the states equi-probable, by setting its PMF to:

P(\text{x} = x_i) = \frac{1}{k}

for all i. This satisfies all properties necessary to be a PMF. Each probability mass, \frac{1}{k} is positive and less than or equal to 1 and the sum over all k states’ probability mass equals one, implying the distribution is properly normalized.

3.3.2: Continuous Variables and Probability Density Functions

In working with continuous random variables, we describe probability distributions using probability density functions (PDFs) rather than PMFs. To be a probability density function, a function p must satisfy the following properties:

  • The domain of p must be the set of all possible states of \text{x}.
  • \forall x \in \text{x}, p(x) \geq 0; we do not require p(x) \leq 1.
  • \int p(x)dx = 1.

A PDF p(x) doesn’t give the probability of a specific state directly; rather, it gives the probability of a state being inside an infinitesimally small region with volume \delta x is given by p(x)\delta x.

We can integrate the density function to find the actual probability mass of a set of points. The probability that x lies in some set \mathbb{S} is given by the integral of p(x) over that set.

For an example of a PDF corresponding to a specific probability density over a continuous random variable, consider a uniform distribution over a finite interval of the real numbers. We can accomplish this with a function u(x; a, b), where a and b are the left and right endpoints of the interval, with b > a. To ensure that there is no probability mass outside the interval, we say \forall x \not\in [a, b], u(x; a, b) = 0. Inside of [a, b], u(x; a, b) = \frac{1}{b-a}. This is clearly non-negative everywhere, and it integrates to 1.

3.4: Marginal Probability

Sometimes we know the probability distribution over a set of random variables, and we want to know the probability distribution over just a subset of them. The distribution over a subset is known as the marginal probability distribution.

Suppose we have discrete random variables \text{x} and \text{y}, and we know P(\text{x}, \text{y}). We can find P(\text{x}) with the sum rule:

\forall x \in \text{x}, P(\text{x} = x) = \sum_{y} P(\text{x} = x, \text{y} = y).

For continuous random variables, we need to use integration instead of summation:

p(x) = \int p(x, y)dy.

3.5: Conditional Probability

We are often interested in the probability of some event given that some other event has happened. This is called the conditional probability. We denote the conditional probability that \text{y} = y given \text{x} = x as P(\text{y} = y | \text{x} = x). This conditional probability can be computed with the following formula:

P(\text{y} = y | \text{x} = x) = \frac{P(\text{y} = y, \text{x} = x)}{P(\text{x} = x)}.

The conditional probability is only defined when P(\text{x} = x) > 0; we can’t compute the conditional probability conditioned on an event which never happens.

It’s important not to confuse conditional probability with computing what would happen if some action were enacted. Computing the consequences of an action is called making an intervention query, which are the domain of causal modeling, which is out of the scope of this book.

3.6: The Chain Rule of Conditional Probability

Any joint probability distribution over many random variables can be decomposed into conditional distributions over only a single variable:

P(\text{x}^{(1)}, ...,\text{x}^{(n)}) = P(\text{x}^{(1)})\prod_{i=2}^{n}P(\text{x}^{(i)}|\text{x}^{(1)}, ...,\text{x}^{(i-1)}).

This is known as the chain rule or product rule of probability. It follows from the definition of conditional probability we’ve defined in section 3.5.

3.7: Independence and Conditional Independence

Two random variables \text{x} and \text{y} are said to be independent if their probability distributions can be expressed as a product of two factors, one only involving \text{x} and one only involving \text{y}:

\forall x \in \text{x}, y \in \text{y}, p(\text{x} = x, \text{y} = y) = p(\text{x} = x)p(\text{y} = y).

Two random variables are said to be conditionally independent given a random variable \text{z} if the conditional probability distribution over \text{x} and \text{y} factorizes in this way for all values of \text{z}:

\forall x \in \text{x}, y \in \text{y}, z \in \text{z}, p(\text{x} = x, \text{y} = y | \text{z} = z) = p(\text{x} = x | \text{z} = z)p(\text{y} = y | \text{z} = z).

To be compact with our notation, we can write \text{x} \bot \text{y} to mean that \text{x} and \text{y} are independent, and \text{x} \bot \text{y} | \text{z} means that \text{x} and \text{y} are conditionally independent given \text{z}.

3.8: Expectation, Variance and Covariance

The expectation or expected value of some function f(x) with respect to a probability distribution P(\text{x}) is the average (or mean value) that f takes on when x is drawn from P. For discrete variables, this can be computed with a summation:

\mathbb{E}_{\text{x} \sim P}[f(x)] = \sum_x P(x)f(x),

and for continuous variables, we compute it with an integral:

\mathbb{E}_{\text{x} \sim P}[f(x)] = \int p(x)f(x)dx.

Expectations are linear:

\mathbb{E}_{\text{x}}[\alpha f(x) + \beta g(x)] =\alpha \mathbb{E}_{\text{x}}[f(x)] + \beta\mathbb{E}_{\text{x}}[g(x)],

where \alpha and \beta are not dependent on x.

The variance gives a measure of how much the values of a function of a random variable \text{x} vary as we sample different values from its probability distribution:

\text{Var}(f(x)) = \mathbb{E}[(f(x) - \mathbb{E})[f(x)])^2].

When the variance is low, the values of f(x) cluster near their expected value. The square root of the variance is known as the standard deviation.

The covariance gives a sense as to how two variables are linearly related to each other, as well as the scale of each variable:

\text{Cov}(f(x), g(y)) = \mathbb{E}[(f(x) - \mathbb{E}[f(x)])(g(y) - \mathbb{E}[g(y)])].

High absolute values of the covariance means that the values change very much and are both far from their respective means at the same time. If the sign of the covariance is positive, the variables tend to both be high-valued at the same time; if the the sign is negative, one variable tends to be high-valued while the other is low-valued, and vice versa. The measure of correlation normalizes the contribution of each variable in order to measure only how much the variables are related, rather than also being affected by the scale of the separate variables.

The notions of covariance and dependence are related but distinct concepts. They are related because two variables which are independent have zero covariance, and two variables that have nonzero covariance are dependent. Independence, however, is distinct from covariance. For two variables to have zero covariance, there must be no linear dependence between them. Independence is a stronger property than zero covariance, because independence also excludes nonlinear relationships. It’s possible for two variables to be dependent but have zero covariance.

The covariance matrix of a random vector \textbf{x} \in \mathbb{R}^n is an n \times n matrix such that:

\text{Cov}(\text{\textbf{x}})_{i,j} = \text{Cov}(\text{x}_i, \text{x}_j).

The diagonal elements of the covariance give the variance of each vector element:

\text{Cov}(\text{x}_i, \text{x}_i) = \text{Var}(\text{x}_i).

3.9: Common Probability Distributions

3.9.1: Bernoulli Distribution

The Bernoulli distribution is a distribution over a single binary random variable. It’s controlled by a single parameter \phi \in [0, 1], which gives the probability that the random variable is equal to 1. It has the following properties:

P(\text{x} = 1) = \phi

P(\text{x} = 0) = 1 - \phi

P(\text{x} = x) = \phi^x (1 - \phi)^{1-x}

\mathbb{E}_x [\text{x}] = \phi

\text{Var}(\text{x}) = \phi(1 - \phi)

3.9.2: Multinoulli Distribution

The multinoulli or categorical distribution is a distribution over a single discrete variable with k different states, where k is finite. It is parametrized by a vector \mathbf{p} \in [0, 1]^{k-1}, where p_i gives the probability of the i-th state, and the k-th state’s probability is given by 1 - \mathbf{1}^\top\mathbf{p}. Multinoulli distributions are often used to capture distributions over categories of objects. We don’t usually compute the expectation or variance of multinoulli-distributed random variables, since the states may not have numerical values.

The Bernoulli and multinoulli distributions are sufficient to describe any distribution over their domain. This is so not because they are particularly powerful, but instead because their domain is very simple; they model discrete variables for which it is feasible to enumerate all states. When dealing with continuous variables, on the other hand, there are uncountably many states, so any distribution described by a small number of parameters must be under strict restriction.

3.9.3: Gaussian Distribution

The most commonly used distribution over real number is the normal or Gaussian distribution:

\mathcal{N}(x; \mu, \sigma^2) = \frac{1}{{\sigma \sqrt {2\pi } }}e^{{{ - \left( {x - \mu } \right)^2 } \mathord{\left/ {\vphantom {{ - \left( {x - \mu } \right)^2 } {2\sigma ^2 }}} \right. \kern-\nulldelimiterspace} {2\sigma ^2 }}}.

The two parameters \mu \in \mathbb{R} and \sigma \in (0, \infty) control the normal distribution. The former gives the coordinate of the central peak of the distribution and is also its mean (\mathbb{E}[\text{x}] = \mu), and the latter denotes variance, and whose square root (\sigma) gives the standard deviation of the distribution.

Normal distributions are a sensible choice for many applications. In the absence of prior knowledge about what form a distribution over the real numbers should take, the normal distribution is a good default choice for two reasons:

  1. Many distributions we want to model are close to being normal distributions. The central limit theorem shows that the sum of many independent random variables is approximately normally distributed. In practice, many complicated systems can be modeled well as normally distributed noise, even if the system can be decomposed into parts with more structured behavior.
  2. Out of all possible probability distributions with the same variance, the normal distribution encodes the maximum amount of uncertainty over the real numbers. Therefore, we can think of the normal distribution as being the distribution which inserts the least amount of prior knowledge into a model.

The normal distribution generalizes to \mathbb{R}^n, where it is known as the multivariate normal distribution. It can be parametrized with a positive definite symmetric matrix \mathbf{\Sigma}:

\mathcal{N}(\mathbf{x}; mathbf{\mu}, mathbf{\Sigma}) = \frac{1}{\sqrt{(2\pi)^n|\boldsymbol\Sigma|}} \exp\left(-\frac{1}{2}({x}-{m})^T{\boldsymbol\Sigma}^{-1}({x}-{m}) \right).

The parameter \mathbf{\mu} gives the mean of the distribution, but is now vector-valued. The parameter \Sigma gives the covariance matrix of the distribution. We can often fix the covariance matrix to be diagonal, and for even greater simplicitywe use an isotropic Gaussian distribution, where the covariance matrix is fixed as a scalar times the identity matrix.

3.9.4: Exponential and Laplace Distributions

In the context of deep learning, we often want to have a probability distribution peaked around some the point x = 0. To do so, we can use the exponential distribution:

p(x; \lambda) = \lambda \mathbf{1}_{\mathbf{x} \geq 0} \text{exp}(-\lambda x).

The exponential distribution uses the indicator function $latex\mathbf{1}_{\mathbf{x} \gte 0}$ to assign zero probability to all negative values of x.

A closely related distribution which allows us to place a sharp peak of probability mass at an arbitrary point \mu is the Laplace distribution:

\text{Laplace}(x; \mu, \gamma) = \frac{1}{2 \gamma} \text{exp}(- \frac{|x - \mu|}{\gamma}).

3.9.5: The Dirac Distribution and Empirical Distributions

In some cases, we want to specify that all the mass in a probability distribution is clustered around a single point. This can be accomplished by defining a probability density function using the Dirac delta function, \delta(x):

p(x) = \delta (x - \mu).

The Dirac delta function is defined such that it’s zero-valued everywhere except at 0, yet integrates to 1. The Dirac delta function is not an ordinary function that associates each value x with a real-valued output; instead, it is a different kind of mathematical object called a generalized function, which is defined in terms of its properties when integrated. We can think of this function as being the limit point of a series of functions that put less and less mass on all points other than zero. By defining p(x) to be \delta shifted by -\mu, we obtain an infinitely narrow and infinitely high peak of probability mass where x = \mu.

A common use of the Dirac delta distribution is as a component of an empirical distribution:

\hat p(\mathbf{x}) = \frac{1}{m} \sum_{i = 1}^{m} \delta (\mathbf{x} - \mathbf{x}^{(i)}),

which puts probability mass \frac{1}{m} on each of the m data points which form a dataset or collection of examples. This Dirac delta distribution is only necessary to define the empirical distribution over continuous variables; in the discrete case,an empirical distribution can be conceptualized as a multinoulli distribution with a probability associated with each possible input value that is simply equal to the empirical frequency of that value in the training set.

We can think of the empirical distribution as specifying the distribution that we sample from when we train a model on a dataset. Another important persepctive on the empirical distribution is that it is the probability density which maximizes the likelihood of the training data.

3.9.6: Mixtures of Distributions

It is common to define probability distributions by combining other, simpler probability distributions. One common way of doing so is to construct a mixture distribution, which is made up of several component distributions. On each trial, the choice of which distribution to generate a sample from is determined by sampling a component identity from a multinoulli distribution:

P(\text{x}) = \sum_i P(\text{c} = i)P(\text{x} | \text{c} = i),

where P(\text{c}) is the multinoulli distribution over such identities.

The mixture model allows us to briefly glimpse a concept that will be of much importance later: the latent variable. This is a random random which we cannot observe directly. The component identity variable \text{c} of the mixture model provides an example. Latent variables may be related to \text{x} through a joint distribution, in this case, P(\text{x}, \text{c}) = P(\text{x} | \text{c}) P(\text{c}). The distributions P(\text{c}) (over the latent variable) and P(\text{x} | \text{c}) (relating the latent variable to the visible variable) determine the shape of the distribution P(\text{x}), though it’s possible to describe P(\text{x}) without reference to the latent variable.

A powerful and common type of mixture model is the Gaussian mixture model, in which the components p(\mathbf{\text{x}} | \text{c} = i) are Gaussians. Each componenet has a separately parametrized mean and covariance. Some mixtures can have more constraints, such as sharing a single covariance matrix accross the mixture components, or by constraining the covariance matrices of the components to be diagonal or isotropic.

The parameters of a Gaussian mixture specify the prior probability \alpha_i = P(\text{c} = i) given to each component i. “Prior” indicates that it expresses the model’s beliefs about \text{c} before it has observed \mathbf{\text{x}}. In comparison, P(\text{c} | \mathbf{\text{x}}) is a posterior probability since it is computed after observing \mathbf{text{x}}. A Gaussian mixture model is a universal approximator of densities, in the sense that any smooth density can be approximated with any specific nonzero amount of error by a Gaussian mixture model with enough components.

3.10: Useful Properties of Common Functions

Certain functions arise often when working with probability distributions, especially in the context of deep learning.

One of these functions is the logistic sigmoid:

\sigma(x) = \frac{1}{1 + \text{exp(-x)}}.

This function is commonly used to produce the \phi parameter of a Bernoulli distribution, since its range is (0, 1), which lies in the valid range of values used for the \phi parameter. The sigmoid function saturates when its argument is very positive or very negative, meaning that the function becomes very flat and insensitive to small changes in its input.

Another commonly used function is the softplus function:

\xi (x) = \text{log} (1 + \text{exp} (x)).

This function can be used for producing the \alpha or \beta parameters of a normal distribution since its range is (0, \infty). The name “softplus” comes from the fact that it is a smoothed (“softened”) version of

x^+ = \text{max} (0, x).

The following properties are all useful enough that it may be useful to memorize them:

\sigma (x) = \frac{\text{exp}(x)}{\text{exp}(x) + \text{exp}(0)}

\frac{d}{dx}\sigma(x) = \sigma(x)(1 - \sigma(x))

1 - \sigma (x) = \sigma(-x)

\text{log}\sigma(x) = -\xi(-x)

\frac{d}{dx}\xi(x) = \sigma(x)

\forall x \in (0, 1), \sigma^{-1}(x) = \text{log}(\frac{x}{1-x})

\forall x > 0, \xi^{-1}(x) = \text{log}(\text{exp}(x)-1)

\xi(x) = \int_{-\infty}^x \sigma(y)dy

\xi(x) - \xi(-x) = x

The function \sigma^{-1}(x) is called the logit in statistics, but this term is rarely used in machine learning.

The softplus function is intended as a smooth version of the positive part function x^+ = \text{max}(0, x). This is the counterpart of the negative part function, defined analogously for negative values. One can use \xi(-x) as the negative part function analogue to the softplus. Just as x can be recovered from its positive and negative parts via the identity x^+ - x^- = x, x can also be recovered as in the last equation shown above (\xi(x) - \xi(-x) = x).

3.11: Bayes’ Rule

It is often the case that we know P(\text{y} | \text{x}) and need to know P(\text{y} | \text{x}). If we also know P(\text{x}), however, we can compute the desired quantity using Bayes’ rule:

P(\text{x} | \text{y}) = \frac{P(\text{x})P(\text{y} | \text{x})}{P(\text{y})}.

Though P(\text{y}) appears in the formula, it’s usually feasible to compute P(\text{y}) = \sum_{x} P(\text{y} | \text{x}) P(\text{x}), so we don’t need to begin with knowledge of P(\text{y}).

Bayes’ rule is simple to derive from the definition of conditional probability, but it’s useful to know the name of this formula since many texts refer to it by name.

3.12: Technical Details of Continuous Variables

A proper formal understanding of continuous random variables and probability density functions requires developing probability theory in terms of a branch of mathematics called measure theory. In section 3.3.2, we saw that the probability of a continuous vector-valued \mathbf{\text{x}} lying in some set \mathbb{S} is given by the integral of p(\mathbf{x}) over the set. Some choices of the set, however, can produce paradoxes. These paradoxical sets are generally constructed making heavy use of the infinite precision of real numbers. One of the key contributions of measure theory is to provide a characterization of the set of sets we can compute the probability of without encountering paradoxes. In the book, integrations are only done over sets with relatively simple descriptions, so this aspect of measure theory never becomes relevant.

Measure theory is useful for describing theorems that apply to most points in \mathbb{R}^n but do not apply to some corner cases. Measure theory provides a rigorous way of describing that a set of points is negligibly some; we say that such a set has measure zero. It is sufficient to understand the intuition that a set of measure zero occupies no volume in the space we are measuring. For example, in \mathbb{R}^2, a line has measure zero, while a filled polygon has positive measure. Similarly, an individual point has measure zero. The union of a countably infinite number of sets of measure zero is also of measure zero (e.g., the rational numbers).

Another useful term from measure theory is almost everywhere, which describes a property which holds throughout all space except for on a set of measure zero. Since the exceptions occupy no volume, they can typically be safely ignored. Some results in probability theory which hold for all discrete values only hold “almost everywhere” for continuous values.

Another technical detail of continuous variables relates to handling continuous random variables that are determined by one another. Suppose we have two random variables, \text{x} and \text{y}, such that y = g(x), where g is an invertible, continuous, differentiable function. One might expect that p_y(x)=p_y(g^{-1}(y)), but this is not always the case (Here, the book constructs a simple example involving scalar random values, one of which is determined by the other, which ends up violating the definition of a probability distribution. To fix the issue, one has to account for the distortion of space introduced by the function g, which we can do by including the determinant of the Jacobian matrix – the matrix of derivatives where J_{i, j} = \frac{\partial g(\mathbf{x})}{\partial \mathbf{x}}.).

3.13: Information Theory

Information theory is a branch of mathematics which revolves around quantifying how much information is present in a signal. In the context of machine learning, we can apply information theory to continuous variables where some of the original “message length” interpretations of information theory do not apply. We use a few key ideas from information theory to characterize probability distributions or to measure similarity between them.

The basic intuition behind information theory is that learning an unlikely event has occurred is more informative than learning a likely event has. We want to quantify information in a way that formalizes this intuition.

  • Likely event should have low information content, and events that are guaranteed to happen should have no information content at all.
  • Unlikely events should have high information content.
  • Independent events should have additive information. If we flip a coin two times and we observe two “heads”, because coin tosses are (pretty much) independent, this conveys twice as much information as a single coin flip.

To satisfy these properties, we define the self-information of an event \text{x} = x to be

I(x) = -\text{log}P(x).

(The textbook notes here that this definition is in terms of nats, since the book always uses \text{log} to mean the natural logarithm, but other texts may use the base-2 logarithm, which gives units called bits or shannons. This is simply a rescaling of information measured in nats, or any other base.)

When \text{x} is continuous, we use the same definition of information by analogy, but some of the properties from the discrete case are lost. For example, an event with unit density still has zero infromation, despite not being an event that is guaranteed to occur (remember that continuous probability distributions need only integrate to 1).

Self-information deals only with single outcomes. We can quantify the amount of uncertainty in an entire probability distribution using the Shannon entropy:

H(x) = \mathbb{E}_{\text{x} \sim P}[I(x)] = -\mathbb{E}_{\text{x} \sim P}[\text{log}P(x)],

also denoted by H(P). The Shannon entropy of a distribution is the expected amount of information in an event drawn from that distribution. It gives a lower bound on the number of bits / nats needed on average to encode symbols drawn from a distribution P. Distributions which are nearly deterministic (outcomes are nearly certain) have low entropy, and distributions which are closer to uniform have high entropy. When x is continuous, the Shannon entropy is known as the differential entropy.

If we have two separate probability distributions P(\text{x}) and W(\text{x}) over the same random variable \text{x}, we can measure how different they are with the Kullback-Leibler (KL) divergence:

D_{\text{KL}}(P || Q) = \mathbb{E}_{\text{x} \sim P}[\text{log}\frac{P(x)}{Q(x)}] = \mathbb{E}_{x \sim P}[\text{log}P(x) - \text{log}Q(x)].

In the case of discrete variables, this is the extra amount of information needed to send a message containing symbols drawn from P when we use a code that was designed to minimize the length of messages drawn from Q.

The KL divergence has many useful properties, most notably being non-negative. The KL divergence is zero if and only if P and Q are the same distribution  (in the case of discrete random variables), or equal “almost everywhere” (in the case of continuous random variables). Since the KL divergence is non-negative and measures the difference between two distributions, it is often thought of as a distance metric between the distributions, although it is not a true distance metric because it is not symmetric (D_{\text{KL}}(P || Q) \neq D_{\text{KL}}(Q || P) for some P and Q). This implies there are certain consequences to the choice of whether to use D_{\text{KL}}(P || Q) or D_{\text{KL}}(Q || P).

A closely-related quantity is the cross-entropy H(P, Q) = H(P) + D_{\text{KL}} (P || Q), which is similar to the KL divergence but lacks the term on the left:

H(P, Q) = -\mathbb{E}_{\text{x} \sim P} \text{log} Q(x).

Minimizing the cross-entropy with respect to Q is the same as minimizing the KL divergence, since Q does not affect the omitted term in the definition of cross-entropy.

3.14: Structured Probabilistic Models

Machine learning algorithms often involve probability distributions over a very large number of random variables. These probability distributions often involve direct interactions only between relatively few variables. Using a single function to describe an entire joint probability distribution can be very inefficient, both computationally and statistically.

Instead of using a single function to represent a joint probability distribution, we can split it into many factors we multiply together. These factorizations greatly reduce the number of parameters needed to describe the distribution. Each factor uses a number of parameters that is exponential in the number of variables in the factor. This means that we can greatly reduce the cost of representing a distribution if we are able to find a factorization in to distributions over fewer variables.

We can describe these kinds of factorizations using graphs. Here, we use the word “graph” in the sense of graph theory: a set of vertices which may be connected to each over via edges. When we represent the factorization of a probability distribution with a graph, we call it a structural probabilistic model or graphical model.

There are two main kinds of structured probabilistic models: directed and undirected. Both kinds of models use a graph in which each node in the graph corresponds to a random variable, and an edge connection two random variables means that the probability distribution is able to represent direct interactions between them.

Directed models use graphs with directed edges, and they represent factorizations into conditional probability distributions. A directed model contains one factor for every random variable \text{x}_i in the distribution, and that factor consists of the conditional distribution over \text{x}_i given the parents of \text{x}_i, denoted P_{a_{\mathcal{G}}}:

p(\textbf{\text{x}}) = \prod_{i} p(\text{x}_i | P_{a_G}(\text{x}_i)).

Undirected models use graphs with undirected edges and represent factorizations into sets of functions. Unlike in the directed case, these functions are usually not probability distributions of any kind. Any set of nodes that are all connected to each other in \mathcal{G} is called a clique. Each clique \mathcal{C}^{(i)} in an undirected model is associated with a factor \phi^{(i)}(\mathcal{C}^{(i)}). The output of each factor must be non-negative, but there is no constraint that the factor must sum or integrate to 1 like a probability distribution.

The probability of a configuration of random variables is proportional to the product of all these factors – assignments that result in larger factor values are more likely. There is no guarantee that this product will sum to 1. We therefore divide by a normalizing constant Z, which is defined as the sum or integral over all states of the product of the \phi functions, in order to obtain a normalized probability distribution:

p(\mathbf{\text{x}}) = \frac{1}{Z}\prod_i \phi^(i)(\mathcal(C)^(i)).

These graphical representations of factorizations are a language for describing probability distributions. They are not mutually exclusive families of probability distributions. Being directed or undirected is not a property of a distribution; it’s a property of a particular description of a distribution, but any probability distribution can be described in both ways.

brian2 Spiking Neural Network Simulator Tutorial

I want to get familiar with the brian2 spiking neural network simulator; hence, this blog post. The code and Markdown-formatted comments were developed in a Jupyter notebook.

Part I: Neurons

Imports and Setup

%matplotlib inline

from brian2 import *

brian2 has a system for using quantities with physical dimensions. All the basic SI units can be used along with their standard prefixes, and some standard abbreviations:

20 * volt


1000 * amp


100 * namp


10 * nA * 5 * Mohm



A simple model

Let’s start by defining a simple neuron model. In brian2, all models are defined by systems of differential equations. Here’s a simple example:

tau = 10*ms
eqs = '''
dv/dt = (1-v)/tau : 1

Equations are given by a multi-line string with one equation per string. The equations are formatted with standard mathematical notation, but with one exception: at the end of each line, we write : unit where unit is the SI unit of the variable in the differential equation.

Let’s use this definition to create a neuron:

G = NeuronGroup(1, eqs, method='linear')

In brian2, we can only create groups of neurons using the class NeuronGroup. The first two arguments when you create one of these objects are the number of neurons and their defining differential equations.

The command run(100*ms) runs the simulation for 100ms. We can see that it has worked by printing out the value of v before and after the simulation.

print('Before v = %s' % G.v[0])
print('After v = %s' % G.v[0])

Before v = 0.0
After v = 0.99995460007

By default, all variables start with value 0. Since the differential equation is dv/dt = (1-v)/tau, we would expect after a while that v would tend towards 1, which is what we’ve observed. Specifically, we should expect to see that v has the value 1-exp(-t/tau). Let’s check this:

print('Expected value of v = %s' % (1 - exp(-100 * ms / tau)))

Expected value of v = 0.99995460007

The simulation gives the expected value!

Now, let’s look at a plot of how the value v evolves over time:

# so this code doesn't depend on previously defined variables

G = NeuronGroup(1, eqs, method='linear')
M = StateMonitor(G, 'v', record=0)


plot(M.t / ms, M.v[0])
xlabel('Time (ms)');
title('Voltage over Time');


We ran the simulation only for 30ms so we can see the behavior better. It looks like its behaving as expected, but let’s check it analytically by plotting the expected behavior as well.


G = NeuronGroup(1, eqs, method='linear')
M = StateMonitor(G, 'v', record=0)


plot(M.t/ms, M.v[0], '-b', lw=2, label='Brian')
plot(M.t/ms, 1-exp(-M.t/tau), '--r', lw=2, label='Analytic')
xlabel('Time (ms)');
title('Voltage over Time');


Clearly, the blue line (brian2 simulation) and the red line (analytic solution) coincide.

Now, we modify the equations and its parameters and see what happens in the plot of the simulation:


tau = 10*ms
eqs = '''
dv/dt = (sin(2*pi*100*Hz*t)-v)/tau : 1

# Change to Euler's method since exact integration doesn't work for this differential equation
G = NeuronGroup(1, eqs, method='euler')
M = StateMonitor(G, 'v', record=0)

# Initial value for v
G.v = 5


plot(M.t/ms, M.v[0])
xlabel('Time (ms)');
title('Voltage over Time');


Adding Spikes

So far we haven’t done anything neuronal, just played around with differential equations. Let’s start adding spiking behavior.


tau = 10*ms
eqs = '''
dv/dt = (1-v)/tau : 1

# creating a single neuron which "spikes" at v = 0.8 and resets to v = 0
G = NeuronGroup(1, eqs, threshold='v > 0.8', reset='v=0', method='linear')
M = StateMonitor(G, 'v', record=0)

plot(M.t/ms, M.v[0])
xlabel('Time (ms)')
title('Spiking Neurons');


We’ve added two keywords to the NeuronGroup declaration: threshold='v > 0.8' and reset='v=0'. This means that whenever v > 0.8 we “fire” a spike and immediately reset v=0 afterwards. We can put any expression or series of statements as these arguments.

At the beginning of the simulation, the behavior is the same as before until v crosses the threshold of v > 0.8, at which point v resets to 0. You can’t see this in the figure, but internally, brian2 has registered this as a spike. Let’s take a look at this.


G = NeuronGroup(1, eqs, threshold='v > 0.8', reset='v=0', method='linear')

# creating a 'SpikeMonitor' object to record spike activity
spikemon = SpikeMonitor(G)


print('Spike times: %s' % spikemon.t[:])

Spike times: [ 16. 32.1 48.2] ms

The SpikeMonitor object takes the group whose spikes we’d like to record as its argument, and stores the spike tiems in the variable t. Let’s plot the spikes on top of the other figure to see that it’s getting the times right.


G = NeuronGroup(1, eqs, threshold='v > 0.8', reset='v=0', method='linear')

# creating both 'StateMonitor' and 'SpikeMonitor' objects
statemon = StateMonitor(G, 'v', record=0)
spikemon = SpikeMonitor(G)


plot(statemon.t/ms, statemon.v[0])
for t in spikemon.t:
axvline(t/ms, ls='--', c='r', lw=3)
xlabel('Time (ms)')
title('Voltage Behavior and Spikes');


Let’s try a few more settings for the threshold and reset arguments.


G = NeuronGroup(1, eqs, threshold='v > 0.6', reset='v=0.1', method='linear')

# creating both 'StateMonitor' and 'SpikeMonitor' objects
statemon = StateMonitor(G, 'v', record=0)
spikemon = SpikeMonitor(G)


plot(statemon.t/ms, statemon.v[0])
for t in spikemon.t:
axvline(t/ms, ls='--', c='r', lw=3)
xlabel('Time (ms)')
title('Voltage Behavior and Spikes');



G = NeuronGroup(1, eqs, threshold='v > 0.6', reset='v=-0.3', method='linear')

# creating both 'StateMonitor' and 'SpikeMonitor' objects
statemon = StateMonitor(G, 'v', record=0)
spikemon = SpikeMonitor(G)


plot(statemon.t/ms, statemon.v[0])
for t in spikemon.t:
axvline(t/ms, ls='--', c='r', lw=3)
xlabel('Time (ms)')
title('Voltage Behavior and Spikes');



A common feature of neuron models is refractoriness; i.e., after a neuron fires, it become refractory, or inactive, for a certain duration and cannot fire another spike until this period is over. Here’s how we do that in brian2:


# setting up equations and time scale
tau = 10*ms
eqs = '''
dv/dt = (1-v)/tau : 1 (unless refractory)

# simulating one neuron with the same 'threshold' and 'reset' args, but with a 5ms refractory period
G = NeuronGroup(1, eqs, threshold='v > 0.8', reset='v=0', refractory=5*ms, method='linear')

# creating state and spike monitor objects
statemon = StateMonitor(G, 'v', record=0)
spikemon = SpikeMonitor(G)


plot(statemon.t/ms, statemon.v[0])
for t in spikemon.t:
axvline(t/ms, ls='--', c='r', lw=3)
xlabel('Time (ms)')
title('Single Neuron Spikes with Refractoriness');


From this figure, after the first spike, we see that v stays at 0 for around 5ms before it resumes its normal behavior. To do this, we’ve done two things: added the keyword refractory=5*ms to the NeuronGroup declaration, and added the (unless refractory) to the end of the definition of v in the differential equations. The first means only that v cannot spike in the refractory period, and the second switches off the evolution of the differential equation when the neuron is refractory.

Here’s what would happen if we didn’t include the (unless refractory). Note that we’ve also decreased the value of tau and increased the length of the refractory period to make the behavior clear.


# setting up time scale and differential equations
tau = 5*ms
eqs = '''
dv/dt = (1-v)/tau : 1

# creating a single neuron with same 'threshold' and 'reset' args, and 15ms refractory period
G = NeuronGroup(1, eqs, threshold='v > 0.8', reset='v=0', refractory=15*ms, method='linear')

# setting up state and spike monitor objects
statemon = StateMonitor(G, 'v', record=0)
spikemon = SpikeMonitor(G)


plot(statemon.t/ms, statemon.v[0])
for t in spikemon.t:
axvline(t/ms, ls='--', c='r', lw=3)
axhline(0.8, ls=':', c='g', lw=3)
xlabel('Time (ms)')
title('Single Neurons Spikes With Odd Refractory Behavior');
print("Spike times: %s" % spikemon.t[:])

Spike times: [ 8. 23.1 38.2] ms


What’s going on here? The behavior for the first spike is the same: v rises to 0.8 and then fires a spike at time 8ms before immediately resetting back to 0. Since the refractory period is no 15ms, this means the neuron won’t be able to spike again until time 8ms + 15ms = 23ms. Immediately after the spike, the value of v again begins to rise since we didn’t specify (unless refractory) in the definition of dv/dt. Once it reaches the threshold value of 0.8, however, it does not fire a spike even though it exceeds the v > 0.8 threshold. This is because the neuron is still refractory until time 23ms, at which point it fires a spike.

See the full brian2 documentation for more options and use cases with spikes.

Multiple Neurons

So far we’ve only being working with a single neuron. Let’s do something interesting with multiple neurons!


# using 100 neurons, 10ms time scale, and slightly different
# differential equation for simulation
N = 100
tau = 10*ms
eqs = '''
dv/dt = (2-v)/tau : 1

# creating the neuron group w/ voltage threshold at 1 and resetting to 0
G = NeuronGroup(N, eqs, threshold='v > 1', reset='v=0', method='linear')

# instantiate each neuron with a random voltage
G.v = 'rand()'

# create spike monitor object
spikemon = SpikeMonitor(G)


plot(spikemon.t/ms, spikemon.i, '.k')
xlabel('Time (ms)')
ylabel('Neuron index')
title('N = 100 Neuron Population Spikes');


We have a new variable N which determines the number of neurons, we’ve added the statement G.v = rand() before the run (which inititializes each neurons with a random value uniformly in [0, 1], so each neuron does something a bit different), and we’ve plotted the spikes of each neuron by index over time.

As well as the variable SpikeMonitor.t (which gives the times of all the spikes), we’ve also used the variable SpikeMonitor.i, which gives the corresponding neuron index for each spike. We plotted a single black dot with time on the x-axis and neuron index on the y-axis. This is the standard “raster plot” used in neuroscience.


To make these multiple neurons do something more interesting, let’s introduce per-neuron parameters that don’t have a differential equation attached to them.


# number of neurons, time scale, initial voltage maximum, and duration of simulation
N = 100
tau = 10*ms
v0_max = 3.
duration = 1000*ms

# differential equations for neurons
eqs = '''
dv/dt = (v0-v)/tau : 1 (unless refractory)
v0 : 1

# creating neuron group of size N with corresponding equations,
# threshold and reset at 1, 0, and 5ms refractory
G = NeuronGroup(N, eqs, threshold='v > 1', reset='v=0', refractory=5*ms, method='linear')
M = SpikeMonitor(G)

# initial voltage based on neuron index
G.v0 = 'i*v0_max/(N-1)'


plot(M.t/ms, M.i, '.k')
xlabel('Time (ms)')
ylabel('Neuron index')
plot(G.v0, M.count/duration)
ylabel('Firing rate (sp/s)');


The line v0 : 1 declares a new per-neuron parameter v0 with units 1 (dimensionless). The line G.v0 = 'i*v0_max/(N-1)' initializes the value of v0 for each neuron, varying from 0 up to v0_max. The symbol i, when it appears in equations like this, refers to the index of a neuron.

In this example, we’re driving the neuron towards the value v0 exponentially quickly, but we fire spikes when v crosses v > 1 and resets to 0. The effect is that the rate at which each neuron fires spikes will be related to the value of v0. For v0 < 1 it will never fire a spike, and as v0 gets larger it will fire spikes at a higher rate. The right hand plot show the firing rate as a function of the value of v0. This is called the “I-f curve” of this neuron model. Note that in the plot we’ve used the count variable of the SpikeMonitor. This is an array of the number of spikes each neuron in the group fired. Dividing this by the duration of the run gives the firing rate.

Stochastic Neurons

Often when making models of neurons, we include a random element to model the effect of various forms of neural noise. In brian2, we can do this by using the symbol xi in differential equations. Strictly speaking, this symbol is a stochastic differential, but you can think of it as a Gaussian random variable with mean 0 and standard deviation 1. We do have to take into account the way stochastic differentials scale with time, which is why we multiply it by tau ** -0.5 in the equations below.

<br />start_scope()

# number of neurons, time scale, max starting voltage, duration of simulation, and 'sigma' (?)
N = 100
tau = 10*ms
v0_max = 3.
duration = 1000*ms
sigma = 0.2

# differential equations (now with stochastic differential (SD) for modeling neural noise)
eqs = '''
dv/dt = (v0-v)/tau+sigma*xi*tau**-0.5 : 1 (unless refractory)
v0 : 1

# creating group of neurons to simulate (solving with Euler's method (possible due to SD))
G = NeuronGroup(N, eqs, threshold='v>1', reset='v=0', refractory=5*ms, method='euler')
# creating spike monitor object on G
M = SpikeMonitor(G)

# specifying firing rate for each neuron i in [1, ..., N]
G.v0 = 'i*v0_max/(N-1)'


plot(M.t/ms, M.i, '.k')
xlabel('Time (ms)')
ylabel('Neuron index')
plot(G.v0, M.count/duration)
ylabel('Firing rate (sp/s)');


This simulation ends up given the same figure as in the last simulation, but with some noise added. Note how the firing rate vs v0 curve has changed shape: instead of a sharp jump from firing at rate 0 to firing at a positive rate, it now increases in a sigmoidal (S-shaped) fashion. This is because no matter how small the driving force, the randomness may cause a neuron to fire a spike.

End of Part I

This is the end of the “Neurons” part of the tutorial. The cell below has an additional example, which we inspect in order to find out what it accomplishes and why.


# number of neurons, time scale, etc...
N = 1000
tau = 10*ms
vr = -70*mV
vt0 = -50*mV
delta_vt0 = 5*mV
tau_t = 100*ms
sigma = 0.5*(vt0-vr)
v_drive = 2*(vt0-vr)
duration = 100*ms

# differential equations (dv/dt and dvt/dt (?))
eqs = '''
dv/dt = (v_drive+vr-v)/tau + sigma*xi*tau**-0.5 : volt
dvt/dt = (vt0-vt)/tau_t : volt

# reset equations
reset = '''
v = vr
vt += delta_vt0

# creating group of 1000 neurons and spike monitor object
G = NeuronGroup(N, eqs, threshold='v > vt', reset=reset, refractory=5*ms, method='euler')
spikemon = SpikeMonitor(G)

# setting initial variable states
G.v = 'rand()*(vt0-vr)+vr'
G.vt = vt0


_ = hist(spikemon.t/ms, 100, histtype='stepfilled', facecolor='k', weights=ones(len(spikemon))/(N*defaultclock.dt))
xlabel('Time (ms)')
ylabel('Instantaneous firing rate (sp/s)');


Part II: Synapses

The Simplest Synapse

Once we have the neurons, the next step is to connect them up with synapses. We’ll start by using the simplest possible type of synapse, which causes an instantaneous change in the postsynaptic neuron membrane potential after the spike.


# differential equations
eqs = '''
dv/dt = (I-v)/tau : 1
I : 1
tau : second

# setting up neuron group, current, and time scales
G = NeuronGroup(2, eqs, threshold='v > 1', reset='v=0', method='linear')
G.I = [2, 0]
G.tau = [10, 100] * ms

# set up and connect a single synapse from first to second neuron
S = Synapses(G, G, on_pre='v_post += 0.2')
S.connect(i=0, j=1)

# set up state monitor object
M = StateMonitor(G, 'v', record=True)


# plotting results of simulation
plot(M.t/ms, M.v[0], '-b', label='Neuron 0')
plot(M.t/ms, M.v[1], '-g', lw=2, label='Neuron 1')
xlabel('Time (ms)')
title('Single Excitatory Synapse');


There are a few things going on here. Let’s recap what’s going on with the NeuronGroup first. We’ve created two neurons, each of which has the same differential equation but different values for parameters I and tau. Neuron 0 has I = 2 and tau = 10 * ms, which means that it’s driven to repeatedly spike at a fairly high rate. Neuron 1 has I = 0 and tau = 100 * ms, which means that on its own (without the synapse) it won’t spike at all (the driving current I is 0). We can see this by commenting out the lines that define the synapse.


# differential equations
eqs = '''
dv/dt = (I-v)/tau : 1
I : 1
tau : second

# setting up neuron group, current, and time scales
G = NeuronGroup(2, eqs, threshold='v > 1', reset='v=0', method='linear')
G.I = [2, 0]
G.tau = [10, 100] * ms

# set up and connect a single synapse from first to second neuron
# S = Synapses(G, G, on_pre='v_post += 0.2')
# S.connect(i=0, j=1)

# set up state monitor object
M = StateMonitor(G, 'v', record=True)


# plotting results of simulation
plot(M.t/ms, M.v[0], '-b', label='Neuron 0')
plot(M.t/ms, M.v[1], '-g', lw=2, label='Neuron 1')
xlabel('Time (ms)')
title('Single Excitatory Synapse');


Next, we define the synapses: Synapses(source, targets, ...) means that we are defining a synaptic model that goes from source to target. In this case, the source and target are both the same, the group G. The syntax on_pre='v_post += 0.2' means that when a spike occurs in the presynaptic neuron (hence on_pre), it causes an instantaneous change to happen v_post += 0.2. The _post means that the value of v referred to is the post-synaptic value, and it is to be increased by 0.2. In short, this model says that whenever two neurons in G are connected by a synapse, when the source neuron fires a spike, the target neuron will have its value of v increased by 0.2.

At this point we have only defined the synapse model, but we haven’t actually created any synapses. The next line (S.connect(i=0, j=1)) creates a synapse from neuron 0 to neuron 1.

Adding a Weight

In the previous section, we hardcoded the weight of the synapse to be the value 0.2, but we often want this to be different for different synapses. We do this by introducing synapse equations.


# define differential equations
eqs = '''
dv/dt = (I-v)/tau : 1
I : 1
tau : second

# defining neuron group, currents, and time scales
G = NeuronGroup(3, eqs, threshold='v > 1', reset='v=0', method='linear')
G.I = [2, 0, 0]
G.tau = [10, 100, 100] * ms

# defining synapses and their connections and weights
S = Synapses(G, G, 'w : 1', on_pre='v_post += w')
S.connect(i=0, j=[1, 2])
S.w = 'j * 0.2'

# state monitor object
M = StateMonitor(G, 'v', record=True)


plot(M.t/ms, M.v[0], '-b', label='Neuron 0')
plot(M.t/ms, M.v[1], '-g', lw=2, label='Neuron 1')
plot(M.t/ms, M.v[2], '-r', lw=2, label='Neuron 2')
xlabel('Time (ms)')
title('Two Synapses with Varying Weights');


This example behaves similarly to the previous example, but now there’s a synaptic weight variable w. The string w : 1 is an equation string, exactly the same as for neurons, which defines a single dimensionless parameter w. We changed the behavior on a spike to on_pre='v_post += w' so that each synapse can behave differently depending on the value of w. To see this, we’ve made a third neuron which behaves the same as the second neuron, and connected neuron 0 to both neurons 1 and 2. We’ve set the weights via S.w = 'j * 0.2'. When i and j occur in the context of synapses, i refers to the presynaptic (source) neuron index, and j to the postsynaptic (target) neuron index. So, this will give a synaptic connection from 0 to 1 with weight 0.2 = 0.2 * 1, and from 0 to 2 with weight 0.4 = 0.2 * 2.

Introducing a Delay

So far the synapses have transmitted action potentials instantaneously, but we can also make them act with a certain delay.


# setting up differential equations
eqs = '''
dv/dt = (I-v)/tau : 1
I : 1
tau : second

# setting up neuron group, driving currents, and time scales
G = NeuronGroup(3, eqs, threshold='v > 1', reset='v=0', method='linear')
G.I = [2, 0, 0]
G.tau = [10, 100, 100] * ms

# setting up synapses, connections, weights, and delay
S = Synapses(G, G, 'w : 1', on_pre='v_post += w')
S.connect(i=0, j=[1, 2])
S.w = 'j * 0.2'
S.delay = 'j * 2 * ms'

# state monitor object
M = StateMonitor(G, 'v', record=True)


plot(M.t/ms, M.v[0], '-b', label='Neuron 0')
plot(M.t/ms, M.v[1], '-g', lw=2, label='Neuron 1')
plot(M.t/ms, M.v[2], '-r', lw=2, label='Neuron 1')
xlabel('Time (ms)')
title('Two Synapses with Varying Weights and Delays');


Adding delays is as simple as adding the line S.delay = 'j * 2 * ms', which causes the synapse from 0 to 1 to have a propagating delay of 1 * 2ms = 2ms, and the synapse from 0 to 2 to have a propagating delay of 2 * 2ms = 4ms.

More Complex Connectivity

So far, we’ve specified the connectivity explicitly, but for larger networks this isn’t usually possible. For that, we shall want to specify some condition on the connectivity.


N = 10 # number of neurons
G = NeuronGroup(N, 'v : 1') # neuron group object
S = Synapses(G, G) # synapses from G to G
S.connect(condition='i != j', p=0.2) # connectivity pattern -> all to all others each with probability 0.2

We’ve created a dummy neuron group of N neurons and a dummy synapses model which doesn’t actually do anything, just to demonstrate the connectivity. The line S.connect(condition='i != j', p=0.2) will connect all pairs of neurons i to j with probability 0.2 as long as the condition i != j holds. How can we visualize this connectivity? Here is a function that will let us do so:

def visualize_connectivity(S):
Ns = len(S.source) # number of source neurons
Nt = len( # number of target neurons
figure(figsize=(10, 4))
plot(zeros(Ns), arange(Ns), 'ok', ms=10) # draw source neurons
plot(ones(Nt), arange(Nt), 'ok', ms=10) # draw target neurons
for i, j in zip(S.i, S.j):
plot([0, 1], [i, j], '-k') # draw lines between them
xticks([0, 1], ['Source', 'Target'])
ylabel('Neuron index')
xlim(-0.1, 1.1)
ylim(-1, max(Ns, Nt))
title('Connectivity: Source to Target')
plot(S.i, S.j, 'ok')
xlim(-1, Ns)
ylim(-1, Nt)
xlabel('Source neuron index')
ylabel('Target neuron index')
title('Connectivity Matrix')



On the left-hand side, we see a vertical line of circles indicating source neurons on the left, a vertical line indicating target neurons of the right, and a line between two neurons that have a synapse connection. On the right-hand side is another way of visualizing the same connectivity; here, each black dot is a synapse, with x value corresponding to the source neuron index and y values corresponding to the target neuron index.

Let’s see how these figures change as we change the probability of a connection:


N = 10 # number of neurons
G = NeuronGroup(N, 'v : 1') # neuron group object

# try several different connectivity probabilities
for p in [0.1, 0.25, 0.5, 1.0]:
S = Synapses(G, G) # create synapse object for neuron group
S.connect(condition='i!=j', p=p) # create arbitrary source -> target connections (i != j) with probability p
visualize_connectivity(S) # visualize the connectivity pattern
suptitle('p = ' + str(p))


Let’s see what another connectivity condition looks like. This will only connect neighbouring neurons.


N = 10 # number of neurons
G = NeuronGroup(N, 'v : 1') # neuron group object

S = Synapses(G, G) # synapse object for neuron group
S.connect(condition='abs(i - j) < 4 and i != j') # connects neighbors with 3 steps or less
visualize_connectivity(S) # take a look at the connectivity pattern


We can also use the generator syntax to create connections like this more efficiently. In small examples, this doesn’t really matter, but for large numbers of neurons it can be much more efficient to specify directly which neurons should be connected than to specify just a condition.


N = 10
G = NeuronGroup(N, 'v : 1')

S = Synapses(G, G)
S.connect(j='k for k in range(clip(i - 3, 0, N_post), clip(i + 4, 0, N_post)) if i!=k')


If each source neuron is connected to precisely one target neuron, there is a special syntax that is extremely efficient. For example, one-to-one connectivity looks like this:


N = 10 # number of neurons
G = NeuronGroup(N, 'v:1') # neuron group object

S = Synapses(G, G) # synapse object for neuron group
S.connect(j='i') # one to one connectivity
visualize_connectivity(S) # visualize it


We can also do things like specifying the values of weights with a string. Let’s see an example where we assign each neuron a spatial location, and have a distance-dependent connectivity function. We visualize the weight of a synapse by the size of the marker.


N = 30 # number of neurons
neuron_spacing = 50*umetre # physical spacing between the neurons
width = N / 4.0 * neuron_spacing # width of network

# neuron has one variable x (its position)
G = NeuronGroup(N, 'x : metre')
G.x = 'i * neuron_spacing'

# all synapses are connected (except for self-connections)
S = Synapses(G, G, 'w : 1')
S.connect(condition='i != j')

# weight varies with distance
S.w = 'exp(-(x_pre - x_post) ** 2 / (2 * width ** 2))'

scatter(G.x[S.i]/um, G.x[S.j]/um, S.w*20)
xlabel('Source neuron position (um)')
ylabel('Target neuron position (um)')
title('Weighted All-to-All Connectivity');


Reversing the distance weighting function:


N = 30 # number of neurons
neuron_spacing = 50*umetre # physical spacing between the neurons
width = N / 4.0 * neuron_spacing # width of network

# neuron has one variable x (its position)
G = NeuronGroup(N, 'x : metre')
G.x = 'i * neuron_spacing'

# all synapses are connected (except for self-connections)
S = Synapses(G, G, 'w : 1')
S.connect(condition='i != j')

# weight varies with distance
S.w = '1 - exp(-(x_pre - x_post) ** 2 / (2 * width ** 2))'

scatter(G.x[S.i]/um, G.x[S.j]/um, S.w*20)
xlabel('Source neuron position (um)')
ylabel('Target neuron position (um)')
title('Weighted All-to-All Connectivity');


Just for fun, let’s add a connection with weight exp(-(x_pre - x_post) ** 2 / (2 * width ** 2)) with probability exp(-(x_pre - x_post) ** 2 / (2 * width ** 2)), for all neurons i and j where the condition i != j holds:


N = 30 # number of neurons
neuron_spacing = 50*umetre # physical spacing between the neurons
width = N / 4.0 * neuron_spacing # width of network

# neuron has one variable x (its position)
G = NeuronGroup(N, 'x : metre')
G.x = 'i * neuron_spacing'

# all synapses are connected (except for self-connections)
S = Synapses(G, G, 'w : 1')
S.connect(condition='i != j', p='exp(-(x_pre - x_post) ** 2 / (2 * width ** 2))')

# weight varies with distance
S.w = 'exp(-(x_pre - x_post) ** 2 / (2 * width ** 2))'

scatter(G.x[S.i]/um, G.x[S.j]/um, S.w*20)
xlabel('Source neuron position (um)')
ylabel('Target neuron position (um)')
title('Weighted All-to-All Connectivity');


More Complex Synapse Models: STDP

brian2‘s synapse framework is very general and can do things like short-term plasticity (STP) or spike-timing dependent plasticity (STDP). Let’s see how it works for STDP.

STDP is normally defined by an equation something like this:

\Delta w = \sum_{t_{\text{pre}}} \sum_{t_{\text{post}}} W(t_{\text{pre}} - t_{\text{post}}).

That is, the change in synaptic weight w is the sum over all presynaptic spike times t_{\text{pre}} and postsynaptic spike times t_{\text{post}} of some function $W$ of the difference in these spike times. A commonly used function $W$ is:

W(\Delta t) = \begin{cases} A_{pre} e^{-\Delta t/\tau_{pre}} & \Delta t>0 \\ A_{post}- e^{\Delta t/\tau_{pre}} & \Delta t<0 \end{cases}

This function looks like this:

tau_pre = tau_post = 20 * ms # time scales
A_pre = 0.01 # pre-postsynaptic spike plasticity scaling constant
A_post = -A_pre * 1.05 # post-postsynaptic spike plasticity scaling constant
delta_t = linspace(-50, 50, 100) * ms # 100 evenly spaced points in [-50, 50]
W = where(delta_t < 0, A_pre*exp(delta_t/tau_pre), A_post*exp(-delta_t/tau_post)) # defines STDP learning window

# plotting
plot(delta_t/ms, W)
xlabel(r'\Delta t (ms)')
ylim(-A_post, A_post)
axhline(0, ls='-', c='k')
title('STDP Learning Window');


Simulating STDP directly using this equation would be very inefficient because we would have to sum over all pairs of spikes, which is physiologically unrealistic since the neuron can’t remember all of its previous spike times. It turns out there is a more efficient and physiologically plausible way to get the same effect.

We define two new variables a_{\text{pre}} and a_{\text{post}} which correspond to “traces” of pre- and postsynaptic activity, given by the following differential equations:

\tau_{\text{pre}} \frac{d}{dt} a_{\text{pre}} = -a_{\text{pre}}

\tau_{\text{post}} \frac{d}{dt} a_{\text{pre}} = -a_{\text{post}}

When a presynaptic spike occurs, the presynaptic trace is updated and the weight is modified according to the rule:

a_{\text{pre}} \rightarrow a_{\text{pre}} + A_{\text{pre}}

w \rightarrow w + a_{\text{post}}

When a postsynaptic spike occurs:

a_{\text{post}} \rightarrow a_{\text{post}} + A_{\text{post}}

w \rightarrow w + a_{\text{pre}}

To see that this formulation is equivalent, you simply have to check that the equations sum linearly, and consider the following two cases: what happens if the presynaptic spike occurs before the postsynaptic spike, and vice versa. Try drawing a picture of it.

Now that we have a formulation which relies only on differential equations and spike events, we can turn that into brian2 code.


taupre = taupost = 20 * ms # time scales
wmax = 0.01 # max weight magnitude
Apre = 0.01 # presynaptic plasticity scaling constant
Apost = -Apre * taupre / taupost * 1.05 # postsynaptic scaling constant

# neuron group object
G = NeuronGroup(1, 'v : 1', threshold='v > 1')

# equations for weights, trace decay
synapse_eqs = '''
w : 1
dapre/dt = -apre / taupre : 1 (event-driven)
dapost/dt = -apost / taupost : 1 (event-driven)

# equations for presynaptic spike
on_pre = '''
v_post += w
apre += Apre
w = clip(w + apost, 0, wmax)

# equations for postsynaptic spike
on_post = '''
apost += Apost
w = clip(w + apre, 0, wmax)

# creating synapses object for neuron group
S = Synapses(G, G, synapse_eqs, on_pre=on_pre, on_post=on_post)

First, when defining the synapses, we’ve given a more complicated multi-line string defining three synaptic variables (w, apre, and apost). We’ve also got a new bit of syntax (event-driven) after the definitions of apre and apost. This means that although these two variables evolve continuously over time, brian2 should only update them at the time of an event (a spike). This is because we don’t need the values of apre and apost expect at spike times, and it’s more efficient to only update them as needed.

Next, we have an on_pre= argument. The first line is v_post += w; this is the line that actually applies the synaptic weight to the target neuron membrane potential. The second line is apre += Apre which encodes the rule above. In the third line, we’re also encoding the rule above, but we’ve added one extra feature: we’ve clamped the synaptic weights between a minimum of 0 and a maximum of wmax so that the weights can’t get too large or negative. The function clip(x, low, high) does this.

Lastly, we have an on_post= argument. This gives the statements to calculate when a postsynaptic neuron fires. Note that we don’t modify v in this case, only the synaptic variables.

Let’s see how all the variables behave when a presynaptic spike arrives some time before a postsynaptic spike.


taupre = taupost = 20 * ms # time scales
wmax = 0.01 # max weight magnitude
Apre = 0.01 # presynaptic plasticity scaling constant
Apost = -Apre * taupre / taupost * 1.05 # postsynaptic scaling constant

# neuron group object
G = NeuronGroup(2, 'v:1', threshold='t > (1 + i) * 10 * ms', refractory=100*ms, method='linear')

# equations for weights, trace decay
synapse_eqs = '''
w : 1
dapre/dt = -apre / taupre : 1 (clock-driven)
dapost/dt = -apost / taupost : 1 (clock-driven)

# equations for presynaptic spike
on_pre = '''
v_post += w
apre += Apre
w = clip(w + apost, 0, wmax)

# equations for postsynaptic spike
on_post = '''
apost += Apost
w = clip(w + apre, 0, wmax)

# creating synapses object for neuron group
S = Synapses(G, G, synapse_eqs, on_pre=on_pre, on_post=on_post, method='linear')

# connecting the first neuron to the second
S.connect(i=0, j=1)

# state monitor object
M = StateMonitor(S, ['w', 'apre', 'apost'], record=True)

# run the simulation

# plotting
figure(figsize=(4, 8))
plot(M.t/ms, M.apre[0], label='apre')
plot(M.t/ms, M.apost[0], label='apost')
plot(M.t/ms, M.w[0], label='w')
xlabel('Time (ms)');


We’ve used a trick to make neuron 0 fire a spike at time 10ms, and neuron 1 at time 20ms. Can you see how that works? It is caused by the threshold argument to the neuron group object.

Secondly, we’ve replaced the (event-driven) by (clock-driven) so you can see how apre and apost evolve over time. If we revert this change, the graph of apre and apost only show the values changing from their intitial settings to their value at the event, and staying there.

Finally, let’s verify that this formulation is equivalent to the original one.


# pre- and postsynaptic time scales, scaling constants, maximum time, and number of neurons
taupre = taupost = 20 * ms
Apre = 0.01
Apost = -Apre * taupre / taupost * 1.05
tmax = 50 * ms
N = 100

# Presynaptic neurons G spike at times from 0 to tmax
# Postsynaptic neurons H spike at times from tmax to 0
# So difference in spike times will vary from -tmax to +tmax
G = NeuronGroup(N, 'tspike : second', threshold='t > tspike', refractory=100 * ms)
H = NeuronGroup(N, 'tspike : second', threshold='t > tspike', refractory=100 * ms)
G.tspike = 'i * tmax / (N - 1)'
H.tspike = '(N - 1 - i) * tmax / (N - 1)'

# synapse equations, pre- and postsynaptic trace and weight updates
synapse_eqs = '''
w : 1
dapre/dt = -apre/taupre : 1 (event-driven)
dapost/dt = -apost/taupost : 1 (event-driven)
on_pre = '''
apre += Apre
w = w + apost
on_post = '''
apost += Apost
w = w + apre

# synapses object from neuron group G to H
S = Synapses(G, H, synapse_eqs, on_pre=on_pre, on_post=on_post)
# one-to-one connectivity

# run the simulation
run(tmax + 1 * ms)

# plotting
plot((H.tspike-G.tspike)/ms, S.w)
xlabel(r'$\Delta t$ (ms)')
ylabel(r'$\Delta w$')
ylim(-Apost, Apost)
axhline(0, ls='-', c='k')
title('Pre- and Postsynaptic Spike Difference vs. Change in Synaptic Weight');


Can you see how this works? As the firing times of the neurons in G range from t = 0 to tmax, the firing times in H range in the opposite direction. At first, \Delta t is very negative, and gives a small positive weight change, and grows exponentially until \Delta t = 0, at which which point the sign of the weight change flips (along with the sign of \Delta t), and as we move further right, the weight change is negative but exponentially decreasing in magnitude.

Word2vec: A Computationally Efficient Model for Learning Word Embeddings

word2vec: Tensorflow Tutorial

In which I walk through a tutorial on building the word2vec model by Mikolov et al. I’ll document justification for the code I write each step of the way. Much of the notes / images / code are / is copied or slightly altered from the tutorial.


from __future__ import division, absolute_import, print_function

from six.moves import urllib, xrange

import tensorflow as tf
import numpy as np
import collections, math, os, random, zipfile


This Tensorflow tutorial is meant to highlight the interesting / substantive parts of building the word2vec model with Tensorflow.

  • We motivate why we would walk to represent words as vectors.
  • We look at the intuition behind the model and how it’s trained, with mathematics as needed.
  • We show a simple implementation of the model in TensorFlow.
  • We look at ways to make the naive version of the model scale better.

One can dive into the original, basic code, or the more complex version, but walking through the tutorial will help us motivate the implementation of the model in those scripts.

Let’s look at why we would want to learn word embeddings in the first place.

Motivation: Why Learn Word Embeddings?

Image and audio processing machine learning systems work with rich and high-dimensional datasets encoded as vectors (of the individual raw pixels for image data, or as power spectral density coefficients for audio data). For tasks like object or speech recognition, we know that all the information required to successfully perform the task is encoded in the data (since us humans can perform the same tasks using the raw data). On the other hand, natural language processing (NLP) systems usually treat words as discrete, atomic symbols (e.g., “cat” and “dog” may be represented by Id438 and Id327, respectively).

These encodings are arbitrary and provide no useful information to the machine learning system regarding the relationships that might exist between the individual symbols. This implies that machine learning systems cannot leverage what it knows about “cats” when processing data having to do with “dogs” (such that, they are both animals, four-legged, cute, …). Representing words as these discrete objects also leads to data sparsity (the “one-hot” vector representation), and usually means that we may need more data in order to successfully train statistical models. Using word vector representations can overcome some of these problems.

audio / image / text representations

Vector space models (VSMs) represent / embed words in a continuous vector space in which semantically similar words are mapped to nearby points; i.e., they are embedded nearby to each other. VSMs have a long, rich history in NLP, but all methods depend on some way on the Distribution hypothesis, which states that words which appear in the same contexts share semantic meaning. The approaches which leverage this principle can be divided into two categories: count-based methods (e.g., Latent Semantic Analysis), and predictive methods (e.g., neural probabilistic language models).

In a nutshell, count-based methods compute the statistics of how often some word co-occurs with its neighbor words in a large text corpus, and then maps these count statistics down to a small, dense vector for each word. Predictive models try to predict a word from its neighbors in terms of small, dense embedding vectors (considered parameters of the model).

word2vec is a particularly computationally efficient predictive model for learning word embeddings from raw text. It comes in two flavors: the Continuous Bag-of-Words model (CBOW) and the Skip-Gram model (Section 3.1, 3.2 in Mikolov et al.). These models are similar algorithmically, but CBOW predicts targets words from source context words (e.g., ‘mat’ from ‘the cat sits on the’), whereas the Skip-Gram model does the inverse and predicts source context words from the target words. The inversion might seem like an arbitrary choice, but statistically, it has the effect that CBOW smooths over much of the distributional information (by treating an entire context as a single observation). This typically turns out to be useful for smaller datasets. However, Skip-Gram treats each context-target pair as a new observation, and this tends to perform better when we have larger datasets. We will focus on the Skip-Gram model for this tutorial.

Scaling up with Noise-Contrastive Training

Neural probabilistic language models (NPLMs) are traditionally trained using the maximum likelihood (ML) principle to maximize the probability of the next word w_t (t for “target”) given the previous words h (for “history”) in terms of a softmax function:

P(w_t \text{ } | \text{ } h) = \text{softmax}(\text{score}(w_t, h))
= \frac{\text{exp}(w_t, h)}{\sum_{\text{word w' in vocab}}\text{exp}(\text{score}(w_t, h))}


where \text{score}(w_t, h) computes the probability of the word w_t with the context h (a dot product is typically used). We can train this model by maximizing its log-likelihood on the training datset:

J_{ML} = \text{log}(P(w_t \text{ } | \text{ } h))
= \text{score}(w_t \text{ } | \text{ } h) - \text{log}(\sum_{\text{word w' in vocab}}\text{exp}(\text{score}(w_t, h)))


This gives a properly normalized probabilistic language model. However, this method is very expensive computationally, since we have to compute and normalize each probability using the score for all other \text{Vocab } V words w' in the current context h at every training step.

Softmax NPLM

On the other hand, for feature learning in word2vec, we don’t need a full probabilistic model. The CBOW and Skip-Gram models are instead trained using a binary classification objection (logistic regression) to discriminate the real target words w_t from k imaginary / noise words \tilde{w}, in the same context. This is illustrated below for a CBOW model; for Skip-Gram, this illustratation is simply inverted:


Mathematically speaking, the objective (for each example) to be maximized is:

J_{NEG} = \text{log }Q_\theta(D = 1 \text{ } | \text{ } w_t, h) + k * \mathbb{E}_{\tilde{w} \text{ ~ } P_{\text{noise}}}[\text{log }Q_\theta(D = 0 \text{ } | \text{ } \tilde{w}, h)],


where Q_\theta(D = 1 \text{ } | \text{ } w_t, h) is the binary logistic regression probability under the model of seeing the word w_t in the dataset D, calculated in terms of the learned embedding vectors \theta. In practice, we approximate the expectation by drawing k contrastive words from the noise distribution; i.e., we compute a Monte Carlo average.

The objective is maximized when the model assigns high probabilities to the real words and low probabilities to the noise words. This is called negative sampling, and there is good mathematical motivation for using this loss function: the updates proposed by it approximate the updates of the softmax function in the limit. It is computationally appealing since computing the loss function now only scales with the number of noise words we select, k, and not all words in the vocabulary V. This makes the model much faster to train. We will actually make use of a very similar noise-contrastive estimation loss, for which Tensorflow has the helper function tf.nn.nce_loss().

The Skip-Gram Model

Let’s consider the dataset the quick brown fox jumped over the lazy dog. We will first form a dataset of words and the contexts in which they appear. We can define “context” in any way that makes sense, and in fact people have looked at syntactic contexts, words-to-the-left of the target, words-to-the-right of the target, and more. We stick to the standard definition and define “context” as the window of words to the left and to the right of a target word. Using a window size of one, we now have the dataset: ([the, brown], quick), ([quick, fox], brown), ([brown, jumped], fox), ... of (context, target) pairs.

Recall that the Skip-Gram model inverts contexts and targets, and tries to predict each context word from its target word. So, the task becomes: predict “the” and “brown” from “quick”, “quick” and “fox” from “brown”, and so on. Our dataset then becomes: (quick, the), (quick, brown), (brown, quick), (brown, fox), ... of (input, output) pairs. The objective function is defined over the whole dataset, but we typically optimize it with stochastic gradient descent (SGD) using one example (or a mini-batch of examples) at a time. Let’s look at a single step of this process.

We can imagine that, at time t, we observe the first training case above, where the goal is to predict the from quick. We select num_noise, the number of noisy / contrastive examples, by drawing samples from some noise distribution (typically the unigram distribution, P(w)). For simplicity, let’s say num_noise = 1, and we select sheep as our contrastive example. Next, we compute the loss for this pair of observed and noisy examples, i.e., the objective at time step t is thus:

J^{(t)}_\text{NEG} = \log Q_\theta(D = 1 \text{ } | \text{ the, quick}) + \log(Q_\theta(D = 0 \text{ } | \text{ sheep, quick})).


The idea is to update the embedding parameters \theta so that the objective function moves towards its goal of maximize. We do this by deriving the gradient of the loss with respect to the embedding parameters \theta, i.e., \frac{\partial}{\partial\theta}J_{NEG}, easily done within TensorFlow. We perform an update to the embedding parameters by taking a small step in the direction of the gradient. When this process is repeated over the entire training dataset, it has the effect of “moving” the embedding vectors around for each word until the model is good at discriminating real words from noise words.

We can visualize the learned vector representation by projecting them down to tw0 or three dimensions using, for example, the t-SNE dimensionality reduction technique. Upon inspection, these visualizations show us that the vectors capture some general and quite useful semantic information about words and their relationships to one another. It is interesting to discover that certain directions in the induced vector space specialize towards certain semantic relationships; e.g., male-female, verb tense, and even country-capital relationships between words, as illustrated below:

Semantic Relationships

This also explains why vectors are useful as features for many standard NLP prediction tasks, such as part-of-speech tagging or name entity recognition.

Getting the Data

We’ll need to download the text data we want to work with, and then read the raw text data into a list of strings.

# download the data
url = ''

def download(filename, expected_bytes):
    Download a file if not present, and make sure it's the right size.
    if not os.path.exists(filename):
        filename, _ = urllib.request.urlretrieve(url + filename, filename)
    statinfo = os.stat(filename)
    if statinfo.st_size == expected_bytes:
        print('Found and verified', filename)
        raise Exception('Failed to verify ' + filename + '. Can you get to it with a browser?')
    return filename

# download the file
file_ = download('', 31344016)

def read_data(filename):
    Parse the file enclosed in the 'filename' zip file into a list of words.
    # unzip the file
    with zipfile.ZipFile(filename) as f:
        # read the data into the 'data' variable
        data = tf.compat.as_str([0])).split()
    # return the data
    return data

words = read_data(file_)
print('data size:', len(words))

Found and verified
data size: 17005207

Building the Word Dictionary and Replacing Rare Words

Now that we’ve read in the raw text and converting it into a list of strings, we’ll need to convert this list into a dictionary of (input, output) pairs as described above for the Skip-Gram model. We’ll also replace rare words in the dictionary with the token UNK, as is standard in this kind of NLP task.

# build the dictionary and replace rare words with the "UNK" token.
vocabulary_size = 50000

def build_dataset(words):
    # create counts list, set counts for "UNK" token to -1 (undefined)
    count = [['UNK', -1]]
    # add counts of the 49,999 most common tokens in 'words'
    count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
    # create the dictionary data structure
    dictionary = {}
    # give a unique integer ID to each token in the dictionary
    for word, _ in count:
        dictionary[word] = len(dictionary)
    # create a list data structure for the data
    data = []
    # keep track of the number of "UNK" token occurrences
    unk_count = 0
    # for each word in our list of words
    for word in words:
        # if its in the dictionary, get its index
        if word in dictionary:
            index = dictionary[word]
        # otherwise, set the index equal to zero (index of "UNK") and increment the "UNK" count
            index = 0  # dictionary['UNK']
            unk_count += 1
        # append its index to the 'data' list structure
    # set the count of "UNK" in the 'count' data structure
    count[0][1] = unk_count
    # invert the dictionary; it becomes (index, word) key-value pairs
    reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    # return the data (indices), counts, dictionary, and inverted dictionary
    return data, count, dictionary, reverse_dictionary

# build the datset
data, count, dictionary, reverse_dictionary = build_dataset(words)
# free up some memory
del words
# print out stats
print('most common words (+UNK):', count[:10])
print('sample data:', data[:10], [reverse_dictionary[i] for i in data[:10]])

most common words (+UNK): [[‘UNK’, 418391], (‘the’, 1061396), (‘of’, 593677), (‘and’, 416629), (‘one’, 411764), (‘in’, 372201), (‘a’, 325873), (‘to’, 316376), (‘zero’, 264975), (‘nine’, 250430)]
sample data: [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156] [‘anarchism’, ‘originated’, ‘as’, ‘a’, ‘term’, ‘of’, ‘abuse’, ‘first’, ‘used’, ‘against’]

Generating Minibatches for Model Training

We define a function which allows us to generate mini-batches for training the skip-gram model.

data_index = 0

# generate a training batch for the skip-gram model.
def generate_batch(batch_size, num_skips, skip_window):
    global data_index
    # make sure our parameters are self-consistent
    assert batch_size % num_skips == 0
    assert num_skips <= 2 * skip_window
    # create empty batch ndarray using 'batch_size'
    batch = np.ndarray(shape=(batch_size), dtype=np.int32)
    # create empty labels ndarray using 'batch_size'
    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
    # [ skip_window target skip_window ]
    span = 2 * skip_window + 1
    # create a buffer object for prepping batch data
    buffer = collections.deque(maxlen=span)
    # for each element in our calculated span, append the datum at 'data_index' and increment 'data_index' moduli the amount of data
    for _ in range(span):
        data_index = (data_index + 1) % len(data)
    # loop for 'batch_size' // 'num_skips'
    for i in range(batch_size // num_skips):
         # target label at the center of the buffer
        target = skip_window
        targets_to_avoid = [skip_window]
        # loop for 'num_skips'
        for j in range(num_skips):
            # loop through all 'targets_to_avoid'
            while target in targets_to_avoid:
                # pick a random index as target
                target = random.randint(0, span - 1)
            # put it in 'targets_to_avoid'
            # set the skip window in the minibatch data
            batch[i * num_skips + j] = buffer[skip_window]
            # set the target in the minibatch labels
            labels[i * num_skips + j, 0] = buffer[target]
        # add the data at the current 'data_index' to the buffer
        # increment 'data_index'
        data_index = (data_index + 1) % len(data)
    # return the minibatch data and corresponding labels
    return batch, labels

# get a minibatch
batch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=1)

# print out part of the minibatch to the console
for i in range(8):
    print(batch[i], reverse_dictionary[batch[i]], '->', labels[i, 0], reverse_dictionary[labels[i, 0]])

3084 originated -> 12 as
3084 originated -> 5239 anarchism
12 as -> 6 a
12 as -> 3084 originated
6 a -> 12 as
6 a -> 195 term
195 term -> 6 a
195 term -> 2 of

Building the Computation Graph

We set up some hyperparameters and set up the computation graph.

# hyperparameters
batch_size = 128
embedding_size = 128 # dimension of the embedding vector
skip_window = 1 # how many words to consider to left and right
num_skips = 2 # how many times to reuse an input to generate a label

# we choose random validation dataset to sample nearest neighbors
# here, we limit the validation samples to the words that have a low
# numeric ID, which are also the most frequently occurring words
valid_size = 16 # size of random set of words to evaluate similarity on
valid_window = 100 # only pick development samples from the first 'valid_window' words
valid_examples = np.random.choice(valid_window, valid_size, replace=False)
num_sampled = 64 # number of negative examples to sample

# create computation graph
graph = tf.Graph()

with graph.as_default():
    # input data
    train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
    train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
    valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

    # operations and variables
    # look up embeddings for inputs
    embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
    embed = tf.nn.embedding_lookup(embeddings, train_inputs)

    # construct the variables for the NCE loss
    nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size], stddev=1.0 / math.sqrt(embedding_size)))
    nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

    # compute the average NCE loss for the batch.
    # tf.nce_loss automatically draws a new sample of the negative labels each time we evaluate the loss.
    loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights, biases=nce_biases,
                     labels=train_labels, inputs=embed, num_sampled=num_sampled, num_classes=vocabulary_size))

    # construct the SGD optimizer using a learning rate of 1.0
    optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)

    # compute the cosine similarity between minibatch examples and all embeddings
    norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
    normalized_embeddings = embeddings / norm
    valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
    similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)

    # add variable initializer
    init = tf.initialize_all_variables()

Training the Model

# steps to train the model
num_steps = 100001

with tf.Session(graph=graph) as session:
    # we must initialize all variables before using them

    # loop through all training steps and keep track of loss
    average_loss = 0
    for step in xrange(num_steps):
        # generate a minibatch of training data
        batch_inputs, batch_labels = generate_batch(batch_size, num_skips, skip_window)
        feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}

        # we perform a single update step by evaluating the optimizer operation (including it
        # in the list of returned values of
        _, loss_val =[optimizer, loss], feed_dict=feed_dict)
        average_loss += loss_val

        # print average loss every 2,000 steps
        if step % 2000 == 0:
            if step > 0:
                average_loss /= 2000
            # the average loss is an estimate of the loss over the last 2000 batches.
            print("Average loss at step ", step, ": ", average_loss)
            average_loss = 0

        # computing cosine similarity (expensive!)
        if step % 10000 == 0:
            sim = similarity.eval()
            for i in xrange(valid_size):
                # get a single validation sample
                valid_word = reverse_dictionary[valid_examples[i]]
                # number of nearest neighbors
                top_k = 8
                # computing nearest neighbors
                nearest = (-sim[i, :]).argsort()[1:top_k + 1]
                log_str = "nearest to %s:" % valid_word
                for k in xrange(top_k):
                    close_word = reverse_dictionary[nearest[k]]
                    log_str = "%s %s," % (log_str, close_word)

    final_embeddings = normalized_embeddings.eval()

Average loss at step 0 : 288.309387207
nearest to four: inappropriately, leninist, altruistic, unsupported, ect, taito, theist, adria,
nearest to one: nabataean, rt, rosa, jeb, jscript, thucydides, electronegativity, popularizer,
nearest to more: polluted, photographed, gec, hideous, reason, gk, cook, drying,
nearest to to: enthroned, homicides, burger, hobbit, marquise, monophyletic, nomination, stake,
nearest to time: lica, links, quicksort, meshech, warring, deadlines, fouling, lola,
nearest to can: revenge, kublai, alzheimer, teamed, inquiries, seen, madness, stencil,
nearest to UNK: holies, compensation, smoker, thatcher, modulo, sampling, aka, sway,
nearest to see: kellogg, mnemonic, stairway, stair, emilio, marcuse, rowers, progressions,
nearest to or: staffed, differs, void, dwellings, aniston, nightmare, vertical, alexandrine,
nearest to up: patched, ottoman, trivia, worshippers, virus, qin, wears, chakras,
nearest to were: reading, pornography, peripatetic, arukh, jus, spacetime, brooms, myspace,
nearest to would: ystem, redeem, informants, adp, moralistic, cru, octavio, unprofitable,
nearest to on: ella, peseta, vertebrate, electrocution, glyphs, matres, paste, aardvarks,
nearest to united: sitcoms, birthright, revisionists, maxwell, cy, intuition, inhibit, analects,
nearest to system: gazing, aino, sung, formations, neorealism, curtailed, evansville, selig,
nearest to be: medicines, ucl, danville, caesarion, astride, lethargy, vanderbilt, rudy,
Average loss at step 2000 : 113.381599131
Average loss at step 4000 : 52.6247417297
Average loss at step 6000 : 33.3770867083
Average loss at step 8000 : 23.6938661382
Average loss at step 10000 : 18.075570853

Average loss at step 92000 : 4.71373743808
Average loss at step 94000 : 4.62654836035
Average loss at step 96000 : 4.72683842146
Average loss at step 98000 : 4.6312282548
Average loss at step 100000 : 4.69500881946
nearest to four: five, six, seven, eight, three, two, nine, zero,
nearest to one: two, seven, four, six, three, five, eight, abet,
nearest to more: less, most, cook, photographed, yalta, polluted, downwards, akita,
nearest to to: will, peacocks, can, thibetanus, cegep, gollancz, would, accustomed,
nearest to time: lica, cebus, dasyprocta, hermann, teach, callithrix, cegep, jarry,
nearest to can: may, will, would, could, might, must, cannot, should,
nearest to UNK: agouti, callithrix, thibetanus, microcebus, michelob, cebus, thaler, wct,
nearest to see: akh, kellogg, microcebus, include, smoking, pdf, can, locals,
nearest to or: and, mitral, but, hbox, cegep, thaler, callithrix, thibetanus,
nearest to up: them, him, methodologies, annex, trivia, back, off, banished,
nearest to were: are, was, have, had, while, be, agouti, dasyprocta,
nearest to would: will, could, can, may, should, might, must, cannot,
nearest to on: in, at, cegep, upon, gollancz, hagbard, within, against,
nearest to united: michelob, hagbard, castrato, cellos, birthright, during, spanish, bang,
nearest to system: chs, systems, aino, evansville, thaler, campaigners, assistive, bos,
nearest to be: been, have, are, by, were, is, was, being,

Visualizing the Embeddings

Let’s plot the embedding vectors learned from the word2vec training using t-SNE.

def plot_with_labels(low_dim_embs, labels, filename='tsne.png'):
    assert low_dim_embs.shape[0] >= len(labels), "More labels than embeddings"
    # set plot size in inches
    plt.figure(figsize=(18, 18))
    # loop through all labels
    for i, label in enumerate(labels):
        # get the embedding vectors
        x, y = low_dim_embs[i, :]
        # plot them in a scatterplot
        plt.scatter(x, y)
        # annotations
        plt.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')

    # save the figure out

    # import t-SNE and matplotlib.pyplot
    from sklearn.manifold import TSNE
    import matplotlib.pyplot as plt

    %matplotlib inline

    # create the t-SNE object with 2 components, PCA initialization, and 5000 iterations
    tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
    # plot only so many words in the embedding
    plot_only = 500
    # fit the TSNE dimensionality reduction technique to the word vector embedding
    low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only, :])
    # get the words associated with the points
    labels = [reverse_dictionary[i] for i in xrange(plot_only)]
    # call the plotting function
    plot_with_labels(low_dim_embs, labels)

except ImportError:
    print("Please install sklearn, matplotlib, and scipy to visualize embeddings.")

TSNE Word Embeddings

Note: I used the t-SNE result from the TensorFlow tutorial since I had trouble getting mine into WordPress HTML.

As expected, words that are similar end up clustering nearby one another (his, her, its, theirs; could, should would, must, will, may; have, had, having, being, been, be, become, became, is; …).

Analyzing Embeddings: Analogical Reasoning

Embeddings are useful for many prediction tasks in NLP. Short of training a full part-of-speech model or named-entity model, a simple way to evaluate embeddings is to directly use them to predict syntax and semantic relationships such as king is to queen as father is to ?. This is called analogical reasoning, and the task was introduced by Mikolov et al.. The dataset for this task is available here.

To see how to do the evaluation, consult the build_eval_graph() and eval() functions in tensorflow/models/embedding/

The choice of hyperparameters can strongly affect the accuracy on the task. To achieve state-of-the-art performance on this task requires training on a very large dataset, carefully tuning the hyperparameters, and making use of tricks like subsampling the data (out of scope of this tutorial).

Optimizing the Implementation

The above simple implementation showcases the flexibility of TensorFlow. For example, changing the training objective is as simple as switching out the call to tf.nn.nce_loss() for an off-the-shelf alternative such as tf.nn.sampled_softmax_loss(). If you have new ideas for a loss function, you can manually write an expression for the new objective in TensorFlow and let the optimizer compute the derivatives. This flexibility is very valuable in the exploratory phase of machine learning model development, in which we want to try out several different ideas and iterate on them quickly.

Once you have a model architecture which you’re happy with, it may be worth optimizing the implementation to run more efficiently and cover more data in less time. For example, the simple code used above would suffer compromised speed since we use Python for read data and feeding it to our model – both of which require very little work on the TensorFlow back-end. If your model is seriously bottle-necking on input data, you may want to implement a custom data reading tool for your problem, as described here. In the case of the Skip-Gram model, this is already done in the more advanced TensorFlow implementation of word2vec.

If your model is no longer I/O bound but you still want better performance, you can write your own TensorFlow Ops. There is an example of this for the Skip-Gram model in TensorFlow here.


We covered the word2vec model, a computationally efficient model for learning word embeddings. We motivated the usefulness of word embeddings, discussed efficient training techniques, and gave a simple implementation of it in TensorFlow. The hope is that the tutorial showcased how TensorFlow is flexible enough for doing early experimentation easily, yet powerful enough to give you the control you need for bespoke optimized implementations.

Deep Learning Book: Linear Algebra

This chapter certainly isn’t the most glamorous, but understanding its content will be a crucial step in motivating the mathematical fundamentals of machine learning and deep learning models and algorithms.

Note: These notes are sometimes verbatim the content of the corresponding chapter in Deep Learning. I wish this weren’t the case, but there are only so many ways to consolidate or restate the mathematics.

Chapter 2: Linear Algebra

Linear algebra is a branch of mathematics which is often applied in various science and engineering disciplines. Since linear algebra is a form of continuous mathematics, computer scientists (typically more concerned with discrete mathematics for modeling computer programs) usually have little experience with it. Having an understanding of linear algebra is an important step towards working with many machine learning and (especially) deep learning algorithms. This chapter is meant to give a presentation of key aspects of linear algebra which relate to machine learning theory and practice.

The book encourages readers to skip the chapter if they are already familiar with linear algebra, but stresses reviewing key formulas, if needed, from The Matrix CookbookFor those without any linear algebra knowledge, the authors recommend consulting other resources focused on the teaching of linear algebra from first principles, such as Linear AlgebraThis chapter omits many topics in linear algebra that are inessential for understanding deep learning.

Chapter 2.1: Scalars, Vectors, Matrices and Tensors

There are several types of mathematical objects particular to linear algebra:

  • Scalars: A scalar is a single number (as opposed to the higher-dimensional objects that are typical of linear algebra). These are written in italics, and are usually given lowercase variable names. Upon introducing them, we specify what kind of number they are; e.g., we say “Let s \in \mathbb{R} be the slope of a line” to define a real-valued scalar.
  • Vectors: An array of numbers arranged in order, identified by its index in that ordering. Typically, we given vector lowercase variable names in bold, such as \textbf{x}. The elements of a vector are written in italics with a subscript; e.g., the first element of a vector \textbf{x} is x_i. If the elements the vector belong to \mathbb{R}, and the vector has n elements, we say that \textbf{x} \in \mathbb{R}^n, where \mathbb{R}^n is the Cartesian product of \mathbb{R} n times. We can explicitly identify the elements of a vector by writing them in a column enclosed in square brackets. Intuitively, vectors identify a point in the space in which they are situated, where each element of the vector gives the coordinate along a unique axis. We can also index a set of elements of a vector, by first defining the set of indices and writing the set as a subscript; e.g., S = \{2, 4, 7\}, and \textbf{x}_S = \{\textit{x}_2,\textit{x}_4,\textit{x}_7 \}.
  • Matrices: A matrix is a two-dimensional array of numbers, where each element of the array is indexed by two indices instead of one. We usually given matrices uppercase, boldface variable names like \mathbf{A}. If a matrix \mathbf{A} has real entries, columns and n rows, we write \mathbf{A} \in \mathbb{R}^{n \times m}. We usually identify the entries of a matrix using its name in italics, and the entries are listed with a separating comma, e.g., A_{i,j} is the element of \mathbf{A} in the i-th row and j-th column of the matrix. Using “:”, we can identify all the numbers in a particular row or column of \mathbf{A}; for example, \mathbf{A}_{i,:} specifies the entire i-th row of \mathbf{A}, and \mathbf{A}_{:,j} specifies the entire j-th column of \mathbf{A}. To explicitly specify the elements of a matrix, we write them as an array enclosed by square brackets. We can also index matrix-valued expressions that are not just a single letter: f(\mathbf{A})_{i,j} gives elements (i, j) of the matrix computed by applying the function f to the matrix \mathbf{A}.
  • Tensors: Sometimes, we need an array with more than two axes. Generally, an array of numbers arranged on a regular grid with a variable number of axes is called a tensor. We denote a tensor (named “A”) by a the following typeface: A. We can identify an element of some tensor A by a column separated list of indices, just as in the matrix case (a two-dimensional tensor).

One important linear algebra operation is the transpose. The matrix transpose is the mirror image of the matrix across its main diagonal (from the upper left to bottom right entries). The transpose of a matrix \mathbf{A} is denoted as \mathbf{A}^\top, and is defined as:

(\mathbf{A}^\top)_{i,j} = \mathit{A}_{i,j}.

Vectors may be thought of as matrices which have only a single column. The transpose of a vector is then a matrix with only a single row. Sometimes we define a vector inline as a row matrix, and then using the transpose operator to turn it into a standard column vector; e.g., [x_1, x_2, x_3]^\top.

A scalar may be thought of as a matrix with only a single entry; from this, it is easy to see that a scalar is its own transpose: a = a^\top.

We can add matrices to each other (so long as they have the same shape) by adding their individual elements; e.g., \textbf{C} = \textbf{A} + \textbf{B}, where \textit{C}_{i,j} = \textit{A}_{i,j} + \textit{B}_{i,j}.

We can add a scalar to a matrix or multiply a matrix by a scalar by performing either operation on each of the elements of a matrix; e.g., \mathbf{D} = a \cdot \mathbf{B} + d, where \textit{D}_{i,j} = a \cdot \textit{B}_{i,j} + b.

In deep learning, we tend to use some less conventional notation. We allow the addition of a matrix and a vector, which yields another matrix: \mathbf{C} = \mathbf{A} + \mathbf{b}, where \textit{C}_{i,j} = \textit{A}_{i,j} + b_j. The vector \mathbf{b} is added to each row of the matrix. This “implicit copy” of \mathbf{b} to each of the rows of the matrix \mathbf{A} is known as broadcasting.

Chapter 2.2: Multiplying Matrices and Vectors

Multiplication of matrices is one of the most important operations in linear algebra. The matrix product of matrices \mathbf{A} and \mathbf{B} is a third matrix \mathbf{C}, where \mathbf{A} must have the same number of columns as \mathbf{B} has rows. Then, if \mathbf{A} has shape m \times n and \mathbf{B} has shape n \times p, then their matrix product \mathbf{C} has shape m \times p. We write the matrix product as:

\mathbf{C} = \mathbf{A}\mathbf{B}.

The product operation is given by:

C_{i,j} = \sum_k A_{i,k}B_{k,j}.

This is not just the matrix containing the product of the individual elements of the factor matrices. This operation does exist, however, and is known as the element-wise product or Hadamard product (denoted as \mathbf{A} \odot \mathbf{B}).

The dot product between two vectors \mathbf{x}, \mathbf{y} of the same size is the matrix product \mathbf{x}^\top \mathbf{y}. We can therefore think of the matrix product \mathbf{C} = \mathbf{A}\mathbf{B} as computing C_{i,j} as the dot product between row i of \mathbf{A} and column j of \mathbf{B}.

The matrix product operation has many useful properties that make mathematical analysis of matrices convenient. Matrix multiplication is (1) distributive: \mathbf{A}(\mathbf{B} + \mathbf{C}) = \mathbf{A}\mathbf{B} +\mathbf{A}\mathbf{C}, (2) associative: \mathbf{A}(\mathbf{B}\mathbf{C}) = (\mathbf{A}\mathbf{B})\mathbf{C}, and is not commutative in general (the condition \mathbf{A}\mathbf{B} = \mathbf{B}\mathbf{A} does not always hold), unlike with scalar multiplication. The dot product between two vectors is commutative:

\mathbf{x}^\top\mathbf{y} = \mathbf{y}^\top\mathbf{x}.

The transpose of a matrix product is:

(\mathbf{A}\mathbf{B})^\top = \mathbf{B}^\top\mathbf{A}^\top.

This allows us to prove vector dot product commutativity, by using the fact that the value of a product is a scalar and therefore equal to its own transpose:

\mathbf{x}^\top\mathbf{y} = (\mathbf{x}^\top\mathbf{y})^\top = \mathbf{y}^\top\mathbf{x}.

We can now write down a system of linear equations using the above developed linear algebra notation:

\mathbf{Ax} = \mathbf{b},

 where \mathbf{A} \in \mathbb{R}^{m \times n} is a known matrix, \mathbf{b} \in \mathbb{R}^m is a known vector, and \mathbf{x} \in \mathbb{R}^n is an unknown vector of variables we wish to solve for. We can think of each row of \mathbf{A} and each entry of \mathbf{b} as constraints on the values of the unknown vector \mathbf{x}. We can rewrite the system of linear equations explicitly as:

A_{1,1}x_1 + A_{1,2}x_2 + ... + A{1,n}x_n = b_1

A_{m,1}x_1 + A_{m,2}x_2 + ... + A_{m,n}x_n = b_m.

Chapter 2.3: Identity and Inverse Matrices

Matrix inversion is a powerful operation which allows us to analytically solve systems of linear equations for many values of the matrix \mathbf{A}.

First, we need to define the concept of an identity matrix: a matrix which does not alter any elements of any vector when we multiply that vector by that matrix. The matrix which preserves n-dimensional vectors is denoted \mathbf{I}_n, and is defined as a square matrix of shape n \times n such that:

\forall \mathbf{x} \in \mathbb{R}^n, \mathbf{I}_n \mathbf{x} = \mathbf{x}.

The identity matrix simply has zero entries everywhere except for on its main diagonal, where each entry has value 1.

The matrix inverse of \mathbf{A} is denoted as \mathbf{A}^{-1}, and is defined as:

\mathbf{A}^{-1} \mathbf{A} =\mathbf{I}_n.

We can now solve systems of linear equations as follows:

\mathbf{A} \mathbf{x} = \mathbf{b}

\mathbf{A}^{-1} \mathbf{A} \mathbf{x} = \mathbf{A}^{-1} \mathbf{b}

\mathbf{I}_n \mathbf{x} = \mathbf{A}^{-1} \mathbf{b}

\mathbf{x} = \mathbf{A}^{-1} \mathbf{b}.

This solution depends on the possibility of finding the matrix inverse \mathbf{A}^{-1}. When \mathbf{A}^{-1} does exist, there are several algorithms which can find it in closed form. In fact, the same inverse matrix can be used multiple times to solve the system of equations for different values of \mathbf{b}. However, \mathbf{A}^{-1} is typically just a theoretical tool and is not typically used in practice in software.

Chapter 2.4: Linear Dependence and Span

In order for the matrix inverse to exist, the system of linear equations \mathbf{Ax} = \mathbf{b} must have exactly one solution for every possible value of \mathbf{b}; it’s possible that there may be zero or infinitely many solutions for some values of \mathbf{b}.

To analyze how many solutions the linear system of equations has, we can think of the columns of \mathbf{A} as specifying the different directions we may travel in from the origin, the point specified by the zero vector, and then determine how many ways there are of reaching \mathbf{b}. In this way, each element of latex \mathbf{x} corresponds to how far to move in the direction of column i:

\sum_i c_i A_{i,j}.

This kind of operation is generally known as a linear combination: multiplying each vector (for some set of vectors \{\mathbf{v}^{(1)}, ..., \mathbf{v}^{(n)}\}) by a corresponding scalar coefficient and summing the results:

\sum_i c_i \mathbf{v}^{(i)}.

The span of a set of vectors is the set of all points obtainable by linear combination of the original vectors. Therefore, \mathbf{Ax} = \mathbf{b} has a solution only if b is in the span of the columns of \mathbf{A}. The span of the columns of the matrix \mathbf{A} is known as its column space or range.

So, in order for the system of equations to have a solution for all possible values of b \in \mathbb{R}^m, we require that the column space of \mathbf{A} be all of \mathbb{R}^m. This requirement implies that \mathbf{A} must have at least m columns, i.e., n \geq m. This is only a necessary condition for every point to have a solution, not a sufficient one; it’s possible that some of the columns may be redundant.

This kind of redundancy is known as linear dependence. A set of vectors \{\mathbf{v}^{(1)}, ..., \mathbf{v}^{(n)}\} is linearly independent if no vector in the set is a linear combination of the other vectors in the set. Thus, if the column space of the matrix \mathbf{A} were to encompass all of \mathbb{R}^m, it must contain at least one set of m linearly independent columns.

For the matrix to have an inverse, we also need to make sure that the system of equations has at most one solution for each value of \mathbf{b}. To accomplish this, we need to ensure that the matrix has at most m columns; otherwise, there is more than one way of parametrizing the solution.

All together, this implies that the matrix must be square; i.e., m = n, and that the columns must be linearly independent (a square matrix with linearly dependent columns is called singular, and is called nonsingular otherwise). If \mathbf{A} is not square, or is square but singular, solving the equation is still possible, but we can’t use matrix inversion to find the solution.

Chapter 2.5: Norms

In machine learning, we usually measure the size of vectors using a function known as a norm. The L^p norm is defined as:

||\mathbf{x}||_p = (\sum_i |x_i|^p)^\frac{1}{p}

for p \in \mathbb{R}, p \geq 1.

Norms are functions which map vectors to non-negative real values. Intuitively, they measure the distance of any given vector from the origin. More rigorously, a norm is any function which satisfies:

  1. f(\mathbf{x}) = 0 \implies \mathbf{x} = \mathbf{0}
  2. f(\mathbf{x} + \mathbf{y}) \leq f(\mathbf{x}) + f(\mathbf{y}) (triangle inequality)
  3. \forall \alpha \in \mathbb{R}, f(\alpha \mathbf{x}) = |\alpha|f(\mathbf{x})

The L^2 norm, known also as the Euclidean norm, is the Euclidean distance from the origin to the point given by \mathbf{x}. It is common to measure the size of a vector using the squared L^2 norm, which is calculated as \mathbf{x}^\top \mathbf{x}. This squared norm is more convenient to work with, mathematically and computationally, since the derivative of the square L^2 norm with respect to each element of the vector \mathbf{x} depends only on that element (whereas each derivative of the L^2 norm depends on the entire vector). Since the L^2 norm increases very slowly near the origin, for some machine learning applications (where it might be crucial to discriminate very small, but nonzero numbers and exact zeros) we turn to the L^1 norm, which grows at the same rate in all locations. This function is given by:

||x||_1 = \sum_i |x_i|.

Every time an entry of \mathbf{x} moves \epsilon away from the origin, the L^1 norm increases by \epsilon.

One could also measure the “size” of a vector by counting the number of nonzero elements in it. This is sometimes referred to as the “L^0” norm, but this is poor terminology, since this function does not satisfy property (3) listed above (scaling the vector by \alpha does not change the number of nonzero entries). Another commonly used norm in machine learning is the L^\infty, also known as the max norm. This is simply defined as the absolute value of the element in the vector with largest magnitude:

||\mathbf{x}||_\infty = \max_i |x_i|.

We may also want to measure the size of the matrix, which (in the context of deep learning) we can accomplish with the somewhat obscure Frobenius norm:

||A||_F = \sqrt{\sum_{i,j}A_{i,j}^2},

which is analogous to the Euclidean norm of a vector.

Chapter 2.6: Special Kinds of Matrices and Vectors

Diagonal matrices consist of entirely zeros except for along its main diagonal. A matrix \mathbf{D} is diagonal if and only if D_{i,j} = 0 for all i \neq j. The identity matrix defined above is one such example of a diagonal matrix. We can also write \text{diag}(\mathbf{v}) to denote a square diagonal matrix whose main diagonal is given by the elements of \mathbf{v}, in order from top left to bottom right. Diagonal matrices are interesting partly because multiplying vectors by them is computationally efficient: we simple scale each i-th element of a vector by the $i$-th entry along the matrix’s main diagonal. Inverting a diagonal matrix is also easy: the matrix inverse exists only if all the main diagonal entries are nonzero, and when this is true, the inverse is given by \text{diag}( \mathbf{v})^{-1} = \text{diag}(1/v_1, ..., 1/v_n). In ML, we often define an algorithm in terms of arbitrary matrices, but obtain a less expensive, but also less descriptive, algorithm by restricting matrices to be diagonal.

Not all diagonal matrices need to be square; we can construct rectangular diagonal matrices. These do not have inverses, but we can still multiply with them cheaply. For a non-square diagonal matrix \mathbf{D}, the product \mathbf{Dx} will scale each element of \mathbf{x} and either (1) concatenate some zeros to the result (if \mathbf{D} is taller than it is wide), or (2) discard some of the last elements of the vector \mathbf{x} (if \mathbf{D} is wider than it is tall).

symmetric matrix is any matrix which is equal to its own transpose:

\mathbf{A} = \mathbf{A}^\top.

An example of a symmetric matrix: a matrix of distance measurements \mathbf{A}, with A_{i,j} giving the distance from point i to point j. This is symmetric because distance functions are symmetric.

unit vector is a vector with unit norm; i.e.,

||\mathbf{x}||_2 = 1.

Vectors \mathbf{x}, \mathbf{y} are orthogonal if \mathbf{x}^\top\mathbf{y} = 0. If both vectors have nonzero norm, then orthogonality means that they are at 90 degree angles to each other. If the vectors are both orthogonal and have unit norm, we call them orthonormal.

An orthogonal matrix is a square matrix whose rows are mutually orthonormal and whose columns are mutually orthonormal. This gives:

\mathbf{A}^\top\mathbf{A} = \mathbf{A}\mathbf{A}^\top = \mathbf{I}

\implies \mathbf{A}^{-1} = \mathbf{A}^\top,

so orthogonal matrices are useful because their inverse is cheap to compute (since the matrix transpose operation is cheap to compute).

Chapter 2.7: Eigendecomposition

Often, we break down mathematical objects into constituent parts, or find properties of them that are universal, not caused by how we choose to represent them.

Similarly to how we can deduce something about the nature of an integer by decomposing it into its prime factors, we can also decompose matrices in ways that give us insight into their functional properties that is not obvious from the way we typically represent matrices (as an array of numerical elements). One of the most widely used kinds of matrix decomposition is known as eigendecomposition, in which a matrix is broken down into a set of eigenvalues and eigenvectors.

An eigenvector of a square matrix \mathbf{A} is defined as a nonzero vector \mathbf{v} such that multiplication of \mathbf{v} by \mathbf{A} simply scales \mathbf{v}:

\mathbf{Ax} = \lambda\mathbf{x}.

The scalar \lambda is known as the eigenvalue corresponding the this particular eigenvector. Now, if \mathbf{v} is an eigenvector of \mathbf{A}, so is any scalar multiple of \mathbf{v}; for this reason, we typically select eigenvectors with unit norm.

Supposing that \mathbf{A} has n linearly independent eigenvectors and corresponding eigenvalues, the eigendecomposition of \mathbf{A} is given by:

\mathbf{A} = \mathbf{V}\text{diag}(\mathbf{\lambda})\mathbf{V}^{-1},

where the matrix \mathbf{V} is given by the row-wise concatenation of the n eigenvectors of \mathbf{A}, and the vector \mathbf{\lambda} is the column-vector concatenation of the corresponding eigenvalues.

Often, we want to decompose matrices into their eigenvalues and eigenvectors. Doing so will help us analyze certain properties of the matrix. However, not all matrices can be decomposed in this way. In other cases, the decomposition exists, but involves complex-valued eigen-values and -vectors. In the context of machine learning, we usually only have to deal with a specific class of matrices which have a simple decomposition. Specifically, every real symmetric matrix can be decomposed into an expression using only real-valued eigen-values and -vectors:

\mathbf{A} = \mathbf{Q \Lambda Q^\top},

in which \mathbf{Q} is an orthogonal matrix composed of eigenvectors of \mathbf{A}, and \mathbf{\Lambda} is a diagonal matrix composed of the eigenvalues of \mathbf{A}. Since \mathbf{Q} is orthogonal, we can think of \mathbf{A} as scaling space by \lambda_i in direction \mathbf{v}^{(i)}.

The eigendecomposition of a real symmetric matrix must exist, but is not necessarily unique. If any two eigenvectors share the same eigenvalue, then any set of orthogonal vectors lying in their span are also eigenvectors with the same eigenvalue, and we could pick a matrix \mathbf{Q} with these vectors instead. We usually sort the entries of \mathbf{\Lambda} in descending order; in this setting, the eigendecomposition is unique if and only if all the eigenvalues are unique.

A matrix is singular only if any of its eigenvalues are zero. The eigendecomposition of a matrix can be used to optimize quadratic expressions of the form f(\mathbf{x}) = \mathbf{x}^\top\mathbf{Ax} subject to ||\mathbf{x}||_2.

A matrix whose eigenvalues are all positive is called positive definite; if they are all greater than or equal to zero, it is called positive semidefiniteNegative definite and negative semidefinite matrices are defined analogously. Positive semidefinite matrices guarantee that \forall \mathbf{x} \text{,} \mathbf{x}^\top \mathbf{Ax} \geq 0, and positive definite matrices further ensure that \mathbf{x}^\top\mathbf{Ax} = 0 \implies \mathbf{x} = \mathbf{0}.

Chapter 2.8: Singular Value Decomposition

The method of Singular Value Decomposition (SVD) provides another way to factorize a matrix, but into singular vectors and singular values. SVD allows us to discover some of the same kind of information which we get from eigendecomposition, but is generally more applicable: every real matrix has a singular value decomposition, but not every real matrix has an eigendecomposition. For non-square matrices, eigendecomposition is undefined, but we may use SVD to analyze some of its properties.

The singular value decomposition of a matrix \mathbf{A} has a similar form to that of the eigendecomposition:

\mathbf{A} = \mathbf{UDV}^\top.

Suppose \mathbf{A} \in \mathbb{R}^{m \times n}. Then, \mathbb{U} is defined as an m \times m matrix, \mathbf{D} an m \times n matrix, and \mathbf{V} an n \times n matrix. Each of these matrices have a certain special structure; namely, \mathbf{U} \text{and} \mathbf{V} are both defined to be orthogonal matrices, and \mathbf{D} is defined as a diagonal matrix (not necessarily square).

The elements along the main diagonal of \mathbf{D} are the singular values of the matrix \mathbf{A}, the columns of \mathbf{U} are known as the left-singular vectors, and the columns of \mathbf{V} are known as the right-singular vectors. We can interpret the SVD of \mathbf{A} in terms of the eigendecomposition of certain functions of \mathbf{A}. In particular, the left-singular vectors of \mathbf{A} are the eigenvectors of \mathbf{AA}^\top, the right-singular vectors of \mathbf{A} are the eigenvectors of \mathbf{A}^\top\mathbf{A}, and the nonzero singular values of \mathbf{A} are the square roots of the eigenvalues of \mathbf{A}^\top\mathbf{A} and also \mathbf{AA}^\top.

As we’ll see in the next section, we can use SVD to partially generalize matrix inversion to nonsquare matrices.

Chapter 2.9: The Moore-Penrose Pseudoinverse

As mentioned in Section 2.4, the matrix inverse operation is not defined for nonsquare matrices. Now, suppose we want to make a left-inverse \mathbf{B} of a matrix \mathbf{A} so that we can solve a linear system of equations \mathbf{Ax} = \mathbf{y} by left-multiplying each side to obtain \mathbf{x} = \mathbf{By}. Depending on the problem, it may not be possible to find a unique mapping from matrices \mathbf{A} to \mathbf{B}. It may be that this equation has no solutions, or infinitely many.

The Moore-Penrose pseudoinverse allows us to make some headway in these cases. The pseudoinverse of a matrix \mathbf{A} is defined as another matrix:

\mathbf{A}^+ = \lim_{\alpha\to 0^+} (\mathbf{A}^\top\mathbf{A} + \alpha\mathbf{I})^{-1}\mathbf{A}^{\top}.

In practice, algorithms for computing the pseudoinverse are often not based on this definition, but on the formula

\mathbf{A}^\top = \mathbf{VD}^+\mathbf{U}^\top,

where the matrices \mathbf{U}\mathbf{D}, and \mathbf{V} are the SVD of \mathbf{A}, and the pseudoinverse \mathbf{D}^+ is given by transposing the matrix resulting from taking the reciprocal of \mathbf{D}‘s nonzero elements.

When \mathbf{A} has more columns than rows, then solving a linear system using the pseudoinverse provides one of many possible solutions. It gives the solution \mathbf{x} = \mathbf{A}^+\mathbf{y} with minimal Euclidean norm among all possible solutions. When \mathbf{A} has more rows than columns, it is possible that there are no solutions to the linear system of equations. In that case, using the pseudoinverse gives us the \mathbf{x} for which \mathbf{Ax} is as close as possible to \mathbf{y} in terms of the Euclidean norm.

Chapter 2.10: The Trace Operator

The trace operator gives the sum of the elements along the main diagonal of a matrix:

Tr(\mathbf{A}) = \sum_{i} A_{i, i}.

Some matrix operations that are difficult to specify without summation notation are easily specified using matrix products and the trace operator. For example, it allows us to rewrite the Frobenius norm in an alternative, more convenient way:

||\mathbf{A}||_F = \sqrt{Tr(\mathbf{AA}^\top}).

Writing expressions in terms of the trace operator opens up ways to manipulate the expression in terms of useful identities. For example, the trace operator is invariant to the transpose operator:

Tr(\mathbf{A}) = Tr(\mathbf{A}^\top).

The trace of a square matrix composed of many factors is invariant to moving the last factor into the first position of the factorization (if the shapes of the matrices admit the resulting matrix product):

Tr(\prod_{i=1}^{n} \mathbf{F}^{i}) = Tr(\mathbf{F}^{n} \prod_{i =1}^{n-1} \mathbf{F}^{i}).

This cyclic permutation invariance still holds even if the resulting matrix product has a different shape; e.g., for \mathbf{A} \in \mathbb{R}^{m \times n} and \mathbf{B} \in \mathbb{R}^{n \times m},

Tr(\mathbf{AB}) = Tr(\mathbf{BA}),

even though \mathbf{AB} \in \mathbb{R}^{m \times m} and \mathbf{BA} \in \mathbb{R}^{n \times n}.

The trace of a scalar is the scalar itself: a = Tr(a).

Chapter 2.11: The Determinant

The determinant of a square matrix (denoted \text{det}(\mathbf{A})) is a function which maps matrices to scalars in \mathbb{R}, and is equal to the product of all the eigenvalues of the matrix. We may think of the absolute value of the determinant as a measure of how much multiplication by the matrix expands or contracts space. If the determinant is 0, the matrix transformation contracts space completely along at least one dimension, causing it to lose all volume. If the determinant is 1, the transformation preserves volume.

Chapter 2.12: Principal Components Analysis

The principal components analysis (PCA) machine learning algorithm can be completely derived only using knowledge of basic linear algebra.

Let’s suppose there is a collection of m vectors \{\mathbf{x}^{(1)}, ..., \mathbf{x}^{(m)}\} in \mathbb{R}^{n} which we want to apply lossy compression to. Lossy compression is a method of storing data in a way that requires less memory at the cost of loss in precision. We want to lose as little precision as possible while gaining some non-trivial memory savings.

We could encode these points in a lower-dimensional version of them. For each point \mathbf{x}^{(i)} \in \mathbb{R}^n we will find a corresponding “code vector” \mathbf{c}^{(i)} \in \mathbb{R}^l. If l is smaller that n, storing the code vectors should take less memory than would storing the original data. We will need to find an encoding function which produces the reconstructed input given its code, \mathbf{x} \approx g(f(\mathbf{x})).

PCA is defined by our choice of decoding function. To make the decoder simple, we choose to use matrix multiplication to map the code back into \mathbb{R}^n. We let g(\mathbf{c}) = \mathbf{Dc}, with \mathbf{D} \in \mathbb{R}^{n \times l} as the matrix defining the decoding function. Computing the optimal code for this decoder could be difficult; for this reason, PCA constrains the columns of \mathbf{D} to be orthogonal to each other.

As described, there may be many possible solutions to the PCA problem, since we can scale \mathbf{D}_{:,i} if we proportionally scale c_i in the opposite direction for all points. We constrain the columns of \mathbf{D} to have unit norm so that the problem has a unique solution.

First, we need to figure out how to generate the optimal code point \mathbf{c}^* for each input point \mathbf{x}. One way to do this is to minimize the distance between the point \mathbf{x} and its reconstruction g(\mathbf{c}^*), which we can measure using a norm. For PCA, we use the L^2 norm:

\mathbf{c}^{*} = \text{argmin}_{\mathbf{c}} ||\mathbf{x} - g(\mathbf{c})||_{2}.

For PCA, we may switch to the squared L^2 norm since it is minimized by the same value of \mathbf{c} as the L^2 norm, since the L^2 norm is non-decreasing and the squaring operation is monotonically increasing for such arguments. The function to be minimized is then:

\mathbf{c}^* = \text{argmin}_\mathbf{c}||\mathbf{x} - g(\mathbf{c})||_{2}^{2}

= (\mathbf{x} - g(\mathbf{c}))^\top(\mathbf{x} - g(\mathbf{c}))

= \mathbf{x}^\top\mathbf{x}-\mathbf{x}^\top g(\mathbf{c})-g(\mathbf{c})^\top\mathbf{x}+g(\mathbf{c})^\top g(\mathbf{c}).

We can omit the first term in this function since it does not depend on \mathbf{c}, and then substitute in the definition of g(\mathbf{c}):

\mathbf{c}^* = \text{argmin}_\mathbf{c}(-2\mathbf{x}^\top g(\mathbf{c})+g(\mathbf{c})^\top g(\mathbf{c}))

=\text{argmin}_\mathbf{c}(-2\mathbf{x}^\top \mathbf{Dc}+\mathbf{c}^\top\mathbf{D}^\top\mathbf{Dc})

=\text{argmin}_\mathbf{c}(-2\mathbf{x}^\top \mathbf{Dc}+\mathbf{c}^\top\mathbf{I}_l\mathbf{c})

=\text{argmin}_\mathbf{c}(-2\mathbf{x}^\top \mathbf{Dc}+\mathbf{c}^\top\mathbf{c}).

We can solve the above optimization problem using vector calculus (to be discussed in Chapter 4):

\nabla_\mathbf{c}(-2\mathbf{x}^\top\mathbf{Dc}+\mathbf{c}^\top\mathbf{c}) = \mathbf{0}

-2\mathbf{D}^\top\mathbf{x}+ 2\mathbf{c} = \mathbf{0}

\mathbf{c} = \mathbf{D}^\top\mathbf{x}.

So, we can encode \mathbf{x} using just a matrix-vector multiplication:

f(\mathbf{x}) = \mathbf{D}^\top\mathbf{x}.

Using another matrix multiplication, we define the PCA reconstruction operation (decoding the encoding operation):

r(\mathbf{x}) = g(f(\mathbf{x})) = \mathbf{DD}^\top\mathbf{x}.

We must also choose the encoding matrix \mathbf{D}. We again make use of the idea of minimizing the L^2 distnace betwewen inputs and reconstructions. Since we use the same matrix \mathbf{D} to decode all the points, we can no longer consider each of the points in isolation. We instead need to minimize the Frobenius norm of the matrix of errors computed over all dimensions and points:

\mathbf{D}^*=\text{argmin}_\mathbf{D} \sqrt{\sum_{i,j}(x_{j}^{(i)}-r(\mathbf{x}^{(i)})_j)^2} subject to \mathbf{D}^\top\mathbf{D} = \mathbf{I}_l.

To derive the algorithm for finding the optimal \mathbf{D}, we begin by considering the case where l = 1; in this case, \mathbf{D} is a single vector \mathbf{d}. Substituting the PCA reconstruction function into the previous equation, the problem becomes:

\mathbf{d}^* = \text{argmin}_\mathbf{d}\sum_i ||\mathbf{x}^{(i)} - \mathbf{dd}^\top\mathbf{x}^{(i)}||_{2}^{2} subject to ||\mathbf{d}||_2 = 1.

We can rearrange this equation as:

\mathbf{d}^* = \text{argmin}_\mathbf{d}\sum_i ||\mathbf{x}^{(i)} - \mathbf{d}^\top\mathbf{x}^{(i)}\mathbf{d}||_{2}^{2} subject to ||\mathbf{d}||_2 = 1

= \mathbf{d}^* = \text{argmin}_\mathbf{d}\sum_i ||\mathbf{x}^{(i)} - \mathbf{x}^{(i)\top}\mathbf{dd}||_{2}^{2} subject to ||\mathbf{d}||_2 = 1.

It can be helpful to rewrite the problem in terms of a designated matrix of examples, rather than as a sum over separate example vectors, thereby lightening the notational burden. We let \mathbf{X} \in \mathbb{R}^{m \times n} be the matrix created by stacking all the vectors describing the points such that \mathbf{X}_{i,:} = \mathbf{x}^{(i)\top}. We can rewrite the problem as:

\mathbf{d}^* = \text{argmin}_\mathbf{d} ||\mathbf{X}-\mathbf{Xdd}^\top||_F^2 subject to \mathbf{d}^\top\mathbf{x} = 1.

We can simplify the Frobenius norm portion as follows:

\text{argmin}_\mathbf{d}||\mathbf{X} - \mathbf{Xdd}^\top||_f^2

= \text{argmin}_\mathbf{d} Tr((\mathbf{X} - \mathbf{Xdd}^\top)^\top (\mathbf{X} - \mathbf{Xdd}^\top))

= \text{argmin}_\mathbf{d} Tr(\mathbf{X}^\top\mathbf{X}-\mathbf{X}^\top\mathbf{Xdd}^\top-\mathbf{dd}^\top\mathbf{X}^\top\mathbf{X}+\mathbf{dd}^\top\mathbf{X}^\top\mathbf{Xdd}^\top)

= \text{argmin}_\mathbf{d} Tr(\mathbf{X}^\top\mathbf{X})-Tr(\mathbf{X}^\top\mathbf{Xdd}^\top)-Tr(\mathbf{dd}^\top\mathbf{X}^\top\mathbf{X})+Tr(\mathbf{dd}^\top\mathbf{X}^\top\mathbf{Xdd}^\top)

=\text{argmin}_\mathbf{d} (-Tr(\mathbf{X}^\top\mathbf{Xdd}^\top)-Tr(\mathbf{dd}^\top\mathbf{X}^\top\mathbf{X})+Tr(\mathbf{dd}^\top\mathbf{X}^\top\mathbf{Xdd}^\top))

= \text{argmin}_\mathbf{d}(-2Tr(\mathbf{X}^\top\mathbf{Xdd}^\top)+Tr(\mathbf{dd}^\top\mathbf{X}^\top\mathbf{Xdd}^\top))

= \text{argmin}_\mathbf{d}(-2Tr(\mathbf{X}^\top\mathbf{Xdd}^\top)+Tr(\mathbf{X}^\top\mathbf{Xdd}^\top\mathbf{dd}^\top)).

Reintroducing the constraint on \mathbf{d},

\text{argmin}_\mathbf{d}(-2Tr(\mathbf{X}^\top\mathbf{Xdd}^\top)+Tr(\mathbf{X}^\top\mathbf{Xdd}^\top\mathbf{dd}^\top)) subject to \mathbf{d}^\top\mathbf{d} = 1

= \text{argmin}_\mathbf{d}(-2Tr(\mathbf{X}^\top\mathbf{Xdd}^\top)+Tr(\mathbf{X}^\top\mathbf{Xdd}^\top)) subject to \mathbf{d}^\top\mathbf{d} = 1

= \text{argmin}_\mathbf{d}(-Tr(\mathbf{X}^\top\mathbf{Xdd}^\top)) subject to \mathbf{d}^\top\mathbf{d} = 1

= \text{argmax}_\mathbf{d}Tr(\mathbf{X}^\top\mathbf{Xdd}^\top) subject to \mathbf{d}^\top\mathbf{d} = 1

= \text{argmax}_\mathbf{d}Tr(\mathbf{d}^\top\mathbf{X}^\top\mathbf{dd}^\top) subject to \mathbf{d}^\top\mathbf{d} = 1.

The resulting optimization problem can be solved using eigendecomposition. The optimal \mathbf{d} is, in fact, given by the eigenvector of \mathbf{X}^\top\mathbf{X} corresponding to its largest eigenvalue.

The above derivation is particular to the case where l = 1, and recovers only the first principal component. To recover a basis of principal components, the matrix \mathbf{D} is given by the l eigenvectors corresponding to the largest eigenvalues, which can be shown via a proof by induction.

Linear algebra is a fundamental mathematical tool necessary for understanding or practicing deep learning. Next, we will learn about another key area of mathematics ubiquitous in machine learning: probability and information theory.

Handwritten Digits Recognition with Convolutional Neural Networks

Yesterday, I built a convolutional neural network model to recognize handwritten digits (0-9) as part of a Kaggle competition. I borrowed heavily from this kernel, but cleaned up the code and wrote better documentation. Here’s the code, written variously in Markdown and Python in a Jupyter notebook. To run the code, you’ll need Python 2.x, the Python packages NumPy, Pandas, Tensorflow, and Matplotlib, and the datasets for the competition. With 15,000 training iterations (as specified by this notebook code), I was able to achieve ~99% testing accuracy, which was evaluated on Kaggle’s held-out test data for the competition.

Digits Recognizer Kaggle Competition

I’ll use a convolutional neural network with dropout to classify the handwritten digits dataset.


Python software packages we use in this notebook.

import numpy as np
import pandas as pd
import tensorflow as tf

import matplotlib.pyplot as plt
import as cm

from random import choice

%matplotlib inline


Change these to determine the hyperparameter configuration of the convolutional neural network and the data-preprocessing.

# determines training / validation split
TRAIN_SIZE = 37000

# neural network learning rate

# minibatch size

# dropout probability

# number of training epochs
EPOCHS = 15000

Data Pre-processing

The training dataset has 42,000 rows and 784 columns, where each row corresponds to a single image of a digit, and each column corresponds to the value of a pixel intensity in an image of a digit; i.e., each column is a “feature” of the digit representation. The test dataset has some 28,000 rows and again, 784 columns.

The datapoints have been flattened from shape (28, 28) to (784,), but we’ll reshape these later in order to make use of the image-filtering capacity of convolutional neural networks.

# read in training data
train = pd.read_csv('../data/train.csv')

# store training dataset as NumPy arrays
X_train = np.array(train.drop('label', axis=1))
y_train = np.array(train['label'])

# normalize pixel values
X_train = np.multiply(X_train, 1.0 / 255.0)

# function to convert labels to one-hot vectors
def dense_to_one_hot(labels_dense):
num_labels = labels_dense.shape[0]
index_offset = np.arange(num_labels) * 10
labels_one_hot = np.zeros((num_labels, 10))
labels_one_hot.flat[index_offset + labels_dense.ravel()] = 1
return labels_one_hot

# use the function!
y_train = dense_to_one_hot(y_train).astype(

# split into training / validation sets (let's use a 37,000 / 5,000 split)
X_train, X_valid = X_train[:TRAIN_SIZE], X_train[TRAIN_SIZE:]
y_train, y_valid = y_train[:TRAIN_SIZE], y_train[TRAIN_SIZE:]

# get shapes of training, validation, and test datasets
print 'training data shapes:', X_train.shape, y_train.shape
print 'validation data shapes:', X_valid.shape, y_valid.shape

# get size, shape of dataset images
image_size = X_train[0,:].shape[0]
print 'image size:', image_size
image_side_length = int(np.sqrt(X_train[0,:].shape[0]))
print 'image shape:', image_side_length, 'x', image_side_length

training data shapes: (37000, 784) (37000, 10)
validation data shapes: (5000, 784) (5000, 10)
image size: 784
image shape: 28 x 28

Displaying Digit Images

A simple function (borrowed from to display a digit image. We need to reshape from (784,) to (28, 28) so we have the correct human friendly representation.

# display image
def display(img):
# (784) => (28,28)
img = img.reshape((28,28))

plt.imshow(img, cmap=cm.binary)

# output image


Building the Tensorflow Computation Graph

Tensorflow does much of its numerical computation and book-keeping outside of Python. Tensorflow allows us to build an entire graph of interacting operations, and run the whole workflow in a separate process all at once.

I define several helper functions for creating weight / bias variables and convolution and pooling operations.

Since we’re using ReLUs (rectified linear units, defined by $f(x) = max(0, x)$), we initialize weight vectors with a bit of positive noise to avoid “dead neurons”; i.e., units whose activations fall below zero and never becoem positive again.

We’ll use zero-padded convolutions so that the output is the same size as the input. The convolution stride in this case is 1 (subject to change). Generally, convolutional layers are responsible for extracting different levels of features from the input data or the output from a previous layer. In digit recognition, low-level features might consist of (among other things) edges (black to white) of various orientation, and higher-level features might consist of sub-shapes characteristic of certain digits, or even canonical or averaged representations of an entire digit class.

We use max-pooling over 2×2 blocks (subject to change). This operation is used to down-sample, or reduce the dimensionality of, the output of a previous layer. Max-pooling keeps only the maximum value from each (2×2) block of its input.

# weight initialization
def weight_variable(shape):
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)

# bias initialization
def bias_variable(shape):
initial = tf.constant(0.1, shape=shape)
return tf.Variable(initial)

# convolution operation
def conv2d(x, W):
# using TensorFlow's conv2d operation with above mentioned parameters
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

# max-pooling operation
def max_pool_2x2(x):
# using TensorFlow's max_pool operation
return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

Any neural network can be used as a layer in a larger neural network (since a “layer” need only be a differentiable operation; this holds for entire NNs). This means that the output for one network can be used as the input for another network. This compositional approach to building neural networks is the reason for the now-popular moniker “deep learning”.

In this notebook, we’ll implement a convolutional neural network with two convolutional layers and two max-pooling layers interweaved, a fully-connected layer, and then a logistic regression layer trained on the activations output from the fully-connected layer.

Let’s define variables for NN input, output:

x = tf.placeholder('float', shape=[None, image_size])
y_ = tf.placeholder('float', shape=[None, 10])

The first layer is a convolution layer followed by max-pooling. The convolution computes 32 features for each 5×5 in the image (with strides of 1). Thus, its weights tensor has shape [5, 5, 1, 32], where the first two dimensions are patch size, the second is the number of input channels (1 corresponds to the fact that the image is grayscale), and the last is the number of output channels or filters. There is also a bias vector with a single component for each output channel.

To apply the layer, we reshape the input data into a 4-dimensional tensor with the first dimension corresponding to the number of images, the second and third to the image width and height, and the last to the number of input channels (On the image input, its shape becomes [37,000, 28, 28, 1]).

After the convolution operation, the max-pooling operation reduces the size of the output (of each channel) from 28×28 to 14×1 (of which there are 32, for an output tensor of shape [37,000, 14, 14, 32]).

# 1st convolution layer
W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])

# reshaping input (-1 corresponds to arbitrary dimensionality)
image = tf.reshape(x, [-1, 28, 28, 1])
# print image.get_shape() gives (?, 28, 28, 1)

# convolution hidden layer
h_conv1 = tf.nn.relu(conv2d(image, W_conv1) + b_conv1)

# max-pooling hidden layer
h_pool1 = max_pool_2x2(h_conv1)

# Prepare for visualization
# display 32 features in a (4 x 8) grid
layer1 = tf.reshape(h_conv1, (-1, 28, 28, 4, 8))

# reorder so the channels are in the first dimension, x and y follow.
layer1 = tf.transpose(layer1, (0, 3, 1, 4,2))
layer1 = tf.reshape(layer1, (-1, 28*4, 28*8))

The second layer has 64 features for each 5×5 input patch (again, strided by 1). Therefore, its weight tensor has shape [5, 5, 32, 64], where the first two dimensions are patch size, the third is the number of input channels, and the fourth is the number of output channels. Again, there is a bias vector which has a single component for each output channel.

# 2nd convolution layer
W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])

# convolution hidden layer
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)

# max-pooling hidden layer
h_pool2 = max_pool_2x2(h_conv2)

# Prepare for visualization
# display 64 features in a (4 x 16) grid
layer2 = tf.reshape(h_conv2, (-1, 14, 14, 4 ,16))

# reorder so the channels are in the first dimension, x and y follow.
layer2 = tf.transpose(layer2, (0, 3, 1, 4,2))

layer2 = tf.reshape(layer2, (-1, 14*4, 14*16))

The image size has been reduced to 7×7 by the last max-pooling layer, and so the shape of the data tensor is now [37,000, 7, 7, 64]. Now, we add a fully-connected layer of 1,024 neurons to do processing on the entire current feature representation of the image.

# fully-connected layer
W_fc1 = weight_variable([7*7*64, 1024])
b_fc1 = bias_variable([1024])

# flattening output of previous max-pooling layer
h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])

# fully-connected hidden layer
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

We apply dropout before the read-out layer in an effort to prevent over-fitting to training data:

# dropout
keep_prob = tf.placeholder('float')
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

Finally, we attach a logisic regression layer to the output of the previous fully-connected layer:

# readout layer / softmax
W_out = weight_variable([1024, 10])
b_out = bias_variable([10])

# output (softmax)
y = tf.nn.softmax(tf.matmul(h_fc1_drop, W_out) + b_out)

We use the cross-entropy loss to evaluate network performance, and use the ADAM optimizer to minimize this loss:

# loss function
cross_entropy = -tf.reduce_sum(y_ * tf.log(y))

# optimization function
train_step = tf.train.AdamOptimizer(LEARNING_RATE).minimize(cross_entropy)

# evaluation
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, 'float'))

To predict values from test data, the element with the highest probability will be picked from the one-hot vector representation.

# prediction function
predict = tf.argmax(y,1)

Train, Validate, and Predict

Now that we’ve defined our neural network architecture, we can train and validate our model, and compile a prediction file for the Kaggle competition.

I define a helper function for doing mini-batch-wise training (stochastic mini-batch training), which is cheaper, faster, and gives similar results than using the entire training dataset per step of network training.

epochs_completed = 0
index_in_epoch = 0
num_examples = X_train.shape[0]

# serve up data by minibatch
def next_batch():
    batch_size = BATCH_SIZE

    global X_train
    global y_train
    global index_in_epoch
    global epochs_completed

    start = index_in_epoch
    index_in_epoch += batch_size

    # when all training data have been already used, reorder it randomly   
    if index_in_epoch > num_examples:
        # finished epoch
        epochs_completed += 1
        # shuffle the data
        perm = np.arange(num_examples)
        X_train = X_train[perm]
        y_train = y_train[perm]
        # start next epoch
        start = 0
        index_in_epoch = batch_size
        assert batch_size <= num_examples
    end = index_in_epoch
    return X_train[start:end], y_train[start:end]

Now, when all operations for every variable is defined in the TensorFlow graph, all computations will be performed outside the Python environment.

# start TensorFlow session
init = tf.initialize_all_variables()
sess = tf.InteractiveSession()

In each step of the loop, we get a minibatch of training data, which we feed into the computation graph to replace the placeholders “x”, “y_”, and “dropout”. We check training / validation set accuracy every once in a while on an incoming minibatch.

# visualization variables
train_accuracies, validation_accuracies, X_range = [], [], []
display_step = 1

for i in range(EPOCHS):
    # get a new training minibatch
    X_batch, y_batch = next_batch()

    # check progress on certain epochs
    if i % display_step == 0 or (i + 1) == EPOCHS:
        # get training accuracy on incoming minibatch
        training_accuracy = accuracy.eval(feed_dict={x : X_batch, y_ : y_batch, keep_prob : 1.0})

        # get validation accuracy on incoming minibatch
        validation_accuracy = accuracy.eval(feed_dict={x : X_valid[0:BATCH_SIZE], y_ : y_valid[0:BATCH_SIZE], keep_prob : 1.0})

        print('training_accuracy / validation_accuracy => %.2f / %.2f for step %d'%(training_accuracy, validation_accuracy, i))



        # increase display step
        if i % (display_step*10) == 0 and i:
            display_step *= 10

    # train on minibatch, feed_dict={x : X_batch, y_ : y_batch, keep_prob: DROPOUT})

training_accuracy / validation_accuracy => 0.00 / 0.08 for step 0
training_accuracy / validation_accuracy => 0.06 / 0.08 for step 1
training_accuracy / validation_accuracy => 0.08 / 0.04 for step 2
training_accuracy / validation_accuracy => 0.12 / 0.10 for step 3
training_accuracy / validation_accuracy => 0.12 / 0.12 for step 4
training_accuracy / validation_accuracy => 0.18 / 0.10 for step 5
training_accuracy / validation_accuracy => 0.14 / 0.08 for step 6
training_accuracy / validation_accuracy => 0.12 / 0.08 for step 7
training_accuracy / validation_accuracy => 0.16 / 0.08 for step 8
training_accuracy / validation_accuracy => 0.20 / 0.12 for step 9
training_accuracy / validation_accuracy => 0.16 / 0.14 for step 10
training_accuracy / validation_accuracy => 0.36 / 0.38 for step 20
training_accuracy / validation_accuracy => 0.58 / 0.54 for step 30
training_accuracy / validation_accuracy => 0.62 / 0.56 for step 40
training_accuracy / validation_accuracy => 0.74 / 0.66 for step 50
training_accuracy / validation_accuracy => 0.80 / 0.66 for step 60
training_accuracy / validation_accuracy => 0.74 / 0.70 for step 70
training_accuracy / validation_accuracy => 0.80 / 0.74 for step 80
training_accuracy / validation_accuracy => 0.86 / 0.68 for step 90
training_accuracy / validation_accuracy => 0.78 / 0.76 for step 100
training_accuracy / validation_accuracy => 0.96 / 0.80 for step 200
training_accuracy / validation_accuracy => 0.94 / 0.86 for step 300
training_accuracy / validation_accuracy => 0.94 / 0.84 for step 400
training_accuracy / validation_accuracy => 0.94 / 0.86 for step 500
training_accuracy / validation_accuracy => 0.94 / 0.90 for step 600
training_accuracy / validation_accuracy => 0.98 / 0.90 for step 700
training_accuracy / validation_accuracy => 0.96 / 0.92 for step 800
training_accuracy / validation_accuracy => 0.94 / 0.92 for step 900
training_accuracy / validation_accuracy => 0.94 / 0.92 for step 1000
training_accuracy / validation_accuracy => 1.00 / 0.92 for step 2000
training_accuracy / validation_accuracy => 0.98 / 0.94 for step 3000
training_accuracy / validation_accuracy => 1.00 / 0.96 for step 4000
training_accuracy / validation_accuracy => 1.00 / 0.96 for step 5000
training_accuracy / validation_accuracy => 1.00 / 0.96 for step 6000
training_accuracy / validation_accuracy => 1.00 / 0.96 for step 7000
training_accuracy / validation_accuracy => 1.00 / 0.96 for step 8000
training_accuracy / validation_accuracy => 1.00 / 0.96 for step 9000
training_accuracy / validation_accuracy => 1.00 / 0.96 for step 10000
training_accuracy / validation_accuracy => 1.00 / 0.96 for step 14999

# check final accuracy on validation set validation_accuracy = accuracy.eval(feed_dict={x : X_valid, y_ : y_valid, keep_prob: 1.0}) print 'validation_accuracy => %.4f' % validation_accuracy

plt.plot(X_range, train_accuracies,'-b', label='Training')
plt.plot(X_range, validation_accuracies,'-g', label='Validation')

plt.legend(loc='lower right', frameon=False)
plt.ylim(ymax = 1.1, ymin = 0.7)

validation_accuracy => 0.9906


When we’re happy with the output, we can predict the labels for the test dataset. The test data contains only images and no labels; otherwise, its structure is identical to the training dataset. Predicted labels will be stored in a .csv file for submission to Kaggle.

# read test data from CSV file
X_test = pd.read_csv('../data/test.csv').values
X_test = X_test.astype(np.float)

print 'test data shape:', X_test.shape

test data shape: (28000, 784)

# convert from [0:255] => [0.0:1.0]
X_test = np.multiply(X_test, 1.0 / 255.0)

# predict the labels of the test dataset. using minibatches is more resource efficient
predicted_labels = np.zeros(X_test.shape[0])
for i in range(X_test.shape[0] // BATCH_SIZE):
predicted_labels[i * BATCH_SIZE:(i+1) * BATCH_SIZE] = predict.eval(feed_dict={x : X_test[i * BATCH_SIZE : (i + 1) * BATCH_SIZE], keep_prob: 1.0})

print 'predicted labels shape:', len(predicted_labels)

predicted labels shape: 28000

# output random test image and prediction
random_idx = choice(range(len(X_test)))
random_img = X_test[random_idx]

print 'predicted_labels[{0}] => {1}'.format(random_idx, predicted_labels[random_idx])

# save results of test dataset prediction
np.savetxt('../submissions/submission_15000_epochs.csv', np.c_[range(1, len(X_test) + 1), predicted_labels], delimiter=',', header='ImageId,Label', comments='', fmt='%d')

predicted_labels[4533] => 2.0


Exploring the Learned Representation

Let’s take a look at the features learned by some of the network’s layers.

We plot saliency maps (what the network deems as “important” in the image) for the activations of all filters of the two convolutional layers, “h_conv1” and “h_conv2”.

As you can see, the activations on the first convolutional hidden layer good at capturing lines and shadows, and “passing through” all the pixels of the digit (probably for bookkeeping during the computation), whereas the activations on the second convolutional layer seem to be capturing more abstract features like shapes and boundaries.

layer1_grid = layer1.eval(feed_dict={x : X_test[0:3], keep_prob : 1.0})
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.imshow(layer1_grid[0], cmap=cm.seismic)


plt.imshow(layer1_grid[1], cmap=cm.seismic)


plt.imshow(layer1_grid[2], cmap=cm.seismic)


layer2_grid = layer2.eval(feed_dict={x : X_test[3:6], keep_prob : 1.0})
plt.imshow(layer2_grid[0], cmap=cm.seismic)


plt.imshow(layer2_grid[1], cmap=cm.seismic)


plt.imshow(layer2_grid[2], cmap=cm.seismic)


Thoughts on Completing an Undergraduate Program and Beginning Graduate School

I graduated this past fall from UMass Amherst with two bachelor’s degrees: one in computer science, and another in mathematics. I decided to take an extra semester, for a total of four and a half years, to (1) load up on some advanced undergraduate courses, (2) get a head start on graduate coursework, (3) start pursuing research opportunities, and (4) get two degrees instead of one with a double major (because why not?). Also, I felt that my first ~2 years in undergrad were somewhat misguided; I didn’t know how to “do college,” which is demonstrated by the decent, but not great, GPA I had during that time.

I enjoyed my time at UMass, but like any undergraduate program, it has its flaws (Unless otherwise specified, I’ll be talking about the CS major, since I spent the most time and effort on this). The one which stood out the most was probably the feeling of isolation that came with arriving with an enormous freshman class (especially in the computer science major). Although my professors were friendly enough, there was little opportunity to interact with them outside of attending lecture and trying to flag them down for the answer to a simple question. And, though there was a computer science building, most CS majors kept to themselves due to lack of encouragement for teamwork or group study. I was lucky to make friends while being a member of the marching band, but I graduated without making more than a half dozen acquaintances in my primary major. Now, this is in part my own fault, but it seemed epidemic in the major, so I think something should be done about it.

The courses offered in the CS major were very hit-or-miss, and it was difficult to know ahead of time what you were getting yourself into by signing up for a course. I think the biggest issue was that the difficulty of the courses varied much too wildly. The discrete mathematics course for computer scientists (CS250 – Introduction to Computation) was very difficult, and it was nearly impossible to understand the motivation behind it, particularly for those sophomore CS majors who were more interested in application (software engineering and the like) than theory. Interestingly, this same professor taught the introductory data structures course (CS187 – Programming With Data Structures), which I thought was done well. Other courses were a joy to take, and offered great reading materials and notes (specifically, CS590D – Algorithms for Data Science and CS697L – Deep Learning), while others were so unstructured that they seemed more like free-form meetings than lectures (CS 591NR – Neural Networks, for example: I showed up to ~50% of the lectures and got an A). In several courses, I felt that I didn’t really grasp the material being taught, but was able to get A’s nonetheless due to easy programming assignments, curved scores on exams, and extra credit work – a sure sign of poor teaching.

On the other hand, most mathematics courses I took was at about the same level in difficulty (with one especially glaring exception: Abstract Algebra II. No one got an A), and there was a simple way to do well in them: read the book, go to lecture, be honest about what you don’t understand, do the homework, and be patient / meticulous in your work. Basically, if you want the A, work hard, ask questions and be thoughtful. I’m glad I decided to take on mathematics as a secondary major; having knowledge of calculus, probability and linear algebra is indispensable in understanding the mechanisms behind machine / deep learning, and I’m confident that this knowledge will serve me well in many applications of CS.

In retrospect, I wish I had realized which subfield of computer science I wanted to focus on much earlier. When I was first matriculated at UMass, all I knew is that I liked working with computers and wanted to make a living using them… This is obviously a much too general plan for a college education! I became interested in computational complexity, and eventually did an independent study in circuit complexity theory, mostly because I was baffled by the field (what is it useful for? who first bothered to study it?) and wanted to prove to myself that I could understand it.

I switched somewhat recently to studying machine learning, neural networks, and related subjects since I didn’t enjoy complexity theory so much, and have been voraciously reading and taking courses on this kind of material since. This was triggered in part by reading about the connectionist movement in the 1980s from a few old books (namely, I thought Mind Design II, a collection of essays and papers on artificial intelligence, philosophy, and cognitive science, was fascinating), and then realizing that AI and specifically neural networks research was alive and well today. I was also captivated by a TEDxAmherst talk on computational neuroscience by then-NYU, now-Yale researcher John Murray, and was inspired ever since to use computers to build models of biological brain function.

All together, I would say I enjoyed my undergraduate studies, especially so after I discovered what it was that got me excited to get out of bed in the morning to study. It was somewhat of a struggle to pay my way through school while finishing two majors and doing marching band on the side, but I feel as though I’ve become much tougher as a result. Now, I hope to maintain this mental toughness, but move towards honing in on exactly what field I want to make a contribution to and build a career upon.

This coming spring, I’ll be starting towards an M.S. degree in computer science, again at UMass Amherst. I’ll be focusing on AI and machine learning courses and (hopefully) research for now, and working in various freelance programming capacities. I’ve also applied to the data science master’s track, which will allow me to take a curriculum designed to develop knowledge and practical skills in statistics, machine learning, and related areas. As you might have guessed from my recent blogging activity, I want develop my writing skills and ramp up my writing output. Others goals include: becoming a Python power user, getting a few papers published, maintaining a great GPA, developing relationships with a faculty members and fellow graduate students, and gearing up for Ph.D. program applications. I’m hoping that this master’s program will allow me to get closer to my professors and do advanced work on research projects, and that the graduate courses I choose will be more challenging and rewarding than in my undergraduate studies.

Deep Learning Book: Introduction

As part of my goals for the new year, I’ve decided to create a blog post for each chapter in the new Deep Learning textbook after I read it. I’m really excited to read this book, and am somewhat star-struck by its authors (Ian Goodfellow, Yoshua Bengio, Aaron Courville), who are preeminent researchers in the deep learning field. I can’t wait to pick their brains, in the guise of a book! Let’s begin with Chapter 1: Introduction.

Chapter 1: Introduction

The introductory chapter begins with a history of artificial intelligence (AI). Ada Lovelace may be considered as the first person to ponder intelligence in machines, in her notes on Charles Babbage’s “Analytical Engine” all the way back in 1842. Early work in AI focused on tasks that were difficult or un-intuitive for humans, but straightforward for machinery; i.e., problems which can be specified by a bit of formal mathematics. The real challenge in AI turned out to be problems which are easy for humans (say, recognizing faces or performing locomotion), but hard for us to describe formally. The textbook is meant to be a solution to these intuitive but indescribable problems.

How do we solve such a problem? We wish “…to allow computers to learn from experience and to understand the world in terms of a hierarchy of objects” (pg. 1). This removes the need for humans to provide all the knowledge the computer could need to solve a task, forcing it to gather it on its own. The hierarchy of concepts requires the machine to represent complex concepts by building them out of simpler ones. We call such an approach deep learning, since, moving from simple to complex concepts, there are many layers of learned relationships.

It is straightforward to implement solutions to formal tasks on digital computers, but replicating intuitive human abilities is something that has spawned a great deal of research, and little success. Much of everyday human activity is grounded in subjective knowledge, something which is absent in early AI approaches, which were often performed in sterile, formalized environments (not necessarily indicative of the “real world”). We want to leverage real-world knowledge, or data, in order to make computer algorithms behave in an intelligent way, but actually doing this is a key challenge.

In the knowledge base approach to AI, a computer reasons automatically about statements about the world, specified in a formal language, using logical inference rules. This is generally not thought of as a useful approach, since coming up with statements with complexity enough to accurately describe the world appears to be intractable.

This failure suggested discarding the hard-coded knowledge approach in favor of knowledge acquisition from raw data, typified by machine learning (ML). This framework allows computers to work with problems which are essentially situated in the real world and to perform nuanced decision making.

ML algorithms’ performance depends significantly on the representation of the data they are trained and evaluated on. For example, one can imagine representing gray-scale images by a floating-point values for corresponding pixel intensities. It is clear that selecting a good representation (say, one that captures all relevant information of the data succinctly) will lead to better predictive performance of our ML algorithm than with an ill-suited representation.

We call each piece of information in the representation of the data a feature. One may perform feature-engineering in order to select those features which might lead to optimal model performance, but this is often time-consuming and requires expert knowledge. On the other hand, one can use ML to learn the data representation itself. This process is known as representation learning. Learned representations often perform much better than hand-engineered ones, and a single learned representation is often useful across different AI tasks (due to their ability to explain real-world data). Moreover, learning representations is typically must less time-intensive than engineering them by hand, and requires little to no expertise.

In designing features, we want to find those which explain the factors of variation which in turn explain the observed data. These are generally not directly observable in the data, but typically involved some combination of features which are a result of real-world forces that offer explanations for the variability of the data. For example, one might have recordings of a pendulum’s position in space, and want to predict where it will be at some future moment in time. If a learned representation were used, the features should correspond, in some way, to the real-world physical parameters governing the motion of the pendulum.

A major source of difficulty is that, typically, many factors of variation influence every single piece of data we observe. In practice, we will need to untangle these factors, and use only the ones we care about in our (learned or engineered) representation. It is often hard to extract abstract features from raw data, and when this is roughly as hard as actually solving the problem, representation learning doesn’t appear to help in any significant way.

Deep learning (DL) solves this central problem by using learned representations which are expressed in terms of simple representations. The quintessential deep learning model is the multilayer perceptron (MLP), which can be described as a function mapping some set of inputs to outputs, and is formed by composing many simpler functions. Each function application can be thought of transposing the input to a new representation. We can also think of depth as allowing a computer to learn a multiple-step program, where each layer of the neural network is the result of the computer’s memory after a set of parallel computation steps. Sequential steps are more powerful than a single computation, a well-known fact of complexity theory, as later instructions can refer back to the results of previous ones.

We can break down these complex neural network functions in to a series of simple, nested functions, each described by a layer in a deep learning model. A datum is presented at the visible layer, is passed through a series of hidden layers, which extracts increasingly more abstract features of the datum, and a simple ML algorithm performs classification / regression on the result of this computation. Not all the information in a layer of a deep learning model’s activations necessarily encode factors of variation which explain the input; some part of the representation typically stores “state” information, which helps the program understand its input in order to execute properly. We might think of this as control flow information which helps the program to organize its processing.

There are two ways to measure the depth of a deep learning model: (1) We measure the number of sequential instructions to execute in order to evaluate the model, which is given by the longest path in the flow chart which describes the input / output of the DL model. This depth is dependent on what we specify as atomic steps in the flow chart; (2) For probabilistic models, depth is usually defined as the depth of the graph describing how the learned concepts are related. The system’s understanding of simple concepts can be adjusted given information about the more complex concepts. There is no clear answer as to what makes a “deep” DL architecture, but we can regard DL as the study of models that involve a greater degree of compositionality than in traditional ML.

In summary, deep learning is a artificial intelligence approach, and more specifically, a machine learning approach, which currently appears to be the only viable approach to building intelligent systems situated in the real world. DL achieves power and flexibility by representing the world as a nested hierarchy of concepts, a powerful idea which is reminiscent of the way human brains are organized.

Chapter 1.1: Who Should Read This Book?

Basically, university students (undergraduate or graduate) interested in pursuing deep learning research or applications, or software engineers who want to develop a background in machine learning and statistics. As a new graduate student interested in studying ML (specifically DL) and computational neuroscience, this book is a great fit for my educational goals.

The book is broken down into four parts:

Part IBasic tools of mathematics and machine learning basics.

Part II: Well-established deep learning algorithms, considered “solved technologies”.

Part III: More speculative ideas which are believed to be important for future deep learning research.

Chapter 1.2: Historical Trends in Deep Learning

There are four historical trends in DL that the book chooses to focus on:

  • Deep learning has a long and rich history, but has gone by several names, and has waxed and waned in popularity since its inception.
  • Deep learning has been more useful as training data has become more available.
  • Deep learning models have grown in size as computer infrastructure for DL has improved.
  • Deep learning has solved increasingly complex problems with increasing accuracy over time.

Chapter 1.2.1: The Many Names and Changing Fortunes of Neural Networks

There have been three major waves of deep learning development:

  1. DL as cybernetics (1940s-1950s).
  2. DL as connectionism (1980s-1990s).
  3. The current resurgence of deep learning (2006-?).

The first research into artificial neurons was conducted in 1943 by McCulloch & Pitts, and later by Donald Hebb in 1949. Both aimed to develop fundamental theories of biological learning. In 1958, Frank Rosenblatt built the Perceptron model, which was, at the time, a single artificial neuron which could be trained as a binary linear classifier. This model was quickly extended to incorporate multiple neurons and to make multi-class decisions. In 1986, Rumelhart et al. developed the backpropagation algorithm to train multi-layer neural networks, a task which was previously thought to be impossible. After eventual stagnation due to lack of computational resources and over-promising, the modern field of deep learning emerged in 2006, beginning with the work of Hinton et al. on the supervised pre-training of a model known as deep belief networks. This result gave researchers the hope that training large neural network architectures was becoming tractable.

Early learning algorithms were meant to be computational models of biological learning; i.e., researchers wished to discover and model how the brain learns. In fact, one of the names which DL has gone by is artificial neural networks (ANNs), which were systems that are more or less inspired by the biological brain, but are nevertheless not realistic models of brain function. The neuroscience perspective of DL is motivated by two main ideas:

  1. The brain is an existence proof for intelligent behavior, and backwards-engineering the computational principles and duplicating its functionality is conceptually straightforward.
  2. It would be interesting to understand the brain and the principles the govern human intelligence. Using machine learning models, we can shed light on this sort of question.

Modern deep learning appeals more to multiple levels of compositionality than traditional machine learning, and for that reason, takes the neuroscientific perspective a step further.

The earliest predecessors of modern neural networks were linear models motivated from neuroscience. These models take a set of n inputs, x_1, ..., x_n, and map them to an output, y. The model would learn a set of weights w_1, ..., w_n to compute the mapping f(\textbf{x}, \textbf{w}) = w_1x_1 + ... + w_nx_n = \hat{y}, and was trained to minimize the difference y - \hat{y}.

The McCulloch-Pitts neuron was an early model of brain function. The model could recognize two different categories of inputs by testing whether f(\textbf{x}, \textbf{w}) was positive or negative. The model’s weights w_i had to be set by a human operator! In the 1950s, Frank Rosenblatt’s perceptron model could learn the weights that defined the positive, negative categories from examples (data). On the other hand, the adaptive linear element (ADALINE) was trained to predict a real-valued function, and could also learn from data. Models based on the linear function f(\textbf{x}, \textbf{w}) or ADALINE are known as linear models, which are still widely used in modern machine learning practice, although their training procedures are typically much improved.

A famous paper by Minksy and Papert in 1969 proved that linear models are very limited; namely, they cannot learn to compute the logical XOR function!

The role of neuroscience in deep learning has diminished since we don’t have enough information about the biological brain to use it as a realistic guide. In order to understand brain function at the algorithmic level, we must be able to monitor thousands of interconnected neurons at once, a technology which doesn’t seem to be in our near future. However, neuroscience has given researchers the hope that a single DL algorithm can be used to solve many tasks across many domains. It is common for DL research laboratories to study several applications areas simultaneously; e.g., computer vision and robotics.

Many rough guidelines can be drawn from neuroscience in the study of DL. For example, having many units that behave intelligently via their interactions with each other is biologically inspired. Some neural networks have been inspired by the mammalian visual systems (e.g., convolutional NNs). Most modern large-scale networks use the rectified linear unit (ReLU) as the model’s “neuron”, but we know that actual neurons compute much more complex functions. However, there is little evidence that greater realism leads to greater network performance. While neuroscience has inspired many network architectures, we do not yet know enough about biological brains to implement neuroscience-inspired learning algorithms.

Deep learning is not brain simulation. In fact, DL implementation is closer to that of applied mathematics fundamentals (i.e., linear algebra, probability, information theory, and numerical optimization) than it is to neuroscience. On the other hand, computational neuroscience is the study of how the brain works at the algorithmic level. The discipline aims to build accurate models of brain function, whereas deep learning is a powerful approach to building intelligent systems. This is typically separated from DL, but researchers often move between the two fields fluently.

In the 1980s, the second wave of neural network emerged via the connectionism or parallel distributed processing movement. Connectionism came from the context of cognitive science, in which the typical symbolic processing approach was discarded in favor of actual neural implementation. The central idea of the movement was that a large number of simple computational units (say, artificial neurons), when properly networked, can behave intelligently. Some key concepts from the connectionist movement include the idea of distributed representations, in which each input to a system should be represented by many features, and each feature should be involved in as many possible inputs. This reduces the number of parameters needed to build learned representations. Another concept was that of backpropagation, its successful application in training multilayer neural networks, and its resulting popularity.

Hochreiter (1991) and Bengio (1994) made advances in modeling sequences with neural networks, and identified some of the fundamental mathematical difficulties in doing so. Hochreiter and Schmidhuber (1997) introduced the Long Short-Term Memory (LSTM) network as a solution to some of these difficulties. These have been successfully implemented in speech recognition (Siri) and natural language processing  applications

The second wave of neural network research lasted into the mid-1990s. At this time, kernel machines and probabilistic graphical models made significant advances and began to show competitive results. So, deep learning research stagnated until Hinton’s work in 2006. However, neural networks continued to obtain good performance on some tasks, and CIFAR was able to keep deep learning research alive within the multi-university NCAP group. At this time, deep neural networks were very difficult or even intractable to train due to high computational cost.

Finally, the third wave of deep learning research came in 2006 with a breakthrough by Geoffrey Hinton. He showed a procedure for efficiently training deep belief networks via greedy layer-wise pre-training. This strategy was quickly show to be applicable for many other kinds of deep networks. This current wave of neural network research inspired the “deep learning” moniker, which emphasized the importance of the depth of the network models, and stressed that deep models were could now be tractably trained.

In the current state of the art, DL models outperform competing AI systems based on other ML techniques or hand-designed functionality. The third wave began with a focus on new unsupervised training techniques and generalization from small datasets, but today, the field has migrated towards old supervised learning algorithms and large, labeled datasets.

Chapter 1.2.2: Increasing Dataset Sizes

Deep learning was used successfully in the 1990s, but was thought of more as something only an expert could use, rather than a general-purpose technology. Luckily, the amount of skill needed to use neural networks decreases as the amount of available training data increases. We can now provide learning algorithms with the resources they need to work properly. The increasing digitization of society (at least in part) fuels the growth in dataset size. The advent of “Big Data” makes machine learning tasks easier since the burden of generalizing well to new data is lightened (since we can now “observe” much more data during model training).

A rough rule of thumb: a supervised learning algorithm (with ~5,000 training examples per category) will match or exceed human-level performance when used on a dataset of 10 million or more labeled examples. Key questions in this realm now involve: How can we get similar ML performance with smaller training datasets? How can we take advantage of large numbers of unlabled examples (semisupervised or unsupervised learning)?

Chapter 1.2.3: Increasing Model Sizes

Another reason for deep learning’s success is that we now have the computational resources to run much larger, and hence more powerful, models today than in the 1980s-90s. A small number of neurons aren’t particularly useful when trying to solve complex AI problems.

Interestingly, biological neurons aren’t especially densely connected, and modern DL models approach human-level neural connectivity (~10,000 connections per neuron). On the other hand, in terms of total number of neurons, neural network models have been very small until recently, and have doubled in size every ~2.4 years since the first multilayer perceptron was implemented. This trend predicts that DL models will have the same number of neurons as in a human brain by the year 2050 (of course, there is no telling how this trend will continue). Today’s DL models have about as many neurons as in the brains of relatively primitive vertebrates.

The increase in model size can be attributed in part to faster CPUs, the creation of general purpose-GPUs and the advances made with GPU technology, faster network connectivity, and better software for distributed systems.

Chapter 1.2.4: Increasing Accuracy, Complexity, and Real-World Impact

Deep learning has consistently improved in recognition and prediction tasks since the 1980s, and has been successfully applied to many disparate domains.

Modern computer vision deep learning applications can operate over large, high-resolution photos which don’t have to be cropped near the objects we want to recognize. This is a vast improvement over the pre-processing necessary for neural network vision problems studied in the 80s. Today, deep learning models are trained to discriminate between at least 1,000 different categories of objects. The ILSVRC competition is a good example of computer vision advancement: Convolutional neural networks greatly improved over the previous state of the art in 2012, and have been winning the competition each year since.

Error rates for speech recognition stagnated around the year 2,000, but the introduction of deep learning cause a sudden rejuvenation of interest and vastly improved error rates. Deep learning has also had big success in pedestrian detection, image segmentation, and traffic sign classification, all of which could be useful for creating the upcoming self-driving car technology.

The scale, accuracy, and complexity of problems solved by deep learning is on the rise. DL models have learned to output entire sequences of characters transcribed from images. It was previously thought that this kind of task required labeling of each of the individuals characters in the sequence! Recurrent neural networks (RNN) and LSTMs are used to model relationships between sequences and other sequences, which has revolutionized machine translation.

Alex Graves et al. introduced the neural Turing machine (NTM), which learns to read and write arbitrary content to and from memory registers. This machine can learn simple programs from example program behavior, but this technology is only just beginning to be developed.

Another achievement of deep learning is in its extension to reinforcement learning (RL). In RL, an agent learns how to behave in some setting via trial and error. Google DeepMind bind a RL system based on deep learning which learned to play Atari games, reaching human or even super-human level performance on many of them.

The recent advances in deep learning software have also been a driving factor in its rampant popularity: Theano, Pylearn2, Torch, DistBelief, Caffe, MXNet, Tensorflow, and more have all been recent additions.

Deep learning has also made a difference in other scientific pursuits: Convolutional neural networks for object recognition provide an interesting model of visual processing which neuroscientists may study. Generally, the technology allows other scientists or analytics-inclined professionals to do data processing and prediction at large scales.

In summary, deep learning is machine learning approach to artificial intelligence which draws on knowledge of brains, statistics, and applied mathematics. It has seen massive growth because of more powerful computers, bigger datasets, and advances in training techniques.

Turing B-Type Machines, Reservoir Computation, Memristors, and Super-Turing Computation

Below is a summary of some concepts I have been discussing with my adviser during this semester. I wrote this very quickly (in the space of about 45 minutes), and meant this to be a quick-and-dirty reference for a meeting we had this morning. I made a few edits before posting.

Why am I posting this? I’ve been too busy to dedicate time to write an original post for this blog, but I wanted to get some writing up here. I know it’s not ideal, but this is how I will break my blogosphere silence.

This document is meant to serve as a summary of the concepts of Turing B-Type machines, reservoir computation, memristors, and super-Turing computation, how they are related and where they diverge.

Turing B-Type Machines

In Intelligent Machinery [1], a report from Turing written in 1948 on the subject of artificial intelligence, Turing argues against the popular belief that machines cannot exhibit intelligence. In particular, we are interested in Sections 4 – 8, in which Turing describes A-Type and B-Type unorganized machines. Namely, an unorganized machine is one which is made up of some kind of standard components in a “rather unsystematic way”, or which are “largely random in their construction”. It is built by creating some N two-terminal components, and choosing, at random, two units from 1, …, N to connect the input terminals to the output terminals of these units. All units are synchronized on a global clock, and each time step is referred to as a “moment”. In this way, Turing’s unorganized machines operate in discrete time.

Each unit may have a state drawn from {0, 1}, which is at any time step determined by its state and the signals incoming from terminals from the previous timestep, which are multiplied together and subtracted from 1 (this is the NAND operation, which constitutes a full Boolean basis). Turing’s A-type machines eventually exhibit periodic behavior, and this period cannot exceed 2^N moments, although the initial configuration has some bearing on how long it takes for the machine to fall into this oscillation, and the period may be highly nontrivial with large N. Turing claims his A-type machines to be the simplest model of the nervous system with a random arrangement of neurons. Turing’s B-Type machine, on the other hand, have intermediate connections which either (1) convert all signals to “1”, (2) pass all signals through with an interchange of “0” and “1”, or (3) conditions (1) and (2) occur in alternation. This construction removes the guarantee that all machine behavior is eventually periodic.

Turing was interesting in performing “paper interference”, particularly with the B-Type machines, by modifying the behavior of the intermediate connections. Turing defines a relative term “modifiable”, used to describe machines whose behavior can be altered radically by interference. He also speaks of a successor relation from machines and modifications to machines, as a differently behaving machine typically emerges from a nontrivial modification. Turing goes on to describe a curriculum for “educating” a machine: “…we should begin with a machine with very little capacity to carry out elaborate operations or to react in a disciplined manner to orders (taking the form of interference). Then by applying appropriate interference, mimicking education, we should hope to modify the machine until it could be relied on to produce definite reactions to certain commands.” So, starting with little capacity, and gradually increasing complexity in the machine is how Turing envisaged the process of machine learning.

Finally, Turing is interested in supply the correct initial conditions, or the correct sequence of signals, which will be fed to the machine in just a few independent units. He calls this process the “organizing” of the machine, or rather, we can think of this as finding the input sequence which causes the machine to give the output we would like. This seems quite similar to programming a standard computer.


[1] Intelligent Machinery (

Reservoir Computation

Reservoir computation, often seen as an extension to the artificial neural network paradigm, is described by a random, high-dimensional dynamical system (or “liquid” in the liquid state machine literature [2]) which maps low-dimensional inputs to high-dimensional outputs. [1] These outputs are typically interpreted via linear read-out functions, which are fit to the dynamical system output with a regression method. These “computers” can be trained in a supervised, reinforcement, or unsupervised manner. Liquid State Machines and Echo State Networks are two subclasses of machines which are a part of this computation paradigm. It is argued that reservoir computing can replicate properties of biological computation (i.e., brain activity), yet does not provide any explanation of how the brain works. Moreover, there is no way in general to control the behavior of the randomly instantiated nonlinear system, and no easy way to dissect what computations the dynamical system is performing.

[1] A Comparitive Study of Reservoir Computing for Temporal Signal Processing (

[2] Liquid State Machines: Motivation, Theory, and Applications (


Memristors are a two-terminal passive circuit element whose electrical resistance varies nonlinearly with the current across the device. That is, its resistance is not constant but depends on the history of current flow across the device. This is a consequence of the device’s non-volatility: it “remembers” the amount of flow in a particular direction across itself; hence the term memristor or memory-resistor.

These devices were used in conjunction with a random “graph” of silver cation deposits on a circuit board to demonstrate an instantiation of the highly nonlinear dynamical system needed to apply reservoir computing. The random graph is necessary for the randomness of the inter-connectedness of the system, and the memristive behavior supplies the nonlinearities, namely in the form of observable metastable states, representing time-series inputs as high-dimensional trajectories in the system, and power-law scaling of spatiotemporal effects.

These devices are exciting because, unlike standard electronic components, these output values along a continuum rather than from a set of discrete states (as in the transistor’s {0, 1}). This seems like an interesting avenue for research, since the brain does not seem to operate in discrete states but along a continuum, and the computations occurring at neuronal membranes seem to be highly nonlinear and continuous.


[1] Memristor – The Missing Circuit Element (

[2] Emergent Criticality in Complex Turing B-Type Atomic Switch Networks (

[3] Universal Memcomputing Machines (

[4] Digital Memcomputing Machines (

Super-Turing Computation

Super-Turing computation (or hypercomputation) refers to computation models which compute problems which are not Turing-computable, or compute problems with provably less time or space than a Turing machine can. For example, if we allow real numbers as the weights of an artificial neural network, these can compute arbitrary functions given unlimited time, and the class P/poly when restricted to polynomial time. This, however, is due to the fact that the weights are described by real numbers (to linear precision) which are being harnessed in order to decide uncomputable problems.

Professor Hava Siegelmann is responsible for the above result, and for showing a number of models which are all super-Turing equivalent to the above. For example, a number of natural dynamical systems or idealized chaotic systems can only be described within the super-Turing paradigm. The Blum-Shub-Smale computation model utilizes real numbers and discontinuous and real operations in order to achieve super-Turing computational power, which was constrained by Siegelmann, et al. to use only continuous operations. Another way to describe the super-Turing paradigm is not as moving from digital to analog computation, but rather from static to adaptive programs. Interleaving the computation scheme with weight adaption on a continuum accounts for super-Turing computational power.

It has been argued that memcomputing machines are strictly more powerful than Turing machines; i.e., that memcomputing machines are a super-Turing model of computation. This has yet to be shown for lack of real-world experiments (digital simulations exist, but are, of course, not indicative of continuous computational power), but the inclusion of the nonlinear, continuous memristor elements suggests that these machines may have access to uncomputable real numbers (granted these are admitted by nature) which could grant non-trivial time and space improvements, and may allow the computation of Turing-uncomputable functions.


[1] Neural Networks and Analog Computation: Beyond the Turing Limit (

[2] Computation Beyond the Turing Limit (

[3] Discontinuities in Recurrent Neural Networks (

[4] Universal Memcomputing Machines (

How Are These Related?

Turing B-Type vs. Reservoir Computation: It has been proposed in [1] that “complex Turing B-type atomic switch networks” are an example of a Turing B-Type machine, where the circuit board randomly constructed with silver cation deposition coupled with memristive circuit elements is both a type of “unorganized machine” which Turing describes, yet at the same time, is the nonlinear dynamical system, or “reservoir” involving reservoir computing. Turing never described a method of reading the output of his organized machine, other than to read the entire state of the network of components. A natural extension to Turing’s framework would be that of fitting linear transformations to the output via regression, as in reservoir computation. However, a fundamental difference between these two approaches is that the dynamical system in reservoir computing is typically fixed, whereas Turing’s B-Type machines are to be organized (or at least, the inputs to the machine determines the organization).

Turing B-Type vs. Memristors: It is not clear how these concepts are related. However, it is natural to augment the operation of the Turing B-Type machine by including memristor elements which supply a fading memory to the machine, to excite or to inhibit certain actions, thereby influencing organization based on history of inputs.

Turing B-Type vs. super-Turing: Turing proposed these unorganized machines as an extension to his standard, static computation model. He emphasized the relationships between these machines and biological systems, such as the human cortex, and argued that with proper training, human brains or unorganized machines could emulate universal machines [2]. It is not clear whether he realized they could obtain super-Turing power with the certain modifications (for example, using polynomial-time computable real numbers to describe the states of the components of the machine, or polynomial-time computable real numbers to organize the machine).

Reservoir Computation vs. Memristors: In [1], we see that the memristive elements involved in the reservoir supply the necessary spatiotemporal effects that constitute a sufficiently nonlinear system for reservoir computing. These allow for the occurrence of metastable states, near-criticality, and power-law scaling space and time effects, which is remniscent of brain dynamics, though perhaps not explanatory.

Reservoir Computing and super-Turing: Again, it is unclear how these two concepts are related, but one can imagine that, when supplied with a “reservoir” or “liquid” which is continuous in time and space, one might be able to compute super-Turing functions as long as the readout units are able to capture these continuous effects.

Memristors and super-Turing: As suggested before, since memristors are a intrinsically nonlinear element and are continuous in their activation / output, it is possible that these might grant super-Turing computational power if harnessed in the right way, namely, to implement adaptive or self-organizing circuits or programs which take advantage of real-valued parameters / weights.


[1] Emergent Criticality in Complex Turing B-Type Atomic Switch Networks (

[2] Intelligent Machinery (

My Lens

Recently, I’ve been thinking about what it means to see the world through the eyes of another. This is a common experience, but after having tossed it around for a few days, I have a few things to say about it. In particular, from a brains / minds / computers point of view, our particular “lens” of the world, the way things appear to us, that is, is our particular program. Namely, us humans all have more or less the same equipment (a brain!), but our individually unique life experiences shape the way these brains develop and, ultimately, process. You can think of this in terms of the “nature vs. nurture” paradigm, where we are given a typically healthy brain which is more or less well equipped as the next (nature), and our experiences and intentions allow us to change our lens in order to become better suited to the things that life might throw our way (nurture). What is really interesting is when minds are able to obtain a highly general lens, developed from salient bits of information encountered in experience, which allows for intelligent behavior in many completely new terrains. As a simple example, although a typical education lasts some 15-20 years, the demonstrable quality of life gained from this highly structured experience is obvious from the disparity between those who are afforded such an education, and those who are not. At least, this is obvious on average: think just a bit about world history, and how civilizations which focus on education and / or self-development experience much greater prosperity on average.

I’m sure you’ve heard of neuroplasticity. This is what I am appealing to when I talk about the nurture aspect of minds, in that it allows us to develop certain neuronal interconnections in our brain which allow us to adapt to our experiences. This is the simplest form of “biohacking”, where one takes control of one’s environment, or input (in this case, one’s conscious experience) in order to accomplish some sort of goal, such as a high measure of intelligence or psychological well-being. I like to think of humans behaving as though on a really complicated, perhaps stochastic computer program. The key difference between our programs and one you might find on a conventional computer is our ability to modify it within the feedback loop of our conscious (and sub-conscious) experiences.

In computational complexity theory, models with some mechanism of plasticity, or updates to the model’s underlying program, confer strictly greater computation power than, say, a machine whose program stays static in time. In particular, with a time bound governed by a polynomial function of the size of its input, an adaptable model can compute the complexity class P/poly – the class of problems solvable with polynomial time and non-uniform polynomial-length advice strings. In contrast, our static machine model can only compute the class P – the class of problems solvable with polynomial time – under the same restrictions. For this reason (the computational power increase), professor Hava Siegelmann termed the style of adaptable, biologically inspired computation “super-Turing”. As referred to in my previous post on Alan Turing, the Turing machine is a prime example of a model which runs a fixed program, no matter the point in time, and so, the super-Turing moniker is a fitting describe of adaptable programs. Actually, professor Siegelmann’s work is partly inspired by Turing’s later writings on intelligent machinery and artificial intelligence, and her artificial recurrent neural network offering can be seen as a fulfillment of Turing’s call for a computer model which better captures the biological style of computation.

So, what does this tell us about ourselves? Our minds?

Maybe nothing, but I like to think about what these ideas imply if they accurately describe brain-like computation. For example, it is interesting to ask: can a machine with a static program compute the same problems that one with an adaptable program can? If so, can it do so in real time (that is, is there no combinatoric blowup necessarily associated with simulating an adaptable machine with a static one?)? It is exciting to think about the possibility that static machines cannot do such a thing, because this validates the quest for strong artificial intelligence in a big way. Machine learning and pattern recognition become not only interesting problem-solving techniques, but strictly more powerful methods of computation. So, under the assumption that adaptable computation strictly contains static computation implies that research into models like neural networks or graphical models is extremely valuable and will eventually pay off in a big way, perhaps in a similar way to our discovery and implementation of these robust mechanical machines in the first place.

Under the converse assumption, that (achievable) adaptable computation is no more powerful than that done by classical, static computers, we still have that adaptable computation is more similar to biological processes, and therefore, we can use these to better solve problems that might exist in nature. For example, taking inspiration from cell biology, designers of computational manufacturing processes (say, robotics on a manufacturing line) can increase efficiency and robustness of mechanical solutions. Take a look at the way human brains use energy and process information, and we can still develop programs which exhibit highly intelligent and efficient behavior, and perhaps even remove the need to hand-set a great deal of the logical operation of our machines – brains often self-organize and reinforce to perform optimally on a task, or to achieve certain harmonious electrical and chemical conditions!

One thing to keep in mind in that computation is necessarily grounded in physical reality. If you believe, like me, that everything is in a constant state of flux, and you realize that classical computers are only capable of behaving based on a fixed and finite program, then it makes perfect sense to take the intuitive leap in saying that they cannot truly capture the essence of natural information processing. Namely, in a system that is responsive and in harmony with its environment, there is the possibility for more powerful and more complete computation ability, and solutions can be found in a self-organizing, convergent, continuous, and effortless manner. Think of how the universe is; there are no discrete computation steps involved, but rather, the solution to its “computation” is simply the way the universe evolves. Is it possible to incorporate such a being-ness, a one-ness with the universe, into a machine? The recent work by Google towards a quantum description of chemistry gives me hope that by incorporating more of physical reality into computation devices (in this case, quantum states) allows for better predictive capacity for quantum reality. In a way, it seems like in order to represent certain parts of reality in a given computation device, we need access to some physical phenomena that captures the essence of that part of reality. So, for human brain simulation, will static machines suffice? Or do we need to also incorporate plasticity? Or, going down another level, must we incorporate quantum effects? We must also keep in mind that the level of description that is sufficient will depend on our tolerance for inaccuracy, but it is evident that we need at least a well-approximating level of description in order for our experiments to have any sort of predictive capacity.

Anyways, this brings me back to my lens of the world. It is a little peculiar to realize that the lens of the cognitive scientist is somewhat self-referential: one sees the world and life in terms of information and computation, and importantly, in terms of consciousness. So, we end up thinking about thinking, and thinking about thinking about thinking, and so on. Though I am not a cognitive scientist per se, I do enjoy framing my world as one does, since I believe it is a very useful philosophical standpoint. Mainly, I like to abstract away my experience from what one might expect from a “normal” human experience. Why do I think in the patterns that I do? Why does the old brain often dominate newer, more complex brain structures? What does this have to do with computation, with physical reality, with evolution, with spirituality, even? Some of these questions aren’t even answerable, if not ill-posed. But, I do enjoy being a “deep thinker” – something which can be attributed to many, if not most, erroneously or not – in that I love to see where certain assumptions or viewpoints can take me. And that is why I believe it is important to have a very general lens, or a brain content / structure which is composed of many salient, related, efficient, and necessarily incomplete interconnections and, importantly, the parameters of which are continuously fine-tuned in order to best understand the world I live in.

Now, who’s lens may I benefit most strongly from? It depends on your definition or desire of benefit. At this stage in my life, I find I gain the most from the writings of those who have come before me in computer science, artificial intelligence, and cognitive science. But even then, I find myself gaining plenty of generalize-able knowledge from selectively choosing from the works of great physicists, mathematicians, philosophers, neuroscientists, psychologists, logicians, etc… Although I hope to make contributions to computer science one day, I know that, by having a more general understanding of the universe, any research or results I might obtain will therefore be all the more meaningful to me, and will help me to fit them more completely into the fabric of human knowledge.