Probability Primer

In this post we will review probability theory used in ML.

The Convolution Theorem

Uncertainty

There are two types of uncertainty.

Model Uncertainty

Uncertainty resulting from the underlying hidden causes or mechanism generating the data.

Data Uncertainty

Arises from the intrinsic variability. For example if we know the probability of getting a head for a coin is $p = 0.5$, we have no model uncertainty, but we would have data uncertainty as we cannot perfectly predict the outcome.

Probability of Events

An event is a binary state of the world that either holds or does not hold. The probability of an event $A$ is denoted as $P(A)$ with the property:

\[\begin{aligned} 0 \leq P(A) \leq 1 \end{aligned}\]

Conjunction of 2 Events

The joint probability of 2 events $A, B$ both happening:

\[\begin{aligned} P(A \land B) &= P(A, B) \end{aligned}\]

If the two events are independent:

\[\begin{aligned} P(A, B) &= P(A)P(B) \end{aligned}\]

Union of 2 Events

The probability of event $A$ or $B$ happening:

\[\begin{aligned} P(A \lor B) &= P(A) + P(B) - P(A \land B) \end{aligned}\]

Conditional Probablility of Event B Given Event A

\[\begin{aligned} P(B|A) &\equiv \frac{P(A, B)}{P(A)}\\ P(A) &\geq 0 \end{aligned}\]

Note that we cannot condition on an impossible event

Independence of Events

We say that event A is independent of event B if:

\[\begin{aligned} P(A, B) &= P(A)P(B) \end{aligned}\]

Conditional Independence of Events

We say that events A and B are conditionally independent given event C if:

\[\begin{aligned} P(A, B |C) &= P(A|C)P(B|C) \end{aligned}\]

This is also writen as:

\[\begin{aligned} A \perp B|C \end{aligned}\]

Most events are not independent on each other, but may be indpendent if we condition on the relevant intermediate variables.

Random Variables

Technically a random variable is neither random (it is determintic) nor a variable (it’s a function). But for simplicity, a random variable X can be thought of as a variable whose value is unknown until it is observed.

The set of possible values of X, denoted by $\mathcal{X}$, is known as the sample space or state space. An event is a set of outcomes from a given sample space.

Discrete Random Variable

If the sample space $\mathcal{X}$ is finite or countably infinite, then X is called a discrete random variable. We define the probability mass function (pmf) as:

\[\begin{aligned} p(x) &\equiv P(X = x)\\ \end{aligned}\] \[\begin{aligned} 0 \leq p(x) \leq 1\\ \end{aligned}\] \[\begin{aligned} \sum_{x \in \mathcal{X}}p(x) = 1 \end{aligned}\]

Continuous Random Variable

There is an infinite er of possible values a continuous variable can take, but they are a countable number of intervals which we can partition on.

We define the cumulative distribution function (CDF) as:

\[\begin{aligned} F_{X}(x) \equiv P(X \leq x) \end{aligned}\]

And the probability of being in any interval is:

\[\begin{aligned} P(a < X \leq b) &= F_{X}(b) - F_{X}(a) \end{aligned}\]

Probability Density Function (PDF)

We define the PDF as the derivative of the CDF:

\[\begin{aligned} p(x) \equiv \frac{d}{dx}F_{X}(x) \end{aligned}\]

Given the PDF, we can compute the continuous RV being in a finite interval as:

\[\begin{aligned} P(a < X \leq b) &= \int_{a}^{b}\ dx\\ &= F_{X}(b) - F_{X}(a) \end{aligned}\]

For a small interval:

\[\begin{aligned} P(x \leq X \leq x + dx) \approx p(x)\ dx \end{aligned}\]

Quantiles

The inverse of the CDF is known as the quantile:

\[\begin{aligned} F_{X}^{-1}(q) &= x_{q}\\ P(X \leq x_{q}) &= q \end{aligned}\]

For example, $\Phi$ is the CDF of a normal distribution, and $\Phi^{-1}$ is the inverse CDF:

\[\begin{aligned} \Phi^{-1}(0.025) &= -1.96\\ \Phi^{-1}(0.975) &= 1.96\\ \end{aligned}\]

Sum Rule (Rule of Total Probability)

Given a joint distribution, we define the marginal distribution of a random variable as:

\[\begin{aligned} p(X = x) &= \sum_{y}p(X = x, Y = y) \end{aligned}\]

Product Rule

\[\begin{aligned} p(x, y) &= p(x)p(y|x) \end{aligned}\]

Chain Rule of Probablility

We can extend the product rule:

\[\begin{aligned} p(\textbf{x}_{1:D}) &= p(x_{1})p(x_{2}|x_{1})p(x_{3}|x_{1}, x_{2})p(x_{4}|x_{1}, x_{2}, x_{3})\cdots p(x_{D}|\textbf{x}_{1:D-1}) \end{aligned}\]

Independence

In general, we say $X_{1}, \cdots, X_{n}$ is independent if:

\[\begin{aligned} p(X_{1}, \cdots, X_{n}) &= \prod_{i = 1}^{n}p(X_{i}) \end{aligned}\]

Variance

\[\begin{aligned} \text{var}(X) &\equiv E[(X - \mu)^{2}]\\ &= \int (x - \mu)^{2}p(x)\ dx\\ &= \int x^{2}p(x)\ dx - 2\mu \int xp(x)\ dx + \mu^{2}\int p(x)\ dx\\ &= E[X^{2}] + \mu^{2} - 2\mu^{2}\\ &= E[X^{2}] - \mu^{2}\\ E[X^{2}] &= \sigma^{2} + \mu^{2} \end{aligned}\]

The variance of a shifted and scaled random variable:

\[\begin{aligned} V[aX + b] &= a^{2}V[X] \end{aligned}\]

If the $n$ random variables are independent:

\[\begin{aligned} E\Big[\sum_{i=1}^{n}X_{i}\Big] &= \sum_{i = 1}^{n}V[X_{i}] \end{aligned}\]

Note the similarity between the above and expectation which is a mulitiplication not a sum.

The variance of their product (if independent):

\[\begin{aligned} V\Big[\prod_{i=1}^{n}X_{i}\Big] &= E\Big[\Big(\prod_{i}X_{i}\Big)^{2}\Big] - \Big(E\Big[\prod_{i}X_{i}\Big]\Big)^{2}\\ &= E\Big[\prod_{i}X_{i}^{2}\Big] - \Big(\prod_{i}E[X_{i}]\Big)^{2}\\ &= \prod_{i}E[X_{i}^{2}] - \prod_{i}E[X_{i}]^{2}\\ &= \prod_{i}E[X_{i}^{2}] - \prod_{i}E[X_{i}]^{2} + \prod_{i}E[X_{i}]^{2} - \prod_{i}E[X_{i}]^{2}\\ &= \prod_{i}V[X_{i}] + E[X_{i}]^{2} - \prod_{i}E[X_{i}]^{2}\\ &= \prod_{i}(\sigma_{i}^{2} + \mu_{i}^{2}) - \prod_{i}\mu_{i}^{2} \end{aligned}\]

Iterated Expectation

When we have two or more dependent random variables, we can compute the moments of one given another:

\[\begin{aligned} E[X] &= E_{Y}\Big[E[X|Y]\Big] \end{aligned}\]

To show this using discrete random variable:

\[\begin{aligned} E_{Y}\Big[E[X|Y]\Big] &= E_{Y}\Big[\sum_{x}x p(X=x|Y)\Big]\\ &= \sum_{y}\Big[\sum_{x}x p(X=x|Y)\Big]p(Y = y)\\ &= \sum_{x, y}xp(X = x, Y = y)\\ &= E[X] \end{aligned}\]

Law of Total Variance

\[\begin{aligned} V[X] &= E_{Y}\Big[V[X|Y]\Big] + V_{Y}\Big[E[X|Y]\Big] \end{aligned}\]

To show this:

\[\begin{aligned} V[X] &= E[X^{2}] - (E[X])^{2}\\ &= E_{Y}\Big[E[X^{2}|Y]\Big] - E_{Y}\Big[E[X|Y]\Big]^{2}\\ &= \Big(E_{Y}\Big[E[X^{2}|Y]\Big] - E_{Y}\Big[X|Y\Big]^{2}\Big) + \Big(E_{Y}\Big[X|Y\Big]^{2} + E_{Y}\Big[E[X|Y]\Big]^{2}\Big)\\ &= E_{Y}\Big[V[X|Y]\Big] + V_{Y}[E[X|Y]] \end{aligned}\]

The intuitive way to think about this is with the mixture of Gaussian example:

\[\begin{aligned} X = \sum_{y=1} ^{K}\pi_{y}N(X|\mu_{y}, \sigma_{y}) \end{aligned}\]

The variance is composed of the mean of the variance of each centriod and variance of the means of the centriods.

Conditional Probability

Conditional probability of outcome of random variable $X = x$ given $Y = y$:

\[\begin{aligned} P(X = x | Y = y) &= \frac{P(X = x, Y = y)}{P(Y = y)} \end{aligned}\]

Independence

Two random variables $X, Y$ are independent if the conditional probability is:

\[\begin{aligned} P(X = x | Y = y) &= \frac{P(X = x, Y = y)}{P(Y = y)}\\[5pt] &= P(X = x) \end{aligned}\]

In addition, if $X$ and $Y$ are independent:

\[\begin{aligned} \frac{P(X = x, Y = y)}{P(Y = y)} &= P(X = x)\\[5pt] P(X = x, Y = y) &= P(X = x) P(Y = y) \end{aligned}\]

Expectation

For continous random variable, the mean is defined as:

\[\begin{aligned} E[X] \equiv \int_{\mathcal{X}}x p(x)\ dx \end{aligned}\]

If the integral is not finite, the mean is not defined.

For discrete random variable, the mean is defined as:

\[\begin{aligned} E[X] \equiv \sum_{x \in \mathcal{X}}x p(x) \end{aligned}\]

Linearity of Expectation

Since the mean is a linear operator:

\[\begin{aligned} E[aX + b] &= aE[X] + b\\ E\Big[\sum_{i = 1}^{n}X_{i} \Big] &= \sum_{i=1}^{n}E[X_{i}] \end{aligned}\]

If the random variables are independent:

\[\begin{aligned} E\Big[\prod_{i=1}^{n} X_{i}\Big] &= \prod_{i=1}^{n}E[X_{i}] \end{aligned}\]

Conditional Expectation

\[\begin{aligned} \mu_{X|Y} &= E[X|Y = y]\\ &= \sum_{x}xf(x|y) \end{aligned}\]

But if $X, Y$ are independent:

\[\begin{aligned} E[X|Y = y] &= \sum_{i = 1}^{n}f(x_{i}|y)\\ &= \sum_{x}xf(x)\\ &= E[X] \end{aligned}\]

Rules for Expectation

\[\begin{aligned} E[g(X)] &= \sum_{x}g(x)f(x) \end{aligned}\]

For example, if $g(X) = aX$:

\[\begin{aligned} E[g(X)] &= \sum_{x}g(x)f(x)\\ &= a\sum_{x}x(fx)\\ &= aE[X] \end{aligned}\]

We can also show that:

\[\begin{aligned} E[aX + b] &= aE[X] + b\\ E[g_{1}(X) + g_{2}(X)] &= E[g_{1}(X)] + E[g_{2}(X)] \end{aligned}\]

Iterated Expectations

\[\begin{aligned} E[Y] &= \sum_{y}p(y)\\ &= \sum_{y}y\sum_{x}p(x, y)\\ &= \sum_{y}y\sum_{x}p(y|x)p_{X}(x)\\ &= \sum_{x}\Big[\sum_{y}y(y|x)\Big]p_{X}(x)\\ &= \sum_{x}E[Y|x]f_{X}(x)\\ &= E_{X}\Big[E[Y|X]\Big] \end{aligned}\]

Hence:

\[\begin{aligned} E[Y] &= E_{X}\Big[ E[Y|X]\Big] \end{aligned}\]

Variance

\[\begin{aligned} \mathrm{Var}(X) &= \sigma_{X}^{2}\\ &= E[X - \mu]^{2}\\ &= E[X^{2} - 2\mu X + \mu^{2}]\\ &= E[X^{2}] - 2\mu E[X] + \mu^{2}\\ &= E[X^{2}] - 2\mu^{2} + \mu^{2}\\ &= E[X^{2}] - \mu^{2} \end{aligned}\] \[\begin{aligned} Y &= aX + b\\ \mathrm{Var}(Y) &= \mathrm{Var}(aX + b) \\ &= E\Big[ (Y - \mu_{y})^{2}\Big]\\ &= E\Big[ \Big(aX + b - E[aX + b]\Big)^{2}\Big]\\ &= E\Big[ \Big(aX + b - a\mu_{x} - b]\Big)^{2}\Big]\\ &= E\Big[ (aX - a\mu_{x})^{2}\Big]\\ &= a^{2}E\Big[ (X - \mu_{x})^{2}\Big]\\ &= a^{2}\mathrm{Var}(X) \end{aligned}\]

Covariance

\[\begin{aligned} \mathrm{Cov}(X, Y) &= \sigma_{XY}\\ &= E\Big[ (X - \mu_{x})(Y - \mu_{Y}) \Big]\\ &= E[XY] - \mu_{X}\mu_{Y} \end{aligned}\]

Correlation

\[\begin{aligned} \rho &= \frac{\mathrm{cov}(X, Y)}{\sqrt{\mathrm{Var}(X)}\sqrt{\mathrm{Var}(Y)}}\\ &= \frac{\rho_{XY}}{\rho_{X}\rho_{Y}} \end{aligned}\]

Sum of Variances

\[\begin{aligned} \mathrm{Var}(aX + bY) &= a^{2}\mathrm{Var}(X) + b^{2}\mathrm{Var}(Y) + 2ab\mathrm{Cov}(X, Y)\\ \mathrm{Var}(X + Y) &= \mathrm{Var}(X) + \mathrm{Var}(Y) + 2\mathrm{Cov}(X, Y)\\ \mathrm{Var}(X - Y) &= \mathrm{Var}(X) + \mathrm{Var}(Y) - 2\mathrm{Cov}(X, Y)\\ \end{aligned}\]

Conditional Variance

\[\begin{aligned} \mathrm{Var}(X|Y = y) &= E\Big[ \Big(X - E[X|Y = y]\Big)^{2}\Big]\\ &= \sum_{x}\Big( x - E[X|Y = y]\Big)^{2}p(x|y) \end{aligned}\]

Variance Decomposition

Also known as the Bias-Variance Tradeoff.

Note some steps are skipped:

\[\begin{aligned} \mathrm{Var}(Y) &= \sum_{y}(y - \mu_{y})^{2}p(y)\\ &= \sum_{y}(y - \mu_{y})^{2}\sum_{x}p(x, y)\\ &= \sum_{y}(y - \mu_{y})^{2}\sum_{x}p(y|x)p(x)\\ &= \sum_{x}\sum_{y}(y - \mu_{y})^{2}p(y|x)p(x)\\ &= \sum_{x}\sum_{y}(y - \mu_{y})^{2}p(y|x)p(x)\\ &= \sum_{x}\sum_{y}\Big(y - E[Y|x] + E[Y|x] \mu_{y}\Big)^{2}p(y|x)p(x)\\ &= \sum_{x}\sum_{y}\Bigg(\Big(y - E[Y|x]\Big) + \Big(E[Y|x] \mu_{y}\Big)\Bigg)^{2}p(y|x)p(x)\\ &= \sum_{x}\sum_{y}\Bigg(\Big(y - E[Y|x]\Big)^{2} + \Big(E[Y|x] \mu_{y}\Big)^{2} + 2(y - E[Y|x])(E[Y|x] - \mu_{y})\Bigg)^{2}p(y|x)p(x)\\ &= \sum_{x}\Big\{\Big(y - E[Y|x]\Big)^{2}p(y|x)p(x) + \sum_{x}\Big(E[Y|x] - \mu_{y}\Big)^{2}p(y|x)p(x) + \sum_{x}2\Big(y - E[Y|x]\Big)\Big(E[Y|x] - \mu_{y}\Big)\Big\}p(y|x)p(x)\\ &= \cdots\\ &= E_{X}[\mathrm{Var}(Y|X)] + \mathrm{Var}_{X}\Big(E[Y|X]\Big) + 0\\ &= E_{X}[\mathrm{Var}(Y|X)] + \mathrm{Var}_{X}\Big(E[Y|X]\Big) \end{aligned}\]

Hence:

\[\begin{aligned} \mathrm{Var}(Y) = E_{X}\Big[\mathrm{Var}(Y|X)\Big] + \mathrm{Var}_{X}\Big[E[Y|X]\Big] \end{aligned}\]

Covariance Decomposition

\[\begin{aligned} \mathrm{Cov}(X, Y) &= \sum_{x}\sum_{y}(x - \mu_{X})(y - \mu_{Y})p(x, y)\\ &= \sum_{x}(x - \mu_{X})E[Y|X = x]p(x) \end{aligned}\]

$E[Y \vert X = x]$ can be treated as a function $g(X)$. So we can represent covariance as an expectation:

\[\begin{aligned} \mathrm{Cov}(X, Y) &= \sum_{x}(x - \mu_{X})E[Y|X = x]p(x)\\ &= \sum_{x}(x - \mu_{X})g(X)p(x)\\ &= E_{X}\Big[(X - \mu_{X})E[Y|X]\Big] \end{aligned}\]

A special case is where $E[Y \mid X] = 0$. This implies that the covariance of $X, Y$ is zero as well.

Gaussian Distribution

If the random variable $X$ is a Gaussian/Normal distributed with mean $\mu$ and variance $\sigma^{2}$, it is represented by:

\[\begin{aligned} X \sim \frac{1}{\sqrt{2\pi\sigma^{2}}}e^{\frac{-(x-\mu)^{2}}{2\sigma^{2}}} \end{aligned}\]

Weighted sum of Gaussian random variables are Gaussian distributed as well:

\[\begin{aligned} X_{1} &\sim N(\mu_{1}, \sigma_{1}^{2})\\ X_{2} &\sim N(\mu_{2}, \sigma_{2}^{2})\\ Y &= a_{1}X_{1} + a_{2}X_{2}\\ &\sim N(\mu_{Y} = a_{1}\mu_{1} + a_{2}\mu_{2}, \sigma_{Y}^{2} = a_{1}^{2}\sigma_{1}^{2} + a_{2}^{2}\sigma_{2}^{2} + 2a_{1}a_{2}\mathrm{Cov}(X_{1}, X_{2})\\ &\sim N(\mu_{Y} = a_{1}\mu_{1} + a_{2}\mu_{2}, \sigma_{Y}^{2} = a_{1}^{2}\sigma_{1}^{2} + a_{2}^{2}\sigma_{2}^{2} + 2a_{1}a_{2}\sigma_{12} ) \end{aligned}\] \[\begin{aligned} Y &\sim N(\mu_{Y} = a_{1}\mu_{1} + a_{2}\mu_{2}, \sigma_{Y}^{2} = a_{1}^{2}\sigma_{1}^{2} + a_{2}^{2}\sigma_{2}^{2} + 2a_{1}a_{2}\mathrm{Cov}(X_{1}, X_{2})\\ Y &\sim N(\mu_{Y} = a_{1}\mu_{1} + a_{2}\mu_{2}, \sigma_{Y}^{2} = a_{1}^{2}\sigma_{1}^{2} + a_{2}^{2}\sigma_{2}^{2} + 2a_{1}a_{2}\sigma_{12} ) \end{aligned}\]

CDF

The CDF of the Gaussian is defined by:

\[\begin{aligned} \Phi(y; \mu, \sigma^{2}) &\equiv \int_{-\infty}^{y}N(z|\mu, \sigma^{2})\ dz\\ z &= \frac{y - \mu}{\sigma}\\ \Phi(y; \mu, \sigma^{2}) &= \frac{1}{2}\Big[1 + \text{erf}\Big(\frac{z}{\sqrt{2}}\Big)\Big] \end{aligned}\]

erf() is the error function:

\[\begin{aligned} \text{erf}(u) &\equiv \frac{2}{\sqrt{\pi}}\int_{0}^{u}e^{-t^{2}}\ dt \end{aligned}\]

The inverse of the CDF is also known as the probit. We can get the quantile by using the probit functoin. For example to get the 95% interval:

\[\begin{aligned} (\Phi^{-1}(\frac{\alpha}{2}), \Phi^{-1}(1 - \frac{\alpha}{2})) &= (-1.96, 1.96)\\ \alpha &= 0.05\\ (\Phi^{-1}(0.025), \Phi^{-1}(0.975)) &= (-1.96, 1.96) \end{aligned}\]

PDF

The PDF is defined as the derivative of the CDF:

\[\begin{aligned} p(y) &\equiv \frac{d}{dy}F(y)\\ N(y|\mu, \sigma^{2}) &\equiv \frac{1}{\sqrt{2\pi \sigma^{2}}}e^{-\frac{1}{2\sigma^{2}}(y - \mu)^{2}} \end{aligned}\]

Where $\sqrt{2\pi \sigma^{2}}$ is the normalization constant

Regression

Consider a conditional density model of the form:

\[\begin{aligned} p(y|\textbf{x};\boldsymbol{\theta}) &= N(y|f_{\mu}(\textbf{x};\boldsymbol{\theta}), f_{\sigma}(\textbf{x};\boldsymbol{\theta})^{2}) \end{aligned}\]

For linear regression, we would assume the mean is a linear function of the input, and the variance is constant (homoscedastic regression):

\[\begin{aligned} p(y|\textbf{x};\boldsymbol{\theta}) &= N(y|\textbf{w}^{T}\textbf{x} + b, \sigma^{2})\\ \boldsymbol{\theta} &= (\textbf{w}, b, \sigma^{2}) \end{aligned}\]

It is also possible to make the variance depend on the input (heteroskedastic regression):

\[\begin{aligned} p(y|\textbf{x};\boldsymbol{\theta}) &= N(y|\textbf{w}_{\mu}^{T}\textbf{x} + b, \sigma_{+}(\textbf{w}_{\sigma}^{T}\textbf{x}))\\ \boldsymbol{\theta} &= (\textbf{w}_{\mu}, \textbf{w}_{\sigma})\\ \sigma_{+}(a) &= \log(1 + e^{a}) \end{aligned}\]

$\sigma_{+}$ is known as the softplus function, to ensure the variance is non-negative.

Why Gaussian so common?

Here are a few reasons why Gaussian distribution is so common:

Only two parameters are needed and easy to interpret
CLT tells us that sums of independent random variables are approximately Gaussian and hence good choice for modeling residual errors
Makes the least number of assumptions and hence maximum entropy subject to having a mean and variance
Simple mathematical form

Dirac Delta Function

As the variance of a Gaussian distribution goes to 0, the distribution appraches a Dirac delta function:

\[\begin{aligned} \lim_{\sigma \rightarrow 0}N(y|\mu, \sigma^{2}) \rightarrow \delta(y - \mu) \end{aligned}\]

The Dirac delta function is defined as:

\[\begin{aligned} \delta(x) &\equiv \begin{cases} \infty & x = 0\\ 0 & x \neq 0 \end{cases}\\ \int_{-\infty}^{\infty}\delta(x)\ dx &= 1 \end{aligned}\]

If we shift the mean to $y$:

\[\begin{aligned} \delta_{y}(x) &\equiv \begin{cases} \infty & x = y\\ 0 & x \neq y \end{cases}\\ \end{aligned}\]

Note we have that:

\[\begin{aligned} \delta_{y}(x) &= \delta(x - y) \end{aligned}\]

The delta function distribution satisfies the following sifting property:

\[\begin{aligned} \int_{-\infty}^{\infty}f(x)\delta(x - y)\ dx &= f(y) \end{aligned}\] \[\begin{aligned} \end{aligned}\] \[\begin{aligned} \end{aligned}\]

Bayes’ Rule

Given an unknown/hidden quantity $H$ given some observed data $Y = y$:

\[\begin{aligned} P(H = h | Y = y) &= \frac{P(H = h)P(Y= y)|H = h}{P(Y = y)} \end{aligned}\]

The above can be derived:

\[\begin{aligned} P(h, y) &= P(y, h)\\ P(h|y)P(y) &= P(y|h)P(h)\\ P(h|y) &= \frac{P(y|h)P(h)}{P(y)} \end{aligned}\]

Prior Distribution

$P(H)$ represents what we know about the possible distribution of $H$ before we see any data.

Likelihood

$P(Y\mid H = h)$ represents the distribution over possible outcomes of $Y$ we expect to see if $H = h$. This is also known as observation distribution. When evaluated at a point $P(Y= y\mid H = h)$, we call this the likelihood. Note that this is not a probability distribution as it does not sum to 1.

Multiplying the likelihood by the prior gives the unnormalized joint distribution $P(H = h, Y = y)$.

Marginal Likelihood

To convert the unnormalized joint distribution ito a normalized distribution, we divide it by marginal likelihood $P(Y = y)$.

\[\begin{aligned} P(Y = y) &= \sum_{h \in \mathcal{H}}P(H = h)P(Y = y| H = h)\\ &= \sum_{h \in \mathcal{H}}P(Y = y|H = h)\\ &= \sum_{h \in \mathcal{H}}P(H = h, Y = y) \end{aligned}\]

Using Bayes rule to update a distribution over unknown/hidden values $P(H)$ given new observed data is called Bayesian inference or posterior inference.

Example: Covid

Let $H$ be the event whether you contracted Covid. Let $Y$ be the outcome of the test. Note the following measures:

\[\begin{aligned} \text{Sensitivity (True Positive Rate)} &= P(Y = 1| H = 1)\\[5pt] \text{Specificity (True Negative Rate)} &= P(Y = 0| H = 0)\\[5pt] \text{False Positive Rate} &= 1 - \text{Specificity} &= 1 - P(Y = 0\mid H = 0) \end{aligned}\]

In this case, the TPR is essentially our likelihood. To get the posterior probability of testing positive for Covid:

\[\begin{aligned} P(H = 1|y = 1) &= \frac{P(Y = 1|H = 1)p(H)}{p(Y=1|H=1)P(H = 1) + p(Y=1|H = 0)P(H = 0)}\\ &= \frac{\text{TPR} \times \text{Prior}}{\text{TPR} \times \text{Prior} + FPR \times ( 1 - \text{Prior})} \end{aligned}\]

Example: The Monty Hall Problem

Let $H_{i}$ be the hidden/hypothesis that the prize is behind door $i$. Our prior distribution would be:

\[\begin{aligned} P(H_{1}) &= P(H_{2})\\ &= P(H_{3})\\ &= \frac{1}{3} \end{aligned}\]

Let $Y$ be the door that was opened. We choose an arbitrary door 1, and in this case, $Y$ could only be 2 or 3. The probability of $P(Y\mid H_{i})$ depends on whether the car is behind which door:

\[\begin{aligned} P(Y = 2|H_{1}) &= \frac{1}{2}\\ P(Y = 3|H_{1}) &= \frac{1}{2}\\ P(Y = 2|H_{2}) &= 0\\ P(Y = 3|H_{2}) &= 1\\ P(Y = 2|H_{3}) &= 1\\ P(Y = 3|H_{3}) &= 0 \end{aligned}\]

The posterior probability distribution given $Y=3$ (the same result for $Y=2$):

\[\begin{aligned} P(Y = 3) &= 0 \times \frac{1}{3} + 1 \times \frac{1}{3} + \frac{1}{2} \times \frac{1}{3}\\ &= \frac{1}{3} + \frac{1}{6}\\ &= \frac{1}{2}\\[5pt] P(H_{1}|Y = 3) &= \frac{\frac{1}{2} \times \frac{1}{3}}{P(Y = 3)}\\ &= \frac{1}{3}\\[5pt] P(H_{2}|Y = 3) &= \frac{1 \times \frac{1}{3}}{P(Y = 3)}\\ &= \frac{2}{3}\\[5pt] P(H_{3}|Y = 3) &= \frac{0 \times \frac{1}{3}}{P(Y = 3)}\\ &= 0 \end{aligned}\]

As you can see, the probability of switching is twice that of not switching.

Inverse Problems

Probability theory is concerned with predicting a distribution over outcomes $y$ given knowledge of the state of the world $h$. For example, the null hypothesis of a coin toss is $p(H = 1) = 0.5$. In contrast, inverse probability is concerned with inferring the state of the world $h$ from observations of outcomes. Most of these problems are ill-posed. For example, inferring a 3D shape from 2D images, and understanding speech spoken by a speaker. To tackle such problems, we specify the forwards model (likelihood) $$p(y

h)$given a prior$p(h)$$ to downweight implausible states.

Bernoulli and Binomial Distributions

Bernoulli distribution is used to model binary events. Let $Y = 1$ denote the event that the coin lands head.

\[\begin{aligned} Y &\sim \text{Ber}(\theta)\\ \text{Ber}(y|\theta) &= \begin{cases} 1 - \theta & y = 0\\ \theta & y = 1 \end{cases}\\ p(Y = 1) &= \theta\\ p(Y = 0) &= 1 - \theta \end{aligned}\]

We can also write it as:

\[\begin{aligned} \text{Ber}(y|\theta) \equiv \theta^{y}(1 - \theta)^{1 - y} \end{aligned}\]

Bernoulli distribution is a special case of Binomial distribution when $N=1$. For example, we observe a set of $N$ Bernoulli trials of coin tosses. Let $x$ be the number of heads. The distribution of $x$ is:

\[\begin{aligned} \text{Bin}(x|N, \theta) &\equiv {N\choose x}\theta^{x}(1 - \theta)^{N - s}\\ {N\choose k} &\equiv \frac{N!}{(N-k)!k!} \end{aligned}\]

Sigmoid Function

Given a binary outcome variable $y \in \{0, 1\}$ and a vector of inputs $\text{bf}$, we need to use a conditional probability of the form:

\[\begin{aligned} p(y|\textbf{x}; \boldsymbol{\theta}) &= \text{Ber}(y|\sigma(f(\textbf{x}; \boldsymbol{\theta}))) \end{aligned}\]

And $\sigma()$ is the sigmoid or logistic function:

\[\begin{aligned} \sigma(x) \equiv \frac{1}{1 + e^{-x}} \end{aligned}\]

$x$ is known as the log-odds:

\[\begin{aligned} \log\Big(\frac{p}{1 - p}\Big) &= \log\Big(\frac{\frac{e^{x}}{1+e^{x}}}{1 - \frac{e^{x}}{1+e^{x}}}\Big)\\ &= \log\Big(\frac{e^{x}}{1 + e^{x}}\times \frac{1 + e^{x}}{1}\Big)\\ &= \log(e^{x})\\ &= x \end{aligned}\]

The inverse of the sigmoid/logistic function is the logit function:

\[\begin{aligned} \sigma^{-1}(p) &= \mathrm{log}\Big(\frac{p}{1 - p}\Big) \end{aligned}\]

Logistic Regression

The logistic regression model has the form:

\[\begin{aligned} p(y|\textbf{x}; \boldsymbol{\theta}) &= \text{Ber}(y|\sigma(\boldsymbol{\omega}^{T}\mathrm{x} + b))\\ p(y = 1|\textbf{x}; \boldsymbol{\theta}) &= \sigma(\boldsymbol{\omega}^{T}\mathrm{x} + b)\\ &= \frac{1}{1 + e^{-(\boldsymbol{w^{T}x} + b)}} \end{aligned}\]

Categorical and Multinomial Distributions

The categorical distribution is a discrete probability distribution (with C - 1 independent parameters):

\[\begin{aligned} \text{Cat}(y|\boldsymbol{\theta}) \equiv \Pi_{c = 1}^{C}\theta_{c}^{I(y = c)}\\ p(y = c|\boldsymbol{\theta}) &= \theta_{c}\\ 0 \leq &\theta_{c} \leq 1\\ \sum_{c=1}^{C}\theta_{c} &= 1 \end{aligned}\]

Similar to how Bernoulli distribution is a special case of Binomial distribution, categorical distribution is a special case of the multinomial distribution. Suppose we observe N categorical trials (rolling a C-sided dice N times):

\[\begin{aligned} y_{n} &\sim \text{Cat}(. | \boldsymbol{\theta})\\ n &= 1:N \end{aligned}\]

Define $\textbf{s}$ to be a vector that counts the number of times each face shows up:

\[\begin{aligned} s_{c} &\equiv E_{n=1}^{N}I(y_{n} = c) \end{aligned}\]

The distribution of $\textbf{s}$ is given by the Multinomial distribution:

\[\begin{aligned} \mathcal{M}(\textbf{s}|N, \boldsymbol{\theta}) &\equiv {N\choose s_{1} \cdots s_{c}}\Pi_{c=1}^{C}\theta_{c}^{s_{c}}\\ {N\choose s_{1} \cdots s_{c}} &\equiv \frac{N!}{s_{1}!s_{2}!\cdots s_{C}!} \end{aligned}\] \[\begin{aligned} p(y|\textbf{x}, \boldsymbol{\theta}) &= \text{Cat}(y|f(\texbf{x}; \boldsymbol{\theta}))\\ p(y|\textbf{x}, \boldsymbol{\theta}) &= \mathcal{M}(y|1, f(\texbf{x}; \boldsymbol{\theta})) \end{aligned}\]

Softmax Function

To turn the output from $f$ into the softmax function (aka multinomial logit):

\[\begin{aligned} \mathcal{S}(\textbf{a}) \equiv \begin{bmatrix} \frac{e^{a_{1}}{\sum_{c' = 1}^{C}e^{a_{c'}}}, \cdots, \frac{e^{a_{C}}{\sum_{c' = 1}^{C}e^{a_{c'}}}}\\ \end{bmatrix} \end{aligned}\] \[\begin{aligned} 0 &\leq \mathcal{S}(\textbf{a})_{c} \leq 1\\ \sum_{c=1}^{C}\mathcal{S}(\textbf{a})_{c} &= 1 \end{aligned}\]

Multiclass Logistic Regression

If the linear predictor is of the form:

\[\begin{aligned} f(\mathrm{x};\boldsymbol{\theta}) &= \text{Cat}(y|\mathcal{S}(\textbf{W}\mathbf{x} + \mathbf{b})) \end{aligned}\]

Then:

\[\begin{aligned} p(y|\mathbf{x};\boldsymbol{\theta}) &= \text{Cat}(y|\mathcal{S}(\textbf{W}\textrm{x} + \textrm{b}))\\ p(y = c|\mathbf{x};\boldsymbol{\theta}) &= \frac{e^{a_{c}}}{\sum_{c'=1}^{C}e^{a_{c'}}} \end{aligned}\]

Log-Sum-Exp Trick

Calculating the normalized probability might cause an overflow:

\[\begin{aligned} p(y = c| \textbf{x}) &= \frac{e^{ac}}{\sum_{c'=1}^{C}e^{a_{c'}}} \end{aligned}\]

Where $\textbf{a} &= f(\textbf{x};\boldsymbol{a_{c'}})$ are the logits. To avoid numerical problems, note the following identity:

\[\begin{aligned} y &= \sum_{c=1}^{C}e^{a_{c}}\\ &= \sum_{c=1}^{C}e^{a_{c}}e^{m - m}\\ &= e^{m}\sum_{c=1}^{C}e^{a_{c}}e^{-m}\\ &= e^{m}\sum_{c=1}^{C}e^{a_{c} -m}\\ \log y &= m + \log\sum_{c=1}^{C}e^{a_{c} - m}\\ \end{aligned}\]

Hence:

\[\begin{aligned} \sum_{c=1}^{C}e^{a_{c}} &= m + \log\sum_{c=1}^{C}e^{a_{c} - m}\\ \end{aligned}\]

It is common to use $m = \text{max}(a_{c})$ and ensures the largest value will exponentiate to zero.

Defining LSE function to be:

\[\begin{aligned} LSE(\textbf{a}) &\equiv \log \sum_{c=1}^{C}e^{a_{c}}\\ &= m + \log\sum_{c=1}^{C}e^{a_{c} - m}\\ \end{aligned}\]

And we can compute the probabilities from the logit:

\[\begin{aligned} p(y = c|\mathrm{x}) &= e^{a_{c} - LSE(\textbf{a})} \end{aligned}\]

This works because:

\[\begin{aligned} e^{a_{c} - \log \sum_{c=1}^{C}e^{a_{c}}} &= \frac{e^{a_{c}}}{e^{\log \sum_{c=1}^{C}e^{a_{c}}}}\\ &= \frac{e^{a_{c}}}{\sum_{c=1}^{C}e^{a_{c}}}\\ \end{aligned}\]

Student t-Distribution

A more robust alternative to Gaussian in terms of outliers is the Student t-Distribution:

\[\begin{aligned} \mathcal{T}(y|\mu, \sigma^{2}, \nu) \propto \Bigg(1 + \frac{1}{\nu}\Big(\frac{y - \mu}{\sigma}\Big)\Bigg)^{-(\frac{\nu + 1}{2})}\\ \sigma &> 0\\ \nu &> 0 \end{aligned}\]

Where $\mu$ is the mean, $\sigma$ is the scale parameter (not the standard deviation), and $\nu$ is the degrees of freedom. The mean is only defined if $\nu > 1$, the variance only defined if $\nu > 2$, and for $\nu >> 5$, the distribution would rapidly approaches a Gaussian distribution and loses its robustness properties. Hence, it is common to use $\nu = 4$ which give good performance.

\[\begin{aligned} E[X] &= \mu\\ \text{mode}(X) &= \mu\\ \text{var}(X) &= \frac{\nu\sigma^{2}}{\nu - 2} \end{aligned}\]

Cauchy Distribution

If $\nu = 1$, the t-distribution is known as the Cauchy or Lorentz distribution. The pdf is:

\[\begin{aligned} \mathrm{C}(x|\mu, \gamma) &= \frac{1}{\gamma\pi}\Big[1 + \Big(\frac{x - \mu}{\gamma}\Big)^{2}\Big]^{-1} \end{aligned}\]

The tails are extremely heavy such that the integral that defines the mean does not converge. The half Cauchy distribution has all its probability density on the positive reals and with $\mu = 0$:

\[\begin{aligned} \mathrm{C}_{+}(x|\mu, \gamma) &= \frac{2}{\gamma\pi}\Big[1 + \Big(\frac{x}{\gamma}\Big)^{2}\Big]^{-1} \end{aligned}\]

Laplace Distribution

Another distribution with heavy tails. Has the following pdf:

\[\begin{aligned} \text{Lap}(y|\mu, b) &\equiv \frac{1}{2b}e^{-\frac{|y - \mu|}{b}}\\ b &> 0 \end{aligned}\]

$b$ is a scale parameter.

\[\begin{aligned} E[X] &= \mu\\ \text{mode}(X) &= \mu\\ \text{var}(X) &= 2b^{2} \end{aligned}\]

Beta Distribution

The beta distribution is defined over the interval [0, 1] and has pdf:

\[\begin{aligned} \text{Beta}(x | a, b) &\equiv \frac{1} {B(a, b)}x^{a - 1}(1 - x)^{b-1}\\ B(a, b) &\equiv \frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)}\\ \Gamma &\equiv \int_{0}^{\infty}x^{a-1}e^{-x}\ dx \end{aligned}\]

We need $a, b > 0$ to ensure the distribution is integrable (B(a, b) exists). If $a = b = 1$, we get the uniform distribution. If $a, b < 1$, we get a bimodel distribution with spikes at 0 and 1. If $a, b > 1$, the distribution is unimodal.

\[\begin{aligned} E[X] &= \frac{a}{a+b}\\ \text{mode}(X) &= \frac{a - 1}{a + b - 2}\\ \text{var}(X) &= \frac{ab}{(a+b)^{2}(a + b + 1)} \end{aligned}\]

Gamma Distribution

Gmma distribution is defined for positive real values $x > 0$ and defined in terms of two parameters:

\[\begin{aligned} \text{Ga}(x | shape = a, rate = b) \equiv \frac{b^{a}}{\Gamma(a)}x^{a-1}e^{-xb}\\ \text{Ga}(x | shape = a, rate = s) \equiv \frac{s}{s^{a}\Gamma(a)}x^{a-1}e^{-\frac{x}{s}} \end{aligned}\] \[\begin{aligned} E[X] &= \frac{a}{b}\\ \text{mode}(X) &= \frac{a - 1}{b}\\ \text{var}(X) &= \frac{a}{b^{2}} \end{aligned}\]

The following are special cases of the Gamma distribution:

Exponential Distribution

This distribution describes the times between events in a Poisson process (a process in which events occur continuously and independently at a constant average rate $\lambda$).

\[\begin{aligned} \text{Exp}(x|\lambda) &\equiv \text{Ga}(x| \text{shape} = 1, \text{rate} = \lambda) \end{aligned}\]

Chi-squared Distribution

This is the distribution of the sum of squared Gaussian random variables:

\[\begin{aligned} Z_{i} &\sim N(0, 1)\\ S &= \sum_{i=1}^{\nu}Z_{i}^{2}\\ s &\sim \mathcal{X}_{\nu}^{2} \end{aligned}\] \[\begin{aligned} \mathcal{X}_{\nu}^{2}(x) &\equiv \text{Ga}(x| \text{shape} = \frac{\nu}{2}, \text{rate} = \frac{1}{2})\\ \end{aligned}\]

Inverse Gamma Distribution

If $X \sim Ga(a, b)$ then:

\[\begin{aligned} \frac{1}{X} \sim \text{IG}(a, b) \end{aligned}\]

The mean exists only if $a > 1$. Variance exists only if $a > 2$:

\[\begin{aligned} E[X] &= \frac{b}{a - 1}\\ \text{mode}(X) &= \frac{b}{a + 1}\\ \text{var}(X) &= \frac{b^{2}}{(a-1)^{2}(a - 2)} \end{aligned}\]

Empirical Distribution

Suppose we have a set of $N$ samples $\{x^{(1), \cdots, x^{(N)}}\}$ derived from a distribution $p(X)$. We can approximate the pdf using delta functions:

\[\begin{aligned} \hat{p}_{N}(x) &= \frac{1}{N}\sum_{n = 1}^{N}\delta_{x^}\delta_{x^{(i)}}(x) \end{aligned}\]

And the CDF:

\[\begin{aligned} \hat{F}_{N}(x) &= \frac{1}{N}\sum_{n=1}^{N}I(x^{(i)}\leq x)\\ &= \frac{1}{N}\sum_{n=1}^{N}u_{x^{(i)}}(x)\\ u_{y}(x) &= \begin{cases} 1 & x \geq y\\ 0 & x < y \end{cases} \end{aligned}\]

Transformations of Random Variables

Let $\boldsymbol{y} &= f(x)$ be some deterministic transformation of $x \sim p()$.

If $X$ is discrete:

\[\begin{aligned} P_{y}(y) &= \sum_{x: f(x) = y}P_{x}(x) \end{aligned}\]

If $X$ is continuous:

\[\begin{aligned} F_{y}(y) &\equiv P(Y \leq y)\\ &= P(f(X) \leq y)\\ &= P(X \in \{x|f(x)\leq y\}) \end{aligned}\]

If $f$ is invertible, we can derive the pdf by differentiating the CDF. If not we can use numerical integration or MC approximation.

Change of Variables (Scalar)

Suppose:

\[\begin{aligned} x &\sim \text{Unif}(0, 1)\\ y &= f(x)\\ &= 2x+1 \end{aligned}\]

The function stretches and shifts the uniform proability distribution. The interval $(x, x + dx)$ gets mapped to $$(y, y + dy). The pmf in these intervals are the same:

\[\begin{aligned} p(x)dx &= p(y)dy\\ p(y) &= p(x)\frac{dx}{dy} \end{aligned}\]

Because probability has to be 0 or greater:

\[\begin{aligned} P_{Y}(y) &= P_{X}(x)\Big|\frac{dx}{dy}\Big| \end{aligned}\]

For any monotonic function $f:\mathbb{R} \rightarrow \mathbb{R}$, let:

\[\begin{aligned} g &= f^{-1} y &= f(x)\\ x &= g(y)\\ F_{Y}(y)&= F(f(X) \leq y)\\ &= F(X \leq f^{-1}(y))\\ &= F_{X}(f^{-1}(y))\\ &= F_{X}(g(y))\\ &= F_{X}(x) \end{aligned}\]

And taking derivative of the CDF:

\[\begin{aligned} P_{Y}(y) &\equiv \frac{d}{dy}F_{Y}(y)\\ &= \frac{d}{dy}F_{X}(x)\\ &= \frac{dx}{dy}\frac{d}{dx}F_{X}(x)\\ &= \frac{dx}{dy}P_{X}(x)\\ &= P_{X}(g(y))\frac{dx}{dy}\\ &= P_{X}(g(y))\Big|\frac{d}{dy}g(y)\Big| \end{aligned}\]

The above is called the change of variables formula.

For the multivariate case:

\[\begin{aligned} \boldsymbol{f}: \mathbb{R}^{n} \rightarrow \mathbb{R}^{n}\\ \boldsymbol{g} &= \boldsymbol{f}^{-1}\\ \boldsymbol{y} &= \boldsymbol{f}(\boldsymbol{x}) \end{aligned}\]

Then:

\[\begin{aligned} P_{Y}(\boldsymbol{y}) &= P_{X}(\boldsymbol{g}(\boldsymbol{y}))|\text{det}[\boldsymbol{J}_{g}(\boldsymbol{y})]|\\ \boldsymbol{J}_{g} &= \frac{d \boldsymbol{g}(\boldsymbol{y})}{d \boldsymbol{y}^{T}} \end{aligned}\]

An example in 2D is an affine transformation:

\[\begin{aligned} f(\boldsymbol{x}) &= \boldsymbol{Ax} + \boldsymbol{b}\\ \boldsymbol{A} &= \begin{bmatrix} a & b\\ c & d \end{bmatrix} \end{aligned}\]

The area of the unit square changes by a factor of:

\[\begin{aligned} \mathrm{det}(\boldsymbol{A}) &= ad - bc \end{aligned}\]

Another example is transforming a density from a Cartesian coordinate to polar coordinate:

\[\begin{aligned} \boldsymbol{x} &= (x_{1}, x_{2})\\ \boldsymbol{y} &= \boldsymbol{f}(x_{1}, x_{2})\\ \boldsymbol{g}(r, \theta) &= (r \mathrm{cos}(\theta), r \mathrm{sin}(\theta)) \end{aligned}\]

And the Jacobian:

\[\begin{aligned} \boldsymbol{J}_{g} &= \begin{bmatrix} \frac{\partial x_{1}}{\partial r} & \frac{\partial x_{1}}{\partial \theta}\\ \frac{\partial x_{2}}{\partial r} & \frac{\partial x_{2}}{\partial \theta} \end{bmatrix}\\ &= \begin{bmatrix} \textrm{cos}(\theta) & -r\textrm{sin}(\theta)\\ \textrm{sin}(\theta) & r\textrm{cos}(\theta) \end{bmatrix}\\ |\mathrm{det}(\boldsymbol{J}_{g})| &= |r \text{rm}{cos}^{2}(\theta), r \textrm{sin}(\theta)|\\ &= |r| \end{aligned}\] \[\begin{aligned} P_{r, \theta}(r, \theta) &= P_{x_{1}, x_{2}}(r \mathrm{cos}(\theta), r \mathrm{sin}(\theta)) \end{aligned}\]

Moments of a Linear Transformation

Suppose f is an affine function:

\[\begin{aligned} \boldsymbol{y} &= \boldsymbol{Ax} + \boldsymbol{b} \end{aligned}\]

To derive the mean:

\[\begin{aligned} E[\boldsymbol{y}] &= E[\boldsymbol{Ax} + \boldsymbol{b}]\\ &= \boldsymbol{A\mu} + \boldsymbol{b} \end{aligned}\]

For the covariance:

\[\begin{aligned} \text{cov}[\boldsymbol{y}] &= \text{cov}(\boldsymbol{Ax}+\boldsymbol{b})\\ &= \boldsymbol{A\Sigma A}^{T} \end{aligned}\]

For example, given:

\[\begin{aligned} y &= \boldsymbol{a}^{T}\boldsymbol{x}\\ &= x_{1} + x_{2}\\ \boldsymbol{a} &= \begin{bmatrix} 1\\ 1 \end{bmatrix}\\ \boldsymbol{x} &= \begin{bmatrix} x_{1}\\ x_{2} \end{bmatrix}\\ \text{var}(y) &= \text{var}(\boldsymbol{a}^{T}\boldsymbol{x})\\ &= \boldsymbol{a}^{T}\boldsymbol{\Sigma a}\\ &= \begin{bmatrix} 1 & 1 \end{bmatrix} \begin{bmatrix} \Sigma_{11} & \Sigma_{12}\\ \Sigma_{21} & \Sigma_{22} \end{bmatrix} \begin{bmatrix} 1\\ 1 \end{bmatrix}\\ &= \Sigma_{11} + \Sigma_{22} + 2\Sigma_{12}\\ &= \text{var}(x_{1}) + \text{var}(x_{2}) + 2\text{cov}(x_{1}, x_{2}) \end{aligned}\]

References

Principles of Econometrics

Adaptive Computation and Machine Learning