In this post we will review probability theory used in ML.
There are two types of uncertainty.
Uncertainty resulting from the underlying hidden causes or mechanism generating the data.
Arises from the intrinsic variability. For example if we know the probability of getting a head for a coin is \(p = 0.5\), we have no model uncertainty, but we would have data uncertainty as we cannot perfectly predict the outcome.
An event is a binary state of the world that either holds or does not hold. The probability of an event \(A\) is denoted as \(P(A)\) with the property:
\[\begin{aligned} 0 \leq P(A) \leq 1 \end{aligned}\]The joint probability of 2 events \(A, B\) both happening:
\[\begin{aligned} P(A \land B) &= P(A, B) \end{aligned}\]If the two events are independent:
\[\begin{aligned} P(A, B) &= P(A)P(B) \end{aligned}\]The probability of event \(A\) or \(B\) happening:
\[\begin{aligned} P(A \lor B) &= P(A) + P(B) - P(A \land B) \end{aligned}\]Note that we cannot condition on an impossible event
We say that event A is independent of event B if:
\[\begin{aligned} P(A, B) &= P(A)P(B) \end{aligned}\]We say that events A and B are conditionally independent given event C if:
\[\begin{aligned} P(A, B |C) &= P(A|C)P(B|C) \end{aligned}\]This is also writen as:
\[\begin{aligned} A \perp B|C \end{aligned}\]Most events are not independent on each other, but may be indpendent if we condition on the relevant intermediate variables.
Technically a random variable is neither random (it is determintic) nor a variable (it’s a function). But for simplicity, a random variable X can be thought of as a variable whose value is unknown until it is observed.
The set of possible values of X, denoted by \(\mathcal{X}\), is known as the sample space or state space. An event is a set of outcomes from a given sample space.
If the sample space \(\mathcal{X}\) is finite or countably infinite, then X is called a discrete random variable. We define the probability mass function (pmf) as:
\[\begin{aligned} p(x) &\equiv P(X = x)\\ \end{aligned}\] \[\begin{aligned} 0 \leq p(x) \leq 1\\ \end{aligned}\] \[\begin{aligned} \sum_{x \in \mathcal{X}}p(x) = 1 \end{aligned}\]There is an infinite er of possible values a continuous variable can take, but they are a countable number of intervals which we can partition on.
We define the cumulative distribution function (CDF) as:
\[\begin{aligned} F_{X}(x) \equiv P(X \leq x) \end{aligned}\]And the probability of being in any interval is:
\[\begin{aligned} P(a < X \leq b) &= F_{X}(b) - F_{X}(a) \end{aligned}\]We define the PDF as the derivative of the CDF:
\[\begin{aligned} p(x) \equiv \frac{d}{dx}F_{X}(x) \end{aligned}\]Given the PDF, we can compute the continuous RV being in a finite interval as:
\[\begin{aligned} P(a < X \leq b) &= \int_{a}^{b}\ dx\\ &= F_{X}(b) - F_{X}(a) \end{aligned}\]For a small interval:
\[\begin{aligned} P(x \leq X \leq x + dx) \approx p(x)\ dx \end{aligned}\]The inverse of the CDF is known as the quantile:
\[\begin{aligned} F_{X}^{-1}(q) &= x_{q}\\ P(X \leq x_{q}) &= q \end{aligned}\]For example, \(\Phi\) is the CDF of a normal distribution, and \(\Phi^{-1}\) is the inverse CDF:
\[\begin{aligned} \Phi^{-1}(0.025) &= -1.96\\ \Phi^{-1}(0.975) &= 1.96\\ \end{aligned}\]Given a joint distribution, we define the marginal distribution of a random variable as:
\[\begin{aligned} p(X = x) &= \sum_{y}p(X = x, Y = y) \end{aligned}\]We can extend the product rule:
\[\begin{aligned} p(\textbf{x}_{1:D}) &= p(x_{1})p(x_{2}|x_{1})p(x_{3}|x_{1}, x_{2})p(x_{4}|x_{1}, x_{2}, x_{3})\cdots p(x_{D}|\textbf{x}_{1:D-1}) \end{aligned}\]In general, we say \(X_{1}, \cdots, X_{n}\) is independent if:
\[\begin{aligned} p(X_{1}, \cdots, X_{n}) &= \prod_{i = 1}^{n}p(X_{i}) \end{aligned}\]The variance of a shifted and scaled random variable:
\[\begin{aligned} V[aX + b] &= a^{2}V[X] \end{aligned}\]If the \(n\) random variables are independent:
\[\begin{aligned} E\Big[\sum_{i=1}^{n}X_{i}\Big] &= \sum_{i = 1}^{n}V[X_{i}] \end{aligned}\]Note the similarity between the above and expectation which is a mulitiplication not a sum.
The variance of their product (if independent):
\[\begin{aligned} V\Big[\prod_{i=1}^{n}X_{i}\Big] &= E\Big[\Big(\prod_{i}X_{i}\Big)^{2}\Big] - \Big(E\Big[\prod_{i}X_{i}\Big]\Big)^{2}\\ &= E\Big[\prod_{i}X_{i}^{2}\Big] - \Big(\prod_{i}E[X_{i}]\Big)^{2}\\ &= \prod_{i}E[X_{i}^{2}] - \prod_{i}E[X_{i}]^{2}\\ &= \prod_{i}E[X_{i}^{2}] - \prod_{i}E[X_{i}]^{2} + \prod_{i}E[X_{i}]^{2} - \prod_{i}E[X_{i}]^{2}\\ &= \prod_{i}V[X_{i}] + E[X_{i}]^{2} - \prod_{i}E[X_{i}]^{2}\\ &= \prod_{i}(\sigma_{i}^{2} + \mu_{i}^{2}) - \prod_{i}\mu_{i}^{2} \end{aligned}\]When we have two or more dependent random variables, we can compute the moments of one given another:
\[\begin{aligned} E[X] &= E_{Y}\Big[E[X|Y]\Big] \end{aligned}\]To show this using discrete random variable:
\[\begin{aligned} E_{Y}\Big[E[X|Y]\Big] &= E_{Y}\Big[\sum_{x}x p(X=x|Y)\Big]\\ &= \sum_{y}\Big[\sum_{x}x p(X=x|Y)\Big]p(Y = y)\\ &= \sum_{x, y}xp(X = x, Y = y)\\ &= E[X] \end{aligned}\]To show this:
\[\begin{aligned} V[X] &= E[X^{2}] - (E[X])^{2}\\ &= E_{Y}\Big[E[X^{2}|Y]\Big] - E_{Y}\Big[E[X|Y]\Big]^{2}\\ &= \Big(E_{Y}\Big[E[X^{2}|Y]\Big] - E_{Y}\Big[X|Y\Big]^{2}\Big) + \Big(E_{Y}\Big[X|Y\Big]^{2} + E_{Y}\Big[E[X|Y]\Big]^{2}\Big)\\ &= E_{Y}\Big[V[X|Y]\Big] + V_{Y}[E[X|Y]] \end{aligned}\]The intuitive way to think about this is with the mixture of Gaussian example:
\[\begin{aligned} X = \sum_{y=1} ^{K}\pi_{y}N(X|\mu_{y}, \sigma_{y}) \end{aligned}\]The variance is composed of the mean of the variance of each centriod and variance of the means of the centriods.
Conditional probability of outcome of random variable \(X = x\) given \(Y = y\):
\[\begin{aligned} P(X = x | Y = y) &= \frac{P(X = x, Y = y)}{P(Y = y)} \end{aligned}\]Two random variables \(X, Y\) are independent if the conditional probability is:
\[\begin{aligned} P(X = x | Y = y) &= \frac{P(X = x, Y = y)}{P(Y = y)}\\[5pt] &= P(X = x) \end{aligned}\]In addition, if \(X\) and \(Y\) are independent:
\[\begin{aligned} \frac{P(X = x, Y = y)}{P(Y = y)} &= P(X = x)\\[5pt] P(X = x, Y = y) &= P(X = x) P(Y = y) \end{aligned}\]For continous random variable, the mean is defined as:
\[\begin{aligned} E[X] \equiv \int_{\mathcal{X}}x p(x)\ dx \end{aligned}\]If the integral is not finite, the mean is not defined.
For discrete random variable, the mean is defined as:
\[\begin{aligned} E[X] \equiv \sum_{x \in \mathcal{X}}x p(x) \end{aligned}\]Since the mean is a linear operator:
\[\begin{aligned} E[aX + b] &= aE[X] + b\\ E\Big[\sum_{i = 1}^{n}X_{i} \Big] &= \sum_{i=1}^{n}E[X_{i}] \end{aligned}\]If the random variables are independent:
\[\begin{aligned} E\Big[\prod_{i=1}^{n} X_{i}\Big] &= \prod_{i=1}^{n}E[X_{i}] \end{aligned}\]But if \(X, Y\) are independent:
\[\begin{aligned} E[X|Y = y] &= \sum_{i = 1}^{n}f(x_{i}|y)\\ &= \sum_{x}xf(x)\\ &= E[X] \end{aligned}\]For example, if \(g(X) = aX\):
\[\begin{aligned} E[g(X)] &= \sum_{x}g(x)f(x)\\ &= a\sum_{x}x(fx)\\ &= aE[X] \end{aligned}\]We can also show that:
\[\begin{aligned} E[aX + b] &= aE[X] + b\\ E[g_{1}(X) + g_{2}(X)] &= E[g_{1}(X)] + E[g_{2}(X)] \end{aligned}\]Hence:
\[\begin{aligned} E[Y] &= E_{X}\Big[ E[Y|X]\Big] \end{aligned}\]Also known as the Bias-Variance Tradeoff.
Note some steps are skipped:
\[\begin{aligned} \mathrm{Var}(Y) &= \sum_{y}(y - \mu_{y})^{2}p(y)\\ &= \sum_{y}(y - \mu_{y})^{2}\sum_{x}p(x, y)\\ &= \sum_{y}(y - \mu_{y})^{2}\sum_{x}p(y|x)p(x)\\ &= \sum_{x}\sum_{y}(y - \mu_{y})^{2}p(y|x)p(x)\\ &= \sum_{x}\sum_{y}(y - \mu_{y})^{2}p(y|x)p(x)\\ &= \sum_{x}\sum_{y}\Big(y - E[Y|x] + E[Y|x] \mu_{y}\Big)^{2}p(y|x)p(x)\\ &= \sum_{x}\sum_{y}\Bigg(\Big(y - E[Y|x]\Big) + \Big(E[Y|x] \mu_{y}\Big)\Bigg)^{2}p(y|x)p(x)\\ &= \sum_{x}\sum_{y}\Bigg(\Big(y - E[Y|x]\Big)^{2} + \Big(E[Y|x] \mu_{y}\Big)^{2} + 2(y - E[Y|x])(E[Y|x] - \mu_{y})\Bigg)^{2}p(y|x)p(x)\\ &= \sum_{x}\Big\{\Big(y - E[Y|x]\Big)^{2}p(y|x)p(x) + \sum_{x}\Big(E[Y|x] - \mu_{y}\Big)^{2}p(y|x)p(x) + \sum_{x}2\Big(y - E[Y|x]\Big)\Big(E[Y|x] - \mu_{y}\Big)\Big\}p(y|x)p(x)\\ &= \cdots\\ &= E_{X}[\mathrm{Var}(Y|X)] + \mathrm{Var}_{X}\Big(E[Y|X]\Big) + 0\\ &= E_{X}[\mathrm{Var}(Y|X)] + \mathrm{Var}_{X}\Big(E[Y|X]\Big) \end{aligned}\]Hence:
\[\begin{aligned} \mathrm{Var}(Y) = E_{X}\Big[\mathrm{Var}(Y|X)\Big] + \mathrm{Var}_{X}\Big[E[Y|X]\Big] \end{aligned}\]\(E[Y \vert X = x]\) can be treated as a function \(g(X)\). So we can represent covariance as an expectation:
\[\begin{aligned} \mathrm{Cov}(X, Y) &= \sum_{x}(x - \mu_{X})E[Y|X = x]p(x)\\ &= \sum_{x}(x - \mu_{X})g(X)p(x)\\ &= E_{X}\Big[(X - \mu_{X})E[Y|X]\Big] \end{aligned}\]A special case is where \(E[Y \mid X] = 0\). This implies that the covariance of \(X, Y\) is zero as well.
If the random variable \(X\) is a Gaussian/Normal distributed with mean \(\mu\) and variance \(\sigma^{2}\), it is represented by:
\[\begin{aligned} X \sim \frac{1}{\sqrt{2\pi\sigma^{2}}}e^{\frac{-(x-\mu)^{2}}{2\sigma^{2}}} \end{aligned}\]Weighted sum of Gaussian random variables are Gaussian distributed as well:
\[\begin{aligned} X_{1} &\sim N(\mu_{1}, \sigma_{1}^{2})\\ X_{2} &\sim N(\mu_{2}, \sigma_{2}^{2})\\ Y &= a_{1}X_{1} + a_{2}X_{2}\\ &\sim N(\mu_{Y} = a_{1}\mu_{1} + a_{2}\mu_{2}, \sigma_{Y}^{2} = a_{1}^{2}\sigma_{1}^{2} + a_{2}^{2}\sigma_{2}^{2} + 2a_{1}a_{2}\mathrm{Cov}(X_{1}, X_{2})\\ &\sim N(\mu_{Y} = a_{1}\mu_{1} + a_{2}\mu_{2}, \sigma_{Y}^{2} = a_{1}^{2}\sigma_{1}^{2} + a_{2}^{2}\sigma_{2}^{2} + 2a_{1}a_{2}\sigma_{12} ) \end{aligned}\] \[\begin{aligned} Y &\sim N(\mu_{Y} = a_{1}\mu_{1} + a_{2}\mu_{2}, \sigma_{Y}^{2} = a_{1}^{2}\sigma_{1}^{2} + a_{2}^{2}\sigma_{2}^{2} + 2a_{1}a_{2}\mathrm{Cov}(X_{1}, X_{2})\\ Y &\sim N(\mu_{Y} = a_{1}\mu_{1} + a_{2}\mu_{2}, \sigma_{Y}^{2} = a_{1}^{2}\sigma_{1}^{2} + a_{2}^{2}\sigma_{2}^{2} + 2a_{1}a_{2}\sigma_{12} ) \end{aligned}\]The CDF of the Gaussian is defined by:
\[\begin{aligned} \Phi(y; \mu, \sigma^{2}) &\equiv \int_{-\infty}^{y}N(z|\mu, \sigma^{2})\ dz\\ z &= \frac{y - \mu}{\sigma}\\ \Phi(y; \mu, \sigma^{2}) &= \frac{1}{2}\Big[1 + \text{erf}\Big(\frac{z}{\sqrt{2}}\Big)\Big] \end{aligned}\]erf() is the error function:
\[\begin{aligned} \text{erf}(u) &\equiv \frac{2}{\sqrt{\pi}}\int_{0}^{u}e^{-t^{2}}\ dt \end{aligned}\]The inverse of the CDF is also known as the probit. We can get the quantile by using the probit functoin. For example to get the 95% interval:
\[\begin{aligned} (\Phi^{-1}(\frac{\alpha}{2}), \Phi^{-1}(1 - \frac{\alpha}{2})) &= (-1.96, 1.96)\\ \alpha &= 0.05\\ (\Phi^{-1}(0.025), \Phi^{-1}(0.975)) &= (-1.96, 1.96) \end{aligned}\]The PDF is defined as the derivative of the CDF:
\[\begin{aligned} p(y) &\equiv \frac{d}{dy}F(y)\\ N(y|\mu, \sigma^{2}) &\equiv \frac{1}{\sqrt{2\pi \sigma^{2}}}e^{-\frac{1}{2\sigma^{2}}(y - \mu)^{2}} \end{aligned}\]Where \(\sqrt{2\pi \sigma^{2}}\) is the normalization constant
Consider a conditional density model of the form:
\[\begin{aligned} p(y|\textbf{x};\boldsymbol{\theta}) &= N(y|f_{\mu}(\textbf{x};\boldsymbol{\theta}), f_{\sigma}(\textbf{x};\boldsymbol{\theta})^{2}) \end{aligned}\]For linear regression, we would assume the mean is a linear function of the input, and the variance is constant (homoscedastic regression):
\[\begin{aligned} p(y|\textbf{x};\boldsymbol{\theta}) &= N(y|\textbf{w}^{T}\textbf{x} + b, \sigma^{2})\\ \boldsymbol{\theta} &= (\textbf{w}, b, \sigma^{2}) \end{aligned}\]It is also possible to make the variance depend on the input (heteroskedastic regression):
\[\begin{aligned} p(y|\textbf{x};\boldsymbol{\theta}) &= N(y|\textbf{w}_{\mu}^{T}\textbf{x} + b, \sigma_{+}(\textbf{w}_{\sigma}^{T}\textbf{x}))\\ \boldsymbol{\theta} &= (\textbf{w}_{\mu}, \textbf{w}_{\sigma})\\ \sigma_{+}(a) &= \log(1 + e^{a}) \end{aligned}\]\(\sigma_{+}\) is known as the softplus function, to ensure the variance is non-negative.
Here are a few reasons why Gaussian distribution is so common:
As the variance of a Gaussian distribution goes to 0, the distribution appraches a Dirac delta function:
\[\begin{aligned} \lim_{\sigma \rightarrow 0}N(y|\mu, \sigma^{2}) \rightarrow \delta(y - \mu) \end{aligned}\]The Dirac delta function is defined as:
\[\begin{aligned} \delta(x) &\equiv \begin{cases} \infty & x = 0\\ 0 & x \neq 0 \end{cases}\\ \int_{-\infty}^{\infty}\delta(x)\ dx &= 1 \end{aligned}\]If we shift the mean to \(y\):
\[\begin{aligned} \delta_{y}(x) &\equiv \begin{cases} \infty & x = y\\ 0 & x \neq y \end{cases}\\ \end{aligned}\]Note we have that:
\[\begin{aligned} \delta_{y}(x) &= \delta(x - y) \end{aligned}\]The delta function distribution satisfies the following sifting property:
\[\begin{aligned} \int_{-\infty}^{\infty}f(x)\delta(x - y)\ dx &= f(y) \end{aligned}\] \[\begin{aligned} \end{aligned}\] \[\begin{aligned} \end{aligned}\]Given an unknown/hidden quantity \(H\) given some observed data \(Y = y\):
\[\begin{aligned} P(H = h | Y = y) &= \frac{P(H = h)P(Y= y)|H = h}{P(Y = y)} \end{aligned}\]The above can be derived:
\[\begin{aligned} P(h, y) &= P(y, h)\\ P(h|y)P(y) &= P(y|h)P(h)\\ P(h|y) &= \frac{P(y|h)P(h)}{P(y)} \end{aligned}\]\(P(H)\) represents what we know about the possible distribution of \(H\) before we see any data.
\(P(Y\mid H = h)\) represents the distribution over possible outcomes of \(Y\) we expect to see if \(H = h\). This is also known as observation distribution. When evaluated at a point \(P(Y= y\mid H = h)\), we call this the likelihood. Note that this is not a probability distribution as it does not sum to 1.
Multiplying the likelihood by the prior gives the unnormalized joint distribution \(P(H = h, Y = y)\).
To convert the unnormalized joint distribution ito a normalized distribution, we divide it by marginal likelihood \(P(Y = y)\).
\[\begin{aligned} P(Y = y) &= \sum_{h \in \mathcal{H}}P(H = h)P(Y = y| H = h)\\ &= \sum_{h \in \mathcal{H}}P(Y = y|H = h)\\ &= \sum_{h \in \mathcal{H}}P(H = h, Y = y) \end{aligned}\]Using Bayes rule to update a distribution over unknown/hidden values \(P(H)\) given new observed data is called Bayesian inference or posterior inference.
Let \(H\) be the event whether you contracted Covid. Let \(Y\) be the outcome of the test. Note the following measures:
\[\begin{aligned} \text{Sensitivity (True Positive Rate)} &= P(Y = 1| H = 1)\\[5pt] \text{Specificity (True Negative Rate)} &= P(Y = 0| H = 0)\\[5pt] \text{False Positive Rate} &= 1 - \text{Specificity} &= 1 - P(Y = 0\mid H = 0) \end{aligned}\]In this case, the TPR is essentially our likelihood. To get the posterior probability of testing positive for Covid:
\[\begin{aligned} P(H = 1|y = 1) &= \frac{P(Y = 1|H = 1)p(H)}{p(Y=1|H=1)P(H = 1) + p(Y=1|H = 0)P(H = 0)}\\ &= \frac{\text{TPR} \times \text{Prior}}{\text{TPR} \times \text{Prior} + FPR \times ( 1 - \text{Prior})} \end{aligned}\]Let \(H_{i}\) be the hidden/hypothesis that the prize is behind door \(i\). Our prior distribution would be:
\[\begin{aligned} P(H_{1}) &= P(H_{2})\\ &= P(H_{3})\\ &= \frac{1}{3} \end{aligned}\]Let \(Y\) be the door that was opened. We choose an arbitrary door 1, and in this case, \(Y\) could only be 2 or 3. The probability of \(P(Y\mid H_{i})\) depends on whether the car is behind which door:
\[\begin{aligned} P(Y = 2|H_{1}) &= \frac{1}{2}\\ P(Y = 3|H_{1}) &= \frac{1}{2}\\ P(Y = 2|H_{2}) &= 0\\ P(Y = 3|H_{2}) &= 1\\ P(Y = 2|H_{3}) &= 1\\ P(Y = 3|H_{3}) &= 0 \end{aligned}\]The posterior probability distribution given \(Y=3\) (the same result for \(Y=2\)):
\[\begin{aligned} P(Y = 3) &= 0 \times \frac{1}{3} + 1 \times \frac{1}{3} + \frac{1}{2} \times \frac{1}{3}\\ &= \frac{1}{3} + \frac{1}{6}\\ &= \frac{1}{2}\\[5pt] P(H_{1}|Y = 3) &= \frac{\frac{1}{2} \times \frac{1}{3}}{P(Y = 3)}\\ &= \frac{1}{3}\\[5pt] P(H_{2}|Y = 3) &= \frac{1 \times \frac{1}{3}}{P(Y = 3)}\\ &= \frac{2}{3}\\[5pt] P(H_{3}|Y = 3) &= \frac{0 \times \frac{1}{3}}{P(Y = 3)}\\ &= 0 \end{aligned}\]As you can see, the probability of switching is twice that of not switching.
Probability theory is concerned with predicting a distribution over outcomes \(y\) given knowledge of the state of the world \(h\). For example, the null hypothesis of a coin toss is \(p(H = 1) = 0.5\). In contrast, inverse probability is concerned with inferring the state of the world \(h\) from observations of outcomes. Most of these problems are ill-posed. For example, inferring a 3D shape from 2D images, and understanding speech spoken by a speaker. To tackle such problems, we specify the forwards model (likelihood) $$p(y | h)\(given a prior\)p(h)$$ to downweight implausible states. |
Bernoulli distribution is used to model binary events. Let \(Y = 1\) denote the event that the coin lands head.
\[\begin{aligned} Y &\sim \text{Ber}(\theta)\\ \text{Ber}(y|\theta) &= \begin{cases} 1 - \theta & y = 0\\ \theta & y = 1 \end{cases}\\ p(Y = 1) &= \theta\\ p(Y = 0) &= 1 - \theta \end{aligned}\]We can also write it as:
\[\begin{aligned} \text{Ber}(y|\theta) \equiv \theta^{y}(1 - \theta)^{1 - y} \end{aligned}\]Bernoulli distribution is a special case of Binomial distribution when \(N=1\). For example, we observe a set of \(N\) Bernoulli trials of coin tosses. Let \(x\) be the number of heads. The distribution of \(x\) is:
\[\begin{aligned} \text{Bin}(x|N, \theta) &\equiv {N\choose x}\theta^{x}(1 - \theta)^{N - s}\\ {N\choose k} &\equiv \frac{N!}{(N-k)!k!} \end{aligned}\]Given a binary outcome variable \(y \in \{0, 1\}\) and a vector of inputs \(\text{bf}\), we need to use a conditional probability of the form:
\[\begin{aligned} p(y|\textbf{x}; \boldsymbol{\theta}) &= \text{Ber}(y|\sigma(f(\textbf{x}; \boldsymbol{\theta}))) \end{aligned}\]And \(\sigma()\) is the sigmoid or logistic function:
\[\begin{aligned} \sigma(x) \equiv \frac{1}{1 + e^{-x}} \end{aligned}\]\(x\) is known as the log-odds:
\[\begin{aligned} \log\Big(\frac{p}{1 - p}\Big) &= \log\Big(\frac{\frac{e^{x}}{1+e^{x}}}{1 - \frac{e^{x}}{1+e^{x}}}\Big)\\ &= \log\Big(\frac{e^{x}}{1 + e^{x}}\times \frac{1 + e^{x}}{1}\Big)\\ &= \log(e^{x})\\ &= x \end{aligned}\]The inverse of the sigmoid/logistic function is the logit function:
\[\begin{aligned} \sigma^{-1}(p) &= \mathrm{log}\Big(\frac{p}{1 - p}\Big) \end{aligned}\]The logistic regression model has the form:
\[\begin{aligned} p(y|\textbf{x}; \boldsymbol{\theta}) &= \text{Ber}(y|\sigma(\boldsymbol{\omega}^{T}\mathrm{x} + b))\\ p(y = 1|\textbf{x}; \boldsymbol{\theta}) &= \sigma(\boldsymbol{\omega}^{T}\mathrm{x} + b)\\ &= \frac{1}{1 + e^{-(\boldsymbol{w^{T}x} + b)}} \end{aligned}\]The categorical distribution is a discrete probability distribution (with C - 1 independent parameters):
\[\begin{aligned} \text{Cat}(y|\boldsymbol{\theta}) \equiv \Pi_{c = 1}^{C}\theta_{c}^{I(y = c)}\\ p(y = c|\boldsymbol{\theta}) &= \theta_{c}\\ 0 \leq &\theta_{c} \leq 1\\ \sum_{c=1}^{C}\theta_{c} &= 1 \end{aligned}\]Similar to how Bernoulli distribution is a special case of Binomial distribution, categorical distribution is a special case of the multinomial distribution. Suppose we observe N categorical trials (rolling a C-sided dice N times):
\[\begin{aligned} y_{n} &\sim \text{Cat}(. | \boldsymbol{\theta})\\ n &= 1:N \end{aligned}\]Define \(\textbf{s}\) to be a vector that counts the number of times each face shows up:
\[\begin{aligned} s_{c} &\equiv E_{n=1}^{N}I(y_{n} = c) \end{aligned}\]The distribution of \(\textbf{s}\) is given by the Multinomial distribution:
\[\begin{aligned} \mathcal{M}(\textbf{s}|N, \boldsymbol{\theta}) &\equiv {N\choose s_{1} \cdots s_{c}}\Pi_{c=1}^{C}\theta_{c}^{s_{c}}\\ {N\choose s_{1} \cdots s_{c}} &\equiv \frac{N!}{s_{1}!s_{2}!\cdots s_{C}!} \end{aligned}\] \[\begin{aligned} p(y|\textbf{x}, \boldsymbol{\theta}) &= \text{Cat}(y|f(\texbf{x}; \boldsymbol{\theta}))\\ p(y|\textbf{x}, \boldsymbol{\theta}) &= \mathcal{M}(y|1, f(\texbf{x}; \boldsymbol{\theta})) \end{aligned}\]To turn the output from \(f\) into the softmax function (aka multinomial logit):
\[\begin{aligned} \mathcal{S}(\textbf{a}) \equiv \begin{bmatrix} \frac{e^{a_{1}}{\sum_{c' = 1}^{C}e^{a_{c'}}}, \cdots, \frac{e^{a_{C}}{\sum_{c' = 1}^{C}e^{a_{c'}}}}\\ \end{bmatrix} \end{aligned}\] \[\begin{aligned} 0 &\leq \mathcal{S}(\textbf{a})_{c} \leq 1\\ \sum_{c=1}^{C}\mathcal{S}(\textbf{a})_{c} &= 1 \end{aligned}\]If the linear predictor is of the form:
\[\begin{aligned} f(\mathrm{x};\boldsymbol{\theta}) &= \text{Cat}(y|\mathcal{S}(\textbf{W}\mathbf{x} + \mathbf{b})) \end{aligned}\]Then:
\[\begin{aligned} p(y|\mathbf{x};\boldsymbol{\theta}) &= \text{Cat}(y|\mathcal{S}(\textbf{W}\textrm{x} + \textrm{b}))\\ p(y = c|\mathbf{x};\boldsymbol{\theta}) &= \frac{e^{a_{c}}}{\sum_{c'=1}^{C}e^{a_{c'}}} \end{aligned}\]Calculating the normalized probability might cause an overflow:
\[\begin{aligned} p(y = c| \textbf{x}) &= \frac{e^{ac}}{\sum_{c'=1}^{C}e^{a_{c'}}} \end{aligned}\]Where \(\textbf{a} &= f(\textbf{x};\boldsymbol{a_{c'}})\) are the logits. To avoid numerical problems, note the following identity:
\[\begin{aligned} y &= \sum_{c=1}^{C}e^{a_{c}}\\ &= \sum_{c=1}^{C}e^{a_{c}}e^{m - m}\\ &= e^{m}\sum_{c=1}^{C}e^{a_{c}}e^{-m}\\ &= e^{m}\sum_{c=1}^{C}e^{a_{c} -m}\\ \log y &= m + \log\sum_{c=1}^{C}e^{a_{c} - m}\\ \end{aligned}\]Hence:
\[\begin{aligned} \sum_{c=1}^{C}e^{a_{c}} &= m + \log\sum_{c=1}^{C}e^{a_{c} - m}\\ \end{aligned}\]It is common to use \(m = \text{max}(a_{c})\) and ensures the largest value will exponentiate to zero.
Defining LSE function to be:
\[\begin{aligned} LSE(\textbf{a}) &\equiv \log \sum_{c=1}^{C}e^{a_{c}}\\ &= m + \log\sum_{c=1}^{C}e^{a_{c} - m}\\ \end{aligned}\]And we can compute the probabilities from the logit:
\[\begin{aligned} p(y = c|\mathrm{x}) &= e^{a_{c} - LSE(\textbf{a})} \end{aligned}\]This works because:
\[\begin{aligned} e^{a_{c} - \log \sum_{c=1}^{C}e^{a_{c}}} &= \frac{e^{a_{c}}}{e^{\log \sum_{c=1}^{C}e^{a_{c}}}}\\ &= \frac{e^{a_{c}}}{\sum_{c=1}^{C}e^{a_{c}}}\\ \end{aligned}\]A more robust alternative to Gaussian in terms of outliers is the Student t-Distribution:
\[\begin{aligned} \mathcal{T}(y|\mu, \sigma^{2}, \nu) \propto \Bigg(1 + \frac{1}{\nu}\Big(\frac{y - \mu}{\sigma}\Big)\Bigg)^{-(\frac{\nu + 1}{2})}\\ \sigma &> 0\\ \nu &> 0 \end{aligned}\]Where \(\mu\) is the mean, \(\sigma\) is the scale parameter (not the standard deviation), and \(\nu\) is the degrees of freedom. The mean is only defined if \(\nu > 1\), the variance only defined if \(\nu > 2\), and for \(\nu >> 5\), the distribution would rapidly approaches a Gaussian distribution and loses its robustness properties. Hence, it is common to use \(\nu = 4\) which give good performance.
\[\begin{aligned} E[X] &= \mu\\ \text{mode}(X) &= \mu\\ \text{var}(X) &= \frac{\nu\sigma^{2}}{\nu - 2} \end{aligned}\]If \(\nu = 1\), the t-distribution is known as the Cauchy or Lorentz distribution. The pdf is:
\[\begin{aligned} \mathrm{C}(x|\mu, \gamma) &= \frac{1}{\gamma\pi}\Big[1 + \Big(\frac{x - \mu}{\gamma}\Big)^{2}\Big]^{-1} \end{aligned}\]The tails are extremely heavy such that the integral that defines the mean does not converge. The half Cauchy distribution has all its probability density on the positive reals and with \(\mu = 0\):
\[\begin{aligned} \mathrm{C}_{+}(x|\mu, \gamma) &= \frac{2}{\gamma\pi}\Big[1 + \Big(\frac{x}{\gamma}\Big)^{2}\Big]^{-1} \end{aligned}\]Another distribution with heavy tails. Has the following pdf:
\[\begin{aligned} \text{Lap}(y|\mu, b) &\equiv \frac{1}{2b}e^{-\frac{|y - \mu|}{b}}\\ b &> 0 \end{aligned}\]\(b\) is a scale parameter.
\[\begin{aligned} E[X] &= \mu\\ \text{mode}(X) &= \mu\\ \text{var}(X) &= 2b^{2} \end{aligned}\]The beta distribution is defined over the interval [0, 1] and has pdf:
\[\begin{aligned} \text{Beta}(x | a, b) &\equiv \frac{1} {B(a, b)}x^{a - 1}(1 - x)^{b-1}\\ B(a, b) &\equiv \frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)}\\ \Gamma &\equiv \int_{0}^{\infty}x^{a-1}e^{-x}\ dx \end{aligned}\]We need \(a, b > 0\) to ensure the distribution is integrable (B(a, b) exists). If \(a = b = 1\), we get the uniform distribution. If \(a, b < 1\), we get a bimodel distribution with spikes at 0 and 1. If \(a, b > 1\), the distribution is unimodal.
\[\begin{aligned} E[X] &= \frac{a}{a+b}\\ \text{mode}(X) &= \frac{a - 1}{a + b - 2}\\ \text{var}(X) &= \frac{ab}{(a+b)^{2}(a + b + 1)} \end{aligned}\]Gmma distribution is defined for positive real values \(x > 0\) and defined in terms of two parameters:
\[\begin{aligned} \text{Ga}(x | shape = a, rate = b) \equiv \frac{b^{a}}{\Gamma(a)}x^{a-1}e^{-xb}\\ \text{Ga}(x | shape = a, rate = s) \equiv \frac{s}{s^{a}\Gamma(a)}x^{a-1}e^{-\frac{x}{s}} \end{aligned}\] \[\begin{aligned} E[X] &= \frac{a}{b}\\ \text{mode}(X) &= \frac{a - 1}{b}\\ \text{var}(X) &= \frac{a}{b^{2}} \end{aligned}\]The following are special cases of the Gamma distribution:
This distribution describes the times between events in a Poisson process (a process in which events occur continuously and independently at a constant average rate \(\lambda\)).
\[\begin{aligned} \text{Exp}(x|\lambda) &\equiv \text{Ga}(x| \text{shape} = 1, \text{rate} = \lambda) \end{aligned}\]This is the distribution of the sum of squared Gaussian random variables:
\[\begin{aligned} Z_{i} &\sim N(0, 1)\\ S &= \sum_{i=1}^{\nu}Z_{i}^{2}\\ s &\sim \mathcal{X}_{\nu}^{2} \end{aligned}\] \[\begin{aligned} \mathcal{X}_{\nu}^{2}(x) &\equiv \text{Ga}(x| \text{shape} = \frac{\nu}{2}, \text{rate} = \frac{1}{2})\\ \end{aligned}\]If \(X \sim Ga(a, b)\) then:
\[\begin{aligned} \frac{1}{X} \sim \text{IG}(a, b) \end{aligned}\]The mean exists only if \(a > 1\). Variance exists only if \(a > 2\):
\[\begin{aligned} E[X] &= \frac{b}{a - 1}\\ \text{mode}(X) &= \frac{b}{a + 1}\\ \text{var}(X) &= \frac{b^{2}}{(a-1)^{2}(a - 2)} \end{aligned}\]Suppose we have a set of \(N\) samples \(\{x^{(1), \cdots, x^{(N)}}\}\) derived from a distribution \(p(X)\). We can approximate the pdf using delta functions:
\[\begin{aligned} \hat{p}_{N}(x) &= \frac{1}{N}\sum_{n = 1}^{N}\delta_{x^}\delta_{x^{(i)}}(x) \end{aligned}\]And the CDF:
\[\begin{aligned} \hat{F}_{N}(x) &= \frac{1}{N}\sum_{n=1}^{N}I(x^{(i)}\leq x)\\ &= \frac{1}{N}\sum_{n=1}^{N}u_{x^{(i)}}(x)\\ u_{y}(x) &= \begin{cases} 1 & x \geq y\\ 0 & x < y \end{cases} \end{aligned}\]Let \(\boldsymbol{y} &= f(x)\) be some deterministic transformation of \(x \sim p()\).
If \(X\) is discrete:
\[\begin{aligned} P_{y}(y) &= \sum_{x: f(x) = y}P_{x}(x) \end{aligned}\]If \(X\) is continuous:
\[\begin{aligned} F_{y}(y) &\equiv P(Y \leq y)\\ &= P(f(X) \leq y)\\ &= P(X \in \{x|f(x)\leq y\}) \end{aligned}\]If \(f\) is invertible, we can derive the pdf by differentiating the CDF. If not we can use numerical integration or MC approximation.
Suppose:
\[\begin{aligned} x &\sim \text{Unif}(0, 1)\\ y &= f(x)\\ &= 2x+1 \end{aligned}\]The function stretches and shifts the uniform proability distribution. The interval \((x, x + dx)\) gets mapped to $$(y, y + dy). The pmf in these intervals are the same:
\[\begin{aligned} p(x)dx &= p(y)dy\\ p(y) &= p(x)\frac{dx}{dy} \end{aligned}\]Because probability has to be 0 or greater:
\[\begin{aligned} P_{Y}(y) &= P_{X}(x)\Big|\frac{dx}{dy}\Big| \end{aligned}\]For any monotonic function \(f:\mathbb{R} \rightarrow \mathbb{R}\), let:
\[\begin{aligned} g &= f^{-1} y &= f(x)\\ x &= g(y)\\ F_{Y}(y)&= F(f(X) \leq y)\\ &= F(X \leq f^{-1}(y))\\ &= F_{X}(f^{-1}(y))\\ &= F_{X}(g(y))\\ &= F_{X}(x) \end{aligned}\]And taking derivative of the CDF:
\[\begin{aligned} P_{Y}(y) &\equiv \frac{d}{dy}F_{Y}(y)\\ &= \frac{d}{dy}F_{X}(x)\\ &= \frac{dx}{dy}\frac{d}{dx}F_{X}(x)\\ &= \frac{dx}{dy}P_{X}(x)\\ &= P_{X}(g(y))\frac{dx}{dy}\\ &= P_{X}(g(y))\Big|\frac{d}{dy}g(y)\Big| \end{aligned}\]The above is called the change of variables formula.
For the multivariate case:
\[\begin{aligned} \boldsymbol{f}: \mathbb{R}^{n} \rightarrow \mathbb{R}^{n}\\ \boldsymbol{g} &= \boldsymbol{f}^{-1}\\ \boldsymbol{y} &= \boldsymbol{f}(\boldsymbol{x}) \end{aligned}\]Then:
\[\begin{aligned} P_{Y}(\boldsymbol{y}) &= P_{X}(\boldsymbol{g}(\boldsymbol{y}))|\text{det}[\boldsymbol{J}_{g}(\boldsymbol{y})]|\\ \boldsymbol{J}_{g} &= \frac{d \boldsymbol{g}(\boldsymbol{y})}{d \boldsymbol{y}^{T}} \end{aligned}\]An example in 2D is an affine transformation:
\[\begin{aligned} f(\boldsymbol{x}) &= \boldsymbol{Ax} + \boldsymbol{b}\\ \boldsymbol{A} &= \begin{bmatrix} a & b\\ c & d \end{bmatrix} \end{aligned}\]The area of the unit square changes by a factor of:
\[\begin{aligned} \mathrm{det}(\boldsymbol{A}) &= ad - bc \end{aligned}\]Another example is transforming a density from a Cartesian coordinate to polar coordinate:
\[\begin{aligned} \boldsymbol{x} &= (x_{1}, x_{2})\\ \boldsymbol{y} &= \boldsymbol{f}(x_{1}, x_{2})\\ \boldsymbol{g}(r, \theta) &= (r \mathrm{cos}(\theta), r \mathrm{sin}(\theta)) \end{aligned}\]And the Jacobian:
\[\begin{aligned} \boldsymbol{J}_{g} &= \begin{bmatrix} \frac{\partial x_{1}}{\partial r} & \frac{\partial x_{1}}{\partial \theta}\\ \frac{\partial x_{2}}{\partial r} & \frac{\partial x_{2}}{\partial \theta} \end{bmatrix}\\ &= \begin{bmatrix} \textrm{cos}(\theta) & -r\textrm{sin}(\theta)\\ \textrm{sin}(\theta) & r\textrm{cos}(\theta) \end{bmatrix}\\ |\mathrm{det}(\boldsymbol{J}_{g})| &= |r \text{rm}{cos}^{2}(\theta), r \textrm{sin}(\theta)|\\ &= |r| \end{aligned}\] \[\begin{aligned} P_{r, \theta}(r, \theta) &= P_{x_{1}, x_{2}}(r \mathrm{cos}(\theta), r \mathrm{sin}(\theta)) \end{aligned}\]Suppose f is an affine function:
\[\begin{aligned} \boldsymbol{y} &= \boldsymbol{Ax} + \boldsymbol{b} \end{aligned}\]To derive the mean:
\[\begin{aligned} E[\boldsymbol{y}] &= E[\boldsymbol{Ax} + \boldsymbol{b}]\\ &= \boldsymbol{A\mu} + \boldsymbol{b} \end{aligned}\]For the covariance:
\[\begin{aligned} \text{cov}[\boldsymbol{y}] &= \text{cov}(\boldsymbol{Ax}+\boldsymbol{b})\\ &= \boldsymbol{A\Sigma A}^{T} \end{aligned}\]For example, given:
\[\begin{aligned} y &= \boldsymbol{a}^{T}\boldsymbol{x}\\ &= x_{1} + x_{2}\\ \boldsymbol{a} &= \begin{bmatrix} 1\\ 1 \end{bmatrix}\\ \boldsymbol{x} &= \begin{bmatrix} x_{1}\\ x_{2} \end{bmatrix}\\ \text{var}(y) &= \text{var}(\boldsymbol{a}^{T}\boldsymbol{x})\\ &= \boldsymbol{a}^{T}\boldsymbol{\Sigma a}\\ &= \begin{bmatrix} 1 & 1 \end{bmatrix} \begin{bmatrix} \Sigma_{11} & \Sigma_{12}\\ \Sigma_{21} & \Sigma_{22} \end{bmatrix} \begin{bmatrix} 1\\ 1 \end{bmatrix}\\ &= \Sigma_{11} + \Sigma_{22} + 2\Sigma_{12}\\ &= \text{var}(x_{1}) + \text{var}(x_{2}) + 2\text{cov}(x_{1}, x_{2}) \end{aligned}\]