Second recitation

Jinghan Liu

Contents

Reading guide

This chapter moves from continuous distributions, joint distributions, and marginal distributions to the least-squares idea in machine learning.
When reading a density, first ask two questions: does the normalizing constant exist, and what is the support?
The extra material is a first look at conditional expectation and projection.

Tip. Before every change of variables, check whether the map is one-to-one, whether the absolute value of the Jacobian is needed, and how the region of integration changes.

Exercise 1.5

Note

For a continuous distribution, first look at the support, then at the normalizing constant. If the density contains a parameter, first decide when the integral is finite.

Problem

Which of the following functions are density functions? If it is a density, find $C$ and the distribution function $F(x)$ .

$1$ $f(x) = \begin{cases} Cx^{-d}, & x > 1 \\ 0, & x < 1 \end{cases}$ .

$2$ $f(x) = C e^{-x-e^{-x}}, -\infty < x < \infty$ .

Solution

$1$ For $f(x)$ to be a density, we need $\int_{-\infty}^{\infty} f(x)\mathrm{d}x = 1$ .

\int_1^{\infty} C x^{-d} \mathrm{d}x = \lim_{t \to \infty} \left[ \frac{C}{1-d} x^{1-d} \right]_1^t.

The improper integral converges only when $1-d<0$ , that is, when $d>1$ . In that case the integral is $\frac{C}{d-1}$ . Thus $C=d-1$ . The distribution function is

F(x)=0,\quad x\le 1,

and

F(x)=\int_1^x (d-1)t^{-d}\mathrm{d}t=1-x^{-(d-1)},\quad x>1.

$2$ Check normalization:

\int_{-\infty}^{\infty} C e^{-x-e^{-x}} \mathrm{d}x.

Let $u=e^{-x}$ . Then $\mathrm{d}u=-e^{-x}\mathrm{d}x$ . As $x\to-\infty$ , $u\to\infty$ ; as $x\to\infty$ , $u\to0$ . Hence

\int_{\infty}^{0} C e^{-u} (-\mathrm{d}u) = C \int_0^{\infty} e^{-u} \mathrm{d}u = C.

So $C=1$ . The distribution function is

F(x) = \int_{-\infty}^x e^{-t-e^{-t}} \mathrm{d}t = \int_{e^{-x}}^{\infty} e^{-u} \mathrm{d}u = e^{-e^{-x}}, \quad -\infty < x < \infty.

Problem

Let $U$ be uniformly distributed on $(0,1)$ on some probability space. Let $F$ be a strictly increasing distribution function. Define a new random variable $Y=F^{-1}(U)$ , that is, $Y(\omega)=F^{-1}(U(\omega))$ . Prove that the distribution function of $Y$ is $F$ .

Proof

Since $F$ is strictly increasing, $F^{-1}$ exists and is also strictly increasing. For any real $y$ ,

F_Y(y)=\mathbb{P}(Y\le y) =\mathbb{P}(F^{-1}(U)\le y).

Applying $F$ to both sides of the inequality does not change the direction, so

\mathbb{P}(F^{-1}(U)\le y)=\mathbb{P}(U\le F(y)).

Since $U\sim U(0,1)$ and $0\le F(y)\le 1$ ,

\mathbb{P}(U\le F(y))=F(y).

Thus $F_Y(y)=F(y)$ .

Problem

Let $(X,Y)$ be an integer-valued random vector with joint probability mass function $f(x,y)$ . Prove that for $x,y\in\mathbb{Z}$ ,

\begin{aligned} f(x, y) &= \mathbb{P}(X \ge x, Y \le y) - \mathbb{P}(X \ge x+1, Y \le y) \\ &\quad - \mathbb{P}(X \ge x, Y \le y-1) + \mathbb{P}(X \ge x+1, Y \le y-1). \end{aligned}

Then find the joint probability mass function of the minimum $X_{\min}$ and the maximum $X_{\max}$ in $r$ rolls of a fair die.

Solution

First, write

\{X \ge x, Y \le y\} = \{X = x, Y \le y\} \cup \{X \ge x+1, Y \le y\}.

The two events on the right are disjoint. Hence

\mathbb{P}(X = x, Y \le y) =\mathbb{P}(X \ge x, Y \le y)-\mathbb{P}(X \ge x+1, Y \le y).

The same argument with $y-1$ gives

\mathbb{P}(X = x, Y \le y-1) =\mathbb{P}(X \ge x, Y \le y-1)-\mathbb{P}(X \ge x+1, Y \le y-1).

Also,

\{X = x, Y \le y\} = \{X = x, Y = y\} \cup \{X = x, Y \le y-1\}.

Therefore

f(x,y)=\mathbb{P}(X=x,Y=y) =\mathbb{P}(X=x,Y\le y)-\mathbb{P}(X=x,Y\le y-1),

and substituting the two previous identities gives the desired formula.

For $r$ rolls of a fair die, the event $X_{\min}\ge i$ and $X_{\max}\le j$ means that all $r$ outcomes lie in the interval $[i,j]$ . If $1\le i\le j\le 6$ , then

\mathbb{P}(X_{\min} \ge i, X_{\max} \le j) =\left(\frac{j-i+1}{6}\right)^r.

Using the formula just proved, for $1\le i<j\le 6$ ,

f(i,j)=\mathbb{P}(X_{\min}=i,X_{\max}=j) =\frac{(j-i+1)^r-2(j-i)^r+(j-i-1)^r}{6^r}.

For $1\le i=j\le 6$ , all rolls must be equal to $i$ , so

f(i,i)=\frac{1}{6^r}.

In all other cases, $f(i,j)=0$ .

Problem

Is the function

F(x, y) = \begin{cases} 1 - e^{-x-y}, & x, y \ge 0 \\ 0, & \text{otherwise} \end{cases}

the joint distribution function of some random vector $(X,Y)$ ? If yes, find the marginal distribution functions of $X$ and $Y$ . If not, explain why.

Solution

It is not. A joint distribution function must assign nonnegative probability to every rectangle. That is, for any $x_1<x_2$ and $y_1<y_2$ ,

\mathbb{P}(x_1 < X \le x_2, y_1 < Y \le y_2) = F(x_2, y_2) - F(x_1, y_2) - F(x_2, y_1) + F(x_1, y_1) \ge 0.

Take $x_1=0$ , $x_2=1$ , $y_1=0$ , and $y_2=1$ . Then

\begin{aligned} & F(1,1) - F(0,1) - F(1,0) + F(0,0) \\ &= (1 - e^{-2}) - (1 - e^{-1}) - (1 - e^{-1}) + (1 - e^0) \\ &= -1 + 2e^{-1} - e^{-2} \\ &= -(1-e^{-1})^2 < 0. \end{aligned}

This would give a negative probability for a rectangle, so the function cannot be a joint distribution function.

Problem

Let $X_1$ and $X_2$ be independent random variables with the same distribution function $F(x)$ . Define

U = \max\{X_1, X_2\}, \quad V = \min\{X_1, X_2\}.

$1$ Find the distribution functions of $U$ and $V$ . $2$ Find the joint distribution function of $(U,V)$ .

Solution

$1$ For $U=\max\{X_1,X_2\}$ ,

F_U(u)=\mathbb{P}(U\le u)=\mathbb{P}(X_1\le u,X_2\le u).

By independence,

F_U(u)=\mathbb{P}(X_1\le u)\mathbb{P}(X_2\le u)=F(u)^2.

For $V=\min\{X_1,X_2\}$ ,

F_V(v)=\mathbb{P}(V\le v) =1-\mathbb{P}(V>v) =1-\mathbb{P}(X_1>v,X_2>v).

Again by independence,

F_V(v)=1-(1-F(v))^2=2F(v)-F(v)^2.

$2$ The joint distribution function is $F_{U,V}(u,v)=\mathbb{P}(U\le u,V\le v)$ . Since $V\le U$ always holds, there are two cases.

If $u\le v$ , then $U\le u$ implies $V\le u\le v$ . Hence

F_{U,V}(u,v)=\mathbb{P}(U\le u)=F(u)^2.

If $u>v$ , then

\mathbb{P}(U\le u,V\le v) =\mathbb{P}(U\le u)-\mathbb{P}(U\le u,V>v).

The event $\{U\le u,V>v\}$ is the same as $\{v<X_1\le u,\ v<X_2\le u\}$ . By independence,

\mathbb{P}(v<X_1\le u,\ v<X_2\le u)=[F(u)-F(v)]^2.

Therefore

F_{U,V}(u,v)=F(u)^2-(F(u)-F(v))^2=2F(u)F(v)-F(v)^2.

F_{U,V}(u,v)= \begin{cases} F(u)^2, & u \le v,\\ 2F(u)F(v)-F(v)^2, & u>v. \end{cases}

Exercise 1.6

Note

For joint distribution problems, draw the region first. Many mistakes come not from the integral itself, but from limits that do not match the region.

Problem

Let $g,h:\mathbb{R}\to\mathbb{R}$ be Borel measurable functions. Suppose $X$ and $Y$ are independent discrete random variables. Without using Theorem 1.6.4, prove directly that $g(X)$ and $h(Y)$ are independent.

Proof

Let $U=g(X)$ and $V=h(Y)$ . Since $X$ and $Y$ are discrete, $U$ and $V$ are also discrete. For any values $u$ and $v$ that $U$ and $V$ can take,

\mathbb{P}(U=u,V=v) =\mathbb{P}(g(X)=u,h(Y)=v) =\mathbb{P}(X\in g^{-1}(u),Y\in h^{-1}(v)).

Here $g^{-1}(u)=\{x:g(x)=u\}$ and $h^{-1}(v)=\{y:h(y)=v\}$ . Since $X$ and $Y$ are independent,

\begin{aligned} \mathbb{P}(X \in g^{-1}(u), Y \in h^{-1}(v)) &= \sum_{x \in g^{-1}(u)} \sum_{y \in h^{-1}(v)} \mathbb{P}(X = x, Y = y) \\ &= \sum_{x \in g^{-1}(u)} \sum_{y \in h^{-1}(v)} \mathbb{P}(X = x)\mathbb{P}(Y = y) \\ &= \left( \sum_{x \in g^{-1}(u)} \mathbb{P}(X = x) \right) \left( \sum_{y \in h^{-1}(v)} \mathbb{P}(Y = y) \right) \\ &= \mathbb{P}(X \in g^{-1}(u)) \mathbb{P}(Y \in h^{-1}(v)) \\ &= \mathbb{P}(g(X) = u) \mathbb{P}(h(Y) = v) \\ &= \mathbb{P}(U = u) \mathbb{P}(V = v). \end{aligned}

Thus $g(X)$ and $h(Y)$ are independent.

Problem

Let $X_1,X_2,X_3$ be independent positive-integer-valued random variables with probability mass functions

\mathbb{P}(X_i = x) = (1 - p_i)p_i^{x-1},\quad i = 1, 2, 3.

$1$ Prove that

\mathbb{P}(X_1 < X_2 < X_3) = \frac{(1 - p_1)(1 - p_2)p_2p_3^2}{(1 - p_2p_3)(1 - p_1p_2p_3)}.

$2$ Find $\mathbb{P}(X_1 \le X_2 \le X_3)$ .

Solution

$1$ By independence,

\mathbb{P}(X_1 < X_2 < X_3) = \sum_{x_1=1}^{\infty} \sum_{x_2=x_1+1}^{\infty} \sum_{x_3=x_2+1}^{\infty} \mathbb{P}(X_1=x_1)\mathbb{P}(X_2=x_2)\mathbb{P}(X_3=x_3).

First compute the inner sum:

\sum_{x_3=x_2+1}^{\infty} (1-p_3)p_3^{x_3-1} = p_3^{x_2}.

Then

\sum_{x_2=x_1+1}^{\infty} (1-p_2)p_2^{x_2-1}p_3^{x_2} =(1-p_2)p_3\sum_{x_2=x_1+1}^{\infty}(p_2p_3)^{x_2-1} =(1-p_2)p_3\frac{(p_2p_3)^{x_1}}{1-p_2p_3}.

Substituting into the outer sum gives

\begin{aligned} \mathbb{P}(X_1 < X_2 < X_3) &= \sum_{x_1=1}^{\infty} (1-p_1)p_1^{x_1-1} \frac{(1-p_2)p_3 (p_2p_3)^{x_1}}{1-p_2p_3} \\ &= \frac{(1-p_1)(1-p_2)p_2p_3^2}{1-p_2p_3} \sum_{x_1=1}^{\infty}(p_1p_2p_3)^{x_1-1} \\ &= \frac{(1-p_1)(1-p_2)p_2p_3^2}{(1-p_2p_3)(1-p_1p_2p_3)}. \end{aligned}

$2$ For $\mathbb{P}(X_1\le X_2\le X_3)$ , change only the lower limits:

\mathbb{P}(X_1 \le X_2 \le X_3) = \sum_{x_1=1}^{\infty} \sum_{x_2=x_1}^{\infty} \sum_{x_3=x_2}^{\infty} \mathbb{P}(X_1=x_1)\mathbb{P}(X_2=x_2)\mathbb{P}(X_3=x_3).

The inner sum is

\sum_{x_3=x_2}^{\infty} (1-p_3)p_3^{x_3-1} = p_3^{x_2-1}.

The second sum is

\sum_{x_2=x_1}^{\infty} (1-p_2)p_2^{x_2-1}p_3^{x_2-1} =(1-p_2)\frac{(p_2p_3)^{x_1-1}}{1-p_2p_3}.

Thus

\begin{aligned} \mathbb{P}(X_1 \le X_2 \le X_3) &= \sum_{x_1=1}^{\infty} (1-p_1)p_1^{x_1-1} \frac{(1-p_2)(p_2p_3)^{x_1-1}}{1-p_2p_3} \\ &= \frac{(1-p_1)(1-p_2)}{1-p_2p_3} \sum_{x_1=1}^{\infty}(p_1p_2p_3)^{x_1-1} \\ &= \frac{(1-p_1)(1-p_2)}{(1-p_2p_3)(1-p_1p_2p_3)}. \end{aligned}

Problem

Let $X_1,X_2,X_3,X_4,X_5$ be independent continuous random variables with the same distribution function $F$ . Let

I = \mathbb{P}(X_1 < X_2 < X_3 < X_4 < X_5).

Prove that $I$ does not depend on $F$ , and find its value.

Proof

Since the variables are independent, identically distributed, and continuous, ties have probability $0$ . The five variables can be ordered in $5!=120$ possible ways. By symmetry, each ordering has the same probability. The event $X_1<X_2<X_3<X_4<X_5$ is just one of these orderings. Hence

I=\mathbb{P}(X_1 < X_2 < X_3 < X_4 < X_5)=\frac{1}{5!}=\frac{1}{120}.

This value is constant and does not depend on the specific distribution function $F$ .

Problem

Throw 3 points independently and uniformly on the interval $[0,1]$ . Find: (1) the distribution function of the middle point; (2) the joint density of the leftmost point and the rightmost point.

Solution

Let the three points be $X_1,X_2,X_3\sim U(0,1)$ , independent. Write the order statistics as $X_{(1)}\le X_{(2)}\le X_{(3)}$ .

$1$ For $x\in[0,1]$ , the event $X_{(2)}\le x$ means that at least two of the three points are at most $x$ . This is a binomial count with three trials and success probability $x$ . Thus

F_{(2)}(x)=\mathbb{P}(X_{(2)}\le x) =\binom{3}{2}x^2(1-x)+\binom{3}{3}x^3 =3x^2-2x^3.

For $x<0$ , the value is $0$ ; for $x>1$ , the value is $1$ .

$2$ Let $U=X_{(1)}$ and $V=X_{(3)}$ . For $0\le u\le v\le 1$ ,

\mathbb{P}(U>u,V\le v) =\mathbb{P}(\text{all three points lie in }(u,v]) =(v-u)^3.

On the other hand,

\mathbb{P}(U>u,V\le v) =\mathbb{P}(V\le v)-\mathbb{P}(U\le u,V\le v) =F_V(v)-F_{U,V}(u,v).

So $F_{U,V}(u,v)=F_V(v)-(v-u)^3$ . Taking the mixed derivative gives

f_{U,V}(u,v) =\frac{\partial^2 F_{U,V}(u,v)}{\partial u\partial v} =\frac{\partial^2}{\partial u\partial v}[-(v-u)^3] =6(v-u).

Thus $f(u,v)=6(v-u)$ for $0\le u\le v\le 1$ , and $0$ elsewhere.

Extra material: basics of machine learning

Note

Least squares can be treated as algebra, but it is also a projection problem in a Hilbert space. This is a useful way to meet conditional expectation.

1. Problem and motivation

In a regression problem, we view the feature and the label as random variables $X$ and $Y$ on the same probability space. The goal is to find a prediction function $g$ so that $g(X)$ is close to the true value $Y$ .

We usually measure closeness by mean squared error. Assume $Y\in L^2$ and $\mathbb{E}[g(X)^2]<\infty$ . Then the best predictor solves

g^\ast = \mathop{\arg\min}_{g} \mathbb{E}[(Y - g(X))^2].

2. The best predictor

We show in two ways that, under mean squared error, the best predictor is the conditional expectation: $g^\ast(X)=\mathbb{E}[Y|X]$ .

View 1: completing the square

Add and subtract $\mathbb{E}[Y|X]$ :

\begin{aligned} \mathbb{E}[(Y-g(X))^2] &= \mathbb{E}\left[\left((Y-\mathbb{E}[Y|X])+(\mathbb{E}[Y|X]-g(X))\right)^2\right] \\ &= \mathbb{E}[(Y-\mathbb{E}[Y|X])^2] +2\mathbb{E}[(Y-\mathbb{E}[Y|X])(\mathbb{E}[Y|X]-g(X))] \\ &\quad + \mathbb{E}[(\mathbb{E}[Y|X]-g(X))^2]. \end{aligned}

The cross term is zero by conditioning on $X$ :

\mathbb{E}\left[ \mathbb{E}[(Y-\mathbb{E}[Y|X])(\mathbb{E}[Y|X]-g(X)) \mid X] \right] = \mathbb{E}\left[ (\mathbb{E}[Y|X]-g(X)) \underbrace{\mathbb{E}[Y-\mathbb{E}[Y|X]\mid X]}_{=0} \right] =0.

\mathbb{E}[(Y-g(X))^2] =\mathbb{E}[(Y-\mathbb{E}[Y|X])^2] +\mathbb{E}[(\mathbb{E}[Y|X]-g(X))^2].

The second term is nonnegative. It is minimized when it is $0$ , so the best predictor is $g^\ast(X)=\mathbb{E}[Y|X]$ a.s.

View 2: projection in a Hilbert space

The random variables with finite second moment form the Hilbert space $L^2(\Omega,\mathcal{F},\mathbb{P})$ , with inner product $\langle X,Y\rangle=\mathbb{E}[XY]$ . The squared distance is $\|X-Y\|^2=\mathbb{E}[(X-Y)^2]$ .

All square-integrable random variables of the form $g(X)$ form a closed subspace, denoted by $\mathcal{H}_X$ . Finding the best $g^\ast(X)$ means finding the point $\hat Y\in\mathcal{H}_X$ closest to $Y$ .

Step 1: the closest point gives an orthogonal error.

Let $e=Y-\hat Y$ . Suppose there is some $V\in\mathcal{H}_X$ with $\langle e,V\rangle\ne0$ . Consider the perturbed prediction $\hat Y+tV$ . Its squared distance from $Y$ is

f(t)=\|Y-(\hat Y+tV)\|^2=\|e-tV\|^2 =\|e\|^2-2t\langle e,V\rangle+t^2\|V\|^2.

This quadratic has derivative $f'(0)=-2\langle e,V\rangle\ne0$ . For a small enough $t$ , the distance becomes smaller, contradicting the choice of $\hat Y$ . Therefore $\langle Y-\hat Y,V\rangle=0$ for all $V\in\mathcal{H}_X$ .

Step 2: conditional expectation has this orthogonality.

Let $\hat Y=\mathbb{E}[Y|X]$ . For any $V=g(X)$ ,

\langle Y-\mathbb{E}[Y|X],g(X)\rangle =\mathbb{E}[(Y-\mathbb{E}[Y|X])g(X)] =\mathbb{E}\left[g(X)(\mathbb{E}[Y|X]-\mathbb{E}[Y|X])\right] =0.

Thus $\mathbb{E}[Y|X]$ is the orthogonal projection of $Y$ onto the information generated by $X$ .

3. From probability to statistics: prediction with data

The formula above gives the ideal predictor

g^\ast(x)=\mathbb{E}[Y\mid X=x].

To compute it exactly, we would need the full joint distribution of $X$ and $Y$ . This is the probability-theory setting: the distribution is known, and we study what follows from it.

In statistics and machine learning, the distribution is usually unknown. We only have a finite i.i.d. sample

\mathcal{D}=\{(x_1,y_1),(x_2,y_2),\dots,(x_n,y_n)\}.

From this sample we train an estimator $\hat f_{\mathcal D}(x)$ , hoping that it is close to the unknown function $f(x)=\mathbb{E}[Y\mid X=x]$ .

Since the training set $\mathcal D$ is random, the fitted model $\hat f_{\mathcal D}(x)$ is also random. So when we judge a model, we should not look at one training set alone. We also ask how it behaves over repeated samples from the same data-generating process.

4. Bias-variance decomposition

Suppose the true relation is

Y=f(x)+\epsilon,

where $f(x)=\mathbb{E}[Y\mid X=x]$ is the best possible prediction, and $\epsilon$ is noise with $\mathbb{E}[\epsilon]=0$ and $\operatorname{Var}(\epsilon)=\sigma^2$ .

At a fixed test point $x$ , let $\hat f(x)$ be the predictor trained from a random data set $\mathcal D$ . The expected test error is

\operatorname{Err}(x) =\mathbb{E}_{\mathcal D,\epsilon}\left[(Y-\hat f(x))^2\right].

Add and subtract $\mathbb{E}_{\mathcal D}[\hat f(x)]$ and $f(x)$ :

\begin{aligned} \operatorname{Err}(x) &=\mathbb{E}\left[(f(x)+\epsilon-\hat f(x))^2\right] \\ &=\mathbb{E}\left[ \Big((f(x)-\mathbb{E}[\hat f(x)]) +(\mathbb{E}[\hat f(x)]-\hat f(x)) +\epsilon\Big)^2 \right]. \end{aligned}

After expanding, the cross terms have expectation $0$ : the noise has mean $0$ , it is independent of the fitted model, and $\mathbb{E}[\mathbb{E}[\hat f(x)]-\hat f(x)]=0$ . Hence

\operatorname{Err}(x) =\underbrace{(f(x)-\mathbb{E}[\hat f(x)])^2}_{\text{bias}^2} +\underbrace{\mathbb{E}\left[(\hat f(x)-\mathbb{E}[\hat f(x)])^2\right]}_{\text{variance}} +\underbrace{\sigma^2}_{\text{irreducible error}}.

The terms have simple meanings.

The bias is $\mathbb{E}[\hat f(x)]-f(x)$ . Large bias means the model is too limited on average. For example, a linear model may miss a nonlinear curve. This is underfitting.
The variance is $\operatorname{Var}(\hat f(x))$ . Large variance means the fitted model changes a lot when the training data changes. This is overfitting.
The irreducible error $\sigma^2$ is the noise in the data. No prediction rule can remove it.

Making a model more complex often lowers bias but raises variance. A good model is one that keeps the sum $\text{Bias}^2+\text{Variance}$ small.

End-of-chapter checklist

The original problems and solutions in this chapter come from the corresponding TeX source file.
You can first read only the problem boxes, write down the key identities, and then open the proofs or solutions.
If a result uses independence, countable additivity, a change of variables, or a moment condition, it is worth marking that point explicitly.