Second recitation

Jinghan Liu

Contents
Reading guide
  • This chapter moves from continuous distributions, joint distributions, and marginal distributions to the least-squares idea in machine learning.
  • When reading a density, first ask two questions: does the normalizing constant exist, and what is the support?
  • The extra material is a first look at conditional expectation and projection.

Tip. Before every change of variables, check whether the map is one-to-one, whether the absolute value of the Jacobian is needed, and how the region of integration changes.

Exercise 1.5

Note

For a continuous distribution, first look at the support, then at the normalizing constant. If the density contains a parameter, first decide when the integral is finite.

Problem

Which of the following functions are density functions? If it is a density, find CC and the distribution function F(x)F(x).

11 f(x)={Cxd,x>10,x<1f(x) = \begin{cases} Cx^{-d}, & x > 1 \\ 0, & x < 1 \end{cases}.

22 f(x)=Cexex,<x<f(x) = C e^{-x-e^{-x}}, -\infty < x < \infty.

Solution

11 For f(x)f(x) to be a density, we need f(x)dx=1\int_{-\infty}^{\infty} f(x)\mathrm{d}x = 1.

1Cxddx=limt[C1dx1d]1t.\int_1^{\infty} C x^{-d} \mathrm{d}x = \lim_{t \to \infty} \left[ \frac{C}{1-d} x^{1-d} \right]_1^t.

The improper integral converges only when 1d<01-d<0, that is, when d>1d>1. In that case the integral is Cd1\frac{C}{d-1}. Thus C=d1C=d-1. The distribution function is

F(x)=0,x1,F(x)=0,\quad x\le 1,

and

F(x)=1x(d1)tddt=1x(d1),x>1.F(x)=\int_1^x (d-1)t^{-d}\mathrm{d}t=1-x^{-(d-1)},\quad x>1.

22 Check normalization:

Cexexdx.\int_{-\infty}^{\infty} C e^{-x-e^{-x}} \mathrm{d}x.

Let u=exu=e^{-x}. Then du=exdx\mathrm{d}u=-e^{-x}\mathrm{d}x. As xx\to-\infty, uu\to\infty; as xx\to\infty, u0u\to0. Hence

0Ceu(du)=C0eudu=C.\int_{\infty}^{0} C e^{-u} (-\mathrm{d}u) = C \int_0^{\infty} e^{-u} \mathrm{d}u = C.

So C=1C=1. The distribution function is

F(x)=xetetdt=exeudu=eex,<x<.F(x) = \int_{-\infty}^x e^{-t-e^{-t}} \mathrm{d}t = \int_{e^{-x}}^{\infty} e^{-u} \mathrm{d}u = e^{-e^{-x}}, \quad -\infty < x < \infty.
Problem

Let UU be uniformly distributed on (0,1)(0,1) on some probability space. Let FF be a strictly increasing distribution function. Define a new random variable Y=F1(U)Y=F^{-1}(U), that is, Y(ω)=F1(U(ω))Y(\omega)=F^{-1}(U(\omega)). Prove that the distribution function of YY is FF.

Proof

Since FF is strictly increasing, F1F^{-1} exists and is also strictly increasing. For any real yy,

FY(y)=P(Yy)=P(F1(U)y).F_Y(y)=\mathbb{P}(Y\le y) =\mathbb{P}(F^{-1}(U)\le y).

Applying FF to both sides of the inequality does not change the direction, so

P(F1(U)y)=P(UF(y)).\mathbb{P}(F^{-1}(U)\le y)=\mathbb{P}(U\le F(y)).

Since UU(0,1)U\sim U(0,1) and 0F(y)10\le F(y)\le 1,

P(UF(y))=F(y).\mathbb{P}(U\le F(y))=F(y).

Thus FY(y)=F(y)F_Y(y)=F(y).

Problem

Let (X,Y)(X,Y) be an integer-valued random vector with joint probability mass function f(x,y)f(x,y). Prove that for x,yZx,y\in\mathbb{Z},

f(x,y)=P(Xx,Yy)P(Xx+1,Yy)P(Xx,Yy1)+P(Xx+1,Yy1).\begin{aligned} f(x, y) &= \mathbb{P}(X \ge x, Y \le y) - \mathbb{P}(X \ge x+1, Y \le y) \\ &\quad - \mathbb{P}(X \ge x, Y \le y-1) + \mathbb{P}(X \ge x+1, Y \le y-1). \end{aligned}

Then find the joint probability mass function of the minimum XminX_{\min} and the maximum XmaxX_{\max} in rr rolls of a fair die.

Solution

First, write

{Xx,Yy}={X=x,Yy}{Xx+1,Yy}.\{X \ge x, Y \le y\} = \{X = x, Y \le y\} \cup \{X \ge x+1, Y \le y\}.

The two events on the right are disjoint. Hence

P(X=x,Yy)=P(Xx,Yy)P(Xx+1,Yy).\mathbb{P}(X = x, Y \le y) =\mathbb{P}(X \ge x, Y \le y)-\mathbb{P}(X \ge x+1, Y \le y).

The same argument with y1y-1 gives

P(X=x,Yy1)=P(Xx,Yy1)P(Xx+1,Yy1).\mathbb{P}(X = x, Y \le y-1) =\mathbb{P}(X \ge x, Y \le y-1)-\mathbb{P}(X \ge x+1, Y \le y-1).

Also,

{X=x,Yy}={X=x,Y=y}{X=x,Yy1}.\{X = x, Y \le y\} = \{X = x, Y = y\} \cup \{X = x, Y \le y-1\}.

Therefore

f(x,y)=P(X=x,Y=y)=P(X=x,Yy)P(X=x,Yy1),f(x,y)=\mathbb{P}(X=x,Y=y) =\mathbb{P}(X=x,Y\le y)-\mathbb{P}(X=x,Y\le y-1),

and substituting the two previous identities gives the desired formula.

For rr rolls of a fair die, the event XminiX_{\min}\ge i and XmaxjX_{\max}\le j means that all rr outcomes lie in the interval [i,j][i,j]. If 1ij61\le i\le j\le 6, then

P(Xmini,Xmaxj)=(ji+16)r.\mathbb{P}(X_{\min} \ge i, X_{\max} \le j) =\left(\frac{j-i+1}{6}\right)^r.

Using the formula just proved, for 1i<j61\le i<j\le 6,

f(i,j)=P(Xmin=i,Xmax=j)=(ji+1)r2(ji)r+(ji1)r6r.f(i,j)=\mathbb{P}(X_{\min}=i,X_{\max}=j) =\frac{(j-i+1)^r-2(j-i)^r+(j-i-1)^r}{6^r}.

For 1i=j61\le i=j\le 6, all rolls must be equal to ii, so

f(i,i)=16r.f(i,i)=\frac{1}{6^r}.

In all other cases, f(i,j)=0f(i,j)=0.

Problem

Is the function

F(x,y)={1exy,x,y00,otherwiseF(x, y) = \begin{cases} 1 - e^{-x-y}, & x, y \ge 0 \\ 0, & \text{otherwise} \end{cases}

the joint distribution function of some random vector (X,Y)(X,Y)? If yes, find the marginal distribution functions of XX and YY. If not, explain why.

Solution

It is not. A joint distribution function must assign nonnegative probability to every rectangle. That is, for any x1<x2x_1<x_2 and y1<y2y_1<y_2,

P(x1<Xx2,y1<Yy2)=F(x2,y2)F(x1,y2)F(x2,y1)+F(x1,y1)0.\mathbb{P}(x_1 < X \le x_2, y_1 < Y \le y_2) = F(x_2, y_2) - F(x_1, y_2) - F(x_2, y_1) + F(x_1, y_1) \ge 0.

Take x1=0x_1=0, x2=1x_2=1, y1=0y_1=0, and y2=1y_2=1. Then

F(1,1)F(0,1)F(1,0)+F(0,0)=(1e2)(1e1)(1e1)+(1e0)=1+2e1e2=(1e1)2<0.\begin{aligned} & F(1,1) - F(0,1) - F(1,0) + F(0,0) \\ &= (1 - e^{-2}) - (1 - e^{-1}) - (1 - e^{-1}) + (1 - e^0) \\ &= -1 + 2e^{-1} - e^{-2} \\ &= -(1-e^{-1})^2 < 0. \end{aligned}

This would give a negative probability for a rectangle, so the function cannot be a joint distribution function.

Problem

Let X1X_1 and X2X_2 be independent random variables with the same distribution function F(x)F(x). Define

U=max{X1,X2},V=min{X1,X2}.U = \max\{X_1, X_2\}, \quad V = \min\{X_1, X_2\}.

11 Find the distribution functions of UU and VV. 22 Find the joint distribution function of (U,V)(U,V).

Solution

11 For U=max{X1,X2}U=\max\{X_1,X_2\},

FU(u)=P(Uu)=P(X1u,X2u).F_U(u)=\mathbb{P}(U\le u)=\mathbb{P}(X_1\le u,X_2\le u).

By independence,

FU(u)=P(X1u)P(X2u)=F(u)2.F_U(u)=\mathbb{P}(X_1\le u)\mathbb{P}(X_2\le u)=F(u)^2.

For V=min{X1,X2}V=\min\{X_1,X_2\},

FV(v)=P(Vv)=1P(V>v)=1P(X1>v,X2>v).F_V(v)=\mathbb{P}(V\le v) =1-\mathbb{P}(V>v) =1-\mathbb{P}(X_1>v,X_2>v).

Again by independence,

FV(v)=1(1F(v))2=2F(v)F(v)2.F_V(v)=1-(1-F(v))^2=2F(v)-F(v)^2.

22 The joint distribution function is FU,V(u,v)=P(Uu,Vv)F_{U,V}(u,v)=\mathbb{P}(U\le u,V\le v). Since VUV\le U always holds, there are two cases.

If uvu\le v, then UuU\le u implies VuvV\le u\le v. Hence

FU,V(u,v)=P(Uu)=F(u)2.F_{U,V}(u,v)=\mathbb{P}(U\le u)=F(u)^2.

If u>vu>v, then

P(Uu,Vv)=P(Uu)P(Uu,V>v).\mathbb{P}(U\le u,V\le v) =\mathbb{P}(U\le u)-\mathbb{P}(U\le u,V>v).

The event {Uu,V>v}\{U\le u,V>v\} is the same as {v<X1u, v<X2u}\{v<X_1\le u,\ v<X_2\le u\}. By independence,

P(v<X1u, v<X2u)=[F(u)F(v)]2.\mathbb{P}(v<X_1\le u,\ v<X_2\le u)=[F(u)-F(v)]^2.

Therefore

FU,V(u,v)=F(u)2(F(u)F(v))2=2F(u)F(v)F(v)2.F_{U,V}(u,v)=F(u)^2-(F(u)-F(v))^2=2F(u)F(v)-F(v)^2.

So

FU,V(u,v)={F(u)2,uv,2F(u)F(v)F(v)2,u>v.F_{U,V}(u,v)= \begin{cases} F(u)^2, & u \le v,\\ 2F(u)F(v)-F(v)^2, & u>v. \end{cases}

Exercise 1.6

Note

For joint distribution problems, draw the region first. Many mistakes come not from the integral itself, but from limits that do not match the region.

Problem

Let g,h:RRg,h:\mathbb{R}\to\mathbb{R} be Borel measurable functions. Suppose XX and YY are independent discrete random variables. Without using Theorem 1.6.4, prove directly that g(X)g(X) and h(Y)h(Y) are independent.

Proof

Let U=g(X)U=g(X) and V=h(Y)V=h(Y). Since XX and YY are discrete, UU and VV are also discrete. For any values uu and vv that UU and VV can take,

P(U=u,V=v)=P(g(X)=u,h(Y)=v)=P(Xg1(u),Yh1(v)).\mathbb{P}(U=u,V=v) =\mathbb{P}(g(X)=u,h(Y)=v) =\mathbb{P}(X\in g^{-1}(u),Y\in h^{-1}(v)).

Here g1(u)={x:g(x)=u}g^{-1}(u)=\{x:g(x)=u\} and h1(v)={y:h(y)=v}h^{-1}(v)=\{y:h(y)=v\}. Since XX and YY are independent,

P(Xg1(u),Yh1(v))=xg1(u)yh1(v)P(X=x,Y=y)=xg1(u)yh1(v)P(X=x)P(Y=y)=(xg1(u)P(X=x))(yh1(v)P(Y=y))=P(Xg1(u))P(Yh1(v))=P(g(X)=u)P(h(Y)=v)=P(U=u)P(V=v).\begin{aligned} \mathbb{P}(X \in g^{-1}(u), Y \in h^{-1}(v)) &= \sum_{x \in g^{-1}(u)} \sum_{y \in h^{-1}(v)} \mathbb{P}(X = x, Y = y) \\ &= \sum_{x \in g^{-1}(u)} \sum_{y \in h^{-1}(v)} \mathbb{P}(X = x)\mathbb{P}(Y = y) \\ &= \left( \sum_{x \in g^{-1}(u)} \mathbb{P}(X = x) \right) \left( \sum_{y \in h^{-1}(v)} \mathbb{P}(Y = y) \right) \\ &= \mathbb{P}(X \in g^{-1}(u)) \mathbb{P}(Y \in h^{-1}(v)) \\ &= \mathbb{P}(g(X) = u) \mathbb{P}(h(Y) = v) \\ &= \mathbb{P}(U = u) \mathbb{P}(V = v). \end{aligned}

Thus g(X)g(X) and h(Y)h(Y) are independent.

Problem

Let X1,X2,X3X_1,X_2,X_3 be independent positive-integer-valued random variables with probability mass functions

P(Xi=x)=(1pi)pix1,i=1,2,3.\mathbb{P}(X_i = x) = (1 - p_i)p_i^{x-1},\quad i = 1, 2, 3.

11 Prove that

P(X1<X2<X3)=(1p1)(1p2)p2p32(1p2p3)(1p1p2p3).\mathbb{P}(X_1 < X_2 < X_3) = \frac{(1 - p_1)(1 - p_2)p_2p_3^2}{(1 - p_2p_3)(1 - p_1p_2p_3)}.

22 Find P(X1X2X3)\mathbb{P}(X_1 \le X_2 \le X_3).

Solution

11 By independence,

P(X1<X2<X3)=x1=1x2=x1+1x3=x2+1P(X1=x1)P(X2=x2)P(X3=x3).\mathbb{P}(X_1 < X_2 < X_3) = \sum_{x_1=1}^{\infty} \sum_{x_2=x_1+1}^{\infty} \sum_{x_3=x_2+1}^{\infty} \mathbb{P}(X_1=x_1)\mathbb{P}(X_2=x_2)\mathbb{P}(X_3=x_3).

First compute the inner sum:

x3=x2+1(1p3)p3x31=p3x2.\sum_{x_3=x_2+1}^{\infty} (1-p_3)p_3^{x_3-1} = p_3^{x_2}.

Then

x2=x1+1(1p2)p2x21p3x2=(1p2)p3x2=x1+1(p2p3)x21=(1p2)p3(p2p3)x11p2p3.\sum_{x_2=x_1+1}^{\infty} (1-p_2)p_2^{x_2-1}p_3^{x_2} =(1-p_2)p_3\sum_{x_2=x_1+1}^{\infty}(p_2p_3)^{x_2-1} =(1-p_2)p_3\frac{(p_2p_3)^{x_1}}{1-p_2p_3}.

Substituting into the outer sum gives

P(X1<X2<X3)=x1=1(1p1)p1x11(1p2)p3(p2p3)x11p2p3=(1p1)(1p2)p2p321p2p3x1=1(p1p2p3)x11=(1p1)(1p2)p2p32(1p2p3)(1p1p2p3).\begin{aligned} \mathbb{P}(X_1 < X_2 < X_3) &= \sum_{x_1=1}^{\infty} (1-p_1)p_1^{x_1-1} \frac{(1-p_2)p_3 (p_2p_3)^{x_1}}{1-p_2p_3} \\ &= \frac{(1-p_1)(1-p_2)p_2p_3^2}{1-p_2p_3} \sum_{x_1=1}^{\infty}(p_1p_2p_3)^{x_1-1} \\ &= \frac{(1-p_1)(1-p_2)p_2p_3^2}{(1-p_2p_3)(1-p_1p_2p_3)}. \end{aligned}

22 For P(X1X2X3)\mathbb{P}(X_1\le X_2\le X_3), change only the lower limits:

P(X1X2X3)=x1=1x2=x1x3=x2P(X1=x1)P(X2=x2)P(X3=x3).\mathbb{P}(X_1 \le X_2 \le X_3) = \sum_{x_1=1}^{\infty} \sum_{x_2=x_1}^{\infty} \sum_{x_3=x_2}^{\infty} \mathbb{P}(X_1=x_1)\mathbb{P}(X_2=x_2)\mathbb{P}(X_3=x_3).

The inner sum is

x3=x2(1p3)p3x31=p3x21.\sum_{x_3=x_2}^{\infty} (1-p_3)p_3^{x_3-1} = p_3^{x_2-1}.

The second sum is

x2=x1(1p2)p2x21p3x21=(1p2)(p2p3)x111p2p3.\sum_{x_2=x_1}^{\infty} (1-p_2)p_2^{x_2-1}p_3^{x_2-1} =(1-p_2)\frac{(p_2p_3)^{x_1-1}}{1-p_2p_3}.

Thus

P(X1X2X3)=x1=1(1p1)p1x11(1p2)(p2p3)x111p2p3=(1p1)(1p2)1p2p3x1=1(p1p2p3)x11=(1p1)(1p2)(1p2p3)(1p1p2p3).\begin{aligned} \mathbb{P}(X_1 \le X_2 \le X_3) &= \sum_{x_1=1}^{\infty} (1-p_1)p_1^{x_1-1} \frac{(1-p_2)(p_2p_3)^{x_1-1}}{1-p_2p_3} \\ &= \frac{(1-p_1)(1-p_2)}{1-p_2p_3} \sum_{x_1=1}^{\infty}(p_1p_2p_3)^{x_1-1} \\ &= \frac{(1-p_1)(1-p_2)}{(1-p_2p_3)(1-p_1p_2p_3)}. \end{aligned}
Problem

Let X1,X2,X3,X4,X5X_1,X_2,X_3,X_4,X_5 be independent continuous random variables with the same distribution function FF. Let

I=P(X1<X2<X3<X4<X5).I = \mathbb{P}(X_1 < X_2 < X_3 < X_4 < X_5).

Prove that II does not depend on FF, and find its value.

Proof

Since the variables are independent, identically distributed, and continuous, ties have probability 00. The five variables can be ordered in 5!=1205!=120 possible ways. By symmetry, each ordering has the same probability. The event X1<X2<X3<X4<X5X_1<X_2<X_3<X_4<X_5 is just one of these orderings. Hence

I=P(X1<X2<X3<X4<X5)=15!=1120.I=\mathbb{P}(X_1 < X_2 < X_3 < X_4 < X_5)=\frac{1}{5!}=\frac{1}{120}.

This value is constant and does not depend on the specific distribution function FF.

Problem

Throw 3 points independently and uniformly on the interval [0,1][0,1]. Find: (1) the distribution function of the middle point; (2) the joint density of the leftmost point and the rightmost point.

Solution

Let the three points be X1,X2,X3U(0,1)X_1,X_2,X_3\sim U(0,1), independent. Write the order statistics as X(1)X(2)X(3)X_{(1)}\le X_{(2)}\le X_{(3)}.

11 For x[0,1]x\in[0,1], the event X(2)xX_{(2)}\le x means that at least two of the three points are at most xx. This is a binomial count with three trials and success probability xx. Thus

F(2)(x)=P(X(2)x)=(32)x2(1x)+(33)x3=3x22x3.F_{(2)}(x)=\mathbb{P}(X_{(2)}\le x) =\binom{3}{2}x^2(1-x)+\binom{3}{3}x^3 =3x^2-2x^3.

For x<0x<0, the value is 00; for x>1x>1, the value is 11.

22 Let U=X(1)U=X_{(1)} and V=X(3)V=X_{(3)}. For 0uv10\le u\le v\le 1,

P(U>u,Vv)=P(all three points lie in (u,v])=(vu)3.\mathbb{P}(U>u,V\le v) =\mathbb{P}(\text{all three points lie in }(u,v]) =(v-u)^3.

On the other hand,

P(U>u,Vv)=P(Vv)P(Uu,Vv)=FV(v)FU,V(u,v).\mathbb{P}(U>u,V\le v) =\mathbb{P}(V\le v)-\mathbb{P}(U\le u,V\le v) =F_V(v)-F_{U,V}(u,v).

So FU,V(u,v)=FV(v)(vu)3F_{U,V}(u,v)=F_V(v)-(v-u)^3. Taking the mixed derivative gives

fU,V(u,v)=2FU,V(u,v)uv=2uv[(vu)3]=6(vu).f_{U,V}(u,v) =\frac{\partial^2 F_{U,V}(u,v)}{\partial u\partial v} =\frac{\partial^2}{\partial u\partial v}[-(v-u)^3] =6(v-u).

Thus f(u,v)=6(vu)f(u,v)=6(v-u) for 0uv10\le u\le v\le 1, and 00 elsewhere.

Extra material: basics of machine learning

Note

Least squares can be treated as algebra, but it is also a projection problem in a Hilbert space. This is a useful way to meet conditional expectation.

1. Problem and motivation

In a regression problem, we view the feature and the label as random variables XX and YY on the same probability space. The goal is to find a prediction function gg so that g(X)g(X) is close to the true value YY.

We usually measure closeness by mean squared error. Assume YL2Y\in L^2 and E[g(X)2]<\mathbb{E}[g(X)^2]<\infty. Then the best predictor solves

g=argmingE[(Yg(X))2].g^\ast = \mathop{\arg\min}_{g} \mathbb{E}[(Y - g(X))^2].

2. The best predictor

We show in two ways that, under mean squared error, the best predictor is the conditional expectation: g(X)=E[YX]g^\ast(X)=\mathbb{E}[Y|X].

View 1: completing the square

Add and subtract E[YX]\mathbb{E}[Y|X]:

E[(Yg(X))2]=E[((YE[YX])+(E[YX]g(X)))2]=E[(YE[YX])2]+2E[(YE[YX])(E[YX]g(X))]+E[(E[YX]g(X))2].\begin{aligned} \mathbb{E}[(Y-g(X))^2] &= \mathbb{E}\left[\left((Y-\mathbb{E}[Y|X])+(\mathbb{E}[Y|X]-g(X))\right)^2\right] \\ &= \mathbb{E}[(Y-\mathbb{E}[Y|X])^2] +2\mathbb{E}[(Y-\mathbb{E}[Y|X])(\mathbb{E}[Y|X]-g(X))] \\ &\quad + \mathbb{E}[(\mathbb{E}[Y|X]-g(X))^2]. \end{aligned}

The cross term is zero by conditioning on XX:

E[E[(YE[YX])(E[YX]g(X))X]]=E[(E[YX]g(X))E[YE[YX]X]=0]=0.\mathbb{E}\left[ \mathbb{E}[(Y-\mathbb{E}[Y|X])(\mathbb{E}[Y|X]-g(X)) \mid X] \right] = \mathbb{E}\left[ (\mathbb{E}[Y|X]-g(X)) \underbrace{\mathbb{E}[Y-\mathbb{E}[Y|X]\mid X]}_{=0} \right] =0.

So

E[(Yg(X))2]=E[(YE[YX])2]+E[(E[YX]g(X))2].\mathbb{E}[(Y-g(X))^2] =\mathbb{E}[(Y-\mathbb{E}[Y|X])^2] +\mathbb{E}[(\mathbb{E}[Y|X]-g(X))^2].

The second term is nonnegative. It is minimized when it is 00, so the best predictor is g(X)=E[YX]g^\ast(X)=\mathbb{E}[Y|X] a.s.

View 2: projection in a Hilbert space

The random variables with finite second moment form the Hilbert space L2(Ω,F,P)L^2(\Omega,\mathcal{F},\mathbb{P}), with inner product X,Y=E[XY]\langle X,Y\rangle=\mathbb{E}[XY]. The squared distance is XY2=E[(XY)2]\|X-Y\|^2=\mathbb{E}[(X-Y)^2].

All square-integrable random variables of the form g(X)g(X) form a closed subspace, denoted by HX\mathcal{H}_X. Finding the best g(X)g^\ast(X) means finding the point Y^HX\hat Y\in\mathcal{H}_X closest to YY.

Step 1: the closest point gives an orthogonal error.

Let e=YY^e=Y-\hat Y. Suppose there is some VHXV\in\mathcal{H}_X with e,V0\langle e,V\rangle\ne0. Consider the perturbed prediction Y^+tV\hat Y+tV. Its squared distance from YY is

f(t)=Y(Y^+tV)2=etV2=e22te,V+t2V2.f(t)=\|Y-(\hat Y+tV)\|^2=\|e-tV\|^2 =\|e\|^2-2t\langle e,V\rangle+t^2\|V\|^2.

This quadratic has derivative f(0)=2e,V0f'(0)=-2\langle e,V\rangle\ne0. For a small enough tt, the distance becomes smaller, contradicting the choice of Y^\hat Y. Therefore YY^,V=0\langle Y-\hat Y,V\rangle=0 for all VHXV\in\mathcal{H}_X.

Step 2: conditional expectation has this orthogonality.

Let Y^=E[YX]\hat Y=\mathbb{E}[Y|X]. For any V=g(X)V=g(X),

YE[YX],g(X)=E[(YE[YX])g(X)]=E[g(X)(E[YX]E[YX])]=0.\langle Y-\mathbb{E}[Y|X],g(X)\rangle =\mathbb{E}[(Y-\mathbb{E}[Y|X])g(X)] =\mathbb{E}\left[g(X)(\mathbb{E}[Y|X]-\mathbb{E}[Y|X])\right] =0.

Thus E[YX]\mathbb{E}[Y|X] is the orthogonal projection of YY onto the information generated by XX.

3. From probability to statistics: prediction with data

The formula above gives the ideal predictor

g(x)=E[YX=x].g^\ast(x)=\mathbb{E}[Y\mid X=x].

To compute it exactly, we would need the full joint distribution of XX and YY. This is the probability-theory setting: the distribution is known, and we study what follows from it.

In statistics and machine learning, the distribution is usually unknown. We only have a finite i.i.d. sample

D={(x1,y1),(x2,y2),,(xn,yn)}.\mathcal{D}=\{(x_1,y_1),(x_2,y_2),\dots,(x_n,y_n)\}.

From this sample we train an estimator f^D(x)\hat f_{\mathcal D}(x), hoping that it is close to the unknown function f(x)=E[YX=x]f(x)=\mathbb{E}[Y\mid X=x].

Since the training set D\mathcal D is random, the fitted model f^D(x)\hat f_{\mathcal D}(x) is also random. So when we judge a model, we should not look at one training set alone. We also ask how it behaves over repeated samples from the same data-generating process.

4. Bias-variance decomposition

Suppose the true relation is

Y=f(x)+ϵ,Y=f(x)+\epsilon,

where f(x)=E[YX=x]f(x)=\mathbb{E}[Y\mid X=x] is the best possible prediction, and ϵ\epsilon is noise with E[ϵ]=0\mathbb{E}[\epsilon]=0 and Var(ϵ)=σ2\operatorname{Var}(\epsilon)=\sigma^2.

At a fixed test point xx, let f^(x)\hat f(x) be the predictor trained from a random data set D\mathcal D. The expected test error is

Err(x)=ED,ϵ[(Yf^(x))2].\operatorname{Err}(x) =\mathbb{E}_{\mathcal D,\epsilon}\left[(Y-\hat f(x))^2\right].

Add and subtract ED[f^(x)]\mathbb{E}_{\mathcal D}[\hat f(x)] and f(x)f(x):

Err(x)=E[(f(x)+ϵf^(x))2]=E[((f(x)E[f^(x)])+(E[f^(x)]f^(x))+ϵ)2].\begin{aligned} \operatorname{Err}(x) &=\mathbb{E}\left[(f(x)+\epsilon-\hat f(x))^2\right] \\ &=\mathbb{E}\left[ \Big((f(x)-\mathbb{E}[\hat f(x)]) +(\mathbb{E}[\hat f(x)]-\hat f(x)) +\epsilon\Big)^2 \right]. \end{aligned}

After expanding, the cross terms have expectation 00: the noise has mean 00, it is independent of the fitted model, and E[E[f^(x)]f^(x)]=0\mathbb{E}[\mathbb{E}[\hat f(x)]-\hat f(x)]=0. Hence

Err(x)=(f(x)E[f^(x)])2bias2+E[(f^(x)E[f^(x)])2]variance+σ2irreducible error.\operatorname{Err}(x) =\underbrace{(f(x)-\mathbb{E}[\hat f(x)])^2}_{\text{bias}^2} +\underbrace{\mathbb{E}\left[(\hat f(x)-\mathbb{E}[\hat f(x)])^2\right]}_{\text{variance}} +\underbrace{\sigma^2}_{\text{irreducible error}}.

The terms have simple meanings.

  • The bias is E[f^(x)]f(x)\mathbb{E}[\hat f(x)]-f(x). Large bias means the model is too limited on average. For example, a linear model may miss a nonlinear curve. This is underfitting.

  • The variance is Var(f^(x))\operatorname{Var}(\hat f(x)). Large variance means the fitted model changes a lot when the training data changes. This is overfitting.

  • The irreducible error σ2\sigma^2 is the noise in the data. No prediction rule can remove it.

Making a model more complex often lowers bias but raises variance. A good model is one that keeps the sum Bias2+Variance\text{Bias}^2+\text{Variance} small.

End-of-chapter checklist
  • The original problems and solutions in this chapter come from the corresponding TeX source file.
  • You can first read only the problem boxes, write down the key identities, and then open the proofs or solutions.
  • If a result uses independence, countable additivity, a change of variables, or a moment condition, it is worth marking that point explicitly.