This chapter moves from continuous distributions, joint distributions, and marginal distributions to the least-squares idea in machine learning.
When reading a density, first ask two questions: does the normalizing constant exist, and what is the support?
The extra material is a first look at conditional expectation and projection.
Tip. Before every change of variables, check whether the map is one-to-one, whether the absolute value of the Jacobian is needed, and how the region of integration changes.
Exercise 1.5
Note
For a continuous distribution, first look at the support, then at the normalizing constant. If the density contains a parameter, first decide when the integral is finite.
Problem
Which of the following functions are density functions? If it is a density, find C and the distribution function F(x).
1f(x)={Cx−d,0,x>1x<1.
2f(x)=Ce−x−e−x,−∞<x<∞.
Solution
1 For f(x) to be a density, we need
∫−∞∞f(x)dx=1.
∫1∞Cx−ddx=t→∞lim[1−dCx1−d]1t.
The improper integral converges only when 1−d<0, that is, when d>1. In that case the integral is d−1C. Thus C=d−1. The distribution function is
F(x)=0,x≤1,
and
F(x)=∫1x(d−1)t−ddt=1−x−(d−1),x>1.
2 Check normalization:
∫−∞∞Ce−x−e−xdx.
Let u=e−x. Then du=−e−xdx. As x→−∞, u→∞; as x→∞, u→0. Hence
∫∞0Ce−u(−du)=C∫0∞e−udu=C.
So C=1. The distribution function is
F(x)=∫−∞xe−t−e−tdt=∫e−x∞e−udu=e−e−x,−∞<x<∞.
Problem
Let U be uniformly distributed on (0,1) on some probability space. Let F be a strictly increasing distribution function. Define a new random variable
Y=F−1(U), that is, Y(ω)=F−1(U(ω)). Prove that the distribution function of Y is F.
Proof
Since F is strictly increasing, F−1 exists and is also strictly increasing. For any real y,
FY(y)=P(Y≤y)=P(F−1(U)≤y).
Applying F to both sides of the inequality does not change the direction, so
P(F−1(U)≤y)=P(U≤F(y)).
Since U∼U(0,1) and 0≤F(y)≤1,
P(U≤F(y))=F(y).
Thus FY(y)=F(y).
Problem
Let (X,Y) be an integer-valued random vector with joint probability mass function f(x,y). Prove that for x,y∈Z,
This would give a negative probability for a rectangle, so the function cannot be a joint distribution function.
Problem
Let X1 and X2 be independent random variables with the same distribution function F(x). Define
U=max{X1,X2},V=min{X1,X2}.
1 Find the distribution functions of U and V. 2 Find the joint distribution function of (U,V).
Solution
1 For U=max{X1,X2},
FU(u)=P(U≤u)=P(X1≤u,X2≤u).
By independence,
FU(u)=P(X1≤u)P(X2≤u)=F(u)2.
For V=min{X1,X2},
FV(v)=P(V≤v)=1−P(V>v)=1−P(X1>v,X2>v).
Again by independence,
FV(v)=1−(1−F(v))2=2F(v)−F(v)2.
2 The joint distribution function is
FU,V(u,v)=P(U≤u,V≤v). Since V≤U always holds, there are two cases.
If u≤v, then U≤u implies V≤u≤v. Hence
FU,V(u,v)=P(U≤u)=F(u)2.
If u>v, then
P(U≤u,V≤v)=P(U≤u)−P(U≤u,V>v).
The event {U≤u,V>v} is the same as {v<X1≤u,v<X2≤u}. By independence,
P(v<X1≤u,v<X2≤u)=[F(u)−F(v)]2.
Therefore
FU,V(u,v)=F(u)2−(F(u)−F(v))2=2F(u)F(v)−F(v)2.
So
FU,V(u,v)={F(u)2,2F(u)F(v)−F(v)2,u≤v,u>v.
Exercise 1.6
Note
For joint distribution problems, draw the region first. Many mistakes come not from the integral itself, but from limits that do not match the region.
Problem
Let g,h:R→R be Borel measurable functions. Suppose X and Y are independent discrete random variables. Without using Theorem 1.6.4, prove directly that g(X) and h(Y) are independent.
Proof
Let U=g(X) and V=h(Y). Since X and Y are discrete, U and V are also discrete. For any values u and v that U and V can take,
P(U=u,V=v)=P(g(X)=u,h(Y)=v)=P(X∈g−1(u),Y∈h−1(v)).
Here g−1(u)={x:g(x)=u} and h−1(v)={y:h(y)=v}. Since X and Y are independent,
Let X1,X2,X3,X4,X5 be independent continuous random variables with the same distribution function F. Let
I=P(X1<X2<X3<X4<X5).
Prove that I does not depend on F, and find its value.
Proof
Since the variables are independent, identically distributed, and continuous, ties have probability 0. The five variables can be ordered in 5!=120 possible ways. By symmetry, each ordering has the same probability. The event
X1<X2<X3<X4<X5 is just one of these orderings. Hence
I=P(X1<X2<X3<X4<X5)=5!1=1201.
This value is constant and does not depend on the specific distribution function F.
Problem
Throw 3 points independently and uniformly on the interval [0,1]. Find: (1) the distribution function of the middle point; (2) the joint density of the leftmost point and the rightmost point.
Solution
Let the three points be X1,X2,X3∼U(0,1), independent. Write the order statistics as
X(1)≤X(2)≤X(3).
1 For x∈[0,1], the event X(2)≤x means that at least two of the three points are at most x. This is a binomial count with three trials and success probability x. Thus
F(2)(x)=P(X(2)≤x)=(23)x2(1−x)+(33)x3=3x2−2x3.
For x<0, the value is 0; for x>1, the value is 1.
2 Let U=X(1) and V=X(3). For 0≤u≤v≤1,
P(U>u,V≤v)=P(all three points lie in (u,v])=(v−u)3.
On the other hand,
P(U>u,V≤v)=P(V≤v)−P(U≤u,V≤v)=FV(v)−FU,V(u,v).
So FU,V(u,v)=FV(v)−(v−u)3. Taking the mixed derivative gives
Least squares can be treated as algebra, but it is also a projection problem in a Hilbert space. This is a useful way to meet conditional expectation.
1. Problem and motivation
In a regression problem, we view the feature and the label as random variables X and Y on the same probability space. The goal is to find a prediction function g so that g(X) is close to the true value Y.
We usually measure closeness by mean squared error. Assume Y∈L2 and E[g(X)2]<∞. Then the best predictor solves
g∗=argmingE[(Y−g(X))2].
2. The best predictor
We show in two ways that, under mean squared error, the best predictor is the conditional expectation:
g∗(X)=E[Y∣X].
The second term is nonnegative. It is minimized when it is 0, so the best predictor is
g∗(X)=E[Y∣X] a.s.
View 2: projection in a Hilbert space
The random variables with finite second moment form the Hilbert space
L2(Ω,F,P), with inner product
⟨X,Y⟩=E[XY]. The squared distance is
∥X−Y∥2=E[(X−Y)2].
All square-integrable random variables of the form g(X) form a closed subspace, denoted by HX. Finding the best g∗(X) means finding the point Y^∈HX closest to Y.
Step 1: the closest point gives an orthogonal error.
Let e=Y−Y^. Suppose there is some V∈HX with ⟨e,V⟩=0. Consider the perturbed prediction Y^+tV. Its squared distance from Y is
f(t)=∥Y−(Y^+tV)∥2=∥e−tV∥2=∥e∥2−2t⟨e,V⟩+t2∥V∥2.
This quadratic has derivative f′(0)=−2⟨e,V⟩=0. For a small enough t, the distance becomes smaller, contradicting the choice of Y^. Therefore
⟨Y−Y^,V⟩=0 for all V∈HX.
Step 2: conditional expectation has this orthogonality.
Thus E[Y∣X] is the orthogonal projection of Y onto the information generated by X.
3. From probability to statistics: prediction with data
The formula above gives the ideal predictor
g∗(x)=E[Y∣X=x].
To compute it exactly, we would need the full joint distribution of X and Y. This is the probability-theory setting: the distribution is known, and we study what follows from it.
In statistics and machine learning, the distribution is usually unknown. We only have a finite i.i.d. sample
D={(x1,y1),(x2,y2),…,(xn,yn)}.
From this sample we train an estimator f^D(x), hoping that it is close to the unknown function
f(x)=E[Y∣X=x].
Since the training set D is random, the fitted model f^D(x) is also random. So when we judge a model, we should not look at one training set alone. We also ask how it behaves over repeated samples from the same data-generating process.
4. Bias-variance decomposition
Suppose the true relation is
Y=f(x)+ϵ,
where f(x)=E[Y∣X=x] is the best possible prediction, and ϵ is noise with E[ϵ]=0 and Var(ϵ)=σ2.
At a fixed test point x, let f^(x) be the predictor trained from a random data set D. The expected test error is
The bias is E[f^(x)]−f(x). Large bias means the model is too limited on average. For example, a linear model may miss a nonlinear curve. This is underfitting.
The variance is Var(f^(x)). Large variance means the fitted model changes a lot when the training data changes. This is overfitting.
The irreducible error σ2 is the noise in the data. No prediction rule can remove it.
Making a model more complex often lowers bias but raises variance. A good model is one that keeps the sum Bias2+Variance small.
End-of-chapter checklist
The original problems and solutions in this chapter come from the corresponding TeX source file.
You can first read only the problem boxes, write down the key identities, and then open the proofs or solutions.
If a result uses independence, countable additivity, a change of variables, or a moment condition, it is worth marking that point explicitly.