This chapter moves into convergence theory, the strong law of large numbers, characteristic functions, the central limit theorem, and Stein's method.
The four modes of convergence give the basic map: a.s., Lp, P, and D.
In each proof, keep track of where independence, moment assumptions, truncation, or Borel-Cantelli is used.
Tip. Whenever a limiting distribution appears, first ask what kind of limit it is: in probability, in distribution, or almost surely.
Exercise 4.2
Note
Keep the four modes of convergence separate: a.s., Lp, P, and D. When you see an arrow, first identify which mode it means.
Problem: 4.2.1
Prove the following two inequalities.
(1) (Lyapunov inequality) If 0<r<s, then
(E[∣X∣r])1/r≤(E[∣X∣s])1/s.
(2) (Cr inequality) If r>0, then
E[∣X+Y∣r]≤Cr(E[∣X∣r]+E[∣Y∣r]),
where
Cr={1,2r−1,0<r<1,r≥1.
Proof
(1) Let α=r/s∈(0,1). Since x↦xα is concave on [0,∞), Jensen's inequality gives
E[∣X∣r]=E[(∣X∣s)α]≤(E[∣X∣s])α.
Taking the power 1/r on both sides gives
(E[∣X∣r])1/r≤(E[∣X∣s])1/s.
2 If 0<r<1, then for all a,b≥0,
(a+b)r≤ar+br.
Therefore
∣X+Y∣r≤(∣X∣+∣Y∣)r≤∣X∣r+∣Y∣r.
Taking expectations gives
E[∣X+Y∣r]≤E[∣X∣r]+E[∣Y∣r].
If r≥1, then by convexity,
(a+b)r=2r(2a+b)r≤2r−1(ar+br).
Hence
∣X+Y∣r≤2r−1(∣X∣r+∣Y∣r),
and the result follows after taking expectations.
Problem: 4.2.2
Let {Xn} be a sequence of random variables, and let {cn} be a real sequence with cn→c. Prove, under each of the four modes of convergence a.s., Lp, in probability, and in distribution, that
Xn→X⟹cnXn→cX.
Proof
If Xna.s.X, then for almost every ω,
cnXn(ω)→cX(ω).
Thus cnXna.s.cX.
If XnLpX, then X∈Lp, and {cn} is bounded. By the Cr inequality, for a constant Cp depending only on p,
∣cnXn−cX∣p≤Cp(∣cn∣p∣Xn−X∣p+∣cn−c∣p∣X∣p).
Taking expectations gives
E[∣cnXn−cX∣p]→0,
so cnXnLpcX.
If XnPX, then
cnXn−cX=cn(Xn−X)+(cn−c)X.
Since {cn} is bounded, the first term converges to 0 in probability. The second term converges to 0 a.s., hence also in probability. Therefore
cnXnPcX.
If XnDX, regard cn as a constant random variable. Then cnPc, and Slutsky's theorem gives
Now prove Slutsky's theorem. For addition, the continuous mapping theorem gives
Xn+cDX+c.
Since
(Xn+Yn)−(Xn+c)=Yn−cP0,
the lemma gives
Xn+YnDX+c.
For multiplication, XnDX implies that {Xn} is tight. Thus for every ε,η>0, there exists M>0 such that, for all large n,
P(∣Xn∣>M)<η.
Hence
P(∣Xn(Yn−b)∣>ε)≤P(∣Xn∣>M)+P(∣Yn−b∣>Mε).
Letting n→∞ gives
Xn(Yn−b)P0.
Also, by the continuous mapping theorem,
bXnDbX.
Since
XnYn−bXn=Xn(Yn−b)P0,
the lemma gives
XnYnDbX.
Finally, apply the addition part to XnYn and Zn to get
XnYn+ZnDbX+c.
If b=0, then x↦1/x is continuous at b, so
Yn1Pb1.
The multiplication part applied to Xn and 1/Yn gives
YnXn=Xn⋅Yn1DbX.
Exercise 4.3
Note
Borel-Cantelli, subsequence arguments, and extreme-value estimates often appear together. Almost sure conclusions usually come from building summable bad events.
Problem: 4.3.1
Let {Xn} be independent standard normal random variables. Use the standard normal tail estimate from Chapter 3, Problem 14(1), to prove
P(n→∞limsuplognXn=2)=1.
Proof
For a>0, define
An(a)={Xn≥2alogn}.
By the normal tail estimate, there exist constants C1,C2>0 such that, for all large n,
C1lognn−a≤P(An(a))≤C2lognn−a.
If 0<a<1, then
n=2∑∞P(An(a))=∞.
The events An(a) are independent, so the second Borel-Cantelli lemma gives
P(An(a)i.o.)=1.
Thus
n→∞limsuplognXn≥2aa.s.
If a>1, then
n=2∑∞P(An(a))<∞.
By the first Borel-Cantelli lemma,
P(An(a)i.o.)=0.
Hence
n→∞limsuplognXn≤2aa.s.
Therefore, for every 0<a<1<b,
2a≤n→∞limsuplognXn≤2ba.s.
Letting a↑1 and b↓1 gives
n→∞limsuplognXn=2a.s.
Problem: 4.3.6
Let X1,⋯,Xn be i.i.d. uniform random variables on [0,a], where a>0. Set
Mn=max{X1,⋯,Xn}.
Prove that Mn→a both a.s. and in Lp as n→∞.
Proof
For every 0<ε<a,
P(∣Mn−a∣>ε)=P(Mn<a−ε)=(aa−ε)n.
Since
n=1∑∞(aa−ε)n<∞,
the first Borel-Cantelli lemma implies that
∣Mn−a∣>ε
can occur only finitely many times. Taking a countable intersection over positive rational ε gives
Mna.s.a.
Since 0≤Mn≤a,
∣Mn−a∣p≤ap.
Together with almost sure convergence, the dominated convergence theorem gives
E[∣Mn−a∣p]→0.
Thus
MnLpa.
Problem: 4.3.7
Suppose XnPX. Prove that there exists a subsequence {Xnk} such that
Xnka.s.X.
Proof
Since XnPX, for each k∈N∗ we may choose nk>nk−1 such that
P(∣Xnk−X∣>2−k)<2−k.
Then
k=1∑∞P(∣Xnk−X∣>2−k)<∞.
By the first Borel-Cantelli lemma, the events
∣Xnk−X∣>2−k
occur only finitely many times. Hence, for almost every ω, there exists K(ω) such that for all k≥K(ω),
∣Xnk(ω)−X(ω)∣≤2−k.
Therefore Xnk(ω)→X(ω), and
Xnka.s.X.
Problem: 4.3.8
1 Let {Xn} be independent real-valued random variables with XnP0, and let {an} be a positive increasing sequence with an→+∞. Must we have
anXna.s.0?
2 Let {Xn} be any sequence of real-valued random variables. Construct positive numbers {cn} such that
cnXna.s.0.
Proof
(1) The conclusion need not hold. For the given sequence {an}, define independent random variables by
P(Xn=an)=n+11,P(Xn=0)=1−n+11.
Since an→∞, for every ε>0 and all large n, an>ε. Hence
P(∣Xn∣>ε)=n+11→0,
so XnP0. But
P(anXn=1)=n+11,n=1∑∞n+11=∞.
By the second Borel-Cantelli lemma,
anXn=1
occurs infinitely often. Thus Xn/an does not converge to 0 a.s.
2 For each n, since P(∣Xn∣>t)↓0 as t→∞, choose cn>0 such that
P(∣Xn∣>2−ncn)<2−n.
Let
An={∣Xn∣>2−ncn}.
Then
n=1∑∞P(An)<∞.
By the first Borel-Cantelli lemma, An occurs only finitely many times. Hence, almost surely, there exists N(ω) such that for all n≥N(ω),
cnXn≤2−n.
Therefore
cnXna.s.0.
Exercise 4.4
Note
Proofs of the strong law often use truncation, fourth moments, or Borel-Cantelli. Watch which moment condition controls which tail event.
Problem: 4.4.1
Let {Xn} be nonnegative i.i.d. random variables with E[X1]=+∞. Prove that
n1k=1∑nXka.s.+∞.
Proof
For each M>0, let
Yk(M)=Xk∧M.
Then {Yk(M)} is still a nonnegative i.i.d. sequence, and E[Y1(M)]<∞. By the strong law of large numbers,
n1k=1∑nYk(M)a.s.E[Y1(M)].
Since Xk≥Yk(M),
n→∞liminfn1k=1∑nXk≥E[Y1(M)]a.s.
Because Y1(M)↑X1, the monotone convergence theorem gives
E[Y1(M)]↑E[X1]=+∞.
Thus for every L>0, we can choose M so that E[Y1(M)]≥L. Hence
n→∞liminfn1k=1∑nXk≥La.s.
Since L is arbitrary,
n1k=1∑nXka.s.+∞.
Problem: 4.4.2
(Weierstrass approximation theorem) For every continuous function f:[0,1]→R, let Sn∼B(n,x). Prove that
Finally, prove that Sn/n does not converge to 0 a.s. Let
An={Xn=2n}.
Then {An} are independent, and
n=2∑∞P(An)=n=2∑∞2nlogn1=∞.
By the second Borel-Cantelli lemma, An occurs infinitely often a.s. If
nSna.s.0,
then
nSn−1=nn−1⋅n−1Sn−1a.s.0.
But on An,
nSn=nSn−1+2.
Since An occurs infinitely often, this contradicts Sn/n→0. Therefore Sn/n does not converge to 0 a.s.
Exercise 5.1
Note
For characteristic functions, independent sums correspond to products, and linear changes correspond to rescaling. Convergence in distribution can often be checked through pointwise convergence of characteristic functions.
Assume that {U,V} is independent of {X,Y}, and let
Z=U2+V2UX+VY.
Prove that if X and Y are independent N(0,1) random variables, then Z∼N(0,1). If (X,Y) is only standard bivariate normal, does the conclusion still hold?
Proof
If X,Y are independent and both have distribution N(0,1), then for every fixed (u,v)∈R2,
uX+vY∼N(0,u2+v2).
Thus, conditional on (U,V)=(u,v),
Z∣(U,V)=(u,v)∼N(0,1).
Equivalently, for every t∈R,
E[eitZ∣U,V]=e−t2/2.
Taking expectations gives
E[eitZ]=e−t2/2,
so Z∼N(0,1).
If (X,Y) is only standard bivariate normal and independence is not assumed, the conclusion need not hold. Let
Cov(X,Y)=ρ=0,
and take U=V=1. Then
Z=2X+Y,
so
Var(Z)=21Var(X+Y)=21(1+1+2ρ)=1+ρ=1.
Thus Z∼N(0,1) in general.
Problem: 5.1.3
Let
ϕ(t)=(tsint)2.
Use a probabilistic argument to prove that, for real numbers t1,⋯,tn, the matrix
Hn=(ϕ(ti−tj))i,j=1n
is nonnegative definite.
Proof
Take independent random variables X,Y∼U[−1,1]. Then
ϕX(t)=ϕY(t)=tsint.
Hence
ϕX+Y(t)=ϕX(t)ϕY(t)=(tsint)2=ϕ(t).
Thus ϕ is the characteristic function of the random variable X+Y.
1 If the moment generating function M(t)=E[etX1] exists, prove the tail bound
P(X1≥a)≤t>0inf{e−atM(t)}.
2 If P(X1=1)=P(X1=−1)=21, prove that for every a>0,
P(Sn≥a)≤e−a2/(2n).
Proof
(1) For every t>0, Markov's inequality gives
P(X1≥a)=P(etX1≥eta)≤e−taE[etX1]=e−taM(t).
Taking the infimum over t>0 gives the result.
2 Apply (1) to Sn:
P(Sn≥a)≤e−atE[etSn]=e−at(E[etX1])n.
Now
E[etX1]=2et+e−t=cosht,
and
cosht=m=0∑∞(2m)!t2m≤m=0∑∞m!(t2/2)m=et2/2.
Thus
P(Sn≥a)≤exp(−at+2nt2).
Taking t=a/n gives
P(Sn≥a)≤e−a2/(2n).
Problem: 5.1.8
A random variable X is called sub-Gaussian if, for some constant K>0,
P(∣X∣≥t)≤2e−t2/K2,∀t≥0.
Prove:
1 If
E[esX]≤es2/2,∀s∈R,
then X is sub-Gaussian.
2 The moments of a sub-Gaussian random variable satisfy
E[∣X∣p]≤(K1p)p,∀p≥1,
where K1 is a positive constant independent of p. You may use Stirling's formula
n!∼nne−n2πn.
Proof
(1) For s,t>0, Markov's inequality gives
P(X≥t)=P(esX≥est)≤e−stE[esX]≤e−st+s2/2.
Taking s=t yields
P(X≥t)≤e−t2/2.
Applying the same argument to −X gives
P(X≤−t)≤e−t2/2.
Therefore
P(∣X∣≥t)≤2e−t2/2,
so X is sub-Gaussian.
2 By the tail integral formula,
E[∣X∣p]=∫0∞ptp−1P(∣X∣>t)dt≤2p∫0∞tp−1e−t2/K2dt.
With u=t2/K2,
E[∣X∣p]≤pKp∫0∞up/2−1e−udu=pKpΓ(p/2)=2KpΓ(p/2+1).
By Stirling's formula, there is a constant C>0 such that for all p≥1,
Γ(p/2+1)≤Cppp/2.
Hence
E[∣X∣p]≤(K1p)p
for a constant K1 independent of p.
Exercise 5.2
Note
This section looks at convergence in distribution and how independence passes to limits. The Cauchy example is a warning: without a first moment, the usual law-of-large-numbers intuition does not apply.
Problem: 5.2.2
Suppose Xn,Yn are independent, X,Y are independent, and
XnDX,YnDY.
Prove that
Xn+YnDX+Y.
Proof
By independence,
ϕXn+Yn(t)=ϕXn(t)ϕYn(t).
Since XnDX and YnDY, for every t∈R,
ϕXn(t)→ϕX(t),ϕYn(t)→ϕY(t).
Because X,Y are independent,
ϕX(t)ϕY(t)=ϕX+Y(t).
Thus
ϕXn+Yn(t)→ϕX+Y(t).
By Levy's continuity theorem,
Xn+YnDX+Y.
Problem: 5.2.3
Let X1,⋯,Xn be independent Cauchy random variables. Prove that
n1k=1∑nXk
also has the Cauchy distribution.
Proof
First compute the characteristic function of the standard Cauchy distribution. If X has density
f(x)=π(1+x2)1,
then
ϕX(t)=π1∫−∞∞1+x2eitxdx.
For t>0, consider
g(z)=1+z2eitz,
and integrate over the upper half-plane semicircle. By Jordan's lemma, the integral over the arc tends to 0. The only pole inside the contour is z=i, with residue
This is the characteristic function of the standard Cauchy distribution, so
n1k=1∑nXk
is again Cauchy.
Problem: 5.2.5
Let ϕn(t)=cosnt, t∈R.
1 Find the distribution function corresponding to the characteristic function ϕ2(t).
2 For general positive integers n, is ϕn(t) a characteristic function? Answer and explain.
Proof
(1) Define a random variable X by
P(X=−2)=41,P(X=0)=21,P(X=2)=41.
Then
ϕX(t)=41e−2it+21+41e2it=cos2t.
Hence the distribution function corresponding to ϕ2 is
F2(x)=⎩⎨⎧0,41,43,1,x<−2,−2≤x<0,0≤x<2,x≥2.
2 For any positive integer n, let Y1,⋯,Yn be i.i.d. random variables with
P(Yk=1)=P(Yk=−1)=21.
Then
ϕYk(t)=21(eit+e−it)=cost.
By independence,
ϕY1+⋯+Yn(t)=k=1∏nϕYk(t)=cosnt=ϕn(t).
Thus ϕn(t) is a characteristic function for every positive integer n.
Exercise 5.3
Note
For central limit theorem problems, first identify the centering and scaling. If the variance depends on n, compute the scale before applying a theorem.
Problem: 5.3.1
Choose suitable sequences {μn} and {σn} to prove
σnXn−μnDN(0,1).
1Xn has the Poisson distribution with positive integer parameter n.
2Xn has the Gamma density
f(x)=Γ(n)xn−1e−x1x≥0.
Proof
(1) If Y1,⋯,Yn are i.i.d. with Yi∼P(1), then
Xn′:=Y1+⋯+Yn∼P(n).
Thus Xn′ and Xn have the same distribution. By the i.i.d. CLT,
nXn′−nDN(0,1).
So take
μn=n,σn=n.
Then
σnXn−μnDN(0,1).
2 If Z1,⋯,Zn are i.i.d. exponential random variables with parameter 1, then
Xn′:=Z1+⋯+Zn
has density
f(x)=Γ(n)xn−1e−x1x≥0.
Thus Xn′ and Xn have the same distribution. By the i.i.d. CLT,
nXn′−nDN(0,1).
Again take
μn=n,σn=n.
This gives
σnXn−μnDN(0,1).
Problem: 5.3.3
Let X1,⋯,Xn be i.i.d. random variables with
P(X1=1)=P(X1=−1)=21.
Prove that
n3/23k=1∑nkXkDN(0,1).
Proof
Let
Yn,k=kXk,1≤k≤n.
Then {Yn,k}k=1n are independent, with
E[Yn,k]=0,Var(Yn,k)=k2.
Set
Bn2=k=1∑nVar(Yn,k)=k=1∑nk2=6n(n+1)(2n+1).
For every ε>0, when n is large enough, Bn≍n3/2, so
∣Yn,k∣=k≤n<εBn,1≤k≤n.
Hence
k=1∑nE[Yn,k2;∣Yn,k∣>εBn]=0,
so the Lindeberg condition holds. By the Lindeberg-Feller CLT,
Bn∑k=1nkXkDN(0,1).
Since
n3/2Bn=6n2(n+1)(2n+1)⟶31,
Slutsky's theorem gives
n3/23k=1∑nkXkDN(0,1).
Exercise 5.5
Note
Slutsky's theorem lets us replace a random error by its constant probability limit. The first thing to check is whether the added or multiplied factor converges in probability to a constant.
Problem: 5.5.12
Slutsky's theorem says: if random variables {Xn}, {Yn}, and {Zn} satisfy
XnDX,YnPb,ZnPc,
where X is a random variable and b,c are constants, then
XnYn+ZnDbX+c.
Use Slutsky's theorem to answer the following questions.
1 Let {Xn} be i.i.d., with E[X1]=0 and finite second moment. Let
where {Bk}, {εk}, and {ηk} are mutually independent, and
P(Bk=1)=2−k,P(εk=±1)=P(ηk=±1)=21.
Then Xk has the required distribution. Set
Tn=k=1∑nεk,Rn=k=1∑nBk(2kηk−εk).
Then
k=1∑nXk=Tn+Rn.
Since
k=1∑∞P(Bk=1)=k=1∑∞2−k<∞,
the first Borel-Cantelli lemma shows that {Bk=1} occurs only finitely many times. Hence Rn is eventually constant a.s., and
nRna.s.0.
By the CLT,
nTnDN(0,1).
Slutsky's theorem gives
n1k=1∑nXk=nTn+nRnDN(0,1).
3 Let
Tn=nSn−n,Un=nSn.
By the CLT,
TnDN(0,1).
By the weak law,
UnP1.
Also,
23nSn3/2−n3/2=Tn⋅32⋅Un−1Un3/2−1.
Define
g(u)=32⋅u−1u3/2−1(u=1),g(1)=1.
Then g is continuous at u=1, so
g(Un)P1.
Slutsky's theorem gives
23nSn3/2−n3/2DN(0,1).
Exercise 5.4
Note
This section moves into stronger limit theorems and Stein's method. When reading the proofs, separate weak convergence, moment bounds, and integrability.
Problem: 5.4.1
Let X1,X2,… be i.i.d. with
P(X1=1)=P(X1=−1)=21.
Prove that for every δ>0,
n1/2+δ1k=1∑nXka.s.0.
Proof
Chebyshev's inequality alone only reaches δ>1/2, so we use higher moments. Let m be a positive integer to be chosen later, and set
Since the Xi are independent and EXi=0, a term vanishes if some index appears an odd number of times. Thus, in every nonzero term, the number of distinct indices is at most m. Hence there is a constant Cm, depending only on m, such that
ESn2m≤Cmnm.
Therefore
P(n1/2+δSn>ε)≤ε2mCmn−2mδ.
Choose m so that
2mδ>1.
Then
n=1∑∞P(n1/2+δSn>ε)<∞.
By Borel-Cantelli,
P(n1/2+δSn>εi.o.)=0.
Thus, for every fixed ε>0, almost surely there is N(ω) such that for n≥N(ω),
n1/2+δSn≤ε.
Letting ε run over the positive rationals gives
n1/2+δSna.s.0.
Problem: 5.4.4
Let {Xk} be i.i.d. random variables with
EX1=0,Var(X1)=1,E∣X1∣3<∞.
Use the Lindeberg replacement method to prove the CLT convergence rate
t∈RsupP(n1k=1∑nXk≤t)−Φ(t)=O(n−1/8).
Here Φ(t) is the standard normal distribution function.
Proof
Set
Sn=k=1∑nXk,Wn=nSn.
Let Y1,…,Yn be i.i.d. standard normal random variables, independent of X1,…,Xn, and set
Zn=n1k=1∑nYk.
Then Zn∼N(0,1), so
P(Zn≤t)=Φ(t).
Fix ε>0. Choose a smooth function ft,ε∈C3(R) such that
1{x≤t}≤ft,ε(x)≤1{x≤t+ε},
and
∥ft,ε(3)∥∞≤Cε−3,
where C is independent of t,ε,n.
We estimate
∣Eft,ε(Wn)−Eft,ε(Zn)∣.
Replace Xk by Yk one at a time. Let
Tk=n1(Y1+⋯+Yk−1+Xk+1+⋯+Xn).
Then Tk is independent of Xk and Yk. Taylor expansion gives
For the other direction, choose a smooth function gt,ε such that
1{x≤t−ε}≤gt,ε(x)≤1{x≤t},∥gt,ε(3)∥∞≤Cε−3.
The same Lindeberg replacement argument gives
∣Egt,ε(Wn)−Egt,ε(Zn)∣≤Cε−3n−1/2.
Therefore
P(Wn≤t)≥Egt,ε(Wn)≥Egt,ε(Zn)−Cε−3n−1/2.
Also,
Egt,ε(Zn)≥P(Zn≤t−ε)=Φ(t−ε).
So
Φ(t)−P(Wn≤t)≤Φ(t)−Φ(t−ε)+Cε−3n−1/2≤Cε+Cε−3n−1/2.
Combining the two bounds, for every t∈R,
∣P(Wn≤t)−Φ(t)∣≤Cε+Cε−3n−1/2.
Take
ε=n−1/8.
Then
∣P(Wn≤t)−Φ(t)∣≤Cn−1/8.
Therefore
t∈RsupP(n1k=1∑nXk≤t)−Φ(t)=O(n−1/8).
Remark
The main idea is to approximate the indicator function by a smooth function, and then replace Xi by normal variables Yi one at a time. Since Xi and Yi have the same first two moments, the first- and second-order Taylor terms cancel. Only the third-order remainder remains. The smoothing error is O(ε), and the replacement error is O(ε−3n−1/2). Taking ε=n−1/8 gives the bound.
Problem: 5.4.5
Let {Xn} be i.i.d. random variables with
EX1=0,EX12=1,
and assume that for all l≥3,
E∣X1∣l<∞.
Set
Sn=X1+⋯+Xn.
Let Hk(x) be the k-th Hermite polynomial, defined by
H0=1,(−1)kHk(x)ϕ(x)=ϕ(k)(x),
where ϕ is the standard normal density. Prove that
n→∞limE[Hk(nSn)]=0,∀k≥1.
Proof
Let
Wn=nSn.
First prove that, for every fixed positive integer j,
n→∞limEWnj=EZj,
where Z∼N(0,1).
Expand:
EWnj=n−j/2i1,…,ij=1∑nE(Xi1⋯Xij).
Since EX1=0 and the Xi are independent, a term is 0 if some index appears exactly once.
Thus, in a nonzero term, every appearing index appears at least twice. If there are r distinct indices, then
r≤2j.
If r<j/2, the total contribution of these terms is at most
O(nr)n−j/2=o(1).
Therefore the limit can only come from r=j/2. This requires j to be even and every appearing index to appear exactly twice. Let j=2m. The number of pairings is
The idea is to first show that the fixed moments of Sn/n converge to the moments of a standard normal variable. In the moment expansion, because EXi=0, only terms with paired indices survive in the limit. Those are exactly the normal moments. Since Hk is a polynomial, moment convergence gives EHk(Sn/n)→EHk(Z). Finally, Hermite polynomials satisfy EHk(Z)=0 under the standard normal law for k≥1.
Problem: 5.4.8
(Stein's method) Prove that
X∼N(0,1)
if and only if for every bounded continuous function g with bounded continuous derivative g′,
E[Xg(X)]=E[g′(X)].
Hint: for Z∼N(0,1) and bounded continuous h, construct
g0(x)=ex2/2∫−∞xe−y2/2(h(y)−Eh(Z))dy.
Proof
First prove necessity. If X∼N(0,1), its density is
for every bounded continuous h. Thus X and Z have the same distribution, and
X∼N(0,1).
This proves the equivalence.
Remark
The main point is the basic Stein characterization
X∼N(0,1)⟺E[Xg(X)]=E[g′(X)]
for a large enough class of test functions g.
The necessary part comes from the special identity for the normal density,
ϕ′(x)=−xϕ(x),
which lets us turn E[Xg(X)] into E[g′(X)] by integration by parts.
For sufficiency, we want to prove that X and a standard normal Z have the same distribution. It is enough to show that for every bounded continuous h,
Eh(X)=Eh(Z).
The constructed function g0 solves the Stein equation
g0′(x)−xg0(x)=h(x)−Eh(Z).
Putting g0 into the assumed identity gives
Eh(X)=Eh(Z).
So X must be standard normal.
There are similar Stein characterizations for other classical distributions, such as the exponential and Poisson distributions. For example:
(Stein characterization of the exponential distribution) Let λ>0, and let W be a continuous random variable supported on (0,∞) with density q. Under suitable regularity conditions, prove that
W∼Exp(λ)
if and only if for every f∈Cc1(0,∞),
Ef′(W)=λEf(W).
Extension: From the χ² distribution to the Wishart distribution
Reading map
Probability already has plenty of named distributions. Adding one or two more is not the main issue. The useful path is this:
normal sample⟹orthogonal decomposition⟹sum of squares / sum of outer products.
In one dimension, this path gives the χ2 distribution and explains why Xˉ and s2 are independent. In several dimensions, what should replace it?
1. One dimension: a sum of squares loses one direction
First recall a small fact from the course notes. If
Z1,…,Zν∼iidN(0,1),
then
i=1∑νZi2∼χν2.
This is the squared length of a standard normal vector in Rν. Equivalently, χν2 is the sum of ν independent squared standard normals, or the square of a random radius.
Now recall the normal-sample theorem. If
X1,…,Xn∼iidN(μ,σ2),s2=n−11i=1∑n(Xi−Xˉ)2,
then
Xˉ⊥⊥s2,σ2(n−1)s2∼χn−12.
The number n−1 is not random decoration. The vector
(X1−μ,…,Xn−μ)
lives in an n-dimensional space, but the sample mean uses the special direction
span{(1,…,1)}.
After subtracting Xˉ, the residual vector
(X1−Xˉ,…,Xn−Xˉ)
is orthogonal to that direction, so it lives in an (n−1)-dimensional subspace. Its squared length, after scaling by σ2, has the χn−12 distribution.
Two Gaussian facts are being used here. First, a standard normal vector is unchanged in distribution under orthogonal rotations. Second, for Gaussian vectors, orthogonal components are independent, not merely uncorrelated. The second fact is special to the Gaussian setting.
So the geometric source of n−1 is simple: estimating the mean uses one direction in the sample space.
2. What changes in p dimensions?
Now replace each scalar observation by a p-dimensional vector. We often use p for dimension, especially in high-dimensional problems. Let
Y1,…,Yν∼iidNp(0,Σ).
For vectors, the natural analogue of a square is not Yi2, but the outer product
YiYi⊤.
Thus the matrix version of a sum of squares is
W=i=1∑νYiYi⊤.
We write
W∼Wp(Σ,ν),
and say that W has a Wishart distribution with scale matrix Σ and ν degrees of freedom.
This is the same idea as the χν2 distribution. When p=1, each Yi is just a scalar with Yi∼N(0,σ2), so
W=i=1∑νYi2=σ2i=1∑νZi2∼σ2χν2.
In this sense, the Wishart distribution is a matrix-valued extension of the χ2 distribution.
For example, when p=2,
W=(∑iYi12∑iYi1Yi2∑iYi1Yi2∑iYi22).
The diagonal entries record sums of squares in each coordinate. The off-diagonal entries record cross-products between coordinates. A χ2 variable keeps only length; a Wishart matrix also keeps the relationships between directions.
3. Main theorem: the sample covariance matrix is Wishart
Let
X1,…,Xn∼iidNp(μ,Σ),
and define the sample mean vector and sample covariance matrix by
Xˉ=n1i=1∑nXi,S=n−11i=1∑n(Xi−Xˉ)(Xi−Xˉ)⊤.
Then
(n−1)S∼Wp(Σ,n−1),Xˉ⊥⊥S.
If n−1<p, this distribution is singular: the rank of (n−1)S is at most n−1, so the matrix cannot be positive definite. The construction above still makes sense. The usual density on the positive definite cone only applies when the degrees of freedom are large enough, typically ν>p−1.
This is exactly the multivariate version of
Xˉ⊥⊥s2,σ2(n−1)s2∼χn−12.
In one dimension, after removing the sample mean, the residual sum of squares is χ2. In several dimensions, after removing the sample mean vector, the residual sum of outer products is Wishart.
One-dimensional normal sample
Multivariate normal sample
square (Xi−Xˉ)2
outer product (Xi−Xˉ)(Xi−Xˉ)⊤
sum of squares
sum of outer products
χn−12
Wp(Σ,n−1)
Xˉ⊥⊥s2
Xˉ⊥⊥S
4. Proof: rotate the sample space
We do not start from the Wishart density. The density is useful, but if it is the first thing you see, Wishart may look like a pile of determinants and traces. Orthogonal decomposition is a better first view.
Put the data into an n×p matrix
X=X1⊤⋮Xn⊤,1n=(1,…,1)⊤.
Choose an n×n orthogonal matrix H whose first row is
n11n⊤.
Define the standardized data matrix
Z=(X−1nμ⊤)Σ−1/2.
The rows of Z are independent Np(0,Ip) random vectors. Left multiplication by an orthogonal matrix only rotates the sample-index direction, so
U=HZ
again has independent Np(0,Ip) rows. Write the j-th row as uj⊤, where uj∈Rp.
The first row is exactly the mean direction:
u1⊤=n11n⊤Z=n(Xˉ−μ)⊤Σ−1/2.
Thus u1 contains the information in Xˉ.
The remaining rows u2⊤,…,un⊤ are residual directions. Let
P0=n11n1n⊤,P1=In−P0.
Here P0 is the projection onto the mean direction, and P1 is the projection onto its orthogonal complement. Since the first row of H is 1n⊤/n,
The third line uses P11n=0: the residual projection removes any constant mean direction.
The last line is a sum of n−1 independent outer products of Np(0,Σ) vectors. Therefore
(n−1)S∼Wp(Σ,n−1).
Also, Xˉ depends only on u1, while S depends only on u2,…,un. These vectors are independent, so
Xˉ⊥⊥S.
This already contains the main idea of Cochran's theorem. Here is the same argument in the projection-matrix form used in multivariate statistics.
Why the degrees of freedom are n-1, not n-p
Subtracting Xˉ removes one direction in the sample-index space, namely (1,…,1). It does not remove p directions. Each remaining residual direction is still a full p-dimensional vector. Hence the Wishart degrees of freedom are n−1.
5. Cochran's theorem: a projection still gives Wishart
The more general form is this. Above we only projected away the mean direction. Cochran's theorem says that any symmetric idempotent projection cuts out a Wishart piece from a normal data matrix.
Theorem: Cochran's theorem
Let
z1,…,zm∼iidNp(0,Σ),Z=z1⊤⋮zm⊤.
If P is an m×m symmetric idempotent matrix and r=rank(P), then
Z⊤PZ∼Wp(Σ,r),Z⊤(Im−P)Z∼Wp(Σ,m−r),
and the two random matrices are independent.
More generally, if P1,…,Pk are pairwise orthogonal symmetric idempotent matrices and ∑a=1kPa=Im, then
Z⊤PaZ∼Wp(Σ,rank(Pa)),a=1,…,k,
and these matrices are independent.
Proof
It is enough to prove the statement for one projection P. Since P is symmetric and idempotent, it is the orthogonal projection onto an r-dimensional subspace. Hence there is an orthogonal matrix H such that
P=H⊤(Ir000)H,Im−P=H⊤(000Im−r)H.
Set Y=HZ. Left multiplication by an orthogonal matrix only rotates the sample-index direction, so the rows of Y are still independent Np(0,Σ) vectors. Split the rows as
Y=(Y1Y2),Y1∈Rr×p,Y2∈R(m−r)×p.
Then
Z⊤PZ=Y1⊤Y1,Z⊤(Im−P)Z=Y2⊤Y2.
The blocks Y1 and Y2 use disjoint normal rows, so they are independent. By the definition of the Wishart distribution, Y1⊤Y1∼Wp(Σ,r) and Y2⊤Y2∼Wp(Σ,m−r).
For the sample covariance matrix, take
P=In−n11n1n⊤.
This is a projection matrix with rank n−1, and P1n=0. Therefore
(n−1)S=X⊤PX=(X−1nμ⊤)⊤P(X−1nμ⊤)∼Wp(Σ,n−1).
This gives a compact proof that the sample covariance matrix has a Wishart distribution. It is shorter than rotating the sample space by hand, but the geometry is the same: the projection splits the sample-index space, and each piece contributes a sum of outer products.
6. A note on the Wishart density
If ν>p−1 and W is positive definite, the Wishart density is
This formula is useful, but it is not the friendliest first encounter with Wishart. Its derivation needs a Jacobian calculation on the cone of positive definite matrices, which is rather technical, so we leave it aside here.
Summary
In a one-dimensional normal sample, projecting away the mean direction leaves n−1 independent Gaussian residual directions, and their squared length gives χn−12. The Wishart theorem is the vector version of the same statement: square becomes outer product, variance becomes covariance matrix, and Xˉ⊥s2 becomes Xˉ⊥S. Cochran's theorem replaces the mean direction by a general orthogonal projection.
End-of-chapter check
The original problems and solutions in this chapter come from the corresponding TeX source files.
You can first read only the problem boxes, write down the main identities, and then open the proof or solution.
If a conclusion uses independence, countable additivity, a change-of-variables formula, or a moment condition, it is worth marking that point explicitly.