Fifth recitation

Jieyang Hu

Contents

Reading guide

This chapter moves into convergence theory, the strong law of large numbers, characteristic functions, the central limit theorem, and Stein's method.
The four modes of convergence give the basic map: a.s., $L^p$ , $P$ , and $D$ .
In each proof, keep track of where independence, moment assumptions, truncation, or Borel-Cantelli is used.

Tip. Whenever a limiting distribution appears, first ask what kind of limit it is: in probability, in distribution, or almost surely.

Exercise 4.2

Note

Keep the four modes of convergence separate: a.s., $L^p$ , $P$ , and $D$ . When you see an arrow, first identify which mode it means.

Problem: 4.2.1

Prove the following two inequalities.

(1) (Lyapunov inequality) If $0<r<s$ , then

\bigl(\mathbb{E}[|X|^r]\bigr)^{1/r} \leq \bigl(\mathbb{E}[|X|^s]\bigr)^{1/s}.

(2) ( $C_r$ inequality) If $r>0$ , then

\mathbb{E}[|X+Y|^r] \leq C_r\bigl(\mathbb{E}[|X|^r]+\mathbb{E}[|Y|^r]\bigr),

where

C_r= \begin{cases} 1, & 0<r<1,\\ 2^{r-1}, & r\geq 1. \end{cases}

Proof

(1) Let $\alpha=r/s\in(0,1)$ . Since $x\mapsto x^\alpha$ is concave on $[0,\infty)$ , Jensen's inequality gives

\mathbb{E}[|X|^r] =\mathbb{E}\bigl[(|X|^s)^\alpha\bigr] \leq \bigl(\mathbb{E}[|X|^s]\bigr)^\alpha.

Taking the power $1/r$ on both sides gives

\bigl(\mathbb{E}[|X|^r]\bigr)^{1/r} \leq \bigl(\mathbb{E}[|X|^s]\bigr)^{1/s}.

$2$ If $0<r<1$ , then for all $a,b\geq 0$ ,

(a+b)^r\leq a^r+b^r.

Therefore

|X+Y|^r\leq (|X|+|Y|)^r\leq |X|^r+|Y|^r.

Taking expectations gives

\mathbb{E}[|X+Y|^r]\leq \mathbb{E}[|X|^r]+\mathbb{E}[|Y|^r].

If $r\geq 1$ , then by convexity,

(a+b)^r =2^r\left(\frac{a+b}{2}\right)^r \leq 2^{r-1}(a^r+b^r).

Hence

|X+Y|^r\leq 2^{r-1}\bigl(|X|^r+|Y|^r\bigr),

and the result follows after taking expectations.

Problem: 4.2.2

Let $\{X_n\}$ be a sequence of random variables, and let $\{c_n\}$ be a real sequence with $c_n\to c$ . Prove, under each of the four modes of convergence a.s., $L^p$ , in probability, and in distribution, that

X_n\to X \Longrightarrow c_nX_n\to cX.

Proof

If $X_n\xrightarrow{\text{a.s.}}X$ , then for almost every $\omega$ ,

c_nX_n(\omega)\to cX(\omega).

Thus $c_nX_n\xrightarrow{\text{a.s.}}cX$ .

If $X_n\xrightarrow{L^p}X$ , then $X\in L^p$ , and $\{c_n\}$ is bounded. By the $C_r$ inequality, for a constant $C_p$ depending only on $p$ ,

|c_nX_n-cX|^p \leq C_p\bigl(|c_n|^p|X_n-X|^p+|c_n-c|^p|X|^p\bigr).

Taking expectations gives

\mathbb{E}[|c_nX_n-cX|^p]\to 0,

so $c_nX_n\xrightarrow{L^p}cX$ .

If $X_n\xrightarrow{P}X$ , then

c_nX_n-cX=c_n(X_n-X)+(c_n-c)X.

Since $\{c_n\}$ is bounded, the first term converges to $0$ in probability. The second term converges to $0$ a.s., hence also in probability. Therefore

c_nX_n\xrightarrow{P}cX.

If $X_n\xrightarrow{D}X$ , regard $c_n$ as a constant random variable. Then $c_n\xrightarrow{P}c$ , and Slutsky's theorem gives

c_nX_n\xrightarrow{D}cX.

Problem: 4.2.3

Prove that, as $n\to\infty$ ,

X_n\xrightarrow{P}0 \quad\Longleftrightarrow\quad \mathbb{E}\!\left[\frac{|X_n|}{1+|X_n|}\right]\to 0.

Proof

Suppose $X_n\xrightarrow{P}0$ . For every $\varepsilon>0$ ,

\begin{aligned} \mathbb{E}\!\left[\frac{|X_n|}{1+|X_n|}\right] &= \mathbb{E}\!\left[\frac{|X_n|}{1+|X_n|}; |X_n|<\varepsilon\right] + \mathbb{E}\!\left[\frac{|X_n|}{1+|X_n|}; |X_n|\geq \varepsilon\right] \\ &\leq \varepsilon+\mathbb{P}(|X_n|\geq \varepsilon). \end{aligned}

Letting $n\to\infty$ gives

\limsup_{n\to\infty}\mathbb{E}\!\left[\frac{|X_n|}{1+|X_n|}\right]\leq \varepsilon.

Then let $\varepsilon\downarrow 0$ . Hence

\mathbb{E}\!\left[\frac{|X_n|}{1+|X_n|}\right]\to 0.

Conversely, if

\mathbb{E}\!\left[\frac{|X_n|}{1+|X_n|}\right]\to 0,

then for every $\varepsilon>0$ ,

\mathbb{E}\!\left[\frac{|X_n|}{1+|X_n|}\right] \geq \mathbb{E}\!\left[\frac{|X_n|}{1+|X_n|}; |X_n|\geq \varepsilon\right] \geq \frac{\varepsilon}{1+\varepsilon}\mathbb{P}(|X_n|\geq \varepsilon).

Thus $\mathbb{P}(|X_n|\geq \varepsilon)\to 0$ , so $X_n\xrightarrow{P}0$ .

Problem: 4.2.4

Let random variable sequences $\{X_n\}$ and $\{Y_n\}$ satisfy $X_n\xrightarrow{D}X$ and $Y_n\xrightarrow{P}c$ , where $X$ is a random variable and $c$ is a constant. Prove:

$1$ $X_n+Y_n\xrightarrow{D}X+c$ .

$2$ $X_nY_n\xrightarrow{D}cX$ , and if $c\neq 0$ , then

\frac{X_n}{Y_n}\xrightarrow{D}\frac{X}{c}.

Proof

The first statement and the first part of the second statement are Slutsky's theorem with $Z_n\equiv 0$ . Hence

X_n+Y_n\xrightarrow{D}X+c,\qquad X_nY_n\xrightarrow{D}cX.

If $c\neq 0$ , the function $x\mapsto 1/x$ is continuous at $c$ , so

\frac{1}{Y_n}\xrightarrow{P}\frac{1}{c}.

Applying Slutsky's theorem to $\{X_n\}$ and $\{1/Y_n\}$ gives

\frac{X_n}{Y_n}=X_n\cdot\frac{1}{Y_n}\xrightarrow{D}\frac{X}{c}.

The full form of Slutsky's theorem is as follows. Suppose

X_n\xrightarrow{D}X,\qquad Y_n\xrightarrow{P}b,\qquad Z_n\xrightarrow{P}c,

where $X$ is a random variable and $b,c$ are constants. Then

X_nY_n+Z_n\xrightarrow{D}bX+c.

In particular,

X_n+Y_n\xrightarrow{D}X+c,\qquad X_nY_n\xrightarrow{D}bX,

and if $b\neq 0$ ,

\frac{X_n}{Y_n}\xrightarrow{D}\frac{X}{b}.

Proof

First prove a standard lemma: if

U_n-V_n\xrightarrow{P}0,\qquad V_n\xrightarrow{D}V,

then $U_n\xrightarrow{D}V$ .

Indeed, for every $\varepsilon>0$ and every continuity point $x$ of the distribution function of $V$ ,

\{V_n\leq x-\varepsilon\}\cap\{|U_n-V_n|\leq\varepsilon\} \subset \{U_n\leq x\},

and

\{U_n\leq x\} \subset \{V_n\leq x+\varepsilon\}\cup\{|U_n-V_n|>\varepsilon\}.

Therefore

\mathbb{P}(V_n\leq x-\varepsilon)-\mathbb{P}(|U_n-V_n|>\varepsilon) \leq \mathbb{P}(U_n\leq x),

and

\mathbb{P}(U_n\leq x) \leq \mathbb{P}(V_n\leq x+\varepsilon)+\mathbb{P}(|U_n-V_n|>\varepsilon).

Letting $n\to\infty$ gives

F_V(x-\varepsilon)\leq \liminf_{n\to\infty}\mathbb{P}(U_n\leq x) \leq \limsup_{n\to\infty}\mathbb{P}(U_n\leq x) \leq F_V(x+\varepsilon).

Letting $\varepsilon\downarrow 0$ and using continuity at $x$ gives

\mathbb{P}(U_n\leq x)\to F_V(x).

Thus $U_n\xrightarrow{D}V$ .

Now prove Slutsky's theorem. For addition, the continuous mapping theorem gives

X_n+c\xrightarrow{D}X+c.

Since

(X_n+Y_n)-(X_n+c)=Y_n-c\xrightarrow{P}0,

the lemma gives

X_n+Y_n\xrightarrow{D}X+c.

For multiplication, $X_n\xrightarrow{D}X$ implies that $\{X_n\}$ is tight. Thus for every $\varepsilon,\eta>0$ , there exists $M>0$ such that, for all large $n$ ,

\mathbb{P}(|X_n|>M)<\eta.

Hence

\mathbb{P}(|X_n(Y_n-b)|>\varepsilon) \leq \mathbb{P}(|X_n|>M) + \mathbb{P}\!\left(|Y_n-b|>\frac{\varepsilon}{M}\right).

Letting $n\to\infty$ gives

X_n(Y_n-b)\xrightarrow{P}0.

Also, by the continuous mapping theorem,

bX_n\xrightarrow{D}bX.

Since

X_nY_n-bX_n=X_n(Y_n-b)\xrightarrow{P}0,

the lemma gives

X_nY_n\xrightarrow{D}bX.

Finally, apply the addition part to $X_nY_n$ and $Z_n$ to get

X_nY_n+Z_n\xrightarrow{D}bX+c.

If $b\neq 0$ , then $x\mapsto 1/x$ is continuous at $b$ , so

\frac{1}{Y_n}\xrightarrow{P}\frac{1}{b}.

The multiplication part applied to $X_n$ and $1/Y_n$ gives

\frac{X_n}{Y_n}=X_n\cdot\frac{1}{Y_n}\xrightarrow{D}\frac{X}{b}.

Exercise 4.3

Note

Borel-Cantelli, subsequence arguments, and extreme-value estimates often appear together. Almost sure conclusions usually come from building summable bad events.

Problem: 4.3.1

Let $\{X_n\}$ be independent standard normal random variables. Use the standard normal tail estimate from Chapter 3, Problem 14(1), to prove

\mathbb{P}\!\left(\limsup_{n\to\infty}\frac{X_n}{\sqrt{\log n}}=\sqrt{2}\right)=1.

Proof

For $a>0$ , define

A_n(a)=\left\{X_n\geq \sqrt{2a\log n}\right\}.

By the normal tail estimate, there exist constants $C_1,C_2>0$ such that, for all large $n$ ,

C_1\frac{n^{-a}}{\sqrt{\log n}} \leq \mathbb{P}(A_n(a)) \leq C_2\frac{n^{-a}}{\sqrt{\log n}}.

If $0<a<1$ , then

\sum_{n=2}^{\infty}\mathbb{P}(A_n(a))=\infty.

The events $A_n(a)$ are independent, so the second Borel-Cantelli lemma gives

\mathbb{P}(A_n(a)\ \text{i.o.})=1.

Thus

\limsup_{n\to\infty}\frac{X_n}{\sqrt{\log n}}\geq \sqrt{2a} \qquad\text{a.s.}

If $a>1$ , then

\sum_{n=2}^{\infty}\mathbb{P}(A_n(a))<\infty.

By the first Borel-Cantelli lemma,

\mathbb{P}(A_n(a)\ \text{i.o.})=0.

Hence

\limsup_{n\to\infty}\frac{X_n}{\sqrt{\log n}}\leq \sqrt{2a} \qquad\text{a.s.}

Therefore, for every $0<a<1<b$ ,

\sqrt{2a} \leq \limsup_{n\to\infty}\frac{X_n}{\sqrt{\log n}} \leq \sqrt{2b} \qquad\text{a.s.}

Letting $a\uparrow 1$ and $b\downarrow 1$ gives

\limsup_{n\to\infty}\frac{X_n}{\sqrt{\log n}}=\sqrt{2} \qquad\text{a.s.}

Problem: 4.3.6

Let $X_1,\cdots,X_n$ be i.i.d. uniform random variables on $[0,a]$ , where $a>0$ . Set

M_n=\max\{X_1,\cdots,X_n\}.

Prove that $M_n\to a$ both a.s. and in $L^p$ as $n\to\infty$ .

Proof

For every $0<\varepsilon<a$ ,

\mathbb{P}(|M_n-a|>\varepsilon) =\mathbb{P}(M_n<a-\varepsilon) =\left(\frac{a-\varepsilon}{a}\right)^n.

Since

\sum_{n=1}^{\infty}\left(\frac{a-\varepsilon}{a}\right)^n<\infty,

the first Borel-Cantelli lemma implies that

|M_n-a|>\varepsilon

can occur only finitely many times. Taking a countable intersection over positive rational $\varepsilon$ gives

M_n\xrightarrow{\text{a.s.}}a.

Since $0\leq M_n\leq a$ ,

|M_n-a|^p\leq a^p.

Together with almost sure convergence, the dominated convergence theorem gives

\mathbb{E}[|M_n-a|^p]\to 0.

Thus

M_n\xrightarrow{L^p}a.

Problem: 4.3.7

Suppose $X_n\xrightarrow{P}X$ . Prove that there exists a subsequence $\{X_{n_k}\}$ such that

X_{n_k}\xrightarrow{\text{a.s.}}X.

Proof

Since $X_n\xrightarrow{P}X$ , for each $k\in\mathbb{N}^*$ we may choose $n_k>n_{k-1}$ such that

\mathbb{P}(|X_{n_k}-X|>2^{-k})<2^{-k}.

Then

\sum_{k=1}^{\infty}\mathbb{P}(|X_{n_k}-X|>2^{-k})<\infty.

By the first Borel-Cantelli lemma, the events

|X_{n_k}-X|>2^{-k}

occur only finitely many times. Hence, for almost every $\omega$ , there exists $K(\omega)$ such that for all $k\geq K(\omega)$ ,

|X_{n_k}(\omega)-X(\omega)|\leq 2^{-k}.

Therefore $X_{n_k}(\omega)\to X(\omega)$ , and

X_{n_k}\xrightarrow{\text{a.s.}}X.

Problem: 4.3.8

$1$ Let $\{X_n\}$ be independent real-valued random variables with $X_n\xrightarrow{P}0$ , and let $\{a_n\}$ be a positive increasing sequence with $a_n\to+\infty$ . Must we have

\frac{X_n}{a_n}\xrightarrow{\text{a.s.}}0?

$2$ Let $\{X_n\}$ be any sequence of real-valued random variables. Construct positive numbers $\{c_n\}$ such that

\frac{X_n}{c_n}\xrightarrow{\text{a.s.}}0.

Proof

(1) The conclusion need not hold. For the given sequence $\{a_n\}$ , define independent random variables by

\mathbb{P}(X_n=a_n)=\frac{1}{n+1}, \qquad \mathbb{P}(X_n=0)=1-\frac{1}{n+1}.

Since $a_n\to\infty$ , for every $\varepsilon>0$ and all large $n$ , $a_n>\varepsilon$ . Hence

\mathbb{P}(|X_n|>\varepsilon)=\frac{1}{n+1}\to 0,

so $X_n\xrightarrow{P}0$ . But

\mathbb{P}\!\left(\frac{X_n}{a_n}=1\right)=\frac{1}{n+1}, \qquad \sum_{n=1}^{\infty}\frac{1}{n+1}=\infty.

By the second Borel-Cantelli lemma,

\frac{X_n}{a_n}=1

occurs infinitely often. Thus $X_n/a_n$ does not converge to $0$ a.s.

$2$ For each $n$ , since $\mathbb{P}(|X_n|>t)\downarrow 0$ as $t\to\infty$ , choose $c_n>0$ such that

\mathbb{P}(|X_n|>2^{-n}c_n)<2^{-n}.

Let

A_n=\{|X_n|>2^{-n}c_n\}.

Then

\sum_{n=1}^{\infty}\mathbb{P}(A_n)<\infty.

By the first Borel-Cantelli lemma, $A_n$ occurs only finitely many times. Hence, almost surely, there exists $N(\omega)$ such that for all $n\geq N(\omega)$ ,

\left|\frac{X_n}{c_n}\right|\leq 2^{-n}.

Therefore

\frac{X_n}{c_n}\xrightarrow{\text{a.s.}}0.

Exercise 4.4

Note

Proofs of the strong law often use truncation, fourth moments, or Borel-Cantelli. Watch which moment condition controls which tail event.

Problem: 4.4.1

Let $\{X_n\}$ be nonnegative i.i.d. random variables with $\mathbb{E}[X_1]=+\infty$ . Prove that

\frac1n\sum_{k=1}^{n}X_k\xrightarrow{\text{a.s.}}+\infty.

Proof

For each $M>0$ , let

Y_k^{(M)}=X_k\wedge M.

Then $\{Y_k^{(M)}\}$ is still a nonnegative i.i.d. sequence, and $\mathbb{E}[Y_1^{(M)}]<\infty$ . By the strong law of large numbers,

\frac1n\sum_{k=1}^{n}Y_k^{(M)} \xrightarrow{\text{a.s.}} \mathbb{E}[Y_1^{(M)}].

Since $X_k\geq Y_k^{(M)}$ ,

\liminf_{n\to\infty}\frac1n\sum_{k=1}^{n}X_k \geq \mathbb{E}[Y_1^{(M)}] \qquad\text{a.s.}

Because $Y_1^{(M)}\uparrow X_1$ , the monotone convergence theorem gives

\mathbb{E}[Y_1^{(M)}]\uparrow \mathbb{E}[X_1]=+\infty.

Thus for every $L>0$ , we can choose $M$ so that $\mathbb{E}[Y_1^{(M)}]\geq L$ . Hence

\liminf_{n\to\infty}\frac1n\sum_{k=1}^{n}X_k\geq L \qquad\text{a.s.}

Since $L$ is arbitrary,

\frac1n\sum_{k=1}^{n}X_k\xrightarrow{\text{a.s.}}+\infty.

Problem: 4.4.2

(Weierstrass approximation theorem) For every continuous function $f:[0,1]\to\mathbb{R}$ , let $S_n\sim B(n,x)$ . Prove that

\lim_{n\to+\infty}\sup_{0\leq x\leq 1} \left| f(x)-\sum_{k=0}^{n}f\!\left(\frac{k}{n}\right)\binom{n}{k}x^k(1-x)^{n-k} \right|=0.

Proof

For fixed $x\in[0,1]$ , let $S_n\sim B(n,x)$ . Then

\mathbb{P}(S_n=k)=\binom{n}{k}x^k(1-x)^{n-k},

\sum_{k=0}^{n}f\!\left(\frac{k}{n}\right)\binom{n}{k}x^k(1-x)^{n-k} = \mathbb{E}\!\left[f\!\left(\frac{S_n}{n}\right)\right].

It is enough to prove

\sup_{0\leq x\leq 1} \left| \mathbb{E}\!\left[f\!\left(\frac{S_n}{n}\right)\right]-f(x) \right|\to 0.

Since $f$ is continuous on $[0,1]$ , it is uniformly continuous. Given $\varepsilon>0$ , choose $\delta>0$ such that $|u-v|<\delta$ implies

|f(u)-f(v)|<\varepsilon.

Let

M=\sup_{0\leq y\leq 1}|f(y)|.

Then

\begin{aligned} \left|\mathbb{E}\!\left[f\!\left(\frac{S_n}{n}\right)\right]-f(x)\right| &\leq \mathbb{E}\!\left[\left|f\!\left(\frac{S_n}{n}\right)-f(x)\right|; \left|\frac{S_n}{n}-x\right|<\delta\right] \\ &\quad+ \mathbb{E}\!\left[\left|f\!\left(\frac{S_n}{n}\right)-f(x)\right|; \left|\frac{S_n}{n}-x\right|\geq\delta\right] \\ &\leq \varepsilon+2M\mathbb{P}\!\left(\left|\frac{S_n}{n}-x\right|\geq\delta\right). \end{aligned}

By Chebyshev's inequality,

\mathbb{P}\!\left(\left|\frac{S_n}{n}-x\right|\geq\delta\right) \leq \frac{\operatorname{Var}(S_n/n)}{\delta^2} = \frac{x(1-x)}{n\delta^2} \leq \frac{1}{4n\delta^2}.

Thus

\sup_{0\leq x\leq 1} \left| \mathbb{E}\!\left[f\!\left(\frac{S_n}{n}\right)\right]-f(x) \right| \leq \varepsilon+\frac{M}{2n\delta^2}.

Letting $n\to\infty$ and then $\varepsilon\downarrow 0$ proves the result.

Problem: 4.4.3

Let $X_1,\cdots,X_n$ be i.i.d. random variables with $\mathbb{E}[X_1]=0$ and $\mathbb{E}[X_1^4]<\infty$ . Without using the strong law directly, prove that

\frac1n\sum_{k=1}^{n}X_k\xrightarrow{\text{a.s.}}0.

Proof

Set

S_n=\sum_{k=1}^{n}X_k.

Since $\mathbb{E}[X_1]=0$ , expanding the fourth moment and using independence gives

\mathbb{E}[S_n^4] = n\mathbb{E}[X_1^4] + 6\binom{n}{2}\bigl(\mathbb{E}[X_1^2]\bigr)^2 =O(n^2).

Thus there is a constant $C>0$ such that, for all $n$ ,

\mathbb{E}[S_n^4]\leq Cn^2.

By Markov's inequality,

\mathbb{P}(|S_n|>n\varepsilon) \leq \frac{\mathbb{E}[S_n^4]}{n^4\varepsilon^4} \leq \frac{C}{n^2\varepsilon^4}.

Therefore

\sum_{n=1}^{\infty}\mathbb{P}(|S_n|>n\varepsilon)<\infty.

By the first Borel-Cantelli lemma,

\mathbb{P}(|S_n|>n\varepsilon\ \text{i.o.})=0.

Since $\varepsilon>0$ is arbitrary,

\frac{S_n}{n}\xrightarrow{\text{a.s.}}0.

Problem: 4.4.4

Let $\{X_n\}$ be independent exponential random variables with parameter $1$ .

$1$ Prove that $(X_1\cdots X_n)^{1/n}$ converges a.s., and find the limit.

$2$ Find the limiting distribution of

\frac{n}{\frac1{X_1}+\cdots+\frac1{X_n}}.

Proof

(1) Let $Y_n=\log X_n$ . Since $X_n\sim\mathrm{Exp}(1)$ ,

\mathbb{E}[|Y_1|]<\infty, \qquad \mathbb{E}[Y_1]=\int_{0}^{\infty}(\log x)e^{-x}\,dx=-\gamma,

where $\gamma$ is Euler's constant. By the strong law,

\frac1n\sum_{k=1}^{n}Y_k\xrightarrow{\text{a.s.}}-\gamma.

Therefore

(X_1\cdots X_n)^{1/n} = \exp\!\left(\frac1n\sum_{k=1}^{n}Y_k\right) \xrightarrow{\text{a.s.}}e^{-\gamma}.

$2$ Let

Z_k=\frac1{X_k}.

Then $Z_k\geq 0$ and $\{Z_k\}$ are i.i.d. Also,

\mathbb{E}[Z_1]=\int_{0}^{\infty}\frac1x e^{-x}\,dx=+\infty.

By the previous result,

\frac1n\sum_{k=1}^{n}Z_k\xrightarrow{\text{a.s.}}+\infty.

Hence

\frac{n}{\frac1{X_1}+\cdots+\frac1{X_n}} = \left(\frac1n\sum_{k=1}^{n}Z_k\right)^{-1} \xrightarrow{\text{a.s.}}0.

Its limiting distribution is therefore the degenerate distribution $\delta_0$ .

Problem: 4.4.5

The interval $[0,1]$ is divided into $n$ disjoint subintervals with lengths $p_1,p_2,\cdots,p_n$ . Define the entropy of the partition by

h=-\sum_{i=1}^{n}p_i\log p_i.

Let $X_1,X_2,\cdots,X_m$ be independent uniform random variables on $[0,1]$ . Let $Z_m(i)$ be the number of $X_1,\cdots,X_m$ that fall in the $i$ -th interval, and define

R_m=\prod_{i=1}^{n}p_i^{Z_m(i)}.

Prove that, as $m\to\infty$ ,

\frac{\log R_m}{m}\xrightarrow{\text{a.s.}}-h.

Proof

For each $k$ , define

Y_k=\sum_{i=1}^{n}(\log p_i)\mathbf{1}_{\{X_k\text{ falls in interval }i\}}.

Then $\{Y_k\}$ are i.i.d., and

\mathbb{P}(Y_k=\log p_i)=p_i,\qquad 1\leq i\leq n.

Thus

\mathbb{E}[Y_1]=\sum_{i=1}^{n}p_i\log p_i=-h.

Also,

\log R_m=\sum_{i=1}^{n}Z_m(i)\log p_i=\sum_{k=1}^{m}Y_k.

By the strong law,

\frac{\log R_m}{m} = \frac1m\sum_{k=1}^{m}Y_k \xrightarrow{\text{a.s.}} \mathbb{E}[Y_1] =-h.

Problem: 4.4.7

Let $\{X_k:k\geq 2\}$ be independent random variables such that

\mathbb{P}(X_k=2k)=\mathbb{P}(X_k=-2k)=\frac{1}{2k\log k}, \qquad \mathbb{P}(X_k=0)=1-\frac{1}{k\log k}.

Set

S_n=X_2+\cdots+X_n.

Prove that

\frac{S_n}{n}\xrightarrow{P}0, \qquad \frac{S_n}{n(n-1)}\xrightarrow{\text{a.s.}}0,

but

\frac{S_n}{n}

does not converge to $0$ a.s.

Proof

First,

\mathbb{E}[X_k]=0.

Also,

\mathbb{E}[X_k^2] =4k^2\cdot\frac1{k\log k} \leq C\frac{k}{\log k}.

Therefore

\operatorname{Var}\!\left(\frac{S_n}{n}\right) = \frac1{n^2}\sum_{k=2}^{n}\mathbb{E}[X_k^2].

Since

\sum_{k=2}^{n}\frac{k}{\log k} \leq \sum_{k\leq \sqrt n}\frac{k}{\log 2} + \sum_{k>\sqrt n}\frac{2k}{\log n} = O\!\left(\frac{n^2}{\log n}\right),

we get

\operatorname{Var}\!\left(\frac{S_n}{n}\right) =O\!\left(\frac1{\log n}\right)\to 0.

By Chebyshev's inequality,

\frac{S_n}{n}\xrightarrow{P}0.

For almost sure convergence under the larger normalization, the same estimate gives

\operatorname{Var}(S_n) = \sum_{k=2}^{n}\mathbb{E}[X_k^2] = O\!\left(\frac{n^2}{\log n}\right).

For every $\varepsilon>0$ ,

\mathbb{P}\!\left(\left|\frac{S_n}{n(n-1)}\right|>\varepsilon\right) \leq \frac{\operatorname{Var}(S_n)}{\varepsilon^2n^2(n-1)^2} = O\!\left(\frac1{n^2\log n}\right).

Hence

\sum_{n=2}^{\infty} \mathbb{P}\!\left(\left|\frac{S_n}{n(n-1)}\right|>\varepsilon\right) <\infty.

By the first Borel-Cantelli lemma,

\frac{S_n}{n(n-1)}\xrightarrow{\text{a.s.}}0.

Finally, prove that $S_n/n$ does not converge to $0$ a.s. Let

A_n=\{X_n=2n\}.

Then $\{A_n\}$ are independent, and

\sum_{n=2}^{\infty}\mathbb{P}(A_n) = \sum_{n=2}^{\infty}\frac1{2n\log n} =\infty.

By the second Borel-Cantelli lemma, $A_n$ occurs infinitely often a.s. If

\frac{S_n}{n}\xrightarrow{\text{a.s.}}0,

then

\frac{S_{n-1}}{n} = \frac{n-1}{n}\cdot\frac{S_{n-1}}{n-1} \xrightarrow{\text{a.s.}}0.

But on $A_n$ ,

\frac{S_n}{n}=\frac{S_{n-1}}{n}+2.

Since $A_n$ occurs infinitely often, this contradicts $S_n/n\to 0$ . Therefore $S_n/n$ does not converge to $0$ a.s.

Exercise 5.1

Note

For characteristic functions, independent sums correspond to products, and linear changes correspond to rescaling. Convergence in distribution can often be checked through pointwise convergence of characteristic functions.

Problem: 5.1.1

The density of $X$ is

f(x)=\frac12e^{-|x|},\qquad -\infty<x<\infty.

Find the characteristic function of $X$ .

Proof

\begin{aligned} \phi_X(t) &= \mathbb{E}[e^{itX}] = \frac12\int_{-\infty}^{\infty}e^{itx-|x|}\,dx \\ &= \int_{0}^{\infty}e^{-x}\cos(tx)\,dx = \frac1{1+t^2}. \end{aligned}

Problem: 5.1.2

Assume that $\{U,V\}$ is independent of $\{X,Y\}$ , and let

Z=\frac{UX+VY}{\sqrt{U^2+V^2}}.

Prove that if $X$ and $Y$ are independent $N(0,1)$ random variables, then $Z\sim N(0,1)$ . If $(X,Y)$ is only standard bivariate normal, does the conclusion still hold?

Proof

If $X,Y$ are independent and both have distribution $N(0,1)$ , then for every fixed $(u,v)\in\mathbb{R}^2$ ,

uX+vY\sim N(0,u^2+v^2).

Thus, conditional on $(U,V)=(u,v)$ ,

Z\mid (U,V)=(u,v)\sim N(0,1).

Equivalently, for every $t\in\mathbb{R}$ ,

\mathbb{E}[e^{itZ}\mid U,V]=e^{-t^2/2}.

Taking expectations gives

\mathbb{E}[e^{itZ}]=e^{-t^2/2},

so $Z\sim N(0,1)$ .

If $(X,Y)$ is only standard bivariate normal and independence is not assumed, the conclusion need not hold. Let

\operatorname{Cov}(X,Y)=\rho\neq 0,

and take $U=V=1$ . Then

Z=\frac{X+Y}{\sqrt2},

\operatorname{Var}(Z) = \frac12\operatorname{Var}(X+Y) = \frac12(1+1+2\rho) =1+\rho\neq 1.

Thus $Z\not\sim N(0,1)$ in general.

Problem: 5.1.3

Let

\phi(t)=\left(\frac{\sin t}{t}\right)^2.

Use a probabilistic argument to prove that, for real numbers $t_1,\cdots,t_n$ , the matrix

H_n=\bigl(\phi(t_i-t_j)\bigr)_{i,j=1}^{n}

is nonnegative definite.

Proof

Take independent random variables $X,Y\sim U[-1,1]$ . Then

\phi_X(t)=\phi_Y(t)=\frac{\sin t}{t}.

Hence

\phi_{X+Y}(t)=\phi_X(t)\phi_Y(t) = \left(\frac{\sin t}{t}\right)^2 =\phi(t).

Thus $\phi$ is the characteristic function of the random variable $X+Y$ .

For any complex numbers $c_1,\cdots,c_n$ ,

\begin{aligned} \sum_{i,j=1}^{n}c_i\overline{c_j}\phi(t_i-t_j) &= \sum_{i,j=1}^{n}c_i\overline{c_j} \mathbb{E}\!\left[e^{i(t_i-t_j)(X+Y)}\right] \\ &= \mathbb{E}\!\left[\left|\sum_{j=1}^{n}c_j e^{it_j(X+Y)}\right|^2\right]\\ &\geq 0. \end{aligned}

Therefore $H_n$ is nonnegative definite.

Problem: 5.1.5

Let $X_1,X_2,\cdots,X_n$ be independent random variables, and set

Y_n=X_1^2+X_2^2+\cdots+X_n^2.

$1$ If $X_i\sim N(i,1)$ , find the characteristic function of $Y_n$ .

$2$ If $X_i\sim N(1,1)$ , and if $N\sim P(\lambda)$ is independent of all $X_i$ , find the characteristic function of $Y_N$ .

Proof

If $X\sim N(\mu,1)$ , then

\begin{aligned} \mathbb{E}[e^{itX^2}] &= \frac1{\sqrt{2\pi}} \int_{\mathbb{R}} \exp\!\left(itx^2-\frac{(x-\mu)^2}{2}\right)\,dx\\ &= \frac1{\sqrt{1-2it}} \exp\!\left(\frac{i\mu^2t}{1-2it}\right). \end{aligned}

$1$ By independence,

\phi_{Y_n}(t) = \prod_{k=1}^{n}\mathbb{E}[e^{itX_k^2}] = (1-2it)^{-n/2} \exp\!\left(\frac{it}{1-2it}\sum_{k=1}^{n}k^2\right).

Thus

\phi_{Y_n}(t) = (1-2it)^{-n/2} \exp\!\left(\frac{it}{1-2it}\cdot\frac{n(n+1)(2n+1)}6\right).

$2$ In this case,

\phi_{X_1^2}(t) = (1-2it)^{-1/2} \exp\!\left(\frac{it}{1-2it}\right).

Conditional on $N=m$ ,

\phi_{Y_N\mid N=m}(t)=\phi_{X_1^2}(t)^m.

Therefore

\phi_{Y_N}(t) = \mathbb{E}[\phi_{X_1^2}(t)^N] = \exp\{\lambda(\phi_{X_1^2}(t)-1)\}.

That is,

\phi_{Y_N}(t) = \exp\!\left\{\lambda\left((1-2it)^{-1/2} \exp\!\left(\frac{it}{1-2it}\right)-1\right)\right\}.

Problem: 5.1.7

Let $X_1,\cdots,X_n$ be i.i.d., and let

S_n=X_1+\cdots+X_n.

$1$ If the moment generating function $M(t)=\mathbb{E}[e^{tX_1}]$ exists, prove the tail bound

\mathbb{P}(X_1\geq a) \leq \inf_{t>0}\{e^{-at}M(t)\}.

$2$ If $\mathbb{P}(X_1=1)=\mathbb{P}(X_1=-1)=\frac12$ , prove that for every $a>0$ ,

\mathbb{P}(S_n\geq a)\leq e^{-a^2/(2n)}.

Proof

(1) For every $t>0$ , Markov's inequality gives

\mathbb{P}(X_1\geq a) = \mathbb{P}(e^{tX_1}\geq e^{ta}) \leq e^{-ta}\mathbb{E}[e^{tX_1}] = e^{-ta}M(t).

Taking the infimum over $t>0$ gives the result.

$2$ Apply (1) to $S_n$ :

\mathbb{P}(S_n\geq a) \leq e^{-at}\mathbb{E}[e^{tS_n}] = e^{-at}\bigl(\mathbb{E}[e^{tX_1}]\bigr)^n.

Now

\mathbb{E}[e^{tX_1}] = \frac{e^t+e^{-t}}2 = \cosh t,

and

\cosh t = \sum_{m=0}^{\infty}\frac{t^{2m}}{(2m)!} \leq \sum_{m=0}^{\infty}\frac{(t^2/2)^m}{m!} = e^{t^2/2}.

Thus

\mathbb{P}(S_n\geq a) \leq \exp\!\left(-at+\frac{nt^2}{2}\right).

Taking $t=a/n$ gives

\mathbb{P}(S_n\geq a)\leq e^{-a^2/(2n)}.

Problem: 5.1.8

A random variable $X$ is called sub-Gaussian if, for some constant $K>0$ ,

\mathbb{P}(|X|\geq t)\leq 2e^{-t^2/K^2}, \qquad \forall t\geq 0.

Prove:

$1$ If

\mathbb{E}[e^{sX}]\leq e^{s^2/2}, \qquad \forall s\in\mathbb{R},

then $X$ is sub-Gaussian.

$2$ The moments of a sub-Gaussian random variable satisfy

\mathbb{E}[|X|^p]\leq (K_1\sqrt p)^p, \qquad \forall p\geq 1,

where $K_1$ is a positive constant independent of $p$ . You may use Stirling's formula

n!\sim n^n e^{-n}\sqrt{2\pi n}.

Proof

(1) For $s,t>0$ , Markov's inequality gives

\mathbb{P}(X\geq t) = \mathbb{P}(e^{sX}\geq e^{st}) \leq e^{-st}\mathbb{E}[e^{sX}] \leq e^{-st+s^2/2}.

Taking $s=t$ yields

\mathbb{P}(X\geq t)\leq e^{-t^2/2}.

Applying the same argument to $-X$ gives

\mathbb{P}(X\leq -t)\leq e^{-t^2/2}.

Therefore

\mathbb{P}(|X|\geq t)\leq 2e^{-t^2/2},

so $X$ is sub-Gaussian.

$2$ By the tail integral formula,

\mathbb{E}[|X|^p] = \int_{0}^{\infty}pt^{p-1}\mathbb{P}(|X|>t)\,dt \leq 2p\int_{0}^{\infty}t^{p-1}e^{-t^2/K^2}\,dt.

With $u=t^2/K^2$ ,

\mathbb{E}[|X|^p] \leq pK^p\int_{0}^{\infty}u^{p/2-1}e^{-u}\,du = pK^p\Gamma(p/2) = 2K^p\Gamma(p/2+1).

By Stirling's formula, there is a constant $C>0$ such that for all $p\geq 1$ ,

\Gamma(p/2+1)\leq C^p p^{p/2}.

Hence

\mathbb{E}[|X|^p]\leq (K_1\sqrt p)^p

for a constant $K_1$ independent of $p$ .

Exercise 5.2

Note

This section looks at convergence in distribution and how independence passes to limits. The Cauchy example is a warning: without a first moment, the usual law-of-large-numbers intuition does not apply.

Problem: 5.2.2

Suppose $X_n,Y_n$ are independent, $X,Y$ are independent, and

X_n\xrightarrow{D}X,\qquad Y_n\xrightarrow{D}Y.

Prove that

X_n+Y_n\xrightarrow{D}X+Y.

Proof

By independence,

\phi_{X_n+Y_n}(t)=\phi_{X_n}(t)\phi_{Y_n}(t).

Since $X_n\xrightarrow{D}X$ and $Y_n\xrightarrow{D}Y$ , for every $t\in\mathbb{R}$ ,

\phi_{X_n}(t)\to\phi_X(t), \qquad \phi_{Y_n}(t)\to\phi_Y(t).

Because $X,Y$ are independent,

\phi_X(t)\phi_Y(t)=\phi_{X+Y}(t).

Thus

\phi_{X_n+Y_n}(t)\to\phi_{X+Y}(t).

By Levy's continuity theorem,

X_n+Y_n\xrightarrow{D}X+Y.

Problem: 5.2.3

Let $X_1,\cdots,X_n$ be independent Cauchy random variables. Prove that

\frac1n\sum_{k=1}^{n}X_k

also has the Cauchy distribution.

Proof

First compute the characteristic function of the standard Cauchy distribution. If $X$ has density

f(x)=\frac1{\pi(1+x^2)},

then

\phi_X(t) = \frac1\pi\int_{-\infty}^{\infty}\frac{e^{itx}}{1+x^2}\,dx.

For $t>0$ , consider

g(z)=\frac{e^{itz}}{1+z^2},

and integrate over the upper half-plane semicircle. By Jordan's lemma, the integral over the arc tends to $0$ . The only pole inside the contour is $z=i$ , with residue

\operatorname{Res}(g,i) = \lim_{z\to i}\frac{e^{itz}}{z+i} = \frac{e^{-t}}{2i}.

The residue theorem gives

\int_{-\infty}^{\infty}\frac{e^{itx}}{1+x^2}\,dx = 2\pi i\cdot\frac{e^{-t}}{2i} = \pi e^{-t}.

Thus

\phi_X(t)=e^{-t},\qquad t>0.

Since $f$ is even,

\phi_X(t) = \frac1\pi\int_{-\infty}^{\infty}\frac{\cos(tx)}{1+x^2}\,dx,

so $\phi_X$ is even. Hence, for $t<0$ ,

\phi_X(t)=\phi_X(-t)=e^{t}.

Together with $\phi_X(0)=1$ ,

\phi_X(t)=e^{-|t|},\qquad t\in\mathbb{R}.

Therefore

\phi_{X_k/n}(t)=\phi_{X_k}\!\left(\frac{t}{n}\right)=e^{-|t|/n}.

By independence,

\phi_{\frac1n\sum_{k=1}^{n}X_k}(t) = \prod_{k=1}^{n}\phi_{X_k/n}(t) = \left(e^{-|t|/n}\right)^n = e^{-|t|}.

This is the characteristic function of the standard Cauchy distribution, so

\frac1n\sum_{k=1}^{n}X_k

is again Cauchy.

Problem: 5.2.5

Let $\phi_n(t)=\cos^n t$ , $t\in\mathbb{R}$ .

$1$ Find the distribution function corresponding to the characteristic function $\phi_2(t)$ .

$2$ For general positive integers $n$ , is $\phi_n(t)$ a characteristic function? Answer and explain.

Proof

(1) Define a random variable $X$ by

\mathbb{P}(X=-2)=\frac14,\qquad \mathbb{P}(X=0)=\frac12,\qquad \mathbb{P}(X=2)=\frac14.

Then

\phi_X(t)=\frac14e^{-2it}+\frac12+\frac14e^{2it}=\cos^2t.

Hence the distribution function corresponding to $\phi_2$ is

F_2(x)= \begin{cases} 0, & x<-2,\\ \frac14, & -2\leq x<0,\\ \frac34, & 0\leq x<2,\\ 1, & x\geq 2. \end{cases}

$2$ For any positive integer $n$ , let $Y_1,\cdots,Y_n$ be i.i.d. random variables with

\mathbb{P}(Y_k=1)=\mathbb{P}(Y_k=-1)=\frac12.

Then

\phi_{Y_k}(t)=\frac12(e^{it}+e^{-it})=\cos t.

By independence,

\phi_{Y_1+\cdots+Y_n}(t) = \prod_{k=1}^{n}\phi_{Y_k}(t) = \cos^n t = \phi_n(t).

Thus $\phi_n(t)$ is a characteristic function for every positive integer $n$ .

Exercise 5.3

Note

For central limit theorem problems, first identify the centering and scaling. If the variance depends on $n$ , compute the scale before applying a theorem.

Problem: 5.3.1

Choose suitable sequences $\{\mu_n\}$ and $\{\sigma_n\}$ to prove

\frac{X_n-\mu_n}{\sigma_n}\xrightarrow{D}N(0,1).

$1$ $X_n$ has the Poisson distribution with positive integer parameter $n$ .

$2$ $X_n$ has the Gamma density

f(x)=\frac{x^{n-1}e^{-x}}{\Gamma(n)}\mathbf{1}_{x\geq 0}.

Proof

(1) If $Y_1,\cdots,Y_n$ are i.i.d. with $Y_i\sim P(1)$ , then

X_n':=Y_1+\cdots+Y_n\sim P(n).

Thus $X_n'$ and $X_n$ have the same distribution. By the i.i.d. CLT,

\frac{X_n'-n}{\sqrt n}\xrightarrow{D}N(0,1).

So take

\mu_n=n,\qquad \sigma_n=\sqrt n.

Then

\frac{X_n-\mu_n}{\sigma_n}\xrightarrow{D}N(0,1).

$2$ If $Z_1,\cdots,Z_n$ are i.i.d. exponential random variables with parameter $1$ , then

X_n':=Z_1+\cdots+Z_n

has density

f(x)=\frac{x^{n-1}e^{-x}}{\Gamma(n)}\mathbf{1}_{x\geq 0}.

Thus $X_n'$ and $X_n$ have the same distribution. By the i.i.d. CLT,

\frac{X_n'-n}{\sqrt n}\xrightarrow{D}N(0,1).

Again take

\mu_n=n,\qquad \sigma_n=\sqrt n.

This gives

\frac{X_n-\mu_n}{\sigma_n}\xrightarrow{D}N(0,1).

Problem: 5.3.3

Let $X_1,\cdots,X_n$ be i.i.d. random variables with

\mathbb{P}(X_1=1)=\mathbb{P}(X_1=-1)=\frac12.

Prove that

\frac{\sqrt3}{n^{3/2}}\sum_{k=1}^{n}kX_k\xrightarrow{D}N(0,1).

Proof

Let

Y_{n,k}=kX_k,\qquad 1\leq k\leq n.

Then $\{Y_{n,k}\}_{k=1}^{n}$ are independent, with

\mathbb{E}[Y_{n,k}]=0,\qquad \operatorname{Var}(Y_{n,k})=k^2.

Set

B_n^2=\sum_{k=1}^{n}\operatorname{Var}(Y_{n,k}) = \sum_{k=1}^{n}k^2 = \frac{n(n+1)(2n+1)}6.

For every $\varepsilon>0$ , when $n$ is large enough, $B_n\asymp n^{3/2}$ , so

|Y_{n,k}|=k\leq n<\varepsilon B_n,\qquad 1\leq k\leq n.

Hence

\sum_{k=1}^{n} \mathbb{E}\!\left[Y_{n,k}^2;\ |Y_{n,k}|>\varepsilon B_n\right]=0,

so the Lindeberg condition holds. By the Lindeberg-Feller CLT,

\frac{\sum_{k=1}^{n}kX_k}{B_n}\xrightarrow{D}N(0,1).

Since

\frac{B_n}{n^{3/2}} = \sqrt{\frac{(n+1)(2n+1)}{6n^2}} \longrightarrow \frac1{\sqrt3},

Slutsky's theorem gives

\frac{\sqrt3}{n^{3/2}}\sum_{k=1}^{n}kX_k\xrightarrow{D}N(0,1).

Exercise 5.5

Note

Slutsky's theorem lets us replace a random error by its constant probability limit. The first thing to check is whether the added or multiplied factor converges in probability to a constant.

Problem: 5.5.12

Slutsky's theorem says: if random variables $\{X_n\}$ , $\{Y_n\}$ , and $\{Z_n\}$ satisfy

X_n\xrightarrow{D}X,\qquad Y_n\xrightarrow{P}b,\qquad Z_n\xrightarrow{P}c,

where $X$ is a random variable and $b,c$ are constants, then

X_nY_n+Z_n\xrightarrow{D}bX+c.

Use Slutsky's theorem to answer the following questions.

$1$ Let $\{X_n\}$ be i.i.d., with $\mathbb{E}[X_1]=0$ and finite second moment. Let

\overline X=\frac1n\sum_{k=1}^{n}X_k.

Prove that

\frac{\sum_{k=1}^{n}X_k} {\sqrt{\sum_{k=1}^{n}(X_k-\overline X)^2}} \xrightarrow{D}N(0,1).

$2$ Let $\{X_n\}$ be independent and satisfy

\mathbb{P}(X_n=\pm 2^n)=\frac1{2^{n+1}}, \qquad \mathbb{P}(X_n=\pm 1)=\frac12-\frac1{2^{n+1}}.

Prove that

\frac1{\sqrt n}\sum_{k=1}^{n}X_k\xrightarrow{D}N(0,1).

$3$ Let $\{X_n\}$ be i.i.d. with

\mathbb{E}[X_1]=\operatorname{Var}(X_1)=1.

Set

S_n=\sum_{k=1}^{n}X_k.

Prove that

\frac{S_n^{3/2}-n^{3/2}}{\frac32 n}\xrightarrow{D}N(0,1).

Proof

(1) Let $\sigma^2=\operatorname{Var}(X_1)$ . By the CLT,

\frac{\sum_{k=1}^{n}X_k}{\sigma\sqrt n}\xrightarrow{D}N(0,1).

Also,

\frac1n\sum_{k=1}^{n}(X_k-\overline X)^2 = \frac1n\sum_{k=1}^{n}X_k^2-\overline X^{\,2}.

By the weak law of large numbers,

\frac1n\sum_{k=1}^{n}X_k^2\xrightarrow{P}\mathbb{E}[X_1^2]=\sigma^2, \qquad \overline X\xrightarrow{P}0.

Therefore

\frac1n\sum_{k=1}^{n}(X_k-\overline X)^2\xrightarrow{P}\sigma^2, \qquad \frac{\sigma}{\sqrt{\frac1n\sum_{k=1}^{n}(X_k-\overline X)^2}} \xrightarrow{P}1.

Slutsky's theorem gives

\frac{\sum_{k=1}^{n}X_k} {\sqrt{\sum_{k=1}^{n}(X_k-\overline X)^2}} = \frac{\sum_{k=1}^{n}X_k}{\sigma\sqrt n} \cdot \frac{\sigma}{\sqrt{\frac1n\sum_{k=1}^{n}(X_k-\overline X)^2}} \xrightarrow{D}N(0,1).

$2$ We may write

X_k=(1-B_k)\varepsilon_k+B_k2^k\eta_k,

where $\{B_k\}$ , $\{\varepsilon_k\}$ , and $\{\eta_k\}$ are mutually independent, and

\mathbb{P}(B_k=1)=2^{-k}, \qquad \mathbb{P}(\varepsilon_k=\pm1)=\mathbb{P}(\eta_k=\pm1)=\frac12.

Then $X_k$ has the required distribution. Set

T_n=\sum_{k=1}^{n}\varepsilon_k, \qquad R_n=\sum_{k=1}^{n}B_k(2^k\eta_k-\varepsilon_k).

Then

\sum_{k=1}^{n}X_k=T_n+R_n.

Since

\sum_{k=1}^{\infty}\mathbb{P}(B_k=1) = \sum_{k=1}^{\infty}2^{-k} <\infty,

the first Borel-Cantelli lemma shows that $\{B_k=1\}$ occurs only finitely many times. Hence $R_n$ is eventually constant a.s., and

\frac{R_n}{\sqrt n}\xrightarrow{\text{a.s.}}0.

By the CLT,

\frac{T_n}{\sqrt n}\xrightarrow{D}N(0,1).

Slutsky's theorem gives

\frac1{\sqrt n}\sum_{k=1}^{n}X_k = \frac{T_n}{\sqrt n}+\frac{R_n}{\sqrt n} \xrightarrow{D}N(0,1).

$3$ Let

T_n=\frac{S_n-n}{\sqrt n}, \qquad U_n=\frac{S_n}{n}.

By the CLT,

T_n\xrightarrow{D}N(0,1).

By the weak law,

U_n\xrightarrow{P}1.

Also,

\frac{S_n^{3/2}-n^{3/2}}{\frac32 n} = T_n\cdot\frac23\cdot\frac{U_n^{3/2}-1}{U_n-1}.

Define

g(u)=\frac23\cdot\frac{u^{3/2}-1}{u-1}\quad (u\neq 1), \qquad g(1)=1.

Then $g$ is continuous at $u=1$ , so

g(U_n)\xrightarrow{P}1.

Slutsky's theorem gives

\frac{S_n^{3/2}-n^{3/2}}{\frac32 n}\xrightarrow{D}N(0,1).

Exercise 5.4

Note

This section moves into stronger limit theorems and Stein's method. When reading the proofs, separate weak convergence, moment bounds, and integrability.

Problem: 5.4.1

Let $X_1,X_2,\dots$ be i.i.d. with

\mathbb{P}(X_1=1)=\mathbb{P}(X_1=-1)=\frac12.

Prove that for every $\delta>0$ ,

\frac1{n^{1/2+\delta}}\sum_{k=1}^{n}X_k \xrightarrow{\text{a.s.}}0.

Proof

Chebyshev's inequality alone only reaches $\delta>1/2$ , so we use higher moments. Let $m$ be a positive integer to be chosen later, and set

S_n=\sum_{k=1}^{n}X_k.

For $\varepsilon>0$ , Markov's inequality gives

\mathbb{P}\left(\left|\frac{S_n}{n^{1/2+\delta}}\right|>\varepsilon\right) = \mathbb{P}\left(\left|\frac{S_n}{n^{1/2+\delta}}\right|^{2m}>\varepsilon^{2m}\right) \leq \frac{\mathbb{E}|S_n|^{2m}}{\varepsilon^{2m}n^{m+2m\delta}}.

We estimate $\mathbb{E}|S_n|^{2m}$ . Expanding,

\mathbb{E}S_n^{2m} = \sum_{i_1,\dots,i_{2m}=1}^{n} \mathbb{E}(X_{i_1}\cdots X_{i_{2m}}).

Since the $X_i$ are independent and $\mathbb{E}X_i=0$ , a term vanishes if some index appears an odd number of times. Thus, in every nonzero term, the number of distinct indices is at most $m$ . Hence there is a constant $C_m$ , depending only on $m$ , such that

\mathbb{E}S_n^{2m}\leq C_m n^m.

Therefore

\mathbb{P}\left(\left|\frac{S_n}{n^{1/2+\delta}}\right|>\varepsilon\right) \leq \frac{C_m}{\varepsilon^{2m}}\,n^{-2m\delta}.

Choose $m$ so that

2m\delta>1.

Then

\sum_{n=1}^{\infty} \mathbb{P}\left(\left|\frac{S_n}{n^{1/2+\delta}}\right|>\varepsilon\right) <\infty.

By Borel-Cantelli,

\mathbb{P}\left( \left|\frac{S_n}{n^{1/2+\delta}}\right|>\varepsilon \ \text{i.o.} \right)=0.

Thus, for every fixed $\varepsilon>0$ , almost surely there is $N(\omega)$ such that for $n\geq N(\omega)$ ,

\left|\frac{S_n}{n^{1/2+\delta}}\right|\leq \varepsilon.

Letting $\varepsilon$ run over the positive rationals gives

\frac{S_n}{n^{1/2+\delta}}\xrightarrow{\mathrm{a.s.}}0.

Problem: 5.4.4

Let $\{X_k\}$ be i.i.d. random variables with

\mathbb{E}X_1=0,\qquad \operatorname{Var}(X_1)=1,\qquad \mathbb{E}|X_1|^3<\infty.

Use the Lindeberg replacement method to prove the CLT convergence rate

\sup_{t\in\mathbb{R}} \left| \mathbb{P}\left(\frac1{\sqrt n}\sum_{k=1}^{n}X_k\leq t\right) -\Phi(t) \right| = O(n^{-1/8}).

Here $\Phi(t)$ is the standard normal distribution function.

Proof

Set

S_n=\sum_{k=1}^{n}X_k, \qquad W_n=\frac{S_n}{\sqrt n}.

Let $Y_1,\dots,Y_n$ be i.i.d. standard normal random variables, independent of $X_1,\dots,X_n$ , and set

Z_n=\frac1{\sqrt n}\sum_{k=1}^{n}Y_k.

Then $Z_n\sim N(0,1)$ , so

\mathbb{P}(Z_n\leq t)=\Phi(t).

Fix $\varepsilon>0$ . Choose a smooth function $f_{t,\varepsilon}\in C^3(\mathbb R)$ such that

\mathbf{1}_{\{x\leq t\}} \leq f_{t,\varepsilon}(x) \leq \mathbf{1}_{\{x\leq t+\varepsilon\}},

and

\|f_{t,\varepsilon}^{(3)}\|_\infty\leq C\varepsilon^{-3},

where $C$ is independent of $t,\varepsilon,n$ .

We estimate

\left| \mathbb{E}f_{t,\varepsilon}(W_n) - \mathbb{E}f_{t,\varepsilon}(Z_n) \right|.

Replace $X_k$ by $Y_k$ one at a time. Let

T_k= \frac1{\sqrt n} \left( Y_1+\cdots+Y_{k-1} + X_{k+1}+\cdots+X_n \right).

Then $T_k$ is independent of $X_k$ and $Y_k$ . Taylor expansion gives

f_{t,\varepsilon}\left(T_k+\frac{X_k}{\sqrt n}\right) = f_{t,\varepsilon}(T_k) + \frac{X_k}{\sqrt n}f_{t,\varepsilon}'(T_k) + \frac{X_k^2}{2n}f_{t,\varepsilon}''(T_k) + R_{k,X},

with

|R_{k,X}| \leq \frac{\|f_{t,\varepsilon}^{(3)}\|_\infty}{6} \frac{|X_k|^3}{n^{3/2}}.

Similarly,

f_{t,\varepsilon}\left(T_k+\frac{Y_k}{\sqrt n}\right) = f_{t,\varepsilon}(T_k) + \frac{Y_k}{\sqrt n}f_{t,\varepsilon}'(T_k) + \frac{Y_k^2}{2n}f_{t,\varepsilon}''(T_k) + R_{k,Y},

and

|R_{k,Y}| \leq \frac{\|f_{t,\varepsilon}^{(3)}\|_\infty}{6} \frac{|Y_k|^3}{n^{3/2}}.

Because

\mathbb{E}X_k=\mathbb{E}Y_k=0, \qquad \mathbb{E}X_k^2=\mathbb{E}Y_k^2=1,

and $T_k$ is independent of $X_k,Y_k$ , the first- and second-order terms cancel after taking expectations. Hence

\left| \mathbb{E}f_{t,\varepsilon}\left(T_k+\frac{X_k}{\sqrt n}\right) - \mathbb{E}f_{t,\varepsilon}\left(T_k+\frac{Y_k}{\sqrt n}\right) \right| \leq C\varepsilon^{-3}n^{-3/2}.

Summing over $k=1,\dots,n$ gives

\left| \mathbb{E}f_{t,\varepsilon}(W_n) - \mathbb{E}f_{t,\varepsilon}(Z_n) \right| \leq C\varepsilon^{-3}n^{-1/2}.

Therefore

\mathbb{P}(W_n\leq t) \leq \mathbb{E}f_{t,\varepsilon}(W_n) \leq \mathbb{E}f_{t,\varepsilon}(Z_n) + C\varepsilon^{-3}n^{-1/2}.

Since

f_{t,\varepsilon}(x)\leq \mathbf{1}_{\{x\leq t+\varepsilon\}},

we have

\mathbb{E}f_{t,\varepsilon}(Z_n) \leq \mathbb{P}(Z_n\leq t+\varepsilon) = \Phi(t+\varepsilon).

Thus

\mathbb{P}(W_n\leq t)-\Phi(t) \leq \Phi(t+\varepsilon)-\Phi(t) + C\varepsilon^{-3}n^{-1/2}.

Since the standard normal density is bounded,

\Phi(t+\varepsilon)-\Phi(t)\leq C\varepsilon.

Hence

\mathbb{P}(W_n\leq t)-\Phi(t) \leq C\varepsilon+C\varepsilon^{-3}n^{-1/2}.

For the other direction, choose a smooth function $g_{t,\varepsilon}$ such that

\mathbf{1}_{\{x\leq t-\varepsilon\}} \leq g_{t,\varepsilon}(x) \leq \mathbf{1}_{\{x\leq t\}}, \qquad \|g_{t,\varepsilon}^{(3)}\|_\infty\leq C\varepsilon^{-3}.

The same Lindeberg replacement argument gives

\left| \mathbb{E}g_{t,\varepsilon}(W_n) - \mathbb{E}g_{t,\varepsilon}(Z_n) \right| \leq C\varepsilon^{-3}n^{-1/2}.

Therefore

\mathbb{P}(W_n\leq t) \geq \mathbb{E}g_{t,\varepsilon}(W_n) \geq \mathbb{E}g_{t,\varepsilon}(Z_n) - C\varepsilon^{-3}n^{-1/2}.

Also,

\mathbb{E}g_{t,\varepsilon}(Z_n) \geq \mathbb{P}(Z_n\leq t-\varepsilon) = \Phi(t-\varepsilon).

\Phi(t)-\mathbb{P}(W_n\leq t) \leq \Phi(t)-\Phi(t-\varepsilon) + C\varepsilon^{-3}n^{-1/2} \leq C\varepsilon+C\varepsilon^{-3}n^{-1/2}.

Combining the two bounds, for every $t\in\mathbb R$ ,

\left| \mathbb{P}(W_n\leq t)-\Phi(t) \right| \leq C\varepsilon+C\varepsilon^{-3}n^{-1/2}.

Take

\varepsilon=n^{-1/8}.

Then

\left| \mathbb{P}(W_n\leq t)-\Phi(t) \right| \leq Cn^{-1/8}.

Therefore

\sup_{t\in\mathbb R} \left| \mathbb{P}\left(\frac1{\sqrt n}\sum_{k=1}^{n}X_k\leq t\right) - \Phi(t) \right| = O(n^{-1/8}).

Remark

The main idea is to approximate the indicator function by a smooth function, and then replace $X_i$ by normal variables $Y_i$ one at a time. Since $X_i$ and $Y_i$ have the same first two moments, the first- and second-order Taylor terms cancel. Only the third-order remainder remains. The smoothing error is $O(\varepsilon)$ , and the replacement error is $O(\varepsilon^{-3}n^{-1/2})$ . Taking $\varepsilon=n^{-1/8}$ gives the bound.

Problem: 5.4.5

Let $\{X_n\}$ be i.i.d. random variables with

\mathbb{E}X_1=0,\qquad \mathbb{E}X_1^2=1,

and assume that for all $l\geq 3$ ,

\mathbb{E}|X_1|^l<\infty.

Set

S_n=X_1+\cdots+X_n.

Let $H_k(x)$ be the $k$ -th Hermite polynomial, defined by

H_0=1,\qquad (-1)^kH_k(x)\phi(x)=\phi^{(k)}(x),

where $\phi$ is the standard normal density. Prove that

\lim_{n\to\infty} \mathbb{E}\left[ H_k\left(\frac{S_n}{\sqrt n}\right) \right] =0,\qquad \forall k\geq 1.

Proof

Let

W_n=\frac{S_n}{\sqrt n}.

First prove that, for every fixed positive integer $j$ ,

\lim_{n\to\infty}\mathbb{E}W_n^j = \mathbb{E}Z^j,

where $Z\sim N(0,1)$ .

Expand:

\mathbb{E}W_n^j = n^{-j/2} \sum_{i_1,\dots,i_j=1}^{n} \mathbb{E}(X_{i_1}\cdots X_{i_j}).

Since $\mathbb{E}X_1=0$ and the $X_i$ are independent, a term is $0$ if some index appears exactly once.

Thus, in a nonzero term, every appearing index appears at least twice. If there are $r$ distinct indices, then

r\leq \frac j2.

If $r<j/2$ , the total contribution of these terms is at most

O(n^r)n^{-j/2}=o(1).

Therefore the limit can only come from $r=j/2$ . This requires $j$ to be even and every appearing index to appear exactly twice. Let $j=2m$ . The number of pairings is

(2m-1)!!,

and each contribution has expectation

\mathbb{E}X_1^2\cdots \mathbb{E}X_m^2=1.

Hence

\lim_{n\to\infty}\mathbb{E}W_n^{2m} = (2m-1)!!.

If $j$ is odd, no $r=j/2$ case exists, so

\lim_{n\to\infty}\mathbb{E}W_n^j=0.

These are exactly the moments of $Z\sim N(0,1)$ . Thus

\lim_{n\to\infty}\mathbb{E}W_n^j = \mathbb{E}Z^j.

Since $H_k(x)$ is a degree $k$ polynomial, write

H_k(x)=\sum_{j=0}^{k}a_jx^j.

By moment convergence,

\lim_{n\to\infty}\mathbb{E}H_k(W_n) = \sum_{j=0}^{k}a_j\lim_{n\to\infty}\mathbb{E}W_n^j = \sum_{j=0}^{k}a_j\mathbb{E}Z^j = \mathbb{E}H_k(Z).

Finally compute $\mathbb{E}H_k(Z)$ . Since $Z\sim N(0,1)$ ,

\mathbb{E}H_k(Z) = \int_{-\infty}^{\infty}H_k(x)\phi(x)\,dx.

By the definition of the Hermite polynomial,

H_k(x)\phi(x)=(-1)^k\phi^{(k)}(x).

Therefore

\mathbb{E}H_k(Z) = (-1)^k\int_{-\infty}^{\infty}\phi^{(k)}(x)\,dx.

For $k\geq 1$ ,

\int_{-\infty}^{\infty}\phi^{(k)}(x)\,dx = \phi^{(k-1)}(\infty)-\phi^{(k-1)}(-\infty) =0.

Hence

\mathbb{E}H_k(Z)=0,\qquad k\geq 1.

Thus

\lim_{n\to\infty} \mathbb{E}\left[ H_k\left(\frac{S_n}{\sqrt n}\right) \right] =0,\qquad \forall k\geq 1.

Remark

The idea is to first show that the fixed moments of $S_n/\sqrt n$ converge to the moments of a standard normal variable. In the moment expansion, because $\mathbb{E}X_i=0$ , only terms with paired indices survive in the limit. Those are exactly the normal moments. Since $H_k$ is a polynomial, moment convergence gives $\mathbb{E}H_k(S_n/\sqrt n)\to\mathbb{E}H_k(Z)$ . Finally, Hermite polynomials satisfy $\mathbb{E}H_k(Z)=0$ under the standard normal law for $k\geq 1$ .

Problem: 5.4.8

(Stein's method) Prove that

X\sim N(0,1)

if and only if for every bounded continuous function $g$ with bounded continuous derivative $g'$ ,

\mathbb{E}[Xg(X)]=\mathbb{E}[g'(X)].

Hint: for $Z\sim N(0,1)$ and bounded continuous $h$ , construct

g_0(x) = e^{x^2/2} \int_{-\infty}^{x} e^{-y^2/2}\bigl(h(y)-\mathbb{E}h(Z)\bigr)\,dy.

Proof

First prove necessity. If $X\sim N(0,1)$ , its density is

\phi(x)=\frac1{\sqrt{2\pi}}e^{-x^2/2}.

Since

\phi'(x)=-x\phi(x),

we have

\mathbb{E}[Xg(X)] = \int_{-\infty}^{\infty}xg(x)\phi(x)\,dx = -\int_{-\infty}^{\infty}g(x)\phi'(x)\,dx.

Integrating by parts,

-\int_{-\infty}^{\infty}g(x)\phi'(x)\,dx = -[g(x)\phi(x)]_{-\infty}^{\infty} + \int_{-\infty}^{\infty}g'(x)\phi(x)\,dx.

The boundary term is $0$ because $g$ is bounded and $\phi(x)\to 0$ . Hence

\mathbb{E}[Xg(X)]=\mathbb{E}[g'(X)].

Now prove sufficiency. Suppose that for every bounded continuous $g$ with bounded continuous $g'$ ,

\mathbb{E}[Xg(X)]=\mathbb{E}[g'(X)].

We show that $X\sim N(0,1)$ .

Let $Z\sim N(0,1)$ . For any bounded continuous $h$ , define

g_0(x) = e^{x^2/2} \int_{-\infty}^{x} e^{-y^2/2}\bigl(h(y)-\mathbb{E}h(Z)\bigr)\,dy.

Since

\int_{-\infty}^{\infty} e^{-y^2/2}\bigl(h(y)-\mathbb{E}h(Z)\bigr)\,dy =0,

we may also write

g_0(x) = -e^{x^2/2} \int_{x}^{\infty} e^{-y^2/2}\bigl(h(y)-\mathbb{E}h(Z)\bigr)\,dy.

The standard normal tail estimate shows that $g_0$ is bounded and continuous, and that $g_0'$ is also bounded and continuous.

Differentiating,

g_0'(x) = x e^{x^2/2} \int_{-\infty}^{x} e^{-y^2/2}\bigl(h(y)-\mathbb{E}h(Z)\bigr)\,dy + h(x)-\mathbb{E}h(Z).

Thus

g_0'(x)=xg_0(x)+h(x)-\mathbb{E}h(Z),

g_0'(x)-xg_0(x)=h(x)-\mathbb{E}h(Z).

Using the assumption with $g=g_0$ ,

\mathbb{E}[Xg_0(X)]=\mathbb{E}[g_0'(X)].

Hence

\mathbb{E}\bigl[g_0'(X)-Xg_0(X)\bigr]=0.

By the Stein equation,

g_0'(X)-Xg_0(X)=h(X)-\mathbb{E}h(Z).

Therefore

\mathbb{E}h(X)-\mathbb{E}h(Z)=0.

\mathbb{E}h(X)=\mathbb{E}h(Z)

for every bounded continuous $h$ . Thus $X$ and $Z$ have the same distribution, and

X\sim N(0,1).

This proves the equivalence.

Remark

The main point is the basic Stein characterization

X\sim N(0,1) \quad\Longleftrightarrow\quad \mathbb{E}[Xg(X)]=\mathbb{E}[g'(X)]

for a large enough class of test functions $g$ .

The necessary part comes from the special identity for the normal density,

\phi'(x)=-x\phi(x),

which lets us turn $\mathbb{E}[Xg(X)]$ into $\mathbb{E}[g'(X)]$ by integration by parts.

For sufficiency, we want to prove that $X$ and a standard normal $Z$ have the same distribution. It is enough to show that for every bounded continuous $h$ ,

\mathbb{E}h(X)=\mathbb{E}h(Z).

The constructed function $g_0$ solves the Stein equation

g_0'(x)-xg_0(x)=h(x)-\mathbb{E}h(Z).

Putting $g_0$ into the assumed identity gives

\mathbb{E}h(X)=\mathbb{E}h(Z).

So $X$ must be standard normal.

There are similar Stein characterizations for other classical distributions, such as the exponential and Poisson distributions. For example:

(Stein characterization of the exponential distribution) Let $\lambda>0$ , and let $W$ be a continuous random variable supported on $(0,\infty)$ with density $q$ . Under suitable regularity conditions, prove that

W\sim \operatorname{Exp}(\lambda)

if and only if for every $f\in C_c^1(0,\infty)$ ,

\mathbb{E}f'(W)=\lambda\mathbb{E}f(W).

Extension: From the χ² distribution to the Wishart distribution

Reading map

Probability already has plenty of named distributions. Adding one or two more is not the main issue. The useful path is this:

\text{normal sample} \quad\Longrightarrow\quad \text{orthogonal decomposition} \quad\Longrightarrow\quad \text{sum of squares / sum of outer products}.

In one dimension, this path gives the $\chi^2$ distribution and explains why $\bar X$ and $s^2$ are independent. In several dimensions, what should replace it?

1. One dimension: a sum of squares loses one direction

First recall a small fact from the course notes. If

Z_1,\ldots,Z_\nu\stackrel{\mathrm{iid}}{\sim}N(0,1),

then

\sum_{i=1}^{\nu}Z_i^2\sim \chi^2_\nu.

This is the squared length of a standard normal vector in $\mathbb R^\nu$ . Equivalently, $\chi^2_\nu$ is the sum of $\nu$ independent squared standard normals, or the square of a random radius.

Now recall the normal-sample theorem. If

X_1,\ldots,X_n\stackrel{\mathrm{iid}}{\sim}N(\mu,\sigma^2), \qquad s^2=\frac1{n-1}\sum_{i=1}^{n}(X_i-\bar X)^2,

then

\bar X\perp\!\!\!\perp s^2, \qquad \frac{(n-1)s^2}{\sigma^2}\sim \chi^2_{n-1}.

The number $n-1$ is not random decoration. The vector

(X_1-\mu,\ldots,X_n-\mu)

lives in an $n$ -dimensional space, but the sample mean uses the special direction

\operatorname{span}\{(1,\ldots,1)\}.

After subtracting $\bar X$ , the residual vector

(X_1-\bar X,\ldots,X_n-\bar X)

is orthogonal to that direction, so it lives in an $(n-1)$ -dimensional subspace. Its squared length, after scaling by $\sigma^2$ , has the $\chi^2_{n-1}$ distribution.

Two Gaussian facts are being used here. First, a standard normal vector is unchanged in distribution under orthogonal rotations. Second, for Gaussian vectors, orthogonal components are independent, not merely uncorrelated. The second fact is special to the Gaussian setting.

So the geometric source of $n-1$ is simple: estimating the mean uses one direction in the sample space.

2. What changes in p dimensions?

Now replace each scalar observation by a $p$ -dimensional vector. We often use $p$ for dimension, especially in high-dimensional problems. Let

Y_1,\ldots,Y_\nu\stackrel{\mathrm{iid}}{\sim}N_p(0,\Sigma).

For vectors, the natural analogue of a square is not $Y_i^2$ , but the outer product

Y_iY_i^\top.

Thus the matrix version of a sum of squares is

W=\sum_{i=1}^{\nu}Y_iY_i^\top.

We write

W\sim W_p(\Sigma,\nu),

and say that $W$ has a Wishart distribution with scale matrix $\Sigma$ and $\nu$ degrees of freedom.

This is the same idea as the $\chi^2_\nu$ distribution. When $p=1$ , each $Y_i$ is just a scalar with $Y_i\sim N(0,\sigma^2)$ , so

W=\sum_{i=1}^{\nu}Y_i^2 =\sigma^2\sum_{i=1}^{\nu}Z_i^2 \sim \sigma^2\chi^2_\nu.

In this sense, the Wishart distribution is a matrix-valued extension of the $\chi^2$ distribution.

For example, when $p=2$ ,

W= \begin{pmatrix} \sum_i Y_{i1}^2 & \sum_i Y_{i1}Y_{i2}\\ \sum_i Y_{i1}Y_{i2} & \sum_i Y_{i2}^2 \end{pmatrix}.

The diagonal entries record sums of squares in each coordinate. The off-diagonal entries record cross-products between coordinates. A $\chi^2$ variable keeps only length; a Wishart matrix also keeps the relationships between directions.

3. Main theorem: the sample covariance matrix is Wishart

Let

X_1,\ldots,X_n\stackrel{\mathrm{iid}}{\sim}N_p(\mu,\Sigma),

and define the sample mean vector and sample covariance matrix by

\bar X=\frac1n\sum_{i=1}^{n}X_i, \qquad S=\frac1{n-1}\sum_{i=1}^{n}(X_i-\bar X)(X_i-\bar X)^\top.

Then

(n-1)S\sim W_p(\Sigma,n-1), \qquad \bar X\perp\!\!\!\perp S.

If $n-1<p$ , this distribution is singular: the rank of $(n-1)S$ is at most $n-1$ , so the matrix cannot be positive definite. The construction above still makes sense. The usual density on the positive definite cone only applies when the degrees of freedom are large enough, typically $\nu>p-1$ .

This is exactly the multivariate version of

\bar X\perp\!\!\!\perp s^2, \qquad \frac{(n-1)s^2}{\sigma^2}\sim\chi^2_{n-1}.

In one dimension, after removing the sample mean, the residual sum of squares is $\chi^2$ . In several dimensions, after removing the sample mean vector, the residual sum of outer products is Wishart.

One-dimensional normal sample	Multivariate normal sample
square $(X_i-\bar X)^2$	outer product $(X_i-\bar X)(X_i-\bar X)^\top$
sum of squares	sum of outer products
$\chi^2_{n-1}$	$W_p(\Sigma,n-1)$
$\bar X\perp\!\!\!\perp s^2$	$\bar X\perp\!\!\!\perp S$

4. Proof: rotate the sample space

We do not start from the Wishart density. The density is useful, but if it is the first thing you see, Wishart may look like a pile of determinants and traces. Orthogonal decomposition is a better first view.

Put the data into an $n\times p$ matrix

X= \begin{pmatrix} X_1^\top\\ \vdots\\ X_n^\top \end{pmatrix}, \qquad \mathbf 1_n=(1,\ldots,1)^\top.

Choose an $n\times n$ orthogonal matrix $H$ whose first row is

\frac1{\sqrt n}\mathbf 1_n^\top.

Define the standardized data matrix

Z=(X-\mathbf 1_n\mu^\top)\Sigma^{-1/2}.

The rows of $Z$ are independent $N_p(0,I_p)$ random vectors. Left multiplication by an orthogonal matrix only rotates the sample-index direction, so

U=HZ

again has independent $N_p(0,I_p)$ rows. Write the $j$ -th row as $u_j^\top$ , where $u_j\in\mathbb R^p$ .

The first row is exactly the mean direction:

u_1^\top =\frac1{\sqrt n}\mathbf 1_n^\top Z =\sqrt n\,(\bar X-\mu)^\top\Sigma^{-1/2}.

Thus $u_1$ contains the information in $\bar X$ .

The remaining rows $u_2^\top,\ldots,u_n^\top$ are residual directions. Let

P_0=\frac1n\mathbf 1_n\mathbf 1_n^\top, \qquad P_1=I_n-P_0.

Here $P_0$ is the projection onto the mean direction, and $P_1$ is the projection onto its orthogonal complement. Since the first row of $H$ is $\mathbf 1_n^\top/\sqrt n$ ,

P_1 =H^\top \begin{pmatrix} 0&0\\ 0&I_{n-1} \end{pmatrix} H.

Now compute the residual sum of outer products:

\begin{aligned} (n-1)S &=(X-\mathbf 1_n\bar X^\top)^\top(X-\mathbf 1_n\bar X^\top)\\ &=X^\top P_1X\\ &=(X-\mathbf 1_n\mu^\top)^\top P_1(X-\mathbf 1_n\mu^\top)\\ &=\Sigma^{1/2}Z^\top P_1Z\Sigma^{1/2}\\ &=\Sigma^{1/2}\left(\sum_{j=2}^{n}u_ju_j^\top\right)\Sigma^{1/2}\\ &=\sum_{j=2}^{n}(\Sigma^{1/2}u_j)(\Sigma^{1/2}u_j)^\top. \end{aligned}

The third line uses $P_1\mathbf 1_n=0$ : the residual projection removes any constant mean direction.

The last line is a sum of $n-1$ independent outer products of $N_p(0,\Sigma)$ vectors. Therefore

(n-1)S\sim W_p(\Sigma,n-1).

Also, $\bar X$ depends only on $u_1$ , while $S$ depends only on $u_2,\ldots,u_n$ . These vectors are independent, so

\bar X\perp\!\!\!\perp S.

This already contains the main idea of Cochran's theorem. Here is the same argument in the projection-matrix form used in multivariate statistics.

Why the degrees of freedom are n-1, not n-p

Subtracting $\bar X$ removes one direction in the sample-index space, namely $(1,\ldots,1)$ . It does not remove $p$ directions. Each remaining residual direction is still a full $p$ -dimensional vector. Hence the Wishart degrees of freedom are $n-1$ .

5. Cochran's theorem: a projection still gives Wishart

The more general form is this. Above we only projected away the mean direction. Cochran's theorem says that any symmetric idempotent projection cuts out a Wishart piece from a normal data matrix.

Theorem: Cochran's theorem

Let

z_1,\ldots,z_m\stackrel{\mathrm{iid}}{\sim}N_p(0,\Sigma), \qquad Z= \begin{pmatrix} z_1^\top\\ \vdots\\ z_m^\top \end{pmatrix}.

If $P$ is an $m\times m$ symmetric idempotent matrix and $r=\operatorname{rank}(P)$ , then

Z^\top PZ\sim W_p(\Sigma,r), \qquad Z^\top(I_m-P)Z\sim W_p(\Sigma,m-r),

and the two random matrices are independent.

More generally, if $P_1,\ldots,P_k$ are pairwise orthogonal symmetric idempotent matrices and $\sum_{a=1}^kP_a=I_m$ , then

Z^\top P_aZ\sim W_p(\Sigma,\operatorname{rank}(P_a)), \qquad a=1,\ldots,k,

and these matrices are independent.

Proof

It is enough to prove the statement for one projection $P$ . Since $P$ is symmetric and idempotent, it is the orthogonal projection onto an $r$ -dimensional subspace. Hence there is an orthogonal matrix $H$ such that

P = H^\top \begin{pmatrix} I_r&0\\ 0&0 \end{pmatrix} H, \qquad I_m-P = H^\top \begin{pmatrix} 0&0\\ 0&I_{m-r} \end{pmatrix} H.

Set $Y=HZ$ . Left multiplication by an orthogonal matrix only rotates the sample-index direction, so the rows of $Y$ are still independent $N_p(0,\Sigma)$ vectors. Split the rows as

Y= \begin{pmatrix} Y_1\\ Y_2 \end{pmatrix}, \qquad Y_1\in\mathbb R^{r\times p},\quad Y_2\in\mathbb R^{(m-r)\times p}.

Then

Z^\top PZ=Y_1^\top Y_1, \qquad Z^\top(I_m-P)Z=Y_2^\top Y_2.

The blocks $Y_1$ and $Y_2$ use disjoint normal rows, so they are independent. By the definition of the Wishart distribution, $Y_1^\top Y_1\sim W_p(\Sigma,r)$ and $Y_2^\top Y_2\sim W_p(\Sigma,m-r)$ .

For the sample covariance matrix, take

P=I_n-\frac1n\mathbf 1_n\mathbf 1_n^\top.

This is a projection matrix with rank $n-1$ , and $P\mathbf 1_n=0$ . Therefore

(n-1)S =X^\top PX =(X-\mathbf 1_n\mu^\top)^\top P(X-\mathbf 1_n\mu^\top) \sim W_p(\Sigma,n-1).

This gives a compact proof that the sample covariance matrix has a Wishart distribution. It is shorter than rotating the sample space by hand, but the geometry is the same: the projection splits the sample-index space, and each piece contributes a sum of outer products.

6. A note on the Wishart density

If $\nu>p-1$ and $W$ is positive definite, the Wishart density is

f(W) = \frac{ |W|^{(\nu-p-1)/2} \exp\left\{-\frac12\operatorname{tr}(\Sigma^{-1}W)\right\} }{ 2^{\nu p/2}|\Sigma|^{\nu/2}\Gamma_p(\nu/2) }, \qquad W>0,

where the multivariate Gamma function is

\Gamma_p(a) = \pi^{p(p-1)/4} \prod_{j=1}^{p}\Gamma\left(a-\frac{j-1}{2}\right).

This formula is useful, but it is not the friendliest first encounter with Wishart. Its derivation needs a Jacobian calculation on the cone of positive definite matrices, which is rather technical, so we leave it aside here.

Summary

In a one-dimensional normal sample, projecting away the mean direction leaves $n-1$ independent Gaussian residual directions, and their squared length gives $\chi^2_{n-1}$ . The Wishart theorem is the vector version of the same statement: square becomes outer product, variance becomes covariance matrix, and $\bar X\perp s^2$ becomes $\bar X\perp S$ . Cochran's theorem replaces the mean direction by a general orthogonal projection.

End-of-chapter check

The original problems and solutions in this chapter come from the corresponding TeX source files.
You can first read only the problem boxes, write down the main identities, and then open the proof or solution.
If a conclusion uses independence, countable additivity, a change-of-variables formula, or a moment condition, it is worth marking that point explicitly.