Third recitation

Jieyang Hu

Contents

Reading guide

This chapter covers discrete random variables, expectation, conditional expectation, the probabilistic method, and a midterm review.
Linearity, indicator decompositions, and conditional distributions appear again and again.
The parts on random walks, Catalan numbers, and distribution functions are worth reading slowly.

Tip. When a random quantity looks complicated, first try to write it as a sum of indicator variables. Then decide whether independence is needed.

Exercise 2.1

Note

This section is mainly about probability mass functions, mixture distributions, and symmetry. Condition first, then sum. Independence is often used to pass symmetry from the marginals to the sum.

Problem: 2.1.1

Roll a fair die first. After seeing the result, toss that many fair coins. Let $X$ be the number of heads. Find the probability mass function of $X$ .

Proof

Let $N$ be the die result. Then the possible values of $X$ are $0,1,\cdots,6$ , and

\begin{aligned} \mathbb{P} (X=x) &=\sum_{n=x}^6 \mathbb{P} (X=x \mid N=n) \mathbb{P} (N=n)\\ &=\frac{1}{6}\sum_{n=x}^6 \binom{n}{x}2^{-n}. \end{aligned}

Problem: 2.1.2

The number of new posts in a unit time interval follows a Poisson distribution with parameter $\lambda$ . The numbers of new posts in disjoint time intervals are independent. Find the probability of seeing $k$ new posts in two unit intervals.

Proof

Let $X$ be the number of posts in two unit intervals. Then

\mathbb{P} (X=k) =\sum_{m+n=k}\frac{\lambda^m}{m!}e^{-\lambda } \frac{\lambda^n}{n!}e^{-\lambda } =\frac{\lambda ^k e^{-2\lambda }}{k!}\sum_{m=0}^k \binom{k}{m} =\frac{(2\lambda )^k e^{-2\lambda }}{k!}.

Problem: 2.1.3

Let $X_1,\cdots,X_n$ be independent discrete random variables, each symmetric about $0$ , meaning that $X_i$ and $-X_i$ have the same probability mass function. Prove that for every $x\in\mathbb{R}$ ,

\mathbb{P}(S_n \ge x) = \mathbb{P}(S_n \le -x),

where

S_n = X_1 + \cdots + X_n.

If independence is removed, does the conclusion still hold? Explain.

Proof

\begin{aligned} \mathbb{P} (S_n\geq x) &=\sum_{x_{1}+\cdots +x_n\geq x }\mathbb{P} (X_1= x_1,\cdots, X_n=x_n)\\ &= \sum_{x_{1}+\cdots +x_n\geq x }\mathbb{P} (X_1= x_1)\cdots\mathbb{P} ( X_n=x_n)\\ &= \sum_{x_{1}+\cdots +x_n\geq x }\mathbb{P} (X_1= -x_1)\cdots\mathbb{P} ( X_n=-x_n)\\ &=\sum_{x_{1}+\cdots +x_n\leq -x }\mathbb{P} (X_1= x_1,\cdots, X_n=x_n)\\ &=\mathbb{P} (S_n \leq -x). \end{aligned}

Without independence, the conclusion can fail. Take $n=2$ and let

\mathbb{P}\bigl((X_1,X_2)=(-1,0)\bigr) =\mathbb{P}\bigl((X_1,X_2)=(0,-1)\bigr) =\mathbb{P}\bigl((X_1,X_2)=(1,1)\bigr) =\frac{1}{3}.

Then both $X_1$ and $X_2$ are uniform on $\{-1,0,1\}$ , so each is symmetric about $0$ . But

\mathbb{P}(S_2\geq 2)=\frac{1}{3}\neq 0=\mathbb{P}(S_2\leq -2).

Problem: 2.1.4

Let $X$ have probability mass function $\mathbb{P}(X=x_k)=p_k$ , $k=1,2,\cdots,n$ . The Shannon entropy is

H(X)=-\sum_{k=1}^n p_k\ln p_k.

For fixed $n$ , which distribution of $X$ maximizes $H(X)$ ?

Proof

By Jensen's inequality,

H(X)=-\sum_{k=1}^n p_k\log p_k \leq \log n.

Equality holds when $p_1=p_2=\cdots=p_n=\frac{1}{n}$ , that is, when $X$ is uniform on its $n$ possible values.

Exercise 2.2

Note

For expectation problems, common tools are generating functions, tail-sum formulas, and indicator variables. For higher moments, first try breaking the random variable into Bernoulli indicators.

Problem: 2.2.1

For $X\sim B(n,p)$ , find $\mathbb{E}[X^3]$ .

Proof

Write $X=\sum_{k=1}^n I_k$ , where $I_1,\cdots,I_n$ are i.i.d. indicator random variables with $\mathbb{P}(I_k=1)=p$ . Then

\begin{aligned} \mathbb{E} (X^3) &= \mathbb{E} \left( \left( \sum_{k=1}^n I_k \right)^3 \right) \\ &= n\mathbb{E} (I_1^3) +6\binom{n}{2}\mathbb{E} (I_1 ^{2} I_2) +6\binom{n}{3}\mathbb{E} (I_1 I_2 I_3)\\ &=n(n-1)(n-2)p^3 +3n(n-1)p^{2} +np. \end{aligned}

Problem: 2.2.2

Let the discrete random variable $X$ have probability mass function

f(x) = \begin{cases} \dfrac{1}{x(x+1)}, & x = 1, 2, \cdots, \\ 0, & \text{otherwise}. \end{cases}

For which real numbers $\alpha$ is the $\alpha$ -th moment $\mathbb{E}[X^\alpha]$ finite?

Proof

If $\alpha<1$ , then

\mathbb{E} [X^\alpha ] =\sum_{k=1}^{\infty}\frac{1}{k^{1-\alpha }(k+1)} \leq \sum_{k=1}^{\infty} \frac{1}{k^{2-\alpha } }<+\infty .

If $\alpha\ge 1$ , then

\mathbb{E} [X^\alpha ]\geq \sum_{k=1}^{\infty} \frac{1}{k+1}=+\infty .

Thus $\mathbb{E}[X^\alpha]<\infty$ exactly when $\alpha<1$ .

Problem: 2.2.3

Let $X$ be the total number of intersections crossed by a self-driving ride-hailing car in one day, and suppose

\mathbb{P}(X = k) = (1-p)^{k-1}p,\quad 0<p<1,\quad k=1,2,\cdots.

The traffic lights at different intersections work independently, and the car meets a red light at each intersection with probability $p$ .

$1$ Find the expectation and variance of the total number of intersections crossed.

$2$ Find the expected number of red lights met in one day.

Proof

Let $q=1-p$ . From

\sum_{k=0}^\infty q^k=\frac{1}{1-q},

differentiating gives

\sum_{k=1}^\infty kq^{k-1}=\frac{1}{(1-q)^2},\qquad \sum_{k=2}^\infty k(k-1)q^{k-2}=\frac{2}{(1-q)^3}.

Thus

\mathbb{E}[X]=\sum_{k=1}^\infty kpq^{k-1} =p\cdot \frac{1}{(1-q)^2}=\frac{1}{p}.

Also

\mathbb{E}[X(X-1)] =\sum_{k=2}^\infty k(k-1)pq^{k-1} =pq\sum_{k=2}^\infty k(k-1)q^{k-2} =\frac{2q}{p^2}.

\mathbb{E}[X^2] =\mathbb{E}[X(X-1)]+\mathbb{E}[X] =\frac{2q}{p^2}+\frac{1}{p} =\frac{2-p}{p^2},

and

\operatorname{Var}(X) =\mathbb{E}[X^2]-\mathbb{E}[X]^2 =\frac{2-p}{p^2}-\frac{1}{p^2} =\frac{1-p}{p^2}.

Let $Y$ be the number of red lights met in one day. Given $X=n$ , we have $Y\mid X=n\sim B(n,p)$ , so

\mathbb{E}[Y\mid X=n]=np.

Therefore

\mathbb{E}[Y]=\mathbb{E}\bigl(\mathbb{E}[Y\mid X]\bigr)=p\mathbb{E}[X]=1.

Remark

Here $X$ is a geometric random variable with parameter $p$ . One may directly use

\mathbb{E}[X]=\frac{1}{p},\qquad \operatorname{Var}(X)=\frac{1-p}{p^2}.

Problem: 2.2.4

For a nonnegative integer-valued random variable $X$ , prove that

\mathbb{E}[X] = \sum_{n=0}^{\infty} \mathbb{P}(X > n).

Proof

Since $X$ is nonnegative and integer-valued,

X=\sum_{n=0}^{\infty}\mathbf{1}_{\{X>n\}}.

Taking expectations and exchanging expectation with the sum gives

\mathbb{E}[X] =\sum_{n=0}^{\infty}\mathbb{E}\bigl[\mathbf{1}_{\{X>n\}}\bigr] =\sum_{n=0}^{\infty}\mathbb{P}(X>n).

Remark

For a nonnegative real-valued random variable $X$ , use

\lfloor X\rfloor \leq X \leq \lceil X\rceil.

Applying the previous result to $\lfloor X\rfloor$ and $\lceil X\rceil$ gives

\mathbb{E}[\lfloor X\rfloor] =\sum_{n=0}^{\infty}\mathbb{P}(\lfloor X\rfloor>n) =\sum_{n=1}^{\infty}\mathbb{P}(X\geq n),

and

\mathbb{E}[\lceil X\rceil] =\sum_{n=0}^{\infty}\mathbb{P}(\lceil X\rceil>n) =\sum_{n=0}^{\infty}\mathbb{P}(X>n).

Thus

\sum_{n=1}^{\infty}\mathbb{P}(X\geq n) \leq \mathbb{E}[X] \leq \sum_{n=0}^{\infty}\mathbb{P}(X>n).

This is a useful way to estimate an expectation by tail probabilities.

Problem: 2.2.6

The random graph model $G(n,p)$ has vertex set $V=\{1,2,\cdots,n\}$ . Each pair of vertices is connected by an edge with probability $p$ , independently of all other pairs. The degree $D_i$ of vertex $i$ is the number of edges incident to $i$ .

$1$ Find the distribution and expectation of $D_i$ .

$2$ Let $X$ be the number of triangles in $G(n,p)$ . Find $\mathbb{E}[X]$ and $\operatorname{Var}(X)$ .

Proof

For a fixed vertex $i$ , the degree $D_i$ counts how many of the other $n-1$ possible edges appear. Hence

D_i\sim B(n-1,p),\qquad \mathbb{E}[D_i]=(n-1)p.

Let $\mathcal{T}$ be the set of all triangles. For each $T\in\mathcal{T}$ , let $I_T$ be the indicator that triangle $T$ appears. Then

X=\sum_{T\in\mathcal{T}}I_T,\qquad |\mathcal{T}|=\binom{n}{3}.

Therefore

\mathbb{E}[X]=\sum_{T\in\mathcal{T}}\mathbb{E}[I_T]=\binom{n}{3}p^3.

For the variance,

\operatorname{Var}(X) =\sum_{T\in\mathcal{T}}\operatorname{Var}(I_T) +2\sum_{T<S}\operatorname{Cov}(I_T,I_S).

Here

\operatorname{Var}(I_T)=p^3(1-p^3).

If two different triangles have no common edge, their edge sets are independent, so the covariance is $0$ . If they share one edge, then

\mathbb{E}[I_TI_S]=p^5,\qquad \operatorname{Cov}(I_T,I_S)=p^5-p^6=p^5(1-p).

The number of unordered pairs of triangles sharing an edge is

\binom{n}{2}\binom{n-2}{2}=6\binom{n}{4}.

Thus

\operatorname{Var}(X) =\binom{n}{3}p^3(1-p^3)+12\binom{n}{4}p^5(1-p).

Exercise 2.3

Note

The probabilistic method usually starts with a random construction, then uses an expectation to prove existence. The final conclusion is deterministic; randomness is only a proof tool.

Problem: 2.3.1

Daniel Bernoulli described a diffusion model in 1769. Bottle A contains $n$ red balls, and bottle B contains $n$ blue balls. At each step, choose one ball from each bottle and exchange the two balls. Find the expected number of red balls in bottle A after $k$ steps.

Proof

For each ball that started in bottle A, let $p_k$ be the probability that it is still in bottle A after the $k$ -th exchange. Then $p_0=1$ , and

p_{k+1}=\frac{n-1}{n}p_k+\frac{1}{n}(1-p_k).

Solving the recursion gives

p_k=\frac{1}{2}\left[\left(\frac{n-2}{n}\right)^k+1\right].

Label the $n$ balls originally in bottle A. Let $I_i$ be the indicator that the $i$ -th such ball is in bottle A after $k$ exchanges. If $N=I_1+\cdots+I_n$ , then

\mathbb{E}(N) =\sum_{i=1}^n\mathbb{E}(I_i) =np_k =\frac{n}{2}\left[\left(\frac{n-2}{n}\right)^k+1\right].

Problem: 2.3.2

Let $G=(V,E)$ be a finite graph. For a vertex set $W$ and an edge $e\in E$ , define

\mathbf{1}_W(e)= \begin{cases} 1, & e \text{ connects } W \text{ and } W^c,\\ 0, & \text{otherwise}. \end{cases}

Let

N_W=\sum_{e\in E}\mathbf{1}_W(e).

Use the probabilistic method to prove that there exists $W\subset V$ such that $N_W\ge |E|/2$ .

Proof

Choose each vertex independently with probability $\frac12$ , and let $W$ be the chosen set. For a fixed edge, the probability that its two endpoints are separated by $W$ and $W^c$ is $2\cdot\frac12(1-\frac12)=\frac12$ . Hence

\mathbb{E}(N_W) =\sum_{e\in E}\mathbb{E}(\mathbf{1}_W(e)) =\frac{|E|}{2}.

Therefore at least one choice of $W$ satisfies $N_W\ge |E|/2$ .

Problem: 2.3.3

A box contains $n$ balls labeled $1,2,\cdots,n$ . Choose $k$ balls uniformly without replacement, and let $X$ be the sum of their labels. Find the expectation and variance of $X$ .

Proof

For each $i=1,2,\cdots,n$ , let

I_i=\mathbf{1}_{\{\text{ball }i\text{ is chosen}\}}.

Then

X=\sum_{i=1}^n iI_i.

Each ball is chosen with probability $\frac{k}{n}$ , so

\mathbb{E}(X)=\sum_{i=1}^n i\mathbb{E}(I_i) =\frac{k}{n}\sum_{i=1}^n i =\frac{k(n+1)}{2}.

For the variance,

\operatorname{Var}(X) =\sum_{i=1}^n i^2\operatorname{Var}(I_i) +2\sum_{1\leq i<j\leq n}ij\operatorname{Cov}(I_i,I_j).

We have

\operatorname{Var}(I_i) =\frac{k}{n}\left(1-\frac{k}{n}\right) =\frac{k(n-k)}{n^2}.

For $i\ne j$ ,

\mathbb{P}(I_i=1,I_j=1) =\frac{\binom{n-2}{k-2}}{\binom{n}{k}} =\frac{k(k-1)}{n(n-1)}.

Thus

\operatorname{Cov}(I_i,I_j) =\frac{k(k-1)}{n(n-1)}-\frac{k^2}{n^2} =-\frac{k(n-k)}{n^2(n-1)}.

Therefore

\begin{aligned} \operatorname{Var}(X) &=\frac{k(n-k)}{n^2}\sum_{i=1}^n i^2 -\frac{2k(n-k)}{n^2(n-1)}\sum_{1\leq i<j\leq n}ij\\ &=\frac{k(n-k)}{n^2}\cdot \frac{n(n+1)(2n+1)}{6} -\frac{k(n-k)}{n^2(n-1)} \left[\left(\sum_{i=1}^n i\right)^2-\sum_{i=1}^n i^2\right]\\ &=\frac{k(n-k)(n+1)}{12}. \end{aligned}

Problem: 2.3.6

Let $\mathbf{v}_1,\mathbf{v}_2,\cdots,\mathbf{v}_n\in\mathbb{R}^n$ satisfy $|\mathbf{v}_i|\le 1$ for all $i$ . Let

\mathbf{w}=\sum_{i=1}^n p_i\mathbf{v}_i,\quad p_i\in[0,1].

Use the probabilistic method to prove that there exist $\varepsilon_i\in\{0,1\}$ such that

\left|\sum_{i=1}^n \varepsilon_i\mathbf{v}_i-\mathbf{w}\right| \le \frac{\sqrt n}{2}.

Proof

Choose $\varepsilon_i\in\{0,1\}$ independently, with $\mathbb{P}(\varepsilon_i=1)=p_i$ . Consider

X:=\left|\sum_{i=1}^n \varepsilon_i v_i-w\right|^2 =\sum_{i=1}^n(\varepsilon_i-p_i)^2|v_i|^2 +2\sum_{1\le i<j\le n}(\varepsilon_i-p_i)(\varepsilon_j-p_j)v_i\cdot v_j.

Taking expectations,

\begin{aligned} \mathbb{E}[X] &=\sum_{i=1}^n \mathbb{E}[(\varepsilon_i-p_i)^2]|v_i|^2 +2\sum_{1\leq i<j\leq n} \mathbb{E}[(\varepsilon_i-p_i)(\varepsilon_j-p_j)]v_i\cdot v_j\\ &=\sum_{i=1}^n \mathbb{E}[(\varepsilon_i-p_i)^2]|v_i|^2\\ &=\sum_{i=1}^n p_i(1-p_i)|v_i|^2\\ &\le \frac{n}{4}. \end{aligned}

So for at least one choice of the $\varepsilon_i$ , we have $X\le n/4$ , which means

\left|\sum_{i=1}^n \varepsilon_i v_i-w\right| \le \frac{\sqrt n}{2}.

Exercise 2.4

Note

Here it is useful to understand conditional expectation first in the discrete case: it is the average value after some information is given. It also satisfies linearity, positivity, and the tower property.

Problem: 2.4.1

Prove the following properties of conditional expectation:

$1$ $\mathbb{E}[aY+bZ\mid X]=a\mathbb{E}[Y\mid X]+b\mathbb{E}[Z\mid X]$ for all $a,b\in\mathbb{R}$ .

$2$ If $Y\ge0$ , then $\mathbb{E}[Y\mid X]\ge0$ .

$3$ $\mathbb{E}[1\mid X]=1$ .

$4$ If $X$ and $Y$ are independent, then $\mathbb{E}[Y\mid X]=\mathbb{E}[Y]$ .

$5$ $\mathbb{E}[Yg(X)\mid X]=g(X)\mathbb{E}[Y\mid X]$ , whenever both sides are well-defined.

Proof

For any $x$ with $\mathbb{P}(X=x)>0$ , by definition

\mathbb{E}[Y\mid X=x]=\sum_y y\mathbb{P}(Y=y\mid X=x).

It is enough to prove each statement for every such $x$ .

$1$

\begin{aligned} \mathbb{E}[aY+bZ\mid X=x] &=\sum_{y,z}(ay+bz)\mathbb{P}(Y=y,Z=z\mid X=x)\\ &=a\sum_{y,z}y\mathbb{P}(Y=y,Z=z\mid X=x) +b\sum_{y,z}z\mathbb{P}(Y=y,Z=z\mid X=x)\\ &=a\mathbb{E}[Y\mid X=x]+b\mathbb{E}[Z\mid X=x]. \end{aligned}

$2$ If $Y\ge0$ , then

\mathbb{E}[Y\mid X=x]=\sum_y y\mathbb{P}(Y=y\mid X=x)\ge0.

$3$

\mathbb{E}[1\mid X=x]=1.

$4$ If $X$ and $Y$ are independent, then

\mathbb{P}(Y=y\mid X=x)=\mathbb{P}(Y=y),

\mathbb{E}[Y\mid X=x]=\sum_y y\mathbb{P}(Y=y)=\mathbb{E}[Y].

$5$ Given $X=x$ , the value $g(X)=g(x)$ is constant. Hence

\mathbb{E}[Yg(X)\mid X=x] =\mathbb{E}[Yg(x)\mid X=x] =g(x)\mathbb{E}[Y\mid X=x].

These identities hold for every relevant $x$ , so the desired statements follow.

Problem: 2.4.2

Let $X$ and $Y$ be independent Poisson random variables with parameters $\lambda_1$ and $\lambda_2$ . Find $\mathbb{E}[X\mid X+Y]$ .

Proof

Let $S=X+Y$ . For $0\le k\le n$ ,

\begin{aligned} \mathbb{P}(X=k\mid S=n) &=\frac{\mathbb{P}(X=k,Y=n-k)}{\mathbb{P}(S=n)}\\ &=\frac{\dfrac{\lambda_1^k e^{-\lambda_1}}{k!} \dfrac{\lambda_2^{n-k}e^{-\lambda_2}}{(n-k)!}} {\dfrac{(\lambda_1+\lambda_2)^n e^{-(\lambda_1+\lambda_2)}}{n!}}\\ &=\binom{n}{k} \left(\frac{\lambda_1}{\lambda_1+\lambda_2}\right)^k \left(\frac{\lambda_2}{\lambda_1+\lambda_2}\right)^{n-k}. \end{aligned}

Thus, conditional on $S=n$ , the random variable $X$ is binomial with parameters $n$ and $\dfrac{\lambda_1}{\lambda_1+\lambda_2}$ . Therefore

\mathbb{E}[X\mid S=n] =n\frac{\lambda_1}{\lambda_1+\lambda_2}.

Equivalently,

\mathbb{E}[X\mid X+Y] =\frac{\lambda_1}{\lambda_1+\lambda_2}(X+Y).

Problem: 2.4.3

Let $X,Y$ be discrete random variables with means $0$ , variances $1$ , and covariance $\rho$ . Prove that

\mathbb{E}[\max\{X^2,Y^2\}]\leq 1+\sqrt{1-\rho^2}.

Proof

Observe that

\max\{X^2,Y^2\} =\frac{X^2+Y^2+|X^2-Y^2|}{2} =\frac{X^2+Y^2+|X-Y||X+Y|}{2}.

Thus

\mathbb{E}[\max\{X^2,Y^2\}] =1+\frac{1}{2}\mathbb{E}[|X-Y||X+Y|].

By Cauchy's inequality,

\mathbb{E}[|X-Y||X+Y|] \leq \sqrt{\mathbb{E}[(X-Y)^2]\mathbb{E}[(X+Y)^2]}.

Also,

\mathbb{E}[(X-Y)^2]=\operatorname{Var}(X-Y)=2(1-\rho),

and

\mathbb{E}[(X+Y)^2]=\operatorname{Var}(X+Y)=2(1+\rho).

Therefore

\mathbb{E}[|X-Y||X+Y|] \leq \sqrt{2(1-\rho)\cdot2(1+\rho)} =2\sqrt{1-\rho^2}.

\mathbb{E}[\max\{X^2,Y^2\}] \leq 1+\sqrt{1-\rho^2}.

Problem: 2.4.5

The conditional variance of $Y$ given $X$ , denoted $\operatorname{Var}(Y\mid X)$ , is usually defined as the variance of the conditional distribution $Y\mid X$ . Using the usual formula

\operatorname{Var}(Y)=\mathbb{E}[Y^2]-\mathbb{E}[Y]^2,

we may define it directly by

\operatorname{Var}(Y\mid X) =\mathbb{E}[Y^2\mid X]-\mathbb{E}[Y\mid X]^2.

Prove that

\operatorname{Var}(Y) =\mathbb{E}[\operatorname{Var}(Y\mid X)] +\operatorname{Var}(\mathbb{E}[Y\mid X]).

Proof

By definition,

\mathbb{E}[\operatorname{Var}(Y\mid X)] =\mathbb{E}[\mathbb{E}[Y^2\mid X]] -\mathbb{E}[\mathbb{E}[Y\mid X]^2].

Also,

\mathbb{E}[\mathbb{E}[Y^2\mid X]]=\mathbb{E}[Y^2], \qquad \mathbb{E}[\mathbb{E}[Y\mid X]]=\mathbb{E}[Y].

Hence

\mathbb{E}[\operatorname{Var}(Y\mid X)] =\mathbb{E}[Y^2]-\mathbb{E}[\mathbb{E}[Y\mid X]^2].

On the other hand,

\operatorname{Var}(\mathbb{E}[Y\mid X]) =\mathbb{E}[\mathbb{E}[Y\mid X]^2] -\mathbb{E}[\mathbb{E}[Y\mid X]]^2 =\mathbb{E}[\mathbb{E}[Y\mid X]^2]-\mathbb{E}[Y]^2.

Adding the two identities gives

\mathbb{E}[\operatorname{Var}(Y\mid X)] +\operatorname{Var}(\mathbb{E}[Y\mid X]) =\mathbb{E}[Y^2]-\mathbb{E}[Y]^2 =\operatorname{Var}(Y).

Problem: 2.4.8

The 2024 Nobel Prize in Physics was awarded to Hopfield and Hinton for foundational work related to machine learning with artificial neural networks. Building on the idea of Hopfield networks, Hinton introduced Boltzmann machines. Given weights $w_{ij}=w_{ji}$ with $w_{ii}=0$ , define an $n$ -dimensional random vector

X=(X_1,\cdots,X_n)

taking values in $\{0,1\}^n$ by

\mathbb{P}(X=x)=\frac{1}{Z_n}\exp\left\{ \sum_{1\leq i<j\leq n} w_{ij}x_ix_j+\sum_{1\leq i\leq n} b_ix_i \right\},

where the partition function is

Z_n=\sum_{x\in \{0,1\}^n}\exp\left\{ \sum_{1\leq i<j\leq n} w_{ij}x_ix_j+\sum_{1\leq i\leq n} b_ix_i \right\}.

Let $X^{(k)}$ be the vector $X$ with its $k$ -th coordinate removed. Prove that

\mathbb{E}[X_k\mid X^{(k)}] =\frac{\exp\left\{ b_k+\sum_{i\neq k} w_{ki}X_i \right\}} {1+\exp\left\{ b_k+\sum_{i\neq k} w_{ki}X_i \right\}}.

Proof

Fix $x^{(k)}$ , and write

\eta=b_k+\sum_{i\neq k}w_{ki}x_i.

When $X^{(k)}=x^{(k)}$ is fixed, all factors in the joint probability that do not depend on $x_k$ can be absorbed into a constant $C$ . Thus

\mathbb{P}(X_k=x_k,X^{(k)}=x^{(k)}) =C\exp\{x_k\eta\},\qquad x_k=0,1.

Therefore

\mathbb{P}(X_k=1\mid X^{(k)}=x^{(k)}) =\frac{Ce^\eta}{C+Ce^\eta} =\frac{e^\eta}{1+e^\eta}.

Since $X_k$ only takes the values $0$ and $1$ ,

\mathbb{E}[X_k\mid X^{(k)}=x^{(k)}] =\mathbb{P}(X_k=1\mid X^{(k)}=x^{(k)}) =\frac{e^\eta}{1+e^\eta}.

This gives

\mathbb{E}[X_k\mid X^{(k)}] =\frac{\exp\left\{ b_k+\sum_{i\neq k} w_{ki}X_i \right\}} {1+\exp\left\{ b_k+\sum_{i\neq k} w_{ki}X_i \right\}}.

Exercise 2.5

Note

For inequality problems, first identify the tool: Cauchy, Jensen, Markov, Chebyshev, or Chernoff. Equality cases are often important too.

Problem: 2.5.1

Consider the simple random walk on the line

S_n=\sum_{k=1}^n X_k,\quad S_0=0,

where

P(X_i=1)=p,\quad P(X_i=-1)=1-p,\quad 0<p<1.

Find $E(S_n)$ , $\operatorname{Var}(S_n)$ , $\operatorname{Cov}(S_m,S_n)$ , and $E[S_n\mid S_m]$ .

Proof

First note that

E(X_1)=p-(1-p)=2p-1,\qquad \operatorname{Var}(X_1)=1-(2p-1)^2=4p(1-p).

Therefore

E(S_n)=\sum_{k=1}^n E(X_k)=n(2p-1),

and

\operatorname{Var}(S_n)=\sum_{k=1}^n\operatorname{Var}(X_k)=4np(1-p).

By independence,

\operatorname{Cov}(S_m,S_n) =\sum_{i=1}^m\sum_{j=1}^n \operatorname{Cov}(X_i,X_j) =\sum_{k=1}^{m\wedge n}\operatorname{Var}(X_k) =4p(1-p)(m\wedge n).

Now compute the conditional expectation.

If $n\ge m$ , then

S_n=S_m+\sum_{k=m+1}^n X_k,

and the latter sum is independent of $S_m$ . Hence

E[S_n\mid S_m]=S_m+(n-m)(2p-1).

If $n\le m$ , then conditional on $S_m$ , the variables $X_1,\dots,X_m$ are exchangeable. Thus

E[X_1\mid S_m]=\cdots=E[X_m\mid S_m].

Also,

\sum_{k=1}^m E[X_k\mid S_m] =E\left[\sum_{k=1}^m X_k\mid S_m\right] =E[S_m\mid S_m] =S_m.

E[X_k\mid S_m]=\frac{S_m}{m},\qquad 1\le k\le m.

Therefore

E[S_n\mid S_m] =\sum_{k=1}^n E[X_k\mid S_m] =\frac{n}{m}S_m.

In summary,

E[S_n\mid S_m]= \begin{cases} \dfrac{n}{m}S_m, & n\le m,\\[6pt] S_m+(n-m)(2p-1), & n\ge m. \end{cases}

Problem: 2.5.2

In an election with two candidates, every vote goes to exactly one candidate. Suppose the final result is $\alpha$ votes for $A$ and $\beta$ votes for $B$ , with $\alpha\ge\beta$ . All counting orders are equally likely.

Find the probability that the two candidates are tied at some time during the count.
Prove that the probability that $A$ is never behind $B$ during the count is

\frac{\alpha-\beta+1}{\alpha+1}.

Proof

As in the standard ballot problem, build a random walk:

X_i= \begin{cases} 1, & \text{the }i\text{-th vote is for }A,\\ -1, & \text{the }i\text{-th vote is for }B, \end{cases} \qquad S_k=\sum_{i=1}^k X_i.

After $k$ votes have been counted, $S_k$ is the number of votes by which $A$ leads $B$ . Each counting order corresponds to a path from $(0,0)$ to $(\alpha+\beta,\alpha-\beta)$ , and all such paths are equally likely. The total number of paths is

N_{\alpha+\beta}(0,\alpha-\beta)=\binom{\alpha+\beta}{\alpha}.

$1$ The event that the vote counts are tied at some time means that the path returns to the $x$ -axis after the start.

If $\alpha=\beta$ , the endpoint is on the $x$ -axis, so the probability is $1$ .

If $\alpha>\beta$ , the complement is that the path never returns to the $x$ -axis, meaning $A$ is always strictly ahead. By the ballot theorem,

\#\{\text{paths from }(0,0)\text{ to }(\alpha+\beta,\alpha-\beta) \text{ that never return to the }x\text{-axis}\} =\frac{\alpha-\beta}{\alpha+\beta}N_{\alpha+\beta}(0,\alpha-\beta).

Thus

P(\text{a tie occurs}) =1-\frac{\alpha-\beta}{\alpha+\beta} =\frac{2\beta}{\alpha+\beta}.

This also gives $1$ when $\alpha=\beta$ .

$2$ The event that $A$ is never behind $B$ is the event $S_k\ge0$ for all $k$ .

Add one upward step to the front of each such path. This gives a path from $(0,0)$ to $(\alpha+\beta+1,\alpha-\beta+1)$ that never returns to the $x$ -axis after the start. Conversely, removing the first step recovers the original path. This is a bijection.

By the ballot theorem,

\#\{\text{paths where }A\text{ is never behind }B\} =\frac{\alpha-\beta+1}{\alpha+\beta+1} N_{\alpha+\beta+1}(0,\alpha-\beta+1).

Since

N_{\alpha+\beta+1}(0,\alpha-\beta+1)=\binom{\alpha+\beta+1}{\alpha+1},

we get

\#\{\text{paths where }A\text{ is never behind }B\} =\frac{\alpha-\beta+1}{\alpha+\beta+1}\binom{\alpha+\beta+1}{\alpha+1} =\frac{\alpha-\beta+1}{\alpha+1}\binom{\alpha+\beta}{\alpha}.

Dividing by the total number $\binom{\alpha+\beta}{\alpha}$ gives

P(\text{$A$ is never behind $B$}) =\frac{\alpha-\beta+1}{\alpha+1}.

Problem: 2.5.3

Let $S_n$ be the simple symmetric random walk on the line, with $S_0=0$ . Let

T=\min\{n\ge1:S_n=0\}

be the first return time to the origin. Prove that

P(T=2n)=\frac{1}{2n-1}\binom{2n}{n}2^{-2n},

and determine for which $\alpha$ we have $E[T^\alpha]<\infty$ .

Note. You may use Stirling's formula: $n!\sim n^n e^{-n}\sqrt{2\pi n}$ .

Proof

Clearly $T$ can only be even. Let

A_n^+=\{T=2n,\ X_1=1\},\qquad A_n^-=\{T=2n,\ X_1=-1\}.

By symmetry,

P(T=2n)=P(A_n^+)+P(A_n^-)=2P(A_n^+).

Count the paths in $A_n^+$ . If $T=2n$ and the first step is to $1$ , then

S_1,S_2,\cdots,S_{2n-1}>0,\qquad S_{2n}=0.

Reading this path backwards gives a path from $(0,0)$ to $(2n-1,1)$ that does not return to the $x$ -axis after the start. This is a bijection. By the ballot theorem,

\#\{\text{paths in }A_n^+\} =\frac{1}{2n-1}N_{2n-1}(0,1) =\frac{1}{2n-1}\binom{2n-1}{n}.

Each path of length $2n$ has probability $2^{-2n}$ . Hence

\begin{aligned} P(T=2n) &=2\cdot \frac{1}{2n-1}\binom{2n-1}{n}2^{-2n} \\ &=\frac{1}{2n-1}\binom{2n}{n}2^{-2n}. \end{aligned}

Now consider $E[T^\alpha]$ . By Stirling's formula,

\binom{2n}{n}\sim \frac{4^n}{\sqrt{\pi n}}.

Thus

P(T=2n) =\frac{1}{2n-1}\binom{2n}{n}2^{-2n} \sim \frac{1}{2\sqrt{\pi}}n^{-3/2}.

Therefore

E[T^\alpha] =\sum_{n=1}^{\infty}(2n)^\alpha P(T=2n) \asymp \sum_{n=1}^{\infty}n^{\alpha-3/2}.

The series $\sum n^{\alpha-3/2}$ converges if and only if

\alpha-\frac32<-1,

that is,

\alpha<\frac12.

Hence

E[T^\alpha]<\infty \iff \alpha<\frac12.

Problem: 2.5.4

A particle moves on a cycle with nodes labeled $0,1,\cdots,m$ . At each step it moves to a neighboring node clockwise or counterclockwise with equal probability. Starting from node $0$ , the particle moves until all nodes $1,2,\cdots,m$ have been visited.

Prove that the particle visits all nodes $1,2,\cdots,m$ with probability $1$ .
Find the probability that node $i$ is the last node visited, for $1\le i\le m$ .

Proof

$1$ Fix $i\in\{1,2,\cdots,m\}$ , and let $A_i=\{\text{the particle eventually visits node }i\}$ . For each $r\ge1$ , let

B_r=\{\text{during steps }(r-1)m+1,\,(r-1)m+2,\cdots,rm, \text{ node }i\text{ is never visited}\}.

No matter where the particle is at time $(r-1)m$ , there is a choice of direction that reaches $i$ in at most $m$ steps. The probability of following that particular route is at least $2^{-m}$ . Hence

P(B_r\mid \text{all results before time }(r-1)m)\le 1-2^{-m}.

Thus, for every $N\ge1$ ,

P(B_1\cap\cdots\cap B_N)\le (1-2^{-m})^N.

If $A_i^c$ occurs, then node $i$ is missed in every block of length $m$ . Therefore, for every $N\ge1$ ,

A_i^c\subseteq B_1\cap\cdots\cap B_N.

P(A_i^c)\le P(B_1\cap\cdots\cap B_N)\le (1-2^{-m})^N.

Letting $N\to\infty$ gives $P(A_i)=1$ . This holds for every $i=1,2,\cdots,m$ . Since there are only finitely many nodes,

P\Bigl(\bigcap_{i=1}^m A_i\Bigr)=1.

So the particle visits all nodes with probability $1$ .

$2$ Let

p_i=P(\text{the last node visited is }i),\qquad 1\le i\le m.

By part (1), $\sum_{i=1}^m p_i=1$ .

For $2\le i\le m-1$ , condition on the first step:

\begin{aligned} p_i &=\frac12 P(\text{after first going to }1,\text{ the last node visited is }i)\\ &\quad+\frac12 P(\text{after first going to }m,\text{ the last node visited is }i). \end{aligned}

If the first step is to $1$ and the last node visited is $i$ , then node $0$ must be visited again before reaching $i$ ; otherwise the particle cannot cross $i$ to visit the other side. Thus this event has the same form as the original problem, after relabeling

1\mapsto0,\quad 2\mapsto1,\quad \cdots,\quad m\mapsto m-1,\quad 0\mapsto m.

Therefore

P(\text{after first going to }1,\text{ the last node visited is }i)=p_{i-1}.

Similarly,

P(\text{after first going to }m,\text{ the last node visited is }i)=p_{i+1}.

Hence

p_i=\frac{p_{i-1}+p_{i+1}}{2},\qquad 2\le i\le m-1.

So $p_1,\cdots,p_m$ form an arithmetic progression. By symmetry around node $0$ , $p_1=p_m$ . An arithmetic progression with equal endpoints is constant, so

p_1=p_2=\cdots=p_m.

Since their sum is $1$ ,

p_i=\frac1m,\qquad 1\le i\le m.

Exercise 2.6

Note

This section repeatedly uses tail control: turn a probability bound into an expectation bound, then close it by summing or integrating.

Problem: 2.6.1

Let $G_1,G_2$ be probability generating functions, and let $0\le \alpha\le1$ . Prove that $G_1G_2$ and $\alpha G_1+(1-\alpha)G_2$ are also probability generating functions. Is

\frac{G(\alpha s)}{G(\alpha)}

also a probability generating function?

Proof

Write

G_i(s)=\sum_{n=0}^{\infty}p_n^{(i)}s^n,\qquad p_n^{(i)}\ge0,\qquad \sum_{n=0}^{\infty}p_n^{(i)}=1,\quad i=1,2.

Then

G_1(s)G_2(s) =\sum_{n=0}^{\infty} \left(\sum_{k=0}^n p_k^{(1)}p_{n-k}^{(2)}\right)s^n.

The coefficients are nonnegative, and their sum is

\sum_{n=0}^{\infty}\sum_{k=0}^n p_k^{(1)}p_{n-k}^{(2)} =\left(\sum_{n=0}^{\infty}p_n^{(1)}\right) \left(\sum_{n=0}^{\infty}p_n^{(2)}\right) =1.

So $G_1G_2$ is a probability generating function.

Also,

\alpha G_1(s)+(1-\alpha)G_2(s) =\sum_{n=0}^{\infty} \left(\alpha p_n^{(1)}+(1-\alpha)p_n^{(2)}\right)s^n.

The coefficients are nonnegative and sum to

\sum_{n=0}^{\infty} \left(\alpha p_n^{(1)}+(1-\alpha)p_n^{(2)}\right) =\alpha+(1-\alpha)=1.

So $\alpha G_1+(1-\alpha)G_2$ is also a probability generating function.

Now write

G(s)=\sum_{n=0}^{\infty}p_ns^n.

When $G(\alpha)>0$ ,

\frac{G(\alpha s)}{G(\alpha)} =\sum_{n=0}^{\infty}\frac{p_n\alpha^n}{G(\alpha)}s^n.

The coefficients are nonnegative and sum to

\sum_{n=0}^{\infty}\frac{p_n\alpha^n}{G(\alpha)} =\frac{G(\alpha)}{G(\alpha)} =1.

Thus it is still a probability generating function. In particular, this always works for $\alpha\in(0,1]$ . If $\alpha=0$ , the expression is meaningful only when $G(0)>0$ ; in that case it is identically $1$ , so it is still a probability generating function.

Problem: 2.6.3

Let $X$ have the geometric distribution with parameter $p$ , $0<p<1$ :

\mathbb{P}(X=k)=(1-p)^{k-1}p,\quad k=1,2,\cdots.

Let $Y$ be a nonnegative integer-valued random variable with probability generating function $G(s)$ , and suppose $Y$ is independent of $X$ . Prove that

\mathbb{P}(X>Y)=G(1-p).

Proof

By the law of total probability and independence,

\mathbb{P}(X>Y) =\sum_{n=0}^{\infty}\mathbb{P}(X>n,Y=n) =\sum_{n=0}^{\infty}\mathbb{P}(X>n)\mathbb{P}(Y=n).

For the geometric distribution,

\mathbb{P}(X>n) =\sum_{k=n+1}^{\infty}(1-p)^{k-1}p =(1-p)^n.

Therefore

\mathbb{P}(X>Y) =\sum_{n=0}^{\infty}(1-p)^n\mathbb{P}(Y=n) =G(1-p).

Problem: 2.6.4

Prove that

G(x,y,z,w)=\frac{1}{8}(xyzw+xy+yz+zw+xw+yw+xz+1)

is the joint generating function of four random variables that are pairwise independent and three-wise independent, but not mutually independent.

Proof

All coefficients of

G(x,y,z,w)=\frac{1}{8}(xyzw+xy+yz+zw+xw+yw+xz+1)

are nonnegative, and the sum of the coefficients is $1$ . Hence it is indeed the joint generating function of some four-dimensional random vector.

For the marginal generating function,

G_X(x)=G(x,1,1,1)=\frac{1+x}{2}.

The other three marginals are the same. For two variables,

G_{X,Y}(x,y)=G(x,y,1,1) =\frac{(1+x)(1+y)}{4} =G_X(x)G_Y(y).

By symmetry, any two of the four variables are independent.

For three variables,

G_{X,Y,Z}(x,y,z)=G(x,y,z,1) =\frac{(1+x)(1+y)(1+z)}{8} =G_X(x)G_Y(y)G_Z(z).

Again by symmetry, any three variables are independent.

If all four variables were mutually independent, their joint generating function would be

G_X(x)G_Y(y)G_Z(z)G_W(w) =\frac{(1+x)(1+y)(1+z)(1+w)}{16}.

This is not equal to $G(x,y,z,w)$ ; for instance, the right-hand side has an $x$ term, while $G$ does not. Hence the four variables are not mutually independent.

Extra material

Note

The characterization of distribution functions is mostly analytic. The random-walk examples are more about recursions and stopping times. It is fine to read the two parts separately.

Notes left from class

Theorem: characterization of distribution functions

Let $F:\mathbb{R}\to\mathbb{R}$ be a function. Then $F$ is the distribution function of some random variable if and only if it satisfies the following three properties:

Monotonicity: if $x_1<x_2$ , then $F(x_1)\le F(x_2)$ .
Right-continuity: for every $x\in\mathbb{R}$ , $\lim_{y\to x^+}F(y)=F(x)$ .
Normalization: $\lim_{x\to-\infty}F(x)=0$ and $\lim_{x\to+\infty}F(x)=1$ .

Proof

Necessity is omitted. We prove sufficiency.

Assume $F$ satisfies the three properties. Let $U\sim U(0,1)$ and define

X=\inf\{t\in\mathbb{R}:F(t)\ge U\}.

Since $\lim_{x\to-\infty}F(x)=0$ and $\lim_{x\to+\infty}F(x)=1$ , the set above is nonempty and bounded below, so $X$ is well-defined.

For any $x\in\mathbb{R}$ , if $U\le F(x)$ , then $x\in\{t:F(t)\ge U\}$ , so $X\le x$ . Hence

\{U\le F(x)\}\subseteq\{X\le x\}.

Conversely, if $X\le x$ , then for every $n\ge1$ there exists some $t_n<x+\frac1n$ such that $F(t_n)\ge U$ . Since $F$ is nondecreasing,

U\le F(t_n)\le F\left(x+\frac1n\right).

Letting $n\to\infty$ and using right-continuity gives $U\le F(x)$ . Thus

\{X\le x\}\subseteq\{U\le F(x)\}.

Therefore

\{X\le x\}=\{U\le F(x)\},

and

\mathbb{P}(X\le x)=\mathbb{P}(U\le F(x))=F(x).

So $F$ is the distribution function of $X$ .

Good problems

Problem: absorption time of a simple random walk

Let $\{S_n\}_{n\ge0}$ be a simple random walk on the state space $\{0,1,\dots,L\}$ , where $0$ and $L$ are absorbing states. If $S_0=1$ , find the probability that the walk is absorbed exactly at time $n$ .

Proof

Let the absorption time be

\tau=\inf\{n\ge0:S_n\in\{0,L\}\}.

For $x=1,2,\dots,L-1$ , set

P(x,n)=\mathbb{P}(S_n=x,\ \tau>n).

Then

P(x,n)=\frac12P(x-1,n-1)+\frac12P(x+1,n-1),

with boundary conditions

P(0,n)=P(L,n)=0,

and initial condition

P(x,0)=\mathbf{1}_{\{x=1\}}.

Because the boundary values are zero, expand in sine functions:

P(x,n)=\sum_{m=1}^{L-1}a_m(n)\sin\frac{m\pi x}{L}.

Substituting this into the recursion gives

a_m(n)=a_m(n-1)\cos\frac{m\pi}{L}.

Hence

a_m(n)=a_m(0)\cos^n\frac{m\pi}{L}.

The initial condition gives

a_m(0)=\frac{2}{L}\sin\frac{m\pi}{L}.

Thus

P(x,n)=\sum_{m=1}^{L-1} \frac{2}{L}\sin\frac{m\pi}{L} \cos^n\frac{m\pi}{L} \sin\frac{m\pi x}{L}.

Therefore

\mathbb{P}(\tau>n)=\sum_{x=1}^{L-1}P(x,n),

and

\mathbb{P}(\tau=n)=\mathbb{P}(\tau>n-1)-\mathbb{P}(\tau>n).

This gives the desired probability in terms of the explicit expression above.

Remark

If the walk starts from any $i\in\{1,2,\dots,L-1\}$ , only the initial condition changes:

P(x,0)=\mathbf{1}_{\{x=i\}}.

The rest of the argument is the same.

Problem: distribution of the first run of successes

Toss a coin repeatedly and independently. Each toss is heads with probability $p$ and tails with probability $q=1-p$ . Let $N$ be the number of tosses needed to see $m$ consecutive heads for the first time. Find the generating function of $N$ .

Proof

Let

P_n=\mathbb{P}(N=n),\qquad n\ge m.

Clearly,

P_n=0\quad(n<m),\qquad P_m=p^m.

For $n>m$ , classify by the first tail among the first $m$ tosses. This gives the recursion

P_n=q\sum_{k=1}^m p^{k-1}P_{n-k},\qquad n>m.

Let

G(z)=\sum_{n=m}^{\infty}P_nz^n.

Summing the recursion gives

G(z)-p^mz^m =q\sum_{n=m+1}^{\infty}\sum_{k=1}^m p^{k-1}P_{n-k}z^n.

Rearranging,

G(z)-p^mz^m =qz\bigl(1+pz+\cdots+(pz)^{m-1}\bigr)G(z).

Thus

G(z)=\frac{p^mz^m}{1-qz\bigl(1+pz+\cdots+(pz)^{m-1}\bigr)}.

Using

1+pz+\cdots+(pz)^{m-1} =\frac{1-(pz)^m}{1-pz},

we get

G(z)=\frac{(pz)^m(1-pz)}{1-z+qp^mz^{m+1}}.

This is the generating function of $N$ .

Midterm review

Note

For review, organize the material by counting, distribution calculations, conditional probability, expectation and variance, and random walks. Check both the answers and the conditions under which each method applies.

A useful review path is: concepts, homework problems, and then common hard points. In other words, start from the book's definitions, then go through the problem types in the homework, and finally review the methods used repeatedly in class.

1. Basic concepts

The easiest points to lose are often not from the hardest computations. They come from unclear definitions or not knowing which property to use. At minimum, you should be able to explain the following items in your own words and recognize when a problem is asking for them:

The three parts of a probability space: sample space, event space, and probability measure.
Random variables, distribution functions, and the basic properties of distribution functions.
Discrete and continuous random variables, and the relation among probability mass functions, density functions, and distribution functions.
Joint distributions, marginal distributions, conditional distributions, and independence for two-dimensional random variables.
The definitions and basic properties of expectation, variance, covariance, and correlation.
The meaning of conditional expectation, and the idea of "condition first, then take expectation."
Basic facts about common distributions: Bernoulli, binomial, geometric, and Poisson.

Remark

Do not stop at "this looks familiar." Try to say the definitions without looking at the notes. In particular, be able to say clearly what independence, conditional expectation, and a distribution function mean.

Here are two examples.

Problem: Fall 2024, problem 1

Toss two fair coins. Write down the three parts of the probability space in detail, and explain why there exist two independent random variables on it.

Solution

Take

\Omega=\{HH,HT,TH,TT\},

where the first letter records the first coin and the second letter records the second coin. Let

\mathcal{F}=2^\Omega,

and define the probability measure $P$ by

P(\{\omega\})=\frac14,\qquad \omega\in\Omega.

This gives the probability space $(\Omega,\mathcal{F},P)$ .

Now define

X=\mathbf{1}_{\{\text{the first coin is heads}\}},\qquad Y=\mathbf{1}_{\{\text{the second coin is heads}\}}.

Then $X$ and $Y$ both take values in $\{0,1\}$ , and

P(X=1)=P(Y=1)=\frac12.

Moreover,

P(X=i,Y=j)=\frac14=P(X=i)P(Y=j),\qquad i,j\in\{0,1\}.

Thus $X$ and $Y$ are independent.

Problem: Fall 2019, problem 2

Give a probability space on $[0,1]$ . For

A_1=[a_1,b_1],\qquad A_2=[a_2,b_2],

when are $A_1$ and $A_2$ independent?

Solution

Take

\Omega=[0,1],\qquad \mathcal{F}=\mathcal{B}([0,1]),\qquad P=\mu,

where $\mu$ is the Borel probability measure given by interval length on $[0,1]$ . Thus for every closed interval $[a,b]\subset[0,1]$ ,

P([a,b])=b-a.

Let

\ell_1=b_1-a_1,\qquad \ell_2=b_2-a_2.

Then $A_1$ and $A_2$ are independent if and only if

P(A_1\cap A_2)=P(A_1)P(A_2)=\ell_1\ell_2.

Assume without loss of generality that $a_1\le a_2$ .

(1) If $b_1\le a_2$ , then the two intervals intersect in at most one point, so

P(A_1\cap A_2)=0.

Independence then holds if and only if $\ell_1\ell_2=0$ , meaning that at least one interval degenerates to a point.

(2) If $a_2\le b_2\le b_1$ , then $A_2\subset A_1$ , so

P(A_1\cap A_2)=P(A_2)=\ell_2.

Independence requires

\ell_2=\ell_1\ell_2.

Thus either $\ell_2=0$ or $\ell_1=1$ . In words, either $A_2$ is a single point or $A_1=[0,1]$ .

(3) If $a_2<b_1<b_2$ , the intervals overlap but neither contains the other. Let

x=a_2-a_1,\qquad y=b_1-a_2,\qquad z=b_2-b_1.

Then $x,y,z>0$ , and

\ell_1=x+y,\qquad \ell_2=y+z,\qquad P(A_1\cap A_2)=y.

If the intervals were independent, then

y=(x+y)(y+z)=y^2+y(x+z)+xz>y,

a contradiction. So this case is impossible.

In this probability space, two closed intervals are independent if and only if at least one of them has probability $0$ or $1$ . Equivalently, at least one interval is either a point or the whole interval $[0,1]$ .

2. Homework review

The homework problems are the most important review material. Many exam questions appear in similar, sometimes almost identical, forms, so every homework problem should be worked through carefully.

Problems worth reviewing include:

If $X,Y$ are independent, $X\sim \mathrm{Poisson}(\lambda_1)$ , and $Y\sim \mathrm{Poisson}(\lambda_2)$ , find

\mathbb{E}[X\mid X+Y].

If $N\sim\mathrm{Poisson}(\lambda)$ , and conditional on $N$ we toss a coin $N$ times, with $X$ the number of heads, find

\mathbb{E}[N\mid X].

S_n=\sum_{k=1}^n X_k,\qquad S_0=0,

be a simple random walk on the line, where

\mathbb{P}(X_i=1)=p,\qquad \mathbb{P}(X_i=-1)=1-p,\qquad 0<p<1.

For $m\le n$ , find

\operatorname{Cov}(S_n,S_m) \quad\text{and}\quad \operatorname{Var}(S_n\mid S_m).

Methods to know:

Cauchy's inequality, especially for estimates, proving variances are nonnegative, and controlling expectations.
Markov's inequality, and the basic idea of controlling tail probabilities by expectations.
The law of total expectation.
Writing a random variable as a sum of indicator variables, which is useful for expectation, variance, and sometimes higher moments.

3. Examples

1. Generating functions and moment generating functions

Generating functions are useful for several reasons:

An ordinary generating function encodes the distribution of a nonnegative integer-valued random variable as one function.
Derivatives give expectation, variance, and higher moments.
For independent random variables, the generating function or moment generating function of the sum is the product of the individual ones.
Generating functions are especially useful for recursions and first-hitting or first-occurrence times.
When a moment generating function exists, it can often characterize the distribution and makes it easier to compare moments.

Problem: symmetric random walk and Catalan numbers

Let $\{S_n\}_{n\ge0}$ be the simple symmetric random walk on the line, with $S_0=0$ . At each step it moves right with probability $\frac12$ and left with probability $\frac12$ . Find

\mathbb{P}(S_1\ge0,S_2\ge0,\dots,S_{2n}\ge0,S_{2n}=0).

Solution

Let

C_n:=\#\{(S_1,\dots,S_{2n}):S_i\ge0,\ 1\le i\le2n,\ S_{2n}=0\}, \qquad C_0=1.

Then the desired probability is

\mathbb{P}(S_1\ge0,S_2\ge0,\dots,S_{2n}\ge0,S_{2n}=0) =\frac{C_n}{2^{2n}},

because every path of length $2n$ has probability $2^{-2n}$ .

Suppose the path first returns to $0$ at time $2k$ , where $1\le k\le n$ . Then $S_i\ge1$ for $1\le i\le 2k-1$ . Shifting this first part down by $1$ gives a path starting at $0$ , of length $2k-2$ , staying nonnegative and ending at $0$ . There are $C_{k-1}$ such paths. After the first return, the remaining $2(n-k)$ steps form another path of the same type, giving $C_{n-k}$ choices. Hence

C_n=\sum_{k=1}^n C_{k-1}C_{n-k},\qquad n\ge1.

Introduce the generating function

F(z):=\sum_{n=0}^{\infty}C_nz^n.

The recursion gives

\begin{aligned} F(z) &=1+\sum_{n=1}^{\infty}\sum_{k=1}^n C_{k-1}C_{n-k}z^n\\ &=1+zF(z)^2. \end{aligned}

Therefore

zF(z)^2-F(z)+1=0.

Solving,

F(z)=\frac{1-\sqrt{1-4z}}{2z},

where we choose the branch with $F(0)=1$ . Hence

C_n=\frac{1}{n+1}\binom{2n}{n}.

Thus

\mathbb{P}(S_1\ge0,S_2\ge0,\dots,S_{2n}\ge0,S_{2n}=0) =\frac{1}{2^{2n}}\cdot\frac{1}{n+1}\binom{2n}{n}.

Problem: Spring 2025, problem 6

Let $X_n$ be a nonconstant random variable taking values in $\{0,1,\dots,2n\}$ . Its generating function $G(z)=\mathbb{E}[z^{X_n}]$ is a polynomial of degree $2n$ and satisfies the Lee--Yang property: all zeros of $G(z)=0$ lie on the unit circle.

Give an example of a random variable whose generating function has the Lee--Yang property.
Prove that for every nonnegative integer $m$ ,

\mathbb{E}\bigl[(X_n-n)^{2m+1}\bigr]=0.

X_n^*:=\frac{X_n-\mathbb{E}[X_n]}{\sqrt{\operatorname{Var}(X_n)}}.

Prove that

1\le \mathbb{E}\bigl[(X_n^*)^4\bigr]<3.

Solution

(i) A standard example is

X_n\sim\mathrm{Bin}\!\left(2n,\frac12\right).

Then

G(z)=\left(\frac{1+z}{2}\right)^{2n},

and all zeros are at $z=-1$ , which lies on the unit circle.

(ii) Let

Y:=X_n-n.

Since the coefficients of $G$ are real and all roots lie on the unit circle, the roots come in conjugate pairs. We can write

G(z)=\lambda\prod_{k=1}^n(z^2-a_kz+1), \qquad a_k\in[-2,2].

Then

M_Y(t):=\mathbb{E}[e^{tY}] =e^{-nt}G(e^t) =\lambda\prod_{k=1}^n(e^t+e^{-t}-a_k).

The right-hand side is an even function of $t$ . Hence $M_Y(t)$ is even, and for every nonnegative integer $m$ ,

\mathbb{E}[Y^{2m+1}] =M_Y^{(2m+1)}(0)=0.

Thus

\mathbb{E}\bigl[(X_n-n)^{2m+1}\bigr]=0.

(iii) Taking $m=0$ in part (ii) gives

\mathbb{E}[X_n]=n.

So $Y=X_n-\mathbb{E}[X_n]$ . Let

M_Y(t)=\mathbb{E}[e^{tY}],\qquad f(t)=\log M_Y(t).

Since all odd moments vanish,

M_Y(t) =1+\frac{\mathbb{E}[Y^2]}{2}t^2 +\frac{\mathbb{E}[Y^4]}{24}t^4+o(t^4),

and therefore

f(t) =\frac{\mathbb{E}[Y^2]}{2}t^2 +\frac{\mathbb{E}[Y^4]-3(\mathbb{E}[Y^2])^2}{24}t^4 +o(t^4).

On the other hand, let $c_k:=2-a_k\in(0,4]$ . Then

f(t)=\sum_{k=1}^n \log\!\left(c_k+t^2+\frac{t^4}{12}+o(t^4)\right) +\log\lambda.

For each $k$ ,

\begin{aligned} \log\!\left(c_k+t^2+\frac{t^4}{12}+o(t^4)\right) &=\log c_k+\log\!\left(1+\frac{t^2}{c_k}+\frac{t^4}{12c_k}+o(t^4)\right)\\ &=\log c_k+\frac{t^2}{c_k} +\left(\frac{1}{12c_k}-\frac{1}{2c_k^2}\right)t^4 +o(t^4). \end{aligned}

Thus

f(t) =C+\sum_{k=1}^n \left[ \frac{t^2}{c_k} +\left(\frac{1}{12c_k}-\frac{1}{2c_k^2}\right)t^4 \right]+o(t^4),

where $C$ is a constant. Since $0<c_k\le4<6$ ,

\frac{1}{12c_k}-\frac{1}{2c_k^2} =\frac{c_k-6}{12c_k^2}<0.

Equivalently,

f^{(4)}(0) =24\sum_{k=1}^n \left(\frac{1}{12c_k}-\frac{1}{2c_k^2}\right) =\sum_{k=1}^n\left(\frac{2}{c_k}-\frac{12}{c_k^2}\right)<0.

So the coefficient of $t^4$ in $f(t)$ is negative. Hence

\mathbb{E}[Y^4]-3(\mathbb{E}[Y^2])^2<0,

\mathbb{E}[Y^4]<3(\mathbb{E}[Y^2])^2.

After standardization,

\mathbb{E}\bigl[(X_n^*)^4\bigr] =\frac{\mathbb{E}[Y^4]}{(\mathbb{E}[Y^2])^2}<3.

On the other hand, by Jensen's inequality, or by Cauchy's inequality,

\mathbb{E}\bigl[(X_n^*)^4\bigr] \ge \bigl(\mathbb{E}[(X_n^*)^2]\bigr)^2 =1.

Therefore

1\le \mathbb{E}\bigl[(X_n^*)^4\bigr]<3.

2. Simple random walk and common variants

First make sure you know the random-walk material in the notes. The earlier homework explanations are also worth reviewing carefully. In particular, know Theorems 2.5.2 through 2.5.5 in the textbook. In an exam, the reflection principle and the ballot theorem may be used directly.

Problem: Fall 2024, problem 5

In an election with two candidates, every ballot goes to exactly one of them. Suppose the final count is $\alpha$ votes for $T$ and $\beta$ votes for $H$ , with $\alpha\ge\beta$ . If the ballots are counted in a random order, find the probability that $T$ is never more than one vote behind $H$ during the count.

Solution

As before, build a random walk:

X_i= \begin{cases} 1, & \text{the }i\text{-th vote is for }T,\\ -1, & \text{the }i\text{-th vote is for }H, \end{cases} \qquad S_k=\sum_{i=1}^k X_i.

After $k$ votes, $S_k$ is the number of votes by which $T$ leads $H$ . The condition in the problem is

S_k\ge -1,\qquad 1\le k\le \alpha+\beta.

Now add one extra vote for $T$ to the beginning of every counting order. The new election has $\alpha+1$ votes for $T$ and $\beta$ votes for $H$ , and the new path satisfies

1+S_k\ge0.

Thus the original problem is the same as asking that, in the new election, $T$ is never behind $H$ .

Conversely, in any new counting order where $T$ is never behind $H$ , the first vote must be for $T$ . Removing that first vote recovers a valid order for the original problem. This is a bijection.

By the ballot theorem,

\frac{(\alpha+1)-\beta+1}{(\alpha+1)+1} \binom{\alpha+\beta+1}{\alpha+1} =\frac{\alpha-\beta+2}{\alpha+2} \binom{\alpha+\beta+1}{\alpha+1}

is the number of valid new orders. The total number of original counting orders is

\binom{\alpha+\beta}{\alpha}.

Therefore the desired probability is

\begin{aligned} &\frac{\dfrac{\alpha-\beta+2}{\alpha+2} \binom{\alpha+\beta+1}{\alpha+1}} {\binom{\alpha+\beta}{\alpha}}\\ &=\frac{\alpha-\beta+2}{\alpha+2} \cdot\frac{\alpha+\beta+1}{\alpha+1}. \end{aligned}

That is,

P(\text{$T$ is at most one vote behind $H$ during the count}) =\frac{(\alpha+\beta+1)(\alpha-\beta+2)} {(\alpha+1)(\alpha+2)}.

3. Probability and other subjects

Problem: Spring 2025, problem 5

Here is an example where probability meets linear algebra. Let $X_n=(X_{ij})$ be an $n\times n$ matrix whose $n^2$ entries are independent symmetric Bernoulli random variables:

\mathbb{P}(X_{ij}=0)=\mathbb{P}(X_{ij}=1)=\frac12.

Define $p_n=\mathbb{P}(\det(X_n)\text{ is odd})$ . Compute $p_2$ and $p_3$ , then guess and prove a general formula for $p_n$ .

Solution

The key observation is that an integer is odd if and only if it is nonzero modulo $2$ . Hence

\det(X_n)\text{ is odd} \quad\Longleftrightarrow\quad \det(X_n)\not\equiv0\pmod2.

So, if we view $X_n$ as a matrix over $\mathbf{F}_2=\{0,1\}$ , the problem becomes

p_n=\mathbb{P}(X_n\text{ is invertible over }\mathbf{F}_2).

Compute this by checking whether the rows are linearly independent. Let

R_1,R_2,\dots,R_n\in\mathbf{F}_2^n

be the rows of $X_n$ . Since the entries are independent and equally likely to be $0$ or $1$ , each row is uniform on $\mathbf{F}_2^n$ , and the rows are independent.

The probability that the first row is nonzero is

\frac{2^n-1}{2^n}=1-2^{-n}.

If the first $k$ rows are linearly independent, their span has $2^k$ vectors. Thus the conditional probability that row $k+1$ lies outside this span is

\frac{2^n-2^k}{2^n}=1-2^{k-n}.

Therefore

p_n=\prod_{k=0}^{n-1}(1-2^{k-n}) =\prod_{j=1}^n(1-2^{-j}).

In particular,

p_2=\left(1-\frac12\right)\left(1-\frac14\right)=\frac38,

and

p_3=\left(1-\frac12\right)\left(1-\frac14\right)\left(1-\frac18\right)=\frac{21}{64}.

The general formula is

p_n=\prod_{j=1}^n\left(1-\frac{1}{2^j}\right).

Problem: Fall 2020, problem 4

Let $S_n$ be the symmetric group, the set of all bijections from $\{1,2,\cdots,n\}$ to itself. Choose $\sigma$ uniformly from $S_n$ . Let the number of fixed points be

X(\sigma)=\left|\{k:\sigma(k)=k\}\right|,

and let the number of transpositions be

Y(\sigma)=\left|\{(i,j):i<j,\ \sigma(i)=j,\ \sigma(j)=i\}\right|.

Give the probability space in detail.
Are $X$ and $Y$ independent?
Find the probability mass function of $X$ .
Compute $\mathbb{E}[Y]$ .

Solution

(1) Take

\Omega=S_n,\qquad \mathcal{F}=2^{S_n},\qquad P(A)=\frac{|A|}{n!}\quad(A\subset S_n).

A sample point is a permutation $\sigma$ , and $X,Y$ are random variables on this probability space.

(2) For $n\ge2$ , $X$ and $Y$ are not independent. Indeed,

P(X=n)=\frac1{n!}>0,

and

P(Y>0)>0,

because, for example, the transposition $(1\ 2)$ has one transposition. But if $X=n$ , then $\sigma$ must be the identity permutation, so $Y=0$ . Hence

P(X=n,Y>0)=0\ne P(X=n)P(Y>0).

Thus $X$ and $Y$ are not independent.

(3) For $k=0,1,\dots,n$ , first choose which $k$ points are fixed. This can be done in

\binom{n}{k}

ways. The remaining $n-k$ points must form a derangement. Let $D_m$ be the number of derangements of $m$ elements. Then

P(X=k)=\frac{\binom{n}{k}D_{n-k}}{n!}, \qquad k=0,1,\dots,n.

By inclusion-exclusion,

D_m=m!\sum_{j=0}^m\frac{(-1)^j}{j!}.

Therefore

P(X=k)=\frac{1}{k!}\sum_{j=0}^{n-k}\frac{(-1)^j}{j!}, \qquad k=0,1,\dots,n.

This is the probability mass function of $X$ .

(4) For each $1\le i<j\le n$ , define

I_{ij}=\mathbf{1}_{\{\sigma(i)=j,\ \sigma(j)=i\}}.

Then

Y=\sum_{1\le i<j\le n} I_{ij}.

By linearity of expectation,

\mathbb{E}[Y] =\sum_{1\le i<j\le n}\mathbb{E}[I_{ij}] =\sum_{1\le i<j\le n}P(\sigma(i)=j,\sigma(j)=i).

For a fixed pair $i<j$ , there are $(n-2)!$ permutations with $\sigma(i)=j$ and $\sigma(j)=i$ . Hence

P(\sigma(i)=j,\sigma(j)=i) =\frac{(n-2)!}{n!} =\frac{1}{n(n-1)}.

Thus

\mathbb{E}[Y] =\binom{n}{2}\frac{1}{n(n-1)} =\frac12.

End-of-chapter checklist

The original problems and solutions in this chapter come from the corresponding TeX source file.
You can first read only the problem boxes, write down the key identities, and then open the proofs or solutions.
If a result uses independence, countable additivity, a change of variables, or a moment condition, it is worth marking that point explicitly.