Concept index

Hanzhe Li, Jinghan Liu, Jieyang Hu

Contents

This page collects definitions and tools that are used repeatedly in the main text but are not restated every time. Each entry keeps only the most common checks and formulas, so it can be used as a quick reference.

Basic modeling and distribution functions

Definition: probability space

A probability space is a triple $(\Omega,\mathcal F,\mathbb P)$ . Here $\Omega$ is the sample space, $\mathcal F$ is the event space, and $\mathbb P:\mathcal F\to[0,1]$ satisfies $\mathbb P(\Omega)=1$ and countable additivity. In problems, first identify what the outcomes are, what the events are, and how probabilities are assigned.

Definition: sigma-algebra

A collection $\mathcal F$ is a $\sigma$ -algebra on $\Omega$ if $\Omega\in\mathcal F$ , and if it is closed under complements and countable unions. By De Morgan's laws, it is also closed under countable intersections. It tells us which sets are allowed to have probabilities.

Tool: constructing a probability space from a random experiment

For a finite or countable model, write it in three steps:

Sample space $\Omega$ : list all possible outcomes.
Probability $\mathbb P$ : state whether outcomes are equally likely, or give the weights.
Random variable $X$ : map each outcome to a number.

This avoids mixing up the outcome of the random experiment with the value of a random variable.

Definition: random variable

A random variable is a measurable function $X:\Omega\to\mathbb R$ from the sample space to the real line. Many random variables can be defined on the same probability space. In many problems, the clean order is to write $\Omega$ and $\mathbb P$ first, then define $X(\omega)$ .

Definition: independence

A family of events $\{A_i:i\in I\}$ is mutually independent if for every finite set of distinct indices $i_1,\dots,i_k$ ,

\mathbb P(A_{i_1}\cap\cdots\cap A_{i_k}) =\prod_{j=1}^k \mathbb P(A_{i_j}).

Pairwise independence only checks the case $k=2$ , and is strictly weaker than mutual independence.

Definition: distribution function

The distribution function of a random variable $X$ is $F(x)=\mathbb P(X\le x)$ . It is nondecreasing, right-continuous, and satisfies

\lim_{x\to-\infty}F(x)=0,\qquad \lim_{x\to+\infty}F(x)=1.

Point masses are jumps:

\mathbb P(X=x)=F(x)-F(x-).

Tool: checking whether a function is a distribution function

To check that a function $F$ is a distribution function, usually verify:

$F$ is nondecreasing.
$F$ is right-continuous.
$\lim_{x\to-\infty}F(x)=0$ .
$\lim_{x\to+\infty}F(x)=1$ .

If $F,G$ are distribution functions and $0\leq\lambda\leq 1$ , then

\lambda F+(1-\lambda)G

is also a distribution function.

Tool: constructing a random variable from a distribution function

If $U\sim U[0,1]$ , the inverse transform construction

X=F^{-1}(U),\qquad F^{-1}(u)=\inf\{x:F(x)\ge u\}

gives a random variable with distribution function $F$ .

Conditional expectation, indicators, and second moments

Tool: tail-sum formula

If $X$ is a nonnegative integer-valued random variable, then

\mathbb E X=\sum_{n=0}^{\infty}\mathbb P(X>n).

If $X\geq 0$ is a general nonnegative random variable, then

\mathbb E X=\int_0^\infty \mathbb P(X>t)\,dt

in the extended sense, allowing the value $+\infty$ .

Tool: conditioning

For a mixture distribution or a multi-stage experiment, first choose a variable $Y$ that simplifies the structure. Then use

\mathbb P(A)=\sum_y \mathbb P(A\mid Y=y)\mathbb P(Y=y), \qquad \mathbb E X=\mathbb E[\mathbb E(X\mid Y)].

In the continuous case, replace sums by integrals.

Definition: conditional expectation

\mathbb E[X\mid\mathcal F]

is the average prediction of $X$ after the information $\mathcal F$ is given. In the discrete case, one can think of $\mathcal F$ as dividing the sample space into conditional blocks; the conditional expectation is the average on each block. A common formula is the tower property

\mathbb E X=\mathbb E[\mathbb E(X\mid Y)].

Tool: indicator decomposition

Counting random variables are often written as

N=\sum_i I_i.

Then

\mathbb E N=\sum_i\mathbb E I_i,

and the variance can be computed by

\operatorname{Var}(N)=\sum_i\operatorname{Var}(I_i) +2\sum_{i<j}\operatorname{Cov}(I_i,I_j).

This is useful for counting adjacency relations, local structures, and numbers of appearances.

Tool: linearity of covariance

Covariance is linear in each argument. For example,

\operatorname{Cov}(aX+bY,Z) =a\operatorname{Cov}(X,Z)+b\operatorname{Cov}(Y,Z).

If $X,Y$ are independent and have finite second moments, then

\operatorname{Cov}(X,Y)=0.

For sample means, centered variables, and projection residuals, covariance linearity often gives the answer in one line.

Tool: higher even-moment method and Markov's inequality

The $2m$ -th moment method rewrites a tail event in terms of a higher even power. If $m\geq 1$ and $\mathbb E|X|^{2m}<\infty$ , Markov's inequality gives

\mathbb P(|X|\ge a) =\mathbb P(|X|^{2m}\ge a^{2m}) \le \frac{\mathbb E|X|^{2m}}{a^{2m}}.

When $m=1$ and $\mathbb E X=0$ , this becomes Chebyshev's inequality:

\mathbb P(|X|\ge a)\le \frac{\operatorname{Var}(X)}{a^2}.

This is often used to prove convergence in probability. If

\mathbb E|X_n-c|^{2m}\to 0,

then

X_n\xrightarrow{P}c.

The usual move is to write the target difference as $X_n-c$ and bound an even moment. If the second moment is too weak, try the fourth, sixth, or another higher even moment.

Characteristic functions and independence

Definition: characteristic function

The characteristic function of a random variable $X$ is

\varphi_X(t)=\mathbb E e^{itX}.

It always exists, and $\varphi_X(0)=1$ . A distribution is uniquely determined by its characteristic function, so this tool is well suited to independent sums and limiting distributions.

Tool: characteristic function of an independent sum

If $X,Y$ are independent, then

\varphi_{X+Y}(t)=\varphi_X(t)\varphi_Y(t).

More generally, a sum of independent random variables corresponds to a product of characteristic functions. For limits of independent sums, first write the characteristic function of each term, then study the product.

Tool: testing independence by the joint characteristic function

The joint characteristic function is

\varphi_{X,Y}(s,t)=\mathbb E e^{i(sX+tY)}.

\varphi_{X,Y}(s,t)=\varphi_X(s)\varphi_Y(t) \quad\text{for all }s,t,

then $X$ and $Y$ are independent. Knowing only

\varphi_{X+Y}(t)=\varphi_X(t)\varphi_Y(t)

does not usually imply independence of $X,Y$ , because it only checks the diagonal of the joint characteristic function.

Tool: convergence of characteristic functions

\varphi_{X_n}(t)\to \varphi(t),

and $\varphi$ is the characteristic function of a random variable $X$ and is continuous at $0$ , then

X_n\xrightarrow{d}X.

In particular, if the limit is

e^{-t^2/2},

then the limiting distribution is $N(0,1)$ .

Convergence in distribution and test functions

Definition: convergence in distribution

X_n\xrightarrow{d}X

is equivalent to

F_n(x)\to F(x)

at every continuity point $x$ of the distribution function of $X$ . It is also equivalent to

\mathbb E h(X_n)\to \mathbb E h(X)

for every bounded continuous function $h$ . When using distribution functions, take limits directly only at continuity points.

Tool: Skorohod representation

If $X_n\Rightarrow X$ , then under suitable conditions one can construct copies with the same distributions,

\widetilde X_n\stackrel d=X_n,\qquad \widetilde X\stackrel d=X,

such that

\widetilde X_n\to\widetilde X\quad a.s.

This can turn a weak convergence problem into an almost sure convergence problem. But it is a theorem; it does not mean the original $X_n$ converges almost surely.

Tool: independence is preserved under almost sure limits

If $X_n\to X$ a.s., $Y_n\to Y$ a.s., and $X_n,Y_n$ are independent for each $n$ , then $X,Y$ are independent. One proof uses bounded continuous test functions. For any bounded continuous $f,g$ ,

\mathbb E f(X_n)g(Y_n) =\mathbb E f(X_n)\mathbb E g(Y_n),

and then the dominated convergence theorem passes to the limit.

Limit theorem toolbox

Tool: law of large numbers

If $X_i$ are i.i.d. and $\mathbb E|X_1|<\infty$ , then

\frac1n\sum_{i=1}^n X_i\xrightarrow{P}\mathbb E X_1.

Use this to replace a sample average by the theoretical mean. Before applying it, check independence, identical distribution, and the first moment condition.

Tool: central limit theorem

If $X_i$ are i.i.d., $\mathbb E X_i=0$ , and $\operatorname{Var}(X_i)=1$ , then

\frac1{\sqrt n}\sum_{i=1}^n X_i\xrightarrow{d}N(0,1).

In the general case, center first and divide by the standard deviation. Before applying it, check the mean, variance, and i.i.d. assumptions.

Tool: Slutsky's theorem

X_n\xrightarrow{d}X,\qquad Y_n\xrightarrow{P}c,

then

X_nY_n\xrightarrow{d}cX,\qquad X_n+Y_n\xrightarrow{d}X+c.

In particular, if the denominator converges in probability to $1$ , then

\frac{X_n}{Y_n}\xrightarrow{d}X.

This is commonly used for random normalizations and negligible error terms.

Tool: method of moments

If all moments converge to the moments of a distribution that is uniquely determined by its moments, then convergence in distribution follows. For a standard normal variable $N$ , odd moments are $0$ , and

\mathbb E[N^{2k}]=(2k-1)!!.

When using this method, say why the target distribution is determined by its moments. It is not enough to write only "the moments converge."

Triangular arrays

Definition: normalized sum in a triangular array

For sums whose entries change with each row,

\sum_{k=1}^n Y_{n,k},

we often write

B_n^2=\sum_{k=1}^n\operatorname{Var}(Y_{n,k}).

The normalized object is

\frac1{B_n}\sum_{k=1}^n (Y_{n,k}-\mathbb E Y_{n,k}).

First compute $B_n^2$ , then check the relevant central limit theorem condition.

Tool: Lindeberg condition

For every $\varepsilon>0$ , if

\frac1{B_n^2}\sum_{k=1}^n \mathbb E\left[ Y_{n,k}^2\mathbf 1_{\{|Y_{n,k}|>\varepsilon B_n\}} \right]\to0,

then, under the usual surrounding assumptions, a central limit theorem holds. The steps are:

First compute $B_n^2$ .
Then write the Lindeberg term.
Control it using tail integrability or a stronger moment condition.

Tool: third-moment criterion for Lindeberg

\frac1{B_n^3}\sum_{k=1}^n \mathbb E|Y_{n,k}|^3\to0,

then the Lindeberg condition holds. Indeed, on $|Y_{n,k}|>\varepsilon B_n$ ,

Y_{n,k}^2\le \frac{|Y_{n,k}|^3}{\varepsilon B_n}.

This is a common quick check in textbooks.

Advanced tools: tail bounds and concentration

Tool: Paley-Zygmund and the second moment lower bound

If $X\geq 0$ and $0<\theta<1$ , then

\mathbb P(X\ge \theta \mathbb E X) \ge (1-\theta)^2\frac{(\mathbb E X)^2}{\mathbb E[X^2]}.

Letting $\theta\downarrow0$ gives the second moment method:

\mathbb P(X>0)\ge \frac{(\mathbb E X)^2}{\mathbb E[X^2]}.

This is useful when you want to prove that some structure appears at least once. Usually $X$ is the number of appearances; compute $\mathbb E X$ and control $\mathbb E X^2$ .

Tool: event version of the second moment method

Let

B_n=\bigcup_{i=1}^{m_n} A_{n,i},\qquad \mu_n=\sum_{i=1}^{m_n}\mathbb P(A_{n,i}).

If only the dependent pairs are included in

\gamma_n=\sum_{i\sim j}\mathbb P(A_{n,i}\cap A_{n,j}),

then in many counting problems, $\mu_n\to\infty$ and $\gamma_n=o(\mu_n^2)$ imply $\mathbb P(B_n)\to1$ . This is a common template in random graphs and random structures.

Tool: Chernoff-Cramer bound

If the moment generating function

M_X(s)=\mathbb E e^{sX}

is finite in the relevant range, write

\Psi_X(s)=\log M_X(s).

For $s>0$ , exponential Markov gives

\mathbb P(X\ge \beta) \le \exp\{-s\beta+\Psi_X(s)\}.

Thus one usually writes

\mathbb P(X\ge \beta) \le \inf_{s>0}\exp\{-s\beta+\Psi_X(s)\}.

This is the starting point of many exponential tail bounds: write the moment generating function, then optimize over $s$ .

Definition: sub-Gaussian random variable

Let $\mu=\mathbb E X$ . If there is a $\nu>0$ such that for all $s\in\mathbb R$ ,

\Psi_{X-\mu}(s)\le \frac{\nu s^2}{2},

then $X$ is called sub-Gaussian with parameter $\nu$ . A typical tail bound is

\mathbb P(|X-\mu|\ge \beta) \le 2\exp\left(-\frac{\beta^2}{2\nu}\right).

Bounded variables, normal variables, and many independent sums have this square-exponential tail behavior.

Tool: Hoeffding-type bound for weighted sums

Suppose $X_i$ are independent and $X_i\in \mathrm{s}\mathcal G(\nu_i)$ . Let

S=\sum_{i=1}^n w_iX_i,\qquad V=\sum_{i=1}^n w_i^2\nu_i.

Then $S$ is again sub-Gaussian-type, and

\mathbb P(|S-\mathbb ES|\ge \beta) \le 2\exp\left(-\frac{\beta^2}{2V}\right).

Use this for independent weighted sums, random signs, and deviations of empirical averages. The main step is computing the variance proxy $V$ correctly.

Definition: sub-exponential random variable

Let $\mu=\mathbb E X$ . If there are $\nu,\alpha>0$ such that for $|s|<1/\alpha$ ,

\Psi_{X-\mu}(s)\le \frac{\nu s^2}{2},

then $X$ is called sub-exponential with parameters $(\nu,\alpha)$ . Its one-sided tail bound is

\mathbb P(X-\mu\ge \beta)\le \begin{cases} \exp\left(-\dfrac{\beta^2}{2\nu}\right),&0<\beta\le \nu/\alpha,\\ \exp\left(-\dfrac{\beta}{2\alpha}\right),&\beta>\nu/\alpha. \end{cases}

Small deviations look sub-Gaussian; large deviations become exponential.

Tool: Bernstein-type bound for bounded variables

Let $X_1,\dots,X_n$ be independent, with $\mu_i=\mathbb E X_i$ , $\operatorname{Var}(X_i)=\sigma_i^2$ , and

|X_i-\mu_i|\le c.

For $S_n=\sum_iX_i$ and $V=\sum_i\sigma_i^2$ , a common one-sided Bernstein-type bound is

\mathbb P(S_n-\mathbb ES_n\ge \beta)\le \begin{cases} \exp\left(-\dfrac{\beta^2}{4V}\right),&0<\beta\le V/c,\\ \exp\left(-\dfrac{\beta}{4c}\right),&\beta>V/c. \end{cases}

For a two-sided bound, apply the same inequality to $-X_i$ as well. This is often much sharper than Chebyshev for sums of independent bounded variables.

Tool: Borel-Cantelli template

\sum_{n=1}^{\infty}\mathbb P(A_n)<\infty,

then

\mathbb P(A_n\ \text{i.o.})=0.

This is often used to prove that a bad event happens only finitely many times, which then gives an eventual almost sure bound. If the events $A_n$ are independent and $\sum_n\mathbb P(A_n)=\infty$ , the second Borel-Cantelli lemma gives $\mathbb P(A_n\ \text{i.o.})=1$ .

Quick reference

To prove convergence in probability: first try a higher even-moment method or Chebyshev.
To prove convergence in distribution to a normal law: first try CLT plus Slutsky.
For triangular arrays: check Lindeberg or the third-moment criterion.
For independent sums: consider characteristic functions.
To prove a nonnegative count is positive: try Paley-Zygmund or a second moment lower bound.
For exponential tail bounds: write the moment generating function and try Chernoff-Cramer.
For independent weighted sums: check whether a Hoeffding-type bound applies.
For sums of independent bounded variables: consider a Bernstein-type bound.
For maximum probabilities: first try a union bound, then combine it with Chernoff-Cramer, Hoeffding, or Bernstein.
For eventual almost sure statements: consider Borel-Cantelli.
For counting problems: write the count as a sum of indicator variables.
For limits of distribution functions: take limits directly only at continuity points.
For limits of expectations when you only have convergence in distribution: consider Skorohod representation or uniform integrability.

Reading warning

In probability, many uses of "obvious" rely on countable additivity, monotone convergence, independence, moment conditions, or the assumptions of a limit theorem. When reading a proof, it is better to mark these conditions step by step.