What happens if toss a fair coin infinite times?

Thu, Mar 18, 2021 4-minute read probability


Suppose we have a fair coin with two faces Head (H) and Tail (T). Intuitively, one would say that as we toss the coin more times, the relative frequency of H (or T) occurrence is getting closer to one half. If you have taken any statistics course, you might know that the above statement can be directly proved by using the strong law of large number1(SLL) theorem. The SLL comes so handy that I used to apply it to any problem that fits into the conditions without questioning what’s really happening on the problem itself. So when I read the proof of the coin tossing problem in Probability and Measure2, a basic but fundamental proof taking a perspective I didn’t come across in the previous statistical classes, I felt very inspired and got a better statistical understanding of the problem itself. In the post, I will summarize the idea and present the proof.

1. Prelude

Let \(1\) indicate H and \(0\) indicate T, \(w\) be a sequence of infinite tosses of a coin. Then \(w\) can be represented by a nonterminating dyadic expansion (similar to binary expansion)3 \(w=\sum_{n=1}^{\infty}\frac{d_n(w)}{2^n}\), where \(d_n(w)\) takes value in \(\{0,1\}\) and can be understood as the result of \(n\)-th toss in the sequence \(w\). For example, \(w=0.01101…\) means that \((d_1(w)=0, d_2(w)=1, d_3(w)=1, d_4(w)=0, d_5(w)=1,…)\), and is equivalent to the sequence THHTH\(…\). Nonterminating means that for value with two expansion representations, use the representation not ending in \(0\). Take \(0.5\) as an example, \(0.5\) can be written as \(0.1\) or \(0.0111…\)4, we choose the one of \(0.0111…\) .

Up to this point, the statement that the relative frequency of 1 (or H) occurrence approaches one half can be statistically translated as \(P(\big[w:\underset{n\rightarrow \infty}{\lim}\frac{\sum_{i=1}^n d_i(w)}{n}=\frac{1}{2}\big])=1\), which has the same form as the target probability in SLL, but here \(d_i(w)\) has a more concrete definition than \(X_i\). Before starting the proof, let’s first look at some useful definitions and a lemma.

2. Notation

Let \(N=\big[w:\underset{n\rightarrow \infty}{\lim}\frac{\sum_{i=1}^n d_i(w)}{n}=\frac{1}{2}\big]\) and \(s_n(w)=\sum_{i=1}^n[2d_i(w)-1]\).

Then it’s equivalent to write \(N=\big[w:\underset{n\rightarrow \infty}{\lim}\frac{s_n(w)}{n}=0\big]\).

We prove \(P(N)=1\) by showing that \(P(N^c)=0\).

3. Negligible set

Define a set \(A\) to be negligible if for each positive \(\epsilon\) there exists a finite or countable collection \(I_1, I_2,…\) of intervals (they may overlap) satisfying \(A \subset \sum_k I_k\) and \(\sum_k|I_k|<\epsilon\). A finite or countable union of negligible sets is negligible.

4. Lemma

\(P(\big[w: |\frac{s_n(w)}{n}|\geq \epsilon\big])\leq \frac{3}{n^2\epsilon^4}\). The proof of this lemma employs Markov Inequality, but it is not the focus here, details can be found in the book 2.

5. Proof

Now we are ready for the proof. Let \(A_n=\big[w: |n^{-1}s_n(w)|\geq \epsilon_n\big]\), then by the lemma, \(P(A_n)\leq \frac{3}{n^2\epsilon_n^4}\). We fix a positive sequence \(\{\epsilon_n\}\) going to \(0\) slowly enough such that the series \(\sum_n\epsilon_n^{-4}n^{-2}\)(take \(\epsilon_n=n^{-1/10}\), for example) converges, so \(\sum_nP(A_n)\leq \sum_n \frac{3}{n^2\epsilon_n^4}<\infty\).

Next we prove \(P(N^c)=0\) by showing that \(N^c\) can be covered by intervals the total sum of whose lengths can be made arbitrarily small.

  • If for some \(m\), \(w\) lies in \(A_n^c\) for all \(n\geq m\), then \(|n^{-1}s_n(w)|<\epsilon_n\) for \(n\geq m\), and it follows that \(w\in N\).
    In other words, for each \(m\), \(\cap_{n=m}^{\infty}A_n^c\subset N\), which is the same thing as \(N^c \subset \cup_{n=m}^{\infty}A_n\).
  • Since \(\sum_nP(A_n)< \infty\), so we can find a \(m\) such that \(\sum_{n=m}^{\infty}P(A_n)< \epsilon\).
  • Therefore, \(P(N^c)<\sum_{n=m}^{\infty}P(A_n)< \epsilon\). \(P(N^c)=0\) due to \(\epsilon\) is arbitrary.

\(\blacksquare\)

6. Remark

To be more accurate, since \(s_n(w)\) is a step function on the interval \((0,1]\), \(A_n\) is a finite disjoint union of \(\cup_kI_{nk}\) with \(P(A_n)=\sum_k|I_{nk}|\). Therefore \(\cup_{n=m}^{\infty}A_n\) is a countable union of \(\cup_{n=m}^{\infty}\cup_kI_{nk}\) with \(\sum_{n=m}^{\infty}\sum_k|I_{nk}|=\sum_{n=m}^{\infty}P(A_n)< \epsilon\). The intervals \(I_{nk}(n\geq m, k\geq 1)\) provide a covering of \(N^c\).

In section 5, the step from \(\sum_nP(A_n)<\infty\) to \(P(N^c)=0\) can be done by using the first Borel Cantelli Lemma. It’s encouraged to use those well-established theorems in solving problems, but in the post I mainly want to show the basic but inspiring concepts. The book 2 covers some fabulous examples of using the first and second Borel Cantelli Lemma, I intend to write a post about them in the near future.

 

 


  1. the strong law of large number: if the random variables \(X_1,…,X_n\) are independent and identically distributed and \(E(X_n)=m\). Let \(T_n=X_1+\cdots+X_n\), then \(P(\lim_n\frac{T_n}{n}=m)=1\). ↩︎

  2. Probability and Measure ↩︎ ↩︎ ↩︎

  3. similar in the sense that they have the same format \(\sum_{n=1}^{\infty}\frac{d_n(w)}{2^n}\), the difference is that in dyadic expansion \(d_n(w)\) is actually a step function of \(w\) when \(n\) is fixed ↩︎

  4. \(0.5=1\times 2^{-1}=0.1\)(in binary)

    \(0.5=\sum_{n=2}^{\infty}2^{-n}=0\times 2^{-1}+1\times 2^{-2}+1\times 2^{-3}+1\times 2^{-4}+…=0.0111…\)(in binary, nonterminating) ↩︎