Learning With Errors

a theory student blog

Discrepancy: definitions and Spencer's six standard deviations

This is a first in a series of (probably at least 3) blog posts about discrepancy minimization. There are already many expositions of this topic, and it is not clear that the world needs yet another, but here it is :). In this post, I'll introduce the basic definitions, give one simple upper bound, one simple lower bound, and then prove Spencer's famous (and elegant) “six standard deviations” theorem. I won't focus on mathematical context, but at the end I will give pointers to some resources on the topic.


Suppose we have $n$ items, and a set of subsets of $[n]$, $\mathcal{S} = S_1,\ldots,S_m$ with $S_j \subseteq [n]$ for all $j \in [m]$. We want to assign each item $i \in [n]$ a sign or “color” $x_i \in \{\pm 1\}$, with the goal that the coloring of each set is as balanced as possible---formally, this is called minimizing the discrepancy. So for a coloring $x \in \{\pm 1\}^n$, the discrepancy or imbalance of $S \subseteq [n]$ is given by \[ \disc(x, S) = \left|\sum_{i \in S} x_i \right|, \] We define the discrepancy of $\cS$ to be the discrepancy of the worst set in $\cS$ under the most balanced coloring, \[ \disc(\cS) = \min_{x \in \{\pm 1\}^n} \max_{S_j \in \cS}~ \disc(x,S_j). \]

It can also be convenient to think of this as a matrix problem. Consider the incidence matrix $A$ of $\cS$, the $m \times n$ matrix with \[ A_{ji} = \begin{cases} 1 & i \in S_j\\ 0 & \text{otherwise}. \end{cases} \] Then letting $a_j$ be the 0/1 indicator vector for set $S_j$, or the $j$th row of $A$, the discrepancy of $x \in \{\pm 1\}^n$ on $S_j$ is equal to $|\iprod{a_j,x}|$, and the discrepancy of $\cS$ is equivalent to \[ \disc(\cS) = \min_{x \in \{\pm 1\}^n} \|Ax\|_{\infty}. \]

Some cursory upper and lower bounds

A priori, it might not be clear what $\disc(\cS)$ should be. Some set systems have discrepancy zero---consider for example a “planted” set system, in which the rows of the incidence matrix $A$ are chosen to be orthogonal to some coloring $x\in \{\pm 1\}^n$. On the other hand, naively we could have set systems with discrepancy as large as $n$.

Uniformly Random Coloring

It is not very hard to see that $n$ is an egregious upper bound (when $m = |\cS|$ is not too large). Let's consider $x \in \{\pm\}^n$ chosen uniformly at random. For each $S_j \in \cS$, by Hoeffding's inequality, we have that \[ \Pr\left[\left|\sum_{i \in S_j} x_i \right| \ge t \sqrt{n} \right] \le 2\exp\left(-\frac{t^2}{2}\right). \] So, if we set $t = 2\sqrt{\log m})$, we can beat the union bound over the sets: \[ \Pr[\|Ax\|_{\infty} \ge t\sqrt{n}] \le \sum_{j \in [m]} \Pr\left[\left|\sum_{i \in S_j} x_i \right| \ge t\sqrt{n} \right] \le m \cdot 2\exp(-4 \log m) = O\left(\frac{1}{m^3}\right). \] So for any set system $\cS$, the random coloring gives the improved bound, \begin{equation} \disc(\cS) \le O(\sqrt{n\log m}).\label{randomub} \end{equation}

A Lower Bound

We'll see that the upper bound we got from the random coloring is almost tight. Our lower bound will come from the set system defined by the Hadamard matrix $H$---for us, it will be enough to know that the Hadamard matrix is a symmetric matrix with entries in $\{\pm 1\}$, the first column (and therefore row) of $H$ is the all-1's vector $\vec{1}$, and the columns are mutually orthogonal so that $HH^{\top} = n\cdot \Id$.

The basic idea for the lower bound is that, because $H$ has large eigenvalues, it cannot map any Boolean vector into a ball of small radius, giving us large discrepancy. But $H$ has negative entries, so in the process of translating it into a valid incidence matrix with entries in $\{0,1\}$ we have to make sure we didn't introduce a small eigenvalue in the direction of some $x \in \{\pm 1\}^n$.

The proof is only a couple of lines. Let $J$ be the all-$1$s matrix. We can define the set system $\cS$ that has an incidence matrix $A = \frac{1}{2}(J+H)$---this is valid, because $A$ is a $0/1$ matrix. Now, for any $x \in \{\pm 1\}^n$, we have that \begin{align*} n \cdot \|Ax\|_{\infty}^2 ~\ge~\|Ax\|_2^2 ~=~ x^{\top}A^{\top} A x ~=~\frac{1}{4} x^{\top} \left(H^{\top} H + J^{\top} J + J^{\top} H + H^{\top} J\right)x. \end{align*} To simplify, we can observe that $H^{\top}H = n\cdot \Id$, that $J^{\top} J = n\cdot J$. Also, because the first row/column of $H$ is equal to $\vec{1}$, and because the rows of $H$ are orthogonal, $H\vec{1} = n \cdot e_1$, and so $JH^{\top} = n \cdot \vec{1}e_1^{\top}$. So, \begin{align*} x^{\top} A^{\top}Ax ~=~ \frac{n}{4} \cdot x^{\top}\left( \Id + J + e_1 \vec{1}^{\top} + \vec{1}e_1^{\top}\right) x ~=~ \frac{n}{4}\left(n + \iprod{x,\vec{1}}^2 + 2\cdot x_1 \cdot \iprod{x,\vec{1}}\right) \end{align*} Because $A$'s first row is $\vec{1}$ we have that $|\iprod{x,\vec{1}}| > \sqrt{n} \implies \|Ax\|_{\infty} \ge \sqrt{n}$. And otherwise if $|\iprod{x,\vec{1}}| < \sqrt{n}$, plugging in to the above we have that \begin{align*} n\cdot \|Ax\|_{\infty}^2 ~\ge~ x^{\top} A^{\top} A x ~>~ \frac{n}{4}\left(n - 2\sqrt{n}\right) \quad \implies \quad \|Ax\|_{\infty} \ge O(\sqrt{n}). \end{align*} So for the Hadamard set system $\cS$, \begin{equation} \disc(\cS) \ge O(\sqrt{n}).\label{lb} \end{equation}

Spencer's Theorem

It turns out that in fact the lower bound from (\ref{lb}) is tight up to constant factors, and we can get rid of the logarithmic factor from the upper bound (\ref{randomub}). In the 1980's, Spencer proved the following result:

Theorem 1 For any set system $\cS$ on $[n]$ with $|\cS| = m$, \[ \disc(\cS) \le 6\sqrt{n\log\frac{m}{n}} \]

Maybe improving on the logarithmic factor in (\ref{randomub}) does not seem like a big deal, but a priori it is not obvious that one should be able to get a bound better than the random assignment. Also, Spencer's proof is extremely elegant (though nonconstructive). We'll prove it below, but we'll assume that $m = n$, and we'll be sloppy with constants (so we'll get an upper bound of $O(\sqrt{n})$ instead of $6\sqrt{n}$).

The gist. The proof is by induction---given $\cS$ and $[n]$, Spencer shows that there exists some partial coloring $y \in \{-1,0,1\}^n$ with at least a constant fraction of the elements colored in and low discrepancy, so that $\|y\|_1 \ge c\cdot n$ and $\|Ay\|_{\infty} \le C\sqrt{n}$ for constants $c,C$. Then this fact is applied inductively for $\log n$ steps, until all of the items are colored, and the total discrepancy is $\sum_{t=0}^{\log n} C\cdot \sqrt{c^t\cdot n} = O(\sqrt{n})$, since $c < 1$.

The reason such a partial coloring $y$ must exist is because the map $Ax$ is not spread out enough---by the pigeonhole principle, one can show that there are at least $2^{\Omega(n)}$ distinct points $x \in \{\pm 1\}^n$ that get mapped to a ball of radius $O(\sqrt{n})$. Since there are so many points, there must exist $x_1,x_2$ in this ball that have large Hamming distance, so that their difference has many nonzero entries, and so the partial coloring is given by $y = \frac{1}{2}(x_1-x_2)$.

Now for the more formal proof. We will prove the following lemma, which we will then apply inductively:

Lemma 2 Let $m,n \in \N$ with $n \le m$, and let $A$ be an $m \times n$ matrix with $0/1$ entries. Then there exist universal constants $c_1,c_2$ such that there always exists a vector $y \in \{-1,0,1\}^n$ so that $\|y\|_1 \ge c_1 \cdot n$ and \[ \|Ay\|_{\infty} \le c_2\cdot \sqrt{n\log\frac{m}{n}}. \]

Proof: Let $\cB_{\infty}(r,p)$ denote the ball of radius $r$ around $p$, where distance is measured in $\ell_\infty$. We'll show that there must exist some point $q \in \R^m$ such that \[ \Pr_{x\sim\{\pm 1\}^n}[ Ax \in \cB_{\infty}(r,q)] \ge 2^{-cn}, \] for some constant $c$. Our strategy will be to use the pigeonhole principle. We'll identify a set $B \subset \R^m$ with $|B| \le 2^{\epsilon n}$, so that for uniformly chosen $x \in \{\pm 1\}^n$, there exists a point in $B$ close to $Ax$ with probability at least $\frac{1}{2}$; because $|B| < 2^{\epsilon n}$, there must be some $q \in B$ which is near at least $2^{(1-\epsilon)n}$. To find such a $B$, which contains few points but is close to $Ax$ with constant probability over uniform $x \in \{\pm 1\}^n$, we'll choose $B$ to be a discretization of the set of points in $\R^m$ that do not have too many entries of large magnitude. The way that we define “large magnitude” will partially depend on the standard deviation of entries of $Ax$, and partly on wanting to keep $B$ relatively small.

Define the function $f:\{\pm 1\}^n \to \Z^m$ so that \[ f(x) = \left\lceil \frac{1}{\sqrt {2n\log \frac{m}{n}}} Ax\right\rceil. \] In words, $f$ maps $x$ to the integral point closest to $Ax/\sqrt{n\log \frac{m}{n}}$.

We next identify some small subset $B \subset \Z^m$ which contains a large fraction of the range of $f$, which will imply that there must exist some $q \in \sqrt{n\log\frac{m}{n}}\cdot B$ which is close to $Ax$ for many $x \in \{\pm 1\}^n$. Define $B \subset \Z^m$ to be the set for which at most a $\kappa_t = 2^{t+2} (m/n)^{-t^2}$-fraction of coordinates are larger than $t$, \[ B = \left\{(b_1,\ldots,b_m) \in \Z^m ~|~ |\{b_i ~s.t.~ |b_i| \ge t\}| \le \kappa_t \cdot m\right\}. \] For uniformly chosen $x$, by Hoeffding's inequality, \[ \Pr\left[ |\iprod{x,a_j}| \ge t \sqrt{2n\log\frac{m}{n}} \right]\le 2\left(m/n\right)^{-t^2} \] And so the expected number of $j \in [m]$ for which $|\iprod{x,a_j}| \ge t\sqrt{n\log\frac{m}{n}}$ is at most \[ \E\left[\sum_{j\in[m]} \Ind\left(\left|\iprod{x,a_j}\right| \ge t\sqrt{2n\log\frac{m}{n}}\right) \right] \le m \cdot 2\left(\frac{m}{n}\right)^{-t^2}, \] And by Markov's inequality \begin{align} \Pr\left[\sum_{j\in[m]} \Ind\left(\left|\iprod{x,a_j}\right| \ge t\sqrt{2n\log\frac{m}{n}}\right) \ge 2^{t+1} m \cdot \left(\frac{m}{n}\right)^{-t^2}\right] &\le \frac{1}{2^{t+1}}.\label{eq:cond} \end{align} Recall that $\kappa_t = 2^{t+1} \left(\frac{m}{n}\right)^{-t^2}$. Thus, for any $x$, the probability that $f(x) = \lceil (n\log\frac{m}{n})^{-1/2} \cdot Ax\rceil \not\in B$ is the sum over (\ref{eq:cond}) for all $t \le \sqrt{n}$, and so by a union bound, \begin{align*} \Pr[f(x) \not \in B] ~=~ \Pr\left[\exists t ~s.t.~ \sum_{j \in [m]} \Ind\left(|\iprod{x,a_j}| \ge t\sqrt{n\log\frac{m}{n}}\right) \ge \kappa_t \cdot m\right] ~\le~ \sum_{t=1}^{\sqrt{n}} \frac{1}{2^{t+1}} ~\le~ \frac{1}{2}. \end{align*}

At the same time, the size of $B$ can be bounded with some meticulous but uncomplicated counting arguments---we won't reproduce them at full resolution here, but the basic idea is that if we consider a point in $B$, it should have at most $\alpha_t \le \kappa_t$ entries of value $\pm t$. So for any valid sequence $\alpha = \alpha_1,\ldots,\alpha_n \le \kappa_1,\ldots,\kappa_n$, we have at most \[ \prod_{t=1}^{n}2^{\alpha_t m} \cdot \binom{\left(1 - \sum_{s < t}\alpha_s\right)\cdot m}{\alpha_t \cdot m} \] points, and then summing over all valid $\alpha$, \begin{align*} |B| &\le \sum_{\alpha} \prod_{t=1}^{n}2^{\alpha_t m} \cdot \binom{\left(1 - \sum_{s < t}\alpha_s\right)\cdot m}{\alpha_t \cdot m}. \end{align*} After applying a number of rearrangements and approximations, by our choice of $\kappa_t$'s one can conclude that \[ |B| \le 2^{cn}, \] for some constant $c < 1$.

Since $|B| \le 2^{cn}$ but $f(x) \in B$ for at least $2^{n-1}$ points, it follows that there must exist some $q \in B$ such that $f$ maps at least $2^{n(1-c) -1}$ of the $x \in \{\pm 1\}^n$ to $q$. If $f(x) = p/\sqrt{2n\log \frac{m}{n}}$, then by definition $x \in \cB(\sqrt{2n\log \frac{m}{n}}, p)$. So we have that \[ \Pr\left[Ax \in \cB_{\infty}\left(\sqrt{2n\log\frac{m}{n}}, \sqrt{2n\log\frac{m}{n}}\cdot q\right)\right] \ge 2^{-cn}. \]

The proof is now complete if we observe that any subset $C \subset \{\pm 1\}^n$ with $|C| \ge 2^{cn}$ must have two points at hamming distance at least $\Omega(n)$. This is by a theorem of Kleitman, but it is not hard to see. The idea is that, if we choose a single point $p \in C$, the number of points around it of Hamming distance at most $2\epsilon n$ is \[ \sum_{k=1}^{cn} \binom{n}{k} \le 2^{H(\epsilon)n}, \] where $H(\cdot)$ is the binary entropy function (this is a standard upper bound for a partial sum of binary coefficients). So if $|C| \ge 2^{ H(\epsilon)\cdot n}$, it must contain at least two points of Hamming distance at least $2\epsilon n$.

Since there are at least $2^{(1-c)n}$ points in $\{\pm 1\}^n$ so that $Ax \in \cB_{\infty}(\sqrt{n},q)$, there must be two points $x_1,x_2$ such that $\|x_1-x_2\|_1 \ge 2H^{-1}(1-c)\cdot n$ and $\|Ax_1 - Ax_2\|_{\infty} \le \|Ax_1 - q\|_{\infty} + \|q - Ax_2\|_{\infty} \le 2\sqrt{n}$. Setting $y = \frac{1}{2}(x_1-x_2)$, $c_1 = H^{-1}(1-c)$ and $c_2 = \sqrt{2}$, the conclusion holds. $$\tag*{$\blacksquare$}$$

Now, we are ready to prove Spencer's Theorem.

Proof: We will apply our lemma recursively, coloring in a $c_1$-fraction of the remaining items at a time. After each partial coloring, we update $A$ by removing columns corresponding to colored items, so that at the $t$th step, there are at most $c_1^t\cdot n$ columns in $A$. There can be at most $\log n/\log c_1$ rounds of partial coloring, and at the $t$th round we can incur a discrepancy of at most $c_2\sqrt{c_1^t n \log\frac{m}{c_1^t n}}$ in each set. Thus the total discrepancy is at most \begin{align*} \disc(\cS) &\le \sum_{t=0}^{O(\log n)}c_2\sqrt{c_1^t n \log\frac{m}{c_1^t n}}\\ &\le c_2\sqrt{n}\cdot \sum_{t=0}^{\infty}c_1^t\left(\log\frac{m}{n} + t\log \frac{1}{c_1} \right)^{1/2} \end{align*} And since $(x+y)^{1/2} \le x^{1/2} + y^{1/2}$ for $x,y \ge 0$, \begin{align*} &\le O\left(\sqrt{n\log\frac{m}{n}}\right)\cdot \sum_{t=0}^{\infty}c_1^{t/2} + O\left(\sqrt{n}\right)\sum_{t=0}^{\infty} c_1^{t/2}\cdot t^{1/2}\\ &\le O\left(\sqrt{n\log\frac{m}{n}}\right), \end{align*} and the conclusion follows. $$\tag*{$\blacksquare$}$$

More sources

There are many very good expositions of discrepancy results. For this post I heavily relied on:
  • Joel Spencer's 1985 paper with the six standard deviations result.
  • Nikhil Bansal's book chapter about algorithmic discrepancy. Full text available here at the time of writing.
These references contain pointers to other good resources, especially books, which give a more detailed account of the mathematical/historical context of discrepancy minimization.

Also, see the proof in Alon and Spencer's “The Probabilistic Method” which is based on entropy and is really clean.