$\newcommand{\qed}{\tag*{\blacksquare}} \renewcommand\qedsymbol{\blacksquare} \newcommand{\1}{\mathbb{1}} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \newcommand{\x}{\times} \newcommand{\Z}{\mathbb{Z}} \newcommand{\Q}{\mathbb{Q}} \newcommand{\R}{\mathbb{R}} \newcommand{\N}{\mathcal{N}} \newcommand{\E}{\mathop{\mathbb{E}}} \renewcommand{\bar}{\overline} \renewcommand{\epsilon}{\varepsilon} \newcommand{\bmqty}[1]{\begin{bmatrix}#1\end{bmatrix}}$

# Discrepancy: a constructive proof via random walks in the hypercube

By

In this post, I'll give two constructive/algorithmic discrepancy upper bounds. The first, by Beck and Fiala, applies to sparse set systems. The second, by Lovett and Meka, improves on the Beck-Fiala result and also matches the guarantees of Spencer's theorem.

### Discrepancy Minimization

Recall that, given a system of subsets of $[n]$, $\cS = S_1,\ldots,S_m \subseteq [n]$, the discrepancy of a coloring $x \in \{\pm 1\}^n$ on $\cS$ is defined to be $\disc(x,\cS) = \max_{S_j \in \cS} \left|\sum_{i \in S_j} x_i\right|.$ In the previous post, we proved Spencer's theorem, which says that for any $\cS$, $\min_{x}\disc(x,\cS) \le O(\sqrt{n\log\frac{m}{n}})$.

The natural associated algorithmic task is discrepancy minimization---given $\cS$, we want to compute $x^* = \argmin_{x \in \{\pm 1\}^n} \disc(x,\cS).$ Spencer's theorem guarantees that some $x$ achieving $\disc(x,\cS) \le O(\sqrt{n\log\frac{m}{n}})$ always exists, but the proof does not provide a natural algorithm for finding the discrepancy minimizer $x^*$. Actually, finding the minimizer $x^*$ is NP-hard.

Theorem 1 (Charikar-Newman-Nikolov) Given a set system $\cS$ with $O(n)$ sets, it is NP-hard to distinguish whether $\disc(\cS) = 0$ or $\disc(\cS) = \Omega(\sqrt{n})$.

Still, it turns out that Spencer's theorem can be made algorithmic---there are efficient algorithms for computing a coloring with discrepancy $O(\sqrt{n})$. The first such algorithm was given by Bansal in 2010, and it was based on semidefinite programming. Later, in 2012, Lovett and Meka gave a simplified and slightly more general version of Bansal's result. The Lovett-Meka algorithm uses some ideas from Bansal's algorithm, but it does not rely on SDPs, using instead only linear algebra and properties of random vectors.

I think it is more natural to see the Lovett-Meka result after seeing the simpler result of Beck and Fiala for the special case when $\cS$ is sparse, and so I will give a brief account of that algorithm first.

### Sparse set systems and Beck-Fiala

Suppose that we have a set system $\cS$ which is sparse, so that every item is in at most $t$ sets. In this case, we can get the following specialized bound:

Theorem 2 (Beck-Fiala) If $\cS = S_1,\ldots,S_m$ is a set system with $S_j \subseteq [n] ~ \forall j \in [m]$, and each $i\in[n]$ is only included it at most $t$ sets of $\cS$, then there is an algorithm that computes a coloring $x \in \{\pm 1\}^n$ with $\disc(x,\cS) \le 2t - 1.$

Beck and Fiala also conjectured that one could obtain a bound of $\disc(\cS) \le O(\sqrt{t})$ for this setting---the Beck-Fiala conjecture is a major open problem in discrepancy theory. Proof: The proof is algorithmic---we'll start with the fractional coloring $x_0 = \vec{0}$, and update $x$ iteratively until we reach an integral point in $\{\pm 1\}^n$, arguing that we cannot do too much damage along the way.

The algorithm is as follows. At step $k$ of the algorithm, say we have the fractional coloring $x_k \in [-1,1]^n$. We keep track of the “live” items, or items for which $|x_i| < 1$. We also keep track of the “dangerous” sets: a set is called dangerous if it contain more than $t$ live items.

Claim 1 At step $k$, if there are $n_k$ live items, then there can be at most $n_k - 1$ dangerous sets.

This is true because each dangerous set has at least $t+1$ live items, but the maximum degree of each item is $t$, and so if we restrict the incidence matrix $A$ to the rows corresponding to dangerous sets, there are at most $n_k\cdot t$ nonzero entries, and therefore there can be at most $\lfloor\frac{t\cdot n_k}{t+1}\rfloor \le n_k-1$ dangerous sets.

So, if we let $A_k$ be the restriction of the incidence matrix to live columns and dangerous rows in the $k$th step, $A_k$ is not full rank, so there must always exist some vector $y_k \in \R^{n_k}$ which is orthogonal to all rows of $A_k$, and furthermore we can find $y_k$ efficiently.

Let $z_k$ be the natural extension of $y_k$ to the space of non-live items (so that $z_k(i) = y_k(i)$ if $i$ is live and $0$ otherwise). We perform the update $x_{k+1} = x_k + \alpha \cdot z_k,$ where $\alpha \in \R_+$ is chosen to be the largest number so that $x_{k+1} \in [-1,1]^n$. In other words, we start with $\alpha = 0$, and grow $\alpha$ until at least one of the entries of $x_{k+1}$ hits $1$ or $-1$. Thus the number of live items decreases by at least one, and the discrepancy of every dangerous set is $0$.

Now we only have to argue that once a set $S_j$ is no longer dangerous, its discrepancy can never grow larger than $2t-1$. If $S_j$ stopped being dangerous at step $k'$, $S_j$ had at most $t$ live items in $x_{k'}$ and $\iprod{x_{k'},a_j} = 0$. In the worst case each live $i \in S_j$ can go from $x_{k'}(i) = 1-\epsilon_i$ to $x(i) = -1$, so a bound of $2t$ on the final discrepancy of $S_j$ is easy. To get $2t-1$, we just notice that because the total discrepancy of $S_j$ was $0$ at step $k'$, the sum over the live $i \in S_j$ must be integral, and for live $|x_{k'}(i)| < 1$, so this gives us a lower bound of $\left|\sum \epsilon_i\right| \ge 1$. $$\tag*{\blacksquare}$$

### Constructive Spencer via guided random walks

As mentioned above, the first algorithmic proof of Spencer's result was given by Bansal in 2010. The proof was a little bit similar to the Beck-Fiala algorithm, in that it starts with the fractional coloring $x_0 = 0$, and makes updates to $x$ iteratively until hitting some integral coloring, bounding the error incurred along the way. The extreme point of departure is the manner in which the iterative updates to $x$ are chosen. Instead of choosing some arbitrary direction orthogonal to the dangerous sets, Bansal's algorithm uses a semidefinite program to take a random step---the semidefinite program makes sure that this random walk will make progress without violating discrepancy constraints too much. This makes the proof non-constructive, since to argue the feasibility of the SDP, Bansal relied on Spencer's result.

In 2012, Lovett and Meka simplified Bansal's approach. They removed the semidefinite programming step, returning to linear algebraic arguments reminiscent of the Beck-Fiala proof. Instead of using the SDP to guide the random walk, they argue that so long as there are not too many integral vertices, there is a high-dimensional subspace of $\R^n$ in which the random walk can proceed without violating the discrepancy constraints by too much, and then they take a random step in this subspace. This gives a truly constructive proof of Spencer's result.

Ignoring variations in the constants chosen, the main theorem of the paper is the following:

Theorem 3 Suppose that for $\lambda \in \R^m$ with $\lambda \ge 0$, $$\sum_j \exp\left(-\frac{\lambda_j^2}{32}\right) \le \frac{n}{16}.\label{cond}$$ Then for any starting point $x \in [-1,1]^n$, there exists a partial coloring $x' \in [-1,1]^n$ with at least $n/2$ entries of $x$ having magnitude $1$ and $|\langle x - x', a_j\rangle | \le \lambda_j \sqrt{|S_j|}$ for all $j \in [m]$, and an algorithm that finds such an $x'$ with probability at least $1/10$.

First let's see that this implies Spencer's result. We'll apply the theorem recursively, like Spencer does, $T = O(\log n)$ times. We'll start with the coloring $x_0 = 0$. At the $t$th iteration of the algorithm, say we have $n_t \le n/2^{t-1}$ items uncolored. For all of $j \in [m]$, we will set $\lambda^{(t)}_j = \sqrt{32\log{\frac{m}{n_t}} + \log 16}$ (it's easy to check that this satisfies the condition of the theorem). Then we use the algorithm to find $x_t = x'$. Letting $a_j$ be the 0/1 indicator vector for $S_j$, by the triangle inequality and the guarantees of the theorem, \begin{align*} \disc(x_T, S_j) ~=~ |\langle x_T, a_j\rangle| ~\le~ \sum_{t=0}^T |\langle x_{t} - x_{t+1}, a_j \rangle| ~&\le ~\sum_{t=0} \lambda_j^{(t)}\sqrt{|S_j^{(t)}|}, \end{align*} And since there are at most $n/2^{t-1}$ active items in $S_j$ at timestep $t$, \begin{align*} &\le \sum_{t=0}\sqrt{n_t\log\frac{m}{n_t}} ~\le~ O(\sqrt{n\log m/n}), \end{align*} where the last inequality follows because the sequence $\frac{n_t\log(m/n_t)}{n\log(m/n)}$ decays at least as fast as $2^{-t}\log 2^t$. So, this recovers Spencer's result.

Remark 1 (Sparse set systems) The algorithms of Bansal and of Lovett and Meka can be generalized to give an upper bound of $O(\sqrt{t}\log n)$ for $t$-sparse set systems. If the $\lambda_j$'s are set to $\lambda_j = c \cdot \sqrt{\frac{t}{|S_j|}}$ for some constant $c$, then by Markov's inequality and by the sparsity of $\cS$ there are at most $2^{-k}n$ sets with $|S_j| \in [2^ktn,2^{k+1}tn]$, and so $\sum_{j}\exp\left(-\frac{\lambda_j^2}{32}\right) \le \sum_{k=0}^{\infty} \frac{n}{2^k}\cdot \exp\left( \frac{-c^2}{2^{k+1}\cdot 32}\right),$ which meets condition (\ref{cond}) of the theorem if $c$ is chosen properly, so the conclusion follows.

Now, we will prove the theorem.

Main idea: Just as Beck and Fiala do, we'll start with some point $x_0$, and update $x$ iteratively, fixing $x(i)$ the moment that $|x(i)| = 1$. We will differ in our updates---we redefine a set to be dangerous when we come close to violating the constraint $|\iprod{x_t, a_j}| \ge \lambda_j\sqrt{|S_j|}$. So unlike Beck-Fiala, by default we start with no dangerous sets, and we add sets to the dangerous list when they become too imbalanced.

Just like Beck-Fiala, we will only make updates orthogonal to the dangerous sets. Our updates will take the form of a random walk in the non-dangerous subspace. The trick will be to argue that by our condition (\ref{cond}) and by properties of Gaussian random walks, with reasonable probability the rank of the dangerous subspace does not become too large as long as there are still many live items to color in.

Proof: The algorithm is as follows: Set the step size $\gamma = 1/100n^2$, and the safety margin $\delta = \gamma \cdot 10\log n$. Initialize the set of non-live items $D_v = \emptyset$ (notationally this is more convenient than keeping track of live items), and initialize the set of dangerous constraints $D_S$. Initialize the starting coloring $x_0 = x$ and the starting subspace $V_0 = \R^n$.

1. For $k = 1,\ldots, K= 8/\gamma^2$:
1. Sample the random vector $g_k$ by sampling $g \sim \cN(0, \Id)$ and projecting $g$ into the subspace $V_k$.
2. Take a random step by setting $x_k = x_{k-1} + \gamma \cdot g_k$.
3. For any $i \in [n]$ such that $|x_k(i)| \ge 1 - \delta$, add $i$ to $D_v$.
4. For any $j \in [m]$ such that $|\langle x_k - x_0, a_j\rangle| \ge \lambda_i\sqrt{|S_j|} - \delta$, add $S_j$ to $D_S$.
5. Set $V_{k+1}$ to be the subspace orthogonal to all $e_i$ for $i \in D_v$ and orthogonal to all $a_i$ for $S_i \in D_S$.
2. If $|x_K(i)| \ge 1-\delta$, set $x'(i) = \sgn(x_K(i))$. Otherwise, set $x'(i) = x_K(i)$.

Each of these steps can be done in polynomial time. Now, for proving correctness, there are several concerns: can we always assume $x_k \in [-1,1]^n$ and $|\iprod{x_k,a_j}|\le \lambda_j\sqrt{|S_j|}$, or does step (b) ever make us jump out of the box? Does the rounding in step 5 change the discrepancy of sets by too much? Will the algorithm ever get stuck in a place where we can't make progress (i.e. $V_k = \emptyset$) before coloring at least $n/2$ items?

The first two concerns are easy to take care of, so here are informal arguments. Since the Gaussian steps have small magnitude $\gamma$, and since we have a reasonable safety margin $\delta$ away from violating any constraint, the probability that we ever violate hard constraints in step (b) is polynomially small. The small safety margin also ensures that with high probability, the rounding we perform in step (c) cannot change the discrepancy of any set by more than $n\delta = O(1/\log n)$ over the course of the entire algorithm.

It remains to argue that with probability at least $1/10$, we won't get stuck before we will color at least $n/2$ items. We'll use a couple of (relatively standard) properties of Gaussian projections:

Claim 2 If $u\in \R^n$ and $g\in \R^n$ is a vector with i.i.d. entries $g_i \sim \cN(0,\sigma^2)$, then $\iprod{g,u}\sim \cN(0,\sigma^2\|u\|_2^2)$.

Claim 3 If $g\in \R^n$ is a vector with i.i.d. entries $g_i \sim \cN(0,\sigma^2)$, and $g'$ is the orthogonal projection of $g$ into a subspace $S \subseteq \R^n$, then $\E[\|g'\|_2^2] = \sigma^2\cdot \dim(S)$.

Claim 4 If $u\in \R^n$ and $g\in \R^n$ is a vector with i.i.d. entries $g_i\sim\cN(0,\sigma^2)$, and $g'$ is the orthogonal projection of $g$ into a subspace $S \subseteq \R^n$, then $\iprod{g',u}\sim \cN(0,\alpha\|u\|_2^2)$ where $\alpha \le \sigma^2$.

The proof of the first claim follows from the additive property of Gaussians. The second and third claims can be proven using the first claim, by considering an orthogonal projection matrix into the subspace $S$.

Now, we are equipped to prove the rest of the theorem. We first relate the progress of the algorithm, as measured by the variance of the Gaussian steps, to the dimension of $V_k$. By the independence of the gaussians $g_k$, have that \begin{align*} \E[\|x_{K} - x_0\|_2^2] ~=~ \E\left[\left\|\gamma \cdot \sum_{k=1}^K g_k\right\|_2^2\right] &= \gamma^2 \cdot \sum_{k=1}^K \E\left[\left\|g_k\right\|_2^2\right]\\ &= \gamma^2 \sum_{k=1}^K \E[\dim(V_k)] \qquad \\ &\ge~ \gamma^2 K \cdot \E[\dim(V_K)], \end{align*} where the second line follows from Claim 2 and the last line is because the dimension of $V_k$ decreases with $k$. Since $\dim(V_K) \ge n - |D_v| - |D_S|$, \begin{align*} &\ge 8 \E[n - |D_v| - |D_S|]. \end{align*} On the other hand, $\E[\|x_{K} - x_0\|_2^2] \le 2n$, because we stop moving in the direction of items once the coordinate magnitude is close to $1$. Thus, \begin{align} 2n &\ge 8 (n - \E[|D_S|] - \E[|D_v|],\nonumber\\ \E[|D_v|] &\ge \frac{3}{4}n - \E[|D_S|],\label{dsbd} \end{align}

Now, all that remains for us to do is argue that the expected number of dangerous sets is not too large---if you like, we are arguing that we don't arrive at $V_k = \emptyset$ before coloring in enough vertices. Recall a set $S_j$ is in $D_S$ only if for some $k$, $|\langle x_0 - x_k, a_j\rangle| \ge \lambda_i\sqrt{|S_j|} - \delta$. Since we only move orthogonal to $a_j$ for $S_j \in D_S$, it suffices to count the number of $S_j$ which are dangerous at the final step when $k = K$.

Let $J \subseteq [m]$ be the set of $j \in [m]$ for which $\lambda_j\sqrt{|S_j|} \ge 2\delta$. By Claim 3, $x_0 - x_K$ is a Gaussian vector supported on $\R^n$ with expected square norm at most $K \cdot \gamma^2 \le 8$, and $a_j$ is a vector of norm $\sqrt{S_j}$. Now, although $x_0-x_K$ is not exactly the orthogonal projection of a Gaussian vector into a subspace of $\R^n$, we can more or less apply Claim 4 to11. If we want to be rigorous, we should break up $x_0 - x_K$ into the independent Gaussian increments $x_k - x_{k+1}$ and apply Claim 4 to each of them, then look at their sum. conclude that $\langle x_0-x_K, a_j\rangle$ is distributed as a Gaussian with variance at most $8|S_j|$. Therefore for sets $j \in J$, $\Pr\left[|\langle x_0 - x_K, a_j\rangle| \ge \lambda_j\sqrt{|S_j|} - \delta\right] \le \Pr\left[|\langle x_0 - x_K, a_j\rangle| \ge \frac{1}{2}\lambda_j\sqrt{|S_j|}\right] \le 2\exp\left(-\frac{\lambda_j^2}{32}\right).$ By condition (\ref{cond}) of the theorem, there are at most $n/16$ sets in $[m]\setminus J$, or sets with $\lambda_j\sqrt{|S_j|} < 2\delta$, since for each such set $\exp(-\lambda_j^2/32) \ge n^{O(1/n^4)} \approx 1$. So the expected number of sets for which $|\langle x_0 - x_K, a_j\rangle| \ge \lambda_j\sqrt{|S_j|} -\delta$ is at most \begin{align*} \E[|D_S|] \le \frac{n}{16} + \sum_{j\in J} \Pr\left[|\langle x_0 - x_K, a_j\rangle| \ge \frac{1}{2}\lambda_j\sqrt{|S_j|} \right] \le \frac{n}{16} + 2\cdot \sum_{j \in J} \exp\left(-\frac{\lambda_i^2}{32}\right) \le \frac{3}{16}n, \end{align*} where for the last inequality we apply condition (\ref{cond}) of the theorem.

Now, plugging back into (\ref{dsbd}), \begin{align*} \E[|D_v|] \ge \left(\frac{3}{4} - \frac{3}{16}\right)n = \frac{9}{16}n. \end{align*} Let $p$ be the probability that $|D_v|$ has fewer than $n/2$ colored items. Since there can be at most $n$ colored items, $\frac{9}{16}n \le \E[|D_v|] < (1-p) \cdot n + p\cdot \frac{n}{2} = \left(1-\frac{p}{2}\right)n.$ From this we have that $p < 7/8$, and so by the union bound the algorithm succeeds with probability at least $1/8 - o(1)$. $$\tag*{\blacksquare}$$

### More sources

The paper of Lovett and Meka, as well as the previously mentioned chapter of Nikhil Bansal are good resources. The original algorithmic result can be found in this paper of Bansal.