Learning With Errors2017-01-04T04:27:34+00:00http://learningwitherrors.org/atom.xmlDiscrepancy: a constructive proof via random walks in the hypercubeTselil Schramm
2017-01-03T00:00:00+00:00
http://learningwitherrors.org/2017/01/03/discrepancy-constructive-rw<script type="text/javascript">
// javascript for toggling sidenotes
function toggle_display_nojump(id) {
event.preventDefault();
var e = document.getElementById(id);
if(e.style.display == 'block')
e.style.display = 'none';
else
e.style.display = 'block';
return false; // prevent default action of jumping to anchor
}
</script>
<div style='display:none;'><script type='math/tex'> \newcommand{\disc}{\mathrm{disc}}
\newcommand{\sgn}{\mathrm{sign}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\N}{\mathbb{N}}
\newcommand{\Z}{\mathbb{Z}}
\newcommand{\iprod}[1]{\langle #1 \rangle}
\newcommand{\Iprod}[1]{\left\langle #1 \right\rangle}
\newcommand{\E}{\mathop{\mathbb{E}}}
\newcommand{\cN}{\mathcal{N}}
\newcommand{\cB}{\mathcal{B}}
\newcommand{\cS}{\mathcal{S}}
\newcommand{\Id}{\mathrm{Id}}
\newcommand{\Tr}{\mathop{Tr}}
\newcommand{\argmin}{\mathop{\mathrm{argmin}}}
\newcommand{\Ind}{\mathbbm{1}}
</script></div>
<p>
In this post, I'll give two constructive/algorithmic discrepancy upper bounds. The first, by Beck and Fiala, applies to sparse set systems. The second, by Lovett and Meka, improves on the Beck-Fiala result and also matches the guarantees of Spencer's theorem.
<p>
<!--more-->
<h3 class='tex'>Discrepancy Minimization</h3> Recall that, given a system of subsets of $[n]$, $\cS = S_1,\ldots,S_m \subseteq [n]$, the discrepancy of a coloring $x \in \{\pm 1\}^n$ on $\cS$ is defined to be \[ \disc(x,\cS) = \max_{S_j \in \cS} \left|\sum_{i \in S_j} x_i\right|. \] In the previous post, we proved Spencer's theorem, which says that for any $\cS$, $\min_{x}\disc(x,\cS) \le O(\sqrt{n\log\frac{m}{n}})$.
<p>
The natural associated algorithmic task is <em>discrepancy minimization</em>---given $\cS$, we want to compute \[ x^* = \argmin_{x \in \{\pm 1\}^n} \disc(x,\cS). \] Spencer's theorem guarantees that some $x$ achieving $\disc(x,\cS) \le O(\sqrt{n\log\frac{m}{n}})$ always exists, but the proof does not provide a natural algorithm for finding the discrepancy minimizer $x^*$. Actually, finding the minimizer $x^*$ is NP-hard.
<blockquote><b>Theorem 1 (Charikar-Newman-Nikolov)</b> <em> Given a set system $\cS$ with $O(n)$ sets, it is NP-hard to distinguish whether $\disc(\cS) = 0$ or $\disc(\cS) = \Omega(\sqrt{n})$. </em></blockquote>
<p>
<p>
Still, it turns out that Spencer's theorem <em>can</em> be made algorithmic---there are efficient algorithms for computing a coloring with discrepancy $O(\sqrt{n})$. The first such algorithm was given by Bansal in 2010, and it was based on semidefinite programming. Later, in 2012, Lovett and Meka gave a simplified and slightly more general version of Bansal's result. The Lovett-Meka algorithm uses some ideas from Bansal's algorithm, but it does not rely on SDPs, using instead only linear algebra and properties of random vectors.
<p>
I think it is more natural to see the Lovett-Meka result after seeing the simpler result of Beck and Fiala for the special case when $\cS$ is sparse, and so I will give a brief account of that algorithm first.
<p>
<h3 class='tex'>Sparse set systems and Beck-Fiala</h3>
<p>
Suppose that we have a set system $\cS$ which is <em>sparse</em>, so that every item is in at most $t$ sets. In this case, we can get the following specialized bound:
<blockquote><b>Theorem 2 (Beck-Fiala)</b> <em> If $\cS = S_1,\ldots,S_m$ is a set system with $S_j \subseteq [n] ~ \forall j \in [m]$, and each $i\in[n]$ is only included it at most $t$ sets of $\cS$, then there is an algorithm that computes a coloring $x \in \{\pm 1\}^n$ with \[ \disc(x,\cS) \le 2t - 1. \] </em></blockquote>
<p>
Beck and Fiala also conjectured that one could obtain a bound of $\disc(\cS) \le O(\sqrt{t})$ for this setting---the Beck-Fiala conjecture is a major open problem in discrepancy theory. <em>Proof:</em> The proof is algorithmic---we'll start with the fractional coloring $x_0 = \vec{0}$, and update $x$ iteratively until we reach an integral point in $\{\pm 1\}^n$, arguing that we cannot do too much damage along the way.
<p>
The algorithm is as follows. At step $k$ of the algorithm, say we have the fractional coloring $x_k \in [-1,1]^n$. We keep track of the “live” items, or items for which $|x_i| < 1$. We also keep track of the “dangerous” sets: a set is called dangerous if it contain more than $t$ live items.
<blockquote><b>Claim 1</b> <em> At step $k$, if there are $n_k$ live items, then there can be at most $n_k - 1$ dangerous sets. </em></blockquote>
<p>
This is true because each dangerous set has at least $t+1$ live items, but the maximum degree of each item is $t$, and so if we restrict the incidence matrix $A$ to the rows corresponding to dangerous sets, there are at most $n_k\cdot t$ nonzero entries, and therefore there can be at most $\lfloor\frac{t\cdot n_k}{t+1}\rfloor \le n_k-1$ dangerous sets.
<p>
So, if we let $A_k$ be the restriction of the incidence matrix to live columns and dangerous rows in the $k$th step, $A_k$ is not full rank, so there must always exist some vector $y_k \in \R^{n_k}$ which is orthogonal to all rows of $A_k$, and furthermore we can find $y_k$ efficiently.
<p>
Let $z_k$ be the natural extension of $y_k$ to the space of non-live items (so that $z_k(i) = y_k(i)$ if $i$ is live and $0$ otherwise). We perform the update \[ x_{k+1} = x_k + \alpha \cdot z_k, \] where $\alpha \in \R_+$ is chosen to be the largest number so that $x_{k+1} \in [-1,1]^n$. In other words, we start with $\alpha = 0$, and grow $\alpha$ until at least one of the entries of $x_{k+1}$ hits $1$ or $-1$. Thus the number of live items decreases by at least one, and the discrepancy of every dangerous set is $0$.
<p>
Now we only have to argue that once a set $S_j$ is no longer dangerous, its discrepancy can never grow larger than $2t-1$. If $S_j$ stopped being dangerous at step $k'$, $S_j$ had at most $t$ live items in $x_{k'}$ and $\iprod{x_{k'},a_j} = 0$. In the worst case each live $i \in S_j$ can go from $x_{k'}(i) = 1-\epsilon_i$ to $x(i) = -1$, so a bound of $2t$ on the final discrepancy of $S_j$ is easy. To get $2t-1$, we just notice that because the total discrepancy of $S_j$ was $0$ at step $k'$, the sum over the live $i \in S_j$ must be integral, and for live $|x_{k'}(i)| < 1$, so this gives us a lower bound of $\left|\sum \epsilon_i\right| \ge 1$. $$\tag*{$\blacksquare$}$$
<p>
<h3 class='tex'>Constructive Spencer via guided random walks</h3>
<p>
As mentioned above, the first algorithmic proof of Spencer's result was given by Bansal in 2010. The proof was a little bit similar to the Beck-Fiala algorithm, in that it starts with the fractional coloring $x_0 = 0$, and makes updates to $x$ iteratively until hitting some integral coloring, bounding the error incurred along the way. The extreme point of departure is the manner in which the iterative updates to $x$ are chosen. Instead of choosing some arbitrary direction orthogonal to the dangerous sets, Bansal's algorithm uses a semidefinite program to take a random step---the semidefinite program makes sure that this random walk will make progress without violating discrepancy constraints too much. This makes the proof non-constructive, since to argue the feasibility of the SDP, Bansal relied on Spencer's result.
<p>
In 2012, Lovett and Meka simplified Bansal's approach. They removed the semidefinite programming step, returning to linear algebraic arguments reminiscent of the Beck-Fiala proof. Instead of using the SDP to guide the random walk, they argue that so long as there are not too many integral vertices, there is a high-dimensional subspace of $\R^n$ in which the random walk can proceed without violating the discrepancy constraints by too much, and then they take a random step in this subspace. This gives a truly constructive proof of Spencer's result.
<p>
Ignoring variations in the constants chosen, the main theorem of the paper is the following:
<blockquote><b>Theorem 3</b> <em> Suppose that for $\lambda \in \R^m$ with $\lambda \ge 0$, \begin{equation} \sum_j \exp\left(-\frac{\lambda_j^2}{32}\right) \le \frac{n}{16}.\label{cond} \end{equation} Then for any starting point $x \in [-1,1]^n$, there exists a partial coloring $x' \in [-1,1]^n$ with at least $n/2$ entries of $x$ having magnitude $1$ and $|\langle x - x', a_j\rangle | \le \lambda_j \sqrt{|S_j|}$ for all $j \in [m]$, and an algorithm that finds such an $x'$ with probability at least $1/10$. </em></blockquote>
<p>
<p>
First let's see that this implies Spencer's result. We'll apply the theorem recursively, like Spencer does, $T = O(\log n)$ times. We'll start with the coloring $x_0 = 0$. At the $t$th iteration of the algorithm, say we have $n_t \le n/2^{t-1}$ items uncolored. For all of $j \in [m]$, we will set $\lambda^{(t)}_j = \sqrt{32\log{\frac{m}{n_t}} + \log 16}$ (it's easy to check that this satisfies the condition of the theorem). Then we use the algorithm to find $x_t = x'$. Letting $a_j$ be the 0/1 indicator vector for $S_j$, by the triangle inequality and the guarantees of the theorem, \begin{align*} \disc(x_T, S_j) ~=~ |\langle x_T, a_j\rangle| ~\le~ \sum_{t=0}^T |\langle x_{t} - x_{t+1}, a_j \rangle| ~&\le ~\sum_{t=0} \lambda_j^{(t)}\sqrt{|S_j^{(t)}|}, \end{align*} And since there are at most $n/2^{t-1}$ active items in $S_j$ at timestep $t$, \begin{align*} &\le \sum_{t=0}\sqrt{n_t\log\frac{m}{n_t}} ~\le~ O(\sqrt{n\log m/n}), \end{align*} where the last inequality follows because the sequence $\frac{n_t\log(m/n_t)}{n\log(m/n)}$ decays at least as fast as $2^{-t}\log 2^t$. So, this recovers Spencer's result.
<p>
<blockquote><b>Remark 1 (Sparse set systems)</b> <em> The algorithms of Bansal and of Lovett and Meka can be generalized to give an upper bound of $O(\sqrt{t}\log n)$ for $t$-sparse set systems. If the $\lambda_j$'s are set to $\lambda_j = c \cdot \sqrt{\frac{t}{|S_j|}}$ for some constant $c$, then by Markov's inequality and by the sparsity of $\cS$ there are at most $2^{-k}n$ sets with $|S_j| \in [2^ktn,2^{k+1}tn]$, and so \[ \sum_{j}\exp\left(-\frac{\lambda_j^2}{32}\right) \le \sum_{k=0}^{\infty} \frac{n}{2^k}\cdot \exp\left( \frac{-c^2}{2^{k+1}\cdot 32}\right), \] which meets condition (\ref{cond}) of the theorem if $c$ is chosen properly, so the conclusion follows. </blockquote></em>
<p>
Now, we will prove the theorem.
<p>
<b>Main idea:</b> Just as Beck and Fiala do, we'll start with some point $x_0$, and update $x$ iteratively, fixing $x(i)$ the moment that $|x(i)| = 1$. We will differ in our updates---we redefine a set to be dangerous when we come close to violating the constraint $|\iprod{x_t, a_j}| \ge \lambda_j\sqrt{|S_j|}$. So unlike Beck-Fiala, by default we start with no dangerous sets, and we add sets to the dangerous list when they become too imbalanced.
<p>
Just like Beck-Fiala, we will only make updates orthogonal to the dangerous sets. Our updates will take the form of a random walk in the non-dangerous subspace. The trick will be to argue that by our condition (\ref{cond}) and by properties of Gaussian random walks, with reasonable probability the rank of the dangerous subspace does not become too large as long as there are still many live items to color in.
<p>
<em>Proof:</em> The algorithm is as follows: Set the step size $\gamma = 1/100n^2$, and the safety margin $\delta = \gamma \cdot 10\log n$. Initialize the set of non-live items $D_v = \emptyset $ (notationally this is more convenient than keeping track of live items), and initialize the set of dangerous constraints $D_S$. Initialize the starting coloring $x_0 = x$ and the starting subspace $V_0 = \R^n$.
<ol> <li> For $k = 1,\ldots, K= 8/\gamma^2$:
<ol> <li> Sample the random vector $g_k$ by sampling $g \sim \cN(0, \Id)$ and projecting $g$ into the subspace $V_k$. <li> Take a random step by setting $x_k = x_{k-1} + \gamma \cdot g_k$. <li> For any $i \in [n]$ such that $|x_k(i)| \ge 1 - \delta$, add $i$ to $D_v$. <li> For any $j \in [m]$ such that $|\langle x_k - x_0, a_j\rangle| \ge \lambda_i\sqrt{|S_j|} - \delta$, add $S_j$ to $D_S$. <li> Set $V_{k+1}$ to be the subspace orthogonal to all $e_i$ for $i \in D_v$ and orthogonal to all $a_i$ for $S_i \in D_S$.
</ol>
<li> If $|x_K(i)| \ge 1-\delta$, set $x'(i) = \sgn(x_K(i))$. Otherwise, set $x'(i) = x_K(i)$.
</ol>
<p>
Each of these steps can be done in polynomial time. Now, for proving correctness, there are several concerns: can we always assume $x_k \in [-1,1]^n$ and $|\iprod{x_k,a_j}|\le \lambda_j\sqrt{|S_j|}$, or does step (b) ever make us jump out of the box? Does the rounding in step 5 change the discrepancy of sets by too much? Will the algorithm ever get stuck in a place where we can't make progress (i.e. $V_k = \emptyset$) before coloring at least $n/2$ items?
<p>
The first two concerns are easy to take care of, so here are informal arguments. Since the Gaussian steps have small magnitude $\gamma$, and since we have a reasonable safety margin $\delta$ away from violating any constraint, the probability that we ever violate hard constraints in step (b) is polynomially small. The small safety margin also ensures that with high probability, the rounding we perform in step (c) cannot change the discrepancy of any set by more than $n\delta = O(1/\log n)$ over the course of the entire algorithm.
<p>
It remains to argue that with probability at least $1/10$, we won't get stuck before we will color at least $n/2$ items. We'll use a couple of (relatively standard) properties of Gaussian projections:
<blockquote><b>Claim 2</b> <em><a name="f1"></a> If $u\in \R^n$ and $g\in \R^n$ is a vector with i.i.d. entries $g_i \sim \cN(0,\sigma^2)$, then $\iprod{g,u}\sim \cN(0,\sigma^2\|u\|_2^2)$. </em></blockquote>
<p>
<p>
<blockquote><b>Claim 3</b> <em><a name="f2"></a> If $g\in \R^n$ is a vector with i.i.d. entries $g_i \sim \cN(0,\sigma^2)$, and $g'$ is the orthogonal projection of $g$ into a subspace $S \subseteq \R^n$, then $\E[\|g'\|_2^2] = \sigma^2\cdot \dim(S)$. </em></blockquote>
<p>
<p>
<blockquote><b>Claim 4</b> <em><a name="f3"></a> If $u\in \R^n$ and $g\in \R^n$ is a vector with i.i.d. entries $g_i\sim\cN(0,\sigma^2)$, and $g'$ is the orthogonal projection of $g$ into a subspace $S \subseteq \R^n$, then $\iprod{g',u}\sim \cN(0,\alpha\|u\|_2^2)$ where $\alpha \le \sigma^2$. </em></blockquote>
<p>
The proof of the first claim follows from the additive property of Gaussians. The second and third claims can be proven using the first claim, by considering an orthogonal projection matrix into the subspace $S$.
<p>
Now, we are equipped to prove the rest of the theorem. We first relate the progress of the algorithm, as measured by the variance of the Gaussian steps, to the dimension of $V_k$. By the independence of the gaussians $g_k$, have that \begin{align*} \E[\|x_{K} - x_0\|_2^2] ~=~ \E\left[\left\|\gamma \cdot \sum_{k=1}^K g_k\right\|_2^2\right] &= \gamma^2 \cdot \sum_{k=1}^K \E\left[\left\|g_k\right\|_2^2\right]\\ &= \gamma^2 \sum_{k=1}^K \E[\dim(V_k)] \qquad \\ &\ge~ \gamma^2 K \cdot \E[\dim(V_K)], \end{align*} where the second line follows from Claim <a href="f2">2</a> and the last line is because the dimension of $V_k$ decreases with $k$. Since $\dim(V_K) \ge n - |D_v| - |D_S|$, \begin{align*} &\ge 8 \E[n - |D_v| - |D_S|]. \end{align*} On the other hand, $\E[\|x_{K} - x_0\|_2^2] \le 2n$, because we stop moving in the direction of items once the coordinate magnitude is close to $1$. Thus, \begin{align} 2n &\ge 8 (n - \E[|D_S|] - \E[|D_v|],\nonumber\\ \E[|D_v|] &\ge \frac{3}{4}n - \E[|D_S|],\label{dsbd} \end{align}
<p>
Now, all that remains for us to do is argue that the expected number of dangerous sets is not too large---if you like, we are arguing that we don't arrive at $V_k = \emptyset$ before coloring in enough vertices. Recall a set $S_j$ is in $D_S$ only if for some $k$, $|\langle x_0 - x_k, a_j\rangle| \ge \lambda_i\sqrt{|S_j|} - \delta$. Since we only move orthogonal to $a_j$ for $S_j \in D_S$, it suffices to count the number of $S_j$ which are dangerous at the final step when $k = K$.
<p>
Let $J \subseteq [m]$ be the set of $j \in [m]$ for which $\lambda_j\sqrt{|S_j|} \ge 2\delta$. By Claim <a href="#f2">3</a>, $x_0 - x_K$ is a Gaussian vector supported on $\R^n$ with expected square norm at most $K \cdot \gamma^2 \le 8$, and $a_j$ is a vector of norm $\sqrt{S_j}$. Now, although $x_0-x_K$ is not exactly the orthogonal projection of a Gaussian vector into a subspace of $\R^n$, we can more or less apply Claim <a href="#f3">4</a> to<sup><a href='#footnote1' onclick="toggle_display_nojump('footnote1');">1</a></sup><span class='sidenote' id='footnote1'><a name='footnote1' href='#footnote1'>1.</a> If we want to be rigorous, we should break up $x_0 - x_K$ into the independent Gaussian increments $x_k - x_{k+1}$ and apply Claim <a href="#f3">4</a> to each of them, then look at their sum. </span> conclude that $\langle x_0-x_K, a_j\rangle$ is distributed as a Gaussian with variance at most $8|S_j|$. Therefore for sets $j \in J$, \[ \Pr\left[|\langle x_0 - x_K, a_j\rangle| \ge \lambda_j\sqrt{|S_j|} - \delta\right] \le \Pr\left[|\langle x_0 - x_K, a_j\rangle| \ge \frac{1}{2}\lambda_j\sqrt{|S_j|}\right] \le 2\exp\left(-\frac{\lambda_j^2}{32}\right). \] By condition (\ref{cond}) of the theorem, there are at most $n/16$ sets in $[m]\setminus J$, or sets with $\lambda_j\sqrt{|S_j|} < 2\delta$, since for each such set $\exp(-\lambda_j^2/32) \ge n^{O(1/n^4)} \approx 1$. So the expected number of sets for which $|\langle x_0 - x_K, a_j\rangle| \ge \lambda_j\sqrt{|S_j|} -\delta $ is at most \begin{align*} \E[|D_S|] \le \frac{n}{16} + \sum_{j\in J} \Pr\left[|\langle x_0 - x_K, a_j\rangle| \ge \frac{1}{2}\lambda_j\sqrt{|S_j|} \right] \le \frac{n}{16} + 2\cdot \sum_{j \in J} \exp\left(-\frac{\lambda_i^2}{32}\right) \le \frac{3}{16}n, \end{align*} where for the last inequality we apply condition (\ref{cond}) of the theorem.
<p>
Now, plugging back into (\ref{dsbd}), \begin{align*} \E[|D_v|] \ge \left(\frac{3}{4} - \frac{3}{16}\right)n = \frac{9}{16}n. \end{align*} Let $p$ be the probability that $|D_v|$ has fewer than $n/2$ colored items. Since there can be at most $n$ colored items, \[ \frac{9}{16}n \le \E[|D_v|] < (1-p) \cdot n + p\cdot \frac{n}{2} = \left(1-\frac{p}{2}\right)n. \] From this we have that $p < 7/8$, and so by the union bound the algorithm succeeds with probability at least $1/8 - o(1)$. $$\tag*{$\blacksquare$}$$
<p>
<h3 class='tex'>More sources</h3>
The <a href="https://arxiv.org/abs/1203.5747">paper</a> of Lovett and Meka, as well as the previously mentioned <a href="http://www.win.tue.nl/~nikhil/pubs/author%20-nikhil-2.pdf">chapter</a> of Nikhil Bansal are good resources.
The original algorithmic result can be found in <a href="https://arxiv.org/abs/1002.2259">this</a> paper of Bansal.
<p>
Discrepancy: definitions and Spencer's six standard deviationsTselil Schramm
2016-12-26T00:00:00+00:00
http://learningwitherrors.org/2016/12/26/discrepancy-spencer-six<script type="text/javascript">
function toggle_display_nojump(id) {
event.preventDefault();
var e = document.getElementById(id);
if(e.style.display == 'block')
e.style.display = 'none';
else
e.style.display = 'block';
return false; // prevent default action of jumping to anchor
}
</script>
<div style='display:none;'><script type='math/tex'> \newcommand{\disc}{\mathrm{disc}}
\newcommand{\sgn}{\mathrm{sign}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\N}{\mathbb{N}}
\newcommand{\Z}{\mathbb{Z}}
\newcommand{\iprod}[1]{\langle #1 \rangle}
\newcommand{\Iprod}[1]{\left\langle #1 \right\rangle}
\newcommand{\E}{\mathop{\mathbb{E}}}
\newcommand{\cN}{\mathcal{N}}
\newcommand{\cB}{\mathcal{B}}
\newcommand{\cS}{\mathcal{S}}
\newcommand{\Id}{\mathrm{Id}}
\newcommand{\Tr}{\mathop{Tr}}
\newcommand{\Ind}{\mathbb{I}}
</script></div> This is a first in a series of (probably at least 3) blog posts about discrepancy minimization. There are already many expositions of this topic, and it is not clear that the world needs yet another, but here it is :). In this post, I'll introduce the basic definitions, give one simple upper bound, one simple lower bound, and then prove Spencer's famous (and elegant) “six standard deviations” theorem. I won't focus on mathematical context, but at the end I will give pointers to some resources on the topic.
<!--more-->
<p>
<h3 class='tex'>Discrepancy</h3>
<p>
Suppose we have $n$ items, and a set of subsets of $[n]$, $\mathcal{S} = S_1,\ldots,S_m$ with $S_j \subseteq [n]$ for all $j \in [m]$. We want to assign each item $i \in [n]$ a sign or “color” $x_i \in \{\pm 1\}$, with the goal that the coloring of each set is as balanced as possible---formally, this is called minimizing the <em>discrepancy</em>. So for a coloring $x \in \{\pm 1\}^n$, the discrepancy or imbalance of $S \subseteq [n]$ is given by \[ \disc(x, S) = \left|\sum_{i \in S} x_i \right|, \] We define the discrepancy of $\cS$ to be the discrepancy of the worst set in $\cS$ under the most balanced coloring, \[ \disc(\cS) = \min_{x \in \{\pm 1\}^n} \max_{S_j \in \cS}~ \disc(x,S_j). \]
<p>
It can also be convenient to think of this as a matrix problem. Consider the incidence matrix $A$ of $\cS$, the $m \times n$ matrix with \[ A_{ji} = \begin{cases} 1 & i \in S_j\\ 0 & \text{otherwise}. \end{cases} \] Then letting $a_j$ be the 0/1 indicator vector for set $S_j$, or the $j$th row of $A$, the discrepancy of $x \in \{\pm 1\}^n$ on $S_j$ is equal to $|\iprod{a_j,x}|$, and the discrepancy of $\cS$ is equivalent to \[ \disc(\cS) = \min_{x \in \{\pm 1\}^n} \|Ax\|_{\infty}. \]
<p>
<h3 class='tex'>Some cursory upper and lower bounds</h3> A priori, it might not be clear what $\disc(\cS)$ should be. Some set systems have discrepancy zero---consider for example a “planted” set system, in which the rows of the incidence matrix $A$ are chosen to be orthogonal to some coloring $x\in \{\pm 1\}^n$. On the other hand, naively we could have set systems with discrepancy as large as $n$.
<p>
<h3 class='tex'>Uniformly Random Coloring</h3> It is not very hard to see that $n$ is an egregious upper bound (when $m = |\cS|$ is not too large). Let's consider $x \in \{\pm\}^n$ chosen uniformly at random. For each $S_j \in \cS$, by Hoeffding's inequality, we have that \[ \Pr\left[\left|\sum_{i \in S_j} x_i \right| \ge t \sqrt{n} \right] \le 2\exp\left(-\frac{t^2}{2}\right). \] So, if we set $t = 2\sqrt{\log m})$, we can beat the union bound over the sets: \[ \Pr[\|Ax\|_{\infty} \ge t\sqrt{n}] \le \sum_{j \in [m]} \Pr\left[\left|\sum_{i \in S_j} x_i \right| \ge t\sqrt{n} \right] \le m \cdot 2\exp(-4 \log m) = O\left(\frac{1}{m^3}\right). \] So for any set system $\cS$, the random coloring gives the improved bound, \begin{equation} \disc(\cS) \le O(\sqrt{n\log m}).\label{randomub} \end{equation}
<p>
<h3 class='tex'>A Lower Bound</h3> We'll see that the upper bound we got from the random coloring is almost tight. Our lower bound will come from the set system defined by the Hadamard matrix $H$---for us, it will be enough to know that the Hadamard matrix is a symmetric matrix with entries in $\{\pm 1\}$, the first column (and therefore row) of $H$ is the all-1's vector $\vec{1}$, and the columns are mutually orthogonal so that $HH^{\top} = n\cdot \Id$.
<p>
The basic idea for the lower bound is that, because $H$ has large eigenvalues, it cannot map any Boolean vector into a ball of small radius, giving us large discrepancy. But $H$ has negative entries, so in the process of translating it into a valid incidence matrix with entries in $\{0,1\}$ we have to make sure we didn't introduce a small eigenvalue in the direction of some $x \in \{\pm 1\}^n$.
<p>
The proof is only a couple of lines. Let $J$ be the all-$1$s matrix. We can define the set system $\cS$ that has an incidence matrix $A = \frac{1}{2}(J+H)$---this is valid, because $A$ is a $0/1$ matrix. Now, for any $x \in \{\pm 1\}^n$, we have that \begin{align*} n \cdot \|Ax\|_{\infty}^2 ~\ge~\|Ax\|_2^2 ~=~ x^{\top}A^{\top} A x ~=~\frac{1}{4} x^{\top} \left(H^{\top} H + J^{\top} J + J^{\top} H + H^{\top} J\right)x. \end{align*} To simplify, we can observe that $H^{\top}H = n\cdot \Id$, that $J^{\top} J = n\cdot J$. Also, because the first row/column of $H$ is equal to $\vec{1}$, and because the rows of $H$ are orthogonal, $H\vec{1} = n \cdot e_1$, and so $JH^{\top} = n \cdot \vec{1}e_1^{\top}$. So, \begin{align*} x^{\top} A^{\top}Ax ~=~ \frac{n}{4} \cdot x^{\top}\left( \Id + J + e_1 \vec{1}^{\top} + \vec{1}e_1^{\top}\right) x ~=~ \frac{n}{4}\left(n + \iprod{x,\vec{1}}^2 + 2\cdot x_1 \cdot \iprod{x,\vec{1}}\right) \end{align*} Because $A$'s first row is $\vec{1}$ we have that $|\iprod{x,\vec{1}}| > \sqrt{n} \implies \|Ax\|_{\infty} \ge \sqrt{n}$. And otherwise if $|\iprod{x,\vec{1}}| < \sqrt{n}$, plugging in to the above we have that \begin{align*} n\cdot \|Ax\|_{\infty}^2 ~\ge~ x^{\top} A^{\top} A x ~>~ \frac{n}{4}\left(n - 2\sqrt{n}\right) \quad \implies \quad \|Ax\|_{\infty} \ge O(\sqrt{n}). \end{align*} So for the Hadamard set system $\cS$, \begin{equation} \disc(\cS) \ge O(\sqrt{n}).\label{lb} \end{equation}
<p>
<h3 class='tex'>Spencer's Theorem</h3>
<p>
It turns out that in fact the lower bound from (\ref{lb}) is tight up to constant factors, and we can get rid of the logarithmic factor from the upper bound (\ref{randomub}). In the 1980's, Spencer proved the following result:
<blockquote><b>Theorem 1</b> <em> For any set system $\cS$ on $[n]$ with $|\cS| = m$, \[ \disc(\cS) \le 6\sqrt{n\log\frac{m}{n}} \] </em></blockquote>
<p>
Maybe improving on the logarithmic factor in (\ref{randomub}) does not seem like a big deal, but a priori it is not obvious that one should be able to get a bound better than the random assignment. Also, Spencer's proof is extremely elegant (though nonconstructive). We'll prove it below, but we'll assume that $m = n$, and we'll be sloppy with constants (so we'll get an upper bound of $O(\sqrt{n})$ instead of $6\sqrt{n}$).
<p>
<p>
<b>The gist.</b> The proof is by induction---given $\cS$ and $[n]$, Spencer shows that there exists some <em>partial coloring</em> $y \in \{-1,0,1\}^n$ with at least a constant fraction of the elements colored in and low discrepancy, so that $\|y\|_1 \ge c\cdot n$ and $\|Ay\|_{\infty} \le C\sqrt{n}$ for constants $c,C$. Then this fact is applied inductively for $\log n$ steps, until all of the items are colored, and the total discrepancy is $\sum_{t=0}^{\log n} C\cdot \sqrt{c^t\cdot n} = O(\sqrt{n})$, since $c < 1$.
<p>
The reason such a partial coloring $y$ must exist is because the map $Ax$ is not spread out enough---by the pigeonhole principle, one can show that there are at least $2^{\Omega(n)}$ distinct points $x \in \{\pm 1\}^n$ that get mapped to a ball of radius $O(\sqrt{n})$. Since there are so many points, there must exist $x_1,x_2$ in this ball that have large Hamming distance, so that their difference has many nonzero entries, and so the partial coloring is given by $y = \frac{1}{2}(x_1-x_2)$.
<p>
Now for the more formal proof. We will prove the following lemma, which we will then apply inductively:
<blockquote><b>Lemma 2</b> <em> Let $m,n \in \N$ with $n \le m$, and let $A$ be an $m \times n$ matrix with $0/1$ entries. Then there exist universal constants $c_1,c_2$ such that there always exists a vector $y \in \{-1,0,1\}^n$ so that $\|y\|_1 \ge c_1 \cdot n$ and \[ \|Ay\|_{\infty} \le c_2\cdot \sqrt{n\log\frac{m}{n}}. \] </em></blockquote>
<p>
<p>
<em>Proof:</em> Let $\cB_{\infty}(r,p)$ denote the ball of radius $r$ around $p$, where distance is measured in $\ell_\infty$. We'll show that there must exist some point $q \in \R^m$ such that \[ \Pr_{x\sim\{\pm 1\}^n}[ Ax \in \cB_{\infty}(r,q)] \ge 2^{-cn}, \] for some constant $c$. Our strategy will be to use the pigeonhole principle. We'll identify a set $B \subset \R^m$ with $|B| \le 2^{\epsilon n}$, so that for uniformly chosen $x \in \{\pm 1\}^n$, there exists a point in $B$ close to $Ax$ with probability at least $\frac{1}{2}$; because $|B| < 2^{\epsilon n}$, there must be some $q \in B$ which is near at least $2^{(1-\epsilon)n}$. To find such a $B$, which contains few points but is close to $Ax$ with constant probability over uniform $x \in \{\pm 1\}^n$, we'll choose $B$ to be a discretization of the set of points in $\R^m$ that do not have too many entries of large magnitude. The way that we define “large magnitude” will partially depend on the standard deviation of entries of $Ax$, and partly on wanting to keep $B$ relatively small.
<p>
Define the function $f:\{\pm 1\}^n \to \Z^m$ so that \[ f(x) = \left\lceil \frac{1}{\sqrt {2n\log \frac{m}{n}}} Ax\right\rceil. \] In words, $f$ maps $x$ to the integral point closest to $Ax/\sqrt{n\log \frac{m}{n}}$.
<p>
We next identify some small subset $B \subset \Z^m$ which contains a large fraction of the range of $f$, which will imply that there must exist some $q \in \sqrt{n\log\frac{m}{n}}\cdot B$ which is close to $Ax$ for many $x \in \{\pm 1\}^n$. Define $B \subset \Z^m$ to be the set for which at most a $\kappa_t = 2^{t+2} (m/n)^{-t^2}$-fraction of coordinates are larger than $t$, \[ B = \left\{(b_1,\ldots,b_m) \in \Z^m ~|~ |\{b_i ~s.t.~ |b_i| \ge t\}| \le \kappa_t \cdot m\right\}. \] For uniformly chosen $x$, by Hoeffding's inequality, \[ \Pr\left[ |\iprod{x,a_j}| \ge t \sqrt{2n\log\frac{m}{n}} \right]\le 2\left(m/n\right)^{-t^2} \] And so the expected number of $j \in [m]$ for which $|\iprod{x,a_j}| \ge t\sqrt{n\log\frac{m}{n}}$ is at most \[ \E\left[\sum_{j\in[m]} \Ind\left(\left|\iprod{x,a_j}\right| \ge t\sqrt{2n\log\frac{m}{n}}\right) \right] \le m \cdot 2\left(\frac{m}{n}\right)^{-t^2}, \] And by Markov's inequality \begin{align} \Pr\left[\sum_{j\in[m]} \Ind\left(\left|\iprod{x,a_j}\right| \ge t\sqrt{2n\log\frac{m}{n}}\right) \ge 2^{t+1} m \cdot \left(\frac{m}{n}\right)^{-t^2}\right] &\le \frac{1}{2^{t+1}}.\label{eq:cond} \end{align} Recall that $\kappa_t = 2^{t+1} \left(\frac{m}{n}\right)^{-t^2}$. Thus, for any $x$, the probability that $f(x) = \lceil (n\log\frac{m}{n})^{-1/2} \cdot Ax\rceil \not\in B$ is the sum over (\ref{eq:cond}) for all $t \le \sqrt{n}$, and so by a union bound, \begin{align*} \Pr[f(x) \not \in B] ~=~ \Pr\left[\exists t ~s.t.~ \sum_{j \in [m]} \Ind\left(|\iprod{x,a_j}| \ge t\sqrt{n\log\frac{m}{n}}\right) \ge \kappa_t \cdot m\right] ~\le~ \sum_{t=1}^{\sqrt{n}} \frac{1}{2^{t+1}} ~\le~ \frac{1}{2}. \end{align*}
<p>
At the same time, the size of $B$ can be bounded with some meticulous but uncomplicated counting arguments---we won't reproduce them at full resolution here, but the basic idea is that if we consider a point in $B$, it should have at most $\alpha_t \le \kappa_t$ entries of value $\pm t$. So for any valid sequence $\alpha = \alpha_1,\ldots,\alpha_n \le \kappa_1,\ldots,\kappa_n$, we have at most \[ \prod_{t=1}^{n}2^{\alpha_t m} \cdot \binom{\left(1 - \sum_{s < t}\alpha_s\right)\cdot m}{\alpha_t \cdot m} \] points, and then summing over all valid $\alpha$, \begin{align*} |B| &\le \sum_{\alpha} \prod_{t=1}^{n}2^{\alpha_t m} \cdot \binom{\left(1 - \sum_{s < t}\alpha_s\right)\cdot m}{\alpha_t \cdot m}. \end{align*} After applying a number of rearrangements and approximations, by our choice of $\kappa_t$'s one can conclude that \[ |B| \le 2^{cn}, \] for some constant $c < 1$.
<p>
<p>
Since $|B| \le 2^{cn}$ but $f(x) \in B$ for at least $2^{n-1}$ points, it follows that there must exist some $q \in B$ such that $f$ maps at least $2^{n(1-c) -1}$ of the $x \in \{\pm 1\}^n$ to $q$. If $f(x) = p/\sqrt{2n\log \frac{m}{n}}$, then by definition $x \in \cB(\sqrt{2n\log \frac{m}{n}}, p)$. So we have that \[ \Pr\left[Ax \in \cB_{\infty}\left(\sqrt{2n\log\frac{m}{n}}, \sqrt{2n\log\frac{m}{n}}\cdot q\right)\right] \ge 2^{-cn}. \]
<p>
The proof is now complete if we observe that any subset $C \subset \{\pm 1\}^n$ with $|C| \ge 2^{cn}$ must have two points at hamming distance at least $\Omega(n)$. This is by a theorem of Kleitman, but it is not hard to see. The idea is that, if we choose a single point $p \in C$, the number of points around it of Hamming distance at most $2\epsilon n$ is \[ \sum_{k=1}^{cn} \binom{n}{k} \le 2^{H(\epsilon)n}, \] where $H(\cdot)$ is the binary entropy function (this is a standard upper bound for a partial sum of binary coefficients). So if $|C| \ge 2^{ H(\epsilon)\cdot n}$, it must contain at least two points of Hamming distance at least $2\epsilon n$.
<p>
Since there are at least $2^{(1-c)n}$ points in $\{\pm 1\}^n$ so that $Ax \in \cB_{\infty}(\sqrt{n},q)$, there must be two points $x_1,x_2$ such that $\|x_1-x_2\|_1 \ge 2H^{-1}(1-c)\cdot n$ and $\|Ax_1 - Ax_2\|_{\infty} \le \|Ax_1 - q\|_{\infty} + \|q - Ax_2\|_{\infty} \le 2\sqrt{n}$. Setting $y = \frac{1}{2}(x_1-x_2)$, $c_1 = H^{-1}(1-c)$ and $c_2 = \sqrt{2}$, the conclusion holds. $$\tag*{$\blacksquare$}$$
<p>
Now, we are ready to prove Spencer's Theorem.
<p>
<em>Proof:</em> We will apply our lemma recursively, coloring in a $c_1$-fraction of the remaining items at a time. After each partial coloring, we update $A$ by removing columns corresponding to colored items, so that at the $t$th step, there are at most $c_1^t\cdot n$ columns in $A$. There can be at most $\log n/\log c_1$ rounds of partial coloring, and at the $t$th round we can incur a discrepancy of at most $c_2\sqrt{c_1^t n \log\frac{m}{c_1^t n}}$ in each set. Thus the total discrepancy is at most \begin{align*} \disc(\cS) &\le \sum_{t=0}^{O(\log n)}c_2\sqrt{c_1^t n \log\frac{m}{c_1^t n}}\\ &\le c_2\sqrt{n}\cdot \sum_{t=0}^{\infty}c_1^t\left(\log\frac{m}{n} + t\log \frac{1}{c_1} \right)^{1/2} \end{align*}
And since $(x+y)^{1/2} \le x^{1/2} + y^{1/2}$ for $x,y \ge 0$,
\begin{align*}
&\le O\left(\sqrt{n\log\frac{m}{n}}\right)\cdot \sum_{t=0}^{\infty}c_1^{t/2} + O\left(\sqrt{n}\right)\sum_{t=0}^{\infty} c_1^{t/2}\cdot t^{1/2}\\ &\le O\left(\sqrt{n\log\frac{m}{n}}\right), \end{align*} and the conclusion follows. $$\tag*{$\blacksquare$}$$
<p>
<h2 class='tex'>More sources</h2> There are many very good expositions of discrepancy results.
For this post I heavily relied on:
<ul> <li> Joel Spencer's 1985 <a href="http://www.ams.org/journals/tran/1985-289-02/S0002-9947-1985-0784009-0/">paper</a> with the six standard deviations result.
<li> Nikhil Bansal's <a href="http://link.springer.com/chapter/10.1007/978-3-319-04696-9_6">book chapter</a> about algorithmic discrepancy.
Full text available <a href="http://www.win.tue.nl/~nikhil/pubs/author%20-nikhil-2.pdf ">here</a> at the time of writing.
</ul>
These references contain pointers to other good resources, especially books, which give a more detailed account of the mathematical/historical context of discrepancy minimization.
<p>
Also, see the proof in Alon and Spencer's “The Probabilistic Method” which is based on entropy and is really clean.
<p>
</body></html>
Constructive Hardness Amplification via Uniform Direct ProductPreetum Nakkiran
2016-08-25T01:00:00+00:00
http://learningwitherrors.org/2016/08/24/uniform-direct-product<script type="text/javascript">
// javascript for toggling sidenotes
function toggle_display_nojump(id) {
event.preventDefault();
var e = document.getElementById(id);
if(e.style.display == 'block')
e.style.display = 'none';
else
e.style.display = 'block';
return false; // prevent default action of jumping to anchor
}
</script>
<div style='display:none;'><script type='math/tex'> \renewcommand\qedsymbol{$\blacksquare$}
\newcommand{\1}{\mathbb{1}}
\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator*{\argmax}{argmax}
\newcommand{\x}{\times}
\newcommand{\Z}{\mathbb{Z}}
\newcommand{\Q}{\mathbb{Q}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\N}{\mathbb{N}}
\newcommand{\E}{\mathop{\mathbb{E}}}
\renewcommand{\bar}{\overline}
\renewcommand{\epsilon}{\varepsilon}
\newcommand{\eps}{\epsilon}
\newcommand{\bmqty}[1]{\begin{bmatrix}#1\end{bmatrix}}
\newcommand{\innp}[1]{\langle #1 \rangle}
\DeclareMathOperator{\rank}{rank}
\newcommand{\note}[1]{&&\text{(#1)}} % use this to annotate eqns in align environments.
\DeclareMathOperator*{\poly}{poly}
\newcommand{\TODO}[1][TODO]{\textcolor{red}{#1}}
\newcommand{\A}{\mathcal{A}}
\newcommand{\ox}{\otimes}
\newcommand{\bad}{\text{BAD}}
\newcommand{\good}{\text{GOOD}}
</script></div>
<p>
<em>This post was motivated by trying to understand the recent paper “Learning Algorithms from Natural Proofs”, by Carmosino-Impagliazzo-Kabanets-Kolokolova [<a href='#ref-CIKK16'>CIKK16</a>]. They crucially use the fact that several results in hardness amplification can be made constructive. In this post, we will look at the Uniform Direct Product Theorem of Impagliazzo-Jaiswal-Kabanets-Wigderson [<a href='#ref-IJKW10'>IJKW10</a>]. We will state the original theorem and algorithm of [<a href='#ref-IJKW10'>IJKW10</a>], then we will present a simpler analysis for a (weaker) non-uniform version of their algorithm, which contains some of the main ideas. </em>
<p>
For a given function $f: \{0, 1\}^n \to \{0, 1\}^\ell$, say a circuit $C$ “$\eps$-computes $f$” if $C$ computes $f$ correctly on at least $\eps$-fraction of inputs. That is, $\Pr_x[C(x) = f(x)] \geq \epsilon$. We are interested in the following kind of direct product theorem (informally): “If function $f$ cannot be $\eps$-computed by any small circuit $C$, then the direct-product $f^{\ox k}(x_1, x_2, \dots x_k) := (f(x_1), f(x_2), \dots, f(x_k))$ cannot be computed better than roughly $\eps^k$ by any similarly small circuit.” <sup class='footnotemark'><a href='#footnote1' onclick="toggle_display_nojump('footnote1');">1</a></sup><span class='sidenote' id='footnote1'><a name='footnote1' href='#footnote1'>1.</a> If this seems trivial, consider the $k=2$ case. We want to show that if $\Pr_x[C(x) = f(x)] \leq \epsilon$ for all small circuits $C$, then $\Pr_{x, y}[C'(x, y) = (f(x), f(y))] \lesssim \eps^2$ for all similarly small circuits $C'$. This is clearly true if the circuit $C'$ operates independently on its inputs, but not as clear otherwise (eg, the correctness of $C'$-s two outputs could be highly correlated). Indeed, proofs of the direct-product theorem take advantage of this correlation. </span>
<p>
This is usually proved<sup class='footnotemark'><a href='#footnote2' onclick="toggle_display_nojump('footnote2');">2</a></sup><span class='sidenote' id='footnote2'><a name='footnote2' href='#footnote2'>2.</a> See the last section for good references to prior proofs. </span> in contrapositive, by showing: If there exists a circuit $C'$ that $\eps^k$-computes $f^{\ox k}$, then there exists a similarly-sized circuit $C$ that $\eps$-computes $f$. The very interesting part is, this amplification can be made fully constructive, by a simple algorithm.
<!--more-->
<blockquote><b>Theorem 1 ([<a href='#ref-IJKW10'>IJKW10</a>], and Theorem 4.1 [<a href='#ref-CIKK16'>CIKK16</a>])</b> <em> <a name="thmuniformDP"></a> Let $k \in \N, \eps > 0$. There is a (uniform) PPT algorithm $\A$ with the following guarantees:
<ul> <li> <b>Input:</b> A circuit $C'$ that $\eps$-computes $f^{\ox k}$ for some function $f:\{0,1\}^n \to \{0, 1\}^\ell$. <li> <b>Output:</b> With probability $\Omega(\eps)$, output a circuit $C$ that $(1-\delta)$-computes $f$.
</ul>
for $\delta = O(\log(1/\eps)/k)$. In particular, $(1-\delta) = \eps^{O(1/k)}$. The circuit $C$ is of size $|C'|\poly(n, k, \log(1/\delta), 1/\eps)$. </em></blockquote>
<p>
<p>
Note that we can only hope to construct the good circuit with probability $\Omega(\eps)$, since unique decoding is impossible: the circuit $C'$ may $\eps$-compute up to $(1/\eps)$ different functions $f$ (agreeing with a different function on each $\eps$-fraction of its inputs).
<p>
<h2 class='tex'>1. Uniform Version </h2>
<p>
The algorithm for Theorem <a href="#thmuniformDP">1</a> is: <div class='framed'> $\mathcal{A}(C')$:
<p>
Input: A circuit $C'$ that $\eps$-computes the direct-product $f^{\ox k}$.
<ol> <li> Pick $k$ iid random inputs $x_i \in \{0, 1\}^n$, let $\vec b = (x_1, \dots, x_k)$, and evaluate $C'(\vec b)$. <li> Pick a random subset $A \subset \{x_1, \dots, x_k\}$ of size $k/2$. Record $v := C'(\vec b)|_A$ as the answers of $C'$ on the inputs in $A$. <li> Output the circuit $C_{A, v}$ defined below (with the values $v$ on the subset $A$ hardcoded).
</ol>
</div>
<p><p>
$C_{A, v}$ is defined as the randomized circuit:
<div class='framed'> $C_{A, v}(x)$:
<p>
On input $x \in \{0, 1\}^n$, check if $x \in A$, in which case output $v|_x$ (the hardcoded value of $x$ according to $v$). Otherwise, repeat the following $T = O(\log(1/\delta)/\epsilon)$ times.
<ol> <li> Sample $(k/2 - 1)$ additional iid random strings $\{y_j\}$, each $y_j \in \{0, 1\}^n$, and let $\vec b := (x, A, \{y_j\})$ be the tuple of $k$ strings. <li> Evaluate $C'(\pi(\vec b))$ for a random permutation $\pi$ of the $k$ inputs. <li> If the answers of $C'$ restricted to $A$ agree with the hardcoded values $v$, then output $C'(\pi(\vec b))|_x$, (the answer $C'$ gave for $x$), and stop.
</ol>
Output an error if no output is produced after $T$ iterations. </div>
<p><p>
<b>Intuition:</b> Suppose the values $v$ returned when the Algorithm queries $C'(b)$ are actually correct. That is, $v|_x = f(x)$ for all $x \in A$. Then, the circuit $C_{A, v}$ evaluates $C'$ on input $\vec b = (b_1, \dots b_k)$, and it knows the correct value of $f(b_i)$ is on half of these coordinates. So, $C_{A, v}(x)$ tries to estimate whether a random point $C'(\vec b)$ is correct or not, based on if it agrees on the known subset of coordinates. The idea is that a value of $C'(\vec b)$ that is wrong on many coordinates is unlikely to pass this test. (See [<a href='#ref-IJKW10'>IJKW10</a>] for the full proof).
<p>
Now, in the remainder of this note, we will develop and prove a simpler (weaker) version.
<p>
<h2 class='tex'>2. Symmetrizing </h2> The direct-product as defined above has a permutation symmetry: \[ f^{\ox k}(\pi(x_1, \dots x_k)) = \pi(f^{\ox k}(x_1, \dots x_k)) \] for any permutation $\pi$.
<p>
The algorithm of Theorem <a href="#thmuniformDP">1</a> strongly takes advantage of this symmetry (indeed, the algorithm would not work as promised if we omitted the random permutations).<sup class='footnotemark'><a href='#footnote3' onclick="toggle_display_nojump('footnote3');">3</a></sup><span class='sidenote' id='footnote3'><a name='footnote3' href='#footnote3'>3.</a> Consider a $C'(x_1, \dots x_k)$ that is correct if $x_1$ lies in some $\eps$-density set, and random otherwise. Without the random permutations, $C_{A, v}(x)$ will always evaluate $C'(x, \dots)$, and produce no output for $(1-\epsilon)$-fraction of inputs $x$. </span> To simplify presentation, it helps to define the direct-product $f^k$ as a function over $k$-<b>multisets</b> of inputs, instead of over $k$-tuples of inputs. Following [<a href='#ref-IJKW10'>IJKW10</a>], for the remainder of this note, we will work in the setting of $k$-multisets, and denote the $k$-multiset direct product as $f^k$. That is, $f^k$ takes as input an (unordered) $k$-multiset $B = \{x_1, x_2, \dots, x_k\}$, and returns the $k$-tuple \[f^k(\{x_1, x_2, \dots, x_k\}) := (f(x_1), f(x_2), \dots, f(x_k))\]
<p>
We consider the probability measure induced by the uniform measure over tuples. That is, “pick a random $k$-multiset of $U$” means to generate a multiset by picking $k$ iid random elements from the universe $U$, and forming the (unordered) multiset containing them.<sup class='footnotemark'><a href='#footnote4' onclick="toggle_display_nojump('footnote4');">4</a></sup><span class='sidenote' id='footnote4'><a name='footnote4' href='#footnote4'>4.</a> So for example, for $k=3$ the multiset $\{a, a, a\}$ has lower probability of being drawn than $\{a, a, b\}$ for $a \neq b$. </span>
<p>
The notion of $\eps$-computing remains the same:<sup class='footnotemark'><a href='#footnote5' onclick="toggle_display_nojump('footnote5');">5</a></sup><span class='sidenote' id='footnote5'><a name='footnote5' href='#footnote5'>5.</a> For our purposes, having a randomized circuit that $\eps$-computes $f^{\ox k}$ is essentially equivalent to having a randomized circuit that $\eps$-computes $f^k$. The proofs will extend to randomized circuits, where we say $C$ $\eps$-computes $f$ if $\Pr_{C, x}[C(x) = f(x)] \geq \eps$, taken over randomness of $C$ as well as $x$. </span> A circuit $C'(B)$ $\epsilon$-computes $f^k$ if \[\Pr_{B \sim \text{random $k$-multiset}}[C'(B) = f^k(B)] \geq \epsilon\] Note that $C'$ is allowed to give different answers for the same element in a multiset, e.g. if $C'(\{a, a, a\}) = (y_1, y_2, y_3)$, the $y_i$s may all be distinct -- we don't take advantage of this symmetry.
<p>
<h2 class='tex'>3. Oracle Version </h2> Here we present and prove a simpler version of the algorithm, in the case when we also have access to an oracle for $f$. (This can be seen as a non-uniform version).
<blockquote><b>Theorem 2</b> <em> <a name="thmoracle"></a> Let $k \in \N, \eps > 0$, and $f:\{0,1\}^n \to \{0, 1\}^\ell$. There is a PPT algorithm $\A^f$ with oracle access to $f$, with the following guarantees:
<ul> <li> <b>Input:</b> A circuit $C'$ that $\eps$-computes $f^k$. <li> <b>Output:</b> With probability $0.99$, output a circuit $C$ that $(1-\delta)$-computes $f$.
</ul>
for $\delta = O(\log(k)/(\eps k))$. The circuit $C$ is of size $|C'|\poly(n, k, \log(1/\delta), 1/\eps)$. </em></blockquote>
<p>
<p>
The idea is, in Step 2 of Algorithm $\A$, we can generate the correct values $v$ for the inputs in set $A$, by querying the oracle. That is, we set $v := f(A)$ directly, instead of using our approximate circuit $C'$. In fact, if we have a perfect oracle for $f$ we can simplify the algorithm even further.
<p>
The algorithm is: <div class='framed'> $\mathcal{A}^f(C')$:
<p>
<ol> <li> Pick $T = O(\log(k)/\epsilon)$ random $(k-1)$-multisets $A_1, \dots A_T$, each $A_i$ containing $(k-1)$ random inputs from $\{0, 1\}^n$. <li> Query the $f$-oracle, and record the values of $v_{A_i} := \{f(x): x \in A_i\}$ for all sets $A_i$. <li> Output the circuit $C_{A, v}$ defined below (with the values $v_{A_i}$ on the subsets $A_i$ hardcoded).
</ol>
</div>
<p><p>
$C_{A, v}$ is defined as the circuit:<br/>
<div class='framed'> $C_{A, v}(x)$:
<p>
For each $i = 1 \dots T = O(\log(k)/\epsilon)$:
<ol> <li> Let $B_i := \{x\} \cup A_i$. <li> Evaluate $C'(B_i)$. <li> If the answers of $C'(B_i)$ restricted to $A_i$ agree with the hardcoded values $v_{A_i} = f(A_i)$, then output $C'(B_i)|_x$, (the answer $C'$ gave for $x$), and stop.
</ol>
Output an error if no output is produced after $T$ iterations. </div>
<p><p>
<em>Proof of Theorem <a href="#thmoracle">2</a>:</em>
<p>
<b>Parameters:</b> We will have $\delta = 10000\log(k)/(\epsilon k)$ and $T = 100 \log(k)/ \epsilon$. (Think of aiming for $\delta \approx 1/k$).
<p>
We will argue that \begin{equation} \label{eqn:main} \Pr_{\A, C, x}[C_{A, v}(x) \neq f(x)] \leq \delta / 100 \end{equation} Where the probability is over the randomness of algorithm $\A^f$ (random choice of sets $A_i$), and random input $x \in \{0, 1\}^n$. Then, by Markov \[\Pr_{\A}\left[ \Pr_{C, x}[C_{A, v}(x) \neq f(x)] > \delta \right] \leq 1/100\] so the algorithm $\A^f$ will produce a good circuit $C_{A, v}$ except with probability $1/100$.
<p>
In the execution of circuit $C_{A, v}(x)$, let us say “iteration $i$ fails” if Step 3 of the circuit at iteration $i$ outputs a wrong answer. That is, iteration $i$ fails if $C'(B_i)$ is correct on the $(k-1)$ values in $A_i = B_i \setminus \{x\}$, but wrong on $x$.
<p>
Consider the probability that iteration 1 fails. Notice that the distribution of $(x, A_1, B_1)$ is equivalently generated as:
<p>
<table align = center><tr><td align=center> $\{(x, A_1, B_1)\}$ </td><td align=center> $\equiv$ </td><td align=center> $\{(x, A_1, B_1)\}$ </td></tr><tr><td align=center> $A_1 \sim$ random $(k-1)$-multiset </td><td align=center> </td><td align=center> $B_1 \sim$ random $k$-multiset </td></tr><tr><td align=center> $x \in \{0, 1\}^n$ </td><td align=center> </td><td align=center> $x \in B_1$</td></tr><tr><td align=center> $B_1 := \{x\} \cup A_1$ </td><td align=center> </td><td align=center> $A_1 := B_1 \setminus \{x\}$ </td></tr></table>
<p>
That is, we can think of first sampling a random $k$-multiset $B_1$, then sampling a random $x \in B_1$. Iteration 1 only returns an output when $C'(B_1)$ has at most $1$ wrong answer (since it checks correctness on the $(k-1)$ values of $A_1$). Thus iteration 1 only fails if the random $x \in B_1$ falls on this $1$ (of $k$) answers. So \begin{equation} \Pr_{x, A_1, B_1}[~\text{Iteration 1 fails}~] \leq \frac{1}{k} \end{equation}
<p>
Now, we just union bound: \begin{align*} \Pr[\text{error}] &= \Pr_{\A, C, x}[C_{A, v}(x) \neq f(x)]\\ &\leq \Pr[\text{no output produced after $T$ iterations, or some iteration fails}]\\ &\leq \Pr[\text{no output produced}] + T \cdot \Pr[\text{Iteration 1 fails}]\\ &\leq \Pr[\text{no output produced}] + \frac{T}{k} \end{align*}
<p>
For our choice of $T, \delta$, the second term is $\frac{T}{k} \leq \delta / 200$. We will show the first term is $\leq \delta / 200$ as well, completing the proof.
<p>
<b>Produces output w.h.p.</b>
<p>
It remains to show that the circuit $C_{A, v}$ produces an output with high probability. In Step 3 of the circuit $C_{A, v}$, notice that if $C'$ is queried on a correct input $B_i$, it will pass the test and output a value.
<p>
The idea is: since $C'$ is correct on $\epsilon$-fraction of inputs, if we try $T = \Omega(\log(1/\delta)/\epsilon)$ iid random inputs, we will be sure to hit a correct input, except with probability $O(\delta)$. This doesn't quite work, since the inputs $B_i$ are not iid random (they all contain the input $x$) -- but this dependence is minimal, so it still works out.
<p>
Following [<a href='#ref-IJKW10'>IJKW10</a>], it helps to think in term of this bipartite graph. Define $G$ as a biregular bipartite graph between inputs $x \in \{0, 1\}^n$, and $k$-<b>tuples</b><sup class='footnotemark'><a href='#footnote6' onclick="toggle_display_nojump('footnote6');">6</a></sup><span class='sidenote' id='footnote6'><a name='footnote6' href='#footnote6'>6.</a> Going back to tuples just to simplify the notation, so we can deal with the uniform measure. </span> $B \in (\{0, 1\}^n)^k$, with an edge $(x, B)$ if $x \in B$. We can think of the circuit $C_{A, v}(x)$ as picking up to $T$ random neighbors of $x$ in the graph $G$, until hitting an input $B$ where $C'(B)$ is correct on all $B \setminus \{x\}$. We know that $\epsilon$-fraction of $k$-tuples $B$ are correct, and in fact we will show that almost all inputs $x$ have close to $\eps$-fraction of their neighbors as correct.
<p>
<p align=center><img width=350 src="/sources/uniform-DP/graph_color.png"></p>
<p>
<blockquote><b>Lemma 3</b> <em> <a name="lemnotbad"></a> There are at most $O(\delta)$-fraction of “$\bad$” inputs $x \in \{0, 1\}^n$ for which \[\Pr_{B \in N(x)}[C'(B) \text{ is correct}] \leq \epsilon/10\] </em></blockquote>
<p>
This is sufficient to show that $\Pr[\text{no output produced}] \leq O(\delta)$, since for inputs $x$ that are not $\bad$, sampling $T = \Omega(\log(k)/\eps)$ iid neighbors of $x$ will hit a correct neighbor, except with probability $O(1/k) \leq O(\delta)$. <sup class='footnotemark'><a href='#footnote7' onclick="toggle_display_nojump('footnote7');">7</a></sup><span class='sidenote' id='footnote7'><a name='footnote7' href='#footnote7'>7.</a> $(1-\eps/10)^{T} \leq e^{-T\eps / 10} \leq 1/k \leq \delta$. </span>
<p>
It is easier to show the related property:
<blockquote><b>Lemma 4 (Mixing Lemma)</b> <em> <a name="lemmixing"></a> Let $H \subseteq \{0, 1\}^n$ be a set of inputs on the left of $G$, with the density of $H$ at least $\mu$. Then, except for some $2e^{-\Omega(\mu k)}$-fraction of tuples $B$, all tuples $B$ on the right of $G$ have \[\Pr_{x \in N(B)}[x \in H] = \mu \pm \mu/2\] </em></blockquote>
<p>
<em>Proof of Lemma <a href="#lemmixing">4</a>:</em> Drawing a uniformly random tuple $B$ on the right is exactly drawing $k$ iid samples of inputs $B := (x_1, x_2, \dots, x_k)$. Then, by definition of $G$, picking a random neighbor $x \in N(B)$ is just picking a random $x \in B$. Thus, it is sufficient to show that if we draw $k$ iid inputs $x_1, x_2, \dots, x_k$, the fraction of inputs that fall in $H$ is within a multiplicative factor $(1 \pm 1/2)$ of its expectation $\mu$ (with high probability). This follows immediately from Chernoff bounds. $$\tag*{$\blacksquare$}$$
<p>
From this, the above Lemma <a href="#lemnotbad">3</a> follows easily:
<p>
<em>Proof of Lemma <a href="#lemnotbad">3</a>:</em> Let $\bad$ be the set of “bad” inputs $x$, where $\Pr_{B \in N(x)}[C'(B) \text{ is correct}] \leq \epsilon/10$. Suppose the density of $\bad$ is $\mu$. Let us count fraction of total edges in $G$ that go between $\bad$, and the set of correct tuples (which we call $\good$). By the mixing lemma, there are at least $(\epsilon - 2e^{\Omega(\mu k)})$ fraction of tuples $B^*$ with $\Pr_{x \in N(B^*)}[\text{$x$ is bad}] \geq \mu / 2$. So there are at least $(\epsilon - 2e^{\Omega(\mu k)}) (\mu/2)$ fraction of edges between the $\bad$ and $\good$ sets.
<p>
But, each bad input $x$ has at most $\eps/10$ fraction of edges into $\good$ by definition, so the fraction of $\bad \leftrightarrow \good$ edges is at most $\mu (\eps/10)$.
<p>
Thus we must have \begin{align*} (\epsilon - 2e^{-\Omega(\mu k)}) (\mu/2) &\leq \mu (\eps/10)\\ \implies \mu &\leq O(\log(1/\eps)/k) \end{align*} This gives $\mu \leq \delta/200$ for our choice of $\delta$. $$\tag*{$\blacksquare$}$$
<p>
This concludes the proof of correctness of the oracle version (Theorem <a href="#thmoracle">2</a>). $$\tag*{$\blacksquare$}$$
<p>
<h2 class='tex'>4. Closing Remarks </h2>
<p>
<ul> <li> Note that in the oracle version, we were able to output a good circuit with probability $0.99$, instead of w.p. $\Theta(\eps)$ as in the fully uniform version. This makes sense because if we have an $f$-oracle, we can “check” if our circuit is actually computing the desired $f$, so we don't run into the unique decoding problem. (Indeed, we can construct an optimal version of algorithm $\A^f$ of Theorem <a href="#thmoracle">2</a> from the algorithm $\A$ of Theorem <a href="#thmuniformDP">1</a> in a black-box way, by checking if the output circuit of $\A$ mostly agrees with $f$ on enough random inputs).
<p>
<li> There were several simplifications we made from $\A$ to $\A^f$.<br/>
(1) We queried the oracle for the hardcoded values $v$, instead of the circuit.<br/>
(2) We hardcoded $(k-1)$-multisets instead of $(k/2)$-multisets.<br/>
(3) We hardcoded $T$ iid multisets $\{A_i\}$, instead of just one multiset $A$.<br/>
Note that we could not have done (2) without also doing (3) -- otherwise there would not have been enough mixing (the circuit would fail with probability close to $\eps$). Also, (3) would not have worked in the fully uniform case ($\A$, without the oracle) -- because then all the hardcoded sets will be correct with only very small probability.
<p>
<li> The reason Theorem <a href="#thmoracle">2</a> has suboptimal parameters (eg, compare the setting of $\delta$ to Theorem <a href="#thmuniformDP">1</a>) is because our analysis used the loose union bound, instead of using the fact that circuit $C_{A, v}$, by only outputting values that pass a test, is doing rejection-sampling on a certain conditional probability space. The tight analysis in [<a href='#ref-IJKW10'>IJKW10</a>] takes advantage of this fact.
<p>
<li> In the proof of Thereom <a href="#thmoracle">2</a>, we used a property of the graph $G$ that was essentially like an “Expander Mixing Lemma”. We may hope that if we replace $G$ with something sufficiently expander-like, we could get a derandomized direct-product theorem. Indeed, something like this is done in [<a href='#ref-IJKW10'>IJKW10</a>] (“Uniform direct product theorems: simplified, optimized, and <i>derandomized</i>”).
<p>
<li> I think the oracle version is sufficient for the applications in [<a href='#ref-CIKK16'>CIKK16</a>], since there we have query access to the function $f$ we are trying to learn/compress.
<p>
<li> For a good survey on direct-product for non-uniform hardness amplification, and the related “Yao's XOR Lemma”, see [<a href='#ref-GNW11'>GNW11</a>] (which includes at least 3 different proofs of the non-uniform XOR lemma). For a clean proof of Impagliazzo's Hardore Set theorem, which is used in some proofs of the XOR lemma, see for example Arora-Barak.
<p>
</ul>
<p>
<br><hr><h3>References</h3>
<p>
<a name='ref-CIKK16'>[CIKK16]</a> Marco~L Carmosino, Russell Impagliazzo, Valentine Kabanets, and Antonina
Kolokolova.
Learning algorithms from natural proofs.
In <em>LIPIcs-Leibniz International Proceedings in Informatics</em>,
volume~50. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2016.
URL:
<a href="http://drops.dagstuhl.de/opus/volltexte/2016/5855/pdf/34.pdf">http://drops.dagstuhl.de/opus/volltexte/2016/5855/pdf/34.pdf</a>.
<p>
<p>
<a name='ref-GNW11'>[GNW11]</a> Oded Goldreich, Noam Nisan, and Avi Wigderson.
On yao's xor-lemma.
In <em>Studies in Complexity and Cryptography. Miscellanea on the
Interplay between Randomness and Computation</em>, pages 273--301. Springer,
2011.
URL: <a href="http://www.wisdom.weizmann.ac.il/~oded/COL/yao.pdf">http://www.wisdom.weizmann.ac.il/~oded/COL/yao.pdf</a>.
<p>
<p>
<a name='ref-IJKW10'>[IJKW10]</a> Russell Impagliazzo, Ragesh Jaiswal, Valentine Kabanets, and Avi Wigderson.
Uniform direct product theorems: simplified, optimized, and
derandomized.
<em>SIAM Journal on Computing</em>, 39(4):1637--1665, 2010.
URL: <a href="http://www.cs.columbia.edu/~rjaiswal/IJKW-Full.pdf">http://www.cs.columbia.edu/~rjaiswal/IJKW-Full.pdf</a>.
<p>
New Theory BlogPreetum Nakkiran
2016-08-13T00:00:00+00:00
http://learningwitherrors.org/2016/08/13/first-post<p>We’re starting a theory student blog! The idea is, this is a collaborative blog
about theoretical computer science, where people can post about interesting
things they’re learning / have learnt. The goal is to help everyone learn from
each other, and also have a forum for student discussion.</p>
<p>Hopefully this will help make TCS concepts more accessible: Sometimes reading
the original paper is not the best / most efficient way to learn something, and
there are several perspectives or proofs of the same thing that are not
explicitly written down in the literature, but are known in the community. We
hope this blog will be a way to share this knowledge among theory students.</p>
<p>One important aspect is, we want this to feel like an informal place to learn
and discuss – think more like chalk talks than STOC talks. It’s fine to have
rough calculations, sketched figures, etc – the emphasis is on explaining
things nicely. And we want to encourage asking clarifying questions and
discussing in the comments.</p>
<h2 id="on-posts">On posts:</h2>
<ul>
<li>
<p>Posts can be about anything technical that other students may find
interesting or learn from. Anything from current research to classical
results.</p>
</li>
<li>
<p>The length and thoroughness can vary, anything from “survey of this field” to
“summary of cool paper” to “interesting technical lemma” to “something cool I
learnt this week”, etc.</p>
</li>
<li>
<p>You don’t need to be an expert on the topic to write about it (as the name
suggests, there may be some errors, but hopefully also some learning).</p>
</li>
<li>
<p>The aim is to convey interesting or useful techniques and intuition (e.g. try
not to just announce a new result without explaining the ideas behind it)</p>
</li>
</ul>
<h2 id="contributing">Contributing:</h2>
<p>Everyone is welcome (and encouraged!) to contribute – including
non-students, and generally anyone interested in TCS.</p>
<p>The easiest way is to
simply write a LaTeX or Markup document, and email it to me (preetum [at]
berkeley).
There is also a <a href="/contributing/">harder way</a>.</p>
<p>Ideally, both readers and writers would get something out of this blog.
(Personally, I like to present topics to make sure I understand them
fully. And of course, we can have an interesting discussion about it.)</p>
<!--(In theory, the entire source code of this blog is public on Github, so you
can author a new post by compiling and pushing the appropriate files in the
appropriate places. In practice it's rather messy, but details are here).-->
<h2 id="comments-and-subreddit">Comments and Subreddit:</h2>
<p>We have comments below each post, which we encourage
people to use to discuss the post.</p>
<p>We also have the subreddit <a href="https://www.reddit.com/r/LWE">r/LWE</a>,
which we hope can be used as a more general
forum among theory students. Feel free to use this for both blog-related things
and general theory questions. Let’s see how this works.</p>
<h2 id="initial-posts">Initial Posts:</h2>
<p>We’re launching with posts on:</p>
<ul>
<li>
<p><a href="/2016/06/23/intro-sos/">Intro to the Sum-of-Squares Hierarchy</a> <br />
by Tselil Schramm.</p>
</li>
<li>
<p><a href="/2016/08/12/pseudocalibration-for-planted-clique-sos/">Pseudo-calibration for Planted Clique Sum-of-Squares Lower Bounds</a> <br />
by Pasin Manurangsi.</p>
</li>
<li>
<p><a href="/2016/07/06/deterministic-sparsification/">Deterministic Sparsification</a> <br />
by Chenyang Yuan.</p>
</li>
<li>
<p><a href="/2016/06/03/small-bias/">Simple Lower Bounds for Small-bias Spaces</a> and
<a href="/2016/05/27/fast-johnson-lindenstrauss/">Fast Johnson-Lindenstrauss</a> <br />
by Preetum Nakkiran.</p>
</li>
</ul>
<p>Thanks especially to the above people (and all future authors) for contributing.</p>
<h2 id="conclusion-and-open-questions">Conclusion and Open Questions</h2>
<p>When conceiving this blog, we had some other
ideas for things that should exist, such as a set of collaboratively-edited
pages on “How to best learn topic X”. Are people interested in contributing to
something like this? In general, any suggestions for things you would like to
see (regarding this blog, or otherwise)?</p>
<p>Feel free to use the comments section below.</p>
Pseudo-calibration for Planted Clique Sum-of-Squares Lower BoundPasin Manurangsi
2016-08-12T00:00:00+00:00
http://learningwitherrors.org/2016/08/12/pseudocalibration-for-planted-clique-sos<script type="text/javascript">
// javascript for toggling sidenotes
function toggle_display_nojump(id) {
event.preventDefault();
var e = document.getElementById(id);
if(e.style.display == 'block')
e.style.display = 'none';
else
e.style.display = 'block';
return false; // prevent default action of jumping to anchor
}
</script>
<div style='display:none;'><script type='math/tex'> \DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator*{\argmax}{argmax}
\DeclareMathOperator*{\poly}{poly}
\DeclareMathOperator*{\polylog}{polylog}
\DeclareMathOperator*{\polyloglog}{polyloglog}
\DeclareMathOperator*{\supp}{supp}
\DeclareMathOperator{\sos}{sos}
\DeclareMathOperator*{\opt}{OPT}
\DeclareMathOperator{\cli}{CLIQUE}
\newcommand{\cQ}{\mathcal{Q}}
\newcommand{\cG}{\mathcal{G}}
\newcommand{\cW}{\mathcal{W}}
\newcommand{\cC}{\mathcal{C}}
\newcommand{\cV}{\mathcal{V}}
\newcommand{\cX}{\mathcal{X}}
\newcommand{\AM}{\mathbf{AM}}
\newcommand{\x}{\times}
\newcommand{\Z}{\mathbb{Z}}
\newcommand{\Q}{\mathbb{Q}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\N}{\mathbb{N}}
\newcommand{\E}{\mathop{\mathbb{E}}}
\newcommand{\tE}{\tilde{\E}}
\newcommand{\f}{\frac}
\newcommand{\reg}{\text{reg}}
\newcommand{\greedy}{\text{greedy}}
\renewcommand{\bar}{\overline}
\renewcommand{\epsilon}{\varepsilon}
\newcommand{\nullset}{\emptyset}
\newcommand{\set}[1]{\{#1\}}
\newcommand{\Mod}[1]{\ (\text{mod}\ #1)}
\newcommand{\DYES}{\mathcal{D}_{YES}}
\newcommand{\DNO}{\mathcal{D}_{NO}}
\newcommand{\beq}{\begin{equation}}
\newcommand{\eeq}{\end{equation}}
\newcommand{\claim}[1][\!\!]
\newcommand{\result}
\newcommand{\qed}{\mbox{}\hspace*{\fill}\nolinebreak$\square$}
\newcommand{\st}{~\text{s.t.}~}
\let\Oldforall\forall
\renewcommand{\forall}{~\Oldforall} % add some space before foralls
\renewcommand{\arraystretch}{1.5} % add more space in table
\let\Oldsum\sum
\let\Oldinf\inf
\renewcommand{\inf}{\Oldinf\limits}
\let\Oldsup\sup
\renewcommand{\sup}{\Oldsup\limits}
\newcommand{\blank}[1]{}
\let\later=\both
</script></div>
<p>
<p>
Recently, Barak, Hopkins, Kelner, Kothari, Moitra and Potechin [<a href='#ref-BHKKMP16'>BHKKMP16</a>] proved an essentially tight Sum-of-Squares lower bound for the <em>planted clique</em> problem. Their result can be divided into two main parts: coming up with the <em>pseudo-distribution</em> and proving positivity of such pseudo-distribution. In this short blog, we summarize the first part of the paper, which provides a general systematic way to come up with pseudo-distributions for problems other than the planted clique problem, without going into details of the proof. We do not touch on the second part, which is more technically involved, here but we will hopefully do so in future posts.
<!--more-->
<p>
<h2 class='tex'>1. SoS Lower Bounds and the Planted Clique Problem </h2>
<p>
In this section, we provide some background for readers unfamiliar with proving Sum-of-Squares lower bounds and the planted clique problem. Those who are accustomed to the topic can skip this section. For SoS, we use notations from <a href="http://learningwitherrors.org/2016/06/23/intro-sos/">Tselil's blog</a> on Sum-of-Squares Hierarchy, which is also a good place to start for those unfamiliar with SoS Hierarchy.
<p>
In this blog, we do not need the optimization version of SoS Hierarchy but we will only use a feasibility one. Recall that, given a polynomial feasibility problem of the form \[Q = \left\{x \in \mathbb{R}^n : \forall i \in [m], g_i(x) = 0\right\},\] the degree-$2d$ Sum-of-Squares relaxation of the problem, which can be solved in $n^{O(d)}$ time, is<sup class='footnotemark'><a href='#footnote1' onclick="toggle_display_nojump('footnote1');">1</a></sup><span class='sidenote' id='footnote1'><a name='footnote1' href='#footnote1'>1.</a> Note that, in Tselil's blog, the positivity condition is written as the pseudo-moment matrix being positive semidefinite but it is not hard to see that this is the same as requiring that $\tE[q^2] \geq 0$ for every $q$ with $\deg(q) \leq d$. </span> \begin{align} \label{eq:sos} \sos_d(Q) = \left\{\tE : \begin{array}{lr} {\tE}: \{q: \deg(q) \leq 2d\} \rightarrow \mathbb{R} \text{ is a linear operator with } \tE[1] = 1, \\ \forall q \text{ with } \deg(q) \leq d, \tE[q^2] \geq 0, \\ \forall i \in [m] \forall q \text{ with } \deg(q) \leq 2d - \deg(g_i), \tE[g_i q] = 0. \end{array} \right\}. \end{align}
<p>
Roughly speaking, if we want to show that degree-$2d$ SoS fails to certify that a polynomial feasibility problem $Q$ is infeasible, we need to come up with a degree-$2d$ pseudo-distribution $\tE$ that satisfies the conditions in (\ref{eq:sos}). For concreteness, let us consider the <em>planted clique</em> problem defined as follows.
<p>
<blockquote><b>Definition 1 (Planted Clique$(n, k)$)</b> <em> Given as an input a graph $G = (V, E)$ drawn from one of the two following distributions (each with probability $1/2$):
<ol> <li> $\cG(n, 1/2)$: the Erdos-Renyi random graph of $n$ vertices where each edge is included with probability 1/2, <li> $\cG(n, 1/2, k)$: the planted distribution, in which a graph $G$ is first drawn from $\cG(n, 1/2)$. Then, $k$ vertices of $G$ are chosen uniformly at random and an edge between each pair of chosen vertices are added to $G$.
</ol>
The goal is to determine, with correctness probability $1/2 + \varepsilon$ for some constant $\varepsilon > 0$, which distribution $G$ is drawn from. </em></blockquote>
<p>
<p>
In this blog, we always restrict ourselves to the case where $k \gg \log n$ so that the maximum clique sizes of the two cases are different. Since the largest clique in $\cG(n, 1/2)$ is of size $O(\log n)$ with high probability, brute-force search solves the planted clique problem with high probability in $n^{O(\log n)}$ time. On the other hand, the best known polynomial-time algorithm works only when $k = \Omega(\sqrt{n})$ [<a href='#ref-AKS98'>AKS98</a>]. A natural question is of course whether the SoS Hierarchy can do any better than this.
<p>
The most widely-used formulation of planted clique in terms of polynomial feasibility, and the one used in [<a href='#ref-BHKKMP16'>BHKKMP16</a>], is to formulate it as “does $G$ have a clique of size $k$?”. For convenience, let $V = [n] = \{1, \dots, n\}$. This formulation can be written as follows.
<p>
\begin{align*} \cli_k(G) = \left\{x \in \mathbb{R}^n : \begin{array}{lr} \forall i \in [n], x_i^2 = x_i, \\ \forall (i, j) \notin E, x_ix_j = 0, \\ \sum_{i \in [n]} x_i = k \end{array} \right\} \end{align*}
<p>
When the constraints are satisfied, $x_i$ is simply a boolean indicator variable whether $i$ is included in the clique. If we can solve $\cli_k(G)$ in polynomial time, then we are done because $G \sim \cG(n, 1/2, k)$ always has clique of size $k$ whereas the maximum clique of $G \sim \cG(n, 1/2)$ is of size $O(\log n)$ w.h.p. Thus, there is always a solution in $\cli_k(G)$ for $G \sim \cG(n, 1/2, k)$ but, w.h.p., there is no feasible solution for $G \sim \cG(n, 1/2)$. But of course solving $\cli_k(G)$ is NP-hard so we will try to relax it using degree-$2d$ SoS which we can solve in $n^{2d}$ time.
<p>
Again, when $G \sim \cG(n, 1/2, k)$, $\sos_d(\cli_k(G))$ remains feasible. If we want to tell which distribution $G$ is drawn from by looking only at whether $\sos_d(\cli_k(G))$ is feasible, we need that, when $G \sim \cG(n, 1/2)$, $\sos_d(\cli_k(G))$ is infeasible with probability at least $\varepsilon$. The main result of [<a href='#ref-BHKKMP16'>BHKKMP16</a>] is that this is impossible. In particular, they show the following:
<p>
<blockquote><b>Theorem 2</b> <em> <a name="thmmain-clique"></a> For every $d \ll \log n$, when $k \leq n^{1/2 - O(\sqrt{d/\log n})}$ and $G$ is drawn from $\cG(n, 1/2)$, $\sos_d(\cli_k(G))$ is feasible with high probability. </em></blockquote>
<p>
<p>
In other words, Barak et al.'s result says that the SoS approach to planted clique is no better (up to the $O(\sqrt{d/\log n})$ factor in the exponent) than the known algorithm from [<a href='#ref-AKS98'>AKS98</a>].
<p>
From how $\sos_d(\cli_k(G))$ is defined, proving Theorem <a href="#thmmain-clique">2</a> boils down to find a linear operator ${\tE}_G: \{q: \deg(q) \leq 2d\} \rightarrow \mathbb{R}$ for each graph $G$ such that, if $G = ([n], E)$ is drawn from $\cG(n, 1/2)$, the following conditions are satisfied with high probability:
<ol> <li> $\tE_G[1] = 1$, <li> $\forall i \in [n] \forall q$ with $ \deg(q) \leq 2d - 2, \tE_G[x_i^2q] = \tE_G[x_iq]$, <li> $\forall (i, j) \notin E \forall q$ with $\deg(q) \leq 2d - 2,\tE_G[x_ix_jq] = 0$, <li> $\tE_G[\sum_{i \in [n]} x_i] = k$, <li> $\forall q$ with $\deg(q) \leq d, \tE_G[q^2] \geq 0$.
</ol>
<p>
<h2 class='tex'>2. Pseudo-calibration for Planted Clique </h2>
<p>
Coming up with degree-$2d$ pseudo-distribution $\tE_G$ with desired properties stated in the previous section is particularly hard for planted clique and past attempts often involve some ad-hoc fixes that prevent them from getting tight bound for large $d$. This is where Barak et al.'s so-called <em>pseudo-calibration</em> method, which is a systematic way to derive $\tE_G$, comes in. Since the method is more of an intuitive heuristic rather than a provable approach, we will be informal here. We also note that the explanation given here is somewhat different than that in [<a href='#ref-BHKKMP16'>BHKKMP16</a>] and the readers should consult the full paper for a more thorough view of pseudo-calibration.
<p>
Let us take a step back and think about our algorithm for planted clique for a moment. Given $G$, we try to solve $\sos_d(\cli_k(G))$. If it is infeasible, then we know for certain that $G$ is drawn from $\cG(n, 1/2)$. Otherwise, we do not seem to gain anything. However, this may not be entirely true; we actually get back $\tE_G$. One thing we can do here is to pick $f_G$ (which can depend on $G$) of degree (with respect to $x$) at most $2d$ as a test function and ask for $\tE_G[f_G]$. If the distributions of $\tE_G[f_G]$ under $G \sim \cG(n, 1/2)$ and $G \sim \cG(n, 1/2, k)$ are “very different”<sup class='footnotemark'><a href='#footnote2' onclick="toggle_display_nojump('footnote2');">2</a></sup><span class='sidenote' id='footnote2'><a name='footnote2' href='#footnote2'>2.</a> In other words, they are distinguishable in polynomial time. </span>, then we should be able to tell $G$'s from the two distributions apart by just looking at $\tE_G[f_G]$. Hence, not only that $\sos_d(\cli_k(G))$ must be feasible with high probability when $G \sim \cG(n, 1/2)$ but the distributions of $\tE_G[f_G]$ when $G \sim \cG(n, 1/2)$ and when $G \sim \cG(n, 1/2, k)$ must also be indistinguishable in polynomial time for every test function $f_G$. An implication of this is that the expectation of $\tE_G[f_G]$ over the two distributions are roughly equal, i.e., \begin{align*} \E_{G \sim \cG(n, 1/2)} \tE_G[f_G] \approx \E_{G \sim \cG(n, 1/2, k)} \tE_G[f_G]. \end{align*}
<p>
We of course do not know what $\tE_G$ is even when $G$ is drawn from $\cG(n, 1/2, k)$ so the above equality does not tell us much yet. But recall that $\tE_G$ is our fake solution and we want it to resemble the actual solution as much as possible. Hence, a reasonable heuristic here is to try to make $\E_{G \sim \cG(n, 1/2, k)} \tE_G[f_G]$ roughly equal to $\E_{G \sim \cG(n, 1/2, k)} f_G(x_G)$ where $x_G$ denote the actual solution, i.e., the indicator vector for the maximum clique in $G$.
<p>
For convenience, let us write $(G, x) \sim \cG(n, 1/2, k)$ to denote $G$ drawn from $\cG(n, 1/2, k)$ and $x$ being the indicator vector whether each vertex is included as part of the planted $k$-clique. Under this notation, the aforementioned condition can be written as \begin{align*} \E_{G \sim \cG(n, 1/2, k)} \tE_G[f_G] \approx \E_{(G, x) \sim \cG(n, 1/2, k)} f_G(x). \end{align*}
<p>
Combining the above two equations, we get \begin{align} \label{eq:calib} \E_{G \sim \cG(n, 1/2)} \tE_G[f_G] \approx \E_{(G, x) \sim \cG(n, 1/2, k)} f_G(x). \end{align}
<p>
Condition (\ref{eq:calib}) is what Barak et al. called <em>pseudo-calibration</em><sup class='footnotemark'><a href='#footnote3' onclick="toggle_display_nojump('footnote3');">3</a></sup><span class='sidenote' id='footnote3'><a name='footnote3' href='#footnote3'>3.</a> In [<a href='#ref-BHKKMP16'>BHKKMP16</a>], the pseudo-calibration condition is in fact slightly stronger that stated here; equality is required instead of approximate equality. However, it does not matter anyway since there will be approximations in subsequent calculations. </span>. As noted in the paper, this condition is quite strong. For example, for fixed $i, j \in [n]$ and $q$ with $\deg(q) \leq 2d - 2$, if we define $f$ as \begin{align*} f_G(x) = \begin{cases} 0 & \text{ if } (i, j) \in E, \\ x_ix_jq(x) & \text{ otherwise}, \end{cases} \end{align*} then $f_G(x)$ is always zero on the right hand side. Hence, $\E_{G \sim \cG(n, 1/2)} \tE_G[f_G] \approx 0$. If we assume that $\tE_G[f_G]$ is non-negative, then Condition 3 at the end of the previous section is almost immediately satisfied. In fact, as we will see next, Condition (\ref{eq:calib}) almost fully determines $\tE_G$ for every $G$.
<p>
<h3 class='tex'>2.1. From Pseudo-Calibration to Pseudo-Distribution</h3>
<p>
We will now see how to arrive at $\tE_G$ from the pseudo-calibration condition. As stated earlier, the condition is quite strong; in fact, it is too that it cannot hold for every $f_G$. For instance, we can pick $f_G$ to simply be the indicator function of whether $G$ has a clique of size $k$. By doing so, the left hand side of (\ref{eq:calib}) is approximately zero whereas the right hand side is one. However, we are “cheating” by picking such $f_G$ because we do not even know how to compute this test function in polynomial time! Hence, roughly speaking, we need to restrict $f_G$ to only those that are not more “powerful” that the SoS relaxation itself.
<p>
To state the exact condition we enforce on $f_G$, let us think of $f_G(x)$ as a function $f(G, x)$ of both $G$ and $x$ where the graph $G$ is encoded naturally as a string in $\{\pm 1\}^{[n] \choose 2}$, i.e., the $(i, j)$-index of the input is $+1$ if there is an edge between $i$ and $j$ and $-1$ otherwise. Now, we can write $f$ as a polynomial on both $G$ and $x$: \begin{align*} f(G, x) = \sum_{T \subseteq {[n] \choose 2}, S \subseteq [n]} a_{(T, S)} \chi_T(G) x_S \end{align*} where $\chi_T(G)$ and $x_S$ denote $\prod_{e \in T} G_e$ and $\prod_{i \in S} x_i$ respectively, and, $a_{(T, S)}$'s are the coefficients of the polynomial. We will require the pseudo-calibration condition to hold only for $f_G$ such that each monomial depends on at most $\tau$ vertices where $\tau = O(d)$ is a truncation threshold. In other words, we only restrict ourselves to $f$ that can be written as \begin{align*} f(G, x) = \sum_{T \subseteq {[n] \choose 2}, S \subseteq [n] \atop |\cV(T) \cup S| \leq \tau} a_{(T, S)} \chi_T(G) x_S \end{align*} where $\cV(T)$ is the set of all vertices which are endpoints of edges in $T$. The intuition behind this heuristic is that, in the conditions on $\tE_G$ imposed by the SoS relaxation, each monomial involves at most $2d$ vertices because $\tE_G$ is defined only on polynomials on $x$ of degree at most $2d$. As a result, each monomial appearing in $f(G, x)$ should involve no more than $O(d)$ vertices in order to limit its “power” to be not much more than the SoS relaxation.
<p>
Now, let us use the pseudo-calibration condition to determine $\tE_G$. Fixed a subset $S \subseteq [n]$ of size at most $2d$, we will compute $\tE_G[x_S]$ for the monomial $x_S$. Note that, since $\tE_G$ is linear and $\tE_G[x_i^2] = x_i$ for all $i \in [n]$, these $\tE_G[x_S]$'s uniquely determine $\tE_G$. By viewing $\tE_G[x_S]$ as a function of $G$, $\tE_G[x_S]$ can be written as fourier expansion \begin{align*} \tE_G[x_S] = \sum_{T \subseteq {[n] \choose 2}} \widehat{\tE_G[x_S]}(T) \chi_T(G). \end{align*} The final heuristic employed by Barak et al. is to enforce $\tE_G[x_S]$ to be low degree by letting $\widehat{\tE_G[x_S]}(T) = 0$ for every $T$ with $|\cV(T) \cup S| > \tau$. This heuristic makes sense since $\tE_G$ must be output by the SoS relaxation solver, which runs in $n^{O(d)}$ time; hence, $\tE_G$ cannot be too hard to compute. More importantly, as we will see shortly, this condition allows us to almost uniquely determine $\tE_G$ from the pseudo-calibration condition.
<p>
Recall that each fourier coefficient $\widehat{\tE_G[x_S]}(T)$ is simply equal to $\E_{G \sim \cG(n, 1/2)} \tE_G[x_S \chi_T(G)].$ Plugging in the pseudo-calibration condition with $f = x_S\chi_T(G)$, this is approximately $\E_{(G, x) \sim \cG(n, 1/2, k)} [x_S\chi_T(G)]$. It is not hard to see that this expression is equal to the probability that every vertex in $\cV(T) \cup S$ is in the planted clique, which is roughly $(k/n)^{|\cV(T) \cup S|}$ when $|\cV(T) \cup S|$ is small. Indeed, we will set $\widehat{\tE_G[x_S]}(T)$ to be exactly this. In other words, the final pseudo-distribution is \begin{align*} \tE_G[x_S] = \sum_{T \subseteq {[n] \choose 2} \atop |\cV(T) \cup S| \leq \tau} \left(\frac{k}{n}\right)^{|\cV(T) \cup S|} \chi_T(G). \end{align*}
<p>
It is not hard to see that $\tE_G$ indeed satisfies the pseudo-calibration condition for $f$'s of our interest. As explained right before the beginning of this subsection, this almost immediately implies that the third condition required for $\tE_G$ is satisfied; it is also pretty easy to check that the condition is indeed true (see Lemma 5.5 in the paper). Using concentration inequalities, Barak et al. also show that $\tE_G[1] = 1 \pm o(1)$ and $\tE_G[\sum_{i \in [n]} x_i] = k \pm o(1)$ (see full proof in Appendix A.2 of the paper). Note that while these two conditions are only approximately satisfied, $\tE_G$ can be scaled so that they are exactly satisfied as well. As mentioned briefly earlier, the proof of the positivity condition $\tE_G[q^2] \geq 0$ is much harder and is the paper's main technical contribution. We do not attempt to discuss it here but we will try to blog about it in the future.
<p>
<h2 class='tex'>3. Further Reading </h2>
<p>
The authors of [<a href='#ref-BHKKMP16'>BHKKMP16</a>] have given talks on the paper and some of them are available online, such as <a href="https://www.youtube.com/watch?v=ZmFOsAB7Y1k">Moitra's</a> and <a href="https://www.youtube.com/watch?v=H2C2ZdgynX4">Kothari's</a>. Barak also wrote <a href="https://windowsontheory.org/2016/04/13/bayesianism-frequentism-and-the-planted-clique-or-do-algorithms-believe-in-unicorns/">a blog</a> regarding pseudo-calibration. All the materials mentioned discuss the pseudo-calibration in much more detail than in this post. Moitra's talk also contains the proof sketch of positivity of the pseudo-distribution, which is not covered in this blog post.
<p>
Apart from the paper, I am not aware of the pseudo-calibration technique being used to prove new lower bounds for other problems yet. I will update this section when I come across new results based on pseudo-calibration.
<p>
<br><hr><h3>References</h3>
<p>
<a name='ref-AKS98'>[AKS98]</a> Noga Alon, Michael Krivelevich, and Benny Sudakov.
Finding a large hidden clique in a random graph.
<em>Random Struct. Algorithms</em>, 13(3-4):457--466, 1998.
<p>
<p>
<a name='ref-BHKKMP16'>[BHKKMP16]</a> Boaz Barak, Samuel~B. Hopkins, Jonathan~A. Kelner, Pravesh Kothari, Ankur
Moitra, and Aaron Potechin.
A nearly tight sum-of-squares lower bound for the planted clique
problem.
<em>CoRR</em>, abs/1604.03084, 2016.
<p>
Deterministic SparsificationChenyang Yuan
2016-07-06T00:00:00+00:00
http://learningwitherrors.org/2016/07/06/deterministic-sparsification<div style="display:none;"><script type="math/tex">
\newcommand{\paren}[1]{\left( #1 \right)}
\newcommand{\dotp}[1]{\left\langle #1 \right\rangle }
</script></div>
<p>Let $G$ be a dense graph. A sparse graph $H$ is a sparsifier of $G$
approximation of $G$ that preserves certain properties such as quadratic forms
of its Laplacian. This post will formally define spectral sparsification, then
present the intuition behind the deterministic construction of spectral
sparsifiers by Baston, Spielman and Srivastava [<a href="#BSS08">BSS08</a>].</p>
<!--more-->
<p>Benczúr and Karger [<a href="#BK96">BK96</a>] introduced the cut sparsifier, which
ensures that the value of all cuts in $H$ approximates that of all cuts in $G$:</p>
<blockquote>
<p><a name="def:cut-sp"></a><strong>Definition 1 (Cut Sparsification):</strong> A weighted
undirected graph $H = (V, E_H)$ is an $\epsilon$-cut sparsifier of a weighted
undirected graph $G = (V, E_G)$ if for all $S \subset V$,</p>
<script type="math/tex; mode=display">(1-\epsilon)E_G(S, V / S) \le E_H(S, V / S) \le (1+\epsilon)E_G(S, V / S)</script>
<p>Where $E_G$ and $E_H$ are the sums of edge weights crossing the cuts in $G$ and
$H$ respectively.</p>
</blockquote>
<p>Spielman and Teng [<a href="#ST08">ST08</a>] introduced another notion of graph sparsification
with the quadratic form of the Laplacian:</p>
<blockquote>
<p><a name="def:spectral-sp"></a><strong>Definition 2 (Spectral Sparsification):</strong> A
weighted undirected graph $H = (V, E_H)$ is an $\epsilon$-spectral sparsifier
of a weighted undirected graph $G = (V, E_G)$ if for all $x \in
\mathbb{R}^{|V|}$,</p>
<script type="math/tex; mode=display">(1-\epsilon)x^T L_G x \le x^T L_H x \le (1+\epsilon) x^T L_G x</script>
<p>Where $L_G$ and $L_H$ are the graph Laplacians of $G$ and $H$ respectively.</p>
</blockquote>
<p>Cut sparsifiers can be used in approximating max-flow (via the max-flow min-cut
theorem) and spectral sparsifiers are a key ingredient in solving Laplacian
inear systems in near linear time.</p>
<!-- Add why it's important for H to be weighted? -->
<p>Note that this stronger than cut sparsification, as we can fix $x$ to be the
indicator vectors of cuts to obtain <a href="#def:cut-sp">definition 1</a>. Also note that
this notion of sparsifiers also provides bounds on Laplacian eigenvalues, and
thus spectral sparsifiers of complete graphs are also expanders. These
sparsifiers can be constructed in a randomized manner by sampling edges
proportional to their effective resistance [<a href="#SS08">SS08</a>], but in this post we
will focus on a deterministic construction presented in [<a href="#BSS08">BSS08</a>], as
stated more precisely in the following theorem:</p>
<blockquote>
<p><a name="thm:sparsifier"></a><strong>Theorem 1:</strong> For every $d > 1$, every undirected
graph $G = (V, E)$ on $n$ vertices contains a weighted subgraph $H = (V, F,
\tilde{w})$ with $\lceil d(n-1) \rceil$ edges that satisfies:</p>
<script type="math/tex; mode=display">x^T L_G x \le x^T L_H x \le \left(\frac{d+1+2\sqrt d}{d+1-2\sqrt d}\right) x^T L_G x \quad \forall x \in \mathbb{R}^{|V|}</script>
</blockquote>
<h2 id="preliminaries">Preliminaries</h2>
<p>The Laplacian $L$ of a graph can be seen as a linear transformation relating the
flow and demand in an electrical flow on the graph where each edge has unit
resistance. Let $B \in \mathbb{R}^{m \times n}$ be the vertex-edge incidence
matrix and $W$ is a diagonal matrix of edge weights, then $L = B^TWB$. $L$ also
has a pseudoinverse which acts like an actual inverse for all vectors
$x \bot \mathbb 1$, resulting from solving an electrical flow on $G$.</p>
<p>Let $\kappa = \frac{d+1+2\sqrt d}{d+1-2\sqrt d}$, we assume that the graph is
connected thus $x \bot \mathbf{1}$, and perform a transformation on the
condition in <a href="#thm:sparsifier">Theorem 1</a>:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
& x^T L_G x \le x^T L_H x \le \kappa x^T L_G x & \forall x \bot \mathbf{1} \\
\iff & 1 \le \frac{x^T L_H x}{x^T L_G^{-1/2} L_G^{-1/2} x} \le \kappa & \forall x \bot \mathbf{1} \\
\iff & 1 \le \frac{y^T L_G^{-1/2} L_H L_G^{-1/2} y}{y^T y} \le \kappa & \forall y \in \text{im}(L_G)
\end{align*} %]]></script>
<p>Let $b_{e = (u, v)} = \mathbf{1}_u - \mathbf{1}_v$ be a row of incidence matrix
$B$ , $s_e$ be the weight of edge $e$ in $E_H$ and $A \succeq B$ when $A - B$ is
a psd matrix. Then the above condition can be rewritten as:</p>
<p><a name="eq:sparse-approx"></a>
<script type="math/tex">\begin{align}
I \preceq \sum_{e \in E_H} L_G^{-1/2} b_e^T s_e b_e L_G^{-1/2} \preceq \kappa I
\end{align}</script></p>
<p>We then define a vector $v_e = L_G^{-1/2}b_e^T$ for each $e \in E_G$. Notice
that over all edges of $G$, the rank 1 matrices formed by $v_eV_e^T$ sum to the
identity matrix:</p>
<script type="math/tex; mode=display">\sum_{e\in E_G} v_e v_e^T = \sum_{e\in E_G} L_G^{-1/2}b_e^T b_e L_G^{-1/2} = L_G^{-1/2} B^T B L_G^{-1/2} = L_G^{-1/2} L_G L_G^{-1/2} = I</script>
<p>Then <a href="#eq:sparse-approx">equation 1</a> can be interpreted as choosing a sparse
subset of the edges in $G$, as well as weights $s_e$, so that the matrix
obtained by summing over the edges of $H$, $\sum_{e \in E_H} s_e v_ev_e^T$, has
a low condition number (ratio between the largest and smallest eigenvalues):</p>
<script type="math/tex; mode=display">I \preceq \sum_{e \in E_H} s_e v_ev_e^T \preceq \kappa I</script>
<p>If we can find such a sparse set of edges and weights, then we have proved
<a href="#thm:sparsifier">Theorem 1</a>. In [<a href="#SS08">SS08</a>] this was done by randomly
sampling these rank-1 matrices based on their effective resistances of their
corresponding edges, using a distribution that has the identity matrix as the
expectation. Convergence is shown using a matrix concentration inequality. The
construction in [<a href="#BSS08">BSS08</a>] deterministically chooses each $v_e$ and
$s_e$, bounding the increase in $\kappa$ in each step using barrier
functions. One useful lemma for this procedure is:</p>
<blockquote>
<p><a name="lem:matrix-det"></a><strong>Lemma 1 (Matrix Determinant Lemma):</strong> If $A$ is
nonsingular and $v$ is a vector, then:</p>
<script type="math/tex; mode=display">\det(A + vv^T) = \det(A)(1 + v^TA^{-1}v)</script>
</blockquote>
<h2 id="main-proof">Main Proof</h2>
<p>Recall from the previous section the main theorem that need to be proved is:</p>
<blockquote>
<p><a name="thm:rank1approx"></a><strong>Theorem 2:</strong>
Suppose $d > 1$ and $v_1, \cdots, v_m$ are vectors in $\mathbb{R}^n$ with
<script type="math/tex">\sum_{i \le m} v_i v_i^T = I.</script>
Then there exist scalars $s_i > 0$ with $|{i: s_i \ne 0 }| \le dn$ so that
<script type="math/tex">I \preceq \sum_{i \le m} s_i v_iv_i^T \preceq \left(\frac{d+1+2\sqrt d}{d+1-2\sqrt d}\right) I</script></p>
</blockquote>
<p>This is equivalent to bounding the ratio of $\lambda_{\min}$ and
$\lambda_{\max}$ of the matrix $\sum_{i \le m} s_i v_iv_i^T$.</p>
<p>We start with a matrix $A = 0$, and build it by adding rank-1 updates
$s_ev_ev_e^T$. One interesting fact is that for any vector $v$, the eigenvalues
of $A$ and $A + vv^T$ interlace. Consider the characteristic polynomial of $A +
vv^T$:</p>
<script type="math/tex; mode=display">p_{A + vv^T}(x) = \det(I - A - vv^T) = p_A(x) \paren{1 - \sum_j \frac{\dotp{v,u_j}^2}{x - \lambda_j}}</script>
<p>Which can be written in terms of the characteristic polynomial of $A$ using
<a href="#lem:matrix-det">Lemma 1</a>. $u_j$ are the eigenvectors of $A$. Let $\lambda$ be
a zero of $p_{A + vv^T}(x)$. It can either:</p>
<ol>
<li>Be a zero of $p_A(x)$, so $\lambda$ is equal to an eigenvalue $\lambda_i$
of $A$, and the corresponding eigenvector $u_i$ is orthogonal to $v$. In this
case, this eigenvalue doesn’t move.</li>
<li>Strictly interlace with the old eigenvalues. This happens when
$p_A(\lambda) \ne 0$ and
<script type="math/tex">\sum_j \frac{\dotp{v,u_j}^2}{x - \lambda_j} = 1</script>
This can be interpreted with a physical model. Consider $n$ positive charges
arranged vertically with the $j$-th charge’s position corresponding to the
$j$-th eigenvalue of $A$, and its charge is $\dotp{v, u_j}^2$. The points
where the electric potential is 1 are the new eigenvalues. Since between any
two charges the potential changes direction from $+ \infty$ to $- \infty$,
there has to be a point between every two charges where the potential is 1,
thus the new eigenvalues strictly interlace the old ones.</li>
</ol>
<p>To get some intuition, we see what happens when we sample $v_i$ uniformly
randomly. Since $\sum_j v_jv_j^T = I$, $\mathbb{E}_v[\dotp{v, u}^2]$ is constant
for any normalized vector $u$. Therefore adding the average $v$ increases the
charges by the same amount in the physical model, causing the new eigenvalues to
all increase by the same amount. Informally, we expect all the eigenvalues to
“march forward” at similar rates with each $vv^T$ added, so $\lambda_{\max} /
\lambda_{\min}$ is bounded.</p>
<p>We construct a sequence of matrices $A^{(0)}, \cdots, A^{(q)}$ by adding rank-1
updates $t vv^T$. To bound the condition number after each update, we create two
barriers $l < \lambda_{\min}(A) < \lambda_{\max}(A) < u$ so that the eigenvalues
of $A$ lie between them. $\Phi_l(A)$ and $\Phi^u(A)$ are defined as the
potentials at the barriers respectively:</p>
<script type="math/tex; mode=display">\Phi_l(A) := \sum_i \frac{1}{\lambda_i - l}, \quad \Phi^u(A) := \sum_i \frac{1}{u - \lambda_i}</script>
<p>The crucial step is to show that there exists a $v_i$ and $t$ so that we can
add $t v_i v_i^T$ to $A$, so that each barrier is shifted by a constant, and the
potentials at each barrier doesn’t change. We will sketch out the proof briefly,
readers can pursue the details in [<a href="#BSS08">BSS08</a>].</p>
<p>Let constants $\delta_U$ and $\delta_L$ be the maximum amount each barrier can
increase each round, and constants $\epsilon_U = \Phi^{u_0}(A^{(0)})$ and
$\epsilon_L = \Phi_{l_0}(A^{(0)})$ be the initial potentials at each
barrier. The first lemma shows that if $t$ is not too large, adding $t vv^T$ to
$A$ and shifting the upper barrier by $\delta_U$ will not increase the upper
potential $\Phi^u$.</p>
<blockquote>
<p><strong>Lemma 2 (Upper Barrier Shift):</strong> Suppose $\lambda_{\max}(A) < u$, and $v$ is
any vector. If
<script type="math/tex">U_A(v) := v^T \paren{\frac{((u + \delta_U)I - A)^{-2}}{\Phi^u(A) - \Phi^{u+ \delta_U}(A)}
+ ((u + \delta_U)I - A)^{-1}} v \le \frac{1}{t}</script>
Then:
<script type="math/tex">% <![CDATA[
\Phi^{u+ \delta_U}(A + tvv^T) \le \Phi^{u}(A) \quad \text{and} \quad
\lambda_{\max}(A + tvv^T) < u + \delta_U %]]></script></p>
</blockquote>
<p>The second lemma shows that if $t$ is not too small, adding $t vv^T$ to $A$ and
shifting the lower barrier by $\delta_L$ will not increase the lower potential
$\Phi^u$.</p>
<blockquote>
<p><strong>Lemma 3 (Lower Barrier Shift):</strong> Suppose $\lambda_{\min}(A) > l$, $\Phi_l(A)
\le 1/\delta_L$ and $v$ is any vector. If
<script type="math/tex">L_A(v) := v^T \paren{\frac{(A - (l + \delta_L)I)^{-2}}{\Phi_{l+ \delta_L}(A) - \Phi_l(A)}
+ ((A - (l + \delta_L)I)^{-1}} v \ge \frac{1}{t} > 0</script>
Then:
<script type="math/tex">\Phi_{l+ \delta_L}(A + tvv^T) \le \Phi_{l}(A) \quad \text{and} \quad
\lambda_{\min}(A + tvv^T) > l + \delta_L</script></p>
</blockquote>
<p>Finally, it can be shown that there exists a $t$ and $v_i$ that satisfy the
conditions of the above lemmas.</p>
<blockquote>
<p><strong>Lemma 3 (Both Barriers):</strong> If $\lambda_{\max}(A) < u$, $\lambda_{\min}(A) >
l$, $\Phi^u(A) \le \epsilon_U$, $\Phi_l(A) \le \epsilon_L$, and $\epsilon_U$ ,
$\epsilon_L$, $\delta_U$, $\delta_L$ satisfy:
<script type="math/tex">0 \le \frac{1}{\delta_U} + \epsilon_U \le \frac{1}{\delta_L} + \epsilon_L,</script>
then there exists a $v_i$ and positive $t$ for which
<script type="math/tex">% <![CDATA[
L_A(v_i) \ge \frac{1}{t} \ge U_A(v_i) \quad \text{and} \quad l + \delta_L <
\lambda_{\min}(A + t v_iv_i^T) < \lambda_{\max}(A + t v_iv_i^T) < u + \delta_U %]]></script></p>
</blockquote>
<p>This is proved by an averaging argument relating the behavior of vector $v$ to
the behavior of the expected vector, showing that
<script type="math/tex">\sum_{i \le m} (L_A(v_i) - U_A(v_i)) \ge 0,</script>
therefore there exists a $i$ for which there is a gap between
$L_A(v_i) - U_A(v_i)$. Choosing the constants carefully, we can get the required
bound on the condition number.</p>
<h2 id="extension">Extension</h2>
<p>There is a similarity between <a href="#thm:rank1approx">Theorem 2</a> and the
Kadison-Singer conjecture. One formulation of it is stated below:</p>
<blockquote>
<p><a name="prop:KSC"></a><strong>Proposition 1:</strong> There are universal constants
$\epsilon, \delta > 0$ and $r \in \mathbb N$ for which the following statement
holds. If $v_1, \cdots, v_m \in \mathbb{R}^n$ satisfy $||v_i|| \le \delta$ for
all $i$ and
<script type="math/tex">\sum_{i \le m} v_i v_i^T = I</script>
then there is a partition $X_1, \cdots X_r$ of ${1, \cdots, m }$ for which
<script type="math/tex">\left| \left| \sum_{i \in X_j} v_i v_i^T \right| \right| \le 1 - \epsilon</script>
for every $j = 1, \cdots, r$.</p>
</blockquote>
<p>This conjecture was positively resolved in [<a href="#MSS13">MSS13</a>], using techniques
arising from generalizing the barrier function argument used to prove
<a href="#thm:rank1approx">Theorem 2</a> to a multivariate version.</p>
<h2 id="references">References</h2>
<p><a name="BK96">[BK96]</a>
A. A. Benczúr and D. R. Karger. Approximating s-t minimum cuts in
$\tilde{O}(n^2)$ time. In <em>STOC ‘96</em>, pages 47-55, 1996.</p>
<p><a name="BSS08">[BSS08]</a>
J. Baston, D. A. Spielman and N. Srivastava. Twice-Ramanujan
Sparsifiers. Available at <a href="http://arxiv.org/abs/0808.0163">http://arxiv.org/abs/0808.0163</a>, 2008.</p>
<p><a name="MSS13">[MSS13]</a>
A. W. Marcus, D. A. Spielman and N. Srivastava. Interlacing Families
II: Mixed Characteristic Polynomials and The Kadison-Singer Problem. Available
at <a href="http://arxiv.org/abs/1306.3969">http://arxiv.org/abs/1306.3969</a>, 2013.</p>
<p><a name="SS08">[SS08]</a>
D. A. Spielman and N. Srivastava. Graph Sparsification by Effective
Resistances. In <em>STOC ‘08</em>, pages 563-568, 2008.</p>
<p><a name="ST08">[ST08]</a>
D. A. Spielman and S.-H. Teng. Spectral Sparsification of Graphs. Available at
<a href="http://arxiv.org/abs/0808.4134">http://arxiv.org/abs/0808.4134</a>, 2008.</p>
Intro to the Sum-of-Squares HierarchyTselil Schramm
2016-06-23T00:00:00+00:00
http://learningwitherrors.org/2016/06/23/intro-sos<script type="text/javascript">
function toggle_display_nojump(id) {
event.preventDefault();
var e = document.getElementById(id);
if(e.style.display == 'block')
e.style.display = 'none';
else
e.style.display = 'block';
return false; // prevent default action of jumping to anchor
}
</script>
<div style='display:none;'><script type='math/tex'> \newcommand{\R}{\mathbb{R}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\sos}{\mathrm{sos}}
\newcommand{\Span}{\mathrm{span}}
\newcommand{\Id}{\mathrm{Id}}
\newcommand{\col}{\mathrm{col}}
</script></div>
<p>
This note is intended to introduce the Sum-of-Squares Hierarchy. We start by SDP relaxation, using the Goemans-Williamson Max-Cut SDP as a jumping off point. We then discuss duality and sum-of-squares proofs. Finally, we give an example non-trivial application of the sum-of-squares hierarchy: an algorithm for finding planted sparse vectors inside random subspaces. We will give no historical context, but in the final section there will be pointers to other resources which give a better sense of the history and alternative expositions of the same content.
<!--more-->
<p>
<h2 class='tex'>1. A Relaxation for Polynomial Optimization </h2>
<p>
Suppose we are interested in some polynomial optimization problem: \[ Q = \left\{\max_{x \in \R^n}\ p(x),\qquad s.t.\quad g_i(x) = 0 \quad \forall\ i \in [m]\right\}. \] That is, we want to maximize our objective function, the polynomial $p:\R^n\to \R$, subject to the polynomial constraints $g_1(x) = 0, g_2(x) = 0,\ldots, g_m(x) = 0$.
<p>
The problem $Q$ may be non-convex, and solving such programs is NP-complete (i.e. this captures integer programming). A standard approach for a situation like this is to relax our problem $Q$ to a semidefinite program (SDP). Perhaps the most famous example is the Goemans-Williamson relaxation for Max-Cut:
<p>
<blockquote><b>Example 1 (Max-Cut)</b> <em> We can formulate the max cut problem on an $n$-vertex graph $G$ as a polynomial optimization problem with objective function $p(x) = \sum_{(i,j) \in E(G)} \frac{1- x_i x_j}{2}$ and the constraint polynomials $g_i(x) = x_i^2 - 1 = 0\ \forall i\in[n]$, which ensure that each $x_i = \pm 1$.
<p>
The Goemans-Williamson SDP relaxation assigns program variables $X_{ij}$ for each $i,j \in [n]$, where $X_{ij}$ is a stand-in for the monomial $x_i x_j$. The SDP then becomes \[ \left\{ \max \sum_{(i,j) \in E(G)}\tfrac{1}{2}(1 - X_{ij}), \qquad s.t. \quad X_{ii} -1= 0\quad \forall i \in [n],\quad X \succeq 0 \right\} \] where $X$ is the $n \times n$ matrix with variable $X_{ij}$ in the $(i,j)$th entry. </em></blockquote>
<p>
<p>
<p>
<b>The Sum-of-Squares SDP: Extending Goemans-Williamson.</b> After seeing the Goemans-Williamson Max-Cut SDP, it seems natural to apply a similar relaxation to other polynomial optimization problems. Suppose that the maximum degree of any term in $p, g_1,\ldots,g_m$ is at most $2d$. The strategy is to relax the polynomial optimization problem by replacing each monomial $\prod_{i\in S \subset [n]} x_i$ which appears in the program $Q$ with an SDP variable $X_S$. So for each $S \subset [n]$, $|S| \le 2d$, we have an SDP variable. Then we arrange the variables into an $(n+1)^d \times (n+1)^d$ matrix $X$ in the natural way, with rows and columns indexed by every ordered subset of at most $d$ variables: <p align=center><img height = 300 src="/sources/june2016-sos/sos-X.png"></p> Now we enforce some natural constraints:
<ul> <li> “Commutativity” or “symmetry”: If the ordered multisets $S,T,U,V \subset [n]$ are such that $S \cup T = U \cup V$ as unordered multisets, then $X_{S\cup T} = X_{U \cup V}$. This is meant to reflect the commutative property of monomials. That is, for any $x\in \R^n$ \[ \prod_{i \in S}x_i \cdot \prod_{j\in T}x_j = \prod_{k \in U}x_k \cdot \prod_{\ell \in V} x_{\ell}. \] <li> “Normalization”: we set $X_{\emptyset} = 1$. This is the “scale” of the coefficients. One way to see that this is the correct scale is to think of $X_{\emptyset}$ as the monomial multiplier of a polynomial's constant term. <li> “PSDness”: we require that $X \succeq 0$, or that $X$ is positive-semidefinite. This constraint is natural because for any point $y \in \R^n$, if we take the matrix $X=X_y$ given by setting $X_{S} = \prod_{i\in S} y_i$, the resulting matrix $X$ is PSD. The proof is that if we take the vector $\tilde{y}$ so that $\tilde{y}^{\top} = [1 \ y^\top]$, then $X_y = \tilde{y}^{\otimes d}(\tilde{y}^{\otimes d})^\top$ (where $y^{\otimes d}$ is the $d$th Kronecker power<sup><a href='#footnote1' onclick="toggle_display_nojump('footnote1');">1</a></sup><span class='sidenote' id='footnote1'><a name='footnote1' href='#footnote1'>1.</a> The Kronecker product of an $n \times m$ matrix $A$ and a $\ell \times k$ matrix $B$ is a $n\ell \times mk$ matrix $A \otimes B$, which we can naturally index by pairs so that the $(a,b),(c,d)$th entry is the product of $A_{ac}B_{bd}$. So, the Kronecker product of an $n \times 1$ vector $x$ with itself is a $n^2 \times 1$ vector whose $(i,j)$th entry is simply the product $x_ix_j$. </span> of $y$ ), and thus for any vector $v$, $v^\top X_yv = \langle v, \tilde{y}^{\otimes d}\rangle^2 \ge 0$.
<p>
</ul>
<p>
<blockquote><b>Remark 1</b> <em> Notice that any feasible solution $y \in \R^n$ for the program $Q$ yields a feasible solution to $\sos_d(Q)$: we just assign $X_S := \prod_{i\in S}y_i$, and the above arguments show that this is feasible. </em></blockquote>
<p>
<p>
One nice consequence of these constraints is that, if we evaluate the square of some degree-$d$ polynomial $q$ in the SDP monomials, the polynomial value will be non-negative! This is because, if $\hat q$ is the vector of coefficients of the polynomial $q$, then $q^2(X) = \hat{q}^\top X \hat{q} \ge 0$.
<p>
<blockquote><b>Example 2</b> <em> Consider the square polynomial $(x_1 + c \cdot x_2)^2$. We would evaluate this square in $X$ by taking the quadratic form \[ \begin{bmatrix} 0 & 1 & c \end{bmatrix} \begin{bmatrix} X_{\emptyset} & X_{\{1\}} & X_{\{2\}} \\ X_{1} & X_{\{1,1\}} & X_{\{1,2\}} \\ X_{2} & X_{\{2,1\}} & X_{\{2,2\}}\\ \end{bmatrix} \begin{bmatrix} 0\\ 1 \\ c \end{bmatrix} = X_{1,1} + c\cdot X_{1,2} + c\cdot X_{2,1} + c^2 \cdot X_{2,2}. \] </em></blockquote>
<p>
<p>
<br>
We now formalize the above definition.
<blockquote><b>Definition 1 (Sum-of-Squares Relaxation for $Q$ at degree $2d$)</b> <em> Given a polynomial optimization problem $Q$ with $\deg(p)\le 2d$ and $\deg(g_i) \le 2d\ \forall i \in [m]$, we define the <em>degree-$2d$ sum-of-squares relaxation</em> for $Q$, $\sos_{d}(Q)$.
<p>
We define a variable $X_{S}$ for each unordered multiset $S \subset [n]$ of size $|S| \le 2d$, and define the $(n+1)^d \times (n+1)^d$ matrix $X$, with rows and columns indexed by ordered multisets $U,V \subset [n]$, so that the $U,V$th entry of $X$ contains the variable $X_{U\cup V}$. Define the linear operator $\tilde{E}:\text{polynomials}_{\le 2d}\to \R$ such that $\tilde{E}[\prod_{i\in S} x_i] = X_S$ for $|S| \le 2d$. Then, \[ \sos_d(Q) = \left\{ \max \ \tilde{E}[p(x)] \quad s.t.\quad \begin{aligned} &X \succeq 0,\\ &X_{\emptyset} = 1,\\ &\tilde{E}[ g_i(X)\cdot\prod_{i\in U} x_i] = 0\quad \forall i \in [m], U \subset [n], \deg(g_i) + |U| \le 2d \end{aligned} \right\} \] </em></blockquote>
<p>
<p>
<p>
<b>Sum-of-Squares Hierarchy.</b> Earlier, we only mentioned that we must have $2d \ge \deg(p), \deg(g_i) \forall i \in [m]$. In fact, we can choose $d$ to be as large as we wish--as long as we are willing to solve an SDP with $n^{O(d)}$ variables and $n^{O(d)}$ constraints. Taking successively larger values for $d$ gives us a systematic way of adding constraints to our program, giving us a family of larger but more powerful programs as we increase the value of $d$; this is why we call the family of relaxations $\{\sos_d\}_{d = 1}^{\infty}$ the <em>sum-of-squares hierarchy</em>.
<p>
<p>
<b>How to make sense of the Sum-of-Squares SDP relaxation?</b> In the Goemans-Williamson SDP relaxation, there is a natural interpretation of the SDP as a vector program: if we view the positive semidefinite matrix solution to the SDP, $X$, according to its Cholesky decomposition $X = VV^{\top}$, then we can identify each node in the underlying graph $G$ with a unit vector corresponding to a row of the matrix $V$, and we can see that the objective function tries to push vectors corresponding to adjacent nodes apart on the unit sphere.
<p>
This geometric intuition is extremely crisp, but unfortunately it is hard to come up with an analogue of this in programs where we care about more than $2$ variables interacting at a time (i.e. when we have variables $X_S$ with $|S| \ge 3$). <em>As of now, we do not have a similar geometric understanding of general sum-of-squares SDP relaxations.</em> We can instead develop alternative ways to (partially) understand these SDP relaxations.
<p>
<p>
<b>Pseudomoments.</b> As a start, one perspective is to think of the variables $X_S$ as the “moments” of a fake distribution over solutions to the program $Q$.
<p>
If we were to actually solve the (non-convex) problem $Q$ (using some inefficient algorithm), what we would have is either a single solution $y^*\in \R^n$, or a distribution over some set of solutions $Y\subset \R^n$, which maximize $p$, so that \[ OPT(Q) = \E_{y \in Y} [p(y)]. \] We cannot expect that the solution to the relaxation $\sos_d(Q)$ comes from an <em>actual</em> distribution over feasible solutions, but our constraints ensure that it still satisfies some of the properties of actual distributions.<sup><a href='#footnote2' onclick="toggle_display_nojump('footnote2');">2</a></sup><span class='sidenote' id='footnote2'><a name='footnote2' href='#footnote2'>2.</a> Because of our constraints on $\sos_d(Q)$, the pseudoexpectation $\tilde{E}$ satisfies linearity of expectation \[ \tilde{E}[\alpha\cdot q_1(x) + \beta \cdot q_2(x)] = \alpha \cdot \tilde{E}[q_1(x)] + \beta \cdot \tilde{E}[q_2(x)]\quad \text{if} \quad \deg(q_1),\deg(q_2) \le 2d,\] and also the non-negativity of low-degree squares, \[ \tilde{E}[q(x)^2] \ge 0 \quad \text{if} \quad \deg(q)\le d. \]
<p>
</span>
For this reason we can also call the solution to $\sos_d(Q)$ a <em>pseudodistribution</em>, and that is why we use the notation \[ \tilde{E}\left[\prod_{i\in S} x_i \right] = X_S. \] In other words, we interpret the variable $X_S$ as being the <em>pseudomoment</em> of the monomial $\prod_{i\in S} x_i$ under a <em>pseudodistribution</em> over solutions to $Q$.
<p>
Thinking about the SDP solution in this way can be helpful in designing algorithms (and in proving lower bounds), but I will not discuss this perspective further here (maybe in a future post).
<p>
<h2 class='tex'>2. Sum-of-Squares Proofs </h2>
<p>
One immediate question is, why should the sum-of-squares relaxation be a good relaxation? When we design SDP algorithms for maximization problems, we want to bound \[ OPT(Q) \le OPT(\sos_d(Q)) \le \alpha \cdot OPT(Q), \] for $\alpha$ as close to $1$ as possible. Why should we expect $\alpha$ to be small?
<p>
We can give a concrete but somewhat technical answer to this question by considering the dual program: the dual program will give us a “sum-of-squares” proof of an upper bound on the primal program. In my opinion this is most easily explained via demonstration, so let's write down the dual program.
<p>
For convenience, we'll start by re-writing the primal program $\sos_d(Q)$ in a matrix-based notation. For two matrices $A,B$ of equal dimension, define the inner product $\langle A,B \rangle = \sum_{(i,j)} A_{ij} B_{ij}$. Now, define $(n+1)^d \times (n+1)^d$ matrices $P,G_1,\ldots,G_{m}$ so that $\langle P, X \rangle = \tilde{E}[p(x)]$ and $\langle G_i,X\rangle = \tilde{E}[g_i(x)]$ (where have redefined the polynomial constraints $g_1,\ldots,g_m$ to include most of our SDP constraints: symmetry/commutativity, and $g_i(x)\cdot X_U = 0$).
<blockquote><b>Example 3</b> <em> If we have $p(x) = \sum_{i}x_i^2$, then one could choose the matrix $P$ to contain the identity in the submatrix indexed by sets of cardinality $1$, and $0$ elsewhere. </em></blockquote>
<p>
<p>
Our program can now be written as the minimization problem, \[ \sos_d'(Q) = \left\{ \min_{X\succeq 0} - \langle P,X\rangle \quad s.t.\quad \langle G_i,X\rangle = 0 \quad \forall i \in [m], \langle J_{\emptyset},X\rangle = 1 \right\}, \] where $J_{\emptyset}$ is the matrix with a single $1$ in the entry $\emptyset,\emptyset$ and zeros elsewhere, and the constraint $\langle J_\emptyset, X\rangle = 1$ enforces normalization. The optimal value of $\sos_d(Q)'$ is the negation of the optimal value of $\sos_d(Q)$.
<p>
The dual is the SDP problem \[ \sos_d^+(Q) = \left\{ \max_{y\in \R^{m+1}} y_{\emptyset} \qquad s.t.\quad \left(-P - y_{\emptyset}\cdot J_{\emptyset} - \sum_{j\in[m]}y_j \cdot G_j\right) = S \succeq 0 \right\}. \] Fixing $y^*$ to be the optimal dual point, from the dual constraints we have that \[ P = -y^*_{\emptyset} \cdot J_{\emptyset} - S + \sum_{j} y^*_j\cdot G_j. \] By duality, we have that in the optimal solution of $\sos_d^+(Q)$, \[ y_{\emptyset}^*= c + OPT(\sos'_d(Q) = c - OPT(\sos_d(Q)) \] for some $c \ge 0$, and therefore taking $S' = S + c\cdot J_{\emptyset} \succeq 0$, \[ P = OPT \cdot J_{\emptyset} - S' + \sum_{j} y^*_j\cdot G_j. \]
<p>
We will turn this matrix equation into a polynomial equation. Let $x \in \R^n$, and let $\tilde{x} = [1 \ x^{\top}]^\top$. Now, let $S'$ have the Cholesky decomposition $S' =\sum ss^{\top}$. We take the quadratic form of the Kronecker power of $\tilde{x}$ with the left- and right-hand sides, \begin{align*} (\tilde{x}^{\otimes d})^{\top} P (\tilde{x}^{\otimes d}) &= OPT - \sum \langle s, \tilde{x}^{\otimes d}\rangle^2 + \sum_{j\in[m]} y_j^* \cdot (\tilde{x}^{\otimes d})^{\top} G_j (\tilde{x}^{\otimes d}) \end{align*} and re-writing each of the above vector products as polynomials, where $q_s$ is the polynomial encoded by the vector of coefficients $s$ \begin{align*} p(x) &= OPT -\sum q_s(x)^2 - \sum_{j\in[m]} y_j^* \cdot g_j(x). \end{align*} This final line is a <em>sum-of-squares proof</em> that the value of $p(x)$ cannot exceed $OPT(\sos_d(Q))$ on the feasible region: any feasible point $x \in \R^n$ evaluates to $0$ for each $g_i$, and the square polynomials $q_s(x)^2$ can never contribute positively to the right-hand side. We have thus proven the following theorem:
<blockquote><b>Theorem 2</b> <em> The dual of the SDP $\sos_d(Q)$ provides a degree-$d$ sum-of-squares proof that $p(x) \le OPT(\sos_d(Q))$ for all $x$ in the feasible region of $Q$. </em></blockquote>
<p>
<p>
At the start of this section, our goal was to understand how to bound \[ OPT(Q) \le OPT(\sos_d(Q)) \le \alpha \cdot OPT(Q). \] This theorem gives us a primal-dual tool for bounding the value of $\sos_d(Q)$--{if we can provide a sum-of-squares proof of degree at most $d$ that $p(x) \le \alpha\cdot OPT(Q)$, then that sum-of-squares proof is a valid dual certificate!}
<p>
<p>
<b>Degree of the proof.</b> Notice that the dual can only use polynomials of degree at most $d$ in the sum-of-squares proof. So, suppose now that we write down two SDP relaxations for $Q$: $\sos_d(Q)$ and $\sos_{d'}(Q)$ for some $d' > d$. Then clearly, \[ OPT(\sos_{d}(Q)) \ge OPT(\sos_{d'}(Q)) \ge OPT(Q), \] since the degree-$d'$ sum-of-squares program contains more constraints than the degree-$d$ program.
<p>
In the primal, it is difficult to understand exactly what these additional constraints buy you. From the perspective of the dual, the power of these additional constraints becomes clearer: the dual now has access to sum-of-squares proofs that use polynomials of <em>higher degree</em>, and this additional power may allow the dual to prove a potentially tighter upper bound. This is still a relatively mysterious condition, but in a later post I will give some concrete examples of situations in which it helps.
<p>
<h2 class='tex'>3. Planted Sparse Vector </h2>
<p>
In this section, we give one algorithmic application: the planted sparse vector problem.
<p>
Given an $n \times d$ matrix $A$, distinguish between the following two cases:
<ul> <li> If the columns of $A$ are uniformly sampled from a $d$-dimensional subspace of $\R^n$ which contains a vector with at most $k < n/100$ nonzero entries, return YES, <li> If the columns of $A$ are sampled from a uniformly random $d$-dimensional subspace of $\R^n$, return NO with high probability.
</ul>
This is a somewhat simple variant of the problem--other variants ask you to find the sparse vector as well. The exposition for this pared-down “distinguishing” version is simpler, and gets the main ideas across.
<p>
Without loss of generality, we may apply a random rotation $R \in \R^{d\times d}$ to the columns of $A$, then normalize by the maximum column norm, so that we work with $A \leftarrow \max_{i\in[d]} \frac{1}{\|ARe_i\|}AR$. This is to ensure that the columns of $A$ have roughly the same norm, are roughly orthogonal, and all have norm roughly $1$--for the remainder of the post we will assume that these conditions all hold.
<p>
We introduce the following polynomial optimization problem $Q_{sparse}$ for the planted sparse vector problem: \[ Q_{sparse}(A) = \left\{ \max_{x \in R^d} \| Ax \|^4_4 \qquad s.t. \qquad \|x\|^2_2 = 1 \right\} \] In other words, we want to find the linear combination of the columns of $A$ that will maximize the $4$-norm of $A$, while having $2$-norm roughly $1$. This program picks out sparse vectors over balanced vectors: a unit vector $e_i$ with only one nonzero entry has $\|e_i\|^2_2 = \|e_i\|^4_4 = 1$, while a unit vector $v$ with all $n$ entries of the same magnitude has $\|v\|_4^4 = n \cdot (1/\sqrt{n})^4 = 1/n \ll \|v\|_2^2$.
<p>
We prove the following theorem:
<blockquote><b>Theorem 3</b> <em>(Barak-Brandao-Harrow-Kelner-Steurer-Zhao '12) <a name="thmplsp"></a> If $1/k \ge \tilde{O}(\sqrt{d^3/n^3} + 1/n)$, then $\sos_4(Q_{sparse}(A))$ solves the planted $k$-sparse vector in a random subspace problem. </em></blockquote>
<p>
<p>
We will prove this by showing that the value of the program is large in the planted case, and small in the random case. It is actually possible to prove a better tradeoff between $k,d$ and $n$, but to simplify the arguments, we prove a weaker theorem. For the full details, see [Barak-Brandao-Harrow-Kelner-Steurer-Zhao '12].
<p>
<em>Proof:</em> If the span of the columns of $A$ actually contains a $k$-sparse vector $v^*$, then $\|v^*\|_4^4$ is minimized when all entries of $v^*$ have equal magnitude. So, if we normalize $v^*$ so that $\|v^*\| = 1$, \[ \|v^*\|^4_4 \le k\cdot \left(\frac{1}{\sqrt{k}}\right)^{4} = \frac{1}{k}. \]
<p>
We will show that in the random case, the value is bounded by a function of $n$ and $d$:
<blockquote><b>Lemma 4</b> <em><a name="lemrandomcase"></a> If $A$ has iid Gaussian columns with $\E[A_{ij}^2] = \frac{1}{n}$, then with high probability the program $\sos_4(Q_{sparse}(A))$ has optimal value $\tilde{O}(\sqrt{d^3/n^3} + 1/n)$. </em></blockquote>
<p>
Given this lemma, the proof of Theorem <a href="#thmplsp">3</a> is essentially trivial--we know that the objective value at most $\tilde{O}(\sqrt{d^3/n^3} + 1/n)$ with high probability in the random case, and at least $1/k$ in the planted case, and so the objective value of $\sos_4(Q)$ distinguishes so long as $1/k \ge\tilde{O}(\sqrt{d^3/n^3} + 1/n)$. $$\tag*{$\blacksquare$}$$
<p>
Now, we prove the lemma, using sum-of-squares proofs to bound the objective value of the SDP in the random case. <em>Proof of Lemma <a href="#lemrandomcase">4</a>:</em> For any $d^2 \times d^2$ matrix $M$, there is a sum-of-squares proof of the following fact: \begin{align*} \left\langle x^{\otimes 2}(x^{\otimes 2})^{\top}, \ M\right\rangle &\le \left\langle x^{\otimes 2}(x^{\otimes 2})^{\top}, \ \|M\|\cdot \Id \right\rangle. \end{align*} The proof simply follows because $\|M\|\cdot \Id \succeq M$, and therefore $M = \|M\|\cdot \Id - S$ for some $S \succeq 0$; by taking the Cholesky decomposition of $S$ and using the vectors as polynomial coefficients, this gives us a sum-of-squares proof of the inequality.
<p>
We will use this sum-of-squares fact to bound the SDP value of our objective function. First, we re-interpret our objective function as an inner product of two matrices. Let $a_1,\ldots,a_n$ be the rows of $A$. We will re-write our objective function as a matrix inner-product: \begin{align*} \|Ax\|^4_4 &= \sum_{i}\langle a_i, x \rangle^4 = \left\langle x^{\otimes 2}(x^{\otimes 2})^{\top}, \ \sum_i (a_i \otimes a_i)(a_i \otimes a_i)^{\top}\right\rangle. \end{align*} At this point, we could apply the above trick, but unfortunately, the maximum eigenvalue of $\sum_i (a_i\otimes a_i)(a_i \otimes a_i)^{\top}$ is $\approx d/n$--much larger than our goal of $\sqrt{d^3/n^3}$. This is because at indices $(\alpha\beta,\gamma\delta)$ where $\alpha = \beta$ and $\gamma = \delta$, our matrix has positive entries, whereas most entries have a random sign. These positive entries create a large eigenvalue in the matrix.
<p>
So, we will decompose this further--we will separate the portion of the matrix with entries corresponding to even-multiplicity indices. Define $B_{\neq}$ to be the matrix $\sum_{i} (a_i \otimes a_i)(a_i \otimes a_i)^{\top}$ in which all even-multiplicity entries are zeroed out. \begin{align*} \|Ax\|^4_4 &= \left\langle x^{\otimes 2}(x^{\otimes 2})^{\top}, \ B_{\neq} \right\rangle + \sum_{\alpha,\beta \in [n]} x_{\alpha}^2x_{\beta}^2\cdot \sum_{i} a_{i}(\alpha)^2 a_{i}(\beta)^2. \end{align*} By the above arguments, there is a sum-of-squares proof that \begin{align*} \left\langle x^{\otimes 2}(x^{\otimes 2})^{\top}, \ B_{\neq}\right\rangle &\le \left\langle x^{\otimes 2}(x^{\otimes 2})^{\top}, \ \|B_{\neq}\|\cdot \Id \right\rangle\\ &= \|B_{\neq}\|\cdot \sum_{\alpha,\beta\in[d]} x_\alpha^2 x_{\beta}^2 = \|B_{\neq}\|\cdot \left(\sum_{\alpha\in[d]} x_{\alpha}^2 \right)^2. \end{align*}
<p>
For the other term, we will use an even simpler bound. Let $c_{\alpha,\beta} = \sum_{i} a_i(\alpha)^2 a_j(\beta)^2$, for convenience. Also, let $c^* = \max_{\alpha,\beta} c_{\alpha,\beta}$. The following equality, \[ \sum_{\alpha,\beta\in[d]}c_{\alpha,\beta}\cdot x_{\alpha}^2 x_{\beta}^2 = c^*\cdot \sum_{\alpha, \beta} x_{\alpha}^2 x_{\beta}^2 - \left(\sum_{\alpha,\beta} (c^* - c_{\alpha,\beta}) \cdot x_{\alpha}^2 x_{\beta}^2\right), \] is a sum-of-squares proof that \[ \sum_{\alpha,\beta\in[d]}c_{\alpha,\beta}\cdot x_{\alpha}^2 x_{\beta}^2 \le c^*\cdot \sum_{\alpha, \beta} x_{\alpha}^2 x_{\beta}^2, \] because $c^* - c_{\alpha,\beta} \ge 0$ for all $\alpha,\beta$ by definition, and thus the parenthesized term is a sum-of-squares.
<p>
Putting the two arguments together, we have a sum-of-squares proof that \[ \|Ax\|_4^4 \le \left(\|B_{\neq}\| + c^*\right)\cdot \left(\sum_{\alpha} x_{\alpha}^2\right) \] Because we have the SDP constraint that $\|x\|_2^2 = 1$, the objective value is thus bounded by \[ \tilde{E}[\|Ax\|_4^4] = (\|B_{\neq} \|+ c^*) \cdot \tilde{E}\left[\left(\sum_{\alpha\in [d]}x_{\alpha}^2\right)^2\right] = \|B_{\neq}\| + c^*. \] The final step in the proof consists of showing that with high probability over the choice of $A$, \[ \|B_{\neq}\| \le \tilde{O}(\sqrt{d^3/n^3})\quad \text{and}\quad c^* \le \tilde{O}(1/n). \] The first fact we can prove using a matrix Chernoff bound, and the second fact we can prove using a Chernoff bound and a union bound. This concludes the proof! $$\tag*{$\blacksquare$}$$
<p>
To get the theorem with the better parameters mentioned above, Barak et al.
remove the even-multiplicity indeces more carefully: they project away from the subspace containing the vectors which correlate too much with the even-multiplicity entries (whereas we just zeroed them out).
This more careful treatment lets them prove a better matrix concentration result.
<p>
<h2 class='tex'>4. Other Resources </h2> Check out the following other resources for historical details and more sum-of-squares algorithms/lower bounds:
<ul> <li> For notes about SDPs and duality, I like these notes by Lap Chi Lau:
<p>
<a href="https://cs.uwaterloo.ca/ lapchi/cs270/notes.html">https://cs.uwaterloo.ca/ lapchi/cs270/notes.html</a>
<p>
I also like these notes by Anupam Gupta and Ryan O'Donnell:
<p>
<a href="https://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15859-f11/www/">https://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15859-f11/www/</a> <li> Lecture notes from Boaz Barak on sum-of-squares:
<p>
<a href="http://www.boazbarak.org/sos/">http://www.boazbarak.org/sos/</a>
<p>
<li> Lecture notes from Massimo Lauria on sum-of-squares and other relaxations for polynomial optimization <a href="http://www.csc.kth.se/ lauria/sos14/">http://www.csc.kth.se/ lauria/sos14/</a> <li> The introduction of this paper by Barak, Kelner and Steurer: <a href="https://arxiv.org/pdf/1312.6652v1.pdf">https://arxiv.org/pdf/1312.6652v1.pdf</a>
<p>
The appendix of the paper also contains many sum-of-squares proofs of basic inequalities (e.g. Cauchy-Schwarz) that can be of use for providing good dual certificates.
</ul>
Simple Lower Bounds for Small-bias SpacesPreetum Nakkiran
2016-06-03T00:00:00+00:00
http://learningwitherrors.org/2016/06/03/small-bias<script type="text/javascript">
function toggle_display_nojump(id) {
event.preventDefault();
var e = document.getElementById(id);
if(e.style.display == 'block')
e.style.display = 'none';
else
e.style.display = 'block';
return false; // prevent default action of jumping to anchor
}
</script>
<div style='display:none;'><script type='math/tex'> \renewcommand\qedsymbol{$\blacksquare$}
\newcommand{\1}{\mathbb{1}}
\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator*{\argmax}{argmax}
\newcommand{\x}{\times}
\newcommand{\Z}{\mathbb{Z}}
\newcommand{\Q}{\mathbb{Q}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\N}{\mathcal{N}}
\newcommand{\E}{\mathop{\mathbb{E}}}
\renewcommand{\bar}{\overline}
\renewcommand{\epsilon}{\varepsilon}
\newcommand{\bmqty}[1]{\begin{bmatrix}#1\end{bmatrix}}
\newcommand{\innp}[1]{\langle #1 \rangle}
\newcommand{\F}{\mathbb{F}}
\renewcommand{\t}{\widetilde}
\newcommand{\note}[1]{&&\text{(#1)}}
</script></div>
<p>
I was reading about PRGs recently, and I think a lemma mentioned last time (used for Johnson-Lindenstrauss lower-bounds) can give simple lower-bounds for $\epsilon$-biased spaces.
<p>
Notice:
<ul> <li> $2^n$ mutually orthogonal vectors requires dimension at least $2^n$, but $2^n$ “almost orthogonal” vectors with pairwise inner-products $|\innp{v_i, v_j}| \leq \epsilon$ exists in dimension $O(n/\epsilon^2)$, by Johnson-Lindenstrauss. <li> Sampling $n$ iid uniform bits requires a sample space of size $2^n$, but $n$ $\epsilon$-biased bits can be sampled from a space of size $O(n/\epsilon^2)$.
</ul>
<p>
First, let's look at $k$-wise independent sample spaces, and see how the lower-bounds might be extended to the almost $k$-wise independent case.
<p>
<i>Note: To skip the background, just see Lemma <a href="#lemrank">1</a>, and its application in Claim <a href="#claimkeps">3</a>. </i>
<!--more-->
<p>
<h2 class='tex'>1. Preliminaries </h2> What “size of the sample space” means is: For some sample space $S$, and $\pm 1$ random variables $X_i$, we will generate bits $x_1, \dots x_n$ as an instance of the r.vs $X_i$. That is, by drawing a sample $s \in S$, and setting $x_i = X_i(s)$. We would like to have $|S| \ll 2^n$, so we can sample from it using less than $n$ bits.
<p>
Also, any random variable $X$ over $S$ can be considered as a vector $\t X \in \R^{|S|}$, with coordinates $\t X[s] := \sqrt{\Pr[s]} X(s)$. This is convenient because $\innp{\t X, \t Y} = \E[XY]$.
<p>
<h2 class='tex'>2. Exact $k$-wise independence </h2> <a name="seckwise"></a> A distribution $D$ on $n$ bits is <em>$k$-wise independent</em> if any subset of $k$ bits are iid uniformly distributed. Equivalently, the distribution $D : \{\pm 1\}^n \to \R_{\geq 0}$ is $k$-wise independent iff the Fourier coefficients $\hat D(S) = 0$ for all $S \neq 0, |S| \leq k$.
<p>
$n$ such $k$-wise independent bits can be generated from a seed of length $O(k \log n)$ bits, using say Reed-Solomon codes. That is, the size of the sample space is $n^{O(k)}$. This size is optimal, as the below claim shows (adapted from Umesh Vazirani's lecture notes [<a href='#ref-Vaz99'>Vaz99</a>]).
<blockquote><b>Claim 1</b> <em> <a name="claimkwise"></a> Let $D$ be a $k$-wise independent distribution on $\{\pm 1\}$ random variables $x_1, \dots, x_n$, over a sample space $S$. Then, $|S| = \Omega_k(n^{k / 2})$. </em></blockquote>
<p>
<p>
<em>Proof:</em> For subset $T \subseteq [n]$, let $\chi_T(x) = \prod_{i \in T} x_i$ be the corresponding Fourier character. Consider these characters as vectors in $\R^{|S|}$ as described above, with \[\innp{\chi_A, \chi_B} = \E_{x \sim D}[\chi_A(x)\chi_B(x)] \]
<p>
Let $J$ be the family of all subsets of size $\leq k/2$. Note that, for $A, B \in J$, the characters $\chi_A, \chi_B$ are orthogonal: \begin{align*} \innp{\chi_A, \chi_B} &= \E_{x \sim D}[\chi_A(x)\chi_B(x)]\\ &= \E_{x \sim D}[(\prod_{i \in A \cap B} x_i^2)(\prod_{i \in A \Delta B} x_i)]\\ &= \E_{x \sim D}[\chi_{A \Delta B}(x)] \note{since $x_i^2 = 1$}\\ &= 0 \note{since $|A \Delta B| \leq k$, and $D$ is $k$-wise independent} \end{align*} Here $A \Delta B$ denotes symmetric difference, and the last equality is because $\chi_{A \Delta B}$ depends on $\leq k$ variables, so the expectation over $D$ is the same as over iid uniform bits.
<p>
Thus, the characters $\{\chi_A\}_{A \in J}$ form a set of $|J|$ mutually-orthogonal vectors in $\R^{|S|}$. So we must have $|S| \geq |J| = \Omega_k(n^{k/2})$. $$\tag*{$\blacksquare$}$$
<p>
The key observation was relating independence of random variables to linear independence (orthogonality). Similarly, we could try to relate $\epsilon$-almost $k$-wise independent random variables to almost-orthogonal vectors.
<p>
<h2 class='tex'>3. Main Lemma </h2> This result is Theorem 9.3 from Alon's paper [<a href='#ref-Alo03'>Alo03</a>]. The proof is very clean, and Section 9 can be read independently. <sup><a href='#footnote1' onclick="toggle_display_nojump('footnote1');">1</a></sup><span class='sidenote' id='footnote1'><a name='footnote1' href='#footnote1'>1.</a> Theorem 9.3 is stated in terms of lower bounding the rank of a matrix $B \in \R^{N \x N}$ where $B_{i,i} = 1$ and $|B_{i, j}| \leq \epsilon$. The form stated here follows by defining $B_{i, j} := \innp{v_i, v_j}$. </span>
<p>
<blockquote><b>Lemma 1</b> <em> <a name="lemrank"></a> Let $\{v_i\}_{i \in [N]}$ be a collection of $N$ unit vectors in $\R^d$, such that $|\innp{v_i, v_j}| \leq \epsilon$ for all $i \neq j$. Then, for $\frac{1}{\sqrt{N}} \leq \epsilon \leq 1/2$, \[d \geq \Omega\left(\frac{\log N}{\epsilon^2 \log(1/\epsilon)}\right)\] </em></blockquote>
<p>
<p>
This lower-bound on the dimension of “almost-orthogonal” vectors translates to a nearly-tight lower-bound on Johnson-Lindenstrauss embedding dimension, and will also help us below.
<p>
<h2 class='tex'>4. Small bias spaces </h2> A distribution $D$ on $n$ bits is <em>$\epsilon$-biased w.r.t linear tests</em> (or just “$\epsilon$-biased”) if all $\F_2$-linear tests are at most $\epsilon$-biased. That is, for $x \in \{\pm 1\}^n$, the following holds for all subsets $S \subseteq [n]$: \[\left|\E_{x \sim D}[\chi_S(x)]\right| = \left|\Pr_{x \sim D}[\chi_S(x) = 1] - \Pr_{x \sim D}[\chi_S(x) = -1]\right| \leq \epsilon\] Similarly, a distribution is <em>$\epsilon$-biased w.r.t. linear tests of size $k$</em> (or “$k$-wise $\epsilon$-biased) if the above holds for all subsets $S$ of size $\leq k$.
<p>
There exists an $\epsilon$-biased space on $n$ bits of size $O(n / \epsilon^2)$: a set of $O(n / \epsilon^2)$ random $n$-bit strings will be $\epsilon$-biased w.h.p. Further, explicit constructions exist that are nearly optimal: the such first construction was in [<a href='#ref-NN93'>NN93</a>], and was nicely simplified by [<a href='#ref-AGHP92'>AGHP92</a>] (both papers are very readable).
<p>
These can be used to sample $n$ bits that are $k$-wise $\epsilon$-biased, from a space of size almost $O(k \log(n)/\epsilon^2)$; much better than the size $\Omega(n^k)$ required for perfect $k$-wise independence. For example<sup><a href='#footnote2' onclick="toggle_display_nojump('footnote2');">2</a></sup><span class='sidenote' id='footnote2'><a name='footnote2' href='#footnote2'>2.</a> This can be done by composing an $(n, k')$ ECC with dual-distance $k$ and an $\epsilon$-biased distribution on $k' = k\log n$ bits. Basically, use a linear construction for generating $n$ exactly $k$-wise independent bits from $k'$ iid uniform bits, but use an $\epsilon$-biased distribution on $k'$ bits as the seed instead. </span>, see [<a href='#ref-AGHP92'>AGHP92</a>] or the lecture notes [<a href='#ref-Vaz99'>Vaz99</a>].
<p>
<h3 class='tex'>4.1. Lower Bounds</h3> The best lower bound on size of an $\epsilon$-biased space on $n$ bits seems to be $\Omega(\frac{n}{\epsilon^2 \log(1/\epsilon)})$, which is almost tight. The proofs of this in the literature (to my knowledge) work by exploiting a nice connection to error-correcting codes: Say we have a sample space $S$ under the uniform measure. Consider the characters $\chi_T(x)$ as vectors $\t \chi_T \in \{\pm 1\}^{|S|}$ defined by $\t \chi_T[s] = \chi_T(x(s))$, similar to what we did in Section <a href="#seckwise">2</a>. The set of $2^n$ vectors $\{\t \chi_T\}_{T \subseteq [n]}$ defines the codewords of a linear code of length $|S|$ and dimension $n$. Further, the hamming-weight of each codeword (number of $-1$s in each codeword, in our context), is within $n(\frac{1}{2} \pm \epsilon)$, since each parity $\chi_T$ is at most $\epsilon$-biased. Thus this code has relative distance at least $\frac{1}{2} - \epsilon$, and we can use sphere-packing-type bounds from coding-theory to lower-bound the codeword length $|S|$ required to achieve such a distance. Apparently the “McEliece-Rodemich-Rumsey-Welch bound” works in this case; a more detailed discussion is in [<a href='#ref-AGHP92'>AGHP92</a>, Section 7].
<p>
We can also recover this same lower bound using Lemma <a href="#lemrank">1</a> in a straightforward way.
<p>
<blockquote><b>Claim 2</b> <em> <a name="claimepsbias"></a> Let $D$ be an $\epsilon$-biased distribution on $n$ bits $x_1, \dots, x_n$, over a sample space $S$. Then, \[|S| = \Omega\left(\frac{n}{\epsilon^2 \log(1/\epsilon)}\right)\] </em></blockquote>
<p>
<em>Proof:</em> Following the proof of Claim <a href="#claimkwise">1</a>, consider the Fourier characters $\chi_T(x)$ as vectors $\t \chi_T \in \R^{|S|}$, with $\t \chi_T[s] = \sqrt{\Pr[s]} \chi_T(x(s))$. Then, for all distinct subsets $A, B \subseteq [n]$, we have \[\innp{\t \chi_A, \t \chi_B} = \E_{x \sim D}[\chi_A(x)\chi_B(x)] = \E_{x \sim D}[\chi_{A \Delta B}(x)]\] Since $D$ is $\epsilon$-biased, $\left|\E_{x \sim D}[\chi_{A \Delta B}(x)]\right| \leq \epsilon$ for all $A \neq B$. Thus, applying Lemma <a href="#lemrank">1</a> to the collection of $N = 2^n$ unit vectors $\{\t \chi_T\}_{T \subseteq [n]}$ gives the lower bound $|S| = \Omega\left(\frac{n}{\epsilon^2 \log(1/\epsilon)}\right)$. $$\tag*{$\blacksquare$}$$
<p>
This also nicely generalizes the proof of Claim <a href="#claimkwise">1</a>, to give an almost-tight lower bound on spaces that are $\epsilon$-biased w.r.t linear tests of size $k$.
<p>
<blockquote><b>Claim 3</b> <em> <a name="claimkeps"></a> Let $D$ be a distribution on $n$ bits that is $\epsilon$-biased w.r.t. linear tests of size $k$. Then, the size of the sample space is \[|S| = \Omega\left(\frac{k \log (n/k)}{\epsilon^2 \log(1/\epsilon)}\right)\] </em></blockquote>
<p>
<em>Proof:</em> As before, consider the Fourier characters $\chi_T(x)$ as vectors $\t \chi_T \in \R^{|S|}$, with $\t \chi_T[s] = \sqrt{\Pr[s]} \chi_T(x(s))$. Let $J$ be the family of all subsets $T \subseteq [n]$ of size $\leq k/2$. Then, for all distinct subsets $A, B \in J$, we have \[\left|\innp{\t \chi_A, \t \chi_B}\right| = \left|\E_{x \sim D}[\chi_{A \Delta B}(x)]\right| \leq \epsilon\] since $|A \Delta B| \leq k$, and $D$ is $\epsilon$-biased w.r.t such linear tests. Applying Lemma <a href="#lemrank">1</a> to the collection of $|J|$ unit vectors $\{\t \chi_T\}_{T \in J}$ gives $|S| = \Omega(\frac{k \log (n/k)}{\epsilon^2 \log(1/\epsilon)})$. $$\tag*{$\blacksquare$}$$
<p>
<i>Note: I couldn't find the lower bound given by Claim <a href="#claimkeps">3</a> in the literature, so please let me know if you find a bug or reference.
<p>
Also, these bounds do not directly imply nearly tight lower bounds for <em>$\epsilon$-almost $k$-wise independent</em> distributions (that is, distributions s.t. their marginals on all sets of $k$ variables are $\epsilon$-close to the uniform distribution, in $\ell_{\infty}$ or $\ell_{1}$ norm). Essentially because of the loss in moving between closeness in Fourier domain and closeness in distributions. <sup><a href='#footnote3' onclick="toggle_display_nojump('footnote3');">3</a></sup><span class='sidenote' id='footnote3'><a name='footnote3' href='#footnote3'>3.</a> Eg, $\epsilon$-biased $\implies$ $\epsilon$-close in $\ell_{\infty}$, but $\epsilon$-close in $\ell_{\infty}$ can be up to $2^{k-1}\epsilon$-biased. And $2^{-k/2}\epsilon$-biased $\implies$ $\epsilon$-close in $\ell_{1}$, but not the other direction. </span> </i>
<p>
<br><hr><h3>References</h3>
<p>
<a name='ref-AGHP92'>[AGHP92]</a> Noga Alon, Oded Goldreich, Johan Håstad, and Ren{é} Peralta.
Simple constructions of almost k-wise independent random variables.
<em>Random Structures \& Algorithms</em>, 3(3):289--304, 1992.
URL: <a href="http://www.tau.ac.il/~nogaa/PDFS/aghp4.pdf">http://www.tau.ac.il/~nogaa/PDFS/aghp4.pdf</a>.
<p>
<p>
<a name='ref-Alo03'>[Alo03]</a> Noga Alon.
Problems and results in extremal combinatorics, part i.
<em>Discrete Math</em>, 273:31--53, 2003.
URL: <a href="http://www.tau.ac.il/~nogaa/PDFS/extremal1.pdf">http://www.tau.ac.il/~nogaa/PDFS/extremal1.pdf</a>.
<p>
<p>
<a name='ref-NN93'>[NN93]</a> Joseph Naor and Moni Naor.
Small-bias probability spaces: Efficient constructions and
applications.
<em>SIAM journal on computing</em>, 22(4):838--856, 1993.
URL: <a href="http://www.wisdom.weizmann.ac.il/~naor/PAPERS/bias.pdf">http://www.wisdom.weizmann.ac.il/~naor/PAPERS/bias.pdf</a>.
<p>
<p>
<a name='ref-Vaz99'>[Vaz99]</a> Umesh Vazirani.
k-wise independence and epsilon-biased k-wise indepedence.
1999.
URL:
<a href="https://people.eecs.berkeley.edu/~vazirani/s99cs294/notes/lec4.pdf">https://people.eecs.berkeley.edu/~vazirani/s99cs294/notes/lec4.pdf</a>.
<p>
Fast Johnson-LindenstraussPreetum Nakkiran
2016-05-27T00:00:00+00:00
http://learningwitherrors.org/2016/05/27/fast-johnson-lindenstrauss<script type="text/javascript">
function toggle_display_nojump(id) {
event.preventDefault();
var e = document.getElementById(id);
if(e.style.display == 'block')
e.style.display = 'none';
else
e.style.display = 'block';
return false; // prevent default action of jumping to anchor
}
</script>
<div style='display:none;'><script type='math/tex'> \renewcommand\qedsymbol{$\blacksquare$}
\newcommand{\1}{\mathbb{1}}
\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator*{\argmax}{argmax}
\newcommand{\x}{\times}
\newcommand{\Z}{\mathbb{Z}}
\newcommand{\Q}{\mathbb{Q}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\N}{\mathcal{N}}
\newcommand{\E}{\mathop{\mathbb{E}}}
\renewcommand{\bar}{\overline}
\renewcommand{\epsilon}{\varepsilon}
\newcommand{\bmqty}[1]{\begin{bmatrix}#1\end{bmatrix}}
\newcommand{\innp}[1]{\langle #1 \rangle}
\renewcommand{\t}{\widetilde}
</script></div>
<p>
The Johnson-Lindenstrauss (JL) Transform says that, informally, we can embed high-dimensional points into a much lower dimension, while still preserving their pairwise distances. In this post we'll start with the classical JL transform, then focus on the Fast JL Transform (FJLT) by Ailon and Chazelle [<a href='#ref-AC09'>AC09</a>], which achieves the JL embedding more efficiently (w.r.t. runtime and randomness). We'll look at the FJLT from the perspective of “preconditioning” a sparse estimator, which comes with nice intuition from Fourier duality. We conclude by mentioning more recent developments in this area (faster/sparser/derandomizeder).
<!--more-->
<p>
<b>Motivation.</b> It's an interesting structural result, and the JL transform also has algorithmic and machine-learning applications. Eg, any application that only depends on approximate pairwise distances (such as nearest-neighbors), or even on pairwise inner-products<sup><a href='#footnote1' onclick="toggle_display_nojump('footnote1');">1</a></sup><span class='sidenote' id='footnote1'><a name='footnote1' href='#footnote1'>1.</a> Because, if the norms $||x_i-x_j||$, $||x_i||$, and $||x_j||$ are approximately preserved, then so is the inner-product $\innp{x_i,x_j}$. </span> (such as kernel methods) may equivalently work with the dimensionality-reduced version of their input. This also has applications in sketching and scarification.
<p>
<h2 class='tex'>1. Classical JL </h2> The JL embedding theorem is:
<p>
<blockquote><b>Theorem 1</b> <em> Given points $x_1, \dots, x_n \in \R^d$, and $\epsilon > 0$, there exists an embedding $f: \R^d \to \R^k$ such that \[\forall i, j: \quad (1-\epsilon) ||x_i - x_j||_2 \leq ||f(x_i) - f(x_j)||_2 \leq (1+\epsilon) ||x_i - x_j||_2 \] and $k = O(\epsilon^{-2}\log n)$. </em></blockquote>
<p>
That is, the embedding roughly preserves pairwise distances in $\ell_2$. Think of the regime $n \gg d \gg k$. Note that the target dimension $k = O(\epsilon^{-2}\log n)$ is (perhaps surprisingly) independent of the source dimension $d$, and only logarithmic in the number of points $n$.
<p>
In fact, a random linear map works as an embedding (w.h.p.). This is established by the following lemma.
<p>
<blockquote><b>Lemma 2</b> <em> <a name="lemjl"></a> For any $\delta > 0$, set $k = O(\epsilon^{-2}\log(1/\delta))$, and let $A \in \R^{k \x d}$ be a random matrix with iid normal $\N(0, 1/k)$ entries. Then \[\forall x \in \R^d: \quad \Pr_{A}\left[~ ||Ax||_2^2 \in (1 \pm \epsilon)||x||_2^2 ~\right] \geq 1- \delta\] </em></blockquote>
<p>
That is, a random matrix preserves the norm of vectors with good probability. To see that this implies the JL Theorem, consider applying the matrix $A$ on the $O(n^2)$ vectors of pairwise differences $(x_i - x_j)$. For fixed $i,j$, the lemma implies that $||A(x_i - x_j)|| \approx ||x_i - x_j||$ except w.p. $\delta$. Thus, setting $\delta = 1/n^3$ and union bounding, we have that $A$ preserves the norm of <em>all</em> differences $(x_i - x_j)$ with high probability. Letting the embedding map $f = A$, we have \[\forall i,j: \quad ||f(x_i) - f(x_j)|| = ||Ax_i - Ax_j|| = ||A(x_i - x_j)|| \approx ||x_i - x_j||\] as desired. <sup><a href='#footnote2' onclick="toggle_display_nojump('footnote2');">2</a></sup><span class='sidenote' id='footnote2'><a name='footnote2' href='#footnote2'>2.</a> Note that it was important that $f$ be linear for us to reduce preserving pairwise-distances to preserving norms. </span>
<p>
<b>Runtime.</b> The runtime of embedding a single vector with this construction (setting $\delta=1/n^3$ and $\epsilon=O(1)$) is $O(dk) = O(d \log n)$. In the Fast JL below, we will show how to do this in time almost $O(d \log d)$.
<p>
We will now prove Lemma <a href="#lemjl">2</a>. First, note that our setting is scale-invariant, so it suffices to prove the lemma for all unit vectors $x$.
<p>
As a warm-up, let $g \in \R^d$ be a random vector with entires iid $\N(0, 1)$, and consider the inner-product \[Y := \innp{g, x}\] (for a fixed unit vector $x \in \R^d$). Notice the random variable $Y$ has expectation $\E[Y] = 0$, and variance \[\E[Y^2] = \E[\innp{g,x}^2] = ||x||^2\] <b>Thus, $Y^2$ is an unbiased estimator for $||x||^2$.</b> Further, it concentrates well: assuming (wlog) that $||x||=1$, we have $Y \sim \N(0, 1)$, so $Y^2$ has constant variance (and more generally, is subguassian<sup><a href='#footnote3' onclick="toggle_display_nojump('footnote3');">3</a></sup><span class='sidenote' id='footnote3'><a name='footnote3' href='#footnote3'>3.</a> Subguassian with parameter $\sigma$ basically means tail probabilities behave as a guassian with variance $\sigma^2$ would behave. Formally, a zero-mean random variable $X$ is “subgaussian with parameter $\sigma$” if: $\E[e^{\lambda X}] \leq e^{\sigma^2\lambda^2/2}$ for all $\lambda \in \R$. </span> with a constant parameter).
<p>
This would correspond to a random linear projection into 1 dimension, where the matrix $A$ in Lemma <a href="#lemjl">2</a> is just $A = g^T$. Then $||Ax||^2 = \innp{g, x}^2 = Y^2$. However, this estimator does not concentrate well enough (we want tail probability $\delta$ to eventually be inverse-poly, not a constant).
<p>
We can get a better estimator for $||x||^2$ by averaging many iid copies. In particular, for any iid subgaussian random variables $Z_i$, with expectation $1$ and subguassian parameter $\sigma$, the Hoeffding bound gives \[\Pr\left[ \left|\left(\frac{1}{k}\sum_{i=1}^k Z_i \right) - 1 \right| > \epsilon \right] \leq e^{-\Omega(k\epsilon^2/\sigma^2)}\] Applying this for $Z_i = Y_i^2$ and $\sigma = O(1)$, we can set $k = O(\epsilon^{-2}\log(1/\delta))$ so the above tail probability is bounded by $\delta$.
<p>
This is exactly the construction of Lemma <a href="#lemjl">2</a>. Each row of $A$ is $\frac{1}{\sqrt{k}} g_i^T$, where $g_i \in \R^d$ is iid $\N(0, 1)$. Then \[||A x||^2 = \sum_{i=1}^k \innp{\frac{1}{\sqrt{k}} g_i, x}^2 = \frac{1}{k} \sum_{i=1}^k \innp{g_i, x}^2 = \frac{1}{k} \sum_{i=1}^k Y_i^2\]
<p>
And thus, for $||x||=1$, \[\Pr\left[ \left| ||A x||^2 - 1 \right| > \epsilon\right] \leq \delta\] as desired in Lemma <a href="#lemjl">2</a>. $$\tag*{$\blacksquare$}$$
<p>
To recap, the key observation was that if we draw $g \sim N(0, I_d)$, then $\innp{g, x}^2$ is a “good” estimator of $||x||^2$, so we can average $O(\epsilon^{-2}\log(1/\delta))$ iid copies and get an estimator within a multiplicative factor $(1 \pm \epsilon)$ with probability $\geq 1-\delta$. Filling the transform matrix $A$ with guassians is clearly not necessary; any distribution with iid entries that are subguassian would work, with the same proof as above. For example, picking each entry in $A$ as iid $\pm \frac{1}{\sqrt k}$ would work.
<p>
We can now think of different JL transforms as constructing different estimators of $||x||^2$. For example, can we draw from a distribution such that $g$ is sparse? (Not quite, but with some “preconditioning” this will work, as we see below).
<p>
<h2 class='tex'>2. Fast JL </h2>
<p>
<h3 class='tex'>2.1. First Try: Coordinate Sampling</h3> As a (bad) first try, consider the estimator that randomly samples a coordinate of the given vector $x$, scaled appropriately. That is, \[Y := \sqrt{d} ~x_j \quad\text{for uniformly random coordinate $j \in [d]$}\] Equivalently, draw a random standard basis vector $e_j$, and let $Y := \sqrt{d} \innp{e_j, x}$.
<p>
Notice that $Y^2$ has the right expectation: \[\E[Y^2] = \E_j[(\sqrt{d} x_j)^2] = d \E_j[x_j^2] = ||x||^2\] However, it does not concentrate well. The variance is $Var[Y^2] = Var[d x_j^2] = d^2 Var[x_j]$. If $x$ is a standard basis vector (say $x = e_1$), then this could be as bad as $Var[Y^2] \approx d$. This is bad, because it means we would need to average $\Omega(d)$ iid samples to get a sufficiently good estimator, which does not help us in reducing the dimension of $x \in \R^d$.
<p>
The bad case in the above analysis is when $x$ is very concentrated/sparse, so sampling a random coordinate of $x$ is a poor estimator of its magnitude. However, if $x$ is very “spread out”, then sampling a random coordinate would work well. For example, if all entries of $x$ are bounded by $\pm O(\sqrt{\frac{1}{d}})$, then $Var[Y^2] = O(1)$, and taking iid copies of this estimator would work. This would be nice for runtime, since randomly sampling a coordinate can be done quickly (it is not a dense inner-product).
<p>
Thus, if we can (quickly) “precondition” our vector $x$ to have $||x||_{\infty} \leq O(\sqrt{\frac{1}{d}})$, we could then use coordinate sampling to achieve a fast JL embedding. We won't quite achieve this, but we will be able to precondition such that $||x||_\infty \leq O(\sqrt{\frac{\log(d/\delta)}{d}})$, as described in the next section. With this in mind, we will need the following easy claim (that with the weaker bound on $\ell_\infty$, coordinate sampling works to reduce the dimension to almost our target dimension).
<p>
<blockquote><b>Lemma 3</b> <em> <a name="lemS"></a> Let $t = \Theta(\epsilon^{-2}\log(1/\delta)\log(d/\delta))$. Let $S \in \R^{t \x d}$ be a matrix with rows $s_i := \sqrt{\frac{d}{t}} e^{(i)}_{j_i}$, where each $j_i \in [d]$ is an iid uniform index. (That is, each row of $S$ randomly samples a coordinate, scaled appropriately). Then, for all $x$ s.t $||x||_2=1$ and $||x||_\infty \leq O(\sqrt{\frac{\log(d/\delta)}{d}})$, we have
<p>
\[\Pr_{S}[ ||Sx||^2 \in 1 \pm \epsilon] \geq 1-\delta\] </em></blockquote>
<p>
<em>Proof:</em> \[||Sx||^2 = \sum_{i=1}^t \innp{\sqrt{\frac{d}{t}} e_{j_i}, x}^2 = \frac{1}{t}\sum_{i=1}^t d (x_{j_i})^2\] Then, the r.vs $\{d (x_{j_i})\}$ are iid and absolutely bounded by $O(\sqrt{\log(d/\delta)})$, so by Chernoff-Hoeffding and our choice of $t$, \[\Pr[ |\frac{1}{t}\sum_{i=1}^t d (x_{j_i})^2 - 1| > \epsilon] \leq e^{-\Omega(\epsilon^2 t / \log(d/\delta))} \leq \delta\] $$\tag*{$\blacksquare$}$$
<p>
<h3 class='tex'>2.2. FJLT: Preconditioning with random Hadamard</h3> The main idea of FJLT is that we can quickly precondition vectors to be “smooth”, by using the Fast Hadamard Transform.
<p>
Recall the $d \x d$ Hadamard transform $H_d$ (for $d$ a power of 2) is defined recursively as \[H_1 := 1 ,\quad H_{2d} := \frac{1}{\sqrt{2}} \bmqty{H_d & H_d\\H_d & -H_d}\] More explicitly, $H_d[i,j] = \frac{1}{\sqrt{d}} (-1)^{\innp{i, j}}$ where indices $i,j \in \{0, 1\}^{\log d}$, and the inner-product is mod 2. The Hadamard transform is just like the discrete Fourier transform<sup><a href='#footnote4' onclick="toggle_display_nojump('footnote4');">4</a></sup><span class='sidenote' id='footnote4'><a name='footnote4' href='#footnote4'>4.</a> Indeed, it is exactly the Fourier transform over the group $(\Z_2)^n$. For more on Fourier transforms over abelian groups, see for example <a href="https://lucatrevisan.wordpress.com/2016/03/16/cs294-lecture-15-abelian-cayley-graphs/">Luca's notes</a>. </span> : it can be computed in time $O(d\log d)$ by recursion, and it is an orthonormal transform.
<p>
Intuitively, the Hadamard transform may be useful to “spread out” vectors, since Fourier transforms take things that are sparse/concentrated in time-domain to things that are spread out in frequency domain (by time-frequency duality/Uncertainty principle). Unfortunately this won't quite work, since duality goes both ways: It will also take vectors that are already spread out and make them sparse.
<p>
To fix this, it turns out we can first randomize the signs, then apply the Hadamard transform.
<p>
<blockquote><b>Lemma 4</b> <em> <a name="lemHadamard"></a> Let $H_d$ be the $d \x d$ Hadamard transform, and let $D$ be a random diagonal matrix with iid $\pm 1$ entries on the diagonal. Then, \[\forall x \in \R^d, ||x||=1: \quad \Pr_{D}[ ||H_d D x||_\infty > \Omega(\sqrt{\frac{\log(d/\delta)}{d}}) ] \leq \delta\] </em></blockquote>
<p>
<p>
We may expect something like this to hold: randomizing the signs of $x$ corresponds to pointwise-multiplying by random white noise. White noise is spectrally flat, and multiplying by it in time-domain corresponds to convolving by its (flat) spectrum in frequency domain. Thus, multiplying by $D$ should “spread out” the spectrum of $x$. Applying $H_d$ computes this spectrum, so should yield a spread-out vector.
<p>
The above intuition seems messy to formalize but the proof is surprisingly simple. <sup><a href='#footnote5' onclick="toggle_display_nojump('footnote5');">5</a></sup><span class='sidenote' id='footnote5'><a name='footnote5' href='#footnote5'>5.</a> This proof presented slightly differently from the one in Alon-Chazelle, but the idea is the same. </span>
<p>
<em>Proof:</em> } Consider the first entry of $H_d D x$. Let $D = diag(a_1, a_2, \dots, a_d)$ where $a_i$ are iid $\pm 1$. The first row of $H_d$ is $\frac{1}{\sqrt{d}} \bmqty{1 & 1 & \dots & 1}$, so \[(H_d D x)[1] = \frac{1}{\sqrt{d}} \sum_i a_ix_i\] Here, the $x_i$ are fixed s.t. $||x||_2=1$, and the $a_i = \pm 1$ iid. Thus we can again bound this by Hoeffding <sup><a href='#footnote6' onclick="toggle_display_nojump('footnote6');">6</a></sup><span class='sidenote' id='footnote6'><a name='footnote6' href='#footnote6'>6.</a> The following form of Hoeffding bound is useful (it follows directly from Hoeffding for subgaussian variables, but is also a corollary of Azuma-Hoeffding): For iid zero-mean random variables $Z_i$, absolutely bounded by $1$, $\Pr[|\sum c_i Z_i| > \epsilon] \leq 2exp(-\frac{\epsilon^2}{2 \sum_i c_i^2})$. </span> (surprise), \[ \Pr[ | \sum_{i=1}^d a_i \frac{x_i}{\sqrt{d}}| > \eta] \leq e^{-\Omega(\eta^2 d / ||x||_2^2)} \] For $\eta = \Omega(\sqrt{\frac{\log(d/\delta)}{d}})$, this probability is bounded by $(\frac{\delta}{d})$. Moreover, the same bound applies for all coordinates of $H_d Dx$, since all rows of $H_d$ have the form $\frac{1}{\sqrt{d}} \bmqty{\pm 1 & \pm1 & \dots & \pm1}$. Thus, union bound over $d$ coordinates establishes the lemma. $$\tag*{$\blacksquare$}$$
<p>
<h3 class='tex'>2.3. The Full Fast JL Transform</h3> <i>This presentation of FJLT is due to Jelani Nelson; see the notes [<a href='#ref-Nel10'>Nel10</a>].</i>
<p>
Putting all the pieces together, the FJLT is defined as: \[A = J S H_d D\] or, \[ A: \quad \R^d \overset{D}{\longrightarrow} \R^d \overset{H_d}{\longrightarrow} \R^d \overset{S}{\longrightarrow} \R^t \overset{J}{\longrightarrow} \R^k \] where
<ul> <li> $S$: the sparse coordinate-sampling matrix of Lemma <a href="#lemS">3</a> <li> $H_d$: the $d \x d$ Hadamard transform. <li> $D$: diagonal iid $\pm 1$. <li> $J$: a dense “normal” JL matrix (iid Gaussian entries).
</ul>
For parameters
<ul> <li> $t = \Theta(\epsilon^{-2}\log(1/\delta)\log(d / \delta))$ <li> $k = \Theta(\epsilon^{-2}\log(1/\delta))$
</ul>
<p>
That is, we first precondition with the randomized Hadamard transform, then sample random coordinates (which does most of the dimensionality reduction), then finally apply a normal JL transform to get rid of the last $\log(d/\delta)$ factor in the dimension.
<p>
<b>Correctness.</b> Since the matrix $D$ and the Hadamard transform are isometric, they do not affect the norms of vectors. Then, after the preconditioning, Lemma <a href="#lemS">3</a> guarantees that $S$ only affects norms by $(1 \pm \epsilon)$, and Lemma <a href="#lemjl">2</a> guarantees that the final step is also roughly isometric. These steps fail w.p. $\delta$, so the final transform affects norms by at most say $(1\pm 3\epsilon)$ except w.p. $3\delta$. This is sufficient to establish the JL embedding.
<p>
<b>Runtime.</b> For computing a JL embedding (ie, setting $\delta = 1/n^3, \epsilon=O(1)$), the time to embed a single vector is $O(d \log d + \log^3 n)$.
<p>
<h2 class='tex'>3. Closing Remarks </h2>
<p>
<b>Optimality.</b> The target dimension given by the JL construction is known to be optimal. That is, one cannot embed $n$ points into dimension less than $k=\Omega(\epsilon^{-2}\log n)$ with distortion $\epsilon$. The first near-optimal lower-bound, in [<a href='#ref-Alo03'>Alo03</a>, Section 9] works by showing upper-bounds on the number of nearly-orthogonal vectors in a given dimension (so a too-good embedding of orthogonal vectors would violate this bound). A more recent, optimal bound, is in [<a href='#ref-KMN11'>KMN11</a>, Section 6]. They actually show optimality of the JL Lemma (that is, restricting to linear embeddings), which works (roughly) by arguing that if the target dimension is too small, then the kernel is too big, so a random vector is likely to be very distorted.
<p>
<b>Recent Advances.</b> Note that the FJLT is fast, but is not <em>sparse</em>. We may hope that embedding a sparse vector $x$ will take time proportional to the sparsity of $x$. A major result in this area was the sparse JL construction of [<a href='#ref-KN14'>KN14</a>]; see also the notes [<a href='#ref-Nel10'>Nel10</a>]. There is also work in derandomized JL, see for example [<a href='#ref-KMN11'>KMN11</a>].
<p>
<i>I'll stop here, since I haven't read these works yet, but perhaps we will revisit this another time.<br/>
This post was derived from my talk at Berkeley theory retreat, on the theme of “theoretical guarantees for machine learning.” </i>
<p>
<br><hr><h3>References</h3>
<p>
<a name='ref-AC09'>[AC09]</a> Nir Ailon and Bernard Chazelle.
The fast johnson-lindenstrauss transform and approximate nearest
neighbors.
<em>SIAM Journal on Computing</em>, 39(1):302--322, 2009.
URL:
<a href="https://www.cs.princeton.edu/~chazelle/pubs/FJLT-sicomp09.pdf">https://www.cs.princeton.edu/~chazelle/pubs/FJLT-sicomp09.pdf</a>.
<p>
<p>
<a name='ref-Alo03'>[Alo03]</a> Noga Alon.
Problems and results in extremal combinatorics, part i.
<em>Discrete Math</em>, 273:31--53, 2003.
URL: <a href="http://www.tau.ac.il/~nogaa/PDFS/extremal1.pdf">http://www.tau.ac.il/~nogaa/PDFS/extremal1.pdf</a>.
<p>
<p>
<a name='ref-KMN11'>[KMN11]</a> Daniel Kane, Raghu Meka, and Jelani Nelson.
Almost optimal explicit johnson-lindenstrauss families.
In <em>Approximation, Randomization, and Combinatorial Optimization.
Algorithms and Techniques</em>, pages 628--639. Springer, 2011.
URL:
<a href="http://people.seas.harvard.edu/~minilek/papers/derand_jl.pdf">http://people.seas.harvard.edu/~minilek/papers/derand_jl.pdf</a>.
<p>
<p>
<a name='ref-KN14'>[KN14]</a> Daniel~M Kane and Jelani Nelson.
Sparser johnson-lindenstrauss transforms.
<em>Journal of the ACM (JACM)</em>, 61(1):4, 2014.
URL: <a href="https://arxiv.org/pdf/1012.1577v6.pdf">https://arxiv.org/pdf/1012.1577v6.pdf</a>.
<p>
<p>
<a name='ref-Nel10'>[Nel10]</a> Jelani Nelson.
Johnson-lindenstrauss notes.
Technical report, Technical report, MIT-CSAIL, 2010.
URL: <a href="http://web.mit.edu/minilek/www/jl_notes.pdf">http://web.mit.edu/minilek/www/jl_notes.pdf</a>.
<p>