Continuing on our journey through expander graphs, we prove the essential Cheeger inequality. We define e(S,T) as the number of edges going from S to T. ∂S is the size of the edge boundary set E(S,V∖S). Recall that the isoperimetric number or Cheeger constant of a graph is the infimum ratio e(S,V)∣∂S∣ over all subsets S of size at most ∣V∣/2.
We will be thinking of d-regular graphs for which we have considered the notion of spectral expansion through the spectral gap of their normalized adjacency matrix AG: λ(G):=max{∣λ2∣,∣λn∣}≤λ.
The Cheeger inequality gives us a way to relate the two notions of expansion.
Theorem
The Cheeger inequality states that:
21−λ2≤h(G)≤2(1−λ2)
Let's prove it! We'll be working off of the wonderful [notes][https://cseweb.ucsd.edu/classes/sp21/cse291-g/] from Shachar Lovett's class with small changes and simplifications at some parts. The proof has two parts, one for each side of the inequality.
Lower Bound
Let us first remember the Courant-Fischer (CF) Theorem, which gives a characterization of the spectrum in terms of Rayleigh coefficients. If M is a symmetric matrix with eigenvalues v1≤...≤vn and corresponding eigenvectors ψ1,...,ψn. Then we have that:
vi=∣∣x∣∣=1xTψj=0,j<iminxTMx
Now, let's remember the Laplacian matrix of G to be defined as L=I−AG. Let 0=μ1≤μ2≤⋯≤μn be the spectrum of L. We know that μ2=1−λ2 so it suffices to show that μ2≤2h(G). By CF, we have that:
μ2=x⊥1nminxTxxTLx
So, all we have to do is find some x∈Rn that satisfies xTxxTLx≤2h(G). We know that h(G) is a minimum over all cuts, so let's define a cut S⊂V and consider its indicator vector:
x=1S={10,if v∈S,otherwise
Now we have to do some calculations:
xTx=∣S∣.
xTLx=(u,v)∈E∑d1(xu−xv)2=d∣∂S∣.
Cool trick
We would be remiss if we did not stand and appreciate the power of examining Laplacians through their quadratic forms. This is a very neat trick and powerful tool for interpreting L. Specifically, we can write all quadratic forms as:
xTLx=(u,v)∈E∑wuv(xu−xv)2
It just so happens here that wuv=d1 for all edges.
So then xTxxTLx=∣S∣∣∂S∣. We're on the right track! Recall that h(G)=h(S)=e(S,V)∣∂S∣ for some S and we have that e(S,V)=d∣S∣. We'll have to change x a little. Of course, we have another constraint to satisfy: we must not forget that the x we choose must be orthogonal to 1n. To do this, define:
xS={1−n∣S∣−n∣S∣,if v∈S,otherwise
So we just subtract n∣S∣ from 1S. Now, what happens? Let's try to go through the calculations again:
⟨xS,1n⟩=∣S∣−n∣S∣2+(n−∣S∣)(−n∣S∣)=0, so we get our orthogonality.
xSTxS=∣S∣(1−n∣S∣)2+(n−∣S∣)n2∣S∣2=...=n∣S∣⋅∣Sc∣, where Sc=V∖S.
xSTLxS=(u,v)∈E∑d1(xS(u)−xS(v))2=d∣∂S∣
Now what do we get if we form the Rayleigh coefficient?
xSTxSxSTLxS=d∣S∣∣Sc∣n∣∂S∣
Now we know that ∣Sc∣≤2n and d∣S∣∣∂S∣=h(S) meaning that we get what we were looking for:
xSTxSxSTLxS≤2h(G)
and that proves our upper bound!
Upper Bound
This is the harder part of the proof. There are a few moving parts. We want to prove that h(G)≤2μ2. We know that μ2=xTxxTLx for some x⊥1n, meaning that we need to show
h(G)≤2xTxxTLx
for that x. However, we don't know what that x is, and we don't really care to know. We'll prove something (way) stronger. If we have anyx⊥1n then we can find some cut S for which
max{h(S),h(Sc)}≤2xTxxTLx
Of course, h(G)≤max{h(S),h(Sc)} for any cut S so this seems we're asking for too much. But this turns out to be just enough! The other concern is how can we create a cut S from some arbitrary vector x. In the lower bound proof we did the easier thing: we created a vector from a cut. The opposite requires some more ingenuity.
Fiedler's Algorithm
The task is really about finding sparse cuts. Fiedler*'s algorithm produces sparse cuts from vectors. Given a vector x, proceed as follows:
Sort the coordinates of x as: xv1≤⋯xvn.
For i∈[n]:
Let Si:={vi,...,vn}
Compute Mi=max{h(Si),h(Sic)}
Output the cut S with the minimum Mi (the sparsest one)
Our crux is the following lemma about the sparsity of the cut Fiedler's algorithm produces:
Lemma
Let x≥0 and x=0. Fiedler's algorithm produces a cut S with
h(S)≤2xTxxTLx
The form of the cut is S=St={v:xv≥t} for some t>0.
This is all about understanding what Fiedler does. Notice that we aren't tackling the full case x⊥1. We are just thinking about x>0, but massaging things will slowly get us to the right solution.
To prove the lemma, we'll use the probabilistic method. If you aren't familiar with it, all it says is that to prove an object with some property exists in some universe you just have to show that the probability of choosing an object with this property is strictly positive under some distribution over the universe. For us, the objects are cuts, so we are looking for a distribution D over all cuts such that:
Si∼DPr[h(Si)≤2xTxxTLx]>0
Now, dealing with probabilities is often quite inconvenient. A classic trick is to use an averaging argument so we deal with expectations instead. If Z is a random variable with E[Z]=0, then Pr[Z≤0]>0. So we'll rephrase our goal in terms of that. Let X=e(Si,V∖Si),Y=d∣Si∣ and r=E[X]/E[Y]. By linearity of expectation, we have that E[X−rY]=0, so Pr[X/Y≤r]>0.
Thus, it suffices to show that for some distribution D over cuts we have that:
r=d⋅E[∣Si∣]E[e(Si,V∖Si)]≤2xTxxTLx
WLOG, let's assume that xvn=1, as the Rayleigh coefficient is immune to scaling. Now, for t∈[0,1], define St:={v:xv≥t}. Let's make the important observation that each St is some Si considered by Fiedler. Why? Because Si:={vi,...,vn} can be defined as all v∈V for which xv≥xvi=ti due to our ordering. And this sequence {ti}i=1n is increasing in i. Conversely, if we pick some t∈[ti,ti+1], the set Si doesn't change.
Enough prep-work! Time to see what our distribution D should be.
Making our distribution
We draw St such that t2 is uniformly drawn from [0,1].
This is a strangest part of possibly the whole proof. Why would we do that? To give some intuition, think about what would happen if we draw St so that t is uniform at random from [0,1]. That seems like a more natural choice. Then:
Here's the problem! We want this to be in our denominator, so we want it to equal xTx. That would be the case if the final sum was ∑xv2, which is what we get with our chosen distribution.
The other term has some more algebra:
E[e(St,V∖St)]={u,v}∈E∑Pr[{u,v} is cut]={u,v}∈E∑Pr[xu≤t≤xv]
That's nice. We concluded the proof to our lemma. But again, this was for x≥0 which is impossible if we want x⊥1n. We'll need another lemma to get past this hurdle:
Lemma
Let x⊥1n. There exists a vector y≥0 with support at most ∣V∣/2 such that
yTyyTLy≤xTxxTLx
Additionally, every cut Syt that Fiedler considers on y is also considered on x.
First, what does this give us? Consider x⊥1n and let y≥0 be the vector we are guaranteed from the lemma. We run Fiedler's algorithm on y and get a cut St, for which our previous lemma says that:
h(St)≤2yTyyTLy≤2xTxxTLx
Because St is considered by Fiedler's algorithm on x as well and because ∣St∣≤2∣V∣ we are comfortably done:
h(St)=max{h(St),h(V∖St)}≤2xTxxTLx
as desired.
So the final thing to do is prove our lemma above. First, note that if we shift x be a constant we can only lower its Rayleigh coefficient (because 1n is an eigenvector of L with eigenvalue 0)
(x+1nc)T(x+1nc)(x+1nc)TL(x+1nc)≤xTxxTLx
Now, let m be the median of x and x′=x−1nm. We know that x′ has at most ∣V∣/2 positive and at most ∣V∣/2 negative coordinates. Define x+ and x− according to the absolute values of the coordinates of x′ that are positive and negative. These two vectors have support ≤∣V∣/2 and are both positive. Then we have that
x′=x+−x−
Now imagine running Fiedler on x′. For any t>0, the cuts {v:xv+≥t} and {v:xv−≥t} are both considered. Why? It's because the coordinates of x′ are ordered and both ST and V∖St are considered as cuts. Now we willshow that either x+ or x− has a Rayleigh coefficient at most x′Tx′x′TLx′. We can calculate:
The lower bound was all about Courant-Fischer and recognizing the role of the Laplacian in this proof. All we had to do is convert a vector into a cut.
The upper bound required making a sparse cut from a vector. This is something one can do using Fiedler's algorithm.
However, are we producing a sparse enough cut?
We first prove that we do if the vector is positive by using the probabilistic method and an averaging argument.
Our distribution of choice is a little funky, but it makes sense if we look closely at the calculations.
Then we extend the lemma to all x⊥1n by looking at positive and negative entries of x separately.