Fourier integral operators

In this post I want to summarize the results of Hörmander’s paper “Fourier Integral Operators I”. I read this paper last summer, but at the time I did not appreciate the geometric aspects of the theory. Here I want to summarize the results of the paper for my own future reference, with a greater emphasis on the geometry.

Generalizing pseudodifferential calculus.
We start by recalling the definition of pseudodifferential calculus on $\mathbb R^n$ .

Definition. A pseudodifferential operator is an operator P of the form

$\displaystyle Pu(x) = \iint_{T^* \mathbb R^n} e^{i(x - y)\xi} a(x, y, \xi) u(y) ~dyd\xi$

acting on Schwartz space, where $dyd\xi$ is the measure induced by the symplectic structure on the cotangent bundle and a is the symbol. We also call P a quantization of a.

Pseudodifferential operators are useful in the study of elliptic PDE, essentially because if P is elliptic of symbol a, then 1/a is only singular on a compact set in each cotangent space, so if we are willing to restrict to Schwartz functions u which are bandlimited to high frequency, and we are willing to ignore the fact that $a \mapsto P$ is not quite a morphism of algebras (essentially since symbols commute but pseudodifferential operators do not), we can “approximately invert” P by quantizing 1/a.

However, this method of “approximate inversion” does not work for hyperbolic operators, essentially because the singular set of the inverse 1/a of a hyperbolic symbol a is asymptotically the cone bundle of null covectors (with respect to the Lorentz structure induced by a). To fix this problem, one defines the notion of a Fourier integral operator

$\displaystyle Pu(x) = \iint_{\mathbb R^{n + N}} e^{i\phi(x, y, \xi)} a(x, y, \xi) u(y) ~dyd\xi$

where the so-called operator phase $\phi$ is positively homogeneous of degree 1 on each fiber of $\mathbb R^{n + N} \to \mathbb R^n$ , is smooth away from the zero section, for every x there is no critical point of $\phi(x, \cdot, \cdot)$ away from the zero section, and similarly for y.

For example, the solution to the wave equation is

$\displaystyle u(t, x) = (2\pi)^{-n} (2i)^{-1} \int_{T^* \mathbb R^n} (e^{i\phi_+(x, y, \xi)} + e^{i\phi_-(x, y, \xi)}) |\xi|^{-1} f(y) ~dyd\xi$

where $f$ is the Fourier transform of the initial data and $\phi_\pm(x, y, \xi) = (x - y)\xi \pm t|\xi|$ . Thus the solution map is a sum of Fourier integral operators.

Equivalence of phase.
Given a Fourier integral operator P, of operator phase $\phi$ and symbol a, we can isolate its Schwartz kernel

$\displaystyle P(x, y) = \int_{\mathbb R^N} e^{i\phi(x, y, \xi)} a(x, y, \xi) ~d\xi$

using Fubini’s theorem. We call P properly supported if the map that sends the support of the Schwartz kernel to each of the factors $\mathbb R^n$ is proper. Once we restrict to Fourier integral operators of proper support, there is no particular reason to keep dividing the domain of the Schwartz kernel into $(x, y)$ and so we might as well study the following class of distributions:

Definition. An oscillatory integral is a distribution of the form

$\displaystyle P(x) = \int_{\mathbb R^N} e^{i\phi(x, \xi)} a(x, \xi) ~d\xi$

where $\phi$ is a phase, thus is positively homogeneous of degree 1 and smooth away from the zero section, and a is a symbol.

In particular, the Schwartz kernel of a Fourier integral operator is an oscillatory integral.

We put an equivalence relation on phases by saying that two phases are the same if they induce the same set of oscillatory integrals. Let $\phi$ be a phase on $X \times \mathbb R^N$ and define the critical set

$\displaystyle C = \{(x, \xi): \partial_\xi \phi(x, \xi) = 0\}$ .

Then the differential $(x, \xi) \mapsto (x, \partial_x \phi(x, \xi))$ of $\phi(\cdot, \xi)$ restricts to a map $C \to T^* X \setminus 0$ by $(x, \xi) \mapsto (x, \partial_x \phi(x, \xi))$ , and the image of C is an immersed conic Lagrangian submanifold of the cotangent bundle. Moreover, if two phases are equivalent in a neighborhood of x, then they induce the same Lagrangian submanifold.

The local theory is as follows.

Theorem. Let X be an open subset of $\mathbb R^n$ , let $\phi_i$ , $i \in \{1, 2\}$ , be phases defined in neighborhoods of $(x, \theta_i) \in X \times \mathbb R^N$ which induce a Lagrangian submanifold $\Lambda$ of $T^* X$ . Then:

Let $s_i$ be the signature of the Hessian tensor $\partial_\xi^2 \phi_i(x, \xi)$ . Then $s_1 - s_2$ is a locally constant, integer-valued function.

If A is an oscillatory integral with phase $\phi_1$ and symbol $a_1$ , then there exists a symbol $a_2$ such that A is also an oscillatory integral with phase $\phi_2$ and symbol $a_2$ .

Let
$\displaystyle d_i = (\partial_\xi \phi_i)^* \delta_0$

where $\delta_0$ is the Dirac measure on $\mathbb R^N$ . Then modulo lower-order symbols, we have

$\displaystyle i^{s_1/2} a_1(x, \xi) \sqrt{d_1} = i^{s_2/2} a_2(x, \xi) \sqrt{d_2}$ (1)

on $\Lambda$ .

We may take $a_i$ to be supported in an arbitrarily small neighborhood of $\Lambda$ without affecting A modulo lower order terms.

The first three claims here are given by Theorem 3.2.1 in the paper, while the last essentially follows fom the first three and an integration by stationary phase.

Cohomology of oscillatory integration.
The above theorem is fine if we have a global coordinate chart, but the formula (1) looks something like the formula relating the sections of a sheaf. Actually, since $\sqrt{d_i}$ is the formal square root of a measure, it can be viewed intrinsically as a half-density — that is, the formal square root of an unsigned volume form. This is very advantageous to us, because ultimately we want to be able to pair the oscillatory integrals we construct with elements of $L^2(X)$ (at least for symbols of order -m where m is large enough), but elements of $L^2(X)$ are not functions if we do not have a canonical volume form on X, but rather half-densities, and therefore we can pair an oscillatory integral with an element of $L^2(X)$ at least locally.

Let $\Omega^{1/2}$ be the half-density sheaf of a Lagrangian submanifold $\Lambda$ of a given symplectic manifold. We want to define a symbol to be a kind of section of $\Omega^{1/2}$ , but the dimension of integration N is not quite intrinsic to an oscillatory integral (even though in practice we will take N to be the dimension of $\Lambda$ ) and neither is the signature s of the Hessian tensor of a phase $\phi$ associated to $\Lambda$ . However, what is true is that s – N mod 2 is intrinsic, so given data $(a_j, \phi_j)$ defining an oscillatory integral in an open set $U_j \cong \mathbb R^n$ in $\Lambda$ , we let

$\displaystyle \sigma_{jk} = \frac{(s_k - N_k) - (s_j - N_j)}{2}$

which defines a continuous function $U_j \cap U_k \to \mathbb Z$ . Chasing the definition of a Čech cochain around, it follows that $\sigma$ drops to an element of the cohomology group $\sigma \in H^1(\Lambda, \mathbb Z/4)$ . We recall that since $i^4 = 1$ , $i^{\sigma_{jk}}$ is well-defined (since $\sigma_{jk} \in \mathbb Z/4$ ).

Definition. The Maslov line bundle of $\Lambda$ is the line bundle L on $\Lambda$ such that for sections $a_j$ defined on $U_j$ , we have $i^{\sigma_{jk}} a_j = a_k$ .

So now if we absorb a factor of $i^s$ into a, then a is honestly a section of L, and if we absorb a factor of $\sqrt{d}$ , then a is a section of $L \otimes \Omega^{1/2}$ . Moreover, L is defined independently of anything except $\Lambda$ , so we in fact have:

Theorem. Up to lower-order terms, there is an isomorphism between symbols valued in $L \otimes \Omega^{1/2}$ and oscillatory integrals whose Lagrangian submanifold is $\Lambda$ .

Canonical relations.
We now return to the case that the oscillatory integral A is the Schwartz kernel of a Fourier integral operator, which we also denote by A. Actually we will be interested in a certain kind of Fourier integral operator, and we will redefine what we mean by “Fourier integral operator” to make that precise.

Definition. Let X, Y be manifolds such that the natural symplectic forms on $T^*X, T^*Y$ are denoted by $\sigma_X, \sigma_Y$ . A canonical relation $C: Y \to X$ is a closed conic Lagrangian submanifold of $T^* Y \times T^* X \setminus 0$ with respect to the symplectic form $\sigma_X - \sigma_Y$ .

The intuition here is that if such a set C is (the graph of) a function, then C is a canonical relation iff C is a canonical transformation. We will mainly be interested in the case that C is a symplectomorphism, and thus is a canonical transformation. However, there is no harm in extending everything to the category of manifolds where the morphisms are canonical relations, or more precisely local canonical graphs (which we define below). Thus we come to the main definition of the paper:

Definition. Let $C: Y \to X$ be a canonical relation. A Fourier integral operator with respect to C is an operator $A: C^\infty_c(X) \to C^\infty_c(Y)'$ such that the Schwartz kernel of A is an oscillatory integral whose Lagrangian submanifold is C.

Definition. A local canonical graph $C: Y \to X$ is a canonical relation C such that the projection $\Pi_C: C \to T^* Y$ is an immersion.

In particular, the graph of a canonical transformation is a local canonical graph. “Locality” here means that $\Pi_C$ is an immersion; obviously it is a submersion, so the only reason that $\Pi_C$ is not a diffeomorphism (and hence C is the graph of a canonical transformation) is that $\Pi_C$ is not assumed to be injective. The reason why it is useful to restrict to the category of local canonical graphs is that in that category, we have a natural measure $\omega = \Pi_C^* \sigma_Y^n$ on C, which induces a natural isomorphism $a \mapsto a\sqrt \omega$ between functions and half-densities. Thus the symbol calculus greatly simplifies, as we can define a symbol in this case to just be a section of the Maslov sheaf. What’s annoying is that if C is a local canonical graph, then X and Y have the same dimension, making it hard to study Fourier integral operators between operators of different dimension.

As an application, pseudodifferential operators on manifolds have an intrinsic definition:

Definition. Suppose that X, Y have the same dimension. A pseudodifferential operator $A: C^\infty_c(X) \to C^\infty_c(Y)'$ is a Fourier integral operator whose Lagrangian submanifold is the graph of the identity.

The paper closes by discussing adjoints and products of Fourier integral operators, and showing that they map Sobolev spaces to Sobolev spaces in the usual way.

The normal bundle to the Devil’s staircase and other questions that keep me up at night

It has recently come to my attention that one can define the normal vector field to certain extremely singular “submanifolds” or “subvarieties” of a manifold or variety. I’m using scare quotes here because I’m pretty sure that nobody in their right mind would actually consider such a thing a manifold or variety. In the case of the standard Devil’s staircase (whose construction I will shortly recall) I believe that this vector field should be explicitly computable, though I haven’t been able to figure out how to do it.

Let us begin with the abstract definition of a Devil’s staircase:

A Devil’s staircase is a curve $\gamma: [0, 1] \to M$ in a surface M such that we can find local coordinates (x, y) for M around some point on the curve, such that in those coordinates we can view $\gamma$ as the parametrization of a continuous nonconstant function F such that $F'(x) = 0$ away from a set of Lebesgue measure zero.

In other words, F looks constant, except in infinitesimally small line segments where F grows too fast to be differentiable (or even absolutely continuous).

The standard Devil’s staircase is constructed from the usual Cantor set in $[0, 1]$ . To construct the Cantor set C, we start with a line segment, and split it into equal thirds. We then discard the middle third, leaving us with two equal-length line segments. We iterate this process infinitely many times. Clearly we can identify the points that we’re left with with the paths through a full infinite binary tree, so the Cantor set is uncountable[1].

The Cantor set comes with a natural probability measure, called the Cantor measure. One can define it by flipping a fair coin every time you split the interval into thirds. If you flip to heads, you move to the left segment; if you flip to tails, you move to the right segment. After infinitely many coin flips, you’ve ended up at a point in the Cantor set. Thinking of the Cantor set as a subset of $[0, 1]$ , you can define the cdf F of the Cantor measure, called the Cantor function:

Choose a random number $0 \leq P \leq 1$ using the Cantor measure. If $x < y$ are real numbers, then the Cantor function F is defined by declaring that $F(y) - F(x)$ is the probability that $x < P \leq y$ . The standard Devil’s staircase is the graph of the Cantor function.

It is easy to see that the standard Devil’s staircase is an abstract Devil’s staircase. First, the length of an interval in the nth stage of the Cantor set construction is $3^{-n}$ and there are $2^n$ such intervals; it follows that the Cantor set has length at most $(2/3)^n$ . Since n was arbitrary, the Cantor set has Lebesgue measure zero. Outside the Cantor set, we can explicitly compute $F' = 0$ . Since F is a cdf, it is a continuous surjective map $F: [0, 1] \to [0, 1]$ .

The Devil’s staircase is extremely useful as a counterexample, as it is about as singular as a curve of bounded variation can be, so heuristically, if we want to know if we can carry out some operation on curves of bounded variation, then it should suffice to check on Devil’s staircases.

Let me now construct the normal bundle to the standard Devil’s staircase[2]. For every smooth vector field X on $[0, 1]^2$ , we define $\int_{[0, 1]^2} X ~d\omega = \int_{\{u \leq 0\}} \nabla \cdot X$ . Then $X \mapsto \int_{[0, 1]^2} X ~d\omega$ can be shown to be bounded on $L^\infty$ , so it extends to every continuous vector field on $[0, 1]^2$ and hence defines a covector-valued Radon measure $\omega$ by the Riesz-Markov representation theorem. On the other hand, the divergence theorem says that if an open set $U$ has a smooth boundary, then $\int_U \nabla \cdot X$ is the integral of the normal part of X to $\partial U$ . In other words, integrating against $d\omega$ should represent “integrating the part of the vector field which is normal to the Devil’s staircase”.

We can take the total variation $|\omega|$ of $\omega$ , and by the Lebesgue differentiation theorem[3], one can show that the 1-form $\alpha(x) = \lim_{r \to 0} \omega(B(x, r))/|\omega|(B(x, r))$ exists for $|\omega|$ -almost every x. But $|\omega|$ is the Hausdorff length measure on the Devil’s staircase, and the Devil’s staircase can be shown to have length 2, yet the parts which are horizontal just have length 1. Therefore $\alpha(x)$ must be defined for some x which is not in the horizontal part of the Devil’s staircase. Sharpening $\alpha$ , we obtain the normal vector field to the Devil’s staircase.

To see that the sharp of $\alpha$ is really worthy of being called a normal vector field, we first observe that it has length 1 by definition, and second observe that for every vector field X, $\int_\gamma (X, \alpha) ~ds = \int_\gamma (X, \alpha) ~d|\omega| = \int_\gamma X ~d\omega$ where $ds$ is arc length. So pairing against $\alpha$ and then integrating against arc length is integrating the part of the vector field which is normal to the staircase.

The Lebesgue differentiation theorem is far from constructive. So what is the normal vector field to the Devil’s staircase? There should be some nice way to compute the normal field over some point P in the Cantor set in terms of how “dense” the Cantor set is at that point, say in terms of the $(2/3)$ -dimensional Hausdorff measure of small balls around P. That, in turn, should be computable in terms of the infinite binary string which defines P. But I don’t know how to do that. I’d love to talk about this problem with you, if you do have an idea.

[1] In fact the Cantor set is a totally disconnected and perfect, compact metrizable space, which characterizes it up to homeomorphism. We could also characterize it up to homeomorphism as the initial object in the category of compact metrizable spaces modulo automorphism.

[2] Actually, the reason that I started looking into this stuff is that I needed to define a normal bundle to extremely singular closed submanifolds of general manifolds. If one wants a definition that does not require a choice of trivialization of the tangent bundle or Riemannian metric, I think one needs the notion of a “bundle-valued Radon measure”. More on that soon…if my definition works.

[3] One needs to use a more general Lebesgue differentiation theorem to do this. In particular, one needs to use the Besicovitch covering lemma in the proof. This raises an interesting question, since the Besicovitch covering lemma has an, apparently, combinatorial constant, which I will call the Besicovitch number. Is there a nice way to compute the Besicovitch number of a Riemannian manifold? Some cute algorithm maybe?

Some common beginners’ proof errors

Recently, both through grading proofs and trying to teach some new math majors how to write proofs, I’ve had the opportunity to see a lot of invalid proofs. I want to record here some of the more common errors that invalidate an argument here.

Compilation errors. When grading a huge stack of problem sets, I kind of feel like a compiler. I go through each argument and stop once I run into an error. If I can guess what the author meant to say, I throw a warning (i.e. take off points) but continue reading; otherwise, I crash (i.e. mark the proof wrong).

So by a compilation error, I just mean an error in which the student had an argument which is probably valid in their heads, but when they wrote it down, they mixed up quantifiers, used an undefined term, wrote something syntactically invalid, or similar. I believe that these are the most common errors. Here are some examples of sentences in proofs that I would consider as committing a compilation error:

For a conic section $C$ , $\Delta = b^2 - 4ac$ .

Here the variable $\Delta,a,b,c$ are undefined at the time that they are used, while the variable $C$ is never used after it is bound. From context, I can guess that $\Delta$ is supposed to be the discriminant of $C$ , and $C$ is supposed to be the conic section in the $(x,y)$ -plane cut out by the equation $ax^2 + bxy + cy^2 + dx + ey + f = 0$ where $(a,b,c,d,e,f)$ are constants. So this isn’t too egregious but it is still an error, and in more complicated arguments could potentially be a serious issue.

There’s another thing problematic about this example. We use “For” without a modifier “For every” or “For some”. Does just one conic section satisfy the equation $\Delta = b^2 - 4ac$ , or does every conic section satisfy this equation? Of course, the author meant that every conic section satisfies this equation, and in fact probably meant this equation to be a definition of $\Delta$ . So this compilation error can be fixed by instead writing:

Let $C$ be the conic section in the $(x,y)$ -plane defined by the equation $ax^2 + bxy + cy^2 + dx + ey + f = 0$ . Then let $\Delta = b^2 - 4ac$ be the discriminant of $C$ .

Here’s another compilation error:

Let $V$ be a $3$ -dimensional vector space. For every $x, y \in V$ , define $f(x, y) = xy$ .

Here the author probably means that $f(x, y)$ is the cross-product, or the wedge product, or the polynomial product, or the tensor product, or some other kind of product, of $x,y$ . But we don’t know which product it is! Indeed, $V$ is just some three-dimensional vector space, so it doesn’t come with a product structure. We could fix this by writing, for example:

Let $V = \mathbb R^3$ , and for every $x, y \in V$ , define $f(x, y) = x \times y$ for the cross product of $x,y$ .

We have seen that compilation errors are usually just caused by sloppiness. That doesn’t mean that compilation errors can’t point to a more serious problem with one’s proof — they could, for example, obscure a key step in the argument which is actually fatally flawed. Arguably, this is the case with Shinichi Mochizuki’s infamous incorrect proof of Szpiro’s Conjecture. However, I think that most beginners can avoid compilation errors by making sure that they define every variable before using it, are never ambiguous about if they mean “for every” or “for some”, and otherwise just being very careful in their writing. And beginners should avoid using symbol-soup whenever possible, in favor of the English language. If you ever write something like

Suppose that $f: ~\forall \varepsilon > 0 \exists \delta > 0 \forall(x,y:|x-y| < \delta) (|f(x) - f(y)| < \varepsilon)$ .

I will probably take off points, even though I can, in principle, parse what you’re trying to say. The reason is that you could just as well write

Suppose that $f: A \to \mathbb R$ is a function, and for every $\varepsilon > 0$ we can find a $\delta > 0$ such that for any $x, y \in A$ such that $|x - y| < \delta$ , $|f(x) - f(y)| < \varepsilon$ .

which is much easier to read.

Edge-case errors. An edge-case error is an error in a proof, where the proof manages to cover every case except for a single special case where the proof fails. These errors are also often caused by sloppiness, but are more likely to actually be a serious flaw in an argument than a compilation error. They also tend to be a lot harder to detect than compilation errors. Here’s an example:

Let $f: X \to Y$ be a function. Then there is some $y \in Y$ in the image of $f$ .

Do you see the problem? Don’t read ahead until you try to find it for a few minutes.

Okay, first of all, if you read ahead without trying to find the problem, shame on you; second of all, if you’ve written something like this, don’t feel shame, because it’s a common mistake. The issue, of course, is that $X$ is allowed to be the empty set, in which case $f$ is the infamous empty function into $Y$ .

Most of the time the empty function isn’t too big of an issue, but it can come up sometimes. For example, the fact that the empty function exists means that arguably $0^0 = 1$ , which is problematic because it means that the function $x \mapsto 0^x$ is not continuous (since if $x > 0$ then $0^x = 0$ ).

Here’s an example from topology:

Let $X$ be a connected space and let $x_1,x_2 \in X$ . Then let $\gamma$ be a path from $x_1$ to $x_2$ .

In practice, most spaces we care about are quite nice — manifolds, varieties, CW-complexes, whatever. In such spaces, if they are connected we can find a path between any two points. However, this is not true in general, and the famous counterexample is the topologist’s sine curve. The point is that it’s very important to make sure you get your assumptions right — if you wrote this in a proof there’s a good chance it would cause the rest of the argument to fail, unless you had an additional assumption that the space $X$ did in fact have the property that connected implied path-connected.

In general, a good strategy to avoid errors like the above error is to beware of the standard counterexamples of whatever area of math you are currently working in, and make sure none of them can sneak past your argument! One way to think about this is to imagine that you are Alice, and Bob is handing you the best counterexamples he can find for your argument. You can only beat Bob if you can demonstrate that none of his counterexamples actually work.

Let me also give an example from my own work.

Let $X$ be a bounded subset of $\mathbb R$ . Then the supremum of $X$ exists and is an element of $\mathbb R$ .

It sure looks like this statement is true, since $\mathbb R$ is a complete order. But in fact, $X$ could be the empty set, in which case every real number is an upper bound on $X$ and so $\mathrm{sup } X = -\infty$ . In most cases, the reaction would be “So what? It’s just an edge case error.” But actually, in my case, I later discovered that the thing I was trying to prove was only interesting in the case that $X$ was the empty set, in which case this step of the argument immediately fails. A month later, I’m still not sure what to do to get around this issue, though I have some ideas.

Fatal errors. These are errors which immediately cause the entire argument to fail. If they can be patched, so much the better, but unlike the other two types of errors that can usually be worked around, a fatal error often cannot be repaired.

The most common fatal error I see in beginners’ proofs is the circular argument, as in the below example:

We claim that every vector space is finite-dimensional. In fact, if $\{v_1, \dots, v_n\}$ is a basis of the vector space $V$ , then $\mathrm{dim }V = n < \infty$ , which proves our claim.

If you read a standard textbook on linear algebra, they will certainly assume that given a vector space $V$ , you can find a basis $\{v_1, \dots, v_n\}$ of $V$ . But in fact, such a finite basis only exists if, a priori, $V$ is finite-dimensional! So all the student here has managed to prove is that if $V$ is a finite-dimensional vector space, then $V$ is a finite-dimensional vector space… not very interesting.

(This is not to say that there are almost-circular arguments which do prove something nontrivial. Induction is a form of this, as is the closely related “proof by a priori estimate” technique used in differential equations. But if one looks closely at these arguments they will see that they are not, in fact, circular.)

The other kind of fatal error is similar: there’s some sneaky assumption used in the proof, which isn’t really an edge case assumption. I have blogged about an insidious such assumption, namely the continuum hypothesis. In general, these assumptions often are related to edge-case issues, but may even happen in the generic case, as you mentally make an assumption that you forget to keep track of. Here is another example, also from measure theory:

Let $X$ be a Banach space and let $F: [0, 1] \to X$ be a bounded measurable function. Then we can find a sequence $(F_n)$ of simple measurable functions such that $F_n \to F$ almost everywhere pointwise and $(F_n)$ is Cauchy in mean, so we define $\int_0^1 F(x) ~dx = \lim_{n \to \infty} \int_0^1 F_n(x) ~dx$ .

This definition looks like the standard definition of an integral in any measure theory course. However, without a stronger assumption on $X$ , it’s just nonsense. For one thing, we haven’t shown that the definition of $\int_0^1 F(x) ~dx$ doesn’t depend on the choice of $(F_n)$ . That can be fixed. What cannot be fixed is that $(F_n)$ might not exist at all! This happens if $X$ is not separable, in which case the definition of the integral is nonsense.

This sort of fatal error is particularly tricky to deal with when one is first learning a more general version of a familiar theory. Most undergraduates are familiar with linear algebra, and the fact that every finite-dimensional vector space has a basis. In particular, every element of a vector space can be written uniquely in terms of a given basis. So when one first learns about finite abelian groups, they might be tempted to write:

Let $G$ be a finite abelian group, and let $g_1, \dots, g_n$ be a minimal set of generators of $G$ . Then for every $x \in G$ there are unique $x_1, \dots, x_n \in \mathbb Z$ such that $x = x_1g_1 + \cdots + x_n g_n$ .

In fact, the counterexample here is $G = \mathbb Z/2$ , $n = 1$ , $g_1 = 1$ , and $x = 0$ , because we can write $0 = 0g_1 = 2g_1 = 4g_1 = \cdots$ . So, when generalizing a theory, one does need to be really careful that they haven’t “imported” a false theorem to higher generality! (There’s no shame in making this mistake, though; I think that many mathematicians tried to prove Fermat’s Last Theorem but committed this error by assuming that unique factorization would hold in certain rings — after all, unique factorization holds in everyone’s favorite ring $\mathbb Z$ — that it fails in.)

What the hell is a Christoffel symbol?

I have to admit that I’ve gone a long time without really understanding the physical interpretation of the Christoffel symbols of a connection. In fact, there is an interpretation that, in the special case of the Christoffel symbols for the Levi-Civita connection in polar coordinates on Euclidean space, could be understood by me at age 16, after I took an intro physics class (though I definitely wouldn’t understand the relativistic or Yang-Millsy stuff). Here I want to record it. As usual, I’m pretty sure that everything here is very well-known, but I want to write it all down for my own intuition.

Let D be the covariant derivative of a connection on a vector bundle E. Given a coordinate frame e, one defines the Christoffel symbols by $D_j e_k = {\Gamma^i}_{jk} e_i$ . Here and always we use Einstein’s convention.

The Levi-Civita connection. Suppose E is the tangent bundle of spacetime and D is the Levi-Civita connection of the metric. Then for any free-falling particle with velocity v and acceleration a, one has the relativistic form of Newton’s first law of motion $a^k + {\Gamma^k}_{ij} v^iv^j = 0$ , which to mathematicians is more popularly known as the geodesic equation. It says that the “acceleration” in the coordinate frame e is entirely due to the fact that e itself is an accelerated frame.

Viewing $\Gamma^k$ as a bilinear form, we can rewrite Newton’s first law as $a^k = -\Gamma^k(v, v)$ , which now resembles Newton’s second law with unit mass. Indeed, the acceleration of the particle is given exactly by a quantity $-\Gamma^k(v, v) e_k$ which can be reasonably interpreted as a “force”. For example, one could consider the case that the spatial origin is a particle P which is orbiting around a point. If one believes that P really is “inertial”, then they will measure a fictitious force — the centrifugal force — acting on all objects. In general relativity, moreover, I think that the notion of “inertial” is ill-defined. In this case, if v is timelike then $\Gamma^k(v, v)$ is the acceleration due to gravity. In particular these fictitious forces all scale linearly with mass, because the geodesic equation does not have a mass factor and so we need to cancel out the factor of mass in the law $F = ma$ .

It will be convenient to go to another level of abstraction and view $\Gamma: T_pM \otimes T_pM \to T_pM$ as a quadratic form valued in the tangent space. In other words it is tempting to think of $\Gamma$ as a section of $T^*M \otimes T^*M \otimes TM$ . This of course presupposes that M has a trivial tangent bundle, since the Christoffel symbols are only defined locally. Putting our doubts aside, this is equivalent to thinking of $\Gamma$ as a section of $T^*M \otimes \text{End } TM$ .

Connections on G-bundles. Let me remind you that if G is a Lie group, then a G-bundle is a bundle of representations of G. Thus we can view quotients of G and its Lie algebra $\mathfrak g$ as both subsets of End E, whenever E is a G-bundle. By a gauge transformation of a G-bundle E one means a section of End E which is in fact a section of G. Thus gauge transformations act on E (and so also on End E, etc.)

If E is a G-bundle, by a covariant derivative on E I mean a covariant derivative whose Christoffel symbols $\Gamma$ are not just sections of $T^*M \otimes \text{End } E$ but in fact are sections of $T^*M \otimes \mathfrak g$ . (Briefly, the Christoffel symbols are $\mathfrak g$ -valued 1-forms.) In this case, if we have two covariant derivatives D, D’ which lie in the same orbit of the gauge transformations, we call D, D’ gauge-equivalent. We tend to think of covariant derivatives of G-bundles (modulo gauge-equivalence) as describing physical theories.

For example, consider the trivial U(1)-bundle E. This is the trivial line bundle equipped with the canonical action of U(1) on the complex numbers. A covariant derivative D on E is defined by locally giving Christoffel symbols which are $\mathfrak u(1)$ -valued 1-forms — in other words, imaginary 1-forms. A gauge transformation, then, is defined by adding an imaginary exact 1-form to the Christoffel symbols. We interpret the Christoffel symbols A as (i times) potentials for the electromagnetic field. In fact, one can take the exterior derivative of A and obtain a closed 2-form $F = dA$ , which one can view as the Faraday tensor. The fact that one can add an exact 1-form to A is exactly the gauge invariance of the Maxwell equation $*d*dA = j$ where j is the current 1-form.

So what is D in the case of electromagnetism? It acts on sections as $D_i = \partial_i + A_i$ . So for a function u (i.e. a section of the trivial bundle E) on M, $(D_i - \partial_i)u$ weights u according to the strength of the electromagnetic potential. This is mainly interesting when u is a constant function, in which case $Du = uA$ is the potential rescaled by u.

I think that the takeaway here is: the Christoffel symbols are a fictitious and local V-valued 1-form, where V is some vector bundle ( $V = \mathfrak g$ or $V = TM \otimes T^*M$ above). In any particular case they should have a nice physical interpretation but I don’t think one can interpret the Maxwell-Yang-Mills case and the Levi-Civita case as one and the same.

Linear algebra done dubiously

A book that has been a contentious topic of discussion is Linear Algebra Done Right, by Axler. The reason, at least ostensibly[1], is because Axler’s treatment avoids the discussion of determinants. For the critics’ part, Axler himself seems to play this up, marketing the book as a revolutionary treatment where determinants are not discussed. Apparently, Sergei Treil found this marketing so offensive that he wrote a competing textbook known as Linear Algebra Done Wrong.

I do not quite buy the hype here. There’s a whole chapter on determinants in Axler’s book, which even includes a discussion of Jacobian determinants. Axler just doesn’t use determinants to prove the three main theorems of intermediate linear algebra over an algebraically closed field $\overline K$ , namely the fact that every linear operator has an eigenvalue, that every linear operator has a unique Jordan canonical form, and the Cayley-Hamilton theorem. In all of these cases, one could prove the theorem using determinants, but there’s no good reason to, since there is a perfectly reasonable structure theory of linear operators over $\overline K$ which does not mention determinants, and it gives fairly easy and conceptual proofs to all three theorems.

(I don’t think Axler’s book is perfect, for the record. Most annoyingly, he doesn’t seem to clearly distinguish between theorems that are valid over general $\overline K$ and theorems that are specifically valid over $\overline K = \mathbb C$ , which is the case for most of the results in the latter half of the book, except for one single chapter about the structure theory over $\mathbb C$ . But I do think that a lot of the angry comments I’ve seen about the book on Reddit and elsewhere, which mainly focus on the issue of determinants, are just totally out to lunch.)

Anyways, it occurred to me today that the way I like to think about linear algebra neither involves determinants nor Axler’s structure theory, but is rather a complex-analytic version of linear algebra. I don’t think it essentially uses complex analysis though, and could probably be adopted to general $\overline K$ .

The point is to consider the resolvent $R(z) = (T - z)^{-1}$ of the linear operator T acting on a vector space V of dimension n, which is a rational map from $\mathbb P^1$ to a space of matrices. Clearly an eigenvalue is a pole of R, and the number of poles equals the number of zeroes (this is clearly true when $\overline K = \mathbb C$ , but I suspect it is true for arbitrary $\overline K$ ). Since R has a zero of order n at $\infty$ , T must have n eigenvalues. (If $\overline K = \mathbb C$ , Rouche’s theorem even gives a bound on the size of the eigenvalues, and a way to compute approximations to the eigenvalues.)

Now we have n eigenvalues $z_1, \dots, z_n$ counted with multiplicity. If $\overline K = \mathbb C$ , we may consider loops $\gamma_j$ around the puncture of $\mathbb C \setminus \{z_j\}$ and define $P_j = \frac{z_j}{2\pi i} \int_{\gamma_j} R(z) ~dz$ and similarly $N_j = \frac{1}{2\pi i} \int_{\gamma_j} (z - z_j)R(z) ~dz.$ It is now a straightforward consequence of Cauchy’s integral formula that $N_j$ is nilpotent and we have $A = \sum_j P_j + N_j$ . Furthermore, if $V_j$ is the image of $P_j$ , then A acts on $V_j$ as $A = \lambda_j + N_j$ , and we have a direct sum decomposition $V = \bigoplus_j V_j$ . That implies that $A = \sum_j P_j + N_j$ is the Jordan canonical form of A. Let me leave the details to these notes of Knill. I would be very interested to see if an argument like this can be used in the general case of $\overline K$ an algebraically closed field, possibly by replacing $\gamma_j$ by the generator of some algebraic analogue of the fundamental group of the “open subscheme” (if that makes any sense) $\overline K \setminus \{z_j\}$ , and replacing the differential form $(z - z_j)R(z) ~\frac{dz}{2\pi i}$ with some sort of algebraic analogue of cohomology.

It remains to prove the Cayley-Hamilton theorem. (This proof, which was shown to me by Charles Pugh, is what got me thinking about linear algebra in this fashion in the first place.) Recall that the Cayley-Hamilton theorem says that if p is a characteristic polynomial of T, thus the zeroes of p are the eigenvalues of T, then p(T) = 0. This is obviously true if T is diagonalizable.

Now, the set of diagonalizable matrices is dense, because for example it includes the set of matrices with distinct eigenvalues, which is a generic set. On the other hand, the set Z of matrices with the Cayley-Hamilton property is closed, since p is continuous. Since clearly the space of all matrices is connected we conclude that $Z = \overline K$ . This argument ostensibly works over $\overline K = \mathbb C$ , but with a little work, it also holds for arbitrary $\overline K$ , because we may use the Zariski topology.

This would be a pretty horrible way to teach linear algebra, but maybe one could simplify it so that it’s not so horrible.

[1] Axler has a signature, and quite clear and amicable, writing style, unlike most older textbooks. How much of the actual debate here is just Bourbakists in shambles?

Much ado about large cardinals

Lately, with Peter Scholze’s MathOverflow post about Grothendieck universes and the Isabelle/HOL implementation of schemes, it seems that in the sphere of online math there has been a somewhat renewed interest in when large cardinals make proving theorems easier. (Specifically, it is not necessary that one actually needs the large cardinals to prove the theorem — only that it makes the proof easier!) So I thought it would be fun to look through some old homework of mine and see if I could find an example where if I had allowed myself the use of a large cardinal, my life would have been easier. I found an example from when I took a course in C*-algebras a few years ago.

Let X be a locally compact Hausdorff space. By a compactification of X we mean an open dense embedding $X \to Y$ where Y is a compact Hausdorff space. By Alexandroff’s theorem, X always has a compactification, but in general if X is not compact then X may have multiple compactifications. We consider the category Comp X of compactifications of X equipped with continuous surjections which preserve X; the Alexandroff compactification is the final object of Comp X.

The Stone–Čech theorem. The category Comp X has an initial object.

One may show that the initial object of Comp X is $\text{Spec } C_b(X)$ where $C_b(X)$ is the Banach space of bounded continuous functions on X with its supremum norm, and the functor Spec is taken in the sense of C*-algebras (thus Spec A consists of maximal closed ideals equipped with the Zariski-Jacobson topology). This proof is presumably inoffensive to anyone who accepts ZFC (and offensive to anyone who does not, since one needs Zorn’s lemma to show that $C_b(X)$ has a maximal ideal in general — and ZF alone cannot prove that Comp X has an initial object).

However, for the purposes of the result I was trying to prove, I needed a proof of the Stone–Čech theorem that did not rely on the existence of $\text{Spec } C_b(X)$ , or else my argument would have been circular. To do this, one proceeds as follows. If $Z \to Y$ is a morphism in Comp X, then since X is dense in Z, the underlying continuous surjection $Z \to Y$ is completely determined by its behavior on X, but it is also the identity on X. Therefore Comp X is a poset category. Let $\mathcal C$ be a chain in Comp X; then $\mathcal C$ is an inverse system of topological spaces, and if C is the inverse limit of $\mathcal C$ , then one can show that there is a closed embedding $C \to \prod \mathcal C$ . Since $\prod \mathcal C$ is a compact Hausdorff space by Tychonoff’s theorem, so is C. Taking the inverse limits of the open dense embeddings $X \to Y$ , where $Y \in \mathcal C$ , we obtain an open dense embedding $X \to C$ , so C is an upper bound of $\mathcal C$ in Comp X.

At this point, one may proceed in two ways. Working in ZFC, it is only valid to apply Zorn’s lemma if Comp X is equivalent to a small category, but $\text{Comp } \mathbb N$ is a large category. To see that Comp X is equivalent to a small category, it suffices to show that there is a cardinal $\kappa$ such that every compactification of Comp X has at most $\kappa$ points; then for every compactification Y of X, one can find a compactification Z of X such that $Y \cong Z$ in Comp X, and the set-theoretic rank of Z is at most $\kappa$ , and so Comp X is a subset of the set $V_\kappa$ . Furthermore, if Y is a compactification of X and $y \in Y$ , then, since X is dense in Y, by the boolean prime ideal theorem there is an ultrafilter U on the set Open X of open subsets of X such that $\lim U = y$ . Since Y is Hausdorff, it follows that y is the UNIQUE limit of U, but some cardinal arithmetic can be used to show that if $\lambda$ is the cardinality of X, then there are only $2^{2^\lambda}$ ultrafilters on Open X (since elements of an ultrafilter on Open X are open subsets of X), so the cardinality of Y is at most $2^{2^\lambda}$ . Therefore we may let $\kappa = 2^{2^{\lambda}}$ .

Okay, that was stupid. We can also proceed by large cardinals. The following argument feels much more conceptual to me:

Definitions. Let $\delta > \aleph_0$ be a regular cardinal. We say that $\delta$ is an inaccessible cardinal if for every cardinal $\lambda < \delta$ , $2^\lambda < \delta$ . We say that $\delta$ is a hyperinaccessible cardinal if $\delta$ is an inaccessible cardinal and there is an increasing chain of inaccessible cardinals $\delta_\alpha$ such that $\lim_\alpha \delta_\alpha = \delta$ .

Let $\delta$ be a hyperinaccessible cardinal and suppose that $\text{card }X < \delta$ . Then there are inaccessible cardinals $\text{card }X < \kappa < \kappa' < \delta$ . If $X \in V_\kappa$ and Y is a compactification of X, then Y can be obtained as an extension of the Alexandroff compactification by splitting nets, but $V_\kappa$ is a Grothendieck universe and so the topology of X can be already probed by nets in $V_\kappa$ ; therefore $Y \in V_\kappa$ . Therefore $\text{Comp } X \subseteq V_\kappa$ is a small category in $V_{\kappa'}$ , so X has a Stone–Čech compactification $\beta X$ with $\text{card } \beta X < \kappa' < \delta$ .

This argument looks verbose, but only because I have written out the details; I think in practice I would just say that if X lies underneath an inaccessible cardinal $\kappa$ , then enough nets to probe the topology of X are also under $\kappa$ , so every compactification is as well.

Sundry facts about pseudodifferential operators

In this blog post I will just record some things I’ve been trying to learn about lately, largely just so I can have a place to collect my thoughts. Most of this is in Hörmander’s monograph on differential operators, and is motivated by trying to understand Vasy’s method and Atiyah-Singer index theory.

Pseudodifferential operators on manifolds.

Let us recall that a symbol on an open subset X of $\mathbb R^d$ is by definition a smooth function on the cotangent bundle of X (for which certain seminorms are finite). This was curious to me — you can motivate it by saying that a symbol is an observable and the cotangent bundle is “phase space” in the sense that a point $(x, \xi) \in T^*X$ consists of a position x and a momentum $\xi$ , but why should the momentum live in a cotangent space and not the fiber of some other vector bundle? When we quantize a symbol a, defining an operator a(D) by formally substituting the differential operator $D = -i\nabla$ in place of the momentum, we by definition obtain a pseudodifferential operator. Now let $\kappa: X \to Y$ be a diffeomorphism, and introduce the pushforward symbol $\kappa_* a(y, \eta) = e^{-iy\eta} a(\kappa^{-1}(y), D) a^{iy\eta}$ . This is the “right” definition in the sense that $)\kappa_*a(x, D)u)(\kappa(x)) = a(x, D)u(\kappa(x))$ .

If a is a symbol of order m, then $\kappa_* a(y, \eta) = a(\kappa^{-1}(y), \kappa'(\kappa^{-1}(y))^t \eta)$ modulo symbols of order m – 1. But $\kappa'(x)$ is invariantly defined as an isomorphism of tangent bundles $\kappa'(x): TX \to TY$ , so its transpose should be an isomorphism $(\kappa')^{-1}(x): T^*Y \to T^*X$ of the dual bundle. This only makes sense if $\eta \in T^*_yY$ is a covector at y.

The above paragraphs are totally obvious, and yet puzzled me for the past three years, until last week when I sat down and decided to work out the details for myself.

The consequence is that we cannot define the symbol of a pseudodifferential operator invariantly. Rather, we declare that a pseudodifferential operator A has the property that for every chart $\kappa: X \to Y$ and every pair of cutoffs $\phi, \psi$ on Y, then the operator $\phi \circ \kappa_* \circ A \circ \kappa^* \circ \psi$ is a pseudodifferential operator on Y (in the sense that it is the quantization of a symbol on Y; here the pushforward $\kappa_*$ is defined to be the inverse of the pullback $\kappa^*$ ). Since Y is an open subset of $\mathbb R^d$ this makes sense.

Previously we have discussed pseudodifferential operators on manifolds M. These can be viewed more abstractly as acting on sections of the trivial line bundle $M \times \mathbb C$ . However, in geometry one frequently has to deal with sections of more general vector bundles over M. For example, a 1-form is a section of the cotangent bundle. If E, F are vector bundles over M of rank r, s respectively, one may define the Hom-bundle Hom(E, F), which locally is isomorphic to the matrix bundle $M \times \mathbb C^{r \times s}$ . Then a pseudodifferential operator from sections of E to sections of F is nothing more than a linear map which, after trivialization of E and F, looks like a $s \times r$ matrix of pseudodifferential operators on M. The principal symbol of such an operator sends the cotangent bundle of M into the Hom-bundle Hom(E, F).

Wavefront sets.

In this section we will impose that all pseudodifferential operators have Schwartz kernels K such that the projections of supp K are both proper maps. Modulo the space $\Psi^{-\infty}$ of pseudodifferential operators of order $-\infty$ , this assumption is no loss of generality. Under this assumption, the top-order term of a symbol — that is, the principal symbol — satisfies the pushforward formula $\kappa_* a(y, \eta) = a(\kappa^{-1}(y), \kappa'(\kappa^{-1}(y))^t \eta)$ , so the principal symbol is well-defined as an element of $S^m/S^{m-1}$ (here $S^\ell$ is the $\ell$ th symbol class). The principal symbol encodes important information about the nature of the operator; for example we have:

Definition. An elliptic pseudodifferential operator of order m is one whose principal symbol is $\sim |\xi|^m$ near infinity of each cotangent space.

The important property is that if A is an elliptic pseudodifferential operator, then A is also invertible modulo the quantization $\Psi^{-\infty}$ of $S^{-\infty}$ . For example the Laplace-Beltrami operator is elliptic on Riemannian manifolds since its symbol is $\xi^2$ ; since the quadratic form induced by a Lorentzian metric is not positive-definite, it follows that on Lorentzian manifolds, the Laplace-Beltrami operator is not elliptic. Since a Lorentzian Laplace-Beltrami operator is really just the d’Alembertian, whose symbol is $\xi^2 - \tau^2$ , this should be no surprise.

Recall that a conic set in a vector space is a set which is closed under multiplication by conic scalars. A conic set in a vector bundle, then, is one which is conic in every fiber.

Definition. Let a be the principal symbol of a pseudodifferential operator A of order m. We say that A is noncharacteristic near $(x_0, \xi_0) \in T^*M$ if there is a conic neighborhood of $(x_0, \xi_0)$ wherein $a(x, \xi) \sim |\xi|^m$ near infinity. Otherwise, we say that $(x_0, \xi_0)$ is a characteristic point. The set of characteristic points is denoted Char A and the set of noncharacteristic points is denoted Ell A.

Thus a pseudodifferential operator A is noncharacteristic at $(x, \xi)$ if in a neighborhood of x, A is elliptic when restricted to the direction $\xi$ . By definition, Char A is closed, so we may make the following definition.

Definition. Let u be a distribution. The wavefront set WF(u) is the intersection of all sets Char A, where A ranges over pseudodifferential operators such that $Au \in C^\infty$ .

Then WF(u) is a closed conic subset of the cotangent bundle $T^*M$ , and its projection to M is exactly the singular support ss(u). Indeed, $x \notin ss(u)$ iff for every pseudodifferential operator A in a sufficiently small neighborhood of x, $Au \in C^\infty$ ; in other words no matter how hard we try, we cannot force u to become singular without differentiating it away from x. The wavefront set also remembers the direction in which this singularity happens; by elliptic invertibility, it will not happen in a direction that A is noncharacteristic.

For example, the only way that $u(x, y) = \delta_{y = 0}$ can be made smooth is by cutting off u to away from $\{(x, y): y = 0\}$ , which can be done by pseudodifferential operators of order 0 which are elliptic in the x-direction, but not possibly in the y-direction, along the x-axis.

Pseudotransport equations.

Hyperbolic operators are meant to generalize the transport equation $(\partial_t - \partial_x)u(t, x) = 0$ . Let us therefore begin by studying the “pseudotransport” equation $(\partial_t + a(t, x, D_x))u(t, x) = 0$ .

We assume that $t \mapsto a(t, x, D_x)$ is uniformly bounded in $S^1$ and continuous in $C^\infty$ , and the real part of a is uniformly bounded from below. Then we have the energy estimate

$\displaystyle \frac{1}{2} \int_0^T ||e^{-\lambda t} u(t)||_{H^s}^p \lambda~dt \leq ||u(0)||_{H^s}^p$

valid for any $s \in \mathbb R$ and $\lambda$ large enough depending on s. Applying the Hanh-Banach theorem we conclude that for every initial data in $H^s$ we can find $u \in C^0([0, \infty) \to H^s)$ which solves the pseudotransport equation. In particular, given Schwartz initial data, it follows that u is smooth.

Now fix initial data $\phi \in H^s$ and assume that the principal symbol exists and is imaginary. (This forces the transport operator to be real and of order 1.) Let q be a symbol of order 0 on space, with principal symbol $q_0$ . If in fact Q(D) is a pseudodifferential operator on spacetime such that such at time 0, Q(0) = q, and Q(t, D) commutes with $\partial_t + a(t, x, D_x)$ then Qu solves the pseudotransport equation. (Actually, we will find Q so that $[Q(t), \partial_t + a(t, x, D_x)]$ is a pseudodifferential operator of order $-\infty$ ; this is good enough.) In particular if $q\phi \in C^\infty_0$ then WF(u) is contained in Char Q, and WF(u) should be the intersection of all such sets Char Q.

To compute WF(u), let $ia_0$ be the principal symbol of a(D) and suppose that $Q \sim \sum_j Q_j$ , where $Q_0$ is principal, is given. Then the principal symbol of $[\partial_t + a(t, x, D_x), Q(t, x, D)]$ is the Poisson bracket

$\displaystyle \{\tau + a_0(t, x, \xi), Q_0(t, x, \xi)\} = (\partial_t + H_{a_0})Q_0$

where $H_p$ is the Hamilton vector field of a symbol p. By inducting on j, we can use this computation to compute $Q_j$ and conclude that modulo an error term of order $-\infty$ , we can choose Q to be invariant along the Hamiltonian flow $\psi$ given by the Hamiltonian $a_0$ . That is, if $F_tu(0) = u(t)$ , then $WF \circ F_t = \psi_t \circ WF$ . This result is a sort of “propagation of singularities” for the pseudotransport equation, which generalizes the fact that the transport equation acts on Dirac masses by transporting them, as expected.

Solving the hyperbolic Cauchy problem.

Let X be a manifold that represents “spacetime”. A priori we may not have a Lorentzian metric to work with, so instead we fix a function $\phi$ that is a “time coordinate”. The level surfaces of $\phi$ can be viewed as “spacelike hypersurfaces” in X.

Throughout we will let $X_0 = \{\phi = 0\}$ and $X_+ = \{\phi > 0\}$ denote the present and future, respectively.

Definition. A hyperbolic operator is a differential operator P of principal symbol p and order m such that $p(x, d\phi(x)) = 0$ and for every $(x, \xi) \in T^*M$ such that $\xi$ is not in the span of $d\phi$ , there are m distinct $\tau \in \mathbb R$ such that $p(x, \xi + \tau d\phi(x)) = 0$ .

Since P is a differential operator, p(x) is a homogeneous polynomial of order m. To make sense of the condition, let me restrict to the case that $X = \mathbb R^2$ with its usual Riemannian metric and $\phi$ is the projection onto the t-axis. Then after rotating the first coordinate so that $\xi$ is a covector dual to the x-axis, the condition says that given $(x, t, \xi)$ we can find exactly m real numbers $\tau$ such that $p(x, t, \xi, \tau) = 0$ . In the case of the d’Alembertian, we have $p(x, t, \xi, \tau) = \xi^2 - \tau^2$ , and indeed given $\xi$ we can set $\tau = \pm \xi$ .

To state the initial-value problem with initial data in the “initial-time slice” $X_0$ , let v be a vector field such that $v\phi = 1$ , so v points “forward in time”. The action of v is “differentiating with respect to time”. Note that this hypothesis prevents $\phi$ from degenerating.

Theorem (solving the hyperbolic Cauchy problem). Let P be a hyperbolic operator of order m with smooth coefficients, Y a precompact open submanifold of X, and $s \geq 0$ . Assume we are given an inhomogeneous term $f \in H^s_{loc}(X_+)$ satisfying $f|X_0 = 0$ and initial data $\psi_j \in H^{loc}_{s + m - 1 -j}(X_0)$ , j < m. Then there is $u \in H^{s + m - 1}_{loc}(X)$ supported in $\overline X_+$ such that Pu = f in $X_+ \cap Y$ and $v^ju = \psi_j$ in $X_0 \cap Y$ .

The proof is in Chapter 23.2 of Hörmander. The idea is to first prove uniqueness of solutions. By compactness, we may cover Y with finitely many charts U which are isomorphic to open subsets of Minkowski spacetime in which level sets of $\phi$ are spacelike hypersurfaces and orbits of v are worldlines. Since Minkowski spacetime has an honest-to-god time coordinate, the hyperbolicity hypothesis allows us to factor the principal symbol p into first-order factors, and hence factor P into pseudotransport operators on U, at least modulo a lower-order error. We may then apply the solution of the Cauchy problem for pseudotransport operators to solve the Cauchy problem for Pu = f in each chart U, and since there were only finitely many, uniqueness allows us to stitch the local solutions together into a global solution.

The proof outlined in the above paragraph is motivated by the special case when P is the d’Alembertian, which already appears in Chapter 2 of Evans. In that proof, one first observes that the Cauchy problem for the transport equation has an explicit solution. Then one reduces to the case that spacetime is two-dimensional, in which case there is an explicit factorization of P into transport operators, namely $P = (\partial_x - \partial_t)(\partial_x + \partial_t)$ .

Propagation of singularities, part I.

To study the propagation of singularities we need to recall some symplectic geometry. Let Q be a pseudodifferential operator on X and q its principal symbol. Then the Hamilton vector field $H_q$ induces a flow on $T^*X$ which preserves q.

Definition. The bicharacteristic flow of a pseudodifferential operator Q of principal symbol q is the flow of $H_q$ on $q^{-1}(0)$ . A bicharacteristic of Q is an orbit of the bicharacteristic flow.

The intuition for the bicharacteristic flow is that its projection to X is “lightlike”, at least if Q is the d’Alembertian.

Theorem (Hörmander’s propagation of singularities). Let P be a pseudodifferential operator of order m such that the Schwartz kernel of P has proper support, and the principal symbol of P is real. Then for every distribution u, WF(u) – WF(f) is invariant under the bicharacteristic flow of P.

By definition of the wavefront set, for every distribution u, WF(u) – WF(Qu) is contained in Char Q. But if Q is a differential operator, then Char Q is exactly the “characteristic variety” $q^{-1}(0)$ , which is exactly the variety where the bicharacteristic flow of Q is defined. Therefore we can ask that WF(u) – WF(Qu) be invariant under the bicharacteristic flow.

If P is a hyperbolic operator of principal symbol p, then the solutions $\tau$ of the equation $p(x, \xi + \tau d\phi(x)) = 0$ are all real and distinct, and modulo lower-order terms this can be used to enforce that the coefficients of p are real. We phrase this more simply by saying that the principal symbol of every hyperbolic operator is real.

A partial converse to the reality of principal symbols of hyperbolic operators holds. If Q is a differential operator, then its principal symbol q is a homogeneous polynomial on each cotangent space. Fixing a particular cotangent space, we can write $q(\xi) = \sum_\alpha c_\alpha \xi^\alpha$ where $\alpha$ ranges over all multiindices of order m and $c_\alpha \in \mathbb R$ . In order that the characteristic variety of Q have more than one real point, there must be some $c_\alpha$ positive and some negative. But this is exactly the situation of the d’Alembertian, whose principal symbol is $q(\xi, \tau) = \xi^2 - \tau^2$ .

Thus, while the propagation of singularities theorem only assumes that the principal symbol is real, if the operator P is (for example) elliptic or parabolic, then the conclusion of the theorem is degenerate in the sense that the characteristic variety only has a single real point, so that WF(u) – WF(f) is invariant under EVERY group action on the characteristic variety, not just the bicharacteristic flow.

The interpretation of the propagation of singularities theorem is that P is something like the d’Alembertian, in which case p is something like a Lorentzian metric. The bicharacteristic flow is a flow on the characteristic bundle, which is the space whose points $(x, \xi)$ consist of a position x and a lightlike momentum $\xi$ . Therefore the projection of any bicharacteristic to X consists of a worldline. Thus, if the initial data is something like a Dirac mass at x, then the Dirac mass travels along the worldline containing x.

To prove the propagation of singularities theorem, we need a propagation estimate. Recall that if A is a pseudodifferential operator, then WF(A) denotes the microsupport of A; that is, the complement of the largest conic set on which A has order $-\infty$ .

Theorem (propagation estimate). Let U be an open conic set, and let $A, B, B_1 \in \Psi^0(X)$ . Let P be a pseudodifferential operator of real principal symbol p and order m.
For every N > 0 and $s \in \mathbb R$ there is C > 0 such that for every distribution u and every inhomogeneous term f with Pu = f,

$\displaystyle ||Au||_{H^{s+m-1}} \leq C||B_1 f||_{H^s} + C||Bu||_{H^{s+m-1}} + C||u||_{H^{-N}}$

given that the following criteria are met:

The projection of U is precompact in X.
For every $(x, \xi) \in U$ , if $p(x, \xi) = 0$ , then $H_p$ and the radial vector field $\xi\partial_\xi$ are linearly independent at $(x, \xi)$ .
WF(A) and WF(B) are contained in U, while $WF(1 - B_1) \cap U = \emptyset$ .
For every trajectory $(x(t), \xi(t))$ of $H_p$ with $(x(0), \xi(0)) \in WF(A)$ , there is T < 0 such that for every $T \leq t \leq 0$ , $(x(t), \xi(t)) \in U$ and $(x(-T), \xi(-T)) \in Ell(B)$ .

The term $C||u||_{H^{-N}}$ is an error term created by the use of pseudodifferential operators and is not interesting. The operator $B_1$ is a cutoff which microlocalizes the problem to a neighborhood to the conic set U. We are interested in WF(u) – WF(f), so we want $WF(B_1) \cap WF(f)$ and $B_1|U = 1$ . Actually, since we only care about the complement of WF(f), we might as well take f Schwartz, in which case we can take $B_1 = 1$ and simplify the propagation estimate to

$\displaystyle ||Au||_{H^{s+m-1}} \leq C||f||_{H^s} + C||Bu||_{H^{s+m-1}} + \text{error terms}.$

The interesting point here is the relationship between the operators A and B. We can optimize the propagation estimate by assuming that WF(B) = Ell B. This is because we really desperately want B to be elliptic on its microsupport, so that it does not introduce any new singularities. Under the assumption WF(B) = Ell B, B is a microlocalization to WF(B), and if $(x, \xi) \in WF(A)$ , then $(x, \xi)$ got to WF(A) after passing through WF(B). The point is that if u has a singularity at $(x, \xi) \in WF(A)$ , then (if the regularity exponent s is taken large enough) $||Au||_{H^{s+m-1}} = \infty$ , but we assumed f Schwartz, so this implies $||Bu||_{H^{s+m-1}} = \infty$ , so that if we traveled back along the bicharacteristic flow $(x(t), \xi(t))$ from $(x, \xi)$ for long enough, we would see that u already had a singularity at some time $(x(T), \xi(T))$ with T < 0.

Moreover, the propagation estimate is time-reversible in the sense we can replace T < 0 with -T > 0. Thus the bicharacteristic flow neither creates nor destroys singularities in the distribution u. This readily implies the propagation of singularities theorem.

The proof of the propagation estimate is quite technical and this post is meant as a more of a conceptual discussion so I will omit it.

Elliptic regularity implies that compact genera are finite

A few years ago I took a PDE course. We were learning about something to do with elliptic pseudodifferential operators and the speaker drew a commutative diagram on the board and said, “You see, this comes from a short exact sequence –” and the whole room started laughing in discomfort. The speaker then remarked that Craig Evans himself would ban him from teaching analysis if word of the incident ever leaked, which might have something to do with why I have not disclosed the speaker’s name 🥵

Before recently, I found topology to be quite a scary area of math. It is still very much my weakest suit, but I should like to have some amount of competency with it. I have since come around to the viewpoint that cohomology is just a clever gadget for counting solutions of PDE. This has made the pill a little easier to swallow, and makes the previous anecdote all the more awkward.

As part of my ventures into trying to learn topology, in this post I will give a proof that the genus of any compact Riemann surface is finite. I am confident that this proof is not original, because it’s sort of the obvious proof if an analyst trying to prove this fact just followed their nose, but it seems a lot more natural to me than the proof in Forster, so let’s do this.

[Since the time of writing, I have made some corrections to incorrect or confusing statements. Thanks to Sarah Griffith for pointing these out!]

Let us start with some generalities. Fix a compact Riemann surface ${X}$ , references to which we will suppress when possible. Let

$\displaystyle 0 \rightarrow A \rightarrow B \rightarrow C \rightarrow 0$

be a short exact sequence of sheaves. In our case, the sheaves will be sheaves of Fréchet spaces on ${X}$ , which might not be homologically kosher, but that won’t cause any real issues. Then we get a long exact sequence in cohomology

$\displaystyle 0 \rightarrow H^0(A) \rightarrow H^0(B) \rightarrow H^0(C) \rightarrow H^1(A) \rightarrow H^1(B) \rightarrow H^1(C) \rightarrow \cdots.$

If $B$ is a fine sheaf, i.e. it has partitions of unity subordinate to every open cover, then ${H^1(B) = 0}$ and the long exact sequence collapses to the exact sequence

$\displaystyle 0 \rightarrow H^0(A) \rightarrow B(X) \rightarrow C(X) \rightarrow H^1(A) \rightarrow 0.$

In particular, the morphism of sheaves ${B \rightarrow C}$ induces a bounded linear map ${T: B(X) \rightarrow C(X)}$ such that ${H^0(A)}$ is the kernel of ${T}$ and ${H^1(A)}$ is the cokernel of ${T}$ . Now, if ${T}$ is a Fredholm operator, then its index ${k}$ satisfies

$\displaystyle k = \text{dim } H^0(A) - \text{dim } H^1(A).$

Let ${\mathcal O}$ denote the sheaf of holomorphic functions on ${X}$ and ${\overline \partial}$ the Cauchy-Riemann operator. Let ${\mathcal E}$ denote the sheaf of smooth functions on ${X}$ ; since ${X}$ has enough partitions of unity, ${\mathcal E}$ is a fine sheaf. The maps ${\overline \partial: \mathcal E(U) \rightarrow \mathcal E(U)}$ , for ${U \subseteq X}$ open, induces a short exact sequence of sheaves of Fréchet spaces

$\displaystyle 0 \rightarrow \mathcal O \rightarrow \mathcal E \rightarrow \mathcal E \rightarrow 0$

and hence an exact sequence in cohomology

$\displaystyle 0 \rightarrow \mathbf C \rightarrow \mathcal E(X) \rightarrow \mathcal E(X) \rightarrow H^1(\mathcal O) \rightarrow 0.$

Here we used Liouville’s theorem. On the other hand, the dimension of ${H^1(\mathcal O)}$ is by definition the genus ${g}$ of ${X}$ . Therefore, if ${k}$ is the Fredholm index of ${\overline \partial}$ , then

$\displaystyle g = 1 - k.$

It remains to show that ${k}$ is well-defined and finite; that is, ${\overline \partial}$ is Fredholm. This is a standard elliptic regularity argument, which I will now recall. We first fix a volume form ${dV}$ on ${X}$ , which exists since ${X}$ is an orientable surface. This induces an ${L^2}$ norm on ${X}$ , namely

$\displaystyle ||u||_{L^2} = \int_X |u|^2 ~dV.$

Unfortunately the usual Sobolev notation ${H^s}$ clashes with the notation for cohomology, so let me use ${W^s}$ to denote the completion of ${\mathcal E}$ under the norm

$\displaystyle ||u||_s = \sum_{|\alpha| \leq s} ||\partial^\alpha u||_{L^2}$

where ${\alpha}$ ranges over multiindices. Then ${W^0 = L^2}$ and ${\overline \partial}$ maps ${W^1 \rightarrow W^0}$ . The kernel of ${\overline \partial}$ is finite-dimensional (since it is isomorphic to ${\mathbf C}$ , by Liouville’s theorem and Weyl’s lemma), so to deduce that ${\overline \partial}$ is Fredholm as an operator ${W^1 \rightarrow W^0}$ it suffices to show that the cokernel of ${\overline \partial}$ is finite-dimensional.

We first claim the elliptic regularity estimate

$\displaystyle ||u||_1 \leq C ||f||_0 + C ||u||_0$

for any smooth functions $u,f$ which satisfy ${\overline \partial}u = f$ . By definition of the Sobolev norm, we have

$\displaystyle ||u||_1 = ||u||_0 + ||u'||_0 + ||f||_0.$

Without loss of generality, we may assume that ${u}$ is smooth. Then we can write ${u = v + w}$ where ${v}$ and ${\overline w}$ are holomorphic. In particular, ${u' = v'}$ and ${f = \overline \partial w}$ , so

$\displaystyle ||u||_1 = ||u||_0 + ||v'||_0 + ||f||_0.$

The only troublesome term here is ${v'}$ . Taking a Cauchy estimate, we see that

$\displaystyle |v'(z)| \leq ||v||_{L^\infty} \leq C||v||_{L^2} = C||v||_0.$

But ${X}$ is compact, so has finite volume; therefore

$\displaystyle ||v'||_0 = ||v'||_{L^2} \leq C||v||_{L^\infty} \leq C||v||_0 \leq C||u||_0.$

This gives the desired bound.

Let ${u_n}$ be a sequence in ${W^1}$ with ${f_n = \overline \partial u_n \in W^0}$ , and assume that the ${f_n}$ are Cauchy in ${W^0}$ . Without loss of generality we may assume that ${u_n \in K^\perp}$ where ${K}$ is the kernel of ${\overline \partial}$ . If the ${u_n}$ are not bounded in ${W^1}$ , we may replace them with ${u_n/||u_n||_1}$ , and thus assume that they are in fact bounded. By the Rellich-Kondrachov theorem (which says that the natural map ${W^1 \rightarrow W^0}$ is compact), we may therefore assume that the ${u_n}$ are Cauchy in ${W^0}$ . But then

$\displaystyle ||u_n - u_m||_1 \leq C ||f_n - f_m||_0 + C ||u_n - u_m||_0$

so the ${u_n}$ are Cauchy in ${W^1}$ . Therefore the ${u_n}$ converge in ${K^\perp}$ , hence the ${f_n}$ converge in the image ${Z}$ of ${\overline \partial}$ , since ${\overline \partial}$ gives an isomorphism ${K^\perp \rightarrow Z}$ . Therefore ${Z}$ is closed.

If one applies integration by parts to ${\overline \partial}$ , the fact that $X$ has no boundary implies that for any $f,g$ ,

$\displaystyle \langle \overline \partial f, g\rangle = \int_X \overline \partial f \overline g ~dV = -\int_X f \overline{\partial g} ~dV = -\langle f, g'\rangle$

and thus $\overline \partial^* = -\partial$ . Since $Z$ is closed, the dual of the cokernel of ${\overline \partial}$ is the kernel $L$ of $-\partial$ ; by the Rellich-Kondrachov theorem, the unit ball of $L$ is compact and therefore $L$ is finite-dimensional. By the Hanh-Banach theorem, this implies that the cokernel of ${\overline \partial}$ is finite-dimensional. Therefore ${k}$ and hence ${g}$ is finite.

A PDE-analytic proof of the fundamental theorem of algebra

The fundamental theorem of algebra is one of the most important theorems in mathematics, being core to algebraic geometry and complex analysis. Unraveling the definitions, it says:

Fundamental theorem of algebra. Let $f$ be a polynomial over $\mathbf C$ of degree $d$ . Then the equation $f(z) = 0$ has $d$ solutions $z$ , counting multiplicity.

Famously, most proofs of the fundamental theorem of algebra are complex-analytic in nature. Indeed, complex analysis is the natural arena for such a theorem to be proven. One has to use the fact that $\mathbf R$ is a real closed field, but since there are lots of real closed fields, one usually defines $\mathbf R$ in a fundamentally analytic way and then proves the intermediate value theorem, which shows that $\mathbf R$ is a real closed field. One can then proceed by tricky algebraic arguments (using, e.g. Galois or Sylow theory), or appeal to a high-powered theorem of complex analysis. Since the fundamental theorem is really a theorem about algebraic geometry, and complex analysis sits somewhere between algebraic geometry and PDE analysis in the landscape of mathematics (and we need some kind of analysis to get the job done; purely algebro-geometric methods will not be able to distinguish $\mathbf R$ from another field $K$ such that $-1$ does not have a square root in $K$ ) it makes a lot of sense to use complex analysis.

But, since complex analysis sits between algebraic geometry and PDE analysis, why not abandon all pretense of respectability (that is to say, algebra — analysis is not a field worthy of the respect of a refined mathematician) and give a PDE-analytic proof? Of course, this proof will end up “looking like” multiple complex-analytic proofs, and indeed it is basically the proof by Liouville’s theorem dressed up in a trenchcoat (and in fact, gives Liouville’s theorem, and probably some other complex-analytic results, as a byproduct). In a certain sense — effectiveness — this proof is strictly inferior to the proof by the argument principle, and in another certain sense — respectability — this proof is strictly inferior to algebraic proofs. However, it does have the advantage of being easy to teach to people working in very applied fields, since it entirely only uses the machinery of PDE analysis, rather than fancy results such as Liouville’s theorem or the Galois correspondence.

The proof
By induction, it suffices to prove that if $f$ is a polynomial with no zeroes, then $f$ is constant. So suppose that $f$ has no zeroes, and introduce $g(z) = 1/f(z)$ . As usual, we want to show that $g$ is constant.

Since $f$ is a polynomial, it does not decay at infinity, so $g(\infty)$ is finite. Therefore $g$ can instead be viewed as a function on the sphere, $g: S^2 \to \mathbf C$ , by stereographic projection. Also by stereographic projection, one can cover the sphere by two copies of $\mathbf R^2$ , one centered at the south pole that misses only the north pole, and one centered at the north pole that only misses the south pole. Thus one can define the Laplacian, $\Delta = \partial_x^2 + \partial_y^2$ , in each of these coordinates; it remains well-defined on the overlaps of the charts, so $\Delta$ is well-defined on all of $S^2$ . (In fancy terminology, which may help people who already know ten different proofs of the fundamental theorem of algebra but will not enlighten anyone else, we view $S^2$ as a Riemannian manifold under the pushforward metric obtained by stereographic projection, and consider the Laplace-Beltrami operator of $S^2$ .)

Recall that a function $u$ is called harmonic provided that $\Delta u = 0$ . We claim that $g$ is harmonic. The easiest way to see this is to factor $\Delta = 4\partial\overline \partial$ where $2\partial = \partial_x - i\partial_y$ . Then $\overline \partial u = 0$ exactly if $u$ has a complex derivative, by the Cauchy-Riemann equations. There are other ways to see this, too, such as using the mean-value property of harmonic functions and computing the antiderivative of $g$ . In any case, the proof is just calculus.

So $g$ is a harmonic function on the compact connected manifold $S^2$ ; by the extreme value theorem, $g$ has (or more precisely, its real and imaginary parts have) a maximum. By the maximum principle of harmonic functions (which is really just the second derivative test — being harmonic generalizes the notion of having zero second derivative), it follows that $g$ is equal to its maximum, so is constant. (In fancy terminology, we view $g$ as the canonical representative of the zeroth de Rham cohomology class of $S^2$ using the Hodge theorem.)

Internalizing tricks: the Heine-Borel theorem

I think that in analysis, the most important results are the tricks, not the theorems. I figure most analysts could prove any of the theorems in Rudin or Pugh at will, not because they have the results memorized, but because they know the tricks.

So it’s really important to internalize tricks! Here’s an example of how we could take apart a proof of the Heine-Borel theorem that every closed bounded set is compact, and internalize some of the tricks in it.

The proof we want to study is as follows.

Step 1. We first prove that [0, 1] is compact. Let $(x_n)$ be a sequence in [0, 1] that we want to show has a convergent subsequence. Let $x_{n_1} = x_1$ and let $I_1 = [0, 1]$ .

Step 2. Suppose by induction that we are given $I_1, \dots, I_J$ such that $I_j$ is a subinterval of $I_{j-1}$ of half length and there is a subsequence of $(x_n)$ in $I_j$ , and $x_{n_j} \in I_j$ . By the pigeonhole principle, since there are infinitely many points of $(x_n)$ in $I_j$ , if we divide $I_j$ into left and right closed subintervals of equal length, one of those two subintervals has infinitely many points of $(x_n)$ as well. So let that subinterval be $I_{J+1}$ and let $(x_{n_{J+1}})$ be the first point of $(x_n)$ after $x_{n_J}$ in $I_{J+1}$ .

Step 3. After the induction completes we have a subsequence $(x_{n_j})$ of $(x_n)$ . By construction, $x_{n_j} \in I_j$ and $I_{j+1}$ is half of $I_j$ , so $|x_{n_j} - x_{n_{j+1}}| < 2^j$ . That implies that $(x_{n_j})$ is a Cauchy sequence, so it converges in $\mathbf R$ , say $x_{n_j} \to x$ .

Step 4. Since x is a limit of a sequence in [0, 1], and [0, 1] is closed, $x \in [0, 1]$ . Therefore $(x_n)$ has a convergent subsequence. So [0, 1] is compact.

Step 5. Now let $K = [0, 1]^n$ be a box. We claim that K is compact. To see this, let $(x_n)$ be a sequence in K. If n = 1, then $(x_n)$ has a convergent subsequence.

Step 6. Suppose by induction that $[0, 1]^{n-1}$ is compact. Then we can write $x_n = (y_n, z_n)$ where $y_n \in [0, 1]^{n-1}$ , $z_n \in [0, 1]$ . So there is a convergent subsequence $(y_{n_k})$ . Now $(z_{n_k})$ has a convergent subsequence $(z_{n_{k_j}})$ , and then $(x_{n_{k_j}})$ is a convergent subsequence. So K is compact.

Step 7. Now let K be closed and bounded. So there is a box $L = [-R, R]^n$ such that $K \subseteq L$ . Without loss of generality, assume that R = 1.

Step 8. Since K is a closed subset of the compact set L, K is compact.

Let’s look at the tricks used at each stage:

Step 1. We want to show that an arbitrary closed and bounded set is compact. This sounds quite hard, as such sets can be nasty; however, it is often the case that if you can prove a special case of the theorem, the general theorem follows. Since [0, 1] is the prototypical example of a compact set, and is much nicer than e.g. Cantor dust in 26 dimensions, we first try to prove the Heine-Borel theorem on [0, 1].

Step 2. Here we use the informal principle that compactness is equivalent to path-finding in an infinite binary tree. That is, compactness requires us to make infinitely many choices, which is exactly the same thing as finding a path through an infinitely large tree, where we will have to choose whether to go left or right infinitely many times. Ideally every time we choose whether we go left or right, we will cut down on the complexity of the problem by half. Here the “complexity” is the size of the interval we’re looking at. This notion of “compactness” is ubiquitous in analysis, combinatorics, and logic. It is the deepest part of the proof of the Heine-Borel theorem, and is known as Koenig’s lemma.

Step 2 has another key idea to it. We need to make infinitely many choices, so we make infinitely many choices using induction. In general when traversing a graph, inducting on the length of the path so far will come in handy. If you don’t know which way to go, the pigeonhole principle and other nonconstructive tricks will also be highly useful here.

Step 3. Compactness gave us a subsequence, but we don’t know what the limit is. But to prove that a sequence converges without referring to an explicit limit, instead show that it is Cauchy. Actually, here we are forced to do this, because the argument of Step 2 could’ve been carried out over the rational numbers, yet the conclusion of the Heine-Borel theorem is false there. So this step could also be interpreted as make sure to use every hypothesis; here the hypothesis that we are working over the reals is key.

Step 4. Make sure to use every hypothesis; up to this point we’ve only used that [0, 1] is bounded, not closed.

Step 5. Here we again reason that if you can prove a special case of the theorem, the general theorem follows.

Step 6. Here n is an arbitrary natural number, so we prove a theorem about every natural number using induction. This is especially nice because the idea behind this proof was to build up the class of compact set iteratively, initializing with the unit interval; at every stage of this induction we also get a unit box.

This trick can be viewed as a special case of if you can prove a special case of the theorem, the general theorem follows: indeed, proving a theorem for every natural number would require infinitely many cases to be considered, but here there are just two, the base case and the inductive case. The inductive case was really easy, so the thing we are really interested in is the base case.

Step 7. Here we abstract away unnecessary parameters using symmetry. The parameter R is totally useless because topological notions don’t care about scaling. However, we do have a box, and it would be nice if it was a unit box because we just showed that unit boxes are compact. So we might as well forget about R and just assume it’s 1.

Step 8. Once again we make sure to use every hypothesis; the boundedness got us inside a box, so the closedness must be used to finish the proof.