Expanding Subspace Minimization (ESM) Theorem
Let \(\mathbf{x}_0 \in \mathbb{R}^n\). Given a set of H-conjugate \(\{ \mathbf{p}_k \}_0^{i - 1}\) directions, and \(\mathbf{x}_k = \mathbf{x}_{k} = \mathbf{x}_{k-1} + \alpha_{k - 1} \mathbf{p}_{k-1}\), the Expanding Subspace Minimization theorem states that,
\begin{align*}
\mathbf{r}_k^T \mathbf{p}_i = 0 \: \forall \: i \in [\![ 0, k - 1 ]\!] \\
\end{align*}
and \(\mathbf{x}_k\) is the minimizer of:
\begin{align*}
f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T H \mathbf{x} + c^T \mathbf{x} \\
\end{align*}
\begin{align*}
\text{over the set } \{\mathbf{x} \mid \mathbf{x} = \mathbf{x}_0 + \mathrm{span}\{\mathbf{p}_0, \mathbf{p}_1,...\mathbf{p}_{k-1}\}\} \\
\end{align*}
The new residual \(\mathbf{r}_k\) is orthogonal to all the search directions used in the previous iterations (provided that the search directions are H-conjugate}.
Proof:
We will show that \(\tilde{\mathbf{x}} = \mathbf{x}_0 + \sigma_0^*\mathbf{p}_0 + \dots + \sigma_{k-1}^*\mathbf{p}_{k-1}\) minimizes \(f(\mathbf{x})\) over \(\{\mathbf{x} \mid \mathbf{x} = \mathbf{x}_0 + \mathrm{span}\{\mathbf{p}_0, \mathbf{p}_1,...\mathbf{p}_{k-1}\}\}\), if and only if \(\mathbf{r}(\tilde{\mathbf{x}})^T \mathbf{p}_i = 0, \: i = 0,...., k - 1\).
Let,
\begin{align*}
h(\sigma) = f(\mathbf{x}_0 + \sigma_0^*\mathbf{p}_0 + \dots + \sigma_{k-1}^*\mathbf{p}_{k-1}) \\
\end{align*}
\begin{align*}
\sigma = (\sigma_0, \sigma_1, ..., \sigma_{k-1})^T \\
\end{align*}
\(h(\sigma)\) is a strict convex quadratic. Therefore, at the unique minimiser \(\sigma^*\),
\begin{align*}
\frac{\partial h (\sigma^*)}{\partial \sigma_i} = 0 \\
\end{align*}
By Chain Rule,
\begin{align*}
\frac{\partial h (\sigma^*)}{\partial \tilde{\mathbf{x}}} \frac{\partial \tilde{\mathbf{x}}}{\partial \sigma_i} = 0 \\
\end{align*}
\begin{align*}
\nabla f(\mathbf{x}_0 + \sigma_0^*\mathbf{p}_0 + \dots + \sigma_{k-1}^*\mathbf{p}_{k-1})^T \mathbf{p}_i = 0, \: i = 0,...,k-1 \\
\end{align*}
\begin{align*}
\nabla f(\tilde{\mathbf{x}})^T \mathbf{p}_i = 0 \\
\end{align*}
\begin{align*}
\mathbf{r}(\tilde{\mathbf{x}})^T \mathbf{p}_i = 0, \: i = 0,..., k-1 \\
\end{align*}
Hence, from the above, we can conclude that \(\tilde{\mathbf{x}}\) minimizes \(f(\mathbf{x})\) over \(\{\mathbf{x}, \mathbf{x} = \mathbf{x}_0 + \mathrm{span}\{\mathbf{p}_0, \mathbf{p}_1,...\mathbf{p}_{k-1}\}\}\), by the result that \(\mathbf{r}(\tilde{\mathbf{x}})^T \mathbf{p}_i = 0, \: i = 0,...., k - 1\).
Now, we will use induction to show that \(\mathbf{x}_k\) satisfies \(\mathbf{r}(\tilde{\mathbf{x}})^T \mathbf{p}_i = 0\), and therefore, \(\mathbf{x}_k = \tilde{\mathbf{x}}\).
For \(k = 1\), \(\mathbf{x}_1 = \mathbf{x}_0 + \alpha_0 \mathbf{p}_0\) minimizes \(f(\mathbf{x})\) along \(\mathbf{p}_0\) which implies \(\mathbf{r}_1^T \mathbf{p}_0 = 0\).
Let our induction hypothesis hold true for \(\mathbf{r}_{k-1}\), that is, \(\mathbf{r}_{k-1}^T \mathbf{p}_i = 0, \: \forall \: i = 0, \dots , k-2\). We will prove the orthogonality relationships for \(\mathbf{r}_{k}\). We know that,
\begin{align*}
\mathbf{r}_{k} = \mathbf{r}_{k - 1} + \alpha_{k - 1} H \mathbf{p}_{k - 1} \\
\end{align*}
\begin{align*}
\implies \mathbf{p}_{k-1}^T \mathbf{r}_{k} = \mathbf{p}_{k-1}^T \mathbf{r}_{k - 1} + \alpha_{k - 1} \mathbf{p}_{k - 1}^T H \mathbf{p}_{k - 1} \\
\end{align*}
\begin{align*}
\implies \mathbf{r}_{k}^T \mathbf{p}_{k-1} = 0 \text{ after substituting the value of } \alpha_{k - 1} = - \frac{\mathbf{p}_{k - 1}^T \mathbf{r}_{k - 1}}{\mathbf{p}_{k - 1}^T H \mathbf{p}_{k - 1}} \\
\end{align*}
For the rest of the vectors \(\mathbf{p}_i, \: i = 0, \dots, k - 2\),
\begin{align*}
\mathbf{r}_{k} = \mathbf{r}_{k - 1} + \alpha_{k - 1} H \mathbf{p}_{k - 1} \\
\end{align*}
\begin{align*}
\implies \mathbf{p}_{i}^T \mathbf{r}_{k} = \mathbf{p}_{i}^T \mathbf{r}_{k - 1} + \alpha_{k - 1} \mathbf{p}_{i}^T H \mathbf{p}_{k - 1} \\
\end{align*}
\begin{align*}
\implies \mathbf{r}_{k}^T \mathbf{p}_{i} = \mathbf{r}_{k - 1}^T \mathbf{p}_{i} + \alpha_{k - 1} \mathbf{p}_{k - 1} H \mathbf{p}_{i}^T \\
\end{align*}
The first term on the Right Hand Side vanishes due to our induction hypothesis. The second term vanishes due to the H-conjugacy of the \(\mathbf{p}_{i}\) vectors. Therefore,
\begin{align*}
\mathbf{r}_{k}^T \mathbf{p}_{i} = 0 \: \forall \: i \in [\![ 0, k - 2 ]\!] \\
\end{align*}
Combining both of the above results, we get,
\begin{align*}
\mathbf{r}_{k}^T \mathbf{p}_{i} = 0 \: \forall \: i \in [\![ 0, k - 1 ]\!] \\
\end{align*}
and therefore, \(\mathbf{x}_k\) is the minimizer of:
\begin{align*}
f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T H \mathbf{x} + c^T \mathbf{x} \\
\end{align*}
over \(\{\mathbf{x} \mid \mathbf{x} = \mathbf{x}_0 + \mathrm{span}\{\mathbf{p}_0, \mathbf{p}_1,...\mathbf{p}_{k-1}\}\}\). Hence, proved.
Connection to Krylov Subspaces
The Expanding Subspace Minimisation (ESM) Theorem proves two things: the next iterate \(\mathbf{x}_i\) minimizes our objective \(f(\mathbf{x})\) over the entire subspace \(\mathbf{x}_0 + \underset{k}{\mathrm{span}}\{ \mathbf{p}_k \}_0^{i - 1}\); and the next residual \(\mathbf{r}_i\) is orthogonal to all the previous search directions \(\{ \mathbf{p}_k \}_0^{i - 1}\), that is,
\begin{align*}
\mathbf{r}_i^T \mathbf{p}_k = 0 \: \forall \: k \in [\![ 0, i - 1 ]\!] \\
\end{align*}
Let us denote this subspace as \(\mathcal{D}_i = \{ \mathbf{p}_k \}_0^{i - 1}\).
In other words, the residual \(\mathbf{r}_i\) is orthogonal to the entire subspace \(\mathcal{D}_i\), and therefore, any vector contained in \(\mathcal{D}_i\) will be orthogonal to \(\mathbf{r}_i\)!
\begin{align*}
\mathbf{r}_i \perp \mathcal{D}_i \\
\end{align*}
The image below demonstrates this fact for \(\mathcal{D}_2\).
Here we attempt to analyse a few more fundamental issues.
In plain English, what the first part of the ESM theorem is saying is that given a starting point \(\mathbf{x}_0\), we keep on enlarging the expanding-subspace \(\mathcal{D}_i\) by one dimension in each step, and we evermore keep getting closer to the solution after gaining that extra degree of freedom of movement in that step. As soon as \(\mathcal{D}_i\) becomes \(\mathbb{R}^n\), the error gets driven to zero and we have reached our destination!
Since we have mentioned in the previous section that the directions \(\mathbf{p}\) are built from the linearly independent set of vectors \(\mathbf{u}\), that is,
\begin{align*}
\mathbf{p}_i = \mathbf{u}_i + \sum_{k = 0}^{i - 1} \beta_{ik} \mathbf{p}_k \\
\end{align*}
..it implies that the space spanned by \(\{ \mathbf{p}_k \}_0^{i - 1}\) is the same as the space spanned by \(\{ \mathbf{u}_k \}_0^{i - 1}\) by linearity. That is, \(\mathcal{D}_i = \{ \mathbf{u}_k \}_0^{i - 1}\). Since, \(\mathbf{r}_i \perp \mathcal{D}_i\), this means, the next residual \(\mathbf{r}_i\) is also orthogonal to all the previous and current generator \(\{ \mathbf{u}_k \}_0^{i - 1}\) vectors lying in the space \(\mathcal{D}_i\)! Mathematically, by first transposing the above equation,
\begin{align*}
\mathbf{p}_i^T = \mathbf{u}_i^T + \sum_{k = 0}^{i - 1} \beta_{ik} \mathbf{p}_k^T \\
\end{align*}
Let \(i < j\) or \(i \in [\![ 0, j - 1 ]\!]\). Then,
\begin{align*}
\mathbf{p}_i^T \mathbf{r}_j = \mathbf{u}_i^T \mathbf{r}_j + \sum_{k = 0}^{i - 1} \beta_{ik} \mathbf{p}_k^T \mathbf{r}_j \\
\end{align*}
Using subspace minimization theorem,
\begin{align*}
0 = \mathbf{u}_i^T \mathbf{r}_j + 0 \\
\end{align*}
\begin{align*}
\implies \mathbf{r}_j^T \mathbf{u}_i = 0 \: \forall \: i \in [\![ 0, j - 1 ]\!] \\
\end{align*}
Hence, we see why it is true.
Now, comes the core idea of Conjugate Gradient: Instead of using the coordinate axes \(\{ \mathbf{e}_1, \mathbf{e}_2, \dots, \mathbf{e}_n \}\) as the generator vectors \(\{ \mathbf{u}_0, \mathbf{u}_1, \dots, \mathbf{u}_{n - 1} \}\) for the conjugated-construction of our search directions \(\mathbf{p}_i\), we will instead use the negative residuals/gradients \(\{ - \mathbf{r}_0, - \mathbf{r}_1, \dots, - \mathbf{r}_{n - 1} \}\) as the generator vectors! That is, \(\mathbf{u}_i = - \mathbf{r}_i\). This is precisely the reason why we have emphasized on the term, Conjugate(d) Gradients. Because the residuals/gradients are themselves not H-conjugate, but the n H-conjugate directions that we need for our algorithm are constructed out of them.
While this choice seems arbitrary, we will also see how this choice leads to the efficiency and some more desirable attributes of the CG algorithm. Since \(\mathbf{p}_0 = \mathbf{u}_0 = -\mathbf{r}_0 = -\nabla f(\mathbf{x}_0)\), the first direction of the Conjugate Gradient process is also the direction of Steepest Descent.
We stated before that,
\begin{align*}
\implies \mathbf{r}_j^T \mathbf{u}_i = 0 \: \forall \: i \in [\![ 0, j - 1 ]\!] \\
\end{align*}
Replace \(\mathbf{u}_i = -\mathbf{r}_i\) in the above and we shall get,
\begin{align*}
\implies - \mathbf{r}_j^T \mathbf{r}_i = 0 \: \forall \: i \in [\![ 0, j - 1 ]\!] \\
\end{align*}
(Dropping the negative sign, and replacing the indices \(i\) by \(k\) and \(j\) by \(i\) with no difference)
\begin{align*}
\implies \mathbf{r}_i^T \mathbf{r}_k = 0 \: \forall \: k \in [\![ 0, i - 1 ]\!] \\
\end{align*}
Which means the next residual/gradient is orthogonal to all the previous residuals/gradients. Since \(\mathbf{u}_i\) (generator) vectors have to be linearly independent for generating our directions, by selecting \(\mathbf{u}_i = -\mathbf{r}_i\) we are not doing any mistake since the mutual orthogonality of \(\mathbf{r}_i \: \forall \: i \in [\![ 0, n - 1 ]\!]\) guarantees their linear independence. Contrast this with Steepest Descent, where a lot of residuals/gradients were repeated directions and hence not linearly independent.
Therefore, the space spanned by directions \(\{ \mathbf{p}_k \}_0^{i - 1}\) is equal to the space spanned by the residuals \(\{ \mathbf{r}_k \}_0^{i - 1}\) (since \(\{ \mathbf{p}_k \}_0^{i - 1} = \{ \mathbf{u}_k \}_0^{i - 1}\)).
\begin{align*}
\mathcal{D}_i = \underset{k}{\mathrm{span}}\{ \mathbf{p}_k \}_0^{i - 1} = \underset{k}{\mathrm{span}}\{ \mathbf{u}_k \}_0^{i - 1} = \underset{k}{\mathrm{span}}\{ \mathbf{r}_k \}_0^{i - 1} \\
\end{align*}
Further, let us develop one more statement:
\begin{align*}
\mathbf{p}_i^T \mathbf{r}_i = \mathbf{u}_i^T \mathbf{r}_i + \sum_{k = 0}^{i - 1} \beta_{ik} \mathbf{p}_k^T \mathbf{r}_i \\
\end{align*}
\begin{align*}
\implies \mathbf{p}_i^T \mathbf{r}_i = \mathbf{u}_i^T \mathbf{r}_i = - \mathbf{r}_i^T \mathbf{r}_i \\
\end{align*}
We know that,
\begin{align*}
\mathbf{r}_i = \mathbf{r}_{i - 1} + \alpha_{i - 1} H \mathbf{p}_{i - 1} \\
\end{align*}
Since, \(\mathbf{r}_i\) lies outside \(\mathcal{D}_i\) from the subspace identity, and \(\mathbf{r}_{i - 1} \in \mathcal{D}_i\), therefore, the \(H \mathbf{p}_{i - 1}\) vector has to lie outside the space spanned by \(\mathcal{D}_i\). From this, we can write: \(\mathbf{p}_{i - 1} \in \mathcal{D}_i \implies H \mathbf{p}_{i - 1} \in H \mathcal{D}_i\). Therefore, each subspace \(\mathcal{D}_{i + 1} = \underset{k}{\mathrm{span}}\{ \{ \mathbf{r}_k \}_0^{i - 1}, \mathbf{r}_i \}\) is constructed from the union of \(\mathcal{D}_i\) and \(H \mathcal{D}_i\).
Starting from the first subspace,
\begin{align*}
\mathcal{D}_1 = \text{span}\{ \mathbf{p}_0 \} \\
\end{align*}
The second subspace is,
\begin{align*}
\mathcal{D}_2 = \text{span}\{ \mathbf{p}_0 \} + H \text{span}\{ \mathbf{p}_0 \} \\
\end{align*}
\begin{align*}
\mathcal{D}_2 = \text{span}\{ \mathbf{p}_0, H \mathbf{p}_0 \} \\
\end{align*}
The third subspace is,
\begin{align*}
\mathcal{D}_3 = \text{span}\{ \mathbf{p}_0, H \mathbf{p}_0 \} + H \text{span}\{ \mathbf{p}_0, H \mathbf{p}_0 \} \\
\end{align*}
\begin{align*}
\mathcal{D}_3 = \text{span}\{ \mathbf{p}_0, H \mathbf{p}_0, H (H \mathbf{p}_0) \} \\
\end{align*}
and therefore, by recursion, the i-th subspace is,
\begin{align*}
\mathcal{D}_i = \text{span}\{ \mathbf{p}_0, H \mathbf{p}_0, H (H \mathbf{p}_0), H (H (H \mathbf{p}_0)), \dots, H^{i - 1} \mathbf{p}_0\} \\
\end{align*}
\begin{align*}
\implies \mathcal{D}_i = \text{span}\{ \mathbf{p}_0, H \mathbf{p}_0, H^2 \mathbf{p}_0, H^3 \mathbf{p}_0, \dots, H^{i - 1} \mathbf{p}_0\} \\
\end{align*}
Since, \(\mathcal{D}_i = \underset{k}{\mathrm{span}}\{ \mathbf{p}_k \}_0^{i - 1} = \underset{k}{\mathrm{span}}\{ \mathbf{r}_k \}_0^{i - 1}\) by our generator choice, by the same arguments,
\begin{align*}
\implies \mathcal{D}_i = \text{span}\{ \mathbf{r}_0, H \mathbf{r}_0, H^2 \mathbf{r}_0, H^3 \mathbf{r}_0, \dots, H^{i - 1} \mathbf{r}_0\} \\
\end{align*}
This subspace has a special name: Krylov Subspace, denoted by \(\mathcal{K}_i\). This particular subspace, which we constructed out of choosing the generator vectors \(\mathbf{u}_i\) as the residuals/gradients \(\mathbf{r}_i\), is in fact the core property that makes our Conjugate Gradient method a highly efficient one. So why is that? If we look carefully, we can observe that the Krylov Subspace \(\mathcal{D}_{i + 1}\) is the union of \(\mathcal{D}_i\) and \(H \mathcal{D}_i\) by construction. Since \(\mathbf{r}_{i + 1} \perp \mathcal{D}_{i + 1}\) as proven by the ESM Theorem, we can write,
\begin{align*}
\mathbf{r}_{i + 1} \perp \mathcal{D}_{i + 1} \\
\end{align*}
\begin{align*}
\mathbf{r}_{i + 1} \perp \left( \mathcal{D}_i \bigcup H \mathcal{D}_i \right) \\
\end{align*}
resulting in,
\begin{align*}
\mathbf{r}_{i + 1} \perp \mathcal{D}_i \\
\end{align*}
and,
\begin{align*}
\mathbf{r}_{i + 1} \perp H \mathcal{D}_i \\
\end{align*}
which is a funny result. Simplifying the above expression with \(\mathcal{D}_i = \underset{k}{\mathrm{span}}\{ \mathbf{p}_k \}_0^{i - 1} = \mathrm{span}\{ \mathbf{p}_0, \mathbf{p}_1, \mathbf{p}_2, \dots, \mathbf{p}_{i - 1} \}\),
\begin{align*}
\mathbf{r}_{i + 1} \perp H \mathrm{span}\{ \mathbf{p}_0, \mathbf{p}_1, \mathbf{p}_2, \dots, \mathbf{p}_{i - 1} \} \\
\end{align*}
\begin{align*}
\mathbf{r}_{i + 1} \perp \mathrm{span}\{ H \mathbf{p}_0, H \mathbf{p}_1, H \mathbf{p}_2, \dots, H \mathbf{p}_{i - 1} \} \\
\end{align*}
\begin{align*}
\mathbf{r}_{i + 1}^T H \mathbf{p}_k = 0 \: \forall \: k \in [\![ 0, i - 1 ]\!] \\
\end{align*}
What this is saying is that \(\mathbf{r}_{i + 1}\) is H-conjugate to all the previous search directions except \(\mathbf{p}_i\) (\(\mathbf{r}_{i + 1}\) is also orthogonal to all the previous search directions as shown above, which is the reason this fact seems funny). Recall that we said in the previous section, to construct a new direction \(\mathbf{p}_{i + 1}\), we want to take a linearly independent generator \(\mathbf{u}_{i + 1}\) and peel off any of its components that are not H-conjugate to the set of directions \(\{ \mathbf{p}_k \}_0^i\).
\begin{align*}
\mathbf{p}_{i + 1} = \mathbf{u}_{i + 1} + \sum_{k = 0}^{i} \beta_{(i + 1),k} \mathbf{p}_k \\
\end{align*}
Since our generator directions are the residuals/gradients, replace \(\mathbf{u}_{i + 1} = - \mathbf{r}_{i + 1}\),
\begin{align*}
\mathbf{p}_{i + 1} = - \mathbf{r}_{i + 1} + \sum_{k = 0}^{i} \beta_{(i + 1),k} \mathbf{p}_k \\
\end{align*}
BUT! We just proved that \(\mathbf{r}_{i + 1}^T H \mathbf{p}_k = 0 \: \forall \: k \in [\![ 0, i - 1 ]\!]\)! It means there are no components in the vector \(\mathbf{r}_{i + 1}\) that are not H-conjugate to the set of directions \(\{ \mathbf{p}_k \}_0^{i - 1}\), and therefore we do not have to bother about removing such components from \(\mathbf{r}_{i + 1}\) as we have eliminated their presence! The only direction with which \(\mathbf{r}_{i + 1}\) is not H-conjugate is \(\mathbf{p}_i\). This greatly simplifies our job: we will simply retain the \(\mathbf{p}_i\) term and drop all other terms (as they are unnecessary):
\begin{align*}
\mathbf{p}_{i + 1} = - \mathbf{r}_{i + 1} + \beta_{(i + 1),i} \mathbf{p}_i \\
\end{align*}
Now we simply replace \(\beta_{(i + 1),i} = \beta_{i + 1}\) and voila! Out comes our final form of a new H-conjugate direction:
\begin{align*}
\mathbf{p}_{i + 1} = - \mathbf{r}_{i + 1} + \beta_{i + 1} \mathbf{p}_i \\
\end{align*}
To calculate \(\beta_{i + 1}\), transpose the above equation and post-multiply by \(H \mathbf{p}_i\),
\begin{align*}
\mathbf{p}_{i + 1}^T H \mathbf{p}_i = - \mathbf{r}_{i + 1}^T H \mathbf{p}_i + \beta_{i + 1} \mathbf{p}_i^T H \mathbf{p}_i \\
\end{align*}
\begin{align*}
\implies \beta_{i + 1} = \frac{\mathbf{r}_{i + 1}^T H \mathbf{p}_i}{\mathbf{p}_i^T H \mathbf{p}_i} \\
\end{align*}
Using \(\mathbf{r}_{i + 1} = \mathbf{r}_{i} + \alpha_{i} H \mathbf{p}_{i}\),
\begin{align*}
\mathbf{r}_{i + 1}^T \mathbf{r}_{i + 1} = \mathbf{r}_{i + 1}^T \mathbf{r}_{i} + \alpha_{i} \mathbf{r}_{i + 1}^T H \mathbf{p}_{i} \\
\end{align*}
\begin{align*}
\implies \mathbf{r}_{i + 1}^T \mathbf{r}_{i + 1} = \alpha_{i} \mathbf{r}_{i + 1}^T H \mathbf{p}_{i} \\
\end{align*}
And,
\begin{align*}
\mathbf{p}_{i}^T \mathbf{r}_{i + 1} = \mathbf{p}_{i}^T \mathbf{r}_{i} + \alpha_{i} \mathbf{p}_{i}^T H \mathbf{p}_{i} \\
\end{align*}
\begin{align*}
\implies 0 = - \mathbf{r}_{i}^T \mathbf{r}_{i} + \alpha_{i} \mathbf{p}_{i}^T H \mathbf{p}_{i} \\
\end{align*}
\begin{align*}
\implies \mathbf{r}_{i}^T \mathbf{r}_{i} = \alpha_{i} \mathbf{p}_{i}^T H \mathbf{p}_{i} \\
\end{align*}
Substitute in the expression for \(\beta_{i + 1}\),
\begin{align*}
\beta_{i + 1} = \frac{\alpha_{i} \mathbf{r}_{i + 1}^T H \mathbf{p}_i}{\alpha_{i} \mathbf{p}_i^T H \mathbf{p}_i} \\
\end{align*}
\begin{align*}
\implies \beta_{i + 1} = \frac{\mathbf{r}_{i + 1}^T \mathbf{r}_{i + 1}}{\mathbf{r}_{i}^T \mathbf{r}_{i}} \\
\end{align*}
Now we are onto something! We neither need to store all directions in memory nor need them to calculate each new direction. This discovery firmly made the CG algorithm one of the mainstays of modern scientific computing.
As as final step, let us prove one more important thing: that the direction \(\mathbf{p}_{i + 1}\) generated by just \(\mathbf{r}_{i + 1}\) and \(\mathbf{p}_i\) at the i-th step is indeed H-conjugate to the entire set \(\{ \mathbf{p}_k \}_0^{i - 1}\) not considered in the formula. Assume that the set upto to the i-th iterate is already H-conjugate.
First, we transpose the following equation:
\begin{align*}
\mathbf{p}_{i + 1} = - \mathbf{r}_{i + 1} + \beta_{i + 1} \mathbf{p}_i \\
\end{align*}
\begin{align*}
\mathbf{p}_{i + 1}^T = - \mathbf{r}_{i + 1}^T + \beta_{i + 1} \mathbf{p}_i^T \\
\end{align*}
Post-multiply with \(H \mathbf{p}_{k}\), \(k \in [\![ 0, i - 1 ]\!]\),
\begin{align*}
\mathbf{p}_{i + 1}^T H \mathbf{p}_{k} = - \mathbf{r}_{i + 1}^T H \mathbf{p}_{k} + \beta_{i + 1} \mathbf{p}_i^T H \mathbf{p}_{k} \\
\end{align*}
Using \(\mathbf{r}_{i + 1}^T H \mathbf{p}_k = 0 \: \forall \: k \in [\![ 0, i - 1 ]\!]\) and the fact that the set \(\{ \mathbf{p}_k \}_0^{i}\) is H-conjugate, both the right hand terms vanish. Therefore,
\begin{align*}
\mathbf{p}_{i + 1}^T H \mathbf{p}_{k} = 0 \: \forall \: k \in [\![ 0, i - 1 ]\!] \\
\end{align*}
Hence, proved.
What we have done so far achieves both goal #2 and goal #1. Before presenting the Conjugate Gradient algorithm in the next section, let us summarise the most important facts from this section:
\begin{align*}
\mathbf{r}_i^T \mathbf{p}_k = 0 \: \forall \: k \in [\![ 0, i - 1 ]\!] \\
\end{align*}
\begin{align*}
\mathbf{r}_i^T \mathbf{r}_k = 0 \: \forall \: k \in [\![ 0, i - 1 ]\!] \\
\end{align*}
\begin{align*}
\mathbf{p}_{i}^T H \mathbf{p}_{k} = 0 \: \forall \: k \in [\![ 0, i - 1 ]\!] \\
\end{align*}
\begin{align*}
\mathbf{p}_i^T \mathbf{r}_i = \mathbf{u}_i^T \mathbf{r}_i = - \mathbf{r}_i^T \mathbf{r}_i \\
\end{align*}
\begin{align*}
\mathcal{D}_i = \mathcal{K}_i = \mathrm{span}\{ \mathbf{r}_0, \mathbf{r}_1, \mathbf{r}_2, \dots, \mathbf{r}_{i - 1} \} = \mathrm{span}\{ \mathbf{p}_0, \mathbf{p}_1, \mathbf{p}_2, \dots, \mathbf{p}_{i - 1} \} \\
= \text{span}\{ \mathbf{r}_0, H \mathbf{r}_0, H^2 \mathbf{r}_0, H^3 \mathbf{r}_0, \dots, H^{i - 1} \mathbf{r}_0\} \\
\end{align*}