PRECONDITIONING THE MODIFIED CONJUGATE GRADIENT METHOD

In this paper, the convergence analysis of the conventional conjugate Gradient method was reviewed. And the convergence analysis of the modified conjugate Gradient method was analysed with our extension on preconditioning the algorithm. Convergence of the algorithm is a function of the condition number of M -1 A. Again, this work broadens our understanding that the modified CGM yields the exacts result after n-iterations, and further proves that the CGM algorithm is quicker if there are duplicated eigenvalues. Given infinite floating point precision, the number of iterations required to compute an exact solution is at most the number of distinct eigenvalues. It was discovered that the modified CGM algorithm converges more quickly when eigenvalues are clustered together than when they are irregularly distributed between a given interval. The effectiveness of a preconditioner is determined by the condition number of the matrix and occasionally by its clustering of eigenvalues. For large scale application, CGM should always be used with a pre-conditioner to improve convergence.


INTRODUCTION
Optimization theory is aimed at solving problem under investigation with a high degree of precision and under a highly restrictive operation time so as to minimize computing cost.It is necessary to choose a computational scheme that can meet these requirements (Otunta and Ibiejugba, 1991).The desire to construct a suitable and implementable algorithm has motivated the research investigation contained in this work.In this paper we seek to improve the convergence rate of the Modified Conjugate Gradient Method by preconditioning the algorithm.

The Conventional Conjugate Gradient method (CGM)
The conventional conjugate method (CGM) was originally developed by Hestenes and Stiefel (1952) as a method of solution for linear systems.Fletcher and Reeves (1964) built the necessary underlying theory for a successful application of the method to quadratic functional and developed its convergence properties.
To this end we defined quadratic functional as: Where A is an n x n symmetric positive definite operator on the Hilbert space H, and a is vector in H.The steps in CGM algorithm are describe as follows (see Omolehin et al, 2006).Step 1: The first element H X ∈ 0 of the sequence is guessed, while the remaining members of the sequence are computed with the aid of step 2 to 4.
Step 2: where 0 P is the descent direction, g o is the gradient of 0 ) ( x x and x f = Step 3: Step 4: If , for some ί, terminate the sequence, else set ί = ί+1 We state the following theorem because it will give an understanding to the analysis of the convergence rate of the conventional conjugate gradient method (see Omolehin et al, 2006).
Theorem 1 (statement only): The convergence rate of GM algorithm for quadratic functional remains stable if = m / M where m and M are the smallest and largest eigen values of the control operator A respectively.(See proof in Omolehin et al, 2006)

Convergence Rate of Conventional CGM Algorithm
To fully understand this work it will be necessary to show the convergence rate of the conventional CGM Algorithm (See Omolehin et al, 2006).Recall the quadratic functional where f 0 is constant, H is a Hilbert space, x is a n x n dimensional vector in H, a positive definite constant matrix operator.
Theorem 2: The law of convergence of the CGM algorithm is given as This establishes the convergence rate of the conventional CGM algorithm.(see proof in Omolehin et al, 2006).

The Modified Conjugate Gradient Method
In our previous work, Omorogbe and Osagiede (2008a) on the general convergence of the steepest descent method, the number of matrix-vector products per iteration can be reduced to one by using a recurrence to find the residual: Here, the conjugate gradient is simply the method of conjugate direction where the search direction are constructed by conjugation of the residuals (i.e by setting µ ί =r ί ).The residual worked for steepest descent in our previous work Omorogbe and Osagiede (2008a), and it will even worked better for the conjugate gradient method.It has the property that it's orthogonal to the search direction i.e d ί r j = 0, ί<j (by A-orthogonal of d-vectors) (8) So, it's guaranteed always to produce a new, linearly independent search direction unless the residual is zero, in which case the problem is feasible.
As we shall see, there is an even better reason to choose, the residual.Let us consider the implication of this choice, because the search vectors are built from the residuals and the subspace span {r 0 , r ί , r i-1 } is equal to D i.As each residual is orthogonal to the previous search directions, it is also orthogonal to the previous residuals Interestingly, Eq(7) shows that each new residual r ί is just a linear combination of the previous residual and Ad ί-1, recalling that , this fact implies that each new subspace D ί+1 is formed from the union of the previous subspace D ί and the subspace Ad ί .Hence, D ί = span {d 0 , Ad 0 , A 2 d 0 , ., A ί-1 d 0 } = span {r 0 , Ar 0 , A 2 r 0 , .., A ί-1 r 0 } According to Shewchuk (1994), this supspace is called krylov subspace created by repeatedly applying a matrix to a vector.It has a fascinating property; because Ad ί is included in D ί+1 , the fact that the preceding residual r ί+1 is orthogonal to D ί+1 by using Gram-Schmidt conjugation r ί+1 is already A-orthogonal to all previous directions except d ί .The process of generating the set of Aorthogonal search directions {d ί } is called conjugate Gram-Schmidt process (Gilbert and Nocedal, 1992).In the context of this paper, It follows that the Gram-Schmidt constant are: Simplifying this expression and taking inner product of r ί and eq (7) Clearly, most of the β ίj term have disappeared.It is no longer necessary to store old search vectors to ensure the A-orthogonality of new search vectors.This major advance is what makes the modified conjugate gradient as important an algorithm as it is because both the iteration are reduced from 0(n 2 ) to 0(m), where mn is the number of zero entries of A (Gilbert and Nocedal, 1992).Henceforth, we shall use the abbreviation Putting everything together, the modified conjugate gradient algorithm is given below When the algorithm reaches the minimum point, the residual becomes zero, and if (*) is evaluated on iteration later, a division by zero will result.Then, STOP.
The above algorithm of the modified CGM is clearly an improvement on the modified steepest descent method as well as algorithm 1 of the conventional conjugate method.
The performance of the modified conjugate gradient method is demonstrated in Fig 1.

Convergence Analysis of The Modified Conjugate Method.
Normally CGM is complete after n-Iterations.However in practice, accumulated floating point roundoff error courses the residual to gradually lose accuracy, and cancellation error causes the search vectors to lose A-orthogonality.This convergence analysis is important because the modified CGM algorithm is used for large class of problems that is not feasible to run even in niterations.The analysis is done using picking perfect polynomials (Omorogbe and Osagiede, 2008b).

Pick perfect polynomials
We have seen that, each step of the modified CGM algorithm, the value e ί is chosen from e 0 + D ί , where Using Krylov subspaces, for a fixed ί, the error term has the form The coefficient j are related to the value ί and ί , but the precise relationship is that CGM algorithm closes the j coefficients that minimize ǁe ί ǁ A .
x 0 x The expression in parentheses above can be expressed as a polynomial.Let P ί ( ) be a polynomial of degree ί, P ί can take either a scalar or a matrix as its argument, and will evaluate to the same; i.e If P 2 ( ) = 2 2 + 1, then P 2 (A) = 2A 2 +1.This feasible notation comes in handy such that Then, we can express the error term as e ί = Ρ ί (A)e 0 If we require that Ρ ί (0) = 1 the modified CGM chooses this polynomials when it chooses the j coefficients.Let's examine the effect of applying this polynomial to e 0 As in the analysis of the steepest descent in our earlier work Omorogbe and Osagiede (2008b), this expresses e 0 as a linear combination of orthogonal unit eigen vectors and we find that e ί = ξ j P ί ( j )V j Ae ί = ξ j P ί ( j ) j V j Implies The performance of the modified CGM is illustrated in Figure 2 2 above, the convergence of the modified CGM after i-iterations depends on how close a polynomial P ί of degree ί can be to zero on each eigenvalue, given the constraint that P ί (0) = 1.The CGM algorithm finds the polynomial that minimizes this expression, but convergence is only as good as the convergence of the least eigenvectors.Letting E(A) be the set of eigenvalues of A. we have Fig2 illustrated, for several values of i, the p ί that minimizes this expression from our illustration with eigen values 2 and 7.There is only one polynomial of degree zero that satisfied P 0 (0) =1, and that is P 0 ( ) =1, graphed into Fig2 (a).The optimal polynomial of degree one is P 1 ( )=1-2x/9 as graphed in Fig 2 (b).Note that P 1 (2) = 5/9 and P 1 (7) = -5/9, and so the energy norm of the error term after one iteration of the CGM is no greater than 5/9 it initial value.Figure 2 (c) shows that, after two iterations, Equations ( * ) evaluates to zero.This is because the polynomial of degree two can be fit to the three points P 2 (0) =1, P 2 (2) = 0 and P 2 (7) = 0.In general, a polynomial of degree n can fit n+ 1 point, and thereby accommodate n separate eigen values.
The foregoing discussion reinforces our understanding that the modified CGM yields the exact result after n iterations; and further proves that the modified CGM is quicker if there are duplicated eigen values, given infinite floating-point precision, the number of iterations required to compute an exact solution is at most the number of distinct eigenvalues.We also find that modified CGM converges more quickly when eigenvalues are clustered together than when they are irregularly distributed between min and max , because it is easier for the algorithm to choose a polynomial that makes equation ( 10) small.

Chebyshev Polynomials.
A useful approach is to minimize equation (10) over the range [ min , max ] rather than at a finite number of points.The polynomials that accomplish this are based on Chebyshev polynomials (Gilbert and Nocedal, 1992).
The Chebyshev polynomial of degree i is T i ( ) =1/2 [( + √ 2 -1) i +( -√ 2 -1) i ] The Chebyshev polynomials have the property that |T i ( )| ≤ 1 on the domain [-1,1] and further that T i ( ) is maximum on the domain [-1,1] among all such polynomials (Gilbert and Nocedal, 1992).Equation ( 10 The second addend inside the square brackets converges to zero as ί increases, so it is common to express the convergence of CGM with the weaker inequality The first step of the modified CGM is identical to a step on the steepest descent method.Setting ί = 1 in equation ( 11), we obtain the convergence result for the steepest descent method of our earlier work (Omorogbe and Osagiede, 2008b): This is just the polynomial case illustrated in Figure 2(b).However in practice CGM usually converges faster than equation ( 12) would suggest, because of good eigenvalue distribution or good starting points.Comparing equation ( 12) of the modified CGM and equation ( 13) of the modified steepest descent method, it is clear that the convergence of the modified CGM is much quicker than that of modified steepest descent method as well as the conventional CGM algorithm.But it is not necessarily true that every iteration of CGM enjoys faster convergence, for example, the first equation of CGM is an iteration of steepest descent the factor 2 in equation ( 12) allows CGM a little slow for these poor iterations.Preconditioning the Modified Conjugate Gradient Method Preconditioning is a technique for improving the condition number of a matrix.(Gilbert and Nocedal, 1992), Suppose that M is a symmetric matrix that approximates A. but is easier to invert.
We can solve AX = b indirectly by solving , or if the eigenvalues of M -1 A are better clustered than those of A, we can iteratively solve equation ( 14) more quickly that the original problem.M -1 A is not generally symmetric nor definite, even if M and A are.
We can prevent this difficulty, because for every symmetric, positive -definite M there is matrix E that has the property that EE T = M, (the matrix E can be obtained by Cholesky factorization).The matrices M -1 A and E -1 AE -T have the same eigenvalues.This is true because if V is an eigenvector of M -1 A with eigenvalue , then E T V is an eigenvector of E -1 AE -T with eigenvalue : The system Ax = b can be transformed into the problem E -1 AE -T X = E -1 b.Then X -E T X, is solved first for x 0 , then for x i .Since E -1 AE T is symmetric and positive -definite, x can be found by steepest descent method or conjugate Gradient method (CGM).The system is called the transformed preconditioned conjugate Gradient method (Sluis and Vorst, 1986): This method has the undesirable characteristics that E must be computed.However, a few careful variable substitution can eliminate E. Set r i = E -1 r i = E T d ί , and use the identities X i = E T x ί and E -T E -1 = M -1 .We derive the untransformed preconditioned conjugate Gradient method: The matrix E does not appear in these equations; only M -1 is needed.By the same means, it is possible to derive a preconditioned steepest Descent method that does not use E.
The effectiveness of a preconditioned M is determined by the condition number of M -1 A, and occasionally by its clustering of eigenvalues.The problem remains of finding a pre conditioner that approximate A well enough to improve convergence enough to make up for the cost of computing the product M -1 r ί once per iteration.It is necessary to explicitly compute M or M -1 ; it is only necessary to be able to compute the effect of applying M -1 to a vector.Within this constraint there is surprisingly rich supply of possibilities.
Intuitively, pre-conditioning is an attempt to stretch the quadratic form to make it appear more spherical, so that the eigenvalues are close to each other.A perfect pre conditioner is M = A; for this pre conditioner, M -1 A has a condition number of one, and the quadratic form is perfectly spherical, so solution takes only one iteration.Unfortunately, the preconditioning step is solving the system MX = b, but this is not a very expedient pre conditioner.
The simplest pre conditioner is a diagonal matrix whose diagonal entries are identical to those of A . the process of applying this pre conditioner is refers to as diagonal preconditioning or Jacobi preconditioning (Gilbert and Nocedal, 1992).This is equivalent to scaling the quadratic form along the coordinate axes.By comparison the perfect pre conditioner M = A scales the quadratic form along its eigenvector axes.A diagonal matrix is trivial to invert, when it is clear that the condition number has improved.This improvement is much more beneficial for systems where n 2.
A more elaborate pre conditioner is incomplete Cholesky preconditioning (Gilbert and Nocedal, 1992).Cholesky factorization is a technique for factorizing a matrix A into the form LL T , where L is a triangular matrix.Incomplete Cholesky factorization is a variant in which little or no fill is allowed; A is approximated by the product LL T , where L might be restricted to have the same pattern of nonzero elements as A; other element of L are thrown away.To use LL T as a pre conditioner, the solution to L L = z is computed by back substitution ( the inverse of LL T is never explicitly computed).Unfortunately, incomplete Cholesky preconditioning is not always stable (Gilbert and Nocedal, 1992).

The Modified Conjugate Gradients on the Normal Equations
The modified CGM can be used to solve system where A is not symmetric, not positivedefinite, and even not square.A solution to the least squares problem Min AX -b 2 (15) can be found by setting the derivative of the expression (15) to zero: ) If A is square and nonsingular, the solution to equation ( 16) is the solution to Ax = b.If A is not square, and Ax = b is over constrained, that is, has more linear independent equations than variables, then there may or may not be solution to Ax = b, but it is always possible to find a value of x that minimizes expression (15), the sum of squares of the error of each linear equation (Omorogbe and Osagiede, 2008b).
A T A is symmetric and positive (for any x, x T A T Ax =|| Ax = || 2 ≥0).If Ax = b is a constrained, then A T A is nonsingular, and methods like Steepest Descent and CGM can be used to solve equation ( 16).The only problem in doing so is that the condition number of A T A is the square of that of A. So convergence is significantly slower (Omorogbe and Osagiede, 2008b). x

PRECONDITIONING THE MODIFIED CONJUGATE GRADIENT METHOD 105
An important technical point is that the matrix A T A is never formed explicitly, be cause it is less sparse than A. Instead A T A is multiplied by d, first we find the product Ad, and then A T Ad.Also, numerical stability is improved if the value d T A T Ad in (2) of Algorithm 2 is computed by taking the inner product of Ad with itself.

CONCLUSION
The effectiveness of a pre conditioner M is determined by the condition number of M -1 A, and occasionally by its clustering of eigenvalues (Omorogbe and Osagiede, 2008b).The problem remains of finding a pre conditioner that approximates A well enough to improve convergence of the CGM to make up for the cost of computing the product M -1 r i once per iteration (Omorogbe and Osagiede, 2008b).It is necessary to explicitly compute M or M -1 ; it is only necessary to be able to compute the effect of applying M -1 to a vector (Omorogbe and Osagiede, 2008a).Within this constraint, there is a surprisingly rich supply of possibilities as earlier discussed (Omorogbe and Osagiede, 2008b).However, it is concluded that for large -scale application, CGM should always be used with a pre conditioner to improve convergence.
Institute of Education, Ekehuan Campus, University of Benin, Benin City, Nigeria. A. A. Osagiede, Department of Mathematics, Faculty of Physical Sciences, University of Benin, Benin City, Nigeria.

Figure 2 :
Figure 2: The performance of the modified CGM algorithm the oscillating properties of Chebyshev polynomials with the domain min ≤ ≤ max .The denominator enforces our requirement that P ί (0) = 1.The numerator has a maximum value on the interval between min and max so, from equation (10) we have,

Fig. 3 :
Fig. 3: Illustration of convergence of the modified CGM as a function of condition number. k