A New Exact State Reconstruction Strategy for Conjugate Gradient Methods with Arbitrary Preconditioners

A New Exact State Reconstruction Strategy for Conjugate Gradient Methods with Arbitrary Preconditioners

Abstract

With growing numbers of nodes in large-scale parallel computers the likelihood of unanticipated node failures increases. Furthermore, global reduction operations become major bottlenecks due to their limited parallel scalability. The Preconditioned Conjugate Gradient (PCG) method, an important iterative solver for large sparse linear systems, faces these challenges. The negative impact of global reduction operations on scalability can be reduced by using a preconditioner which significantly reduces the number of iterations, by overlapping communication with computation (communication-hiding variants of PCG), or by reducing synchronization points (communication-avoiding variants of PCG). However, efficient algorithm-based resilience to unanticipated node failures that does not impact the convergence of the solver was so far studied only for a single scalable variant of PCG, but not for arbitrary preconditioners. In an effort to address both challenges mentioned above in combination, we present variants of standard PCG and communication-hiding PCG which are resilient to node failures. By exploiting algorithm-specific properties of PCG the overhead of storing redundant information during the failure-free phase can be made very small. Efficient recovery from multiple node failures is based on adapting an exact state reconstruction (ESR) strategy. Existing ESR strategies are not applicable for all preconditioners as they require the explicit availability of the preconditioner matrix. We extend the ESR approach to work efficiently with arbitrary preconditioners for both standard PCG and communication-hiding PCG methods. Experiments on the Vienna Scientific Cluster (VSC) illustrate very low runtime overheads compared to the non-resilient methods.

Grafik Top
Authors
  • Mayer, Viktoria
  • Gansterer, Wilfried
Grafik Top
Shortfacts
Category
Paper in Conference Proceedings or in Workshop Proceedings (Poster)
Event Title
2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
Divisions
Theory and Applications of Algorithms
Subjects
Parallele Datenverarbeitung
Event Location
San Francisco
Event Type
Workshop
Event Dates
May 27-31, 2024
Date
May 2024
Export
Grafik Top