Science World Journal

The AJOL site is currently undergoing a major upgrade, and there will temporarily be some restrictions to the available functionality.
-- Users will not be able to register or log in during this period.
-- Full text (PDF) downloads of Open Access journal articles will be available as always.
-- Full text (PDF) downloads of subscription based journal articles will NOT be available
We apologise for any inconvenience caused. Please check back soon, as we will revert to usual policy as soon as possible.

Understanding error log event sequence for failure analysis

Nentawe Gurumdimma, Desmond Bala Bisandu


Due to the evolvement of large-scale parallel systems, they are mostly employed for mission critical applications. The anticipation and accommodation of failure occurrences is crucial to the design. A commonplace feature of these large-scale systems is failure, and they cannot be treated as exception. The system state is mostly captured through the logs. The need for proper understanding of these error logs for failure analysis is extremely important. This is because the logs contain the “health” information of the system. In this paper we design an approach that seeks to find similarities in patterns of these logs events that leads to failures. Our experiment shows that several root causes of soft lockup failures could be traced through the logs. We capture the behavior of failure inducing patterns and realized that the logs pattern of failure and non-failure patterns are dissimilar.

Keywords: Failure Sequences; Cluster; Error Logs; HPC; Similarity

AJOL African Journals Online