Understanding error log event sequence for failure analysis
Due to the evolvement of large-scale parallel systems, they are mostly employed for mission critical applications. The anticipation and accommodation of failure occurrences is crucial to the design. A commonplace feature of these large-scale systems is failure, and they cannot be treated as exception. The system state is mostly captured through the logs. The need for proper understanding of these error logs for failure analysis is extremely important. This is because the logs contain the “health” information of the system. In this paper we design an approach that seeks to find similarities in patterns of these logs events that leads to failures. Our experiment shows that several root causes of soft lockup failures could be traced through the logs. We capture the behavior of failure inducing patterns and realized that the logs pattern of failure and non-failure patterns are dissimilar.
Keywords: Failure Sequences; Cluster; Error Logs; HPC; Similarity