Real Time Failure Prediction by Mining Execution Logs in LargeScale Cloud Environments

Abstract

A large population of users get affected by sudden slowdown orshutdown of an enterprise application. System administrators andanalysts spend considerable amount of time dealing with functionaland performance bugs. These problems are particularly hard todetect and diagnose in most computer systems, since there is ahuge amount of system generated supportability data (counters,logs etc.) that need to be analyzed.Hence, timely identification of significant change in applicationbehavior is very important to prevent negative impact on the ser-vice. In that light many a research efforts have been made to proposeautomated frameworks for anomaly detection and system failureprediction in distributed environments. While many a frameworksfor detecting anomalous behavior exist , very few models dedi-cated for system failure have been proposed. Again, most of thoseproposed works mostly predict the failure events without actuallypinpointing the problem instances/bugs responsible for failure re-port. Further , the design of most of these frameworks are too muchdependent on the logging instrumentation of the particular cloudplatform which makes the portability of these frameworks challeng-ing. In our work, we propose a system failure prediction frameworkfor large scale distributed systems by learning the bug wise signa-tures. The key novelty of our framework is its ability to raise failurealerts apriori alongside pinpointing the problem scenario. The de-sign of our framework is quite flexible in the sense it leverages onsome abstractions defined on the event/performance logs generatedby the wide scale cloud platforms.The high prediction FPR ascer-tains the power of this model to alert the corresponding supportengineers so that once notified they troubleshoot the system anduse their domain to triage the failure.

Publication
Under Review, 2020