並列タイトル等Automated Fault Localization in Large-Scale Computing Systems
一般注記This dissertation presents two scalable, automated approaches to simplifying fault localization in large-scale computing systems that view localization as anomaly detection in system behaviors. Both approaches always capture system behaviors by obtaining function call traces, and identify anomalous behaviors through automatic data analysis of the collected traces. To find anomalies in scalably and automatically, they assume processes in typical distributed software systems have behavioral similarities, and find violations in the assumed similarities as anomalies. The first approach, outlier-detection-based localization, localizes faults by assuming that the target system consists of distributed processes with similar behaviors. Specifically, once a failure occurs, it identifies anomalous processes and functions by comparing the failure traces and finding outliers among them. Traces are compared by using their function-execution times. By finding outliers based on these times, this approach can localize faults such as performance bugs, deadlocks, and livelocks.The second approach, model-based localization, localizes faults by assuming that all processes exhibit similar behaviors to those observed in the past. By using traces collected during normal operations, it derives an execution model that estimates the call probability of each function. Once a failure occurs, it finds anomalous processes and function calls by comparing the failure traces against the derived model. We consider the following cases anomalous when: 1) high-probability functions are not called, and 2) low-probability functions are called. This approach is especially effective in localizing program logic bugs by finding these functions.Experimental studies done on real-world large-scale environments indicate the effectiveness of the proposed techniques. Our outlier-detection-based localization almost automatically found the causes of several nondeterministic failures in a distributed cluster middleware running on a 129-node production cluster. The model-based localization also substantially simplified the localization process of a failure that occurred in a three-site, 78-node Grid environment.
identifier:oai:t2r2.star.titech.ac.jp:50091984
連携機関・データベース国立情報学研究所 : 学術機関リポジトリデータベース(IRDB)(機関リポジトリ)