我們考慮一種情景,即工程師分析用戶發送的系統日誌以解決故障。用戶通常因系統運行有缺陷的程序路徑而遇到麻煩,這會生成一系列稱為“故障事件”的模板。工程師的目標是探索未知故障事件的性質,並確定系統日誌中是否包含已知的故障事件。主要挑戰在於來自不同任務的日誌在系統日誌中交錯存在,此外,大規模系統服務會生成多種多樣的日誌。這些因素使得故障排除過程極其耗時,因為工程師需要確認系統日誌每一行之間的相關性。在本論文中,我們提出了一種新的故障排除框架,模板-模式-事件,通過將代表相同行為的日誌聚合成同一模式來減少系統日誌的複雜性。其次,我們提出了一種模板聚類算法,從具有交錯特徵的系統日誌數據中學習模式。第三,我們引入了事件追踪算法,以識別系統日誌中故障事件的位置。通過我們提出的新架構,故障排除過程將更加簡化和高效。
We consider the scenario where engineers analyze system logs sent from users for troubleshooting. Users typically encounter trouble due to the system running a defective program path, which generates the sequence of templates called the "trouble event." The engineers' goal is to explore the nature of unknown trouble events and to determine whether a system log contains any known trouble events. The main challenge lies in the fact that logs from different tasks are interleaved in the system log, and additionally, large-scale system services generate a wide variety of logs. These factors make the process of troubleshooting extremely time-consuming, as engineers need to confirm the relevance between logs in each line of the system log. In this thesis, we propose a new troubleshooting framework, template-pattern-event, which reduces the complexity of the system log by aggregating logs that represent the same system behavior into the same pattern. Secondly, we propose an algorithm, Template Clustering, to learn patterns from system log data with interleaving characteristics. Thirdly, we introduce the Event Trace algorithm to identify the positions of trouble events in the system log. With our proposed new architecture, the troubleshooting process will be simplified and more efficient.