Spark Streaming串流資料處理架構效能之分析與估算

由於Spark Streaming處理資料的方式是屬於粗粒度(一次處理微量批次資料)，造成不可避免的延遲，在Spark Streaming框架中，資料必須整理到一定的量後再一次處理，增加了資料處理的延遲，而延遲是由架構設計產生。由此可見，如何調校Spark Streaming運作參數是值得深思探討的。簡單來說，我們必須對運行時間以及記憶體的使用作優化，不過如果每次優化都要運作程式，非常消耗時間。因此，本篇對於Spark Streaming框架內的DStreamGraph作分析與估算，對於增加平行度、減少序列化的負擔、合理的批次處理時間，評估出對於此次處理操作較適合的參數設定，而不用重複的運作調整。本研究提出了轉換參數的公式估算模型，有效的針對批次處理間隔時間作分析與估算，透過本研究模型，開發者能夠準確並快速的找到適合的批次處理時間，使得後續調教工作能夠省下繁瑣的重複啟動程式與測試，並可作為Spark Streaming批次間隔時間的參考依據達到秒即以內的延遲。

關鍵字

Real time processing ； Spark Streaming ； batch Duration ； Micro-batching

並列摘要

Since Spark Streaming handles data in a coarse-grained model (processing a micro-batch of data at a time), delays are inevitable. In the framework of Spark Streaming, data is processed after a certain amount has been collected, which aggravates the problem of delays in data processing. Such delays stem from the design of the framework. In view of that, it is worth contemplating how to calibrate the operational parameters of Spark Streaming. To put it simply, we must try to perform optimization on the processing time and the use of memory. However, it would be very time-consuming to run the program each time optimization is required. Consequently, the study focuses on the analysis and estimation of the DstreamGraph within the framework of Spark Streaming. With a view to increasing the level of parallelism, decreasing the workload of serialization and deserialization, and securing reasonable batch-processing time, the appropriate parameter configuration for an operation is figured out, so that repetitive calibrations are not necessary. The study presents a formula estimation model for transformation parameters, which is effective in analyzing and estimating the duration of a batch-processing cycle. With the model, developers can accurately and swiftly figure out the most appropriate batch-processing time, preventing the redundant restarting and testing of the program for subsequent calibration. It also serves as guidance for setting the batch interval in Spark Streaming to limit the delay within the one-second range.

並列關鍵字

Real time processing ； Spark Streaming ； batch Duration ； Micro-batching

參考文獻

[15] Chang, J., & Lee, W.A sliding window method for finding recently frequent itemsets over online data streams. Journal of Information Science and Engineering, 2004,20(4), 753–762.

[21] 陳秀秀. 具時間權重之串流資料的結合運算機制. 臺中科技大學資訊科技與應用研究所學位論文, 2009, 1-52.

[12] ZHANG, Kai; HU, Jiayu; HUA, Bei. A holistic approach to build real-time stream processing system with GPU. Journal of Parallel and Distributed Computing, 2015, 83: 44-57.

[13] TINATI, Ramine, et al. A Streaming Real-Time Web Observatory Architecture for Monitoring the Health of Social Machines. In: Proceedings of the 24th International Conference on World Wide Web. ACM, 2015. p. 1149-1154.

[14] HEMALATHA, C. Sweetlin; VAIDEHI, Vijay; LAKSHMI, R. Minimal infrequent pattern based approach for mining outliers in data streams. Expert Systems with Applications, 2015, 42.4: 1998-2012.

國際替代計量

Spark Streaming串流資料處理架構效能之分析與估算

全文下載

主題瀏覽