科學資料之內存運算查詢系統

隨著現今電腦的運算能力愈來愈強大，而且在資料量不斷提升的情況下有限的I/O頻寬卻無法等比例的提升，兩者間日趨擴大的效能差異導致傳統的模擬後數據處理方法(post-simulation data processing method)已面臨效能上的瓶頸。因此原位計算(in-situ computing)與查詢驅動數據分析(query-driven data analysis)是用於縮短資料搬移路徑很重要的技巧。我們實作一個結合了位圖索引(bitmap indexing)、空間資料結構重組(spatial data reorganization) 、分散式共享內存(distributed shared memory)與位置感知平行執行(location-aware parallel execution)的索引系統，並且使用了NERSC的超級電腦作為真實環境對兩個真實科學模擬資料運行實驗分析。結果顯示對比於傳統依賴平行儲存檔案系統的查詢系統，我們的系統可以達到10倍以上的效能優化。

關鍵字

索引；科學資料

並列摘要

The growing gap between compute performance and I/O bandwidth coupled with the increasing data volumes has resulted in a bottleneck to the traditional post- simulation data processing method. Hence in-situ computing and query-driven data analysis are important techniques to minimize data movement. By taking advantage of the growing memory capacity on supercomputers, we developed an in-memory query system for scientific data analysis. Our approach is a combination of bitmap indexing, spatial data layout re-organization, distributed shared memory, and location-aware parallel execution. Our evaluations on a NERSC supercomputer using two real scientific datasets showed that we can aggregate the memory ca- pacity from thousands of computes nodes to analyze a 750GB simulation dataset without transferring data to remote nodes or storage systems. Comparing to the traditional solutions based on out-of-core parallel file systems, we achieve more than x10 speedup. Therefore, our system can support interactive query and serve as a vehicle for steering simulations.

並列關鍵字

In-situ computing ； query-driven analysis ； indexing, ； scientifi ； distributed shared memory

參考文獻

Adding value to the io pipelines of high performance applications with jitstaging.

In Proceedings of the 20th International Symposium on High Performance

Ecient query execution on raw data les. In Proceedings of the 2012 ACM

pages 241{252, 2012.

[4] IPCC Fifth Assessment Report. http://en.wikipedia.org/wiki/IPCCF ifthAssessmentReport:

國際替代計量

科學資料之內存運算查詢系統

主題瀏覽