隨著現今電腦的運算能力愈來愈強大,而且在資料量不斷提升 的情況下有限的I/O頻寬卻無法等比例的提升,兩者間日趨擴大的效 能差異導致傳統的模擬後數據處理方法(post-simulation data processing method)已面臨效能上的瓶頸。因此原位計算(in-situ computing)與查詢驅動數據分析(query-driven data analysis)是 用於縮短資料搬移路徑很重要的技巧。我們實作一個結合了位圖索 引(bitmap indexing)、空間資料結構重組(spatial data reorganization) 、分散式共享內存(distributed shared memory)與 位置感知平行執行(location-aware parallel execution)的索引系 統,並且使用了NERSC的超級電腦作為真實環境對兩個真實科學模擬 資料運行實驗分析。結果顯示對比於傳統依賴平行儲存檔案系統的 查詢系統,我們的系統可以達到10倍以上的效能優化。
The growing gap between compute performance and I/O bandwidth coupled with the increasing data volumes has resulted in a bottleneck to the traditional post- simulation data processing method. Hence in-situ computing and query-driven data analysis are important techniques to minimize data movement. By taking advantage of the growing memory capacity on supercomputers, we developed an in-memory query system for scientific data analysis. Our approach is a combination of bitmap indexing, spatial data layout re-organization, distributed shared memory, and location-aware parallel execution. Our evaluations on a NERSC supercomputer using two real scientific datasets showed that we can aggregate the memory ca- pacity from thousands of computes nodes to analyze a 750GB simulation dataset without transferring data to remote nodes or storage systems. Comparing to the traditional solutions based on out-of-core parallel file systems, we achieve more than x10 speedup. Therefore, our system can support interactive query and serve as a vehicle for steering simulations.