透過您的圖書館登入
IP:3.135.217.228
  • 學位論文

用於大規模科學數據處理的高效且可移植的分布建模

Efficient and Portable Distribution Modeling for Large-Scale Scientific Data Processing with Data-Parallel Primitives

指導教授 : 王科植

摘要


透過基於分布的資料表示法來處理大規模的科學資料集是一種新興且相當有潛 力的方法。這種資料表示法基本上是將科學資料集轉換為許多分布來表示,並且每 個分布皆由少量的樣本計算而出。目前大多數的平行演算法著重在將許多輸入樣本 擬合成單一個分布,但這可能不適合處理大規模的科學資料集,因為這樣並不能很 有效地利用計算資源。直方圖和高斯混和模型(GMM)最流行的科學資料集的分布 表示法。因此,我們提出了針對處理大規模科學資料集的多組直方圖和GMM建模 演算法。我們的演算法是基於data-parallel primitives開發的,以實現不同硬體架構的 可移植性。我們詳細評估了我們所提出的演算法的性能,並展示了在處理科學數據 時的使用案例。

並列摘要


The use of distribution-based data representation to handle large-scale scientific datasets is a promising approach. The distribution-based approaches often transform a scientific dataset into many distributions, and each distribution is calculated from a small number of samples. Most of the proposed parallel algorithms focus on modeling single distribution from many input samples efficiently, which may not fit the large-scale scientific data processing scenario because they cannot utilize the computing resource well. Histogram and Gaussian Mixture Model (GMM) are the most popular distribution representations used to model the scientific datasets. Therefore, we propose multi-set histogram and GMM modeling algorithms for the scenario of large-scale scientific data processing. Our algorithms are developed by data-parallel primitives to achieve portability across different hardware architectures. We evaluate the performance of the proposed algorithms in detail and demonstrate use cases for scientific data processing.

參考文獻


[1] Dimitrios Bachtis, Gert Aarts, and Biagio Lucini. Extending machine learning classification capabilities with histogram reweighting. Physical Review E, 102(3):033303, 2020.
[2] Nathan Bell and Jared Hoberock. Thrust: A productivity-oriented library for cuda. In GPU computing gems Jade edition, pages 359–371. Elsevier, 2012.
[3] Guy E Blelloch. Vector models for data-parallel computing, volume 2. MIT press Cambridge, 1990.
[4] Rishav Chakravarti and Xiannong Meng. A study of color histogram based image retrieval. In 2009 Sixth International Conference on Information Technology: New Generations, pages 1323–1328. IEEE, 2009.
[5] Abon Chaudhuri, Teng-Yok Lee, Han-Wei Shen, and Tom Peterka. Efficient range distribution query in large-scale scientific data. In 2013 IEEE Symposium on Large- Scale Data Analysis and Visualization (LDAV), pages 125–126. IEEE, 2013.

延伸閱讀