透過您的圖書館登入
IP:3.17.184.90
  • 學位論文

以最大概略估計為基礎之捉放法模型估算伺服器數量

Estimation on Server Population with MLE-based CMR (Capture-Mark-Recapture)

指導教授 : 黃寶儀

摘要


影音服務的品質與順暢度取決於內容傳遞網路 (CDN) 的規模與設備完善程度。近年來因為影音服務需求的成長,為了因應客戶的需求,內容傳遞網路中的伺服器數量也顯著的增加。因為 Twitch 廣泛的影音應用,我們認為長時間持續續的探究其內容傳遞網路架構是一個重要課題。發想於內容傳遞網路中的伺服器數量的新增淘汰和動物群體出生死亡行為的相似性,我們認為可以套用野生群體數量估算的捉放法,透過每次少量的網路流量取樣以達到估算整體內容傳遞網路的伺服器數量。在我們過去發表的 AINTEC 論文(2021)中, Cormack-Jolly-Seber (CJS) 模型能夠相對準確的估算出伺服器總數。 然而,傳統的 CJS 模型的機率假設仍有很多的限制。因為其需要較多採樣間距估算才能收斂,導致這個模型僅限於離線的估算。此外,這個模型假設所有個體被抓捕和生存的機率都是一樣的,這個假設並不符合 Twitch 的內容傳遞網路中伺服器的服務型態。因此,我們引入了考慮異質性、以最大概似估計為基礎的 CJS 捉放法模型來解決這兩個議題。這個模型不僅可以賦予每一台伺服器不同的參數設定,還會以最大概略估計一次性估算 CJS 機率模型中的所有參數。不過每一伺服器都有相對應的參數會導致整個機率模型過於複雜,我們因此使用分群法按照提供服務的模式將伺服器分群,透過讓同一群伺服器共用參數以達到減少模型的參數數量。我們使用 2021 年五月蒐集的資料集做測試,發現以最大概似估計為基礎的 CJS 模型的確在在線估算中有較好的表現,而異質性和伺服器分群在實驗中並未有效提升估算準確率,我們透過檢視各分群的估算結果詳細分析其中原因。

並列摘要


The quality and continuity of the video services such as Twitch depend on the scale and well-being of their content distribution networks (CDNs). Due to the growing demand for video services, server numbers in the CDNs have rapidly increased to feed videos to the clients. Given the widespread use of Twitch, we find continuous survey of its CDN an important subject of study. Inspired by Capture-Mark-Recapture(CMR), a methodology widely used to estimate animal population, we developed a system to continuously observe its CDN size (i.e., the number of servers) with lightweight probing. According to our previous research in AINTEC, the Cormack-Jolly-Seber (CJS) model can estimate the CDN size at each sample time with relatively low errors. Nevertheless, the assumptions of the traditional CJS model are still restrictive. Due to its long converging period, the model can only estimate server population offline. Besides, it assumes that all servers share the same capturing and survival rates, which does not meet the server patterns in Twitch's CDN. Therefore, we introduce the Maximum-Likelihood-Estimation-based (MLE) CJS model with heterogeneity to address these two issues. It not only allows different parameters for each server but also co-estimates all parameters in the CJS probability model. The resulting MLE model is too complicated, and thus we try server clustering to reduce the parameter space. Using a data set collected in May 2021, we find the MLE-based CJS indeed performs better in online estimation. Heterogeneity and server clustering, on the other hand, do not improve the estimation accuracy. For these worse results, we identify the detailed reasons with the estimation results in each group.

參考文獻


H. Akaike. Information Theory and an Extension of the Maximum Likelihood Principle, pages 199–213. Springer New York, New York, NY, 1998.
D. Borchers and M. Efford. Spatially explicit maximum likelihood methods for capture-recapture studies. Biometrics, 64:377–85, 07 2008.
T. Böttger, F. Cuadrado, G. Tyson, I. Castro, and S. Uhlig. Open connect everywhere: A glimpse at the internet ecosystem through the lens of the netflix cdn. SIGCOMM Comput. Commun. Rev., 48(1):28–34, Apr. 2018.
L. Breiman. Random forests. In Machine Learning, pages 5–32, 2001.
C. Brownie and D. S. Robson. Models allowing for age-dependent survival rates for band-return data. Biometrics, 32(2):305–323, 1976.

延伸閱讀