聯邦式主成分分析

近年來由於大數據時代的來臨，以及運算硬體限制的突破，機器學習的方法在許各個領域都有十分出色的應用。然而隨著收集資料的技術進步，所得到的資料集也往往十分龐大，而無法儲存在單一的設備上；又在許多情境下，資料本身就是透過不同的設備收集儲存在分散的設備上（例：IoT sensors）。因應這樣狀況，分散式的機器學習技術也隨之興起，在2016 年Google 提出了聯邦式學習的概念，除了原有的分散式框架之外，還強調資料隱私以及保護的想法。這是因為在許多情況下，資料的持有者並不希望自己的資料被共享或洩露。在這樣的情境之下，我們提出了聯邦式主成分分析（Federated PCA），由於PCA 是一個非常泛用的資料分析工具，在許多降維分析、資料前處理以及視覺化的問題都能應用。我們希望能夠在一個有許多工作單位以及一個中央主機的情境下，每個工作單位都有自己的資料，而工作單位只會向中央主機傳遞模型而不會分享資料。在這個情境下，我們的Federated PCA 可以在同時維護資料隱私的條件下而仍得到良好的模型結果。

關鍵字

主成分分析；聯邦式學習；分散式學習；大規模機器學習

並列摘要

In recent years, due to the advent of the era of big data and the breakthrough of computing hardware limitations, machine learning methods have been successfully applied in many fields. However, as the technology for collecting data advances, the resulting data sets are often very large and cannot be stored on a single device. Moreover, in many situations, the data itself is collected and stored on distributed devices (for example : IoT sensors). In response to this situation, decentralized machine learning technology has thus emerged. In 2016, Google[1] proposed the concept of federal learning. In addition to the original decentralized framework, it also emphasizes the idea of data privacy and protection. This is because in many cases, the data holder does not want their data to be shared or leaked. Under such circumstances, we proposed the Federated PCA (Federated PCA). we chose PCA as our target model because it is an ubiquitous data analysis tool which is often used for linear dimension reduction. We hope that in a situation where there are many work units and a central master, each work unit has its own data, and the work unit will only pass the model to the central master without sharing the data. In this situation, our Federated PCA can work under the condition of maintaining data privacy, while maintaining good model performance.