應用密度函數比對資料品質一致性之研究

資料品質問題一直是引受關注的，從網路搜尋引擎可搜尋到約十億多筆的相關資訊網頁，可見資料品質在現實生活中已成為相當重要的討論議題。在許多實務應用上，不時會接觸到針對相同調查母體抽樣所得的兩個獨立資料庫。在沒有互相連結變數的情況下，就無法像關聯式資料庫利用連結變數，將所有資料串聯起來。因此，在比對變數間一致性時，就無法透過一對一的方式進行資料的對應。故本研究提出由觀察資料機率密度函數形態的角度，依據資料變數的屬性，分別從單一維度及多維度來尋找其適當機率分配函數，利用所估計的機率分配函數作為兩獨立資料間比對的基礎，計算出兩筆資料間的重疊係數，進而判定彼此資料間的一致、吻合程度，使得在變數使用上更具可靠性。根據本研究範例，對於產業創新與工商普查資料的實務上應用分析，建議利用不隨時間變動的屬質變數進行一致性比對，相對於屬量變數可得到較佳的比對結果。

關鍵字

資料品質；機率密度函數；重疊係數；連結變數

並列摘要

The data quality problem has been focused. There are more than one billion related webpage from the internet search engine. Obviously, data quality has been become an important issue in real life. In many practical applications, one contacts two independent databases that sampling from the same investigative population. As without linking variable, that will not be able to merge overall data like relational database. Therefore, we are unable to map data consistency through one by one way. In this study, we observe a point of view with probability density function. According to the attribute of the variables, we find the appropriate one-dimension and multi-dimension probability distribution function. Then, we use the estimated probability distribution function to calculate the overlap coefficient between the similar variables of the two independent data. Finally, we will to judge the extent of data consistency and to cause the variable more reliable. From the practical analysis of industrial innovation survey and the industry commerce and service census data in this example of study, we suggest using the non-time-varying of discrete variables to carry on mapping data that will get better results than the continuous variables.

並列關鍵字

Data quality ； Probability density function ； Overlap coefficient ； Linking variable

參考文獻

林家偀 (2006)，「微陣列基因表現資料一致性的統計方法之評估研究」，國立臺灣大學農藝學研究所碩士論文。

王偉驎、林文燦、賴政皓、陳慧敏（2008），「應用資料探勘技術提升急診醫學檢傷分類之一致性-以台灣某醫學中心急診醫學部為例」。品質學報，第15卷第四期，p.283-291。

陳韋仲 (2009)，「探勘一致性樣式間之時間相依性於基因-樣本-時間微陣列資料集」，國立成功大學資訊工程學系碩士論文。

傅正陽 (2010)，「一個針對高效率多核心系統快取記憶體資料一致性模擬之根據分享變數為基礎的同步方法」，國立清華大學資訊工程學系碩士論文。

張富健 (2006)，「量測資料品質評估與偵錯」，國立成功大學製造工程研究所博士論文。

國際替代計量

應用密度函數比對資料品質一致性之研究

未授權

主題瀏覽