結合潛在語意檢索及資訊粒化於資料探勘

資料探勘技術的興起，主要為了解決大量資料所引發的相關問題，主要目的為從大量資料中擷取或挖掘有意義的特徵(Patterns)。但許多資料集具有多維度(Multidimension)、稀疏(Sparsity)及非平衡資料(Imbalance Data)特徵，這三種資料集的特徵對現今資料探勘科技有顯著的影響，所以，在資料探勘前必需先對這類型資料執行資料前處理動作。具有稀疏及高維度特徵的資料可透過常用於縮減詞彙-文件矩陣的潛在語意檢索(Latent Semantic Indexing ; LSI)執行資料前處理 (Data Preprocessing)。因為，潛在語意檢索所應用的奇異值分解(Singular Value Decomposition ; SVD)方法能夠有效處理高維度及稀疏的資料。至於在處理非平衡資料方面，資訊粒化(Information Granulation ; IG)可將性質類似的多數類別資料元素轉換成資訊粒子，以提高稀少類別資料的比率；故能有效解決非平衡資料的問題。因此，潛在語意檢索及資訊粒化可視為資料探勘程序中，資料前處理步驟。本研究主要結合潛在語意檢索及資訊粒化於資料探勘，希望能達到縮減資料屬維度、資料筆數及有效處理非平衡資料。依本論文實際研究成果指出，對資料執行LSI確實能縮減資料屬性維度；對資料執行LSI+IG確實能縮減資料屬性及資料筆數；對資料執行IG+LSI確實能縮減資料筆數及資訊粒子次屬性維度，而三種方法都能減少資料分析所需時間。在非平衡資料處理方面，本研究發現不適合對非平衡資料執行LSI縮減資料屬性維度，因為，會導致稀少類別資料分類正確率降低。若對非平衡資料執行LSI+IG或IG+LSI能有效提昇稀少類別資料分類正確率，但前提為須針對多數類別資料及稀少類別資料各別執行資訊粒化。本研究發現對資料IG+LSI分類效果比對資料LSI+IG佳，因為非平衡資料集先執行LSI時，會遺漏資訊導致分類正確率低，尤其是稀少類別資料分類正確率。對資料執行IG+LSI除了能效提昇稀少類別資料分類正確率，也能有效提昇多數類別資料分類正確率，但前提為多數類別資料縮減幅度要夠大。

關鍵字

潛在語意檢索；資訊粒化；資料探勘

並列摘要

With the rapid information growth, the development of data mining aims at discovering useful patterns from the huge amount of data. Enterprise data usually have features of multi-dimension, sparsity and imbalance. These features result in significant impacts on the functions of data mining. Therefore, data preprocessing has become an essential task in data mining, which can reduce the data size and remove noises and outliers. By using Singular Value Decomposition, Latent Semantic Indexing (LSI) can effectively process multidimensional and sparse data. The data possessing features of multi-dimension and sparsity can be preprocessed by using LSI to reduce the data dimension and records. In the case of processing imbalance data, Information Granulation (IG) can transform data of majority class that share similar property into information granule in order to raise the ratio of minority as well as to resolve problem of imbalance data. Therefore, LSI and IG can be taken as the first stage of data preprocessing in data mining process. This thesis combines LSI with IG for data mining in order to achieve the goal of reducing the size and dimensions of data, and resolve the problem caused by imbalance data. According to the results in this thesis, it points out that implementing LSI to data can effectively reduce the dimensions of data, implementing LSI+IG to data can effectively reduce the dimensions and the size of data, and implementing IG+LSI to data can effectively reduce the sub-attributes (generated in IG process) and the size of data. Moreover, all these three methods of data reduction can reduce the computational time of analysis. In the case of processing imbalance data, the computational results indicate that LSI alone is not suitable for preprocessing imbalance data. By implementing LSI+IG or IG+LSI to preprocessing imbalance data, the accuracy of minority class is improved. This thesis concludes that the results of classification can be most improved provided that IG+LSI is adopted.

並列關鍵字

Information Granulation, Latent Semantic Indexing ； Data Mining ； Imbalance Data ； Data Preprocessing

參考文獻

[1] Batista, G., Prati, R. C., Monard, M. C., 2004, A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data, SIGKDD Explorations, 6(1), 20-29.

[2] Bargiela, A., Pedrycz, W., 2003, Granular Computing : An introduction, Boston : Kluwer Academic Publishers.

[3] Bargiela, A., Pedrycz, W., 2001, Classification and Clustering of Granular Data, Proceedings of IFSA-NAFIPS.

[4] Bellegarda, J. R., 2000, Exploiting Latent Semantic Information in Statistical Language Modeling, Proceedings of IEEE, 88(8), 1279-1296.

[6] Bezdek, J. C., 1973, Cluster Validity with Fuzzy Sets, Journal of Cybernetics, 3(3), 58-73.

被引用紀錄

巫冠霆（2007）。以價值為基礎之資料探勘〔碩士論文，國立臺北科技大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0006-2208200719354600

李家隆（2007）。資訊粒化技術用於設備失效分析〔碩士論文，國立臺北科技大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0006-0907200714542400

孫昆正（2012）。發展不平衡語意分類之研究〔碩士論文，朝陽科技大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0078-2712201314041510

薛仱芸（2014）。改善網路操弄評論分類績效之研究〔碩士論文，朝陽科技大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0078-0905201416542666

許智翔（2016）。植基於區域核主成分分析方法以檢測網路入侵〔碩士論文，朝陽科技大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0078-1108201714034011

國際替代計量

結合潛在語意檢索及資訊粒化於資料探勘

未授權

主題瀏覽