為無序性資料集設計之基於深度學習之自動學習資訊檢索系統

近年來由於現代資訊系統的發達，各種資訊量成指數型成長，如何讓使用者能方便快速找出所需的資料愈顯重要。用於各式資訊檢索的模型日漸被提出，也展示了在一般網路搜尋環境下優異的搜尋能力。然而，此類研究皆專注於日常常見的有序性資料集。有序性資料即為資料中各個詞排列順序會影響資料整體意義的資料，大多數網路上的資料文章，只要以文字形式儲存的基本上都是有序性資料；無序性資料則是資料中的各個詞，即使順序調換也不會影響資料整體的意義。以我們擁有的資料集來說，由於原始資料為圖片，為了取出其中的文字我們使用了光學字元辨識（OCR），導致無法保證取出的詞相互的上下文關係，這就是一種無序性資料集。在這篇論文中，我們設計了一個系統專門用於解決無序性資料集上的資訊檢索系統。並且我們的系統訓練資料皆可以從使用者行為紀錄中產生，因此可以收集資料、自動學習使用者之行為。為了在無序性資料集上解決資訊檢索的問題，我們將每個在 query 中的每個字詞分開，以避免上下文之間的資訊誤導。將字詞分別加上文件送進訓練好的類神經網路（DNN）進行評分，也就是文件以及輸入字詞的相關程度。將屬於輸入的所有字詞總合起來即為文件與輸入之分數，在對所有文件以此分數進行排序做為系統輸出。實驗中我們使用我們所持有的搜尋系統做為訓練與測試資料，並使用統計模型產生模擬的 label 來模擬使用者的真實行為，當使用 NDCG 評量指標時，分數比相關文獻中針對有序性資料及設計的模型(Conv-KNRM)增長了2-3百分比不等。為了使訓練神經模型的效果更好，在訓練模型時我們也使用了新的輸出設計。以往在訓練此種神經網路時，通常都是以一個輸入加上一個文件當成一組作為訓練，而我們則使用一個輸入加上兩個文件當成一組作為訓練。此種訓練方式在基於相同訓練時間的實驗中證實比原有的設計更快收斂、效果更好。另外也進行了從較弱的模型（tf-idf模型）開始訓練，後來加入較強的模型（BM25）做為使用者新的行為模擬，在訓練後可以完全貼合bm25的搜尋結果，證明此系統可以隨著使用者點擊紀錄來優化搜尋結果。

關鍵字

深度學習；訊息檢索；無序性資料

並列摘要

Due to the development of modern information systems, the amount of information grows exponentially. How to make it easy for users to find the required data is becoming more critical. Models for various types of information retrieval are being proposed recently, and they also demonstrate excellent search capabilities in general web search environments. However, such research focuses on the ordered datasets that are common in daily life. Ordered data is the data in which the order of the words in the data will affect the overall meaning of the data. Most data articles on the Internet are ordered data as long as they are stored in text form. Unordered data is that even if the order of words is changed, it will not affect the overall meaning. For the data set we have, because the original data is a picture, we use optical character recognition (OCR) to extract the text, which results in no guarantee of the contextual relationship between the extracted words. And this is an unordered dataset. In this thesis, we have designed a system specifically for retrieving information on unordered datasets. Moreover, our system training data can be generated from user activities records, so we can automatically collect data and learn from user activities. To solve the task of searching problems on unordered datasets, we separate each word in each query to avoid misleading information between contexts. Then, we input a term and a document into the trained neural network (DNN) model for scoring, which is the relevance of the document and the input words. The sum of all term's score that belongs to the query is the relevant score of the query and the document. Furthermore, all files sorted with this score is the system output. In the experiment, we use the search system we hold to train and test data and use statistical models to generate simulated labels to simulate the user's real behavior. When using the NDCG metrics, the score is higher than the model designed for ordered datasets (Conv-KNRM) 2-3%. To make the neural model training better, we also used a new output design when training the model. In the past, when training such a neural network, one query and one document were usually used as a pair for training, while we used one query and two documents as a pair for training. This training method has proved to be faster and more effective than the original design in experiments based on the same training time. Besides, we started training from a weaker model (basic tfidf model), and later, we added a more robust model (BM25) as new user behavior. After training, it can completely match the search results of BM25, proving this system can be optimized by user-clicking logs.

並列關鍵字

deep learning ； information retrieval ； unordered dataset

參考文獻

[1] “Domo resource - data never sleeps 5.0.”https://www.domo.com/learn/data-never-sleeps-5. Accessed: 2020-05-26.

Google Scholar

[2] D. K. Harman,The first text retrieval conference (TREC-1). No. 500, US Departmentof Commerce, National Institute of Standards and Technology, 1993.