文獻探討是不論學界以及業界都常在做研究時非常耗時的一個部分,特別是當研究者的目標是一些較難定義的概念,例如「大數據」,這些字可能很難透過一些特定的關鍵字去搜索相關的文獻。這份研究的目標在於應用自然語言處理以及機器學習的預測模型,去幫助研究者去自動化這種在模糊目標上的文獻探討的流程,比如研究者一開始先手動標記一些相關的文獻,並透過模型學習這些標注過的文獻的資訊,訓練後的模型可以幫助更快地檢索出相關的文獻。 我們將目標放在檢索作業研究領域中,與「大數據」相關的文獻。這項研究延續了 Mach (2019) 使用事先定義好的詞彙當作特徵的研究,使用 Mach (2019) 從著名的三個作業研究領域的期刊收集之 368 篇文章,Lee et al. (2019) 改良方法使用詞頻當作特徵來預測,而不是事先定義好的詞彙。 與 Mach (2019) 不同,我們也使用類似於 Lee et al. (2019) 用詞頻當作特徵的方法,除了詞頻以外,我們還使用了詞向量的技術 「Universal Sentence Encoder」和「BERT」去產生我們文件的特徵向量,然後我們用這些向量去搭配不同的機器學習以及深度學習的模型,包括:羅吉斯回歸、隨機森林、類神經網路、長短期記憶模型、卷積神經網路。另外,因為我們的資料有許多未標籤的資料,所以我們也做了半監督學習的實驗。 在我們的實驗中,我們發現在各種不同向量與模型的組合下,長短期記憶模型在監督式學習下,在精確率與召回率有最佳的表現。召回率是我們判定標準下比較重要的標準,因為我們的研究目的是模型能找到多少真正有相關的文件。 除此之外,我們的實驗還發現無標籤的資料確實能幫助模型,不但能在原始的測試資料及上有較佳的表現,還能延伸至另一份期刊 (MSOM),但是,當我們測試在第三份期刊時(POM),表現並不如我們的預期,這是典型的「表徵學習轉移」問題。我們的研究發現自然語言處理模型在當訓練集與測試集相似時,確實能在模糊詞義上的文獻探討有較佳的表現,我們建議未來的研究方向可以往增進自動化的 PDF 資訊擷取、實驗更多不同的半監督學習模型、訓練時切分不同的章節、或是自定義的損失函數來調整資訊檢索中不平衡資料的問題。
Researchers in both academia and industry often perform literature reviews. However, literature reviews can be time-consuming, especially when the topics that the researchers are interested in are hard to define, such as “big data”, or there are no well defined keywords to search for the topic. The goal of this research is to help researchers conduct this type of literature review by automating the process, applying natural language processing (NLP) algorithms and machine learning predictive models. Suppose a researcher starts by manually labeling a set of papers. By training a model on these labeled papers, NLP algorithms can help find related papers in the literature. We focus on the case of searching the operations management (OM) literature for papers on “behavioral big data” (BBD), an ambiguous concept. This expands previous work by Mach (2019) using manually-selected features. Using Mach’s 368 papers collected from top three OM journals, Lee et al. (2019) expanded the approach to use the term frequencies as features in the documents (TF-IDF), which are chosen by algorithms instead of domain knowledge. In contrast to Mach (2019), and similar to Lee et al. (2019) we use the TF-IDF NLP algorithm to generate our features. In addition to TF-IDF we also use two state-of-the-art embedding techniques -- “Universal Sentence Encoder” and “BERT” to embed our documents’ text as features. Then we use these features with various machine learning and deep learning models, including logistic regression, random forest, deep neural net, LSTM (Long Short Term Memory), and CNN (Convolutional Neural Network). Because most of our data are unlabeled documents, we also expanded our experiments to semi-supervised learning. In our experiments, we found that among the various models with different features, LSTM performed the best under supervised learning scenario, in terms of precision and recall. Recall is more important in our case because we want to focus on how many related documents are really captured by our algorithmic solution.. Our experiments’ results show that unlabeled data help improve performance not only on the testing set from the same journal but also on one other journal (MSOM). However, when applied to the third journal (POM), the performance did not improve, which is a typical issue of “data shift”. Our work shows that using NLP for literature reviews with ambiguous terms can provide a useful automated solution if the training and test data are sufficiently similar. We suggest that further improvements might be achieved by improving automatic PDF parsing, studying different semi-supervised learning methods, training models on separate sections of a paper, and customizing the models’ loss function for handling the imbalanced data issue.