運用影像辨識與文字探勘於文件撰寫輔助之研究

資訊科技蓬勃發展的今日，資訊系統與軟體的需求增加，好的系統分析(System Analysis)與過程中所撰寫的軟體需求規格(Software Requirements Specification)，對於系統開發而言格外重要，然而經驗不同的系統分析師(System Analyst)，撰寫出的軟體規格書品質不一，再加上軟體規格書非結構化的特性，一一瀏覽過去完成的文件作為參考依據時，容易因人為判斷導致疏漏，並且在查找上也需要花費大量時間成本與人力。本研究係以臺中某資訊公司的軟體需求規格文件，運用光學文字辨識(Optical Character Recognition, OCR)與文字探勘(Text mining)相關技術，首先使用光學文字辨識進行文字辨識，透過轉換後將文件內容分別建立標題、章節、段落內容三個層級的文字內容資料庫，再根據三個層級內容資料庫，使用Jieba斷詞後，接著採用隱含狄利克雷分佈(Latent Dirichlet Allocation, LDA)及詞頻-逆向檔案頻率(Term Frequency-Inverse Document Frequency, TF-IDF)提取關鍵字建立階層查詢索引，並將內容建置為文件撰寫輔助詞庫，使用者透過查詢標題、章節與內容的索引，進行三階段篩選與一階段排序後，推薦適當文章內容給使用者作為文件撰寫某特定章節時的參考。再者，當新的文件資料要加入建立索引與詞庫時，僅需走訪上述流程，再加上過程中適時的人工輔助，便可最小成本擴增文件撰寫輔助詞庫。研究結果顯示，本研究建置了OCR錯誤辨識替換修正規則庫、自定義停用詞表、自訂義斷詞規則以及文件撰寫輔助詞庫，並且發展了完整的文件處理與撰寫輔助流程，故本研究有以下五點之貢獻：提出文件處理與撰寫輔助可行流程方案、文件推薦機制的建置、文件撰寫輔助詞庫建置、OCR文字錯誤辨識應對機制以及自定義停用詞表與斷詞規則庫建置，未來，本研究之研究流程也可以廣泛運用在各個領域。

關鍵字

文件撰寫輔助詞庫；光學文字辨識；文字探勘；隱含狄利克分佈；詞頻-逆向檔案頻率

並列摘要

With the rapid development of information technology, the demand for information systems and software has increased. Good system analysis and software requirements specification are very important for system development. However, system analysts with different experience write software specifications of different quality. Based on the unstructured nature of software specifications, it is easy to cause omissions due to human judgment when browsing past completed documents as a reference. When browsing past completed documents as a reference, it is easy to cause omissions due to differences in personal cognition and experience. Furthermore, when there are many reference documents, it takes a lot of time to search for the required information. This research uses optical character recognition (OCR) and text mining technology to analyze the software requirements specification document of an information company in Taichung. Optical character recognition and Jieba are used for text recognition and word segmentation, respectively. Next, the content of the document is divided into a text content database with three levels of title, chapter and content. The Latent Dirichlet Allocation (LDA) and Term Frequency-Inverse Document Frequency (TF-IDF) are used to extract keywords and build hierarchical query indexes. The user can find the appropriate article content as a reference when writing a specific chapter of the software specification document by querying the hierarchical index of title, chapter and content. Furthermore, when a new document needs to be added to the lexicon database, through the above process and human assistance, the auxiliary lexicon for document writing can be expanded with minimal cost. In this study, an OCR replacement to correction rules, a custom stop words list, a custom word segmentation rules, and a document writing auxiliary lexicon. This research also developed a complete document processing and writing auxiliary process. This research has the following contributions: First, the two rule bases for OCR text error recognition and custom stop word list are built. Second, a document writing auxiliary lexicons and a document reference content recommendation mechanism are built. Finally, this study proposes a feasible solution for document conversion and segmentation, reference lexicon construction and writing assistance.

並列關鍵字

document writing auxiliary lexicon ； optical character recognition ； text mining ； Latent Dirichlet Allocation ； Term Frequency-Inverse Document Frequency

參考文獻

國家教育研究院（民111）。取自：https://dict.revised.moe.edu.tw/index.jsp，擷取日期：民國111年6月30日。

Google Scholar

Alshazly, A. A., Elfatatry, A. M., Abougabal, M. S. (2014). Detecting defects in software requirements specification. Alexandria Engineering Journal, 53, 513-527.

Google Scholar

Asif, M., Ali, I., Malik, M. S. A., Chaudary, M. H., Tayyaba, S., Mahmood, M. T. (2019). Annotation of software requirements specification (srs), extractions of nonfunctional requirements, and measurement of their tradeoff. IEEE Access, 7, 36164-36176.

Google Scholar

Bagheri, S., Kusters, R. J., Trienekens, J. J., Grefen, P. W. (2019). A reference model-based user requirements elicitation process: Toward operational business-IT alignment in a co-creation value network. Information and Software Technology, 111, 72-85.

Google Scholar

Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84.

Google Scholar

主題瀏覽