以自然語言處理方法分析年度報表中的管理層討論與分析的修改

目的：管理層討論與分析（MD A）是10-K年度報表中重要的項目之一，而每年MD A文字內容的修改，被用在許多研究上，包含評估公司的表現，股價預測等。然而，MD A 修改的前處理步驟，包含從10-K報表中擷取MD A，以及從擷取出的MD A中移除不想要的文字，仍然使用一些傳統的文字分析方法，而對MD A 修改的分析造成負面影響。除此之外，MD A 修改的呈現，無法完整考量文字語意，且經常以數值形式呈現，鮮少呈現實際上MD A修改的內容。方法：本研究運用建立一個自然語言處理框架（EPSC）去分析MD A 的修改，包含項目擷取（Item Extraction）、項目修飾（Item Prettification）、基於文字語意的句子層級文件比較（SDDSC），以及運用分群方法（Clustering）探索MD A修改的傾向。我們的EPSC能解決先前研究在項目擷取、項目修飾和MD A修改呈現上的研究限制，並運用進階的自然語言處理技術，改善MD A修改的分析。我們的EPSC包含四個步驟，第一步是使用條件隨機場（Conditional Random Field, CRF）做10-K年度報表的項目擷取，第二步是用雙向長短期記憶模型（Bi-directional Long Short-Term Memory, Bi-LSTM）做10-K年度報表的項目修飾，第三步使用我們所設計的基於文字語意的句子層級文件比較的演算法（SDDSC），呈現每年詳細的MD A修改，而第四步使用K-平均演算法（K-Means Clustering）識別產業中MD A修改的傾向。結果：我們的實驗結果顯示出，使用Bi-LSTM做項目修飾的表現比其他模型還要好。我們設計的SDDSC能夠基於不同的文字語意相似度之閥值，呈現詳細的MD A修改的資訊。除此之外，使用K-平均演算法能成功的識別產業內的MD A修改的傾向，並以離群中心相似度最高的前五個句子呈現此傾向。結論：本研究採用進階的自然語言處理技術，改善MD A修改的分析。此外，我們的EPSC可以提供更詳細的MD A文字內容修改的內容，提供研究者和投資者有價值的資訊。未來，我們希望能增加項目擷取的人工標註資料以提升模型的表現，也希望將我們的SDDSC修改成非遞迴演算法，解決遞迴演算法的深度限制，並提升演算法的執行效率。

關鍵字

10-K報表；管理層討論與分析；管理層討論與分析的修改；自然語言處理； CRF ； Bi-LSTM ； BERT

並列摘要

Aim: Management’s Discussion and Analysis (MD A) is an important item in 10-K reports. In particular, the text changes of MD A across years, also known as MD A modifications, have been applied in research such as firm performance evaluation and stock price prediction. However, the preprocessing routine of MD A modifications, including extracting MD A from 10-K reports and removing unwanted text chunks from extracted MD A, still applies traditional text analysis approaches that negatively influence the quality of analyzing MD A modifications. Besides, the representation of MD A modifications cannot fully consider text semantics and is usually aggregated by numerical values, lacking detailed information of what has been modified. Methods: We develop a natural language processing framework for analyzing MD A modifications by item Extraction Prettification, Sentence-level document differences based on text semantic changes, and trend exploration with Clustering (EPSC). Our EPSC can solve the limitation of item extraction, item prettification, and the display of MD A modifications by applying advanced NLP techniques to improve the analysis of MD A modifications. Our EPSC is composed of four steps. The first step (i.e., E Step) is item extraction applying Conditional Random Field (CRF). The second step (i.e., P Step) is item prettification applying character-level Bi-directional Long Short-Term Memory (Bi-LSTM), a deep learning-based sequence labeling model. The third step (i.e., S Step) is to provide detailed year-over-year MD A modifications using our novel Sentence-level Document Difference based on Semantic Changes algorithm (SDDSC). The fourth step (i.e., C Step) is to identify MD A modification trends in a certain industry using the K-Means clustering algorithm. Results: Our experimental results show that Bi-LSTM performs better than our baseline models for item prettification. We also show that our SDDSC can identify detailed MD A modifications given different text semantic similarity thresholds. In addition, we can successfully identify potential patterns and trends in a certain industry using the K-Means clustering algorithm, listing the top five sentences that represent the patterns or trends. Conclusions: We adopt advanced NLP techniques that improve the analysis of MD A modifications. Second, our EPSC can provide more detailed information about what has been changed in MD A considering text semantics to give more valuable insight to both researchers and investors. In the future, we hope to increase the size of the annotated dataset for item prettification to increase model performance and reform our SDDSC to a non-recursion version to eliminate the limitation of recursion depth and increase its efficiency.

並列關鍵字

10-K Reports ； MD A ； MD A Modifications ； Natural Language Processing ； CRF ； Bi-LSTM ； BERT

參考文獻

Alsabti, K., Ranka, S., Singh, V. (1997). An Efficient K-Means Clustering Algorithm.

Google Scholar

Blei, D. M., Ng, A. Y., Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3(Jan), 993-1022.

Google Scholar

Boureau, Y.-L., Ponce, J., LeCun, Y. (2010). A Theoretical Analysis of Feature Pooling in Visual Recognition. In Proceedings of the 27th International Conference on Machine Learning (ICML-10) (pp. 111-118).

Google Scholar

Brown, S. V., Tucker, J. W. (2011). Large‐Sample Evidence on Firms’ Year‐over‐Year MD A Modifications. Journal of Accounting Research, 49(2), 309-346.

Google Scholar

Cai, D., He, X., Han, J. (2005). Document Clustering Using Locality Preserving Indexing. IEEE Transactions on Knowledge and Data Engineering, 17(12), 1624-1637.

Google Scholar

延伸閱讀

黃挺豪（2009）。應用於中文意見分析之詞內暨詞間語法結構自動擷取研究〔碩士論文，國立臺灣大學〕。華藝線上圖書館。https://doi.org/10.6342/NTU.2009.00083
Anthony, L. (2016). 利用語料庫分析English for Specific Purposes期刊內之文章標題以探討專業英語之過去與未來發展. 英語教學期刊, 40(4), 91-107. https://doi.org/10.6330/ETL.2016.40.4.04
黃美惠（2010）。多角化策略、年報資訊揭露與自願性資訊揭露－法說會之召開對金融產業公司績效影響之實證分析〔碩士論文，崑山科技大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0025-0107201020012900
Lin, C. T. (2009). An investigation of the relationships between metalinguistic knowledge, motivation, self-regulation, self-perceived proficiency and language achievement. [master's thesis, National Taiwan Normal University]. Airiti Library. https://www.airitilibrary.com/Article/Detail?DocID=U0021-1610201315161193
Hsieh, K. L., & Tong, L. I. (2000). Parameter Optimization for Quality Response with Linguistic Ordered Category by Employing Artificial Neural Networks: A Case Study. 淡江理工學刊, 2(4), 213-219. https://doi.org/10.6180/jase.1999.2.4.04

未授權

主題瀏覽