透過您的圖書館登入
IP:3.135.183.89
  • 學位論文

文本標記格式的轉換與應用

On transformations between text-tagging formats

指導教授 : 項潔

摘要


許多數位人文的研究會需要使用到文本中的詞彙標記,而目前已經有許多現有的文本標記工具可以使用,由於各個工具擅長的詞彙標記不同,故本論文希望能夠整合多個工具去使用,但是因為各個工具所使用之格式不同,所以若要直接整合使用是無法辦到的事情,勢必要進行格式之間的轉換。為此本論文分析出文本標記格式中會有哪些資訊,並且將這些資訊進行分類,最後定義出了新的文本標記格式STAML去儲存這些資訊,並且將STAML作為各種不同文本標記格式之間轉換的中介語言,接著再利用網頁平台將這個轉換程式實際地開發出來。透過這個STAML格式與其轉換程式,本論文達到可以將這些文本標記工具整合使用的目的,藉此希望讓數位人文的研究能夠更加地順利。

並列摘要


Tagging named entities in a text is often an essential part of preparing the text to be used in digital humanities research. Although there are several text-tagging tools available to researchers, each tool is designed for a specific purpose and the tagging formats that they use are often different. Conse- quently text tagged using a specific tool cannot be reused by another person with a different tool. In this thesis we propose an approach to integrate different text-tagging formats produced from different tools. We introduce the Simple Text-Annotation Markup Language (STAML), which serves as an intermediary representa- tion between different tagging formats. Through STAML, texts tagged us- ing one format can be used in another tagging tool without disrupting the existing annotations. STAML and web-based programs are implemented for several common Chinese language based tagging formats such as those used by MARKUS, a popular tagging tool, THDL, and TEI.

參考文獻


[1] European Research Council, “MARKUS.” http://dh.chinese-empires.eu/beta/ index.html. [Online; accessed 12-June-2016].
[2] 杜協昌, “詞夾子系統.” http://dev.digital.ntu.edu.tw/DADH-2015/ch-clipper. html. [Online; accessed 12-June-2016].
[3] 謝育平, “同位詞夾子: 主題式分類詞庫萃取演算法,” 2010.
[4] Text Encoding Initiative Consortium, “TEI: Text Encoding Initiative.” http://www.
tei-c.org/. [Online; accessed 12-June-2016].

被引用紀錄


趙叡(2017)。文本對讀系統—以《春秋》三傳為例〔碩士論文,國立臺灣大學〕。華藝線上圖書館。https://doi.org/10.6342/NTU201702088

延伸閱讀