Title

光學字元辨識古籍之全文轉置經驗:以明人文集為例

Translated Titles

Full Text Conversion Experience in Optical Character Recognition of Ancient Books: An Example of Ming Dynasty Literati Collections

DOI

10.6575/JILA.202012_(97).0003

Authors

林巧敏(Chiao-Min Lin);蔡瀚緯(Han-Wei Tasi)

Key Words

光學字元辨識 ; 全文資料庫 ; 特藏古籍 ; 古籍數位化 ; 數位典藏 ; Optical character recognition ; Full-Text database ; Old rare books ; Ancient book digitization ; Digital archive

PublicationName

圖資與檔案學刊

Volume or Term/Year and Month of Publication

97期(2020 / 12 / 15)

Page #

76 - 117

Content Language

繁體中文

Chinese Abstract

因應資訊技術的發展,加上數位人文研究對於全文內容分析的使用需求,運用光學字元辨識技術(OCR)將文本內容轉置為全文,可促進全文檢索與內容探勘使用。為瞭解利用OCR辨識軟體轉換古籍全文的可行性,本研究運用古籍文本進行實測分析,探討古籍運用OCR辨識的成效以及影響辨識率的原因。研究選取40種明代文集進行分析,研究結果顯示古籍版式與影像品質皆會影響OCR辨識率,尤其版式文字過於擁擠和影像品質不佳,較不利於OCR處理,進而歸納出六種常見的辨識錯誤字形樣態,可提供典藏機構進行類似古籍版本全文轉置作業規劃之參考。

English Abstract

Due to the development of information technology and the need for content analysis of digital humanities research, the use of optical character recognition technology (OCR) to convert contents into verbatim texts can facilitate full-text search and content exploration. In order to understand the feasibility of using the OCR software to convert the full text of the ancient books, this study used the ancient texts to conduct a measured analysis to explore the effectiveness of OCR identification and the reasons for the impact of text recognition. The study selected 40 different layouts and glyphs of Ming Dynasty ancient books for analysis. The results show that the ancient book layout and image quality would affect the OCR recognition rate. When the layout is too crowded and the image quality is blurred, it is not conducive to OCR recognition. This study summarized six common types of identification error glyphs, which can provide the collection agencies to carry out the plan of the full text conversion of similar ancient books.

Topic Category 人文學 > 圖書資訊學
Reference
  1. 林巧敏, Chiao-Min,陳志銘, Chih-Ming(2017)。古籍風華再現:關於古籍數位人文平台之建置。國家圖書館館刊,106(1),111-132。
    連結:
  2. ABBYY.(2020). ABBYY Expert Talks. Retrieved from https://www.abbyy.com /expert-talks
  3. ABBYY Production.(2017). ABBYY FineReader 14。Retrieved from https://help.abbyy.com/static/guides/finereader/14/Guide_ChineseTraditional.pdf
  4. Al-A’ali, M.,Ahmad, J.(2007).Optical character recognition system for Arabic text using cursive multi-directional approach.Journal of Computer Science,3(7),549-555.
  5. Badoiu, V.,Ciobanu, A. C.,Craitoiu, S.(2016).OCR quality improvement using image preprocessing.Journal of Information Systems & Operations Management,10(1),1-13.
  6. Balk, H.,Ploeger, L.(2009).IMPACT: Working together to address the challenges involving mass digitization of historical printed text.OCLC Systems & Services: International digital library perspectives,25(4),233-248.
  7. Bates, M. J.(Ed.),Maack, M. N.(Ed.)(2010).Encyclopedia of Library and Information Sciences.Boca Raton, Fla:CRC Press.
  8. Chapman, S.,Kenney, A. R.(1996).Digital conversion of research library materials: A case for full informational capture.D-Lib Magazine,2(10)
  9. Cojocaru, S.,Colesnicov, A.,Malahov, L.,Bumbu, T.(2016).Optical character recognition applied to Romanian printed texts of the 18th-20th century.Computer Science Journal of Moldova,24(1),106-117.
  10. Holley, R.(2009).How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation programs.D-Lib Magazine,15(3/4)
  11. Mori, S.,Suen, C. Y.,Yamamoto, K.(1992).Historical review of OCR research and development.Proceedings of the IEEE,80(7),1029-1058.
  12. Patel, C.,Patel, A.,Patel, D.(2012).Optical character recognition by open source OCR tool tesseract: A case study.International Journal of Computer Applications,55(10),50-56.
  13. Sun, W.,Liu, L. M.,Zhang, W.,Comfort, J. C.(1992).Intelligent OCR processing.Journal of the American Society for Information Science,43(6),422-431.
  14. Zhu, Y.,Tan, T.,Wang, Y.(2001).Font recognition based on global texture analysis.IEEE Transactions on Pattern Analysis and Machine Intelligence,23(10),1192-1200.
  15. 于惠泉, Hui-Quan(2008).漢字造型規律及書寫技能.鄭州市=Zhengzhou:河南美術=Henan mei shu.
  16. 中央研究院=Academia Sinica(2010)。中央研究院(2010)。國際電腦漢字及異體字知識庫。檢自:https://chardb.iis.sinica.edu.tw/【Academia Sinica (2010). Guo ji dian nao han zi ji yi ti zi zhi shi ku. Retrieved from https://chardb.iis.sinica.edu.tw/ (in Chinese)】。https://chardb.iis.sinica.edu.tw/
  17. 中央研究院歷史語言研究所, Academia Sinica(2019)。中央研究院歷史語言研究所(2019)。漢籍電子文獻資料庫。檢自:http://hanchi.ihp.sinica.edu.tw/ihp/hanji.htm【Institute of History and Philology, Academia Sinica (2019). Han ji dian zi wen xian zi liao ku. Retrieved from http://hanchi.ihp.sinica.edu.tw/ihp/hanji.htm (in Chinese)】。http://hanchi.ihp.sinica.edu.tw/ihp/hanji.htm
  18. 中華人民共和國教育部, The People’s Republic of China(2009)。中華人民共和國教育部(2009)。現代常用字部件及部件名稱規範。北京市:國家語言文字工作委員會。【Ministry of Education, The People’s Republic of China (2009). Xian dai chang yong zi bu jian ji bu jian ming cheng gui fan. Beijing: guó jiāyǔ yánwén zìgōng zuòwěi yuán huì (in Chinese)】。
  19. 中華民國教育部(2017)。教育部異體字字典。檢自:https://dict.variants.moe.edu.tw/variants/rbt/home.do【Ministry of Education, Republic of China (2017). Jiao yu bu yi ti zi zi dian. Retrieved from https://dict.variants.moe.edu.tw/variants/rbt/home.do (in Chinese)】
  20. 王雅萍, Ya-Ping,謝筱琳, Xiao-Lin(2011).漢籍全文數位化工作流程指南.臺北市=Taipei:行政院國家科學委員會=National Science Council, Executive Yuan.
  21. 李清志, Qing-Zhi(1985)。明代中葉以後版刻特徵。古籍鑑定與維護研習會專集,臺北市=Taipei:
  22. 周駿富, Jun-Fu(1985)。明代前期版刻特徵。古籍鑑定與維護研習會專集,臺北市=Taipei:
  23. 林巧敏, Chiao-Min(2017)。古籍全文數位化經驗分享。國家圖書館通用型古籍數位人文研究平台成果發表會,臺北市,中華民國=Taipei, Republic of China:
  24. 國家圖書館(2020)。古籍與特藏文獻資源。檢自:http://rbook.ncl.edu.tw/NCLSearch/【National Central Library (2020). Gu ji yu te cang wen xian zi yuan. Retrieved from http://rbook.ncl.edu.tw/NCLSearch/ (in Chinese)】
  25. 張俊盛, Jun-Sheng,陳舜德, Shun-Der(1995)。雜訊通道模型在 OCR 後處理之應用。影像與識別,3(3),98-109。
  26. 莊德明、鄧賢瑛(2009)。漢字構形資料庫的研發與應用。檢自:http://cdp.sinica.edu.tw/service/documents/T090904.pdf【Zhuang, De-ming, & Deng, Xian-Ying (2009). Han zi gou xing zi liao ku de yan fa yu ying yong. Retrieved from http://cdp.sinica.edu.tw/service/documents/T090904.pdf (in Chinese)】
  27. 陳金木, Chin-Mu(2008)。電子全文資料庫與學術研究—以《四部叢刊電子全文檢索版》為例。明道通識論叢,5,120-135。
  28. 曾元顯, Yuen-Hsien(2004)。應用於資訊檢索的中文 OCR 錯誤詞彙自動更正。中國圖書館學會會報,72,23-31。
  29. 曾逸鴻, Yi-Hong,林裕淵, Yu-Yuan(2007)。中文文件影像中之特殊字體偵測。科學與工程技術期刊,3(4),29-39。
  30. 黃永年, Yong-Nian(2005).古籍版本學.南京=Nanjing:江蘇教育出版社=Jiangsu jiao yu chu ban she.
  31. 黃沛榮, Pei-Rong(2009).漢字教學的理論與實踐.臺北市=Taipei:樂學=Le xue.
  32. 劉兆祐, Zhao-You(2007).認識古籍版刻與藏書家.臺北市=Taipei:學生書局=Student Book.
  33. 潘美月, Mei-Yue(1985)。明代官私刻書。古籍鑑定與維護研習會專集,臺北市=Taipei:
  34. 潘朝陽, Chao-Yan(1994)。OCR/中文 OCR 技術。光學工程,47,48-53。
  35. 駱偉, Wei(2004).簡明古籍整理與版本學.澳門=Macao:澳門圖書館暨資訊管理協會=Macao Library and Information Management Association.
  36. 顧力仁, Li-Jen(2001)。中文古籍全文資料庫建置比較研究。國家圖書館館刊,90(2),197-216。
  37. 顧力仁, Li-Jen(2002)。永樂大典數位化相關問題之探討:兼論資訊科技對古籍整理的影響。圖書館學與資訊科學,28(1),33-48。