THE HONG KONG CANTONESE CORPUS: DESIGN AND USES

The Hong Kong Cantonese Corpus (HKCC) was built with the specific aim of making available to researchers and language learners a body of naturally occurring talk gleaned from everyday conversations between speakers of Cantonese in Hong Kong. In this paper, we describe the origin, rationale, design principles and uses of HKCC. In particular, we focus on the following aspects of the corpus: (1) data collection procedures; (2) transcription and orthographic conventions; (3) encoding schemes; (4) segmentation and POS tagging; and (5) potential uses of the corpus and future directions.

關鍵字

Speech corpus ； Conversation ； Cantonese ； Naturally occurring talk ； Corpus design

並列摘要

建構香港粵語語料庫，旨在爲語言研究及粵語學習提供日常會話中出現的自然語言材料。本文介紹香港粵語語料庫的構思、動機、設計和應用。討論範圍包括：（1）語料收集的原則和過程，（2）轉寫規則，（3）代碼系統，（4）分詞與詞性標注，（5）語料庫的應用及未來發展方向等。

並列關鍵字

口語語料庫；日常會話；粵語；自然語言材料；語料庫設計

國際替代計量

THE HONG KONG CANTONESE CORPUS: DESIGN AND USES

全文下載

主題瀏覽