基於實例查詢演算法之聲音檢索輔助標注工具

資料標注是通過對語音、影像、文字等資料進行標注的一個過程，它在人工智慧及機器學習中相當重要，主要用於訓練統計模型以理解內容並提供相對應的結果。然而手動標記耗費時間與人力，因此建立一個能夠降低這些成本的輔助標註系統將會非常有幫助。專注於語音資料標註的話，若能有聲音檢索工具以查詢並找出聲音片段，將可以大幅縮短標記的時間。在聲音實例查詢以及語意實例查詢方面， Shazam 以及Musiwave 提出的聲音指紋(Audio Fingerprinting) 讓使用者可以用環境中的錄音片段去查詢該歌曲。本篇論文將聲音指紋方法應用於輔助標注系統中，透過各種環境以及不同方法比較的一系列實驗中，以數據量化並分析該系統的檢索性能以及噪聲穩健性。本篇論文亦設計了一個互動性的使用介面提供使用性測試並收集回饋，跟一般手動標記的標注工具介面相比，該系統能夠不失標注品質下縮短使用者35% 的標注時間，不過目前檢索準確性平均約80%，可以再更好一些。

關鍵字

聲音檢索；實例查詢；聲音指紋；資料標註

並列摘要

Data annotation is the process of labeling image, videos, audios, and text data. It is quite critical in Artificial Intelligence (AI) and machine learning (ML) for training a statistical model to understand the input and react appropriately. However, manually labeling requires time and labor costs. It would be worthwhile to build an assistant annotation tool to reduce the cost of manually labeling. Concentrating on labeling audio data, when audio retrieval tool is available, it can locate the queries and help quickly label relevant segments. Among previous work in content-based audio retrieval, query-by-acoustic example (QBAE) and query-by-semantic-example (QBSE) are two classic approaches. Among QBAE, a well-known algorithm called Audio Fingerprinting (AF) proposed by Shazam [1] and Musiwave [2] allows users to search a desired song by a short query recorded in the environment. In this thesis, we implemented the AF methods to construct an assistant annotation system, and conducts a set of experiments to validate the feasibility. The proposed system is called QBEAT (Query-by-example Annotation Tool). With the quantitative analysis under different environments and the comparison with cross correlation (a conventional method in audio retrieval), we can assess the noise robustness and the retrieval performance of QBEAT. In addition, an interactive user interface is built for usability testing, which gathers feedback from the participants. In contrast to manual annotation interfaces, the proposed system shortens the labeling time without the loss in labeling performance, even though there is still space to improve the accuracy of audio retrieval.