基於音頻和視覺特徵的多模式相似影片檢測與定位

在數位時代，隨著YouTube等平台上的影片內容激增，高效的影片相似性定位（VSL）系統的需求變得至關重要。VSL以其能夠在時間上對齊相似的影片片段的能力，應對了龐大的影片數據量和不斷增加侵權情況所帶來的挑戰。當前的VSL方法通常專注於視覺或聽覺特徵之一，但很少充分發揮兩者之間的協同作用。本研究設計了一個多模式影片相似性定位（MVSL）管道，無縫集成了音頻和視覺特徵，以增強相似性定位任務的性能。本研究引入了一個新型的深度學習管道，並使用一個特別精選的影片相似性定位數據集（VSLD），透過VSLD強調了該方法的效能。MVSL系統從影片預處理、音視頻特徵提取開始，然後進行相似性映射和時間對齊。其中音頻特徵在VSL中發揮了決定性作用。這項工作為下一代影片內容分析工具奠定了基礎，實現了自動化、可擴展和精確的影片相似性檢測，對於內容創作者和數位媒體平台具有巨大價值。實驗結果與其他研究相比也展示了出色的性能。

關鍵字

音視覺特徵；影片內容分析；相似影片定位；影片副本檢測

並列摘要

In the digital age, with the upsurge in video content on platforms like YouTube, the need for efficient video similarity localization (VSL) systems has become paramount. VSL, with its ability to temporally align similar video segments, addresses challenges posed by the sheer volume of video data and the increasing instances of copyright infringement. Current VSL methods often focus on either visual or auditory features but rarely capitalize on the synergy between the two. This research introduces a Multimodal Video Similarity Localization (MVSL) pipeline that seamlessly integrates audio and visual features to enhance the performance of similarity localization tasks. The study introduces a novel deep learning pipeline, a specially curated Video Similarity Localization Dataset (VSLD), and underscores the efficiency of the approach using the VSLD. The MVSL system begins with video preprocessing, audio-visual feature extraction, and then progresses to similarity mapping and temporal alignment. The results showcase outstanding performance, with auditory features playing a decisive role in VSL. This work lays a foundation for the next generation of video content analysis tools, enabling automated, scalable, and precise video similarity detection that holds immense value for content creators and digital media platforms.

並列關鍵字

Audio-Visual Features ； Video Content Analysis ； Video Similarity Localization ； Video Copy Detection

參考文獻

[1] Evlampios Apostolidis, Eleni Adamantidou, Alexandros I Metsai, Vasileios Mezaris, and Ioannis Patras. Video summarization using deep neural networks: A survey. Proceedings of the IEEE, 109(11):1838–1863, 2021.

Google Scholar

[2] Pavlos Avgoustinakis, Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Andreas L Symeonidis, and Ioannis Kompatsiaris. Audio-based near-duplicate video retrieval with audio similarity learning. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 5828–5835. IEEE, 2021.

Google Scholar

[3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.

Google Scholar

[4] Lorenzo Baraldi, Matthijs Douze, Rita Cucchiara, and Herv´e J´egou. Lamv: Learning to align and match videos with kernelized temporal layers. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7804–7813, 2018.

Google Scholar

[5] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006. Proceedings, Part I 9, pages 404–417. Springer, 2006.

Google Scholar

國際替代計量

基於音頻和視覺特徵的多模式相似影片檢測與定位

全文下載

主題瀏覽