透過您的圖書館登入
IP:18.188.39.45
  • 學位論文

以圖像階層式主題推薦附歌詞的歌曲之研究

Exploiting hierarchical topics of images to recommend songs with lyrics

指導教授 : 鄭卜壬

摘要


相片是一項保存回憶,紀錄生活片段的重要手段。相片往往帶有各式各樣的情感,若能將視覺感官搭配聽覺感官,將能提升相片中的情緒感受。現代人常將照片分享至社群網路,若將圖片自動配上歌曲,勢必能增加該相片的豐富度及趣味性。 2017年出現第一篇論文提出圖片歌曲配對的問題。然而,該篇論文提出的資料集及方法有幾個缺點:首先,資料集中有許多不合理的圖片歌曲配對。第二,該篇論文在處理圖片時使用物體偵測的方式,會將物體偵測的結果向後面的訓練網路傳遞,而物體偵測的會有一定的錯誤率,會造成錯誤的疊加。 此外,我們認為每張圖片有階層式的主題,每張圖片除了有大方向的主題外,能在該主題下細分成子主題,因此每張圖片所配對的歌曲不該只有一首歌曲,而是有順序性的,歌曲排序應先配對到子主題的歌曲,再來配對到大主題的歌曲,最後才是其他主題的歌曲。 為了解決上述的問題,我們建立一個階層式圖片歌曲配對的資料集。我們將Instagram上前6000熱門的hashtag作為搜集圖片的主題,並利用Flickr搜尋該主題的照片所配對的tag作為細分子主題的依據。建立大主題、子主題後,我們根據從Flickr、Google image search這兩個平台上搜集圖片,並搜集相對應的歌曲。 在這篇論文中,我們利用歌詞作為歌曲的資訊,並著重於圖文配對的方法。本篇論文提出的方法主要分成三個步驟,第一步是圖片特徵的抽取,第二步是歌詞特徵的抽取,第三步是圖片及歌詞的特徵的配對。以往的圖文配對模型無法處理階層式配對的問題,而我們提出的模型能針對階層式分層的圖片做歌曲配對。實驗結果顯示我們的模型有良好的準確率,並且能有效的處理階層式主題配對的問題。

並列摘要


Photo is an important medium to keep memory and record life. Photo usu- ally expresses some feelings. Combine the vision with hearing, the feelings of the photo will be enhanced and strengthen. People usually post thier photo on social networks, if we can match the photo which posted on the social media with songs, the photo will be more expressive and more interesting. The first work of image-song matching problem was proposed in 2017. However, there are several drawbacks in this work. First, there are many unreasonable matching pairs in the dataset. Second, they use object detection to extract the representation of images, which will cause the error propagate to the following networks. Additionaly, we think the image has hierarchical topics, each image has a topic and a specific tag. Hence, every image can match to not only a song, they should first match to songs with same tag, then songs with same topic. To solve the problems, we create a hiearchical image-song matching dataset. We crawl image data on Flickr and Google image search by topics and tags and collect corresponding songs. In the work, we use lyric as the information of the song and put concentrate on matching rather than feature extraction. There are three main steps in our methods, first is to get image representation, second is to get lyric representation, finally we match the image with lyric. We propose two methods on this task and get great results.

參考文獻


[1] Gabriella Csurka, Christopher R. Dance, Lixin Fan, Jutta Willamowski, and Cédric Bray. Visual categorization with bags of keypoints. In In Workshop on Statistical Learning in Computer Vision, ECCV, pages 1–22, 2004.
[2] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recogni- tion (CVPR’05), volume 1, pages 886–893 vol. 1, June 2005. doi: 10.1109/CVPR. 2005.177.
[3] Jia Deng, Wei Dong, Richard Socher, Li jia Li, Kai Li, and Li Fei-fei. Imagenet: A large-scale hierarchical image database. In In CVPR, 2009.
[4] Michael Fell and Caroline Sporleder. Lyrics-based analysis and classification of music. In COLING, 2014.
[5] AndreaFrome,GregSCorrado,JonShlens,SamyBengio,JeffDean,Marc’Aurelio Ranzato, and Tomas Mikolov. Devise: A deep visual-semantic embedding model. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2121–2129. Curran Associates, Inc., 2013.

延伸閱讀