透過您的圖書館登入
IP:52.14.22.250
  • 學位論文

對話過程廣告標的推薦之研究

Recommendation on Commercial Intention in Dialogs

指導教授 : 陳信希

摘要


即時通訊是目前非常熱門的網際網路應用系統。使用者以自然語言或各種符號輸入系統中進行對話,對話中系統會隨機出現與對話無關之廣告連結,本論文主要目的在建立一個對話分析系統,使出現之廣告連結能與對話者對話內容有高度相關性,由此增加使用者對廣告的興趣並點入廣告。 本論文使用雅虎線上目錄系統做為資料比對來源,並將每個對話分類成雅虎目錄的十四種類別之一,如運動休閒、藝術、科學等。系統以對話中每個單詞在雅虎目錄文件中出現的頻率做為權重來源,依不同的模型,也會將對話中的單詞在對話中或各類別中之出現頻率列入考慮,形成類似傳統TFIDF的方法。 將單詞取出後,系統會依不同的模型參數設計,考慮其性質,如動詞、名詞,單詞長度等再決定是否進行權重計算;另外系統也 使用上位詞、下位詞及同義詞來進行對話中單詞權重之計算。 本論文亦對下載後之雅虎線上目錄進行擴增運算以產生不同的資料來源,擴增的依據是原目錄結構中所附加的節點說明檔,其內含相關網頁、標題、及簡要說明,我們相信這些資訊對計算權重相當重要,而事實也證明如此,為方便起見,我們將這些資料來源都稱之為語料庫。 本論文提出的最佳模型及參數中,使用名詞及擴增語料庫的效能可以達到90%的F值,亦即在一百個對話中,此模型能正確判定其中九十個的對話內容屬於那種類別,並由此來取出相關類別的廣告。 本論文亦提出一種特定的統計量,稱為猜中速度,即在對話中第幾個回合能正確猜中對話的類別,目前的結果,我們有信心如果使用最佳模型,當對話進行一半時便能正確猜出其廣告類別,並送出有意義的廣告連結。我們也發展出一個決策樹,用來判定一個單詞是否為新單詞,並能有效取出其定義,另外也能再加以分類出譯音地名及人名。 最後我們總結實驗結果,解釋如何實現一個完整的以對話分析做為廣告推薦之即時通訊系統,並提出一些相關議題及應用,以供未來研究之用。

並列摘要


Instant messaging applications are the most popular applications on Internet. Users can communicate with each other by inputting texts or symbols in natural languages. While the conversation is in progress, some irrelevant advertisement links would appear randomly. Our target is to establish a dialog analysis system in which meaningful advertisement links highly relevant to the dialog contents can therefore be proposed, and thereafter the click rate of the ad links can be increased. The proposed model uses Yahoo! Directory tree as the data-comparison source, and classifies each dialog into one of the 14 categories of Yahoo! Directory, such as Recreation & Sports, Art, Science, etc. The system will calculate the weight by terms from the dialogs according to their document frequency in Yahoo! Directory tree. Also a TFIDF-similar is considered and evaluated by computing the term frequency in dialogs and each category. For bettering the data resource, we develop an expansion algorithm to expand the original Yahoo! Directory tree with its accompanying HTML files, in which some related web pages with titles, links, and snippets are saved. The experiment results show that expansion is meaningful with better performance. For convenience, we call the data sources as corpora. In the best setting of system parameters in the model, we conclude using Noun and Expansion Corpus can get the best result, which brings a 90% of F-value. This can give us confidence that we can correctly guess the commercial intentions of 90 dialogs from a given set of 100 dialogs. Besides, a special statistic, hit speed, is proposed to evaluate when our system can correctly retrieve the correct commercial category and provide relevant ad links. So far we are confident to do so in the middle round of a given conversation. We also define a decision tree which can decide new terms from dialogs and retrieve its definitions. After some refinement, we can get interesting geographical transliteration terms and people names. Finally we provide some detailed results and conclude our models to implement an effective commercial recommendation system on IM applications, and discuss some interesting topics for future research.

參考文獻


[6] Hsin-Hsi Chen and Sheng-Jie Huang, Summarization System for Chinese News from Multiple Sources, Proceedings of the 4th International Workshop on Information Retrieval with Asian Languages
[2] Wen-tau Yih, Joshua Goodman and Victor R. Carvalho, Finding Advertising Keywords on Web Pages, WWW2006
[5] C. Fellbaum. WordNet: An Electronic Lexical Database. The MIT Press, 1998
[7] Rosie Jones, Benjamin Rey and Omid Madani, Generating Query Substitutions, WWW2006
[1] Microsoft Corporation, Microsoft Corporation Annual Report 2006, http://www.microsoft.com/msft/reports/ar06/staticversion/10k_fr_dis.html

延伸閱讀