文本蘊含辨識為辨識兩文句之間蘊含關係(inference)的技術。對許多自然語言處理應用來說,文本蘊含辨識的技術可以提供許多的幫助。本論文針對中文文本蘊含辨識進行深入的研究與探討,首先回顧世界各國目前對文本蘊含辨識研究的現況,包括相關的活動及競賽、常用資源與工具、以及主流的文本蘊含辨識方法。其次、本研究使用常用的語言特徵實做一個基礎文本蘊含辨識系統以便於實際觀察文本蘊含辨識的癥結點以及中文文本蘊含辨識特有的現象。於本研究中亦提出在驗證基礎系統中所觀察到的詞彙的相似度及句子間的矛盾問題以及改進的方法並將及其整合於進一個新的中文文本蘊含辨識。相較於基礎系統的表現,本研究所實作改進的系統在兩次NTCIR RITE國際競賽的資料集中不管是辨識各種蘊含關係的正確率(Accuracy),或是系統的強固性(Robustness)皆取得顯著的提升,根據其驗證結果本研究也同時提出中文文本蘊含辨識未來可能的發展方向。

Historically, information which apprises us of daily events has been provided by mass media sources, specifically news media. Presently social media services, such as Twitter, provide an enormous amount of user generated data which has great potential to contain informative news related content. However, for this content to be useful we must find a way to filter noise and capture only such information that, based on its content similarity to news media, may potentially be considered useful or valuable. However, even after noise is removed there still exists a problem of information overload in the remaining data. A person is incapable of processing huge amounts of information all at once and thus information which is of most value must be prioritized for consumption. To achieve prioritization, the information must be ranked in order of estimated importance. The temporal prevalence of a particular topic in news media is one significant factor of importance and may be considered the media focus of a topic. The topic’s temporal prevalence in social media, specifically Twitter, indicates user interest and may be considered its user attention. Furthermore, the interaction between the social media users whom mention this topic indicates the strength of the community discussing said topic and may be considered the user interaction. We propose an unsupervised method called SociRank, which identifies news topics that are prevalent in both social and news media and then ranks these topics taking into account media focus, user attention and user interaction as measures of importance.

The Internet is now more than a commodity and has transitioned to be a invaluable service for organizations, companies, and general everyday users. With the enormous and continuous growth, attackers are consistent in creating new methods to prey on vulnerable users. It is now a matter of high importance to secure and protect user data, since many attacks are popularly deployed on malicious Websites. Many commercial enterprise solutions are costly and a sophisticated infrastructure is needed to deploy them. Additionally, these solutions often rely on the vendors to constantly provide signatures or blacklists to make sure the system is up-to-date. Therefore, the detection of infection by malware is often really complex. Client honeypots have become a popular choice by researchers that aim to detect and analyze drive-by-download attacks. These systems crawl websites and detect if malware or malicious code is present in these websites. The tools are readily available and are relatively easily to deploy and maintain. An approach that allows users to manage their defense systems has proved inefficient as years have passed by due to performance issues and the complexity of maintaining these solutions individually. In this thesis, we propose a solution to keep networks behind a proxy server secure. Client honeypots can feed the proxy server with newly found malicious websites, the proxy server will access a database of blocked URLs and domains effectively filtering the web access users have. Clients will connect to the proxy server that is coupled with an Internet Content Adaptation Protocol (ICAP). The ICAP system will serve an HTML page when clients visit potentially malicious websites.

人們的日常生活中會產生大量的資料,很多研究學者希望透過分析這些收集來的資料,來改善現況或是代替人力,像是預測經濟情勢、辨認疾病等等。隨著運算速度的快速、儲存空間大幅度地增加,如何去做好資料分析成為一個重要的議題。而那些資料往往是複雜且多維度的,這也增加了分析資料的困難性。   作為一個資料分析技術,主成份分析能夠在保有最多特徵值的情況下,有效地降低資料的維度。在這篇論文中,我們利用主成份來重新呈現8OX、大腸癌基因、乳癌基因、紅酒辨認這四組資料。為了以視覺化呈現,我們利用MATLAB來表現二維及三維的實驗結果。最後,我們討論了主成份分析的一個使用注意事項,以及其可行的解決方法。

Fast and efficient power restoration algorithms become necessary for current and future electrical smart grids. In light of that, we propose a Multi-Agent System (MAS) approach for automatic restoration in power distribution networks. Agents in our MAS are categorized into Generator Agents (GA), Zone Agents (ZA) and one Data Base Agent (DBA). GAs have been implemented and negotiation capabilities in order to minimize the cost of the post-restoration configuration. Moreover, as electrical demand fluctuates on the hourly basis, a Least-Square Boosting technique has been used for short-term forecasting of electrical demand. This prediction is incorporated into the restoration algorithm in order to obtain a capacity-based restoration solution. The proposed method has been evaluated in two distribution networks. The forecasting methodology and restoration process are demonstrated in detail through several experiments.

Haplotypes consist of blocks of single nucleotide polymorphisms (SNPs). Haplotypes being a unit of inheritance are widely used for association studies and gene candidate studies. However, obtaining these blocks of SNPs through in vitro methods is both time consuming and expensive. In silico studies try to infer haplotypes from genotypic data. This thesis utilizes a genetic algorithm (i.e. a heuristic approach) guided through two genetic models, essentially the Hardy-Weinberg equilibrium and linkage disequilibrium. These have been statistically assessed by maximum likelihood estimates and a normalized mutual information respectively. This technique generates an adequate solution in polynomial time to an inherently NP-Hard problem. The results showed that our algorithm has a better accuracy rate compared to a genetic algorithm that only utilizes the Hardy-Weinberg equilibrium.

分類是資料探勘裡面最重要的技術之一,我們可以透過分類,將已知資料進行處理分類並找出隱含的規則,日後可用此規則對未知的資料進行預測。日常生活中,它的應用非常的廣泛,例如醫療上我們可以利用此技術找出病人基因特徵之隱含規則,日後便可將此規則應用在其他病人,如此一來可以加速醫療流程,也讓醫生在診斷上有其他的依據可做參考。所以資料探勘是一門非常重要的技術和學問,在海量資料(Big Data)的來臨,我們更必須要藉由此技術來分析資料中隱含的意義資訊。 在本篇論文中,我們探討AdaBoost(Adaptive Boosting)二元及多元方法,首先賦予每個樣本一個權重值,再來利用改變樣本權重的方式來訓練多個弱分類器。訓練完成後,最終將多個弱分類器組合成一個強分類器,如此一來我們可以利用此強分類器來對未知資料進行預測。我們提供AdaBoost演算法在大腸癌、乳癌、8OX、及Iris資料集的實驗結果。
