Locality-Sensitive Hashing for Sentence Retrieval Applied to Example-Based Machine Translation

Nowadays, in a world where information technologies are becoming more necessary to analyze large volumes of data, computational processes that emphasize the data rather than a set of predefined rules result in more scalable and flexible systems. Machine translation systems under the example-based machine translation (EBMT) paradigm come out to be a good example of an outcome obtained from the analysis of a large volume of data rather than from the pre-definition of grammatical translation-rules. The EBMT paradigm is based in the analogy principle: two sentences annotated with a similar grammatical structure will preserve such grammatical similarity after translated into some target language. Therefore, an arbitrary new sentence can be translated by looking up a previously translated sentence with a similar grammatical structure. The goal of this research is to introduce the details of the implementation of the Locality-Sensitive Hashing (LSH) schema as an approach for building an indexing mechanism for retrieving sentences in the EBMT framework. A data set consisting of thousands of sentences were downloaded from the Open National American Corpus (ONAC) project and parsed using the Stanford CoreNLP parser. The sentences were then transformed to vectors in the Euclidean space using part-of-speech (POS) tags as mapping unit to yield a data set that can be used to simulate an EBTM example database. The LSH schema is used as an indexing mechanism for the querying of an example database designed over the concept of the analogy principle. Finally, Structured String-Tree correspondences were used to guide the translation process between a new input sentence and a previously translated sentence retrieved from the example database with a similar grammatical structure. Section 2 introduces the theory behind the EBMT framework and the LSH schema in order to provide a comprehension basis for the implementations explained in the further sections. The objective in section 2 is to uncover the theory behind the LSH schema in order to grasp the theoretical guidelines used for the implementation of an EBMT example database based in the LSH schema. Section 3 introduces the process of choosing the parameters of the LSH algorithm in order to provide an efficient search for a given query. A sample query set selected randomly from the data set is used to analyze the average searching cost and estimate the best parameters to build a suitable index structure able to solve any further query. Section 4 introduces the implementation details for the LSH schema to index the examples of a bilingual database in the EBMT framework. The theoretical background introduced in section 2 is used to guide the construction of a set of hash functions used as indexes to store each data point of our data set in a set of hash tables. Section 5 explains how we can generate a structure tree for a set of translated sentences and then use the same method to generate the structure tree for a new input sentence. The LSH schema is implemented to generate an index structure for the search of previously translated sentences with a similar grammatical structure and then Structured String-Tree Correspondences are used to represent the association between a pair of translated sentences in order to guide the translation of the input sentence.

關鍵字

地點敏感散；機器翻譯

並列摘要

無資料

並列關鍵字

Locality Sensive Hashing ； Data Indexing ； Machine Translation

參考文獻

8. REFERENCES

[2] A. Andoni and P. Indyk 2006, Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions in proceedings of the 47th Annual IEEE Symposium (FOCS '06).

[4] M. Slaney and M. Casey 2008, Locality-Sensitive Hashing for Finding nearest Neighbors in IEEE Signal Process Magazine, vol. 25, pp. 128-131, no. 2.

[5] M. Slaney, Y. Lifshits and J. He 2012, Optimal Parameters for Locality-Sensitive Hashing in proceedings of the IEEE, pp. 1-20, no. 99.

[6] Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer 2003, Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network in Proceedings of HLT-NAACL 2003, pp. 252-259.

國際替代計量

Locality-Sensitive Hashing for Sentence Retrieval Applied to Example-Based Machine Translation

主題瀏覽