Automatic Detection and Language Identification of Multilingual Documents
by Lui et al.
This work removes usual monolingual assumption and address the problem of LID that contains more than on language. (Most related work is on large docuemnts in small number of documents) This method learns for multilingual documents from monolingual training data, because there is no standard corpora of multilingual documents available.
Learning algorithms applied to language identification fall into two general categories:
- Bayesian classifiers: Markov, naive Bayes, compressive methods
- nearest-prototype classifiers: it varies on distance measures used, like rank order statistics, information theory, string kernels and etc.
It is a multilabel classification task.
Document is represented as frequency distribution over byte n-gram sequence. The feature is selected using information gain combined with naive Bayes classifier.
Generative Mixture Model is used to calculated probability of the ith token over a series of labels. Infering labels involves Expectation Maximization algorithm and Labeled Latent Dirchlet allocation. In training, we calculate \phi = P(w_i | z_i = j, z,w) by MLE. For a language l, we maximize P(l|D) for document D <=> maximize P(D|l).
It is experimented on ALTW2010 with 10000 bilingual documents and WikipediaMulti, which is also generated by wikipedia. This dataset is constructed by randomly select number of languages and join their sections in one document.