Automatic Detection and Language Identification of Multilingual Documents
by Lui et al.
This work removes usual monolingual assumption and address the problem of LID that contains more than on language. (Most related work is on large docuemnts in small number of documents) This method learns for multilingual documents from monolingual training data, because there is no standard corpora of multilingual documents available.
Background
Learning algorithms applied to language identification fall into two general categories:
 Bayesian classifiers: Markov, naive Bayes, compressive methods
 nearestprototype classifiers: it varies on distance measures used, like rank order statistics, information theory, string kernels and etc.
Methodology
It is a multilabel classification task.
Document is represented as frequency distribution over byte ngram sequence. The feature is selected using information gain combined with naive Bayes classifier.
Generative Mixture Model is used to calculated probability of the ith token over a series of labels. Infering labels involves Expectation Maximization algorithm and Labeled Latent Dirchlet allocation. In training, we calculate \phi = P(w_i  z_i = j, z,w) by MLE. For a language l, we maximize P(lD) for document D <=> maximize P(Dl).
Experiment
It is experimented on ALTW2010 with 10000 bilingual documents and WikipediaMulti, which is also generated by wikipedia. This dataset is constructed by randomly select number of languages and join their sections in one document.

Previous
Dropout A Simple Way to Prevent Neural Networks from Overfitting 
Next
Statement about ClassUCLA