Today was a big day, the culmination of months of reading, researching, and rehearsing. All for an hour’s presentation. Thankfully, it’s over with – I’ve never been a fan of public speaking. Here are a few highlights:
Aligned data is important for machine translation. It’s used to train the models that decide how to convert one language into another. The larger the volume of accurate alignments we can feed into our machine translation systems, the better the parameter estimations and the more accurate the results. Sentence alignment is that all-important first step in machine translation. Before words & phrases can be aligned, bilingual texts need to be broken down into bite-sized chunks.
Fortunately, there’s a very high correlation between the lengths of sentences in different languages. For example, there’s a .991 ratio of lengths between English & German. This allows us to use Bayes Theorem to estimate possible alignment points. A high level of accuracy (~96%) can be achieved by looking solely at character lengths and ignoring the actual words themselves.
An even higher level of accuracy is achievable if we incorporate lexical information into the mix. By modeling our data with a Poisson distribution instead of a Gaussian and implementing various methods of search pruning, we can narrow down the search space so that we’re only spending computing time on alignments that are likely to be correct. By combining a modified sentence-length-based model with a modified version of IBM’s Model-1, error rates of less than 1% can be achieved. Some results are even better than hand-aligned data. Not bad!
Sources:
- W. Gale and K. Church A Program for Aligning Sentences in Bilingual Corpora Association for Computational Linguistics. 1991
- R. Moore Fast and Accurate Sentence Alignment of Bilingual Corpora Proceedings, 5th Conference of the Association for Machine Translation in the Americas, Tiburon, California, Springer-Verlag, Heidelberg, Germany, pp. 135-244.