Composite engines in Globalese 2Greg Horváth
Not all language pairs are created equal. Anyone who has experience with Statistical Machine Translation (SMT) knows it is always easier to get good results from an English to Spanish engine than say, French to Japanese.
The concept of composite engines makes its debut in Globalese 2.0. Every Globalese engine now includes a phrase-based and a hierarchical part. These reflect two different approaches in SMT. The two parts generate two different translation candidates for the same source sentence. Thanks to quality estimation algorithms, a well-calibrated composite engine is able to choose the better candidate, sentence by sentence, automatically.
With composite engines, Globalese can provide better results for all language pairs. The phrase-based and the hierarchical components are both capable of generating good translation candidates. The hard task is figuring out which one is better for a given source sentence, even if we know that generally speaking this approach would fit language combination A better, and that approach would yield better results for combination B. So how do we know?
Enter quality estimation models
Quality Estimation Models (QEMs) were introduced in Globalese 1.5, and are further refined with each subsequent release. They assign an estimated score (from 0% to 100%, like any household CAT tool) to every sentence Globalese translates. Project managers can use these scores to improve time and resource planning. QEMs are also used for setting a quality threshold that only allows the really usable machine translated content back into the translated files.
What about the rest of the MT output? The good or bad news (depending on who’s asking) is: there will always be work for translators or post-editors (depending on who’s reading). Globalese is a tool for increasing translation productivity, not for forcing your seasoned translators into early retirement. (Make sure you tell them!)
With the introduction of composite engines in Globalese 2.0, the role of QEMs is more important than ever. A well-calibrated composite engine can make the right decisions when cherry picking from translation candidates. This also means that no engine can be used without a quality estimation model – a change in Globalese 2.0.
The secret of calibrating an engine well lies in choosing the right asset. The same guidelines should be followed as when selecting a tuning set. The asset should not be a part of the corpora used for training the phrase table(s) used in the engine, but should still be relevant to the translatable content.
What lies ahead
Composite engines are just a milestone in the evolution of Globalese. They are not the ultimate goal, the holy grail, nor the best thing since sliced bread. But they are one important milestone which enables us to focus on the next addition that will get Globalese 3 one more step closer to being the machine translation tool we want it to become.