Augmented in-domain enginesGreg Horváth
In the past, LSPs and content owners with a need for MT would often struggle when building engines, because they wouldn’t have the required volumes of specific corpora to train successful engines. To tackle this, Globalese 3.1 introduces the concept of core and auxiliary corpora.
The small corpus struggle
To train a working MT engine, a training corpus of less than 100,000 segment pairs is rarely enough. And that is just the bare minimum.
In the past, many MT users found it a struggle to put a decent training corpus together. Knowing that the more relevant the training corpus was, the better the engine was expected to perform, they tried to build small but specific engines, including only in-domain data to make sure the style and terminology of the engine would align with those of the client or project.
The problem with this approach very often was that the client- or product-specific translation memories would only amount to 30, 50, or 70 thousand segment pairs, but seldom to 100 thousand or more. Therefore, even if the engine was using appropriate terminology, its output would lack fluency and coherence.
To combat this, MT users could often only resort to adding out-of-domain translation memories. This was a trade-off, because fluency would improve, but the engine would get biased. Client- or product-specific words would be suppressed and often completely disappear from the engine output.
Solution: a balance between core corpora and auxiliary corpora
The auxiliary corpus can be:
- Your own TMs that you have in the same language combination.
- Publicly available corpora, from OPUS, the Tilde MODEL Corpus or elsewhere.
- Corpus you obtain from TAUS, META-SHARE or elsewhere.
As long as it is a well-maintained corpus, the auxiliary corpus helps the engine boost its “linguistic” capabilities. (Of course, this is not linguistic knowledge in the academic sense, but something the machine learns about how the source and target languages work.)
How to use it
Simply mark corpora as core when creating/editing an engine. Globalese will take care of the rest during training.
How it works
- Globalese makes sure the core vocabulary is kept and does not get “lost” (i.e. overshadowed by the supposedly larger auxiliary corpus).
- The auxiliary corpus is filtered, resulting in a training corpus that contains only entries that are more related to the core corpus.
- At the end of the training phase, the engine is tuned further on the core corpus.
How much auxiliary corpus should I add to my engine?
It doesn’t matter. You can add 20 million segments if you want. During training, Globalese will create a filtered training corpus anyway, so you don’t end up training an engine for weeks.
Can I still create big generic engines?
Yes, that is still possible. If you don’t make a distinction between core and auxiliary corpus, Globalese will assume everything is equally important, and therefore use all of the corpora for training.