Stock corpora for training Machine Translation enginesGreg Horváth
Since the introduction of core and auxiliary corpora in version 3.1, we have seen successful and less successful MT engines trained in Globalese. The successful ones usually have ample and well-maintained core corpora (which we have renamed to ‘master’ in version 3.5 to resonate more with CAT tool users), have plenty of auxiliary corpora to use as the foundation, and are used to translate texts that come from the domain they were trained in. The less successful ones may not have enough meaningful core corpora, enough auxiliary corpora, or occasionally are used for a different domain than the core corpora.
(Side note: the same engine will produce astonishing BLEU, TER etc. scores for content from its own domain, and terrible results when used to translate texts from a domain it barely knows. The question “what is the BLEU score of my engine?” is meaningless, unless you are comparing engines trained on different platforms from the same corpora and translating the same text.)
While we cannot help our users magically double the size of their master TMs overnight, we try our best to help them have a solid foundation to build on. Any neural MT engine, be it generic or trained in a specific domain, needs a certain amount of foundation corpora to ‘learn’ the languages. (The exact value of ‘certain’ is debated and probably will always be.) We have also seen users either turning to us for help, or uploading the same publicly available corpora time and time again.
The advantages of using Globalese stock corpora
Globalese 3.5 now offers the possibility to use stock corpora for training engines. (Note that this feature is only available in cloud-based systems.) We believe we have a few sound reasons for implementing this feature:
- It saves users the struggle of downloading massive files from the internet, splitting them into chunks and uploading them to Globalese. (A win already for everyone who’s ever been through this.)
- We keep these corpora updated, so when a new version comes out, we will make sure it is updated in the corpus repository as well. (You don’t have to change anything in the engine. Any time you retrain it, the latest stock corpora will be used.)
- We are actively seeking new and better ways to improve the filtering of these corpora to give our users a better training foundation — and we don’t just mean regular expressions, but also putting AI to work.
- In the future, this will also save you training time — keep an eye on our release notes.
Auxiliary corpora will continue to be used the same way as before, i.e. they will be filtered according to closeness in domain to the master corpora. In this respect, stock corpora are auxiliary corpora, only from a different source.
Where do stock corpora come from?
The base for our stock corpora are corpora publicly available on the internet. Some of them come from manually maintained translation memories, while others are automatically aligned. (We ourselves do not crawl the internet to create automatically aligned corpora.) We run these corpora through various pipelines before making them available in Globalese.
So what’s the use of auxiliary corpora now that they are provided by Globalese as stock corpora?
First of all, keep in mind that we cannot provide stock corpora for every language pair, so in a number of cases you’ll still have to provide all the training material.
Secondly, users are still encouraged to use their own TMs as auxiliary corpora, for the simple reason that they may be of higher quality than some of the stock corpora.
Why can’t I pick the stock corpora I want to use?
Easy: Globalese does the filtering for you, based on your master corpora.
What can I do if there are no stock corpora for my language combination?
Again, easy: just let us know!
Should I change anything in my existing engines?
If you have engines where you have been using corpora from popular sites such as OPUS, chances are that those corpora are available as stock. Just edit the engine and see if you can tick the Use stock corpora checkbox. If not sure, just ask!
Once you have removed the corpora you (or us) uploaded from all of your engines, you can delete them — you will never need them again.