Machine Translation: the Right Expectations, the Right Priorities

Machine Translation today is a real productivity service. While the conclusion has been obvious looking at the performance data MT services deliver, many organizations with the right characteristics have now decided to adopt the technology to support their workflows, cut costs and save resources.

Will the rollout of a Neural Machine Translation solution lead you into the space age? Oh, well – yes it will. But it is important to manage our expectations.

However, what should you expect from introducing the tech in your workflows? Well, it all depends on how you set out to change your world: start small, expect little – start large, get more results early. Adjust your expectations depending on how much ammunition you have.

Machine Translation and the right rollout conditions

“It is always important not to expect very radical improvements in cost and throughput when starting out on small projects” – warns Globalese CEO Gábor Bessenyei in an interview he gave together with Crosslang’s Luc Meertens  to the TAUS blog recently. Small projects will not produce giant results. Gábor has provided an exemplary rollout by a Turkish company: “This company managed to double productivity to 5,000 or 6,000 words a day – about twice the rate of the human-only process. But we should be very careful not to believe or spread stories about ten-fold productivity figures from MT projects.” However, more often than not, end clients and LPSs alike do tend to expect too much from an MT deployment. Gábor is certain it is not only the issue of not understanding the technology well enough and expecting too much based on very limited corpora. The problem often lies in overestimating the benefits of an MT deployment.

Win some, lose some: take good care of your ecosystem

“End clients think they can save a lot of money, while at the same time they don’t have the right compensation package in place to pay either their LSPs or their translators. It’s very important that there is an ecosystem process in place whereby everyone can see how they benefit from automation. For example, translators ought to be able to see a benefit from the introduction of MT - such as applying a lower word rate but being able to work much faster. Rolling out a new pricing model should be done very carefully.” And pricing schemes and compensation packages are an absolute must – as Andrew Joscelyne, the author of the TAUS article puts it: “The vision of transparency that Gabor pictures is real. We can track translation throughput on a real-time basis and share the reporting with translators and clients. We would whole-heartedly agree that this type of business intelligence is part-and-parcel of the paradigm shift that NMT is taking the industry through these days.” Gábor believes the technological change driven by MT technology is happening fast. He thinks however that, while the technology may render traditional generic translation obsolete in many contexts (e.g. travel guides, menus, other B2C content) the technology will also create ample opportunities, especially on the areas of content management and quality assurance. Read the full article on TAUS here.

Meet us at GALA Munich!

We will be at GALA Munich – meet us between the 25th and the 27th: check in below and we will be waiting for you at our booth at the designated time!

Live Webinar: Deploying Neural Machine Translation in the CIS

[lead]How has Neural Machine Translation (NMT) changed the world for the countries of CIS? [/lead] Neural Machine Translation has changed the landscape for many languages and regions. In the era of Statistical and Rule-based Machine Translation, output for many languages spoken in CIS countries were of a very moderate quality. The application of MT for these languages remained rather theoretical until the rollout of the more robust Neural Machine Translation technology. Gala Webinar In this live webinar hosted by GALA, Gábor Bessenyei, CEO of the Globalese Neural Machine Translation system and Mikhail Gilin, Head of QAD and R&D of TransLink, will not only provide you with an overview of the technology, but they will also discuss the pros and cons of Neural MT. They will share the experience they have gained during the implementation of NMT in this region. They will speak about the selection process and criteria of MT tools, and about how they have integrated NMT into the daily workflow of Translink. They will also discuss the impact of the technology on the different actors of the translation ecosystem such as post-editors, PMs and customers. Join the webinar to see first hand a live deployment of Neural Machine Translation in the workflow of a leading LSP.  Time and date:
11:00 EST (17:00 CEST)

Breaking the terminology barrier in Neural Machine Translation

[lead]One challenge Neural Machine Translation technology faces today stems from the very same thing which makes it so amazing and effective. Let's see how Globalese solves the Terminology Problem with the help of AIDA.[/lead]  
The end of the second act of the opera Aida in the Verona Arena in July 2011. – AIDA, Automated In-Domain Adaptation is probably not as grandiose, but probably similarly spectacular for terminology-savvy users of Neural Machine Translation. Photo by Jakub Hałun, CC BY-SA 4.0
The end of the second act of the opera Aida in the Verona Arena in July 2011. – AIDA, Automated In-Domain Adaptation is probably not as grandiose, but probably similarly spectacular for terminology-savvy users of Neural Machine Translation. Photo by Jakub Hałun, CC BY-SA 4.0

Neural Machine Translation was an amazing break-through from many points of view. It has improved the overall quality of machine translations compared to pre-neural times. It has provided, for the first time, truly usable and sound quality output for the language industry.  It has also opened up opportunities for languages like Japanese, Chinese or Russian, which otherwise performed poorly on Statistical MT technology.

The downside of the Neural Machine Translation revolution: terminology

As with every groundbreaking invention, NMT technology also had its limitations. One of the major issues with Neural was handling terminology. This major challenge stems from the very reason of what makes NMT so truly exciting. Unlike with statistical MT technology, where it was possible for users to provide a terminology list, which the MT system could safely rely on during translation, it was not directly possible to provide a master terminology for the translation process in the NMT world. Technically, you can, of course, introduce a glossary to an engine as part of the training corpora, but this will not act the way you would expect. It will not prioritize the translations in the glossary over the content in the rest of the training data. In the NMT technology, there is currently no way to influence the terminology translation directly during the machine translation process.

Are you a content owner or an LSP? Give Globalese a go now and grow your business with the power of Neural MT! Click here and start your free trial now!

That doesn’t mean that developers hasn’t made attempts to solve this issue. One of the solutions we have seen from many MT providers is to implement terminology replacement based on a glossary after the machine translation phase. While it certainly sounds promising, unfortunately the results are not always that encouraging. The problem is that you are running a considerable risk of losing grammatical information during the replacement process. Just imagine the problems a changed gender of a word can cause in German. In better cases, you will have to spend many hours of editing to fish out the problematic bits. In some cases, you end up with a limited usability output that leaves you, your clients and your translators disappointed.

Introducing automated in-domain adaptation (AIDA)

Globalese is answering to this challenge by introducing its proprietary technology, the automated in-domain adaptation. This technology will provide you with a yet unparalleled improvement. So what is this all about? By using the automated in-domain adaptation technology, as a Globalese user, you will have the chance to mark content from the training data of an engine as the most important in-domain content. For example, if a user has a Translation Memory (TM) of a medical device documentation, it can be marked as the master TM. Globalese will analyze the content of the master TM(s) and extend the engine only with similar and related training data from the auxiliary TMs. Additionally, the engine will be tuned based on the master TM. The result is a highly customized engine focusing on the content of the master TM.

Maxing out terminological accuracy and keeping quality

The result of this process will be an engine where the wording and the style of the master TM will get higher priority over the rest of the training data, even if there are concurring terms. This way, you can reach a maximum level of terminology accuracy without having to face the problem of losing grammatical information or decreasing the overall language quality. Naturally, the cleaner and the more up-to-date your master TM is in the relevant topic or domain, the better the overall quality will be. This innovative Globalese solution concerning the terminology barrier of Neural MT technology paves the way to even better optimized workflows. This means that content owners and Language Service Providers can save considerable time and resources in post-editing output.

Join us for a coffee in Munich!

Stock corpora for training Machine Translation engines

Since the introduction of core and auxiliary corpora in version 3.1, we have seen successful and less successful MT engines trained in Globalese. The successful ones usually have ample and well-maintained core corpora (which we have renamed to 'master' in version 3.5 to resonate more with CAT tool users), have plenty of auxiliary corpora to use as the foundation, and are used to translate texts that come from the domain they were trained in. The less successful ones may not have enough meaningful core corpora, enough auxiliary corpora, or occasionally are used for a different domain than the core corpora. (Side note: the same engine will produce astonishing BLEU, TER etc. scores for content from its own domain, and terrible results when used to translate texts from a domain it barely knows. The question "what is the BLEU score of my engine?" is meaningless, unless you are comparing engines trained on different platforms from the same corpora and translating the same text.) While we cannot help our users magically double the size of their master TMs overnight, we try our best to help them have a solid foundation to build on. Any neural MT engine, be it generic or trained in a specific domain, needs a certain amount of foundation corpora to 'learn' the languages. (The exact value of 'certain' is debated and probably will always be.) We have also seen users either turning to us for help, or uploading the same publicly available corpora time and time again.

The advantages of using Globalese stock corpora

Globalese 3.5 now offers the possibility to use stock corpora for training engines. (Note that this feature is only available in cloud-based systems.) We believe we have a few sound reasons for implementing this feature:
  • It saves users the struggle of downloading massive files from the internet, splitting them into chunks and uploading them to Globalese. (A win already for everyone who's ever been through this.)
  • We keep these corpora updated, so when a new version comes out, we will make sure it is updated in the corpus repository as well. (You don't have to change anything in the engine. Any time you retrain it, the latest stock corpora will be used.)
  • We are actively seeking new and better ways to improve the filtering of these corpora to give our users a better training foundation — and we don't just mean regular expressions, but also putting AI to work.
  • In the future, this will also save you training time — keep an eye on our release notes.
Auxiliary corpora will continue to be used the same way as before, i.e. they will be filtered according to closeness in domain to the master corpora. In this respect, stock corpora are auxiliary corpora, only from a different source.

Where do stock corpora come from?

The base for our stock corpora are corpora publicly available on the internet. Some of them come from manually maintained translation memories, while others are automatically aligned. (We ourselves do not crawl the internet to create automatically aligned corpora.) We run these corpora through various pipelines before making them available in Globalese.

So what's the use of auxiliary corpora now that they are provided by Globalese as stock corpora?

First of all, keep in mind that we cannot provide stock corpora for every language pair, so in a number of cases you'll still have to provide all the training material. Secondly, users are still encouraged to use their own TMs as auxiliary corpora, for the simple reason that they may be of higher quality than some of the stock corpora.

Why can't I pick the stock corpora I want to use?

Easy: Globalese does the filtering for you, based on your master corpora.

What can I do if there are no stock corpora for my language combination?

Again, easy: just let us know!

Should I change anything in my existing engines?

If you have engines where you have been using corpora from popular sites such as OPUS, chances are that those corpora are available as stock. Just edit the engine and see if you can tick the Use stock corpora checkbox. If not sure, just ask! Once you have removed the corpora you (or us) uploaded from all of your engines, you can delete them — you will never need them again.