Breaking the terminology barrier in Neural Machine Translation

[lead]One challenge Neural Machine Translation technology faces today stems from the very same thing which makes it so amazing and effective. Let's see how Globalese solves the Terminology Problem with the help of AIDA.[/lead]  
The end of the second act of the opera Aida in the Verona Arena in July 2011. – AIDA, Automated In-Domain Adaptation is probably not as grandiose, but probably similarly spectacular for terminology-savvy users of Neural Machine Translation. Photo by Jakub Hałun, CC BY-SA 4.0
The end of the second act of the opera Aida in the Verona Arena in July 2011. – AIDA, Automated In-Domain Adaptation is probably not as grandiose, but probably similarly spectacular for terminology-savvy users of Neural Machine Translation. Photo by Jakub Hałun, CC BY-SA 4.0

Neural Machine Translation was an amazing break-through from many points of view. It has improved the overall quality of machine translations compared to pre-neural times. It has provided, for the first time, truly usable and sound quality output for the language industry.  It has also opened up opportunities for languages like Japanese, Chinese or Russian, which otherwise performed poorly on Statistical MT technology.

The downside of the Neural Machine Translation revolution: terminology

As with every groundbreaking invention, NMT technology also had its limitations. One of the major issues with Neural was handling terminology. This major challenge stems from the very reason of what makes NMT so truly exciting. Unlike with statistical MT technology, where it was possible for users to provide a terminology list, which the MT system could safely rely on during translation, it was not directly possible to provide a master terminology for the translation process in the NMT world. Technically, you can, of course, introduce a glossary to an engine as part of the training corpora, but this will not act the way you would expect. It will not prioritize the translations in the glossary over the content in the rest of the training data. In the NMT technology, there is currently no way to influence the terminology translation directly during the machine translation process.

Are you a content owner or an LSP? Give Globalese a go now and grow your business with the power of Neural MT! Click here and start your free trial now!

That doesn’t mean that developers hasn’t made attempts to solve this issue. One of the solutions we have seen from many MT providers is to implement terminology replacement based on a glossary after the machine translation phase. While it certainly sounds promising, unfortunately the results are not always that encouraging. The problem is that you are running a considerable risk of losing grammatical information during the replacement process. Just imagine the problems a changed gender of a word can cause in German. In better cases, you will have to spend many hours of editing to fish out the problematic bits. In some cases, you end up with a limited usability output that leaves you, your clients and your translators disappointed.

Introducing automated in-domain adaptation (AIDA)

Globalese is answering to this challenge by introducing its proprietary technology, the automated in-domain adaptation. This technology will provide you with a yet unparalleled improvement. So what is this all about? By using the automated in-domain adaptation technology, as a Globalese user, you will have the chance to mark content from the training data of an engine as the most important in-domain content. For example, if a user has a Translation Memory (TM) of a medical device documentation, it can be marked as the master TM. Globalese will analyze the content of the master TM(s) and extend the engine only with similar and related training data from the auxiliary TMs. Additionally, the engine will be tuned based on the master TM. The result is a highly customized engine focusing on the content of the master TM.

Maxing out terminological accuracy and keeping quality

The result of this process will be an engine where the wording and the style of the master TM will get higher priority over the rest of the training data, even if there are concurring terms. This way, you can reach a maximum level of terminology accuracy without having to face the problem of losing grammatical information or decreasing the overall language quality. Naturally, the cleaner and the more up-to-date your master TM is in the relevant topic or domain, the better the overall quality will be. This innovative Globalese solution concerning the terminology barrier of Neural MT technology paves the way to even better optimized workflows. This means that content owners and Language Service Providers can save considerable time and resources in post-editing output.

Join us for a coffee in Munich!

About MT Engine Quality

Machine Translation (MT) is becoming more and more part of the standard translation workflow. However, to use MT as a productivity tool for increasing the profitability of projects and decreasing delivery time, it is essential to utilize high-quality MT engines in projects. This post summarizes the most important points about the influencers on MT engine quality, focusing on Statistical Machine Translation (SMT) within the existing MT technologies.

Relevance

First of all: there are no good or bad MT engines. This may sound strange, but it is true, because the quality of an SMT engine can always be measured in relation to the particular translation project it is used in. You can have a perfect engine for translation of medical device documentation, but the same engine will perform poorly if you use it in an ERP software interface translation project. The reason is very simple: an SMT engine can generate translations only for the content it is trained on. This is the same scenario as with Translation Memories: a TM with automotive content will not help you in your healthcare marketing translation project. Therefore, you should train different SMT engines for your different projects, and you should always apply the right engine to the right project to achieve good results.

Volume

This is probably the best-known influencing factor of SMT quality. It is essential to have as much bilingual and monolingual content as a basis for SMT engine training as possible, because the engine will use this to generate the translations. However, there is another factor which is not so well-known: volume itself is not everything. Adding new corpora can, in some cases, even lower the quality of the output, if the content covers a different domain or style than the project. The reason is simple: irrelevant content, due to the statistical approach, will mislead your engine, so adding more volume only helps in case it is relevant to your project. So, less is in many cases more, and you should always add only relevant content to your engines.

Content type

When running SMT for your projects, you should always keep in mind that SMT performs differently for different content types. Documents with controlled source and shorter sentences, such as technical documentation or user interface, are very good candidates for SMT. On the other hand, running SMT on documents with uncontrolled source (like blog comments), documents with very long and complicated sentences (like legal texts or marketing texts where you have more to transcreate than to translate), the result can be potentially disappointing.

Resource quality

The 'garbage in, garbage out' rule applies to SMT too. Engines based on low-quality Translation Memories and wrong segmentation/alignment will inherently produce low-quality MT. Therefore, always take care what content you add to your engine.

Expectation

Last, but not least: quality is also a question of expectation. SMT can be a useful productivity tool, but you should not expect the machine to replace human translators. SMT is just like an advanced TM which helps you to generate translations where TMs do not return any fuzzy matches. The output will not be perfect in many cases, but it can still be useful for your translators. Depending on your corpora, your projects and language pairs, you can expect a 5% to 50% productivity growth with SMT.
This article was issued originally in the September issue of Dragosfer, the newsletter of Dragoman Ltd.