About MT Engine QualityGábor Bessenyei
Machine Translation (MT) is becoming more and more part of the standard translation workflow. However, to use MT as a productivity tool for increasing the profitability of projects and decreasing delivery time, it is essential to utilize high-quality MT engines in projects. This post summarizes the most important points about the influencers on MT engine quality, focusing on Statistical Machine Translation (SMT) within the existing MT technologies.
First of all: there are no good or bad MT engines. This may sound strange, but it is true, because the quality of an SMT engine can always be measured in relation to the particular translation project it is used in. You can have a perfect engine for translation of medical device documentation, but the same engine will perform poorly if you use it in an ERP software interface translation project. The reason is very simple: an SMT engine can generate translations only for the content it is trained on. This is the same scenario as with Translation Memories: a TM with automotive content will not help you in your healthcare marketing translation project. Therefore, you should train different SMT engines for your different projects, and you should always apply the right engine to the right project to achieve good results.
This is probably the best-known influencing factor of SMT quality. It is essential to have as much bilingual and monolingual content as a basis for SMT engine training as possible, because the engine will use this to generate the translations. However, there is another factor which is not so well-known: volume itself is not everything. Adding new corpora can, in some cases, even lower the quality of the output, if the content covers a different domain or style than the project. The reason is simple: irrelevant content, due to the statistical approach, will mislead your engine, so adding more volume only helps in case it is relevant to your project. So, less is in many cases more, and you should always add only relevant content to your engines.
When running SMT for your projects, you should always keep in mind that SMT performs differently for different content types. Documents with controlled source and shorter sentences, such as technical documentation or user interface, are very good candidates for SMT. On the other hand, running SMT on documents with uncontrolled source (like blog comments), documents with very long and complicated sentences (like legal texts or marketing texts where you have more to transcreate than to translate), the result can be potentially disappointing.
The ‘garbage in, garbage out’ rule applies to SMT too. Engines based on low-quality Translation Memories and wrong segmentation/alignment will inherently produce low-quality MT. Therefore, always take care what content you add to your engine.
Last, but not least: quality is also a question of expectation. SMT can be a useful productivity tool, but you should not expect the machine to replace human translators. SMT is just like an advanced TM which helps you to generate translations where TMs do not return any fuzzy matches. The output will not be perfect in many cases, but it can still be useful for your translators. Depending on your corpora, your projects and language pairs, you can expect a 5% to 50% productivity growth with SMT.