Author - Greg Horváth

Globalese 3.7 released

The major update in this release is the switch of the underlying neural network model from RNN (Recurrent Neural Network) to TNN (Transformer Neural Network). The result is simply better translations obtained from engines trained from this release onward.

Further improvements

  • CAT tool plugins now also have access to generic stock engines.
  • Faster file transfers through the XTM connector.
  • Faster loading of projects connected to remote CAT tool projects.

Fixes

  • Delivering remote CAT tool files after pretranslation failed in certain cases.
  • User authentication via the API failed.
  • URLs truncated in translations.
  • Missing special characters in translations.

Globalese 3.6 released

Starting with version 3.6, Globalese cloud users can quickly take advantage of ready-to-use generic stock engines for the most common language combinations. This addition aims to provide a quick solution in scenarios where a custom engine cannot yet be trained because there is not enough in-domain training material. At the time of writing, stock engines are available:
  • English <> French
  • English <> German
  • English <> Hungarian
  • English <> Italian
  • English <> Polish
  • English <> Portuguese
  • English <> Spanish
  • German <> Polish
Please contact Support for queries about further language combinations.

Changes

  • New projects can no longer be created without an engine.
  • Projects can be created using generic stock engines. API users should first query the list of engines for the specific language combination to find out if a stock engine is available or not.
  • Engine segment pair counts reflect the real count after training.

Improvements

  • A range of new API endpoints to help manage engines.
  • Boolean "ready" property in engine API responses to indicate whether an engine can be immediately used for translation.
  • Generic training and translation quality improvements.
  • More helpful API error messages.

Fixes

  • No API warnings if deleted groups were specified in the request payload.
  • Engine cloning not working in specific circumstances.
  • New users not receiving welcome e-mails.
  • Better handling of tags adjacent to numbers.
  • A sentence-last tag would always be followed by an underscore in the translation.
  • If a sentence starts with a tag, the first word gets lowercased.
  • Translate button enabled in situations where the file cannnot actually be translated.

Globalese 3.5 released

The major addition in this release is the ability to use stock corpora as the foundation for training engines in the cloud. Read more about stock corpora in this blog post.

Changes

  • Core corpora have been renamed to master corpora.
  • It is now mandatory for engines to have a certain volume of master corpora as well as a certain number of segment pairs in total.

Improvements

  • Better handling of dates and numerical information
  • Ability to train engines with no locale (e.g. 'en') even if all corpora are marked with a certain locale (e.g. 'en-us')
  • Terms and Conditions are always available
  • Better error messages for failing file uploads
  • An engine being edited can now only be saved (and marked as Changed) if there are actual changes

Fixes

  • Training progress went over 100% (had too many GPUs working for us... now we're using them to mine Bitcoin instead (just kidding!))
  • Translations containing a redundant tag at the end of some segments

Stock corpora for training Machine Translation engines

Since the introduction of core and auxiliary corpora in version 3.1, we have seen successful and less successful MT engines trained in Globalese. The successful ones usually have ample and well-maintained core corpora (which we have renamed to 'master' in version 3.5 to resonate more with CAT tool users), have plenty of auxiliary corpora to use as the foundation, and are used to translate texts that come from the domain they were trained in. The less successful ones may not have enough meaningful core corpora, enough auxiliary corpora, or occasionally are used for a different domain than the core corpora. (Side note: the same engine will produce astonishing BLEU, TER etc. scores for content from its own domain, and terrible results when used to translate texts from a domain it barely knows. The question "what is the BLEU score of my engine?" is meaningless, unless you are comparing engines trained on different platforms from the same corpora and translating the same text.) While we cannot help our users magically double the size of their master TMs overnight, we try our best to help them have a solid foundation to build on. Any neural MT engine, be it generic or trained in a specific domain, needs a certain amount of foundation corpora to 'learn' the languages. (The exact value of 'certain' is debated and probably will always be.) We have also seen users either turning to us for help, or uploading the same publicly available corpora time and time again.

The advantages of using Globalese stock corpora

Globalese 3.5 now offers the possibility to use stock corpora for training engines. (Note that this feature is only available in cloud-based systems.) We believe we have a few sound reasons for implementing this feature:
  • It saves users the struggle of downloading massive files from the internet, splitting them into chunks and uploading them to Globalese. (A win already for everyone who's ever been through this.)
  • We keep these corpora updated, so when a new version comes out, we will make sure it is updated in the corpus repository as well. (You don't have to change anything in the engine. Any time you retrain it, the latest stock corpora will be used.)
  • We are actively seeking new and better ways to improve the filtering of these corpora to give our users a better training foundation — and we don't just mean regular expressions, but also putting AI to work.
  • In the future, this will also save you training time — keep an eye on our release notes.
Auxiliary corpora will continue to be used the same way as before, i.e. they will be filtered according to closeness in domain to the master corpora. In this respect, stock corpora are auxiliary corpora, only from a different source.

Where do stock corpora come from?

The base for our stock corpora are corpora publicly available on the internet. Some of them come from manually maintained translation memories, while others are automatically aligned. (We ourselves do not crawl the internet to create automatically aligned corpora.) We run these corpora through various pipelines before making them available in Globalese.

So what's the use of auxiliary corpora now that they are provided by Globalese as stock corpora?

First of all, keep in mind that we cannot provide stock corpora for every language pair, so in a number of cases you'll still have to provide all the training material. Secondly, users are still encouraged to use their own TMs as auxiliary corpora, for the simple reason that they may be of higher quality than some of the stock corpora.

Why can't I pick the stock corpora I want to use?

Easy: Globalese does the filtering for you, based on your master corpora.

What can I do if there are no stock corpora for my language combination?

Again, easy: just let us know!

Should I change anything in my existing engines?

If you have engines where you have been using corpora from popular sites such as OPUS, chances are that those corpora are available as stock. Just edit the engine and see if you can tick the Use stock corpora checkbox. If not sure, just ask! Once you have removed the corpora you (or us) uploaded from all of your engines, you can delete them — you will never need them again.

Globalese 3.3 released

Globalese 3.3 introduces the concept of engine health. The idea behind it is to prevent cases where you have old engines hanging around, when you could increase their output quality simply by retraining them on the same corpora. As always, we are trying to keep it simple. Whenever we release a new Globalese version in the future that adds a significant (read: measurable) increase in engine quality, Globalese will prompt users gently to retrain engines that are going "stale" (starting with the oldest ones). If a major change is introduced that affects engine quality more dramatically, users will see a firm warning. In both cases, old engines can still be used to translate files. Other improvements:
  • SDLXLIFF files with segment comments supported.
  • Safer password hashing algorithm implemented.
Bugs fixed:
  • File names starting with a non-ASCII character will no longer generate translation errors.
  • Groups could not be deleted.
  • Smartcat files could not be delivered after translating them on the Globalese instance.
  • Segment count mismatch after merging two or more corpora.
  • New engines could not be saved if all available corpora were selected.
  • Various training and translation-related issues.

Globalese 3.2 released

Globalese 3.2 is all about version 2.1 of the Globalese API. The new API endpoints introduced in v2.1 let users automate corpus management and engine creation and training. Paginated results are also available for listing queries. Other improvements include:
  • Improved corpus pre-processing guarantees better engine training results.
  • UI improvements.

Globalese 3.1 released

Globalese 3.1 introduces the concept of augmented in-domain engines trained from core and auxiliary corpora. Minor improvements:
  • Better feedback for misconfigured SmartCAT and Memsource connectors.
  • Improved user guidance when training small engines.
  • Users can now see engine IDs and group IDs directly on an engine page to help set up pretranslation in Memsource.
  • Improved 404 pages for non-existing or removed resources.
  • A lot of minor issues have been fixed since the release of Globalese 3.0.

Augmented in-domain engines

In the past, LSPs and content owners with a need for MT would often struggle when building engines, because they wouldn't have the required volumes of specific corpora to train successful engines. To tackle this, Globalese 3.1 introduces the concept of core and auxiliary corpora.

The small corpus struggle

To train a working MT engine, a training corpus of less than 100,000 segment pairs is rarely enough. And that is just the bare minimum. In the past, many MT users found it a struggle to put a decent training corpus together. Knowing that the more relevant the training corpus was, the better the engine was expected to perform, they tried to build small but specific engines, including only in-domain data to make sure the style and terminology of the engine would align with those of the client or project. The problem with this approach very often was that the client- or product-specific translation memories would only amount to 30, 50, or 70 thousand segment pairs, but seldom to 100 thousand or more. Therefore, even if the engine was using appropriate terminology, its output would lack fluency and coherence. To combat this, MT users could often only resort to adding out-of-domain translation memories. This was a trade-off, because fluency would improve, but the engine would get biased. Client- or product-specific words would be suppressed and often completely disappear from the engine output.

Solution: a balance between core corpora and auxiliary corpora

The auxiliary corpus can be:
  • Your own TMs that you have in the same language combination.
  • Publicly available corpora, from OPUS, the Tilde MODEL Corpus or elsewhere.
  • Corpus you obtain from TAUS, META-SHARE or elsewhere.
As long as it is a well-maintained corpus, the auxiliary corpus helps the engine boost its "linguistic" capabilities. (Of course, this is not linguistic knowledge in the academic sense, but something the machine learns about how the source and target languages work.)

How to use it

Simply mark corpora as core when creating/editing an engine. Globalese will take care of the rest during training. core corpora

How it works

  1. Globalese makes sure the core vocabulary is kept and does not get "lost" (i.e. overshadowed by the supposedly larger auxiliary corpus).
  2. The auxiliary corpus is filtered, resulting in a training corpus that contains only entries that are more related to the core corpus.
  3. At the end of the training phase, the engine is tuned further on the core corpus.

How much auxiliary corpus should I add to my engine?

It doesn't matter. You can add 20 million segments if you want. During training, Globalese will create a filtered training corpus anyway, so you don't end up training an engine for weeks.

Can I still create big generic engines?

Yes, that is still possible. If you don't make a distinction between core and auxiliary corpus, Globalese will assume everything is equally important, and therefore use all of the corpora for training.