Author - Greg Horváth

Globalese 3.9.1 released

Fixes

  • Users with only project access in a group could not see the group a project belongs to.
  • Not all Memsource TMs were listed when retrieving translation memories from Memsource.
  • Project creator user missing from project Log page.
  • When editing an engine, corpora were not sorted alphabetically (this was a regression).

Minor improvements

  • 4 to 8 per cent increase on average in training speeds.

Globalese 3.9 released

What's new

We've introduced a quick training option to save training time where engines receive minor updates, and therefore don’t need to be fully retrained.

Read more about this new functionality here.

Notable fixes

  • Spacing issues after tags followed by punctuation marks.
  • Unicode characters were sometimes replaced by text in the translation.
  • Project log could not be displayed.
  • Translate button active for projects without an engine.

Globalese 3.8.4 released

New functionality

Minor improvements

  • Groups on the Corpora, Engines and Projects pages are now listed alphabetically.
  • The total number of currently existing engines is always displayed on the About page.
  • Project files can be sorted by status and last translation date & time.
  • Memsource connector updated so support changes in the Memsource API affecting filtering for TMs when retrieving them from Memsource.

Fixes

  • When editing a project that had a stock engine chosen, no selected engine was displayed.
  • Non-admin users could not edit projects.
  • Spaces left after translation in Japanese target texts.
  • Conversion of colons to double-byte characters fixed for Japanese.
  • URLs inside tags were truncated.
  • When deleting an engine, corpora that were not deletable were offered for deletion, causing an error.
  • Stock+ engines with a non-green health indicator didn't become healthy even after retraining.

Globalese 3.8.3 released

Globalese 3.8.3 fixes the following issues:

  • Some term bases could not be retrieved from Memsource
  • XTM term bases could not be filtered by target language
  • URL translation issues
  • Stock corpora were automatically added when creating a new engines even if there were no stock corpora for the language pair

Globalese 3.8 released

The major highlight of Globalese 3.8 is stock+ engines. Stock+ engines are customised stock engines, i.e. pre-trained stock engines extended with the user's own corpora. Note: stock+ engines are only available in the cloud environment.

Improvements

  • A Select all/Deselect all option has been added to the Corpora and Engines listing pages.
  • It is now possible to filter engines by status on the Engines page.
  • Better experience on smaller screens (i.e. mobile devices).

Further changes

  • The number of individual corpora that an engine may contain has been limited to 500.
  • The Engines page now shows master and auxiliary corpora segment counts in two distinct columns.
  • The Engines page now shows whether an engine has been built using stock corpora or a stock engine.
  • An Engine page now shows whether the particular engine uses stock corpora or a stock engine.
  • Master and auxiliary corpora now appear in two distinct tables on Engine pages.
  • The last trained version is shown on Engine pages.

Notable fixes

  • TBX parsing issues.
  • Not being able to create new users.
  • Could not delete resources containing apostrophes in their names.
  • Engines' Log pages not showing any entries.
  • Translation files' Log pages not displaying.
  • Authentication via the API not working.
  • System administrators unable to change their passwords.
  • Projects using XTM connector could not be created.
  • XML parsing errors during translation.

Globalese 3.7 released

The major update in this release is the switch of the underlying neural network model from RNN (Recurrent Neural Network) to TNN (Transformer Neural Network). The result is simply better translations obtained from engines trained from this release onward.

Further improvements

  • CAT tool plugins now also have access to generic stock engines.
  • Faster file transfers through the XTM connector.
  • Faster loading of projects connected to remote CAT tool projects.

Fixes

  • Delivering remote CAT tool files after pretranslation failed in certain cases.
  • User authentication via the API failed.
  • URLs truncated in translations.
  • Missing special characters in translations.

Globalese 3.6 released

Starting with version 3.6, Globalese cloud users can quickly take advantage of ready-to-use generic stock engines for the most common language combinations. This addition aims to provide a quick solution in scenarios where a custom engine cannot yet be trained because there is not enough in-domain training material. At the time of writing, stock engines are available:
  • English <> French
  • English <> German
  • English <> Hungarian
  • English <> Italian
  • English <> Polish
  • English <> Portuguese
  • English <> Spanish
  • German <> Polish
Please contact Support for queries about further language combinations.

Changes

  • New projects can no longer be created without an engine.
  • Projects can be created using generic stock engines. API users should first query the list of engines for the specific language combination to find out if a stock engine is available or not.
  • Engine segment pair counts reflect the real count after training.

Improvements

  • A range of new API endpoints to help manage engines.
  • Boolean "ready" property in engine API responses to indicate whether an engine can be immediately used for translation.
  • Generic training and translation quality improvements.
  • More helpful API error messages.

Fixes

  • No API warnings if deleted groups were specified in the request payload.
  • Engine cloning not working in specific circumstances.
  • New users not receiving welcome e-mails.
  • Better handling of tags adjacent to numbers.
  • A sentence-last tag would always be followed by an underscore in the translation.
  • If a sentence starts with a tag, the first word gets lowercased.
  • Translate button enabled in situations where the file cannnot actually be translated.

Globalese 3.5 released

The major addition in this release is the ability to use stock corpora as the foundation for training engines in the cloud. Read more about stock corpora in this blog post.

Changes

  • Core corpora have been renamed to master corpora.
  • It is now mandatory for engines to have a certain volume of master corpora as well as a certain number of segment pairs in total.

Improvements

  • Better handling of dates and numerical information
  • Ability to train engines with no locale (e.g. 'en') even if all corpora are marked with a certain locale (e.g. 'en-us')
  • Terms and Conditions are always available
  • Better error messages for failing file uploads
  • An engine being edited can now only be saved (and marked as Changed) if there are actual changes

Fixes

  • Training progress went over 100% (had too many GPUs working for us... now we're using them to mine Bitcoin instead (just kidding!))
  • Translations containing a redundant tag at the end of some segments

Stock corpora for training Machine Translation engines

Since the introduction of core and auxiliary corpora in version 3.1, we have seen successful and less successful MT engines trained in Globalese. The successful ones usually have ample and well-maintained core corpora (which we have renamed to 'master' in version 3.5 to resonate more with CAT tool users), have plenty of auxiliary corpora to use as the foundation, and are used to translate texts that come from the domain they were trained in. The less successful ones may not have enough meaningful core corpora, enough auxiliary corpora, or occasionally are used for a different domain than the core corpora. (Side note: the same engine will produce astonishing BLEU, TER etc. scores for content from its own domain, and terrible results when used to translate texts from a domain it barely knows. The question "what is the BLEU score of my engine?" is meaningless, unless you are comparing engines trained on different platforms from the same corpora and translating the same text.) While we cannot help our users magically double the size of their master TMs overnight, we try our best to help them have a solid foundation to build on. Any neural MT engine, be it generic or trained in a specific domain, needs a certain amount of foundation corpora to 'learn' the languages. (The exact value of 'certain' is debated and probably will always be.) We have also seen users either turning to us for help, or uploading the same publicly available corpora time and time again.

The advantages of using Globalese stock corpora

Globalese 3.5 now offers the possibility to use stock corpora for training engines. (Note that this feature is only available in cloud-based systems.) We believe we have a few sound reasons for implementing this feature:
  • It saves users the struggle of downloading massive files from the internet, splitting them into chunks and uploading them to Globalese. (A win already for everyone who's ever been through this.)
  • We keep these corpora updated, so when a new version comes out, we will make sure it is updated in the corpus repository as well. (You don't have to change anything in the engine. Any time you retrain it, the latest stock corpora will be used.)
  • We are actively seeking new and better ways to improve the filtering of these corpora to give our users a better training foundation — and we don't just mean regular expressions, but also putting AI to work.
  • In the future, this will also save you training time — keep an eye on our release notes.
Auxiliary corpora will continue to be used the same way as before, i.e. they will be filtered according to closeness in domain to the master corpora. In this respect, stock corpora are auxiliary corpora, only from a different source.

Where do stock corpora come from?

The base for our stock corpora are corpora publicly available on the internet. Some of them come from manually maintained translation memories, while others are automatically aligned. (We ourselves do not crawl the internet to create automatically aligned corpora.) We run these corpora through various pipelines before making them available in Globalese.

So what's the use of auxiliary corpora now that they are provided by Globalese as stock corpora?

First of all, keep in mind that we cannot provide stock corpora for every language pair, so in a number of cases you'll still have to provide all the training material. Secondly, users are still encouraged to use their own TMs as auxiliary corpora, for the simple reason that they may be of higher quality than some of the stock corpora.

Why can't I pick the stock corpora I want to use?

Easy: Globalese does the filtering for you, based on your master corpora.

What can I do if there are no stock corpora for my language combination?

Again, easy: just let us know!

Should I change anything in my existing engines?

If you have engines where you have been using corpora from popular sites such as OPUS, chances are that those corpora are available as stock. Just edit the engine and see if you can tick the Use stock corpora checkbox. If not sure, just ask! Once you have removed the corpora you (or us) uploaded from all of your engines, you can delete them — you will never need them again.

Globalese 3.3 released

Globalese 3.3 introduces the concept of engine health. The idea behind it is to prevent cases where you have old engines hanging around, when you could increase their output quality simply by retraining them on the same corpora. As always, we are trying to keep it simple. Whenever we release a new Globalese version in the future that adds a significant (read: measurable) increase in engine quality, Globalese will prompt users gently to retrain engines that are going "stale" (starting with the oldest ones). If a major change is introduced that affects engine quality more dramatically, users will see a firm warning. In both cases, old engines can still be used to translate files. Other improvements:
  • SDLXLIFF files with segment comments supported.
  • Safer password hashing algorithm implemented.
Bugs fixed:
  • File names starting with a non-ASCII character will no longer generate translation errors.
  • Groups could not be deleted.
  • Smartcat files could not be delivered after translating them on the Globalese instance.
  • Segment count mismatch after merging two or more corpora.
  • New engines could not be saved if all available corpora were selected.
  • Various training and translation-related issues.