Customizing the NMT model

Cloud Translation - Advanced API lets you customize the Google Neural Machine Translation (NMT) model without writing code. This means you can tailor a custom model to your domain-specific content and produce more accurate translations than the default Google NMT model would.

The NMT model covers a large number of language pairs and does well with general-purpose text. Where a custom model excels is in handling specific, niche vocabularies. Customizing the NMT model lets you get the right translation of domain-specific terminology that matters to you.

If you run a specialized reporting service that has the opportunity to expand into new countries. Those markets require that your time-sensitive content be translated correctly in real time, including specialized terminology. Instead of having to hire bilingual staff or contract with specialist translators, both of which come at a high price, you can create and refine a custom model to do the job in real time at a much lower cost.

Data Preparation

In order to train a custom model, you supply matching pairs of segments in the source and target languages. These are pairs of words or phrases that mean the same thing in the language you want to translate from and the language you want to translate to. The closer in meaning your segment pairs are, the better your model can work.

While putting together the dataset of matching segment pairs, start with the use case:

Match data to your problem domain

You're training a custom translation model because you need a model that fits a particular linguistic domain. Make sure your segment pairs do the best possible job of covering the vocabulary, usage, and grammatical quirks of your industry or area of focus. Find documents that contain typical usages you'd find in the translation tasks you want accomplished, and make sure your parallel phrases match as closely in meaning as you can arrange. Of course, sometimes languages don't map perfectly in vocabulary or syntax, but try to capture the full diversity of semantics you expect to encounter in use if that's possible. You're building on top of a model that already does a pretty good job with general-purpose translation - your examples are the special last step that makes custom models work for your use case in particular, so make sure they're relevant and representative of usage you expect to see.

Capture the diversity of your linguistic space

It's tempting to assume that the way people write about a specific domain is uniform enough that a small number of text samples translated by a small number of translators should be sufficient to train a model that works well for anyone else writing about that domain. But we're all individuals, and we each bring our own personality to the words we write. A training dataset with segment pairs from a broad selection of authors and translators is more likely to give you a model that's useful for translating writing from a diverse organization. In addition, consider the variety of segment lengths and structures; a dataset where all the segments are the same size or share a similar grammatical structure won't produce build a good custom model that captures all the possibilities.

Source your data

After you've established what data you need, you need to find a way to source it. You can begin by taking into account all the data your organization collects. You might find that you're already collecting the data you would need to train a translation model. In case you don't have the data you need, you can obtain it manually or outsource it to a third-party provider.

Keep humans in the loop

If it's at all feasible, make sure a person who understands both languages well has validated that the segment pairs match up correctly and represent understandable, accurate translations. A common mistake like misaligning the rows of your training data spreadsheet can yield translations that sound like nonsense. High-quality data is the most important thing you can provide to Cloud Translation - Advanced API to get a model that's usable for your business.

Keep in mind fairness in developing segment pairs

A core principle underpinning Google's ML products is human-centered machine learning, an approach that promotes responsible AI practices, including fairness. The goal of fairness in ML is to understand and prevent unjust or prejudicial treatment of people related to race, income, sexual orientation, religion, gender, and other characteristics historically associated with discrimination and marginalization, when and where they manifest in algorithmic systems or algorithmically aided decision-making. You can read more in our guide and in these fair-aware notes:

Clean up messy data

You may make mistakes when preprocessing data, and some mistakes can really confuse a custom model. In particular, look for the following data issues that you can fix:

Data processing

Cloud Translation - Advanced API stops parsing your data input file when:

Cloud Translation - Advanced API ignores errors for problems it cannot detect, such as:

For automatic data splitting, Cloud Translation - Advanced API performs additional processing (see Dataset division):

Dataset division

Your dataset of segment pairs is divided into three subsets, for training, validation and testing:

If you don't manually specify how your dataset is split between these functions as described in Preparing your training data, and if your dataset contains fewer than 100,000 segment pairs, then Cloud Translation - Advanced API automatically uses 80% of your content documents for training, 10% for validating, and 10% for testing. If your data is larger than that, you must explicitly specify how it is split. Manual splitting gives you more control over the process, not only letting you determine the split percentages, but also letting you specify particular sets in which to include particular segment pairs.

Importing data

After you've decided whether a manual or automatic split of your data is right for you, there are two ways to add data:

  • You can import data as a tab-separated values (TSV) file containing source and target segments, one segment pair per line.

  • You can import data as a TMX file, a standard format for providing segment pairs to automatic translation model tools (see Prepare training data) for more about the TMX format). If a TMX file contains invalid XML tags, Cloud Translation - Advanced API ignores them. If the TMX file contains XML or TMX errors, like if an end tag or <tmx> element is missing, Cloud Translation - Advanced API ends processing and returns an error if it skips more than 1024 invalid `' elements.

    Preliminary evaluation of your custom model

    After your model is trained, you receive a summary of your model performance. Click the Train tab to view a detailed analysis. The BLEU score of your custom model and of the standard Google NMT model are shown in the Train tab, along with the BLEU score performance gain from using the custom mode.

    The higher the BLEU score, the better translations your model can give you for segments that are similar to your training data. Scores in the range 30-40 are considered good. For a detailed explanation of BLEU scores, see The BLEU translation quality metric BLEU.

    There are other evaluation metrics that are often more reliable than the BLEU score. For information about those evaluation options, see Evaluate translation models.

    Debugging

    Debugging a custom model is more about debugging the data than the model itself. If your model is not translating the way you intend, check your data to see where it can be improved.

    Testing

    Even if your BLEU score looks okay, it's a good practice to check the model yourself to make sure its performance matches your expectations. If your training and test data are drawn from the same incorrect set of samples, the scores might be excellent even if the translation is nonsense. Add some examples as input on the Predict tab and compare the results from the custom model with the Google NMT base model. You might notice that your model comes up with the same predictions as the base model, especially on short segments or if you have a smaller training set, since the base model is already pretty good for a wide variety of use cases. In that case, try longer or more complex segments. However, if all of your segments come back identical to the predictions from the base model, it might indicate a data problem.

    If there's a mistake that you're particularly worried about your model making (like a translation mistake that might be costly in money or reputation) make sure your test set or procedure covers that case adequately for you to feel safe using your model in everyday tasks.

    What's next

    For details about how to create your own dataset and custom model, see Prepare training data.