What is MTPE?

A general misunderstanding of what Machine Translation Post-Editing (MTPE) really is and how it’s related to revision can be considered the common denominator for the problems of both individual translators and Linguistic Service Producers (LSPs), said Isabella Massardo in ‘Multilingual’ in December 2015.

So how do we at ITI understand MTPE and make use of it?

Back in 2014, when being involved in an eBay MTPE project and following the international discussions on the subject, we tried to illustrate the idea that, from our experience, although MT helps produce translated content easier and faster MTPE is more than just Editing. Treating MTPE as just a brief manual revision of the automatically generated translation before the end user delivery would never let us achieve the results mentioned by eBay’s Senior Director of Machine Translation and Geo Expansion, Hassan Sawaf: “As we’ve rolled out our machine translation capabilities, and even before a lot of the education and outreach we plan to do, we’ve quickly increased the number of Russian users we see using these features by 50 %”.

For Russian, which is an inflectional language with a complex structure and high morphological demands, every single MT segment has to undergo a multi-stage processing sequence with several linguists working on it in turns.

  • First of all, right at the beginning of the project, we, together with the client, developed a comprehensive guideline on the language specific conventions of MTPE, and made sure that each member of the team was familiar with and adhering to it. This guide was continuously being updated during the project itself. The key point of this stage was communication.
  • We processed each MT-segment with our usual 4-step worfkflow, modified for the project: MT-editor/2nd editor/proofreader/QA specialist performing automated checks.
  • Resources for each of these 4 steps were tested through a special procedure, adapted for the project specifics. (The content we worked on was meant not for human users, but for the MT engine’s training.)

In the end, this ride was even more complex than a standard 4-step TEP+QA localization cycle. And if so, then why did we do it? Why bother complicating time-tested processes, get paid less, and not “translate from scratch”, as usual?

The answer is, MT is not as black as it is sometimes painted.

Judging from our experience with such major accounts as eBay, Cisco and Dell, we at ITI do believe that MT is good. But it is certainly not yet capable of replacing human translation, so if our goal is the client’s satisfaction, there’s always a job for human experts.

We especially doubt that any MT engine can be trained well enough to produce a near final-quality translation for Russian and other complex languages.

Worth a try?

Back to the eBay case, we think the 50% increase in the number of Russian users was achieved mostly because the content was translated. And although MT is not a universal remedy, implementing it played the key role in the success of this particular case.

In many of other cases, though, it’s still better to have no translation than a poor one (which raw MT output usually constitutes). However, if and when you do have Machine Translated text, you need to choose what to do next.

Types of MTPE Workflow

Besides the MT output quality, the labor cost of post-editing (PE) is affected by the expected end quality of the content. To reach “publishable” quality similar to “high-quality human translation and revision”, full PE is usually recommended.

For a lower standard often referred to as “good enough” or “fit for purpose” (as per TAUS), light PE may suffice, which aims to make the MT output “simply understandable”. However, in our 5+ years of MTPE practice we’ve never faced an actual project with Light MTPE demands. On the contrary, those of our clients who utilise MT, tend to present some of the highest quality expectations. This is probably because they’re putting so much effort into MT deployment, which includes engine training, MT output evaluation, analytics, statistics, not to mention the actual PE work for each language involved. Consequently, highest quality is expected.


MT, like most automation technologies, aims to save time and money. Depending on the initial quality of the MT engine output and expected target text quality level, the discounts may vary.

As mentioned above, we haven’t seen an acceptable MT engine output for Russian yet (and that is probably also why we haven’t received any requests for light MTPE so far). Therefore, an average discount is usually within the range of 15-25% of the basic rate for the full MTPE cycle, whereas it can amount to 26-40% for light PE.

Several metrics are used to evaluate the MT quality and calculate discounts respectively.

Metrics and Statistics

Though we speak about automation, human judgment is the benchmark for assessing automatic metrics, as humans are the end-users of the translation output.

Banerjee et al. (2005) highlight five attributes that a good automatic metric must possess, correlation, sensitivity, consistency, reliability among them. Any good metric must correlate well with human judgment; it must be consistent, giving similar results on similar text with the same MT system. It must be sensitive to differences between MT systems and reliable: MT systems that score similarly are expected to perform similarly. Finally, the metric must be general, and should work with different text domains in a wide range of scenarios and MT tasks.

Let’s take a brief overview of the automatic metrics for evaluating machine translation.

BLEU, or bilingual evaluation understudy (see also: BLEU) was one of the first metrics to report high correlation with human judgments of quality. The metric is currently one of the most popular. The central idea behind the metric is that “the closer a machine translation is to a professional human translation, the better it is”. The metric calculates scores for individual segments, usually sentences—then averages these scores over the entire scope for a final score. No other machine translation metric is yet to significantly outperform BLEU with respect to correlation with human judgment across language pairs.

NIST metric, which title comes from the from the US National Institute of Standards and Technology (see also: NIST (metric)), is based on the BLEU metric, but with some alterations. NIST also calculates how informative a particular n-gram is. That is to say when a correct n-gram is found, the rarer that n-gram is, the more weight it is given. NIST also differs from BLEU in its calculation of the brevity penalty.

Word error rate (see also: Word error rate ) or WER is based on the calculation of the number of words that differ between a piece of machine translated text and a reference translation. A related metric is the Position-independent word error rate (PER), this allows for re-ordering of words and sequences of words between a translated text and a references translation.

METEOR (see also: METEOR ) metric is designed to address some of the deficiencies inherent in the BLEU metric. The metric is based on the weighted harmonic mean of unigram precision and unigram recall. METEOR also includes some other features, such as synonyms matching. For example, rendering the word “good” in the reference as “well” in the translation counts as a match. The metric also includes a stemmer, which lemmatises words and matches on the lemmatised forms.

LEPOR (see alsoLEPOR) was proposed as the combination of many evaluation factors including existing ones (precision, recall) and modified ones (sentence-length penalty and n-gram based word order penalty). An enhanced version of LEPOR metric, hLEPOR, utilizes the harmonic mean to combine the sub-factors of the designed metric. Furthermore, they design a set of parameters to tune the weights of the sub-factors according to different language pairs.

The easiest metric to understand without special knowledge of math or statistics is Edit Distance (see also: Edit Distance). It is a way of quantifying how dissimilar two strings (or words) are to one another by counting the minimum number of operations required to transform one string into the other.

There are more kinds of evaluation methods used. For instance, besides edit distance, precision, recall and word order are used to measure lexical similarity.

Apart from that, major CAT and MT brands gather and analyze MT statistics from the Big Data perspective. As recent article by Memsource states, “French, Portuguese, Spanish and English machine translation engines have the highest rates of potential MT leverage. English to French stands out with more than 20% of the translations a complete match to MT suggestions, and almost 90% of segments having at least some coherence with the MT. In comparison, Russian, Polish and Korean have a much lower leverage rates, below 40% or even 20% fuzzy matches and 5% complete matches. The difference is probably due to the morphological typology of the languages.”

Moreover, based on a recent primary research done by CSA, LSPs with an aggressive MT implementation approach actually grew 3.5 times faster than their more cautious competitors, with most of this growth coming from professional human language services.

So this statistics may be helpful when making a decision about whether to implement MT.

In summary, next time you receive a piece of “incoherent Google translated mess” to work on, you may not just judge its quality harshly, but quantify the actual effort/difference between MT and the end result. And who knows, probably it will turn out to be quite worthy in the long-term post-editing perspective.

Want to try Machine Translation Post Editing, or learn more about its benefits? Give it a chance! E-mail us at


eBay’s Machine Translation Tools Break Down Borders in Russia

The state of post-editing

MT Post-editing Guidelines

Evaluation of machine translation

Machine and Professional Human Translations

Fast-Growing LSPs Turn to Machine Translation