Machine Translation: Evaluation

Back to Introduction, Rule-based systems and Statistical approach.

Motivation for MT evaluation

Evaluation scale

adequacy fluency
5all meaning5flawless English
4most meaning4good
3much meaning3non-native
2little meaning2dis-fluent
1no meaning1incomprehensible

Annotation tool

Disadvantages of manual evaluation

Automatic translation evaluation

Recall and precision on words

img

$$\text{precision} = \frac{\text{correct}}{\text{output-length}} = \frac{3}{6} = 50%$$

$$\text{recall} = \frac{\text{correct}}{\text{reference-length}} = \frac{3}{7} = 43%$$

$$\text{f-score} = 2 \times \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}} = 2 \times \frac{.5 \times .43}{.5+.43} = 46%$$

Recall and precision: shortcomings

wer

metrics system A system B
precision50%100%
recall43%100%
f-score46%100%

It does not capture wrong word order.

BLEU

$$\hbox{BLEU} = \min \left( 1,\frac{\text{output-length}}{\text{reference-length}} \right) ; \big( \prod_{i=1}^4 \text{precision}_i \big)^\frac{1}{4}$$

BLEU: an example

bleu

metrics system A system B
precision (1gram)3/66/6
precision (2gram)1/54/5
precision (3gram)0/42/4
precision (4gram)0/31/3
brevity penalty6/76/7
BLEU0%52%

NIST

NEVA

WAFT

$\hbox{WAFT} = 1 - \frac{d + s + i}{max(l_r, l_c)}$

TER

$$\hbox{TER} = \frac{\hbox{number of edits}}{\hbox{avg. number of ref. words}}$$

HTER

METEOR

Evaluation of evaluation metrics

Correlation of automatic evaluation with manual evaluation.

A

EuroMatrix

A

EuroMatrix II

A

Round-trip translation

Factored translation models I

A

A

Tree-based translation models

Synchronous phrase grammar

Parallel tree-bank

A

Syntactic rules extraction

A

Hybrid systems of machine translation

Hybrid SMT+RBMT

Computer-aided Translation

Translation memory

Questions: examples