Machine Translation: Statistical approach

Back to Introduction, Rule-based systems or forward to Evaluation.


SMT scheme (IBM)



Parallel corpora I

Parallel corpora II

Translation memories

Comparable corpora

Sentence alignment

P alignment
0.89 1:1
0.0099 1:0, 0:1
0.089 2:1, 1:2
0.011 2:2

Church-Gale alignment


Comparison of sentence aligners

Probability and statistics basics for SMT

Zipf’s law


$r$ rank, $f$ = word freq., $c$ = constant; $r \times f = c$

Probability distribution


Binomial distribution



Conditional probability


Bayes’s rule

$$p(x|y) = \frac{p(y|x) . p(x)}{p(y)}$$

SMT noisy channel principle

Claude Shannon (1948), self-correcting codes transfered through noisy channels based on information about the original data and errors made in the channels.

Used for MT, ASR, OCR. Optical Character Recognition is erroneous but we can estimate what was damaged in a text (with a language model); errors l$\leftrightarrow$1$\leftrightarrow$I, rn$\leftrightarrow$m etc.

$$e^* = \arg\max_e p(e|f)$$ $$= \arg\max_e \frac{p(e)p(f|e)}{p(f)}$$ $$= \arg\max_e p(e)p(f|e)$$

Reformulation of MT formula

$$e^* = \arg\max_e \sum_{m=1}^M \lambda_m h_m (e, f)$$

(Och et al., 2004)

SMT components

Language model

Translation model

Decoding algorithm

Language models

Noam Chomsky, 1969:

But it must be recognized that the notion “probability of a sentence” is an entirely useless one, under any known interpretation of this term.

Fred Jelinek, 1988, IBM:

Anytime a linguist leaves the group the recognition rate goes up.

What is it good for?

What is the probability of utterance of $s$?

I go to home vs. I go home

What is the next, most probable word?

Ke snídani jsem měl celozrnný …

{chléb $>$ pečivo $>$ zákusek $>$ mléko $>$ babičku}

Language models probability of a sentence

Probability of word sequences

$p_{LM}(\hbox{včera jsem jel do Brna})$

$p_{LM}(\hbox{včera jel do Brna jsem})$

$p_{LM}(\hbox{jel jsem včera do Brna})$

Chomsky was wrong

Colorless green ideas sleep furiously


Furiously sleep ideas green colorless

LM assigns higher p to the 1st! (Mikolov, 2012)

LM for fluency and WSD

N-gram models

Generating random text

To him swallowed confess hear both. Which. Of save on trail for are ay device and rote life have Every enter now severally so, let. (unigrams)

Sweet prince, Falstaff shall die. Harry of Monmouth’s grave. This shall forbid it should be branded, if renown made it empty. (trigrams)

Can you guess the author of the original text?


N-grams: chain rule

$W = w_1, w_2, \cdots, w_n$

$$p(W) = \prod_{i=1}^n p(w_n | w_1 \dots w_{n-1})$$

$$= p(w_1)p(w_2|w_1)p(w_3|w_1,w_2)\dots p(w_n|w_1\dots w_{n-1})$$

Markov’s chain, process, assumption

Since $p$ is not available for longish word sequences, we limit the history to $m$ words using Markov’s assumption

$p(w_n|w_1 \dots w_{n-1}) \simeq p(w_n|w_{n-m+1} \dots w_{n-2} w_{n-1})$

For guessing a following word it is sufficient to consider ($m-1$) preceding words. $m$ is order of a model.

Usually 3- to 5-grams.

Maximum likelihood estimation

$$p(w_3|w_1,w_2) = \frac{count(w_1, w_2, w_3)}{\sum_w count(w_1, w_2, w)}$$

(the, green, *): 1,748× in EuroParl

w count p(w)
paper 801 0.458
group 640 0.367
light 110 0.063
party 27 0.015
ecu 21 0.012

Maximum likelihood principle

$$ P(\mbox{chléb}|\mbox{ke snídani jsem měl celozrnný}) $$ $$= {{|\mbox{ke snídani jsem měl celozrnný chléb}|} \over{|\mbox{ke snídani jsem měl celozrnný}|}}$$

LM quality

We need to compare quality of various LMs.

2 approaches: extrinsic and intrinsic evaluation.

A good LM should assign a higher probability to a good (looking) text than to an incorrect text. For a fixed testing text we can compare various LMs.

Extrinsic evaluation


$E(X) = -\sum_{i=1}^n {p(x_i) \log_2 p(x_i)}$


$$H(p, q) = -\sum_x p(x), \log q(x)$$

$$\begin{align*} H(p_{LM}) &= -\frac{1}{n} \log p_{LM}(w_1, w_2, \dots w_n)\
&= - \frac{1}{n} \sum_{i=1}^{n} \log p_{LM}(w_i|w_1, \dots w_{i-1}) \end{align*}$$

Corresponds to a measure of uncertainty of a probability distribution when predicting true data. The lower the better.

The second formula is a Monte Carlo estimate of the true cross entropy.

A good LM should reach entropy close to a real entropy of language. That can not be measured but quite reliable estimates do exist, e.g. Shannon’s game.


$$PP = 2^{H(p_{LM})}$$

$$PP(W) = p(w_1 w_2 \dots w_n)^{-{1\over N}}$$

A good LM should not waste $p$ for improbable phenomena. The lower entropy, the better $\rightarrow$ the lower perplexity, the better.

Minimizing probabilities = minimizing perplexity.

Perplexity, example

Perplexity is weighted branching factor.

Shannon’s game

Probability distribution for a next letter in a text depends on previous letters.

Some letters bear more information than the others (usually at the beginning of words).

When averaged, we have an estimation of entropy / perplexity of language.

What influences LM quality?

Large LM - n-gram counts

How many unique n-grams are in a corpus?

order types singletons %
unigram 86,700 33,447 (38,6%)
bigram 1,948,935 1,132,844 (58,1%)
trigram 8,092,798 6,022,286 (74,4%)
4-gram 15,303,847 13,081,621 (85,5%)
5-gram 19,882,175 18,324,577 (92,2%)

Taken from Europarl with 30 mil. tokens.

Zero frequency, OOV, rare words

Language models smoothing

We need to distinguish $p$ also for unseen data.

$$\forall w. p(w) > 0$$

The issue is more serious for high-order models.

Smoothing: amend ML of n-grams to expected counts in any data (different corpora).


Add-one smoothing (Laplace)

Maximum likelihood estimation assigns $p$ based on

$$p = \frac{c}{n}$$

Add-one smoothing uses

$$p = \frac{c+1}{n+v}$$

where $v$ is amount of all possible n-grams. All possible n-grams might outnumber real n-grams by several magnitudes.

Europarl has 139,000 types → 19 billion possible bigrams. In fact it has only 53 mil. tokens. How many bigrams?

This smoothing overvalues unseen n-grams.

Add-$\alpha$ smoothing

We won’t add 1, but $\alpha$. This can be estimated for the smoothing to be the most just and balanced.

$$p = \frac{c+\alpha}{n+\alpha v}$$

$\alpha$ can be obtained experimentally: we can try several different values and find the best one using perplexity.

Usually it is very small (0.000X).

Deleted estimation

We can find unseen n-grams in another corpus. N-grams contained in one of them and not in the other help us to estimate general amount of unseen n-grams.

E.g. bigrams not occurring in a training corpus but present in the other corpus million times (given the amount of all possible bigrams equals 7.5 billions) will occur approx. $$\frac{10^6}{7.5\times 10^9} = 0.00013\times$$

Good-Turing smoothing

We need to amend occurrence counts (frequencies) of n-grams in a corpus in such a way they correspond to general occurrence in texts. We use frequency of frequencies: number of n-grams which occur n×

We use frequency of hapax legomena (singletons in data) to estimate unseen data.

$$r^* = (r+1)\frac{N_{r+1}}{N_r}$$

Especially for n-grams not in our corpus we have $$r^*_0 = (0+1)\frac{N_1}{N_0} = 0.00015$$

where $N_1 = 1.1\times 10^6$ a $N_0 = 7.5\times 10^9$ (Europarl).

Missing f-of-f are smoothed.

Example of Good-Turing smoothing (Europarl)

$c$ FF $c^*$
0 7,514,941,065 0.00015
1 1,132,844 0.46539
2 263,611 1.40679
3 123,615 2.38767
4 73,788 3.33753
5 49,254 4.36967
6 35,869 5.32929
8 21,693 7.43798
10 14,880 9.31304
20 4,546 19.54487


Comparing smoothing methods (Europarl)

method perplexity
add-one 382.2
add-$\alpha$ 113.2
deleted est. 113.4
Good-Turing 112.9

Close vs. open vocabulary tasks

Interpolation and back-off

Previous methods treated all unseen n-grams the same. Consider trigrams

nádherná červená řepa
nádherná červená mrkev

Despite we don’t have any of these in our training data, the former trigram should be more probable.

We will use probability of lower order models, for which we have necessary data:

červená řepa
červená mrkev
nádherná červená


$$p_I(w_3|w_1 w_2) = \lambda_1 p(w_3) \times \lambda_2 p(w_3|w_2) \times \lambda_3 p(w_3|w_1 w_2)$$

If we have enough data we can trust higher order models more and assign a higher significance to corresponding n-grams.

$p_I$ is probability distribution, thus this must hold: $$\forall\lambda_n:0 \le \lambda_n \le 1$$ $$\sum_n \lambda_n = 1$$

Stupid back-off

Cache language models

$$p(w|h) = \lambda p(w_i|w_{i-2}w_{i-1}) + (1-\lambda) {c(w \in h) \over |h|}$$

Compressing language models

Extrinsic evaluation: sentence completion

Model Acc
Human 90
simple 4-gram 34
smoothed 3-gram 36
simple character model 38
smoothed 4-gram 39
LSA probability 49
RNN 59

Sentence completion

The stage lost a fine XXX, even as science lost an acute reasoner, when he became a specialist in crime.
a) linguist b) hunter c) actor d) estate e) horseman

What passion of hatred can it be which leads a man to XXX in such a place at such a time.
a) lurk b) dine c) luxuriate d) grow e) wiggle

My heart is already XXX since i have confided my trouble to you.
a) falling b) distressed c) soaring d) lightened e) punished

My morning’s work has not been XXX, since it has proved that he has the very strongest motives for standing in the way of anything of the sort.
a) invisible b) neglected c) overlooked d) wasted e) deliberate

That is his XXX fault, but on the whole he’s a good worker.
a) main b) successful c) mother’s d) generous e) favourite

Agreement completion challenge

Údajně existuje studiová session, co jste nahrávalX s Iggy Popem.
ZtratilX téměř 5 kol a vyjíždělX pln odhodlání na dvanáctém místě.
Ahojte všechny, dnes jsme vstávalX brzy, páč se jelX na houby.
Sice jsem vyrůstalX v Plzni, ale Budějovice, to je moje první láska.
Pokud prodejce zaniklX a jeho živnost převzalX jiná firma, máte právo uplatnit reklamaci u této.

Neural network language models

Feed-forward model

Feed-forward NM

Recurrent Neural Network



Word embeddings

Skip gram and CBOW





From Morpho-syntactic Regularities in Continuous Word Representations: A Multilingual Study (Nicolai, Cherry, Kondrak, 2015)

Embeddings in MT

Long short-term memory

Translation models

Lexical translation

Standard lexicon does not contain information about frequency of translations of individual meanings of words.

key → klíč, tónina, klávesa

How often are the individual translations used in translations?

key → klíč (0.7), tónina (0.18), klávesa (0.11)

probability distribution $p_f$: $$\sum_e p_f(e) = 1$$ $$\forall e: 0 \le p_f(e) \le 1$$

Word alignment, alignment function

Translations differ in number of words and in word order. SMT uses alignment function $$a:j\rightarrow i$$ where $j$ is position in TS, $i$ in SS.

i: 1 2 3 4 5
the castle is very old
ten hrad je velmi starý
j: 1 2 3 4 5

$a$ is function, therefore for each word $w_e$ from the target sentence there is exactly one word $w_f$ from the source sentence.

Word alignment: examples

IBM model 1

$p_f$ not available. We split translation into several steps: generative modeling.

$$p(e, a|f) = \frac{\epsilon}{(l_f + 1)^{l_e}} \prod_{j=1}^{l_e} t(e_j|f_{a(j)})$$

where $e = (e_1, \dots e_{l_e})$ is target sentence, $f =(f_1, \dots f_{l_f})$ source sentence, $l_e$ is target sentence length, $l_f$ source sentence length, $\epsilon$ is normalizing constant, $(l_f + 1)^{l_e}$ is number of all possible alignments whereas $l_f$ is increased by $1$ due to NULL, $t$ is probability translation function.

Translation probability

For $p(e, a|f)$ we need $t$ for all words (pairs).

Parallel corpora with aligned sentences!

Word-level alignment in corpora (CEMAT) not available. Word-alignment by expectation-maximization (EM) algorithm.

EM algorithm

EM algorithm

Translation probability in EM

$$t(w_e|w_f) = \frac{\sum_{(e,f)} c(w_e|w_f; e, f)}{\sum_{w_e} \sum_{(e, f)} c(w_e|w_f; e, f)}$$

EM algorithm - initialization


EM algorithm - final phase


IBM models

IBM-1 does not take context into account, cannot add and skip words. Each of the following models adds something more to the previous.


In IBM-1, all translations in different word order are of the same probability.

$p(e|f) = \sum_a p(e, a|f)$

IBM-2 adds explicit model of alignment: alignment probability distribution: $$a(i|j, l_w, l_f)$$ where $i$ is source word position, $j$ target word position.

IBM-2 – 2 steps of translation

Translation is split to two steps. In the first, lexical units are translated and in the second, words are rearranged according to the model.



The first step is the same as in IBM-1, $t(e|f)$ is used. Function $a$ with probability distribution $a$ is in the opposite direction to the translation direction. Both distributions are combined to formula for IBM-2:

$$p(e, a|f) = \epsilon \prod_{j=1}^{l_e} t(e_j|f_{a(j)}) a(a(j)|j, l_e, l_f)$$

$$p(e|f) = \sum_a p(e, a|f)$$ $$= \epsilon \prod_{j=1}^{l_e} \sum_{i=0}^{l_f} t(e_j|f_i) a(i|j, l_e, l_f)$$


Models 1 and 2 don’t consider situations where one word is translated to two and more words, or is not translated at all. IBM-3 solves this with fertility, which is modelled with a probability distribution $$n(\phi|f)$$

For each source word $f$, the distribution $n$ expresses, to how many words $f$ is usually translated.

$n(0|\hbox{a})$ = 0.999
$n(1|\hbox{king})$ = 0.997
$n(2|\hbox{steep})$ = 0.25

NULL token insertion

If we want to translate properly to a target language which uses words without translational equivalents we have to tackle with inserting NULL token.

$n(x|NULL)$ is not used since NULL insertion depends on a sentence length.

We add another step NULL insertion to the translation process. We use $p_1$ and $p_0 = 1 - p_1$, where $p_1$ stands for probability of NULL token insertion after any word in a sentence.



IBM-3 – distortion

The last step is almost the same as the 2. step in IBM-2 and is modelled with distortion probability distribution: $$d(j|i, l_e, l_f)$$

which models positions in an opposite direction: for each source word at position $i$ it models position $j$ of a target word.




The problem of distortion lies in sparse data for long sentences. IBM-4 introduces relative distortion, where rearrangements of word positions depend on particular previous words. It comes from an assumption that we translate in phrases, which are moved together, or some rearrangement are more frequent (English: ADJ SUB → French SUB ADJ etc.).


solves other shortcomings of the previous models. E.g. it takes care of situations where two different source words could get to one target position (tracking of free positions).

Word alignment matrix


Word alignment issues

Word alignment

Word alignment

Phrase-base Translation Model


Phrases not linguistically, but statistically motivated. German am is seldom translated with single English to. Cf. (fun (with (the game)))

Advantages of PBTM

Phrase-based model

Translation probability $p(f|e)$ is split to phrases:

$$ p(\bar{f}_1^I|\bar{e}1^I) = \prod{i=1}^I \phi(\bar{f}_i|\bar{e}i) d(start_i - end{i-1} - 1) $$

Sentence $f$ is split to $I$ phrases $\bar{f}_i$, all segmentations are of the same probability. Function $\phi$ is translation probability for phrases. Function $d$ is distance-based reordering model. $\hbox{start}_i$ is position of the first word of phrase from sentence $f$, which is translated to $i$-th phrase of sentence $e$.

Distance-based reordering model

$$d(x) = \alpha^{abs(x)}$$

Reordering, example


phr trans move d
1 1-3 at beg. 0
2 6 skip 4-5 +2
3 4-5 back over 4-6 -3
4 7 skip 6 +1

Building phrase matrix

Re-use word alignments from EM algorithm.
Look for consistent phrases.

Phrase $\bar{f}$ and $\bar{e}$ are consistent with alignment $A$ if all words $f_1, \dots f_n$ in phrase $\bar{f}$, which have alignment in $A$, are aligned with words $e_1, \dots e_n$ in phrase $\bar{e}$ and vice versa.


Phrase extraction


Extracted phrases

phr1 phr2
michael michael
assumes geht davon aus / geht davon aus
that dass / , dass
he er
will stay bleibt
in the im
house haus
michael assumes michael geht davon aus / michael geht davon aus ,
assumes that geht davon aus , dass
assumes that he geht davon aus , dass er
that he dass er / , dass er
in the house im haus
michael assumes that michael geht davon aus , dass

Phrase translation probability estimation

$$\phi(\bar{f}|\bar{e}) = \frac{\mbox{count}(\bar{e},\bar{f})}{\sum_{\bar{f_i}}\mbox{count}(\bar{e},\bar{f_i})} $$

Phrase-based model of SMT

$$e^* = \mbox{argmax}_e \prod_{i=1}^I \phi(\bar{f}_i|\bar{e}_i) ; d(start_i-end_{i-1}-1)$$ $$\prod_{i=1}^{|{\bf e}|} p_{LM}(e_i|e_1…e_{i-1})$$


Given a model $p_{LM}$ and translation model $p(f|e)$ we need to find a translation with the highest probability but from exponential number of all possible translations.

Heuristic search methods are used.

It is not guaranteed to find the best translation.

Errors in translations are caused by

Example of noise-induced errors (Google Translate)

Phrase-wise sentence translation


In each step of translation we count preliminary values of probabilities from the translation, reordering and language models.

Search space of translation hypotheses


Exponential space of all possible translations →
limit this space!


Tools for SMT

Regular events

Annual evaluation of SMT quality. Test set preparing, manual evaluation events, etc.

Neural network machine translation

NN models in MT


Encoder-decoder framework


Summary vector for sentences


Problem with long sentences


Problem: any sentence is compressed to a fixed size vector.
Solution: bidirectional RNN + learning attention.

Bidirectional RNN


Attention mechanism

A neural network with a single hidden layer, a single scalar output


Alignment from attetion


More details

Alignment with co-occurrence statistics

$$Dice = \frac{2 f_{xy}}{f_x + f_y}$$

$$logDice = 14 + \log_2 D$$