Machine Translation: Rule-based systems

Go back to Introduction or forward to Statistical MT.

Rule-based Machine Translation

Knowledge-based Machine Translation

KBMT classification

The only types of MT until 90s.

Direct translation

Direct translation

MT with interlingua

Rosetta

It should be stressed that the isomorphy and not the interlinguality is the primary characteristic of our approach.

Two sentences are considered translations of each other if they have the same semantic derivation trees, i.e. corresponding syntactic derivation trees.

Rosetta2 Rosetta3 Rosetta4
release 1985 1988 1991
speed 1-3 words/sec ? ?
dictionary 5,000 90,000 ?
SL NL, EN NL, EN NL, EN
TL NL, EN NL, EN, ES NL, EN, ES

KBMT-89

kbmt-89-scheme

Nirenburg, Sergei. Knowledge-based machine translation. Machine Translation 4.1 (1989): 5-24.

Transfer translation

PC translator (LangSoft)

img

Systran

img

Interlingua vs. transfer

img

Source language analysis

Tokenization

Obstacles of tokenization

Scriptio continua

Thai

What is a word?

Tokenization

Sentence segmentation

Obstacles of sentence segmentation

Morphological level

Morphology

Morphologic level

Morphologic analysis

Morphological tags, tagset

DEMO: tag list

BNC tags

head -n 10000 VERT |\
grep -v "^<" |\
cut -f3 |\
sort |\
uniq -c |\
sort -rn

Morphological polysemy

Morphological disambiguation

Statistical disambiguation

Rule-based disambiguation

Morphologic segmentation

Guesser

Morphological disambiguation—example

slovo analýzy disambiguace
Pravidelné k2eAgMnPc4d1, k2eAgInPc1d1, k2eAgInPc4d1, k2eAgInPc5d1, k2eAgFnSc2d1, k2eAgFnSc3d1, k2eAgFnSc6d1, k2eAgFnPc1d1, k2eAgFnPc4d1, k2eAgFnPc5d1, k2eAgNnSc1d1, k2eAgNnSc4d1, k2eAgNnSc5d1, ... (+ 5) k2eAgNnSc1d1
krmení k2eAgMnPc1d1, k2eAgMnPc5d1, k1gNnSc1, k1gNnSc4, k1gNnSc5, k1gNnSc6, k1gNnSc3, k1gNnSc2, k1gNnPc2, k1gNnPc1, k1gNnPc4, k1gNnPc5 k1gNnSc1
je k5eAaImIp3nS, k3p3gMnPc4, k3p3gInPc4, k3p3gNnSc4, k3p3gNnPc4, k3p3gFnPc4, k0 k5eAaImIp3nS
pro k7c4 k7c4
správný k2eAgMnSc1d1, k2eAgMnSc5d1, k2eAgInSc1d1, k2eAgInSc4d1, k2eAgInSc5d1, ... (+ 18) k2eAgInSc4d1
růst k5eAaImF, k1gInSc1, k1gInSc4 k1gInSc4
důležité k2eAgMnPc4d1, k2eAgInPc1d1, k2eAgInPc4d1, k2eAgInPc5d1, k2eAgFnSc2d1, k2eAgFnSc3d1, k2eAgFnSc6d1, k2eAgFnPc1d1, k2eAgFnPc4d1, k2eAgFnPc5d1, k2eAgNnSc1d1, k2eAgNnSc4d1, k2eAgNnSc5d1, ... (+ 5) k2eAgNnSc1d1

Universal POS tags

TAG Meaning
VERB verbs (all tenses and modes)
NOUN nouns (common and proper)
PRON pronouns
ADJ adjectives
ADV adverbs
ADP adpositions (prepositions and postpositions)
CONJ conjunctions
DET determiners
NUM cardinal numbers
PRT particles or other functional words
X other: foreign words, typos, abbreviations
. punctuation

Mapping for cca 25 languages (with tree banks)

Guessing POSes from gramemes

EN CZ meaning
-s 3rd person, sing., present simple
-ed -al, -l, -en. past tense
-ing -(ov)ání present continuous
-en -en(.) past participle
-s -y, -i, -ové, -a plural
-‘s ov(o, a, y) possession
-er -ší comparative
-est nej-, -ší superlative
you -‘s pronoun

A problem: myší, west, fotbal, … → myšám, wer, fotbala, božit

Brill’s tagger

Problems of MA, POSes

Morphology—summary

Lexical level

Dictionaries in MT

Polysemy in dictionaries

Smooth sense transitions


log

log chair

chair

Polysemy on several levels

Meaning representation

sem types

Semantic network—WordNet

wordnet

VerbaLex

Word sense disambiguation

WSD: deep methods

WSD: shallow methods

Granularity: cat

WordNet

Granularity: oko

Granularity: dát

VerbaLex states 32 (!) senses (irreflexive variants).

Granularity: malý

Granularity for MT

The granularity of translation dictionaries may be enough: a word $w$ has exactly the number of senses as it has equivalents in a dictionary.

What is the most polysemous word in English?

Answers from

wordnik, PDEV

Lexica: summary

Kilgarriff, Adam. I don’t believe in word senses. Computers and the Humanities 31.2 (1997): 91–113.

Syntactic level

Syntactic analysis

Context-free grammar

Context-free grammar

Grammars

Types of analyses

Why syntactic analysis?

Syntactic ambiguity

Partial syntactic polysemy—garden path

…cognitive plausibility of parsing.

Phrase structure

Example

S   -> NP VP
VP  -> ADV V | V ADV
NP  -> DET N
DET -> the | a | an
N   -> cat | dog
...

Analyse: the dog runs fast (bottom-up and top-down)

Phrasal tree

Phrasal tree

Constituency (phrasal structure)

Dependency structure

Dependency tree

Dependency tree

Dependency

Hybrid trees

Hybrid tree I

Hybrid tree (SET)

Evaluation of parsing quality

Transfer translation

Transfer scheme

Example of transfer rules I

Example of transfer rules II

From Arturo Trujillo, Translation Engines: Techniques for Machine Translation.

Writing rules

You like her. x Ella te gusta.

Transfer syntax

Classes of rules

Semantic level / analysis

Semantic roles

  Dítě     škádlí  lvíče.
  AG/SUBJ  V       PAT/OBJ

  A child (SUBJ)    teases (V) a lion cub (OBJ).
  A lion cub (SUBJ) teases (V) a child (OBJ).

Errors propagated from below

zatímco trhal prsty svého pstruha

(George R. R. Martin, Hostina pro vrány)

FrameNet

FrameNet: Closure

Framenet

An Agent manipulates a Fastener to open or close a Containing_object (e.g. coat, jar). Sometimes an Enclosed_region or a Container_portal may be expressed. Since the Manipulator is syntactically omissible, many verbs in this frame incorporate the Fastener.

Mary closed her coat with a belt.

Prague Dependency TreeBank 2.0

PDT layers

TectoMT

TectoMT: a simple block

English negative particles → verb attributes

sub process_document {
  my ($self,$document) = @_;

  foreach my $bundle ($document->get_bundles()) {
    my $a_root = $bundle->get_tree('SEnglishA');

    foreach my $a_node ($a_root->get_descendants) {
      my ($eff_parent) = $a_node->get_eff_parents;
      if ($a_node->get_attr('m/lemma')=~/^(not|n't)$/
          and $eff_parent->get_attr('m/tag')=~/^V/ ) {
        $a_node->set_attr('is_aux_to_parent',1);
      }
    }
  }
}

Tecto-align

Analysis in RBMT

Synthesis in RBMT, issues

Rule-based systems: conclusion