Statistical Machine Translation

Similar documents
CS472 Foundations of Artificial Intelligence. Final Exam December 19, :30pm

Planning and Acting in Partially Observable Stochastic Domains

Using Perceptual Context to Ground Language

Syntax and Parsing II

/435 Artificial Intelligence Fall 2015

SYSTEMS AND TOOLS FOR MACHINE TRANSLATION

Lecture 10: Generation

PREDICTING the outcomes of sporting events

CENG 466 Artificial Intelligence. Lecture 4 Solving Problems by Searching (II)

Princess Nora University Faculty of Computer & Information Systems ARTIFICIAL INTELLIGENCE (CS 370D) Computer Science Department

Sequence Tagging. Today. Part-of-speech tagging. Introduction

Satoshi Yoshida and Takuya Kida Graduate School of Information Science and Technology, Hokkaido University

INTRODUCTION TO PATTERN RECOGNITION

Ocean Fishing Fleet Scheduling Path Optimization Model Research. Based On Improved Ant Colony Algorithm

Chapter 4 (sort of): How Universal Are Turing Machines? CS105: Great Insights in Computer Science

Joint Parsing and Translation

Opleiding Informatica

EVOLVING HEXAPOD GAITS USING A CYCLIC GENETIC ALGORITHM

Problem Solving Agents

Approximate Grammar for Information Extraction

NCSS Statistical Software

FEATURES. Features. UCI Machine Learning Repository. Admin 9/23/13

Transposition Table, History Heuristic, and other Search Enhancements

Iteration: while, for, do while, Reading Input with Sentinels and User-defined Functions

Open Research Online The Open University s repository of research publications and other research outputs

The sycc Color Space

Sentiment Analysis Symposium - Disambiguate Opinion Word Sense via Contextonymy

Safety Criticality Analysis of Air Traffic Management Systems: A Compositional Bisimulation Approach

Aryeh Rappaport Avinoam Meir. Schedule automation

Chapter Capacity and LOS Analysis of a Signalized I/S Overview Methodology Scope Limitation

1.1 The size of the search space Modeling the problem Change over time Constraints... 21

CT PET-2018 Part - B Phd COMPUTER APPLICATION Sample Question Paper

Application of Dijkstra s Algorithm in the Evacuation System Utilizing Exit Signs

Uninformed Search (Ch )

SEARCH SEARCH TREE. Node: State in state tree. Root node: Top of state tree

Basketball field goal percentage prediction model research and application based on BP neural network

Better Search Improved Uninformed Search CIS 32

SEARCH TREE. Generating the children of a node

Representation. Representation. Representation. Representation. 8 puzzle.

Spreading Activation in Soar: An Update

Analysis of the Article Entitled: Improved Cube Handling in Races: Insights with Isight

CS 221 PROJECT FINAL

COMP219: Artificial Intelligence. Lecture 8: Combining Search Strategies and Speeding Up

Critical Systems Validation

Overview. Depth Limited Search. Depth Limited Search. COMP219: Artificial Intelligence. Lecture 8: Combining Search Strategies and Speeding Up

Rules Modernization FAQs March 2018 Update

C. Mokkapati 1 A PRACTICAL RISK AND SAFETY ASSESSMENT METHODOLOGY FOR SAFETY- CRITICAL SYSTEMS

Finding your feet: modelling the batting abilities of cricketers using Gaussian processes

BALLGAME: A Corpus for Computational Semantics

Building an NFL performance metric

A THESIS SUBMITTED TO THE FACULTY OF GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA

Planning. CS 510: Intro to AI October 19 th 2017

2017 RELIABILITY STUDY STYLE INSIGHTS

TASK 4.2.1: "HAMILTON'S ROBOT" ==============================

GOLOMB Compression Technique For FPGA Configuration

ECE 697B (667) Spring 2003

- 2 - Companion Web Site. Back Cover. Synopsis

COMP 406 Lecture 05. Artificial. Fiona Yan Liu Department of Computing The Hong Kong Polytechnic University

A Novel Dependency-to-String Model for Statistical Machine Translation

Commonsense Knowledge Acquisition and Applications

What is new in CAPWAP 2014

T H E R M O P T I M CALCULATION OF MOIST GAS FROM EXTERNAL CLASSES VERSION JAVA 1.5 R. GICQUEL MARCH 2007

CMPUT680 - Winter 2001

Neural Nets Using Backpropagation. Chris Marriott Ryan Shirley CJ Baker Thomas Tannahill

Syntax and Parsing II

Uninformed Search (Ch )

Naïve Bayes. Robot Image Credit: Viktoriya Sukhanova 123RF.com

A study on the relation between safety analysis process and system engineering process of train control system

EXAMEN D ADMISSION EN CLASSE DE MATURITE PROFESSIONNELLE

Coupling distributed and symbolic execution for natural language queries. Lili Mou, Zhengdong Lu, Hang Li, Zhi Jin

New Thinking in Control Reliability

Estimating a Toronto Pedestrian Route Choice Model using Smartphone GPS Data. Gregory Lue

Queue analysis for the toll station of the Öresund fixed link. Pontus Matstoms *

D-Case Modeling Guide for Target System

Performance of Fully Automated 3D Cracking Survey with Pixel Accuracy based on Deep Learning

Phrase-based Image Captioning

Safety Critical Systems

REPORT. A comparative study of the mechanical and biomechanical behaviour of natural turf and hybrid turf for the practise of sports

CSE 3401: Intro to AI & LP Uninformed Search II

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

knn & Naïve Bayes Hongning Wang

Pedestrian Dynamics: Models of Pedestrian Behaviour

Grand Slam Tennis Computer Game (Version ) Table of Contents

Tension Cracks. Topics Covered. Tension crack boundaries Tension crack depth Query slice data Thrust line Sensitivity analysis.

WRITING FOLDER BOOKLET

Fun Neural Net Demo Site. CS 188: Artificial Intelligence. N-Layer Neural Network. Multi-class Softmax Σ >0? Deep Learning II

Bulgarian Olympiad in Informatics: Excellence over a Long Period of Time

2 When Some or All Labels are Missing: The EM Algorithm

SOUTHWESTERN ONTARIO DISTRICT OPTIMIST SPELLING BEE PLANNING GUIDE

1 Konan University Yuki Hattori Akiyo Nadamoto

THE CANDU 9 DISTRffiUTED CONTROL SYSTEM DESIGN PROCESS

CS 4649/7649 Robot Intelligence: Planning

OPERATIONAL AMV PRODUCTS DERIVED WITH METEOSAT-6 RAPID SCAN DATA. Arthur de Smet. EUMETSAT, Am Kavalleriesand 31, D Darmstadt, Germany ABSTRACT

A Novel Approach to Predicting the Results of NBA Matches

CS 4649/7649 Robot Intelligence: Planning

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

COLREGS Based Collision Avoidance of Unmanned Surface Vehicles

Cycling Statistics Feeds

A Blind Man Catches a Bird

Naïve Bayes. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Transcription:

Statistical Machine Translation Hicham El-Zein Jian Li University of Waterloo May 20, 2015 Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 1 / 69

Overview 1 Introduction 2 Challenges in Machine Translation 3 Classical Machine Translation 4 Statistical MT The Noisy Channel Model The IBM Translation Models 5 Phrase-Based Translation Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 2 / 69

Introduction Machine Translation is the problem of automatically translating from one language (source language) to another language (target language). It is one of the oldest problems in Artificial Intelligence and Computer Science. It is a problem that has huge impact and implications. Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 3 / 69

Challenges in Machine Translation Lexical Ambiguity: a word can have distinct meanings. For example: book the flight vs read the book the box was in the pen vs the pen was on the table Different word orders. English word order: subject - verb - object Japanese word order: subject - object - verb English: The dog saw the cat. Japanese: The dog the cat saw. Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 4 / 69

Challenges in Machine Translation Syntactic Structure is not Preserved Across Translations English: The bottle floated into the cave. Spanish: La bottela entro a la cuerva flotando. (The bottle entered the cave floating.) Floated was a verb in English got translated to an adverb(flotando) in Spanish. Into a proposition was translated to the main verb entered(entro) in Spanish. Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 5 / 69

Challenges in Machine Translation Syntactic Ambiguity causes problems. For example the sentence: Call me a cab. has two different meanings. Pronoun Resolution. The computer outputs the data, it is fast. The computer outputs the data, it is stored in asci. It can refer to the computer or to the data. We will have different translations for each possibility. Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 6 / 69

Classical Machine Translation We will give a high level description of the classical machine translation systems. These systems are rule based systems. Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 7 / 69

Direct Machine Translation Direct Machine Translation: Translation is done word by word. Very little analysis of source text. Relies on a large bilingual dictionary. For each word in the source language, the dictionary specifies a set of rules for translating that word. After the words are translated, simple reordering rules are applied (e.g., move adjectives after nouns when translating from English to French) Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 8 / 69

Direct Machine Translation The lack of any analysis of the source language in Direct Machine Translation causes some problems, for example: It is difficult or impossible to capture long range reordering. Words are translated without any disambiguation of their syntactic role. Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 9 / 69

Transfer-Based Approaches Transfer-Based Approaches: Done in three phases. Analysis: Analyze the source language sentence; for example, build a syntactic analysis of the source language sentence. Transfer: Convert the source-language parse tree to a target-language parse tree. Generation: Convert the target-language parse tree to an output sentence. Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 10 / 69

Transfer-Based Approaches The parse trees involved can vary from shallow analyses to much deeper analyses. The transfer rules might look quite similar to the rules for direct translation systems. But they can now operate on syntactic structures. It is easier with these approaches to handle long-distance reordering. Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 11 / 69

Interlingua-Based Translation Interlingua-Based Translation: Done in two phases. Analysis: Analyze the source language sentence into a (language-independent) representation of its meaning. Generation: Convert the meaning representation into an output sentence. Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 12 / 69

Interlingua-Based Translation Advantage: If we want to build a translation system that translates between k languages, we need to develop k analysis and generation systems. With a transfer based system, we d need to develop O(k 2 ) sets of translation rules. Disadvantage: What would a language-independent representation look like? Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 13 / 69

Interlingua-Based Translation How to represent different concepts in a unified language? Different languages break down concepts in quite different ways: German has two words for wall: one for an internal wall, one for a wall that is outside. Japanese has two words for brother: one for an elder brother, one for a younger brother. Spanish has two words for leg: one for a human s leg, and the other for an animal s leg, or the leg of a table. A unified language may be the intersection of all languages, but that doesn t seem very satisfactory. Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 14 / 69

Introduction to Statistical Machine Translation Motivation: parallel corpora are available in several language pairs Basic idea: use a parallel corpus as a training set of translation examples Examples: IBM work on French-English translation using the Canadian Hansards (1.7 million sentences of 30 words or less in length) Canadian parliament, English-French Europarl Idea goes back to Warren Weaver (1949): suggested applying statistical and cryptanalytic techniques to translation Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 15 / 69

Introduction to Statistical Machine Translation... one naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode. (Warren Weaver, 1949, in a letter to Norbert Wiener) Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 16 / 69

The Noisy Channel Model The noisy channel model is a framework used in spell checkers, question answering, speech recognition, and machine translation. It is mainly used in spell checkers, but it is still a simple machine translation model. Goal: translation system from source language(e.g., French) to target language(e.g., English), f e Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 17 / 69

The Noisy Channel Model Have a model p(e f ) which estimates conditional probability of any English sentence e given the French sentence f. Use the training corpus to set the parameters. A Noisy Channel Model p(e), the language model p(f e), the translation model Bayes rule and p(e f ) = p(e, f ) p(f ) = p(e)p(f e) p(f ) arg max e p(e f ) = arg max p(e)p(f e) e Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 18 / 69

More about the Noisy Channel Model The language model p(e) could be a trigram model, estimated from any data(parallel corpus not needed to estimate the parameters) The translation model p(f e) is trained from a parallel corpus of French/English pairs. Note: The translation model is backwards. The language model can make up for deficiencies of the translation model. Challenge: how to build p(f e) Challenge: finding arg max e p(e)p(f e) Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 19 / 69

Example from Koehn and Knight tutorial Translation from Spanish to English, candidate translations based on p(spanish English) alone: Que hambre tengo yo What hunger have p(s e) = 0.000014 Hungry I am so p(s e) = 0.000001 I am so hungry p(s e) = 0.0000015 Have i that hunger p(s e) = 0.000020... Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 20 / 69

Example from Koehn and Knight tutorial With p(spanish English) p(english): Que hambre tengo yo What hunger have p(s e)p(e) = 0.000014 0.000001 Hungry I am so p(s e)p(e) = 0.000001 0.0000014 I am so hungry p(s e)p(e) = 0.0000015 0.0001 Have i that hunger p(s e)p(e) = 0.000020 0.00000098... Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 21 / 69

The IBM Translation Models IBM Model 1 IBM Model 2 EM Training of Models 1 and 2 Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 22 / 69

The IBM Translation Models Key ideas in the IBM translation models alignment variables translation parameters, t(f i e j ) alignment parameters, q(j i, l, m) The EM algorithm: an iterative algorithm for training the q and t parameters Once the parameters are trained, we can recover the most likely alignments on our training examples Recently, the original IBM models are rarely (if ever) used for translation, but they are used for recovering alignments Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 23 / 69

IBM Model: Alignment How do we model p(f e)? Assume English sentence e has l words e 1... e l, French sentence f has m words f 1... f m. An alignment a identifies which English word each French word originated from Example: English: the dog barks French: le chien aboie An alignment: a 1 = 1, a 2 = 2, a 3 = 3 Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 24 / 69

IBM Model: Alignment How do we model p(f e)? Assume English sentence e has l words e 1... e l, French sentence f has m words f 1... f m. An alignment a identifies which English word each French word originated from Formally, an alignment a is {a 1,..., a m }, where each a j {0... l}. There are (l + 1) m possible alignments. Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 25 / 69

IBM Model: Alignment - Example l = 6, m = 7 e = And the program has been implemented f = Le programme a ete mis en application One possible alignment is {2, 3, 4, 5, 6, 6, 6} Another (bad) alignment is {1, 1, 1, 1, 1, 1, 1} Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 26 / 69

Alignments in the IBM Models Define models for alignment parameter p(a e, m) and translation parameter p(f a, e, m) Goal p(f, a e, m) = p(a e, m)p(f a, e, m) p(f e, m) = a p(f, a e, m) = a p(a e, m)p(f a, e, m) Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 27 / 69

An example alignment French: le conseil a rendu son avis, et nous devons à présent adopter un nouvel avis sur la base de la première position. English: the council has stated its position, and now, on the basis of the first position, we again have to give our opinion. Alignment: the le council conseil has à stated rendu its son position avis,, and et now présent, NULL on sur the le basis base of de the la first première position position, NULL we nous again NULL have devons to a give adopter our nouvel opinion avis Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 28 / 69

A By-Product: Most Likely Alignments Once we have a model p(f, a e, m) = p(a e, m)p(f a, e, m), we can also calculate p(a f, e, m) = p(f, a e, m) p(f e, m) = p(f, a e, m) p(f, a e, m) a For a given f, e pair, we can also compute the most likely alignment, a = arg max p(a f, e, m) a Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 29 / 69

IBM Model 1: Alignments In IBM model 1 all alignments a are equally likely: p(a e, m) = A major simplifying assumption 1 (l + 1) m Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 30 / 69

IBM Model 1: Translation Probabilities Next step: find an estimate for p(f a, e, m) In model 1, this is m p(f a, e, m) = t(f j e aj ) j=1 Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 31 / 69

IBM Model 1: Translation Probabilities - Example l = 6, m = 7 e = And the program has been implemented f = Le programme a ete mis en application Alignment a = {2, 3, 4, 5, 6, 6, 6} p(f a, e, m) =t(le the) t(programme program) t(a has) t(ete been) t(mis implemented) t(en implemented) t(application implemented) Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 32 / 69

IBM Model 1: The Generative Process To generate a French string f from an English string e: 1 Step 1: pick an alignment a with probability (l+1) m Step 2: Pick the French words with probability m p(f a, e, m) = t(f j e aj ) The final result: j=1 p(f, a e, m) = p(a e, m)p(f a, e, m) = p(a l, m)p(f a, e, m) = 1 (l + 1) m m t(f j e aj ) j=1 Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 33 / 69

An Example Lexical Entry p(position position) = 0.7567 p(situation position) = 0.0548 p(measure position) = 0.0282 p(vue position) = 0.0169 p(point position) = 0.0125 p(attitude position) = 0.0109... Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 34 / 69

IBM Model 2 Only difference: we now introduce alignment or distortion distortion q(i j, l, m) = probability that j-th French word is connected to i-th English word, given sentence lengths of e and f are l and m respectively Define where a = {a 1,..., a m } Gives p(a e, m) = p(f, a e, m) = m q(a j j, l, m) j=1 m q(a j j, l, m)t(f j e aj ) j=1 Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 35 / 69

IBM Model 2: Example l = 6, m = 7 e = And the program has been implemented f = Le programme a ete mis en application a = {2, 3, 4, 5, 6, 6, 6} p(a e, 7) =q(2 1, 6, 7) q(3 2, 6, 7) q(4 3, 6, 7) q(5 4, 6, 7) q(6 5, 6, 7) q(6 6, 6, 7) q(6 7, 6, 7) Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 36 / 69

IBM Model 2: Example l = 6, m = 7 e = And the program has been implemented f = Le programme a ete mis en application Alignment a = {2, 3, 4, 5, 6, 6, 6} p(f a, e, 7) =t(le the) t(programme program) t(a has) t(ete been) t(mis implemented) t(en implemented) t(application implemented) Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 37 / 69

IBM Model 2: The Generative Process To generate a French string f from an English string e: Step 1: pick an alignment a = {a 1, a 2,..., a m } with probability m q(a j j, l, m) j=1 Step 2: Pick the French words with probability p(f a, e, m) = The final result: m t(f j e aj ) m p(f, a e, m) = p(a e, m)p(f a, e, m) = q(a j j, l, m)t(f j e aj ) j=1 j=1 Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 38 / 69

Recovering Alignments If we have distributions q and t, we can easily reover the most like alignment for any sentence pair Given a sentence pair e 1, e 2,..., e l, f 1, f 2,..., f m, define for j {1,... m} a j = arg max q(a j, l, m)t(f j e aj ) a {0...l} the algorithm for recovering alignments is beam search Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 39 / 69

EM Training - the parameter estimation problem Input to the parameter estimation algorithm: (e (k), f (k) ) for k = 1... n. Each e (k) is an English sentence, each f (k) is a French sentence Output: parameters t(f e) and q(i j, l, m) The key challenge: we do not have alignments on our training examples Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 40 / 69

Parameter Estimation if the Alignments are Observed Example where alignments are observed in training data e (100) = And the program has been implemented f (100) = Le programme a ete mis en application a (100) = {2, 3, 4, 5, 6, 6, 6} Training data is (e (k), f (k), a (k) ) for k = 1... n. Each e (k) is an English sentence, each f (k) is a French sentence, each a (k) is an alignment Maximum-likelihood parameter estimates in this case are: t ML (f e) = Count(e, f ) Count(e) q ML (j i, l, m) = Count(j i, l, m) Count(i, l, m) Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 41 / 69

Algorithm Input A training corpus (f (k), e (k), a (k) ) for k = 1... n, where f (k) = a (k) = m k Output t ML (f e) = c(e,f ) c(e), q ML(j i, l, m) = c(j i,l,m) c(i,l,m) Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 42 / 69

Algorithm Algorithm set all counts c(...) = 0 for k = 1... n for i = 1... m k, for j = 0... l k, c(e (k) j where δ(k, i, j) = 1 if a (k) i, f (k) i ) c(e (k) j, f (k) i ) + δ(k, i, j) c(e (k) j ) c(e (k) j ) + δ(k, i, j) c(j i, l, m) c(j i, l, m) + δ(k, i, j) c(i, l, m) c(i, l, m) + δ(k, i, j) = j; 0, otherwise. Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 43 / 69

Parameter Estimation with the EM Algorithm The algorithm is quiet similar to algorithm when alignments are observed. The only two differences: The algorithm is iterative. We start with some initial(e.g., random) choice for the q and t parameters. At each iteration we compute counts based on the training data with our current parameter estimates. We then re-estimate our parameters with these counts, and iterate. Computing δ(k, i, j) by δ(k, i, j) = lk q(j i, l k, m k )t(f (k) i e (k) j ) j=0 q(j i, l k, m k )t(f (k) i e (k) j ) Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 44 / 69

Algorithm Input A training corpus (f (k), e (k), a (k) ) for k = 1... n, where f (k) = a (k) = mk Example Initialization Initialize t(f e) and q(j i, l, m) parameters(e.g., to random values) Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 45 / 69

Algorithm Algorithm where for s = 1... S set all counts c(...) = 0 for k = 1... n for i = 1... m k, for j = 0... l k, c(e (k) j, f (k) i ) c(e (k) j, f (k) i ) + δ(k, i, j) c(e (k) j ) c(e (k) j ) + δ(k, i, j) c(j i, l, m) c(j i, l, m) + δ(k, i, j) c(i, l, m) c(i, l, m) + δ(k, i, j) δ(k, i, j) = lk q(j i, l k, m k )t(f (k) i e (k) j ) j=0 q(j i, l k, m k )t(f (k) i e (k) j ) Re-calculate the parameters: t(f e) = c(e,f ) c(j i,l,m) c(e), q(j i, l, m) = c(i,l,m) Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 46 / 69

Details of the Algorithm The log-likelihood function n L(t, q) = log p(f (k) e (k) ) = k=1 k=1 n log a p(f (k), a e (k) ) The maximum-likelihood estimates are arg max L(t, q) t,q The EM algorithm will converge to a local maximum of the log-likelihood function Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 47 / 69

Phrase-Based Translation Overview Learning phrases from alignments A phrase-based model Decoding in phrase-based models Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 48 / 69

Learning Phrases from Alignments First stage in training a phrase-based model is extraction of a Phrase-Based Lexicon. A Phrase-Based Lexicon pairs strings in one language with strings in another language: nach Kanada in Canada zur Konferenz to the conference Morgen tomorrow... We need to capture the probability distribution t(e s) where e is a phrase in the target language and s is a phrase in the source language. Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 49 / 69

Learning Phrases from Alignments For example: English: Mary did not slap the green witch Spanish: Maria no daba una bofetada a la bruja verde Some (not all) phrase pairs extracted from this example: (Mary Maria), (no did not), (no daba una bofetada did not slap). We ll see how to do this using alignments from the IBM models. Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 50 / 69

Learning Phrases from Alignments IBM model 2 defines two distributions: t(s i e j ) where s i is a word in the source language and e j is a word in the target language. q(i j, l, m) is the probability that the i th word in the source language aligns to the j th word in the target language. A useful by-product: once we ve trained the model, for any (f, e) pair, we can calculate: a = argmax a p(a f, e, m) l = argmax a q(a i i, l, m)t(s ai e i ) i=1 under the model. a is the most likely alignment. Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 51 / 69

Learning Phrases from Alignments Maria no daba una bofetada a la bruja verde Mary x did x not x slap x x x the x green x witch x Every Spanish word is aligned to exactly one English word. The alignment is often noisy. We need a many to many relation not a one to many relation. Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 52 / 69

Learning Phrases from Alignments Step1: Train IBM model 2 for p(s t), and come up with the most likely alignment for each (s, t) pair. Step2: Train IBM model 2 for p(t s), and come up with the most likely alignment for each (t, s) pair. We now have two alignments, take their intersection as a starting point. Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 53 / 69

Learning Phrases from Alignments Alignment from p(s t): Maria no daba una bofetada a la bruja verde Mary x did x not x slap x x x the x green x witch x Alignment from p(t s): Maria no daba una bofetada a la bruja verde Mary x did x not x slap x the x green x witch x Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 54 / 69

Learning Phrases from Alignments Intersection of the two alignments is a very reliable starting point: Mary did not slap the green witch Maria no daba una bofetada a la bruja verde x x x x x x Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 55 / 69

Learning Phrases from Alignments Only explore alignment in union of p(s t) and p(t s) alignments. Add one alignment point at a time. Only add alignment points which align a word that currently has no alignment. At first, restrict ourselves to alignment points that are neighbours of current alignment points. Later, consider other alignment points. Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 56 / 69

Learning Phrases from Alignments The final alignment, created by taking the intersection of the two alignments, then adding new points using the growing heuristics: Maria no daba una bofetada a la bruja verde Mary x did x not x slap x x x the x x green x witch x Note that the alignment is no longer many-to-one: potentially multiple Spanish words can be aligned to a single English word, and vice versa. Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 57 / 69

Learning Phrases from Alignments A phrase-pair consists of a sequence of words, s from the source language, paired with a sequence of words, e from the target language. A phrase-pair (s, e) is consistent if: There is at least one word in s aligned to a word in e. There are no words in s aligned to words outside e. There are no words in e aligned to words outside s. (Marry did not, Maria no) is consistent, (Marry did, Maria no) is not consistent. We extract all pairs from the training example. Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 58 / 69

Learning Phrases from Alignments For any phrase pair (s,e) extracted from the training data, we can calculate: t(e s) = Count(s, e) Count(s) Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 59 / 69

Phrase-based Models: Definitions A Phrase-Based Model consists of: phrase-based lexicon, consisting of entries (s,e) where each entry has a score g(s, e) = lg t(e s). A trigram language model. A distortion parameter η typically negative. Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 60 / 69

Definitions Given a sentence s in the source language a Derivation y is a finite sequence of phrases p 1,..., p L. The length L can be any positive integer value. Each phrase p i = (s, t, σ 1...σ m ) is aligned to the phrase starting from word s and ending in word t in the source sentence. A derivation for an input sentence s is valid iff: Each word in s is translated only once. For all k {1,..., (L 1)}, t(p k ) s(p k+1 ) d where d is a parameter of the model. Also 1 s(p k+1 ) d. Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 61 / 69

Example German: wir mussen auch diese kritik ernst nehmen y = (1,3, we must also), (7,7, take), (4,5,this criticism), (6,6,seriously) y = (1,2, we must), (7,7, take), (3,3, also), (4,5,this criticism), (6,6,seriously) Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 62 / 69

Scoring Derivations The optimal translation under the model for a source-language sentence s will be the valid derivation with the highest score. The score of a derivation is defined as: h(y) + L L 1 g(p k ) + η t(p k ) + 1 s(p k+1 ) k=1 k=0 where h(y) is the probability of the sentence y calculated using a tri-gram model. Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 63 / 69

Example wir mussen auch diese kritik ernst nehmen y = (1,3, we must also), (7,7, take), (4,5,this criticism), (6,6,seriously) Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 64 / 69

Decoding Algorithm Finding the optimal derivation is an NP-Hard problem. We will use a heuristic (Beam Search). Beam search is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set (the node with the highest score). Beam search is an optimization of best-first search that reduces its memory requirements. Best-first search is a graph search which explores a graph by always expanding the node with the highest score. In Beam search, only the nodes with the highest scores are kept as candidates. Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 65 / 69

Decoding Algorithm - Representing the Search Graph A state is a tuple (e 1, e 2, b, r, α) where: e 1, e 2 are english words, b is a bit string of length n, r is an integer specifying the end-point of the last phrase in the state, α is the state score. The initial state is: (,, 0 n, 0, 0) Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 66 / 69

Decoding Algorithm - Representing the Search Graph A state q = (e 1, e 2, b, r, α) is followed by a phrase p = (s, t, σ 1...σ m ) if: p does not intersect b, The distortion limit must not be violated. ( r + 1 s(p) d) The resulting state will be q = (e 1, e 2, b, r, α ) where: e 1 = σ m 1, e 2 = σ m, b = b {s,..., t} r = t α = α + g(p) + M i=1 lg q(σ i σ i 2, σ i 1 ) + η r s + 1 Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 67 / 69

Algorithm Now that the states are well defined we use Beam search to find an approximate answer. Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 68 / 69

The End Hicham El-Zein Jian Li (UW) Statistical Machine Translation May 20, 2015 69 / 69