Dialect Recogni-on Using a Phone GMM Supervector Based SVM Kernel

Similar documents
Title: 4-Way-Stop Wait-Time Prediction Group members (1): David Held

Using Perceptual Context to Ground Language

Closure duration analysis of incomplete stop consonants due to stop-stop interaction

Closure duration analysis of incomplete stop consonants due to stop-stop interaction

Bayesian Methods: Naïve Bayes

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

36 th SINGAPORE OPEN WINDSURFING CHAMPIONSHIP

BALLGAME: A Corpus for Computational Semantics

PREDICTING the outcomes of sporting events

Common Core Standards for Language Arts. An Alignment with Santillana Español (Serie Amigos) Grades K-5

Agriculture Zone Winter Replicate Count 2007/08

Mistral : open source biometric platform.

Unit 4: Inference for numerical variables Lecture 3: ANOVA

Two Machine Learning Approaches to Understand the NBA Data

Sentiment Analysis Symposium - Disambiguate Opinion Word Sense via Contextonymy

Optimize Road Closure Control in Triathlon Event

Ivan Suarez Robles, Joseph Wu 1

Harbor Threat Detection, Classification, and Identification

2017 RELIABILITY STUDY STYLE INSIGHTS

FACE DETECTION. Lab on Project Based Learning May Xin Huang, Mark Ison, Daniel Martínez Visual Perception

Title: Modeling Crossing Behavior of Drivers and Pedestrians at Uncontrolled Intersections and Mid-block Crossings

Statistical Machine Translation

Undergraduate Course: Linguistics and the Gaelic Language

A computer program that improves its performance at some task through experience.

What s cooking at IBM Research?

Vectors in the City Learning Task

1 WCF World Ranking Rules

Rules for the Mental Calculation World Cup 2018

Machine Learning an American Pastime

Nature Neuroscience: doi: /nn Supplementary Figure 1. Visual responses of the recorded LPTCs

C est à toi! Level Two, 2 nd edition. Correlated to MODERN LANGUAGE CURRICULUM STANDARDS DEVELOPING LEVEL

ID-3 Triathlon Road Closure Control Zonghuan Li and Mason Chen

INTRODUCTION TO PATTERN RECOGNITION

شركة شباب لتدریب ریاضة الكابویرا. Capacity-building and PSS activities for children and youth in Za atari through live music, sport and play

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Universal Style Transfer via Feature Transforms

ALPINE LEVEL I SKIING

Elementary Physical Education Level K-2 - Unit : LOCOMOTOR Skills

Examples of Carter Corrected DBDB-V Applied to Acoustic Propagation Modeling

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

LONGITUDINAL AIR VELOCITY CONTROL IN A ROAD TUNNEL DURING A FIRE EVENT

November Fest was the theme of the DCV event for its clients

4. Learning Pedestrian Models

Twitter Analysis of IPL cricket match using GICA method

Predicting Horse Racing Results with Machine Learning

CS145: INTRODUCTION TO DATA MINING

Investigation of Gait Representations in Lower Knee Gait Recognition

18 :. : (2 : (4 ( : ) ( : ) ( :) (2 ( : ) ( :) ( : ) (4-1 : (1 : (3-2 ( : ) ( : ) ( : ) (1 ( : ) ( : ) ( : ) (3-3 ( : ) ( : ) ( : ) ( : ) ( : ) ( :) (

Owl Canyon Corridor Project Overview and Summary

Major League Baseball Offensive Production in the Designated Hitter Era (1973 Present)

Tennis Plots: Game, Set, and Match

Evaluating and Classifying NBA Free Agents

SIMULTANEOUS RECORDINGS OF VELOCITY AND VIDEO DURING SWIMMING

Acknowledgement: Author is indebted to Dr. Jennifer Kaplan, Dr. Parthanil Roy and Dr Ashoke Sinha for allowing him to use/edit many of their slides.

2014 STATE OF THE SYSTEM REPORT

PUERTO PORTALS DRAGON WINTER SERIES Club de Regatas Puerto Portals

Prior Design for Dependent Dirichlet Processes: An Application to Marathon Modeling

GROUND REACTION FORCE DOMINANT VERSUS NON-DOMINANT SINGLE LEG STEP OFF

Intersec ons. Alignment. Chapter 3 Design Elements. Standards AASHTO & PennDOT: As close to 90 as possible, but a minimum of 60.

OOA LA 09A E2020 COURSE OUTLINE

THE SONAR OF DOLPHINS BY WHITLOW W.L. AU

Kernewek Kemmyn. Dr Ken George a-barth Kesva an Taves Kernewek. an 30ves a vis Gwynngala 2006

Lab Report Outline the Bones of the Story

Gas Pressure and Distance The Force of the Fizz Within, By Donell Evans and Russell Peace

Total Morphological Comparison Between Anolis oculatus and Anolis cristatellus

Special Topics: Data Science

(b) Express the event of getting a sum of 12 when you add up the two numbers in the tosses.

EXPEDITION ADVENTURE PART 2: HIGHER RESOLUTION RANGE SEISMIC IMAGING TO LOCATE A SUNKEN PIRATE SHIP OFF ILE ST MARIE.

Can Next-Generation 3D TLC NAND Extend Enterprise Applications?

Sequence Tagging. Today. Part-of-speech tagging. Introduction

Recognition of Tennis Strokes using Key Postures

Europass website activity report 2014 (Iceland, Icelandic)

Minimum Mean-Square Error (MMSE) and Linear MMSE (LMMSE) Estimation

1. Use the diagrams below to investigate the pelvis and scapula models and identify anatomical structures. Articulated Pelvis

Statistical Assessment of the Glare Issue - Human and Natural Elements

Racetrack Betting: Do Bettors Understand the Odds?

Essen,al Probability Concepts

THEORY OF TRAINING, THEORETICAL CONSIDERATIONS WOMEN S RACE WALKING

SOUTH CAROLINA ELECTRIC & GAS COMPANY COLUMBIA, SOUTH CAROLINA

Comparison of Wind Turbines Regarding their Energy Generation.

Statistics for Process Validation Training Master Handbook

PAN AMERICAN GYMNASTICS UNION

THE MAGICAL PROPERTY OF 60 KM/H AS A SPEED LIMIT? JOHN LAMBERT,

Appendix C. Corridor Spacing Research

Chapter 13. Factorial ANOVA. Patrick Mair 2015 Psych Factorial ANOVA 0 / 19

Tournaments. Dale Zimmerman University of Iowa. February 3, 2017

The Pennsylvania System of School Assessment. Reading Item and Scoring Sampler SUPPLEMENT Grade 4

Anthropogenic Noise and the Marine Environment

World Tenpin Bowling Association WTBA Congress December 2015 Agenda item 12.2 Legislative session Chapter 2-6

Europass website activity report 2017 (Norway, Norwegian)

Europass website activity report 2016 (Norway, Norwegian)

Estimating a Toronto Pedestrian Route Choice Model using Smartphone GPS Data. Gregory Lue

Lecture 10: Generation

RESILIENT INFRASTRUCTURE June 1 4, 2016

Single Phase Pressure Drop and Flow Distribution in Brazed Plate Heat Exchangers

Where the Swell Begins

QT2 Triathlon Calculator

Equivalent SDOF Systems to Simulate MDOF System Behavior

8.G Bird and Dog Race

Bouncing Ball A C T I V I T Y 8. Objectives. You ll Need. Name Date

Transcription:

Dialect Recogni-on Using a Phone GMM Supervector Based SVM Kernel Fadi Biadsy, Julia Hirschberg, Michael Collins Columbia University, NY, USA Sep 28 st, 2010 1

Mo-va-on: Why Study Dialect Recogni-on? To improve Automa-c Speech Recogni-on (ASR) Model adapta-on: Pronuncia-on, Acous-c, Morphological, Language models Build even a dialect specific ASR Discover differences between dialects To infer speaker s regional origin for Forensic speaker profiling Speech to speech transla-on Annota-ons for Broadcast News Monitoring Spoken dialogue systems adapt TTS systems Charisma-c speech 2

Case Study: Arabic Dialects (by Arab Atlas)

Corpora Dialect # Speakers Test 20% 30s* test cuts Corpus Gulf 976 801 (Appen Pty Ltd, 2006a) Iraqi 478 477 (Appen Pty Ltd, 2006b) Levan-ne 985 818 (Appen Pty Ltd, 2007) For testing: (25% female mobile, 25% female landline, 25% male mobile, 25 % male landline) Egyptian: Training: CallHome Egyp-an, Tes-ng: CallFriend Egyp-an Dialect # Training Speakers # 120 speakers 30s* cuts Corpora Egyp-an 280 1912 (Canavan and Zipperlen, 1996) (Canavan et al., 1997) 4 *Exactly 30s

Baselines I. Standard PRLM A trigram phonotac-c model per dialect II. III. Standard GMM UBM: Front End: 13D PLP features per frame Each frame is spliced together with four preceding and four succeeding frames followed by LDA 40D CMVN 2048 Gaussians ML trained on equal number of frames from each dialect Dialect Models are MAP adapted with 5 itera-ons (similar to Torres Carrasquillo et al., 2008) GMM UBM with fmllr (Biadsy et al., 2010) 5

Baselines Approach EER (%) PRLM 17.7 GMM UBM 15.3 GMM UBM fmllr 11.0% 6

Hypothesis in current work Rely on the hypothesis that dialects differ in the realiza-on of certain phonemes (Al Tamimi & Ferragne, 2005)

General Idea Compare uferances at the phone-c level 8

Current Approach Build a GMM UBM for each phone type Extract GMM Supervectors at the level of phones Design a kernel func-on that computes similarity between pairs of uferances Train an SVM classifier for each pair of dialects 9

Current Approach Build a GMM UBM for each phone type Extract GMM Supervectors at the level of phones Design a kernel func-on that computes similarity between pairs of uferances Train an SVM classifier for each pair of dialects 10

Front End Using IBM s Ahla System (Soltau et al. 2009): 13D PLP features per frame Each frame is spliced together with four preceding and four succeeding frames followed by LDA 40D CMVN fmllr adapta-on using hypothesized CD phones 11

Phone GMM UBM Run a phone recognizer on all data Extract frames aligned to each phone type Train a GMM UBM for every phone type Using frames from all dialects 12

Phone GMM UBM /aa/ /b/ /z/ 13 34 Arabic phones 34 Phone GMM UBMs

Current Approach Build a GMM UBM for each phone type Extract GMM Supervectors at the level of phones Design a kernel func-on that computes similarity between pairs of uferances Train an SVM classifier for each pair of dialects 14

Step 1 Given an uferance U: Acoustic frames: Front-End Phones: (e.g.) /aa/ /r/ Phone Recognizer 15

Step 2 MAP Adapta-on of each Phone Instance Given a phone instance acous-c frames: MAP /r/ I. MAP adapt the phone GMM UBM using the phone acous-c frames II. Stack all the Gaussian means and phone durapon V k =[µ 1, µ 2,., µ N, duration] in one supervector i.e., summarize the acoustic-phonetic characteristics of each phone in one vector 16 *Similar to (Campbell et al., 2006) but at the phone level

Steps Given an uferance U: Acoustic frames: Front-End Phones: (e.g.) φ 1 (e.g., /aa/) φ n Phone Recognizer Phone GMM-UBMs: MAP adapted GMMs MAP Adaptation Phone GMM- Supervectors: v 1 v n Phone GMM- Supervectors Sequence of tuples: 17

Classifica-on Task Dis-nguish between pairs of dialects given a sequence of tuples Classifier choice: SVM has been shown to model well supervector like representa-on (e.g., Campbell et al., 2006) We need a kernel func-on that computes the similarity between a pair of uferances and U a U b 18

Phone GMM Supervector Based Kernel S Ua Let be the sequence of tuples of uferance S Ub Let be the sequence of tuples of uferance U a U b Sum of RBF kernels between every pair of Supervectors of phone instances with the same type across the two uferances s a l A m A t y a A l s a l A m a l y k u m 19

Dialect Recogni-on Compute a kernel matrix using our kernel func-on for each pair of dialects Train an SVM classifier using this kernel matrix for the pair of dialects During tes-ng, given an uferance U: 1. Construct the sequence of tuples S U 2. Compute the kernel value with every support vector 3. The sign of this func-on is our hypothesized dialect class 20

Evalua-on 34 phone GMM UBMs are Maximum Likelihood trained with 100 Gaussians each We segment the training speakers in our corpora to 30s cuts Using all training cuts, train an SVM classifier for each pair of the 4 dialects (6 classifiers) Use SVMs that es-mate posterior probabili-es (Wu et al., 2004) Use the posterior as the detec-on score to plot DET curves 21

Results and Baseline comparison Approach EER (%) PRLM 17.7 GMM UBM 15.3 GMM UBM fmllr 11.0% 22

Results and Baseline comparison Approach EER (%) PRLM 17.7 GMM UBM 15.3 GMM UBM fmllr 11.0% Kernel 4.9% 23

Comparison to (Torres Carrasquillo et al., 2008) GMM UBM based model discrimina-vely trained with SDC features Eigen channel compensa-on and VTLN Back end classifier EER 7.0% on 3 Arabic dialects Our approach on exactly the same segments as in (Torres Carrasquillo et al., 2008) EER 6.4% 24

Conclusions Modeling the differences between dialects at the phone-c level is very effec-ve New Approach: Supervector representa-on at the phone level New Phone GMM UBM Supervector based Kernel func-on Significantly outperforms: PRLM, GMM UBM, GMM UBM fmllr To our knowledge, represents new state of the art performance for Arabic 25

Future Work Test this approach on shorter uferances (3s and 10s) Try this approach on dialects/accents of other languages: English accents (American English and Indian English) American English Dialects Portuguese Dialects Missing components: VTLN NAP channel compensa-on (need to modify to accommodate for short context supervectors) 26

Thank You! Acknowledgments: P. Torres Carrasquillo and N. Chen for providing us with the segmenta-on IBM T. J. Watson Speech Team 27