Conversational Speech Translation - Challenges and Techniques, by Chris Wendt, Microsoft

Conversational Speech Translation
Challenges and Techniques
Chris.Wendt@Microsoft.com
@Tian500
with Will LewisTAUS Forum Tokyo – April 26, 2016
V160418

Why now?
Confluence of factors:
Steady progress in MT quality over the last few years
• Using huge amounts of data
Technological Leap in ASR
• Deep Learning (DNNs) – 33+% WER reduction over GMMs (Seide et al 2011)
 From average of 30% down to 20%, in English
 Now above 42%
• More robust to noise, speaker variation, accents
Skype
• A global platform to put speech translation in the hands of 100s of Millions of
users

Skype Translator: Goal
•To support open-domain conversations between Skype users in
different parts of the world, speaking different languages
•

Automatic
Speech
Recognition
(ASR)
Microsoft
Translator Skype
Infrastructure
New Skype
Translator
Client app
Skype
Translator
Skype Translator: What is it?
•

Skype Translator: What is it?
•Current state of the art in Speech Recognition and Machine Translation
embedded in a VoIP client: Skype

Skype Translator = Universal Translator?

Automatic
Speech
Recognition
(ASR)
Microsoft
Translator Skype
Infrastructure
New Skype
Translator
Client app
Skype
Translator
Microsoft Speech to Speech (S2S): What is it?
•
1. High quality speech recognition
•

The Challenges
• The gulf between speech and text
• It’s not enough to just chain a really good ASR system with a really good MT system
• How people talk to each other is not how they write
• Building really good conversational ASR and MT systems
• Significant changes in the data we use to train the ASR and MT systems.
• The gap between technology demo and consumer product
• Producing models with shippable latency
• Interesting problems one encounters with real consumers

How people really speak
What person thought they said:
Yeah. I guess it was worth it.
 Ja. Ich denke, es hat sich gelohnt.
 はい。私はそれの価値があったと思います。
What they actually said:
Yeah, but um, but it was you know, it was, I guess, it was worth it.
 Ja, aber ähm, aber es war, weißt du, es war, ich denke, es hat sich gelohnt.
 はい、ええと、あなたが知っている、だったが、推測すると、それはそれの価値があった
けど。
Disfluency removal
More than just removing “um” and “ah”

Disfluencies in Conversational Speech
um no i mean yes but you know i am i've
never done it myself have you done that uh yes
Disfluency types:
• Pause Fillers
• Discourse Markers
• Repetition
• Corrections (“speech repairs”)
Yes.
But, I’ve never done it myself.
Have you done that?
Yes?

Yes.
Have you done that?
Yes?
Need to:
1. Segment
2. Remove disfluencies
3. Punctuate
4. Add case

Without TrueText
•um no i mean yes but you know i am i've never done it myself
have you done that uh yes
•um i いという意味ないが、私は知っている私はそれをやったことが
ない自分をした、ええとはい
Translate

With TrueText
Translate
Yes.
Have you done that?
Yes?
はい。
しかし、私はそれを自分自身を
行ってきたこと。
行っているか。
はい?

Missing punc  Catastrophic Effects
Questions
¿vas ahora?  are you going now?
vas ahora  go now
Negation
no es mi segundo  it is not my second
no. es mi segundo  no. it’s my second
Seriously embarrassing
tienes una hija ¿no? es muy preciosa  you have a daughter right? is very beautiful
tienes una hija no es muy preciosa  you have a daughter is not very beautiful

Accents/Wrong chars  Changes in meaning
Accented words (sound-alikes)
• Written with different forms  different meanings
• But pronounced the same
Si los vinos mendocinos son muy famosos
If the wines from Mendoza are very famous
Sí los vinos mendocinos son muy famosos
Yes the wines from Mendoza are very famous
Misrecognized words/characters (sound-alikes)
你经常在没有听完的时候就睡着了吗 Do you often fall asleep without listening to it?
你经常在没有听完的时候就睡着了嘛 You often fall asleep without listening to it.

How people say things
Here’s what we need to recognize and translate
• He ain't my choice. But, hey, we hated the last guy.
• We're going to hit it and quit it.
• Boy, that story gets better every time you hear it.
• I swear to God I am done with guys like that.
Unfortunately a lot of our MT training data looks like this
• Mr President, Commissioner, Mr Sacconi, ladies and gentlemen, as the PPE-DE's coordinator for
regional policy, I want to stress that some very important points are made in this resolution.
• I am therefore calling for integrated policies, all-encompassing policies that we can adapt to society,
which must listen to our recommendations and comply with them.

Data mismatch & scarcity
Training data mismatch
• MT training is clearly mismatched
• ASR training data is a mixed bag
Data scarcity
• Traditional data sources (govt, news, web) not well matched
• Not a lot of parallel conversational data (for MT)
• Not a lot of transcribed conversational data (for ASR)

ASR: word errors, missing vocab
ASR vocab issues – e.g. names
Hi Arul  Hi Aaron
I went skiing at Snoqualmie pass  I went skiing at snow call me pass
ASR errors
How do we minimize the impact of misrecognized words?

The Solutions
Speech Recognition

The Challenges
• Conversational speaking style
• Open domain
Key enabler: dramatic ASR improvements from using Deep Neural
Networks
Where to get training data?
• US English: DARPA Switchboard (2000h) is a great start; but no comparable corpus for other
languages
• Use “found” captioned speech.
 Many thousands of hours of speech used for English system

Training Data: Audio w/ Fluent Transcripts
• Disfluent (what we want): Well I uh started this this project while
I was a student uh grad student at uh Stan- Stanford
• Fluent (what we get): I started this project while I was a grad
student at Stanford
Recreate disfluent training material

The Solutions
Machine Translation > ASR output
Adapting MT to Conversational Domain

ASR/MT Mismatch
Significant data mismatch between ASR output (even
when cleaned) and MT:
• He ain't my choice. But, hey, we hated the last guy.
• We're going to hit it and quit it.
vs.
• Mr President, Commissioner, Mr Sacconi, ladies and gentlemen, as the PPE-DE's coordinator for
regional policy, I want to stress that some very important points are made in this resolution.
But where do we get parallel conversational data?
Example: Movie subtitles

Data Selection
• Sample “in-domain” (“in-register”?) data from our en-fr parallel
data store
• Leverage the fact that the data pool does not match the target domain
• Use monolingual conversational data as seed (“in-domain”): CallHome, SWBD
• Use Cross-Entropy Difference method (Moore-Lewis 2010) against very large
parallel corpus (for ENU-FRA, many 100s of Ms sentences)
• Train on combination of subtitle and DA data

The solution
TrueText (ASR > MT)
Bridging the gap between ASR and MT

Oui.
Mais je ne l'ai jamais fait moi-
même.
Avez-vous utilisé le vôtre avant ?
Gurdeep va demander de l'aide.
Speech
Recognition
Speech
Correction
Translation
Text
to
Speech
Raw ASR
Output
um no i mean yes but i am i've never done it myself did users
before uh I will ask go deep to help me
Euh non je veux dire oui
mais je suis je l'ai jamais fait
moi-même fait util-isateurs
avant euh je vais demander
à aller pro-fond pour m'aider
Customization
and Personalization
before uh I will ask gurdeep to help me
Segmentation
Punctuation
and True Casing
Yes.
But I’ve never done it myself.
Did you use yours before?
I will ask Gurdeep to help me.
Disfluency Removal
no i mean yes but I am i've never done it myself did you use
yours before uh I will ask gurdeep to help me
Lattice
Rescoring
um no i mean yes but i am i've never done it myself did you
use yours before uh I will ask gurdeep to help me

Oui.
même.
Raw ASR
Output
Customization
and Personalization
Segmentation
Punctuation
and True Casing
Yes.
Disfluency Removal
Lattice
Rescoring
Speech
Recognition
Speech
Correction
Translation
Text
to
Speech
um 意味ないはい、私です
私はそれをやったことがな
い自分はユーザー、ええと
私は手伝って深く要求され
ます前に

Personalization and Customization
Client
Skype Translator
Service
User Profiles
Object Stores
Speech
Recognition
Customized
Language
Models
Cloud Storage
Customized
Models
Machine
Translation
CLM

Personalized Names Handling
• Name recognition is a well known problem in large vocabulary ASR
• Supporting high-recall names recognition usually compromises WER.
• We deploy high-precision approach to support contacts names recognition using personalized names lists
• Personalized names can be recognized in any context
• Examples:
• Hello Ignacio, how are you doing today?
• I will meet Arul Menezes for lunch tomorrow.
Client Client
Speech
Recognition
with Generic LM
Customized
LM
Contact names

Some early experiments: The error cascade
Speech
Recognition
Translation
Engine
1-best ASR
Proposed solutions
• Feed n-best list of ASR output to MT
• Use speech lattice directly as input to MT (e.g. Matusov et al., 2005, Lavie et al. 2004, Dyer et al, 2008)
• Confusion network decoding (e.g. Bertoldi et al., 2007; Bertoldi and Federico, 2005)

Lattice rescoring
Rescoring ASR lattice with
• Much bigger LM (100x larger than first pass)
• MT-specific features
• Tuning weights
• WER reduction 1-2% absolute
• BLEU improvement 1-2% absolute
Cherry picked examples
Ref: what do you use yours for mostly
ASR: do users for mostly
Rescored: do you use yours for mostly
Ref: but we're in a subdivision
ASR: but where in a subdivision
Rescored: but we're in a subdivision

Oui.
même.
Disfluency Removal
Speech
Recognition
Speech
Correction
Translation
Text
to
Speech
Raw ASR
Output
Customization
and Personalization
Lattice
Rescoring
ます前に

はい。
しかし、私はそれを自分自身を行ってきたこと。
あなたは前にあなたを使用しましたか。
私は私を助けるための Gurdeep が要求されま
す。
ます前に
Segmentation
Punctuation
and True Casing
Yes.
Speech
Recognition
Speech
Correction
Translation
Text
to
Speech
Disfluency Removal
Raw ASR
Output
Customization
and Personalization
Lattice
Rescoring

Disfluency types:
• Filler Pauses
• Discourse Markers
• Repetition
• Corrections (“speech repairs”)

um no i mean yes but i am i've never done it myself have you done that uh yes
um no
i mean yes
but i am
i've never done it myself
have you done that
uh yes
no ,, yes , but , i am , i've never done it myself have you done that yes
No.
I mean yes.
I am.
i've never done it myself.
have you done that?
Yes?
No,,yes
but , i am , i've never done it myself
have you done that
yes
Yes.
I've never done it myself .
Have you done that?
Yes?
Segmentation and Disfluency removal interact with each other
Segmentation
Disfluency removal
Simple disfluency removal
Segmentation
Complex disfluency removal

CRF-based Classifiers for annotation
P(y|F) =
1
Z(W,F) 𝑘 𝜆 𝑘 𝐺(𝑦, 𝐹)
Simple
Disfluency
Segmentation
and Punctuation
Complex
Disfluency
Segmentation and Disfluency Removal for Conversational Speech Translation
Hany Hassan, Lee Schwartz, Dilek Hakkani-Tur, and Gokhan Tur INTERSPEECH 2014

Sentence Unit Boundary Detection
•CRF Classifier: L2 Regularization, Features Cut-off=2
•Lexical Features
•Brown Clusters  Group semantically related words based on context
•POS tags trained on conversational data (another CRF classifier)
•Speech Pause-based duration
•Phrase-translation table n-gram
•Features on a window of two words on each side

Disfluency Removal, Punctuation insertion
and TrueCaser
•CRF Classifiers: L2 Regularization, Features Cut-off=2
•Lexical Features
•Brown Clusters
•POS tags trained on conversational data (another CRF classifier)
•Features on a window of two words on each side

Example of Complex Disfluency Removal
but , i’m , I’ve never done that before.

Example of Complex Disfluency removal
But I’ve never done that before.

Segmentation and Disfluency Removal Effect
(EnglishSpanish)
Segmentation Disfluency Handling BLEU on Transcripts BLEU on ASR
No segmentation (full utterance) None 22.13 19.13
No segmentation(full utterance) 1-stage 23.46 20.49
Segment on pauses None 20.32 18.78
Segment on pauses 1-stage 22.53 19.32
CRF Segmenter 1-stage after segmenter 25.11 21.24
CRF Segmenter 1-stage before segmenter 24.79 20.95
CRF Segmenter 2-stages (before & after) 25.65 (+16%) 21.76 (+13.7%)
Disfluency handling :
None: No Disfluency handling applied
After: applied after segmentation
Split: Simple Disfluency applied before segmentation , Complex Disfluency applied after segmentation
1 point of BLEU improvement is roughly equivalent to 1% absolute improvement in accuracy

The Speech Translation API
Public API for speech translation
www.microsoft.com/translator

API Documentation
• http://docs.microsofttranslator.com/

Sample Code using API
• https://github.com/MicrosoftTranslator

Cmdline parameters (ex of API usage)
Usage:
CmdLineSpeechTranslate.exe ClientId ClientSecret FilePath SrcLanguage TargetLanguage
Example:
CmdLineSpeechTranslate.exe ClientId ClientSecret helloworld.wav en-us es-es
Source: 1 of 8 spoken languages
Target: 1 of 50+ spoken languages

S2S in the Schools
Bilingual Mystery Skype
Deaf/Hard of Hearing Students

Deaf and Hard of Hearing Students
• In Seattle Public Schools, Jean Rogers’ (Chief Audiologist) and Liz Hayden’s
(Teacher of the Deaf) idea:
• Use Skype Translator with the “mainstreamed” deaf and hard of hearing kids

Deaf and Hard of Hearing Students

S2S in the Classroom
• https://www.microsoft.com/en-us/design/inclusive#inclusive-
skype_video

www.microsoft.com/translator
Translator@Microsoft.com
@mstranslator
Chris.Wendt@Microsoft.com
@Tian500

Conversational Speech Translation - Challenges and Techniques, by Chris Wendt, Microsoft

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Similar to Conversational Speech Translation - Challenges and Techniques, by Chris Wendt, Microsoft

Similar to Conversational Speech Translation - Challenges and Techniques, by Chris Wendt, Microsoft (20)

More from TAUS - The Language Data Network

More from TAUS - The Language Data Network (20)

Recently uploaded

Recently uploaded (20)

Conversational Speech Translation - Challenges and Techniques, by Chris Wendt, Microsoft

Editor's Notes