scarica il
Transcript
scarica il
Interfacce Vocali Uomo Macchina Verso Prestazioni Umane Roberto Pieraccini Director of Advance Conversational Technology “at some company somewhere in the US” Conferenza TAL, Torino, 21 Gennaio, 2014 Comunicazione vocale uomo macchina Le previsione di Kubrick & Clarke per il 2001 (1969) La realta’ nel 2001 Design: Jonathan Bloom Realization: Peter Krogh From an idea by Roberto Pieraccini Cosa e’ successo da allora? ! La realta’ tecnologica in molti casi ha superato la previsione ! Internet, Web, Wikipedia, Social Networks ! PC, tablets, smartphones, wireless ! Genomics, personalized medicine, brain research ! Big data, quantum computing MA NON PER LE FUNZIONI PRETTAMENTE UMANE ! Visione, Voce, Intelligenza generale e “common sense” Statistical Approach Radio Rex TASK RESOURCE MANAGEMENT Approximate amount of data (hours) ATIS 30 40 WALL STREET JOURNAL 100 MEETING SPEECH 100 SWITCHBOARD 300 BROADCAST NEWS Siri, Google voice search 10,000 Unknown, but pracLcally unlimited 60 anni di storia dei computers che capiscono la voce Start of the speech technology Industry Dynamic Time Warping Von Kampelen's speaking machine Statistical Approach Radio Rex SIRI, Google Voice Search Dudley’s Voder Bell Labs AUDREY 1769 1920 1952 1936 1971 1975 1982 1988 1995 1985 1990 2000 2011 Siri e Google Voice Search Siamo a un punto di svolta? ! Riconoscimento della voce disponibile, per la prima volta, alla maggioranza della popolazione ! ! Siri, Google Voice Search, Google Translate L’utilizzo di quantità illimitate di dati genera miglioramenti valutabili ! Google, Apple, Amazon stanno investendo massivamente nella ricerca nel campo del riconoscimento della voce e linguaggio naturale ! Quantità mai vista prima di offerte di lavoro nel campo della voce ! Nuova ondata di entusiasmo nell’intelligenza artificiale ! ! Google assume Ray Kurtzweil, e acquisisce l’azienda di Geoffrey Hinton su “deep learning”. Acquista unD-Wave quantum computer per ricerca in AI . Continua ad acquisire aziende di robotica. Facebook annuncia un nuovo “AI lab” in New York, IBM segue a ruota. ! Un film già visto? ! L’AI è “morta” circa quattro volte in cinquant’anni a causa dell’entusiasmo infondato e grandi promesse non mantenute. E due di queste volte con le reti neuronali (Yann LeCun NYU, e capo del nuovo gruppo di AL Labs a Facebook) Bill Gates l’aveva predetto …. Bill Gates, 1 October 1997: “In this 10-year time Gates, Julynot 2003: “It’s thethe dreams of frame, IBill believe that28we’ll only be all using software, of visiontoand speech recognition keyboard and the25 mouse interact, during thatand Bill Gates, June 1998: “The but breakthroughs in business intelligence; those are within our grasp. Bill aren’t Gates, 25 February 2004: with speech it’s time we will have perfected speech recognition and interaction going to come in the“Now, next three Somenot people might say those it’s three years, some people as easy. Speech is another onerecognition, that be solved, speechyears. output well enough that will become a will We’ll have some additional speech Bill Gates, 24 March 1999: “Speech recognition …I might say it’s 10 years to solve those things, but by and will be solved for a broad range of applications standard part of the interface.” Bill Gates, 14 September 2005: “We totally believe but itdon’t won’tthink be the center ofdictation the interface. But in the you’ll see as something that and large, those very interesting will things, put aside somewhere within thistimeframe, decade.” speech recognition go mainstream three-to-six-year I feel very confident Bill Gates, 10 March 2000: “You know, when most people will use in the next couple of years. TheI was a machine learning, the very interesting tool-based Bill Gates, 14 October 2005: “Another bigwas change next decade.” that that will beover notatthe only a standard thing, but built group student Harvard, the defense DARPA extra processing power, getting the extra memory I things, I think it’s verysee clear that we’rehave on amicrophones track to you’ll is that we’ll on PCs and into the operating system, and something that giving out money to universities that said, yes, in think has us on a track to provide that, but for most make some incredible advances.” theBill recognition berecognition built-in as … athing standard Gates, 9 June “The next is applications will sit on top ofhave and take advantage of.” three years we’ll great speech people, I think itspeech will be more like2011: awill five-year timebig feature. Andspeech that’s probably two to three years from definitely andout recognition. so that’s the frontiers that way are there — great You’ll be able frameAnd before a standard ofvoice interacting.” now that that becomes to touch thatreally board or speech speakmainstream…” torecognition, it and get your handwriting recognition, great message to colleagues world. Screens are even having the computer have around a visualthe capability so it can seecheap who’s” coming in, what’s going on, all of those things undoubtedly will be solved in the next decade.” Riconoscere la voce e’ … DIFFICILE! ! Nonostante tutto le prestazioni’ umane sono ancora imbattibili ! …ci aspettiamo che le macchine abbiano le stesse prestazioni ! PROBLEMI ! ! ! ! Rumore di fondo, riverbero Variazioni di caratteristiche vocali, accento e linguistiche Vocabolario limitato Parlatori simultanei 5 0 -5 88.36 66.90 26.13 87.55 62.15 27.18 87.80 53.44 20.58 87.60 64.36 24.34 87.82 61.71 24.55 Riconoscimento della voce e rumore di Average between 0 88.75 87.95 86.52 88.03 fondo 87.81 and 20dB Noise level Digitforrecognition accuracy (AURORA-2) Table 1: Word accuracy as percentage test set A in multi-condition training SNR/dB Restaurant Street Airport Train-station Average clean 20 15 10 5 0 -5 98.68 96.87 95.30 91.96 83.54 59.29 25.51 98.52 97.58 96.31 94.35 85.61 61.34 27.60 98.39 97.44 96.12 93.29 86.25 65.11 29.41 98.49 97.01 95.53 92.87 83.52 56.12 21.07 98.52 97.22 95.81 93.11 84.73 60.46 25.89 Average between 0 85.39 87.03 87.64 85.01 86.27 and 20dB Table 2: Word accuracy as percentage for test set B in multi-condition training Hirsch, Pearce, ISCA ITRW ASR2000 SNR/dB Subway(MIRS) Street(MIRS) Average clean 20 15 10 5 98.50 97.30 96.35 93.34 82.41 98.58 96.55 95.53 92.50 82.53 98.54 96.92 95.94 92.92 82.47 Parole nuove L’effetto cocktail party Separazione di sorgente From: Audio Alchemy: Getting Computers to Understand Overlapping Speech J. R. Hershey, P. A. Olsen, S. J. Rennie, A. Aaron, Scientific American, April 2011 SPEAKER MASKING ALGORITHM MIXED SPEECH Speaker 1: Lay white at K 5 again. Speaker 2: Bin blue by M zero now. Speaker 3: Set green in M 7 please. Speaker 4: Lay green with S 7 please SEPARATION BY SPEAKER MASKING Speaker 1: Lay white at K 5 again. Speaker 2: Bin blue by M zero now. Speaker 3: Set green in M 7 please. Speaker 4: Lay green with S 7 please Riverbero Close talking. 2 meters. Twice as many errors 4 meters. 4 times as many errors From: Sub-band temporal modulation envelopes and their normalization for automatic speech recognition in reverberant environments, X. Lu, M. Unoki, S. Nakamura, Computer Speech and Language, July 2011 Architettura di interfaccia vocale FRONT-END From speech to features I want to fly to San Francisco leaving from New York in the morning SEARCH From features to words Acoustic Models Representations of speech units derived from data I San to om leaving fr Fran cisco morning ork fly New Y LANGUAGE UNDERSTANDING From words to meaning Language Models Representations of sequences of words derived from data request(flight) origin(SFO) destination(NYC) time(morning) DIALOG From meaning to actions What date do you want to leave? Il Front End Ancora oggi, i front-end dei riconoscitori commerciali, usano tecniche di quantizzazione spettrale relativamente semplici Oggi sappiamo che l’apparato uditivo umano usa dei meccanismi di rappresentazione molto piu’ sofisticati Front-end – come quello umano? I modelli acustici Gli imbattibili Modelli Markoviani “Nascosti” (Hidden Markov Models) L’assunzione di indipendenza statistica assunta dai modelli markoviani limita oggi la capacita’ di migliorare le prestazioni e utilizzare a pieno l’enrome quantita’ di dati disponibili (Wegmann, Morgan, Cohen, 2013) Vi ricordate la tecnica dei “templates”? s s 0.0 e e v v e time (sec) e n n 1.0 Breve storia dei meccanismi di riconoscimento della voce 250 180 2001 2012 La potenza dei computer continua ad aumentare esponenzialmente – Moore’s Law Il ritorno dei “templates” From “Exemplar-Based Processing for Speech Recognition, Sainath et als”, IEEE Signal Processing Magazine, 2012 Van Compernolle et. als (Univ of Leuven, Belgium), Nguyen and Zweig (MS Research), Sainath, Ramabhadran, Nahamoo, Kanesky et als (IBM Research) Se si usano milioni di templates, si raggiungono prestazioni confrontabili con quelle dei modelli markoviani – il potere della statistica empirica contro la parametrica Reti neuronali nel riconoscimento della voce ! Speech/non-speech classification (Morgan, 1983) ! Speech event classification (Makino, 1983) ! Recurrent ANN (Fallside, Robinson, 1989) ! Time-Delay neural Networks (Alex Waibel et als., 1989) ! Hybrid HMM/ANN (Morgan, Bourlard, 1989) ! Hidden Control Neural Networks (Esther Levin, 1990) ! Dopo i tentativi iniziali di usarle per riconoscer la voce direttamente, le reti neuronali sono state impiegate essenzialmentecome modelli di rappresentazione della distribuzione statsitica negli stati dei modelli markoviani. Il ritorno delle reti neuronali OUTPUT LAYER HIDDEN LAYER INPUT LAYER Il ritorno delle reti neuronali DEEP NEURAL NETWORKS OUTPUT LAYER HIDDEN LAYER INPUT LAYER Il ritorno delle reti neuronali DEEP NEURAL NETWORKS OUTPUT LAYER HIDDEN LAYER HIDDEN LAYER HIDDEN LAYER HIDDEN LAYER HIDDEN LAYER HIDDEN LAYER HIDDEN LAYER INPUT LAYER Risultati incoraggianti con “deep neural networks” E la comprensione del linguaggio, il dialogo? Comprensione: Modelli Concettuali (CHRONUS -- Pieraccini, Levin, 1991) CONCEPT 2 CONCEPT 1 P(S12|S11) P(S12|S11) P(wt|wt-1, wt-2, S12) P(S23|S22) P(wt|wt-1, wt-2, S21) P(C2|C1) P(wt|wt-1, wt-2, S11) P(wt|wt-1, wt-2, S22) P(S13|S12) P(wt|wt-1, wt-2, S23) P(C3|C1) P(wt|wt-1, wt-2, S13) P(C3|C2) CONCEPT 3 P(S12|S11) P(wt|wt-1, wt-2, S22) Apprendimento Automatico di Grammatiche Statistiche TRANSCRIPTIONS ANNOTATIONS want to cancel the account CANCEL_ACCOUNT cancel service CANCEL_ACCOUNT I cant send a particular message to a certain group of people CANNOT_SEND_RECEIVE_EMAIL cancellation of the service CANCEL_ACCOUNT I need to setup my email EMAIL_SETUP they registered my modem in from my internet and I need to get my email address EMAIL_SETUP my emails are not been received at the address I sent it to CANNOT_SEND_RECEIVE_EMAIL … Language Model for Speech Recognition Statistical Semantic Classifier Apprendimento Continuo D. Suendermann, J. Liscombe, and R. Pieraccini: How to Drink TRANSCRIPTIONS from a Fire Hose: One Person Can Annoscribe 693 Thousand ANNOTATIONS want to cancel the Utterances in account One Month. In Proc. of the SIGDIAL 2010, 11th Annual Meeting of the Special Interest Group on Discourse and cancel service Dialogue, 2010. I cant send aTokyo, particularJapan, messageSeptember to a certain group of people CANCEL_ACCOUNT cancellation of the service CANCEL_ACCOUNT I need to setup my email EMAIL_SETUP they registered my modem in from my internet and I need to get my email address EMAIL_SETUP my emails are not been received at the address I sent it to CANNOT_SEND_RECEIVE_EMAIL CANCEL_ACCOUNT CANNOT_SEND_RECEIVE_EMAIL … Language Model for Speech Recognition Statistical Semantic Classifier Dialogo – modelli di controllo a stati finiti Modelli di apprendimento automatico della funzione del dialogo ! Introduzione dei modelli di “reinforcement learning” con MDP (Markov Decision Process) (Levin, Pieraccini, 2001) ! Introduzione dei POMDP (Partially Observable Markov Decision Process) (Williams, Young, 2006) ! L’apprendimento della funzione di dialogo e’ ancora un problema accademico Ottimizzazione del dialogo “online” D. Suendermann and R. Pieraccini: One Year of Contender: What Have We Learned about Assessing and Tuning Industrial Spoken Dialog Systems? In Proc. of the Workshop on Future Directions and Needs in the Spoken Dialog Community at the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Montreal, Canada, June 2012. C1 33% C1 of example Contenders. Table 1: Statistics application 33% 33% TV TV TV C2 TV 50%TV TV50% Internet C3 Internet Internet 50% 50% TV/Internet 55% Contender # calls problem capture 13,477,810 15% 30% cable box reboot order 4,322,428 outage prediction 2,758,963 C2 485,300 on demand input source troubleshooting 1,162,445 72% account lookup 28% 9,627 troubleshooting paths I 275,248 C3 troubleshooting paths II 1,389,489 computer monitor 1,500,010 16% instruction opt in 6,865,929 84% C1 −1 ] ∆At [mo 98% ∆R 40,362 2% 0.05 0% 28,975 0.11 08,198 0.04 C2 08,123 0.17 03,487 0.05 03,201 0.02 95% 5% 05,568 0.02 C3 03,530 0.01 03,271 3% 0.01 31,764 0.05 97% is calculated by multiplying the observed dif-C4References C4 C4 ference in automation rate ∆A with the number K. Acomb, J. Bloom, K. Dayanidhi, P. Hunter, P. Krogh, 33% 12% 5% of monthly calls hitting the Contender (t). E. Levin, and R. Pieraccini. 2007. Technical Support 33% 33% 3 Conclusion call-flow 0 Dialog Systems: Issues, Problems, and62% Solutions. In 23% 46% 42% call-flow We have seen that the use of Contenders (a method to assess and tune arbitrary components of induscalls callstrial spoken dialog systems)10,000 can be very beneficial in multiple respects. Applications can selfcorrect as soon as Probability reliable data becomes available without additional recalculation manual analysis and intervention. Moreover, performance can increase substantially in applications implementing Contenders. Looking Proc. of the HLT-NAACL, Rochester, USA. call-flow K. Evanini, P. Hunter, J. Liscombe, D. Suendermann, K. Dayanidhi, and R. Pieraccini:. 2008. Caller Experience: A Method for Evaluating Dialog Systems and Its Automatic Prediction. Proc. of the SLT, Goa, 20,000In calls India. S. Möller, K. Probability Engelbrecht, and R. Schleicher. 2008. Predicting the Quality and Usability of Spoken Dialogue recalculation Services. Speech Communication, 50(8-9). A. Raux, B. Langner, D. Bohus, A. Black, and M. Eske- Avg # of automated calls gained/ month Conclusioni ! Rinnovato entusiasmo nell’interazione vocale uomomacchina generato dalle applicazioni di massa ! Ancora lontani da prestazioni umane, specialmente in situazioni acustiche non ottimali ! Ricerca di nuove soluzioni e rivisitazione di alcune vecchie idee potrebbero ridurre il divario ! Progresso limitato nella comprensione e nella gestione del dialogo.
Documenti analoghi
Introduzione - Dipartimento di Informatica
Centre For Spoken Language Understanding, University of Oregon