The workshop is now over. Videos and slides for the talks and keynotes are available through the links in the schedule below. There is also a YouTube Playlist for all talks.
Oral Session 1
|11:10|| Overview of the 6th CHiME Challenge [YouTube] [Slides]
(1Center for Language and Speech Processing, Johns Hopkins University; 2Brooklyn College, City University of New York; 3University of Sheffield, UK; 4Inria, France)
|11:35||The IOA Systems for CHiME-6 Challenge [Paper] [YouTube] [Slides]
(1Key Laboratory of Speech Acoustics & Content Understanding, Institute of Acoustics, CAS, China; 2University of Chinese Academy of Sciences, Beijing, China)
|11:55||The OPPO System for CHiME-6 Challenge [Paper] [YouTube] [Slides]
(Beijing OPPO telecommunications corp., ltd., Beijing, China)
|12:15||The Qdreamer Systems for CHiME-6 Challenge [Paper] [YouTube] [Slides]
(1Qdreamer Research, Suzhou, JiangSu, P.R. China; 2School of Computer Science and Technology, Harbin Institute of Technology, Harbin, P.R. China)
Oral Session 2
|13:55||The USTC-NELSLIP Systems for CHiME-6 Challenge [Paper] [YouTube] [Slides]
(1University of Science and Technology of China, Hefei, Anhui, P. R. China; 2Georgia Institute of Technology, Atlanta, Georgia, USA; 3Northwestern Polytechnical University, Shanxi, P. R. China)
|14:20||The CW-XMU System For CHiME-6 Challenge [Paper] [YouTube] [Slides]
(1CloudWalk Technology Co., Ltd.; 2Xiamen University)
|14:40||The Academia Sinica Systems of Speech Recognition and Speaker Diarization for the CHiME-6 Challenge [Paper] [YouTube] [Slides]
(1Institute of Information Science, Academia Sinica, Taiwan; 2Research Center for Information Technology Innovation, Academia Sinica, Taiwan)
|14:55||LEAP Submission to CHiME-6 ASR Challenge [Abstract] [YouTube] [Slides]
(Learning and Extraction of Acoustic Patterns (LEAP) lab Indian Institute of Science, Bangalore, 560012.)
Oral Session 3
|16:40||The STC System for the CHiME-6 Challenge [Paper] [YouTube] [Slides]
(1STC-innovations Ltd; 2ITMO University, Saint Petersburg, Russia)
|17:05||Towards a speaker diarization system for the CHiME 2020 dinner party transcription [Paper] [YouTube] [Slides]|
(1Paderborn University, Department of Communications Engineering, Paderborn, Germany; 2Toshiba Cambridge Research Laboratory, Cambridge, United Kingdom; 3 Toshiba Corporation Corporate R&D Center, Kawasaki, Japan; 4Toshiba China R&D Center, Beijing, China)
|17:25||The JHU Multi-Microphone Multi-Speaker ASR System for the CHiME-6 Challenge [Paper] [YouTube] [Slides]
(Center for Language and Speech Processing, Johns Hopkins University)
|17:45||CUNY Speech Diarization System for the CHiME-6 Challenge [Abstract] [YouTube] [Slides]
(1The Graduate Center, City University of New York; 2Brooklyn College, City University of New York)
|18:05||BUT System for CHiME-6 Challenge [Paper] [YouTube] [Slides]
(Brno University of Technology, Faculty of Information Technology, IT4I Centre of Excellence, Czechia)
|18:25||Toshiba’s Speech Recognition System for the CHiME 2020 Challenge [Paper] [YouTube] [Slides]
(1Toshiba Cambridge Research Laboratory, Cambridge, United Kingdom; 2Toshiba Corporation Corporate R&D Center, Kawasaki, Japan; 3Toshiba China R&D Center, Beijing, China)
Diarization, the missing link in Speech Technologies
The amount of unlabeled speech data that is available enormously outweighs the labeled data, and there is great potential in using this data to improve the performance of current speech recognition systems and related technologies. A primary goal of research in this domain is to automatically compute labels for the unlabelled data with an acceptable level of accuracy for downstream applications. One such task is to answer the question "who spoke when" in a recording, identifying regions containing speech and assigning speaker identity labels to each utterance. This labeling, called speaker diarization, is not typically the final task for applications, but often a missing link in a pipeline that can boost the performance of automatic speech recognition and speaker and language identification systems. In this talk, I will guide you through a journey of this missing link. We will start with a brief discussion of the key components that comprise the state-of-the-art systems—discussing the usage of a voice activity detector, speaker embeddings, scoring and clustering techniques. Next, we will demonstrate the aspects in which current systems fail and propose new alternatives to attain better performance. We will address overlap detection, resegmentation, and speaker turn detection among others. In addition, we will give some insights of the newest solutions, such as end-to-end approaches. Then, we will go beyond diarization and explore the positive impact of including a diarization stage in speech and speaker recognition systems. Finally, we will discuss the influence of diarization in other fields such as cognitive science and linguistics.
Dr. Leibny Paola Garcia Perera (PhD 2014, University of Zaragoza, Spain) joined Johns Hopkins University after extensive research experience in academia and industry, including highly regarded laboratories at Agnitio and Nuance Communications. She lead a team of 20+ researchers from four of the best laboratories worldwide in far-field speech diarization and speaker recognition, under the auspices of the JHU summer workshop 2019 in Montreal , Canada. She was also a researcher at Tec de Monterrey, Campus Monterrey, Mexico for 10 years. She was a Marie Curie researcher for the Iris project during 2015, exploring assistive technology for children with autism in Zaragoza, Spain. She was a visiting scholar at Georgia Institute of Technology (2009) and Carnegie Mellon (2011). Recently, she has been working on children’s speech; including child speech recognition and diarization in day-long recordings. She is also part of the JHU CHiME-5, CHiME-6, SRE18 and SRE19 teams. Her interests include diarization, speech recognition, speaker recognition, machine learning and language processing.
Solving Cocktail Party Problem – From Single Modality to Multi-Modality
Cocktail party problem is one of the difficult problems yet to be solved to enable high-accuracy speech recognition in everyday environments. In this talk, I will introduce our recent attempts to attack this problem with a focus on multi-channel multi-modal approaches.
Dr. Dong Yu, IEEE Fellow, is a distinguished scientist and vice general manager at Tencent AI Lab. Prior to joining Tencent in 2017, he was a principal researcher at Microsoft speech and dialog research group. His research works, which focus on statistical speech recognition and processing, have been recognized by the prestigious IEEE Signal Processing Society 2013 and 2016 best paper award and have been widely cited.
Dr. Dong Yu is currently serving as the vice chair of the IEEE Speech and Language Processing Technical Committee (SLPTC). He has served as a member of the IEEE SLPTC (2013-2018), a distinguished lecturer of APSIPA (2017-2018), an associate editor of the IEEE/ACM transactions on audio, speech, and language processing (2011-2015), an associate editor of the IEEE signal processing magazine (2008-2011), and members of organization and technical committees of many conferences and workshops.