chime logo
The 6th CHiME Speech Separation and Recognition Challenge

Track 2 / Software

We provide software baselines for array synchronization, speech enhancement, speech activity detection, speaker diarization, and speech recognition systems. All systems are integrated as a Kaldi CHiME-6 recipe.


  • The main script ( is similar to in track 1, which executed array synchronization, data preparation, data augmentation, feature extraction, GMM training, data cleaning, and chain model training. in track 2 additionally includes speech activity detector (SAD) training with the CHiME-6 data, and diarization model training with the VoxCeleb data. We allow the participants to use the VoxCeleb data in addition to CHiME-5/6 data, which is necessary to build sufficient performance of a diarization system (See the instruction page for the detail).
  • After training, finally calls the inference script (local/, which includes speech enhancement, SAD, speaker diarization, and speech recognition given the trained model. We can also execute local/ independently with your own SAD, diarization, and ASR models or pre-trained models downloaded from here.
  1. Array synchronization to generate the new CHiME-6 audio data (stage 0)
    • This stage first downloads the array synchronization tool, and generates the synchronized audio files across arrays and corresponding JSON files. Note that this requires sox v14.4.2, which is installed via miniconda in ./local/ The details about the array synchronization will be found in Array synchronization.
  2. Data, dictionary, and language model preparation (stages 1 to 3)
    • Prepare Kaldi format data directories, lexicon, and language models
    • Language model: maximum entropy based 3-gram
       data/srilm/best_3gram.gz ->
    • Vocabulary size: 127,712
       $wc -l data/lang/words.txt
       127712 data/lang/words.txt
  3. Data augmentation (stages 4 to 7)
    • In these stages, we augment and fix the training data. Point source noises are extracted from the CHiME-6 corpus. Here, we use 400k utterances from array microphones, its augmentation and all the worn set utterances during training.
    • We did not include the enhanced speech data for the training data due to the simplicity of the system.
  4. Feature extraction (stage 8)
    • We make 13-dim MFCC features for GMM-HMM systems.
  5. GMM training (stages 9 to 13)
    • Stages 9 to 13 train monophone and triphone models. They will be used for cleaning training data and generating lattices for training the chain model.
  6. Data cleaning (stage 14)
    • This stage performs data cleanup for training data by using the GMM model.
  7. Chain model training (stage 15)
    • We use a factorized time delay neural network (TDNN-F) adapted from SWBD recipe 7q
    • You can also download a pretrained chain ASR model using: wget Once it is downloaded, extract using: tar -xvzf 0012_asr_v1.tar.gz copy the contents of the exp/ directory to your exp/ directory.
  8. SAD training
    • We use a TDNN+LSTM model trained with the CHiME-6 data with the alignment obtained by a GMM
    • You can also download a pretrained SAD model using: wget
    • Once it is downloaded, extract using: tar -xvzf 0012_sad_v1.tar.gz and copy the contents of the exp/ directory to your exp/ directory.
  9. Diarization training
    • x-vector DNN for diarization is trained with the VoxCeleb data. This script is adapted from the Kaldi VoxCeleb v2 recipe.
    • x-vector DNN is trained with the VoxCeleb data.
    • PLDA model is trained with the CHiME-6 data.
    • You can also download a pretrained diarization model using: wget Once it is downloaded, extract using: tar -xvzf 0012_diarization_v1.tar.gz and copy the contents of the exp/ directory to your exp/ directory.
  10. Decoding and scoring (stage 16)
    • In track 2, only the raw audios are given without segment or speaker information, i.e., local/ has to perform the whole pipeline including speech enhancement -> speech activity detection -> speaker diarization -> decoding and scoring.

    • [1] Barker, J., Watanabe, S., Vincent, E., & Trmal, J. (2018). The Fifth ‘CHiME’ Speech Separation and Recognition Challenge: Dataset, Task and Baselines. Proc. Interspeech 2018, 1561-1565.
    • [2] Manohar, V., Chen, S. J., Wang, Z., Fujita, Y., Watanabe, S., & Khudanpur, S. (2019, May). Acoustic Modeling for Overlapping Speech Recognition: JHU CHiME-5 Challenge System. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6665-6669)
    • [3] Povey, D., Cheng, G., Wang, Y., Li, K., Xu, H., Yarmohammadi, M., & Khudanpur, S. (2018, September). Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks. In Interspeech (pp. 3743-3747).

Array synchronization

The new array synchronisation baseline is available on GitHub. The array synchronisation compensates for two separate issues: audio frame-dropping (which affects the Kinect devices only) and clock-drift (which affects all devices). It operates in two stages:

  1. Frame-dropping is compensated by inserting 0 s into the signals where samples have been dropped. These locations have been detected by comparing the Kinect audio with an uncorrupted stereo audio signal recovered from the video avi files that were recorded (but not made publicly available). The frame-drop locations have been precomputed and stored in the file chime6_audio_edits.json that is then used to drive the synchronisation software.

  2. Clock-drift is computed by comparing each device’s signal to the session’s ‘reference’ binaural recordings (the binaural mic of the speaker with the lowest ID number). Specifically, cross-correlation is used to estimate delays between the device and the reference at regular intervals throughout the recording session (performed using from the CHiME-5 synchronization baseline). A relative speed-up or slow-down can then be approximated using a linear fit through these estimates. The signal is then synchronised to the reference using a sox command to adjust the speed of the signal appropriately. This adjustment is typically very subtle, i.e., less than 100 ms over the 2 1/2 hour recording session. Note, the approach failed for devices S01_U02 and S01_U05 which appear to have temporarily changed speeds during the recording session and have required a piece-wise linear fit. The adjustment for clock-drift compensation have been precomputed and the parameters to drive the sox commands are stored in chime6_audio_edits.json.

Note, after frame-drop and clock-drift compensation, the wav files that are generated for each device will have slightly different durations. For each session, device signals can be safely truncated to the duration of the shortest signal across devices, but this step is not performed by the synchronisation tool.

Finally, the CHiME-5 transcript json files are processed to fit the new alignment. In the new version, utterances will have the same start and end time on every device.

Speech enhancement

Unlike track 1, track 2 only prepares the following BeamformIt-based speech enhancement method due to potential risks of the degradation of GSS based on diarization output. (Note that track 2 cannot use the speech segment information for each speaker at all).

  1. WPE based dereverberation and Weighted delay-and-sum beamformer, BeamformIt applied to the reference array.

    • [4] Nakatani, T., Yoshioka, T., Kinoshita, K., Miyoshi, M., & Juang, B. H. (2010). Speech dereverberation based on variance-normalized delayed linear prediction. IEEE Transactions on Audio, Speech, and Language Processing, 18(7), 1717-1731.
    • [5] Anguera, X., Wooters, C., & Hernando, J. (2007). Acoustic beamforming for speaker diarization of meetings. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2011-2022
    • [6] Drude, L., Heymann, J., Boeddeker, C., & Haeb-Umbach, R. (2018, October). NARA-WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing. In Speech Communication; 13th ITG-Symposium (pp. 1-5)

Speech activity detection

  • This speech activity detection is based on neural networks with statistics pooling for long-context [7].
  • It was trained using data train_worn_u400k from 1) the CHiME-6 worn microphone utterances and 2) randomly picked 400k array microphone utterances
  • We generate speech activity labels using an HMM-GMM system trained with the train_worn_simu_u400k data from 1) the CHiME-6 worn microphone utterances perturbed with various room impulse responses generated from a room simulator and 2) randomly picked 400k array microphone utterances
  • Neural network architecture:
    • Input feature: 40-dimensional MFCC features.
    • 5 TDNN layers and 2 layers of statistics pooling (see [7] for the statistics pooling).
    • The overall context of the network is set to be around 1s, with around 0.8s of left context and 0.2s of the right context.
    • The network is trained with a cross-entropy objective to predict the speech / non-speech labels.
  • Simple Viterbi decoding on an HMM with duration constraints of 0.3s for speech and 0.1s for silence is used to get speech activity labels for the test data recordings.
  • How to check SAD results?
    • The SAD decoding script (local/segmentation/ outputs a segments file. Use the script steps/segmentation/ with the generated segments file and a dummy utt2spk file (with spkid as the uttid itself) to get an RTTM file. Then use -1 -c 0.25 -r <ref-rttm> -s <hyp-rttm> to obtain the SAD results.
  • The baseline system only performs SAD for the U06 array for the simplicity. Exploring the multi-array fusion techniques for SAD/diarization/ASR is a part of the challenge.
  • [7] Ghahremani, P., Manohar, V., Povey, D., & Khudanpur, S. (2016, September). Acoustic Modelling from the Signal Domain Using CNNs. In Interspeech (pp. 3434-3438).

Speaker diarization

  • An x-vector system with 5-layer TDNN is trained with the VoxCeleb data.
  • PLDA is trained with the CHiME-6 data (train_worn_simu_u400k).
  • Diarization is performed given the segment files obtained by SAD.
  • Agglomerative hierarchical clustering (AHC) is performed given the number of speakers (=4). The number of speakers in the CHiME-6 is always 4 in every session, and we can use this prior information in the system.
  • We consistently use the reference RTTM converted from the original JSON file via the data preparation ( --stage 1) by using
  • The diarization result is obtained as an RTTM file as follows
  • Diarization error rate (DER) and Jaccard error rate (JER) are computed by using dscore (used in the DIHARD II challenge)
  • Note that we fixed the bug, updated the reference RTTM and removed the introduction part. Those are reflected in the latest Kaldi recipe.

      Dev. DER (%) Dev. JER (%) Eval. DER (%) Eval. JER (%)
    Baseline 63.42 70.83 68.20 72.54
  • Note1: as a future analysis, we may introduce the UEM file to exclude the redacted utterances.

  • Note2: the baseline system only performs SAD for the U06 array for the simplicity. Exploring the multi-array fusion techniques for SAD/diarization/ASR is a part of the challenge.

    • [8] Sell, G., Snyder, D., McCree, A., Garcia-Romero, D., Villalba, J., Maciejewski, M., Manohar, V., Dehak, D., Povey, D., Watanabe, S., & Khudanpur, S. (2018). Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge. In Interspeech (pp. 2808-2812).

Decoding and scoring

  • First, the RTTM files obtained by speaker diarization are converted to the Kaldi data directories, e.g., dev_beamformit_dereverb_diarized_hires/ and eval_beamformit_dereverb_diarized_hires/

  • We perform 2 stage decoding, which refines the i-vector extraction based on the first pass decoding result to achieve robust decoding for noisy speech.

  • We prepare the scoring script for both development and evaluation sets for the submission based on

  • Multi-speaker scoring local/ is performed to produce the concatenated minimum-permutation word error rate (cpWER).

  • The language model weight and insertion penalty are optimized based on the development set.

  • Multispeaker scoring local/ is performed to produce the concatenated minimum-permutation word error rate (cpWER).
    1. First concatenate all texts across utterances per speaker for both reference and hypothesis files.

    2. Compute WER between the reference and all speaker permutations of the hypothesis (4 * 3 * 2 = 24 permutations).

    3. Pick up the best permutation result (note that this result is updated on Feb. 24 and Mar. 10 by using the latest Kaldi recipe and also the error pointed out in this thread and this thread.

        Dev. WER (%) Eval. WER (%)
      Kaldi baseline 84.25 77.94
    4. We also prepare the detailed errors per utterance in local/ --stage 3 and local/ --stage 4 for the analysis purpose.

    5. The final results are appeared in the log as follows:

      best LM weight: 11
      best insertion penalty weight: 0.5
      Dev:  %WER 84.25 [ 49610 / 58881, 1937 ins, 34685 del, 12988 sub ]
      Eval: %WER 77.94 [ 42971 / 55132, 1086 ins, 30839 del, 11046 sub ]
  • We are preparing the other evaluation metrics for analysis purpose in the submission stage (but not related for the rank).

  • The baseline system only performs SAD for the U06 array for the simplicity. Exploring the multi-array fusion techniques for SAD/diarization/ASR is a part of the challenge.

    • [9] Peddinti, V., Chen, G., Manohar, V., Ko, T., Povey, D., & Khudanpur, S. (2015, December). Jhu aspire system: Robust LVCSR with TDNNs, ivector adaptation and RNN-LMs. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 539-546).

Usage (especially for Kaldi beginners)

  • This instruction is almost same as track 1.
  1. Download Kaldi, compile Kaldi tools, and install BeamformIt for beamforming, Phonetisaurus for constructing a lexicon using grapheme to phoneme conversion, and SRILM for language model construction, miniconda and Nara WPE for dereverberation. For SRILM, you need to download the source (srilm.tgz) first.
     git clone
     cd kaldi/tools
     make -j                             # "-j" option parallelize compile
     ./extras/      # BeamformIt
     ./extras/           # Get source from first
     ./extras/   # G2P
     ./extras/       # Miniconda for several Python packages including Nara WPE, audio synchronization, and GSS
     ./extras/             # Nara WPE
  2. Compile Kaldi (-j 10 means the number of jobs is 10. Please change this number based on your environment accordingly).
     cd ../src
     make depend -j 10
     make -j 10
  3. Move to the CHiME-6 track1 ASR baseline in the Kaldi egs/ directory.
     cd ../kaldi/egs/chime6/s5_track1
  4. Specify model and CHiME-5 root paths in You can also specify the CHiME-6 root directory, which is generated in the array synchronization stage (stage 0)
     chime5_corpus=<your CHiME-5 path>
     chime6_corpus=<desired CHiME-6 path>
  5. Execute

    We suggest using the following command to save the main log file:

     nohup ./ > run.log

    If your experiments have failed or you want to resume your experiments at some stage, you can use the following command (this example is to rerun GMM experiments from stage 9):

     ./ --stage 9
  6. If you have your own enhanced speech data for test, you can perform your own enhancement by using local/

  7. You can find the resulting diarization error rates (DERs) and word error rates (WERs) in the following files:
    • DER
    • WER


  • This script simply picks up the U06 array to output a single DER/WER result per session. You can perform multiarray processing, system combination or whatever to fuse the multiarray information to provide a single DER/WER result.
  • Make sure that you are using the correct version of Kaldi (TODO).
  • During scoring, we filter the tags ([noise], [inaudible], [laughs], and [redacted]), and normalize the filler hmm, i.e., sed -e 's/\/hmm/g; s/\/hmm/g; s/\/hmm/g;'. See local/wer_output_filter. The final scoring script will be released when the test data is released.
  • The WER can differ for every run and every machine due to random initialization and to machine-specific issues. The difference can be up to several percent absolute.