Track 2 / Instructions

In order to reach a broad audience we have tried to avoid setting rules that might artificially disadvantage one research community over another. However, to allow systems to be broadly comparable, there are some guidelines that we expect participants to follow.

Which information can I use?

You can use the following annotations for training, development, and evaluation:

the session labels,
the synchronization ground truth,
the map of the recording environment showing the positions of the Kinects.

For training and development, you can use the full-length recordings of all Kinects and all binaural microphones, as well as the start and end times of all utterances and the corresponding speaker labels. Note that the start and end times may not be fully accurate. For training, you also allowed to use the Voxceleb corpus.

For evaluation, you are allowed to use the full-length recordings of all Kinects for each session. The binaural microphone recordings cannot be used.

Note also that the dimensions in the maps are not fully accurate and heights are not provided.

Which information shall I not use?

Manual modification of the data or the annotations (e.g., manual refinement of the utterance start and end times) is forbidden.

All parameters should be tuned on the training set or the development set. Modifications of the development set are allowed, provided that its size remains unchanged and these modifications do not induce the risk of inadvertently biasing the development set toward the particular speakers or acoustic conditions in the evaluation set. For instance, enhancing the signals, applying “unbiased” transformations (e.g., fMLLR) or automatically refining the utterance start and end times is allowed. Augmenting the development set by generating simulated data, applying biased signal transformations (e.g., systematically increasing intensity/pitch), or selecting a subset of the development set is forbidden. In case of doubt, please ask us ahead of the submission deadline.

Can I use different data or annotations?

You are entirely free in the development of your system.

In particular, you can modify the training, development, and evaluation data listed in the “Which information can I use?” section above by:

automatically resynchronizing the signals wrt each other,
processing the signals by means of speech enhancement or “unbiased” signal transformations (e.g., fMLLR),

and you can also modify the provided training data by:

automatically refining the utterance start and end times (e.g., by automatic speech activity detection),
processing the signals by means of other transformations,
generating simulated data based on the Voxceleb corpus, on the provided binaural or Kinect signals and on artificially generated impulse responses and noises,

provided that these modifications are fully automatic (no manual reannotation) and they rely on the provided signals only (no external speech, impulse response, or noise dataset other than Voxceleb). The results obtained using those modifications will be taken into account in the final WER ranking of all systems.

You may even use external speech, impulse response, or noise data taken from publicly available or in-house datasets. However, you should still report the results of your system using only the official challenge data, so that enough information is available to understand where the performance gains obtained by your system come from. The results obtained using external data will not be taken into account in the final WER ranking of all systems.

Can I use a different recogniser or overall system?

Again, you are entirely free in the development of your system.

In particular, you can:

include a single-channel or multichannel enhancement front-end,
use other acoustic features,
modify the acoustic model architecture or the training criterion,
modify the lexicon and the language model,
use any rescoring technique.

The results obtained using those modifications will be taken into account in the final ranking of all systems. Note that, depending on the chosen baseline (conventional or end-to-end) and the modifications made, your system will be ranked within either category A or B. If the outputs of the acoustic model remain frame-level tied phonetic targets, the lexicon and language model are unchanged compared to the conventional ASR baseline, and rescoring techniques (if any) are based on this lexicon and this language model (e.g., MBR or DLM based on acoustic features only), then it will be ranked within category A. Otherwise, e.g., if you used end-to-end ASR, you modified the lexicon or the language model, or you used rescoring techniques that implicitly modify the language model (e.g., DLM based on linguistic features), it will be ranked within category B. In case of doubt, please ask us ahead of the submission deadline.

Which results should I report?

For every tested system, you should report 2 diarization error rates (DERs), 2 Jaccard error rates (JERs), and 2 WERs (%), namely:

the DER on the development set
the DER on the evaluation set
the JER on the development set
the JER on the evaluation set
the WER on the development set
the WER on the evaluation set.

For instance, here are the DERs, JERs and WERs (%) achieved by the diarization baseline followed by the category A ASR baseline.

Baseline	Development set			Evaluation set
	DER	JER	WER	DER	JER	WER
Category A	63.42	70.83	84.25	68.20	72.54	77.94

The experimental comparison of all systems should provide enough information to understand where the performance gains obtained by your best system come from. For instance, in the case when your diarization system can be split into a front end, a segmentation stage, an embedding stage and a clustering stage, please report the results obtained by your system when replacing each component by the corresponding baseline component. Similarly, in the case when your ASR system can be split into a front end, an acoustic model and a language model, please report the results obtained by your system when replacing each component by the corresponding baseline component. More generally, if your system is made of multiple blocks, we encourage you to separately evaluate and report the influence of each block on performance.

Eventually, only the results of each team’s best system on the evaluation set will be taken into account in the final ranking of all systems according to WER but we also provide the other metrics (DER and JER) for our analysis. The best system is taken to be the one that performs best on the development set.

Finally, you will be asked to provide the recognised transcriptions for the development and evaluation data for that system.