Track 1 / Instructions

In order to reach a broad audience we have tried to avoid setting rules that might artificially disadvantage one research community over another. However, to allow systems to be broadly comparable, there are some guidelines that we expect participants to follow.

Which information can I use?

You can use the following annotations for training, development, and evaluation:

the session and location labels,
the synchronization ground truth,
the start and end times of all utterances,
the corresponding speaker labels,
the map of the recording environment showing the positions of the Kinects.

For training and development, you can use the full-length recordings of all Kinects and all binaural microphones.

For evaluation, you are allowed to use for a given utterance the full-length recordings of all Kinects for that session. In other words, you are not limited to the past context or to the immediate context surrounding each utterance. The binaural microphone recordings for that session can also be used albeit for array sychronization only. They shall not be used for enhancement or recognition.

Note that the start and end times may not be fully accurate, since they were manually annotated on the binaural microphones worn by the speaker. Note also that the dimensions in the maps are not fully accurate and heights are not provided.

Which information shall I not use?

Manual modification of the data or the annotations (e.g., manual refinement of the utterance start and end times) is forbidden.

All parameters should be tuned on the training set or the development set. Modifications of the development set are allowed, provided that its size remains unchanged and these modifications do not induce the risk of inadvertently biasing the development set toward the particular speakers or acoustic conditions in the evaluation set. For instance, enhancing the signals, applying “unbiased” transformations (e.g., fMLLR) or automatically refining the utterance start and end times is allowed. Augmenting the development set by generating simulated data, applying biased signal transformations (e.g., systematically increasing intensity/pitch), or selecting a subset of the development set is forbidden. In case of doubt, please ask us ahead of the submission deadline.

Can I use different data or annotations?

You are entirely free in the development of your system.

In particular, you can modify the training, development, and evaluation data listed in the “Which information can I use?” section above by:

automatically resynchronizing the signals wrt each other,
processing the signals by means of speech enhancement or “unbiased” signal transformations (e.g., fMLLR),
automatically refining the utterance start and end times (e.g., by automatic speech activity detection),

and you can also modify the provided training data by:

processing the signals by means of other transformations,
generating simulated data based on the provided binaural or Kinect signals and on artificially generated impulse responses and noises,

provided that these modifications are fully automatic (no manual reannotation) and they rely on the provided signals only (no external speech, impulse response, or noise data). The results obtained using those modifications will be taken into account in the final WER ranking of all systems. You may even use external speech, impulse response, or noise data taken from publicly available or in-house datasets. However, you should still report the results of your system using only the official challenge data, so that enough information is available to understand where the performance gains obtained by your system come from. The results obtained using external data will not be taken into account in the final WER ranking of all systems.

Can I use a different recogniser or overall system?

Again, you are entirely free in the development of your system.

In particular, you can:

include a single-channel or multichannel enhancement front-end,
use other acoustic features,
modify the acoustic model architecture or the training criterion,
modify the lexicon and the language model,
use any rescoring technique.

The results obtained using those modifications will be taken into account in the final ranking of all systems. Note that, depending on the chosen baseline (conventional or end-to-end) and the modifications made, your system will be ranked within either category A or B. If the outputs of the acoustic model remain frame-level tied phonetic targets, the lexicon and language model are unchanged compared to the conventional ASR baseline, and rescoring techniques (if any) are based on this lexicon and this language model (e.g., MBR or DLM based on acoustic features only), then it will be ranked within category A. Otherwise, e.g., if you used end-to-end ASR, you modified the lexicon or the language model, or you used rescoring techniques that implicitly modify the language model (e.g., DLM based on linguistic features), it will be ranked within category B. In case of doubt, please ask us ahead of the submission deadline.

Which results should I report?

For every tested system, you should report 2 word error rates (WERs) (%), namely:

the WER on the development set
the WER on the evaluation set.

For instance, here are the WERs (%) achieved by the two ASR baselines.

Baseline	Development set WER	Evaluation set WER
Category A
Category B

The experimental comparison of all systems should provide enough information to understand where the performance gains obtained by your best system come from. For instance, in the case when your ASR system can be split into a front end, an acoustic model and a language model, please report the results obtained by your system when replacing each component by the corresponding baseline component. More generally, if your system is made of multiple blocks, we encourage you to separately evaluate and report the influence of each block on performance.

Eventually, only the results of each team’s best system on the evaluation set will be taken into account in the final ranking of all systems. The best system is taken to be the one that performs best on the development set.

For that system, you should report 12 WERs: one for every development/evaluation session and every location.

Finally, you will be asked to provide the recognised transcriptions for the development and evaluation data for that system.