Interspeech 2015

Interspeech 2015 was held in ICD, Dresden, Germany. I took part in this conference because I had two papers published here. This is the first time to attend an international conference for me. The first part of this post includes lots of photos. LG Flex2 was used to take these photos. All photos are compressed using TinyPNG. ~78% space is saved. The second part includes analysis of some papers.


Airplane in Shanghai Pudong:
Foggy Moscow:


In front of the church:

Interspeech 2015

Our group:
Photo with Tomas Mikolov:
Standup banquet:

Some papers

Neural Network

Most of these papers focused on the novel architecture:

Long Short-Term Memory based Convolutional Recurrent Neural Networks for Large Vocabulary Speech Recognition

  • important work, combine CNN with LSTM
  • considerable result on ASR tasks, but the corpus isn’t standard(HKUST Mandarin speech)

Multi-softmax Deep Neural Network for Semi-Supervised Training

  • it shows that the deep neural network can be seen as two parts:
    feature extraction(hidden layers) and classifier(output layer)
  • different task can share same hidden layers while different output layers are used

Transferring Knowledge from a RNN to a DNN

  • it gives experiments using soft margin which is proposed by Hinton
  • experiments in details


Denoising autoencoder-based speaker feature restore for utterances of short duration

  • use Autoencoder to enhance i-Vectors given by short audio
  • text information is taken into consideration. Without that stand-alone performance is bad

Migrating i-Vectors Between Speaker Recognition Systems Using Regression Neural Networks

  • solve the problem of migrating between different i-Vector systems(training on different datasets)
  • quite strange that one layer network leads to best result

DNN in Speaker Verification

Locally-Connected and Convolutional Neural Networks for Small Footprint Speaker Recognition

  • it shows that with same amount of parameters LCN leads to better result
  • they proposed that max operation is better than mean operation in d-vector
    • contrast with my experiment. Reason: Use different models(PLDA/cosine similarity)

A Unified Deep Neural Network for Speaker and Language Recogntion

  • it is interesting that DNN trained on Switchboard data leads to large improvement give rise to large gains
  • bottleneck features show consistent performance improvements over DNN based system, which is also consistent with the result mentioned by Multi-softmax Deep Neural Network for Semi-Supervised training


Representing Nonspeech Audio Signals through Speech Classification Models

  • one of the best student paper
  • seems to be novel, but in fact it can be seen as the combination of mean operation on input layer and random forest classifier
  • overall to use classifiers trained on speech data to extract features for nonspeech data is a novel intuition

A Universal VAD Based on Jointly Trained Deep Neural Networks

  • one of the best student paper
  • combine VAD with speech enhancement network
  • joint training

Text-Informed Speech Enhancement with Deep Neural Networks

  • combine TTS with speech enhancement
  • the main drawback is that it assumes text information are given while in reality we can only use ASR result use ASR result