Interspeech 2015

Interspeech 2015 was held in ICD, Dresden, Germany. I took part in this conference because I had two papers published here. This is the first time to attend an international conference for me. The first part of this post includes lots of photos. LG Flex2 was used to take these photos. All photos are compressed using TinyPNG. ~78% space is saved. The second part includes analysis of some papers.

Airport

Airplane in Shanghai Pudong:

Foggy Moscow:
Moscow

Dresden

Dresden1
Dresden2
Dresden3
In front of the church:
Dresden4

Interspeech 2015

Our group:
group
Photo with Tomas Mikolov:
mikolv
Banquet:
banquet
Standup banquet:
standup
Presentation:

Some papers

Neural Network

Most of these papers focused on the novel architecture:

Long Short-Term Memory based Convolutional Recurrent Neural Networks for Large Vocabulary Speech Recognition

important work, combine CNN with LSTM

considerable result on ASR tasks, but the corpus isn’t standard(HKUST Mandarin speech)

Multi-softmax Deep Neural Network for Semi-Supervised Training

it shows that the deep neural network can be seen as two parts:
feature extraction(hidden layers) and classifier(output layer)

different task can share same hidden layers while different output layers are used

Transferring Knowledge from a RNN to a DNN

it gives experiments using soft margin which is proposed by Hinton

experiments in details

i-Vector

Denoising autoencoder-based speaker feature restore for utterances of short duration

use Autoencoder to enhance i-Vectors given by short audio

text information is taken into consideration. Without that stand-alone performance is bad

Migrating i-Vectors Between Speaker Recognition Systems Using Regression Neural Networks

solve the problem of migrating between different i-Vector systems(training on different datasets)

quite strange that one layer network leads to best result

DNN in Speaker Verification

Locally-Connected and Convolutional Neural Networks for Small Footprint Speaker Recognition

it shows that with same amount of parameters LCN leads to better result

they proposed that max operation is better than mean operation in d-vector

contrast with my experiment. Reason: Use different models(PLDA/cosine similarity)

A Unified Deep Neural Network for Speaker and Language Recogntion

it is interesting that DNN trained on Switchboard data leads to large improvement give rise to large gains

bottleneck features show consistent performance improvements over DNN based system, which is also consistent with the result mentioned by Multi-softmax Deep Neural Network for Semi-Supervised training

Others

Representing Nonspeech Audio Signals through Speech Classiﬁcation Models

one of the best student paper

seems to be novel, but in fact it can be seen as the combination of mean operation on input layer and random forest classifier

overall to use classifiers trained on speech data to extract features for nonspeech data is a novel intuition

A Universal VAD Based on Jointly Trained Deep Neural Networks

one of the best student paper

combine VAD with speech enhancement network

joint training

Text-Informed Speech Enhancement with Deep Neural Networks

combine TTS with speech enhancement

the main drawback is that it assumes text information are given while in reality we can only use ASR result use ASR result