Important Papers:
- Connectionist Temporal Classification <- the book
- But I found that Grave’s thesis is easier to follow. e.g. the definition of alpha and beta in the book doesn’t make sense to me.
- Few Alex Grave’s papers. (here, here, here)
- Deep Speech 2: End-to-End Speech Recognition in English and Mandarin Baidu’s production system based on CTC
- Flat Start Training of CD-CTC-SMBR LSTM RNN Acoustic Models
- Very good explanation on the Math by Andrew Gibiansky: http://andrew.gibiansky.com/blog/machine-learning/speech-recognition-neural-networks/
- A comprehensive explanation of CTC on distll.
- Attention-based seq2seq model:
- END-TO-END ATTENTION-BASED LARGE VOCABULARY SPEECH RECOGNITION
- Work from Bengio’s group
- Listen, Attend and Spell by William Chan (his thesis)
- Very good presentation by Markus Nussbaum-Thom.
Unsorted:
- http://www.isca-speech.org/archive/Interspeech_2017/pdfs/0233.PDF
- https://arxiv.org/pdf/1707.07167.pdf
- Wav2letter: https://www.openreview.net/pdf?id=BkUDvt5gg
- http://publications.idiap.ch/downloads/papers/2017/Palaz_THESIS_2016.pdf
- http://ttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf
- https://arxiv.org/pdf/1709.07814.pdf
Important Implementations:
- EESEN: https://github.com/srvk/eesen
- Stanford-CTC: https://github.com/amaas/stanford-ctc
- Warp-CTC from Baidu: https://github.com/baidu-research/warp-ctc
- Mozilla’s implementation
- Neon’s implementation
- Kaldi-ctc
- https://github.com/zzw922cn/Automatic_Speech_Recognition
For reference, here are some papers on the hybrid approach:
- Acoustic Modeling using Deep Belief Networks
- http://www.cs.toronto.edu/~asamir/papers/SPM_DNN_12.pdf
- Good Kaldi tutorial.