- Connectionist Temporal Classification <- the book
- But I found that Grave's thesis is easier to follow. e.g. the definition of alpha and beta in the book doesn't make sense to me.
- Few Alex Grave's papers. (here, here, here)
- Deep Speech 2: End-to-End Speech Recognition in English and Mandarin Baidu's production system based on CTC
- Flat Start Training of CD-CTC-SMBR LSTM RNN Acoustic Models
- Very good explanation on the Math by Andrew Gibiansky: http://andrew.gibiansky.com/blog/machine-learning/speech-recognition-neural-networks/
- A comprehensive explanation of CTC on distll.
- Attention-based seq2seq model:
- END-TO-END ATTENTION-BASED LARGE VOCABULARY SPEECH RECOGNITION
- Work from Bengio's group
- Listen, Attend and Spell by William Chan (his thesis)
- Very good presentation by Markus Nussbaum-Thom.
- Wav2letter: https://www.openreview.net/pdf?id=BkUDvt5gg
- EESEN: https://github.com/srvk/eesen
- Stanford-CTC: https://github.com/amaas/stanford-ctc
- Warp-CTC from Baidu: https://github.com/baidu-research/warp-ctc
- Mozilla's implementation
- Neon's implementation
For reference, here are some papers on the hybrid approach: