If you track news of CMUSphinx, you may notice that the Sourceforge guys start to distribute data through BitTorrent (link).
That's a great move. One of the issues in ASR is the lack of machine power in training. To make a blunt example, it's possible to squeeze extra performance by searching for the best training parameters. Not to say a lot of modern training techniques take some time to run.
I do recommend all of your help the effort. Again, me not involved at all, just feel that it is a great cause.
Of course, things in ASR are never easy so I want to give two subtle points about the whole distributed approach of training.
Improvement over the years?
First question you may ask, now does that mean, ASR can be like project such as SETI, which would automatically improve over the years? Not yet, ASR still has its unique challenge.
The major part I would see is how we can incrementally increase phonetically-balanced transcribed audio. Note that it is not just audio, but transcribed audio. Meaning: someone needs to go to listen to the audio, spending 5-10 times real time to write down what the audio really say word-by-word. All these transcriptions need to clean up and in a certain format.
This is what Voxforge tries to achieve and it's not a small undertaking. Of course, comparing to the speed of the industry development, the progress is still too slow. The last time I heard, Google was training their acoustic model with 38000 hours of data. A WSJ corpus is a toy task compared to it.
Now, thinking in this way, let's say if we want to build the best recognizer through open source, what is the bottleneck? I bet the answer doesn't lie on machine power, whether we have enough transcribed data would be the key. So that's something to ponder about.
(Added Dec 27, 2012, on the part of initial amount of data, Nickolay corrected me saying that amount of data from Sphinx is already in terms of 10000 hours. That includes "librivox recordings, transcribed podcasts, subtitled videos, real-life call recordings, voicemail messages".
So it does sound like Sphinx has the amount of data which rivals commercial companies. I am very interested to see how we can train an acoustic model with that amount of data.)