The Grand Janitor Blog V3 – Page 22 – Speech Recognition, Artificial Intelligence, and Random Musing of Arthur Chan

Wednesday Speech-related Links

Post author By grandjanitor
Post date April 4, 2013
No Comments on Wednesday Speech-related Links

CMUSphinx:

CMUSphinx on Kindle Touch (cmusphinx.org Yay!)

Business

Why Carl Icahn’s Buying a Stake in Nuance

This is indeed a big development for ASR industry because it makes a rather constant revenue stream as compared to sales of software or professional service.

Arthur

Sphinx on Kindle

http://cmusphinx.sourceforge.net/2013/03/speech-recognition-on-kindle-touch-with-cmusphinx/

Love the current trend that Sphinx is everywhere.

C++ sphinx3 sphinxbase sphinxtrain

Grand Janitor’s Blog February and March Summary

Post author By grandjanitor
Post date April 2, 2013
1 Comment on Grand Janitor’s Blog February and March Summary

I wasn’t very productive in blogging for the last two months. Here are couple of worthy blog posts and news you might feel interested.

A look on Sphinx3’s initialization

sphinxbase 0.8 and SphinxTrain 1.08

Landscape of Open Source Speech Recognition Software (Part II)

Good ASR Training System

C++ vs C

GJB also reached the milestone of 100 posts, thanks for your support !
Newsworthy:

Google Buys Neural Net Startup, Boosting Its Speech Recognition, Computer Vision Chops

Future Windows Phone speech recognition revealed in leaked video

Google Keep

Feel free to connect with me on Plus, LinkedIn and Twitter.

Arthur

DNS Dragon dragon TV Python siri

GJB Wednesday Speech-related Links/Commentaries (DragonTV, Siri vs Xiao i Robot, Coding with Voice)

Post author By grandjanitor
Post date March 27, 2013
2 Comments on GJB Wednesday Speech-related Links/Commentaries (DragonTV, Siri vs Xiao i Robot, Coding with Voice)

Apple Appears In Court In China To Defend Against Siri Patent Infringement Claim (Techcrunch)

ZhiZhen company (智臻網絡科技) from Shanghai is suing Apple for infringing their patents. (The original Shanghai Daily article) From the news, back in 2006, ZhiZhen has already developed the engine for Xiao i Robot (小i機械人). A video 8 months ago (as below).

Technically, it is quite possible that a Siri-like system can be built at 2006. (Take a Look at Olympus/Ravenclaw.) Of course, the Siri-like interface you see here is certainly built in the advent of smartphone (, which by my definition, after iPhone is released). So overall speaking, it’s a bit hard to say who is right.

Of course, when interpreting news from China, it’s tempting to use slightly different logic. In the TC article, OP (Etherington) suggested that the whole lawsuit could be state-orchestrated. It could be related to recent Beijing’s attack of Apple.

I don’t really buy the OP’s argument, Apples is constantly sued in China (or over the world). It is hard to link the two events together.

Dragon TV brings speech recognition to Panasonic’s 2013 Smart TVs (DigitalVersus)

This is definitely not the Siri for TV.

Oh well, Siri is not just speech recognition, there is also the smart interpretation in the sentence level: scheduling, making appointments, do the right search. Those by themselves are challenges. In fact, I believe Nuance only provides the ASR engine for Apple. (Can’t find the link, I read it from Matthew Siegler.)

In the scenario of TV, what annoys users most are probably switching channels and searching programs. If I built a TV, I would also eliminate the any set-top boxes. (So cable companies will hate me a lot).

With the technology profile of all big companies, Apple seems to own all technologies need. It also takes quite a lot of design (with taste) to realize such a device.

Using Python to code by Voice

Here is an interesting look of how ASR can be used in coding. Some notes/highlights:

The speaker, Travis Rudd, had RSI 2 years ago. After a climbing accident, He decided to code using voice instead. Now his RSI is recovered, he claims he is still using it for 40-60%.
2000 voice commands, which are not necessarily English words. The author used Dragonfly to control emacs in windows.
How does variables work? Turns out most variables are actually English phrases. There are specific commands to get these phrases delimited by different characters.
The speaker said “it’s not very hard” for others to repeat. I believe there will be some amount of customizations. It takes him around 3 months. That’s pretty much how much time a solution engineer needs to take to tune an ASR system.
The best language to program in voice : Lisp.

One more thing. Rudd also believe it will be very tough to do the same thing with CMUSphinx.

Ah…… models, models, models.

Earlier on Grand Janitor’s Blog

Some quick notes on what a “Good training system” should look like: (link).

GJB reaches the 100th post! (link)

Arthur

Uncategorized

Tuesday’s Links (Meetings and more)

Post author By grandjanitor
Post date March 26, 2013
No Comments on Tuesday’s Links (Meetings and more)

Geeky:

Is Depression Really Biochemical (AssertTrue)

Meetings are Mutexes (Vivek Haldar)

So True. It doesn’t count all the time you use to prepare a meeting.

Exhaustive Testing is Not a Proof of Correctness

True, but hey. Writing regression tests is never a bad thing. If you rely only on your brain on testing, it bounds to fail one way or the other.

Apple :

Apple’s iPhone 5 debuts on T-Mobile April 12 with $99 upfront payment plan
iWatchHumor (DogHouseDiaries)

Yahoo:

Yahoo The Marissa Mayer Turnaround

Out of all commentaries on Marissa Mayer’s realm. I think Jean-Louis Gassée goes straight to the point and I agree most. You cannot use a one size fit all policy. So WFH is not always appropriate as well.

Management:

The Management-free Organization

sphinxtrain Thought training training scripts

Good ASR Training System

The term “speech recognition” is a misnomer.

Why do I say that? I have explained this point in an old article “Do We Have True Open Source Dictation?, which I wrote back in 2005: To recap, a speech recognition system consists of a Viterbi decoder, an acoustic model and a language model. You could have a great recognizer but bad accuracy performance if the models are bad.

So how does that related to you, a developer/researcher of ASR? The answer is ASR training tools and process usually become a core asset of your inventories. In fact, I can tell you when I need to work on acoustic model training, I need to spend full time to work on it and it’s one of the absorbing things I have done.

Why is that? When you look at development cycles of all tasks in making an ASR systems. Training is the longest. With the wrong tool, it is also the most error prone. As an example, just take a look of Sphinx forum, you will find that majority of non-Sphinx4 questions are related to training. Like, “I can’t find the path of a certain file”, “the whole thing just stuck at the middle”.

Many first time users complain with frustration (and occasionally disgust) on why it is so difficult to train a model. The frustration probably stems from the perception that “Shouldn’t it be well-defined?” The answer is again no. In fact how a model should be built (or even which model should be built) is always subjects to change. It’s also one of the two subfields in ASR, at least IMO, which is still creative and exciting in research. (Another one: noisy speech recognition.) What an open source software suite like Sphinx provide is a standard recipe for everyone.

Saying so, is there something we can do better for an ASR training system? There is a lot I would say, here are some suggestions:

A training experiment should be created, moved and copied with ease,
A training experiment should be exactly repeatable given the input is exactly the same,
The experimenter should be able to verify the correctness of an experiment before an experiment starts.

Ease of Creation of an Experiment

You can think of a training experiment as a recipe …… not exactly. When we read a recipe and implement it again, we human would make mistakes.

But hey! We are working with computers. Why do we need to fix small things in the recipe at all? So in a computer experiment, what we are shooting for is an experiment which can be easily created and moved around.

What does that mean? It basically means there should be no executables which are hardwired to one particular environment. There should also be no hardware/architecture assumption in the training implementations. If there is, they should be hidden.

Repeatability of an Experiment

Similar to the previous point, should we allow difference when running a training experiment? The answer should be no. So one trick you heard from experienced experimenters is that you should keep the seed of random generators. This will avoid minute difference happens in different runs of experiments.

Here someone would ask. Shouldn’t us allow a small difference between experiments? We are essentially running a physical experiment.

I think that’s a valid approach. But to be conscientious, you might want to run a certain experiment many times to calculate an average. In a way, I think this is my problem with this thinking. It is slower to repeat an experiment. e.g. What if you see your experiment has 1% absolute drop? Do you let it go? Or do you just chalk it up as noise? Once you allow yourself to not repeat an experiment exactly, there will be tons of questions you should ask.

Verifiability of an Experiment

Running an experiment sometimes takes day, how do you make sure running it is correct? I would say you should first make sure trivial issues such as missing paths, missing models, or incorrect settings was first screened out and corrected.

One of my bosses used to make a strong point and asked me to verify input paths every single time. This is a good habit and it pays dividend. Can we do similar things in our training systems?

Apply it on Open Source

What I mentioned above is highly influenced by my experience in the field. I personally found that sites, which have great infrastructure to transfer experiments between developers, are the strongest and faster growing.

To put all these ideas into open source would mean very different development paradigm. For example, do we want to have a centralized experiment database which everyone shares? Do we want to put common resource such as existing paramatized inputs (such as MFCC) somewhere in common for everyone? Should we integrate the retrieval of these inputs into part of our experiment recipe?

Those are important questions. In a way, I think it is the most type of questions we should ask in open source. Because regardless of much volunteer’s effort. Performance of open source models is still lagging behind the commercial models. I believe it is an issue of methodology.

Arthur

Apple

Monday’s Links (Brain-Computer Interface, Apple and more)

Post author By grandjanitor
Post date March 25, 2013
No Comments on Monday’s Links (Brain-Computer Interface, Apple and more)

Geeky:

How to Write Six Important Papers a Year without Breaking a Sweat: The Deep Immersion Approach to Deep Work
It’s Like They’re Reading My Mind (Slate)

Apple:

Apple Buys Indoor Mapping Company WifiSLAM (LA times)
How Apple Invites Facile Analysis (Business Insiders)
So long, break-even (Horace Dediu)

After big channels picked up Richards’ story:

Startups have a sexism problem

Fun:
R2-D2 Day …… for real!

DNN EnglishCentral Nexiwave Voci wfst

The 100th Post: Why The Grand Janitor’s Blog?

Post author By grandjanitor
Post date March 23, 2013
2 Comments on The 100th Post: Why The Grand Janitor’s Blog?

Since I decided to revamp The Grand Janitor’s Blog last December, it has been 100 posts. (I cheat a bit, so “not since then”.)

It’s funny to describe time with the number of articles you write. In blogging though, that makes complete sense.

I have started several blogs in the past. Only 2 of them survive (, Cumulomanic and “Start-Up Employees 333 weeks“, both in Chinese) . When you cannot maintain your blog for more than 50 posts, you blog just dies, or simply to disappear into oblivion.

Yet I make it. So here’s an important question to ask: what makes me keep on?

I believe the answer is very simple. There is no bloggers so far who work on the niche of speech recognition: None on automatic speech recognition (ASR) systems, even though there was much progress. None on engines, even much work has been done in open source. None on applications, even great projects such as Simon was there.

Nor there were discussion on how open source speech recognition can be applied to the commercial world, even when there are dozens of companies are now based on Sphinx (e.g. my employer Voci, EnglishCentral and Nexiwave ), and they are filling the startup space.

How about how the latest technology such as deep neural network (DNN) and weighted finite state transducers (WFST) would affect us? I can see them in academic conferences, journals or sometimes tradeshows…… but not in a blog.

But blogging, which we all know, is probably the most prominent form of how people are getting news these days.

And news about speech recognition, once you understand them, is fascinating.

The only blog which comes close is Nicholay’s blog : nsh. When I try to recover as a speech recognition programmer, nsh was a great help. So thank you, Nick, thank you.

But there is only one nsh. There are still have a lot of fascinating to talk about…… Right?

So probably the reason why I keep on working: I want to invent something I want: a kind of information hub on speech recognition technology, commercial/open source, applications/engines, theory/implementations, the ideals/the realities.

I want to bring my unique perspective: I was in academia, in industrial research and now in the startup world so I know quite well people’s mindsets in each group.

I also want to connect with all of you. We are working on one of the most exciting technology in the world. Not everyone understands that. It will take time for all of us, to explain to our friends and families what speech recognition can really do and why it matters.

In any case, I hope you enjoy this blog. Feel free to connect with me on Plus, LinkedIn and Twitter.

Arthur

C++

C++ vs C

I have been mainly a C programmer. Because of work though, I have been working with many codebase which is written in C++.

Many programmers will tell you C++ is a necessary evil. I agreed. Using C to emulate object oriented feature such as polymorphism, inheritance or even the idea of objects is not easy. It also easily confused novice programmer.

So why C++ frustrates many programmers then? I guess my major complaint is that its standard has be evolving and many compilers cannot catch up with the latest.

For example, it’s very hard for gcc 4.7 to compile code which can be compiled by gcc 4.2 . Chances are some of the language feature is outdated and they will generate compiler error.

On the other hand, C exhibit much greater stability across compiler. If you look at the C portion of the triplet (PocketSphinx, SphinxTrain, Sphinxbase), i.e. 99% of the code. Most of them just compile across different generation of gcc. This makes things easier to maintain.

Arthur