stylometric analysis – The Grand Janitor Blog V3

Just read a story about how usage of machine learning can identify anonymous hackers or crackers. I am actually not too surprised by that capability. According to the article, currently the accuracy is around 66% to 80%, which I take it as detection rate. There is a 5000 words minimum limit there, which attempt to make the word distribution estimation be robust. (Backoff strategy comes to mind immediately……)

The final limitation about this method is that the text has to be in English. That ….. I don’t think it’s such a big deal. No one can communicate effectively if they mix up multiple languages in text. If they do, probably it means the mixture of the languages is a kind of language itself.

Thinking deeper, this will probably prompt intelligent hackers to speak less in public forums. They will either use secure channel to establish a connection for obtaining information.

Will hacker be deterred by this new method? I guess no, many in the hacker community are well-aware that latest machine learning method can detect their existence. For example, usage of IRC is already one particular signature that one can detect in the network.

In any case, the topic of how NLP can be applied to infosec always fascinates me, hope I can work on it someday.

Arthur