Better yet, java should match j2ee and Java both, but not java script. (j2ee and java are synonyms, and did you notice the space in java script?)
Now it’s getting interesting. How do you do that?
We ran into this problem last year @Belong.co. We noticed that people talk about the same terms in multiple ways. Big apple could be either a big appleor New York. Luckily for us, we had some context. When our documents talk about Python, they 99.99 % of the times mean the programming language, not the animal.
But this didn’t simplify our problem. Java and j2ee are the same thing for us, but not java script. So how to extract this information from millions of documents?
As you can imagine we wrote a regex based code. For 1 million documents and 2K keywords the code took 24 hours to run. And life was good :)
But soon we expanded to multi million documents with 10K+ keywords. And the same code was now going to take 10+ days to run. So we set out to find a better way.
Turns out, Aho Corasick algorithm can simultaneously search all keywords in one pass over the document. Now that is something.
I wrote a custom implementation based on Trie data structure to suit our use case. It worked quite well. The keyword extraction process takes 15 mins with this algorithm. Down from 10+ days with the regex based approach.
This is really useful because it helps in term expansion. Say you want to replace RC caras Remote Control carin product catalogue. Or say you want to extract Electrocardiogram as ECG. Both are easily doable.
If you know someone who works on Entity recognition or NER or NLP or Word2vec, please share this blog with them. This library has been really useful for us in these areas. I am sure it would be useful to others also.