Say you have a document and you want to know if it talks about python (a term you care for)
-------------------------------------------------------------------------
Document: I am a python developer.
Term: python
-------------------------------------------------------------------------
You want to check if the document contains the word python or not. So you open the document, press ctr+f and search for ‘python’. And you find it :)
Now say you have 100 such terms: [python, java, github, medium, etc.]
You will open the document with a simple python code. Loop through each term, and see if the term is present or not.
Now say java should match Java but not javascript.
Better yet, java should match j2ee and Java both, but not java script. (j2ee and java are synonyms, and did you notice the space in java script?)
Now it’s getting interesting. How do you do that?
We ran into this problem last year @Belong.co. We noticed that people talk about the same terms in multiple ways. Big apple could be either a big appleor New York. Luckily for us, we had some context. When our documents talk about Python, they 99.99 % of the times mean the programming language, not the animal.
But this didn’t simplify our problem. Java and j2ee are the same thing for us, but not java script. So how to extract this information from millions of documents?
As you can imagine we wrote a regex based code. For 1 million documents and 2K keywords the code took 24 hours to run. And life was good :)
But soon we expanded to multi million documents with 10K+ keywords. And the same code was now going to take 10+ days to run. So we set out to find a better way.
Turns out, Aho Corasick algorithm can simultaneously search all keywords in one pass over the document. Now that is something.
Demo of flashtext on a sample input.
I wrote a custom implementation based on Trie data structure to suit our use case. It worked quite well. The keyword extraction process takes 15 mins with this algorithm. Down from 10+ days with the regex based approach.
Now Keyword extraction was working well. So I also added the capability to replace keyword with synonyms within the document.
Say you want to replace ‘New Delhi’ with ‘NCR region’ in a document.
Input: I live in New Delhi.
Output: I live in NCR region.
We were able to take advantage of this library in multiple projects. That’s when we decided to open source it. So here is a link to the code :) https://github.com/vi3k6i5/flashtext
It’s really simple to use: [Python code coming up]
>>> from flashtext.keyword import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('j2ee', 'Java')
>>> keyword_processor.add_keyword('Python')
>>> keyword_processor.extract_keywords('I work on python and j2ee')
# output: ['Python', 'Java']
And keyword replacement:
>>> keyword_processor.add_keyword('New Delhi', 'NCR region')
>>> keyword_processor.replace_keywords('I live in New Delhi.')
# output: 'I live in NCR region.'
This is really useful because it helps in term expansion. Say you want to replace RC caras Remote Control carin product catalogue. Or say you want to extract Electrocardiogram as ECG. Both are easily doable.
If you know someone who works on Entity recognition or NER or NLP or Word2vec, please share this blog with them. This library has been really useful for us in these areas. I am sure it would be useful to others also.