KDnuggets Home » News » 2017 » Sep » Opinions, Interviews » Search Millions of Documents for Thousands of Keywords in a Flash ( 17:n34 )

Search Millions of Documents for Thousands of Keywords in a Flash


We present a python library called FlashText that can search or replace keywords / synonyms in documents in O(n) – linear time.



By Vikash Singh, Belong.co.

Say you have a document and you want to know if it talks about python (a term you care for)

-------------------------------------------------------------------------
Document: I am a python developer.
Term: python
-------------------------------------------------------------------------

You want to check if the document contains the word python or not. So you open the document, press ctr+f and search for ‘python’. And you find it :)

Now say you have 100 such terms: [python, java, github, medium, etc.]

You will open the document with a simple python code. Loop through each term, and see if the term is present or not.

-------------------------------------------------------------------------
Open (document) 
for each term in terms: 
    if term is present in document: print(term)
-------------------------------------------------------------------------

Now say you have a 100 documents. Well you can open each document in a loop. Per document you search for each term in the document.

-------------------------------------------------------------------------
for document in documents: 
    Open (document)
    for each term in terms:
        if term is present in document: print(term)
-------------------------------------------------------------------------

Now say java should match Java but not javascript.
Better yet, java should match j2ee and Java both, but not java script.
(j2ee and java are synonyms, and did you notice the space in java script?)

Now it’s getting interesting. How do you do that?

We ran into this problem last year @Belong.co. We noticed that people talk about the same terms in multiple ways. Big apple could be either a big appleor New York. Luckily for us, we had some context. When our documents talk about Python, they 99.99 % of the times mean the programming language, not the animal.

But this didn’t simplify our problem. Java and j2ee are the same thing for us, but not java script. So how to extract this information from millions of documents?

As you can imagine we wrote a regex based code. For 1 million documents and 2K keywords the code took 24 hours to run. And life was good :)

But soon we expanded to multi million documents with 10K+ keywords. And the same code was now going to take 10+ days to run. So we set out to find a better way.

I asked around in my office and Vinay suggested I should take a look at Trie dictionary based approach. Suresh suggested Aho Corasick algorithm. Got similar suggestions on Stack overflow.

Turns out, Aho Corasick algorithm can simultaneously search all keywords in one pass over the document. Now that is something.

Demo of flashtext on a sample input.

I wrote a custom implementation based on Trie data structure to suit our use case. It worked quite well. The keyword extraction process takes 15 mins with this algorithm. Down from 10+ days with the regex based approach.

-------------------------------------------------------------------------
Input: I love j2ee. Keyword: j2ee=>Java 
# Which is basically saying j2ee means Java Output: ['Java']
-------------------------------------------------------------------------

Now Keyword extraction was working well. So I also added the capability to replace keyword with synonyms within the document.

Say you want to replace ‘New Delhi’ with ‘NCR region’ in a document.
Input: I live in New Delhi.
Output: I live in NCR region.

We were able to take advantage of this library in multiple projects. That’s when we decided to open source it. So here is a link to the code :) https://github.com/vi3k6i5/flashtext

It’s really simple to use: [Python code coming up]

-------------------------------------------------------------------------
$ pip install flashtext
>>> from flashtext.keyword import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('j2ee', 'Java')
>>> keyword_processor.add_keyword('Python')
>>> keyword_processor.extract_keywords('I work on python and j2ee')
# output: ['Python', 'Java']

And keyword replacement:

>>> keyword_processor.add_keyword('New Delhi', 'NCR region')
>>> keyword_processor.replace_keywords('I live in New Delhi.')
# output: 'I live in NCR region.'
-------------------------------------------------------------------------

This is really useful because it helps in term expansion. Say you want to replace RC car as Remote Control car in product catalogue. Or say you want to extract Electrocardiogram as ECG. Both are easily doable.


If you know someone who works on Entity recognition or NER or NLP or Word2vec, please share this blog with them. This library has been really useful for us in these areas. I am sure it would be useful to others also.

Cheers :)

Original. Reposted with permission.

Bio: Vikash Singh is a data scientist at belong.co, dealing with large volumes of text and multiple projects based on word embeddings.

Related: