Text Clustering : Quick insights from Unstructured Data, part 2

We will build this in a modular way and also focus on exposing the functionalities as an API so that it can serve as a plug and play model without any disruptions to the existing systems.



By Vivek Kalyanarangan.

Text Clustering

In this two-part series, we will explore text clustering and how to get insights from unstructured data. It will be quite powerful and industrial strength. The first part will focus on the motivation. The second part will be about implementation.

This post is the second part of the two-part series on how to get insights from unstructured data using text clustering. We will build this in a very modular way so that it can be applied to any dataset. Moreover, we will also focus on exposing the functionalities as an API so that it can serve as a plug and play model without any disruptions to the existing systems. 

In case you are in a hurry you can find the full code for the project at my Github Page

Just a sneak peek into how the final output is going to look like –

Text Clustering API

Installations

  • Anaconda distribution of python 2.7 – Download from here
  • flask API python package – After installing anaconda, go to command prompt and type
    pip install flask
  • flasgger python package – After installing anaconda and flask, go to command prompt and type
    pip install flasgger

You are ready with the Tools now. Download the code from here to get started with setting it up.

Running

Unzip the contents, open the command prompt and type

python CLAAS_public.py

A server will be started and you can now access the tool at this location – http://localhost:8180/apidocs/index.html

Workflow

Unguided Clustering

This is where the actual KMeans clustering happens.

  1. It takes a CSV file as input. In addition, you also want to input the column name which contains the unstructured text and the number of clusters
  2. Once you click “Try it Out” button, the inputs will be used by the API
  3. The API does the text cleaning, Tfidf Vectorization and the clustering
  4. Once it’s done, it will give a downloadable link which will have an additional column appended to it with the cluster numbers

Guided Clustering

As far as this technique goes, it is a little more straightforward.

  1. It takes two files as input, one with the data to be clustered and the other with predefined keywords
  2. In addition it takes the column name equivalent to the unguided clustering
  3. As output, it brings out additional columns for each keyword given
  4. TRUE if a document contains that word, FALSE if it doesn’t

This gives a sense of the presence/absence of keywords in documents, giving which documents contain signals from keywords and which of them don’t.

Conclusion

That was all in this multi-series on text clustering. Good enough to get started right? It was an amazing experience penning down this series. See you on the next bit. Have fun!

Original. Reposted with permission.

Bio: Vivek Kalyanarangan work as a data scientist looking at problems in the Healthcare domain.

Related: