Text Clustering : Quick insights from Unstructured Data, part 2
We will build this in a modular way and also focus on exposing the functionalities as an API so that it can serve as a plug and play model without any disruptions to the existing systems.
By Vivek Kalyanarangan.
In this two-part series, we will explore text clustering and how to get insights from unstructured data. It will be quite powerful and industrial strength. The first part will focus on the motivation. The second part will be about implementation.
This post is the second part of the two-part series on how to get insights from unstructured data using text clustering. We will build this in a very modular way so that it can be applied to any dataset. Moreover, we will also focus on exposing the functionalities as an API so that it can serve as a plug and play model without any disruptions to the existing systems.
- Text Clustering: How to get quick insights from Unstructured Data – Part 1: The Motivation
- Text Clustering: How to get quick insights from Unstructured Data – Part 2: The Implementation
In case you are in a hurry you can find the full code for the project at my Github Page
Just a sneak peek into how the final output is going to look like –
Installations
- Anaconda distribution of python 2.7 – Download from here
- flask API python package – After installing anaconda, go to command prompt and type
pip install flask
- flasgger python package – After installing anaconda and flask, go to command prompt and type
pip install flasgger
You are ready with the Tools now. Download the code from here to get started with setting it up.
Running
Unzip the contents, open the command prompt and type
python CLAAS_public.py
A server will be started and you can now access the tool at this location – http://localhost:8180/apidocs/index.html
Workflow
Unguided Clustering
This is where the actual KMeans clustering happens.
- It takes a CSV file as input. In addition, you also want to input the column name which contains the unstructured text and the number of clusters
- Once you click “Try it Out” button, the inputs will be used by the API
- The API does the text cleaning, Tfidf Vectorization and the clustering
- Once it’s done, it will give a downloadable link which will have an additional column appended to it with the cluster numbers
Guided Clustering
As far as this technique goes, it is a little more straightforward.
- It takes two files as input, one with the data to be clustered and the other with predefined keywords
- In addition it takes the column name equivalent to the unguided clustering
- As output, it brings out additional columns for each keyword given
- TRUE if a document contains that word, FALSE if it doesn’t
This gives a sense of the presence/absence of keywords in documents, giving which documents contain signals from keywords and which of them don’t.
Conclusion
That was all in this multi-series on text clustering. Good enough to get started right? It was an amazing experience penning down this series. See you on the next bit. Have fun!
Original. Reposted with permission.
Bio: Vivek Kalyanarangan work as a data scientist looking at problems in the Healthcare domain.
Related: