Creating a simple text classifier using Google CoLaboratory

Google CoLaboratory is Google’s latest contribution to AI, wherein users can code in Python using a Chrome browser in a Jupyter-like environment. In this article I have shared a method, and code, to create a simple binary text classifier using Scikit Learn within Google CoLaboratory environment.



By Sudipto Dasgupta, Flipkart.

The Platform: Google has scored another hit with CoLaboratory, its in-house data science platform that is freely available for anyone to use. It gives several benefits of Jupyter, free GPU time, easy code sharing and storage, no requirement of software installation, coding using a chrome browser and compatibility with the Python language and access to modules such as scikit-learn. This is a truly great step in making AI and data accessible to all.

The Context: I work for an e-commerce organization, where mis-shipments are ubiquitous in the business. When the information regarding a mis-shipment is received in the system, a team of experts read the comments generated by the customer against every case, to determine how to investigate it. As open text fields are difficult to control, customers are free to post messages which may not be actionable, or sometimes even understandable. Reading comments take up a considerable amount of time, and given previous labelling information, the junk comments can be easily labelled by text classification algorithms. Given below is a simple classifier which can generate labels with a high level of accuracy given sufficient training data and balanced label distribution.

Binary Text Classifier

Loading the corpus: The training data consists of two columns, the first containing comments and the second, the labels (0 and 1). First, we load the data onto the colab environment with the following code -

from google.colab import files

import pandas as pd

import io

 

uploaded = files.upload()

df = pd.read_csv(io.StringIO(uploaded['data_train.csv'].decode('utf-8')),header=None)

Executing this block (known as cell in colab) generates an upload widget, by which the training data needs to be uploaded. Once this operation is complete, columns are given name references –

raw_text = df[0]

y = df[1]

 

Pre-Processing:
Though there are several different methods to classify, the one I have used involve the NLTK python package.

Stemming involves reducing a derived word to its base form. For example, the word ‘fish’ is a root for words such as ‘fishing’, ‘fished’, and ‘fisher’. Martin Porter’s algorithm is a popular stemming tool, which can be found in NLTK. Stopwords are words that do not add much meaning to a sentence from a feature extraction point of view. Words such as ‘after’, ‘few’, ‘right’ etc. are frequently ignored by search engines. A list of common stopwords can be found HERE. I have imported the ‘PorterStemmer’ and ‘Stopwords’ from NLTK using the following commands.

 

import nltk

nltk.download('stopwords'),nltk.download('porter_test')

 

Another pre-processing step was conducted using the Regular Expressions or ‘re’ module. This involved removing whitespaces, tabs, punctuations and finally converting all text in lowercase. This step is commonly known as normalization. A function named ‘pre_process’ was created to implement all these steps in a single line to any text or block of text.

#Text PreProcessing Function Creation

stop_words = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

import re
def pre_process(txt):
    z = re.sub("[^a-zA-Z]",  " ",  str(txt))
    z = re.sub(r'[^\w\d\s]', ' ', z)
    z = re.sub(r'\s+', ' ', z)
    z = re.sub(r'^\s+|\s+?$', '', z.lower())
    return ' '.join(ps.stem(term) 
        for term in z.split()
        if term not in set(stop_words)
    )

Now let us look at what this function does to text using the code on the first 5 comments in the corpus.

##Testing whether processing works
processed = raw_text.apply(pre_process)
print('Original Comment\n',raw_text.head(10),'\n\nTransformed Comment\n',processed.head())

 

Running the cell generates the following output in which you can compare lines to see the combined effect of pre-processing steps.

##Original Comment###

Item is not same as shown in pciture

cmbissue :- cust called for the return and ...

Different item

Dial color is white  I have ordered blackItem ...

issue with quality and price tag missing

 

###Transformed Comment###

item shown pcitur

cmbissucust call return refund due discript ...

differ item

dial color white order blackitem receivvari...

issuqualiti price tag miss

 

Tokenization: Let’s consider the sentence- ‘How are you?’. Obviously, programs don’t understand words, they only understand characters. So, if a bag of words model is brought into play, the sentence ‘How are you?’ and ‘are How you?’ are same. However, the bigrams for the sentences would be different. Bigrams are subset of n-grams, which is a collection of base pairs, syllables or words. N-grams are highly popular not only in NLP, but also in other fields such as DNA sequencing! The bigrams are:

‘How are you?’  ----  ‘How are’ , ‘are you’

‘are How you?’  ----  ‘are How’ , ‘How you’

The bigram generation code is:

# Creating Unigram & Bigram Vectors

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer =TfidfVectorizer(ngram_range=(1,2))

X_ngrams=vectorizer.fit_transform(processed)

The term frequency (tf) measures the occurrences of each n-gram for each training example. This is down weighted with the inverse document frequency (idf), ensuring that words/word pairs distinctive to either class have higher weights, while common grams have lower weights.

Creating the classifier: Once the features are generated by the previous block of code, the next step is to fit the model using the data. The data is split in 80/20 ratio, and modelled using (Binary) Logistic Regression.

#Train/Test Split

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X_ngrams,y,test_size=0.2,stratify=y)

 

# Running the Classsifier

from sklearn.linear_model import LogisticRegression

clf=LogisticRegression()

clf.fit(X_train,y_train)

 

At this point one can use a metric such as F1 score to evaluate model performance.

The machine has learnt!

 

Machine learning robot

Next, we need a wrapper function to separate the pearlsfrom the junk.

def pearl_or_junk(message):# Wrapper Function

if clf.predict(vectorizer.transform([pre_process(message)])):

return 'pearl'

else:

return 'junk'

That’s it! The wrapper function can process comment columns to classify whether a comment is useful or not. An example file can be passed using the following code:

uploaded =files.upload()

test_data=pd.read_csv(io.StringIO(uploaded['test_file.csv'].decode('utf-8')),header=None)

test_data[1]=test_data[0].apply(pearl_or_junk)

test_data.to_csv('newfile.csv', index =None, header =False)

files.download('newfile.csv')

The code will create a column with labels ‘pearl’ and ‘junk’ against every line of comment which you will find in the file named ‘newfile’ that will be automatically downloaded to your system.

What next? As I stated earlier, this is a simple piece of code, and has a lot of improvement areas. If there is class imbalance in the training data, this model will predict the dominating class, and to address that a resampling technique such as SMOTE may help.

There are several NLP techniques such as lemmatization, that can boost the pre-processing. Usually increasing the n-gram to a 3 or 4 combination increases the load on the machine, and in my case, it just gave me a message that it is too tired to handle such complicated stuff. FYI, the bigram generated over 36,000 features in a sparse matrix. Imagine what will happen if you increase N.

The modelling was super simple. Model stacking may help increase the accuracy, and other algorithms such as Naïve Bayes or SVC are known to handle such stuff better. I did not try a neural network, but that may also give better results, if the training data is sufficiently large. Google colab does provide free GPU also, so that may be worth a try.

Google colab notebook

Though I created this code in Jupyter, moving it to coLaboratory was a breeze. Thank you, Google!

 

Tags: NLP; Machine Learning; Regression;

 

Bio:

Sudipto Dasgupta is currently working as a Specialist – Process Design for Flipkart India Pvt. Ltd., the largest e-commerce organization in India. He has 15+ years of experience in Business Analytics in domains such as software, market research, education and supply chain. He is an experienced Six Sigma Master Black Belt and project management professional (PMP) with an educational background in Mathematics and Statistics. He has an active interest in the Data Sciences.

Related: