Naive Bayes from Scratch using Python only – No Fancy Frameworks

We provide a complete step by step pythonic implementation of naive bayes, and by keeping in mind the mathematical & probabilistic difficulties we usually face when trying to dive deep in to the algorithmic insights of ML algorithms, this post should be ideal for beginners.



Training NB Model on Training Dataset

And yes that’s it !
Just a matter of four functions and we are all set to go for training our NB Model on any text dataset and with any number of class labels!

If you are curious to know what the training data actually looks like ….. ????
It’s a newsgroups dataset consisting of newsgroups posts on 20 topics . It has 20 classes, but for the time being, we will train our NB model on just four categories — [‘alt.atheism’, ‘comp.graphics’, ‘sci.med’, ‘soc.religion.christian’] but the code works perfectly well for training against all 20 categories as well.

Training Dataset

You might be wondering why the column of “Training Labels” is in numeric form rather than their original string textual form. It’s just that every string label has been mapped to it’s unique numeric Integer form. Even if this is unclear to you at the moment, just consider that a dataset has been provided and it has it’s labels in numeric form. Simple !

So before we start training a NB Model, let’s load this dataset…
We will load a dataset from sklearn ( python’s ML Framework)— but we still coded NB from scratch!

Woohoo! Let’s actually begin Training ????????????

Aaaaand Training is Completed !!!

Milestone # 3 Achieved ???? ???? ????

Testing Using Trained NB Model

So now that we have trained our NB Model — let’s move to testing!
Loading the test set…..

Testing on above loaded test examples using our trained NB Model….

Wow! Pretty Good Accuracy of ~ 93% ✌️
See now you realize NB is not that Naïve !

Milestone # 4 Achieved ???? ???? ????????

Proving that the Code for NaiveBayes Class is Absolutely Generic!

As I mentioned in the beginning that the code we have written is generic, so let’s use the same code on a different dataset and with different class labels to prove it’s “genericity” !

The other text dataset consists of movie reviews and their sentiments & looks something like below:

Here is the link to this dataset from Kaggle

Training a NB Model for this dataset & testing it’s accuracy….

Notice See how the same NaiveBayes code works like a charm on different datasets and yet with same programming interface ???? ???? ????

Let’s test on Kaggle test set and upload our predictions on kaggle to see how well our NB performs on Kaggle test set!

A screen shot of kaggle results — A quite good accuracy of 80% ????

Kaggle Prediction Results

Milestone # 5 Achieved ???? ???? ????????????

So that’s all for this blog post aaaand you are technically now a “NB Guru” ! Cheers ????????????

Upcoming post will include :

  • Unfolding Naïve Bayes from Scratch! Take-3 ???? Implementation of Naive Bayes using scikit-learn (Python’s Holy Grail ofMachine Learning!)

Until that Stay Tuned ???? ???? ????

If you have any thoughts, comments, or questions, feel free to comment below or connect ???? with me on LinkedIn

If you liked my article, please clap ???? as many times as you have enjoyed reading it (it fuels me to write more in-depth Data Science blogs ????) and feel free to share it with your friends!

BioAisha Javed is a data scientist enthusiast interested in Deep Learning, Machine Learning, NLP and Kaggle.

Original. Reposted with permission.

Related: