A naive bayes classification example

I have collected Forbes’ 2017 billionaire profiles (2043 articles) and new York times' notable deaths 2016 (364 articles). These articles are about individuals. Here I am developing a simple machine learning program for classifying the articles based on Naive Bayes classifier using python/nltk.

Let us use words for feature. A word is considered here as a combination of only English alphabets. The feature will be whether a word is present or not. Eg:- does the article contain the word ‘death’?

Thirty articles are taken for training data; Fifteen each from the two categories. All the remaining articles are taken as test data. All words from the training data used for preparing features and the machine learning program classified the test data with accuracy 94.66%.

The accuracy is good. But there is always room for improvement. One of the widely used method in natural language processing is exclude stop words from consideration. But the no-stop-words method reduced the accuracy to 91.29%. What does it mean? Stop words are crucial in the data?

What about using only stop words for generating features? The result was 99.79% accuracy; impressive result. The stop words list contains only 153 words. I selected around 1000 most common English words. Basically it should be super set of stop words. The new method, say only-common-words, classified 99.75% accurately.

Tried the stemmed form of words for building features and got only 97.48% accuracy. Good accuracy but couldn’t outperform only-stop-word method.

Machine learning is about learning from examples. What about adding more examples? Moved ten articles from test data to training data. The moving process repeated a few more times. Here is the result:

FEATURE METHOD NUMBER OF TRAINING DATA
30 40 50 60 70 80
Only stop words 99.79 99.79 99.87 99.87 99.87 99.91
Only common words 99.75 99.79 99.92 99.91 99.91 99.96
Stemming 97.48 98.27 99.19 99.49 99.61 99.70
All words 94.66 96.83 97.96 98.76 98.93 99.48
No stop words 91.29 94.00 96.39 97.49 97.82 98.45

The deatiled output is available in github.

Only-common-words method with 80 training data produced very high accuracy of 99.96%. Out of 2327 test data, it classified wrongly only one article!! Here is the visual comparison of all the experiments.

Which feature set is relevant? Only-common-words has the advantage of using limited number of features (maximum 1000 features). Whereas only-stop-words is using less number of features (maximum 153 features). All-words, no-stop-words and stemming are using large number of features and the number features increases while the number of training examples increases. Will the accuracy of only-stop-words or only-common-words decrease while the number of training examples increases?

. See the python/ntlk code is in GitHub