News Classifier with Naive Bayes in Python

Photo by AbsolutVision on Unsplash

Jupyter Notebook

See The Jupyter Notebook for the concepts we’ll cover on building machine learning models and my Medium Profile for other Data Science articles and tutorials.

Classification with Naive Bayes

The Naive Bayes algorithm is named after the Bayes probability theorem. The algorithm aims to calculate the probability that an unknown sample belongs to each possible class, predicting the most likely class.

This type of prediction is called statistical classification because it is wholly based on probabilities. This classification is also called naïve because it considers that the value of an attribute on a given class is independent of the importance of the other attributes, which simplifies the calculations involved.

What is classification?

The classification consists of finding, through machine learning, a model or function that describes different data classes. The purpose of classification is to automatically label new instances of the database with a particular category by applying the model or function learned during training. This model is based on the value of the attributes of the training instances.

There are several classifiers available: KNN, Naive Bayes, Decision Trees… Classification can be used in several types of problems, such as:

  • SPAM detection
  • Automatic organization of emails by categorization
  • Page ID with adult content
  • Detection of expressions and feelings

Naive Bayes Classifier

The Classifier Naive Bayes is a probabilistic classifier based on applying the Bayes Theorem, with hypotheses of independence between attributes. In simple terms, the classifier Naive Bayes assumes that the presence or absence of a particular characteristic is not related to the presence or absence of any other element, taking into account the class’s target variable.

He is naive because he considers that each variable has independent participation, which does not always happen in practice.

For example, fruit can be considered an apple if red, round, and about 7 cm in diameter. A Naive Bayes classifier considers that each attribute contributes independently to the probability that this fruit is an apple, regardless of the presence or absence of other characteristics.

The Naive Bayes model is easy to build and particularly useful for large data sets. In addition to being simple, Naive Bayes is known to be superior to other highly sophisticated classification methods — in various classification problems, and we can achieve better results with Naive Bayes than with Artificial Neural Networks.

Naive Bayes Applications

There is no better Machine Learning algorithm than the other by itself. Each algorithm has advantages and disadvantages. The algorithm’s performance will be directly related to the business problem we are trying to solve and about our data set.

Naive Bayes can present a performance in neural networks for a given problem, while the Neural Networks have adequate performance for another issue. The important thing is to learn as much as possible about machine learning algorithms to try them out and achieve the best possible result.

• Multi-classes forecast

Fairly common use are multi-class predictions, i.e., when we have more than two classes that we must predict. For example, if we are trying to predict sentiment analysis, we need to expect whether a particular user is neutral, negative, positive, or any other feeling; in this case, we have few classes, and the Naive Bayes can be used for these situations.

• Text classification, spam filtering, and sentiment analysis

We can apply Naive Bayes to all three. Naive Bayes achieves excellent performance in many situations, especially when the data volume is not very large.

• Real-Time Predictions

When we need to collect real-time data from Twitter, Facebook, or any other data source and apply real-time classification, Naive Bayes performs well. This is because it uses probabilistic calculations quickly compared to different algorithms.

So, if we have to use real-time online ranking, Naive Bayes is an excellent option.

• Recommendation System

When we have to find patterns in the data and from that, make recommendations for new users.

Naive Bayes at Scikit-learn

There are three types of Naive Bayes models in python’s Scikit-learn library. We have the algorithms:

1. Gaussian

The Gaussian algorithm considers the data to be in a normal distribution. Therefore, if the data is already normalized, we can apply the Gaussian algorithm to The Naive Bayes.

2. Multinomial

This algorithm is used for discrete counts. For example, if we have a text sorting problem, it would be necessary to count how often a word occurs in a document — from a statistical perspective; this problem would be the number of times an X value is observed during n attempts. For this, we can use the classifier Naive Bayes Multinomial.

3. Bernoulli

The Bernoulli Binomial model is helpful if the data vectors are binary, i.e., 0 and 1. An application would be to build a template for the classification of text. We want to know whether or not words occur in the document, i.e., 1 (the word occurs) 0 (the word does not appear), as this is typically a binary Bernoulli classification.

Probability Theory

To understand Bayes’ theorem, we need to take a step back and understand the theory of probability and what conditional probability is about.

Probability is the study of experiments that present results that cannot predict even under very similar conditions.

We studied probability intending to predict the possibilities of occurrence of a given situation or fact. Therefore, we studied probability to try to indicate the likelihood of an event occurring.

• Random Experiment

An experiment is considered random when its occurrences may present different results. An example of this happens when we flip a coin with distinct faces, one face and one crown. This release is unpredictable, as there is no way to know which face will be up.

• Sample Space

The sample space (S) determines the possible possibilities of results. In the case of the toss of a coin, the set of sample space is given by: S = {heads, tails} because they are the only two possible answers to this random experiment. Depending on the type of experiment, we can have a vast sample space.

• Event

The probability of the occurrence of a fact or situation is called an event. Therefore, when we launch a coin, we are establishing the occurrence of the event. We then have that any subset of the sample space should be considered an event. An example can happen when we flip a coin three times and get as a result of the event the following set: E = {heads, tails, heads}

• Probability Ratio

The probability ratio is given by the possibilities of an event occurring, taking into account its sample space. This ratio, which is a fraction, is equal to the number of event elements (numerator) over the number of features in the sample space (denominator).

This is the basis of probability theory. We have some variations in some forms that extend the concept of probability. However, we still have some general rules of probability based on the fact that probability is a value between 0 and 1 and is generally used in applying the Bayes theorem. To understand the Bayes theorem, we need to understand the concept of the conditional probability of simultaneous events.

Conditional Probability

Conditional probability is the basic concept behind Bayes’ theorem. For example, below, we have the space of possibilities of white background within the square, we still have event A which is the area with yellow background, and we have event B, which is the pink background square. Therefore, we have a total of possibilities and the occurrence of 2 events, and we can calculate the probability corresponding to a single event:

Above, we have 3 possibilities out of 16. Based on this, we can find probability A, an isolated event:

Just as we can try to find the probability of B, which is another isolated event, for this case, we have a chance of 25%.

Here we have the calculation of the individual probability of each event, but what we actually want is a conditional probability, that is, the occurrence of A given the occurrence of B and vice versa. Therefore, we are considering the likelihood of an event, given the occurrence of another event.

Calculating the conditional probability, we can see that the formula changes a little. We have P of A given B, i.e., Probability of A given the Occurrence of B — we break the procedure into two parts. At the top, we have the probability of A given B, divided by the individual probability of B.

When looking at the figure, we clearly have that the probability of A and B (intersection) is 1/16, and we still have the possibility of event B, which is 4/16.

We can do the opposite. Calculate the probability of B given the occurrence of A.

In many situations, to find the probability of an event, we have to consider the occurrence of another event, present in various scenarios in real life. Therefore, we saw the relationship between a conditional probability and its inverse probability, in other words, the likelihood of a hypothesis given the observation of evidence and the probability of the evidence presented by the hypothesis.

Bayes’ theorem represents one of the first attempts to model statistical inference to prove God’s existence mathematically. To achieve this goal, Bayes developed this theorem that became one of the bases of statistics.

Bayes Rules

The Bayes rule shows how to change odds a priori, considering new evidence to obtain a posteriori probability.

Above, we have the Bayes rule formula, with the a priori probabilities of P(x) and P(c), and we also have the probabilities a posteriori P(c|x) conditional to x and P(x|c) x conditional to c.

Bayes’ theorem provides a way to calculate the posterior probability, i.e., the posterior probability. Considering Bayes’s theorem formula, we have:

  • P (c|x) is the later probability of the class (c, target) given the predictor (x, attributes).
  • P © is the original probability of the class).
  • P (x|c) is the probability of the predictor given the class.
  • P (x) is the original probability of the predictor.

What the algorithm does in practice is learn these probabilities. During training, the algorithm will learn the chances to make up the formula and, in the end, present the likelihood of a given data point belonging to one of the classes for which we perform the training.

How does Bayes’s theorem work?

In probability theory, Bayes’ theorem shows the relationship between conditional probability and its inverse. Here we have a set of training data on “Weather” and the corresponding target variable “Play,” that is, according to the weather, what is the probability of a player playing sport or not.

We have two independent events: climate and playing sports. Therefore, we need to sort whether athletes will play or not based on weather conditions. One event is conditioned to another; we use conditional probability to build Bayes’s theorem that considers conditional probability and its inverse. First, we must convert the dataset into a frequency table:

Then create probability tables to find the probabilities of each occurrence and each combination, that is, the total occurrences of an event divided by the entire possibilities.

Each class in the training set has its probability calculated. Most of the time, we work with only two categories, such as:

  • whether or not a consumer purchases a product based on demographic characteristics
  • whether or not a player will practice a sport given the weather condition

The calculation is done by dividing the number of instances of a given class by the number of cases in the training set.

Finally, we used the Bayes theorem equation to calculate the posterior probability for each class. The class with the highest posterior probability is the result of the prediction.

1st question: players will play sport if the fear is sunny. Is that statement correct? The first step is to find the odds we need to answer that question.

  • P (Yes|Sun) = P (Sun|Yes) * P (Yes) / P (Sun)

Translating a Problem into Bayes’ Theorem

We have the first probability of (Yes| Sun), i.e., play, given that the day is sunny. Then we find its reverse, where we have the likelihood of a sunny day given the player practicing the sport P(Sun| Yes) and multiply by the probability of occurrence P(Yes) and divide by the possibility of making sun P(Sun).

Soon we will have to:

  • P (Sun| Yes) = 3/9 = 0.33
  • P (Sun) = 5/14 = 0.36
  • P (Yes) = 9/14 = 0.64

And replace in the formula:

P (Yes|Sun) = P (Sun|Yes) * P (Yes) / P (Sun) = 0.33 * 0.64 / 0.36 = 0.60

So, we’ll have a value of 0.60. As this is a relatively high probability, we can classify the result as Yes — players will practice sport if the weather is sunny.

Multinomial News Classifier

The goal of the algorithm is to build classification models. So, we have an input dataset, and we want to sort that data into groups! Because this is a supervised task, we deliver input datasets and possible classes based on historical data to the algorithm, train the model, and finally present a new data set to make predictions, that is, to classify the data according to the labels available.

When our variables are discrete, that is numbers with a finite set of possibilities. For these cases, the Naive Bayes Multinomial is the most indicated, being widely used for counting words for text classification that are a finite set of discrete possibilities.

Multinomial distribution typically requires counts of entire entities. However, in practice, fractional counts such as tf_idf (turning frequency-inverse document frequency) can also work to convert the occurrences of a word into a frequency table.

What happens is the conversion of the record table into a frequency table; that is, we count how many times each of those combinations appear. Then, we calculate the probabilities applied in Bayes’s theorem from the frequency table and generate a result.

Multinomial Naive Bayes — Scikit-Learn

The Naive Bayes Multinomial classifier is suitable for classification with discrete variables (for example, word counts for text classification). This is because multinomial distribution typically requires counts of entire entities. However, in practice, fractional counts such as tf-idf can also work.

News Classifier

Let’s sort out whether new news is from one category or another category. In this case, we’ll need to count words within a text.

# Imports
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

We will use the newsgroups data set for our classifier, a famous data set that allows us to categorize news — we already have the data set ready, and we don’t have to worry.

We import the CountVectorizer to count the number of words, put it in an array, and generate the frequency table; with the pipeline, we can create the composite classifier; the TfidTransformer, to generate the frequency table and finally optimize the model with GridSearchCV.

Within the dataset, we have about 20 different categories, i.e., 20 classes. If we process all this, it will take a long time because the classifier will have to work too a huge data set. Therefore, to facilitate process and computing, let’s limit it to 4 categories:

Create a lean list of categories

# Defining categories
# (using only 4 out of 20 available so the ranking process is faster)
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

Training

We call the fetch_20_newsgroups to collect a subset of this training function from these categories that have just been created and completely randomly.

20 Newsgroups is a great data set where all news is classified into 20 categories. So that we don’t have to build a classifier for all 20 classes, we’ll build a classifier for 4 classes we determined above, collecting these attributes completely randomly through the fetch_20newsgroups function:

# Training
twenty_train = fetch_20newsgroups(subset = 'train', categories = categories, shuffle = True, random_state = 42)

We call the target_names to view the 4 categories.

# Class
twenty_train.target_names
['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

Get the length of the dataset:

len(twenty_train.data)
2257

View the first records in the set:

print("\n".join(twenty_train.data[0].split("\n")[:3]))
From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton

View the target variable:

print(twenty_train.target_names[twenty_train.target[0]])
comp.graphics

Scikit-learn registers labels as an array of numbers to increase speed:

twenty_train.target[:10]
array([1, 1, 3, 3, 3, 3, 3, 2, 2, 2], dtype=int64)

The scikit-learn, when you implemented the dataset from the fetch_20newsgroups, created an array with the numbers representing the target variables.

Below we have the classes of the first 10 records, namely the classes of the first 10 news.

for t in twenty_train.target[:10]:
print(twenty_train.target_names[t])
comp.graphics
comp.graphics
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
sci.med
sci.med
sci.med

Bag of Words

From here, we build our bag of words. Finally, let’s work on the dataset to have the final bag and present it to the algorithm.

# Tokenizing
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
count_vect.vocabulary_.get(u'algorithm')
X_train_counts.shape
(2257, 35788)

The output we have is the record of occurrences. We work with CountVectorizer to perform Tokenizing, break texts into paragraphs, paragraphs into sentences, phrases into words. The CountVectorizer generates a vector with these counts, and we apply fit_transform to adjust the data with our input dataset.

The TfidTransformer function applies to transform Tfidf, and with that, it builds a frequency table that is a sparse array.

# From occurrences to frequencies - Term Frequency times Inverse Document Frequency (Tfidf)
tf_transformer = TfidfTransformer(use_idf = False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape
(2257, 35788)

From now on, we can build our model bypassing the sparse matrix and the target variable as parameters:

# Creating the Multinomial Model
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

We want to see which category the model will rate this two news:

# Predictions
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)
for doc, category in zip(docs_new, predicted):print('%r => %s' % (doc, twenty_train.target_names[category]))

So when building the two news, we also need to do vectorization count_vect.transform and frequency transformation with tfidf_transformer. After all, when we train the model with the tfidf (frequency table) from the occurrence of words in the dataset, the text is the text.

If we do the test now, we need to deliver the dataset in the same format — we can’t deliver the dataset in a completely different format.

Therefore, first, we need to transform the mass of data of the docs_news object and generate the count_vect.transform and tfidf_transforme to predict the appropriate format.

'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics

The first news was classified as a religion, and the second news was classified as computer graphics. So we did the data processing, built our model, and applied the test data set; we also need to do the conversion treatment to ensure that we will have the data in the same format used during training.

Creating a Pipeline — Composite Classifier

Let’s create the composite classifier. In the previous cells, we were doing step by step; now, we will do everything in a single command with the Pipeline:

# vectorizer => transformer => classifier
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()), ('clf', MultinomialNB()),])

We vectorize to count the occurrences, make the tfdif generate the frequency table, and create the model. We give the fi of the pipeline text_clf:

# Fit
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

Calculamos a acurácia a partir da média das previsões comparadas aos valores observados, ou seja, 83% de acurácia é excelente.

# Model Accuracy
twenty_test = fetch_20newsgroups(subset = 'test', categories = categories, shuffle = True, random_state = 42)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)
0.8348868175765646

To improve this accuracy, we print the classification_report:

# Metrics
print(metrics.classification_report(twenty_test.target, predicted, target_names = twenty_test.target_names))
precision recall f1-score support
alt.atheism 0.97 0.60 0.74 319
comp.graphics 0.96 0.89 0.92 389
sci.med 0.97 0.81 0.88 396
soc.religion.christian 0.65 0.99 0.78 398
accuracy 0.83 1502
macro avg 0.89 0.82 0.83 1502
weighted avg 0.88 0.83 0.84 1502

The indexes classification_report are calculated from the combination of the values of the confusion_matrix. To improve this accuracy, we print the confusion_matrix:

# Confusion Matrix
metrics.confusion_matrix(twenty_test.target, predicted)
array([[192, 2, 6, 119],
[ 2, 347, 4, 36],
[ 2, 11, 322, 61],
[ 2, 2, 1, 393]], dtype=int64)

To optimize the parameters of our model, we use GridSearchCV. So we generate our parameter grid that we want to test through the parameters object. Call the GridSearchCV function. We pass the text_clf, which is the pipeline that synthesizes everything that we created, and we pass the parameter grid.

# Parameters for GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)], 'tfidf__use_idf': (True, False),
'clf__alpha': (1e-2, 1e-3),}
# GridSearchCV
gs_clf = GridSearchCV(text_clf, parameters, n_jobs = -1)

The GridSearchCV function applies cross-validation, making several divisions in the training dataset and applies the model, finding the best estimator possible.

Now let’s introduce the training dataset to this model:

# Fit
gs_clf = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400])

Perform a test passing a piece of new news and see the category forecast for that news.

# Test
twenty_train.target_names[gs_clf.predict(['God is love'])[0]]
'soc.religion.christian'

We have a 93% score! A result is much better than the 83% we had initially achieved.

# Score
gs_clf.best_score_
0.9349999999999999

Finally, we can see which parameters were used to build the best estimator.

# Parameters used
for param_name in sorted(parameters.keys()):
print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))
clf__alpha: 0.01
tfidf__use_idf: True
vect__ngram_range: (1, 2)

With that, we have a news classifier with an excellent level of accuracy using Naive Bayes Multinomial! And there we have it. I hope you have found this helpful. Thank you for reading. ?


News Classifier with Naive Bayes in Python was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

Photo by AbsolutVision on Unsplash

Jupyter Notebook

See The Jupyter Notebook for the concepts we’ll cover on building machine learning models and my Medium Profile for other Data Science articles and tutorials.

Classification with Naive Bayes

The Naive Bayes algorithm is named after the Bayes probability theorem. The algorithm aims to calculate the probability that an unknown sample belongs to each possible class, predicting the most likely class.

This type of prediction is called statistical classification because it is wholly based on probabilities. This classification is also called naïve because it considers that the value of an attribute on a given class is independent of the importance of the other attributes, which simplifies the calculations involved.

What is classification?

The classification consists of finding, through machine learning, a model or function that describes different data classes. The purpose of classification is to automatically label new instances of the database with a particular category by applying the model or function learned during training. This model is based on the value of the attributes of the training instances.

There are several classifiers available: KNN, Naive Bayes, Decision Trees… Classification can be used in several types of problems, such as:

  • SPAM detection
  • Automatic organization of emails by categorization
  • Page ID with adult content
  • Detection of expressions and feelings

Naive Bayes Classifier

The Classifier Naive Bayes is a probabilistic classifier based on applying the Bayes Theorem, with hypotheses of independence between attributes. In simple terms, the classifier Naive Bayes assumes that the presence or absence of a particular characteristic is not related to the presence or absence of any other element, taking into account the class’s target variable.

He is naive because he considers that each variable has independent participation, which does not always happen in practice.

For example, fruit can be considered an apple if red, round, and about 7 cm in diameter. A Naive Bayes classifier considers that each attribute contributes independently to the probability that this fruit is an apple, regardless of the presence or absence of other characteristics.

The Naive Bayes model is easy to build and particularly useful for large data sets. In addition to being simple, Naive Bayes is known to be superior to other highly sophisticated classification methods — in various classification problems, and we can achieve better results with Naive Bayes than with Artificial Neural Networks.

Naive Bayes Applications

There is no better Machine Learning algorithm than the other by itself. Each algorithm has advantages and disadvantages. The algorithm’s performance will be directly related to the business problem we are trying to solve and about our data set.

Naive Bayes can present a performance in neural networks for a given problem, while the Neural Networks have adequate performance for another issue. The important thing is to learn as much as possible about machine learning algorithms to try them out and achieve the best possible result.

• Multi-classes forecast

Fairly common use are multi-class predictions, i.e., when we have more than two classes that we must predict. For example, if we are trying to predict sentiment analysis, we need to expect whether a particular user is neutral, negative, positive, or any other feeling; in this case, we have few classes, and the Naive Bayes can be used for these situations.

• Text classification, spam filtering, and sentiment analysis

We can apply Naive Bayes to all three. Naive Bayes achieves excellent performance in many situations, especially when the data volume is not very large.

• Real-Time Predictions

When we need to collect real-time data from Twitter, Facebook, or any other data source and apply real-time classification, Naive Bayes performs well. This is because it uses probabilistic calculations quickly compared to different algorithms.

So, if we have to use real-time online ranking, Naive Bayes is an excellent option.

• Recommendation System

When we have to find patterns in the data and from that, make recommendations for new users.

Naive Bayes at Scikit-learn

There are three types of Naive Bayes models in python’s Scikit-learn library. We have the algorithms:

1. Gaussian

The Gaussian algorithm considers the data to be in a normal distribution. Therefore, if the data is already normalized, we can apply the Gaussian algorithm to The Naive Bayes.

2. Multinomial

This algorithm is used for discrete counts. For example, if we have a text sorting problem, it would be necessary to count how often a word occurs in a document — from a statistical perspective; this problem would be the number of times an X value is observed during n attempts. For this, we can use the classifier Naive Bayes Multinomial.

3. Bernoulli

The Bernoulli Binomial model is helpful if the data vectors are binary, i.e., 0 and 1. An application would be to build a template for the classification of text. We want to know whether or not words occur in the document, i.e., 1 (the word occurs) 0 (the word does not appear), as this is typically a binary Bernoulli classification.

Probability Theory

To understand Bayes’ theorem, we need to take a step back and understand the theory of probability and what conditional probability is about.

Probability is the study of experiments that present results that cannot predict even under very similar conditions.

We studied probability intending to predict the possibilities of occurrence of a given situation or fact. Therefore, we studied probability to try to indicate the likelihood of an event occurring.

• Random Experiment

An experiment is considered random when its occurrences may present different results. An example of this happens when we flip a coin with distinct faces, one face and one crown. This release is unpredictable, as there is no way to know which face will be up.

• Sample Space

The sample space (S) determines the possible possibilities of results. In the case of the toss of a coin, the set of sample space is given by: S = {heads, tails} because they are the only two possible answers to this random experiment. Depending on the type of experiment, we can have a vast sample space.

• Event

The probability of the occurrence of a fact or situation is called an event. Therefore, when we launch a coin, we are establishing the occurrence of the event. We then have that any subset of the sample space should be considered an event. An example can happen when we flip a coin three times and get as a result of the event the following set: E = {heads, tails, heads}

• Probability Ratio

The probability ratio is given by the possibilities of an event occurring, taking into account its sample space. This ratio, which is a fraction, is equal to the number of event elements (numerator) over the number of features in the sample space (denominator).

This is the basis of probability theory. We have some variations in some forms that extend the concept of probability. However, we still have some general rules of probability based on the fact that probability is a value between 0 and 1 and is generally used in applying the Bayes theorem. To understand the Bayes theorem, we need to understand the concept of the conditional probability of simultaneous events.

Conditional Probability

Conditional probability is the basic concept behind Bayes’ theorem. For example, below, we have the space of possibilities of white background within the square, we still have event A which is the area with yellow background, and we have event B, which is the pink background square. Therefore, we have a total of possibilities and the occurrence of 2 events, and we can calculate the probability corresponding to a single event:

Above, we have 3 possibilities out of 16. Based on this, we can find probability A, an isolated event:

Just as we can try to find the probability of B, which is another isolated event, for this case, we have a chance of 25%.

Here we have the calculation of the individual probability of each event, but what we actually want is a conditional probability, that is, the occurrence of A given the occurrence of B and vice versa. Therefore, we are considering the likelihood of an event, given the occurrence of another event.

Calculating the conditional probability, we can see that the formula changes a little. We have P of A given B, i.e., Probability of A given the Occurrence of B — we break the procedure into two parts. At the top, we have the probability of A given B, divided by the individual probability of B.

When looking at the figure, we clearly have that the probability of A and B (intersection) is 1/16, and we still have the possibility of event B, which is 4/16.

We can do the opposite. Calculate the probability of B given the occurrence of A.

In many situations, to find the probability of an event, we have to consider the occurrence of another event, present in various scenarios in real life. Therefore, we saw the relationship between a conditional probability and its inverse probability, in other words, the likelihood of a hypothesis given the observation of evidence and the probability of the evidence presented by the hypothesis.

Bayes’ theorem represents one of the first attempts to model statistical inference to prove God’s existence mathematically. To achieve this goal, Bayes developed this theorem that became one of the bases of statistics.

Bayes Rules

The Bayes rule shows how to change odds a priori, considering new evidence to obtain a posteriori probability.

Above, we have the Bayes rule formula, with the a priori probabilities of P(x) and P(c), and we also have the probabilities a posteriori P(c|x) conditional to x and P(x|c) x conditional to c.

Bayes’ theorem provides a way to calculate the posterior probability, i.e., the posterior probability. Considering Bayes’s theorem formula, we have:

  • P (c|x) is the later probability of the class (c, target) given the predictor (x, attributes).
  • P © is the original probability of the class).
  • P (x|c) is the probability of the predictor given the class.
  • P (x) is the original probability of the predictor.

What the algorithm does in practice is learn these probabilities. During training, the algorithm will learn the chances to make up the formula and, in the end, present the likelihood of a given data point belonging to one of the classes for which we perform the training.

How does Bayes’s theorem work?

In probability theory, Bayes’ theorem shows the relationship between conditional probability and its inverse. Here we have a set of training data on “Weather” and the corresponding target variable “Play,” that is, according to the weather, what is the probability of a player playing sport or not.

We have two independent events: climate and playing sports. Therefore, we need to sort whether athletes will play or not based on weather conditions. One event is conditioned to another; we use conditional probability to build Bayes’s theorem that considers conditional probability and its inverse. First, we must convert the dataset into a frequency table:

Then create probability tables to find the probabilities of each occurrence and each combination, that is, the total occurrences of an event divided by the entire possibilities.

Each class in the training set has its probability calculated. Most of the time, we work with only two categories, such as:

  • whether or not a consumer purchases a product based on demographic characteristics
  • whether or not a player will practice a sport given the weather condition

The calculation is done by dividing the number of instances of a given class by the number of cases in the training set.

Finally, we used the Bayes theorem equation to calculate the posterior probability for each class. The class with the highest posterior probability is the result of the prediction.

1st question: players will play sport if the fear is sunny. Is that statement correct? The first step is to find the odds we need to answer that question.

  • P (Yes|Sun) = P (Sun|Yes) * P (Yes) / P (Sun)

Translating a Problem into Bayes’ Theorem

We have the first probability of (Yes| Sun), i.e., play, given that the day is sunny. Then we find its reverse, where we have the likelihood of a sunny day given the player practicing the sport P(Sun| Yes) and multiply by the probability of occurrence P(Yes) and divide by the possibility of making sun P(Sun).

Soon we will have to:

  • P (Sun| Yes) = 3/9 = 0.33
  • P (Sun) = 5/14 = 0.36
  • P (Yes) = 9/14 = 0.64

And replace in the formula:

P (Yes|Sun) = P (Sun|Yes) * P (Yes) / P (Sun) = 0.33 * 0.64 / 0.36 = 0.60

So, we’ll have a value of 0.60. As this is a relatively high probability, we can classify the result as Yes — players will practice sport if the weather is sunny.

Multinomial News Classifier

The goal of the algorithm is to build classification models. So, we have an input dataset, and we want to sort that data into groups! Because this is a supervised task, we deliver input datasets and possible classes based on historical data to the algorithm, train the model, and finally present a new data set to make predictions, that is, to classify the data according to the labels available.

When our variables are discrete, that is numbers with a finite set of possibilities. For these cases, the Naive Bayes Multinomial is the most indicated, being widely used for counting words for text classification that are a finite set of discrete possibilities.

Multinomial distribution typically requires counts of entire entities. However, in practice, fractional counts such as tf_idf (turning frequency-inverse document frequency) can also work to convert the occurrences of a word into a frequency table.

What happens is the conversion of the record table into a frequency table; that is, we count how many times each of those combinations appear. Then, we calculate the probabilities applied in Bayes’s theorem from the frequency table and generate a result.

Multinomial Naive Bayes — Scikit-Learn

The Naive Bayes Multinomial classifier is suitable for classification with discrete variables (for example, word counts for text classification). This is because multinomial distribution typically requires counts of entire entities. However, in practice, fractional counts such as tf-idf can also work.

News Classifier

Let’s sort out whether new news is from one category or another category. In this case, we’ll need to count words within a text.

# Imports
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

We will use the newsgroups data set for our classifier, a famous data set that allows us to categorize news — we already have the data set ready, and we don’t have to worry.

We import the CountVectorizer to count the number of words, put it in an array, and generate the frequency table; with the pipeline, we can create the composite classifier; the TfidTransformer, to generate the frequency table and finally optimize the model with GridSearchCV.

Within the dataset, we have about 20 different categories, i.e., 20 classes. If we process all this, it will take a long time because the classifier will have to work too a huge data set. Therefore, to facilitate process and computing, let’s limit it to 4 categories:

Create a lean list of categories

# Defining categories
# (using only 4 out of 20 available so the ranking process is faster)
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

Training

We call the fetch_20_newsgroups to collect a subset of this training function from these categories that have just been created and completely randomly.

20 Newsgroups is a great data set where all news is classified into 20 categories. So that we don’t have to build a classifier for all 20 classes, we’ll build a classifier for 4 classes we determined above, collecting these attributes completely randomly through the fetch_20newsgroups function:

# Training
twenty_train = fetch_20newsgroups(subset = 'train', categories = categories, shuffle = True, random_state = 42)

We call the target_names to view the 4 categories.

# Class
twenty_train.target_names
['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

Get the length of the dataset:

len(twenty_train.data)
2257

View the first records in the set:

print("\n".join(twenty_train.data[0].split("\n")[:3]))
From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton

View the target variable:

print(twenty_train.target_names[twenty_train.target[0]])
comp.graphics

Scikit-learn registers labels as an array of numbers to increase speed:

twenty_train.target[:10]
array([1, 1, 3, 3, 3, 3, 3, 2, 2, 2], dtype=int64)

The scikit-learn, when you implemented the dataset from the fetch_20newsgroups, created an array with the numbers representing the target variables.

Below we have the classes of the first 10 records, namely the classes of the first 10 news.

for t in twenty_train.target[:10]:
print(twenty_train.target_names[t])
comp.graphics
comp.graphics
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
sci.med
sci.med
sci.med

Bag of Words

From here, we build our bag of words. Finally, let’s work on the dataset to have the final bag and present it to the algorithm.

# Tokenizing
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
count_vect.vocabulary_.get(u'algorithm')
X_train_counts.shape
(2257, 35788)

The output we have is the record of occurrences. We work with CountVectorizer to perform Tokenizing, break texts into paragraphs, paragraphs into sentences, phrases into words. The CountVectorizer generates a vector with these counts, and we apply fit_transform to adjust the data with our input dataset.

The TfidTransformer function applies to transform Tfidf, and with that, it builds a frequency table that is a sparse array.

# From occurrences to frequencies - Term Frequency times Inverse Document Frequency (Tfidf)
tf_transformer = TfidfTransformer(use_idf = False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape
(2257, 35788)

From now on, we can build our model bypassing the sparse matrix and the target variable as parameters:

# Creating the Multinomial Model
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

We want to see which category the model will rate this two news:

# Predictions
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)
for doc, category in zip(docs_new, predicted):print('%r => %s' % (doc, twenty_train.target_names[category]))

So when building the two news, we also need to do vectorization count_vect.transform and frequency transformation with tfidf_transformer. After all, when we train the model with the tfidf (frequency table) from the occurrence of words in the dataset, the text is the text.

If we do the test now, we need to deliver the dataset in the same format — we can’t deliver the dataset in a completely different format.

Therefore, first, we need to transform the mass of data of the docs_news object and generate the count_vect.transform and tfidf_transforme to predict the appropriate format.

'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics

The first news was classified as a religion, and the second news was classified as computer graphics. So we did the data processing, built our model, and applied the test data set; we also need to do the conversion treatment to ensure that we will have the data in the same format used during training.

Creating a Pipeline — Composite Classifier

Let’s create the composite classifier. In the previous cells, we were doing step by step; now, we will do everything in a single command with the Pipeline:

# vectorizer => transformer => classifier
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()), ('clf', MultinomialNB()),])

We vectorize to count the occurrences, make the tfdif generate the frequency table, and create the model. We give the fi of the pipeline text_clf:

# Fit
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

Calculamos a acurácia a partir da média das previsões comparadas aos valores observados, ou seja, 83% de acurácia é excelente.

# Model Accuracy
twenty_test = fetch_20newsgroups(subset = 'test', categories = categories, shuffle = True, random_state = 42)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)
0.8348868175765646

To improve this accuracy, we print the classification_report:

# Metrics
print(metrics.classification_report(twenty_test.target, predicted, target_names = twenty_test.target_names))
precision recall f1-score support
alt.atheism 0.97 0.60 0.74 319
comp.graphics 0.96 0.89 0.92 389
sci.med 0.97 0.81 0.88 396
soc.religion.christian 0.65 0.99 0.78 398
accuracy 0.83 1502
macro avg 0.89 0.82 0.83 1502
weighted avg 0.88 0.83 0.84 1502

The indexes classification_report are calculated from the combination of the values of the confusion_matrix. To improve this accuracy, we print the confusion_matrix:

# Confusion Matrix
metrics.confusion_matrix(twenty_test.target, predicted)
array([[192, 2, 6, 119],
[ 2, 347, 4, 36],
[ 2, 11, 322, 61],
[ 2, 2, 1, 393]], dtype=int64)

To optimize the parameters of our model, we use GridSearchCV. So we generate our parameter grid that we want to test through the parameters object. Call the GridSearchCV function. We pass the text_clf, which is the pipeline that synthesizes everything that we created, and we pass the parameter grid.

# Parameters for GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)], 'tfidf__use_idf': (True, False),
'clf__alpha': (1e-2, 1e-3),}
# GridSearchCV
gs_clf = GridSearchCV(text_clf, parameters, n_jobs = -1)

The GridSearchCV function applies cross-validation, making several divisions in the training dataset and applies the model, finding the best estimator possible.

Now let’s introduce the training dataset to this model:

# Fit
gs_clf = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400])

Perform a test passing a piece of new news and see the category forecast for that news.

# Test
twenty_train.target_names[gs_clf.predict(['God is love'])[0]]
'soc.religion.christian'

We have a 93% score! A result is much better than the 83% we had initially achieved.

# Score
gs_clf.best_score_
0.9349999999999999

Finally, we can see which parameters were used to build the best estimator.

# Parameters used
for param_name in sorted(parameters.keys()):
print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))
clf__alpha: 0.01
tfidf__use_idf: True
vect__ngram_range: (1, 2)

With that, we have a news classifier with an excellent level of accuracy using Naive Bayes Multinomial! And there we have it. I hope you have found this helpful. Thank you for reading. ?


News Classifier with Naive Bayes in Python was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.


Print Share Comment Cite Upload Translate
APA
Anello | Sciencx (2024-03-28T23:37:09+00:00) » News Classifier with Naive Bayes in Python. Retrieved from https://www.scien.cx/2021/06/15/news-classifier-with-naive-bayes-in-python/.
MLA
" » News Classifier with Naive Bayes in Python." Anello | Sciencx - Tuesday June 15, 2021, https://www.scien.cx/2021/06/15/news-classifier-with-naive-bayes-in-python/
HARVARD
Anello | Sciencx Tuesday June 15, 2021 » News Classifier with Naive Bayes in Python., viewed 2024-03-28T23:37:09+00:00,<https://www.scien.cx/2021/06/15/news-classifier-with-naive-bayes-in-python/>
VANCOUVER
Anello | Sciencx - » News Classifier with Naive Bayes in Python. [Internet]. [Accessed 2024-03-28T23:37:09+00:00]. Available from: https://www.scien.cx/2021/06/15/news-classifier-with-naive-bayes-in-python/
CHICAGO
" » News Classifier with Naive Bayes in Python." Anello | Sciencx - Accessed 2024-03-28T23:37:09+00:00. https://www.scien.cx/2021/06/15/news-classifier-with-naive-bayes-in-python/
IEEE
" » News Classifier with Naive Bayes in Python." Anello | Sciencx [Online]. Available: https://www.scien.cx/2021/06/15/news-classifier-with-naive-bayes-in-python/. [Accessed: 2024-03-28T23:37:09+00:00]
rf:citation
» News Classifier with Naive Bayes in Python | Anello | Sciencx | https://www.scien.cx/2021/06/15/news-classifier-with-naive-bayes-in-python/ | 2024-03-28T23:37:09+00:00
https://github.com/addpipe/simple-recorderjs-demo