Clickbait Detection with Machine Learning: A Complete Python Tutorial

Hey devs! đź‘‹ Ever wondered how to build a real-world NLP classifier? Today, we’re diving into clickbait detection using scikit-learn, TF-IDF, and Random Forest. I’ll walk you through the entire process, from data prep to deployment on Hugging Face.


This content originally appeared on DEV Community and was authored by Deviprasad Shetty

Hey devs! đź‘‹ Ever wondered how to build a real-world NLP classifier? Today, we're diving into clickbait detection using scikit-learn, TF-IDF, and Random Forest. I'll walk you through the entire process, from data prep to deployment on Hugging Face.

Why Clickbait Detection Matters

In the age of social media, clickbait wastes time and spreads misinformation. As developers, we can build tools to combat this. My model achieves 91.45% accuracy on 32,000 headlines.

Dataset & Setup

We're using the Clickbait Dataset from Kaggle. Balanced classes: 16K clickbait, 16K real news.

pip install pandas scikit-learn matplotlib seaborn joblib huggingface_hub

Step 1: Data Loading & Preprocessing

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Load data
df = pd.read_csv("clickbait_data.csv")
df.dropna(inplace=True)

# Map labels
df.rename(columns={'clickbait': 'label'}, inplace=True)
df['label'] = df['label'].map({0: 'real', 1: 'clickbait'})

print(f"Dataset shape: {df.shape}")
print(df.head())

Step 2: Train-Test Split

Stratified split to maintain class balance:

X_train, X_test, y_train, y_test = train_test_split(
    df['headline'], df['label'], 
    test_size=0.2, 
    random_state=42, 
    stratify=df['label']
)

print(f"Train: {len(X_train)}, Test: {len(X_test)}")

Step 3: Feature Extraction with TF-IDF

Convert text to numerical features:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

print(f"Feature matrix shape: {X_train_vec.shape}")

Step 4: Model Training

Random Forest for robust classification:

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
model.fit(X_train_vec, y_train)

print("Model trained! âś…")

Step 5: Evaluation

Check performance on test set:

from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

y_pred = model.predict(X_test_vec)
print(f"Accuracy: {accuracy_score(y_test, y_pred)*100:.2f}%")
print(classification_report(y_test, y_pred))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix")
plt.show()

Results:

  • Accuracy: 91.45%
  • Macro F1: 0.91

Testing on Real Headlines

test_headlines = [
    "You won't believe what this celebrity did!",
    "New study reveals surprising health benefits",
    "10 hacks to boost your productivity"
]

predictions = model.predict(vectorizer.transform(test_headlines))
for text, pred in zip(test_headlines, predictions):
    print(f"'{text}' → {pred}")

Deploy to Hugging Face

Save and upload the model:

import joblib
from huggingface_hub import HfApi

# Save locally
joblib.dump(model, "clickbait_detector.pkl")
joblib.dump(vectorizer, "tfidf_vectorizer.pkl")

# Upload
api = HfApi()
api.upload_file(
    path_or_fileobj="clickbait_detector.pkl",
    path_in_repo="clickbait_detector.pkl",
    repo_id="Devishetty100/clickbait-detector",
    token="your-hf-token"
)
# Same for vectorizer

Usage in Production

from huggingface_hub import hf_hub_download
import joblib

# Load from HF
model_path = hf_hub_download(repo_id="Devishetty100/clickbait-detector", filename="clickbait_detector.pkl")
vectorizer_path = hf_hub_download(repo_id="Devishetty100/clickbait-detector", filename="tfidf_vectorizer.pkl")

model = joblib.load(model_path)
vectorizer = joblib.load(vectorizer_path)

# Predict
def detect_clickbait(headline):
    features = vectorizer.transform([headline])
    return model.predict(features)[0]

print(detect_clickbait("Shocking truth about coffee!"))

Next Steps & Improvements

While this model performs well, here are some ideas for future enhancements (these are suggestions, not planned features):

  • Try BERT or other transformers for better accuracy
  • Add multilingual support
  • Build a web API with FastAPI
  • Integrate into browser extensions

Feel free to fork the notebook and experiment!

What do you think? Have you built similar classifiers? Share your projects in the comments!

đź”— Kaggle Notebook | HF Model | Demo Space


This content originally appeared on DEV Community and was authored by Deviprasad Shetty


Print Share Comment Cite Upload Translate Updates
APA

Deviprasad Shetty | Sciencx (2025-10-04T17:07:14+00:00) Clickbait Detection with Machine Learning: A Complete Python Tutorial. Retrieved from https://www.scien.cx/2025/10/04/clickbait-detection-with-machine-learning-a-complete-python-tutorial-2/

MLA
" » Clickbait Detection with Machine Learning: A Complete Python Tutorial." Deviprasad Shetty | Sciencx - Saturday October 4, 2025, https://www.scien.cx/2025/10/04/clickbait-detection-with-machine-learning-a-complete-python-tutorial-2/
HARVARD
Deviprasad Shetty | Sciencx Saturday October 4, 2025 » Clickbait Detection with Machine Learning: A Complete Python Tutorial., viewed ,<https://www.scien.cx/2025/10/04/clickbait-detection-with-machine-learning-a-complete-python-tutorial-2/>
VANCOUVER
Deviprasad Shetty | Sciencx - » Clickbait Detection with Machine Learning: A Complete Python Tutorial. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/10/04/clickbait-detection-with-machine-learning-a-complete-python-tutorial-2/
CHICAGO
" » Clickbait Detection with Machine Learning: A Complete Python Tutorial." Deviprasad Shetty | Sciencx - Accessed . https://www.scien.cx/2025/10/04/clickbait-detection-with-machine-learning-a-complete-python-tutorial-2/
IEEE
" » Clickbait Detection with Machine Learning: A Complete Python Tutorial." Deviprasad Shetty | Sciencx [Online]. Available: https://www.scien.cx/2025/10/04/clickbait-detection-with-machine-learning-a-complete-python-tutorial-2/. [Accessed: ]
rf:citation
» Clickbait Detection with Machine Learning: A Complete Python Tutorial | Deviprasad Shetty | Sciencx | https://www.scien.cx/2025/10/04/clickbait-detection-with-machine-learning-a-complete-python-tutorial-2/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.