CS458 Natural language Processing
Self-study 6
Logistic Regression
Krishnendu Ghosh
Department of Computer Science & Engineering
Indian Institute of Information Technology Dharwad
Logistic Regression
Objective:
Understand how logistic regression is used for classification tasks in NLP, such as sentiment analysis or spam detection.
Concepts to Cover:
Binary classification and decision boundary
Sigmoid function and loss function (Log-Loss)
Feature engineering (TF-IDF, word embeddings)
Optimization techniques (Gradient Descent)
Logistic Regression
Practical Tasks:
Spam Classification:
Logistic Regression
Practical Tasks:
Sentiment Analysis on IMDB Reviews:
Preprocess text (remove stopwords, tokenize, vectorize)
Train logistic regression on positive/negative reviews
Compare performance with and without feature selection
Logistic Regression
# Step 1: Import the necessary libraries
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
Logistic Regression
# Step 2: Load the dataset (using two categories for binary classification)
categories = ['sci.space', 'rec.autos']
newsgroups = fetch_20newsgroups(subset='all',
categories=categories,
remove=('headers', 'footers', 'quotes'))
# The data is stored in newsgroups.data (list of text documents)
# and the target labels in newsgroups.target
Logistic Regression
# Step 3: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
newsgroups.data,
newsgroups.target,
test_size=0.2,
random_state=42
)
Logistic Regression
# Step 4: Preprocess the text using TF-IDF vectorization
# This converts text data into numerical feature vectors.
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
Logistic Regression
# Step 5: Initialize and train the Logistic Regression model
# Increase max_iter if convergence warnings occur.
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_tfidf, y_train)
# Step 6: Predict on the test set
y_pred = clf.predict(X_test_tfidf)
Logistic Regression
# Step 7: Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
Thank You