1 of 11

CS458 Natural language Processing

Self-study 6

Logistic Regression

Krishnendu Ghosh

Department of Computer Science & Engineering

Indian Institute of Information Technology Dharwad

2 of 11

Logistic Regression

Objective:

Understand how logistic regression is used for classification tasks in NLP, such as sentiment analysis or spam detection.

Concepts to Cover:

Binary classification and decision boundary

Sigmoid function and loss function (Log-Loss)

Feature engineering (TF-IDF, word embeddings)

Optimization techniques (Gradient Descent)

3 of 11

Logistic Regression

Practical Tasks:

Spam Classification:

Use a dataset (e.g., SMS Spam Collection)
Convert text into numerical features using TF-IDF
Train a logistic regression model using sklearn
Evaluate using accuracy, precision, recall, and F1-score

4 of 11

Logistic Regression

Practical Tasks:

Sentiment Analysis on IMDB Reviews:

Preprocess text (remove stopwords, tokenize, vectorize)

Train logistic regression on positive/negative reviews

Compare performance with and without feature selection

5 of 11

Logistic Regression

# Step 1: Import the necessary libraries

import numpy as np

from sklearn.datasets import fetch_20newsgroups

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

6 of 11

Logistic Regression

# Step 2: Load the dataset (using two categories for binary classification)

categories = ['sci.space', 'rec.autos']

newsgroups = fetch_20newsgroups(subset='all',

categories=categories,

remove=('headers', 'footers', 'quotes'))

# The data is stored in newsgroups.data (list of text documents)

# and the target labels in newsgroups.target

7 of 11

Logistic Regression

# Step 3: Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(

newsgroups.data,

newsgroups.target,

test_size=0.2,

random_state=42

)

8 of 11

Logistic Regression

# Step 4: Preprocess the text using TF-IDF vectorization

# This converts text data into numerical feature vectors.

vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)

X_train_tfidf = vectorizer.fit_transform(X_train)

X_test_tfidf = vectorizer.transform(X_test)

9 of 11

Logistic Regression

# Step 5: Initialize and train the Logistic Regression model

# Increase max_iter if convergence warnings occur.

clf = LogisticRegression(max_iter=1000)

clf.fit(X_train_tfidf, y_train)

# Step 6: Predict on the test set

y_pred = clf.predict(X_test_tfidf)

10 of 11

Logistic Regression

# Step 7: Evaluate the model's performance

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)

print("\nClassification Report:")

print(classification_report(y_test, y_pred))

print("Confusion Matrix:")

print(confusion_matrix(y_test, y_pred))

1 of 11

2 of 11

3 of 11

4 of 11

5 of 11

6 of 11

7 of 11

8 of 11

9 of 11

10 of 11

11 of 11