1 of 18

CS458 Natural language Processing

Self-study 5 Naive Bayes, Text Classification & Sentiment Analysis

Krishnendu Ghosh

Department of Computer Science & Engineering

Indian Institute of Information Technology Dharwad

2 of 18

Naive Bayes

1. Scikit-learn

Install Scikit-learn:�pip install scikit-learn

Available Naive Bayes Models:

GaussianNB: For continuous data assuming a normal distribution.
MultinomialNB: For text classification and other count-based data.
BernoulliNB: For binary/boolean data.

Example Code:�from sklearn.naive_bayes import MultinomialNB

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.model_selection import train_test_split

3 of 18

Naive Bayes

# Sample data

texts = ["I love programming", "Python is great", "I hate bugs", "Debugging is hard"]

labels = [1, 1, 0, 0]

# Vectorize the text

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(texts)

# Train/test split

X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.25, random_state=42)

4 of 18

Naive Bayes

# Train Naive Bayes model

model = MultinomialNB()

model.fit(X_train, y_train)

# Predict and evaluate

predictions = model.predict(X_test)

print("Predictions:", predictions)

5 of 18

Naive Bayes

2. NLTK

Install NLTK:�pip install nltk

Example Code:�import nltk

from nltk.classify import NaiveBayesClassifier

from nltk.classify.util import accuracy

# Sample data

training_data = [

({"love": True, "programming": True}, "positive"),

({"hate": True, "bugs": True}, "negative"),

({"debugging": True, "hard": True}, "negative"),

]

6 of 18

Naive Bayes

# Train the classifier

classifier = NaiveBayesClassifier.train(training_data)

# Test the classifier

test_data = [{"love": True, "python": True}, {"hate": True, "debugging": True}]

for test in test_data:

print(f"Classification for {test}: {classifier.classify(test)}")

# Display accuracy

print("Accuracy:", accuracy(classifier, training_data))

7 of 18

Naive Bayes

3. Custom Implementation

Example Code:

import numpy as np

# Sample data

X = np.array([[1, 0, 1], [1, 1, 0], [0, 1, 1], [1, 1, 1]])

y = np.array([0, 1, 1, 0])

# Calculate probabilities

def train_naive_bayes(X, y):

classes = np.unique(y)

priors = {cls: np.mean(y == cls) for cls in classes}

likelihoods = {cls: np.mean(X[y == cls], axis=0) for cls in classes}

return priors, likelihoods

priors, likelihoods = train_naive_bayes(X, y)

print("Priors:", priors)

print("Likelihoods:", likelihoods)

8 of 18

Text Classification

1. Scikit-learn

Key Features:

Supports Naive Bayes, Logistic Regression, SVM, etc.
Easy-to-use API for preprocessing and model evaluation.

Example Code:

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import MultinomialNB

# Sample data

texts = ["I love programming", "Python is amazing", "Debugging is hard", "I hate bugs"]

labels = [1, 1, 0, 0]

9 of 18

Text Classification

# Vectorization

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(texts)

# Train-test split

X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.25, random_state=42)

# Train classifier

clf = MultinomialNB()

clf.fit(X_train, y_train)

# Predict

predictions = clf.predict(X_test)

print(predictions)

10 of 18

Text Classification

2. NLTK

Example Code:

import nltk

from nltk.classify import NaiveBayesClassifier

# Training data

training_data = [

({"text": "I love programming"}, "positive"),

({"text": "Python is amazing"}, "positive"),

({"text": "Debugging is hard"}, "negative"),

({"text": "I hate bugs"}, "negative"),

]

# Train classifier

classifier = NaiveBayesClassifier.train(training_data)

# Test data

test_data = {"text": "I love Python"}

print("Classification:", classifier.classify(test_data))

11 of 18

Text Classification

3. TensorFlow/Keras

Key Features:

Handles large-scale data with advanced deep learning models.
Pretrained embeddings like Word2Vec, GloVe, and BERT.

Example Code:

import tensorflow as tf

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Embedding, LSTM, Dense

# Sample data

texts = ["I love programming", "Python is amazing", "Debugging is hard", "I hate bugs"]

labels = [1, 1, 0, 0]

12 of 18

Text Classification

# Preprocess text (tokenization and padding)

tokenizer = tf.keras.preprocessing.text.Tokenizer()

tokenizer.fit_on_texts(texts)

X = tokenizer.texts_to_sequences(texts)

X = tf.keras.preprocessing.sequence.pad_sequences(X, maxlen=5)

# Create model

model = Sequential([

Embedding(input_dim=50, output_dim=16, input_length=5),

LSTM(32),

Dense(1, activation='sigmoid')

])

# Compile and train model

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

model.fit(X, labels, epochs=5, verbose=1)

13 of 18

Text Classification

4. PyTorch

Key Features:

Supports advanced models like RNNs, CNNs, and Transformers.
Pretrained models available via torchtext or Hugging Face Transformers.

Example Code:

import torch

from torchtext.legacy.data import Field, TabularDataset, BucketIterator

from torchtext.legacy.data.utils import get_tokenizer

# Define fields

TEXT = Field(tokenize=get_tokenizer("basic_english"), lower=True)

LABEL = Field(sequential=False, use_vocab=False)

# Example training and processing

# Define your model as per your needs

14 of 18

Sentiment Analysis

1. VADER (Valence Aware Dictionary and sEntiment Reasoner)

Installation:

pip install vaderSentiment

Key Features:

Simple to use.
Optimized for social media and informal text.

Example Code:

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

text = "I love programming! It's amazing!"

score = analyzer.polarity_scores(text)

print(score)

15 of 18

Sentiment Analysis

2. Hugging Face Transformers

Installation:

pip install transformers

Key Features:

Uses transformer-based models like BERT, GPT-2, etc.
Fine-tuning capabilities for custom datasets.

Example Code:

from transformers import pipeline

classifier = pipeline('sentiment-analysis')

result = classifier("I love programming in Python!")

print(result)

16 of 18

Sentiment Analysis

3. AllenNLP

Installation:

pip install allennlp

Key Features:

Deep learning models for sentiment classification.
Easy-to-use APIs.

Example Code:

from allennlp.predictors import Predictor

predictor = Predictor.from_path("https://allennlp.s3.amazonaws.com/models/sentiment-analysis-bert.tar.gz")

result = predictor.predict(sentence="I love this!")

print(result)

17 of 18

Sentiment Analysis

4. PyTorch

Installation:

pip install torch

Key Features:

Deep learning-based approach.
Flexibility to build custom models for sentiment analysis.

Example Code:

import torch

from torchtext.legacy.data import Field, TabularDataset, BucketIterator

from torchtext.legacy import data

TEXT = Field(tokenize='spacy', batch_first=True)

LABEL = Field(sequential=False, use_vocab=False)

# Define fields, model, and other necessary steps as per your dataset

1 of 18

2 of 18

3 of 18

4 of 18

5 of 18

6 of 18

7 of 18

8 of 18

9 of 18

10 of 18

11 of 18

12 of 18

13 of 18

14 of 18

15 of 18

16 of 18

17 of 18

18 of 18