Depersonalize
Team Guide
ASSOCIATE PROF. SYED MUJIB RAHAMAN
Team Members
Diksha Patro B.
319136410032
Satwika G.
319136410039
Meghana G.
319136410045
Manaswini A.
319136410008
Progress till now
Progress
Analyzed about k-anonymity.
Review 1
Title & Abstract
Review 2
Analyzing & Designing Document
Review 3
Implementation
Progress
Went into details on the system requirements about k-anonymity.
Challenges
Were stuck on how to differentiate from the general implementation.
Progress
1. Added two other layers - Initial SelectKBest algorithm to select important attributes.
2. Adding these layers made it less manual and more automated
Challenges
1.Was told to add pseudo code.
2.To give a bench dataset and dataset description.
Review 4
PPT+Rough documentation
Progress
1. Implemented the feedback given in review 3.
2. Added pseudo code and dataset description.
3.Completed rough documentation.
4.Added Testing
Review 5
Internal
Progress
.Updated a UI accepting the dataset
ABSTRACT
Depersonalize is a privacy-preserving network security project that aims at preserving the privacy of an individual using an Anonymization Method called K-Anonymity.
Data anonymization refers to the method of preserving private or confidential information by deleting or encoding identifiers that link individuals to the stored data. It aims at making the individual record indistinguishable from a group record by using techniques of generalization and suppression.
TECHNOLOGIES
1. Python with Pandas, Matplotlib and other libraries
2. JSON and PHP
INTRODUCTION
People have the right to keep their privacy inviolate and free from being infringed upon by third parties. So for Social networks, to preserve the privacy of individuals in the server, we are using K-anonymity.
This turns a dataset into a k-anonymous (and possibly l-diverse or t-close) dataset is a complex problem, and finding the optimal partition into k-anonymous groups is an NP-hard problem.
We are using SelectKBest to select the quasi-identifiers and passing the dataset to anonymize these quasi-identifiers and anonymize the dataset completely.
MOTIVATION
Privacy is an individual's fundamental human right. It relates to the protection of an individual's personal information from unauthorized use and disclosure. Privacy is often called a condition of independence in which one can be trusted to take care of one's personal integrity.
As such, people have the right to keep their privacy inviolate and free from being infringed upon by third parties. Privacy can be defined as freedom from unwarranted intrusion into personal affairs, especially by governmental authorities or by others. It requires that people act as individuals and carry out their own affairs without interference by others.
PROBLEM STATEMENT
Data privacy is a discipline that seeks to protect data from unauthorised access, theft, or loss. Especially in social networks, which is more prone to hacking and data being exposed to the dark web.
It is critical to keep data confidential and secure by practising good data management and preventing unauthorised access that could result in data loss, alteration, or theft.
REQUIREMENT ANALYSIS
Functional Requirements
Non - Functional Requirements
SOFTWARE REQUIREMENT SPECIFICATION
SOFTWARE REQUIREMENT SPECIFICATION
Software Requirements
Hardware Requirements
UML
DIAGRAMS
USE CASE DIAGRAM
ACTIVITY DIAGRAM
SEQUENCE DIAGRAM
CLASS DIAGRAM
Project process overview
Dataset retrieval
SelectKBest algorithm -> Selecting “important” attributes
K-anonymity -> Anonymize “important” attributes
“Depersonalized” data
SelectKBest Overview
SelectKBest Algorithm
Depends on
Generates
scoring _function
Important features
Used for our case
mutual_info_regression
Generates
Scores for features
Requires input
Dataset and Target variable
“Important” features
Generates
SelectKBest works by retaining the first k features of X with the highest scores.
Ref: medium.com
SelectKBest Algorithm steps
Target Variable in SelectKBest
Now, the question is how to select this target variable for different types of data?
In general, the target variable is chosen based on the question we want to answer or the problem we want to solve.
SelectKBest implementation
import pandas as pd
from sklearn.feature_selection import SelectKBest, mutual_info_regression
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
# Load the dataset
data = pd.read_csv('Healthcare.csv')
# Encode categorical variables using label encoding
le = LabelEncoder()
for column in data.columns:
data[column] = le.fit_transform(data[column])
X = data[['Pregnancies','Glucose', 'BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction', 'Age','Outcome']]
y = data['Outcome']
# specify the value of k
k = len(X.columns)
selector = SelectKBest(score_func=mutual_info_regression, k=k)
selector.fit(X, y)
top_features = selector.scores_
top_features_index = selector.get_support(indices=True)
features = X.columns[top_features_index]
scores = top_features[top_features_index]
for i, feature in enumerate(X.columns[top_features_index]):
print(f'Feature {i+1}: {feature}, Score: {top_features[top_features_index[i]]}')
plt.bar(features, scores)
plt.xlabel('Feature')
plt.ylabel('Score')
plt.title('Feature Selection Scores for k=' + str(k))
plt.show()
Output
Feature 1: Pregnancies, Score: 0.0084942637476777
Feature 2: Glucose, Score: 0.08644442667100716
Feature 3: BloodPressure, Score: 0.018855057618854865
Feature 4: SkinThickness, Score: 0.03266666458992251
Feature 5: Insulin, Score: 0.025863325344521737
Feature 6: BMI, Score: 0.04779110084319793
Feature 7: DiabetesPedigreeFunction, Score: 0.01708609909554948
Feature 8: Age, Score: 0.041768434554806166
Feature 9: Outcome, Score: 0.6773941863649848
DATASET DESCRIPTION
https://www.kaggle.com/datasets/surajjha101/bigbasket-entire-product-list-28k-datapoints
E-commerce (electronic commerce) is the activity of electronically buying or selling of products on online services or over the Internet. E-commerce draws on technologies such as mobile commerce, electronic funds transfer, supply chain management, Internet marketing, online transaction processing, electronic data interchange (EDI), inventory management systems, and automated data collection systems. E-commerce is in turn driven by the technological advances of the semiconductor industry, and is the largest sector of the electronics industry.
Bigbasket is the largest online grocery supermarket in India. Was launched somewhere around in 2011 since then they've been expanding their business. Though some new competitors have been able to set their foot in the nation such as Blinkit etc. but BigBasket has still not loose anything - thanks to ever expanding popular base and their shift to online buying.
2. K-Anonymity
What is K-Anonymity?
K-anonymity is a privacy-preserving technique designed to protect the privacy of individuals by ensuring that no individual in a dataset can be uniquely identified based on their attribute values.
Why are we choosing k-anonymity?
Relatively simple to understand and implement.
Ensures that individuals cannot be uniquely identified based on their attribute values.
Can be applied to a wide range of types of data - tabular data, categorical data, and continuous data.
Can be applied to both structured and unstructured data.
k-anonymity steps
K-anonymity implementation
import pandas as pd
# Load the dataset
data = pd.read_csv('Healthcare.csv')
# Define k
k = 2
# Anonymize the Age feature by creating age groups
age_groups = pd.cut(data['Age'], bins=[20, 30, 40, 50, 60, 70, 80, 90, 100], labels=['20-30', '30-40', '40-50', '50-60', '60-70', '70-80', '80-90', '90-100'])
data['Age'] = age_groups
# Anonymize the Glucose feature by creating glucose level groups
glucose_groups = pd.cut(data['Glucose'], bins=[0, 100, 200, 300, 400, 500], labels=['0-100', '100-200', '200-300', '300-400', '400-500'])
data['Glucose'] = glucose_groups
#Anonymize the Pregnancies feature by creating pregnancy groups
pregnancy_groups = pd.cut(data['Pregnancies'], bins=[0, 5, 10, 15, 20, 25], labels=['0-5', '5-10', '10-15', '15-20', '20-25'])
data['Pregnancies'] = pregnancy_groups
#Anonymize the Outcome feature
data['Outcome'] = data['Outcome'].apply(lambda x: '*' if x == 0 else '**')
#Save the anonymized data to a new file
data.to_csv('anonymized_data.csv', index=False)
Input
Output
Top 5 features:
1. brand (1.3518)
2. category (0.9775)
3. sub_category (0.7965)
4. type (0.5966)
5. sale_price (0.4202)
Output
INTEGRATED CODE
import pandas as pd
from sklearn.feature_selection import SelectKBest, mutual_info_regression
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
# Load the dataset
data = pd.read_csv('BigBasket.csv')
# Encode categorical variables using label encoding
le = LabelEncoder()
for column in data.columns:
data[column] = le.fit_transform(data[column])
X = data[['index','product','category','sub_category','brand','sale_price','market_price','type','rating','description']]
y = data['brand']
# Use SelectKBest with mutual information to find the top 5 features
selector = SelectKBest(score_func=mutual_info_regression, k=5)
selector.fit(X, y)
# Print the top 5 features and their mutual information scores
top_features = selector.scores_
top_features_index = selector.get_support(indices=True)
feature_names = []
scores = []
for i, feature in enumerate(X.columns[top_features_index]):
feature_names.append(feature)
scores.append(top_features[top_features_index[i]])
# Sort the features based on their scores in descending order
sorted_features = sorted(zip(feature_names, scores), key=lambda x: x[1], reverse=True)
print("Top 5 features:")
for i in range(5):
print(f'{i+1}. {sorted_features[i][0]} ({sorted_features[i][1]:.4f})')
# Define k-anonymity rules for the top 5 features
k_anonymity_rules = {
'index': None,
'product': 2,
'category': 3,
'sub_category': 3,
'brand': 2
}
# Apply k-anonymity to the top 5 features based on the defined rules
for feature in sorted_features[:5]:
feature_name = feature[0]
k = k_anonymity_rules.get(feature_name, None)
if k is not None:
X[feature_name] = X[feature_name] // k * k
# Save the anonymized data to a new CSV file
X.to_csv('BigBasket_anonymized.csv', index=False)
# Plot the feature selection scores
plt.bar(feature_names, scores)
# Add labels and title
plt.xlabel('Feature')
plt.ylabel('Score')
plt.title('Feature Selection Scores')
# Show the plot
plt.show()
TESTING
Testing is the process of evaluating a system or application to identify errors, defects, or gaps in requirements. It is an essential part of the software development life cycle to ensure that the software is functional, reliable, and secure.
There are two types of testing:
1.Black box testing
2.White box testing
BLACK BOX TESTING
Black box testing is a method of testing where the tester focuses on the inputs and outputs of the system under test without considering its internal workings or code. It is a high-level testing approach that focuses on the functionality of the software system and is usually performed by the testing team.
Here are some types of black box testing:
FUNCTIONAL TESTING
FUNCTIONAL TESTING
FUNCTIONAL TESTING
FUNCTIONAL TESTING
FUNCTIONAL TESTING
White Box Testing
White box testing is a method of testing that involves testing the internal structure of the software system, including its code, architecture, and design. It is a low-level testing approach that focuses on the implementation of the software system and is usually performed by developers or testing engineers who have access to the source code.
Here are some types of white box testing:
Data flow testing
Data flow testing is a white box testing technique that aims to identify potential defects or issues by analyzing the flow of data through the software code. In the provided code, some examples of potential data flow testing scenarios could include:
Thank you!