1 of 53

Depersonalize

Team Guide

ASSOCIATE PROF. SYED MUJIB RAHAMAN

2 of 53

Team Members

Diksha Patro B.

319136410032

Satwika G.

319136410039

Meghana G.

319136410045

Manaswini A.

319136410008

3 of 53

Progress till now

Progress

Analyzed about k-anonymity.

Review 1

Title & Abstract

Review 2

Analyzing & Designing Document

Review 3

Implementation

Progress

Went into details on the system requirements about k-anonymity.

Challenges

Were stuck on how to differentiate from the general implementation.

Progress

1. Added two other layers - Initial SelectKBest algorithm to select important attributes.

2. Adding these layers made it less manual and more automated

Challenges

1.Was told to add pseudo code.

2.To give a bench dataset and dataset description.

Review 4

PPT+Rough documentation

Progress

1. Implemented the feedback given in review 3.

2. Added pseudo code and dataset description.

3.Completed rough documentation.

4.Added Testing

Review 5

Internal

Progress

.Updated a UI accepting the dataset

4 of 53

ABSTRACT

Depersonalize is a privacy-preserving network security project that aims at preserving the privacy of an individual using an Anonymization Method called K-Anonymity.

Data anonymization refers to the method of preserving private or confidential information by deleting or encoding identifiers that link individuals to the stored data. It aims at making the individual record indistinguishable from a group record by using techniques of generalization and suppression.

5 of 53

TECHNOLOGIES

1. Python with Pandas, Matplotlib and other libraries

2. JSON and PHP

6 of 53

INTRODUCTION

People have the right to keep their privacy inviolate and free from being infringed upon by third parties. So for Social networks, to preserve the privacy of individuals in the server, we are using K-anonymity.

This turns a dataset into a k-anonymous (and possibly l-diverse or t-close) dataset is a complex problem, and finding the optimal partition into k-anonymous groups is an NP-hard problem.

We are using SelectKBest to select the quasi-identifiers and passing the dataset to anonymize these quasi-identifiers and anonymize the dataset completely.

7 of 53

MOTIVATION

Privacy is an individual's fundamental human right. It relates to the protection of an individual's personal information from unauthorized use and disclosure. Privacy is often called a condition of independence in which one can be trusted to take care of one's personal integrity.

As such, people have the right to keep their privacy inviolate and free from being infringed upon by third parties. Privacy can be defined as freedom from unwarranted intrusion into personal affairs, especially by governmental authorities or by others. It requires that people act as individuals and carry out their own affairs without interference by others.

8 of 53

PROBLEM STATEMENT

Data privacy is a discipline that seeks to protect data from unauthorised access, theft, or loss. Especially in social networks, which is more prone to hacking and data being exposed to the dark web.

It is critical to keep data confidential and secure by practising good data management and preventing unauthorised access that could result in data loss, alteration, or theft.

9 of 53

REQUIREMENT ANALYSIS

Functional Requirements

  • Data Selection
  • K-Anonymity Implementation
  • Privacy Assessment
  • Data Visualization

Non - Functional Requirements

  • Performance
  • Scalability
  • Security
  • Reliability

10 of 53

SOFTWARE REQUIREMENT SPECIFICATION

11 of 53

SOFTWARE REQUIREMENT SPECIFICATION

12 of 53

Software Requirements

  • Operating System: Windows 8
  • Python IDLE and libraries

Hardware Requirements

  • RAM: 4 GB
  • Processor: 64-bit, 1.4 GHz minimum per core

13 of 53

UML

DIAGRAMS

14 of 53

USE CASE DIAGRAM

15 of 53

16 of 53

17 of 53

18 of 53

ACTIVITY DIAGRAM

19 of 53

SEQUENCE DIAGRAM

20 of 53

CLASS DIAGRAM

21 of 53

Project process overview

Dataset retrieval

SelectKBest algorithm -> Selecting “important” attributes

K-anonymity -> Anonymize “important” attributes

“Depersonalized” data

22 of 53

  1. SelectKBest

23 of 53

SelectKBest Overview

SelectKBest Algorithm

Depends on

Generates

scoring _function

Important features

Used for our case

mutual_info_regression

Generates

Scores for features

Requires input

Dataset and Target variable

“Important” features

Generates

SelectKBest works by retaining the first k features of X with the highest scores.

24 of 53

SelectKBest Algorithm steps

  1. Load the dataset and select the features and target variable you want to use.
  2. Initialize a SelectKBest object with the desired scoring function and number of features to select.
  3. Fit the SelectKBest object to the input features and target variable.
  4. Use the .scores_ and .get_support() attributes of the SelectKBest object to obtain the scores and indices of the top k features.
  5. Print or use the top k features as desired.

25 of 53

Target Variable in SelectKBest

Now, the question is how to select this target variable for different types of data?

In general, the target variable is chosen based on the question we want to answer or the problem we want to solve.

  1. Target variable is chosen based on the question we want to answer or the problem we want to solve.

  1. The other features/attributes are chosen based on this target variable.

26 of 53

SelectKBest implementation

import pandas as pd

from sklearn.feature_selection import SelectKBest, mutual_info_regression

from sklearn.preprocessing import LabelEncoder

import matplotlib.pyplot as plt

# Load the dataset

data = pd.read_csv('Healthcare.csv')

# Encode categorical variables using label encoding

le = LabelEncoder()

for column in data.columns:

data[column] = le.fit_transform(data[column])

X = data[['Pregnancies','Glucose', 'BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction', 'Age','Outcome']]

27 of 53

y = data['Outcome']

# specify the value of k

k = len(X.columns)

selector = SelectKBest(score_func=mutual_info_regression, k=k)

selector.fit(X, y)

top_features = selector.scores_

top_features_index = selector.get_support(indices=True)

features = X.columns[top_features_index]

scores = top_features[top_features_index]

for i, feature in enumerate(X.columns[top_features_index]):

print(f'Feature {i+1}: {feature}, Score: {top_features[top_features_index[i]]}')

plt.bar(features, scores)

plt.xlabel('Feature')

plt.ylabel('Score')

plt.title('Feature Selection Scores for k=' + str(k))

plt.show()

28 of 53

Output

Feature 1: Pregnancies, Score: 0.0084942637476777

Feature 2: Glucose, Score: 0.08644442667100716

Feature 3: BloodPressure, Score: 0.018855057618854865

Feature 4: SkinThickness, Score: 0.03266666458992251

Feature 5: Insulin, Score: 0.025863325344521737

Feature 6: BMI, Score: 0.04779110084319793

Feature 7: DiabetesPedigreeFunction, Score: 0.01708609909554948

Feature 8: Age, Score: 0.041768434554806166

Feature 9: Outcome, Score: 0.6773941863649848

29 of 53

DATASET DESCRIPTION

https://www.kaggle.com/datasets/surajjha101/bigbasket-entire-product-list-28k-datapoints

E-commerce (electronic commerce) is the activity of electronically buying or selling of products on online services or over the Internet. E-commerce draws on technologies such as mobile commerce, electronic funds transfer, supply chain management, Internet marketing, online transaction processing, electronic data interchange (EDI), inventory management systems, and automated data collection systems. E-commerce is in turn driven by the technological advances of the semiconductor industry, and is the largest sector of the electronics industry.

Bigbasket is the largest online grocery supermarket in India. Was launched somewhere around in 2011 since then they've been expanding their business. Though some new competitors have been able to set their foot in the nation such as Blinkit etc. but BigBasket has still not loose anything - thanks to ever expanding popular base and their shift to online buying.

30 of 53

2. K-Anonymity

31 of 53

What is K-Anonymity?

K-anonymity is a privacy-preserving technique designed to protect the privacy of individuals by ensuring that no individual in a dataset can be uniquely identified based on their attribute values.

32 of 53

Why are we choosing k-anonymity?

Relatively simple to understand and implement.

Ensures that individuals cannot be uniquely identified based on their attribute values.

Can be applied to a wide range of types of data - tabular data, categorical data, and continuous data.

Can be applied to both structured and unstructured data.

33 of 53

k-anonymity steps

  1. Identify the sensitive attributes in the data set using SelectKBest.
  2. Determine the value of k.
  3. Obscure the sensitive attributes in the data set.
  4. Implement a mechanism for accessing the original data when necessary.
  5. Test the data set to ensure that it meets the k-anonymity criteria.
  6. Regularly review and update the data set to ensure that it continues to meet the k-anonymity criteria for data set changes over time.

34 of 53

K-anonymity implementation

import pandas as pd

# Load the dataset

data = pd.read_csv('Healthcare.csv')

# Define k

k = 2

# Anonymize the Age feature by creating age groups

age_groups = pd.cut(data['Age'], bins=[20, 30, 40, 50, 60, 70, 80, 90, 100], labels=['20-30', '30-40', '40-50', '50-60', '60-70', '70-80', '80-90', '90-100'])

data['Age'] = age_groups

# Anonymize the Glucose feature by creating glucose level groups

glucose_groups = pd.cut(data['Glucose'], bins=[0, 100, 200, 300, 400, 500], labels=['0-100', '100-200', '200-300', '300-400', '400-500'])

data['Glucose'] = glucose_groups

#Anonymize the Pregnancies feature by creating pregnancy groups

pregnancy_groups = pd.cut(data['Pregnancies'], bins=[0, 5, 10, 15, 20, 25], labels=['0-5', '5-10', '10-15', '15-20', '20-25'])

data['Pregnancies'] = pregnancy_groups

35 of 53

#Anonymize the Outcome feature

data['Outcome'] = data['Outcome'].apply(lambda x: '*' if x == 0 else '**')

#Save the anonymized data to a new file

data.to_csv('anonymized_data.csv', index=False)

36 of 53

Input

37 of 53

Output

Top 5 features:

1. brand (1.3518)

2. category (0.9775)

3. sub_category (0.7965)

4. type (0.5966)

5. sale_price (0.4202)

38 of 53

Output

39 of 53

INTEGRATED CODE

import pandas as pd

from sklearn.feature_selection import SelectKBest, mutual_info_regression

from sklearn.preprocessing import LabelEncoder

import matplotlib.pyplot as plt

# Load the dataset

data = pd.read_csv('BigBasket.csv')

# Encode categorical variables using label encoding

le = LabelEncoder()

for column in data.columns:

data[column] = le.fit_transform(data[column])

X = data[['index','product','category','sub_category','brand','sale_price','market_price','type','rating','description']]

y = data['brand']

40 of 53

# Use SelectKBest with mutual information to find the top 5 features

selector = SelectKBest(score_func=mutual_info_regression, k=5)

selector.fit(X, y)

# Print the top 5 features and their mutual information scores

top_features = selector.scores_

top_features_index = selector.get_support(indices=True)

feature_names = []

scores = []

for i, feature in enumerate(X.columns[top_features_index]):

feature_names.append(feature)

scores.append(top_features[top_features_index[i]])

# Sort the features based on their scores in descending order

sorted_features = sorted(zip(feature_names, scores), key=lambda x: x[1], reverse=True)

print("Top 5 features:")

for i in range(5):

print(f'{i+1}. {sorted_features[i][0]} ({sorted_features[i][1]:.4f})')

# Define k-anonymity rules for the top 5 features

k_anonymity_rules = {

'index': None,

'product': 2,

41 of 53

'category': 3,

'sub_category': 3,

'brand': 2

}

# Apply k-anonymity to the top 5 features based on the defined rules

for feature in sorted_features[:5]:

feature_name = feature[0]

k = k_anonymity_rules.get(feature_name, None)

if k is not None:

X[feature_name] = X[feature_name] // k * k

# Save the anonymized data to a new CSV file

X.to_csv('BigBasket_anonymized.csv', index=False)

# Plot the feature selection scores

plt.bar(feature_names, scores)

# Add labels and title

plt.xlabel('Feature')

plt.ylabel('Score')

plt.title('Feature Selection Scores')

# Show the plot

plt.show()

42 of 53

TESTING

Testing is the process of evaluating a system or application to identify errors, defects, or gaps in requirements. It is an essential part of the software development life cycle to ensure that the software is functional, reliable, and secure.

There are two types of testing:

1.Black box testing

2.White box testing

43 of 53

BLACK BOX TESTING

Black box testing is a method of testing where the tester focuses on the inputs and outputs of the system under test without considering its internal workings or code. It is a high-level testing approach that focuses on the functionality of the software system and is usually performed by the testing team.

Here are some types of black box testing:

  1. Functional testing
  2. Integration testing
  3. Regression testing
  4. Usability testing
  5. Load testing
  6. Security testing

44 of 53

FUNCTIONAL TESTING

45 of 53

FUNCTIONAL TESTING

46 of 53

FUNCTIONAL TESTING

47 of 53

FUNCTIONAL TESTING

48 of 53

FUNCTIONAL TESTING

49 of 53

White Box Testing

White box testing is a method of testing that involves testing the internal structure of the software system, including its code, architecture, and design. It is a low-level testing approach that focuses on the implementation of the software system and is usually performed by developers or testing engineers who have access to the source code.

Here are some types of white box testing:

  1. Statement Coverage Testing
  2. Branch Coverage Testing
  3. Condition Coverage Testing
  4. Path Coverage Testing
  5. Loop Coverage Testing
  6. Data Flow Testing
  7. Control Flow Testing

50 of 53

Data flow testing

Data flow testing is a white box testing technique that aims to identify potential defects or issues by analyzing the flow of data through the software code. In the provided code, some examples of potential data flow testing scenarios could include:

  • Checking whether the encoded categorical variables are correctly passed to the SelectKBest function and the feature selection algorithm.
  • Analyzing the data flow of the k-anonymity rules to ensure that the top 5 features are correctly identified and anonymized based on the defined rules.
  • Examining the data flow of the feature selection scores to ensure that they are correctly calculated and plotted in the bar chart.

51 of 53

52 of 53

Thank you!

53 of 53

Glossary/References