EDA_UNIT WISE_2 M QUESTIONS

UNIT1: Exploratory Data Analysis Fundamentals

Python Code Questions

Write Python code using the pandas library to load a CSV file named data.csv into a DataFrame.
Assuming a DataFrame df with a column 'age', write code to calculate the mean and median of this column.
Given a DataFrame df with columns 'city' and 'sales', write code to group the data by 'city' and sum the sales for each city.
Write code using the seaborn library to create a scatter plot from a DataFrame df with columns 'x' and 'y'.
Given a DataFrame df with a column 'price' containing missing values (NaN), write code to fill these missing values with the mean of the 'price' column.
Write Python code to find the number of unique values in a column named 'product_category' from a DataFrame df.
A DataFrame df has a column 'date' in string format. Write code to convert this column to a proper datetime format.
Write code using the pandas library to drop any duplicate rows from a DataFrame named df.
Given a DataFrame df with a numerical column 'income', write code to create a histogram using matplotlib.pyplot.
Write Python code to apply descriptive statistics (like count, mean, min, max, etc.) to all numerical columns in a DataFrame named df.

Descriptive Questions

What is the main purpose of Exploratory Data Analysis (EDA), and how does it differ from classical data analysis?
Explain the difference between numerical and categorical data. Give one example of each.
What are the key characteristics of the Ratio and Nominal measurement scales? Give a real-world example for each.
Describe what it means to "make sense of data" in EDA. What types of questions would an analyst ask themselves during this process?
Why is data visualization considered a crucial part of EDA?
Name two software tools widely used for EDA and briefly explain a key advantage of each.
What is the significance of the median over the mean when a dataset contains outliers?
Explain the difference between univariate and bivariate analysis.

Match the Case

Match the EDA step with its most suitable action.

EDA Step	Action
A. Getting Started	1. Visualizing the relationship between two variables.
B. Making Sense of Data	2. Calculating the mean and median of a variable.
C. Univariate Analysis	3. Loading the dataset into a DataFrame.
D. Bivariate Analysis	4. Exploring the data through visualizations and summary statistics.

Match the data type or scale with its appropriate characteristic.

Data Type / Scale	Characteristic
A. Categorical Data	1. Has an ordered sequence but no meaningful zero point.
B. Ratio Scale	2. Represents qualities or groups.
C. Interval Scale	3. Has a true zero point, allowing for ratios.
D. Numerical Data	4. Represents measurable quantities.

Solutions

Python Code Solutions

import pandas as pd

df = pd.read_csv('data.csv')

mean_age = df['age'].mean()

median_age = df['age'].median()

sales_by_city = df.groupby('city')['sales'].sum()

import seaborn as sns

sns.scatterplot(x='x', y='y', data=df)

df['price'].fillna(df['price'].mean(), inplace=True)

num_unique = df['product_category'].nunique()

import pandas as pd

df['date'] = pd.to_datetime(df['date'])

df.drop_duplicates(inplace=True)

import matplotlib.pyplot as plt

plt.hist(df['income'])

descriptive_stats = df.describe()

Descriptive Solutions

EDA is the process of summarizing and visualizing data to understand its key characteristics and identify patterns, often without a predefined hypothesis. Classical data analysis is a formal, confirmatory process that uses statistical tests to prove or disprove a specific hypothesis.
Numerical data represents measurable quantities (e.g., age, height), while categorical data represents qualities or groups (e.g., gender, country).
The Ratio scale has a true zero point, allowing for meaningful ratios (e.g., height, weight). The Nominal scale categorizes data without any order (e.g., car colors, gender).
"Making sense of data" means developing an intuitive understanding of the dataset through exploration. An analyst would ask questions like, "What do these variables represent?" and "Are there any patterns or anomalies here?"
Data visualization is crucial in EDA because it allows analysts to quickly spot patterns, trends, outliers, and anomalies that would be difficult to find in raw data or summary statistics alone.
Two software tools for EDA are Python (with libraries like Pandas and Seaborn) and R. Python is highly versatile for its use in data science pipelines, while R is known for its strong statistical and graphical capabilities.
The median is a more robust measure of central tendency than the mean because it is not affected by extreme values or outliers, making it a more accurate representation of the center of a skewed dataset.
Univariate analysis examines a single variable to understand its distribution (e.g., a histogram of a variable). Bivariate analysis examines the relationship between two variables (e.g., a scatter plot showing the relationship between age and income).

Match the Case Solutions

A matches 3, B matches 4, C matches 2, D matches 1.
A matches 2, B matches 3, C matches 1, D matches 4.

UNIT2: Exploratory Data Analysis Visual Aids and Case Study

Python Code to Write:-

1. Write Python code using the pandas library to load a CSV file named emails.csv into a DataFrame.

2. Assuming a DataFrame df has a column 'email_length', write code using matplotlib.pyplot to create a histogram of this column.

3. You have a DataFrame df with columns 'sender' and 'date'. Write code to count the number of emails sent by each person and display the top 5 senders.

4. Given a DataFrame df with columns 'date' and 'word_count', write Python code using seaborn to create a scatter plot to visualize the relationship between these two variables.

5. A DataFrame df contains a timestamp column. Write code to convert this column to a proper datetime format.

6. Write code using the pandas library to remove any duplicate rows from a DataFrame named df.

7. You have a DataFrame df with a column 'email_subject'. Write code to find and count the number of unique subjects.

8. Given a DataFrame df with a column 'category', write code to create a bar chart showing the number of emails in each category using seaborn.

9. Write Python code to apply descriptive statistics (mean, median, etc.) to a numerical column named 'email_length' in a DataFrame df.

10. A DataFrame df has columns 'date' and 'sentiment_score'. Write code to create a line chart to show the trend of the average sentiment score over time.

Descriptive Questions: -

11. Explain the difference between a histogram and a bar chart in terms of the data they represent and their primary use in EDA.

12. What is the purpose of data cleansing in the EDA process, specifically in a personal email case study? Give two examples of cleansing tasks.

13. Why is it important to choose the best chart for your data? What could be the consequence of choosing a misleading visualization?

14. Describe one technical requirement for using the seaborn library in Python for visualization.

15. What is data refactoring and how does it differ from data transformation? Provide a simple example of refactoring in the email case study.

16. In the context of EDA for personal emails, what is a key challenge when loading the dataset, and how can it be addressed?

17. Explain how descriptive statistics can provide initial insights in an email case study. Mention two specific statistics and what they might tell you.

18. Why is a line chart suitable for visualizing the trend of emails sent per month?

Match the Case:-

19. Match the visualization with its most suitable use case in EDA.

Visualization	Use Case
A. Histogram	1. Comparing email counts from different senders.
B. Scatter Plot	2. Displaying the distribution of email lengths.
C. Line Chart	3. Showing the correlation between email length and word count.
D. Bar Chart	4. Visualizing the trend of emails sent per month.

20. Match the EDA step with its correct description.

EDA Step	Description
A. Data Loading	1. Finding and removing duplicate emails from a dataset.
B. Data Cleansing	2. Changing a date string to a proper datetime object.
C. Data Transformation	3. Reading a CSV or text file into a DataFrame.
D. Data Analysis	4. Calculating the mean word count per email.

Solutions

1. Python Code to Write

import pandas as pd

df = pd.read_csv('emails.csv')

2. Python Code to Write

import matplotlib.pyplot as plt

plt.hist(df['email_length'])

plt.show()

3. Python Code to Write

sender_counts = df['sender'].value_counts()

top_5_senders = sender_counts.head(5)

print(top_5_senders)

4. Python Code to Write

import seaborn as sns

sns.scatterplot(x='date', y='word_count', data=df)

plt.show()

5. Python Code to Write

Python

df['timestamp'] = pd.to_datetime(df['timestamp'])

6. Python Code to Write

df.drop_duplicates(inplace=True)

7. Python Code to Write

unique_subjects = df['email_subject'].nunique()

print(unique_subjects)

8. Python Code to Write

import seaborn as sns

sns.countplot(x='category', data=df)

plt.show()

9. Python Code to Write

descriptive_stats = df['email_length'].describe()

print(descriptive_stats)

10. Python Code to Write

import seaborn as sns

import matplotlib.pyplot as plt

df_agg = df.groupby('date')['sentiment_score'].mean().reset_index()

sns.lineplot(x='date', y='sentiment_score', data=df_agg) tv

plt.show()

11. Descriptive Solution

A histogram is used to show the distribution of a continuous numerical variable by grouping data into bins. A bar chart is used to show the frequency of discrete categorical data. For instance, a histogram shows the distribution of email lengths, while a bar chart shows the number of emails per day of the week.

12. Descriptive Solution

Data cleansing is the process of detecting and correcting or removing corrupt, inaccurate, or irrelevant records from a dataset. In an email case study, this involves tasks like:

Removing duplicate emails that may exist due to syncing or backup processes.
Handling missing values in fields like sender, subject, or date.

13. Descriptive Solution

Choosing the best chart ensures that the visual representation accurately reflects the underlying data and the insights you wish to convey. A poor choice could lead to misleading conclusions; for example, using a line chart for categorical data might imply a trend or order that doesn't exist, confusing stakeholders.

14. Descriptive Solution

A primary technical requirement for using seaborn is that it is a Python library built on top of matplotlib. Therefore, you must have both Python and the seaborn library (and its dependencies) installed in your environment before you can use it to generate plots.

15. Descriptive Solution

Data refactoring is the process of restructuring the dataset's format to make it more useful for analysis, without changing its core information. This is different from data transformation, which alters the data itself (e.g., converting a column's data type). An example of refactoring in an email case study would be pivoting a table to change the orientation of the data for easier analysis.

16. Descriptive Solution

A key challenge is that personal email data is often stored in unstructured or semi-structured formats like .mbox or .eml files, which are not readily readable by standard data analysis libraries. This can be addressed by using specialized parsers or libraries designed to handle these file types to extract the necessary information into a structured format like a DataFrame.

17. Descriptive Solution

Descriptive statistics provide a quick numerical summary of the dataset. For an email case study:

The mean email length could give an average size, but the median might be a better indicator if there are a few very long emails (outliers).
The standard deviation of email lengths could indicate how varied the email sizes are.

18. Descriptive Solution

A line chart is suitable because the data on the x-axis (months) is sequential and ordered. The line connecting the data points visually represents the trend or change in email volume over time, making it easy to spot seasonal patterns or periods of high or low activity.

19. Match the Case Solution

A matches 2.
B matches 3.
C matches 4.
D matches 1.

20. Match the Case Solution

A matches 3.
B matches 1.
C matches 2.
D matches 4.

UNIT1: Exploratory Data Analysis Fundamentals

Python Code Questions

Write Python code using the pandas library to load a CSV file named data.csv into a DataFrame.

Assuming a DataFrame df with a column 'age', write code to calculate the mean and median of this column.

Given a DataFrame df with columns 'city' and 'sales', write code to group the data by 'city' and sum the sales for each city.

Write code using the seaborn library to create a scatter plot from a DataFrame df with columns 'x' and 'y'.

Given a DataFrame df with a column 'price' containing missing values (NaN), write code to fill these missing values with the mean of the 'price' column.

Write Python code to find the number of unique values in a column named 'product_category' from a DataFrame df.

A DataFrame df has a column 'date' in string format. Write code to convert this column to a proper datetime format.

Write code using the pandas library to drop any duplicate rows from a DataFrame named df.

Given a DataFrame df with a numerical column 'income', write code to create a histogram using matplotlib.pyplot.

Write Python code to apply descriptive statistics (like count, mean, min, max, etc.) to all numerical columns in a DataFrame named df.

Descriptive Questions

What is the main purpose of Exploratory Data Analysis (EDA), and how does it differ from classical data analysis?

Explain the difference between numerical and categorical data. Give one example of each.

What are the key characteristics of the Ratio and Nominal measurement scales? Give a real-world example for each.

Describe what it means to "make sense of data" in EDA. What types of questions would an analyst ask themselves during this process?

Why is data visualization considered a crucial part of EDA?

Name two software tools widely used for EDA and briefly explain a key advantage of each.

What is the significance of the median over the mean when a dataset contains outliers?

Explain the difference between univariate and bivariate analysis.

Match the Case

Match the EDA step with its most suitable action.

EDA Step

Action

A. Getting Started

1. Visualizing the relationship between two variables.

B. Making Sense of Data

2. Calculating the mean and median of a variable.

C. Univariate Analysis

3. Loading the dataset into a DataFrame.

D. Bivariate Analysis

4. Exploring the data through visualizations and summary statistics.

Match the data type or scale with its appropriate characteristic.

Data Type / Scale

Characteristic

A. Categorical Data

1. Has an ordered sequence but no meaningful zero point.

B. Ratio Scale

2. Represents qualities or groups.

C. Interval Scale

3. Has a true zero point, allowing for ratios.

D. Numerical Data

4. Represents measurable quantities.

Solutions

Python Code Solutions

import pandas as pd

df = pd.read_csv('data.csv')

mean_age = df['age'].mean()

median_age = df['age'].median()

sales_by_city = df.groupby('city')['sales'].sum()

import seaborn as sns

sns.scatterplot(x='x', y='y', data=df)

df['price'].fillna(df['price'].mean(), inplace=True)

num_unique = df['product_category'].nunique()

import pandas as pd

df['date'] = pd.to_datetime(df['date'])

df.drop_duplicates(inplace=True)

import matplotlib.pyplot as plt

plt.hist(df['income'])

descriptive_stats = df.describe()

Descriptive Solutions

EDA is the process of summarizing and visualizing data to understand its key characteristics and identify patterns, often without a predefined hypothesis. Classical data analysis is a formal, confirmatory process that uses statistical tests to prove or disprove a specific hypothesis.

Numerical data represents measurable quantities (e.g., age, height), while categorical data represents qualities or groups (e.g., gender, country).

The Ratio scale has a true zero point, allowing for meaningful ratios (e.g., height, weight). The Nominal scale categorizes data without any order (e.g., car colors, gender).

"Making sense of data" means developing an intuitive understanding of the dataset through exploration. An analyst would ask questions like, "What do these variables represent?" and "Are there any patterns or anomalies here?"

Data visualization is crucial in EDA because it allows analysts to quickly spot patterns, trends, outliers, and anomalies that would be difficult to find in raw data or summary statistics alone.

Two software tools for EDA are Python (with libraries like Pandas and Seaborn) and R. Python is highly versatile for its use in data science pipelines, while R is known for its strong statistical and graphical capabilities.

The median is a more robust measure of central tendency than the mean because it is not affected by extreme values or outliers, making it a more accurate representation of the center of a skewed dataset.

Univariate analysis examines a single variable to understand its distribution (e.g., a histogram of a variable). Bivariate analysis examines the relationship between two variables (e.g., a scatter plot showing the relationship between age and income).

Match the Case Solutions

A matches 3, B matches 4, C matches 2, D matches 1.

A matches 2, B matches 3, C matches 1, D matches 4.

UNIT2: Exploratory Data Analysis Visual Aids and Case Study

Python Code to Write:-

Descriptive Questions: -

Match the Case:-

Solutions