EDA Step | Action |
A. Getting Started | 1. Visualizing the relationship between two variables. |
B. Making Sense of Data | 2. Calculating the mean and median of a variable. |
C. Univariate Analysis | 3. Loading the dataset into a DataFrame. |
D. Bivariate Analysis | 4. Exploring the data through visualizations and summary statistics. |
Data Type / Scale | Characteristic |
A. Categorical Data | 1. Has an ordered sequence but no meaningful zero point. |
B. Ratio Scale | 2. Represents qualities or groups. |
C. Interval Scale | 3. Has a true zero point, allowing for ratios. |
D. Numerical Data | 4. Represents measurable quantities. |
1. Write Python code using the pandas library to load a CSV file named emails.csv into a DataFrame.
2. Assuming a DataFrame df has a column 'email_length', write code using matplotlib.pyplot to create a histogram of this column.
3. You have a DataFrame df with columns 'sender' and 'date'. Write code to count the number of emails sent by each person and display the top 5 senders.
4. Given a DataFrame df with columns 'date' and 'word_count', write Python code using seaborn to create a scatter plot to visualize the relationship between these two variables.
5. A DataFrame df contains a timestamp column. Write code to convert this column to a proper datetime format.
6. Write code using the pandas library to remove any duplicate rows from a DataFrame named df.
7. You have a DataFrame df with a column 'email_subject'. Write code to find and count the number of unique subjects.
8. Given a DataFrame df with a column 'category', write code to create a bar chart showing the number of emails in each category using seaborn.
9. Write Python code to apply descriptive statistics (mean, median, etc.) to a numerical column named 'email_length' in a DataFrame df.
10. A DataFrame df has columns 'date' and 'sentiment_score'. Write code to create a line chart to show the trend of the average sentiment score over time.
11. Explain the difference between a histogram and a bar chart in terms of the data they represent and their primary use in EDA.
12. What is the purpose of data cleansing in the EDA process, specifically in a personal email case study? Give two examples of cleansing tasks.
13. Why is it important to choose the best chart for your data? What could be the consequence of choosing a misleading visualization?
14. Describe one technical requirement for using the seaborn library in Python for visualization.
15. What is data refactoring and how does it differ from data transformation? Provide a simple example of refactoring in the email case study.
16. In the context of EDA for personal emails, what is a key challenge when loading the dataset, and how can it be addressed?
17. Explain how descriptive statistics can provide initial insights in an email case study. Mention two specific statistics and what they might tell you.
18. Why is a line chart suitable for visualizing the trend of emails sent per month?
19. Match the visualization with its most suitable use case in EDA.
Visualization | Use Case |
A. Histogram | 1. Comparing email counts from different senders. |
B. Scatter Plot | 2. Displaying the distribution of email lengths. |
C. Line Chart | 3. Showing the correlation between email length and word count. |
D. Bar Chart | 4. Visualizing the trend of emails sent per month. |
20. Match the EDA step with its correct description.
EDA Step | Description |
A. Data Loading | 1. Finding and removing duplicate emails from a dataset. |
B. Data Cleansing | 2. Changing a date string to a proper datetime object. |
C. Data Transformation | 3. Reading a CSV or text file into a DataFrame. |
D. Data Analysis | 4. Calculating the mean word count per email. |
1. Python Code to Write
import pandas as pd
df = pd.read_csv('emails.csv')
2. Python Code to Write
import matplotlib.pyplot as plt
plt.hist(df['email_length'])
plt.show()
3. Python Code to Write
sender_counts = df['sender'].value_counts()
top_5_senders = sender_counts.head(5)
print(top_5_senders)
4. Python Code to Write
import seaborn as sns
sns.scatterplot(x='date', y='word_count', data=df)
plt.show()
5. Python Code to Write
Python
df['timestamp'] = pd.to_datetime(df['timestamp'])
6. Python Code to Write
df.drop_duplicates(inplace=True)
7. Python Code to Write
unique_subjects = df['email_subject'].nunique()
print(unique_subjects)
8. Python Code to Write
import seaborn as sns
sns.countplot(x='category', data=df)
plt.show()
9. Python Code to Write
descriptive_stats = df['email_length'].describe()
print(descriptive_stats)
10. Python Code to Write
import seaborn as sns
import matplotlib.pyplot as plt
df_agg = df.groupby('date')['sentiment_score'].mean().reset_index()
sns.lineplot(x='date', y='sentiment_score', data=df_agg) tv
plt.show()
11. Descriptive Solution
A histogram is used to show the distribution of a continuous numerical variable by grouping data into bins. A bar chart is used to show the frequency of discrete categorical data. For instance, a histogram shows the distribution of email lengths, while a bar chart shows the number of emails per day of the week.
12. Descriptive Solution
Data cleansing is the process of detecting and correcting or removing corrupt, inaccurate, or irrelevant records from a dataset. In an email case study, this involves tasks like:
13. Descriptive Solution
Choosing the best chart ensures that the visual representation accurately reflects the underlying data and the insights you wish to convey. A poor choice could lead to misleading conclusions; for example, using a line chart for categorical data might imply a trend or order that doesn't exist, confusing stakeholders.
14. Descriptive Solution
A primary technical requirement for using seaborn is that it is a Python library built on top of matplotlib. Therefore, you must have both Python and the seaborn library (and its dependencies) installed in your environment before you can use it to generate plots.
15. Descriptive Solution
Data refactoring is the process of restructuring the dataset's format to make it more useful for analysis, without changing its core information. This is different from data transformation, which alters the data itself (e.g., converting a column's data type). An example of refactoring in an email case study would be pivoting a table to change the orientation of the data for easier analysis.
16. Descriptive Solution
A key challenge is that personal email data is often stored in unstructured or semi-structured formats like .mbox or .eml files, which are not readily readable by standard data analysis libraries. This can be addressed by using specialized parsers or libraries designed to handle these file types to extract the necessary information into a structured format like a DataFrame.
17. Descriptive Solution
Descriptive statistics provide a quick numerical summary of the dataset. For an email case study:
18. Descriptive Solution
A line chart is suitable because the data on the x-axis (months) is sequential and ordered. The line connecting the data points visually represents the trend or change in email volume over time, making it easy to spot seasonal patterns or periods of high or low activity.
19. Match the Case Solution
20. Match the Case Solution