Pandas
Lecture 03
1
CMSC 320�
INTRODUCTION TO �DATA SCIENCE
– FARDINA FATHMIUL ALAM
(fardina@umd.edu)
�
“A Python Library for Manipulating Tabular Data”
Importance
2
These skills are essential for effective data analysis, collaboration, and handling data in real-world projects.
Topics to Cover
Basic Concept Review: Table / Tabular Data
4
In tabular data, information is organized into rows and columns.
Tabular data is crucial for both Pandas and SQL
Provides a structured and organized way to represent, manipulate and visualize data.
5
There are some operations common in both SQL and Pandas-
select, join, aggregates etc.
6
1. SELECT/SLICING
Select only some of the rows, or some of the columns, or a combination
7
ID | age | wgt_kg | hgt_cm |
1 | 12.2 | 42.3 | 145.1 |
2 | 11.0 | 40.8 | 143.8 |
3 | 15.6 | 65.3 | 165.3 |
4 | 35.1 | 84.2 | 185.8 |
ID | age |
1 | 12.2 |
2 | 11.0 |
3 | 15.6 |
4 | 35.1 |
Only columns
ID and Age
Only rows with wgt > 41
Both with wgt > 41
ID | age | wgt_kg | hgt_cm |
1 | 12.2 | 42.3 | 145.1 |
3 | 15.6 | 65.3 | 165.3 |
4 | 35.1 | 84.2 | 185.8 |
ID | age |
1 | 12.2 |
3 | 15.6 |
4 | 35.1 |
2. AGGREGATE/REDUCE
Combine values across a column into a single value
8
ID | age | wgt_kg | hgt_cm |
1 | 12.2 | 42.3 | 145.1 |
2 | 11.0 | 40.8 | 143.8 |
3 | 15.6 | 65.3 | 165.3 |
4 | 35.1 | 84.2 | 185.8 |
SUM
SUM(wgt_kg^2 - hgt_cm)
73.9 | 232.6 | 640.0 |
MAX
35.1 | 84.2 | 185.8 |
14167.66 |
What about ID/Index column?
Usually not meaningful to aggregate across it
May need to explicitly add an ID column
3. MAP
Apply a function to every row, possibly creating more or fewer columns
9
ID | Address |
1 | College Park, MD, 20742 |
2 | Washington, DC, 20001 |
3 | Silver Spring, MD, 20901 |
Variations that allow one row to generate multiple rows in the output (sometimes called “flatmap”)
ID | City | State | Zipcode |
1 | College Park | MD | 20742 |
2 | Washington | DC | 20001 |
3 | Silver Spring | MD | 20901 |
4. GROUP BY
Group tuples together by column/dimension
10
ID | A | B | C |
1 | foo | 3 | 6.6 |
2 | bar | 2 | 4.7 |
3 | foo | 4 | 3.1 |
4 | foo | 3 | 8.0 |
5 | bar | 1 | 1.2 |
6 | bar | 2 | 2.5 |
7 | foo | 4 | 2.3 |
8 | foo | 3 | 8.0 |
ID | B | C |
1 | 3 | 6.6 |
3 | 4 | 3.1 |
4 | 3 | 8.0 |
7 | 4 | 2.3 |
8 | 3 | 8.0 |
ID | B | C |
2 | 2 | 4.7 |
5 | 1 | 1.2 |
6 | 2 | 2.5 |
A = foo
A = bar
By ‘A’
4. GROUP BY
Group tuples together by column/dimension
11
ID | A | B | C |
1 | foo | 3 | 6.6 |
2 | bar | 2 | 4.7 |
3 | foo | 4 | 3.1 |
4 | foo | 3 | 8.0 |
5 | bar | 1 | 1.2 |
6 | bar | 2 | 2.5 |
7 | foo | 4 | 2.3 |
8 | foo | 3 | 8.0 |
By ‘B’
ID | A | C |
5 | bar | 1.2 |
B = 1
ID | A | C |
2 | bar | 4.7 |
6 | bar | 2.5 |
ID | A | C |
3 | foo | 3.1 |
7 | foo | 2.3 |
ID | A | C |
1 | foo | 6.6 |
4 | foo | 8.0 |
8 | foo | 8.0 |
B = 3
B = 2
B = 4
4. GROUP BY
Group tuples together by column/dimension
12
ID | A | B | C |
1 | foo | 3 | 6.6 |
2 | bar | 2 | 4.7 |
3 | foo | 4 | 3.1 |
4 | foo | 3 | 8.0 |
5 | bar | 1 | 1.2 |
6 | bar | 2 | 2.5 |
7 | foo | 4 | 2.3 |
8 | foo | 3 | 8.0 |
By ‘A’, ‘B’
ID | C |
5 | 1.2 |
A = bar, B = 1
ID | C |
2 | 4.7 |
6 | 2.5 |
ID | C |
3 | 3.1 |
7 | 2.3 |
ID | C |
1 | 6.6 |
4 | 8.0 |
8 | 8.0 |
A = foo, B = 3
A = bar, B = 2
A = foo, B = 4
5. GROUP BY AGGREGATE
Compute one aggregate per group
13
ID | A | B | C |
1 | foo | 3 | 6.6 |
2 | bar | 2 | 4.7 |
3 | foo | 4 | 3.1 |
4 | foo | 3 | 8.0 |
5 | bar | 1 | 1.2 |
6 | bar | 2 | 2.5 |
7 | foo | 4 | 2.3 |
8 | foo | 3 | 8.0 |
Group by ‘B’
Sum on C
ID | A | C |
5 | bar | 1.2 |
B = 1
ID | A | C |
2 | bar | 4.7 |
6 | bar | 2.5 |
ID | A | C |
3 | foo | 3.1 |
7 | foo | 2.3 |
ID | A | C |
1 | foo | 6.6 |
4 | foo | 8.0 |
8 | foo | 8.0 |
B = 3
B = 2
B = 4
Sum (C) |
1.2 |
B = 1
B = 3
B = 2
B = 4
Sum (C) |
22.6 |
Sum (C) |
7.2 |
Sum (C) |
5.4 |
5. GROUP BY AGGREGATE
Final result usually seen as a table
14
ID | A | B | C |
1 | foo | 3 | 6.6 |
2 | bar | 2 | 4.7 |
3 | foo | 4 | 3.1 |
4 | foo | 3 | 8.0 |
5 | bar | 1 | 1.2 |
6 | bar | 2 | 2.5 |
7 | foo | 4 | 2.3 |
8 | foo | 3 | 8.0 |
Group by ‘B’
Sum on C
Sum (C) |
1.2 |
B = 1
B = 3
B = 2
B = 4
Sum (C) |
22.6 |
Sum (C) |
7.2 |
Sum (C) |
5.4 |
B | SUM(C ) |
1 | 1.2 |
2 | 7.2 |
3 | 22.6 |
4 | 5.4 |
6. UNION / INTERSECTION / DIFFERENCE
Set operations – only if the two tables have identical attributes/columns
15
15
ID | A | B | C |
1 | foo | 3 | 6.6 |
2 | bar | 2 | 4.7 |
3 | foo | 4 | 3.1 |
4 | foo | 3 | 8.0 |
ID | A | B | C |
5 | bar | 1 | 1.2 |
6 | bar | 2 | 2.5 |
7 | foo | 4 | 2.3 |
8 | foo | 3 | 8.0 |
U
ID | A | B | C |
1 | foo | 3 | 6.6 |
2 | bar | 2 | 4.7 |
3 | foo | 4 | 3.1 |
4 | foo | 3 | 8.0 |
5 | bar | 1 | 1.2 |
6 | bar | 2 | 2.5 |
7 | foo | 4 | 2.3 |
8 | foo | 3 | 8.0 |
Similarly Intersection (returns the rows that are common to both tables) and Set Difference EXCEPT/MINUS (returns the rows that are in one table but not in the other)manipulate tables as Sets
The way IDs are treated (whether as unique or non-unique identifiers) can lead to different behaviors in these operations
7. MERGE OR JOIN
Combine rows/tuples across two tables if they have the same key
16
ID | A | B |
1 | foo | 3 |
2 | bar | 2 |
3 | foo | 4 |
4 | foo | 3 |
ID | C |
1 | 1.2 |
2 | 2.5 |
3 | 2.3 |
5 | 8.0 |
ID | A | B | C |
1 | foo | 3 | 1.2 |
2 | bar | 2 | 2.5 |
3 | foo | 4 | 2.3 |
⨝
What about IDs not present in both tables?
Often need to keep them around
Can “pad” with NaN
7. MERGE OR JOIN
Combine rows/tuples across two tables if they have the same key
Outer joins can be used to ”pad” IDs that don’t appear in both tables
Three variants: LEFT, RIGHT, FULL
SQL Terminology – pandas has these operations as well
17
ID | A | B |
1 | foo | 3 |
2 | bar | 2 |
3 | foo | 4 |
4 | foo | 3 |
ID | C |
1 | 1.2 |
2 | 2.5 |
3 | 2.3 |
5 | 8.0 |
ID | A | B | C |
1 | foo | 3 | 1.2 |
2 | bar | 2 | 2.5 |
3 | foo | 4 | 2.3 |
4 | foo | 3 | NaN |
5 | NaN | NaN | 8.0 |
⟗
Next
Pandas
18
Class Colab:
Dataset:
What is Pandas
Why Pandas?
"Pandas": Short for "Python Data Analysis Library," used for data analysis.
Written by Wes McKinney: Started in 2008 to get a high-performance, flexible tool to perform quantitative analysis on financial data.
Pandas First Steps: install and import
20
$ conda install pandas
OR
$ pip install pandas
!pip install pandas
import pandas as pd
After installing pandas, it's good practice to check the installed version to ensure
print(pd.__version__)
“pd” → alias (in Python): an alternate name for referring to the same thing .
Core components of Pandas:
Series & DataFrames
21
Basic Pandas Workflow……..
Pandas: Data Table Representation
Data in Pandas is stored in two fundamental data types:
22
Notes: They both behave very similar to numpy arrays, so you can use them as input to anything in numpy, scipy, or scikit-learn. The main difference from numpy arrays is indexing.
One-dimensional labeled array (like a column).
Create a Series
Series
An one dimensional array of data with an index
24
Series Functions
Pandas can do anything with a series you can do with an array
25
2. Dataframe
Two-dimensional labeled data structure/table (like a spreadsheet or SQL table).
Most commonly used Pandas object.
Create a DataFrame
Dataframes
A collection of series organized by columns.
27
rows
columns
Dataframe Functions
Important functions:
28
Basic Operations: Reading Data
# Read CSV file
df = pd.read_csv('data.csv')
# Read Excel file
df = pd.read_excel('data.xlsx') [ Reading (both XLS and XLSX), using the file path as an input.]
# Read Text file
df = pd.read_csv("diabetes.txt", sep="\s") [sep argument - separator. sep="\s" referring to single whitespace]
# Read Json file
df = pd.read_json("diabetes.json")
Basic Operations: Outputting data
# into a CSV file
df = df.to_csv("diabetes_out.csv", index=False)
For a Series s: s.to_csv("filename.csv")
# into am Excel file
df.to_excel("diabetes_out.xlsx", index=False)
# Read Text file
df = f.to_csv('diabetes_out.txt', header=df.columns, index=None, sep=' ')
# Read Json file
df.to_json("diabetes_out.json")
Example: Loading a DataFrame from files
31
Reading data from a CSV file
32
df = pd.read_csv('dataset.csv')
Load a CSV file into a Pandas DataFrame:
*** If the data contains only one column and you specify squeeze=True, pandas will convert the result to a Series.
Viewing and understanding DataFrames using pandas
After reading tabular data as a DataFrame, you would need to have a glimpse of the data. You can either view a small sample of the dataset or a summary of the data in the form of summary statistics.
Example:
Use DataFrame.head() and DataFrame.tail() to view the top (first) and bottom (last) few rows of the frame respectively.
Basic Operations: Exploring Data
# Display First 5 rows df.head() # Display Last 3 rows df.tail(n=3) [specify the number of rows through the n argument (the default value is 5)]. | # Data types df.dtypes # Summary statistics df.describe() prints the summary statistics of all numeric columns, such as count, mean, standard deviation, range, and quartiles of numeric columns. | # Shape (rows, columns) df.shape # Column names df.columns df.info() # Overview of columns and types |
Get the number of rows and columns
df.shape[0] # Get the number of rows only
df.shape[1] # Get the number of columns only
Basic Operations: Selecting Data
# Select column df['Age'] # Select multiple columns df[['Name', 'City']] | # Select rows by index df.iloc[0] # First row df.iloc[1:3] # Rows 1-2 | # Select rows by condition df[df['Age'] > 30] |
Example: How to access a column
dataframe['column_name']: Access a column by its name.
36
E.g. : df['age']
Where df=your dataframe
Arithmetic Operations
Can perform arithmetic operations (e.g., addition, subtraction, multiplication, division) on entire columns or between columns.
E.g.:
df['new_column'] = df['column1'] + df['column2']
df['new_column'] = df['column1'] - df['column2']
E.g # Modify column
df['Age'] = df['Age'] + 1
37
Applying Functions Directly to the dataframe
You can apply different functions directly to a DataFrame column:
Syntax: dataframe[column_name].function()
38
Applying Functions Directly to the dataframe
Aggregation operations are used to compute summary statistics or single values from multiple values in a dataset. E.g:
39
E.g.: df[“age”].sum()
Try some other aggregation operation!
Filtering in Pandas
1. Column Filtering (Selecting Columns): Select specific columns or filter columns based on names/conditions. Method: Basic Selection: df[['col1']] or df[['col1', 'col2']]. |
|
2. Row Filtering (Selecting Rows): Filter rows based on conditions applied to column values. Method: Boolean Indexing (Most common one). |
|
Two Types:
Focus on specific subsets of your data
Extract specific rows based on conditions applied to one or more columns, making it easier to work with relevant subsets of data.
# Sample DataFrame
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'city': ['NY', 'LA', 'Chicago']
})
# Method 1: Select specific columns
df_cols = df[['name', 'age']] # Keeps only 'name' and 'age'
2. Filtering Rows: How to Filter Rows based on Condition (s)
Filter rows where a column meets a condition:
42
E.g.: Filter rows where age is greater than 25.
dataframe [boolean condition]
df [df[‘Age’]>29]
Known as “Boolean Indexing”: Use boolean indexing to filter data based on conditions. ex: filtered_df = df[df['Value'] > 20]
Q: What about multiple condition?
(Boolean Indexing)
Count Unique Values
Basic Usage: Count occurrences of each unique value in a column:
Returns the frequency of each unique value in a Series, sorted by default in descending order.
Use value_counts() after filtering when you want to see/count how many times each unique value appears in a Series (a single column).
You can also Count unique values in a filtered subset
Key Aggregation Methods Supported by Pandas
E.g: Get Statistics
Method | Action | Works On |
.sum() | Sum of values | DataFrame/Series |
.mean() | Mean (average) | DataFrame/Series |
.count() | Non-NA values | DataFrame/Series |
.size() | All rows (including NA) | GroupBy only |
.min()/.max() | Minimum/Maximum | DataFrame/Series |
.std()/.var() | Standard deviation/Variance | DataFrame/Series |
.describe() | Summary statistics (count, mean, std, etc.) | DataFrame/Series |
Applying Aggregation Functions Directly to a DataFrame
We can use different key statistical methods directly on a DataFrame or Series.
(Next Topic)
More Advanced: Filtering Data & Apply Statistical Functions
We can filter rows in a DataFrame based on a condition and then apply statistical functions to a specific column:
46
df[df['condition']][column_name].statistics_function()
More Advanced: Filtering Data & Apply Statistical Functions
We can filter rows in a DataFrame based on a condition and then apply statistical functions to a specific column:
47
df[df['condition']][column_name].statistics_function()
Try: Calculate the mean of the 'Number' column where the 'Age' is greater than 25
Grouping Data (Groupby)
Purpose: The groupby() operation splits data into groups based on a column/criteria, and then apply a function (like sum, mean, etc.) to each group, and combines results.
48
df.groupby('condition')[target_column_name].statistics_function()
Splitting
Apply
Combine
grouping_column
Common Usage:
grouped = df.groupby('column_name')
grouped = df.groupby(['col1', 'col2'])
df.groupby('condition')[target_column_name].statistics_function()
Example:
df.groupby('category')['value'].mean()
grouping_column
Example: Grouping Data (Groupby)
50
1. group by “order”
2. Apply “Sum” to each group
3. Combine the result
Group By - More Examples (1)
Group By - More Complicated Examples (2)
Pandas Label/Position-based indexing and slicing: loc, iloc
53
https://pandas.pydata.org/docs/user_guide/indexing.html
loc: Used for label-based indexing.
Select rows and columns by label (the actual name of the row or column).
iloc: Used for integer-location based indexing.
Select rows and columns by position (using row/column numbers, starting from 0).
Pandas Label/Position-based indexing and slicing: loc, iloc
54
# Access rows with labels from index 2 to 4 and columns 'A' and 'B'
df.loc[2:4, ['A', 'B']]
# Access rows from index 2 to 4 and columns at positions 0 and 1
df.iloc[2:5, [0, 1]]
https://pandas.pydata.org/docs/user_guide/indexing.html
Example: loc, iloc
55
https://pandas.pydata.org/docs/user_guide/indexing.html
When in doubt, visualize
You can visualize what we just did by plotting the dataframes.
conda install matplotlib
56
Pandas: Advantages
57
References
58
More References
CheatSheet: ttps://s3.amazonaws.com/assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf
59
The End