Pandas
Lecture 03
1
CMSC 320�
INTRODUCTION TO �DATA SCIENCE
– FARDINA FATHMIUL ALAM
(fardina@umd.edu)
�
“A Python Library for Manipulating Tabular Data”
Topics to Cover
2
Chapter 4 in : https://ffalam.github.io/CMSC320TextBook/chapter4/Chapter_4_1.html
These skills are essential for effective data analysis, collaboration, and handling data in real-world projects.
Tabular Data: Key Concept
Tabular structured format is essential for both Pandas and SQL.
1. SELECT/SLICING
Select only some of the rows, or some of the columns, or a combination
5
ID | age | wgt_kg | hgt_cm |
1 | 12.2 | 42.3 | 145.1 |
2 | 11.0 | 40.8 | 143.8 |
3 | 15.6 | 65.3 | 165.3 |
4 | 35.1 | 84.2 | 185.8 |
ID | age |
1 | 12.2 |
2 | 11.0 |
3 | 15.6 |
4 | 35.1 |
Only columns
ID and Age
Only rows with wgt > 41
Both with wgt > 41
ID | age | wgt_kg | hgt_cm |
1 | 12.2 | 42.3 | 145.1 |
3 | 15.6 | 65.3 | 165.3 |
4 | 35.1 | 84.2 | 185.8 |
ID | age |
1 | 12.2 |
3 | 15.6 |
4 | 35.1 |
2. AGGREGATE/REDUCE
Combine values across a column into a single value
6
ID | age | wgt_kg | hgt_cm |
1 | 12.2 | 42.3 | 145.1 |
2 | 11.0 | 40.8 | 143.8 |
3 | 15.6 | 65.3 | 165.3 |
4 | 35.1 | 84.2 | 185.8 |
SUM
SUM(wgt_kg^2 - hgt_cm)
73.9 | 232.6 | 640.0 |
MAX
35.1 | 84.2 | 185.8 |
14167.66 |
What about ID/Index column?
Usually not meaningful to aggregate across it
May need to explicitly add an ID column
3. MAP
Apply a function to every row, possibly creating more or fewer columns
7
ID | Address |
1 | College Park, MD, 20742 |
2 | Washington, DC, 20001 |
3 | Silver Spring, MD, 20901 |
Variations that allow one row to generate multiple rows in the output (sometimes called “flatmap”)
ID | City | State | Zipcode |
1 | College Park | MD | 20742 |
2 | Washington | DC | 20001 |
3 | Silver Spring | MD | 20901 |
4. GROUP BY
Group tuples together by column/dimension
8
ID | A | B | C |
1 | foo | 3 | 6.6 |
2 | bar | 2 | 4.7 |
3 | foo | 4 | 3.1 |
4 | foo | 3 | 8.0 |
5 | bar | 1 | 1.2 |
6 | bar | 2 | 2.5 |
7 | foo | 4 | 2.3 |
8 | foo | 3 | 8.0 |
ID | B | C |
1 | 3 | 6.6 |
3 | 4 | 3.1 |
4 | 3 | 8.0 |
7 | 4 | 2.3 |
8 | 3 | 8.0 |
ID | B | C |
2 | 2 | 4.7 |
5 | 1 | 1.2 |
6 | 2 | 2.5 |
A = foo
A = bar
By ‘A’
4. GROUP BY
Group tuples together by column/dimension
9
ID | A | B | C |
1 | foo | 3 | 6.6 |
2 | bar | 2 | 4.7 |
3 | foo | 4 | 3.1 |
4 | foo | 3 | 8.0 |
5 | bar | 1 | 1.2 |
6 | bar | 2 | 2.5 |
7 | foo | 4 | 2.3 |
8 | foo | 3 | 8.0 |
By ‘B’
ID | A | C |
5 | bar | 1.2 |
B = 1
ID | A | C |
2 | bar | 4.7 |
6 | bar | 2.5 |
ID | A | C |
3 | foo | 3.1 |
7 | foo | 2.3 |
ID | A | C |
1 | foo | 6.6 |
4 | foo | 8.0 |
8 | foo | 8.0 |
B = 3
B = 2
B = 4
4. GROUP BY
Group tuples together by column/dimension
10
ID | A | B | C |
1 | foo | 3 | 6.6 |
2 | bar | 2 | 4.7 |
3 | foo | 4 | 3.1 |
4 | foo | 3 | 8.0 |
5 | bar | 1 | 1.2 |
6 | bar | 2 | 2.5 |
7 | foo | 4 | 2.3 |
8 | foo | 3 | 8.0 |
By ‘A’, ‘B’
ID | C |
5 | 1.2 |
A = bar, B = 1
ID | C |
2 | 4.7 |
6 | 2.5 |
ID | C |
3 | 3.1 |
7 | 2.3 |
ID | C |
1 | 6.6 |
4 | 8.0 |
8 | 8.0 |
A = foo, B = 3
A = bar, B = 2
A = foo, B = 4
5. GROUP BY AGGREGATE
Compute one aggregate per group
11
ID | A | B | C |
1 | foo | 3 | 6.6 |
2 | bar | 2 | 4.7 |
3 | foo | 4 | 3.1 |
4 | foo | 3 | 8.0 |
5 | bar | 1 | 1.2 |
6 | bar | 2 | 2.5 |
7 | foo | 4 | 2.3 |
8 | foo | 3 | 8.0 |
Group by ‘B’
Sum on C
ID | A | C |
5 | bar | 1.2 |
B = 1
ID | A | C |
2 | bar | 4.7 |
6 | bar | 2.5 |
ID | A | C |
3 | foo | 3.1 |
7 | foo | 2.3 |
ID | A | C |
1 | foo | 6.6 |
4 | foo | 8.0 |
8 | foo | 8.0 |
B = 3
B = 2
B = 4
Sum (C) |
1.2 |
B = 1
B = 3
B = 2
B = 4
Sum (C) |
22.6 |
Sum (C) |
7.2 |
Sum (C) |
5.4 |
5. GROUP BY AGGREGATE
Final result usually seen as a table
12
ID | A | B | C |
1 | foo | 3 | 6.6 |
2 | bar | 2 | 4.7 |
3 | foo | 4 | 3.1 |
4 | foo | 3 | 8.0 |
5 | bar | 1 | 1.2 |
6 | bar | 2 | 2.5 |
7 | foo | 4 | 2.3 |
8 | foo | 3 | 8.0 |
Group by ‘B’
Sum on C
Sum (C) |
1.2 |
B = 1
B = 3
B = 2
B = 4
Sum (C) |
22.6 |
Sum (C) |
7.2 |
Sum (C) |
5.4 |
B | SUM(C ) |
1 | 1.2 |
2 | 7.2 |
3 | 22.6 |
4 | 5.4 |
6. UNION / INTERSECTION / DIFFERENCE
Set operations – only if the two tables have identical attributes/columns
13
13
ID | A | B | C |
1 | foo | 3 | 6.6 |
2 | bar | 2 | 4.7 |
3 | foo | 4 | 3.1 |
4 | foo | 3 | 8.0 |
ID | A | B | C |
5 | bar | 1 | 1.2 |
6 | bar | 2 | 2.5 |
7 | foo | 4 | 2.3 |
8 | foo | 3 | 8.0 |
U
ID | A | B | C |
1 | foo | 3 | 6.6 |
2 | bar | 2 | 4.7 |
3 | foo | 4 | 3.1 |
4 | foo | 3 | 8.0 |
5 | bar | 1 | 1.2 |
6 | bar | 2 | 2.5 |
7 | foo | 4 | 2.3 |
8 | foo | 3 | 8.0 |
Similarly Intersection (returns the rows that are common to both tables) and Set Difference EXCEPT/MINUS (returns the rows that are in one table but not in the other)manipulate tables as Sets
The way IDs are treated (whether as unique or non-unique identifiers) can lead to different behaviors in these operations
7. MERGE OR JOIN
Combine rows/tuples across two tables if they have the same key
14
ID | A | B |
1 | foo | 3 |
2 | bar | 2 |
3 | foo | 4 |
4 | foo | 3 |
ID | C |
1 | 1.2 |
2 | 2.5 |
3 | 2.3 |
5 | 8.0 |
ID | A | B | C |
1 | foo | 3 | 1.2 |
2 | bar | 2 | 2.5 |
3 | foo | 4 | 2.3 |
⨝
What about IDs not present in both tables?
Often need to keep them around
Can “pad” with NaN
7. MERGE OR JOIN
Combine rows/tuples across two tables if they have the same key
Outer joins can be used to ”pad” IDs that don’t appear in both tables
Three variants: LEFT, RIGHT, FULL
SQL Terminology – pandas has these operations as well
15
ID | A | B |
1 | foo | 3 |
2 | bar | 2 |
3 | foo | 4 |
4 | foo | 3 |
ID | C |
1 | 1.2 |
2 | 2.5 |
3 | 2.3 |
5 | 8.0 |
ID | A | B | C |
1 | foo | 3 | 1.2 |
2 | bar | 2 | 2.5 |
3 | foo | 4 | 2.3 |
4 | foo | 3 | NaN |
5 | NaN | NaN | 8.0 |
⟗
Pandas First Steps: install and import
16
$ conda install pandas
OR
$ pip install pandas
!pip install pandas
import pandas as pd
After installing pandas, it's good practice to check the installed version to ensure
print(pd.__version__)
“pd” → alias (in Python): an alternate name for referring to the same thing .
Common Pandas Tasks and Workflow
Class Colab:
Dataset:
|
|
|
|
|
Pandas Key Data Structures:
Core components of Pandas: Series & DataFrames
Name Age Score
Alice 20 85
Bob 21 90
Pandas stores data in two key structures:
Key Idea:
Supports common operations: sum(), count(), unique(), etc.
Columns → accessed by name �Rows by index → less common
# Read CSV file
df = pd.read_csv("data.csv")
# Read Excel file (XLS / XLSX)
df = pd.read_excel("data.xlsx") # file path as input
# Read text file (whitespace-separated)
df = pd.read_csv("diabetes.txt", sep="\s") # sep = separator (\s = whitespace)
# Read JSON file
df = pd.read_json("diabetes.json")
Key idea: Pandas can load data from multiple file formats using the file path as input.
Reading Data
# Write DataFrame to CSV
df.to_csv("diabetes_out.csv", index=False)
# Write Series to CSV
s.to_csv("series_out.csv")
# Write DataFrame to Excel (XLS / XLSX)
df.to_excel("diabetes_out.xlsx", index=False)
# Write DataFrame to text file (space-separated)
df.to_csv("diabetes_out.txt", header=True, index=False, sep=" ")
# Write DataFrame to JSON
df.to_json("diabetes_out.json")
Key idea: Use to_*() methods to save Pandas objects (Series or DataFrames) to different file formats.
Outputting data
Reading data from a CSV file
20
df = pd.read_csv('dataset.csv')
Load a CSV file into a Pandas DataFrame:
*** If the data contains only one column and you specify squeeze=True, pandas will convert the result to a Series.
Basic Operations: Exploring Data
# Display First 5 rows df.head() # Display Last 3 rows df.tail(n=3) [specify the number of rows through the n argument (the default value is 5)]. | # Data types df.dtypes # Summary statistics df.describe() prints the summary statistics of all numeric columns, such as count, mean, standard deviation, range, and quartiles of numeric columns. | # Shape (rows, columns) df.shape # Column names df.columns df.info() # Overview of columns and types |
Get the number of rows and columns
df.shape[0] # Get the number of rows only
df.shape[1] # Get the number of columns only
Basic Pandas Operations: Selecting & Modifying Data
# Select a column (Series)
df["Age"]
# Select multiple columns (DataFrame)
df[["Name", "City"]]
# Select rows by index
df.iloc[0] # first row
df.iloc[1:3] # rows 1–2
# Select rows by condition
df[df["Age"] > 30]
Access pattern:
dataframe["column_name"] → access a column (e.g., df["age"])
Arithmetic Operations on Columns
# Create new columns
df["new_column"] = df["column1"] + df["column2"]
df["new_column"] = df["column1"] - df["column2"]
# Modify an existing column
df["Age"] = df["Age"] + 1
Key idea: Pandas supports vectorized operations; math is applied to entire column
Applying Functions Directly to the dataframe
You can apply different functions directly to a DataFrame column:
Syntax: dataframe[column_name].function()
23
E.g.: df[“age”].sum()
Filtering in Pandas
1. Column Filtering (Selecting Columns): Select specific columns by name using basic selection, e.g.df[['col1']] or df[['col1', 'col2']]. |
|
2. Row Filtering (Selecting Rows): Select rows based on conditions on column values using Boolean Indexing (Most common one). E.g. df[df['col'] > value]. |
|
Two Types:
Focus on specific subsets of your data
2. Filtering Rows: How to Filter Rows based on Condition (s)
Extract specific rows based on conditions applied to one or more columns, making it easier to work with relevant subsets of data.
Filter rows where a column meets a condition (Boolean Indexing):E.g.: Filter rows where age is greater than 25.
25
# Sample DataFrame
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'city': ['NY', 'LA', 'Chicago']
})
# Method 1: Select specific columns
df_cols = df[['name', 'age']]
# Keeps only 'name' and 'age'
dataframe [boolean condition]
df [df[‘Age’]>29]
Known as “Boolean Indexing”: Use boolean indexing to filter data based on conditions. ex: filtered_df = df[df['Value'] > 20]
Q: What about multiple conditions?
Counting & Aggregating Values in Pandas
E.g: Get Statistics
Count unique values (frequency)
Works on a Series (single column); Can be applied after filtering
Expected output:
Method | Action | Works On |
.sum() | Sum of values | DataFrame/Series |
.mean() | Mean (average) | DataFrame/Series |
.count() | Non-NA values | DataFrame/Series |
.size() | All rows (including NA) | GroupBy only |
.min()/.max() | Minimum/Maximum | DataFrame/Series |
.std()/.var() | Standard deviation/Variance | DataFrame/Series |
.describe() | Summary statistics (count, mean, std, etc.) | DataFrame/Series |
Applying Aggregation Functions Directly to a DataFrame
We can use different key statistical methods directly on a DataFrame or Series.
(Next Topic)
More Advanced: Filtering Data & Apply Statistical Functions
Filter rows, then apply a statistical function to a column:
df[df[condition]][column].statistics_function()
Example: Find the Mean of 'Number' where Age > 25
df[df['Age'] > 25]['Number'].mean()
Filter rows → select column → apply statistic
Grouping Data (Groupby)
Split data into groups, apply a function, and combine results
df.groupby('group_column_name')['target_column_name'].statistics_function()
Example: Average score per category
df.groupby('Category')['Score'].mean()
29
split → apply → combine
GroupBy: Common Usage Patterns
Group by a single column
Group by multiple columns
Group + aggregation (most common)
[column_name] selects a specific column from each group (recommended but optional).
or
Remember: groupby() groups rows first; aggregation happens after selecting a column.
Example: Grouping Data (Groupby)
30
1. group by “order”
2. Apply “Sum” to each group
3. Combine the result
Group By - More Examples (1)
Group By - More Complicated Examples (2)
Pandas Indexing: loc vs iloc
loc : label-based
df.loc[row_label, column_label]
df.loc[1, 'Age'] # Selects the value in row 1 and column 'Age'
df.loc[2:4, ['A', 'B']] # Access rows with labels from index 2 to 4 and columns 'A' and 'B'
iloc:position-based
df.iloc[row_index, column_index]
df.iloc[1, 1] # Selects the value in the second row, second column
df.iloc[2:5, [0, 1]] #Access rows from index 2 to 4 and columns at positions 0 and 1
Remember
Same idea, different way to index.
Visualize DataFrames
Pandas dataframes can be visualized using Matplotlib.
Install Matplotlib using conda install matplotlib command.
pip/conda install matplotlib
Visualization helps in understanding the dataframe content.
Plotting dataframes is a useful way to see results.
01
02
03
04
Pandas: Advantages
35
References
The End
More Helpful references:
CheatSheet: ttps://s3.amazonaws.com/assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf
36