1 of 36

Pandas

Lecture 03

1

CMSC 320�

INTRODUCTION TO �DATA SCIENCE

– FARDINA FATHMIUL ALAM

(fardina@umd.edu)

�

“A Python Library for Manipulating Tabular Data”

2 of 36

Topics to Cover

Python: It's a user-friendly language for data analysis.
Git: Helps manage code and data changes when working with a team.
Pandas: Simplifies data cleaning and manipulation.
Databases: Needed for storing and retrieving data efficiently.

2

Chapter 4 in : https://ffalam.github.io/CMSC320TextBook/chapter4/Chapter_4_1.html

What is Pandas
Tabular data mindset
Core operations (select, filter, group)
Series & DataFrames and many more….

These skills are essential for effective data analysis, collaboration, and handling data in real-world projects.

3 of 36

4 of 36

Tabular Data: Key Concept

Effective data analysis and visualization.
Pandas and SQL share operations like select, join, and aggregates.

Tabular structured format is essential for both Pandas and SQL.

5 of 36

1. SELECT/SLICING

Select only some of the rows, or some of the columns, or a combination

5

ID	age	wgt_kg	hgt_cm
1	12.2	42.3	145.1
2	11.0	40.8	143.8
3	15.6	65.3	165.3
4	35.1	84.2	185.8

ID	age
1	12.2
2	11.0
3	15.6
4	35.1

Only columns

ID and Age

Only rows with wgt > 41

Both with wgt > 41

ID	age	wgt_kg	hgt_cm
1	12.2	42.3	145.1
3	15.6	65.3	165.3
4	35.1	84.2	185.8

ID	age
1	12.2
3	15.6
4	35.1

6 of 36

2. AGGREGATE/REDUCE

Combine values across a column into a single value

6

ID	age	wgt_kg	hgt_cm
1	12.2	42.3	145.1
2	11.0	40.8	143.8
3	15.6	65.3	165.3
4	35.1	84.2	185.8

SUM

SUM(wgt_kg^2 - hgt_cm)

73.9	232.6	640.0

MAX

35.1	84.2	185.8

14167.66

What about ID/Index column?

Usually not meaningful to aggregate across it

May need to explicitly add an ID column

7 of 36

3. MAP

Apply a function to every row, possibly creating more or fewer columns

7

ID	Address
1	College Park, MD, 20742
2	Washington, DC, 20001
3	Silver Spring, MD, 20901

Variations that allow one row to generate multiple rows in the output (sometimes called “flatmap”)

ID	City	State	Zipcode
1	College Park	MD	20742
2	Washington	DC	20001
3	Silver Spring	MD	20901

8 of 36

4. GROUP BY

Group tuples together by column/dimension

8

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	B	C
1	3	6.6
3	4	3.1
4	3	8.0
7	4	2.3
8	3	8.0

ID	B	C
2	2	4.7
5	1	1.2
6	2	2.5

A = foo

A = bar

By ‘A’

9 of 36

4. GROUP BY

Group tuples together by column/dimension

9

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

By ‘B’

ID	A	C
5	bar	1.2

B = 1

ID	A	C
2	bar	4.7
6	bar	2.5

ID	A	C
3	foo	3.1
7	foo	2.3

ID	A	C
1	foo	6.6
4	foo	8.0
8	foo	8.0

B = 3

B = 2

B = 4

10 of 36

4. GROUP BY

Group tuples together by column/dimension

10

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

By ‘A’, ‘B’

ID	C
5	1.2

A = bar, B = 1

ID	C
2	4.7
6	2.5

ID	C
3	3.1
7	2.3

ID	C
1	6.6
4	8.0
8	8.0

A = foo, B = 3

A = bar, B = 2

A = foo, B = 4

11 of 36

5. GROUP BY AGGREGATE

Compute one aggregate per group

11

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

Group by ‘B’

Sum on C

ID	A	C
5	bar	1.2

B = 1

ID	A	C
2	bar	4.7
6	bar	2.5

ID	A	C
3	foo	3.1
7	foo	2.3

ID	A	C
1	foo	6.6
4	foo	8.0
8	foo	8.0

B = 3

B = 2

B = 4

Sum (C)
1.2

B = 1

B = 3

B = 2

B = 4

Sum (C)
22.6

Sum (C)
7.2

Sum (C)
5.4

12 of 36

5. GROUP BY AGGREGATE

Final result usually seen as a table

12

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

Group by ‘B’

Sum on C

Sum (C)
1.2

B = 1

B = 3

B = 2

B = 4

Sum (C)
22.6

Sum (C)
7.2

Sum (C)
5.4

B	SUM(C )
1	1.2
2	7.2
3	22.6
4	5.4

13 of 36

6. UNION / INTERSECTION / DIFFERENCE

Set operations – only if the two tables have identical attributes/columns

13

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0

ID	A	B	C
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

U

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

Similarly Intersection (returns the rows that are common to both tables) and Set Difference EXCEPT/MINUS (returns the rows that are in one table but not in the other)manipulate tables as Sets

The way IDs are treated (whether as unique or non-unique identifiers) can lead to different behaviors in these operations

14 of 36

7. MERGE OR JOIN

Combine rows/tuples across two tables if they have the same key

14

ID	A	B
1	foo	3
2	bar	2
3	foo	4
4	foo	3

ID	C
1	1.2
2	2.5
3	2.3
5	8.0

ID	A	B	C
1	foo	3	1.2
2	bar	2	2.5
3	foo	4	2.3

⨝

What about IDs not present in both tables?

Often need to keep them around

Can “pad” with NaN

15 of 36

7. MERGE OR JOIN

Combine rows/tuples across two tables if they have the same key

Outer joins can be used to ”pad” IDs that don’t appear in both tables

Three variants: LEFT, RIGHT, FULL

SQL Terminology – pandas has these operations as well

15

ID	A	B
1	foo	3
2	bar	2
3	foo	4
4	foo	3

ID	C
1	1.2
2	2.5
3	2.3
5	8.0

ID	A	B	C
1	foo	3	1.2
2	bar	2	2.5
3	foo	4	2.3
4	foo	3	NaN
5	NaN	NaN	8.0

⟗

16 of 36

Pandas First Steps: install and import

Pandas is an easy package to install. Open up your terminal program (shell or cmd) and install it using either of the following commands:

For jupyter notebook users, you can run this cell:

The ! at the beginning runs cells as if they were in a terminal.

To import pandas we usually import it with a shorter name since it's used so much:

16

$ conda install pandas

OR

$ pip install pandas

!pip install pandas

import pandas as pd

Installation: https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html

After installing pandas, it's good practice to check the installed version to ensure

print(pd.__version__)

“pd” → alias (in Python): an alternate name for referring to the same thing .

17 of 36

Common Pandas Tasks and Workflow

https://colab.research.google.com/drive/1i2IJTAodNi3RVYuoGM5INU4bHOiN-eBU?usp=sharing

Class Colab:

https://drive.google.com/file/d/1au5qJnRR57pOeAt4aefzxwsmG02JyaR_/view?usp=sharing

Dataset:

Loading data: CSV, Excel, databases
Inspecting data: First few rows, column names, data types
Cleaning data Handle missing values Fix data types Remove duplicates
Filtering & sorting Select rows/columns Apply conditions
Aggregation: Mean, median, count, group by categories

18 of 36

Pandas Key Data Structures:

Core components of Pandas: Series & DataFrames

The Series is a single column of data

1D labeled array
Index + values
Represents one variable (one column)
Example: [72, 85, 90, 88]

The DataFrame holds the entire tabular dataset

2D labeled table
Rows + columns
Each column is a Series

Example:

Name Age Score

Alice 20 85

Bob 21 90

Pandas stores data in two key structures:

Key Idea:

Series → building block
DataFrame → collection of Series

Supports common operations: sum(), count(), unique(), etc.

Columns → accessed by name �Rows by index → less common

19 of 36

# Read CSV file

df = pd.read_csv("data.csv")

# Read Excel file (XLS / XLSX)

df = pd.read_excel("data.xlsx") # file path as input

# Read text file (whitespace-separated)

df = pd.read_csv("diabetes.txt", sep="\s") # sep = separator (\s = whitespace)

# Read JSON file

df = pd.read_json("diabetes.json")

Key idea: Pandas can load data from multiple file formats using the file path as input.

Reading Data

# Write DataFrame to CSV

df.to_csv("diabetes_out.csv", index=False)

# Write Series to CSV

s.to_csv("series_out.csv")

# Write DataFrame to Excel (XLS / XLSX)

df.to_excel("diabetes_out.xlsx", index=False)

# Write DataFrame to text file (space-separated)

df.to_csv("diabetes_out.txt", header=True, index=False, sep=" ")

# Write DataFrame to JSON

df.to_json("diabetes_out.json")

Key idea: Use to_*() methods to save Pandas objects (Series or DataFrames) to different file formats.

Outputting data

20 of 36

Reading data from a CSV file

With CSV files, all you need is a single line to load in the data:

20

df = pd.read_csv('dataset.csv')

Load a CSV file into a Pandas DataFrame:

*** If the data contains only one column and you specify squeeze=True, pandas will convert the result to a Series.

21 of 36

Basic Operations: Exploring Data

# Display First 5 rows df.head() # Display Last 3 rows df.tail(n=3) [specify the number of rows through the n argument (the default value is 5)].	# Data types df.dtypes # Summary statistics df.describe() prints the summary statistics of all numeric columns, such as count, mean, standard deviation, range, and quartiles of numeric columns.	# Shape (rows, columns) df.shape # Column names df.columns df.info() # Overview of columns and types

Get the number of rows and columns

df.shape[0] # Get the number of rows only

df.shape[1] # Get the number of columns only

22 of 36

Basic Pandas Operations: Selecting & Modifying Data

# Select a column (Series)

df["Age"]

# Select multiple columns (DataFrame)

df[["Name", "City"]]

# Select rows by index

df.iloc[0] # first row

df.iloc[1:3] # rows 1–2

# Select rows by condition

df[df["Age"] > 30]

Access pattern:

dataframe["column_name"] → access a column (e.g., df["age"])

Arithmetic Operations on Columns

# Create new columns

df["new_column"] = df["column1"] + df["column2"]

df["new_column"] = df["column1"] - df["column2"]

# Modify an existing column

df["Age"] = df["Age"] + 1

Key idea: Pandas supports vectorized operations; math is applied to entire column

23 of 36

Applying Functions Directly to the dataframe

You can apply different functions directly to a DataFrame column:

Syntax: dataframe[column_name].function()

23

E.g.: df[“age”].sum()

24 of 36

Filtering in Pandas

1. Column Filtering (Selecting Columns): Select specific columns by name using basic selection, e.g.df[['col1']] or df[['col1', 'col2']].	Feature selection for modeling Reduce memory by dropping unused columns
2. Row Filtering (Selecting Rows): Select rows based on conditions on column values using Boolean Indexing (Most common one). E.g. df[df['col'] > value].	Analyze subsets of data Clean data (remove outliers, duplicates)

Two Types:

Focus on specific subsets of your data

25 of 36

Filtering Columns:Filtering a Pandas DataFrame by column values

2. Filtering Rows: How to Filter Rows based on Condition (s)

Extract specific rows based on conditions applied to one or more columns, making it easier to work with relevant subsets of data.

Filter rows where a column meets a condition (Boolean Indexing):E.g.: Filter rows where age is greater than 25.

25

# Sample DataFrame

df = pd.DataFrame({

'name': ['Alice', 'Bob', 'Charlie'],

'age': [25, 30, 35],

'city': ['NY', 'LA', 'Chicago']

})

# Method 1: Select specific columns

df_cols = df[['name', 'age']]

# Keeps only 'name' and 'age'

dataframe [boolean condition]

df [df[‘Age’]>29]

Known as “Boolean Indexing”: Use boolean indexing to filter data based on conditions. ex: filtered_df = df[df['Value'] > 20]

Q: What about multiple conditions?

26 of 36

Counting & Aggregating Values in Pandas

E.g: Get Statistics

Count unique values (frequency)

Use when working with categorical data
Returns counts sorted by default (descending)�df["column"].value_counts()

Works on a Series (single column); Can be applied after filtering

Expected output:

Method	Action	Works On
.sum()	Sum of values	DataFrame/Series
.mean()	Mean (average)	DataFrame/Series
.count()	Non-NA values	DataFrame/Series
.size()	All rows (including NA)	GroupBy only
.min()/.max()	Minimum/Maximum	DataFrame/Series
.std()/.var()	Standard deviation/Variance	DataFrame/Series
.describe()	Summary statistics (count, mean, std, etc.)	DataFrame/Series

27 of 36

Applying Aggregation Functions Directly to a DataFrame

We can use different key statistical methods directly on a DataFrame or Series.

(Next Topic)

28 of 36

More Advanced: Filtering Data & Apply Statistical Functions

Filter rows, then apply a statistical function to a column:

df[df[condition]][column].statistics_function()

Step 1: Filter rows that satisfy the condition
Step 2: Select a specific column (Series)
Step 3: Apply a statistic (mean, sum, max, etc.)

Example: Find the Mean of 'Number' where Age > 25

df[df['Age'] > 25]['Number'].mean()

Filter rows → select column → apply statistic

29 of 36

Grouping Data (Groupby)

Split data into groups, apply a function, and combine results

df.groupby('group_column_name')['target_column_name'].statistics_function()

Step 1: Split rows into groups based on unique values in group_column
Step 2: Select a column from each group
Step 3: Apply a statistic (mean, sum, count, etc.) to each group

Example: Average score per category

df.groupby('Category')['Score'].mean()

29

split → apply → combine

GroupBy: Common Usage Patterns

Group by a single column

df.groupby('column_name')

Group by multiple columns

df.groupby(['col1', 'col2'])

Group + aggregation (most common)

[column_name] selects a specific column from each group (recommended but optional).

df.groupby('category')['value'].mean() → mean of one column (value)

or

df.groupby('category').mean() → mean of all numeric columns

Remember: groupby() groups rows first; aggregation happens after selecting a column.

30 of 36

Example: Grouping Data (Groupby)

30

1. group by “order”

2. Apply “Sum” to each group

3. Combine the result

31 of 36

Group By - More Examples (1)

Group the rows of the DataFrame into two categories based on whether the Values column is greater than equal 15 or not, and calculate the sum of the Values for each group.

Output:

32 of 36

Group By - More Complicated Examples (2)

Find the maximum Values for each Category, but only include rows where Values are less than 25.

Step 1: Filter rows where `Values` < 25

Step 2: Group by `Category` and calculate the maximum `Values`
Output:

33 of 36

Pandas Indexing: loc vs iloc

loc : label-based

Select rows and columns by labels (names)
Slicing is inclusive

df.loc[row_label, column_label]

df.loc[1, 'Age'] # Selects the value in row 1 and column 'Age'

df.loc[2:4, ['A', 'B']] # Access rows with labels from index 2 to 4 and columns 'A' and 'B'

iloc:position-based

Select rows and columns by integer positions
Slicing is end-exclusive

df.iloc[row_index, column_index]

df.iloc[1, 1] # Selects the value in the second row, second column

df.iloc[2:5, [0, 1]] #Access rows from index 2 to 4 and columns at positions 0 and 1

Remember

loc → labels / names
iloc → numbers / positions

Same idea, different way to index.

34 of 36

Visualize DataFrames

Pandas dataframes can be visualized using Matplotlib.

Install Matplotlib using conda install matplotlib command.

pip/conda install matplotlib

Visualization helps in understanding the dataframe content.

Plotting dataframes is a useful way to see results.

01

02

03

04

35 of 36

Pandas: Advantages

A powerful python library for data tasks like analysis and cleaning.
Simplifies data representation for better understanding.
Cleans messy datasets for readability and relevance.
Increases productivity by minimizing writing.
Offers extensive features for easy data analysis.

35

36 of 36

References

The End

pandas documentation

https://pandas.pydata.org/pandas-docs/stable/index.html

pandas: Input/output

https://pandas.pydata.org/pandas-docs/stable/reference/io.html

pandas: DataFrame

https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

pandas: Series

https://pandas.pydata.org/pandas-docs/stable/reference/series.html

pandas: Plotting

https://pandas.pydata.org/pandas-docs/stable/reference/plotting.html

More Helpful references:

https://github.com/TirendazAcademy/PANDAS-TUTORIAL/blob/main/01-What-is-Pandas.ipynb
Here's a good playlist by Kevin Markham, a data science instructor, on how to do specific things in Pandas: https://www.youtube.com/playlist?list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y
Python Data Science Handbook by Jake VanderPlas
Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython by Wes McKinney
https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html

CheatSheet: ttps://s3.amazonaws.com/assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf

36

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0