1 of 36

Pandas

Lecture 03

1

CMSC 320

 

INTRODUCTION TO �DATA SCIENCE

– FARDINA FATHMIUL ALAM

           (fardina@umd.edu)

“A Python Library for Manipulating Tabular Data”

2 of 36

Topics to Cover

  • Python: It's a user-friendly language for data analysis.
  • Git: Helps manage code and data changes when working with a team.
  • Pandas: Simplifies data cleaning and manipulation.
  • Databases: Needed for storing and retrieving data efficiently.

2

Chapter 4 in : https://ffalam.github.io/CMSC320TextBook/chapter4/Chapter_4_1.html

  • What is Pandas
  • Tabular data mindset
  • Core operations (select, filter, group)
  • Series & DataFrames and many more….

These skills are essential for effective data analysis, collaboration, and handling data in real-world projects.

3 of 36

4 of 36

Tabular Data: Key Concept

  • Effective data analysis and visualization.
  • Pandas and SQL share operations like select, join, and aggregates.

Tabular structured format is essential for both Pandas and SQL.

5 of 36

1. SELECT/SLICING

Select only some of the rows, or some of the columns, or a combination

5

ID

age

wgt_kg

hgt_cm

1

12.2

42.3

145.1

2

11.0

40.8

143.8

3

15.6

65.3

165.3

4

35.1

84.2

185.8

ID

age

1

12.2

2

11.0

3

15.6

4

35.1

Only columns

ID and Age

Only rows with wgt > 41

Both with wgt > 41

ID

age

wgt_kg

hgt_cm

1

12.2

42.3

145.1

3

15.6

65.3

165.3

4

35.1

84.2

185.8

ID

age

1

12.2

3

15.6

4

35.1

6 of 36

2. AGGREGATE/REDUCE

Combine values across a column into a single value

6

ID

age

wgt_kg

hgt_cm

1

12.2

42.3

145.1

2

11.0

40.8

143.8

3

15.6

65.3

165.3

4

35.1

84.2

185.8

SUM

SUM(wgt_kg^2 - hgt_cm)

73.9

232.6

640.0

MAX

35.1

84.2

185.8

14167.66

What about ID/Index column?

Usually not meaningful to aggregate across it

May need to explicitly add an ID column

7 of 36

3. MAP

Apply a function to every row, possibly creating more or fewer columns

7

ID

Address

1

College Park, MD, 20742

2

Washington, DC, 20001

3

Silver Spring, MD, 20901

Variations that allow one row to generate multiple rows in the output (sometimes called “flatmap”)

ID

City

State

Zipcode

1

College Park

MD

20742

2

Washington

DC

20001

3

Silver Spring

MD

20901

8 of 36

4. GROUP BY

Group tuples together by column/dimension

8

ID

A

B

C

1

foo

3

6.6

2

bar

2

4.7

3

foo

4

3.1

4

foo

3

8.0

5

bar

1

1.2

6

bar

2

2.5

7

foo

4

2.3

8

foo

3

8.0

ID

B

C

1

3

6.6

3

4

3.1

4

3

8.0

7

4

2.3

8

3

8.0

ID

B

C

2

2

4.7

5

1

1.2

6

2

2.5

A = foo

A = bar

By ‘A’

9 of 36

4. GROUP BY

Group tuples together by column/dimension

9

ID

A

B

C

1

foo

3

6.6

2

bar

2

4.7

3

foo

4

3.1

4

foo

3

8.0

5

bar

1

1.2

6

bar

2

2.5

7

foo

4

2.3

8

foo

3

8.0

By ‘B’

ID

A

C

5

bar

1.2

B = 1

ID

A

C

2

bar

4.7

6

bar

2.5

ID

A

C

3

foo

3.1

7

foo

2.3

ID

A

C

1

foo

6.6

4

foo

8.0

8

foo

8.0

B = 3

B = 2

B = 4

10 of 36

4. GROUP BY

Group tuples together by column/dimension

10

ID

A

B

C

1

foo

3

6.6

2

bar

2

4.7

3

foo

4

3.1

4

foo

3

8.0

5

bar

1

1.2

6

bar

2

2.5

7

foo

4

2.3

8

foo

3

8.0

By ‘A’, ‘B’

ID

C

5

1.2

A = bar, B = 1

ID

C

2

4.7

6

2.5

ID

C

3

3.1

7

2.3

ID

C

1

6.6

4

8.0

8

8.0

A = foo, B = 3

A = bar, B = 2

A = foo, B = 4

11 of 36

5. GROUP BY AGGREGATE

Compute one aggregate per group

11

ID

A

B

C

1

foo

3

6.6

2

bar

2

4.7

3

foo

4

3.1

4

foo

3

8.0

5

bar

1

1.2

6

bar

2

2.5

7

foo

4

2.3

8

foo

3

8.0

Group by ‘B’

Sum on C

ID

A

C

5

bar

1.2

B = 1

ID

A

C

2

bar

4.7

6

bar

2.5

ID

A

C

3

foo

3.1

7

foo

2.3

ID

A

C

1

foo

6.6

4

foo

8.0

8

foo

8.0

B = 3

B = 2

B = 4

Sum (C)

1.2

B = 1

B = 3

B = 2

B = 4

Sum (C)

22.6

Sum (C)

7.2

Sum (C)

5.4

12 of 36

5. GROUP BY AGGREGATE

Final result usually seen as a table

12

ID

A

B

C

1

foo

3

6.6

2

bar

2

4.7

3

foo

4

3.1

4

foo

3

8.0

5

bar

1

1.2

6

bar

2

2.5

7

foo

4

2.3

8

foo

3

8.0

Group by ‘B’

Sum on C

Sum (C)

1.2

B = 1

B = 3

B = 2

B = 4

Sum (C)

22.6

Sum (C)

7.2

Sum (C)

5.4

B

SUM(C )

1

1.2

2

7.2

3

22.6

4

5.4

13 of 36

6. UNION / INTERSECTION / DIFFERENCE

Set operations – only if the two tables have identical attributes/columns

13

13

ID

A

B

C

1

foo

3

6.6

2

bar

2

4.7

3

foo

4

3.1

4

foo

3

8.0

ID

A

B

C

5

bar

1

1.2

6

bar

2

2.5

7

foo

4

2.3

8

foo

3

8.0

U

ID

A

B

C

1

foo

3

6.6

2

bar

2

4.7

3

foo

4

3.1

4

foo

3

8.0

5

bar

1

1.2

6

bar

2

2.5

7

foo

4

2.3

8

foo

3

8.0

Similarly Intersection (returns the rows that are common to both tables) and Set Difference EXCEPT/MINUS (returns the rows that are in one table but not in the other)manipulate tables as Sets

The way IDs are treated (whether as unique or non-unique identifiers) can lead to different behaviors in these operations

14 of 36

7. MERGE OR JOIN

Combine rows/tuples across two tables if they have the same key

14

ID

A

B

1

foo

3

2

bar

2

3

foo

4

4

foo

3

ID

C

1

1.2

2

2.5

3

2.3

5

8.0

ID

A

B

C

1

foo

3

1.2

2

bar

2

2.5

3

foo

4

2.3

What about IDs not present in both tables?

Often need to keep them around

Can “pad” with NaN

15 of 36

7. MERGE OR JOIN

Combine rows/tuples across two tables if they have the same key

Outer joins can be used to ”pad” IDs that don’t appear in both tables

Three variants: LEFT, RIGHT, FULL

SQL Terminology – pandas has these operations as well

15

ID

A

B

1

foo

3

2

bar

2

3

foo

4

4

foo

3

ID

C

1

1.2

2

2.5

3

2.3

5

8.0

ID

A

B

C

1

foo

3

1.2

2

bar

2

2.5

3

foo

4

2.3

4

foo

3

NaN

5

NaN

NaN

8.0

16 of 36

Pandas First Steps: install and import

  • Pandas is an easy package to install. Open up your terminal program (shell or cmd) and install it using either of the following commands:

  • For jupyter notebook users, you can run this cell:
    • The ! at the beginning runs cells as if they were in a terminal.

  • To import pandas we usually import it with a shorter name since it's used so much:

16

$ conda install pandas

OR

$ pip install pandas

!pip install pandas

import pandas as pd

After installing pandas, it's good practice to check the installed version to ensure

print(pd.__version__)

“pd”alias (in Python): an alternate name for referring to the same thing .

17 of 36

Common Pandas Tasks and Workflow

Class Colab:

Dataset:

  • Loading data: CSV, Excel, databases
  • Inspecting data: First few rows, column names, data types
  • Cleaning data
    • Handle missing values
    • Fix data types
    • Remove duplicates
  • Filtering & sorting
    • Select rows/columns
    • Apply conditions
  • Aggregation: Mean, median, count, group by categories

18 of 36

Pandas Key Data Structures:

Core components of Pandas: Series & DataFrames

  1. The Series is a single column of data
    1. 1D labeled array
    2. Index + values
    3. Represents one variable (one column)
    4. Example: [72, 85, 90, 88]

  1. The DataFrame holds the entire tabular dataset
    1. 2D labeled table
    2. Rows + columns
    3. Each column is a Series
  2. Example:

Name Age Score

Alice 20 85

Bob 21 90

Pandas stores data in two key structures:

Key Idea:

  • Series → building block
  • DataFrame → collection of Series

Supports common operations: sum(), count(), unique(), etc.

Columns → accessed by name �Rows by index → less common

19 of 36

# Read CSV file

df = pd.read_csv("data.csv")

# Read Excel file (XLS / XLSX)

df = pd.read_excel("data.xlsx") # file path as input

# Read text file (whitespace-separated)

df = pd.read_csv("diabetes.txt", sep="\s") # sep = separator (\s = whitespace)

# Read JSON file

df = pd.read_json("diabetes.json")

Key idea: Pandas can load data from multiple file formats using the file path as input.

Reading Data

# Write DataFrame to CSV

df.to_csv("diabetes_out.csv", index=False)

# Write Series to CSV

s.to_csv("series_out.csv")

# Write DataFrame to Excel (XLS / XLSX)

df.to_excel("diabetes_out.xlsx", index=False)

# Write DataFrame to text file (space-separated)

df.to_csv("diabetes_out.txt", header=True, index=False, sep=" ")

# Write DataFrame to JSON

df.to_json("diabetes_out.json")

Key idea: Use to_*() methods to save Pandas objects (Series or DataFrames) to different file formats.

Outputting data

20 of 36

Reading data from a CSV file

  • With CSV files, all you need is a single line to load in the data:

20

df = pd.read_csv('dataset.csv')

Load a CSV file into a Pandas DataFrame:

*** If the data contains only one column and you specify squeeze=True, pandas will convert the result to a Series.

21 of 36

Basic Operations: Exploring Data

# Display First 5 rows

df.head()

# Display Last 3 rows

df.tail(n=3)

[specify the number of rows through the n argument (the default value is 5)].

# Data types

df.dtypes

# Summary statistics

df.describe()

prints the summary statistics of all numeric columns, such as count, mean, standard deviation, range, and quartiles of numeric columns.

# Shape (rows, columns)

df.shape

# Column names

df.columns

df.info() # Overview of columns and types

Get the number of rows and columns

df.shape[0] # Get the number of rows only

df.shape[1] # Get the number of columns only

22 of 36

Basic Pandas Operations: Selecting & Modifying Data

# Select a column (Series)

df["Age"]

# Select multiple columns (DataFrame)

df[["Name", "City"]]

# Select rows by index

df.iloc[0] # first row

df.iloc[1:3] # rows 1–2

# Select rows by condition

df[df["Age"] > 30]

Access pattern:

dataframe["column_name"] → access a column (e.g., df["age"])

Arithmetic Operations on Columns

# Create new columns

df["new_column"] = df["column1"] + df["column2"]

df["new_column"] = df["column1"] - df["column2"]

# Modify an existing column

df["Age"] = df["Age"] + 1

Key idea: Pandas supports vectorized operations; math is applied to entire column

23 of 36

Applying Functions Directly to the dataframe

You can apply different functions directly to a DataFrame column:

Syntax: dataframe[column_name].function()

23

E.g.: df[“age”].sum()

24 of 36

Filtering in Pandas

1. Column Filtering (Selecting Columns):

Select specific columns by name using basic selection, e.g.df[['col1']] or df[['col1', 'col2']].

  • Feature selection for modeling
  • Reduce memory by dropping unused columns

2. Row Filtering (Selecting Rows):

Select rows based on conditions on column values using Boolean Indexing (Most common one). E.g. df[df['col'] > value].

  • Analyze subsets of data
  • Clean data (remove outliers, duplicates)

Two Types:

Focus on specific subsets of your data

25 of 36

  1. Filtering Columns:Filtering a Pandas DataFrame by column values

2. Filtering Rows: How to Filter Rows based on Condition (s)

Extract specific rows based on conditions applied to one or more columns, making it easier to work with relevant subsets of data.

Filter rows where a column meets a condition (Boolean Indexing):E.g.: Filter rows where age is greater than 25.

25

# Sample DataFrame

df = pd.DataFrame({

'name': ['Alice', 'Bob', 'Charlie'],

'age': [25, 30, 35],

'city': ['NY', 'LA', 'Chicago']

})

# Method 1: Select specific columns

df_cols = df[['name', 'age']]

# Keeps only 'name' and 'age'

dataframe [boolean condition]

df [df[‘Age’]>29]

Known as “Boolean Indexing”: Use boolean indexing to filter data based on conditions. ex: filtered_df = df[df['Value'] > 20]

Q: What about multiple conditions?

26 of 36

Counting & Aggregating Values in Pandas

E.g: Get Statistics

Count unique values (frequency)

  • Use when working with categorical data
  • Returns counts sorted by default (descending)�df["column"].value_counts()

Works on a Series (single column); Can be applied after filtering

Expected output:

Method

Action

Works On

.sum()

Sum of values

DataFrame/Series

.mean()

Mean (average)

DataFrame/Series

.count()

Non-NA values

DataFrame/Series

.size()

All rows (including NA)

GroupBy only

.min()/.max()

Minimum/Maximum

DataFrame/Series

.std()/.var()

Standard deviation/Variance

DataFrame/Series

.describe()

Summary statistics (count, mean, std, etc.)

DataFrame/Series

27 of 36

Applying Aggregation Functions Directly to a DataFrame

We can use different key statistical methods directly on a DataFrame or Series.

(Next Topic)

28 of 36

More Advanced: Filtering Data & Apply Statistical Functions

Filter rows, then apply a statistical function to a column:

df[df[condition]][column].statistics_function()

  • Step 1: Filter rows that satisfy the condition
  • Step 2: Select a specific column (Series)
  • Step 3: Apply a statistic (mean, sum, max, etc.)

Example: Find the Mean of 'Number' where Age > 25

df[df['Age'] > 25]['Number'].mean()

Filter rows → select column → apply statistic

29 of 36

Grouping Data (Groupby)

Split data into groups, apply a function, and combine results

df.groupby('group_column_name')['target_column_name'].statistics_function()

  • Step 1: Split rows into groups based on unique values in group_column
  • Step 2: Select a column from each group
  • Step 3: Apply a statistic (mean, sum, count, etc.) to each group

Example: Average score per category

df.groupby('Category')['Score'].mean()

29

split → apply → combine

GroupBy: Common Usage Patterns

Group by a single column

  • df.groupby('column_name')

Group by multiple columns

  • df.groupby(['col1', 'col2'])

Group + aggregation (most common)

[column_name] selects a specific column from each group (recommended but optional).

  • df.groupby('category')['value'].mean() → mean of one column (value)

or

  • df.groupby('category').mean() → mean of all numeric columns

Remember: groupby() groups rows first; aggregation happens after selecting a column.

30 of 36

Example: Grouping Data (Groupby)

30

1. group by “order”

2. Apply “Sum” to each group

3. Combine the result

31 of 36

Group By - More Examples (1)

  • Group the rows of the DataFrame into two categories based on whether the Values column is greater than equal 15 or not, and calculate the sum of the Values for each group.
  • Output:

32 of 36

Group By - More Complicated Examples (2)

  • Find the maximum Values for each Category, but only include rows where Values are less than 25.
  • Step 1: Filter rows where `Values` < 25

  • Step 2: Group by `Category` and calculate the maximum `Values`
  • Output:

33 of 36

Pandas Indexing: loc vs iloc

loc : label-based

  • Select rows and columns by labels (names)
  • Slicing is inclusive

df.loc[row_label, column_label]

df.loc[1, 'Age'] # Selects the value in row 1 and column 'Age'

df.loc[2:4, ['A', 'B']] # Access rows with labels from index 2 to 4 and columns 'A' and 'B'

iloc:position-based

  • Select rows and columns by integer positions
  • Slicing is end-exclusive

df.iloc[row_index, column_index]

df.iloc[1, 1] # Selects the value in the second row, second column

df.iloc[2:5, [0, 1]] #Access rows from index 2 to 4 and columns at positions 0 and 1

Remember

  • loclabels / names
  • ilocnumbers / positions

Same idea, different way to index.

34 of 36

Visualize DataFrames

Pandas dataframes can be visualized using Matplotlib.

Install Matplotlib using conda install matplotlib command.

pip/conda install matplotlib

Visualization helps in understanding the dataframe content.

Plotting dataframes is a useful way to see results.

01

02

03

04

35 of 36

Pandas: Advantages

  • A powerful python library for data tasks like analysis and cleaning.
  • Simplifies data representation for better understanding.
  • Cleans messy datasets for readability and relevance.
  • Increases productivity by minimizing writing.
  • Offers extensive features for easy data analysis.

35

36 of 36

References

The End

  • pandas documentation

  • pandas: Input/output

  • pandas: DataFrame

  • pandas: Series

More Helpful references:

CheatSheet: ttps://s3.amazonaws.com/assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf

36