Machine Learning with Large Datasets (Fall 2025)
Recitation 1
August 29th, 2025
Special thanks to Tian Li and Prof. Heather Miller who developed some of the material covered today.
Agenda
1. Brief Introduction to Google Colab
2. Spark Basics
3. Programming Demo
Homework Setup
Programming Assignments
Written Assignments
More info on Course Website - https://10605.github.io/
To-Dos for students
To-Dos for students
Registration and Uploading Notebooks
Exporting the Homework
Registration
Registration
Please only choose the community edition and NOT a cloud provider.
Login
Creating a Cluster
Creating a Cluster
Installing Required Packages
Uploading/Importing the Homework
Interacting with Notebooks
Exporting the Homework
Important notes about clusters
PySpark
PySpark (local installation)
Spark Architecture
Execution of a Spark Program
Spark APIs
RDDs vs Dataframes
RDDs
DataFrames
Transformations vs Actions
Example: Map vs. Reduce
Narrow vs Wide Transformations
Narrow vs Wide Transformations
How can we create an RDD?
What can you do when you have an RDD?
A small cheat-sheet can be found here with some nice examples.
Interacting with RDDs
Lambda vs Regular Functions
Lambda Functions
Regular Function
Spark Execution
Spark Execution
Cache/Persistence
Reference Link - https://data-flair.training/blogs/apache-spark-rdd-persistence-caching/
Example
peopleRdd = sc.parallelize ([“bob”, “alice”, “bill”])
words_filter = peopleRdd.filter(lambda x: ‘i’ in x)
filtered = words_filter.collect()
What does the following code snippet do?
Example
peopleRdd = sc.parallelize ([“bob”, “alice”, “bill”])
words_filter = peopleRdd.filter(lambda x: ‘i’ in x)
filtered = words_filter.collect()
What does the following code snippet do?
creates the RDD.
nothing, since this is a transformation.
the DAG is executed since an action is called.
Example #2
peopleRdd = sc.parallelize ([“bob”, “alice”, “bill”])
peopleRdd.foreach(print)
peopleRdd.take(2)
What does the following code snippet do on the driver node?
Example #2
peopleRdd = sc.parallelize ([“bob”, “alice”, “bill”])
peopleRdd.foreach(print)
peopleRdd.take(2)
On the driver: Nothing. Why?
foreach is an action that doesn't return anything. Therefore, it is eagerly executed on the executors, not the driver. Therefore, any calls to print are happening on the stdout of the worker nodes and are thus not visible in the stdout of the driver node.
What about when take is called? Where will the array of people end up?
When an action returns a result, it returns it to the driver node.
What does the following code snippet do on the driver node?
Example #3
peopleRdd = sc.parallelize ([“bob”, “alice”, “bill”])
newRDD = peopleRDD.foreach(print)
print(type(newRDD))
newRDD = peopleRDD.map(print)
print(type(newRDD))
newRDD.take(2)
newRDD = peopleRDD.map(lambda x: "Dr." + x)
newRDD.take(2)
What does the following code snippet do on the driver node?
Example #3
peopleRdd = sc.parallelize ([“bob”, “alice”, “bill”])
newRDD = peopleRDD.foreach(print)
print(type(newRDD))
newRDD = peopleRDD.map(print)
print(type(newRDD))
newRDD.take(2)
newRDD = peopleRDD.map(lambda x: "Dr." + x)
newRDD.take(2)
What does the following code snippet do on the driver node?
<class ‘NoneType’>
foreach is an action that doesn't return anything.
Therefore, the type of the returned is NoneType.
Example #3
peopleRdd = sc.parallelize ([“bob”, “alice”, “bill”])
newRDD = peopleRDD.foreach(print)
print(type(newRDD))
newRDD = peopleRDD.map(print)
print(type(newRDD))
newRDD.take(2)
newRDD = peopleRDD.map(lambda x: "Dr." + x)
newRDD.take(2)
What does the following code snippet do on the driver node?
<class ‘NoneType’>
foreach is an action that doesn't return anything.
Therefore, the type of the returned is NoneType.
<class ‘pyspark.rdd.PipelinedRDD’>
[None, None]
map is a transformation, and the type of the returned is another RDD where the values are modified. However, since the print() function doesn’t return anything, the output will be [None, None].
Example #3
peopleRdd = sc.parallelize ([“bob”, “alice”, “bill”])
newRDD = peopleRDD.foreach(print)
print(type(newRDD))
newRDD = peopleRDD.map(print)
print(type(newRDD))
newRDD.take(2)
newRDD = peopleRDD.map(lambda x: "Dr." + x)
newRDD.take(2)
What does the following code snippet do on the driver node?
<class ‘NoneType’>
foreach is an action that doesn't return anything.
Therefore, the type of the returned is NoneType.
<class ‘pyspark.rdd.PipelinedRDD’>
[None, None]
map is a transformation, and the type of the returned is another RDD where the values are modified. However, since the print() function doesn’t return anything, the output will be [None, None].
[‘Dr. bob’, ‘Dr. alice’]
map is a transformation, so it returned a new RDD with “Dr.” added before each name.
Example #4
data = ["Project Gutenberg's", "Alice's Adventures in Wonderland","Project Gutenberg's"]
rdd = sc.parallelize(data)
rdd_flattened = rdd.flatMap(lambda x: x.split(" "))
result = rdd_flattened.collect()
What does the following code snippet output?
Example #4
data = ["Project Gutenberg's", "Alice's Adventures in Wonderland","Project Gutenberg's"]
rdd = sc.parallelize(data)
rdd_flattened = rdd.flatMap(lambda x: x.split(" "))
result = rdd_flattened.collect()
# result = [Project, Gutenberg's, Alice's, Adventures, in, Wonderland, Project, Gutenberg's]
What does the following code snippet output?
Example #5
word_pairs = [
("apple", 1), ("banana", 1),("apple", 1),("apple", 1), ("banana", 1)
]
rdd = sc.parallelize(word_pairs)
reduced_rdd = rdd.reduceByKey(lambda x, y: x + y)
What does reduced_rdd look like?
Example #5
word_pairs = [
("apple", 1), ("banana", 1),("apple", 1),("apple", 1), ("banana", 1)
]
rdd = sc.parallelize(word_pairs)
reduced_rdd = rdd.reduceByKey(lambda x, y: x + y)
# Result: [("apple", 3), ("banana", 2)]
Example #5
word_pairs = [
("apple", 1), ("banana", 1),("apple", 1),("apple", 1), ("banana", 1)
]
rdd = sc.parallelize(word_pairs)
grouped_rdd = rdd.groupByKey()
result = grouped_rdd.mapValues(list)
What happens here?
Example #5
word_pairs = [
("apple", 1), ("banana", 1),("apple", 1),("apple", 1), ("banana", 1)
]
rdd = sc.parallelize(word_pairs)
grouped_rdd = rdd.groupByKey()
result = grouped_rdd.mapValues(list)
# Result: [("apple", [1, 1, 1]), ("banana", [1, 1])]
FAQs/Helpful Debugging Instructions
�
Appendix
Group By Key Example - PySpark