JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 9

10-405/10-605

Machine Learning with Large Datasets

Homework Setups

Tian Li

01/17

2 of 9

Homework Overview

Register for a free community version of Databricks
Import the IPython Notebook file we provide
Configure the environment according to instructions in the writeup (creating a cluster, installing a third-party package, and starting running)
Hand in the solution to Gradescope (see the writeup)

Registration

*Make sure to choose the community edition*

Still Log in to Community Edition:

Import Lab Files

Installing Third-party Packages

“nose”

Creating a Cluster

Choose the default Spark version;

Use Python 3

Notes about Clusters

Spark version: default on Databricks (2.4.4); Python 3
It may take a while to launch the cluster (e.g., 20 seconds)
The cluster status should be ‘active’ for it to be functional
The community edition only allows for one cluster, which is essentially a single machine

When you start a second notebook, either delete the current cluster and create a new one; or attach to and activate the existing (terminated) cluster

Interact with Notebooks