CS1380:2025:M0

M0: Setup & Centralized Computing

Deadline: Wed, Jan. 29, 2025 11:59PM ET

Latest handout version: CS1380:2025:M0

This milestone combines several goals: (1) to set up and familiarize students with the environment and infrastructure used for each of these assignments; (2) refresh (and confirm) everyone's background on developing systems in the languages and libraries used in this course; and (3) start thinking about modularizing and testing the system throughout the development process.

By the end of this assignment you will be familiar with the basics of JavaScript, shell scripting, stream processing, Docker containers, deployment to AWS, and performance characterization—all of which will be useful for the rest of the project.

Key Expectations & Checklist

Background & Context

Setting up your Environment

A Centralized Search Engine

Testing & Production Environments

Cloud Deployment on AWS

Lab-only Extensions and Extra Credit

Usage Scenarios: Familiarize Yourself with the Project

Results and Reflections

Submission, Tips, FAQ, and Feedback

Key Expectations & Checklist

The table below contains the tasks comprising this milestone, along with a rough estimate of lines of code they require (JavaScript, shell, configuration etc.) and rough difficulty relative to each other. You should interpret these as order-of-magnitude estimates, as they depend on familiarity and background with relevant languages and environments. Make sure you read the entire handout before starting, and refer back to the list frequently as you make progress.

Tasks	O(LoC)	Difficulty
T1: set up and familiarize yourself with the codebase	0	0.3
T2: implement 5 JS components	200	2
T3 : implement 1 shell component	1	0.2
T4: implement at least 8 test cases	30	0.5
T5: deploy the solution on the cloud (AWS EC2 nodes)	10	0.5
T6: confirm correctness locally, the cloud, and Gradescope	1	0.1
T7: characterize throughput locally, the cloud, and Gradescope	5	0.3
T8: document your findings in the README.md and package.json	10	0.1
Rough total:	300

The extras below are for students taking the lab (CS1385 and CS2385) or students interested in extra credit, calculated as a percentage bonus over base score.

Extras: Lab and Extra-Credit	O(LoC)	EC
E1: implement 3 additional JS components		10% x3
E2: Implement TF-IDF		20%

Background & Context

Fundamentally, a search engine for the web consists of three subcomponents:

a crawler, i.e. a subsystem for downloading web content, extracting links, and downloading further content.
an indexer, i.e., a subsystem that processes content to extract key features, rank their importance, and restructure content to accelerate answering queries.
a querier, i.e., a subsystem that is responsible for receiving a query from a client and responding with the most appropriate content.

For this milestone, we will be developing the simplest possible non-distributed subcomponents using a combination of shell and JavaScript code—enough to get a good sense of these three components. We will also deploy our implementation on real cloud infrastructure, test our solution, characterize its performance, and reflect on some of the design decisions.

The observable behavior (e.g., results, output, etc.) of this non-distributed implementation will be used to characterize the correctness and performance of the final distributed search engine.

The following online resources might be useful for this assignment:

Docker installation and usage cheatsheet
Unix and Linux shell scripting guide and cheatsheet
The Node v.20 LTS documentation
A primer on Git and GitHub

Please skim the entire assignment before diving into these resources, and only study these resources on a need basis. In any case, make sure that you use version control—ideally Git and GitHub—and that you commit and push frequently.

Setting up your Environment

The stencil code for all milestones are in a single repository.

git clone https://github.com/brown-cs1380/stencil

Once you have cloned the repo, you need to then set up the appropriate development environment. The development environment assumes a standard Unix/Linux environment. Windows and Mac OS X users can easily use the Docker container that has been set up. We recommend using the standardized Docker container for the following reasons: (1) users avoid incompatibilities between software for this class and other software on your machine; (2) in the unfortunate event that something goes wrong, users can simply wipe the container and restart it from a known checkpoint; (3) the container image works in every environment, including the grading infrastructure; (4) the teaching team can reproduce and understand errors from students within the same environment; and (5) containerization simplifies deployment in a real cloud environment such as AWS.

To set up and use Docker, follow the instructions from the Docker website. Then download the course image and follow the instructions in the README file — this will provide you with an appropriate base environment:

docker pull ghcr.io/brown-cs1380/container:main

This image has been set up with appropriate version of a Linux distribution (Ubuntu 22.04.3 LTS), userspace utilities (GNU Coreutils v.8.32-4), the ShellCheck linting tool (v. 0.8), a JavaScript engine (Node v20.9.0), the node package manager (v.10.1), and basic support for editing, versioning, AWS CLI, and checkpointing.

If you're not planning on using Docker—likely because you are already running a Unix or Unix-based environment such as Linux, MacOS, or WSL with appropriate dependencies installed—simply clone the M0 repository from GitHub and then download the necessary dependencies:

git clone https://github.com/brown-cs1380/stencil

apt install curl

curl -fsSL https://deb.nodesource.com/setup_20.x | bash -

apt install nodejs shellcheck git zip build-essential

cd stencil

npm install # Install the distribution library for future milestones

cd non-distribution && npm install # Install dependencies for this milestone

These npm dependencies are the only third-party JavaScript dependencies allowed for completing the course project.

Task 1: Set up and familiarize yourself with the codebase
Read the entire handout and then move on to the scenarios at the end of handout.

A Centralized Search Engine

The core search engine can be written as a single loop, which reads a URL from a set of URLs and (1) downloads the URL, extracting further URLs and additional content, and (2) analyzes the content, to compute inverted indices and add them to the global index. The figure below shows an overview of the system—with components in red being the core implementation focus of this milestone.

At the top level, the engine script reads URLs in a loop and performs the following subtasks: (1) crawls that URL, to extract its URLs and text, and writes the text to content.txt and new URLs to urls.txt for further crawling, and (2) indexes the content, morphing it into an inverted index and merging that with the global index.txt (see below). Note that the inverted index has a special structure, mapping sorted terms to URLs (sorted by weight), so merging indices requires maintaining a few different invariants. The engine stops either when it sees as a URL the keyword stop or when it has seen all visited URLs—i.e., the length of visited.txt is greater or equal to the length of urls.txt. Finally, the component q allows users querying the inverted index.

The entire system operates in a streaming fashion, with elements at the various stages (e.g., web pages, documents, terms) being processed concurrently by different processing units. For example, one element might be downloaded while another element might be going through link extraction while yet another element might be getting indexed — and the user can still use the tool to extract results from the indices. The indexing component applies some basic natural-language processing to extract and store terms, bigrams, and trigrams from these pages — along with pointers to these pages.

The crawl.sh and index.sh scripts run components in the non-distribution/c/ (for code) directory — for example, getURLs.js and invert.sh: crawl.sh downloads pages and index.sh gradually builds the inverted index. All these scripts read from and write to various data files in the non-distribution/d/ (for data) directory. Various tests showing the structure of intermediate data streams exist in the non-distribution/t/ directory. An additional query.js script can be used (when implemented) to query the inverted index and return relevant pages.

Here are the components needed to complete the implementation:

stem.js: replace individual words with their root (i.e., their stem)
getText.js: convert an incoming HTML stream to text
getURLs.js: identify and extract URLs
process.sh: harmonize text (e.g., convert to lowercase, remove stopwords)
merge.js: merge the inverted index of the current page with the global inverted index
query.js: identify relevant pages using 1–3 search terms

The merge.js implementation requires significant attention to ensure all the expected invariants of the inverted index hold.

Tasks 2 & 3: Implement 5 JavaScript programs and 1 shell script

Implement all six components mentioned above.

Testing & Production Environments

Characterizing the correctness and performance of the system being developed is important: tests aimed at correctness increase our confidence that the system we're developing operates according to the specification and help catch bugs early; tests aimed at performance help characterize the performance of our system. Tests need to be executed at all environments — first on your local development environment and then on Amazon Cloud.

For end-to-end correctness testing and performance characterization, we have set up a few testing corpora: a tiny web graph, a small collection of full books, a larger collection of book summaries, and a web of quotes.

Correctness: We have provided some tests for some of these components; you will need to provide additional tests for these components, as well as tests for other components you've developed. Try to run the linter continuously— ensuring that every time you commit code it passes both the linter checks and all tests called via npm test.

Task 4: Implement a minimum of 8 tests

Implement at least one test in JavaScript for each component above, and add a couple of additional tests testing edge-cases for some of the more complex components.

Performance: There are two key metrics we are interested in, namely (1) throughput, the number of "elements" processed per time unit, and (2) latency, the time it takes to process one element. The exact elements and time units depend on the specific component being characterized—we will characterize the crawling, indexing, and query subsystems independently, on both the local deployment and on AWS (see below).

Throughput is typically calculated by measuring the time it took completing a large number of tasks and then dividing by the time unit. For example, if it took 30 minutes to download 300 URLs, then the system's throughput is 300/30 = 10 URLs per minute or 300/(30*60) = 0.16 URLs per second. The crawler's throughput is the number of pages it downloads per second, the indexer's throughput is the number of pages it indexes per second, and the query throughput is the number of queries per second.

Cloud Deployment on AWS

This section briefly summarizes how to get started with Amazon Web Services. It is based on the Amazon “Getting Started” documentation but specialized to Brown's CS1380. The goal is to deploy your project on Amazon EC2 instances — both the centralized components, i.e., M1, as well as additional milestones in the future.

With a new AWS account, you get 750 free hours per month of Linux, RHEL, or SLES t2.micro / t3.micro instance use. You will be billed for AWS instances that go beyond those limits — so you will want to terminate them when they aren’t in direct use. Make sure you read the entire section before launching any instances — including Terminating your AWS Instances, so that you do not incur any additional costs.

Creating an account: Go to AWS and click Create an AWS Account. You can use your Brown email — you'll be asked to enter some information, including payment info. At the very end, make sure you pick the Free tier.

Launching an instance: To launch a new EC2 instance, go to AWS console, search for EC2 (or click at the top left, services, then Compute, then EC2), and Launch Instance. Pick one of the image types — e.g. Micro (t2.micro) or Small (m3.small) — and the image of the operating system — e.g., Ubuntu 22.04, e.g., ami-005fc0f236362e99f (64-bit (x86)) — and confirm that it is free-tier eligible. Create your own .pem key pair (e.g., name cs1380) to be able to connect to these instances and save it on your computer. Then create a new security group, and enable SSH, HTTP, and HTTPS. Eventually, Launch your instance.

Setting up API tools & Logging in: After launching an instance, you can connect via (1) a web-based ssh client, (2) your own ssh key and client (this requires pushing your key to the remote instance following the instructions or using web-based ssh client and adding your key in the ~/.ssh/authorized_keys), or (3) EC2 instance connect. Options #2 and #3 require setting up the AWS AMI tools on your machine (these are already set up on the course's docker image). You can also create your own docker image that sets up your node and starts it up.

Option #1 is far from ideal for long-running system administration tasks, but might be for a quick confirmation that everything's working as expected; Select the EC2 instance that you created and choose "Connect.“ Select “EC2 Instance Connect.” (note the IP address, e.g., 3.15.40.40, which changes every time you restart the instance). Choose “Connect.” A window opens, and you are connected to your instance.

Post-login: When logged in, and assuming an Ubuntu image, first run sudo apt update then run sudo apt install nodejs npm git vim and then run git clone <github-repo> to fetch your latest changes from GitHub. You can then deploy your system on this instance and connect it to nodes running on other instances. For long-running jobs, you might want to use gnu screen or tmux so that they stay up and running even after you log out from an instance. Automating all these tasks with a script will be beneficial for accelerating the development and evaluation of your project.

Stopping and terminating Instances: Note that you will be billed for AWS instances that are alive past the maximum total of 750 hours per month (per member in the team). To avoid charges, you can (1) stop (pause) instances, which at a negligible cost will freeze their contents and free their computational resources, but allow you to restart them, or (2) terminate them entirely when they are not in direct use.

To terminate an image, log into your AWS Management Console, locate the instance in your list of instances on the Instances (likely in "recently visited" too), right-click on the instance, and then click "Terminate", and when prompted for confirmation click "Yes, Terminate". Amazon EC2 begins terminating the instance. As soon as the instance status changes to shutting down or terminated, you stop incurring charges for that instance. Similarly, you should delete EBS volumes (or any other resources) that you aren't using anymore.

Task 5: Deploy the solution on the cloud (AWS EC2 nodes)
Deploy to one EC2 node, and by running all tests confirm that nothing broke by accident.

Lab-only Extensions and Extra Credit

This section is only for students taking the lab (i.e., CS1385 and CS2380), or students who are interested in extra credit.

Additional components: Rewrite the following components in JavaScript, by creating .js files with the same name, and calling these files from the corresponding .sh files below:

process.sh: harmonize text (e.g., convert to lowercase, remove stopwords)
combine.sh: generate additional bigrams and trigrams from word sequences
invert.sh: create a mapping from terms to IDs (URLs) of documents that contain them

Confirm the original tests pass, and create 3 additional tests for the JavaScript implementation.

E1–3: Implement 3 additional components in JavaScript

Implement all the components mentioned above in JavaScript instead of Shell script.

TF-IDF metrics and tests: The current inverted index ranks pages based on a simple frequency metric; a more interesting metric, such as TF-IDF could yield significantly better performance. TF-IDF captures the term frequency and inverted document frequency metrics for each term, and can be used to rank documents by relevance to the term:

Term frequency (TF): The ratio of how many times a word appears in a document to the total number of words in the document. For example, if pizza appears 10 times in a document of 100 words, then the TF value is 10/100 = 0.1.

Inverse document frequency (IDF): The number of documents in the corpus over the number of documents that contain the term—often smoothed by using a logarithm. Continuing the previous example, if pizza appears in 10 of the 100 total documents in the corpus, then IDF is log(100/10).

Thus the TF-IDF metric of pizza across the entire corpus is 0.1.

Incorporating TF-IDF across the entire pipeline requires updating the value constantly, as new documents appear. This is easier if all JavaScript components are combined into a single process, to allow (1) using more complex data structures, and (2) sharing them across various stages.

To support complex end-to-end TF-IDF testing requires carefully creating corpora that contain certain structure or instances of certain terms. Create a synthetic multi-document corpus that contains known frequencies of certain terms to ensure your ranking works as expected.

E2: Implement TF-IDF

Implement document ranking based on TF-IDF, and test it extensively using custom corpora created synthetically.

Usage Scenarios: Familiarize Yourself with the Project

Scenarios are a series of steps that will acquaint you with a particular component, offering an opportunity to understand its underlying key concepts without or before implementing it. For M0, you need to complete the following scenarios:

Execute all tests (which will fail, since the corresponding functionality has not been implemented yet), study their inputs and outputs, and confirm that they all make sense.

Peruse the available source code of the entire engine using your editor

Make a few modifications, e.g., adding a few print statements to see each component's interaction and confirm that the log matches your understanding.

Add a few tests for query.js — what is your expectation for certain terms on one of the available corpora?

Results and Reflections

Your solution must run across three environments: (1) locally (your dev environment, in package.json), (2) the cloud (the aws environment, in package.json), and (3) on gradescope (the gs entry in package. json). To complete the assignment, collect correctness results (i.e., run all your tests) across all environments, and collect performance for your three subsystems (crawler, indexer, and query) on one of the corpora above (and document which one it is) across on your laptop and the cloud (i.e., only two environments, not Gradescope). See package.json for the full set of details you need to gather.

T6 & T7: Characterize correctness and throughput across environments

Confirm your solution works locally, the cloud, and gradescope and characterize the throughput of the crawler, indexer, and query subsystems on two environments.

Correctness and performance will be crucial later too, when the distributed version is complete.

As part of your submission, complete the README.md markdown file with the following structure, along with answers to the following prompts:

# M0: Setup & Centralized Computing

> Add your contact information below and in `package.json`.

* name: `<first last?>`

* email: `<email@brown.edu?>`

* cslogin: `<cslogin?>`

## Summary

> Summarize your implementation, including the most challenging aspects; remember to update the `report` section of the `package.json` file with the total number of hours it took you to complete M0 (`hours`), the total number of JavaScript lines you added, including tests (`jsloc`), the total number of shell lines you added, including for deployment and testing (`sloc`).

My implementation consists of `<how-many?>` components addressing T1--8. The most challenging aspect was `<challenging>` because `<explain-why>`.

## Correctness & Performance Characterization

> Describe how you characterized the correctness and performance of your implementation.

To characterize correctness, we developed `<number of tests>` that test the following cases: <summarize>.

*Performance*: The throughput of various subsystems is described in the `"throughput"` portion of package.json. The characteristics of my development machines are summarized in the `"dev"` portion of package.json.

## Wild Guess

> How many lines of code do you think it will take to build the fully distributed, scalable version of your search engine? Add that number to the `"dloc"` portion of package.json, and justify your answer below.

T8: Document your findings in the README.md file

Complete all missing information in the README.md and package.json files.

Submission, Tips, FAQ, and Feedback

Consult this page for instructions on how to submit on Gradescope, debugging tips, frequently asked questions, and feedback options.