CS229: Machine Learning
Project Suggestions, Autumn 2013
Below is a list of suggested projects. If you see a project that interests you, contact that person directly to see if there's a chance to work together. Please do not blanket email everyone listed here.
Title: The Role of Media Coverage in Discovering Drug Side Effects
Description: The FDA maintains an Adverse Event Reporting System where doctors can report adverse events occurring to patients taking certain drugs. The hope is that this information can be used to quickly elicit rare side effects or drug-drug interactions that were not detected during the small-scale clinical trials that were required to approve the drug. However these reports are very biased -- doctors only submit reports when they believe that the prescribed drugs are causing an adverse event and this belief can be influenced by many external factors such as media hype. This project aims to understand whether media coverage of drugs on average harms or helps the submission of accurate adverse event reports (and thereby predictions).
Students will have to scrape news articles or other forms of media to approximate the amount of media coverage for drugs in the AERS dataset (available online). We will be examining Bayesian network models of causality.
Contact: Hamsa Sridhar (firstname.lastname@example.org), Mohsen Bayati (email@example.com)
Title: Predicting conversational likability on anonymous chat networks
Description: Chatous is a text-based, 1-on-1 anonymous chat network that has seen 2.5 million unique visitors from over 180 different countries. Users can create a profile that contains a screen name, age, gender, location, and a short free-form "about me" field. After clicking the "new chat" button, users are matched up with one another in a text-based conversation. Interactions on Chatous include exchanging messages, sending/accepting a friend request, reporting an abusive user, ending a conversation. In our dataset, we store all user profile information (and changes made to the profile), all actions taken by users on the site, as well as conversation content (in particular, conversation length and words used).
An interesting research project would be to predict user "quality" - or general conversation tendency / likability by community as a whole. In particular, identifying users of poor quality is important because they rarely have long conversations and are overrepresented in the matching queue, thus affecting a large set of users. Using a user's profile information and past conversations, can we predict which users tend to be good conversationalists and which users generally engage in short or zero length conversations?
Other potential ideas:
- Predicting identity changes. A user who frequently changes profile information on Chatous, especially age/gender, indicates that the user is lying about their identity. Given a set of users, can we use information from their current profile and past interactions to predict whether or not they will tend to be changing key profile elements in the future?
- Evaluating validity of user reports. On Chatous, user moderation of community is important for flagging of abusive / spammy users. However, the tendency of users to report varies widely, and we have many false positives (reports that are unwarranted, people that are reported simply because they are on the platform a lot). Can we automatically determine the accuracy of a user report (based on the reporting user's behavior, the reported user's behavior, and the total number of reports both users have sent/received)?
-Two weeks of user interaction on our platform ~80,000 users and ~8 million conversations
-Graph structure consisting of users as nodes and conversations as weighted edges (with conversation length as weight).
-Additional meta data around edge includes: person who disconnected the conversation, time started, time finished, whether a "friendship" exists between the two users
-User profiles consisting of screen name, age, gender, location, and "about me" (including all changes to a user's profile)
-List of user reports (person reported, person reporting, conversation length & all associated meta data)
-Word vectors consisting of the words each user has used in a conversation
Title: 2D visualization of high-dimensional molecular data in cancer datasets
Description: New single cell technologies have enabled the creation of rich, high dimensional molecular datasets, providing thousands of individual cell measurements from resected tumors. These datasets come in the form of large matrices, in which each column is a random variable of interest (up to 40 variables), and each row is an individual cell. While the organization of tumor cells into various patterns and subsets is of great interest in the cancer community, there is a dearth of effective tools allowing the visualization of these ~40 dimensional datasets into a human-digestible (2 or 3 dimensions) format.
Nonlinear embedding algorithms have been proffered as effective tools for obtaining a low-dimension projection of high dimensional data. A nonlinear embedding preserves structure in the high-dimensional data better than linear or spectral methods, but existing training algorithms have quadratic runtime on the number of points N. This bottleneck has been recently addressed by formulating the optimization as an N-body problem and using fast multipole methods (FMMs) to approximate the gradient in linear time. (Max Vladymyrov and Miguel A. Carreira-Perpinan, UC Merced) In this project, we will use this method to visualize leukemia tumors.
Title: Recovering audio from handwritten documents
Description: When someone writes with a pen (or pencil), both the table and the pen vibrate due to sound in the environment. The path the pen takes on the paper behaves like the groove in a record, and encodes a time trace of those vibrations. It should therefore be possible to examine a handwritten document using a microscope, and recover an audio recording from when the document was written. Approximate physics calculations suggest that sound at conversational volumes should be recoverable.
The proposed class-project-sized task would be to apply machine vision algorithms to light microscopy images of pen traces to recover a simple known audio signal played at high volume. If this is successful, then we can talk about recovering the background audio from the signing of the Declaration of Independence... :)
I have a collaborator in a microscopy lab that can help generate the data. Generating the dataset will require probably an afternoon's work at the beginning of the project though. I do a lot of work with machine learning and vision, and have been excited about this project idea for a while, so I should be able to provide solid advising. You should be able to write code using scientific python (better) or MATLAB (worse). Preference goes to groups that might be interested in turning the project into a publication, even after the class ends (obviously depending on results and enjoyment).
Also, this will be really fun.
Contact: Jascha Sohl-Dickstein (firstname.lastname@example.org)
Title: Building fast performance models for loop-free 64-bit x86 code sequences
Description: STOKE is an aggressive low level compiler which generates complex assembly level optimizations using random search (http://tinyurl.com/cymt7jt, http://tinyurl.com/lkh7vl3). Part of this process includes estimating the performance properties of short (< 100 line) 64-bit x86 instruction sequences. In this project you will help train a model for doing so accurately (x86 is an extremely complex CISC architecture) and efficiently (the tool should be able to sustain a throughput of 1M proposals/second).
Contact: eric schkufza (email@example.com)
Title: Making Deep Learning Methods for Text Classification Accessible to the World
Description: We are building a website that will allow people to easily classify text documents using deep learning algorithms. Users will be able to train and test on their own datasets and share the results with others. We already have easy integration with Twitter and a
Contact: Richard Socher (firstname.lastname@example.org)
Title: Classifying News for Topic and Sentiment
Description: We are interested in building classifiers that will label news articles for both topic and sentiment using deep learning. We would like to compare the performance of deep learning algorithms against more conventional algorithms (Naive Bayes, SVM, MaxEnt).
Prerequisites: Java and Python (required), MATLAB (preferable)
Contact: Rebecca Weiss (email@example.com)
Title: interpreting building energy data
Description: The Stanford Y2E2 building has 2375 sensors that report building energy system status and performance every minute, and the data are readily available from a database. We are interested in statistical analysis of system performance over time and diagnostic interpretation of data with respect to historical patterns, commonsense physics and building design specifications.
Contact: John kunz (firstname.lastname@example.org)
Title: Sentiment analysis on personal email archives
Description: We have built an open source system called Muse (Memories Using Email, http://mobisocial.stanford.edu/muse) which lets people use their personal email archives for reflection, story-telling and extracting the interesting moments of their lives. It will be very useful to build robust ways of categorizing emails according to various sentiments (happiness, grief, anger, surprise, etc), as well as according to categories associated with different types of memories (family, life events, vacations, etc). This needs fairly sophisticated “sentiment” analysis. Our email infrastructure does all the work of fetching the email, cleaning it, constructing an address book, identifying named entities, etc. so the students would only need to focus on the algorithms for automatic sentiment and category identification.
Contact: Sudheendra Hangal, email@example.com
Prerequisites: Java (required)
Title: Predicting the efficacy of cancer drugs using molecular characteristics
Description: Various recent experimental studies have involved assaying the sensitivity of a range of cancer cell lines to an array of anti-cancer therapeutics. Alongside these sensitivity measurements high dimensional molecular characterisation of the cell lines is typically available, for example including gene expression measurements, copy number variation and genetic mutations. The existing analyses of these datasets involve simple per drug regressions, such as elastic net. While these methods are able to pick out the strongest signals in the data, they suffer from not taking advantage of similarity between drugs. Focusing on the Cancer Cell Line Encyclopedia, this project will investigate whether “multitask” regression methods, potentially using side information about the drugs, are able to outperform per drug regressions.
Contact: David A. Knowles (firstname.lastname@example.org)
Title: Multitask clustering of genes in different tissue types
Description: A important task in computational biology is learning biologically meaningful clusters (groups) of genes from data. A cluster might correspond to a particular pathway or function. We are interested in whether allowing the clustering to vary in different cell/tissue types is beneficial. This will entail developing novel methodology which will require understanding material beyond the core CS229 syllabus so we are looking for enthusiastic students.
Title: Recognition and characterization of sleep microarchitecture features in sleep recordings
Description: We have approximately 8,000 nocturnal sleep recordings from random individuals and several thousands more from individuals with various sleep disorders such as sleep apnea, narcolepsy, REM behavior disorder, insomnia, Restless Leg Syndrome/Periodic Leg Movements during Sleep (RLS-PLMS). This is a unique resource as we know a lot about these individuals and the data is deeply annotated by sleep technicians and sleep doctors (the event file, with time stamps). These recordings include the following channels of physiological data: electroencephalogram (EEG, brain wave, 2 channels, measuring arousals, sleep staging based on K complex, sleep spindles, slow waves, saw-thooth waves), electroocculogram (EOG, eye movements, 2 channels, for Rapid Eye movement sleep), chin electromyogram (EMG, measure muscle tone, especially lack of during REM sleep), electrocardiogram (ECG heart beats), breathing (oral and nasal thermistors measuring flow of hot air), breathing efforts (abdominal belts reflecting the efforts of the chest when trying to breathe), Oxygen saturation (decreases a few % if apnea). right and left leg muscle tone (measure PLMs, a feature of RLS). We have several engineers and doctors working on extracting selected features to conduct a genetic analysis on them (whole genome available), and also to use them diagnostically. One works on narcolepsy (already using machine learning), another on REM behavior disorder (acting out your dream, excessive muscle tone during REM sleep, abnormal ECG changes across sleep stages, a precursor of Parkinson disease), a third on PLM during sleep (completed), and a fourth on sleep spindles. We are seeking students to work with these engineers and doctors to use machine learning to score apnea, hyponeas with arousals, hypopneas with oxygen desaturation, hyponeas with both, K complexes, and sawtooth waves. We could take 2 students on two projects. The data is ready, and supervision will be ensured by the PI and this group of engineers.
Contact: Emmanuel Mignot (email@example.com)
Title: Using Deep Learning to Predict Depth from Single Images (computer vision/deep learning)
Description: We have collected a stereo vision dataset consisting of image and depth pairs taken in outdoor environments. The goal is to train a deep network that is able to predict a depth map given a single still image. This is a task humans are capable of doing based on our high level understanding of depth cues from a structured world; this allows us to easily segment out key regions in an image. With a deep network capable of predicting depth, we will use it in an attempt to improve object recognition and detection accuracy.
Contact: Brody Huval (firstname.lastname@example.org)
Title: Deep Learning for Speech Recognition (potential joint projects with CS224N)
Description: Applying deep neural networks to speech recognition is one of the great recent successes of deep learning in industry and academia. At Stanford, we have a state-of-the-art speech recognition system working on a corpus of hundreds of hours of telephone conversations. We are currently investigating both linguistic and machine learning improvements to the existing system. There are several potential projects in this space, which we can adapt to best suit your background. Successful projects can convert into independent research studies in Prof Ng’s lab in subsequent quarters.
Prerequisites: Python or Matlab. Linux with some bash scripting experience. Nice to have: some background in NLP and/or neural networks
Contact: Andrew Maas (email@example.com)
Title: Predicting Enterprise Customer Metrics
Description: Enterprise databases contains vast amounts of structured and text data. There are enormous opportunities in building machine learning tools to help managers understand their business better. In this project, you can work with an early-stage startup on building such models with actual enterprise data. Techniques include sparse regression and classification, as well as estimating model uncertainty. You will have the opportunity to work closely with the startup during your project, with possibilities for employment later.
Prerequisites: Python. Nice to have: some background in statistics, enterprise data, or Amazon EC2
Contact: Andrew Maas (firstname.lastname@example.org)
Title: Deep Learning Optimization Improvements
Description: Deep neural networks critically rely on optimization algorithms to find a good setting for the millions of parameters in the network. Recently, several new approaches to optimization were introduced for deep learning. These techniques have strong connections to the convex optimization literature. In this project, you can implement recently introduced approaches on a challenging deep learning classification task. Successful projects can continue as independent research in Prof Ng’s group in subsequent quarters
Contact: Andrew Maas (email@example.com)
Title: Machine Learning for Android Malware Classification
Description: Everyone remembers the age when Internet worms and viruses plagued personal computers. While anti-virus programs and better OS security tools have helped mitigate this problem, the Android smartphone platform is experiencing a similar wave of malware. Apps have shown up that steal IMEI numbers, send premium SMS texts, and install keyloggers.
Static analysis techniques have achieved success in identifying malicious applications. The analyses are used to find information flows through the application that are possibly malicious. However, these techniques are only as good as the specifications that they take as input. Recent literature has attempted to apply machine learning to mine specifications in creative ways.
The goal of this project is to apply machine learning techniques to classify flows found by the static analysis tools into benign and malicious flows. We plan to explore natural language features of the application description and features of the discovered information flows to accurately distinguish benign flows from malicious flows. For more details about the problem, please refer to the paper: https://www.usenix.org/system/files/conference/usenixsecurity13/sec13-paper_pandita.pdf
Skills: Java required, MatLAB preferred.
Contact: Osbert Bastani (firstname.lastname@example.org), Saswat Anand (email@example.com)
Title: Composer Style Attribution
Description: The Josquin Research Project in the Stanford Music Department has a large collection of music from the early Renaissance (http://josquin.stanford.edu). We are looking to collaborate on a CS229 project to study ML techniques suitable for composer identification within this data set of symbolic music. In particular, 336 works are attributed to Josquin des Prez, the most famous composer of his time (http://en.wikipedia.org/wiki/Josquin_des_Prez). We have identified 50 works through manuscript sources that are nearly certain to be by Josquin, with the other ~300 works ranging from likely to be by Josquin, to known misattributions to Josquin. As part of the project we have already digitized the works of several other composer to serve as controls. See a related past project for CS 229: http://cs229.stanford.edu/proj2008/LebarChangYu-ClassifyingMusicalScoresByComposer.pdf
Title: How to prevent another Financial Crisis on Wall Street.
Description: There was a meltdown on Wall Street during 2008-2010, partially caused by credit derivatives. One of the major reasons for the crash were CDO-s. For example, Exhibit 3 shows the magnitude of the CMBS/CDO crash here: http://goo.gl/UfmYoR . CDO-s received AAA ratings from the rating agencies even before disclosing their assets and subsequently a large portion of them defaulted. CDO-s are normally registered offshore and there is little data available. However, there is another product called CMBS, that has lots of data available, and it’s not an evil product by design. What a subset of the CDO-s (called CRE-CDO-s) did was to pool the junior tranches of CMBS thereby creating systemic risk. What this project aims to do is fix this practice by using machine learning. In particular, we can go down to the asset level and automatically classify the risk of each asset.
Data: Simple way to get an overview of the data is to go to http://www.sec.gov/edgar/searchedgar/companysearch.html and type in COMM 2013 to get prospectuses like this http://goo.gl/aun8uD. There is also an aggregate data download available.
Essentially, Wall street firms pool together $1bn of loans every month and do an offering similar to an IPO to issue bonds that pay interest and are repaid after 10 years. A Facebook IPO on the other hand issues stock, promises future dividends and relies on expectations of future growth. This current pay and current financials make the CMBS IPO a good candidate to analyse statistically, most important fields are likely LTV and DSCR.
Analysis: First project(simple). The dependent variable is the interest rate and the independent variable are all the fields in the data file. Loans are originated in a competitive market, where riskier loans pay more, so you can back out the riskier attributes of loans by looking at the interest rate that borrowers are paying.
Advanced project: About 20% of the loans from 2005 IPO-s defaulted, however some had better recovery than others. Combining the defaults with interest rates, one can determine which attributes were indications of risk ex post. Hypotheses is that these variables are now priced higher. Anecdotally, hotel loans were perceived less risky in 2005, lots of them subsequently defaulted, and today hotels are paying a higher interest rate, all else equal.
Skills: MatLAB required, Excel required, SQL a plus.
Contact: Keith Siilats (firstname.lastname@example.org)
Title: Machine Learning JS
Contact: Andrej Karpathy (email@example.com)
Title: Structural Patterns in Translation
Description: An approach to translation proposed by DeKai Wu, called "Inversion Transduction Grammar" is based on the idea that the phrases in a translation correspond to phrases of the original, but possibly in a different order. We assume that the words label the terminals in a binary tree and the translation results from reordering the daughters of some of the nodes. We do not have a grammar, so we pick some tree that works. Notice that some translations could not be achieved this way. So he translations of the words 1, 2, 3, 4 cannot be reordered in this way to give 2, 4, 1, 3. The proposed project involves examining gold0-standard translations, say in the Europarl data base, to reveal statistical and qualitative information about trees that do and do not fit the model, as well as the reordering patterns that are found in pairs of languages with greater or less degrees of grammatical difference. So, we might compare French-English with German-French.
Contact: Martin Kay (firstname.lastname@example.org)
Title: On-road vehicle detection using stereo vision
Description: On-road vehicle detection is closely related to automobile applications such as driver-assistance or even autonomous driving. While the current mainstream technologies rely on expensive, active sensors such as radar or lidar, we would like to use a pair of stereo cameras to detect vehicles. We have collected long sequences of stereo videos, together with accurate, frame-synchronized GPS+IMU data. Your goal would be to build an algorithm to reliably detect vehicles using this existing dataset. You will need to experiment different stereo matching algorithms to retrieve depth information, and also need to perform detection and tracking to find objects like cars, using both depth and color information.
Prerequisites: Python required. Some knowledge with computer vision is a plus
Contact: Tao Wang (email@example.com)
Title: Synthesizing images with out-of-plane transformations using stereo images
Description: In machine learning research, it is a common practice to apply 2D perspective distortions to image data during training, not only for better robustness against noise, but also to increase the amount of training data. However, it is hard to tune the parameters for such transformations, and the resulting images often look uncanny. Using a pair of stereo cameras, one should in principle be able to reconstruct the depth information, and thus synthesize images with 3D transformations that are more realistic. We have collected long sequences of stereo videos, with cameras on two extreme sides of a car. We would like to explore ways to significantly expand our dataset by synthesizing images that ‘would have been captured’ by cameras more towards the center of the car.
Prerequisites: Python or Matlab. Some knowledge with computer vision is a plus
Contact: Tao Wang (firstname.lastname@example.org)
Title: Collecting Lane Marking Labels Using Google Map
Description: We have collected long sequences of stereo videos, together with accurate, frame-synchronized GPS+IMU data. We would like to label the lane markings in these videos, with minimal human effort. One possible way is to label the lane markings on the satellite images in Google map, and project these labels to the camera frames. You will build such a labelling tool and help us evaluating its performance.
Prerequisites: Python or Matlab. Some knowledge with computer vision is a plus
Contact: Tao Wang (email@example.com)
Title: Identify $100 Billion in Yearly Gas Savings from Driver Behavior
Description: Americans spend ~$500 Billion on gasoline every year. An individual's gas usage is fairly sensitive to her driving habits and can be doubled by driving 15 miles per hour faster on highways, or accelerating more quickly in stop-and-go traffic. This project is to learn the precise relationship between driving behavior and gas usage, and predict feasible driving behavior that is 20% more efficient over the same routes. This would result in $100B in yearly savings for America's drivers. The dataset includes real-time high-frequency accelerometer, heading, speed, odometer, and gas usage.
Title: Identify a Car's Driver from Driving Behavior
Description: We hypothesize that an individual's driving behavior is an identifying characteristic, similar to handwriting. This project involves supervised learning of a vehicle's characteristic accelerometer/heading/speed. An ambitious student may be able use unsupervised clustering to detect when vehicles have multiple drivers.
Title: Phishing email/web detection (Proofpoint, Inc.)
Description: Proofpoint (Nasdaq: PFPT) is the recently-IPO leader in "security-as-a-service". One service the company provides to top global organizations is defense against targeted attacks and Phishing. Phishing is the act of maliciously attempting to extract personal information such as username/password combinations to bank accounts. Currently, the two major attack vectors are email and web pages. Phishing works because it appears legitimate. Your goal is to develop an algorithm that will correctly classify phishing from legitimate and spam email and/or web pages. You will need to create a data set with ham, spam and phish labels. With the data set, you will be able to experiment with ML classifiers and improve them specifically for phishing. Due to the nature of email and web browsing, your classifier should be able to complete the task in real time and adapt to new attacks. NOTE: participants in this project will have the opportunity to use live data within Proofpoint's systems, working with senior members of Proofpoint's R&D team. This project has immediate, real-world applicability.