Publicerad med Google Dokument
CSED516 2019au Project Suggestions
Uppdateras automatiskt efter 5 minuter

CSED516 2019au Project Suggestions

Note. This list may be continuously updated; check it out frequently.

These are suggestions for the CSED 516 class project.  You can come up with your own project, or choose an idea below, or start from an idea below and modify it.  It’s OK if several people choose the same project, it will be fun to compare the approaches and the results.   If you choose a project suggested by Amperity, then I will put you in touch with a person in that company.

Benchmarking Ideas

Here you will choose your favorite big data system X (where X=Spark on AWS, or X=Spark on Azure, or X=Snowflake, or X=Redshift, or X=Bigquery …) then test one of the following.

Sampling

Suppose you have to compute a query, say join-group-by-aggregate.  The query takes too long, and/or costs too much to run on system X.  Instead, you want to sample from your base tables, then run the query on the sample, and return an estimate of the answer.  (This is called Approximate Query Processing, or AQP).  In this project you would explore X’s sampling capabilities.  What sampling primitives does it support?  Are they efficient?  Are they guaranteed to be uniform?  (Sometimes database systems “cheat” and sample entire blocks instead of sampling individual records, for efficiency purposes; how does X work?)  How effective is sampling in answering your big join-group-by-aggregate query?  Does it save any moeny?

Skew

Distributed query processing becomes much harder when there is skew.  Study how system X copes with skew in query processing.  For example, when joining two tables R(x,y) Join S(y,z), if y is skewed in either R or in S then what happens?  Some things to expect: (a) the query runs out of memory and crashes, or (b) the query doesn’t crash but takes significantly longer, or (c) it runs as efficiently as for non-skewed data.  Extend this to more complex queries: multiway joins, and/or add selection predicates.

Comparing X to Y

Compare system X to system Y.  For example, you may want to compare the performance of Spark on AWS with the performance of Spark on Azure.  Or compare SQL on Redshift with SQL on Snowflake, and/or with SQL on Azure.   Find some benchmarks to do the comparison.  For SQL, the best known benchmark is TPC/H  (and there are other variants of TPC) as well as the newer “Join Benchmark” (which uses IMDB data).  If you choose Spark, look for benchmarks for Spark (or design/adapt your own).  Feel free to adapt any benchmark as you see fit, i.e. simplify it, drop some queries that would take more effort to run on X or on Y.

ML in Big Data

A typical ML algorithm runs gradient descent, or stochastic gradient descent, until convergence (or for a fixed number of iterations).  One gradient descent step can be expressed in SQL or in Spark, as an aggregate query (with or without group by).  Evaluate the effectiveness of running gradient descent on a big data system, e.g. in SQL or in Spark.   Compare this to the standard approach of using python for gradient descent.  Which one scales better?  (One expects system X, which is usually distributed, to scale better)  Which one is faster?  How does this depend on the size of the data?  (Perhaps python is faster for small training data but crashes when the training data reaches a certain size.).  Your goal here should be to express a very simple ML algorithm (e.g. for linear regression, or logistic regression) using a SQL query (for example, one step of gradient descent requires one to compute a sum over the data: write that sum as a SQL query to compute one step of the gradient descent, update the data, then repeat the gradient descent step).  Some references to (much more advanced) usages of SQL for ML tasks are here and here and here.

Benchmarks Suggested by RelationalAI:

Bechmarks (datasets and queries)

- TPC-H

- TPC-DS

- LDBC SNB

- LDBC Graph analytics

- IMDB JOB

- Public BI benchmark (Boncz)

- TPC-C

- TPC-CH

- Various microbenchmarks: iiBench, transaction benchmarks

- ACID properties of all systems (pretty hard)

Systems:

- Warehouses/analytics (Snowflake, Spark SQL etc)

- More general relational (PostgreSQL, HyPer, SQL Server etc)

- Graph databases (TigerGraph, Neo4J, Virtuoso etc)

RelationalAI is particularly interesting to see:

1) How a specialized system does compared to a general system on a specialized benchmark (e.g. LDBC SNB on TigerGraph/Neo4J vs Snowflake)

2) How a specialized system performs on something it is not considered suitable for (e.g. transaction throughput on a warehouse).


Projects suggested by Amperity

Apply the Two Tower Model to record blocking.

Contact person: Aria at Amperity, aria@amperity.com

"Blocking" is the first stage of deduplication, when the set of records is partitioned into groups (blocks) so that the actual all-pairs matching can be done only within each block.  Blocking is traditionally done using a blocking key.  In this project you will use embeddings and the two towers method to replace key-based blocking.  The hope is that the embedding will improve recall over the static blocking key approach.  The two tower method has been used by Facebook and by Google for image search, but not for blocking.

Some details.  The training data consists of labeled pairs: every pair of data item is labeled as a match or non-match.  An embedding of data to R^d is learned such that the cosine similarity of the embeddings is a good predictor of the match.  Then the system will use this embedding in the blocking phase of record matching.

Data: bibliography datasets, e.g. arxiv data or use Amperity's data generator.

Tools: pytorch; training should be done on GPU's, e.g. using Google's Colab https://colab.research.google.com

Distributed Union Find

Contact person: Yan Yan <yanyan@amperity.com>, Joe Christianson <joe@amperity.com>

The classic Union-Find algorithm https://en.wikipedia.org/wiki/Disjoint-set_data_structure is very efficient when run sequentially, but scales poorly to very large datasets.  Design and test a distributed version of union find.  Aim to scale linearly up to 10-20 servers, on large datasets, i.e. when run on 10 servers it should be 8-9times faster than on one server.  It's OK to design an approximate version, e.g. the algorithm may fail to union some clusters because they happened to be placed on different servers.

Data: synthetic (Yan Yan will provide)

Tools: Spark (suggestion)

Determine Bogus Data from Cardinality Information.

Contact person: Yan Yan <yanyan@amperity.com>

Sometimes customers give their correct information, sometimes they fill in a bogus values.  For example, when asked for their email address, they may provide the correct value, or may simply say name@email.com When a field has a low cardinality (e.g. most values are name@email.com) then this is a signal that the field may be bogus.  Explore ways to use the cardinality information in order to classify a field, or a particular data value in a field, as bogus or real.  This is an open ended project, and Yan Yan will assist in providing data and guidance.

Data: needs to be real data (Yan Yan)

Tools: ???

Use Advanced Labeling functions in Snorkel.

Contact person: Joe Christianson <joe@amperity.com>

The snorkel system https://hazyresearch.github.io/snorkel/ allows users to generate training data by replacing manual labels with simple, hand-written classification functions.  Snorkel takes the often conflicting labels produced by the classification functions and reconciles them, generating a clean training set to be used in any generic machine learning system.  In this project you will explore the use of Snorkel for record matching.  A direct use of Snorkel would consists of writing some simple functions that, given a pair of data items, classify them into match or non-match.  A more advanced use of Snorkel would be to write more sophisticated functions, which predict how the clusters will look like, e.g. "4-5 records, with 1 single value for LastName, with 1-2 values for email address, and 1-3 values for phone".  The main effort will consists of using Snorkel for record matching and deduplication, but more advanced usages are helpful.

Data: can be synthetic (Joe and Yan Yan will provide)

Tools: Snorkel.  (To check if it makes sense to use it in the cloud).