Week 1: Relational Database Management Systems (review)
- No readings. Welcome to class!
Week 2: Parallel shared-nothing DBMSs
- DeWitt and Gray, Parallel Database Systems: The Future of High Performance Database Systems, Communications of the ACM. 1992. [pdf].
- Skim over the introduction, then only read Section 2 Basic Techniques...
- Suggested discussion points in your review. What is the consensus architecture for parallel database systems? Which type of scaleup is applicable to machine learning tasks on large training data?
- Paraschos Koutris, Semih Salihoglu, Dan Suciu: Algorithmic Aspects of Parallel Data Processing, Foundations and Trends in Databases 2018. [pdf].
- Section 2: read 2.1, and 2.2.2
- Section 3: read 3.1.1, 3.1.2, skim over 3.1.3, then read 3.3
- Suggested discussion points: what is the significance of Brent’s theorem? What is the main challenge in computing a join on a distributed cluster? What is the degree threshold for a parallel join?
- Optional (no review): Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Stefano Stefani, and Vidhya Srinivasan. 2015. Amazon Redshift and the Case for Simpler Data Warehouses. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). [pdf]. Additional information about Redshift can also be found on the AWS website: https://aws.amazon.com/redshift/.
Week 3: Parallel Query Processing
- Paraschos Koutris, Semih Salihoglu, Dan Suciu: Algorithmic Aspects of Parallel Data Processing, Foundations and Trends in Databases 2018. [pdf].
- Read only the following sections:
- 4.1.1 (takeaway: the Hypercube Algorithm)
- 4.1.2 (takeaway: how to compute the optimal shares)
- 4.1.3 (takeaway: quick way to compute optimal shares for a uniform db)
- 4.1.4 (takeaway: quick way to compute the optimal load for a non-uniform db)
- 4.1.6 (takeaway: make sure you understand well example 4.5)
- 4.2.1 (takeaway: the AGM bound; its application to multi-round parallel computation); skip the last part on the Connection to the Quasi-packing Number.
- Suggested discussion points: what is the trade-off between computing a query in one round or multiple rounds of communication? What speedup does the HyperCube algorithm achieve: linear? Sublinear? Explain. Why is Broadcast Join a special case of the HyperCube algorithm?
- Optional (no review): Teradata Aster Database. [pdf].
Week 4: MapReduce, Successors, Legacy
- Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04. [pdf].
- D. DeWitt and M. Stonebraker. Mapreduce – a major step backward. In Database Column (Blog), 2008. [pdf].
- Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Anthony, Hao Liu, Raghotham Murthy: Hive - a petabyte scale data warehouse using Hadoop. ICDE 2010: 996-1005. [pdf].
- Read sections 1, 2, and skim through section 4 (focus on the optimizations)
- Suggested discussion topics:
- How do these three papers fit gogether? What is the big story that they are telling?
- What are some advantages/disadvantages of MapReduce compared to parallel databases? How does MapReduce address skew? How does MapReduce address worker failures?
- Why was MapReduce needed in Facebook? Why was it insufficient? What are some limitations of HiveSQL? Why does it only support only INSERT OVERWRITE? What are some of the optimizations in Hive?
Week 5: Snowflake
- Dageville et al, The Snowflake Elastic Data Warehouse. SIGMOD Conference 2016: 215-226. [pdf].
- Read sections 1,2,3, skim over 4, and read sec. 6
- Suggested discussion topics:
- What is elasticity, why is it important, and how is it supported in Snowflake?
- How is data storage handled in Snowflake, and why? What would have been the alternatives?
- How are worker failures handled in Snowflake? How does this compare to MapReduce?
- How does snowflake handle semistructure data?
Week 6: Spark, MLSystems
- M. Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012. [pdf].
- MLlib: Machine Learning in Apache Spark Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael Franklin, Reza Zadeh, Matei Zaharia, Ameet Talwalkar Journal of Machine Learning Research, 17 (34), Apr. 2016. [pdf] and also online documentation available here. Make sure to click on "MLLib Guide".
- [Optional paper - Read only if you want] Spark SQL: Relational Data Processing in Spark Michael Armbrust, Reynold Xin, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael Franklin, Ali Ghodsi, Matei Zaharia. ACM SIGMOD Conference 2015, May. 2015. [pdf].
Week 7: Algorithms for Big Data, Stream Processing
- Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. Mining of Massive Datasets. [pdf]
- Chapter 3 Similar Items, read only sections 3.1 through 3.4.
- Suggested discussion points: what are some applications of the Jaccard similarity? What is the purpose of the minhash signature? What is the purpose of LSH applied to minhash signatures?
- Chapter 4 Streaming Algorithms:
- Section 4.1
- Section 4.2
- Section 4.3
- Section 4.4
- Sections 4.5.1-4.5.3
Suggested discussion points: pick your own. Note that some of these topics were discussed in class; the only new topic is the hyperloglog algorithm.
[Optional] Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. 2015. The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc. VLDB Endow. 8, 12 (August 2015), 1792-1803. [pdf].
[Optional] Daniel J. Abadi, Don Carney, Ugur Çetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik. 2003. Aurora: a new model and architecture for data stream management. The VLDB Journal 12, 2 (August 2003), 120-139. [pdf].
Week 8: Column-store DBMSs
- Da Yan, Yingyi Bu, Yuanyuan Tian and Amol Deshpande (2017), "Big Graph Analytics Platforms", Foundations and Trends in Databases: Vol. 7: No. 1-2, pp 1-195.
- The Design and Implementation of Modern Column-Oriented Database Systems Daniel Abadi, Peter Boncz, Stavros Harizopoulos, Stratos Idreos, Samuel Madden. Foundations and Trends® in Databases (Vol 5, Issue 3, 2012, pp 197-280) [pdf].
- Sections 1, 2, 4 (read 4.1, 4.4., 4.5, skim over the others and skim Section 3).
- [Optional] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis: Dremel: Interactive Analysis of Web-Scale Datasets. PVLDB 3(1): 330-339 (2010). [pdf].