Data Science Topics
SV: Statistics & Visualization
SV I: Statistics Basics
SV II: Visualization Basics
SV III: Lying with statistics
DM: Data Mining / Machine Learning
DM I : Decision Tree 1 (C4.5)
DM II: Clustering (K-MEANS)
DM III: Classification 1 (SVM)
DM IV: Frequent Item Sets (Apriori)
DM V: Maximum Likelihood (EM)
DM VI: Graph Mining (PageRank)
DM VII: AdaBoost
DM VIII: Classification 2 (kNN)
DM IX: Classification 3 (Naive Bayes)
DM X: Optimization: Gradient Descent
DM XI: Evaluation
SML: Systems for machine learning
SML I: Hazy
SML II: MAD Skills & MLbase
SML III: Graph Processing
DI: Data Integration & CrowdSourcing
DI I: Intro to Data Integration
DI II: Data Wrangling
DI III: CrowdSourcing Overview & Quality Control
DI IV: Entity Resolution
DI V: Declarative Crowd-Sourcing
AF: Analytic Frameworks, Storage & Databases
AF I: Map/Reduce - Basics
AF II: Map/Reduce Extensions
AF III: R & Julia
AF IV: Languages for Hadoop
AF V: Spark
AF VI: Scope & Reef
AF VII: NoSQL
AF VIII: Other NoSQL Systems
AF IX: Column Databases
SV: Statistics & Visualization
SV I: Statistics Basics
- Law of big numbers
- Conditional probability
- Bayes’ law
- Hypothesis testing
- Hypothesis pitfall
SV II: Visualization Basics
- Yi, Ji Soo and Kang, Youn ah and Stasko, John and Jacko, Julie: Toward a Deeper Understanding of the Role of Interaction in Information Visualization, IEEE Transactions on Visualization and Computer Graphics, 2007
- Dominique Brodbeck, Riccardo Mazza, Denis Lalanne: Interactive Visualization - A Survey, Book Chapter Human Machine Interaction, 2009
SV III: Lying with statistics
- Misleading scales
- The importance of avg, mean, Percentile
- Lying with maps
- Darrell Huff: How to lie with statistics, 1954
- Mark Monmonier: Lying with Maps, Statistical Science, 2005
DM: Data Mining / Machine Learning
Reading for whole class:
Wu, Xindongand Kumar, Vipin and Ross Quinlan, J. and Ghosh, Joydeep and Yang, Qiang and Motoda, Hiroshi and McLachlan, Geoffrey J. and Ng, Angus and Liu, Bing and Yu, Philip S. and Zhou, Zhi-Hua and Steinbach, Michael and Hand, David J. and Steinberg, Dan: Top 10 algorithms in data mining, Knowledge and Information Systems, Volume 14, Issue 1, 2008
DM I : Decision Tree 1 (C4.5)
DM II: Clustering (K-MEANS)
- How to choose k?
- Impact of distance measure
- Weaknesses
- Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, Sergei Vassilvitskii, Scalable K-Means++, In International Conference on Very Large Data Bases (VLDB), 2012
DM III: Classification 1 (SVM)
- Convey intuition behind SVM
- Trade-offs
- Tutorial on how to use LibSVM
- Chih-Chung Chang, Chih-Jen Lin, LibSVM: A Library for Support Vector Machines, Transactions on Intelligent Systems and Technology (TIST), 2013
- Chih-wei Hsu , Chih-chung Chang , Chih-jen Lin, LIBSVM: A Practical Guide to Support Vector Classification, 2013
- Bryan Catanzaro, Narayanan Sundaram, Kurt Keutzer, Fast Support Vector Machine Training and Classification on Graphics Processors, Proceedings of the 25th international conference on Machine learning ICML ‘08, 2008
DM IV: Frequent Item Sets (Apriori)
- Applications of frequent itemsets
- Association rules
- Adrian Kügel, Enno Ohlebusch. A space efficient solution to the frequent string mining problem for many databases, Data mining knowledge discovery, 2008.
- Han, J., Pei, J., Yin, Y.,Mao, R : Mining frequent patterns without candidate generation. A frequent-tree approach, SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data, 2000
- Ramesh C. Agarwal , Charu C. Aggarwal , V. V. V. Prasad: A tree projection algorithm for generation of frequent itemsets, Journal of Parallel and Distributed Computing, 2000
- Yongxin Tong, Lei Chen, Yurong Cheng, Philip S. Yu, Mining Frequent Itemsets over Uncertain Databases, Proceedings of the VLDB Endowment (PVLDB), Vol. 5, No. 11, 2012
DM V: Maximum Likelihood (EM)
DM VI: Graph Mining (PageRank)
- Why is PageRank important
- PageRank definition
- Hadoop PageRank implementation
- Pregel PageRank definition
DM VII: AdaBoost
DM VIII: Classification 2 (kNN)
DM IX: Classification 3 (Naive Bayes)
DM X: Optimization: Gradient Descent
- Gradient Descent
- Stochastic Descent
- Parallel Implementation
- Hogwild
DM XI: Evaluation
- Fixed
- k-fold cross validaiton
SML: Systems for machine learning
SML I: Hazy
SML II: MAD Skills & MLbase
- Joseph M. Hellerstein, Christopher Ré, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, Arun Kumar: The MADlib Analytics Library or MAD Skills, the SQL. PVLDB 5(12): 1700-1711 (2012)
- Tim Kraska, Ameet Talwalkar, John Duchi, Rean Griffith, Michael J. Franklin, Michael Jordan, MLBase: A Distributed Machine-learning System, CIDR, Asilomar, California, 2013
SML III: Graph Processing
- GraphLab Programming Model and System
- Pregel Programming Model and System
- Comparison
- Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, Joseph M. Hellerstein, Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud, In International Conference on Very Large Data Bases (VLDB), 2012
- Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski: Pregel: A System for Large-Scale Graph Processing, In the Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, 2010
DI: Data Integration & CrowdSourcing
DI I: Intro to Data Integration
- What is data integration
- Example Algorithms for
- Schema Alignment
- Entity Resolution (e.g., similarity)
- Fusion
- WebTables
- Schema Extraction for Tabular Data on the Web
- Luna Dong and Divesh Srivastava, Big Data Integration, Tutorial in Proceedings of the IEEE International Conference on Database Engineering (ICDE), 2013
- Lise Getoor, Ashwin Machanavajjhala, Entity Resolution: Theory, Practice & Open Challenges, PVLDB 5(12): 2018-2019 (2012)
- Michael J. Cafarella, Alon Y. Halevy, Daisy Zhe Wang, Eugene Wu, Yang Zhang: WebTables: exploring the power of tables on the web. PVLDB 1(1): 538-549 (2008)
- Marco D. Adelfio, Hanan Samet, Schema Extraction for Tabular Data on the Web, In International Conference on Very Large Data Bases (VLDB), 2013
DI II: Data Wrangling
- What is data wrangling
- Data Wrangler
- Potters Wheel
- Research challenges
- Sean Kandel, Andreas Paepcke, Joseph Hellerstein, Jeffrey Heer: Wrangler: Interactive Visual Specification of Data Transformation Scripts, CHI 2011
- Sean Kandel, Jeffrey Heer, Catherine Plaisant, Jessie Kennedy, Frank van Ham, Nathalie Henry Riche, Chris Weaver, Bongshin Lee, Dominique Brodbeck, Paolo Buono, Research directions in data wrangling: Visualizations and transformations for usable and credible data, Information Visualization, 2011
- Vijayshankar Raman, Joseph M. Hellerstein: Potter’s Wheel: An Interactive Data Cleaning System, VLDB 2011
DI III: CrowdSourcing Overview & Quality Control
- What is crowdsourcing
- Quality control techniques
- CrowdDB & Quark
- Overview over Amazon Mechanical Turk
- AnHai Doan, Michael Franklin, Donald Kossmann, Tim Kraska, Crowdsourcing Applications and Platforms: A Data Management Perspective (Tutorial), In International Conference on Very Large Data Bases (VLDB), 2013
- Panagiotis G. Ipeirotis, Praveen K. Paritosh, Managing Crowdsourced Human Computation: A Tutorial, Tutorial at the 20th International World-Wide Web Conference (WWW), 2011
- Panagiotis G. Ipeirotis, Foster Provost, Jing Wang, Quality Management on Amazon Mechanical Turk, KDD-HCOMP, 2010
DI IV: Entity Resolution
- What is Entity Resolution
- Automatic ML Algorithms
- CrowdSourcing Approaches
- Jiannan Wang, Guoliang Li, Tim Kraska, Michael J. Franklin, Jianhua Feng, Leveraging Transitive Relations for Crowdsourced Joins, In the Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 2013
- Jiannan Wang, Tim Kraska, Michael J. Franklin, Jianhua Feng CrowdER: Crowdsourcing Entity Resolution, VLDB, Istanbul, Turkey, 2012
- Steven Whang, Peter Lofgren, Hector Garcia-Molina, Question Selection for Crowd Entity Resolution, In the Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 2013
DI V: Declarative Crowd-Sourcing
- TurkIT
- CrowdDB
- Qurk
- Scoop
- Comparison between systems
- Greg Little, Lydia B. Chilton, Max Goldman, Robert C. Miller TurKit: Human Computation Algorithms on Mechanical Turk, UIST '10 Proceedings of the 23nd annual ACM symposium on User interface software and technology, 2010
- Michael J. Franklin, Donald Kossmann, Tim Kraska, Sukriti Ramesh, Reynold Xin: CrowdDB: answering queries with crowdsourcing, In the Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, 2011
- Adam Marcus, Eugene Wu, David Karger, Samuel Madden, Robert Miller, (Qurk) Human-Powered Sort and Joins, VLDB, Istanbul, Turkey, 2012
- Aditya Parameswaran , Neoklis Polyzotis, Answering queries using humans, algorithms, and databases, CIDR, 2011
- Aditya Parameswaran, Anish Das Sarma, Hector Garcia-Molina, Neoklis Polyzotis, Jennifer Widom, Human-Assisted Graph Search: It’s Okay to Ask Questions, In International Conference on Very Large Data Bases (VLDB), 2010
- Hyunjung Park, Jennifer Widom, Query Optimization over Crowdsourced Data, In International Conference on Very Large Data Bases (VLDB), 2013
AF: Analytic Frameworks, Storage & Databases
AF I: Map/Reduce - Basics
- Hadoop
- Amazon Elastic MapReduce Tutorial
- Critique on Map/Reduce
- Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, In OSDI, 2004.
- M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin, MapReduce and Parallel DBMSs: Friends or Foes?, Communications of the ACM, vol. 53, iss. 1, pp. 64-71, 2010
- A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker, A comparison of approaches to large-scale data analysis, In SIGMOD ’09: Proceedings of the 35th SIGMOD international conference on Management of data, New York, NY, USA, 2009
- http://www.umiacs.umd.edu/~jimmylin/publications/Lin_BigData2013.pdf
AF II: Map/Reduce Extensions
- Scientific Workloads
- Iterative extension
- Adaptive Indexes
- Kai Ren, YongChul Kwon, Magdalena Balazinska, Bill Howe, Hadoop’s Adolescence: An Analysis of Hadoop Usage in Scientific Workloads, In International Conference on Very Large Data Bases (VLDB), 2013
- Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst. Haloop: efficient iterative data processing on large clusters, Proc. VLDB Endow., 3:285, 296, September 2010.
- Jens Dittrich, Jorge-Arnulfo Quiane-Ruiz, Stefan Richter, Stefan Schuh, Alekh Jindal, Jorg Schad, Only Aggressive Elephants are Fast Elephants, Proc. VLDB Endow., 2012.
AF III: R & Julia
- R/Matlab introduction
- Julia (http://julialang.org/)
- SciDB
- SciDB R integration
- Brown, Paul G., Overview of SciDB: large scale array storage, processing and analysis, In SIGMOD ’10: Proceedings of the SIGMOD international conference on Management of Data, Indianapolis, Indiana, USA., 2010
- Sudipto Das, Yannis Sismanis, Kevin S. Beyer, Rainer Gemulla, Peter J. Haas, John McPherson, Ricardo: Integrating R and Hadoop, In SIGMOD ’10: Proceedings of the SIGMOD international conference on Management of Data, Indianapolis, Indiana, USA., 2010
AF IV: Languages for Hadoop
- Pig Latin (Tutorial + System Overview)
- Hive (Tutorial + System Overview)
- Differences between Pig and Hive
- Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins: Pig Latin: A Not-So-Foreign Language for Data Processing, SIGMOD '08 Proceedings of the 2008 ACM SIGMOD international conference on Management of data, 2008
- Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu, Raghotham Murthy: Hive – A Petabyte Scale Data Warehouse Using Hadoop, In ICDE, 2010
AF V: Spark
- Intro to Spark (Tutorial)
- Spark Streaming
- Shark Overview (with examples)
- Architecture of Spark
- Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012. April 2012. Best Paper Award and Honorable Mention for Community Award.
- Reynold Xin, Joshua Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, Ion Stoica, Shark: SQL and Rich Analytics at Scale. Technical Report UCB/EECS-2012-214. November 2012.
- Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica, Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters. HotCloud 2012. June 2012.
AF VI: Scope & Reef
- Scope stack
- Query compilation in Scope (PeriScope)
- Reef
- Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, Jingren Zhou: SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets, Proceedings of the VLDB Endowment Volume 1 Issue 2, 2008
- Jingren Zhou, Nicolas Bruno, Ming-chuan Wu, Paul Larson, Ronnie Chaiken, Darren Shakib: SCOPE: Parallel Databases Meet MapReduce, The VLDB Journal, 2012.
- Zhenyu Guo, Xuepeng Fan, Rishan Chen, Jiaxing Zhang, Hucheng Zhou, Sean McDirmid, Chang Liu, Wei Lin, Jingren Zhou, and Lidong Zhou: Spotting Code Optimizations in Data-Parallel Pipelines through Peri, in Proc. of the 2012 OSDI Conference, 2012
- Byung-Gon Chun, Chris Douglas, Shravan Narayanamurthy, Josh Rosen, Tyson Condie, Sergiy Matusevych, Raghu Ramakrishnan, Russell Sears, Carlo Curino, Brandon Myers, Sriram Rao, Markus Weimer, REEF: Retainable Evaluator Execution Framework, In the Proceedings of the VLDB Endowment, Vol. 6, No. 12, 2013
AF VII: NoSQL
- What is NoSQL
- Overview of Systems
- ACID vs BASE (Eventual Consistency)
- The Google Stack
- GFS / Colossos
- BigTable
- MegaStore
- Spanner
- Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google File System, SOSP’03, 2003
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber, Bigtable: A Distributed Storage System for Structured Data, OSDI, 2006
- Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin, James Larson, Jean-Michel Leon, Yawei Li, Alexander Lloyd, Vadim Yushprakh: Megastore: Providing Scalable, Highly Available Storage for Interactive Services, Proceedings of the Conference on Innovative Data system Research (CIDR) (2011), pp. 223-234
- James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford: Spanner: Google's Globally-Distributed Database, In the Proceedings of OSDI'12: Tenth Symposium on Operating System Design and Implementation, 2012
AF VIII: Other NoSQL Systems
- Consistent Hashing
- Vector Clocks
- Dynamo / Cassandra
- CouchDB
- F1
- Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, Werner Vogels: Dynamo: Amazon’s Highly Available Key-value Store, SOSP’07, 2007
- Jeff Shute, Radek Vingralek, Bart Samwel, Ben Handy, Chad Whipkey, Eric Rollins, Mircea Oancea, Kyle Littlefield, David Menestrina,Stephan Ellner, John Cieslewicz, Ian Rae, Traian Stancescu, Himani Apte: F1: A Distributed SQL Database That Scales, VLDB 2013
AF IX: Column Databases
- What are column stores
- Vertica architecture
- Hardware optimization
- Mike Stonebraker, Daniel Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O'Neil, Pat O'Neil, Alex Rasin, Nga Tran, Stan Zdonik: C-Store: A Column Oriented DBMS. VLDB, pages 553-564, 2005
- The Vertica Analytic Database: C-Store 7 Years Later http://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf
- Max Heimel, Michael Saecker, Holger Pirk, Stefan Manegold, Volker Markl: Hardware-Oblivious Parallelism for In-Memory Column-Stores, VLDB 2013