Why take this topics course? Data science is an exciting new multidisciplinary field founded on statistics, data analysis and machine learning. It aims at extracting knowledge and insights from big and small data. It’s advancing almost every aspect of our lives; from healthcare, politics, economics, entertainment to national security.  It’s spawning new startups and technology,  such as, targeted advertisements, fraud detection, self-driving cars, search recommender systems, stock forecasting and health diagnostics tools. And, it’s even being used to build new tools to target and nudge the US electorate. It’s a hot and controversial field in computer science.

What will you learn? In this seminar (it’s not a course) you will hear from leading researchers at Dartmouth extracting new insights from data to advance their respective fields. You will also read and present research papers on key areas in data science. This seminar will also include a number of programming assignments that seek to reinforce concepts and computational methods widely used in data science. The programming assignments will use the pydata stack: the Python open data science stack.  The seminar will also include a group project. We aim to have fun!

Why you should or shouldn’t do this seminar? There will be no formal lectures and a large amount of self learning and hacking will be required. We intend to use most X-hours to preview weekly programming assignments. It might end up as a time sink. Note, there is little math in this seminar, it’s mostly focussed on hacking using the pydata stack. While advance seminars such as this are more geared toward research and graduate students, undergraduates are very very welcome.

Prerequisites: Python and Machine Learning or Professor Campbell’s approval.

Seminar Speakers

Andrew Campbell (Computer Science) -- Behavioral sensing (week 1)

Jim Haxby (Psychological and Brain Sciences)  -- Computational neuroscience (week 2)

V.S. Subrahmanian (Computer Science) --  Terrorism (week 3)

Saeed Hassanpour (Biomedical Data Science) - Precision medicine  (week 4)

Pino Audia  (Tuck) -- Workplace networks (week 5)

Eugene Santos  (Thayer) -- Human behavior dynamics (week 6)

Olga Zhaxybayeva (Biological Sciences) -- Computational genomics (week 7)

Hany Farid  (Computer Science) -- Digital forensics (week 8)

Brendan Nyhan (Government) -- Fake news (week 9)

Class information

Location: 107 Reed Hall.

Time: 2A Tuesday 2.25-4.15 pm, Thursday 2.25-4.15 pm, Wednesday-X 4.35-5.25 pm

Seminar leader: Andrew T. Campbell

Office hours: Monday 4-5 pm, Sudikoff 147

Lab hours: Tuesday 5:30-8:30 pm, Sudikoff 148

Unofficial TA: Rui Wang

Coursework and grading 

Reading/critiques 20%

Presentations: 10%

Programming assignments: 40%

Group project: 15%

Class participation: 15%

Writing critiques

You are required to read the papers presented in class and write a critique. A critique should be a minimum of 1 page long but can be longer it should include the following:

Papers are due 11.59 pm the day before the speaker or presenters. So you come to class with good knowledge of the paper and are ready to contribute to the discussion of the contributions, and pros and cons of the paper presented.

Please read this: Keshav, S. (2007). How to read a paper. ACM SIGCOMM Computer Communication Review, 37(3), 83-84.

And check out How to give a good presentation

Grading Critiques

There are approximately 3 critiques per week. Every 2 weeks I randomly select one paper and grade the critique; that is, I am grading 1/6 of your critiques every two weeks. The critiques serve two purposes in my mind:

Programming assignments

Each students will complete a programming assignment that will reinforce ideas and techniques in data science; the tentative list of projects are as follows:

  1. Data acquisition
  2. Statistical tests: confidence interval, t-tests, ANOVA, correlation. MPG dataset and Hanover climate dataset.
  3. Linear regression: training and interpretations. Bike rental dataset
  4. Classification – credit card fraud detection
  5. Twitter: topic models and sentiment analysis
  6. Deep learning and TensorFlow

Submission of programming assignments

Your assignments will use jupyter notebooks that we provide when the assignment is handed out. Please submit your notebook and data files to Canvas.

Assignments we be previewed during Wednesday x-hour and due on the following Tuesday at 11.59 pm. You have one 24-hour extension you can use.

Resources

PyData stack tools

How to read a paper

How to give a good presentation

Statistics

Group projects

All project material is here.

Week 1 -- StudentLife and tensors

Introduction to data science seminar

Tuesday: Speaker: Andrew Campbell (Computer Science): “Future of mental health sensing on college campuses

Paper associated with talk: Rui Wang, Fanglin Chen, Zhenyu Chen, Tianxing Li, Gabriella Harari, Stefanie Tignor, Xia Zhou, Dror Ben-Zeev, and Andrew T. Campbell. "StudentLife: Assessing Mental Health, Academic Performance and Behavioral Trends of College Students using Smartphones." In Proceedings of the ACM Conference on Ubiquitous Computing. 2014.

Wednesday-X: Rui Wang: Assignment preview

Thursday: Paper Presentations

Rui Wang, et al. “Tracking Depression Dynamics in College Students using Mobile Phone and Wearable Sensing” submitted to UbiComp 2018 (Presenter: Weichen. [presentation_slides])

Martín Abadi, et la. “TensorFlow: a system for large-scale machine learning”. In Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, Berkeley, CA, USA, 265-283. (Presenter: Shayan. [presentation_slides])

  

 [Jim Haxby gave a future looking talk on computational neuroscience, 05/04/18]

Week 2 -- Computational neuroscience and data processing

Tuesday: Paper Presentations

Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010, May). “The hadoop distributed file system”. In Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on (pp. 1-10)  (Presenter: Marissa and Varun. [presentation_slides])

Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113 (Presenter: Kizito Masaba. [presentation_slides])

Wednesday-X: Rui Wang: Assignment preview

Thursday: Speaker: Jim Haxby (Psychological and Brain Sciences): “Deriving a model of shared fine-scale structure in human cortical information spaces

Paper associated with talk: Haxby, J. V., Connolly, A. C., & Guntupalli, J. S. (2014). Decoding neural representational spaces using multivariate pattern analysis. Annual review of neuroscience, 37, 435-456..

 

 [VS Subrahmanian talking about malicious actors, 10/04/18]

Week 3 -- Malicious actors and ML

Tuesday: Speaker:  V.S. Subrahmanian (Computer Science), “Bots, Socks, and Vandals: Identifying Malicious Actors on Social Platforms"

Paper associated with talk: Kumar S, Cheng J, Leskovec J, Subrahmanian VS. An army of me: Sockpuppets in online discussion communities. Proceedings of the 26th International Conference on World Wide Web 2017 Apr 3 (pp. 857-866). International World Wide Web Conferences.

Wednesday-X: Rui Wang: Assignment preview

Thursday: Paper Presentations

Guyon, Isabelle, and André Elisseeff. "An introduction to variable and feature selection." Journal of machine learning research 3.Mar (2003): 1157-1182. (Presenters: Cara, Anja. [presentation_slides])

Fernández-Delgado, Manuel, et al. "Do we need hundreds of classifiers to solve real world classification problems." J. Mach. Learn. Res 15.1 (2014): 3133-3181. (Presenters: Srinath, Rui Liu. [presentation_slides])

 

 [Saeed Hassanpour discusses results from work on predicting behavioural risk from Instagram data using deep models 19/04/18]

Week 4 -- Precision medicine and social media intelligence

Tuesday: Paper Presentations

Che Z, Kale D, Li W, Bahadori MT, Liu Y. Deep computational phenotyping. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2015 Aug 10 (pp. 507-516). ACM. (Presenters: Anthony, Daniele.  [presentation_slides])

Caballero Barajas KL, Akella R. Dynamically modeling patient's health state from electronic medical records: A time series approach. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2015 Aug 10 (pp. 69-78). ACM. (Presenters: Jeremy, Paul.  [presentation_slides])

Wednesday-X: Rui Wang: Assignment preview

Thursday: Speaker: Saeed Hassanpour (Biomedical Data Science): “Using Artificial Intelligence and Social Media to Predict Behavioral Health Risks"

Kumar N, Tafe LJ, Higgins JH, Peterson JD, de Abreu FB, Deharvengt SJ, Tsongalis GJ, Amos CI, Hassanpour S. Identifying Associations between Somatic Mutations and Clinicopathologic Findings in Lung Cancer Pathology Reports. Methods of Information in Medicine. 2018 Feb;57(01):63-73.

Week 5 -- Workplace networks

Tuesday: Paper Presentations

Lin, C.Y., Wu, L., Wen, Z., Tong, H., Griffiths-Fisher, V., Shi, L. and Lubensky, D., 2012. Social network analysis in enterprise. Proceedings of the IEEE, 100(9), pp.2759-2776. (Presenters: Casey, Sameed  [presentation_slides])

McLarnon, M.J. and Rothstein, M.G., 2013. Development and initial validation of the Workplace Resilience Inventory. Journal of Personnel Psychology, 12(2). (Presenters: Juan, Jun  [presentation_slides])

Wednesday-X: Rui Wang: Assignment preview

Thursday: Speaker: Pino Audia  (Tuck) “Workplace Networks”.

Week 6 -- Human behavioral dynamics

Tuesday: Speaker:  Eugene Santos  (Thayer) “Analysis of a Computational Framework to Capture Commander’s Decision - Making Process”

Paper associated with talk: “A Contextual Decision-Making Framework

Wednesday-X: Rui Wang: Assignment preview

Thursday:  No Paper Presentations. Work on project proposals.

Week 7 -- Computational genomics

Tuesday: Paper Presentations

Mehrotra, Abhinav, Sandrine R. Müller, Gabriella M. Harari, Samuel D. Gosling, Cecilia Mascolo, Mirco Musolesi, and Peter J. Rentfrow. "Understanding the role of places and activities on mobile phone interaction and usage patterns." Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 1, no. 3 (2017): 84. (Presenters: Xi, Steven  [presentation_slides])

Jeong, Hayeon, Heepyung Kim, Rihun Kim, Uichin Lee, and Yong Jeong. "Smartwatch Wearing Behavior Analysis: A Longitudinal Study." Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 1, no. 3 (2017): 60.. (Presenters: Chongyang,  Deepti [presentation_slides])

Wednesday-X: Project Pitches

Thursday: Speaker: Olga Zhaxybayeva (Biological Sciences) “Molecular Archaeology of Bacterial Genomes”.

Smith, L.M., 1993. “The future of DNA sequencing”. Science, 262(5133), pp.530-532.

Week 8 -- Data & You

Tuesday: Projects

Thursday: Speaker: Hany Farid  (Computer Science), “Data & You”

Dressel, J. and Farid, H., 2018. The accuracy, fairness, and limits of predicting recidivism. Science Advances, 4(1).

Week 9 -- Fake news

Tuesday: Projects

   Wednesday 10-1 pm: Show_and_tell  

Wednesday X: Speaker: Brendan Nyhan (Government), Fake news

Guess, A., Nyhan, B. and Reifler, J., 2018. Selective Exposure to Misinformation: Evidence from the consumption of fake news during the 2016 US presidential campaign.

Thursday: no class

  Week 10 -- Project presentations

Tuesday: Final presentation 2.15-5.15 pm