Data Science Notes

By: Lev Lazinskiy

Notes from the EdX course Python for Data Science.


Useful Links

Data Sets


The goal of data science is to take data and provide actionable insights

There are five main phases

        Acquire Data

        Prepare Data

        Analyze Data

        Report Data

        Act on Data

The biggest risk is making it seem like your data is telling a clear story when it actually is not. This can cause a significant damage to your credibility if this is later on pointed out to be false.

The zip function can be used to work through two lists at the same time.

Tuples are lists that are immutable, this is important especially in parallel computing.


Data Science Library




All built on top of numpy


        1D array

        Many types

        Valid argument to numpy


        2D labeled structure

        Dictionary of series objects

Data Visualization 

Machine Learning

Learning from data, on its own, can be used to find hidden patterns and drive data-driven decisions. These are not “explicitly” programmed.


        Classification -> Predict Category

        Regression -> Predict Numeric Value

                Linear Regression

                        Capture relationship between input/output

                        Use least squares method to find regression line

        Cluster Analysis -> Organize similar Items into groups

                There is no “correct” clustering

                Can be used to identify new data samples into a segment

                Can also be used to identify outliers in new data samples

                There are no labels, so results require further analysis


                K-means Clustering

                        algorithm used for cluster analysis

                        Chosing value for K

                                Visualization (scatter) can be used to make sense of input data

                                Application Dependent

                                Data Driven

        Association Analysis -> find rules to capture associations between items



        Target is provided

        “Labeled” Data

        Classification + Regression are supervised


        Target is not known, or not available

        “Unlabeled” data

        Cluster + Association are usually unsupervised


        Python Library for Machine Learning


Natural Language Processing


        Understand and derive meaning from human language

        This is hard because human language is ambiguous


        Natural language toolkit


        Split text into words

Naive Bayes Classifier

        Simple classifier based on conditional probabilities

        Detects probability that each feature appears in a category

        Once trained, collects “votes” for all features and finds the most probable label