Data Science Notes

By: Lev Lazinskiy

Notes from the EdX course Python for Data Science.

Tools

Useful Links

Data Sets

Books

The goal of data science is to take data and provide actionable insights

There are five main phases

        Acquire Data

        Prepare Data

        Analyze Data

        Report Data

        Act on Data

The biggest risk is making it seem like your data is telling a clear story when it actually is not. This can cause a significant damage to your credibility if this is later on pointed out to be false.

The zip function can be used to work through two lists at the same time.

Tuples are lists that are immutable, this is important especially in parallel computing.

Pandas

Data Science Library

Transformation

Visualization

Statistics

All built on top of numpy

Series

        1D array

        Many types

        Valid argument to numpy

DataFrame

        2D labeled structure

        Dictionary of series objects

Data Visualization

http://www.visualisingdata.com/ 

Machine Learning

Learning from data, on its own, can be used to find hidden patterns and drive data-driven decisions. These are not “explicitly” programmed.

Categories

        Classification -> Predict Category

        Regression -> Predict Numeric Value

                Linear Regression

                        Capture relationship between input/output

                        Use least squares method to find regression line

        Cluster Analysis -> Organize similar Items into groups

                There is no “correct” clustering

                Can be used to identify new data samples into a segment

                Can also be used to identify outliers in new data samples

                There are no labels, so results require further analysis

                

                K-means Clustering

                        algorithm used for cluster analysis

                        Chosing value for K

                                Visualization (scatter) can be used to make sense of input data

                                Application Dependent

                                Data Driven

        Association Analysis -> find rules to capture associations between items

        

Supervised

        Target is provided

        “Labeled” Data

        Classification + Regression are supervised

Unsupervised

        Target is not known, or not available

        “Unlabeled” data

        Cluster + Association are usually unsupervised

Scikit-learn

        Python Library for Machine Learning

        

Natural Language Processing

nlp

        Understand and derive meaning from human language

        This is hard because human language is ambiguous

nltk

        Natural language toolkit

Tokenization

        Split text into words

Naive Bayes Classifier

        Simple classifier based on conditional probabilities

        Detects probability that each feature appears in a category

        Once trained, collects “votes” for all features and finds the most probable label