Data Science Notes
By: Lev Lazinskiy
Notes from the EdX course Python for Data Science.
The goal of data science is to take data and provide actionable insights
There are five main phases
Act on Data
The biggest risk is making it seem like your data is telling a clear story when it actually is not. This can cause a significant damage to your credibility if this is later on pointed out to be false.
The zip function can be used to work through two lists at the same time.
Tuples are lists that are immutable, this is important especially in parallel computing.
Data Science Library
All built on top of numpy
Valid argument to numpy
2D labeled structure
Dictionary of series objects
Learning from data, on its own, can be used to find hidden patterns and drive data-driven decisions. These are not “explicitly” programmed.
Classification -> Predict Category
Regression -> Predict Numeric Value
Capture relationship between input/output
Use least squares method to find regression line
Cluster Analysis -> Organize similar Items into groups
There is no “correct” clustering
Can be used to identify new data samples into a segment
Can also be used to identify outliers in new data samples
There are no labels, so results require further analysis
algorithm used for cluster analysis
Chosing value for K
Visualization (scatter) can be used to make sense of input data
Association Analysis -> find rules to capture associations between items
Target is provided
Classification + Regression are supervised
Target is not known, or not available
Cluster + Association are usually unsupervised
Python Library for Machine Learning
Understand and derive meaning from human language
This is hard because human language is ambiguous
Natural language toolkit
Split text into words
Naive Bayes Classifier
Simple classifier based on conditional probabilities
Detects probability that each feature appears in a category
Once trained, collects “votes” for all features and finds the most probable label