Midterm Review
DATA 100 Summer 2019
Ishaan, Manana
Plan
PCA - Principal Component Analysis
Principal Component Analysis, abbreviated as PCA, is a method of reducing high dimensional data to lower dimensions. As the name implies, it looks at which components (or features) are the most important (or principal) when we are analysing a set of data
2. What is the overall goal of PCA?
PCA has two primary goals: a) To capture as much of the total variance in the data with as few components as possible
b) To reduce high dimensional data to low dimensional data (expressed as principal components) such that the original matrix can still be reconstructed from the principal components
PCA (contd.)
When you believe the data are low rank, and are still exploring the dataset. PCA can be used to visually identify clusters of similar observations in high dimensions
The rank of a matrix is the dimension of the subspace spanned by its columns (Bourbaki, Algebra, p. 359)
Don’t worry if this definition doesn’t make sense to you, there are other ways of computing rank. For me, rank is the number of columns required to describe all other columns as a linear combination of them
PCA (contd.)
What is the rank of this matrix?
Distance (m) | Time (s) | Distance (km) | Time (min) | Speed (m/s) |
150 | 150 | 0.15 | 2.5 | 1 |
120 | 60 | 0.12 | 1 | 2 |
240 | 150 | 0.24 | 2.5 | 2.5 |
PCA (contd.)
What is the rank of this matrix?
Distance (m) | Time (s) | Distance (km) | Time (min) | Speed (m/s) | Speed (km/min) |
150 | 150 | 0.15 | 2.5 | 1 | 0.06 |
120 | 60 | 0.12 | 1 | 2 | 0.12 |
240 | 150 | 0.24 | 2.5 | 2.5 | 0.096 |
PCA (contd.)
The rank of the first matrix is 3, since the 3rd column is a scalar multiple of the 1st column, and the 4th column is a scalar multiple of the 2nd column. Since the 5th column cannot be expressed as a linear combination of the previous 4 columns, the rank of the matrix is 3
The rank of the second matrix is also 3. Even though the 6th column can’t be represented as a linear combination of the first 4 columns, it is a scalar multiple of the 5th column. Hence, including this column in the matrix does not increase its rank
PCA (contd.)
PCA is based on the idea that the orthogonal components with the most variance are responsible for most of the variance observed in the data, and focusing on these components allows us to reduce the number of dimensions while still capturing most or all the variance in the data. If the first few components do not capture most of the variance, you should not use PCA
PCA (contd.)
Because these components may be correlated, and therefore each component adds little or no variance to the overall data. In comparison, PCA produces components that are all orthogonal and linearly uncorrelated with one another
PCA (contd.)
No, not quite. Instead, PCA creates new components from linear combinations of the pre-existing such columns, such that these columns capture maximum variance and are orthogonal to one another
As per lecture slides:
“Step 0: Center the data matrix by subtracting the mean of each attribute column.
PCA (contd.)
Step 1: Find a linear combination of attributes, represented by a unit vector v, that summarizes each row of the data matrix as a scalar.
Steps 2+: To find k principal components, choose the vector for each one in the same manner as the first, but ensure each one is orthogonal to all previous ones.”
PCA (contd.)
We first decompose the data using SVD. As per lecture slides
“Singular value decomposition (SVD) describes a matrix decomposition:
PCA (contd.)
Almost ready to start answering questions, just a few more things to clarify
https://piazza.com/class/jwyiam0g7rq6rb?cid=142
PCA (contd.)
Time to actually get the Principal Components
Any questions before we get to practice problems?
Final pieces of advice
Break and Attendance Form
RegEx - Regular Expressions
Ordinary Characters: A, o, s, t, 7, 9, !, z
E.g. r’’’flower’’’ will match ---------> flower
r’’’kit-kat’’’ will match ----------> kit-kat
RegEx - Special Characters
Special Characters (metacharacters) have a “magic power”: they match or signal different patterns other than their character.
Here are some examples of special characters:
$ ------> indicates the end of the line
? ------> matches the preceding character 0 or 1 times
If you want to get rid of the “magic power” and match only the character you should use a backslash before the special character: \+ ; \$ ; \?
RegEx - Let’s Read!
RegEx - More Examples
Examples are from Spring 2018 midterm
RegEx - More Examples
RegEx - More Examples
EDA and String Methods with RegEx
String Handling Methods for Series https://pandas.pydata.org/pandas-docs/stable/reference/series.html#api-series-str
Some examples:
Series.str.lower() ------> converts the strings in the series to lowercase
Series.str.replace (pattern, repl) -----> replaces the patterns in the strings with another string
More methods: series.str.count(pattern); series.str.extract(pattern); series.str.findall(pattern); etc.
EDA and Data Cleaning in Pandas
Missing Values - Replace if possible; Sometimes we fill in with a certain value
E.g. Remember from HW3 pd.fillna(0)
Joins - usually join foreign key of one table with the primary key of another
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
pd.merge(left_table, right_table, how = ‘inner’, left_on = ‘left_column_name’, right_on = ‘right_column_name’)
left_table.merge(right_table, how = ‘inner’, left_on = ‘left_column_name’, right_on = ‘right_column_name’)
Pandas
df.loc[row_label, column_label] - selects the labels; (think of pandas index as labels for rows!);
df.loc[df[‘sepal_length’>5; ‘sepal_width’]
df.iloc[row_position, column_position] -selects the positions
df.iloc[4, 2] - the positions range from 0 to length-1
Pandas
Groupby - returns groupby object (either for Series or DataFrame!) Groupby Objects can be aggregated (.agg) or filtered (.filter)
Pandas
Fall 2018 Final
‘c_name’ - company name; ‘m_name’ - major name
Visualizations (quantitative)
Notebook Demo if there is time!
Histogragrams - plt.hist() - for quantitative variables - total area is 1
Density plots (KDE plots) - for quantitative variables- “smoothed histograms”
Scatter plots - plt.scatter() - for two quantitative variables (warnings: overplotting!)
Hex Plots - “histograms in 2D” - for two quantitative variables
Density plots in 2D - for two quantitative variables
Visualizations (categorical)
Bar Plots - plt.bar(), sns.barplot() - for categorical variables
Box Plots - for comparing the distributions of a quantitative variables corresponding to categorical variables - e.g. the distribution of income of male vs female
Pie Charts - for categorical variables
Visualizations
SQL
Relations - tables; Records - rows; Attribute or field - column
SELECT attribute names
FROM relation(s)
WHERE …. LIKE / IN… (filters the records in the relation)
GROUPBY attribute name
HAVING (filters the group) logical clause
ORDERBY column name DESC (default is ascending)
LIMIT integer;