1 of 40

Midterm Review

DATA 100 Summer 2019

Ishaan, Manana

2 of 40

Plan

  1. PCA

  • REGEX

  • EDA, Pandas, SQL, Visualizations, Sampling...

3 of 40

PCA - Principal Component Analysis

  1. What does PCA stand for? What does it actually mean?

Principal Component Analysis, abbreviated as PCA, is a method of reducing high dimensional data to lower dimensions. As the name implies, it looks at which components (or features) are the most important (or principal) when we are analysing a set of data

2. What is the overall goal of PCA?

PCA has two primary goals: a) To capture as much of the total variance in the data with as few components as possible

b) To reduce high dimensional data to low dimensional data (expressed as principal components) such that the original matrix can still be reconstructed from the principal components

4 of 40

PCA (contd.)

  1. When should PCA be used?

When you believe the data are low rank, and are still exploring the dataset. PCA can be used to visually identify clusters of similar observations in high dimensions

  • What does rank mean? (should know this from Math 54/EE 16A/etc.)

The rank of a matrix is the dimension of the subspace spanned by its columns (Bourbaki, Algebra, p. 359)

Don’t worry if this definition doesn’t make sense to you, there are other ways of computing rank. For me, rank is the number of columns required to describe all other columns as a linear combination of them

5 of 40

PCA (contd.)

What is the rank of this matrix?

Distance (m)

Time (s)

Distance (km)

Time (min)

Speed (m/s)

150

150

0.15

2.5

1

120

60

0.12

1

2

240

150

0.24

2.5

2.5

6 of 40

PCA (contd.)

What is the rank of this matrix?

Distance (m)

Time (s)

Distance (km)

Time (min)

Speed (m/s)

Speed (km/min)

150

150

0.15

2.5

1

0.06

120

60

0.12

1

2

0.12

240

150

0.24

2.5

2.5

0.096

7 of 40

PCA (contd.)

The rank of the first matrix is 3, since the 3rd column is a scalar multiple of the 1st column, and the 4th column is a scalar multiple of the 2nd column. Since the 5th column cannot be expressed as a linear combination of the previous 4 columns, the rank of the matrix is 3

The rank of the second matrix is also 3. Even though the 6th column can’t be represented as a linear combination of the first 4 columns, it is a scalar multiple of the 5th column. Hence, including this column in the matrix does not increase its rank

8 of 40

PCA (contd.)

  1. What is the underlying principle?

PCA is based on the idea that the orthogonal components with the most variance are responsible for most of the variance observed in the data, and focusing on these components allows us to reduce the number of dimensions while still capturing most or all the variance in the data. If the first few components do not capture most of the variance, you should not use PCA

9 of 40

PCA (contd.)

  1. Why don’t we just pick the components/features/columns in the dataset with the greatest variance?

Because these components may be correlated, and therefore each component adds little or no variance to the overall data. In comparison, PCA produces components that are all orthogonal and linearly uncorrelated with one another

10 of 40

PCA (contd.)

  1. So does PCA just discard components with low variance?

No, not quite. Instead, PCA creates new components from linear combinations of the pre-existing such columns, such that these columns capture maximum variance and are orthogonal to one another

  1. How do we perform PCA (in theory)?

As per lecture slides:

“Step 0: Center the data matrix by subtracting the mean of each attribute column.

11 of 40

PCA (contd.)

Step 1: Find a linear combination of attributes, represented by a unit vector v, that summarizes each row of the data matrix as a scalar.

  • v gives a one-dimensional projection of the data, the first principal component.
  • Unit vector v is chosen to minimize the sum of squared distances between each point and its projection onto v.

Steps 2+: To find k principal components, choose the vector for each one in the same manner as the first, but ensure each one is orthogonal to all previous ones.”

12 of 40

PCA (contd.)

  1. How do we perform PCA (in practice)?

We first decompose the data using SVD. As per lecture slides

“Singular value decomposition (SVD) describes a matrix decomposition:

  • X = UΣVT (or XV = UΣ) where U and V are orthonormal and Σ is diagonal.
  • If X has rank r, then there will be r non-zero values on the diagonal of Σ.
  • The values in Σ, called singular values, are ordered from greatest to least.
  • The columns of U are the left singular vectors.
  • The columns of V are the right singular vectors.”

13 of 40

PCA (contd.)

Almost ready to start answering questions, just a few more things to clarify

  • Before performing SVD, we must center the data matrix. Why? Check out

https://stats.stackexchange.com/questions/69157/why-do-we-need-to-normalize-data-before-principal-component-analysis-pca

  • What are left and right singular vectors? Check out Stephanie’s explanation at

https://piazza.com/class/jwyiam0g7rq6rb?cid=142

  • Don’t worry too much about singular values, just know sum of square of all singular values in Σ is the total variance

14 of 40

PCA (contd.)

Time to actually get the Principal Components

  • The first k components are the first k columns of XV
  • Since X = UΣVT, we can also say that XV = UΣ, and as such the first components are also the first k columns of UΣ

Any questions before we get to practice problems?

15 of 40

16 of 40

17 of 40

18 of 40

19 of 40

Final pieces of advice

  • Make sure to read the entire question carefully, and underline words or terms you think are relevant
  • Pace yourself so you have enough time to attempt all the questions on the exam
  • Specifically my (Ishaan’s) opinion: if there’s a coding question, you may want to get back to that later
  • Make sure you’ve eaten and are well rested, good luck!

20 of 40

Break and Attendance Form

21 of 40

RegEx - Regular Expressions

  • Regular expressions are used for defining search patterns.
  • Essentially, they are collections of characters!
  • Regular expressions can contain both ordinary and special characters.

Ordinary Characters: A, o, s, t, 7, 9, !, z

E.g. r’’’flower’’’ will match ---------> flower

r’’’kit-kat’’’ will match ----------> kit-kat

22 of 40

RegEx - Special Characters

Special Characters (metacharacters) have a “magic power”: they match or signal different patterns other than their character.

Here are some examples of special characters:

  • -----> matches the preceding character 1 or more time

$ ------> indicates the end of the line

? ------> matches the preceding character 0 or 1 times

If you want to get rid of the “magic power” and match only the character you should use a backslash before the special character: \+ ; \$ ; \?

23 of 40

24 of 40

RegEx - Let’s Read!

  • r’’’^.*$’’’

  • r’’’(male|female)\.$’’’

  • r’’’^M*.*?(na){2}’’’

25 of 40

RegEx - More Examples

Examples are from Spring 2018 midterm

26 of 40

RegEx - More Examples

27 of 40

RegEx - More Examples

28 of 40

EDA and String Methods with RegEx

String Handling Methods for Series https://pandas.pydata.org/pandas-docs/stable/reference/series.html#api-series-str

Some examples:

Series.str.lower() ------> converts the strings in the series to lowercase

Series.str.replace (pattern, repl) -----> replaces the patterns in the strings with another string

More methods: series.str.count(pattern); series.str.extract(pattern); series.str.findall(pattern); etc.

29 of 40

EDA and Data Cleaning in Pandas

Missing Values - Replace if possible; Sometimes we fill in with a certain value

E.g. Remember from HW3 pd.fillna(0)

Joins - usually join foreign key of one table with the primary key of another

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

pd.merge(left_table, right_table, how = ‘inner’, left_on = ‘left_column_name’, right_on = ‘right_column_name’)

left_table.merge(right_table, how = ‘inner’, left_on = ‘left_column_name’, right_on = ‘right_column_name’)

30 of 40

31 of 40

Pandas

df.loc[row_label, column_label] - selects the labels; (think of pandas index as labels for rows!);

df.loc[df[‘sepal_length’>5; ‘sepal_width’]

df.iloc[row_position, column_position] -selects the positions

df.iloc[4, 2] - the positions range from 0 to length-1

32 of 40

Pandas

Groupby - returns groupby object (either for Series or DataFrame!) Groupby Objects can be aggregated (.agg) or filtered (.filter)

33 of 40

Pandas

Fall 2018 Final

‘c_name’ - company name; ‘m_name’ - major name

34 of 40

35 of 40

Visualizations (quantitative)

Notebook Demo if there is time!

Histogragrams - plt.hist() - for quantitative variables - total area is 1

Density plots (KDE plots) - for quantitative variables- “smoothed histograms”

Scatter plots - plt.scatter() - for two quantitative variables (warnings: overplotting!)

Hex Plots - “histograms in 2D” - for two quantitative variables

Density plots in 2D - for two quantitative variables

36 of 40

Visualizations (categorical)

Bar Plots - plt.bar(), sns.barplot() - for categorical variables

Box Plots - for comparing the distributions of a quantitative variables corresponding to categorical variables - e.g. the distribution of income of male vs female

Pie Charts - for categorical variables

37 of 40

Visualizations

38 of 40

SQL

Relations - tables; Records - rows; Attribute or field - column

SELECT attribute names

FROM relation(s)

WHERE …. LIKE / IN… (filters the records in the relation)

GROUPBY attribute name

HAVING (filters the group) logical clause

ORDERBY column name DESC (default is ascending)

LIMIT integer;

39 of 40

40 of 40