1 of 56

Announcements

  • HW1 is due tonight. Make sure to validate and submit!
  • HW2 and next vitamin will be out today.

2 of 56

DS100: Fall 2018

Lecture 7 (Josh Hug): Working With Text

  • Cleaning Text
  • Extracting From Text with Split
  • Regular Expressions
  • Restaurant Data Case Study

3 of 56

Goals For Today

Goals For Today: Working With Text Data

  • Cleaning with native Python string functions.
  • Extracting data from text.
    • Using split.
    • Using regular expressions.
  • Manipulating strings inside Series using Pandas string functions.
    • Seen via case study on restaurant data.

4 of 56

Cleaning Text with Python String Methods

5 of 56

A Joining Problem

join

???

6 of 56

Resolving Our Problem through Python String Functions

See 07-text.ipynb

7 of 56

Resolving Our Problem through Python String Functions

Our goal: Canonicalization

  • Replace each string with a unique representation.
  • Feels very “hacky”, but this is just how it goes.

Tools used so far:�

Slicing

str[:-7]

Replacement

str.replace(‘&’, ‘and’)

Deletion

str.replace(‘ ‘, ‘’)

Transformation

str.lower()

8 of 56

How We Used Python String Functions

9 of 56

Challenge: Create a Sequence of Steps That Works for Both

???

See 07-text.ipynb for solution.

10 of 56

Extracting From Text Using Split

11 of 56

Extracting Date Information

Suppose we want to extract times and dates from webserver logs that look like the following:

There are existing libraries that do most of the work for us, but let’s try to do it from scratch.

  • Will do together, just a little bit at a time.

169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"

12 of 56

Extracting Date Information

169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"

One possible solution:

13 of 56

Regular Expression Basics

14 of 56

Extracting Date Information

Earlier we saw that we can hack together code that uses split to extract info:

An alternate approach is to use a so-called “regular expression”:

  • Implementation provided in the re library built into Python.
  • We’ll spend some time working up to expressions like shown below.

15 of 56

String Matching Example (Reference)

Identifying C2H2 Zinc Fingers in an amino acid sequence:

  • GPCGGWCAASCGGPYACGGWAGYHAGWHWAH

How do you tell a Zinc Finger? If you see a subsequence with:

  • C
  • 2 to 4 amino acids
  • C
  • 3 more amino acids
  • One of LIVMFYWCX
  • H
  • Between 3 to 5 more amino acids
  • H

Amino acid sequence

Zinc Finger subsequence

16 of 56

Introducing the Regular Expression

Regular expressions: A notation for specifying a set of strings.

Example: C.{2,4}C.{3}[LIVMFYWCX].{8}H.{3,5}H

  • C, then 2 to 4 of any letter, then C, then 3 of any letter, then one character out of the set LIVMFYWCX, then 8 of any letter, then H, then 3 to 5 of any letter, then H.

All major programming languages support regular expressions. Python code:

seq = "GPCGGWCAASCGGPYACGGWAGYHAGWHWAH”

pattern = "C.{2,4}C.{3}[LIVMFYWCX].{8}H.{3,5}H"

re.findall(pattern, seq) # returns ['CAASCGGPYACGGWAGYHAGWH']

17 of 56

Introducing the Regular Expression

Regular expressions: A notation for specifying a set of strings.

Example 2: [0-9]{3}-[0-9]{2}-[0-9]{4}

  • 3 of any digit, then a dash, then 2 of any digit, then a dash, then 4 of any digit.

text = "My social security number is 456-76-4295 bro.";

pattern = "[0-9]{3}-[0-9]{2}-[0-9]{4}"

re.findall(pattern, text)

18 of 56

Regular Expression Syntax

The four basic operations for regular expressions.

  • Can technically do anything with just these basic four (albeit tediously).

operation

order

example

matches

does not match

concatenation

3

AABAAB

AABAAB

every other string

or

4

AA|BAAB

AA

BAAB

every other string

closure

(zero or more)

2

AB*A

AA

ABBBBBBA

AB

ABABA

parenthesis

1

A(A|B)AAB

AAAAB

ABAAB

every other string

(AB)*A

A

ABABABABA

AA

ABBA

19 of 56

Regular Expression Syntax

AB*: A then zero or more copies of B: A, AB, ABB, ABBB

(AB)*: Zero or more copies of AB: ABABABAB, ABAB, AB,

operation

order

example

matches

does not match

concatenation

3

AABAAB

AABAAB

every other string

or

4

AA|BAAB

AA

BAAB

every other string

closure

(zero or more)

2

AB*A

AA

ABBBBBBA

AB

ABABA

parenthesis

1

A(A|B)AAB

AAAAB

ABAAB

every other string

(AB)*A

A

ABABABABA

AA

ABBA

20 of 56

Regex101.com (or the online tutorial regexone.com)

There are a ton of nice resources out there to experiment with regular expressions (e.g. regex101.com, regexone.com, sublime text, python, etc).

I recommend trying out regex101.com, which provides a visually appealing and easy to use platform for experimenting with regular expressions.

  • Example: https://regex101.com/r/1SREie/1

21 of 56

Puzzle: Use regex101.com to test! Or tinyurl.com/reg913z

Give a regular expression that matches moon, moooon, etc. Your expression should match any even number of os except zero (i.e. don’t match mn).

operation

order

example

matches

does not match

concatenation

3

AABAAB

AABAAB

every other string

or

4

AA | BAAB

AA

BAAB

every other string

closure

(zero or more)

2

AB*A

AA

ABBBBBBA

AB

ABABA

parenthesis

1

A(A|B)AAB

AAAAB

ABAAB

every other string

(AB)*A

A

ABABABABA

AA

ABBA

22 of 56

Puzzle Solution

Solution to puzzle on previous slide: moo(oo)*n

operation

order

example

matches

does not match

concatenation

3

AABAAB

AABAAB

every other string

or

4

AA | BAAB

AA

BAAB

every other string

closure

(zero or more)

2

AB*A

AA

ABBBBBBA

AB

ABABA

parenthesis

1

A(A|B)AAB

AAAAB

ABAAB

every other string

(AB)*A

A

ABABABABA

AA

ABBA

23 of 56

Regular Expression Puzzle: https://tinyurl.com/reg913x

Give a regex that matches muun, muuuun, moon, moooon, etc. Your expression should match any even number of us or os except zero (i.e. don’t match mn).

operation

order

example

matches

does not match

concatenation

3

AABAAB

AABAAB

every other string

or

4

AA | BAAB

AA

BAAB

every other string

closure

(zero or more)

2

AB*A

AA

ABBBBBBA

AB

ABABA

parenthesis

1

A(A|B)AAB

AAAAB

ABAAB

every other string

(AB)*A

A

ABABABABA

AA

ABBA

24 of 56

Puzzle Solution

Solution to puzzle on previous slide: m(uu(uu)*|oo(oo)*)n

  • Note: m(uu(uu)*)|(oo(oo)*)n is not correct! OR must be in parentheses!

operation

order

example

matches

does not match

concatenation

3

AABAAB

AABAAB

every other string

or

4

AA | BAAB

AA

BAAB

every other string

closure

(zero or more)

2

AB*A

AA

ABBBBBBA

AB

ABABA

parenthesis

1

A(A|B)AAB

AAAAB

ABAAB

every other string

(AB)*A

A

ABABABABA

AA

ABBA

25 of 56

Order of Operations in Regexes

m(uu(uu)*|oo(oo)*)n

  • Matches starting with m and ending with n, with either of the following in the middle:
    • uu(uu)*
    • oo(oo)*

m(uu(uu)*)|(oo(oo)*)n

  • Matches either of the following
    • m followed by uu(uu)*
    • oo(oo)* followed by n

In regexes | comes last.

Match examples:

muu

muuuu

oon

oooon

Match examples:

muun

muuuun

moon

moooon

26 of 56

Expanded Regular Expression Syntax

27 of 56

Expanded Regex Syntax

These additional operations confer no additional power to regexes.

  • For every regex in this expanded syntax, there is a regex in the basic syntax.
  • Ex. [A-E]+ is just shorthand for (A|B|C|D|E)(A|B|C|D|E)*

operation

example

matches

does not match

wildcard

.U.U.U.

CUMULUS

JUGULUM

SUCCUBUS

TUMULTUOUS

character class

[A-Za-z][a-z]*

word

Capitalized

camelCase

4illegal

at least 1

m(oo)+n

moon

moooon

mn

mon

between a and b occurrences

m[aeiou]{1,2}m

mem

maam

miem

mm

mooom

meme

28 of 56

More Regular Expression Examples

regex

matches

does not match

.*SPB.*

RASPBERRY

CRISPBREAD

SUBSPACE

SUBSPECIES

[0-9]{3}-[0-9]{2}-[0-9]{4}

231-41-5121

573-57-1821

231415121

57-3571821

[a-z]+@([a-z]+\.)+(edu|com)

horse@pizza.com

horse@pizza.food.com

frank_99@yahoo.com

hug@cs

29 of 56

Expanded Regex Puzzle: https://tinyurl.com/reg913w

Challenge: Give a regular expression for any lowercase string that has a repeated vowel (i.e. noon, peel, festoon, looop, etc).

operation

example

matches

does not match

wildcard

.U.U.U.

CUMULUS

JUGULUM

SUCCUBUS

TUMULTUOUS

character class

[A-Za-z][a-z]*

word

Capitalized

camelCase

4illegal

at least 1

m(oo)+n

moon

mn

between a and b occurrences

m[aeiou]{1,2}m

mam

maam

miem

mm

mooom

meme

30 of 56

Expanded Regex Puzzle Solution

Challenge: Give a regular expression for any lowercase string that has a repeated vowel (i.e. noon, peel, festoon, looop, etc).

  • [a-z]*(aa|ee|ii|oo|uu)[a-z]*

operation

example

matches

does not match

wildcard

.U.U.U.

CUMULUS

JUGULUM

SUCCUBUS

TUMULTUOUS

character class

[A-Za-z][a-z]*

word

Capitalized

camelCase

4illegal

at least 1

m(oo)+n

moon

mn

between a and b occurrences

m[aeiou]{1,2}m

mam

maam

miem

mm

mooom

meme

31 of 56

Expanded Regex Syntax Puzzle: https://tinyurl.com/reg913v

Challenge: Give a regular expression for any string that contains both a lowercase letter and a number.

operation

example

matches

does not match

wildcard

.U.U.U.

CUMULUS

JUGULUM

SUCCUBUS

TUMULTUOUS

character class

[A-Za-z][a-z]*

word

Capitalized

camelCase

4illegal

at least 1

m(oo)+n

moon

mn

between a and b occurrences

m[aeiou]{1,2}m

mam

maam

miem

mm

mooom

meme

Click “run tests” to test your regex.

32 of 56

Expanded Regex Syntax Puzzle Solution

Challenge: Give a regular expression for any string that contains both a lowercase letter and a number.

  • (.*[0-9].*[a-z].*)|(.*[a-z].*[0-9].*)

operation

example

matches

does not match

wildcard

.U.U.U.

CUMULUS

JUGULUM

SUCCUBUS

TUMULTUOUS

character class

[A-Za-z][a-z]*

word

Capitalized

camelCase

4illegal

at least 1

m(oo)+n

moon

mn

between a and b occurrences

m[aeiou]{1,2}m

mam

maam

miem

mm

mooom

meme

33 of 56

Limitations of Regular Expressions

Writing regular expressions is like writing a program.

  • Need to know the syntax well.
  • Can be easier to write than to read.
  • Can be difficult to debug.

Regular expressions sometimes jokingly referred to as a “write only language”.

Regular expressions are terribly at certain types of problems. Examples:

  • Anything involving counting (same number of instances of a and b).
  • Anything involving complex structure (palindromes).
  • Parsing highly complex text structure.�

"Some people, when confronted with a problem, think ‘I know, I'll use regular expressions.’ Now they have two problems." - Jamie Zawinski (Source)

34 of 56

Email Address Regular Expression (a probably bad idea)

The regular expression for email addresses (for the Perl programming language):

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?: \r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\ ](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?: (?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n) ?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t] )*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])* )(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*) *:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r \n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t ]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\]( ?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(? :\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)? [ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]| \\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|" (?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[ \]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|( ?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([ ^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\ ]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\ r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\] |\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\ .|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(? :[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\] ]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[ \["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+| \Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\ ]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[" ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<> @,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@, ;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[ "()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]) +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*))*)?;\s*)

35 of 56

Even More Regular Expression Syntax

Suppose you want to match one of our special characters like . or [ or ]

  • In these cases, you must “escape” the character using the backslash.
  • You can think of the backslash as meaning “take this next character literally”.

operation

example

matches

does not match

built-in character classes

\w+

\d+

fawef

231231

this person

423 people

character class negation

[^a-z]+

PEPPERS3982

17211!↑å

porch

CLAmS

escape character

cow\.com

cow.com

cowscom

36 of 56

Regular Expressions Puzzle: tinyurl.com/reg913a

Create a regular expression that matches the red portion below.

operation

example

matches

does not match

built-in character classes

\w+

\d+

fawef

231231

this person

423 people

character class negation

[^a-z]+

PEPPERS3982

17211!↑å

porch

CLAmS

escape character

cow\.com

cow.com

cowscom

169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"

37 of 56

Regular Expressions Puzzle Solution: tinyurl.com/reg913a

Create a regular expression that matches the red portion below.

  • \[.*\] (must include escape backslashes!)

operation

example

matches

does not match

built-in character classes

\w+

\d+

fawef

231231

this person

423 people

character class negation

[^a-z]+

PEPPERS3982

17211!↑å

porch

CLAmS

escape character

cow\.com

cow.com

cowscom

169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"

38 of 56

Attendance Quiz: yellkey.com/child

Create a regular expression that matches anything inside of angle brackets <>, but none of the string outside of angle brackets. Answer at: yellkey.com/child

  • Example: <div><td valign="top">Moo</td></div>
  • Moo should not match because it is not between < and >.
  • Note: This is equivalent to the problem of matching HTML tags.

operation

example

matches

does not match

built-in character classes

\w+

\d+

fawef

231231

this person

423 people

character class negation

[^a-z]+

PEPPERS3982

17211!↑å

porch

CLAmS

escape character

cow\.com

cow.com

cowscom

39 of 56

Even More Regular Expression Features

A few additional common regex features are listed above.

  • Won’t discuss these in class, but might come up in discussion or hw.
  • There are even more out there!

For the best guide you’ll ever read on regex in Python: https://docs.python.org/2/howto/regex.html

operation

example

matches

does not match

beginning of line

^ark

ark two

ark o ark

dark

end of line

ark$

dark

ark o ark

ark two

non-greedy qualifier

5.*?5

5005

55

5005005

5*5 would match this!

40 of 56

Regular Expression in Python

(and Regex Groups)

41 of 56

re.findall in Python

In Python, re.findall(pattern, text) will return a list of all matches.

text = "My social security number is 456-76-4295 bro, or actually maybe it’s 456-67-4295.";

pattern = "[0-9]{3}-[0-9]{2}-[0-9]{4}"

m = re.findall(pattern, text)

print(m)

['456-76-4295', '456-67-4295']

42 of 56

re.sub in Python

In Python, re.sub(pattern, repl, text) will return text with all instances of pattern replaced by repl.

text = '<div><td valign="top">Moo</td></div>'

pattern = "<[^>]+>"

cleaned = re.sub(pattern, '', text)

print(cleaned)

’Moo’

43 of 56

Regular Expression Groups

Earlier we used parentheses to specify the order of operations.

Parenthesis have another meaning:

  • Every set of parentheses specifies a so-called “group”.
  • Regular expression matchers (e.g. re.findall, regex101.com) will return matches organized by groups. In Python, returned as tuples.

s = """Observations: 03:04:53 - Horse awakens.

03:05:14 - Horse goes back to sleep."""

pattern = "(\d\d):(\d\d):(\d\d) - (.*)"

matches = re.findall(pattern, s)

[('03', '04', '53', 'Horse awakens.'),

('03', '05', '14', 'Horse goes back to sleep.')]

44 of 56

Regex Puzzle

Fill in the regex below so that after code executes, day is “26”, month is “Jan”, and year is “2014”.

  • See 07-text.ipynb or https://tinyurl.com/reg913s.

169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"

pattern = "YOUR REGEX HERE"

matches = re.findall(pattern, log[0])

day, month, year = matches[0]

log[0]:

45 of 56

Regex Puzzle (One Possible Solution)

Fill in the regex below so that after it executes, day is “26”, month is “Jan”, and year is “2014”.

  • Fun question: What happens if you remove the right bracket?

169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"

pattern = "\[(.+)\/(.+)\/([^:]+).*\]"

matches = re.findall(pattern, log[0])

day, month, year = matches[0]

log[0]:

46 of 56

Extracting Date Information

With a little more work, we can do something similar and extract day, month, year, hour, minutes, seconds, and time zone all in one regular expression.

  • Derivation is left as an exercise for you guys.

47 of 56

Case Study on Restaurants

(Feat. Pandas String Methods)

48 of 56

Series.str in Pandas

In an earlier lecture, we saw that the .str attribute of the Series class allows to do handy string manipulations.

List comprehension: Creates a list of string lengths.

  • [len(s) for s in some_series]

.str attribute: Creates a Series of string lengths.

  • some_series.str.len()

As we’ll see, .str has many more capabilities!��

49 of 56

Observations

unclean or degraded floors walls or ceilings

moderate risk food holding temperature

inadequate and inaccessible handwashing facilities

unapproved or unmaintained equipment or utensils

inadequately cleaned or sanitized food contact surfaces

wiping cloths not clean or properly stored or inadequate sanitizer

improper food storage

foods not protected from contamination

moderate risk vermin infestation

high risk food holding temperature

unclean nonfood contact surfaces

food safety certificate or food handler card not available

unclean or unsanitary food contact surfaces

inadequate food safety knowledge or lack of certified food safety manager

improper storage of equipment utensils or linens

low risk vermin infestation

permit license or inspection report not posted

improper cooling methods

unclean hands or improper use of gloves

improper or defective plumbing

50 of 56

Observations

unclean or degraded floors walls or ceilings

moderate risk food holding temperature

inadequate and inaccessible handwashing facilities

unapproved or unmaintained equipment or utensils

inadequately cleaned or sanitized food contact surfaces

wiping cloths not clean or properly stored or inadequate sanitizer

improper food storage

foods not protected from contamination

moderate risk vermin infestation

high risk food holding temperature

unclean nonfood contact surfaces

food safety certificate or food handler card not available

unclean or unsanitary food contact surfaces

inadequate food safety knowledge or lack of certified food safety manager

improper storage of equipment utensils or linens

low risk vermin infestation

permit license or inspection report not posted

improper cooling methods

unclean hands or improper use of gloves

improper or defective plumbing

51 of 56

Observations

unclean or degraded floors walls or ceilings

moderate risk food holding temperature

inadequate and inaccessible handwashing facilities

unapproved or unmaintained equipment or utensils

inadequately cleaned or sanitized food contact surfaces

wiping cloths not clean or properly stored or inadequate sanitizer

improper food storage

foods not protected from contamination

moderate risk vermin infestation

high risk food holding temperature

unclean nonfood contact surfaces

food safety certificate or food handler card not available

unclean or unsanitary food contact surfaces

inadequate food safety knowledge or lack of certified food safety manager

improper storage of equipment utensils or linens

low risk vermin infestation

permit license or inspection report not posted

improper cooling methods

unclean hands or improper use of gloves

improper or defective plumbing

52 of 56

Observations

unclean or degraded floors walls or ceilings

moderate risk food holding temperature

inadequate and inaccessible handwashing facilities

unapproved or unmaintained equipment or utensils

inadequately cleaned or sanitized food contact surfaces

wiping cloths not clean or properly stored or inadequate sanitizer

improper food storage

foods not protected from contamination

moderate risk vermin infestation

high risk food holding temperature

unclean nonfood contact surfaces

food safety certificate or food handler card not available

unclean or unsanitary food contact surfaces

inadequate food safety knowledge or lack of certified food safety manager

improper storage of equipment utensils or linens

low risk vermin infestation

permit license or inspection report not posted

improper cooling methods

unclean hands or improper use of gloves

improper or defective plumbing

53 of 56

Observations

unclean or degraded floors walls or ceilings

moderate risk food holding temperature

inadequate and inaccessible handwashing facilities

unapproved or unmaintained equipment or utensils

inadequately cleaned or sanitized food contact surfaces

wiping cloths not clean or properly stored or inadequate sanitizer

improper food storage

foods not protected from contamination

moderate risk vermin infestation

high risk food holding temperature

unclean nonfood contact surfaces

food safety certificate or food handler card not available

unclean or unsanitary food contact surfaces

inadequate food safety knowledge or lack of certified food safety manager

improper storage of equipment utensils or linens

low risk vermin infestation

permit license or inspection report not posted

improper cooling methods

unclean hands or improper use of gloves

improper or defective plumbing

54 of 56

Features

Cleanliness

‘clean|sanit’

High Risk

‘high risk’

Vermin

‘vermin’

Surfaces

‘wall|ceiling|floor|surface’

Human

‘hand|glove|hair|nail’

Permits and Ceritification

‘permit|certif’

55 of 56

Example of a Discovery

56 of 56

Summary

Today we saw many different string manipulation tools.

  • There are many many more!
  • With just this basic set of tools, you can do most of what you’ll need.

basic python

re

pandas

re.findall

df.str.findall

str.replace

re.sub

df.str.replace

str.split

re.split

df.str.split

’ab’ in str

re.search

df.str.contain

len(str)

df.str.len

str[1:4]

df.str[1:4]