Announcements
DS100: Fall 2018
Lecture 7 (Josh Hug): Working With Text
Goals For Today
Goals For Today: Working With Text Data
Cleaning Text with Python String Methods
A Joining Problem
join
???
Resolving Our Problem through Python String Functions
See 07-text.ipynb
Resolving Our Problem through Python String Functions
Our goal: Canonicalization
Tools used so far:�
Slicing | str[:-7] |
Replacement | str.replace(‘&’, ‘and’) |
Deletion | str.replace(‘ ‘, ‘’) |
Transformation | str.lower() |
How We Used Python String Functions
Challenge: Create a Sequence of Steps That Works for Both
???
See 07-text.ipynb for solution.
Extracting From Text Using Split
Extracting Date Information
Suppose we want to extract times and dates from webserver logs that look like the following:
There are existing libraries that do most of the work for us, but let’s try to do it from scratch.
169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"
Extracting Date Information
169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"
One possible solution:
Regular Expression Basics
Extracting Date Information
Earlier we saw that we can hack together code that uses split to extract info:
An alternate approach is to use a so-called “regular expression”:
String Matching Example (Reference)
Identifying C2H2 Zinc Fingers in an amino acid sequence:
How do you tell a Zinc Finger? If you see a subsequence with:
Amino acid sequence
Zinc Finger subsequence
Introducing the Regular Expression
Regular expressions: A notation for specifying a set of strings.
Example: C.{2,4}C.{3}[LIVMFYWCX].{8}H.{3,5}H
All major programming languages support regular expressions. Python code:
seq = "GPCGGWCAASCGGPYACGGWAGYHAGWHWAH”
pattern = "C.{2,4}C.{3}[LIVMFYWCX].{8}H.{3,5}H"
re.findall(pattern, seq) # returns ['CAASCGGPYACGGWAGYHAGWH']
Introducing the Regular Expression
Regular expressions: A notation for specifying a set of strings.
Example 2: [0-9]{3}-[0-9]{2}-[0-9]{4}
text = "My social security number is 456-76-4295 bro.";
pattern = "[0-9]{3}-[0-9]{2}-[0-9]{4}"
re.findall(pattern, text)
Regular Expression Syntax
The four basic operations for regular expressions.
operation | order | example | matches | does not match |
concatenation | 3 | AABAAB | AABAAB | every other string |
or | 4 | AA|BAAB | AA BAAB | every other string |
closure (zero or more) | 2 | AB*A | AA ABBBBBBA | AB ABABA |
parenthesis | 1 | A(A|B)AAB | AAAAB ABAAB | every other string |
(AB)*A | A ABABABABA | AA ABBA |
Regular Expression Syntax
AB*: A then zero or more copies of B: A, AB, ABB, ABBB
(AB)*: Zero or more copies of AB: ABABABAB, ABAB, AB,
operation | order | example | matches | does not match |
concatenation | 3 | AABAAB | AABAAB | every other string |
or | 4 | AA|BAAB | AA BAAB | every other string |
closure (zero or more) | 2 | AB*A | AA ABBBBBBA | AB ABABA |
parenthesis | 1 | A(A|B)AAB | AAAAB ABAAB | every other string |
(AB)*A | A ABABABABA | AA ABBA |
Regex101.com (or the online tutorial regexone.com)
There are a ton of nice resources out there to experiment with regular expressions (e.g. regex101.com, regexone.com, sublime text, python, etc).
I recommend trying out regex101.com, which provides a visually appealing and easy to use platform for experimenting with regular expressions.
Puzzle: Use regex101.com to test! Or tinyurl.com/reg913z
Give a regular expression that matches moon, moooon, etc. Your expression should match any even number of os except zero (i.e. don’t match mn).
operation | order | example | matches | does not match |
concatenation | 3 | AABAAB | AABAAB | every other string |
or | 4 | AA | BAAB | AA BAAB | every other string |
closure (zero or more) | 2 | AB*A | AA ABBBBBBA | AB ABABA |
parenthesis | 1 | A(A|B)AAB | AAAAB ABAAB | every other string |
(AB)*A | A ABABABABA | AA ABBA |
Puzzle Solution
Solution to puzzle on previous slide: moo(oo)*n
operation | order | example | matches | does not match |
concatenation | 3 | AABAAB | AABAAB | every other string |
or | 4 | AA | BAAB | AA BAAB | every other string |
closure (zero or more) | 2 | AB*A | AA ABBBBBBA | AB ABABA |
parenthesis | 1 | A(A|B)AAB | AAAAB ABAAB | every other string |
(AB)*A | A ABABABABA | AA ABBA |
Regular Expression Puzzle: https://tinyurl.com/reg913x
Give a regex that matches muun, muuuun, moon, moooon, etc. Your expression should match any even number of us or os except zero (i.e. don’t match mn).
operation | order | example | matches | does not match |
concatenation | 3 | AABAAB | AABAAB | every other string |
or | 4 | AA | BAAB | AA BAAB | every other string |
closure (zero or more) | 2 | AB*A | AA ABBBBBBA | AB ABABA |
parenthesis | 1 | A(A|B)AAB | AAAAB ABAAB | every other string |
(AB)*A | A ABABABABA | AA ABBA |
Puzzle Solution
Solution to puzzle on previous slide: m(uu(uu)*|oo(oo)*)n
operation | order | example | matches | does not match |
concatenation | 3 | AABAAB | AABAAB | every other string |
or | 4 | AA | BAAB | AA BAAB | every other string |
closure (zero or more) | 2 | AB*A | AA ABBBBBBA | AB ABABA |
parenthesis | 1 | A(A|B)AAB | AAAAB ABAAB | every other string |
(AB)*A | A ABABABABA | AA ABBA |
Order of Operations in Regexes
m(uu(uu)*|oo(oo)*)n
m(uu(uu)*)|(oo(oo)*)n
In regexes | comes last.
Match examples:
muu
muuuu
oon
oooon
Match examples:
muun
muuuun
moon
moooon
Expanded Regular Expression Syntax
Expanded Regex Syntax
These additional operations confer no additional power to regexes.
operation | example | matches | does not match |
wildcard | .U.U.U. | CUMULUS JUGULUM | SUCCUBUS TUMULTUOUS |
character class | [A-Za-z][a-z]* | word Capitalized | camelCase 4illegal |
at least 1 | m(oo)+n | moon moooon | mn mon |
between a and b occurrences | m[aeiou]{1,2}m | mem maam miem | mm mooom meme |
More Regular Expression Examples
regex | matches | does not match |
.*SPB.* | RASPBERRY CRISPBREAD | SUBSPACE SUBSPECIES |
[0-9]{3}-[0-9]{2}-[0-9]{4} | 231-41-5121 573-57-1821 | 231415121 57-3571821 |
[a-z]+@([a-z]+\.)+(edu|com) | horse@pizza.com horse@pizza.food.com | frank_99@yahoo.com hug@cs |
Expanded Regex Puzzle: https://tinyurl.com/reg913w
Challenge: Give a regular expression for any lowercase string that has a repeated vowel (i.e. noon, peel, festoon, looop, etc).
operation | example | matches | does not match |
wildcard | .U.U.U. | CUMULUS JUGULUM | SUCCUBUS TUMULTUOUS |
character class | [A-Za-z][a-z]* | word Capitalized | camelCase 4illegal |
at least 1 | m(oo)+n | moon | mn |
between a and b occurrences | m[aeiou]{1,2}m | mam maam miem | mm mooom meme |
Expanded Regex Puzzle Solution
Challenge: Give a regular expression for any lowercase string that has a repeated vowel (i.e. noon, peel, festoon, looop, etc).
operation | example | matches | does not match |
wildcard | .U.U.U. | CUMULUS JUGULUM | SUCCUBUS TUMULTUOUS |
character class | [A-Za-z][a-z]* | word Capitalized | camelCase 4illegal |
at least 1 | m(oo)+n | moon | mn |
between a and b occurrences | m[aeiou]{1,2}m | mam maam miem | mm mooom meme |
Expanded Regex Syntax Puzzle: https://tinyurl.com/reg913v
Challenge: Give a regular expression for any string that contains both a lowercase letter and a number.
operation | example | matches | does not match |
wildcard | .U.U.U. | CUMULUS JUGULUM | SUCCUBUS TUMULTUOUS |
character class | [A-Za-z][a-z]* | word Capitalized | camelCase 4illegal |
at least 1 | m(oo)+n | moon | mn |
between a and b occurrences | m[aeiou]{1,2}m | mam maam miem | mm mooom meme |
Click “run tests” to test your regex.
Expanded Regex Syntax Puzzle Solution
Challenge: Give a regular expression for any string that contains both a lowercase letter and a number.
operation | example | matches | does not match |
wildcard | .U.U.U. | CUMULUS JUGULUM | SUCCUBUS TUMULTUOUS |
character class | [A-Za-z][a-z]* | word Capitalized | camelCase 4illegal |
at least 1 | m(oo)+n | moon | mn |
between a and b occurrences | m[aeiou]{1,2}m | mam maam miem | mm mooom meme |
Limitations of Regular Expressions
Writing regular expressions is like writing a program.
Regular expressions sometimes jokingly referred to as a “write only language”.
Regular expressions are terribly at certain types of problems. Examples:
"Some people, when confronted with a problem, think ‘I know, I'll use regular expressions.’ Now they have two problems." - Jamie Zawinski (Source)
Email Address Regular Expression (a probably bad idea)
The regular expression for email addresses (for the Perl programming language):
(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?: \r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\ ](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?: (?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n) ?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t] )*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])* )(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*) *:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r \n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t ]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\]( ?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(? :\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)? [ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]| \\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|" (?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[ \]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|( ?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([ ^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\ ]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\ r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\] |\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\ .|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(? :[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\] ]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[ \["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+| \Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\ ]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[" ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<> @,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@, ;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[ "()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]) +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*))*)?;\s*)
Even More Regular Expression Syntax
Suppose you want to match one of our special characters like . or [ or ]
operation | example | matches | does not match |
built-in character classes | \w+ \d+ | fawef 231231 | this person 423 people |
character class negation | [^a-z]+ | PEPPERS3982 17211!↑å | porch CLAmS |
escape character | cow\.com | cow.com | cowscom |
Regular Expressions Puzzle: tinyurl.com/reg913a
Create a regular expression that matches the red portion below.
operation | example | matches | does not match |
built-in character classes | \w+ \d+ | fawef 231231 | this person 423 people |
character class negation | [^a-z]+ | PEPPERS3982 17211!↑å | porch CLAmS |
escape character | cow\.com | cow.com | cowscom |
169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"
Regular Expressions Puzzle Solution: tinyurl.com/reg913a
Create a regular expression that matches the red portion below.
operation | example | matches | does not match |
built-in character classes | \w+ \d+ | fawef 231231 | this person 423 people |
character class negation | [^a-z]+ | PEPPERS3982 17211!↑å | porch CLAmS |
escape character | cow\.com | cow.com | cowscom |
169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"
Attendance Quiz: yellkey.com/child
Create a regular expression that matches anything inside of angle brackets <>, but none of the string outside of angle brackets. Answer at: yellkey.com/child
operation | example | matches | does not match |
built-in character classes | \w+ \d+ | fawef 231231 | this person 423 people |
character class negation | [^a-z]+ | PEPPERS3982 17211!↑å | porch CLAmS |
escape character | cow\.com | cow.com | cowscom |
Regex101 link: https://tinyurl.com/reg913
Even More Regular Expression Features
A few additional common regex features are listed above.
For the best guide you’ll ever read on regex in Python: https://docs.python.org/2/howto/regex.html
operation | example | matches | does not match |
beginning of line | ^ark | ark two ark o ark | dark |
end of line | ark$ | dark ark o ark | ark two |
non-greedy qualifier | 5.*?5 | 5005 55 | 5005005 |
5*5 would match this!
Regular Expression in Python
(and Regex Groups)
re.findall in Python
In Python, re.findall(pattern, text) will return a list of all matches.
text = "My social security number is 456-76-4295 bro, or actually maybe it’s 456-67-4295.";
pattern = "[0-9]{3}-[0-9]{2}-[0-9]{4}"
m = re.findall(pattern, text)
print(m)
['456-76-4295', '456-67-4295']
re.sub in Python
In Python, re.sub(pattern, repl, text) will return text with all instances of pattern replaced by repl.
text = '<div><td valign="top">Moo</td></div>'
pattern = "<[^>]+>"
cleaned = re.sub(pattern, '', text)
print(cleaned)
’Moo’
Regular Expression Groups
Earlier we used parentheses to specify the order of operations.
Parenthesis have another meaning:
s = """Observations: 03:04:53 - Horse awakens.
03:05:14 - Horse goes back to sleep."""
pattern = "(\d\d):(\d\d):(\d\d) - (.*)"
matches = re.findall(pattern, s)
[('03', '04', '53', 'Horse awakens.'),
('03', '05', '14', 'Horse goes back to sleep.')]
Regex Puzzle
Fill in the regex below so that after code executes, day is “26”, month is “Jan”, and year is “2014”.
169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"
pattern = "YOUR REGEX HERE"
matches = re.findall(pattern, log[0])
day, month, year = matches[0]
log[0]:
Regex Puzzle (One Possible Solution)
Fill in the regex below so that after it executes, day is “26”, month is “Jan”, and year is “2014”.
169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"
pattern = "\[(.+)\/(.+)\/([^:]+).*\]"
matches = re.findall(pattern, log[0])
day, month, year = matches[0]
log[0]:
Extracting Date Information
With a little more work, we can do something similar and extract day, month, year, hour, minutes, seconds, and time zone all in one regular expression.
Case Study on Restaurants
(Feat. Pandas String Methods)
Series.str in Pandas
In an earlier lecture, we saw that the .str attribute of the Series class allows to do handy string manipulations.
List comprehension: Creates a list of string lengths.
.str attribute: Creates a Series of string lengths.
As we’ll see, .str has many more capabilities!��
Observations
unclean or degraded floors walls or ceilings
moderate risk food holding temperature
inadequate and inaccessible handwashing facilities
unapproved or unmaintained equipment or utensils
inadequately cleaned or sanitized food contact surfaces
wiping cloths not clean or properly stored or inadequate sanitizer
improper food storage
foods not protected from contamination
moderate risk vermin infestation
high risk food holding temperature
unclean nonfood contact surfaces
food safety certificate or food handler card not available
unclean or unsanitary food contact surfaces
inadequate food safety knowledge or lack of certified food safety manager
improper storage of equipment utensils or linens
low risk vermin infestation
permit license or inspection report not posted
improper cooling methods
unclean hands or improper use of gloves
improper or defective plumbing
Observations
unclean or degraded floors walls or ceilings
moderate risk food holding temperature
inadequate and inaccessible handwashing facilities
unapproved or unmaintained equipment or utensils
inadequately cleaned or sanitized food contact surfaces
wiping cloths not clean or properly stored or inadequate sanitizer
improper food storage
foods not protected from contamination
moderate risk vermin infestation
high risk food holding temperature
unclean nonfood contact surfaces
food safety certificate or food handler card not available
unclean or unsanitary food contact surfaces
inadequate food safety knowledge or lack of certified food safety manager
improper storage of equipment utensils or linens
low risk vermin infestation
permit license or inspection report not posted
improper cooling methods
unclean hands or improper use of gloves
improper or defective plumbing
Observations
unclean or degraded floors walls or ceilings
moderate risk food holding temperature
inadequate and inaccessible handwashing facilities
unapproved or unmaintained equipment or utensils
inadequately cleaned or sanitized food contact surfaces
wiping cloths not clean or properly stored or inadequate sanitizer
improper food storage
foods not protected from contamination
moderate risk vermin infestation
high risk food holding temperature
unclean nonfood contact surfaces
food safety certificate or food handler card not available
unclean or unsanitary food contact surfaces
inadequate food safety knowledge or lack of certified food safety manager
improper storage of equipment utensils or linens
low risk vermin infestation
permit license or inspection report not posted
improper cooling methods
unclean hands or improper use of gloves
improper or defective plumbing
Observations
unclean or degraded floors walls or ceilings
moderate risk food holding temperature
inadequate and inaccessible handwashing facilities
unapproved or unmaintained equipment or utensils
inadequately cleaned or sanitized food contact surfaces
wiping cloths not clean or properly stored or inadequate sanitizer
improper food storage
foods not protected from contamination
moderate risk vermin infestation
high risk food holding temperature
unclean nonfood contact surfaces
food safety certificate or food handler card not available
unclean or unsanitary food contact surfaces
inadequate food safety knowledge or lack of certified food safety manager
improper storage of equipment utensils or linens
low risk vermin infestation
permit license or inspection report not posted
improper cooling methods
unclean hands or improper use of gloves
improper or defective plumbing
Observations
unclean or degraded floors walls or ceilings
moderate risk food holding temperature
inadequate and inaccessible handwashing facilities
unapproved or unmaintained equipment or utensils
inadequately cleaned or sanitized food contact surfaces
wiping cloths not clean or properly stored or inadequate sanitizer
improper food storage
foods not protected from contamination
moderate risk vermin infestation
high risk food holding temperature
unclean nonfood contact surfaces
food safety certificate or food handler card not available
unclean or unsanitary food contact surfaces
inadequate food safety knowledge or lack of certified food safety manager
improper storage of equipment utensils or linens
low risk vermin infestation
permit license or inspection report not posted
improper cooling methods
unclean hands or improper use of gloves
improper or defective plumbing
Features
Cleanliness | ‘clean|sanit’ |
High Risk | ‘high risk’ |
Vermin | ‘vermin’ |
Surfaces | ‘wall|ceiling|floor|surface’ |
Human | ‘hand|glove|hair|nail’ |
Permits and Ceritification | ‘permit|certif’ |
Example of a Discovery
Summary
Today we saw many different string manipulation tools.
basic python | re | pandas |
| re.findall | df.str.findall |
str.replace | re.sub | df.str.replace |
str.split | re.split | df.str.split |
’ab’ in str | re.search | df.str.contain |
len(str) | | df.str.len |
str[1:4] | | df.str[1:4] |