Regular Expressions
Using string methods and regular expressions to work with textual data
Data 100, Summer 2020 @ UC Berkeley
Josh Hug
LECTURE 8
Goals For Today
Goals For Today: Working With Text Data
String Canonicalization
Goal 1: Joining Tables with Mismatched Labels
join
???
A Joining Problem
join
???
To join our tables we’ll need to canonicalize the county names.
Canonicalizing County Names
Canonicalization
Canonicalization:
Can be done slightly better but not by much →
Tools used:�
Replacement | str.replace(‘&’, ‘and’) |
Deletion | str.replace(‘ ‘, ‘’) |
Transformation | str.lower() |
Extracting From Text Using Split
Goal 2: Extracting Date Information
Suppose we want to extract times and dates from web server logs that look like the following:
There are existing libraries that do most of the work for us, but let’s try to do it from scratch.
169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"
Extracting Date Information
169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"
One possible solution:
Regular Expression Basics
Extracting Date Information
Earlier we saw that we can hack together code that uses split to extract info:
An alternate approach is to use a so-called “regular expression”:
Regular Expressions
A formal language is a set of strings, typically described implicitly.
A regular language is a formal language that can be described by a regular expression (which we will define soon).
Example: [0-9]{3}-[0-9]{2}-[0-9]{4}
3 of any digit, then a dash, then 2 of any digit, then a dash, then 4 of any digit.
text = "My social security number is 123-45-6789.";
pattern = r"[0-9]{3}-[0-9]{2}-[0-9]{4}"
re.findall(pattern, text)
The language of SSNs is described by this regular expression.
Regex101.com (or the online tutorial regexone.com)
There are a ton of nice resources out there to experiment with regular expressions (e.g. regex101.com, regexone.com, sublime text, python, etc).
I recommend trying out regex101.com, which provides a visually appealing and easy to use platform for experimenting with regular expressions.
Regular Expression Syntax
The four basic operations for regular expressions.
operation | order | example | matches | does not match |
concatenation | 3 | AABAAB | AABAAB | every other string |
or | 4 | AA|BAAB | AA BAAB | every other string |
closure (zero or more) | 2 | AB*A | AA ABBBBBBA | AB ABABA |
parenthesis | 1 | A(A|B)AAB | AAAAB ABAAB | every other string |
(AB)*A | A ABABABABA | AA ABBA |
Regular Expression Syntax
AB*: A then zero or more copies of B: A, AB, ABB, ABBB
(AB)*: Zero or more copies of AB: ABABABAB, ABAB, AB,
operation | order | example | matches | does not match |
concatenation | 3 | AABAAB | AABAAB | every other string |
or | 4 | AA|BAAB | AA BAAB | every other string |
closure (zero or more) | 2 | AB*A | AA ABBBBBBA | AB ABABA |
parenthesis | 1 | A(A|B)AAB | AAAAB ABAAB | every other string |
(AB)*A | A ABABABABA | AA ABBA |
Matches the empty string!
Puzzle: Use regex101.com to test! Or tinyurl.com/reg913z
Give a regular expression that matches moon, moooon, etc. Your expression should match any even number of os except zero (i.e. don’t match mn).
operation | order | example | matches | does not match |
concatenation | 3 | AABAAB | AABAAB | every other string |
or | 4 | AA | BAAB | AA BAAB | every other string |
closure (zero or more) | 2 | AB*A | AA ABBBBBBA | AB ABABA |
parenthesis | 1 | A(A|B)AAB | AAAAB ABAAB | every other string |
(AB)*A | A ABABABABA | AA ABBA |
Puzzle Solution
Solution to puzzle on previous slide: moo(oo)*n
operation | order | example | matches | does not match |
concatenation | 3 | AABAAB | AABAAB | every other string |
or | 4 | AA | BAAB | AA BAAB | every other string |
closure (zero or more) | 2 | AB*A | AA ABBBBBBA | AB ABABA |
parenthesis | 1 | A(A|B)AAB | AAAAB ABAAB | every other string |
(AB)*A | A ABABABABA | AA ABBA |
Regular Expression moo(oo)*n: https://tinyurl.com/reg913m
Give a regex that matches muun, muuuun, moon, moooon, etc. Your expression should match any even number of us or os except zero (i.e. don’t match mn).
operation | order | example | matches | does not match |
concatenation | 3 | AABAAB | AABAAB | every other string |
or | 4 | AA | BAAB | AA BAAB | every other string |
closure (zero or more) | 2 | AB*A | AA ABBBBBBA | AB ABABA |
parenthesis | 1 | A(A|B)AAB | AAAAB ABAAB | every other string |
(AB)*A | A ABABABABA | AA ABBA |
Puzzle Solution
Solution to puzzle on previous slide: m(uu(uu)*|oo(oo)*)n
operation | order | example | matches | does not match |
concatenation | 3 | AABAAB | AABAAB | every other string |
or | 4 | AA | BAAB | AA BAAB | every other string |
closure (zero or more) | 2 | AB*A | AA ABBBBBBA | AB ABABA |
parenthesis | 1 | A(A|B)AAB | AAAAB ABAAB | every other string |
(AB)*A | A ABABABABA | AA ABBA |
Order of Operations in Regexes
m(uu(uu)*|oo(oo)*)n
m(uu(uu)*)|(oo(oo)*)n
In regexes | comes last.
Match examples:
muu
muuuu
oon
oooon
Match examples:
muun
muuuun
moon
moooon
More Advanced Regular Expressions Syntax
Expanded Regex Syntax
operation | example | matches | does not match |
any character (except newline) | .U.U.U. | CUMULUS JUGULUM | SUCCUBUS TUMULTUOUS |
character class | [A-Za-z][a-z]* | word Capitalized | camelCase 4illegal |
at least one | jo+hn | john joooooohn | jhn jjohn |
zero or one | joh?n | jon john | any other string |
repeated exactly {a} times | j[aeiou]{3}hn | jaoehn jooohn | jhn jaeiouhn |
repeated from a to b times: {a,b} | j[ou]{1,2}hn | john juohn | jhn jooohn |
More Regular Expression Examples
regex | matches | does not match |
.*SPB.* | RASPBERRY CRISPBREAD | SUBSPACE SUBSPECIES |
[0-9]{3}-[0-9]{2}-[0-9]{4} | 231-41-5121 573-57-1821 | 231415121 57-3571821 |
[a-z]+@([a-z]+\.)+(edu|com) | horse@pizza.com horse@pizza.food.com | frank_99@yahoo.com hug@cs |
Expanded Regex Puzzle: https://tinyurl.com/reg913w
Challenge: Give a regular expression for any lowercase string that has a repeated vowel (i.e. noon, peel, festoon, looop, etc).
operation | example | matches | does not match |
any character (except newline) | .U.U.U. | CUMULUS JUGULUM | SUCCUBUS TUMULTUOUS |
character class | [A-Za-z][a-z]* | word Capitalized | camelCase 4illegal |
at least one | jo+hn | john | jhn |
zero or one | joh?n | jon john | any other string |
repeated exactly {a} times | j[aeiou]{3}hn | jaoehn jooohn | jhn jaeiouhn |
repeated from a to b times: {a,b} | j[ou]{1,2}hn | john juohn | jhn jooohn |
Expanded Regex Puzzle Solution
Challenge: Give a regular expression for any lowercase string that has a repeated vowel (i.e. noon, peel, festoon, looop, etc): [a-z]*(aa|ee|ii|oo|uu)[a-z]*
operation | example | matches | does not match |
any character (except newline) | .U.U.U. | CUMULUS JUGULUM | SUCCUBUS TUMULTUOUS |
character class | [A-Za-z][a-z]* | word Capitalized | camelCase 4illegal |
at least one | jo+hn | john | jhn |
zero or one | joh?n | jon john | any other string |
repeated exactly {a} times | j[aeiou]{3}hn | jaoehn jooohn | jhn jaeiouhn |
repeated from a to b times: {a,b} | j[ou]{1,2}hn | john juohn | jhn jooohn |
Expanded Regex Syntax Puzzle: https://tinyurl.com/reg913v
Challenge: Give a regular expression for any string that contains both a lowercase letter and a number.
Click “run tests” to test your regex.
operation | example | matches | does not match |
any character (except newline) | .U.U.U. | CUMULUS JUGULUM | SUCCUBUS TUMULTUOUS |
character class | [A-Za-z][a-z]* | word Capitalized | camelCase 4illegal |
at least one | jo+hn | john | jhn |
zero or one | joh?n | jon john | any other string |
repeated exactly {a} times | j[aeiou]{3}hn | jaoehn jooohn | jhn jaeiouhn |
repeated from a to b times: {a,b} | j[ou]{1,2}hn | john juohn | jhn jooohn |
Expanded Regex Syntax Solution
Challenge: Give a regular expression for any string that contains both a lowercase letter and a number: (.*[0-9].*[a-z].*)|(.*[a-z].*[0-9].*)
Click “run tests” to test your regex.
operation | example | matches | does not match |
any character (except newline) | .U.U.U. | CUMULUS JUGULUM | SUCCUBUS TUMULTUOUS |
character class | [A-Za-z][a-z]* | word Capitalized | camelCase 4illegal |
at least one | jo+hn | john | jhn |
zero or one | joh?n | jon john | any other string |
repeated exactly {a} times | j[aeiou]{3}hn | jaoehn jooohn | jhn jaeiouhn |
repeated from a to b times: {a,b} | j[ou]{1,2}hn | john juohn | jhn jooohn |
Limitations of Regular Expressions
Writing regular expressions is like writing a program.
Regular expressions sometimes jokingly referred to as a “write only language”.
Regular expressions are terrible at certain types of problems. Examples:
"Some people, when confronted with a problem, think ‘I know, I'll use regular expressions.’ Now they have two problems." - Jamie Zawinski (Source)
Email Address Regular Expression (a probably bad idea)
The regular expression for email addresses (for the Perl programming language):
(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?: \r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\ ](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?: (?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n) ?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t] )*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])* )(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*) *:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r \n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t ]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\]( ?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(? :\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)? [ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]| \\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|" (?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[ \]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|( ?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([ ^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\ ]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\ r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\] |\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\ .|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(? :[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\] ]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[ \["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+| \Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\ ]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[" ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<> @,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@, ;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[ "()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]) +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*))*)?;\s*)
Even More Regular Expression Syntax
Suppose you want to match one of our special characters like . or [ or ]
operation | example | matches | does not match |
built-in character classes | \w+ \d+ | fawef 231231 | this person 423 people |
character class negation | [^a-z]+ | PEPPERS3982 17211!↑å | porch CLAmS |
escape character | cow\.com | cow.com | cowscom |
Regular Expressions Puzzle: tinyurl.com/reg913a
Create a regular expression that matches the red portion below.
operation | example | matches | does not match |
built-in character classes | \w+ \d+ | fawef 231231 | this person 423 people |
character class negation | [^a-z]+ | PEPPERS3982 17211!↑å | porch CLAmS |
escape character | cow\.com | cow.com | cowscom |
169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"
Regular Expressions Puzzle Solution: tinyurl.com/reg913a
Create a regular expression that matches the red portion below: \[.*\]
operation | example | matches | does not match |
built-in character classes | \w+ \d+ | fawef 231231 | this person 423 people |
character class negation | [^a-z]+ | PEPPERS3982 17211!↑å | porch CLAmS |
escape character | cow\.com | cow.com | cowscom |
169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"
Quiz
Create a regular expression that matches anything inside of angle brackets <>, but none of the string outside of angle brackets.
operation | example | matches | does not match |
built-in character classes | \w+ \d+ | fawef 231231 | this person 423 people |
character class negation | [^a-z]+ | PEPPERS3982 17211!↑å | porch CLAmS |
escape character | cow\.com | cow.com | cowscom |
Regex101 link: https://tinyurl.com/reg913
Even More Regular Expression Features
A few additional common regex features are listed above.
The official guide is good! https://docs.python.org/3/howto/regex.html
operation | example | matches | does not match |
beginning of line | ^ark | ark two ark o ark | dark |
end of line | ark$ | dark ark o ark | ark two |
lazy version of zero or more *? | 5.*?5 | 5005 55 | 5005005 |
5.*5 would match this!
Regular Expressions in Python
(and Regex Groups)
re.findall in Python
In Python, re.findall(pattern, text) will return a list of all matches.
text = "My social security number is 456-76-4295 bro, or actually maybe it’s 456-67-4295.";
pattern = r"[0-9]{3}-[0-9]{2}-[0-9]{4}"
m = re.findall(pattern, text)
print(m)
['456-76-4295', '456-67-4295']
re.sub in Python
In Python, re.sub(pattern, repl, text) will return text with all instances of pattern replaced by repl.
text = '<div><td valign="top">Moo</td></div>'
pattern = r"<[^>]+>"
cleaned = re.sub(pattern, '', text)
print(cleaned)
’Moo’
Raw Strings in Python
Note: When specifying a pattern, we strongly suggest using “raw strings”.
For more information see “The Backslash Plague” under https://docs.python.org/3/howto/regex.html.
Regular Expression Groups
Earlier we used parentheses to specify the order of operations.
Parenthesis have another meaning:
s = """Observations: 03:04:53 - Horse awakens.
03:05:14 - Horse goes back to sleep."""
pattern = "(\d\d):(\d\d):(\d\d) - (.*)"
matches = re.findall(pattern, s)
[('03', '04', '53', 'Horse awakens.'),
('03', '05', '14', 'Horse goes back to sleep.')]
Regex Puzzle
Fill in the regex below so that after code executes, day is “26”, month is “Jan”, and year is “2014”.
169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"
pattern = "YOUR REGEX HERE"
matches = re.findall(pattern, log[0])
day, month, year = matches[0]
log[0]:
Regex Puzzle (One Possible Solution)
Fill in the regex below so that after it executes, day is “26”, month is “Jan”, and year is “2014”.
169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"
pattern = "\[(\d{2})/(\w{3})/(\d{4})"
matches = re.findall(pattern, log[0])
day, month, year = matches[0]
log[0]:
Extracting Date Information
With a little more work, we can do something similar and extract day, month, year, hour, minutes, seconds, and time zone all in one regular expression.
You will also see code that uses re.search instead of re.findall.
Case Studies on Police Data and Restaurant Data
See lec08-working-with-text.ipynb
Summary
Today we saw many different string manipulation tools.
basic python | re | pandas |
| re.findall | df.str.findall |
str.replace | re.sub | df.str.replace |
str.split | re.split | df.str.split |
’ab’ in str | re.search | df.str.contain |
len(str) | | df.str.len |
str[1:4] | | df.str[1:4] |
Even More Regex Syntax (Bonus)
Optional (but Handy) Regex Concepts
These regex features aren’t going to be on an exam, but they are useful:
(19|20)\d\d # year (group 1)
[- /.] # separator
(0[1-9]|1[012]) # month (group 2)
[- /.] # separator
(0[1-9]|[12][0-9]|3[01]) # day (group 3)