Regular expressions
www.jtleek.com/advdatasci16
Announcements
Working with text
Find the BCM centers
library(readxl)
kg = read_excel("1000genomes.xlsx",sheet=4,skip=1)�table(kg$Center)�
grep(“BCM”, kg$center)
grepl(“BCM”, kg$center)
The same with stringr
library(stringr)
str_detect(kg$Center,"BCM")[1:40]�str_subset(kg$Center, "BCM")
vignette("stringr")
Literals: nuclear
But text is more complicated
We need a way to express
- whitespace/word boundaries
- sets of literals
- the beginning and end of a line
- alternatives (“war” or “peace”)
Beginning of line with ^
x = c("i think we all rule for participating",�"i think i have been outed",�"i think this will be quite fun actually",�"it will be fun, i think")��str_detect(x, "^i think")�
End of line with $
x = c("well they had something this morning",�"then had to catch a tram home in the morning",�"dog obedience school in the morning",�"this morning I'll go for a run")��str_detect(x, "morning$")
Character list with []
x = c('Name the worst thing about Bush!',�'I saw a green bush',�'BBQ and bushwalking at Molonglo Gorge',�'BUSH!!')��str_detect(x,"[Bb][Uu][Ss][Hh]")
Sets of letters and numbers
x = c('7th inning stretch',�'2nd half soon to begin. OSU did just win.',�'3am - cant sleep - too hot still.. :(',�'5ft 7 sent from heaven')��str_detect(x,"^[0-9][a-zA-Z]")�
Negative classes
x = c('are you there?',�'2nd half soon to begin. OSU did just win.',�'6 and 9',�'dont worry... we all die anyway!')��str_detect(x,"[^?.]$")
. means anything
x = c('its stupid the post 9-11 rules',�'NetBios: scanning ip 203.169.114.66',�'Front Door 9:11:46 AM',�'Sings: 0118999881999119725...3 !')��str_detect(x,"9.11")�
| means or
x = c('Not a whole lot of hurricanes.',�'We do have floods nearly every day', �'hurricanes swirl in the other direction',�'coldfire is STRAIGHT!')��str_detect(x,"flood|earthquake|hurricane|coldfire")�
Detecting phone numbers
x = c('206-555-1122','206-332','4545','test')��phone = "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"��str_detect(x,phone)
Can this get ridiculous? You bet!
Like really ridiculous
A nice tutorial
Regex lab