1 of 19

Regular expressions

www.jtleek.com/advdatasci16

2 of 19

Announcements

  • New swirl modules (complete by wednesday):
    • Grouping_and_Chaining_with_dplyr
    • Googlesheets
  • For lab this week
    • You should be very close to having a collected data set
    • Please come with a step-by-step analysis plan even if you haven’t done it yet

3 of 19

Working with text

4 of 19

Find the BCM centers

library(readxl)

kg = read_excel("1000genomes.xlsx",sheet=4,skip=1)�table(kg$Center)�

grep(“BCM”, kg$center)

grepl(“BCM”, kg$center)

5 of 19

The same with stringr

library(stringr)

str_detect(kg$Center,"BCM")[1:40]�str_subset(kg$Center, "BCM")

vignette("stringr")

6 of 19

Literals: nuclear

7 of 19

But text is more complicated

We need a way to express

- whitespace/word boundaries

- sets of literals

- the beginning and end of a line

- alternatives (“war” or “peace”)

8 of 19

Beginning of line with ^

x = c("i think we all rule for participating",�"i think i have been outed",�"i think this will be quite fun actually",�"it will be fun, i think")��str_detect(x, "^i think")

9 of 19

End of line with $

x = c("well they had something this morning",�"then had to catch a tram home in the morning",�"dog obedience school in the morning",�"this morning I'll go for a run")��str_detect(x, "morning$")

10 of 19

Character list with []

x = c('Name the worst thing about Bush!',�'I saw a green bush',�'BBQ and bushwalking at Molonglo Gorge',�'BUSH!!')��str_detect(x,"[Bb][Uu][Ss][Hh]")

11 of 19

Sets of letters and numbers

x = c('7th inning stretch',�'2nd half soon to begin. OSU did just win.',�'3am - cant sleep - too hot still.. :(',�'5ft 7 sent from heaven')��str_detect(x,"^[0-9][a-zA-Z]")

12 of 19

Negative classes

x = c('are you there?',�'2nd half soon to begin. OSU did just win.',�'6 and 9',�'dont worry... we all die anyway!')��str_detect(x,"[^?.]$")

13 of 19

. means anything

x = c('its stupid the post 9-11 rules',�'NetBios: scanning ip 203.169.114.66',�'Front Door 9:11:46 AM',�'Sings: 0118999881999119725...3 !')��str_detect(x,"9.11")�

14 of 19

| means or

x = c('Not a whole lot of hurricanes.',�'We do have floods nearly every day', �'hurricanes swirl in the other direction',�'coldfire is STRAIGHT!')��str_detect(x,"flood|earthquake|hurricane|coldfire")

15 of 19

Detecting phone numbers

x = c('206-555-1122','206-332','4545','test')��phone = "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"��str_detect(x,phone)

16 of 19

Can this get ridiculous? You bet!

17 of 19

Like really ridiculous

18 of 19

A nice tutorial

19 of 19

Regex lab