1 of 28

Dr. Harpreet Singh

Head, PG Department of Bioinformatics,

Hans Raj Mahila Maha Vidyalaya,

Jalandhar, Punjab, India

e-module

Pattern Matching and Regular Expressions in Python

2 of 28

Pattern

An arrangement of characters in a specific order is called a pattern. Patterns exists in various types of data including biologicall data (such as DNA, Proteins sequences).

A conserved pattern among various sequences (DNA/Protein) is known as a motif. Motifs helps bimolecules to attain a specific structure and/or to perform a partcular task.

3 of 28

Importance of patters in Biology

By comparing related sequences and looking for those amino acids that remain the same in all of the members in the family, we can predict the sites that might be essential for function.

Protein Motifs/domains

Start/Stop codons

Transcription factor binding sites

Cutting sites for restriction enzymes

Posttranslational modification sites

Degenerate PCR primer sites

Runs of mononucleotides

Repetitive elements

Ligand binding sites

Enzyme active sites

Targeting sites

4 of 28

Motifs example 1: Enzyme active sites �

To catalyze a reaction, an enzyme will bind to one or more reactant molecules, known as its substrates. The active site consists of the enzyme's amino acids that form temporary bonds with the substrate, known as the binding site, and the amino acids that catalyze the reaction of that substrate.

5 of 28

Motifs example 2: Ligand‐binding sites �

A binding site is a region on a protein molecule where ligands (small molecules or ions) can form a chemical bond. Ligand binding often plays a structural or functional role, for example, in stabilization, catalysis, modulation of enzymatic activity, or signal transmission.

6 of 28

Motifs example 3: Cleavage sites �

The location on a protein molecule where peptide bonds are broken down by hydrolysis. For instance, in human digestion, proteins in food are broken down into smaller peptide chains by digestive enzymes. Many viruses also produce their proteins initially as a single polypeptide chain which is then cleaved into individual protein chains.

7 of 28

Motifs example 4: Posttranslational modification sites�

Some amino acids in a protein can undergo chemical modification, produced in most cases by an enzyme after its synthesis or during its life in the cell. This change usually results in a change of the protein function, whether in terms of its action, half‐life, or its cellular localization

8 of 28

Motifs example 5: Targeting sites �

Within a cell, the localization of a protein is essential for its proper functioning, but the production site of a protein is often different from the place of action. Protein targeting signals, such as nuclear or mitochondrial localization signals, can be encoded within the polypeptide chain to allow a protein to be directed to the correct location for its function.

9 of 28

Example of a simple functional site�

  • An example of a simple functional site is the N‐glycosylation site, which is a posttranslational modification where a carbohydrate is attached to a hydroxyl or other functional group of a protein molecule. The sequence motif representing this site can be indicated by N‐X‐S/T. The first amino acid is asparagine (N), the second amino acid can be any of the 20 amino acids (X), and the third amino acid is either serine (S) or threonine (T).

10 of 28

Regular Expressions: Definition�

regular expression is a pattern that matches strings or pieces of strings. It may also be defined as a sequence of characters that specifies a search pattern in text.  They are expressions that summarize a text pattern.

As discussed previously, regular expressions can be used to locate domains in proteins, sequence patterns in DNA like CpG islands, repeats, restriction enzyme, nuclease recognition sites, and so on. There are even biological databases devoted to protein domains, like PROSITE.

11 of 28

Components of regular expressions: Meta-characters�

Meta-characters are characters that have a special meaning in the context of the regular expressions. The common meta-characters are listed below:

. ^ $ * + ? { [ ] \ | ( )

. (dot): Matches any character, except the new line: “ATT.T” will match “ATTCT”, “ATTFT” but not “ATTTCT”.

^(carat): Matches the beginning of the chain: “^AUG” will match “AUGAGC” but not “AAUGC”. Using inside a group means “opposite”.

$(dollar): Matches the end of the chain or just before a new line at the end of the chain: “UAA$” will match “AGCUAA” but not “ACUAAG”.

  • (star): Matches 0 or more repetitions of the preceding token: “AT*” will match “AAT”, “A”, but not “TT”.

+ (plus): The resulting REGEX will match 1 or more repetitions of the preceding REGEX: “AT+” will match “ATT”, but not “A”.

? (question mark): The resulting REGEX matches 0 or 1 repetitions of the preceding RE. “AT?” will match either “A” or “AT”.

(...): Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group. To match the literals "(" or ")", use \( or \), or enclose them inside a character class: [(] [)].

12 of 28

Components of regular expressions: Meta-characters

(?:...): A non-grouping version of regular parentheses. The substring matched by the group cannot be retrieved after performing a match.

{n}: Exactly n copies of the previous REGEX will match: “(ATTG){3}” will match “ATTGATTGATTG” but not “ATTGATTG”.

{m,n}: The resulting REGEX will match from m to n repetitions of the preceding REGEX: “(AT){3,5}” will match “ATATTATATAT” but not “ATATTATAT”. Without m, it will match from 0 repetitions.Without n, it will match all repetitions.

[] (square brackets): Indicates a set of characters. “[A-Z]” will match any uppercase letter and “[a-z0-9]” will match any lowercase letter or digit. Meta characters are not active inside REGEX sets. “[AT*]” will match “A”, “T” or “*”. The ^inside a set will match the complement of a set. “[^R]” will match any character but “R”.

"\" (backslash): Used to escape reserved characters (to match characters like “?”, “*”). Since Python also uses backslash as the escape character, you should pass a raw string to express the pattern.

| (vertical bar): As in logic, it reads as “or”. Any number of REGEX can be separated by “|”. “A|T” will match “A”, “T” or “AT”.

13 of 28

Components of regular expressions: Metasymbols

Name Description

\ number The contents of the group of the same number, starting from 1

\A Only at the start of the string

\b The empty string, only at the beginning or end of a word

\B The empty string, only when it is not at the beginning or end of a word

\d Any decimal digit (as [0-9])

\D Any non-digit (as [^0-9])

\s Any whitespace character (as [\t\n\r\f\v])

\S Any non-whitespace character (as [^\t\n\r\f\v])

\w Any alphanumeric character (as [a-zA-Z0-9_])

\W Any non-alphanumeric character (as [^a-zA-Z0-9_]

\Z Only the end of the string

Metasymbols are special sequences with “\” and a character. They also have special meaning as described below: -

14 of 28

THE PYTHON: RE MODULE

The re module provides methods like compile, search, findall, match, and other.

These functions are used to process a text using a pattern built with the REGEX

syntax.

A basic search works like this:

>>> import re

>>> mo = re.search(’hello’, ’Hello world, hello Python!’)

15 of 28

Searching, Matching and Splitting using RE Module

The python RE can be employed in a number of ways to search, match, replace a pattern or to split a string based on a given pattern. Common uses of RE are listed below: -

16 of 28

re.search

The search from the re method requires a pattern as a first argument and as a second argument, a string where the pattern will be searched. In this case the pattern can be translated as “H or h, followed by ello.” When a match is found, this function returns a match object (called mo in this case) with information about the first match. If there is no match, it returns None. A match object can be queried with the methods shown here:

>>> mo.group()

’hello’

>>> mo.span()

(13, 18)

group() returns the string matched by the REGEX, while span() returns a tuple containing the (start, end) positions of the match (that is the (0, 5) returned by mo.span()).

17 of 28

re.search

This result is very similar to what the index method returns:

>>> ’Hello world, hello Python!’.index(’hello’)

13

The difference lies in the chance of using REGEX instead of plain strings. For example, we would like to match “Hello” and “hello”:

>>> import re

>>> mo = re.search(’[Hh]ello’, ’Hello world, hello Python!’)

The first match now is,

>>> mo.group()

’Hello’

18 of 28

re.findall

re.findall

To find all the matches, and not just the first one, use findall:

>>> re.findall("[Hh]ello","Hello world, hello Python,!")

[’Hello’, ’hello’]

19 of 28

re.match

This match method works like search but it looks only at the start of a string. When the pattern is not found, it returns None:

>>> mo = re.match("hello", "Hello world, hello Python!")

>>> print mo

None

As search, when the pattern is found, it returns a match object:

>>> mo = re.match("Hello", "Hello world, hello Python!")

>>> mo

<_sre.SRE_Match object at 0xb7b5eb80>

This match object can be queried as before:

>>> mo.group()

’Hello’

>>> mo.span()

(0, 5)

20 of 28

�Compiling a Pattern

A pattern can be compiled (converted to an internal representation) to speed up the search. This step is not mandatory but recommended for large amounts of text.

Let’s see findall with a regular pattern and then with a “compiled” pattern (rgx):

>>> re.findall("[Hh]ello","Hello world, hello Python,!")

[’Hello’, ’hello’]

>>> rgx = re.compile("[Hh]ello")

>>> rgx.findall("Hello world, hello Python,!")

[’Hello’, ’hello’]

21 of 28

�Groups

Sometimes you need to match more than one pattern; this can be done by grouping. Groups are marked by a set of parentheses (“()”). Groups can be “capturing” (“named” or “unnamed”) and “non-capturing.”

A “capturing” group is used when you need to retrieve the contents of a group.

Groups are captured with groups. Don’t confuse group with groups. As seen on

page 288, group returns the string matched by the REGEX.

22 of 28

re.split: Splitting a string using regular expression

split() function to split a string at the occurrences of matches of a regular expression. The built-in re module provides you with the split() function that splits a string by the matches of a regular expression.

Example1:

myvar = 'sky1cloud3blue333red'

result= re.split('\d+', myvar)

print(result)

The output will be

['sky', 'cloud', 'blue', 'red']

Example2:

mystring = "Hey, I love Python, Python is easy“

x = mystring.split(",")

print(x)

The output will be

['Hey', ' I love Python', ' Python is easy']

23 of 28

�re.sub: Replacing Patterns

re.sub

sub(rpl,str[,count=0]): Replace rpl with the portion of the string (str) that coincides with the REGEX to which it applies. The third parameter, which is optional, indicates how many replacements we want made. By default the value is zero and means that it replaces all of the occurrences. It is very similar to the string method called replace, just that instead of replacing one text for another, the replaced text is located by a REGEX.

Example1:

import re

regex = re.compile("(?:GC){3,}")

seq="ATGATCGTACTGCGCGCTTCATGTGATGCGCGCGCGCAGACTATAAG"

print ("Before:",seq)

print ("After:",regex.sub("",seq))

Note: Instead of re.compile("(?:GC){3,}") we can also use re.compile("(GC){3,}"). The first one is the non-grouped version. Both will give same result in our case

24 of 28

�re.subn: Replacing Patterns

re.subn

subn(rpl,str[,count=0]): This has the same function as sub, differing in that instead of returning the new string, it returns a tuple with two elements: the new string and the number of replacementsmade. This function is used when, in addition to replacing a pattern in a string, it’s required to know how many replacements have been made.

Example1:

import re

regex = re.compile("(?:GC){3,}")

seq="ATGATCGTACTGCGCGCTTCATGTGATGCGCGCGCGCAGACTATAAG"

print ("Before:",seq)

print ("After:",regex.subn("",seq))

25 of 28

Cleaning Up a Sequence

It’s more than common to find a file with sequences in a non-standard format, such

as the following sequence:

The following code reads a text file with the sequence in this format and returns only the sequence, without any strange (number or whitespace) character: -

import re

regex = re.compile(’ |\d|\n|\t’)

seq = ’’

for line in open(’pMOSBlue.txt’):

seq += regex.sub(’’,line)

print seq

26 of 28

�Combining

27 of 28

References

28 of 28

Thanks