Dr. Harpreet Singh
Head, PG Department of Bioinformatics,
Hans Raj Mahila Maha Vidyalaya,
Jalandhar, Punjab, India
e-module
Pattern Matching and Regular Expressions in Python
Pattern
An arrangement of characters in a specific order is called a pattern. Patterns exists in various types of data including biologicall data (such as DNA, Proteins sequences).
A conserved pattern among various sequences (DNA/Protein) is known as a motif. Motifs helps bimolecules to attain a specific structure and/or to perform a partcular task.
Importance of patters in Biology
By comparing related sequences and looking for those amino acids that remain the same in all of the members in the family, we can predict the sites that might be essential for function.
Protein Motifs/domains
Start/Stop codons
Transcription factor binding sites
Cutting sites for restriction enzymes
Posttranslational modification sites
Degenerate PCR primer sites
Runs of mononucleotides
Repetitive elements
Ligand binding sites
Enzyme active sites
Targeting sites
Motifs example 1: Enzyme active sites �
To catalyze a reaction, an enzyme will bind to one or more reactant molecules, known as its substrates. The active site consists of the enzyme's amino acids that form temporary bonds with the substrate, known as the binding site, and the amino acids that catalyze the reaction of that substrate.
Motifs example 2: Ligand‐binding sites �
A binding site is a region on a protein molecule where ligands (small molecules or ions) can form a chemical bond. Ligand binding often plays a structural or functional role, for example, in stabilization, catalysis, modulation of enzymatic activity, or signal transmission.
Motifs example 3: Cleavage sites �
The location on a protein molecule where peptide bonds are broken down by hydrolysis. For instance, in human digestion, proteins in food are broken down into smaller peptide chains by digestive enzymes. Many viruses also produce their proteins initially as a single polypeptide chain which is then cleaved into individual protein chains.
Motifs example 4: Posttranslational modification sites�
Some amino acids in a protein can undergo chemical modification, produced in most cases by an enzyme after its synthesis or during its life in the cell. This change usually results in a change of the protein function, whether in terms of its action, half‐life, or its cellular localization
Motifs example 5: Targeting sites �
Within a cell, the localization of a protein is essential for its proper functioning, but the production site of a protein is often different from the place of action. Protein targeting signals, such as nuclear or mitochondrial localization signals, can be encoded within the polypeptide chain to allow a protein to be directed to the correct location for its function.
Example of a simple functional site�
Regular Expressions: Definition�
A regular expression is a pattern that matches strings or pieces of strings. It may also be defined as a sequence of characters that specifies a search pattern in text. They are expressions that summarize a text pattern.
As discussed previously, regular expressions can be used to locate domains in proteins, sequence patterns in DNA like CpG islands, repeats, restriction enzyme, nuclease recognition sites, and so on. There are even biological databases devoted to protein domains, like PROSITE.
Components of regular expressions: Meta-characters�
Meta-characters are characters that have a special meaning in the context of the regular expressions. The common meta-characters are listed below:
. ^ $ * + ? { [ ] \ | ( )
. (dot): Matches any character, except the new line: “ATT.T” will match “ATTCT”, “ATTFT” but not “ATTTCT”.
^(carat): Matches the beginning of the chain: “^AUG” will match “AUGAGC” but not “AAUGC”. Using inside a group means “opposite”.
$(dollar): Matches the end of the chain or just before a new line at the end of the chain: “UAA$” will match “AGCUAA” but not “ACUAAG”.
+ (plus): The resulting REGEX will match 1 or more repetitions of the preceding REGEX: “AT+” will match “ATT”, but not “A”.
? (question mark): The resulting REGEX matches 0 or 1 repetitions of the preceding RE. “AT?” will match either “A” or “AT”.
(...): Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group. To match the literals "(" or ")", use \( or \), or enclose them inside a character class: [(] [)].
Components of regular expressions: Meta-characters
(?:...): A non-grouping version of regular parentheses. The substring matched by the group cannot be retrieved after performing a match.
{n}: Exactly n copies of the previous REGEX will match: “(ATTG){3}” will match “ATTGATTGATTG” but not “ATTGATTG”.
{m,n}: The resulting REGEX will match from m to n repetitions of the preceding REGEX: “(AT){3,5}” will match “ATATTATATAT” but not “ATATTATAT”. Without m, it will match from 0 repetitions.Without n, it will match all repetitions.
[] (square brackets): Indicates a set of characters. “[A-Z]” will match any uppercase letter and “[a-z0-9]” will match any lowercase letter or digit. Meta characters are not active inside REGEX sets. “[AT*]” will match “A”, “T” or “*”. The ^inside a set will match the complement of a set. “[^R]” will match any character but “R”.
"\" (backslash): Used to escape reserved characters (to match characters like “?”, “*”). Since Python also uses backslash as the escape character, you should pass a raw string to express the pattern.
| (vertical bar): As in logic, it reads as “or”. Any number of REGEX can be separated by “|”. “A|T” will match “A”, “T” or “AT”.
Components of regular expressions: Metasymbols
Name Description
\ number The contents of the group of the same number, starting from 1
\A Only at the start of the string
\b The empty string, only at the beginning or end of a word
\B The empty string, only when it is not at the beginning or end of a word
\d Any decimal digit (as [0-9])
\D Any non-digit (as [^0-9])
\s Any whitespace character (as [\t\n\r\f\v])
\S Any non-whitespace character (as [^\t\n\r\f\v])
\w Any alphanumeric character (as [a-zA-Z0-9_])
\W Any non-alphanumeric character (as [^a-zA-Z0-9_]
\Z Only the end of the string
Metasymbols are special sequences with “\” and a character. They also have special meaning as described below: -
THE PYTHON: RE MODULE�
The re module provides methods like compile, search, findall, match, and other.
These functions are used to process a text using a pattern built with the REGEX
syntax.
A basic search works like this:
>>> import re
>>> mo = re.search(’hello’, ’Hello world, hello Python!’)
Searching, Matching and Splitting using RE Module
The python RE can be employed in a number of ways to search, match, replace a pattern or to split a string based on a given pattern. Common uses of RE are listed below: -
re.search�
The search from the re method requires a pattern as a first argument and as a second argument, a string where the pattern will be searched. In this case the pattern can be translated as “H or h, followed by ello.” When a match is found, this function returns a match object (called mo in this case) with information about the first match. If there is no match, it returns None. A match object can be queried with the methods shown here:
>>> mo.group()
’hello’
>>> mo.span()
(13, 18)
group() returns the string matched by the REGEX, while span() returns a tuple containing the (start, end) positions of the match (that is the (0, 5) returned by mo.span()).
re.search
This result is very similar to what the index method returns:
>>> ’Hello world, hello Python!’.index(’hello’)
13
The difference lies in the chance of using REGEX instead of plain strings. For example, we would like to match “Hello” and “hello”:
>>> import re
>>> mo = re.search(’[Hh]ello’, ’Hello world, hello Python!’)
The first match now is,
>>> mo.group()
’Hello’
�re.findall��
re.findall
To find all the matches, and not just the first one, use findall:
>>> re.findall("[Hh]ello","Hello world, hello Python,!")
[’Hello’, ’hello’]
re.match�
This match method works like search but it looks only at the start of a string. When the pattern is not found, it returns None:
>>> mo = re.match("hello", "Hello world, hello Python!")
>>> print mo
None
As search, when the pattern is found, it returns a match object:
>>> mo = re.match("Hello", "Hello world, hello Python!")
>>> mo
<_sre.SRE_Match object at 0xb7b5eb80>
This match object can be queried as before:
>>> mo.group()
’Hello’
>>> mo.span()
(0, 5)
�Compiling a Pattern�
A pattern can be compiled (converted to an internal representation) to speed up the search. This step is not mandatory but recommended for large amounts of text.
Let’s see findall with a regular pattern and then with a “compiled” pattern (rgx):
>>> re.findall("[Hh]ello","Hello world, hello Python,!")
[’Hello’, ’hello’]
>>> rgx = re.compile("[Hh]ello")
>>> rgx.findall("Hello world, hello Python,!")
[’Hello’, ’hello’]
�Groups�
Sometimes you need to match more than one pattern; this can be done by grouping. Groups are marked by a set of parentheses (“()”). Groups can be “capturing” (“named” or “unnamed”) and “non-capturing.”
A “capturing” group is used when you need to retrieve the contents of a group.
Groups are captured with groups. Don’t confuse group with groups. As seen on
page 288, group returns the string matched by the REGEX.
re.split: Splitting a string using regular expression�
split() function to split a string at the occurrences of matches of a regular expression. The built-in re module provides you with the split() function that splits a string by the matches of a regular expression.
Example1:
myvar = 'sky1cloud3blue333red'
result= re.split('\d+', myvar)
print(result)
The output will be
['sky', 'cloud', 'blue', 'red']
Example2:
mystring = "Hey, I love Python, Python is easy“
x = mystring.split(",")
print(x)
The output will be
['Hey', ' I love Python', ' Python is easy']
�re.sub: Replacing Patterns�
re.sub
sub(rpl,str[,count=0]): Replace rpl with the portion of the string (str) that coincides with the REGEX to which it applies. The third parameter, which is optional, indicates how many replacements we want made. By default the value is zero and means that it replaces all of the occurrences. It is very similar to the string method called replace, just that instead of replacing one text for another, the replaced text is located by a REGEX.
Example1:
import re
regex = re.compile("(?:GC){3,}")
seq="ATGATCGTACTGCGCGCTTCATGTGATGCGCGCGCGCAGACTATAAG"
print ("Before:",seq)
print ("After:",regex.sub("",seq))
Note: Instead of re.compile("(?:GC){3,}") we can also use re.compile("(GC){3,}"). The first one is the non-grouped version. Both will give same result in our case
�re.subn: Replacing Patterns�
re.subn
subn(rpl,str[,count=0]): This has the same function as sub, differing in that instead of returning the new string, it returns a tuple with two elements: the new string and the number of replacementsmade. This function is used when, in addition to replacing a pattern in a string, it’s required to know how many replacements have been made.
Example1:
import re
regex = re.compile("(?:GC){3,}")
seq="ATGATCGTACTGCGCGCTTCATGTGATGCGCGCGCGCAGACTATAAG"
print ("Before:",seq)
print ("After:",regex.subn("",seq))
�Cleaning Up a Sequence�
It’s more than common to find a file with sequences in a non-standard format, such
as the following sequence:
The following code reads a text file with the sequence in this format and returns only the sequence, without any strange (number or whitespace) character: -
import re
regex = re.compile(’ |\d|\n|\t’)
seq = ’’
for line in open(’pMOSBlue.txt’):
seq += regex.sub(’’,line)
print seq
�Combining �
References
Thanks