Strings
Genome 559: Introduction to Statistical and Computational Genomics
Prof. William Stafford Noble
Strings
>>> “foo”
‘foo’
>>> ‘foo’
‘foo’
Defining strings
>>> myString = “GATTACA”
myString
Accessing single characters
>>> myString = “GATTACA”
>>> myString[0]
‘G’
>>> myString[1]
‘A’
>>> myString[-1]
‘A’
>>> myString[-2]
‘C’
>>> myString[7]
Traceback (most recent call last):
File "<stdin>", line 1, in ?
IndexError: string index out of range
Negative indices start at the end of the string and move left.
Accessing substrings
>>> myString = “GATTACA”
>>> myString[1:3]
‘AT’
>>> myString[:3]
‘GAT’
>>> myString[4:]
‘ACA’
>>> myString[3:5]
‘TA’
>>> myString[:]
‘GATTACA’
Special characters
>>> "He said, "Wow!""
File "<stdin>", line 1
"He said, "Wow!""
^
SyntaxError: invalid syntax
>>> "He said, 'Wow!'"
"He said, 'Wow!'"
>>> "He said, \"Wow!\""
'He said, "Wow!"'
Tab
\t
Newline
\n
Double quote
\”
Single quote
\’
Backslash
\\
Meaning
Escape sequence
More string functionality
>>> len(“GATTACA”)
7
>>> “GAT” + “TACA”
‘GATTACA’
>>> “A” * 10
‘AAAAAAAAAA
>>> “GAT” in “GATTACA”
True
>>> “AGT” in “GATTACA”
False
String methods
>>> dna = “ACGT”
>>> dna.find(“T”)
3
String methods
>>> "GATTACA".find("ATT")
1
>>> "GATTACA".count("T")
2
>>> "GATTACA".lower()
'gattaca'
>>> "gattaca".upper()
'GATTACA'
>>> "GATTACA".replace("G", "U")
'UATTACA‘
>>> "GATTACA".replace("C", "U")
'GATTAUA'
>>> "GATTACA".replace("AT", "**")
'G**TACA'
>>> "GATTACA".startswith("G")
True
>>> "GATTACA".startswith("g")
False
Strings are immutable
>>> s = "GATTACA"
>>> s[3] = "C"
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: object doesn't support item assignment
>>> s = s[:3] + "C" + s[4:]
>>> s
'GATCACA'
>>> s = s.replace("G","U")
>>> s
'UATCACA'
Strings are immutable
>>> sequence = “ACGT”
>>> sequence.replace(“A”, “G”)
‘GCGT’
>>> print sequence
ACGT
>>> sequence = “ACGT”
>>> new_sequence = sequence.replace(“A”, “G”)
>>> print new_sequence
GCGT
String summary
Basic string operations:
S = "AATTGG" # assignment - or use single quotes ' '
s1 + s2 # concatenate
s2 * 3 # repeat string
s2[i] # index character at position 'i'
s2[x:y] # index a substring
len(S) # get length of string
int(S) # or use float(S) # turn a string into an integer or floating point decimal
Methods:
S.upper()
S.lower()
S.count(substring)
S.replace(old,new)
S.find(substring)
S.startswith(substring), S. endswith(substring)
Printing:
print var1,var2,var3 # print multiple variables
print "text",var1,"text" # print a combination of explicit text (strings) and variables
Sample problem #1
> python dna2rna.py AGTCAGT
ACUCAGU
> python dna2rna.py actcagt
acucagu
> python dna2rna.py ACTCagt
ACUCagu
First get it working just for uppercase letters.
Two solutions
import sys
sequence = sys.argv[1]
new_sequence = sequence.replace(“T”, “U”)
newer_sequence = new_sequence.replace(“t”, “u”)
print newer_sequence
import sys
print sys.argv[1]
Two solutions
import sys
sequence = sys.argv[1]
new_sequence = sequence.replace(“T”, “U”)
newer_sequence = new_sequence.replace(“t”, “u”)
print newer_sequence
import sys
print sys.argv[1].replace(“T”, “U”)
Two solutions
import sys
sequence = sys.argv[1]
new_sequence = sequence.replace(“T”, “U”)
newer_sequence = new_sequence.replace(“t”, “u”)
print newer_sequence
import sys
print sys.argv[1].replace(“T”, “U”).replace(“t”, “u”)
Sample problem #2
> python get-codons.py TTGCAGTCG
TTG
CAG
TCG
> python get-codons.py TTGCAGTCGATC
TTG
CAG
TCG
> python get-codons.py tcgatcgac
TCG
ATC
GAC
Solution #2
import sys
sequence = sys.argv[1]
upper_sequence = sequence.upper()
print upper_sequence[:3]
print upper_sequence[3:6]
print upper_sequence[6:9]
Sample problem #3 (optional)
> python find-cysteine.py MNDLSGKTVIITGGARGLGAEAARQAVAAGARVVLADVLDEEGAATARELGDAARYQHLDVTIEEDWQRVCAYAREEFGSVDGL 70
> python find-cysteine.py MNDLSGKTVIITGGARGLGAEAARQAVAAGARVVLADVLDEEGAATARELGDAARYQHLDVTIEEDWQRVVAYAREEFGSVDGL -1
Solution #3
import sys
protein = sys.argv[1]
upper_protein = protein.upper()
print upper_protein.find(“C”)
Reading