1 of 21

Strings

Genome 559: Introduction to Statistical and Computational Genomics

Prof. William Stafford Noble

2 of 21

Strings

  • A string is a sequence of letters (called characters).
  • In Python, strings start and end with single or double quotes.

>>> “foo”

‘foo’

>>> ‘foo’

‘foo’

3 of 21

Defining strings

  • Each string is stored in the computer’s memory as a list of characters.

>>> myString = “GATTACA”

myString

4 of 21

Accessing single characters

  • You can access individual characters by using indices in square brackets.

>>> myString = “GATTACA”

>>> myString[0]

‘G’

>>> myString[1]

‘A’

>>> myString[-1]

‘A’

>>> myString[-2]

‘C’

>>> myString[7]

Traceback (most recent call last):

File "<stdin>", line 1, in ?

IndexError: string index out of range

Negative indices start at the end of the string and move left.

5 of 21

Accessing substrings

>>> myString = “GATTACA”

>>> myString[1:3]

‘AT’

>>> myString[:3]

‘GAT’

>>> myString[4:]

‘ACA’

>>> myString[3:5]

‘TA’

>>> myString[:]

‘GATTACA’

6 of 21

Special characters

  • The backslash is used to introduce a special character.

>>> "He said, "Wow!""

File "<stdin>", line 1

"He said, "Wow!""

^

SyntaxError: invalid syntax

>>> "He said, 'Wow!'"

"He said, 'Wow!'"

>>> "He said, \"Wow!\""

'He said, "Wow!"'

Tab

\t

Newline

\n

Double quote

\”

Single quote

\’

Backslash

\\

Meaning

Escape sequence

7 of 21

More string functionality

>>> len(“GATTACA”)

7

>>> “GAT” + “TACA”

‘GATTACA’

>>> “A” * 10

‘AAAAAAAAAA

>>> “GAT” in “GATTACA”

True

>>> “AGT” in “GATTACA”

False

  • Length

  • Concatenation

  • Repeat

  • Substring test

8 of 21

String methods

  • In Python, a method is a function that is defined with respect to a particular object.
  • The syntax is <object>.<method>(<parameters>)

>>> dna = “ACGT”

>>> dna.find(“T”)

3

9 of 21

String methods

>>> "GATTACA".find("ATT")

1

>>> "GATTACA".count("T")

2

>>> "GATTACA".lower()

'gattaca'

>>> "gattaca".upper()

'GATTACA'

>>> "GATTACA".replace("G", "U")

'UATTACA‘

>>> "GATTACA".replace("C", "U")

'GATTAUA'

>>> "GATTACA".replace("AT", "**")

'G**TACA'

>>> "GATTACA".startswith("G")

True

>>> "GATTACA".startswith("g")

False

10 of 21

Strings are immutable

  • Strings cannot be modified; instead, create a new one.

>>> s = "GATTACA"

>>> s[3] = "C"

Traceback (most recent call last):

File "<stdin>", line 1, in ?

TypeError: object doesn't support item assignment

>>> s = s[:3] + "C" + s[4:]

>>> s

'GATCACA'

>>> s = s.replace("G","U")

>>> s

'UATCACA'

11 of 21

Strings are immutable

  • String methods do not modify the string; they return a new string.

>>> sequence = “ACGT”

>>> sequence.replace(“A”, “G”)

‘GCGT’

>>> print sequence

ACGT

>>> sequence = “ACGT”

>>> new_sequence = sequence.replace(“A”, “G”)

>>> print new_sequence

GCGT

12 of 21

String summary

Basic string operations:

S = "AATTGG" # assignment - or use single quotes ' '

s1 + s2 # concatenate

s2 * 3 # repeat string

s2[i] # index character at position 'i'

s2[x:y] # index a substring

len(S) # get length of string

int(S) # or use float(S) # turn a string into an integer or floating point decimal

Methods:

S.upper()

S.lower()

S.count(substring)

S.replace(old,new)

S.find(substring)

S.startswith(substring), S. endswith(substring)

Printing:

print var1,var2,var3 # print multiple variables

print "text",var1,"text" # print a combination of explicit text (strings) and variables

13 of 21

Sample problem #1

  • Write a program called dna2rna.py that reads a DNA sequence from the first command line argument, and then prints it as an RNA sequence. Make sure it works for both uppercase and lowercase input.

> python dna2rna.py AGTCAGT

ACUCAGU

> python dna2rna.py actcagt

acucagu

> python dna2rna.py ACTCagt

ACUCagu

First get it working just for uppercase letters.

14 of 21

Two solutions

import sys

sequence = sys.argv[1]

new_sequence = sequence.replace(“T”, “U”)

newer_sequence = new_sequence.replace(“t”, “u”)

print newer_sequence

import sys

print sys.argv[1]

15 of 21

Two solutions

import sys

sequence = sys.argv[1]

new_sequence = sequence.replace(“T”, “U”)

newer_sequence = new_sequence.replace(“t”, “u”)

print newer_sequence

import sys

print sys.argv[1].replace(“T”, “U”)

16 of 21

Two solutions

import sys

sequence = sys.argv[1]

new_sequence = sequence.replace(“T”, “U”)

newer_sequence = new_sequence.replace(“t”, “u”)

print newer_sequence

import sys

print sys.argv[1].replace(“T”, “U”).replace(“t”, “u”)

  • It is legal (but not always desirable) to chain together multiple methods on a single line.

17 of 21

Sample problem #2

  • Write a program get-codons.py that reads the first command line argument as a DNA sequence and prints the first three codons, one per line, in uppercase letters.

> python get-codons.py TTGCAGTCG

TTG

CAG

TCG

> python get-codons.py TTGCAGTCGATC

TTG

CAG

TCG

> python get-codons.py tcgatcgac

TCG

ATC

GAC

18 of 21

Solution #2

import sys

sequence = sys.argv[1]

upper_sequence = sequence.upper()

print upper_sequence[:3]

print upper_sequence[3:6]

print upper_sequence[6:9]

19 of 21

Sample problem #3 (optional)

  • Write a program that reads a protein sequence as a command line argument and prints the location of the first cysteine residue.

> python find-cysteine.py MNDLSGKTVIITGGARGLGAEAARQAVAAGARVVLADVLDEEGAATARELGDAARYQHLDVTIEEDWQRVCAYAREEFGSVDGL 70

> python find-cysteine.py MNDLSGKTVIITGGARGLGAEAARQAVAAGARVVLADVLDEEGAATARELGDAARYQHLDVTIEEDWQRVVAYAREEFGSVDGL -1

20 of 21

Solution #3

import sys

protein = sys.argv[1]

upper_protein = protein.upper()

print upper_protein.find(“C”)

21 of 21

Reading

  • Chapters 5 and 8 of Learning Python by Lutz.