Python – Strings and Regular expression
Python - Strings
var1 = 'Hello World!'
var2 = "Python Programming"
Strings�
print("Hello")� print('Hello')
Output:
Hello� Hello
Assign String to a Variable�
a = "Hello"� print(a)
Output: Hello
Multiline Strings�
a = """ Welcome to Kongu College,� Welcome civil engineers,� civil students are good students."""� print(a)
Output:
Welcome to Kongu College,� Welcome civil engineers,� civil students are good students.
String Concatenation�
Example
a = "Hello"� b = "World"� c = a + b� print(c)
Output: HelloWorld
Example�
a = "Hello"� b = "World"� c = a + " " + b� print(c)
Output: Hello World
Multiply on Strings
str2 = str1 * N
|
# Original string
a = "Geeks"
# Multiply the string and store
# it in a new string
b = a*3
# Display the strings
print(a)
print(b)
Output:
Geeks
GeeksGeeksGeeks
Initializing the original string
# Original string
a = "Geeks“
N=3
# Multiply the string and store
# it in a new string
b = a* N
# Display the strings
print(a)
print(b)
Output:
Geeks
GeeksGeeksGeeks
Copying a string multiple times given in a list�
a = [“str1”] * N
a will be a list that contains str1 N number of times.
# Initialize the list
a =[ "Geeks”]
# No.of Copies
N=3
# Multiply the string and store
# it in a new string
b = a* N
# Display the strings
print(a)
print(b)
Output:
[‘Geeks’]
[‘GeeksGeeksGeeks’]
Accessing Values in Strings:
var1 = 'Hello World!'
var2 = "Python Programming"
print "var1[0]: ", var1[0]
print "var2[1:5]: ", var2[1:5]
This will produce following result:
var1[0]: H
var2[1:5]: ytho
Updating Strings:
var1 = 'Hello World!'
print ("Updated String :- ", var1[:6] + 'Python‘)
This will produce following result:
Updated String :- Hello Python
1. str= “Hello”
print(str*3)
Output:
HelloHelloHello
2. str1=“Hello”
var=7
str2=str1+var
print(str2)
Output: Error
Cannot concatenate ‘str’ and ‘int’ objects
3. str1=“Hello”
var=7
str2=str1+ str(var)
print(str2)
Output:
Hello7
Escape Characters:
Backslash | Hexadecimal | Description |
notation | character | |
\a | 0x07 | Bell or alert |
\b | 0x08 | Backspace |
\cx |
| Control-x |
\C-x |
| Control-x |
\e | 0x1b | Escape |
\f | 0x0c | Formfeed |
\M-\C-x |
| Meta-Control-x |
\n | 0x0a | Newline |
\nnn |
| Octal notation, where n is in the range 0.7 |
\r | 0x0d | Carriage return |
\s | 0x20 | Space |
\t | 0x09 | Tab |
\v | 0x0b | Vertical tab |
\x |
| Character x |
\xnn |
| Hexadecimal notation, where n is in the range 0.9, a.f, or A.F |
String Special Operators: Assume string variable a holds 'Hello' and variable b holds 'Python' then:
Operator | Description | Example |
+ | Concatenation - Adds values on either side of the operator | a + b will give HelloPython |
* | Repetition - Creates new strings, concatenating multiple copies of the same string | a*2 will give -HelloHello |
[] | Slice - Gives the character from the given index | a[1] will give e |
[ : ] | Range Slice - Gives the characters from the given range | a[1:4] will give ell |
in | Membership - Returns true if a character exists in the given string | H in a will give 1 |
not in | Membership - Returns true if a character does not exist in the given string | M not in a will give 1 |
r/R | Raw String - Suppress actual meaning of Escape characters. | print r'\n' prints \n and print R'\n' prints \n |
% | Format - Performs String formatting | See at next section |
String Formatting Operator:
Format Symbol | Conversion |
%c | character |
%s | string conversion via str() prior to formatting |
%i | signed decimal integer |
%d | signed decimal integer |
%u | unsigned decimal integer |
%o | octal integer |
%x | hexadecimal integer (lowercase letters) |
%X | hexadecimal integer (UPPERcase letters) |
%e | exponential notation (with lowercase 'e') |
%E | exponential notation (with UPPERcase 'E') |
%f | floating point real number |
%g | the shorter of %f and %e |
%G | the shorter of %f and %E |
Formatting Strings
Other supported symbols and functionality are listed in the following table:
Symbol | Functionality |
* | argument specifies width or precision |
- | left justification |
+ | display the sign |
<sp> | leave a blank space before a positive number |
# | add the octal leading zero ( '0' ) or hexadecimal leading '0x' or '0X', depending on whether 'x' or 'X' were used. |
0 | pad from left with zeros (instead of spaces) |
% | '%%' leaves you with a single literal '%' |
(var) | mapping variable (dictionary arguments) |
m.n. | m is the minimum total width and n is the number of digits to display after the decimal point (if appl.) |
Triple Quotes:
para_str = """this is a long string that is made up of several lines and non-printable characters such as TAB ( \t ) and they will show up that way when displayed. NEWLINEs within the string, whether explicitly given like this within the brackets [ \n ], or just a NEWLINE within the variable assignment will also show up. """
print para_str;
Raw String:
print 'C:\\nowhere'
This would print following result:
C:\nowhere
Now let's make use of raw string. We would put expression in r'expression' as follows:
print r'C:\\nowhere'
This would print following result:
C:\\nowhere
Unicode String:
print u'Hello, world!'
This would print following result:
Hello, world!
Built-in String Methods:
1 | |
Capitalizes first letter of string | |
2 | |
Returns a space-padded string with the original string centered to a total of width columns | |
3 | |
Counts how many times str occurs in string, or in a substring of string if starting index beg and ending index end are given | |
4 | |
Decodes the string using the codec registered for encoding. encoding defaults to the default string encoding. | |
5 | |
Returns encoded string version of string; on error, default is to raise a ValueError unless errors is given with 'ignore' or 'replace'. | |
6 7 | |
Determines if string or a substring of string (if starting index beg and ending index end are given) ends with suffix; Returns true if so, and false otherwise | |
Determines if string or a substring of string (if starting index beg and ending index end are given) starts with substring str; Returns true if so, and false otherwise | |
8 | |
Expands tabs in string to multiple spaces; defaults to 8 spaces per tab if tabsize not provided |
9 | |
Determine if str occurs in string, or in a substring of string if starting index beg and ending index end are given; returns index if found and -1 otherwise | |
10 | |
| Same as find(), but search backwards in string |
11 | |
Same as find(), but raises an exception if str not found | |
12 | |
| Same as index(), but search backwards in string |
13 | |
Returns true if string has at least 1 character and all characters are alphanumeric and false otherwise | |
14 | |
Returns true if string has at least 1 character and all characters are alphabetic and false otherwise | |
15 | |
Returns true if string contains only digits and false otherwise |
16 | |
Returns true if string has at least 1 cased character and all cased characters are in lowercase and false otherwise | |
17 | |
| Returns true if string has at least one cased character and all cased characters are in uppercase and false otherwise |
18 | |
Returns true if a unicode string contains only numeric characters and false otherwise | |
19 | |
Returns true if string contains only whitespace characters and false otherwise | |
20 | |
| Returns true if string is properly "titlecased" and false otherwise |
21 | |
| Merges (concatenates) the string representations of elements in sequence seq into a string, with separator string |
22 | |
| Returns the length of the string |
23 | |
| Returns a space-padded string with the original string left-justified to a total of width columns |
| |
| |
24 | |
Returns a space-padded string with the original string right-justified to a total of width columns. | |
25 | |
Converts all uppercase letters in string to lowercase | |
| |
| Converts lowercase letters in string to uppercase |
26 27 | |
Removes all leading whitespace in string | |
Removes all trailing whitespace of string | |
28 | |
| Performs both lstrip() and rstrip() on string |
29 30 | |
Returns a translation table to be used in translate function. | |
Translates string according to translation table str(256 chars), removing those in the del string | |
31 | |
Returns the max alphabetical character from the string str |
32 | |
Returns the min alphabetical character from the string str | |
33 | |
Replaces all occurrences of old in string with new, or at most max occurrences if max given | |
34 | |
Splits string according to delimiter str (space if not provided) and returns list of substrings; split into at most num substrings if given | |
35 | |
Splits string at all (or num) NEWLINEs and returns a list of each line with NEWLINEs removed | |
36 | |
| Inverts case for all letters in string |
37 | |
| Returns "titlecased" version of string, that is, all words begin with uppercase, and the rest are lowercase |
38 | |
Returns original string leftpadded with zeros to a total of width characters; intended for numbers, zfill() retains any sign given (less one zero) | |
39 | |
Returns true if a unicode string contains only decimal characters and false otherwise |
Built-in string methods�
# capitalize()
str="kongu engineering college "# first letter is converted into capital
str1=str.capitalize()
print(str, " ", str1)
Output:kongu engineering college Kongu engineering college
#Title():
str2=str.title()# first letter of all words are changed to capital letter
print(str," ", str2)
kongu engineering college Kongu Engineering College
# center()�
str="kongu"
str1=str.center(11)
print(str)
print(str1)
str1=str.center(11,"*")
print(str)
print(str1)
Output : kongu
kongu
kongu
***kongu***
# count()--Counts how many times str occurs in string,�
str="kongu cse eee ece eee"
sub_str="eee"
cnt=str.count(sub_str)
#(or) #cnt=str.count("eee")
print(cnt)
cnt=str.count("civil")
print(cnt)
cnt=str.count("eee",5,15) # 5 starting index and 15 ending index
print(cnt)
cnt=str.count("eee",15)
print(cnt)
Output: 2 0 1 1
# endswith() and startswith()�
str="kongu engineering college"
print(str.endswith("ege"))
print(str.endswith("ege",2,12))
print(str.endswith("ing"))
print(str.startswith("kon",2,15))
Output:
True
False
False
False
# expandtabs�
str=" kongu\tengineering\tcollege"
print(str.expandtabs())
str=" kongu\t\tengineering\t\tcollege"
print(str.expandtabs())
Output:
kongu engineering college
kongu engineering college
# find()
str="kongu engineering college"
ind=str.find("engineering")
print(ind)
ind1=str.find("engineering",4,10)
print(ind1)
ind1=str.find("engineering",4,20)
print(ind1)
Output: 6,-1,6
#rfind()-- Same as find(), but search backwards in string
str=" kongu engineering college"
ind=str.rfind("engineering")
print(ind)
ind=str.rfind("college")
print(ind)
ind=str.rfind("college",15)
print(ind)
Output: 7 19 19
# index()�
str="kongu engineering college"
ind=str.index("engineering")
print(ind)
ind=str.index("engineering",0,25)
print(ind)
Output:
6
6
#rindex() -->search from backwards�
ind=str.rindex("engineering",0,25)
print(ind)
ind=str.rindex("engineering",3,25)
print(ind)
Output:
6
6
#alnum() --> alphanumeric�
str1="kongu"
print(str1.isalnum())
str1="kongu123"
print(str1.isalnum())
str2="***$$$"
print(str2.isalnum())
Output: True True False
#isalpha()�
str1="kongu123"
print(str1.isalpha())
str1="kongu"
print(str1.isalpha())
str1="123"
print(str1.isalpha())
Output : False True False
#isdigit()�
str1="123"
print(str1.isdigit())
str1="kec123"
print(str1.isdigit())
Output : True
False
�# islower() and isupper()�
islower() - Returns true if string has at least 1 cased character and all cased characters are in lowercase and false otherwise
isupper() -Returns true if string has at least one cased character and all cased characters are in uppercase and false otherwise
# islower() and isupper()
str="kec"
print(str.islower())
str="Kec"
print(str.islower())
str1="KEC"
print(str1.isupper())
str1="KEc"
print(str1.isupper())
Output : True False True False
�� #isnumeric() and # isspace()��
isnumeric() - Returns true if a unicode string contains only numeric characters and false otherwise
isspace() - Returns true if string contains only whitespace characters and false otherwise
#isnumeric()
str="123"
print(str.isnumeric())
str="kec123"
print(str.isnumeric())
# isspace()
str=" kongu college"
print(str.isspace())
str=" "
print(str.isspace())
Output : True False False True
�#istitle()�
Returns true if string is properly "titlecased" and false otherwise
#istitle()
str="Kongu Enigneering College"
print(str.istitle())
str="Kongu enigneering College"
print(str.istitle())
Output : True False
�#join�
join(seq) - Merges (concatenates) the string representations of elements in sequence into a string, with separator string
#The join() method takes all items in an iterable(list, tuple,string) and joins them
into one string.
l1=["1","2","3"]
str1="kec"
new_str=str1.join(l1)
print(new_str)
l2=["kongu","kec"]
str2="college"
new_str1=str2.join(l2)
print(new_str1)
Output: 1kec2kec3
kongucollegekec��
myTuple = ("John", "Peter", "Vicky")
x = "#".join(myTuple)
# (or)
str="#"
x=str.join(myTuple)print(x)
Output : John#Peter#Vicky
# len()�
ljust(width,[fillchar]) -Returns a space-padded string with the original string left-justified to a total of width columns
rjust(width,[ fillchar]) Returns a space-padded string with the original string right-justified to a total of width columns.
str="kongu"
print(str.ljust(11)," welcome")
print(str.ljust(10,'_'),"hello")
str="kongu"
print(str.rjust(50)," welcome")
print(str.rjust(50,'_')," hello")
Output : kongu welcome
kongu_____ hello
kongu welcome _____________________________________________kongu hello
lower() - Converts all uppercase letters in string to lowercase
#lower()
str="kec"
lower_str=str.lower()
print(str, " ", lower_str)
str="Kec"
lower_str=str.lower()
print(str, " ", lower_str)
str="KEC"
lower_str=str.lower()
print(str, " ", lower_str)
Output : kec kec Kec kec KEC kec
#upper()�
upper() - Converts lowercase letters in string to uppercase
#upper()
str="kec"
upper_str=str.upper()
print(str, " ", upper_str)
str="Kec"
upper_str=str.upper()
print(str, " ", upper_str)
str="KEC"
upper_str=str.upper()
print(str, " ", upper_str)
Output : kec KEC Kec KEC KEC KEC
#lstrip()�
lstrip() - Removes all leading whitespace in string
#lstrip()
str=" kongu"
print(str,"welcome")
print(str.lstrip(),"welcome")
Output :
kongu welcome
kongu welcome
#rstrip()�
rstrip() - Removes all trailing whitespace of string
str="kongu "
print(str," ", " welcome")
print(str.rstrip(),"welcome")
Output:
kongu welcome
kongu welcome
#strip()
strip([chars]) - Performs both lstrip() and rstrip() on string
str=" kongu "
# output:kongu ( no space both on left and right)
print(" hello",str,"welcome")
#new_str=str.rstrip().lstrip()
#print("hello",new_str,"welcome")
print("hello",str.rstrip().lstrip(),"welcome")
Output: hello kongu welcome
hello kongu welcome
translate(table, deletechars="") - Translates string according to translation table str(256 chars), removing those in the del string
# maketrans() and translate()
intab = "aeiou"
outtab = "12345"
str = "this is string example....wow!!!"
trantab = str.maketrans(intab, outtab)
print(trantab)
print (str.translate(trantab))
Output: {97: 49, 101: 50, 105: 51, 111: 52, 117: 53}
th3s 3s str3ng 2x1mpl2....w4w!!!
#max() and min()�
replace(old, new, [max]) - Replaces all occurrences of old in string with new, or at most max occurrences if max given
#replace()
str=" kec eee cse kec cse kec"
str1=str.replace("kec", "kongu")
print(str1)
str1=str.replace("kec", "kongu",1)
print(str1)
str1=str.replace("kec", "kongu",2)
print(str1)
Output: kongu eee cse kongu cse kongu
kongu eee cse kec cse kec
kongu eee cse kongu cse kec�
split(str="", num=string.count(str)) - Splits string according to delimiter str (space if not provided) and returns list of substrings; split into at most num substrings if given
#split and splitlines
# split() will create a list of substrings
str="kec eee cse kec cse kec"
sub_strings=str.split()# delimiter is space
print(sub_strings)
print(type(sub_strings))
split_str="kec"
sub_strings=str.split(split_str)# here the delimiter is kec
print(sub_strings)
str="keceeecsekeccsekec"
sub_strings=str.split(split_str)# here the delimiter is kec
print(sub_strings)
sub_strings=str.split(split_str,2)
print(sub_strings)
sub_strings=str.split(split_str,str.count(split_str))#sub_strings=str.split(split_str,3))
print(sub_strings)
Output
['kec', 'eee', 'cse', 'kec', 'cse', 'kec']
<class 'list'> ['', ' eee cse ', ' cse ', '']
['', 'eeecse', 'cse', '']
['', 'eeecse', 'csekec']
['', 'eeecse', 'cse', '']
#splitlines�
lines='''abc
def
ghi
jkl
mno'''
print(lines)
lines_1=lines.splitlines()
print(lines_1)
#splitlines output�
abc
def
ghi
jkl
mno
['abc', 'def', 'ghi', 'jkl', 'mno']
�
#swapcase()�
swapcase() - Inverts case for all letters in string.
str="KEC college"
print(str.swapcase())
str="Kec coLLege"
print(str.swapcase())
Output: kec COLLEGE
kEC COllEGE
#zfill -> zerofill�
str="kongu"
print(str.zfill(10))
str="123"
print(str.zfill(10))
Output:
00000kongu
0000000123
Negative index
# negative index
str="kongu"
print(str[-1])
print(str[-2])
print(str[0:2])
print(str[0:4])
print(str[0:4:1])
print(str[0:4:2])
print(str[0:-2])# will from 1 to -3 index
print(str[::-1])
print(str[-3::-1])
print("string",str[-4:-1:-1])# no answer
�
# negative index Output
u
g
ko
kong
Kong
kn
kon
ugnok
nok
string
Stride during slicing
Reverse skipping 3rd char
In and not in
Ord() and chr()
#ord() (ordinal()) and chr()�
# ascii values from 0 to 255
print(ord('a'))
print(ord('b'))
print(ord('A'))
print(ord('B'))
print(chr(97))
Output: 97
98
65
66
a
# in and not in�
str="kongu engg college"
if "kec" in str:
print("present")
else:
print("not present")
if "kong" in str:
print("present")
else:
print("not present")
if "k" in str:
print("present")
else:
print("not present")
# not in
if "k" not in str:
print("present")
else:
print("not present")
�# in and not in output�
not present
present
present
not present
iteration
Vaidate PAN no
Pattern
Help() in python
The string module
The string module
String module capwords()
print(string.capwords("kec"))
Output is: Kec
String module
String module
Regular Expression��- Special sequence of characters that helps to match or find strings in another string
Match()- returns true only it is present in the beginning
Search()
Sub()
findall()
finditer()�- returns an iterator. Used to print index of match in the given string
Flag options
Meta characters in RE
Check if string has atleast one vowel
Use of metacharacter * and +
Groups�
Capturing groups - have the format(?P<name>…) where name is name of the group��Non Capturing groups - having the format(?:…)and not accessible by the group method , so they can be added to an existing regular expression without breaking the numbering
Application of Regular Expression to extract email
Python Additional �Regular expressions�
Regular Expressions
Regular Expressions
Python’s Regular Expression Syntax
The regular expression “test” matches the string ‘test’, and only that string
“[abc]” matches ‘a’,‘b’,or ‘c’
“[^abc]” matches any single character except ‘a’,’b’,or ‘c’
Python’s Regular Expression Syntax
“(abc)+” matches ’abc’, ‘abcabc’, ‘abcabcabc’, etc.
“this|that” matches ‘this’ and ‘that’, but not ‘thisthat’.
Python’sRegular Expression Syntax
“a*” matches ’’, ’a’, ’aa’, etc.
“a+” matches ’a’,’aa’,’aaa’, etc.
“a?” matches ’’ or ’a’
“a{2,3}” matches ’aa’ or ’aaa’
Regular Expression Syntax
Search and Match
>>> import re
>>> pat = "a*b”
>>> re.search(pat,"fooaaabcde")
<_sre.SRE_Match object at 0x809c0>
>>> re.match(pat,"fooaaabcde")
>>>
Q: What’s a match object?
>>> r1 = re.search("a*b","fooaaabcde")
>>> r1.group() # group returns string matched
'aaab'
>>> r1.start() # index of the match start
3
>>> r1.end() # index of the match end
7
>>> r1.span() # tuple of (start, end)
(3, 7)
What got matched?
\w+@(\w+\.)+(com|org|net|edu)
>>> pat1 = "\w+@(\w+\.)+(com|org|net|edu)"
>>> r1 = re.match(pat,"finin@cs.umbc.edu")
>>> r1.group()
'finin@cs.umbc.edu’
What got matched?
>>> pat2 = "(\w+)@((\w+\.)+(com|org|net|edu))"
>>> r2 = re.match(pat2,"finin@cs.umbc.edu")
>>> r2.group(1)
'finin'
>>> r2.group(2)
'cs.umbc.edu'
>>> r2.groups()
r2.groups()
('finin', 'cs.umbc.edu', 'umbc.', 'edu’)
What got matched?
>>> pat3 ="(?P<name>\w+)@(?P<host>(\w+\.)+(com|org|net|edu))"
>>> r3 = re.match(pat3,"finin@cs.umbc.edu")
>>> r3.group('name')
'finin'
>>> r3.group('host')
'cs.umbc.edu’
More re functions
>>> re.split("\W+", “This... is a test,
short and sweet, of split().”)
['This', 'is', 'a', 'test', 'short’,
'and', 'sweet', 'of', 'split’, ‘’]
>>> re.sub('(blue|white|red)', 'black', 'blue socks and red shoes')
'black socks and black shoes’
>>> re.findall("\d+”,"12 dogs,11 cats, 1 egg")
['12', '11', ’1’]
Compiling regular expressions
>>> capt3 = re.compile(pat3)
>>> cpat3
<_sre.SRE_Pattern object at 0x2d9c0>
>>> r3 = cpat3.search("finin@cs.umbc.edu")
>>> r3
<_sre.SRE_Match object at 0x895a0>
>>> r3.group()
'finin@cs.umbc.edu'
Pattern object methods
Pattern objects have methods that parallel the re functions (e.g., match, search, split, findall, sub), e.g.:
>>> p1 = re.compile("\w+@\w+\.+com|org|net|edu")
>>> p1.match("steve@apple.com").group(0)
'steve@apple.com'
>>> p1.search(”Email steve@apple.com today.").group(0)
'steve@apple.com’
>>> p1.findall("Email steve@apple.com and bill@msft.com now.")
['steve@apple.com', 'bill@msft.com’]
>>> p2 = re.compile("[.?!]+\s+")
>>> p2.split("Tired? Go to bed! Now!! ")
['Tired', 'Go to bed', 'Now', ’ ']
email address
sentence boundary
Example: pig latin
The pattern
([bcdfghjklmnpqrstvwxyz]+)(\w+)
piglatin.py
import re
pat = ‘([bcdfghjklmnpqrstvwxyz]+)(\w+)’
cpat = re.compile(pat)
def piglatin(string):
return " ".join( [piglatin1(w) for w in string.split()] )
piglatin.py
def piglatin1(word):
"""Returns the pig latin form of a word. e.g.:� piglatin1("dog”) => "ogday". """
match = cpat.match(word)
if match:
consonants = match.group(1)
rest = match.group(2)
return rest + consonants + “ay”
else:
return word + "zay“
print (piglatin())
void add();
Void main()
{
-----
add()
}
Void add()
{
int a=4,b=9;
printf(“%d”, a+b);
}
Date & Time
TimeTuple
Current time
Getting formatted time
Getting calendar for a month
calendar.isleap()
Time -clock() Method
import time;
print (time.clock())
time.sleep(20.5)
print (time.clock())