%%html
<script src="https://bits.csb.pitt.edu/preamble.js"></script>
An anti-pattern is a common response to a recurring problem that is usually ineffective and risks being highly counterproductive.
Anti-Patterns:
frame = 0
while frame < length:
if values[frame] < float(sys.argv[3]):
pass #do something
frame += 1
Pythonic Pattern:
cutoff = float(sys.argv[3])
for value in values:
if value < cutoff:
pass #dostuff
Antipattern: Not using numpy broadcasting
cnt = 0
for i in range(len(array)):
if array[i] < cutoff:
cnt += 1
cnt
3
Pythonic Pattern:
np.count_nonzero(array < cutoff)
3
Antipattern: Expanding generators
for i in list(range(3)):
pass
Python Pattern:
for i in range(3):
pass
Efficiency: Use sets for membership testings... but don't keep converting from a list.
Bad
L = [1,2,3]
for i in range(10):
if i in set(L):
pass
Good
L = set([1,2,3])
for i in range(10):
if i in L:
pass
re
¶A regular expression is a way to match text patterns.
It is specified with a string
and is compiled to a regular expression object that can be used for searching and other pattern-using operations.
Patterns can get pretty complicated and are not limited to exact string matches (but for now we'll stick with exact string matching since it's easy to understand).
import re
regex = re.compile('abc')
regex
re.compile(r'abc', re.UNICODE)
match
and search
apply the regex to the passsed string
regex = re.compile('abc')
regex.search('xyzabc')
<re.Match object; span=(3, 6), match='abc'>
match
must match starting at the begining of the string.
print(regex.match('xyzabc')) #matches at beginning of line only
None
Advice: use search
and pretend match
doesn't exist to avoid confusion
In addition to searching for a particular pattern, a regular expression can be used to extract parts of the pattern using groups.
A group is defined using parentheses.
regex = re.compile('(abc)def')
match = regex.search('xyzabcdef')
The returned MatchObject
can be used to extract all the contents of the groups.
match.groups()
('abc',)
match.group(1)
'abc'
match.group(0) #group zero is always the whole match
'abcdef'
regex = re.compile('(a(b(c)))def')
match = regex.search('xyzabcdefg')
%%html
<div id="regroups" style="width: 500px"></div>
<script>
var divid = '#regroups';
jQuery(divid).asker({
id: divid,
question: "How many groups are in match?",
answers: ['0','1','2','3','4'],
server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
charter: chartmaker})
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();
</script>
match.groups()
('abc', 'bc', 'c')
match.group(0)
'abcdef'
You can compile your regular expression into a RegexObject, or you can use re
methods directly and it will compile them for you automatically.
The re
package will cache your most recently used RegexpObject
s. However, if you are using a lot of regular expressions, particularly inside of loops, you should probably compile them once outside the loop and use the resulting RegexpObject
directly.
re.search('abc','abcxyz') # this searches the string 'abcxyz' using the regex 'abc'
<re.Match object; span=(0, 3), match='abc'>
regex = re.compile('abc')
regex.search('abcxyz') #same as above
<re.Match object; span=(0, 3), match='abc'>
Unlike Perl, where regular expressions are distinct from string literals, in Python we specify regular expressions as strings.
String literals use the backslash (\
) to escape special characters.
Regular expressions use the backslash to escape special characters.
So how would we write a regular expression that matches \x\
?
%%html
<div id="reslashes" style="width: 500px"></div>
<script>
var divid = '#reslashes';
jQuery(divid).asker({
id: divid,
question: "How would you write a regular expression to match \\x\\?",
answers: ['\\x\\','\\\\x\\\\','\\\\\\x\\\\\\','\\\\\\\\x\\\\\\\\'],
server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
charter: chartmaker})
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();
</script>
firsttry = '\x\'
File "/var/folders/c_/pwm7n7_174724g8zkkqlpr3m0000gn/T/ipykernel_58822/673471912.py", line 1 firsttry = '\x\' ^ SyntaxError: EOL while scanning string literal
secondtry = '\\x\\'
print(secondtry,len(secondtry))
\x\ 3
regex = re.compile(secondtry)
--------------------------------------------------------------------------- error Traceback (most recent call last) /var/folders/c_/pwm7n7_174724g8zkkqlpr3m0000gn/T/ipykernel_58822/3245247240.py in <module> ----> 1 regex = re.compile(secondtry) /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/re.py in compile(pattern, flags) 250 def compile(pattern, flags=0): 251 "Compile a regular expression pattern, returning a Pattern object." --> 252 return _compile(pattern, flags) 253 254 def purge(): /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/re.py in _compile(pattern, flags) 302 if not sre_compile.isstring(pattern): 303 raise TypeError("first argument must be string or compiled pattern") --> 304 p = sre_compile.compile(pattern, flags) 305 if not (flags & DEBUG): 306 if len(_cache) >= _MAXCACHE: /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/sre_compile.py in compile(p, flags) 762 if isstring(p): 763 pattern = p --> 764 p = sre_parse.parse(p, flags) 765 else: 766 pattern = None /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/sre_parse.py in parse(str, flags, state) 946 947 try: --> 948 p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0) 949 except Verbose: 950 # the VERBOSE flag was switched on inside the pattern. to be /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/sre_parse.py in _parse_sub(source, state, verbose, nested) 441 start = source.tell() 442 while True: --> 443 itemsappend(_parse(source, state, verbose, nested + 1, 444 not nested and not items)) 445 if not sourcematch("|"): /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/sre_parse.py in _parse(source, state, verbose, nested, first) 509 if this in "|)": 510 break # end of subpattern --> 511 sourceget() 512 513 if verbose: /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/sre_parse.py in get(self) 254 def get(self): 255 this = self.next --> 256 self.__next() 257 return this 258 def getwhile(self, n, charset): /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/sre_parse.py in __next(self) 243 char += self.decoded_string[index] 244 except IndexError: --> 245 raise error("bad escape (end of pattern)", 246 self.string, len(self.string) - 1) from None 247 self.index = index + 1 error: bad escape (end of pattern) at position 2
thirdtry = '\\\\x\\\\'
print(thirdtry,len(thirdtry))
\\x\\ 5
regex = re.compile(thirdtry)
regex.search('\\x\\')
<re.Match object; span=(0, 3), match='\\x\\'>
Python let's you specify a raw string literal where backslashes aren't escaped.
Raw string have an r
before the string literal.
Use raw strings for regular expressions.
normal_str = '\\x\\'
raw_str = r'\\x\\'
print(normal_str,raw_str)
\x\ \\x\\
Slight detail: raw strings can't end with an odd number of backslashes
print(r'\x\')
File "/var/folders/c_/pwm7n7_174724g8zkkqlpr3m0000gn/T/ipykernel_58822/2066912133.py", line 1 print(r'\x\') ^ SyntaxError: EOL while scanning string literal
regex1|regex2
Match either regex1 or regex2
bool(re.search(r'a|b','xxxaxxx'))
True
bool(re.search(r'abc|xyz','axbycz'))
False
bool(re.search(r'abc|xyz','xxxyzxxx'))
True
regex* | Match regex zero or more times (Kleene star) |
regex? | Match regex one or zero times |
regex+ | Match regex one or more times |
regex{m} | Match regex `m` times |
regex{m,n} | Match regex between `m` and `n` times (as many as possible) |
bool(re.search(r'a*','xxxxx'))
True
bool(re.search(r'a+','xxxxx'))
False
m = re.search(r'a+(.*)','aaba')
%%html
<div id="regreedy" style="width: 500px"></div>
<script>
var divid = '#regreedy';
jQuery(divid).asker({
id: divid,
question: "What is m.group(1)?",
answers: ['a','b','ba','ab','aba','aaba'],
server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
charter: chartmaker})
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();
</script>
Multiple matching is greedy by default and will match as much as possible. To match as few characters as possible, use *?
, ??
, and +?
.
m1 = re.search(r'a*(.*)','aaba')
m2 = re.search(r'a+(.*)','aaba')
m1.groups(),m2.groups()
(('ba',), ('ba',))
m3 = re.search(r'a*?(.*)','aaba')
m3 = re.search(r'a*?(.*)','aaba')
%%html
<div id="remstar3" style="width: 500px"></div>
<script>
var divid = '#remstar3';
jQuery(divid).asker({
id: divid,
question: "What is m3.group(1)?",
answers: ['aaba','aba','ba','ABBA'],
server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
charter: chartmaker})
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();
</script>
m3.groups()
('aaba',)
m = re.search(r'a+?(.*)','aaba')
%%html
<div id="rem1" style="width: 500px"></div>
<script>
var divid = '#rem1';
jQuery(divid).asker({
id: divid,
question: "What is m.group(1)?",
answers: ['aaba','aba','ba','ABBA'],
server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
charter: chartmaker})
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();
</script>
.
Matches any character (except newline, by default)
^
Matches start of string
$ Matches end of string
bool(re.search('^abc','xyzabc'))
False
bool(re.search('abc$','xyzabc'))
True
Keep in mind that if you want to match a special character, .^$()[]|*+?{}
, you need to escape it with backslash.
[]
specifies a set of characters
m = re.search(r'([0-9])','BST3')
m.groups()
('3',)
m = re.search(r'([cat])','garfield')
%%html
<div id="reset" style="width: 500px"></div>
<script>
var divid = '#reset';
jQuery(divid).asker({
id: divid,
question: "What's in m.group(1)?",
answers: ['Nothing','cat','c','a','t','garfield'],
server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
charter: chartmaker})
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();
</script>
The complement of the character set is taken if ^
is the first character.
r3 = re.compile(r'([^ ]*)')
m3 = r3.search('Hello World')
m3.groups()
('Hello',)
\d
Matches decimal digit
\D
Matches non-decimal digit
\s
Matches whitespace character
\S
Matches non-whitespace character
\w
Matches alphanumeric characters and underscore [A-Za-z0-9_]
\W
Matches nonalphanumeric [^A-Za-z0-9_]
re.search(r'(\w+)-(\w+)','de-hyphen').groups()
('de', 'hyphen')
float_regex = re.compile(r'[+-]?(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?')
float_regex.match('3.14159')
<re.Match object; span=(0, 7), match='3.14159'>
r = re.compile(r'\d?\d.(png|jpg)')
%%html
<div id="reex1" style="width: 500px"></div>
<script>
var divid = '#reex1';
jQuery(divid).asker({
id: divid,
question: "Which string will NOT match",
answers: ['0.png','15.jpg','93png','100.jpg'],
server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
charter: chartmaker})
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();
</script>
We can extract parts of a match using groups.
m = re.search(r'(\w*)@pitt\.edu','dkoes@pitt.edu')
m.group(1)
'dkoes'
Groups can be referenced within the regular expression with \number
where number
is from 1 to 99
regex = re.compile(r'(\w+)\s+\1')
m1 = regex.search('cat cat')
m2 = regex.search('cat dog')
m1.groups(),m2
(('cat',), None)
Groups can be named with (?P<name>...)
regex = re.compile(r'(?P<last>\w+), (?P<first>\w+)')
m = regex.search('Koes, David')
print(m.group('first'),m.group('last'))
David Koes
Named groups can be referenced by name within the regular expression.
regex = re.compile(r'(?P<animal>\w+)\s+(?P=animal)')
m1 = regex.search('cat cat')
m1.groups()
('cat',)
Compiling to a RegexpObject also let's you provide some flags:
re.IGNORECASE
- case insensitive matchingre.DOTALL
- make the dot character match newlinesre.MULTILINE
- ^ and $ will match begining/end of lines in addition to the stringprint(re.search(r'^cat$','cat\ndog'))
None
regex = re.compile(r'^cat$',re.MULTILINE)
regex.search('cat\ndog')
<re.Match object; span=(0, 3), match='cat'>
The same functions are available both as standalone functions in re
(which take a regular expression in string form) and as methods of a RegexpObj
.
search
Scan through a string looking for where a regular expression produces a match, and return a MatchObject.match
Return a MatchObject of regular expression matches at beginning of string.split
Split a string by occurances of pattern.findall
Return all non-overlapping matches of the regular expression as strings.finditer
Return an iterator yielding MatchObject instances over all non-overlapping matches of the regular expressionsub
Return a string obtained by substituting matches of the regular expression with a provided stringsplit
¶re.split(r'\s+',"A bunch of spacey\nwords.")
['A', 'bunch', 'of', 'spacey', 'words.']
If matching groups are included, then the matches are included in the returned list
re.split(r'(\s+)',"A bunch of spacey\nwords.")
['A', ' ', 'bunch', ' ', 'of', ' ', 'spacey', '\n', 'words.']
findall
¶Returns matches as strings.
bigstr = 'abc xyz abc a x'
re.findall('abc',bigstr)
['abc', 'abc']
re.findall(r'(a)bc',bigstr)
['a', 'a']
re.findall(r'(a)b(c)',bigstr)
[('a', 'c'), ('a', 'c')]
matches = re.findall(r'(\S+)|(\S+)','x|y a|b')
%%html
<div id="reexfindall" style="width: 500px"></div>
<script>
var divid = '#reexfindall';
jQuery(divid).asker({
id: divid,
question: "What is in matches[0]",
answers: ["('x','y')","('x|y')","'x|y'","('x|y','')",'Error'],
server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
charter: chartmaker})
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();
</script>
matches
[('x|y', ''), ('a|b', '')]
finditer
¶list_of_names = 'Koes, David\nKarplus, Martin\nLevitt, Michael\nWarshel, Arieh\n'
for m in re.finditer(r'(?P<last>\w+), (?P<first>\w+)',list_of_names):
print(m.group('first'),m.group('last'))
David Koes Martin Karplus Michael Levitt Arieh Warshel
sub
¶pdb = '''ATOM 2267 N THR A 609 4.155 42.962 60.898 1.00 9.19 N
ATOM 2268 CA THR A 609 3.520 44.246 60.575 1.00 10.78 C
ATOM 2269 C THR A 609 4.491 45.117 59.815 1.00 11.13 C
ATOM 2270 O THR A 609 5.689 44.864 59.853 1.00 9.92 O'''
print(re.sub(r' A ',' B ',pdb))
ATOM 2267 N THR B 609 4.155 42.962 60.898 1.00 9.19 N ATOM 2268 CA THR B 609 3.520 44.246 60.575 1.00 10.78 C ATOM 2269 C THR B 609 4.491 45.117 59.815 1.00 11.13 C ATOM 2270 O THR B 609 5.689 44.864 59.853 1.00 9.92 O
A regular expression describes a regular language in formal language theory.
A formal language is a set of symbols and rules for constructing strings from these symbols. All programming languages are formal languages, although none are regular languages (usually context free grammars).
Stephen Kleene, American mathematician and inventor of regular expressions.
The following are equivalent:
This FSM and regular expression matches all binary strings with an even number of zeros.
* is the Kleene star and matches zero or more copies of the preceeding expression.
Finite state machines are finite. This means they cannot count arbitrarily high. For example, it is impossible to write an regular expression for balanced parentheses.
When you compile a regular expression, you are creating a FSM. When you search, the string is run through the FSM which takes time linear in the length of the string (no backtracking).
Consider the file alignment.txt
. This is the saved result of a blast query.
!wget http://mscbio2025.csb.pitt.edu/files/alignment.txt
--2023-10-18 21:56:26-- http://mscbio2025.csb.pitt.edu/files/alignment.txt Resolving mscbio2025.csb.pitt.edu (mscbio2025.csb.pitt.edu)... 136.142.4.139 Connecting to mscbio2025.csb.pitt.edu (mscbio2025.csb.pitt.edu)|136.142.4.139|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 86458 (84K) [text/plain] Saving to: ‘alignment.txt’ alignment.txt 100%[===================>] 84.43K --.-KB/s in 0.02s 2023-10-18 21:56:27 (4.52 MB/s) - ‘alignment.txt’ saved [86458/86458]
!head alignment.txt
BLASTP 2.2.28+ Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Reference for compositional score matrix adjustment: Stephen F. Altschul, John C. Wootton, E. Michael Gertz, Richa
What is the average length of the sequences returned?
How many sequences are from the pdb?
Can you extract just the subject sequences?
data = open('alignment.txt').read()