Regular Expressions¶

10/19/2023¶

print view

notebook

In [91]:
%%html
<script src="https://bits.csb.pitt.edu/preamble.js"></script>

Anti-Patterns¶

An anti-pattern is a common response to a recurring problem that is usually ineffective and risks being highly counterproductive.

Anti-Patterns:

In [93]:
frame = 0
while frame < length:
    if values[frame] < float(sys.argv[3]):
        pass #do something
    frame += 1

Pythonic Pattern:

In [94]:
cutoff = float(sys.argv[3])
for value in values:
    if value < cutoff:
        pass #dostuff

Antipattern: Not using numpy broadcasting

In [96]:
cnt = 0
for i in range(len(array)):
    if array[i] < cutoff:
        cnt += 1
cnt        
Out[96]:
3

Pythonic Pattern:

In [97]:
np.count_nonzero(array < cutoff)
Out[97]:
3

Antipattern: Expanding generators

In [98]:
for i in list(range(3)):
    pass

Python Pattern:

In [99]:
for i in range(3):
    pass

Efficiency: Use sets for membership testings... but don't keep converting from a list.

Bad

In [100]:
L = [1,2,3]
for i in range(10):
    if i in set(L):
        pass

Good

In [101]:
L = set([1,2,3])
for i in range(10):
    if i in L:
        pass

re¶

A regular expression is a way to match text patterns.

It is specified with a string and is compiled to a regular expression object that can be used for searching and other pattern-using operations.

Patterns can get pretty complicated and are not limited to exact string matches (but for now we'll stick with exact string matching since it's easy to understand).

In [102]:
import re
regex = re.compile('abc')
regex
Out[102]:
re.compile(r'abc', re.UNICODE)

Matching vs Searching¶

match and search apply the regex to the passsed string

In [103]:
regex = re.compile('abc')
regex.search('xyzabc')
Out[103]:
<re.Match object; span=(3, 6), match='abc'>

match must match starting at the begining of the string.

In [104]:
print(regex.match('xyzabc')) #matches at beginning of line only
None

Advice: use search and pretend match doesn't exist to avoid confusion

Extracting¶

In addition to searching for a particular pattern, a regular expression can be used to extract parts of the pattern using groups.

A group is defined using parentheses.

In [105]:
regex = re.compile('(abc)def')
match = regex.search('xyzabcdef')

The returned MatchObject can be used to extract all the contents of the groups.

In [106]:
match.groups()
Out[106]:
('abc',)

Groups¶

In [107]:
match.group(1)
Out[107]:
'abc'
In [108]:
match.group(0) #group zero is always the whole match
Out[108]:
'abcdef'
In [109]:
regex = re.compile('(a(b(c)))def')
match = regex.search('xyzabcdefg')
In [112]:
%%html
<div id="regroups" style="width: 500px"></div>
<script>

    var divid = '#regroups';
	jQuery(divid).asker({
	    id: divid,
	    question: "How many groups are in match?",
		answers: ['0','1','2','3','4'],
        server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
		charter: chartmaker})
    
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();


</script>
In [113]:
match.groups()
Out[113]:
('abc', 'bc', 'c')
In [114]:
match.group(0)
Out[114]:
'abcdef'

Using Regular Expressions¶

You can compile your regular expression into a RegexObject, or you can use re methods directly and it will compile them for you automatically.

The re package will cache your most recently used RegexpObjects. However, if you are using a lot of regular expressions, particularly inside of loops, you should probably compile them once outside the loop and use the resulting RegexpObject directly.

In [115]:
re.search('abc','abcxyz') # this searches the string 'abcxyz' using the regex 'abc'
Out[115]:
<re.Match object; span=(0, 3), match='abc'>
In [116]:
regex = re.compile('abc')
regex.search('abcxyz') #same as above
Out[116]:
<re.Match object; span=(0, 3), match='abc'>

Regular Expression Syntax¶

The Backslash Problem¶

Unlike Perl, where regular expressions are distinct from string literals, in Python we specify regular expressions as strings.

String literals use the backslash (\) to escape special characters.

Regular expressions use the backslash to escape special characters.

So how would we write a regular expression that matches \x\?

In [117]:
%%html
<div id="reslashes" style="width: 500px"></div>
<script>

    var divid = '#reslashes';
	jQuery(divid).asker({
	    id: divid,
	    question: "How would you write a regular expression to match \\x\\?",
		answers: ['\\x\\','\\\\x\\\\','\\\\\\x\\\\\\','\\\\\\\\x\\\\\\\\'],
        server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
		charter: chartmaker})
    
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();


</script>
In [118]:
firsttry = '\x\'
  File "/var/folders/c_/pwm7n7_174724g8zkkqlpr3m0000gn/T/ipykernel_58822/673471912.py", line 1
    firsttry = '\x\'
                    ^
SyntaxError: EOL while scanning string literal
In [119]:
secondtry = '\\x\\'
In [120]:
print(secondtry,len(secondtry))
\x\ 3
In [121]:
regex = re.compile(secondtry)
---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
/var/folders/c_/pwm7n7_174724g8zkkqlpr3m0000gn/T/ipykernel_58822/3245247240.py in <module>
----> 1 regex = re.compile(secondtry)

/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/re.py in compile(pattern, flags)
    250 def compile(pattern, flags=0):
    251     "Compile a regular expression pattern, returning a Pattern object."
--> 252     return _compile(pattern, flags)
    253 
    254 def purge():

/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/re.py in _compile(pattern, flags)
    302     if not sre_compile.isstring(pattern):
    303         raise TypeError("first argument must be string or compiled pattern")
--> 304     p = sre_compile.compile(pattern, flags)
    305     if not (flags & DEBUG):
    306         if len(_cache) >= _MAXCACHE:

/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/sre_compile.py in compile(p, flags)
    762     if isstring(p):
    763         pattern = p
--> 764         p = sre_parse.parse(p, flags)
    765     else:
    766         pattern = None

/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/sre_parse.py in parse(str, flags, state)
    946 
    947     try:
--> 948         p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
    949     except Verbose:
    950         # the VERBOSE flag was switched on inside the pattern.  to be

/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/sre_parse.py in _parse_sub(source, state, verbose, nested)
    441     start = source.tell()
    442     while True:
--> 443         itemsappend(_parse(source, state, verbose, nested + 1,
    444                            not nested and not items))
    445         if not sourcematch("|"):

/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/sre_parse.py in _parse(source, state, verbose, nested, first)
    509         if this in "|)":
    510             break # end of subpattern
--> 511         sourceget()
    512 
    513         if verbose:

/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/sre_parse.py in get(self)
    254     def get(self):
    255         this = self.next
--> 256         self.__next()
    257         return this
    258     def getwhile(self, n, charset):

/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/sre_parse.py in __next(self)
    243                 char += self.decoded_string[index]
    244             except IndexError:
--> 245                 raise error("bad escape (end of pattern)",
    246                             self.string, len(self.string) - 1) from None
    247         self.index = index + 1

error: bad escape (end of pattern) at position 2
In [122]:
thirdtry = '\\\\x\\\\'
In [123]:
print(thirdtry,len(thirdtry))
\\x\\ 5
In [124]:
regex = re.compile(thirdtry)
In [125]:
regex.search('\\x\\')
Out[125]:
<re.Match object; span=(0, 3), match='\\x\\'>

Raw Strings¶

Python let's you specify a raw string literal where backslashes aren't escaped.

Raw string have an r before the string literal.

Use raw strings for regular expressions.

In [126]:
normal_str = '\\x\\'
raw_str = r'\\x\\'
In [127]:
print(normal_str,raw_str)
\x\ \\x\\

Slight detail: raw strings can't end with an odd number of backslashes

In [128]:
print(r'\x\')
  File "/var/folders/c_/pwm7n7_174724g8zkkqlpr3m0000gn/T/ipykernel_58822/2066912133.py", line 1
    print(r'\x\')
                 ^
SyntaxError: EOL while scanning string literal

Operators¶

regex1|regex2

Match either regex1 or regex2

In [129]:
bool(re.search(r'a|b','xxxaxxx'))
Out[129]:
True
In [130]:
bool(re.search(r'abc|xyz','axbycz'))
Out[130]:
False
In [131]:
bool(re.search(r'abc|xyz','xxxyzxxx'))
Out[131]:
True

Operators: multiple matches¶

regex* Match regex zero or more times (Kleene star)
regex? Match regex one or zero times
regex+ Match regex one or more times
regex{m} Match regex `m` times
regex{m,n} Match regex between `m` and `n` times (as many as possible)
In [132]:
bool(re.search(r'a*','xxxxx'))
Out[132]:
True
In [133]:
bool(re.search(r'a+','xxxxx'))
Out[133]:
False
In [134]:
m = re.search(r'a+(.*)','aaba')
In [136]:
%%html
<div id="regreedy" style="width: 500px"></div>
<script>

    var divid = '#regreedy';
	jQuery(divid).asker({
	    id: divid,
	    question: "What is m.group(1)?",
		answers: ['a','b','ba','ab','aba','aaba'],
        server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
		charter: chartmaker})
    
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();


</script>
In [137]:
m.groups()
Out[137]:
('ba',)

Non-greedy Kleene¶

Multiple matching is greedy by default and will match as much as possible. To match as few characters as possible, use *?, ??, and +?.

In [138]:
m1 = re.search(r'a*(.*)','aaba')
m2 = re.search(r'a+(.*)','aaba')
In [139]:
m1.groups(),m2.groups()
Out[139]:
(('ba',), ('ba',))
In [140]:
m3 = re.search(r'a*?(.*)','aaba')
In [141]:
m3 = re.search(r'a*?(.*)','aaba')
In [142]:
%%html
<div id="remstar3" style="width: 500px"></div>
<script>

    var divid = '#remstar3';
	jQuery(divid).asker({
	    id: divid,
	    question: "What is m3.group(1)?",
		answers: ['aaba','aba','ba','ABBA'],
        server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
		charter: chartmaker})
    
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();


</script>
In [143]:
m3.groups()
Out[143]:
('aaba',)
In [144]:
m = re.search(r'a+?(.*)','aaba')
In [145]:
%%html
<div id="rem1" style="width: 500px"></div>
<script>

    var divid = '#rem1';
	jQuery(divid).asker({
	    id: divid,
	    question: "What is m.group(1)?",
		answers: ['aaba','aba','ba','ABBA'],
        server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
		charter: chartmaker})
    
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();


</script>
In [146]:
m.group(1)
Out[146]:
'aba'

Special Characters¶

. Matches any character (except newline, by default)
^ Matches start of string
$ Matches end of string

In [147]:
bool(re.search('^abc','xyzabc'))
Out[147]:
False
In [148]:
bool(re.search('abc$','xyzabc'))
Out[148]:
True

Keep in mind that if you want to match a special character, .^$()[]|*+?{}, you need to escape it with backslash.

Character Sets¶

[] specifies a set of characters

In [149]:
m = re.search(r'([0-9])','BST3')
In [150]:
m.groups()
Out[150]:
('3',)
In [151]:
m = re.search(r'([cat])','garfield')
In [152]:
%%html
<div id="reset" style="width: 500px"></div>
<script>

    var divid = '#reset';
	jQuery(divid).asker({
	    id: divid,
	    question: "What's in m.group(1)?",
		answers: ['Nothing','cat','c','a','t','garfield'],
        server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
		charter: chartmaker})
    
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();


</script>
In [153]:
m.group(1)
Out[153]:
'a'

Character Set Complements¶

The complement of the character set is taken if ^ is the first character.

In [154]:
r3 = re.compile(r'([^ ]*)')
m3 = r3.search('Hello World')
m3.groups()
Out[154]:
('Hello',)

Predefined Character Sets¶

\d Matches decimal digit
\D Matches non-decimal digit

\s Matches whitespace character
\S Matches non-whitespace character

\w Matches alphanumeric characters and underscore [A-Za-z0-9_]
\W Matches nonalphanumeric [^A-Za-z0-9_]

In [155]:
re.search(r'(\w+)-(\w+)','de-hyphen').groups()
Out[155]:
('de', 'hyphen')
In [156]:
float_regex = re.compile(r'[+-]?(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?')
In [157]:
float_regex.match('3.14159')
Out[157]:
<re.Match object; span=(0, 7), match='3.14159'>
In [158]:
r = re.compile(r'\d?\d.(png|jpg)')
In [160]:
%%html
<div id="reex1" style="width: 500px"></div>
<script>

    var divid = '#reex1';
	jQuery(divid).asker({
	    id: divid,
	    question: "Which string will NOT match",
		answers: ['0.png','15.jpg','93png','100.jpg'],
        server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
		charter: chartmaker})
    
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();


</script>

Groups¶

We can extract parts of a match using groups.

In [161]:
m = re.search(r'(\w*)@pitt\.edu','dkoes@pitt.edu')
m.group(1)
Out[161]:
'dkoes'

Groups can be referenced within the regular expression with \number where number is from 1 to 99

In [162]:
regex = re.compile(r'(\w+)\s+\1')
m1 = regex.search('cat cat')
m2 = regex.search('cat dog')
m1.groups(),m2
Out[162]:
(('cat',), None)

Named Groups¶

Groups can be named with (?P<name>...)

In [163]:
regex = re.compile(r'(?P<last>\w+), (?P<first>\w+)')
m = regex.search('Koes, David')
print(m.group('first'),m.group('last'))
David Koes

Named groups can be referenced by name within the regular expression.

In [164]:
regex = re.compile(r'(?P<animal>\w+)\s+(?P=animal)')
m1 = regex.search('cat cat')
m1.groups()
Out[164]:
('cat',)

Compiling Regular Expressions¶

Compiling to a RegexpObject also let's you provide some flags:

  • re.IGNORECASE - case insensitive matching
  • re.DOTALL - make the dot character match newlines
  • re.MULTILINE - ^ and $ will match begining/end of lines in addition to the string
In [165]:
print(re.search(r'^cat$','cat\ndog'))
None
In [166]:
regex = re.compile(r'^cat$',re.MULTILINE)
regex.search('cat\ndog')
Out[166]:
<re.Match object; span=(0, 3), match='cat'>

More Regular Expression Functions¶

The same functions are available both as standalone functions in re (which take a regular expression in string form) and as methods of a RegexpObj.

  • search Scan through a string looking for where a regular expression produces a match, and return a MatchObject.
  • match Return a MatchObject of regular expression matches at beginning of string.
  • split Split a string by occurances of pattern.
  • findall Return all non-overlapping matches of the regular expression as strings.
  • finditer Return an iterator yielding MatchObject instances over all non-overlapping matches of the regular expression
  • sub Return a string obtained by substituting matches of the regular expression with a provided string

split¶

In [167]:
re.split(r'\s+',"A bunch of   spacey\nwords.")
Out[167]:
['A', 'bunch', 'of', 'spacey', 'words.']

If matching groups are included, then the matches are included in the returned list

In [168]:
re.split(r'(\s+)',"A bunch of   spacey\nwords.")
Out[168]:
['A', ' ', 'bunch', ' ', 'of', '   ', 'spacey', '\n', 'words.']

findall¶

Returns matches as strings.

  • If no groups, returns full match
  • If single group, returns string of that group's match
  • If multiple groups, returns tuple of strings
In [169]:
bigstr = 'abc xyz abc a x'
re.findall('abc',bigstr)
Out[169]:
['abc', 'abc']
In [170]:
re.findall(r'(a)bc',bigstr)
Out[170]:
['a', 'a']
In [171]:
re.findall(r'(a)b(c)',bigstr)
Out[171]:
[('a', 'c'), ('a', 'c')]
In [172]:
matches = re.findall(r'(\S+)|(\S+)','x|y a|b')
In [173]:
%%html
<div id="reexfindall" style="width: 500px"></div>
<script>
    var divid = '#reexfindall';
	jQuery(divid).asker({
	    id: divid,
	    question: "What is in matches[0]",
		answers: ["('x','y')","('x|y')","'x|y'","('x|y','')",'Error'],
        server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
		charter: chartmaker})
    
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();


</script>
In [174]:
matches
Out[174]:
[('x|y', ''), ('a|b', '')]

finditer¶

In [175]:
list_of_names = 'Koes, David\nKarplus, Martin\nLevitt, Michael\nWarshel, Arieh\n'
In [176]:
for m in re.finditer(r'(?P<last>\w+), (?P<first>\w+)',list_of_names):
    print(m.group('first'),m.group('last'))
David Koes
Martin Karplus
Michael Levitt
Arieh Warshel

sub¶

In [177]:
pdb = '''ATOM   2267  N   THR A 609       4.155  42.962  60.898  1.00  9.19           N  
ATOM   2268  CA  THR A 609       3.520  44.246  60.575  1.00 10.78           C  
ATOM   2269  C   THR A 609       4.491  45.117  59.815  1.00 11.13           C  
ATOM   2270  O   THR A 609       5.689  44.864  59.853  1.00  9.92           O'''

print(re.sub(r' A ',' B ',pdb))
ATOM   2267  N   THR B 609       4.155  42.962  60.898  1.00  9.19           N  
ATOM   2268  CA  THR B 609       3.520  44.246  60.575  1.00 10.78           C  
ATOM   2269  C   THR B 609       4.491  45.117  59.815  1.00 11.13           C  
ATOM   2270  O   THR B 609       5.689  44.864  59.853  1.00  9.92           O

Some Theory¶

A regular expression describes a regular language in formal language theory.

A formal language is a set of symbols and rules for constructing strings from these symbols. All programming languages are formal languages, although none are regular languages (usually context free grammars).

Stephen Kleene, American mathematician and inventor of regular expressions.

Regular Languages¶

The following are equivalent:

  • A language is regular
  • A language can be recognized by a regular expression
  • A language can be recognized by a finite automata (finite state machine)

Finite Automata¶

1*(01*01*)*

This FSM and regular expression matches all binary strings with an even number of zeros.

* is the Kleene star and matches zero or more copies of the preceeding expression.

Finite state machines are finite. This means they cannot count arbitrarily high. For example, it is impossible to write an regular expression for balanced parentheses.

When you compile a regular expression, you are creating a FSM. When you search, the string is run through the FSM which takes time linear in the length of the string (no backtracking).

Exercise¶

Consider the file alignment.txt. This is the saved result of a blast query.

In [178]:
!wget http://mscbio2025.csb.pitt.edu/files/alignment.txt
--2023-10-18 21:56:26--  http://mscbio2025.csb.pitt.edu/files/alignment.txt
Resolving mscbio2025.csb.pitt.edu (mscbio2025.csb.pitt.edu)... 136.142.4.139
Connecting to mscbio2025.csb.pitt.edu (mscbio2025.csb.pitt.edu)|136.142.4.139|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 86458 (84K) [text/plain]
Saving to: ‘alignment.txt’

alignment.txt       100%[===================>]  84.43K  --.-KB/s    in 0.02s   

2023-10-18 21:56:27 (4.52 MB/s) - ‘alignment.txt’ saved [86458/86458]

In [179]:
!head alignment.txt
BLASTP 2.2.28+
Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro
A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and
David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs", Nucleic
Acids Res. 25:3389-3402.


Reference for compositional score matrix adjustment: Stephen
F. Altschul, John C. Wootton, E. Michael Gertz, Richa

Answer these questions using regular expressions¶

What is the average length of the sequences returned?

How many sequences are from the pdb?

Can you extract just the subject sequences?

In [180]:
data = open('alignment.txt').read()