{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "# Regular Expressions\n", "\n", "## 10/19/2023\n", "\n", "print view\n", "\n", "notebook\n" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Anti-Patterns\n", "\n", "*An [anti-pattern](https://en.wikipedia.org/wiki/Anti-pattern) is a common response to a recurring problem that is usually ineffective and risks being highly counterproductive.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Anti-Pattern**s**:" ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "import sys\n", "sys.argv.append('3.0')\n", "length = 100\n", "values = [3]*100" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [], "source": [ "frame = 0\n", "while frame < length:\n", " if values[frame] < float(sys.argv[3]):\n", " pass #do something\n", " frame += 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pythonic Pattern:" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [], "source": [ "cutoff = float(sys.argv[3])\n", "for value in values:\n", " if value < cutoff:\n", " pass #dostuff" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**Antipattern**: Not using numpy broadcasting" ] }, { "cell_type": "code", "execution_count": 95, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "import numpy as np\n", "array = np.array([0,2,.5,.5,1.3])\n", "cutoff = 1.0" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cnt = 0\n", "for i in range(len(array)):\n", " if array[i] < cutoff:\n", " cnt += 1\n", "cnt " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pythonic Pattern:" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.count_nonzero(array < cutoff)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**Antipattern**: Expanding generators" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [], "source": [ "for i in list(range(3)):\n", " pass" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Python Pattern:" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [], "source": [ "for i in range(3):\n", " pass" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**Efficiency**: Use sets for membership testings... **but** don't keep converting from a list.\n", "\n", "**Bad**" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [], "source": [ "L = [1,2,3]\n", "for i in range(10):\n", " if i in set(L):\n", " pass" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Good**" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [], "source": [ "L = set([1,2,3])\n", "for i in range(10):\n", " if i in L:\n", " pass" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# `re`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A *regular expression* is a way to match text patterns. \n", "\n", "It is specified with a `string` and is compiled to a regular expression object that can be used for searching and other pattern-using operations.\n", "\n", "Patterns can get pretty complicated and are not limited to exact string matches (but for now we'll stick with exact string matching since it's easy to understand)." ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "re.compile(r'abc', re.UNICODE)" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "regex = re.compile('abc')\n", "regex" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Matching vs Searching\n", "\n", "`match` and `search` apply the regex to the passsed string" ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regex = re.compile('abc')\n", "regex.search('xyzabc')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`match` must match starting at the begining of the string." ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "None\n" ] } ], "source": [ "print(regex.match('xyzabc')) #matches at beginning of line only" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Advice: use `search` and pretend `match` doesn't exist to avoid confusion" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Extracting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition to searching for a particular pattern, a regular expression can be used to extract parts of the pattern using *groups*.\n", "\n", "A group is defined using parentheses." ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [], "source": [ "regex = re.compile('(abc)def')\n", "match = regex.search('xyzabcdef')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The returned `MatchObject` can be used to extract all the contents of the groups." ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('abc',)" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "match.groups()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Groups" ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'abc'" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "match.group(1)" ] }, { "cell_type": "code", "execution_count": 108, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'abcdef'" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "match.group(0) #group zero is always the whole match" ] }, { "cell_type": "code", "execution_count": 109, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "regex = re.compile('(a(b(c)))def')\n", "match = regex.search('xyzabcdefg')" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "
\n", "" ] }, { "cell_type": "code", "execution_count": 113, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "('abc', 'bc', 'c')" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "match.groups()" ] }, { "cell_type": "code", "execution_count": 114, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'abcdef'" ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "match.group(0)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Using Regular Expressions\n", "\n", "You can compile your regular expression into a RegexObject, or you can use `re` methods directly and it will compile them for you automatically.\n", "\n", "The `re` package will cache your most recently used `RegexpObject`s. However, if you are using a lot of regular expressions, particularly inside of loops, you should probably compile them once outside the loop and use the resulting `RegexpObject` directly." ] }, { "cell_type": "code", "execution_count": 115, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.search('abc','abcxyz') # this searches the string 'abcxyz' using the regex 'abc'" ] }, { "cell_type": "code", "execution_count": 116, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 116, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regex = re.compile('abc')\n", "regex.search('abcxyz') #same as above" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Regular Expression Syntax" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# The Backslash Problem" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unlike Perl, where regular expressions are distinct from string literals, in Python we specify regular expressions as strings.\n", "\n", "String literals use the backslash (`\\`) to escape special characters.\n", "\n", "Regular expressions use the backslash to escape special characters.\n", "\n", "So how would we write a regular expression that matches `\\x\\`?" ] }, { "cell_type": "code", "execution_count": 117, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "
\n", "" ] }, { "cell_type": "code", "execution_count": 118, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "ename": "SyntaxError", "evalue": "EOL while scanning string literal (673471912.py, line 1)", "output_type": "error", "traceback": [ "\u001b[0;36m File \u001b[0;32m\"/var/folders/c_/pwm7n7_174724g8zkkqlpr3m0000gn/T/ipykernel_58822/673471912.py\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m firsttry = '\\x\\'\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m EOL while scanning string literal\n" ] } ], "source": [ "firsttry = '\\x\\'" ] }, { "cell_type": "code", "execution_count": 119, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "secondtry = '\\\\x\\\\'" ] }, { "cell_type": "code", "execution_count": 120, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\\x\\ 3\n" ] } ], "source": [ "print(secondtry,len(secondtry))" ] }, { "cell_type": "code", "execution_count": 121, "metadata": {}, "outputs": [ { "ename": "error", "evalue": "bad escape (end of pattern) at position 2", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31merror\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/var/folders/c_/pwm7n7_174724g8zkkqlpr3m0000gn/T/ipykernel_58822/3245247240.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mregex\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mre\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcompile\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msecondtry\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/re.py\u001b[0m in \u001b[0;36mcompile\u001b[0;34m(pattern, flags)\u001b[0m\n\u001b[1;32m 250\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mcompile\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpattern\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mflags\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 251\u001b[0m \u001b[0;34m\"Compile a regular expression pattern, returning a Pattern object.\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 252\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0m_compile\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpattern\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mflags\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 253\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 254\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mpurge\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/re.py\u001b[0m in \u001b[0;36m_compile\u001b[0;34m(pattern, flags)\u001b[0m\n\u001b[1;32m 302\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0msre_compile\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0misstring\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpattern\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 303\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mTypeError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"first argument must be string or compiled pattern\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 304\u001b[0;31m \u001b[0mp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msre_compile\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcompile\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpattern\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mflags\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 305\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mflags\u001b[0m \u001b[0;34m&\u001b[0m \u001b[0mDEBUG\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 306\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0m_cache\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m>=\u001b[0m \u001b[0m_MAXCACHE\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/sre_compile.py\u001b[0m in \u001b[0;36mcompile\u001b[0;34m(p, flags)\u001b[0m\n\u001b[1;32m 762\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0misstring\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mp\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 763\u001b[0m \u001b[0mpattern\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 764\u001b[0;31m \u001b[0mp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msre_parse\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mparse\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mp\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mflags\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 765\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 766\u001b[0m \u001b[0mpattern\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/sre_parse.py\u001b[0m in \u001b[0;36mparse\u001b[0;34m(str, flags, state)\u001b[0m\n\u001b[1;32m 946\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 947\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 948\u001b[0;31m \u001b[0mp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_parse_sub\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msource\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstate\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mflags\u001b[0m \u001b[0;34m&\u001b[0m \u001b[0mSRE_FLAG_VERBOSE\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 949\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mVerbose\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 950\u001b[0m \u001b[0;31m# the VERBOSE flag was switched on inside the pattern. to be\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/sre_parse.py\u001b[0m in \u001b[0;36m_parse_sub\u001b[0;34m(source, state, verbose, nested)\u001b[0m\n\u001b[1;32m 441\u001b[0m \u001b[0mstart\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msource\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtell\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 442\u001b[0m \u001b[0;32mwhile\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 443\u001b[0;31m itemsappend(_parse(source, state, verbose, nested + 1,\n\u001b[0m\u001b[1;32m 444\u001b[0m not nested and not items))\n\u001b[1;32m 445\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0msourcematch\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"|\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/sre_parse.py\u001b[0m in \u001b[0;36m_parse\u001b[0;34m(source, state, verbose, nested, first)\u001b[0m\n\u001b[1;32m 509\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mthis\u001b[0m \u001b[0;32min\u001b[0m \u001b[0;34m\"|)\"\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 510\u001b[0m \u001b[0;32mbreak\u001b[0m \u001b[0;31m# end of subpattern\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 511\u001b[0;31m \u001b[0msourceget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 512\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 513\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mverbose\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/sre_parse.py\u001b[0m in \u001b[0;36mget\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 254\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 255\u001b[0m \u001b[0mthis\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnext\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 256\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__next\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 257\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mthis\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 258\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mgetwhile\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcharset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/sre_parse.py\u001b[0m in \u001b[0;36m__next\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 243\u001b[0m \u001b[0mchar\u001b[0m \u001b[0;34m+=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdecoded_string\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mindex\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 244\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mIndexError\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 245\u001b[0;31m raise error(\"bad escape (end of pattern)\",\n\u001b[0m\u001b[1;32m 246\u001b[0m self.string, len(self.string) - 1) from None\n\u001b[1;32m 247\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mindex\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mindex\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31merror\u001b[0m: bad escape (end of pattern) at position 2" ] } ], "source": [ "regex = re.compile(secondtry)" ] }, { "cell_type": "code", "execution_count": 122, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "thirdtry = '\\\\\\\\x\\\\\\\\'" ] }, { "cell_type": "code", "execution_count": 123, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\\\\x\\\\ 5\n" ] } ], "source": [ "print(thirdtry,len(thirdtry))" ] }, { "cell_type": "code", "execution_count": 124, "metadata": {}, "outputs": [], "source": [ "regex = re.compile(thirdtry)" ] }, { "cell_type": "code", "execution_count": 125, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 125, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regex.search('\\\\x\\\\')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Raw Strings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Python let's you specify a *raw string literal* where backslashes aren't escaped.\n", "\n", "Raw string have an `r` before the string literal.\n", "\n", "**Use raw strings for regular expressions.**" ] }, { "cell_type": "code", "execution_count": 126, "metadata": {}, "outputs": [], "source": [ "normal_str = '\\\\x\\\\'\n", "raw_str = r'\\\\x\\\\'" ] }, { "cell_type": "code", "execution_count": 127, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\\x\\ \\\\x\\\\\n" ] } ], "source": [ "print(normal_str,raw_str)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Slight detail: raw strings can't end with an odd number of backslashes" ] }, { "cell_type": "code", "execution_count": 128, "metadata": {}, "outputs": [ { "ename": "SyntaxError", "evalue": "EOL while scanning string literal (2066912133.py, line 1)", "output_type": "error", "traceback": [ "\u001b[0;36m File \u001b[0;32m\"/var/folders/c_/pwm7n7_174724g8zkkqlpr3m0000gn/T/ipykernel_58822/2066912133.py\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m print(r'\\x\\')\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m EOL while scanning string literal\n" ] } ], "source": [ "print(r'\\x\\')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Operators" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`regex1|regex2` \n", "\n", "Match either regex1 or regex2" ] }, { "cell_type": "code", "execution_count": 129, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 129, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bool(re.search(r'a|b','xxxaxxx'))" ] }, { "cell_type": "code", "execution_count": 130, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 130, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bool(re.search(r'abc|xyz','axbycz'))" ] }, { "cell_type": "code", "execution_count": 131, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 131, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bool(re.search(r'abc|xyz','xxxyzxxx'))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Operators: multiple matches" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "\n", "
regex* Match regex zero or more times (Kleene star)
regex? Match regex one or zero times
regex+ Match regex one or more times
regex{m} Match regex `m` times
regex{m,n} Match regex between `m` and `n` times (as many as possible)
" ] }, { "cell_type": "code", "execution_count": 132, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 132, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bool(re.search(r'a*','xxxxx'))" ] }, { "cell_type": "code", "execution_count": 133, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 133, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bool(re.search(r'a+','xxxxx'))" ] }, { "cell_type": "code", "execution_count": 134, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "m = re.search(r'a+(.*)','aaba')" ] }, { "cell_type": "code", "execution_count": 136, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "
\n", "" ] }, { "cell_type": "code", "execution_count": 137, "metadata": { "slideshow": { "slide_type": "notes" } }, "outputs": [ { "data": { "text/plain": [ "('ba',)" ] }, "execution_count": 137, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m.groups()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Non-greedy Kleene" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Multiple matching is greedy by default and will match as much as possible. To match as few characters as possible, use `*?`, `??`, and `+?`." ] }, { "cell_type": "code", "execution_count": 138, "metadata": {}, "outputs": [], "source": [ "m1 = re.search(r'a*(.*)','aaba')\n", "m2 = re.search(r'a+(.*)','aaba')" ] }, { "cell_type": "code", "execution_count": 139, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(('ba',), ('ba',))" ] }, "execution_count": 139, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m1.groups(),m2.groups()" ] }, { "cell_type": "code", "execution_count": 140, "metadata": {}, "outputs": [], "source": [ "m3 = re.search(r'a*?(.*)','aaba')" ] }, { "cell_type": "code", "execution_count": 141, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "m3 = re.search(r'a*?(.*)','aaba')" ] }, { "cell_type": "code", "execution_count": 142, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "
\n", "" ] }, { "cell_type": "code", "execution_count": 143, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "('aaba',)" ] }, "execution_count": 143, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m3.groups()" ] }, { "cell_type": "code", "execution_count": 144, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "m = re.search(r'a+?(.*)','aaba')" ] }, { "cell_type": "code", "execution_count": 145, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "
\n", "" ] }, { "cell_type": "code", "execution_count": 146, "metadata": { "slideshow": { "slide_type": "notes" } }, "outputs": [ { "data": { "text/plain": [ "'aba'" ] }, "execution_count": 146, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m.group(1)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Special Characters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`.` Matches any character (except newline, by default)
\n", "`^` Matches start of string
\n", "$ Matches end of string
" ] }, { "cell_type": "code", "execution_count": 147, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 147, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bool(re.search('^abc','xyzabc'))" ] }, { "cell_type": "code", "execution_count": 148, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 148, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bool(re.search('abc$','xyzabc'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Keep in mind that if you want to match a special character, `.^$()[]|*+?{}`, you need to escape it with backslash. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Character Sets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`[]` specifies a set of characters" ] }, { "cell_type": "code", "execution_count": 149, "metadata": {}, "outputs": [], "source": [ "m = re.search(r'([0-9])','BST3')" ] }, { "cell_type": "code", "execution_count": 150, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('3',)" ] }, "execution_count": 150, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m.groups()" ] }, { "cell_type": "code", "execution_count": 151, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "m = re.search(r'([cat])','garfield')" ] }, { "cell_type": "code", "execution_count": 152, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "
\n", "" ] }, { "cell_type": "code", "execution_count": 153, "metadata": { "slideshow": { "slide_type": "notes" } }, "outputs": [ { "data": { "text/plain": [ "'a'" ] }, "execution_count": 153, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m.group(1)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Character Set Complements" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The *complement* of the character set is taken if `^` is the first character." ] }, { "cell_type": "code", "execution_count": 154, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('Hello',)" ] }, "execution_count": 154, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r3 = re.compile(r'([^ ]*)')\n", "m3 = r3.search('Hello World')\n", "m3.groups()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Predefined Character Sets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`\\d` Matches decimal digit
\n", "`\\D` Matches non-decimal digit\n", "\n", "`\\s` Matches whitespace character
\n", "`\\S` Matches non-whitespace character\n", "\n", "`\\w` Matches alphanumeric characters and underscore `[A-Za-z0-9_]`
\n", "`\\W` Matches nonalphanumeric `[^A-Za-z0-9_]`" ] }, { "cell_type": "code", "execution_count": 155, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('de', 'hyphen')" ] }, "execution_count": 155, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.search(r'(\\w+)-(\\w+)','de-hyphen').groups()" ] }, { "cell_type": "code", "execution_count": 156, "metadata": {}, "outputs": [], "source": [ "float_regex = re.compile(r'[+-]?(\\d+(\\.\\d*)?|\\.\\d+)([eE][+-]?\\d+)?')" ] }, { "cell_type": "code", "execution_count": 157, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 157, "metadata": {}, "output_type": "execute_result" } ], "source": [ "float_regex.match('3.14159')" ] }, { "cell_type": "code", "execution_count": 158, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "r = re.compile(r'\\d?\\d.(png|jpg)')" ] }, { "cell_type": "code", "execution_count": 160, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "
\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Groups" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can extract parts of a match using groups." ] }, { "cell_type": "code", "execution_count": 161, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'dkoes'" ] }, "execution_count": 161, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m = re.search(r'(\\w*)@pitt\\.edu','dkoes@pitt.edu')\n", "m.group(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Groups can be referenced within the regular expression with `\\number` where `number` is from 1 to 99" ] }, { "cell_type": "code", "execution_count": 162, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(('cat',), None)" ] }, "execution_count": 162, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regex = re.compile(r'(\\w+)\\s+\\1')\n", "m1 = regex.search('cat cat')\n", "m2 = regex.search('cat dog')\n", "m1.groups(),m2" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Named Groups" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Groups can be named with `(?P...)`" ] }, { "cell_type": "code", "execution_count": 163, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "David Koes\n" ] } ], "source": [ "regex = re.compile(r'(?P\\w+), (?P\\w+)')\n", "m = regex.search('Koes, David')\n", "print(m.group('first'),m.group('last'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Named groups can be referenced by name within the regular expression." ] }, { "cell_type": "code", "execution_count": 164, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('cat',)" ] }, "execution_count": 164, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regex = re.compile(r'(?P\\w+)\\s+(?P=animal)')\n", "m1 = regex.search('cat cat')\n", "m1.groups()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Compiling Regular Expressions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Compiling to a RegexpObject also let's you provide some flags:\n", "\n", "* `re.IGNORECASE` - case insensitive matching\n", "* `re.DOTALL` - make the dot character match newlines\n", "* `re.MULTILINE` - ^ and $ will match begining/end of *lines* in addition to the string\n" ] }, { "cell_type": "code", "execution_count": 165, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "None\n" ] } ], "source": [ "print(re.search(r'^cat$','cat\\ndog'))" ] }, { "cell_type": "code", "execution_count": 166, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 166, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regex = re.compile(r'^cat$',re.MULTILINE)\n", "regex.search('cat\\ndog')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# More Regular Expression Functions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The same functions are available both as standalone functions in `re` (which take a regular expression in string form) and as methods of a `RegexpObj`.\n", "\n", "* `search` Scan through a string looking for where a regular expression produces a match, and return a MatchObject.\n", "* `match` Return a MatchObject of regular expression matches at *beginning* of string.\n", "* `split` Split a string by occurances of pattern.\n", "* `findall` Return all non-overlapping matches of the regular expression as strings. \n", "* `finditer` Return an iterator yielding MatchObject instances over all non-overlapping matches of the regular expression\n", "* `sub` Return a string obtained by substituting matches of the regular expression with a provided string" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# `split`" ] }, { "cell_type": "code", "execution_count": 167, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['A', 'bunch', 'of', 'spacey', 'words.']" ] }, "execution_count": 167, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.split(r'\\s+',\"A bunch of spacey\\nwords.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If matching groups are included, then the matches are included in the returned list" ] }, { "cell_type": "code", "execution_count": 168, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['A', ' ', 'bunch', ' ', 'of', ' ', 'spacey', '\\n', 'words.']" ] }, "execution_count": 168, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.split(r'(\\s+)',\"A bunch of spacey\\nwords.\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# `findall`\n", "\n", "Returns matches as strings. \n", " * If no groups, returns full match\n", " * If single group, returns string of that group's match\n", " * If multiple groups, returns tuple of strings" ] }, { "cell_type": "code", "execution_count": 169, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['abc', 'abc']" ] }, "execution_count": 169, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bigstr = 'abc xyz abc a x'\n", "re.findall('abc',bigstr)" ] }, { "cell_type": "code", "execution_count": 170, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['a', 'a']" ] }, "execution_count": 170, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall(r'(a)bc',bigstr)" ] }, { "cell_type": "code", "execution_count": 171, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('a', 'c'), ('a', 'c')]" ] }, "execution_count": 171, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall(r'(a)b(c)',bigstr)" ] }, { "cell_type": "code", "execution_count": 172, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "matches = re.findall(r'(\\S+)|(\\S+)','x|y a|b')" ] }, { "cell_type": "code", "execution_count": 173, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "
\n", "" ] }, { "cell_type": "code", "execution_count": 174, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "[('x|y', ''), ('a|b', '')]" ] }, "execution_count": 174, "metadata": {}, "output_type": "execute_result" } ], "source": [ "matches" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# `finditer`" ] }, { "cell_type": "code", "execution_count": 175, "metadata": {}, "outputs": [], "source": [ "list_of_names = 'Koes, David\\nKarplus, Martin\\nLevitt, Michael\\nWarshel, Arieh\\n'" ] }, { "cell_type": "code", "execution_count": 176, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "David Koes\n", "Martin Karplus\n", "Michael Levitt\n", "Arieh Warshel\n" ] } ], "source": [ "for m in re.finditer(r'(?P\\w+), (?P\\w+)',list_of_names):\n", " print(m.group('first'),m.group('last'))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# `sub`" ] }, { "cell_type": "code", "execution_count": 177, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ATOM 2267 N THR B 609 4.155 42.962 60.898 1.00 9.19 N \n", "ATOM 2268 CA THR B 609 3.520 44.246 60.575 1.00 10.78 C \n", "ATOM 2269 C THR B 609 4.491 45.117 59.815 1.00 11.13 C \n", "ATOM 2270 O THR B 609 5.689 44.864 59.853 1.00 9.92 O\n" ] } ], "source": [ "pdb = '''ATOM 2267 N THR A 609 4.155 42.962 60.898 1.00 9.19 N \n", "ATOM 2268 CA THR A 609 3.520 44.246 60.575 1.00 10.78 C \n", "ATOM 2269 C THR A 609 4.491 45.117 59.815 1.00 11.13 C \n", "ATOM 2270 O THR A 609 5.689 44.864 59.853 1.00 9.92 O'''\n", "\n", "print(re.sub(r' A ',' B ',pdb))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Some Theory\n", "\n", "A regular expression describes a *regular language* in formal language theory. \n", "\n", "A formal language is a set of symbols and rules for constructing strings from these symbols. All programming languages are formal languages, although none are regular languages (usually context free grammars).\n", "\n", " Stephen Kleene, American mathematician and inventor of regular expressions." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Regular Languages\n", "\n", "The following are equivalent:\n", "\n", "* A language is regular\n", "* A language can be recognized by a regular expression\n", "* A language can be recognized by a finite automata (finite state machine)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Finite Automata\n", "\n", "
1*(01*01*)*\n", "\n", "This FSM and regular expression matches all binary strings with an even number of zeros.\n", "\n", "\\* is the *Kleene star* and matches zero or more copies of the preceeding expression.\n", "\n", "Finite state machines are **finite**. This means they cannot count arbitrarily high. For example, it is impossible to write an regular expression for balanced parentheses.\n", "\n", "When you compile a regular expression, you are creating a FSM. When you search, the string is run through the FSM which takes time linear in the length of the string (*no backtracking*)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Exercise" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Consider the file `alignment.txt`. This is the saved result of a blast query." ] }, { "cell_type": "code", "execution_count": 178, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2023-10-18 21:56:26-- http://mscbio2025.csb.pitt.edu/files/alignment.txt\n", "Resolving mscbio2025.csb.pitt.edu (mscbio2025.csb.pitt.edu)... 136.142.4.139\n", "Connecting to mscbio2025.csb.pitt.edu (mscbio2025.csb.pitt.edu)|136.142.4.139|:80... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 86458 (84K) [text/plain]\n", "Saving to: ‘alignment.txt’\n", "\n", "alignment.txt 100%[===================>] 84.43K --.-KB/s in 0.02s \n", "\n", "2023-10-18 21:56:27 (4.52 MB/s) - ‘alignment.txt’ saved [86458/86458]\n", "\n" ] } ], "source": [ "!wget http://mscbio2025.csb.pitt.edu/files/alignment.txt" ] }, { "cell_type": "code", "execution_count": 179, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "BLASTP 2.2.28+\r\n", "Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro\r\n", "A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and\r\n", "David J. Lipman (1997), \"Gapped BLAST and PSI-BLAST: a new\r\n", "generation of protein database search programs\", Nucleic\r\n", "Acids Res. 25:3389-3402.\r\n", "\r\n", "\r\n", "Reference for compositional score matrix adjustment: Stephen\r\n", "F. Altschul, John C. Wootton, E. Michael Gertz, Richa\r\n" ] } ], "source": [ "!head alignment.txt" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Answer these questions using regular expressions\n", "\n", "What is the average length of the sequences returned?\n", "\n", "How many sequences are from the pdb?\n", "\n", "Can you extract just the subject sequences?" ] }, { "cell_type": "code", "execution_count": 180, "metadata": {}, "outputs": [], "source": [ "data = open('alignment.txt').read()" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 4 }