{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# multiprocessing" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "## 11/30/2023\n", "print view\n", "\n", "notebook" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# OMETs\n", "\n", "Please fill out.\n", "\n", "https://teaching.pitt.edu/omet/student-information/" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Parallel Programming" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are several parallel programming models enabled by a variety of hardware (**multicore**, cloud computing, supercomputers, GPU).\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Threads vs. Processes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A **thread** of execution is the smallest sequence of programmed instructions that can be managed independently by an operating system scheduler. \n", "\n", "A **process** is an instance of a computer program.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Address Spaces" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A process has its own *address space*. An address space is a mapping of **virtual memory addresses** to **physical memory addresses** managed by the operating system.\n", "\n", "Address spaces prevent processes from crashing other applications or the operating system - they can only access their *own* memory.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Threads vs. Processes" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import threading,time, math\n", "\n", "cnt = [0]\n", "\n", "def incrementCnt(cnt):\n", " for i in range(1000000): # a million times\n", " cnt[0] += math.sqrt(1)\n", "\n", "t1 = threading.Thread(target=incrementCnt,args=(cnt,))\n", "t2 = threading.Thread(target=incrementCnt,args=(cnt,))\n", "\n", "t1.start()\n", "t2.start()\n", "\n", "print(cnt[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What do we expect to print out?" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "
\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# The Answer" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[75813.0]\n", "[1182366.0]\n", "[1182366.0]\n" ] } ], "source": [ "cnt = [0]\n", "\n", "t1 = threading.Thread(target=incrementCnt,args=(cnt,))\n", "t2 = threading.Thread(target=incrementCnt,args=(cnt,))\n", "\n", "t1.start()\n", "t2.start()\n", "\n", "print(cnt) \n", "time.sleep(1)\n", "print(cnt)\n", "time.sleep(1)\n", "print(cnt)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Threads vs. Processes" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "import multiprocess,time\n", "\n", "def incrementCnt(cnt):\n", " for i in range(1000000): # a million times\n", " cnt[0] += math.sqrt(1)\n", " \n", "cnt = [0]\n", "\n", "p1 = multiprocess.Process(target=incrementCnt,args=(cnt,))\n", "p2 = multiprocess.Process(target=incrementCnt,args=(cnt,))\n", "\n", "p1.start()\n", "p2.start()\n", "\n", "#what do we expect when we print out cnt[0]?" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "
\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# The Answer" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "0\n" ] } ], "source": [ "cnt = [0]\n", "p1 = multiprocess.Process(target=incrementCnt,args=(cnt,))\n", "p2 = multiprocess.Process(target=incrementCnt,args=(cnt,))\n", "\n", "p1.start()\n", "p2.start()\n", "\n", "print(cnt[0])\n", "time.sleep(3)\n", "print(cnt[0])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Threads/Process Objects" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thread and Process Objects have the same interface \n", " * `start` - start running the `target` function with (optional) `args`\n", " * `join` - wait for thread to finish before doing anything else\n", " * `is_alive` - returns true if thread is still running" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Example" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "78226.0 True True\n", "1246141.0 False False\n" ] } ], "source": [ "t1 = threading.Thread(target=incrementCnt,args=(cnt,))\n", "t2 = threading.Thread(target=incrementCnt,args=(cnt,))\n", "\n", "t1.start() #start launches the thread to run target with args\n", "t2.start()\n", "\n", "print(cnt[0],t1.is_alive(),t2.is_alive())\n", "\n", "t1.join() #join waits for a thread to finish\n", "t2.join() #if you don't join a Process, it will become a zombie\n", "\n", "print(cnt[0],t1.is_alive(),t2.is_alive())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Synchronization\n", "\n", "When two threads update the same resource, their access to that resource needs to be gated.\n", "\n", "There are a variety of synchronization primitives provided for both threads and processes (although they usually aren't needed for processes):\n", "\n", "### Lock\n", "Can be acquired by exactly one thread (calling acquire twice from the same thread will hang). Must be released to be acquired by another thread. Basically, just wrap the *critical section* with an acquire-release pair.\n", "\n", "### RLock\n", "A *recursive* lock. This is a lock that can be acquired multiple times by the same thread (and then must be released exactly the same number times). This is especially useful for modular programming. For example, you can acquire/release a lock around a function call without working about what that function is doing with the lock.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Lock Example" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2000000.0\n" ] } ], "source": [ "lock = threading.Lock()\n", "\n", "def incrementCnt(cnt):\n", " for i in range(1000000): # a million times\n", " lock.acquire() #only one thread can acquire the lock at a time\n", " cnt[0] += math.sqrt(1) #this is the CRITICAL SECTION\n", " lock.release()\n", "\n", "cnt = [0]\n", "t1 = threading.Thread(target=incrementCnt,args=(cnt,))\n", "t2 = threading.Thread(target=incrementCnt,args=(cnt,))\n", "\n", "t1.start() #start launches the thread to run target with args\n", "t2.start()\n", "\n", "t1.join() \n", "t2.join() \n", "\n", "print(cnt[0])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Communication" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Threads communicate through their shared address space, and use locks to protect sensitive state. Several classes provide a simplified interface for communication (these are available with processes as well, but the underlying implementation is different). \n", "\n", "### Queue\n", "`Queue.Queue` (`multiprocess.Queue` for processes) provides a simple, thread-safe, first-in-first-out queue.\n", "\n", "* `put`: put an element on the queue. **This will block if the queue has filled up**\n", "* `get`: get an element from the queue. **This will block if the queue is empty**. \n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Communication\n", "\n", "### Pipe\n", "A pipe is a communication channel between processes that can send and receive messages. *Unlike a queue, it is not okay for multiple threads to write to the same end of the pipe at the same time*.\n", "\n", "`Pipe()` return a duple of `Connection` objects representing the ends of the pipe. Each connection object has the following methods:\n", "\n", "* `send`: sends data to other end of the pipe\n", "* `recv`: waits for data from other end of the pipe (unless pipe closed, then `EOFError`)\n", "* `close`: close the pipe" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Communication" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since they don't have a shared address space, it is recommended the *processes* use exclusively either `multiprocess.Queue` or `multiprocess.Pipe` to communicate.\n", "\n", "Use a pipe for communication between two processes (server-client architecture).\n", "\n", "Use a queue for one-way communication between many threads (producer-consumer architecture)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Pipes" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "you sent me Hello!\n" ] } ], "source": [ "import multiprocess\n", "\n", "def chatty(conn): #this takes a Connection object representing one end of a pipe\n", " msg = conn.recv()\n", " conn.send(\"you sent me \"+msg)\n", " \n", "(c1,c2) = multiprocess.Pipe()\n", "\n", "p1 = multiprocess.Process(target=chatty,args=(c2,))\n", "p1.start()\n", "\n", "c1.send(\"Hello!\")\n", "result = c1.recv()\n", "p1.join()\n", "\n", "print(result)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Pools" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`multiprocess` supports the concept of a *pool* of workers. You initialize with the number of processes you want to run in parallel (the default is the number of CPUs on your system) and they are available for doing parallel work:\n", "\n", "* `map`: the most important function - just like the built-in map, this will map a function to an iterable and return the result, but the *mapping will be done in parallel*. Blocks until the full result is computed and the result is in the proper order.\n", "* `map_async`: map without blocking\n", "* `imap`: parallel imap - returns iterable instead of list. The `next()` method of the returned iterable will block if necessary.\n", "* `imap_unordered`: same as imap but returns values in order they are computed\n", "* `close`: close the pool to prevent additional jobs from being submitted\n", "* `join`: must call `close` first; waits for all jobs to complete" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Pool Example" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361]\n" ] } ], "source": [ "def f(x):\n", " return x*x\n", "\n", "pool = multiprocess.Pool(processes=4)\n", "\n", "print(pool.map(f,range(20)))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Producer/Consumer" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "def dowork(inQ, outQ):\n", " val = inQ.get()\n", " outQ.put(val*val)\n", "\n", "inQ = multiprocess.Queue()\n", "outQ = multiprocess.Queue()\n", "pool = multiprocess.Pool(4, dowork, (inQ, outQ))" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "inQ.put(4)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "16" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "outQ.get()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Bad Example" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "#What is wrong with this code?\n", "for i in range(10):\n", " inQ.put(i)\n", "\n", "while not outQ.empty():\n", " print(outQ.get())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "You should not produce and consume in the same thread. If `outQ` fills up, `inQ` will fill up and the `put` will block.\n", "\n", "The `empty`/`full` methods of the Queue class are pretty much **useless** since their result depends on when they are called. Here, no values had been generated when it was called so nothing is printed.\n", "\n", "In order communicate that there are no more values, you must send a *sentinal* value." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Threads or Processes?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Normally, the choice between threads and processes depends on how data will be accessed and the level of communication between parallel tasks etc.\n", "\n", "However, in Python, the answer is almost always **use multiprocessing, not threading**.\n", "\n", "Why? The CPython interpretter has a *Global Interpretter Lock*. This means that only **one** thread of python code is actually executed at any given time when using threads. With processes, each process has its own interpretter (with its own lock).\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Embarassingly Parallel\n", "\n", "Writing correct, high performance parallel code can be difficult, but some in some cases it's trivial.\n", "\n", "A problem is *embarassingly parallel* if it can be easily separated into *independent* subtasks (i.e., no need to communicate between threads) each of which does a substantial amount of computation.\n", "\n", "Fortunately this is often the case.\n", " * Apply this filter to 1000 images\n", " * Process these 5000 protein structures\n", " * Compute RMSDs of all frames in a trajectory\n", " \n", "In cases like these, using **Pools** will get you a significant speedup (by however many cores you have)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Key Concepts\n", "\n", "* You can running many things at the same time with threads/processes\n", "* This is an easy way to make things faster, but can get complicated\n", "* Use `multiprocess` not threads\n", "* Processes can communicate with Queues/Pipes\n", "* 90% of the time all you need are Pools, and they are not complicated" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Project Plans?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Project\n", "\n", " * Take a search term as an argument\n", " * Use biopython to search the NCBI protein database for entries that match this term and are in the pdb\n", " * Extract the PDB entries\n", " * For each PDB entry, use ProDy to calculate the ANM modes (this can be done in parallel)\n", " * Sort the results based on the highest fractional variance in the first mode\n", " * Print out the top ten PDBs with the fractional variance of their first three modes" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "{'6KD5',\n", " '6LU7',\n", " '6W61',\n", " '6WUU',\n", " '6WX4',\n", " '6XA4',\n", " '6XBG',\n", " '6XBH',\n", " '6XBI',\n", " '6XFN',\n", " '7AKU',\n", " '7BQY',\n", " '7EIN',\n", " '7GAV',\n", " '7GAW',\n", " '7GAX',\n", " '7GAY',\n", " '7GAZ',\n", " '7GB0',\n", " '7GB1',\n", " '7GB2',\n", " '7GB3',\n", " '7GB4',\n", " '7GB5',\n", " '7GB6',\n", " '7GB7',\n", " '7GB8',\n", " '7GB9',\n", " '7GBA',\n", " '7GBB',\n", " '7GBC',\n", " '7GBD',\n", " '7GBE',\n", " '7GBF',\n", " '7GBG',\n", " '7GBH',\n", " '7GBI',\n", " '7GBJ',\n", " '7GBK',\n", " '7GBL',\n", " '7GBM',\n", " '7GBN',\n", " '7GBO',\n", " '7GBP',\n", " '7GBQ',\n", " '7GBR',\n", " '7GBS',\n", " '7GBT',\n", " '7GBU',\n", " '7GBV',\n", " '7GBW',\n", " '7GBX',\n", " '7GBY',\n", " '7GBZ',\n", " '7GC0',\n", " '7GC1',\n", " '7GC2',\n", " '7GC3',\n", " '7GC4',\n", " '7GC5',\n", " '7GC6',\n", " '7GC7',\n", " '7GC8',\n", " '7GC9',\n", " '7GCA',\n", " '7GCB',\n", " '7GCC',\n", " '7GCD',\n", " '7GCE',\n", " '7GCF',\n", " '7GCG',\n", " '7GCI',\n", " '7GCJ',\n", " '7GCK',\n", " '7GCL',\n", " '7GCM',\n", " '7GCN',\n", " '7GCO',\n", " '7GCP',\n", " '7GCQ',\n", " '7GCR',\n", " '7GCS',\n", " '7GCT',\n", " '7GCU',\n", " '7GCV',\n", " '7GCW',\n", " '7GCX',\n", " '7GCY',\n", " '7GCZ',\n", " '7GD0',\n", " '7GD1',\n", " '7GD2',\n", " '7GD3',\n", " '7GD4',\n", " '7GD5',\n", " '7GD6',\n", " '7GD7',\n", " '7GD8',\n", " '7GD9',\n", " '7GDA',\n", " '7GDB',\n", " '7GDC',\n", " '7GDD',\n", " '7GDE',\n", " '7GDF',\n", " '7GDG',\n", " '7GDH',\n", " '7GDI',\n", " '7GDJ',\n", " '7GDK',\n", " '7GDL',\n", " '7GDM',\n", " '7GDN',\n", " '7GDO',\n", " '7GDP',\n", " '7GDQ',\n", " '7GDR',\n", " '7GDS',\n", " '7GDT',\n", " '7GDU',\n", " '7GDV',\n", " '7GDW',\n", " '7GDX',\n", " '7GDY',\n", " '7GDZ',\n", " '7GE0',\n", " '7GE1',\n", " '7GE2',\n", " '7GE3',\n", " '7GE4',\n", " '7GE5',\n", " '7GE6',\n", " '7GE7',\n", " '7GE8',\n", " '7GE9',\n", " '7GEA',\n", " '7GEB',\n", " '7GEC',\n", " '7GED',\n", " '7GEE',\n", " '7GEF',\n", " '7GEG',\n", " '7GEH',\n", " '7GEI',\n", " '7GEJ',\n", " '7GEK',\n", " '7GEL',\n", " '7GEM',\n", " '7GEN',\n", " '7GEO',\n", " '7GEQ',\n", " '7GER',\n", " '7GES',\n", " '7GET',\n", " '7GEU',\n", " '7GEV',\n", " '7GEW',\n", " '7GEX',\n", " '7GEY',\n", " '7GEZ',\n", " '7GF0',\n", " '7GF1',\n", " '7GF2',\n", " '7GF3',\n", " '7GF4',\n", " '7GF5',\n", " '7GF6',\n", " '7GF7',\n", " '7GF8',\n", " '7GF9',\n", " '7GFA',\n", " '7GFB',\n", " '7GFC',\n", " '7GFD',\n", " '7GFE',\n", " '7GFF',\n", " '7GFG',\n", " '7GFH',\n", " '7GFI',\n", " '7GFJ',\n", " '7GFK',\n", " '7GFL',\n", " '7GFM',\n", " '7GFN',\n", " '7GFO',\n", " '7GFP',\n", " '7GFQ',\n", " '7GFR',\n", " '7GFS',\n", " '7GFT',\n", " '7GFU',\n", " '7GFV',\n", " '7GFW',\n", " '7GFX',\n", " '7GFY',\n", " '7GFZ',\n", " '7GG0',\n", " '7GG1',\n", " '7GG2',\n", " '7GG3',\n", " '7GG4',\n", " '7GG5',\n", " '7GG6',\n", " '7GG7',\n", " '7GG8',\n", " '7GG9',\n", " '7GGA',\n", " '7GGB',\n", " '7GGC',\n", " '7GGD',\n", " '7GGE',\n", " '7GGF',\n", " '7GGG',\n", " '7GGH',\n", " '7GGI',\n", " '7GGJ',\n", " '7GGK',\n", " '7GGL',\n", " '7GGM',\n", " '7GGN',\n", " '7GGO',\n", " '7GGP',\n", " '7GGQ',\n", " '7GGR',\n", " '7GGS',\n", " '7GGT',\n", " '7GGU',\n", " '7GGV',\n", " '7GGW',\n", " '7GGX',\n", " '7GGY',\n", " '7GGZ',\n", " '7GH0',\n", " '7GH1',\n", " '7GH2',\n", " '7GH3',\n", " '7GH4',\n", " '7GH5',\n", " '7GH6',\n", " '7GH7',\n", " '7GH8',\n", " '7GH9',\n", " '7GHA',\n", " '7GHB',\n", " '7GHC',\n", " '7GHD',\n", " '7GHE',\n", " '7GHF',\n", " '7GHG',\n", " '7GHH',\n", " '7GHI',\n", " '7GHJ',\n", " '7GHK',\n", " '7GHL',\n", " '7GHM',\n", " '7GHN',\n", " '7GHO',\n", " '7GHP',\n", " '7GHQ',\n", " '7GHR',\n", " '7GHS',\n", " '7GHT',\n", " '7GHU',\n", " '7GHV',\n", " '7GHW',\n", " '7GHX',\n", " '7GHY',\n", " '7GHZ',\n", " '7GI0',\n", " '7GI1',\n", " '7GI2',\n", " '7GI3',\n", " '7GI4',\n", " '7GI5',\n", " '7GI6',\n", " '7GI7',\n", " '7GI8',\n", " '7GI9',\n", " '7GIA',\n", " '7GIB',\n", " '7GIC',\n", " '7GID',\n", " '7GIE',\n", " '7GIF',\n", " '7GIG',\n", " '7GIH',\n", " '7GII',\n", " '7GIJ',\n", " '7GIK',\n", " '7GIL',\n", " '7GIM',\n", " '7GIN',\n", " '7GIO',\n", " '7GIP',\n", " '7GIQ',\n", " '7GIR',\n", " '7GIS',\n", " '7GIT',\n", " '7GIU',\n", " '7GIV',\n", " '7GIW',\n", " '7GIX',\n", " '7GIY',\n", " '7GIZ',\n", " '7GJ0',\n", " '7GJ1',\n", " '7GJ2',\n", " '7GJ3',\n", " '7GJ4',\n", " '7GJ5',\n", " '7GJ6',\n", " '7GJ7',\n", " '7GJ8',\n", " '7GJ9',\n", " '7GJA',\n", " '7GJB',\n", " '7GJC',\n", " '7GJD',\n", " '7GJE',\n", " '7GJF',\n", " '7GJG',\n", " '7GJH',\n", " '7GJI',\n", " '7GJJ',\n", " '7GJK',\n", " '7GJL',\n", " '7GJM',\n", " '7GJN',\n", " '7GJO',\n", " '7GJP',\n", " '7GJQ',\n", " '7GJR',\n", " '7GJS',\n", " '7GJT',\n", " '7GJU',\n", " '7GJV',\n", " '7GJW',\n", " '7GJX',\n", " '7GJY',\n", " '7GJZ',\n", " '7GK0',\n", " '7GK1',\n", " '7GK2',\n", " '7GK3',\n", " '7GK4',\n", " '7GK5',\n", " '7GK6',\n", " '7GK7',\n", " '7GK8',\n", " '7GK9',\n", " '7GKA',\n", " '7GKB',\n", " '7GKC',\n", " '7GKD',\n", " '7GKE',\n", " '7GKF',\n", " '7GKG',\n", " '7GKH',\n", " '7GKI',\n", " '7GKJ',\n", " '7GKK',\n", " '7GKL',\n", " '7GKM',\n", " '7GKN',\n", " '7GKO',\n", " '7GKP',\n", " '7GKQ',\n", " '7GKR',\n", " '7GKS',\n", " '7GKT',\n", " '7GKU',\n", " '7GKV',\n", " '7GKW',\n", " '7GKX',\n", " '7GKY',\n", " '7GKZ',\n", " '7GL0',\n", " '7GL1',\n", " '7GL2',\n", " '7GL3',\n", " '7GL4',\n", " '7GL5',\n", " '7GL6',\n", " '7GL7',\n", " '7GL8',\n", " '7GL9',\n", " '7GLA',\n", " '7GLB',\n", " '7GLC',\n", " '7GLD',\n", " '7GLE',\n", " '7GLF',\n", " '7GLG',\n", " '7GLH',\n", " '7GLI',\n", " '7GLJ',\n", " '7GLK',\n", " '7GLL',\n", " '7GLM',\n", " '7GLN',\n", " '7GLO',\n", " '7GLP',\n", " '7GLQ',\n", " '7GLR',\n", " '7GLS',\n", " '7GLT',\n", " '7GLU',\n", " '7GLV',\n", " '7GLW',\n", " '7GLX',\n", " '7GLY',\n", " '7GLZ',\n", " '7GM0',\n", " '7GM1',\n", " '7GM2',\n", " '7GM3',\n", " '7GM4',\n", " '7GM5',\n", " '7GM6',\n", " '7GM7',\n", " '7GM8',\n", " '7GM9',\n", " '7GMA',\n", " '7GMB',\n", " '7GMC',\n", " '7GMD',\n", " '7GME',\n", " '7GMF',\n", " '7GMG',\n", " '7GMH',\n", " '7GMI',\n", " '7GMJ',\n", " '7GMK',\n", " '7GML',\n", " '7GMM',\n", " '7GMN',\n", " '7GMO',\n", " '7GMP',\n", " '7GMQ',\n", " '7GMR',\n", " '7GMS',\n", " '7GMT',\n", " '7GMU',\n", " '7GMV',\n", " '7GMW',\n", " '7GMX',\n", " '7GMY',\n", " '7GMZ',\n", " '7GN0',\n", " '7GN1',\n", " '7GN2',\n", " '7GN3',\n", " '7GN4',\n", " '7GN5',\n", " '7GN6',\n", " '7GN7',\n", " '7GN8',\n", " '7GN9',\n", " '7GNA',\n", " '7GNB',\n", " '7GNC',\n", " '7GND',\n", " '7GNE',\n", " '7GNF',\n", " '7GNG',\n", " '7GNH',\n", " '7GNI',\n", " '7GNJ',\n", " '7GNK',\n", " '7GNL',\n", " '7GNM',\n", " '7GNN',\n", " '7GNO',\n", " '7GNP',\n", " '7GNQ',\n", " '7GNR',\n", " '7GNS',\n", " '7GNT',\n", " '7GNU',\n", " '7JPY',\n", " '7JPZ',\n", " '7JQ0',\n", " '7JQ1',\n", " '7JQ2',\n", " '7JQ3',\n", " '7JQ4',\n", " '7JQ5',\n", " '7LCR',\n", " '7M2P',\n", " '7MW2',\n", " '7MW3',\n", " '7MW4',\n", " '7MW5',\n", " '7MW6',\n", " '7RVN',\n", " '7RVQ',\n", " '7RVS',\n", " '7RVX',\n", " '7S4S',\n", " '7S6W',\n", " '7SH7',\n", " '7TLT',\n", " '7TUQ',\n", " '7TV0',\n", " '7UKL',\n", " '7URQ',\n", " '7URS',\n", " '7UU6',\n", " '7UU7',\n", " '7UU8',\n", " '7UU9',\n", " '7UUA',\n", " '7UUB',\n", " '7UUC',\n", " '7UUD',\n", " '7UUE',\n", " '8AJA',\n", " '8AJL',\n", " '8BWU',\n", " '8D4P',\n", " '8DDL',\n", " '8DGU',\n", " '8DGV',\n", " '8DGW',\n", " '8DGX',\n", " '8DIB',\n", " '8DIC',\n", " '8DID',\n", " '8DIE',\n", " '8DIF',\n", " '8DIG',\n", " '8DIH',\n", " '8DII',\n", " '8DL9',\n", " '8DLB',\n", " '8DMD',\n", " '8DPR',\n", " '8DS2',\n", " '8DSU',\n", " '8DTR',\n", " '8DTT',\n", " '8DTX',\n", " '8DW9',\n", " '8DWA',\n", " '8DZ0',\n", " '8DZ1',\n", " '8DZ2',\n", " '8DZ6',\n", " '8DZA',\n", " '8DZB',\n", " '8DZC',\n", " '8E25',\n", " '8E26',\n", " '8E63',\n", " '8E6A',\n", " '8E7C',\n", " '8EL2',\n", " '8ELG',\n", " '8ELH',\n", " '8EOO',\n", " '8EUA',\n", " '8F44',\n", " '8F45',\n", " '8F46',\n", " '8GIA',\n", " '8H3G',\n", " '8H3K',\n", " '8H3L',\n", " '8H4Y',\n", " '8H51',\n", " '8H57',\n", " '8H5F',\n", " '8H5P',\n", " '8H6I',\n", " '8H6N',\n", " '8H7K',\n", " '8H7W',\n", " '8H82',\n", " '8HBK',\n", " '8HI9',\n", " '8IX3',\n", " '8J5J',\n", " '8SUZ',\n", " '8TBE',\n", " '8URB'}" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from Bio import Entrez\n", "import re\n", "seqfile = 'searchresults.fasta'\n", "Entrez.email = \"dkoes@pitt.edu\"\n", "records = Entrez.read(Entrez.esearch(db='protein',term='covid AND pdb[filter]',retmax=1000))\n", "result = Entrez.efetch(db='protein',id=records['IdList'],rettype='fasta',retmode='text').read()\n", "set(re.findall(r'>pdb\\|(\\S+?)\\|',result))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6" } }, "nbformat": 4, "nbformat_minor": 1 }