{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Text Processing with the Linux Commandline\n",
    "## 8/31/2023\n",
    "\n",
    "<a href=\"?print-pdf\">print view</a><br>\n",
    "<a href=\"bash2.ipynb\">notebook</a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<script src=\"https://code.jquery.com/jquery-3.7.1.min.js\"></script>\n",
       "<script src=\"https://requirejs.org/docs/release/2.3.6/minified/require.js\"></script>\n",
       "<script src=\"https://bits.csb.pitt.edu/asker.js/lib/asker.js\"></script>\n",
       "<style>\n",
       ".reveal pre { font-size: 100%; overflow-x: auto; overflow-y: auto;}\n",
       ".reveal h1 { font-size: 2em}\n",
       ".reveal ol {display: block;}\n",
       ".reveal ul {display: block;}\n",
       ".reveal .slides>section>section.present { max-height: 100%; overflow-y: auto;}\n",
       "\n",
       ".jp-OutputArea-output { padding: 0; }\n",
       "</style>\n",
       "\n",
       "\n",
       "<script>\n",
       "$3Dmolpromise = new Promise((resolve, reject) => { \n",
       "    require(['https://3Dmol.org/build/3Dmol.js'], function(){       \n",
       "            resolve();});\n",
       "});\n",
       "require(['https://cdnjs.cloudflare.com/ajax/libs/Chart.js/2.2.2/Chart.js'], function(Ch){\n",
       " Chart = Ch;\n",
       "});\n",
       "\n",
       "$('head').append('<link rel=\"stylesheet\" href=\"https://bits.csb.pitt.edu/asker.js/themes/asker.default.css\" />');\n",
       "\n",
       "\n",
       "//the callback is provided a canvas object and data \n",
       "var chartmaker = function(canvas, labels, data) {\n",
       "  var ctx = $(canvas).get(0).getContext(\"2d\");\n",
       "     var dataset = {labels: labels,                     \n",
       "    datasets:[{\n",
       "     data: data,\n",
       "     backgroundColor: \"rgba(150,64,150,0.5)\",\n",
       "         fillColor: \"rgba(150,64,150,0.8)\",    \n",
       "  }]};\n",
       "  var myBarChart = new Chart(ctx,{type:'bar',data:dataset,options:{legend: {display:false},\n",
       "        scales: {\n",
       "            yAxes: [{\n",
       "                ticks: {\n",
       "                    min: 0,\n",
       "                }\n",
       "            }]}}});\n",
       "};\n",
       "\n",
       "$(\".jp-InputArea .o:contains(html)\").closest('.jp-InputArea').hide();\n",
       "\n",
       "\n",
       "</script>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%html\n",
    "<script src=\"https://code.jquery.com/jquery-3.7.1.min.js\"></script>\n",
    "<script src=\"https://requirejs.org/docs/release/2.3.6/minified/require.js\"></script>\n",
    "<script src=\"https://bits.csb.pitt.edu/asker.js/lib/asker.js\"></script>\n",
    "<style>\n",
    ".reveal pre { font-size: 100%; overflow-x: auto; overflow-y: auto;}\n",
    ".reveal h1 { font-size: 2em}\n",
    ".reveal ol {display: block;}\n",
    ".reveal ul {display: block;}\n",
    ".reveal .slides>section>section.present { max-height: 100%; overflow-y: auto;}\n",
    "\n",
    ".jp-OutputArea-output { padding: 0; }\n",
    "</style>\n",
    "\n",
    "\n",
    "<script>\n",
    "$3Dmolpromise = new Promise((resolve, reject) => { \n",
    "    require(['https://3Dmol.org/build/3Dmol.js'], function(){       \n",
    "            resolve();});\n",
    "});\n",
    "require(['https://cdnjs.cloudflare.com/ajax/libs/Chart.js/2.2.2/Chart.js'], function(Ch){\n",
    " Chart = Ch;\n",
    "});\n",
    "\n",
    "$('head').append('<link rel=\"stylesheet\" href=\"https://bits.csb.pitt.edu/asker.js/themes/asker.default.css\" />');\n",
    "\n",
    "\n",
    "//the callback is provided a canvas object and data \n",
    "var chartmaker = function(canvas, labels, data) {\n",
    "  var ctx = $(canvas).get(0).getContext(\"2d\");\n",
    "     var dataset = {labels: labels,                     \n",
    "    datasets:[{\n",
    "     data: data,\n",
    "     backgroundColor: \"rgba(150,64,150,0.5)\",\n",
    "         fillColor: \"rgba(150,64,150,0.8)\",    \n",
    "  }]};\n",
    "  var myBarChart = new Chart(ctx,{type:'bar',data:dataset,options:{legend: {display:false},\n",
    "        scales: {\n",
    "            yAxes: [{\n",
    "                ticks: {\n",
    "                    min: 0,\n",
    "                }\n",
    "            }]}}});\n",
    "};\n",
    "\n",
    "$(\".jp-InputArea .o:contains(html)\").closest('.jp-InputArea').hide();\n",
    "\n",
    "\n",
    "</script>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div id=\"checkdotdot\" style=\"width: 500px\"></div>\n",
       "<script>\n",
       "\n",
       "\tjQuery('#checkdotdot').asker({\n",
       "\t    id: \"checkdotdot\",\n",
       "\t    question: \"Which command changes to the previous directory?\",\n",
       "\t\tanswers: [\"cd\", \"cd .\", \"cd ..\",\"cd /..\"],\n",
       "        server: \"https://bits.csb.pitt.edu/asker.js/example/asker.cgi\",\n",
       "\t\tcharter: chartmaker})\n",
       "    \n",
       "$(\".jp-InputArea .o:contains(html)\").closest('.jp-InputArea').hide();\n",
       "\n",
       "</script>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%html\n",
    "<div id=\"checkdotdot\" style=\"width: 500px\"></div>\n",
    "<script>\n",
    "\n",
    "\tjQuery('#checkdotdot').asker({\n",
    "\t    id: \"checkdotdot\",\n",
    "\t    question: \"Which command changes to the previous directory?\",\n",
    "\t\tanswers: [\"cd\", \"cd .\", \"cd ..\",\"cd /..\"],\n",
    "        server: \"https://bits.csb.pitt.edu/asker.js/example/asker.cgi\",\n",
    "\t\tcharter: chartmaker})\n",
    "    \n",
    "$(\".jp-InputArea .o:contains(html)\").closest('.jp-InputArea').hide();\n",
    "\n",
    "</script>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Review\n",
    "\n",
    "<tt>ls</tt> - list files\n",
    "\n",
    "<tt>cd</tt> - change directory\n",
    "\n",
    "<tt>pwd</tt> - print working (current) directory\n",
    "\n",
    "<tt>..</tt> - special file that refers to parent directory\n",
    "\n",
    "<tt>.</tt> - the current directory\n",
    "\n",
    "<tt>cat <em>file</em></tt> - print out contents of file\n",
    "\n",
    "<tt>more <em>file</em></tt> - print contents of file with pagination"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Shortcuts\n",
    "\n",
    "`Tab` autocomplete\n",
    "\n",
    "`Ctrl-D`  EOF/logout/exit\n",
    "\n",
    "`Ctrl-A`  go to beginning of line\n",
    "\n",
    "`Ctrl-E`  go to end of line\n",
    "\n",
    "`alias new=cmd`\n",
    "\n",
    "<pre>\n",
    "make a nickname for a command\n",
    "$ alias l='ls -l'\n",
    "$ alias\n",
    "$ l\n",
    "</pre>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## `.bashrc` example\n",
    "\n",
    "```\n",
    "HISTCONTROL=ignoredups\n",
    "\n",
    "#immediately append instead of at end of session, clear and re-read .bash_history\n",
    "export PROMPT_COMMAND=\"history -a; history -c; history -r\"\n",
    "#append instead of overwrite history\n",
    "shopt -s histappend\n",
    "\n",
    "export HISTSIZE=1000000\n",
    "\n",
    "# If set, Bash checks the window size after each command \n",
    "shopt -s checkwinsize\n",
    "\n",
    "alias mroe=more\n",
    "alias grpe=grep\n",
    "\n",
    "export PYTHONPATH=$PYTHONPATH:/usr/local/python\n",
    "export PATH=$PATH:$HOME/bin\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "\n",
    "## Loops\n",
    "\n",
    "```bash\n",
    "for i in x y z\n",
    "do\n",
    " echo $i\n",
    "done\n",
    "\n",
    "for file in *.txt\n",
    "do\n",
    " echo $file\n",
    "done\n",
    "```\n",
    "\n",
    "<a href=\"http://tldp.org/LDP/abs/html/loops.html\">Lots more... (TLDP)</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "```bash\n",
    "for i in {1..10}\n",
    "do\n",
    " echo $i\n",
    "done\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div id=\"bashloopq\" style=\"width: 500px\"></div>\n",
       "<script>\n",
       "\n",
       "\tjQuery('#bashloopq').asker({\n",
       "\t    id: \"bashloopq\",\n",
       "\t    question: \"What is the last line to print out?\",\n",
       "\t\tanswers: [\"{1..10}\",\"}\", \"9\",\"10\",\"An Error\"],\n",
       "        server: \"https://bits.csb.pitt.edu/asker.js/example/asker.cgi\",\n",
       "\t\tcharter: chartmaker})\n",
       "    \n",
       "$(\".jp-InputArea .o:contains(html)\").closest('.jp-InputArea').hide();\n",
       "\n",
       "\n",
       "</script>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%html\n",
    "<div id=\"bashloopq\" style=\"width: 500px\"></div>\n",
    "<script>\n",
    "\n",
    "\tjQuery('#bashloopq').asker({\n",
    "\t    id: \"bashloopq\",\n",
    "\t    question: \"What is the last line to print out?\",\n",
    "\t\tanswers: [\"{1..10}\",\"}\", \"9\",\"10\",\"An Error\"],\n",
    "        server: \"https://bits.csb.pitt.edu/asker.js/example/asker.cgi\",\n",
    "\t\tcharter: chartmaker})\n",
    "    \n",
    "$(\".jp-InputArea .o:contains(html)\").closest('.jp-InputArea').hide();\n",
    "\n",
    "\n",
    "</script>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# I/O Redirection\n",
    "\n",
    "`>` send *standard output* to file\n",
    "\n",
    "<pre>\n",
    "$ echo Hello > h.txt\n",
    "</pre>\n",
    "\n",
    "`>>` append to file\n",
    "\n",
    "<pre>\n",
    "$ echo World >> h.txt\n",
    "</pre>\n",
    "\n",
    "`<`  send file to *standard input* of command\n",
    "\n",
    "`2>`  send *standard error* to file\n",
    "\n",
    "`>&`  send output and error to file\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "```bash\n",
    "$ echo Hello > h.txt\n",
    "$ echo World >> h.txt\n",
    "$ cat h.txt\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div id=\"q1\" style=\"width: 500px\"></div>\n",
       "<script>\n",
       "\n",
       "\tjQuery('#q1').asker({\n",
       "\t    id: \"ioquestion\",\n",
       "\t    question: \"What prints out?\",\n",
       "\t\tanswers: [\"Hello\",\"World\", \"HelloWorld\", \"<Br>Hello<br>World\",\"An Error\"],\n",
       "\t\textra: [\"\",\"\",\"\",\"\",\"\",\"\"],\n",
       "        server: \"https://bits.csb.pitt.edu/asker.js/example/asker.cgi\",\n",
       "\t\tcharter: chartmaker})\n",
       "    \n",
       "$(\".jp-InputArea .o:contains(html)\").closest('.jp-InputArea').hide();\n",
       "\n",
       "</script>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%html\n",
    "<div id=\"q1\" style=\"width: 500px\"></div>\n",
    "<script>\n",
    "\n",
    "\tjQuery('#q1').asker({\n",
    "\t    id: \"ioquestion\",\n",
    "\t    question: \"What prints out?\",\n",
    "\t\tanswers: [\"Hello\",\"World\", \"HelloWorld\", \"<Br>Hello<br>World\",\"An Error\"],\n",
    "\t\textra: [\"\",\"\",\"\",\"\",\"\",\"\"],\n",
    "        server: \"https://bits.csb.pitt.edu/asker.js/example/asker.cgi\",\n",
    "\t\tcharter: chartmaker})\n",
    "    \n",
    "$(\".jp-InputArea .o:contains(html)\").closest('.jp-InputArea').hide();\n",
    "\n",
    "</script>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "```bash\n",
    "$ echo Hello > h.txt\n",
    "$ echo World > h.txt\n",
    "$ cat h.txt\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div id=\"q2\" style=\"width: 500px\"></div>\n",
       "<script>\n",
       "\n",
       "\tjQuery('#q2').asker({\n",
       "\t    id: \"ioquestion2\",\n",
       "\t    question: \"What prints out?\",\n",
       "\t\tanswers: [\"Hello\",\"World\", \"HelloWorld\", \"<Br>Hello<br>World\",\"An Error\"],\n",
       "\t\textra: [\"\",\"\",\"\",\"\",\"\",\"\"],\n",
       "        server: \"https://bits.csb.pitt.edu/asker.js/example/asker.cgi\",\n",
       "\t\tcharter: chartmaker})\n",
       "    \n",
       "$(\".jp-InputArea .o:contains(html)\").closest('.jp-InputArea').hide();\n",
       "\n",
       "</script>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%html\n",
    "<div id=\"q2\" style=\"width: 500px\"></div>\n",
    "<script>\n",
    "\n",
    "\tjQuery('#q2').asker({\n",
    "\t    id: \"ioquestion2\",\n",
    "\t    question: \"What prints out?\",\n",
    "\t\tanswers: [\"Hello\",\"World\", \"HelloWorld\", \"<Br>Hello<br>World\",\"An Error\"],\n",
    "\t\textra: [\"\",\"\",\"\",\"\",\"\",\"\"],\n",
    "        server: \"https://bits.csb.pitt.edu/asker.js/example/asker.cgi\",\n",
    "\t\tcharter: chartmaker})\n",
    "    \n",
    "$(\".jp-InputArea .o:contains(html)\").closest('.jp-InputArea').hide();\n",
    "\n",
    "</script>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Pipes\n",
    "\n",
    "A pipe (<tt>|</tt>) redirects the *standard output* of one program to the *standard input* of another.  It's like you typed the output of the first program into the second.  This allows us to chain several simple programs together to do something more complicated.\n",
    "<pre>\n",
    "$ echo Hello World | wc\n",
    "</pre>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Simple Text Manipulation\n",
    "\n",
    "`cat` dump file to stdout\n",
    "\n",
    "`more` paginated output\n",
    "\n",
    "`head` show first 10 lines\n",
    "\n",
    "`tail` show last 10 lines\n",
    "\n",
    "`wc` count lines/words/characters\n",
    "\n",
    "`sort` sort file by line and print out (<tt>-n</tt> for numerical sort)\n",
    "\n",
    "`uniq` remove **adjacent** duplicates (<tt>-c</tt> to count occurances)\n",
    "\n",
    "`cut` extract fixed width columns from file\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "<pre>\n",
    "$ cat text\n",
    "a\n",
    "b\n",
    "a\n",
    "b\n",
    "b\n",
    "$ cat text | uniq | wc\n",
    "</pre>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div id=\"q3\" style=\"width: 500px\"></div>\n",
       "<script>\n",
       "\n",
       "\tjQuery('#q3').asker({\n",
       "\t    id: \"simplepipe\",\n",
       "\t    question: \"What is the first number to print out?\",\n",
       "\t\tanswers: [\"1\", \"2\",\"3\",\"4\",\"5\",\"None of the above\"],\n",
       "\t\textra: [\"\",\"\",\"\",\"\",\"\",\"\"],\n",
       "        server: \"https://bits.csb.pitt.edu/asker.js/example/asker.cgi\",\n",
       "\t\tcharter: chartmaker})\n",
       "    \n",
       "$(\".jp-InputArea .o:contains(html)\").closest('.jp-InputArea').hide();\n",
       "\n",
       "</script>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%html\n",
    "<div id=\"q3\" style=\"width: 500px\"></div>\n",
    "<script>\n",
    "\n",
    "\tjQuery('#q3').asker({\n",
    "\t    id: \"simplepipe\",\n",
    "\t    question: \"What is the first number to print out?\",\n",
    "\t\tanswers: [\"1\", \"2\",\"3\",\"4\",\"5\",\"None of the above\"],\n",
    "\t\textra: [\"\",\"\",\"\",\"\",\"\",\"\"],\n",
    "        server: \"https://bits.csb.pitt.edu/asker.js/example/asker.cgi\",\n",
    "\t\tcharter: chartmaker})\n",
    "    \n",
    "$(\".jp-InputArea .o:contains(html)\").closest('.jp-InputArea').hide();\n",
    "\n",
    "</script>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "<pre>\n",
    "$ cat text\n",
    "a\n",
    "b\n",
    "a\n",
    "b\n",
    "b\n",
    "$ cat text | sort | uniq | wc\n",
    "</pre>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div id=\"q4\" style=\"width: 500px\"></div>\n",
       "<script>\n",
       "\n",
       "\tjQuery('#q4').asker({\n",
       "\t    id: \"simplepipe2\",\n",
       "\t    question: \"What is the first number to print out?\",\n",
       "\t\tanswers: [\"1\", \"2\",\"3\",\"4\",\"5\",\"None of the above\"],\n",
       "\t\textra: [\"\",\"\",\"\",\"\",\"\",\"\"],\n",
       "        server: \"https://bits.csb.pitt.edu/asker.js/example/asker.cgi\",\n",
       "\t\tcharter: chartmaker})\n",
       "    \n",
       "$(\".jp-InputArea .o:contains(html)\").closest('.jp-InputArea').hide();\n",
       "\n",
       "</script>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%html\n",
    "<div id=\"q4\" style=\"width: 500px\"></div>\n",
    "<script>\n",
    "\n",
    "\tjQuery('#q4').asker({\n",
    "\t    id: \"simplepipe2\",\n",
    "\t    question: \"What is the first number to print out?\",\n",
    "\t\tanswers: [\"1\", \"2\",\"3\",\"4\",\"5\",\"None of the above\"],\n",
    "\t\textra: [\"\",\"\",\"\",\"\",\"\",\"\"],\n",
    "        server: \"https://bits.csb.pitt.edu/asker.js/example/asker.cgi\",\n",
    "\t\tcharter: chartmaker})\n",
    "    \n",
    "$(\".jp-InputArea .o:contains(html)\").closest('.jp-InputArea').hide();\n",
    "\n",
    "</script>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Advanced Text Manipulation\n",
    "\n",
    "<tt>grep</tt> search contents of file for expression\n",
    "\n",
    "<tt>sed</tt> stream editor - perform substitutions\n",
    "\n",
    "<tt>awk</tt> pattern scanning and processing, great for dealing with data in columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# grep\n",
    "\n",
    "Search file contents for a pattern.\n",
    "\n",
    "<tt>grep <em>pattern</em> <em>file(s)</em></tt>\n",
    " * <tt>‐r</tt> recursive search\n",
    " * <tt>‐I</tt> skip over binary files\n",
    " * <tt>‐s</tt> suppress error messages\n",
    " * <tt>‐n</tt> show line numbers\n",
    " * <tt>‐A</tt>*N* show *N* lines after match\n",
    " * <tt>‐B</tt>*N* show *N* lines before match\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "```bash\n",
    "$ grep a text | wc\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div id=\"q5\" style=\"width: 500px\"></div>\n",
       "<script>\n",
       "\n",
       "\tjQuery('#q5').asker({\n",
       "\t    id: \"grepq\",\n",
       "\t    question: \"What is the first number to print out?\",\n",
       "\t\tanswers: [\"1\", \"2\",\"3\",\"4\",\"5\",\"None of the above\"],\n",
       "\t\textra: [\"\",\"\",\"\",\"\",\"\",\"\"],\n",
       "        server: \"https://bits.csb.pitt.edu/asker.js/example/asker.cgi\",\n",
       "\t\tcharter: chartmaker})\n",
       "    \n",
       "$(\".jp-InputArea .o:contains(html)\").closest('.jp-InputArea').hide();\n",
       "\n",
       "</script>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%html\n",
    "<div id=\"q5\" style=\"width: 500px\"></div>\n",
    "<script>\n",
    "\n",
    "\tjQuery('#q5').asker({\n",
    "\t    id: \"grepq\",\n",
    "\t    question: \"What is the first number to print out?\",\n",
    "\t\tanswers: [\"1\", \"2\",\"3\",\"4\",\"5\",\"None of the above\"],\n",
    "\t\textra: [\"\",\"\",\"\",\"\",\"\",\"\"],\n",
    "        server: \"https://bits.csb.pitt.edu/asker.js/example/asker.cgi\",\n",
    "\t\tcharter: chartmaker})\n",
    "    \n",
    "$(\".jp-InputArea .o:contains(html)\").closest('.jp-InputArea').hide();\n",
    "\n",
    "</script>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# grep patterns\n",
    "\n",
    "Patterns are defined using *regular expressions* which we will talk more about later.  Some useful special characters.\n",
    "\n",
    "* `^pattern`  pattern must be at start of line\n",
    "* `pattern$` pattern must be at end of line\n",
    "* `.` match any character, **not** period\n",
    "* `.*` match any charcter repeated any number of times\n",
    "* `\\.` escape a special character to treat it literally (i.e., this matches period)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# sed\n",
    "Search and replace\n",
    "\n",
    "<pre>\n",
    "sed 's/<em>pattern</em>/<em>replacement</em>/' <em>file</em>\n",
    "</pre>\n",
    "\n",
    " * <tt>‐i</tt> replace in-place (overwrites input file)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "```bash\n",
    "$ sed 's/a/b/' text | uniq | wc\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div id=\"q6\" style=\"width: 500px\"></div>\n",
       "<script>\n",
       "\n",
       "\tjQuery('#q6').asker({\n",
       "\t    id: \"sedq\",\n",
       "\t    question: \"What is the first number to print out?\",\n",
       "\t\tanswers: [\"1\", \"2\",\"3\",\"4\",\"5\",\"None of the above\"],\n",
       "\t\textra: [\"\",\"\",\"\",\"\",\"\",\"\"],\n",
       "        server: \"https://bits.csb.pitt.edu/asker.js/example/asker.cgi\",\n",
       "\t\tcharter: chartmaker})\n",
       "$(\".jp-InputArea .o:contains(html)\").closest('.jp-InputArea').hide();\n",
       "\n",
       "</script>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%html\n",
    "<div id=\"q6\" style=\"width: 500px\"></div>\n",
    "<script>\n",
    "\n",
    "\tjQuery('#q6').asker({\n",
    "\t    id: \"sedq\",\n",
    "\t    question: \"What is the first number to print out?\",\n",
    "\t\tanswers: [\"1\", \"2\",\"3\",\"4\",\"5\",\"None of the above\"],\n",
    "\t\textra: [\"\",\"\",\"\",\"\",\"\",\"\"],\n",
    "        server: \"https://bits.csb.pitt.edu/asker.js/example/asker.cgi\",\n",
    "\t\tcharter: chartmaker})\n",
    "$(\".jp-InputArea .o:contains(html)\").closest('.jp-InputArea').hide();\n",
    "\n",
    "</script>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# awk\n",
    "Pattern scanning and processing language. We'll mostly use it to extract columns/fields. It processes a file line-by-line and if a condition holds runs a simple program on the line.\n",
    "\n",
    "<tt> awk '<em>optional condition</em> {<em>awk program</em>}' <em>file</em></tt>\n",
    "* <tt>-F<em>x</em></tt> make *x* the field deliminator (default whitespace)\n",
    "* <tt>NF</tt> number of fields on current line\n",
    "* <tt>NR</tt> current record number\n",
    "* <tt>\\$0</tt> full line\n",
    "* <tt>\\$<em>N</em></tt> Nth field"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# awk\n",
    "\n",
    "```bash\n",
    "$ cat names\n",
    "id last,first \n",
    "1 Smith,Alice\n",
    "2 Jones,Bob\n",
    "3 Smith,Charlie\n",
    "```\n",
    "Try these:\n",
    "\n",
    "```bash\n",
    "$ awk '{print $1}' names\n",
    "$ awk -F, '{print $2}' names\n",
    "$ awk 'NR > 1 {print $2}' names \n",
    "$ awk '$1 > 1 {print $0}' names\n",
    "$ awk 'NR > 1 {print $2}' names | awk -F, '{print $1}' | sort | uniq -c\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Exercises\n",
    "\n",
    "```bash\n",
    "mkdir intro\n",
    "cd intro\n",
    "wget http://mscbio2025.net/files/Spellman.csv\n",
    "wget http://mscbio2025.net/files/1shs.pdb\n",
    "```\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Questions\n",
    "\n",
    "- How many data points are in Spellman.csv?\n",
    "-  The first three letters of the systematic open reading frames are: 'Y' for yeast, the chromosome number, then the chromosome arm. In the dataset, how many ORFs from chromosome A are there?\n",
    "- How many are there from each chromosome? \n",
    "  - each chromosome arm?\n",
    "- How many data points start with a positive expression value?\n",
    "- What are the 10 data points with the highest initial expression values?\n",
    "  - Lowest?\n",
    "- How many lines are there where expression values are continuously increasing for the first 3 time steps?\n",
    "- Sorted by biggest increase?\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "```bash\n",
    "wc Spellman.csv   (gives number of lines, because of header this is off by one)\n",
    "grep YA Spellman.csv |wc\n",
    "grep ^YA Spellman.csv |wc  (this is a bit better, ^ matches begining of line)\n",
    "grep ^YA -c Spellman.csv  (grep can provide the count itself)\n",
    "awk -F, 'NR > 1 {print $1}' Spellman.csv | cut -b 1-2 | sort | uniq -c\n",
    "awk -F, 'NR > 1 {print $1}' Spellman.csv | cut -b 1-3 | sort | uniq -c\n",
    "awk -F, 'NR > 1 && $2 > 0 {print $0}' Spellman.csv | wc\n",
    "awk -F, 'NR > 1  {print $1,$2}' Spellman.csv  | sort -k2,2 -n | tail\n",
    "awk -F, 'NR > 1  {print $1,$2}' Spellman.csv  | sort -k2,2 -n -r | tail\n",
    "awk -F, 'NR > 1 && $3 > $2 && $4 > $3 {print $0}' Spellman.csv  |wc\n",
    "awk -F, 'NR > 1 && $3 > $2 && $4 > $3  {print $4-$2,$0}' Spellman.csv   | sort -n -k1,1\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# More\n",
    "\n",
    "- Create a pdb file from 1shs that consists of only ATOM records. \n",
    "- Create a pdb with only ATOM records from chain A. (The chain is the fifth column* of an atom record)\n",
    "- How many carbon atoms are in this file?\n",
    "- Create a pdb with only the ATOM records from chain G, but with the chain renamed to be A.\n",
    "\n",
    "\\*PDB files are actually fixed files, not space deliminated, but with this file you can ignore that distinction.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    },
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "```bash\n",
    "grep ^ATOM 1shs.pdb > newpdb.pdb (^matches beginning of line)\n",
    "grep ^ATOM 1shs.pdb | awk '$5 == \"A\" {print $0}'\n",
    "#this is UNSAFE with pdb files since there is no guarantee that fields\n",
    "#will be whitespace seperated, safer is:\n",
    "grep ^ATOM 1shs.pdb | awk ' substr($0,22,1) == \"A\" {print $0}' > newpdb.pdb\n",
    " \n",
    "grep ^ATOM 1shs.pdb | awk ' substr($0,22,1) == \"A\" {print $0}' | cut -b 78- | sort | uniq -c\n",
    "grep ^ATOM 1shs.pdb | awk '$5 == \"A\" {print $0}' | sed 's/ G / A /'\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Running Python\n",
    "\n",
    "```bash\n",
    "$ cat hi.py \n",
    "print(\"hi\")\n",
    "$ python3 hi.py\n",
    "hi\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```bash\n",
    "$ cat hi.py \n",
    "#!/usr/bin/python3\n",
    "print(\"hi\")\n",
    "$ chmod +x hi.py  <em>make the file executable</em>\n",
    "$ ls -l hi.py \n",
    "-rwxr-xr-x  1 dkoes  staff  29 Sep  3 16:05 hi.py\n",
    "$ ./hi.py \n",
    "hi\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Python Versions\n",
    "\n",
    "**python2**  Legacy python.  \n",
    "\n",
    "**python3** Released in 2008. Mostly the same as python2 but \"cleaned up\".  Breaks backwards compatibility. May need to specify explicity (`python3`). *We will be using python3*.\n",
    "\n",
    "https://wiki.python.org/moin/Python2orPython3\n",
    "\n",
    "```bash\n",
    "~$ python\n",
    "Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux\n",
    "Type \"help\", \"copyright\", \"credits\" or \"license\" for more information.\n",
    ">>> \n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# IPython\n",
    "\n",
    "##  A powerful interactive shell\n",
    "* Tab complete commands, file names\n",
    "* Support for a number of \"shell\" commands (ls, cd, pwd, etc)\n",
    "* Supports up arrow, `Ctrl-r`\n",
    "* Persistent command history across sessions\n",
    "* Backbone of notebooks...\n",
    "\n",
    "```bash\n",
    "~$ ipython\n",
    "Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0]\n",
    "Type 'copyright', 'credits' or 'license' for more information\n",
    "IPython 8.5.0 -- An enhanced Interactive Python. Type '?' for help.\n",
    "\n",
    "In [1]:  \n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# ipython notebook  \n",
    "\n",
    "<strike>\n",
    "<pre>\n",
    "$ ipython notebook\n",
    "</pre>\n",
    "</strike>\n",
    "\n",
    "<pre>\n",
    "$ jupyter notebook\n",
    "</pre>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now called Jupyter (not just for python) <a href=\"https://jupyter.org\">jupyter.org</a>\n",
    "\n",
    "IPython in your browser.  Save your code *and* your output.\n",
    "\n",
    "[Colab](https://colab.research.google.com/) is basically a Google hosted Jupyter notebook.\n",
    "\n",
    "Demo: running code (shift-enter), cell types, saving and exporting, kernel state"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Why Jupyter notebook?\n",
    "\n",
    "* A \"lab notebook\" for data science\n",
    "* See output as you run commands\n",
    "* Embedded figures/output\n",
    "* Easy to modify and rerun steps\n",
    "* Can embed formatted text - share code *and* reason for code\n",
    "* Can convert to multiple formats (html, pdf, raw python, even slides)\n",
    "\n",
    "[A different perspective](https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI/present?token=AC4w5ViEY1bIVsQHr8Z_JV3-l800VDuEpg%3A1536066747968&includes_info_params=1#slide=id.g362da58057_0_1)"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}