{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Text Processing with the Linux Commandline\n", "## 8/31/2023\n", "\n", "print view
\n", "notebook" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "
\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Review\n", "\n", "ls - list files\n", "\n", "cd - change directory\n", "\n", "pwd - print working (current) directory\n", "\n", ".. - special file that refers to parent directory\n", "\n", ". - the current directory\n", "\n", "cat file - print out contents of file\n", "\n", "more file - print contents of file with pagination" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Shortcuts\n", "\n", "`Tab` autocomplete\n", "\n", "`Ctrl-D` EOF/logout/exit\n", "\n", "`Ctrl-A` go to beginning of line\n", "\n", "`Ctrl-E` go to end of line\n", "\n", "`alias new=cmd`\n", "\n", "
\n",
    "make a nickname for a command\n",
    "$ alias l='ls -l'\n",
    "$ alias\n",
    "$ l\n",
    "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## `.bashrc` example\n", "\n", "```\n", "HISTCONTROL=ignoredups\n", "\n", "#immediately append instead of at end of session, clear and re-read .bash_history\n", "export PROMPT_COMMAND=\"history -a; history -c; history -r\"\n", "#append instead of overwrite history\n", "shopt -s histappend\n", "\n", "export HISTSIZE=1000000\n", "\n", "# If set, Bash checks the window size after each command \n", "shopt -s checkwinsize\n", "\n", "alias mroe=more\n", "alias grpe=grep\n", "\n", "export PYTHONPATH=$PYTHONPATH:/usr/local/python\n", "export PATH=$PATH:$HOME/bin\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "## Loops\n", "\n", "```bash\n", "for i in x y z\n", "do\n", " echo $i\n", "done\n", "\n", "for file in *.txt\n", "do\n", " echo $file\n", "done\n", "```\n", "\n", "Lots more... (TLDP)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "```bash\n", "for i in {1..10}\n", "do\n", " echo $i\n", "done\n", "```" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "
\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# I/O Redirection\n", "\n", "`>` send *standard output* to file\n", "\n", "
\n",
    "$ echo Hello > h.txt\n",
    "
\n", "\n", "`>>` append to file\n", "\n", "
\n",
    "$ echo World >> h.txt\n",
    "
\n", "\n", "`<` send file to *standard input* of command\n", "\n", "`2>` send *standard error* to file\n", "\n", "`>&` send output and error to file\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "```bash\n", "$ echo Hello > h.txt\n", "$ echo World >> h.txt\n", "$ cat h.txt\n", "```" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "
\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "```bash\n", "$ echo Hello > h.txt\n", "$ echo World > h.txt\n", "$ cat h.txt\n", "```" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "
\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Pipes\n", "\n", "A pipe (|) redirects the *standard output* of one program to the *standard input* of another. It's like you typed the output of the first program into the second. This allows us to chain several simple programs together to do something more complicated.\n", "
\n",
    "$ echo Hello World | wc\n",
    "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Simple Text Manipulation\n", "\n", "`cat` dump file to stdout\n", "\n", "`more` paginated output\n", "\n", "`head` show first 10 lines\n", "\n", "`tail` show last 10 lines\n", "\n", "`wc` count lines/words/characters\n", "\n", "`sort` sort file by line and print out (-n for numerical sort)\n", "\n", "`uniq` remove **adjacent** duplicates (-c to count occurances)\n", "\n", "`cut` extract fixed width columns from file\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
\n",
    "$ cat text\n",
    "a\n",
    "b\n",
    "a\n",
    "b\n",
    "b\n",
    "$ cat text | uniq | wc\n",
    "
" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "
\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
\n",
    "$ cat text\n",
    "a\n",
    "b\n",
    "a\n",
    "b\n",
    "b\n",
    "$ cat text | sort | uniq | wc\n",
    "
" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "
\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Advanced Text Manipulation\n", "\n", "grep search contents of file for expression\n", "\n", "sed stream editor - perform substitutions\n", "\n", "awk pattern scanning and processing, great for dealing with data in columns" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# grep\n", "\n", "Search file contents for a pattern.\n", "\n", "grep pattern file(s)\n", " * ‐r recursive search\n", " * ‐I skip over binary files\n", " * ‐s suppress error messages\n", " * ‐n show line numbers\n", " * ‐A*N* show *N* lines after match\n", " * ‐B*N* show *N* lines before match\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "```bash\n", "$ grep a text | wc\n", "```" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "
\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# grep patterns\n", "\n", "Patterns are defined using *regular expressions* which we will talk more about later. Some useful special characters.\n", "\n", "* `^pattern` pattern must be at start of line\n", "* `pattern$` pattern must be at end of line\n", "* `.` match any character, **not** period\n", "* `.*` match any charcter repeated any number of times\n", "* `\\.` escape a special character to treat it literally (i.e., this matches period)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# sed\n", "Search and replace\n", "\n", "
\n",
    "sed 's/pattern/replacement/' file\n",
    "
\n", "\n", " * ‐i replace in-place (overwrites input file)\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "```bash\n", "$ sed 's/a/b/' text | uniq | wc\n", "```" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "
\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# awk\n", "Pattern scanning and processing language. We'll mostly use it to extract columns/fields. It processes a file line-by-line and if a condition holds runs a simple program on the line.\n", "\n", " awk 'optional condition {awk program}' file\n", "* -Fx make *x* the field deliminator (default whitespace)\n", "* NF number of fields on current line\n", "* NR current record number\n", "* \\$0 full line\n", "* \\$N Nth field" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# awk\n", "\n", "```bash\n", "$ cat names\n", "id last,first \n", "1 Smith,Alice\n", "2 Jones,Bob\n", "3 Smith,Charlie\n", "```\n", "Try these:\n", "\n", "```bash\n", "$ awk '{print $1}' names\n", "$ awk -F, '{print $2}' names\n", "$ awk 'NR > 1 {print $2}' names \n", "$ awk '$1 > 1 {print $0}' names\n", "$ awk 'NR > 1 {print $2}' names | awk -F, '{print $1}' | sort | uniq -c\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Exercises\n", "\n", "```bash\n", "mkdir intro\n", "cd intro\n", "wget http://mscbio2025.net/files/Spellman.csv\n", "wget http://mscbio2025.net/files/1shs.pdb\n", "```\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Questions\n", "\n", "- How many data points are in Spellman.csv?\n", "- The first three letters of the systematic open reading frames are: 'Y' for yeast, the chromosome number, then the chromosome arm. In the dataset, how many ORFs from chromosome A are there?\n", "- How many are there from each chromosome? \n", " - each chromosome arm?\n", "- How many data points start with a positive expression value?\n", "- What are the 10 data points with the highest initial expression values?\n", " - Lowest?\n", "- How many lines are there where expression values are continuously increasing for the first 3 time steps?\n", "- Sorted by biggest increase?\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "```bash\n", "wc Spellman.csv (gives number of lines, because of header this is off by one)\n", "grep YA Spellman.csv |wc\n", "grep ^YA Spellman.csv |wc (this is a bit better, ^ matches begining of line)\n", "grep ^YA -c Spellman.csv (grep can provide the count itself)\n", "awk -F, 'NR > 1 {print $1}' Spellman.csv | cut -b 1-2 | sort | uniq -c\n", "awk -F, 'NR > 1 {print $1}' Spellman.csv | cut -b 1-3 | sort | uniq -c\n", "awk -F, 'NR > 1 && $2 > 0 {print $0}' Spellman.csv | wc\n", "awk -F, 'NR > 1 {print $1,$2}' Spellman.csv | sort -k2,2 -n | tail\n", "awk -F, 'NR > 1 {print $1,$2}' Spellman.csv | sort -k2,2 -n -r | tail\n", "awk -F, 'NR > 1 && $3 > $2 && $4 > $3 {print $0}' Spellman.csv |wc\n", "awk -F, 'NR > 1 && $3 > $2 && $4 > $3 {print $4-$2,$0}' Spellman.csv | sort -n -k1,1\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# More\n", "\n", "- Create a pdb file from 1shs that consists of only ATOM records. \n", "- Create a pdb with only ATOM records from chain A. (The chain is the fifth column* of an atom record)\n", "- How many carbon atoms are in this file?\n", "- Create a pdb with only the ATOM records from chain G, but with the chain renamed to be A.\n", "\n", "\\*PDB files are actually fixed files, not space deliminated, but with this file you can ignore that distinction.\n" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "slideshow": { "slide_type": "notes" } }, "source": [ "```bash\n", "grep ^ATOM 1shs.pdb > newpdb.pdb (^matches beginning of line)\n", "grep ^ATOM 1shs.pdb | awk '$5 == \"A\" {print $0}'\n", "#this is UNSAFE with pdb files since there is no guarantee that fields\n", "#will be whitespace seperated, safer is:\n", "grep ^ATOM 1shs.pdb | awk ' substr($0,22,1) == \"A\" {print $0}' > newpdb.pdb\n", " \n", "grep ^ATOM 1shs.pdb | awk ' substr($0,22,1) == \"A\" {print $0}' | cut -b 78- | sort | uniq -c\n", "grep ^ATOM 1shs.pdb | awk '$5 == \"A\" {print $0}' | sed 's/ G / A /'\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Running Python\n", "\n", "```bash\n", "$ cat hi.py \n", "print(\"hi\")\n", "$ python3 hi.py\n", "hi\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```bash\n", "$ cat hi.py \n", "#!/usr/bin/python3\n", "print(\"hi\")\n", "$ chmod +x hi.py make the file executable\n", "$ ls -l hi.py \n", "-rwxr-xr-x 1 dkoes staff 29 Sep 3 16:05 hi.py\n", "$ ./hi.py \n", "hi\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Python Versions\n", "\n", "**python2** Legacy python. \n", "\n", "**python3** Released in 2008. Mostly the same as python2 but \"cleaned up\". Breaks backwards compatibility. May need to specify explicity (`python3`). *We will be using python3*.\n", "\n", "https://wiki.python.org/moin/Python2orPython3\n", "\n", "```bash\n", "~$ python\n", "Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux\n", "Type \"help\", \"copyright\", \"credits\" or \"license\" for more information.\n", ">>> \n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# IPython\n", "\n", "## A powerful interactive shell\n", "* Tab complete commands, file names\n", "* Support for a number of \"shell\" commands (ls, cd, pwd, etc)\n", "* Supports up arrow, `Ctrl-r`\n", "* Persistent command history across sessions\n", "* Backbone of notebooks...\n", "\n", "```bash\n", "~$ ipython\n", "Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0]\n", "Type 'copyright', 'credits' or 'license' for more information\n", "IPython 8.5.0 -- An enhanced Interactive Python. Type '?' for help.\n", "\n", "In [1]: \n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# ipython notebook \n", "\n", "\n", "
\n",
    "$ ipython notebook\n",
    "
\n", "
\n", "\n", "
\n",
    "$ jupyter notebook\n",
    "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now called Jupyter (not just for python) jupyter.org\n", "\n", "IPython in your browser. Save your code *and* your output.\n", "\n", "[Colab](https://colab.research.google.com/) is basically a Google hosted Jupyter notebook.\n", "\n", "Demo: running code (shift-enter), cell types, saving and exporting, kernel state" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Why Jupyter notebook?\n", "\n", "* A \"lab notebook\" for data science\n", "* See output as you run commands\n", "* Embedded figures/output\n", "* Easy to modify and rerun steps\n", "* Can embed formatted text - share code *and* reason for code\n", "* Can convert to multiple formats (html, pdf, raw python, even slides)\n", "\n", "[A different perspective](https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI/present?token=AC4w5ViEY1bIVsQHr8Z_JV3-l800VDuEpg%3A1536066747968&includes_info_params=1#slide=id.g362da58057_0_1)" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 4 }