"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%%html\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Pipes\n",
"\n",
"A pipe (|) redirects the *standard output* of one program to the *standard input* of another. It's like you typed the output of the first program into the second. This allows us to chain several simple programs together to do something more complicated.\n",
"\n",
"$ echo Hello World | wc\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Simple Text Manipulation\n",
"\n",
"`cat` dump file to stdout\n",
"\n",
"`more` paginated output\n",
"\n",
"`head` show first 10 lines\n",
"\n",
"`tail` show last 10 lines\n",
"\n",
"`wc` count lines/words/characters\n",
"\n",
"`sort` sort file by line and print out (-n for numerical sort)\n",
"\n",
"`uniq` remove **adjacent** duplicates (-c to count occurances)\n",
"\n",
"`cut` extract fixed width columns from file\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"\n",
"$ cat text\n",
"a\n",
"b\n",
"a\n",
"b\n",
"b\n",
"$ cat text | uniq | wc\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%%html\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"\n",
"$ cat text\n",
"a\n",
"b\n",
"a\n",
"b\n",
"b\n",
"$ cat text | sort | uniq | wc\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%%html\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Advanced Text Manipulation\n",
"\n",
"grep search contents of file for expression\n",
"\n",
"sed stream editor - perform substitutions\n",
"\n",
"awk pattern scanning and processing, great for dealing with data in columns"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# grep\n",
"\n",
"Search file contents for a pattern.\n",
"\n",
"grep pattern file(s)\n",
" * ‐r recursive search\n",
" * ‐I skip over binary files\n",
" * ‐s suppress error messages\n",
" * ‐n show line numbers\n",
" * ‐A*N* show *N* lines after match\n",
" * ‐B*N* show *N* lines before match\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"```bash\n",
"$ grep a text | wc\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%%html\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# grep patterns\n",
"\n",
"Patterns are defined using *regular expressions* which we will talk more about later. Some useful special characters.\n",
"\n",
"* `^pattern` pattern must be at start of line\n",
"* `pattern$` pattern must be at end of line\n",
"* `.` match any character, **not** period\n",
"* `.*` match any charcter repeated any number of times\n",
"* `\\.` escape a special character to treat it literally (i.e., this matches period)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# sed\n",
"Search and replace\n",
"\n",
"\n",
"sed 's/pattern/replacement/' file\n",
"
\n",
"\n",
" * ‐i replace in-place (overwrites input file)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"```bash\n",
"$ sed 's/a/b/' text | uniq | wc\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%%html\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# awk\n",
"Pattern scanning and processing language. We'll mostly use it to extract columns/fields. It processes a file line-by-line and if a condition holds runs a simple program on the line.\n",
"\n",
" awk 'optional condition {awk program}' file\n",
"* -Fx make *x* the field deliminator (default whitespace)\n",
"* NF number of fields on current line\n",
"* NR current record number\n",
"* \\$0 full line\n",
"* \\$N Nth field"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# awk\n",
"\n",
"```bash\n",
"$ cat names\n",
"id last,first \n",
"1 Smith,Alice\n",
"2 Jones,Bob\n",
"3 Smith,Charlie\n",
"```\n",
"Try these:\n",
"\n",
"```bash\n",
"$ awk '{print $1}' names\n",
"$ awk -F, '{print $2}' names\n",
"$ awk 'NR > 1 {print $2}' names \n",
"$ awk '$1 > 1 {print $0}' names\n",
"$ awk 'NR > 1 {print $2}' names | awk -F, '{print $1}' | sort | uniq -c\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Exercises\n",
"\n",
"```bash\n",
"mkdir intro\n",
"cd intro\n",
"wget http://mscbio2025.net/files/Spellman.csv\n",
"wget http://mscbio2025.net/files/1shs.pdb\n",
"```\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Questions\n",
"\n",
"- How many data points are in Spellman.csv?\n",
"- The first three letters of the systematic open reading frames are: 'Y' for yeast, the chromosome number, then the chromosome arm. In the dataset, how many ORFs from chromosome A are there?\n",
"- How many are there from each chromosome? \n",
" - each chromosome arm?\n",
"- How many data points start with a positive expression value?\n",
"- What are the 10 data points with the highest initial expression values?\n",
" - Lowest?\n",
"- How many lines are there where expression values are continuously increasing for the first 3 time steps?\n",
"- Sorted by biggest increase?\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"```bash\n",
"wc Spellman.csv (gives number of lines, because of header this is off by one)\n",
"grep YA Spellman.csv |wc\n",
"grep ^YA Spellman.csv |wc (this is a bit better, ^ matches begining of line)\n",
"grep ^YA -c Spellman.csv (grep can provide the count itself)\n",
"awk -F, 'NR > 1 {print $1}' Spellman.csv | cut -b 1-2 | sort | uniq -c\n",
"awk -F, 'NR > 1 {print $1}' Spellman.csv | cut -b 1-3 | sort | uniq -c\n",
"awk -F, 'NR > 1 && $2 > 0 {print $0}' Spellman.csv | wc\n",
"awk -F, 'NR > 1 {print $1,$2}' Spellman.csv | sort -k2,2 -n | tail\n",
"awk -F, 'NR > 1 {print $1,$2}' Spellman.csv | sort -k2,2 -n -r | tail\n",
"awk -F, 'NR > 1 && $3 > $2 && $4 > $3 {print $0}' Spellman.csv |wc\n",
"awk -F, 'NR > 1 && $3 > $2 && $4 > $3 {print $4-$2,$0}' Spellman.csv | sort -n -k1,1\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# More\n",
"\n",
"- Create a pdb file from 1shs that consists of only ATOM records. \n",
"- Create a pdb with only ATOM records from chain A. (The chain is the fifth column* of an atom record)\n",
"- How many carbon atoms are in this file?\n",
"- Create a pdb with only the ATOM records from chain G, but with the chain renamed to be A.\n",
"\n",
"\\*PDB files are actually fixed files, not space deliminated, but with this file you can ignore that distinction.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
},
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"```bash\n",
"grep ^ATOM 1shs.pdb > newpdb.pdb (^matches beginning of line)\n",
"grep ^ATOM 1shs.pdb | awk '$5 == \"A\" {print $0}'\n",
"#this is UNSAFE with pdb files since there is no guarantee that fields\n",
"#will be whitespace seperated, safer is:\n",
"grep ^ATOM 1shs.pdb | awk ' substr($0,22,1) == \"A\" {print $0}' > newpdb.pdb\n",
" \n",
"grep ^ATOM 1shs.pdb | awk ' substr($0,22,1) == \"A\" {print $0}' | cut -b 78- | sort | uniq -c\n",
"grep ^ATOM 1shs.pdb | awk '$5 == \"A\" {print $0}' | sed 's/ G / A /'\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Running Python\n",
"\n",
"```bash\n",
"$ cat hi.py \n",
"print(\"hi\")\n",
"$ python3 hi.py\n",
"hi\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```bash\n",
"$ cat hi.py \n",
"#!/usr/bin/python3\n",
"print(\"hi\")\n",
"$ chmod +x hi.py make the file executable\n",
"$ ls -l hi.py \n",
"-rwxr-xr-x 1 dkoes staff 29 Sep 3 16:05 hi.py\n",
"$ ./hi.py \n",
"hi\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Python Versions\n",
"\n",
"**python2** Legacy python. \n",
"\n",
"**python3** Released in 2008. Mostly the same as python2 but \"cleaned up\". Breaks backwards compatibility. May need to specify explicity (`python3`). *We will be using python3*.\n",
"\n",
"https://wiki.python.org/moin/Python2orPython3\n",
"\n",
"```bash\n",
"~$ python\n",
"Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux\n",
"Type \"help\", \"copyright\", \"credits\" or \"license\" for more information.\n",
">>> \n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# IPython\n",
"\n",
"## A powerful interactive shell\n",
"* Tab complete commands, file names\n",
"* Support for a number of \"shell\" commands (ls, cd, pwd, etc)\n",
"* Supports up arrow, `Ctrl-r`\n",
"* Persistent command history across sessions\n",
"* Backbone of notebooks...\n",
"\n",
"```bash\n",
"~$ ipython\n",
"Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0]\n",
"Type 'copyright', 'credits' or 'license' for more information\n",
"IPython 8.5.0 -- An enhanced Interactive Python. Type '?' for help.\n",
"\n",
"In [1]: \n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# ipython notebook \n",
"\n",
"\n",
"\n",
"$ ipython notebook\n",
"
\n",
"\n",
"\n",
"\n",
"$ jupyter notebook\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now called Jupyter (not just for python) jupyter.org\n",
"\n",
"IPython in your browser. Save your code *and* your output.\n",
"\n",
"[Colab](https://colab.research.google.com/) is basically a Google hosted Jupyter notebook.\n",
"\n",
"Demo: running code (shift-enter), cell types, saving and exporting, kernel state"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Why Jupyter notebook?\n",
"\n",
"* A \"lab notebook\" for data science\n",
"* See output as you run commands\n",
"* Embedded figures/output\n",
"* Easy to modify and rerun steps\n",
"* Can embed formatted text - share code *and* reason for code\n",
"* Can convert to multiple formats (html, pdf, raw python, even slides)\n",
"\n",
"[A different perspective](https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI/present?token=AC4w5ViEY1bIVsQHr8Z_JV3-l800VDuEpg%3A1536066747968&includes_info_params=1#slide=id.g362da58057_0_1)"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}