Text Processing with the Linux Commandline¶

8/31/2023¶

print view
notebook

In [1]:
%%html
<script src="https://bits.csb.pitt.edu/asker.js/lib/asker.js"></script>
<style>
.reveal pre { font-size: 100%; overflow-x: auto; overflow-y: auto;}
.reveal h1 { font-size: 2em}
.reveal ol {display: block;}
.reveal ul {display: block;}
.reveal .slides>section>section.present { max-height: 100%; overflow-y: auto;}

.jp-OutputArea-output { padding: 0; }
</style>


<script>
$3Dmolpromise = new Promise((resolve, reject) => { 
    require(['https://3Dmol.org/build/3Dmol.js'], function(){       
            resolve();});
});
require(['https://cdnjs.cloudflare.com/ajax/libs/Chart.js/2.2.2/Chart.js'], function(Ch){
 Chart = Ch;
});

$('head').append('<link rel="stylesheet" href="https://bits.csb.pitt.edu/asker.js/themes/asker.default.css" />');


//the callback is provided a canvas object and data 
var chartmaker = function(canvas, labels, data) {
  var ctx = $(canvas).get(0).getContext("2d");
     var dataset = {labels: labels,                     
    datasets:[{
     data: data,
     backgroundColor: "rgba(150,64,150,0.5)",
         fillColor: "rgba(150,64,150,0.8)",    
  }]};
  var myBarChart = new Chart(ctx,{type:'bar',data:dataset,options:{legend: {display:false},
        scales: {
            yAxes: [{
                ticks: {
                    min: 0,
                }
            }]}}});
};

$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();


</script>
In [2]:
%%html
<div id="checkdotdot" style="width: 500px"></div>
<script>

	jQuery('#checkdotdot').asker({
	    id: "checkdotdot",
	    question: "Which command changes to the previous directory?",
		answers: ["cd", "cd .", "cd ..","cd /.."],
        server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
		charter: chartmaker})
    
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();

</script>

Review¶

ls - list files

cd - change directory

pwd - print working (current) directory

.. - special file that refers to parent directory

. - the current directory

cat file - print out contents of file

more file - print contents of file with pagination

Shortcuts¶

Tab autocomplete

Ctrl-D EOF/logout/exit

Ctrl-A go to beginning of line

Ctrl-E go to end of line

alias new=cmd

make a nickname for a command
$ alias l='ls -l'
$ alias
$ l

.bashrc example¶

HISTCONTROL=ignoredups

#immediately append instead of at end of session, clear and re-read .bash_history
export PROMPT_COMMAND="history -a; history -c; history -r"
#append instead of overwrite history
shopt -s histappend

export HISTSIZE=1000000

# If set, Bash checks the window size after each command 
shopt -s checkwinsize

alias mroe=more
alias grpe=grep

export PYTHONPATH=$PYTHONPATH:/usr/local/python
export PATH=$PATH:$HOME/bin

Loops¶

for i in x y z
do
 echo $i
done

for file in *.txt
do
 echo $file
done

Lots more... (TLDP)

for i in {1..10}
do
 echo $i
done
In [3]:
%%html
<div id="bashloopq" style="width: 500px"></div>
<script>

	jQuery('#bashloopq').asker({
	    id: "bashloopq",
	    question: "What is the last line to print out?",
		answers: ["{1..10}","}", "9","10","An Error"],
        server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
		charter: chartmaker})
    
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();


</script>

I/O Redirection¶

> send standard output to file

$ echo Hello > h.txt

>> append to file

$ echo World >> h.txt

< send file to standard input of command

2> send standard error to file

>& send output and error to file

$ echo Hello > h.txt
$ echo World >> h.txt
$ cat h.txt
In [4]:
%%html
<div id="q1" style="width: 500px"></div>
<script>

	jQuery('#q1').asker({
	    id: "ioquestion",
	    question: "What prints out?",
		answers: ["Hello","World", "HelloWorld", "<Br>Hello<br>World","An Error"],
		extra: ["","","","","",""],
        server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
		charter: chartmaker})
    
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();

</script>
$ echo Hello > h.txt
$ echo World > h.txt
$ cat h.txt
In [5]:
%%html
<div id="q2" style="width: 500px"></div>
<script>

	jQuery('#q2').asker({
	    id: "ioquestion2",
	    question: "What prints out?",
		answers: ["Hello","World", "HelloWorld", "<Br>Hello<br>World","An Error"],
		extra: ["","","","","",""],
        server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
		charter: chartmaker})
    
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();

</script>

Pipes¶

A pipe (|) redirects the standard output of one program to the standard input of another. It's like you typed the output of the first program into the second. This allows us to chain several simple programs together to do something more complicated.

$ echo Hello World | wc

Simple Text Manipulation¶

cat dump file to stdout

more paginated output

head show first 10 lines

tail show last 10 lines

wc count lines/words/characters

sort sort file by line and print out (-n for numerical sort)

uniq remove adjacent duplicates (-c to count occurances)

cut extract fixed width columns from file

$ cat text
a
b
a
b
b
$ cat text | uniq | wc
In [6]:
%%html
<div id="q3" style="width: 500px"></div>
<script>

	jQuery('#q3').asker({
	    id: "simplepipe",
	    question: "What is the first number to print out?",
		answers: ["1", "2","3","4","5","None of the above"],
		extra: ["","","","","",""],
        server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
		charter: chartmaker})
    
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();

</script>
$ cat text
a
b
a
b
b
$ cat text | sort | uniq | wc
In [7]:
%%html
<div id="q4" style="width: 500px"></div>
<script>

	jQuery('#q4').asker({
	    id: "simplepipe2",
	    question: "What is the first number to print out?",
		answers: ["1", "2","3","4","5","None of the above"],
		extra: ["","","","","",""],
        server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
		charter: chartmaker})
    
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();

</script>

Advanced Text Manipulation¶

grep search contents of file for expression

sed stream editor - perform substitutions

awk pattern scanning and processing, great for dealing with data in columns

grep¶

Search file contents for a pattern.

grep pattern file(s)

  • ‐r recursive search
  • ‐I skip over binary files
  • ‐s suppress error messages
  • ‐n show line numbers
  • ‐AN show N lines after match
  • ‐BN show N lines before match
$ grep a text | wc
In [8]:
%%html
<div id="q5" style="width: 500px"></div>
<script>

	jQuery('#q5').asker({
	    id: "grepq",
	    question: "What is the first number to print out?",
		answers: ["1", "2","3","4","5","None of the above"],
		extra: ["","","","","",""],
        server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
		charter: chartmaker})
    
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();

</script>

grep patterns¶

Patterns are defined using regular expressions which we will talk more about later. Some useful special characters.

  • ^pattern pattern must be at start of line
  • pattern$ pattern must be at end of line
  • . match any character, not period
  • .* match any charcter repeated any number of times
  • \. escape a special character to treat it literally (i.e., this matches period)

sed¶

Search and replace

sed 's/pattern/replacement/' file
  • ‐i replace in-place (overwrites input file)
$ sed 's/a/b/' text | uniq | wc
In [9]:
%%html
<div id="q6" style="width: 500px"></div>
<script>

	jQuery('#q6').asker({
	    id: "sedq",
	    question: "What is the first number to print out?",
		answers: ["1", "2","3","4","5","None of the above"],
		extra: ["","","","","",""],
        server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
		charter: chartmaker})
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();

</script>

awk¶

Pattern scanning and processing language. We'll mostly use it to extract columns/fields. It processes a file line-by-line and if a condition holds runs a simple program on the line.

awk 'optional condition {awk program}' file

  • -Fx make x the field deliminator (default whitespace)
  • NF number of fields on current line
  • NR current record number
  • $0 full line
  • $N Nth field

awk¶

$ cat names
id last,first 
1 Smith,Alice
2 Jones,Bob
3 Smith,Charlie

Try these:

$ awk '{print $1}' names
$ awk -F, '{print $2}' names
$ awk 'NR > 1 {print $2}' names 
$ awk '$1 > 1 {print $0}' names
$ awk 'NR > 1 {print $2}' names | awk -F, '{print $1}' | sort | uniq -c

Exercises¶

mkdir intro
cd intro
wget http://mscbio2025.net/files/Spellman.csv
wget http://mscbio2025.net/files/1shs.pdb

Questions¶

  • How many data points are in Spellman.csv?
  • The first three letters of the systematic open reading frames are: 'Y' for yeast, the chromosome number, then the chromosome arm. In the dataset, how many ORFs from chromosome A are there?
  • How many are there from each chromosome?
    • each chromosome arm?
  • How many data points start with a positive expression value?
  • What are the 10 data points with the highest initial expression values?
    • Lowest?
  • How many lines are there where expression values are continuously increasing for the first 3 time steps?
  • Sorted by biggest increase?
wc Spellman.csv   (gives number of lines, because of header this is off by one)
grep YA Spellman.csv |wc
grep ^YA Spellman.csv |wc  (this is a bit better, ^ matches begining of line)
grep ^YA -c Spellman.csv  (grep can provide the count itself)
awk -F, 'NR > 1 {print $1}' Spellman.csv | cut -b 1-2 | sort | uniq -c
awk -F, 'NR > 1 {print $1}' Spellman.csv | cut -b 1-3 | sort | uniq -c
awk -F, 'NR > 1 && $2 > 0 {print $0}' Spellman.csv | wc
awk -F, 'NR > 1  {print $1,$2}' Spellman.csv  | sort -k2,2 -n | tail
awk -F, 'NR > 1  {print $1,$2}' Spellman.csv  | sort -k2,2 -n -r | tail
awk -F, 'NR > 1 && $3 > $2 && $4 > $3 {print $0}' Spellman.csv  |wc
awk -F, 'NR > 1 && $3 > $2 && $4 > $3  {print $4-$2,$0}' Spellman.csv   | sort -n -k1,1

More¶

  • Create a pdb file from 1shs that consists of only ATOM records.
  • Create a pdb with only ATOM records from chain A. (The chain is the fifth column* of an atom record)
  • How many carbon atoms are in this file?
  • Create a pdb with only the ATOM records from chain G, but with the chain renamed to be A.

*PDB files are actually fixed files, not space deliminated, but with this file you can ignore that distinction.

grep ^ATOM 1shs.pdb > newpdb.pdb (^matches beginning of line)
grep ^ATOM 1shs.pdb | awk '$5 == "A" {print $0}'
#this is UNSAFE with pdb files since there is no guarantee that fields
#will be whitespace seperated, safer is:
grep ^ATOM 1shs.pdb | awk ' substr($0,22,1) == "A" {print $0}' > newpdb.pdb

grep ^ATOM 1shs.pdb | awk ' substr($0,22,1) == "A" {print $0}' | cut -b 78- | sort | uniq -c
grep ^ATOM 1shs.pdb | awk '$5 == "A" {print $0}' | sed 's/ G / A /'

Running Python¶

$ cat hi.py 
print("hi")
$ python3 hi.py
hi
$ cat hi.py 
#!/usr/bin/python3
print("hi")
$ chmod +x hi.py  <em>make the file executable</em>
$ ls -l hi.py 
-rwxr-xr-x  1 dkoes  staff  29 Sep  3 16:05 hi.py
$ ./hi.py 
hi

Python Versions¶

python2 Legacy python.

python3 Released in 2008. Mostly the same as python2 but "cleaned up". Breaks backwards compatibility. May need to specify explicity (python3). We will be using python3.

https://wiki.python.org/moin/Python2orPython3

~$ python
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

IPython¶

A powerful interactive shell¶

  • Tab complete commands, file names
  • Support for a number of "shell" commands (ls, cd, pwd, etc)
  • Supports up arrow, Ctrl-r
  • Persistent command history across sessions
  • Backbone of notebooks...
~$ ipython
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.5.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:

ipython notebook¶

$ ipython notebook
$ jupyter notebook

Now called Jupyter (not just for python) jupyter.org

IPython in your browser. Save your code and your output.

Colab is basically a Google hosted Jupyter notebook.

Demo: running code (shift-enter), cell types, saving and exporting, kernel state

Why Jupyter notebook?¶

  • A "lab notebook" for data science
  • See output as you run commands
  • Embedded figures/output
  • Easy to modify and rerun steps
  • Can embed formatted text - share code and reason for code
  • Can convert to multiple formats (html, pdf, raw python, even slides)

A different perspective