An introduction to awk and bioawk.
Applied Computational Genomics
https://github.com/quinlan-lab/applied-computational-genomics
Aaron Quinlan
Departments of Human Genetics and Biomedical Informatics
USTAR Center for Genetic Discovery
University of Utah
quinlanlab.org
Let's grab some data to work with.
wget https://s3.amazonaws.com/bedtools-tutorials/web/cpg.bed
CpG "islands"
Now to learn some more UNIX. awk is your dear friend.
Now to learn some more UNIX. awk is your dear friend.
Awk is a programming language that is specifically designed for quickly manipulating space delimited data.
-Heng Li (http://lh3lh3.users.sourceforge.net/biounix.shtml)
awk '$1 == "chr1"' cpg.bed
Every awk program begins and ends with single quotes.
Only report annotations in cpg.bed that are for chr1
Each column is referred to by number.
Use awk to "filter" files: i.e., find lines that match your criteria
awk '$3 < $2' cpg.bed
Only report annotations in cpg.bed where the end coordinate is less than the start coordinate. How many such records do we expect?
The NR (record number) variable
awk 'NR == 100' cpg.bed
Report the 100th line in the file
Impose multiple filtering criteria with the AND ("&&") operator
awk 'NR >= 100 && NR <= 200' cpg.bed
Report the 100th through the 200th lines in the file
Getting a bit more sophisticated. Mini programming!!!
awk '(NR>=100 && NR <= 200) || ($1 == "chr22")' cpg.bed
Report lines if they are the 100th through the 200th lines in the file OR (||) they are from chr22
Quality control of BED files
# Find lines where start is greater than end.
# What should this return?
awk '$2>$3' cpg.bed
Do some computation and report the results
awk '{print $0, $3-$2}' cpg.bed
If using a print statement, you must add curly brackets between the single quotes describing the program.
Print the BED record followed by the length (end - start) of the record
$0 refers to the entire input line
By default, output is separated by a space. Prefer tabs
awk 'BEGIN{OFS="\t"}{print $0, $3-$2}' cpg.bed
or
awk -v OFS="\t" '{print $0, $3-$2}' cpg.bed
or
awk '{len=($3-$2); print $0"\t"len}' cpg.bed
or
awk -v OFS="\t" '{len=($3-$2); print $0, len}' cpg.bed
Print the BED record followed by the length (end - start) of the record. Separated by a TAB, the OFS (output field separator)
BEGIN: before anything else happens, execute what is in the BEGIN statement. Then start processing the input.
Print columns in any order
# Note that is not possible with `cut`
awk -v OFS="\t" '{print $4,$1}' cpg.bed
Compute the total number of base pairs represented by CpG islands
awk '{tot_len += $3-$2}END{print tot_len}' cpg.bed
Create a variable named "sum" whose value starts at 0, but is increased by the length ($3-$2)
of each CpG island.
END: after all the processing of each line in the file occurs, print the final value of sum.
Question: How could we compure the average number of base pairs represented by CpG islands
# This computes the total...
awk '{tot_len += $3-$2}END{print tot_len/NR}' cpg.bed
How many (whitespace-separated) columns are on each line?
awk '{print NF}' cpg.bed
NF: The number of "fields" (that is, the number of whitespace-separated values) detected for the line
Pattern matching: find CpG islands on chr22
awk '/^chr22/' cpg.bed
^ means match the beginning of the line
Pattern matching: find CpG islands on chr22 or chr9
awk '/^chr22/ || /^chr9/' cpg.bed
^ means match the beginning of the line
Bioawk: a better awk from Heng Li.
Installing bioawk
# clone the code for bioawk from github
git clone https://github.com/lh3/bioawk
# move into the bioawk code directory
cd bioawk
# compile the code into a binary/executable
make
# copy that binary to your ~/bin directory
cp bioawk ~/bin/
Named columns!!!
bioawk -t -c bed '{print $chrom, $start}' cpg.bed | head -n 3
chr1 28735
chr1 135124
chr1 327790
Let's grab some more data to work with. FASTQ
wget https://bedtools-tutorials.s3.amazonaws.com/web/toy.fastq
Named records for FASTQ. This is far easier because FASTQ records are split across four lines!!!
bioawk -c fastx '{print $seq}' toy.fastq | head -n 4
NTCGGAACATTTTTTCTTCAAAAATATGAAAAATCACCTAATTTATCTGAAAATGACATTTANNNCAGTNNNNNNNATTGGGAAAGTGCTCGATTTNCGGA
TGTAATTTACTTTGTTCAGTTAGACTCTTAATTAGACTAAAAACGGTCTCAAAAAGTATAATTTCATAATGAGACACCTTTAAAAATTCTACGTTTTTATG
AGTTTTCTCAAACACAGAAAACATATGGGAGTTTCTCAAACAATGGACAATGAGTGATCACCGATATTTGATACAAATCGACCAACTCGGCTCATATTCTC
AAACATGTCCGAGTAGCTTACAGTGTGCTCCTTTTGCTCTATCTTCCGGCTCCATTTTTTCCGTTGCTTCTCCTGCATTTGTGACAAATTCCCTGATTTTC
How long are the sequencing reads in this FASTQ file?
bioawk -c fastx '{print length($seq)}' toy.fastq | head -n 4
NTCGGAACATTTTTTCTTCAAAAATATGAAAAATCACCTAATTTATCTGAAAATGACATTTANNNCAGTNNNNNNNATTGGGAAAGTGCTCGATTTNCGGA
TGTAATTTACTTTGTTCAGTTAGACTCTTAATTAGACTAAAAACGGTCTCAAAAAGTATAATTTCATAATGAGACACCTTTAAAAATTCTACGTTTTTATG
AGTTTTCTCAAACACAGAAAACATATGGGAGTTTCTCAAACAATGGACAATGAGTGATCACCGATATTTGATACAAATCGACCAACTCGGCTCATATTCTC
AAACATGTCCGAGTAGCTTACAGTGTGCTCCTTTTGCTCTATCTTCCGGCTCCATTTTTTCCGTTGCTTCTCCTGCATTTGTGACAAATTCCCTGATTTTC