1 of 24

An introduction to awk and bioawk.

Applied Computational Genomics

https://github.com/quinlan-lab/applied-computational-genomics

Aaron Quinlan

Departments of Human Genetics and Biomedical Informatics

USTAR Center for Genetic Discovery

University of Utah

quinlanlab.org

2 of 24

Let's grab some data to work with.

wget https://s3.amazonaws.com/bedtools-tutorials/web/cpg.bed

3 of 24

CpG "islands"

4 of 24

Now to learn some more UNIX. awk is your dear friend.

5 of 24

Now to learn some more UNIX. awk is your dear friend.

Awk is a programming language that is specifically designed for quickly manipulating space delimited data.

-Heng Li (http://lh3lh3.users.sourceforge.net/biounix.shtml)

awk '$1 == "chr1"' cpg.bed

Every awk program begins and ends with single quotes.

Only report annotations in cpg.bed that are for chr1

Each column is referred to by number.

6 of 24

Use awk to "filter" files: i.e., find lines that match your criteria

awk '$3 < $2' cpg.bed

Only report annotations in cpg.bed where the end coordinate is less than the start coordinate. How many such records do we expect?

7 of 24

The NR (record number) variable

awk 'NR == 100' cpg.bed

Report the 100th line in the file

8 of 24

Impose multiple filtering criteria with the AND ("&&") operator

awk 'NR >= 100 && NR <= 200' cpg.bed

Report the 100th through the 200th lines in the file

9 of 24

Getting a bit more sophisticated. Mini programming!!!

awk '(NR>=100 && NR <= 200) || ($1 == "chr22")' cpg.bed

Report lines if they are the 100th through the 200th lines in the file OR (||) they are from chr22

10 of 24

Quality control of BED files

# Find lines where start is greater than end.

# What should this return?

awk '$2>$3' cpg.bed

11 of 24

Do some computation and report the results

awk '{print $0, $3-$2}' cpg.bed

If using a print statement, you must add curly brackets between the single quotes describing the program.

Print the BED record followed by the length (end - start) of the record

$0 refers to the entire input line

12 of 24

By default, output is separated by a space. Prefer tabs

awk 'BEGIN{OFS="\t"}{print $0, $3-$2}' cpg.bed

or

awk -v OFS="\t" '{print $0, $3-$2}' cpg.bed

or

awk '{len=($3-$2); print $0"\t"len}' cpg.bed

or

awk -v OFS="\t" '{len=($3-$2); print $0, len}' cpg.bed

Print the BED record followed by the length (end - start) of the record. Separated by a TAB, the OFS (output field separator)

BEGIN: before anything else happens, execute what is in the BEGIN statement. Then start processing the input.

13 of 24

Print columns in any order

# Note that is not possible with `cut`

awk -v OFS="\t" '{print $4,$1}' cpg.bed

14 of 24

Compute the total number of base pairs represented by CpG islands

awk '{tot_len += $3-$2}END{print tot_len}' cpg.bed

Create a variable named "sum" whose value starts at 0, but is increased by the length ($3-$2)

of each CpG island.

END: after all the processing of each line in the file occurs, print the final value of sum.

15 of 24

Question: How could we compure the average number of base pairs represented by CpG islands

# This computes the total...

awk '{tot_len += $3-$2}END{print tot_len/NR}' cpg.bed

16 of 24

How many (whitespace-separated) columns are on each line?

awk '{print NF}' cpg.bed

NF: The number of "fields" (that is, the number of whitespace-separated values) detected for the line

17 of 24

Pattern matching: find CpG islands on chr22

awk '/^chr22/' cpg.bed

^ means match the beginning of the line

18 of 24

Pattern matching: find CpG islands on chr22 or chr9

awk '/^chr22/ || /^chr9/' cpg.bed

^ means match the beginning of the line

19 of 24

Bioawk: a better awk from Heng Li.

20 of 24

Installing bioawk

# clone the code for bioawk from github

git clone https://github.com/lh3/bioawk

# move into the bioawk code directory

cd bioawk

# compile the code into a binary/executable

make

# copy that binary to your ~/bin directory

cp bioawk ~/bin/

21 of 24

Named columns!!!

bioawk -t -c bed '{print $chrom, $start}' cpg.bed | head -n 3

chr1 28735

chr1 135124

chr1 327790

22 of 24

Let's grab some more data to work with. FASTQ

wget https://bedtools-tutorials.s3.amazonaws.com/web/toy.fastq

23 of 24

Named records for FASTQ. This is far easier because FASTQ records are split across four lines!!!

bioawk -c fastx '{print $seq}' toy.fastq | head -n 4

NTCGGAACATTTTTTCTTCAAAAATATGAAAAATCACCTAATTTATCTGAAAATGACATTTANNNCAGTNNNNNNNATTGGGAAAGTGCTCGATTTNCGGA

TGTAATTTACTTTGTTCAGTTAGACTCTTAATTAGACTAAAAACGGTCTCAAAAAGTATAATTTCATAATGAGACACCTTTAAAAATTCTACGTTTTTATG

AGTTTTCTCAAACACAGAAAACATATGGGAGTTTCTCAAACAATGGACAATGAGTGATCACCGATATTTGATACAAATCGACCAACTCGGCTCATATTCTC

AAACATGTCCGAGTAGCTTACAGTGTGCTCCTTTTGCTCTATCTTCCGGCTCCATTTTTTCCGTTGCTTCTCCTGCATTTGTGACAAATTCCCTGATTTTC

24 of 24

How long are the sequencing reads in this FASTQ file?

bioawk -c fastx '{print length($seq)}' toy.fastq | head -n 4

NTCGGAACATTTTTTCTTCAAAAATATGAAAAATCACCTAATTTATCTGAAAATGACATTTANNNCAGTNNNNNNNATTGGGAAAGTGCTCGATTTNCGGA

TGTAATTTACTTTGTTCAGTTAGACTCTTAATTAGACTAAAAACGGTCTCAAAAAGTATAATTTCATAATGAGACACCTTTAAAAATTCTACGTTTTTATG

AGTTTTCTCAAACACAGAAAACATATGGGAGTTTCTCAAACAATGGACAATGAGTGATCACCGATATTTGATACAAATCGACCAACTCGGCTCATATTCTC

AAACATGTCCGAGTAGCTTACAGTGTGCTCCTTTTGCTCTATCTTCCGGCTCCATTTTTTCCGTTGCTTCTCCTGCATTTGTGACAAATTCCCTGATTTTC