1 of 42

Introduction to Unix

(part 2)

Applied Computational Genomics, Lecture 02

https://github.com/quinlan-lab/applied-computational-genomics

Aaron Quinlan

Departments of Human Genetics and Biomedical Informatics

USTAR Center for Genetic Discovery

University of Utah

quinlanlab.org

2 of 42

What do you think the tail command does?

3 of 42

We spend hours searching for patterns indicative of biological signal in genomics.

4 of 42

Often, we do this in genome files stored in FASTA format

5 of 42

Download a FASTA file

6 of 42

The cat command

(report every line in an input file or stream)

>NC_012920.1 Homo sapiens mitochondrion, complete genome

GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGG

GTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTC

CTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACTTACTAAAGTGTGTTA

ATTAATTAATGCTTGTAGGACATAATAATAACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATC

ATAACAAAAAATTTCCACCAAACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCA

AACCCCAAAAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC

TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAATCTCATCAATA

CAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATACCCCGAACCAACCAAACCCC

AAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAAGCAATACACTGAAAATGTTTAGACGGGCTC

ACATCACCCCATAAACAAATAGGTTTGGTCCTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAA

GCATCCCCGTTCCAGTGAGTTCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGC

AATGCAGCTCAAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAATAA

ACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACCGCGGTCACACGA

TTAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCCTCCCCAATAAAGCTAAAACT

$ cat dna.fasta

7 of 42

The cat command: add line numbers with -n

(report every line in an input file or stream)

1 >NC_012920.1 Homo sapiens mitochondrion, complete genome

2 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGG

3 GTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTC

4 CTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACTTACTAAAGTGTGTTA

5 ATTAATTAATGCTTGTAGGACATAATAATAACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATC

6 ATAACAAAAAATTTCCACCAAACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCA

7 AACCCCAAAAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC

8 TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAATCTCATCAATA

9 CAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATACCCCGAACCAACCAAACCCC

10 AAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAAGCAATACACTGAAAATGTTTAGACGGGCTC

11 ACATCACCCCATAAACAAATAGGTTTGGTCCTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAA

12 GCATCCCCGTTCCAGTGAGTTCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGC

13 AATGCAGCTCAAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAATAA

14 ACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACCGCGGTCACACGA

15 TTAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCCTCCCCAATAAAGCTAAAACT

$ cat -n dna.fasta

8 of 42

The cat command: check for newlines -E

(report every line in an input file or stream)

>NC_012920.1 Homo sapiens mitochondrion, complete genome$

GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGG$

GTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTC$

CTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACTTACTAAAGTGTGTTA$

ATTAATTAATGCTTGTAGGACATAATAATAACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATC$

ATAACAAAAAATTTCCACCAAACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCA$

AACCCCAAAAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC$

TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAATCTCATCAATA$

CAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATACCCCGAACCAACCAAACCCC$

AAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAAGCAATACACTGAAAATGTTTAGACGGGCTC$

ACATCACCCCATAAACAAATAGGTTTGGTCCTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAA$

GCATCCCCGTTCCAGTGAGTTCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGC$

AATGCAGCTCAAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAATAA$

ACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACCGCGGTCACACGA$

TTAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCCTCCCCAATAAAGCTAAAACT$

$ cat -E dna.fasta

9 of 42

The wc command

(count the lines, words, and bytes in a file or stream)

$ wc dna.fasta

15 20 1051 dna.fasta

How many lines words and bytes (characters) in dna.fasta?

$ wc -l dna.fasta

16 dna.fasta

How many lines (-l) in dna.fasta?

>NC_012920.1 Homo sapiens mitochondrion, complete genome

GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGG

GTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTC

CTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACTTACTAAAGTGTGTTA

ATTAATTAATGCTTGTAGGACATAATAATAACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATC

ATAACAAAAAATTTCCACCAAACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCA

AACCCCAAAAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC

TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAATCTCATCAATA

CAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATACCCCGAACCAACCAAACCCC

AAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAAGCAATACACTGAAAATGTTTAGACGGGCTC

ACATCACCCCATAAACAAATAGGTTTGGTCCTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAA

GCATCCCCGTTCCAGTGAGTTCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGC

AATGCAGCTCAAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAATAA

ACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACCGCGGTCACACGA

TTAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCCTCCCCAATAAAGCTAAAACT

$ cat dna.fasta

$ wc -c dna.fasta

1051 dna.fasta

How many characters (-c) in dna.fasta?

Think: Is this the same as the number of nucleotides in the sequence?

10 of 42

I am lazy. You should be too. Shortcuts!!!

  1. Go back to past commands with the arrow keys.
  2. The history command will report the commands you have used in the past.
  3. Type the "Ctrl" "r" keys at the same time to bring up a search for commands that contain search terms. Use the arrow keys to cycle through all commands that match the search term.
  4. Use the "Tab" key for autocomplete - just like your smartphone!
  5. Type the "Ctrl" "c" keys at the same time to kill a command.

11 of 42

But what about doing

something useful?

12 of 42

Let’s imagine you are a first year graduate student studying gene regulation in human muscle development.

You will need a list of genes and transcripts in the human genome.

Let's create a directory called “muscledev” within your “home” directory.

$ mkdir muscledev

$ cd muscledev

$ curl https://home.chpc.utah.edu/~u1007787/genes.gtf > genes.gtf

$ more genes.gtf

13 of 42

The more command

(scroll through the contents of files page by page)

#!genome-build GRCh38.p7

#!genome-version GRCh38

#!genome-date 2013-12

#!genome-build-accession NCBI:GCA_000001405.22

#!genebuild-last-updated 2016-06

1 havana gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2";

1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; tag "basic"; transcript_support_level "1";

1 havana exon 11869 12227 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1";

1 havana exon 12613 12721 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2

Tips:

  1. Hit the space key to move to the next page.
  2. Hit “q” to quit and return to the prompt

14 of 42

The less command

(scroll through the contents of files page by page. less is “more” , because you can go forwards and backwards)

#!genome-build GRCh38.p7

#!genome-version GRCh38

#!genome-date 2013-12

#!genome-build-accession NCBI:GCA_000001405.22

#!genebuild-last-updated 2016-06

1 havana gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2";

1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; tag "basic"; transcript_support_level "1";

1 havana exon 11869 12227 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1";

1 havana exon 12613 12721 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2

$ less genes.gtf

Tips:

  • Hit the space key to move to the next page, or enter to move to the next line.
  • Hit “q” to quit and return to the prompt
  • Hit the up and down arrows to move backwards and forwards in the file, respectively

15 of 42

The less command (searching)

#!genome-build GRCh38.p7

#!genome-version GRCh38

#!genome-date 2013-12

#!genome-build-accession NCBI:GCA_000001405.22

#!genebuild-last-updated 2016-06

1 havana gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2";

1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; tag "basic"; transcript_support_level "1";

1 havana exon 11869 12227 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1";

1 havana exon 12613 12721 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2

$ less genes.gtf

Searching the file:

If you type “/” (i.e., forward slash) after you have opened the file with less, this tells less that you want to search for a specific “pattern”. After typing “/”, then type the pattern you want to search for (e.g., “exon”) and then type Enter. Less will jump to the first occurrence of that pattern.

Typing “n” will take you to the next occurrence of the pattern.

Repeat ad infinitum

16 of 42

How do we put an end to this dreadful line “wrapping” behavior? (less -S)

17 of 42

The man command

man is the manual viewer; it can be used to display manual pages for options, scroll up and down, search for occurrences of specific text, and other useful functions.

LESS(1) LESS(1)

NAME

less - opposite of more

SYNOPSIS

less -?

less --help

less -V

less --version

less [-[+]aABcCdeEfFgGiIJKLmMnNqQrRsSuUVwWX~]

[-b space] [-h lines] [-j line] [-k keyfile]

[-{oO} logfile] [-p pattern] [-P prompt] [-t tag]

[-T tagsfile] [-x tab,...] [-y lines] [-[z] lines]

[-# shift] [+[+]cmd] [--] [filename]...

(See the OPTIONS section for alternate option syntax with long option

names.)

DESCRIPTION

Less is a program similar to more (1), but which allows backward movement

in the file as well as forward movement. Also, less does not have to read

$ man less

Tips:

  • One navigates the man page using the same keys (i.e., space, up, down, as “less”)

So, how do we find any documentation about OPTIONS (hint) to deal with wrapped (hint) lines in the less command?

(That is, besides google)

18 of 42

The less command -S option

#!genome-build GRCh38.p7

#!genome-version GRCh38

#!genome-date 2013-12

#!genome-build-accession NCBI:GCA_000001405.22

#!genebuild-last-updated 2016-06

1 havana gene 11869 14409 . + . gene_id "ENSG00000223972";

1 havana transcript 11869 14409 . + . gene_id "ENSG00000

1 havana exon 11869 12227 . + . gene_id "ENSG00000223972";

1 havana exon 12613 12721 . + . gene_id "ENSG00000223972";

1 havana exon 13221 14409 . + . gene_id "ENSG00000223972";

1 havana transcript 12010 13670 . + . gene_id "ENSG00000

1 havana exon 12010 12057 . + . gene_id "ENSG00000223972";

1 havana exon 12179 12227 . + . gene_id "ENSG00000223972";

1 havana exon 12613 12697 . + . gene_id "ENSG00000223972";

1 havana exon 12975 13052 . + . gene_id "ENSG00000223972";

1 havana exon 13221 13374 . + . gene_id "ENSG00000223972";

1 havana exon 13453 13670 . + . gene_id "ENSG00000223972";

1 havana gene 14404 29570 . - . gene_id "ENSG00000227232";

1 havana transcript 14404 29570 . - . gene_id "ENSG00000

1 havana exon 29534 29570 . - . gene_id "ENSG00000227232";

1 havana exon 24738 24891 . - . gene_id "ENSG00000227232";

1 havana exon 18268 18366 . - . gene_id "ENSG00000227232";

1 havana exon 17915 18061 . - . gene_id "ENSG00000227232";

1 havana exon 17606 17742 . - . gene_id "ENSG00000227232";

$ less -S genes.gtf

Tips:

  • Use the right and left arrows to move towards the end and beginning of a line, respectively.

So nice and tidy!!!

19 of 42

The GFF format

http://uswest.ensembl.org/info/website/upload/gff.html

20 of 42

The grep command (this is VERY useful)

(find lines in an input file or stream that match a specific pattern you are looking for)

1 havana exon 11869 12227 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1";

1 havana exon 12613 12721 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; exon_id "ENSE00003582793"; exon_version "1"; tag "basic"; transcript_support_level "1";

1 havana exon 13221 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "3"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; exon_id "ENSE00002312635"; exon_version "1"; tag "basic"; transcript_support_level "1";

...

$ grep "exon" genes.gtf

Result:

Only lines that contain the text "exon" anywhere in the line will be returned.

21 of 42

See grep matches with --color

1 havana exon 11869 12227 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1";

1 havana exon 12613 12721 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; exon_id "ENSE00003582793"; exon_version "1"; tag "basic"; transcript_support_level "1";

1 havana exon 13221 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "3"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; exon_id "ENSE00002312635"; exon_version "1"; tag "basic"; transcript_support_level "1";

...

$ grep --color "exon" genes.gtf

Tips:

  • Use the right and left arrows to move towards the end and beginning of a line, respectively.

22 of 42

See grep matches with --color

#!genebuild-last-updated 2016-06

1 havana gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2";

1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; tag "basic"; transcript_support_level "1";

1 havana exon 11869 12227 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1";

1 havana exon 12613 12721 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; exon_id "ENSE00003582793"; exon_version "1"; tag "basic"; transcript_support_level "1";

$ grep --color "gene" genes.gtf

By default, grep reports a pattern match anywhere it find it on the line.

What if we want to match "gene" but not things like "gene_name"?

23 of 42

Matches flanked by "whitespace" with -w

1 havana gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2";

1 havana gene 14404 29570 . - . gene_id "ENSG00000227232"; gene_version "5"; gene_name "WASH7P"; gene_source "havana"; gene_biotype "unprocessed_pseudogene"; havana_gene "OTTHUMG00000000958"; havana_gene_version "1";

1 mirbase gene 17369 17436 . - . gene_id "ENSG00000278267"; gene_version "1"; gene_name "MIR6859-1"; gene_source "mirbase"; gene_biotype "miRNA";

1 ensembl_havana gene 29554 31109 . + . gene_id "ENSG00000243485"; gene_version "4"; gene_name "MIR1302-2"; gene_source "ensembl_havana"; gene_biotype "lincRNA"; havana_gene "OTTHUMG00000000959"; havana_gene_version "2";

1 havana gene 34554 36081 . - . gene_id "ENSG00000237613"; gene_version "2"; gene_name "FAM138A"; gene_source "havana"; gene_biotype "lincRNA"; havana_gene "OTTHUMG00000000960"; havana_gene_version "1";

1 havana gene 52473 53312 . + . gene_id "ENSG00000268020"; gene_version "3"; gene_name "OR4G4P"; gene_source "havana"; gene_biotype "unprocessed_pseudogene"; havana_gene "OTTHUMG00000185779"; havana_gene_version "1";

1 havana gene 62948 63887 . + . gene_id "ENSG00000240361"; gene_version "1"; gene_name "OR4G11P"; gene_source "havana"; gene_biotype "unprocessed_pseudogene"; havana_gene "OTTHUMG00000001095"; havana_gene_version "2";

1 ensembl_havana gene 69091 70008 . + . gene_id "ENSG00000186092"; gene_version "4"; gene_name "OR4F5"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTHUMG00000001094"; havana_gene_version "2";

$ grep --color "gene" -w genes.gtf

By default, grep reports a pattern match anywhere it find it on the line.

What if we want to match "gene" but not things like "gene_name"?

24 of 42

Report lines that lack a pattern with -v

#!genome-build GRCh38.p7

#!genome-version GRCh38

#!genome-date 2013-12

#!genome-build-accession NCBI:GCA_000001405.22

$ grep "gene" -v genes.gtf

Why are these the only lines that are reported?

25 of 42

Report lines that lack a pattern with -v and -w

#!genome-build GRCh38.p7

#!genome-version GRCh38

#!genome-date 2013-12

#!genome-build-accession NCBI:GCA_000001405.22

#!genebuild-last-updated 2016-06

1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; tag "basic"; transcript_support_level "1";

1 havana exon 11869 12227 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1";

...

$ grep "gene" -v -w genes.gtf

26 of 42

Matching patterns that include quotes

1 ensembl_havana gene 113813811 113871759 . - . gene_id "ENSG00000134242"; gene_version "15"; gene_name "PTPN22"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTHUMG00000011936"; havana_gene_version "2";

1 havana transcript 113813811 113871712 . - . gene_id "ENSG00000134242"; gene_version "15"; transcript_id "ENST00000460620"; transcript_version "5"; gene_name "PTPN22"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTHUMG00000011936"; havana_gene_version "2"; transcript_name "PTPN22-002"; transcript_source "havana"; transcript_biotype "protein_coding"; havana_transcript "OTTHUMT00000033016"; havana_transcript_version "2"; tag "basic"; transcript_support_level "1";

1 havana exon 113871537 113871712 . - . gene_id "ENSG00000134242"; gene_version "15"; transcript_id "ENST00000460620"; transcript_version "5"; exon_number "1"; gene_name "PTPN22"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTHUMG00000011936"; havana_gene_version "2"; transcript_name "PTPN22-002"; transcript_source "havana"; transcript_biotype "protein_coding"; havana_transcript "OTTHUMT00000033016"; havana_transcript_version "2"; exon_id "ENSE00001835701"; exon_version "1"; tag "basic"; transcript_support_level "1";

1 havana CDS 113871537 113871623 . - 0 gene_id "ENSG00000134242"; gene_version "15"; transcript_id "ENST00000460620"; transcript_version "5"; exon_number "1"; gene_name "PTPN22"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTHUMG00000011936"; havana_gene_version "2"; transcript_name "PTPN22-002"; transcript_source "havana"; transcript_biotype "protein_coding"; havana_transcript "OTTHUMT00000033016"; havana_transcript_version "2"; protein_id "ENSP00000433141"; protein_version "1"; tag "basic"; transcript_support_level "1";

$ grep --color "PTPN22" genes.gtf

What if we want to find lines with a specific gene name (note that the GFF has gene names in quotes). We need to tell grep to treat quote characters literally!

27 of 42

Matching patterns that include quotes

$ grep --color \"PTPN22\" genes.gtf

What if we want to find lines with a specific gene name (note that the GFF has gene names in quotes). We need to tell grep to treat quote characters literally!

To do so, we must "escape" them.

Note this match from our last attempt disappeared. Why?

1 ensembl_havana gene 113813811 113871759 . - . gene_id "ENSG00000134242"; gene_version "15"; gene_name "PTPN22"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTHUMG00000011936"; havana_gene_version "2";

1 havana transcript 113813811 113871712 . - . gene_id "ENSG00000134242"; gene_version "15"; transcript_id "ENST00000460620"; transcript_version "5"; gene_name "PTPN22"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTHUMG00000011936"; havana_gene_version "2"; transcript_name "PTPN22-002"; transcript_source "havana"; transcript_biotype "protein_coding"; havana_transcript "OTTHUMT00000033016"; havana_transcript_version "2"; tag "basic"; transcript_support_level "1";

1 havana exon 113871537 113871712 . - . gene_id "ENSG00000134242"; gene_version "15"; transcript_id "ENST00000460620"; transcript_version "5"; exon_number "1"; gene_name "PTPN22"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTHUMG00000011936"; havana_gene_version "2"; transcript_name "PTPN22-002"; transcript_source "havana"; transcript_biotype "protein_coding"; havana_transcript "OTTHUMT00000033016"; havana_transcript_version "2"; exon_id "ENSE00001835701"; exon_version "1"; tag "basic"; transcript_support_level "1";

1 havana CDS 113871537 113871623 . - 0 gene_id "ENSG00000134242"; gene_version "15"; transcript_id "ENST00000460620"; transcript_version "5"; exon_number "1"; gene_name "PTPN22"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTHUMG00000011936"; havana_gene_version "2"; transcript_name "PTPN22-002"; transcript_source "havana"; transcript_biotype "protein_coding"; havana_transcript "OTTHUMT00000033016"; havana_transcript_version "2"; protein_id "ENSP00000433141"; protein_version "1"; tag "basic"; transcript_support_level "1";

28 of 42

Piping Unix commands together with "|"

The sum is greater than its parts.

# find the first 10 "gene" matches

$ grep "gene" genes.gtf | head

# how many total "gene" matches?

$ grep "gene" genes.gtf | wc -l

# alphabetically sort the "gene" match lines and report the first 10 in alphabetical order

$ grep "gene" genes.gtf | sort | head

29 of 42

Piping Unix commands together with "|"

The sum is greater than its parts.

# alphabetically sort the "gene" match lines and report the first 10 in alphabetical order

$ grep "gene" genes.gtf | sort | head

command 2

command 1

command 3

"stdout"

30 of 42

How many lines contain PTPN22?

$ grep --color \"PTPN22\" genes.gtf | wc -l

316

Why does this command take so long (20-30 seconds) to complete?

31 of 42

How many exon records are there for PTPN22?

$ grep \"PTPN22\" genes.gtf | grep -w exon | wc -l

148

32 of 42

The ability to pipe together many simple tools to create something quite complex is one of UNIX's key strengths.

33 of 42

The cut command

(cut out specific columns/fields from a file or input stream)

So far, we have been working with entire lines of files. Many times we want to just explore specific subsets of information in a file. The grep command helps to isolate specific lines. The cut command helps us to extract specific columns from lines.

The cut command assumes that your file or input stream (in the case of a pipe) is tab-delimited.

34 of 42

The cut command

(cut out specific columns/fields from a file or input stream)

#!genome-build GRCh38.p7

#!genome-version GRCh38

#!genome-date 2013-12

#!genome-build-accession NCBI:GCA_000001405.22

#!genebuild-last-updated 2016-06

1 havana gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana";

1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; tr

1 havana exon 11869 12227 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript

1 havana exon 12613 12721 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript

1 havana exon 13221 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript

1 havana transcript 12010 13670 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000450305"; tr

1 havana exon 12010 12057 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000450305"; transcript

1 havana exon 12179 12227 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000450305"; transcript

1 havana exon 12613 12697 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000450305"; transcript

1 havana exon 12975 13052 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000450305"; transcript

1 havana exon 13221 13374 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000450305"; transcript

1 havana exon 13453 13670 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000450305"; transcript

1 havana gene 14404 29570 . - . gene_id "ENSG00000227232"; gene_version "5"; gene_name "WASH7P"; gene_source "havana"; g

1 havana transcript 14404 29570 . - . gene_id "ENSG00000227232"; gene_version "5"; transcript_id "ENST00000488147"; tr

1 havana exon 29534 29570 . - . gene_id "ENSG00000227232"; gene_version "5"; transcript_id "ENST00000488147"; transcript

1 havana exon 24738 24891 . - . gene_id "ENSG00000227232"; gene_version "5"; transcript_id "ENST00000488147"; transcript

1 havana exon 18268 18366 . - . gene_id "ENSG00000227232"; gene_version "5"; transcript_id "ENST00000488147"; transcript

1 havana exon 17915 18061 . - . gene_id "ENSG00000227232"; gene_version "5"; transcript_id "ENST00000488147"; transcript

1 havana exon 17606 17742 . - . gene_id "ENSG00000227232"; gene_version "5"; transcript_id "ENST00000488147"; transcript

$ less -S genes.gtf

Invisible tab (\t) characters

35 of 42

The cut command identifies columns by number

#!genome-build GRCh38.p7

#!genome-version GRCh38

#!genome-date 2013-12

#!genome-build-accession NCBI:GCA_000001405.22

#!genebuild-last-updated 2016-06

1 havana gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana";

1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; tr

1 havana exon 11869 12227 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript

1 havana exon 12613 12721 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript

1 havana exon 13221 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript

1 havana transcript 12010 13670 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000450305"; tr

1 havana exon 12010 12057 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000450305"; transcript

1 havana exon 12179 12227 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000450305"; transcript

1 havana exon 12613 12697 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000450305"; transcript

1 havana exon 12975 13052 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000450305"; transcript

1 havana exon 13221 13374 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000450305"; transcript

1 havana exon 13453 13670 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000450305"; transcript

1 havana gene 14404 29570 . - . gene_id "ENSG00000227232"; gene_version "5"; gene_name "WASH7P"; gene_source "havana"; g

1 havana transcript 14404 29570 . - . gene_id "ENSG00000227232"; gene_version "5"; transcript_id "ENST00000488147"; tr

1 havana exon 29534 29570 . - . gene_id "ENSG00000227232"; gene_version "5"; transcript_id "ENST00000488147"; transcript

1 havana exon 24738 24891 . - . gene_id "ENSG00000227232"; gene_version "5"; transcript_id "ENST00000488147"; transcript

1 havana exon 18268 18366 . - . gene_id "ENSG00000227232"; gene_version "5"; transcript_id "ENST00000488147"; transcript

1 havana exon 17915 18061 . - . gene_id "ENSG00000227232"; gene_version "5"; transcript_id "ENST00000488147"; transcript

1 havana exon 17606 17742 . - . gene_id "ENSG00000227232"; gene_version "5"; transcript_id "ENST00000488147"; transcript

$ less -S genes.gtf

1

2

3

4

5

6

7

8

9

The long strings of text at the end of each line are treated as a single column. Why?

Because the text within is separated by spaces, not tabs.

36 of 42

Use cut to extract solely the second column

#!genome-build GRCh38.p7

#!genome-version GRCh38

#!genome-date 2013-12

#!genome-build-accession NCBI:GCA_000001405.22

#!genebuild-last-updated 2016-06

havana

havana

havana

havana

havana

havana

havana

havana

havana

havana

havana

havana

havana

havana

havana

$ cut -f 2 genes.gtf | head -n 20

Why were the first 5 lines reported?

How could we prevent that?

37 of 42

Let's omit the header lines first.

havana

havana

havana

havana

havana

havana

havana

havana

havana

havana

havana

havana

havana

havana

havana

havana

havana

havana

havana

havana

$ grep -v "#" genes.gtf | cut -f 2 | head -n 20

38 of 42

Saving results to files with >

So far, every command we have run reports the output to the screen (i.e., standard output or "stdout").

Obviously, once an analysis is working the way we want, we typically wish to save the results to a file.

We do this with the "redirect to a file operator": >

39 of 42

Saving results to files with >

havana

havana

havana

havana

havana

havana

havana

havana

havana

havana

havana

havana

havana

havana

havana

havana

havana

havana

havana

havana

$ grep -v "#" genes.gtf | cut -f 2 | head -n 20 > column2.first20.txt

$ cat column2.first20.txt

40 of 42

The results are just a plain old file!

havana

havana

$ head -n 2 column2.first20.txt

41 of 42

Homework #01

  • Assignment
    • https://gist.github.com/arq5x/c0eb84bce2086fbfbe9184668ef87b31#file-hw1-md
  • Due at 11:59PM on January 25th.
  • Post it to a link that will be on the course github site by Sunday.
  • Name your file UNID.HW1.txt

42 of 42

Reading before next class

  • Pages 25-28 of the following textbook from Jonathan Pritchard, starting with the section entitled "Major data resources for human genetics."
  • Note that many of these concepts apply to model organism genomes
  • https://web.stanford.edu/group/pritchardlab/HGbook/Release-2023-09/HGBook-2023-09-chapters/HGBook-2023-09-23-ch1.2.pdf