Published using Google Docs
Scherzer lab wiki
Updated automatically every 5 minutes

Scherzer Lab wiki

[bioinformatics section]

Table of Contents:

  1. Bioinformatics lab rules
  2. Lab resources
  3. Linux/Unix basic
  4. Bioinformatics Tools
  5. Bioinformatics resources
  6. High-throughput computing resources
  7. Partners HPCC overview
  8. NGS analysis pipeline
  9. NGS resources


  1. Bioinformatics lab rules

  1. use zipped format (*.gz) as possible as you can.
  2. use binary format (*.bam, *.bw) instead of raw format such as *.sam, *bedGraph;
  3. move raw sequencing files (*.fastq) to backup disk after it’s processed.
  4. remove intermediate or redundant files
  5. use soft link (e.g. ln -s) instead of using hard copy (e.g. cp)
  6. Xianjun’s top 15 practical tips: http://onetipperday.sterding.com/2016/02/my-15-practical-tips-for.html
  1. Lab resources

  1. Web Server: panda.dipr.partners.org
  1. For new user: email your hpcc user ID to Xianjun (xdong@rics.bwh.harvard.edu) for access
  2. ssh YOUR_PARTNER_ID@panda.dipr.partners.org
  3. mkdir public_html
  4. from your local or eris, type: scp my.file YOUR_PARTNER_ID@panda.dipr.partners.org:~/public_html
  5. chmod 644 my.file
  6. You should see your file by the address: http://panda.partners.org/~YOUR_PARTNER_ID
  1. hpcc cluster: eris1n2.research.partners.org (get account first, see below)
  2. Track hub: http://panda.partners.org/~xd010/myHub/hub.txt (ask Xianjun for pass)
  3. Scherzer lab track session on UCSC:
  1. Enhancer: https://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hg
  2. S_otherUserName=sterding&hgS_otherUserSessionName=hg19_PD_enhancerall RNAseq: https://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=sterding&hgS_otherUserSessionName=hg19_PD
  3. GTEx: https://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=sterding&hgS_otherUserSessionName=hg19_GTEx
  1. Group reference: http://www.mendeley.com/groups/4633051/neurogenomics/
  1. Linux/Unix basic

  1. http://www.ee.surrey.ac.uk/Teaching/Unix/
  2. http://www.tldp.org/LDP/gs/node5.html (advanced one)
  3. https://www.bits.vib.be/index.php/training/124-linux-for-bioinformatics
  1. Bioinformatics Tools

  1. bedtools: http://bedtools.readthedocs.org/
  2. samtools: http://samtools.sourceforge.net/
  3. UCSC Jim Kent utility:
  1. all commands: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64
  2. source: https://github.com/ENCODE-DCC/kentUtils
  3. installation instruction: http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob;f=src/userApps/README
  1. ENCODE tools: https://github.com/ENCODE-DCC/
  2. RNAseq analysis: Bowtie/Tophat/Cufflinks etc.: http://ccb.jhu.edu/software.shtml
  3. GATK: https://www.broadinstitute.org/gatk/
  4. ggplot2: http://ggplot2.org/
  1. Bioinformatics resources

  1. UCSC Genome Browser
  1. About: http://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html
  2. Training tutorial 1: http://www.openhelix.com/ucsc
  3. Training tutorial 2: http://bit.ly/genomebrowserYoutube
  1. Rdocumentation: http://www.rdocumentation.org/
  2. BioStars forum: https://www.biostars.org/
  3. R news and tutorials RSS: http://www.r-bloggers.com/
  1. High-throughput computing resources

  1. Eris (Partners): http://rc.partners.org/hpc
  2. Orchestra (HMS): https://wiki.med.harvard.edu/Orchestra/
  3. Odessey (Harvard): https://rc.fas.harvard.edu/odyssey-quickstart-guide/
  4. XSEDE’s Blacklight server (Pittsburgh Supercomputer Center): https://www.xsede.org/high-performance-computing
  5. The Data Intensive Acadmeic Grid (DIAG): http://diagcomputing.org/
  1. Partners HPCC overview

  1. First-time users guide: http://rc.partners.org/node/126
  2. Register an account here: https://rc.partners.org/eris_cluster
  3. Login in:
  1. ssh YOUR_PARTNER_ID@eris1n2.research.partners.org
  2. ssh YOUR_PARTNER_ID@eris1n3.research.partners.org
  3. ssh YOUR_PARTNER_ID@erisone.partners.org
  1. Partner’s hpcc is based on LSF: http://en.wikipedia.org/wiki/Platform_LSF
  2. Knowledge base: http://rc.partners.org/kbase/High_Performance_Computing
  3. How to submit jobs: http://rc.partners.org/node/227
  4. How to change queues/resources: http://rc.partners.org/kbase?cat_id=45&art_id=425
  5. How to log in specific node: http://rc.partners.org/kbase?cat_id=45&art_id=405
  6. how to mount the Eris folder: http://rc.partners.org/kbase?cat_id=47&art_id=312
  7. Need help? send email (hpcsupport@partners.org) or ticket (https://tickets.partners.org/)
  8. VPN: use pvc.partners.org/legacy in your Cisco Anyconnect
  1. NGS analysis pipeline

  1. RNA-seq:
  1. on github: https://github.com/sterding/RNAseq
  2. on hpcc: ~/neurogen/pipeline/RNAseq  
  3. NOTE: Please use git to manage the change if you edit it there directly; otherwise, I prefer you to clone a version in your home and push to github every time you made a change. This applies to all projects.
  1. smallRNA-seq
  2. genotyping
  3. DNA-seq
  1. NGS resources

  1. ENCODE: https://www.encodeproject.org/
  2. modENCODE: http://www.modencode.org
  3. mouseENCODE: http://mouseencode.org
  4. Roadmap Epigenomics: http://www.roadmapepigenomics.org/
  5. FANTOM: http://fantom.gsc.riken.jp/5
  6. GTEx: http://commonfund.nih.gov/GTEx/index
  1. How to manage different R version in Mac, Eris cluster

We want to use the same version of R in Mac and the Linux-based ERIS cluster. Here is how to:

In Mac, simply download and install R. Its default install location will be

/Library/Frameworks/R.framework/Resources/bin/R

As you can tell from

[xdong@macbook ~]$ ls -l /usr/local/bin/R

lrwxr-xr-x 1 root admin 47 Nov  4 19:27 /usr/local/bin/R -> /Library/Frameworks/R.framework/Resources/bin/R

The different versions of R are managed as

[xdong@macbook ~]$ ll /Library/Frameworks/R.framework/Versions/

total 4.0K

drwxrwxr-x 6 root 204 Nov  8  2013 3.0

drwxrwxr-x 3 root 102 Jun  9 20:08 3.1

drwxrwxr-x 6 root 204 Nov  4 19:26 3.3

lrwxr-xr-x 1 root   3 Nov  4 19:26 Current -> 3.3

The default path of R libraries installed via R command console or Rstudio will be

/Library/Frameworks/R.framework/Resources/library/

This can be seen in R via:

> .libPaths()

[1] "/Library/Frameworks/R.framework/Versions/3.3/Resources/library"

If you don’t like to install there, you can change it by setting the following line in ~/.Rprofile file

.libPaths( "/My/path/for/Rlib" )

See: http://stackoverflow.com/questions/2615128/where-does-r-store-packages

This can also be override by setting the R_LIBS_USER variable. (not tested yet)

In Linux server, this is similar. As ERIS cluster manager also installed different R version for you. What you need to do is just to load the R module:

$ module ava

$ module load R/3.3.0

Then you can use R v.3.30 directly, and some library the manager installed is already accessible by $R_HOME. If you install your own library in R console, it will ask you if you want to create a personal folder (if not yet)