1 of 20

miRNA Target Prediction

Review on existing algorithms/methods

2 of 20

Outline

  • Motivation
  • Review of some successful existing algorithms
    • miRDB (MirTarget v1 & v2)
    • TarBase (DIANA-microT)
    • RNA22 (unfinished)
  • Some thoughts

3 of 20

Motivation

  • Identify a reliable/trustable miRNA-mRNA interaction is critical for miRNA functional analysis
  • So far, there is no any existing tool can solve the false positive issue.
  • We have worked on miRNA more than one year, but we still use the existing tools even we already knew that it certainly causes many problems.

4 of 20

MirTarget

  • Developed in 2006, and updated to v2 and v3 in 2007 and 2014, respectively. (v3 is not published yet)
  • All results are presented on miRDB
  • Several independent reviews/studies rated miRDB as the best computation tool in this field.

5 of 20

MirTarget v1

6 of 20

MirTarget v1, 2006

  • miRNA’s seed region: 2-8;
  • for each seed, scan all human 3’-UTR sequences
    • find perfect base pairing region;
    • check other species, must >=3 species;
  • calculate MFE of miRNA/target site duplex
    • MFE <= -13
    • they did not mention why used “-13”
  • Local alignment between miRNA and 3’-UTR
    • mature miRNA sequence
    • 22bp downstream of binding site in 3’-UTR
    • Bases surrounding the seed sequences are important for target recognition. Thus limited seed extension was evaluated for pairing to miRNA positions 1, 9 and 10.
    • “The longest stretch of perfect matches (including positions 2–8) was considered as an extended seed for raw score calculation”.

7 of 20

MirTarget v1, 2006

  • terminal matches
    • from the TargetScan paper, the terminal position tend to be A/U
    • if the potential target and miRNA meet this condition, then increase the score by 1.

8 of 20

MirTarget v1, 2006

  • Score:

Wi: seed conservation > limited seed extension > duplex binding stability > terminal base match

Cutoff: score >= 30

  • False positive control:

shuffled miRNA 100 times and predicted targets on shuffled sequences

e.g. miR-124 target prediction:

85 gene targets were predicted (score >30) for real miR-124;

1.7 predicted targets for one shuffled miR-124 sequence on average.

9 of 20

MirTarget v2, 2007

  • Upgrade:
    • Integrated miRNA and mRNA expression info
    • Used SVM to predict the possible targets
      • identified more important features

10 of 20

MirTarget v2, 2007

  • Used two sets of microarray data to identify down-regulated genes.
    • Normal condition vs HCT116 Dicerex5 and DLD-1 Dicerex5 cell lines w/ miRNAs
    • 6 miRNAs: let-7c, miR-103, miR-106b, miR-141, miR-15a and miR-215.
      • 6 unique seed regions to reduce the bias of any specific miRNA sequence
    • down-regulated genes: expression level was reduced by at least 40% with P-value < 0.001 in either cell line
    • unaffected (normal) genes: expression level was at least 95%, but no more than 120% with P-value > 0.3 in both cell lines.
  • 1,401 down-regulated genes and 16,761 normal genes were identified.

*However, specific miRNA-mRNA interactions are still unknown. All they achieved here is only a set of mRNAs that are regulated by 6 miRNAs.

11 of 20

MirTarget v2, 2007

  • After identify dr and normal genes, they retrieved their 3’-UTR from NCBI and conducted further analysis.
  • 4 major types of seed pairing:
    • seed 8: perfect complementarity on p1-8
    • seed 7a: perfect complementarity on p1-7
    • seed 7b: perfect complementarity on p2-8
    • seed 6: perfect complementarity on p2-7

* In this method, they ignored the sequences that were predicted multiple binding sites.

12 of 20

MirTarget v2, 2007

  • Identification of training features
    • human 3’-UTR in mouse, rat, dog, chicken
    • terminal base: p1, p9, p10
    • MFE
    • dinucleotide counts
    • single nucleotide counts
    • position-specific base counts:

13 of 20

MirTarget v3, 2014

  • According to the latest paper of miRDB, the authors have upgraded the algorithm to version 3, which employed CLASH data as additional training data.

14 of 20

DIANA-microT-CDS, 2012

Highlights:

  • integrate CLIP data to provide high-confidence miRNA binding site on the target genes.
    • PAR-CLIP: true binding sites.
  • not only predicted binding site on 3’-UTR region, but also on CDS region.
    • However, Dr. Wang (miRDB, 2014) suggested that the binding site on CDS region does not significantly repress gene expression.

15 of 20

DIANA-microT-CDS, 2012

  • MREs specified by the PAR-CLIP data are divided in two categories according to the genomic region:
    • 5075 MREs in 3’-UTRs
    • 6057 MREs in CDSs.
  • Feature extraction:
    • Alignment for putative MRE identification
      • miRNA extended seed sequence [p1–p9] and every 9 nt window on the 3’-UTR or CDS
        • minimum: 4nt pairing
        • 1 G:U is allowed if >=6nt pairing
        • 1 bulge/mismatch if >= 8nt pairing
    • Detection of binding categories
      • 64 different binding categories
        • ‘8mer+3UTR-pairing 1st:mismatch+NotA’
    • Conservation measure of the MRE sequence in CDS/3’-UTR
      • CDS: 30;
      • 3’UTR: 16.
    • Detection of significantly accessible locations within MREs
      • Sfold: 40 nt upstream and 10 nt downstream of the start of each MRE
    • flanking AU content; distance to closest 3 -UTR end; adjacent MRE distance; MFE; etc

16 of 20

DIANA-microT-CDS, 2012

  • Feature selection
    • used the stepAIC to find a model with minimum AIC
    • independently feature selection of CDS and 3’-UTR
  • Training and Scoring
    • Generalized linear model to calculate the score.
    • for each gene, linear combining CDS and 3’-UTR score

17 of 20

RNA22, 2006 (unfinished)

Highlights:

  • Thinking out of the box
  • Pure engineering style
  • Signal processing
  • Pattern recognition
  • IBM Watson Research Center (Now: Thomas Jefferson University)
  • Unfinished review.

18 of 20

RNA22, 2006

  • 644 mature microRNA sequences
  • remove identical and near-duplicate entries
    • no two sequences have more than 90% similarity
    • 354 mature microRNA sequences passed the filtering
  • “Teiresias” algorithm was used to discover the patterns among the 354 mature miRNA sequences
    • unsupervised manner
    • eg: [AT][CG].TTTTT[CG]G..[AT]
    • 233,554 mature-microRNA patterns remained

19 of 20

20 of 20

Thank you!!!