Here’s a story illustrating a common use case for this project, from

Summary: This Big Data Spokes Project aims to speed our understanding of genetic variants that are suspected of causing disease by instantly showing researchers a 3D model of the chemical effect of each variant. Because determining these effects is currently a laborious process requiring specialized skills, it is often not done in research on genetic variants. Here, we present a study of the genetic causes of inflammatory bowel disease, and show how our resource could have been employed to make the study faster and cheaper.

A few commonly-known diseases are strongly associated with a single genetic variant-- for example, variants in the BRCA1 and BRCA2 genes are strongly associated with breast cancer. People with these variants are much more likely to get breast cancer than the average person.

Most diseases, however, are weakly associated with many genetic variants. Inflammatory bowel disease (IBD), a chronic inflammatory condition of the gastrointestinal tract that affects about 3.7 million people in the US and Europe, is one such disease. Genome-wide association studies have revealed over 163 common genetic variants that each indicate a slightly increased propensity to develop IBD. But none of these alone is a strong predictor of IBD in the way that the BRCA variants are predictors of breast cancer. So a big problem in medicine right now is to understand these so-called multi-genic diseases.

Even taking all the 163 associated variants into account explains only a fraction of the genetic basis of IBD. So researchers at the Institute for Systems Biology, led by Anna B. Stittrich, hypothesized that rarer variants may also contribute. They analyzed five families with multiple IBD-affected family members to test the hypothesis that low frequency variants are shared among affected family members and contribute to disease risk. They used Kaviar, a comprehensive database of human genetic variants, to tell which variants have low frequency in the population.


Figure 1: Pedigree for a family with three known cases of IBD (Crohn’s disease and ulcerative colitis are types of IBD). Whole genome sequencing of these three family members plus two unaffected family members led to the hypothesis that a mutation in the TRIM11 gene is associated with IBD.

For one of the families (see Figure 1), the 163 common variants seemed to be less responsible for their disease risk than for the other families, so the researchers focused on this family to discover rarer variants that might be responsible. Whole genome sequencing suggested that the best candidate risk variant for this family was a rare missense mutation in the ubiquitin ligase TRIM11. Ubiquitin ligases are protein molecules that catalyze reactions that regulate homeostasis, cell cycle, and DNA repair pathways. “Missense” means that the mutation results in a small, but potentially significant, chemical change in the protein (specifically, an amino acid substitution). The 3-dimensional structure of TRIM11 is shown in a schematic diagram below. TRIM11’s structure is a version of a classic protein structure called a beta-sandwich, with the two sets of flat arrows forming the two slices of bread.

To support the hypothesis that this TRIM11 variant contributed to IBD risk, Stittrich’s team wanted to do some lab experiments, but these are expensive and time-consuming. It was best to first gather further theoretical evidence. By checking where this mutation maps to the protein sequence of TRIM11, the researchers were able to determine that it is located within a domain that was already thought to mediate protein-protein interactions. This was promising, because a disturbance in such interactions is a possible cause for disease. However, in order to understand more deeply this potential effect of the mutation, they decided to model the effect of the mutation on the 3-dimensional structure of the protein. A member of their team, Justin Ashworth, had trained at the David Baker lab at the University of Washington and was an expert in employing Rosetta software to do this work, which, he says, is “a leading standard in modeling the biophysical and energetic impacts of mutations.”

The result, illustrated in the figure below, showed that the mutation did affect the surface of the ligase at a place where it interacts with other proteins, confirming that the mutation (see arrow) could indeed disturb protein-protein interactions.

Dr. Ashworth says, “For me it was not exceedingly difficult or time-consuming to produce these models and predictions--but I've been doing specifically this for years.” For researchers who are not specialists, this approach requires “familiarity with protein data, c++, linux, and cluster systems” and would likely consume multiple work sessions over several days.

Dr. Ashworth outlined a simpler approach for obtaining “a nice quick graphic of the structural context of mutations, without scientifically precise or accurate predictions of biophysical impacts”:

  1. Acquire familiarity with any of several available protein modeling tools
  2. Query the Protein Data Bank for an appropriate protein structure to represent the gene/protein of interest
  3. Swap in the mutated amino acid
  4. Do a quick-and-dirty conformational remodel
  5. Output a graphic

Although the sophistication of the Rosetta approach is valuable when deeply studying a small number of highly promising variations, while screening large numbers of variants this simpler approach is sufficient. Ashworth states that this “could be done fairly easily by anyone reasonably familiar with protein sequence and structure files, the PDB database, and protein modeling tools.” But without such familiarity, even this simpler approach would take considerable time and effort.

Our proposed resource would encapsulate steps 2-5 above into a single step that would take place with a single click as the researcher is investigating variants of interest. A likely click-through location would be the Kaviar database, where Stittrich’s team learned of the low frequency of the mutation.


Figure 2: Results of a query to the Kaviar database for the TRIM11 mutation under study show that it has a very low frequency of 0.0013% (top). Our proposed resource would allow the Kaviar user to click the link in the left column and immediately view a model of the protein structure highlighting the effect of the mutation (bottom). The user would be able to zoom in/out, rotate the model, and manipulate the amino acid residue about its rotatable bonds.

Only after obtaining such computational support for this mutation’s potential for disrupting protein structure did the researchers take the expensive step of conducting assays to test the effect of the mutation experimentally. They found that the mutated TRIM11 induced NF-κB and interferon signaling, whereas the wild-type (normal) TRIM11 repressed such signaling. Tissue localization experiments on TRIM11 further supported the hypothesis that this variant affects protein–protein interactions rather than protein folding and stability.

The next step is to study large cohorts of IBD cases using whole genome sequencing to determine the effect sizes and disease risk contribution of this variant. Ultimately this can lead to genetic tests that will help predict whether an individual will develop IBD, and may also provide a target against which drugs may be developed to help treat the disease.