Population Genetic Workflow Survey
There are a number of widely used tools to assess the demographic structure of unknown samples of genetic data. The two most widely used tools are PCA and (versions of) structure, but there are many other valuable tools including admixtools, treemix, spacemix, SFA, tess3, eems, etc. The problem is that there are no widely accepted standards, making analyses cumbersome. I suspect that many groups are using in-house tool sets, but they remain either unpublished or have not gained traction. The goal of this survey is to figure out why that is, and to see if there is a need of a tool that streamlines the application of these tools for a single data set. Many closely related fields, such as phylogenetics (BEAST) and molecular evolution (MEGA) have sophisticated analysis tools that are widely used. Arlequin (Excoffier & Lischer 2015) implements many analyses for single loci, but currently does not implement the methods described above, and is not trivial to use on a cluster/server system.

However, most analyses (at least preliminary ones) follow the same four-step procedure: We start with a data set, including both genetic data and annotations (such as location, language or subspecies). In most cases, only a subset of the data is analyzed, either because we want to study a region of interest, or due to quality filters. For each analysis (such as PCA, or structure), there are a number of performance parameters that can be varied for each run. Finally, tables and figures need to be produced.

Benefits
Benefits for users / data analysis

While some standards (plink, vcf, glf, ) exist for the genetic data, additional data that is often useful such as sampling location, language, ecotype, population, etc. is not standardized, and each tool will require its own input format. Established interfaces could streamline this process. This becomes worse as it is often necessary to investigate the impact of QC-choices, analyze subsets of the data, or add new data to an analysis.
While many labs will have set-up their own pipeline for this purpose, standardization has a few benefits: First, the impact of bugs may be reduced. Second, the barrier of entry gets removed. Third, alternative methods that are similar (e.g. tess3 vs. admixture vs. structure) can easily be compared, and the most useful method for a given data set could be applied.

Benefits for method developers:

Standards allow for easier performance comparison. In the spirit of dynamic statistical comparison (https://stephenslab.github.io/dsc-wiki/), easy access to multiple methods make it easy to investigate the behavior of new methods in a wide array of circumstances. It will also facilitate the adoption of new methods, if they are easily accessible in a framework that is already used.

Proposal
I personally have implemented a workflow following these steps for a paper on human fine-scale structure, and successfully used it to apply PCA, admixture, treemix and EEMS on a number of data sets on various organisms (Humans, Daphnia, Chimps and spiders), and used it locally and on three separate clusters I have been working on. However, the project needs substantial work (on streamlining and documentation) to be shareable, and I want to assess if there is a perceived need.

My workflow is managed using Snakemake (http://snakemake.readthedocs.io), which, while based on python, allows usage of bash, python and R (and other languages) at any stage, thus the framework is largely language agnostic.
An skeleton is available under https://github.com/BenjaminPeter/eems-around-the-world-draft ,

Why a survey?
I am unsure if there is a need for such a pipeline. There is a necessary flexibility vs. usability trade-off, and most population geneticists may be technologically sophisticated enough that they prefer to work within their own workflow management scheme. Or using xkcd:
About me
I am currently a postdoc in John Novembre's lab at the University of Chicago. If you have any questions or comments, please email me at bpeter@uchicago.edu.
Next
Never submit passwords through Google Forms.
This content is neither created nor endorsed by Google. Report Abuse - Terms of Service - Additional Terms