Response Letter.

Dear PlosGen Editors,
We appreciate the comments on the initial draft of our manuscript and here include thorough revisions addressing the major concerns of each of the two reviewers.  Below we address their specific comments and highlight changes we have made to the text including the additional analysis requested by reviewer 2.

I would like to briefly highlight the overall significance of this important findings unique to this study. The sampling is unique in its shear size ~6000 strains across three broad scales, at the local, regional, and continental levels. In addition our collection spans the native Eurasian range and the recently colonized N. American continent. It includes many new collections to finally anchor and filter the commonly used stock center lines.  The first result of clear isolation by distance was unexpected in A. thaliana as the species is predominantly selfing and is patchy, colonizing disturbed environments.  Our multiple independent population, or patch, estimates show that migration and ~5% outcrossing can establish isolation by distance over short scales (<10km) in Eurasia and are establishing across a broader scale in N. America (>100km). Thus we are witnessing the patterns and process of population differentiation in a species under the influence of human disturbance and can, for the first time, directly contrast the patterns of gene flow in established versus invasive environments. This paper sets a global background for further studies (including a companion submission by Bomblies, Weigel and colleagues) at the local and regional scales and defines empirically the genetic diversity in specific locations for future tests of adaptation, as directional changed in allele frequency.  We look forward to hearing your thoughts and/or additional comments on this revised manuscript.


Sincerely,

Justin Borevitz and Alexander Platt


Specific points to reviewers comments.

Reviewer1.

[..The strength of the study is the large volume and quality of data and the relevance of the results to the Arabidopsis evolutionary genetics community. The most interesting result is that Platt et al estimation the average selfing rate to be 0.97 in nature, lower than most would think. The primary difficulty with the current draft is that critical details of the selfing rate (s) estimation procedure are not given. Platt et al state that F[is] of genotypes was used to infer s.]   We have included additional details in the methods on exactly how the selfing was calculated.

[
They also say that genotypes were extracted from both field collected plants and from seed. The latter often have very different F values in natural populations due to inbreeding depression. (F is higher in zygotes than adults). Are Platt et al assuming that there is no inbreeding depression? This is noteworthy given the highly selfing species often exhibit substantial inbreeding depression] We have clarified the description of the sample as well. Briefly,with the exception of 219 plants from North America, naturally-fertilized seeds were collected from mature plants in the field.  These seeds were germinated in the lab and a single plant from each seed collection was kept.  A leaf from each of these plants was genotyped.  These plants were later artificially selfed  and seed was sent to the stock center or is available.  The North American exceptions were genotyped directly from leaves of mature plants collected in the field. 

The reviewer is quite right to raise the question of inbreeding depression.  For the most part we did not type plants directly from the field that would have survived field germination and early growth and could have been selected against due to inbreeding depression.  Any selection in the lab that might remove inbred lines is likely to be far less than in the field so our estimate of the effective selfing rate, the proportion of reproductively mature plants contributing to future generations that were created mono-parentally, is conservative with regards to possible inbreeding depression.

[ Contrast isolation by distance in models where individuals are continuously distributed with patch models (say the 2D stepping stone model). The model structures are different, but can Platt et al cite specific examples where these models make divergent predictions? Perhaps in conditions for protected polymorphism? Mutation-selection balance? ] At the limit where deme size is small enough that each contains only zero or one plant the 2D stepping stone model and continuous models converge.  As deme size increases the quality of the approximation falls off.  Clearly there is a trade-off here in terms of mathematical complexity model accuracy and exactly where the optimum lies will greatly depend on the purpose of the modeling.  The intent of the paper was not to discourage the use of very fine-scale 2D stepping stone models so much as to emphasize the importance of real geography, the problematic nature of defining discrete population boundaries, and the general inapplicability of models with few (or even moderate numbers of) demes.  We have amended the text to try to clarify this point.

Reviewer 2

[The dataset is ambitious, and characterization of the underlying population structure of Arabidopsis is an important component of developing an appropriate null model for the genetics of this model organism. One of the key results of the paper is a convincing demonstration that outcrossing events are sufficiently common that analyses need to treat this species as sexual. This conclusion contrasts with the conventional wisdom, which often describes A. thaliana as an "obligate" selfer, which implies a complete lack of recombination. The paper is also well written, and the results are clearly conveyed. ] Thank you


[ First, I would like to see plots of pairwise values of FST/(1-FST) a la Rousset (Genetics 1997). I suspect that the data may be too noisy for reliable quantitative estimates, but it might be possible to get at least an approximate estimate of neighborhood size from the Eurasian data. ] We have included this as a supplemental figure. Unfortunately, the data are a very poor fit for the underlying model and it is impossible to make meaningful inference from this analysis. 


[ Second, it would be interesting to know if anything can be done quantitatively with the non-equilibrium dynamics in the North American data. For example, in the context of human populations, I did some simulation work in the past showing how IBD patterns spread following a demographic shift (Wilkins  Marlowe 2006 Bioessays). In that case, the idea was a transition from low migration to high migration, but it was possible to write down a simple analytic expression for the rate at which the new IBD pattern spread. This case seems to be the opposite, with a panmictic population gradually developing IBD signatures, but I am wondering if a back of the envelope calculation might help to explain the range over which IBD exists. This calculation would presumably include the approximate date of introduction of the species into North America as well as an estimate of the dispersal rate.

Previous work on these non-equilibrium dynamics have presented results of a simple shift in population size or migration rate (or the product of the two).  In the case of A. thaliana both of these parameters are expected to have been changing dynamically and could well still be in flux, particularly in North America.  Sadly, this far more complicated scenario does not lend itself well to back of the envelope calculations.




 

The Scale of Population Structure in Arabidopsis thaliana

Alexander Platt1,  Matthew Horton2**, Yu S. Huang1**,  Yan Li2**,  Alison E. Anastasio2, Ni Wayan Mulyati2, Jon Ågren3, Oliver Bossdorf4, Diane Byers5, Kathleen Donohue6, Megan Dunning2, Eric B. Holub7, Andrew Hudson8, Valérie Le Corre9, Olivier Loudet10, Fabrice Roux11, Norman Warthmann12, Detlef Weigel12,  Luz Rivero13, Randy Scholl13, Magnus Nordborg1,14,  Joy Bergelson2, Justin O. Borevitz2 

1Molecular and Computational Biology, University of Southern California, 1050 Childs Way, Los Angeles, CA, 90026, USA  2Dept of Ecology and Evolution, University of Chicago, 1101 E 57th St, Chicago IL, 60637 3Department of Ecology and Evolution, Evolutionary Biology Centre, Uppsala University, Norbyvägen 18 D, SE-752 36 Uppsala, Sweden. 4Institute of Plant Sciences, University of Bern, Altenbergrain 21, CH-3013 Bern, Switzerland 5Department of Biological Sciences Campus Box 4120, Illinois State University, Normal, IL 61790-4120 6 Department of Biology, 125 Science Drive, Duke University, Durham, NC 27708  7University of 
Warwick, Warwick Life Science, Wellesbourne, CV35 9EF, UK 8Institute of Plant Molecular Sciences, University of Edinburgh, Mayfield Rd, Edinburgh EH9 3JH, UK  9 UMR Biologie et Gestion des Adventices, 17 rue Sully, B.P.86510, 21065 Dijon Cedex, France 10 INRA,  IJPB Genetics and Plant Breeding SGAP UR254,   F-78026 Versailles, France11Laboratoire de Génétique et Evolution des Populations Végétales, UMR-CNRS 8016, Université de Lille I, F-59655 Villeneuve d’Ascq Cedex, France 12 Dept of Molecular Biology, Max Planck Institute for Developmental Biology, Spemannstrasse 35, 72076 Tübingen, Germany 13Arabidopsis Biological Resource Center, Ohio State University, 1060 Carmack Road, Columbus, OH 43210 14Gregor Mendel Institut, 1030 Vienna, Austria

Abstract

Background

The population structure of an organism reflects its evolutionary history and influences its evolutionary trajectory.  It constrains the combination of genetic diversity and reveals patterns of past gene flow.  Understanding it is a prerequisite for detecting genomic regions under selection, predicting the effect of population disturbances, or modeling gene flow. This paper examines the detailed global population structure of Arabidopsis thaliana.

Methodology/Principal Findings:

Using a set of 5,707 plants collected from around the globe and genotyped at 149 SNPs, we show that while Arabidopsis thaliana as a species self-fertilizes 97% of the time, there is considerable variation among local groups.  This level of outcrossing greatly limits observed heterozygosity but is sufficient to generate considerable local haplotypic diversity.  We also find that in its native Eurasian range Arabidopsis thaliana exhibits continuous isolation by distance at every geographic scale without natural breaks corresponding to classical notions of populations. By contrast, in North America, where it exists as an exotic species, Arabidopsis thaliana exhibits little or no population structure at a continental scale but local isolation by distance that extends hundreds of km.

Conclusions/Significance:

This suggests a pattern for the development of isolation by distance that can establish itself shortly after an organism fills a new habitat range.  It also raises questions about the general applicability of many standard population genetics models.  Any model based on discrete clusters of interchangeable individuals will be an uneasy fit to organisms like Arabidopsis thaliana which exhibit continuous isolation by distance on many scales.

 

Much of the modern field of population genetics is premised on particular models of what an organism's population structure is and how it behaves.  The classic models generally start with the idea of a single randomly mating population that has reached an evolutionary equilibrium.  Many models relax some of these assumptions, allowing for phenomena such as assortative mating, discrete sub-populations with migration, self-fertilization, and sex-ratio distortion.  Virtually all models, however, have as their core premise the notion that there exist classes of exchangeable individuals each of which represents an identical, independent sample from that class' distribution.  For certain organisms, such as Drosophila melanogaster, these models do an excellent job of describing how populations work[1].  For other organisms, such as humans, these models can be reasonable approximations but require a great deal of care in assembling samples and can begin to break down as sampling becomes locally dense[2][3].For the vast majority of organisms the applicability of these models has never been investigated.   

When studying natural populations, reasonable models of isolation, migration, and population growth should be applied to estimate the population structure of an organism. Furthermore, it is also important to understand the way in which a species' population structure has been altered by anthropogenic disturbance.  The population structure of domesticated organisms such as corn or rice are clearly drastically influenced by human intervention and provide extreme examples of how demographic processes can influence the genetic diversity and distribution of a species[4][5][6].  There are now few organisms whose habitat range does not coincide with human activity or for whom interference in their population structure is of little concern.  The degree of impact humans have --be it on purpose or not-- on the population structures of species that are not targets of domestication is unclear.

In this paper we present the results of a large scale study of the global population of Arabidopsis thaliana as an example of a natural organism that, like many others, exists in a predominantly continuous habitat that is much larger than the migration range of any individual, engages in sexual reproduction (with at least some regularity), and exists partially as a human commensal but serves no agricultural purpose.

 

Composition of sample  

We analyzed 5,707 plants collected around the globe [figure 1] with 139 SNPs  spread across the genome. These plants cluster into 1,799 different haplogroups with approximately three quarters of those haplogroups consisting of a single unique plant. Some haplogroups are represented by tens, or even hundreds, of individuals [supplemental figure 1].  One haplogroup was found over a thousand times across North America and another was found more than 200 times across the United Kingdom. Looking at the distribution of all pairwise genetic distances highlights three types of inter-plant relationships:  they can be genetically identical (approximately 3% of all pairs in the sample, mostly pairs within North America), they can be completely unrelated plants given our marker resolution (approximately 85% of pairs in the sample, mostly inter-continental pairs or pairs within Eurasia), or they can show an intermediate degree of relatedness to each other (approximately 12% of pairs in the sample, mostly pairs with North America with very few inter-continental pairs) [figure 2]. Simulations demonstrate that these intermediate relations cannot be explained in a panmictic population and are therefore consistent with a more structured population.

Heterozygosity and outcrossing.

Arabidopsis thaliana frequently reproduces by self-fertilizing and only occasionally outcrosses. The level of heterozygosity in the sample is therefore quite low compared to organisms that obligately outcross, with 95% of plants having five or fewer heterozygous loci. We estimated outcrossing rate in each field site from the distribution of number of heterozygous markers in each individual. As a whole our sample selfed  97% of the time overall in its recent history with the middle 50% of sites having estimates ranging from 95% to 99%.  The estimates were lower in North American sites (Wilcoxon test p-value<0.005) which had an average of a 92% selfing rate and range of the middle 50% from  92% to 96% [Figure 3].  Three sites had 0% selfing as their maximum likelihood estimates.  These sites included 2, 3, and 5 plants (respectively).  While the estimates are robust across loci (bootstrapping gives upper 95% confidence intervals of no more than 10% selfing for any of these sites), the small sample sizes may not be representative of the site as a whole.  Most of the material used for this analysis was taken from seeds collected in the field or from mature plants grown under lab conditions from field-collected seed.  As such there was a reduced chance for natural selection to influence the heterozygosity of the sample as it may have done had the seeds been allowed to grow to maturity under natural conditions.  If inbreeding depression plays a significant role in A. thaliana the heterozygosity of a cohort of mature plants would be expected to be higher than the seed population from which it grows.  Under these circumstances the effective selfing rate, the contribution to future gene pools from self-fertilized plants, could be somewhat lower than we estimate here.

While this level of selfing is high enough to greatly depress the individual heterozygosity of the sample, it is low enough to thoroughly mix haplotypes whenever two distinct haplotypes find themselves in close proximity. [Figure 4] shows the probability that two plants drawn from a given site are from a different haplogroup. Approximately 1/5th of sites are dominated by a single haplogroup (>80%).  This includes nearly half the sites in North America but only 1/8th of Eurasian sites. The polymorphic field sites, however, are often quite variable and comprised of plants with unique haplotypes

Isolation by distance.

Looking at measures of similarities between pairs of plants as a function of geographic distance we see striking differences in pattern between pairs of Eurasian plants and pairs of North American plants. Panels B of figures 5 and 6 show the strong broad trend of decay of genetic similarity with increasing geographic distance across Eurasia.  The fraction of differing alleles rises to saturation across the continent and the probability of finding two plants of the same haplogroup becomes negligible beyond 1000 km.  Panels A, showing effects of similar scale in North America, show extremely wide-spread haplogroups and little relation between distance and allelic similarity.  The entire negative slope of figure 6A can be explained by the distribution of haplogroups in figure 5A.  Panels C and D are the same data on a smaller geographic scale.  Panels D are similar to panels B and show that Eurasia's isolation by distance continues in a smooth manner at this level of resolution.  Panels C reveal that North American Arabidopsis thaliana does exhibit a measure of isolation by distance at this smaller scale though with a great deal more noise than in Eurasia.  Panels E and F continue this trend at a very fine scale.  Both continents exhibit isolation by distance at this level though the pattern is more pronounced in Eurasia.

Conclusion

When a species has established itself across a broad geographic range, migrates relatively slowly, and outcrosses with reasonable frequency, isolation by distance is an inevitable outcome.  Every time a new haplotype migrates to a nearby area it recombines with the local haplotypes creating organisms of intermediate relatedness.  Occasional long-distance migration events may have only weak effects on this continuum, as crossing and back-crossing with local haplotypes would dilute the impact. Aggressively invading haplotypes and selective sweeps can, however, strongly disrupt this process.  Both can allow individual haplotypes to spread over much greater distances before being broken apart by the locally established haplotype pools. This is consistent with the pattern that has previously been identified in smaller studies of Arabidopsis thaliana within regions of Europe and Asia[7][8].

A species newly introduced to a region is expected to have a different pattern.  As the species spreads across its new range its migration events bring it to previously unoccupied areas.  Without established local haplotypes there is no recombination, no intermediate genotypes are formed, and single, un-recombined haplotypes can spread uninterrupted over great distances.  As the new range becomes filled with the species, however, isolation by distance will begin to establish itself, first on very local scales and gradually spreading out as recombination creates geographically unique haplotypes and migration and recombination between occupied areas blends them together.  These patterns are consistent with our observations.  In Eurasia, where Arabidopsis thaliana has flourished for thousands of years, it has established a strong gradient of isolation by distance.  In North America, which has been colonized in the last three hundred years[9], haplotypes are spread across the entire continent but weak isolation by distance is emerging, particularly over shorter distances.

Arabidopsis thaliana is often a human commensal in both North America and Eurasia.  The largest difference between its natural history on the two continents is that it has existed across Eurasia for thousands of years and in North America for only a couple of centuries.  Human disturbance does not appear to have radically altered its natural population structure in Eurasia and the results suggest that the disturbance in North America is transitory and that a natural form of isolation by distance will emerge over time.  This suggests that for organisms like Arabidopsis thaliana human disturbance only has a particularly large effect on population structure when established local populations are small or absent, or when an entire local gene-pool is replaced by artificial migrants.  Otherwise, even moderate human disturbance can be swamped out by natural processes.

This kind of continuous isolation by distance is a type of population structure that the field of population genetics is poorly equipped to deal with.  While there are several exceptions[10][11][12][13][14], most of population genetics theory is premised on the existence of discrete populations of exchangeable individuals.  Even the modern field of landscape genetics[15][16] is focused on finding discrete regions within continuous habitats that behave like classic populations.  Organisms like Arabidopsis thaliana, however, do not fit such models.  With continuous geographic variation the probability of observing a particular set of alleles in an organism depends on the unique location of that organism and the alleles at the next closest organism are expected to have been drawn from a slightly different distribution.  Sufficiently fine-scaled lattices of stepping-stone models may approximate many of the important features of this kind of structure, but it is not straightforward to determine the appropriate scale and having too coarse a scale may quickly degrade the numerical results (particularly for populations not at equilibrium). Hierarchical models are particularly inappropriate.  The migration rate is low compared to the outcrossing rate which very quickly (on a scale generally less than a kilometer) creates a geographic blend of alleles and extremely rich pools of local haplotypes.  There is no bifurcating process to be uncovered.  Continuous variation will require a new body of theory in order to accurately estimate effective population sizes, gene flow, recombination, and natural selection.

For researchers using Arabidopsis thaliana as a model organism for ecological and evolutionary studies this paper provides several lessons and raises several new questions.   One important point is that it is necessary to recognize that both genotype and environment are expected to vary spatially.  Any study of local adaptation or gene by environment interaction should expect to find correlations between genotypes and environments simply through spatial correlation.  Study design and analysis must take this into account and show that similarities between plants separated by a given distance within environments are greater than those at similar distances but between environments.  Another point is that in terms of genetic diversity, Arabidopsis thaliana needs to be thought of as a sexually reproducing species: the difference between outcrossing and highly selfing organisms is quantitative rather than qualitative.   Each plant in the wild may contain multiple hybrid siliques.  While the vast majority of individual seeds are self-fertilized, the outcrossing rate is sufficient to introduce considerable genetic recombination after just a few generations.  This will help make natural samples of Arabidopsis thaliana a powerful research subject for genome-wide association studies and linkage mapping[17], but create difficulties in reconstructing even fairly recent phylogeographic events such as the colonization of North America (let alone older events such as the re-colonization of Eurasia after the most recent ice age).  Future studies using higher-density marker sets will have considerably more power to address these questions.

Methods.

Collection.

The collection is described in detail at http://arabidopsis.usc.edu/Accession/.  It contains 4756 new accessions and 1201 accessions obtained from the Arabidopsis Biological Resource Center (ABRC) as a leaf from a single reference plant such that the distributed seed matches the genotype in this study. The collection spans 42 countries and four continents.  

Genotyping.

Genomic DNA was isolated using Puregene 96-well DNA purification kit (Gentra Systems) with the modified protocol[1]. All DNA samples were normalized to 10 ng/ul, and then genotyped using The Sequenom MassArray (compact) system at Sequenom (San Diego, CA) and University of Chicago DNA sequencing facility (Chicago, IL) with 149 SNPs.The primer sequences of the 149 SNPs and their physical and relative genetic distances are listed on the web (http://borevitzlab.uchicago.edu/resources/molecular-resources/snp-markers). They were selected from loci exhibiting minor allele frequencies between 25 and 30% in a set of globally-distributed DNA alignments[19] using MSQT[20] .  

Data cleaning.

Samples were removed if they contained excess missing genotype calls (>50 of 149) as this indicates poor quality of the genomic DNA or contamination.  Information from ten SNP assays was removed due to excess missing genotypes or heterozygous calls (>25% of sample) which is often an indicator of poorly performing genotype assays. Haplogroups containing common lab strains Col, Ler, Ws2, and Nd were also removed to limit the chances of contamination.  Multiple samples of each were found and at suspiciously broad global distributions.

Haplogroup clustering.

Each plant was assigned to a single unique haplogroup.  All plants in a haplogroup have haplotypes that are potentially identical given the number of SNPs genotyped and the accuracy of the SNP genotyping.  Clusters are defined by a modified QT-clustering[21] algorithm.  The distance function between two haplotypes is derived from the binomial probability of finding the observed number or more of marker mismatches between them given the number of observed markers.  The first haplogroup is defined by finding the central haplotype around which it is possible to form the largest haplogroup.  Haplotypes are proposed in order of their distance from the central haplotype and are included if their distance is less than 0.05 times the current size of the cluster.  Once the largest haplogroup is defined it is removed from the sample and the next largest haplogroup is defined.  This is iterated until every plant has been placed in a haplogroup. Heterozygous markers were treated as missing data.

Diversity simulations.

To simulate the distribution of pairwise fraction of non-matching alleles we simulated a sample of 10,000 haplotypes.  For each marker in each haplotype an allele was taken from the corresponding site of an observed haplotype randomly chosen with replacement.  The simulation adjusted for production of identical haplogroups was done in the same manner, however only one representative of each haplogroup was included in the random sampling.

Estimation of selfing rates.

Selfing rates were estimated for 88 field sites with 8 from North America.  These are all the sites for which the genotyped tissues were taken directly from field samples or plants grown from field-collected seed, and for which there were at least two haplogroups present. Estimates were derived from the inbreeding coefficient FIS[22] in each field site as implemented in[23].  The point estimates are the maximum likelihood estimators when values are constrained to be equal across field sites.

1.      Kliman RM, Andolfatto P, Coyne JA, Depaulis F, Kreitman M, et al. (2000) The Population Genetics of the Origin and Divergence of the Drosophila simulans Complex Species. Genetics 156: 1913-1931.

 

2.      Marchini J, Cardon LR, Phillips MS, Donnelly P (2004) The effects of human population structure on large genetic association studies. Nat Genet 36: 512-517. doi:10.1038/ng1337

 

3.      Voight BF, Pritchard JK (2005) Confounding from cryptic relatedness in case-control association studies. PLoS Genet 1: e32. doi:10.1371/journal.pgen.0010032

 

4.      Buckler ES, Thornsberry JM, Kresovich S (2001) Molecular Diversity, Structure and Domestication of Grasses. Genetics Research 77: 213-218. doi:10.1017/S0016672301005158

 

5.      Sasaki T, Matsumoto T, Yamamoto K, Sakata K, Baba T, et al. (2002) The genome sequence and structure of rice chromosome 1. Nature 420: 312-316. doi:10.1038/nature01184

 

6.      Rafalski A, Morgante M (2004) Corn and humans: recombination and linkage disequilibrium in two genomes of similar size. Trends in Genetics 20: 103-111. doi:10.1016/j.tig.2003.12.002

 

7. Beck JB, Schmuths H, Schaal BA (2008) Native range genetic variation in Arabidopsis thaliana is strongly geographically structured and reflects Pleistocene glacial dynamics. Molecular Ecology 17: 902-915. doi:10.1111/j.1365-294X.2007.03615.x

 

8.      Pico FX, Mendez-Vigo B, Martinez-Zapater JM, Alonso-Blanco C (2008) Natural Genetic Variation of Arabidopsis thaliana Is Geographically Structured in the Iberian Peninsula. Genetics 180: 1009-1021. doi:10.1534/genetics.108.089581

 

9.      O'Kane SL, Al-Shehbaz IA (1997) A Synopsis of Arabidopsis (Brassicaceae). Novon 7: 323-327. doi:10.2307/3391949

 

10.      Wright S (1943) Isolation by Distance. Genetics. 28: 114–138.

 

11.      Maruyama T (1972) Rate of Decrease of Genetic Variability in a Two-Dimensional Continuous Population of Finite Size. Genetics. 70: 639–651.

 

12.      Barton NH, Wilson I (1995) Genealogies and Geography. Philosophical Transactions: Biological Sciences 349: 49-59. doi:10.2307/56123

 

13.      Wilkins JF (2004) A Separation-of-Timescales Approach to the Coalescent in a Continuous Population. Genetics 168: 2227-2244. doi:10.1534/genetics.103.022830

 

14.      Knowles LL, Carstens BC (2007) Estimating a geographically explicit model of population divergence. Evolution 61: 477-493. doi:10.1111/j.1558-5646.2007.00043.x

 

15.      Guillot G, Estoup A, Mortier F, Cosson JF (2005) A Spatial Statistical Model for Landscape Genetics. Genetics 170: 1261-1280. doi:10.1534/genetics.104.033803

 

16.      Storfer A, Murphy MA, Evans JS, Goldberg CS, Robinson S, et al. (2006) Putting the /`landscape/' in landscape genetics. Heredity 98: 128-142.

 

17. Atwell S et al. (in submission) Genome-wide association study of 107 phenotypes in a common set of Arabidopsis thaliana inbred lines.

 

18. Y. Li (2007) Purification of Arabidopsis DNA in 96-Well Plate Using the PUREGENE DNA Purification Kit. p87.  In book: Genetic variation: a laboratory manual.  Edited by Weiner MP, Gabriel S, Stephens JC.  Cold Spring harbor laboratory Press, Cold Spring Harbor, New York.

 

19.      Nordborg M, Hu TT, Ishino Y, Jhaveri J, Toomajian C, et al. (2005) The Pattern of Polymorphism in Arabidopsis thaliana. PLoS Biol 3(7): e196

 

20.      Warthmann N, Fitz J, Weigel D (2007) MSQT for choosing SNP assays from multiple DNA alignments. Bioinformatics 23: 2784-2787. doi:10.1093/bioinformatics/btm428

 

21.      Heyer LJ, Kruglyak S, Yooseph S (1999) Exploring Expression Data: Identification and Analysis of Coexpressed Genes. Genome Res. 9: 1106–1115.

 

22.      Weir BS, Cardon LR, Anderson AD, Nielsen DM, Hill WG (2005) Measures of human population structure show heterogeneity among genomic regions. Genome Research 15: 1468-1476. doi:10.1101/gr.4398405

 

23. Lewis PO, Zaykin D (2001) Genetic Data Analysis: Computer program for the analysis of allelic data   Version 1.0 (d16c).    Free program distributed by the authors over the internet at http://lewis.eeb.uconn.edu/lewishome/software.html\    

Figure 1.  Map of collection sites around the world.  Red dots indicate sample sites

Figure 2.  Fraction of non-matching alleles between all pairs of plants.  Solid bars are observed measurements from data.  Stacked on each other are pairs within Eurasia (blue), pairs within North America (red), and inter-continental pairs (black).  Green line is the distribution from a simulation assuming panmixia.  Yellow line is a simulation assuming global random mating but only measuring differences between unique haplotypes.  

Figure 3. 
Estimated selfing rate per field site.  Individual dots are specific field sites. North American sites are in red. The curve is a smoothed kernel density.

Figure 4. 
Distribution of haplogroup diversity by field site.  Probability of two plants in a field site being of different haplogroups.  Low values (red) indicate monomorphic field sites.  High values (light) indicate diverse field sites. A dynamic map will be available online at (http://arabidopsis.usc.edu/Accession/).

Figure 5.  Probability of finding two members of a haplogroup as a function of distance and continent.  Dot size shows relative (within panel) number of observations per bin.  Blue line is curve of the form y=mx+b that is best fit to the binned data.  Red line is model of exponential decay of the form y=Cexp(-λ*x) that is best fit to the binned data.  Panels A and B use 150 km bins.  Panels C and D use 10 km bins. Panels E and F use 1/2 km bins.  

Figure 6.  Pairwise distribution of non-shared alleles as a function of geographic distance and continent. Boxes show median, 25th and 75th percentile; whiskers show 9th and 91st percentile.  Shading shows relative (within panel) number of observations per bin.  Blue line is curve of the form y=mx+b that is best fit to the binned data.  Red line is model of exponential decay of the form y=K-Cexp(-λ*x) that is best fit to the binned data.  Panels A and B use 150 km bins.  Panels C and D use 10 km bins. Panels E and F use 1/2 km bins. Data in panels A and E would not converge on an exponential curve.

Supplemental Figure 1. Number of accessions per field site. Eurasian sites are in blue, North American red.

Supplemental Figure 2.  Number of accessions per region defined by size.  Eurasian (red) and North American (blue) regions defined as cells of a discrete geodesic grid of hexagons defined on four different resolutions.  Panel A has an inter-cell distance of ~1km, B ~10km, C ~100km, D ~850km.

Supplemental Figure 3.  Number of samples per distinct haplogroup.  Inset shows fraction of contribution to overall sample of each size-class.