WGS Extract Manual (for 18 Feb 2020 BETA version 2b software) 11 Nov 2020

WGS Extract Manual

Beta Edition v2b (software)
18 Feb 2020 release

WGS Extract is a tool to aid Genetic Genealogists with taking their Whole Genome Sequence (WGS) result files and extracting the information needed to submit to autosomal / X DNA segment match, segment analysis and phylogenetic tree of haplogroup sites that support 3rd party test results. Some benefits can be gained by those manipulating WGS files for health reasons but that is not our focus here.

See https://wgsextract.github.io/ (Beta v2b 18 Feb 2020) for the WGS Extract program and this manual. And Section 4 on installation and starting the program, if needed. We delve directly into how to run the tool as a start to this manual. The tool acronym is WGSE and pronounced as "wig-see" in English. We encourage this use in English language conversation.

30x WGS results are delivered in FASTQ, BAM and VCF file formats. They contain as near a complete image of your DNA as currently available. Much more detail than typically found in the RAW data files of microarray DNA tests from 23andMe or Ancestry, for example. The microarray RAW data files have only a small fraction (less than 0.01%) of your DNA data. WGSE works with the BAM files directly to extract the complete sequence data. WGS sequence testing is almost always more correct than microarray tests for the same reported values as well. Even so, there is still some error. See the Bioinformatics for Newbies document for more information on the file types available from your 30x WGS test.

The tool is the concept of City Farmer and follows on similar concepts in Felix Immanuel’s BAM Analysis Kit and the more recent Wilhelm H-O’s DNA Kit Studio for microarray files. WGSE is a small subset of the features in a Galaxy server (or usegalaxy.org / usegalaxy.eu site) but provides a more gentle, push-button capability for WGS processing. The biggest and still unique contribution is the generation of microarray data format files from the WGS test results; often more accurate and complete than that provided by the microarray test companies themselves.

WGS Extract utilizes other GNU-license tools directly; some heavily modified. Of note are the Bioinformatic Tools such as HTSLib: Samtools, BCFTools, BGZip and Faidx; as well as BWA. Also GATK, Picard, Haplogrep and yLeaf. The main program is a Python script that runs many of the common bioinformatic tools behind the scenes. In fact, if you are trying to learn more about the bioinformatic tools, simply look at the Python scripts or follow the logs to see what tool invocations are made. We capture some of these command scripts in the final section 6 Appendix: Under the Hood.

WGS Extract mainly starts with a BAM file. As needed, it will create the FASTQ and/or VCF files from the BAM. Sometimes the BAM is a subset of the WGS test result, or maybe not sorted and indexed as needed by the tools. As much as possible this is checked for and scripts adapted to hide complexity from the user. The goal is to use the tool to quickly ascertain the content of any BAM file and manipulate it to generate whatever data may be needed.

The WGS Extract release on Windows 10 includes a CygWin Win10 port of the bioinformatic tools used (and Python). Making for a quick and easy install of these tools that are usable in a Windows Powershell. See the document Bioinformatics Tools on Windows 10: Quick and Dirty for more information on how to enable this. If you follow the install instructions below for MacOS and Linux, you will automatically make the underlying bioinformatic tools available on those platforms. While BioConda would be the best resource for tools (especially the most current versions); they do not deliver on a Win10 platform. And so we have not used them at all.

The v2 release contains some bug fixes and improvements. yLeaf is now fully working on all platforms. yLeaf has been reworked to relax the standards so it regularly calls the leaf nodes in the ISOGG 2019 Y tree it works from. See City’s Facebook post for more details. An unmapped FASTQ extractor to use with CosmosID is now available. Finally, a more detailed stats section following the popular idxstats spreadsheet format from Randy is included.

While DNA Kit Studio is mainly used to manipulate the microarray RAW files, it does have an option to read VCF files. Working from standard 30x WGS VCF files is NOT sufficient to generate a correct microarray file though. And this path should not be used with that tool and others like it. VCF is just a container format. But what is in it can vary dramatically. Even the BCFtools has a simple one line command to convert any microarray TSV file into a legal WGS VCF format file. But clearly with much less information. An “all-call” RAW VCF can be useful for that tool to generate useful results but is not yet available here.

It is the unknowing use of standard but incomplete VCF files (whether from 30x WGS tests or other sources) that has led some match sites to detect and ban 3rd party generated microarray files. We are working to make all parties more aware of the real issue so proper procedures can be put in place to be assured of quality data. This is not a 30x BAM test processing issue.

Even BAM files are not all the same. 4x, 0.4x WGS or any WES BAM file result will not be sufficient to generate correct microarray files. Only the 30x or greater WGS test BAM file is complete and sufficient to generate an accurate microarray file or haplogroup call. We currently generate a user warning when attempting to use a smaller than 10x WGS BAM file to generate output data. And may tag generated files so downstream resources can reject them as needed.

The true RAW VCF or all-call gVCF generated from a 30x WGS test can be used to generate an accurate and complete microarray file. Such a VCF file is more similar to a reference genome final assembly except it still retains the diploid alleles and quality information.

This manual was frozen on 8 April 2020 and meant to remain static. The Francais patch and an issue with the MacOS Catalina release forced rewrites on 8 June. A merge back in is required.

1 Running the WGS Extract Tool

Once you have started the tool in your environment (possibly by just clicking on the command script file name), you will see a series of windows come and go during the operation. We are highlighting some key ones here that require your input before moving on.

The first pop-up is a question window about which language you wish the tool to use. Supported are English and German (Deutsch). As of March 2020, there is a patch to add Français.

Once answered, by clicking on your choice, you will get the main program dialog screen.

If you never see the language dialog, the tool will never properly start. Change window focus to find this dialog pop-up as it may get hidden. The WGSExtract program is waiting for your answer. For other, similar pop-ups that may commonly occur, see the last sub-section of this chapter.
(Note: Starting in BetaV3, this answer will be remembered and you will not be asked again. Remove the file ~/.wgselang to clear the answer.)

After selecting the language, the main dialog window will appear. This window is shown anytime the program is idle and ready to start the next processing task. The English version of this and all subsequent windows is shown here.

First thing to note is that there are Tabs across the top (introduced with the BetaV2 release). We could no longer fit everything on one screen, so there are three tabs named “Settings”, “Extract data” and “Other”. The main dialog window is shown below and marked to explain its content.

We will cover each tab of this main page in detail below.

1.1 Settings Tab

You must specify the output directory location and the BAM file here BEFORE the other tabs of the program become available (that is, the buttons in the other tabs of this main program window are clickable). You can come back to this tab and change the BAM file and output directory at any time. which may initiate new processing.

The tool will immediately perform prep-work with the selected BAM. For example, it may simply need to create a BAM Index file if one cannot be found. One of the first tasks after selecting a BAM is to determine its reference model type. If this cannot be automatically determined, a query for the reference model type for the selected BAM will be made as shown below:

If you do not know the reference genome, then exit out and select a different BAM.

Another task may be to convert the BAM from one reference genome to another; creating a file named <your BAM file name>_converted_to_hg19.bam and then a BAM index of that new BAM file with a .bai extension. Note: This conversion from one reference genome BAM to another is no longer automatic as some places need a GRCh38 BAM, and other places a HG19. Conversion is delayed until needed.

After any needed preparation or conversion tasks are done (which may take an hour or more), control will return to this main screen and some summary stats of the BAM will be filled in as shown in the second screen capture of the Settings tab shown above. Once you select a BAM source file and/or Output Directory, the button will change from “Select ..” to “Change selection”.

You can specify a location for the output files via the Output Directory button. Make sure 50 to 100 gigabytes are available there.

The conversion tasks only need happen, if at all, with your first run on a particular WGS BAM file. Subsequent tool invocations can access the converted files and make the other sections available more immediately. Any conversion tasks will place their files in the same directory as the BAM. Otherwise, all final output files from WGS Extract are put in the Output Directory. Conversion tasks that create files will start independent of setting the Output Directory.

One of the main conversion (or checking) tasks, after setting the BAM file name, is to determine the reference model used to align the sequencing data in the BAM. The code here is pretty simplistic. It looks for the size of Chromosome 1 from the reference model (that is stored in the header), and the name used for Chromosome 1 (‘1’ or ‘chr1’). One special exception is the hs37d5 model used by Dante Labs since Sep 2019. It is a cross between the HG19 and GRCh37 model format. If the reference model cannot be determined by this simple process, then the program will put up a dialog to ask what reference model is in the BAM file. Accuracy is required. If you are asked and are unsure, do not proceed until the answer is known.

Note that you do not have to supply a WGS BAM to WGS Extract. Maybe a Y and MT BAM created from an earlier run is fed back in as a reminder or to see the stats. Or the BAM supplied by FamilyTreeDNA from their BigY test can be used. If FTDNA, make sure to first “unzip” the file they supply as the BAM and BAI index are grouped together inside it.

Note: Once you select the FamilyTreeDNA BAM, you may see the following warning:

We have modified release Betav3 and later to determine the read depth on only those chromosomes contained in the BAM file. At the current time, the only possibly useful tool for a FamilyTreeDNA BigY BAM is the Y Haplogroup caller (yLeaf) beside the stats.

1.2 Extract Data Tab

The various extraction programs are all grouped together in an “Extract data” tab. Each is described further below. If no buttons are clickable, you need to go back to the “Settings” and specify a BAM file and output file directory (and possibly wait for any conversion process to finish). If only the “Generate a BAM file that contains only Y-DNA” is not clickable (as shown, greyed out), then likely the program has determined that this is a gender Female sample file and has no Y chromosome values to generate. That situation is illustrated in the screen shot below.

1.2.1 Autosomes and X Chromosome

This section generates the microarray RAW data files (pseudo, simple VCF or TSV) like put out by the major genetic genealogy test vendors (23andMe, Ancestry, FamilyTreeDNA, LivingDNA, and MyHeritage). The output files are read in by many of the above companies as well as 3rd Party sites such as GEDMatch and Geneanet; or tools such as DNA Kit Studio.

When you click on the “Generate files in several autosomal formats” button in this tab, you get the following new dialog box shown below.

Not all vendors can read all the file formats from all the companies. You have to experiment to see what works. Generally, 23andMe v3 and v5 RAW data files can be read by all other sites that accept externally sourced test result files. We have a table here to show you what works and how well.

Select which options you want to generate. The default when you open is all selected. (Betav3 just has the three recommended as the default.) A separate RAW data file will be generated for each box selected. Once you select at least one option, the “Generate the selected files” button will become active to allow you to initiate the file generation. Click “Select everything” to select all boxes. Note that the time difference to generate one versus all is minimal. It takes an hour or so to extract the variants from which each file can be quickly generated.

Otherwise, if not wanting to generate anything, simply close the window (hit the ‘x’ in the upper right) to then return to the main program screen. Note: this is the only screen that does not have a separate Close button. Fixed in the Betav3 release so a Close button is on this window also.

As to which option(s) you select, that is a longer discussion and given in more detail in a subsequent chapter (so as not to clutter up the user tutorial here). Suffice it to say, “CombinedKit” can be used only at GEDMatch and Geneanet. 23andMe v3 and v5 are the best, most accepted files with all the other sites that allow import. And have the best overlap with most of the other test site versions.

Below is a quick chart summary of the accepted formats by each company..

Test File→

Everything

23andMe

Ancestry

Family
TreeDNA

Living
DNA

MyHeritage

Nat Geo Geno

Import
Site↓

FTDNA

🆇

✅

🆇

Living

🆇

✅

🆇

MyHer

🆇

✅

🆇

GEDM

✅

Genea

✅: Recommended, Supported, ☑ Supported, 🆇: NOT Supported for import

Oddly, sites will NOT accept their own lab format (which would be optimal to match against people who tested at that site). Guess they are trying to keep users from cluttering their database with a copy of results already generated there.

Note: All sites accept X chromosome info as part of the Autosomal import. None listed here accept Y nor mitochondrial info; if part of the file format. National Geographic Genographic NextGen (2.0) Plus output is not yet available. It includes a full Autosomal panel as well as 13K+ Y SNPs. FTDNA accepts the Y SNPs from National Geographic only if provided as an internal, company-to-company transfer and not via an autosomal import.

Note: WGS Extract only maps SNPs and not InDel’s (Insertions / Deletions) from the BAM into the 23andMe file. 23andMe includes a few thousand InDels in their format that can be useful in medical diagnosis. WGS Extract has been designed for use by genetic genealogists where the InDel’s are ignored; and not as a feed mechanism into medical diagnosis sites like Promethease.com and similar. GEDMatch and others do not utilize the InDel’s nor 23andMe private variants.

The Y and MT values (variants or in-common allele values) are included in the files if in the original format from the test company. They are part of the WGS BAM file. So you can use these output files just like you would use the originals from the test company. For example, feeding the file to ytree.morleydna.com to get a Haplogroup determination from their 2016 Experimental ISOGG tree. These are the full, microarray file formats as normally provided by the test company that originated them. And in fact, with many no calls now filled in as the WGS test tends to have better coverage and accuracy than the microarray test results being mimicked in the files here.

You will get a pop-up warning if your BAM file is aligned to any reference genome other than the hs37d5 one based on HG19. This is because the hs37d5 model is the best coverage and handling of the highly dense HLA genomic region in Chromosome 6 that is also polymorphic and difficult to align short read scans too. The human leukocyte antigen (HLA) region, encompasses 7.6Mb on chromosome 6p21 and is the most gene dense region within the human genome encoding 252 expressed loci The hs37d5 model was very focused on help in that area and thus yields improved calls of SNPs there. For optimal output, consider converting your file to that reference model before generating a microarray format file. See the GATK Glossary entry on Human Genome Builds for more information.

If supplying a GRCh38 reference model mapped BAM, the tool uses PyLiftover to convert the coordinates of values to GRCh37; which the microarray test file formats are defined from. This operation can take an additional hour. Otherwise, the tool takes about an hour to generate an initial CombinedKit file, and then 5 minutes for each selected format after that. The CombinedKit file is then destroyed if you did not select that for delivery.

In the future, we may recognize if the CombinedKit file already exists and save the hour processing time by starting with that. This will allow for any easy and quick generation of any formats you missed on the first run. For this reason, we always suggest you generate the CombinedKit file. It never delays the processing and retains the file already being generated internally for possible future use.

Out of credit to Marko, this is the largest unique effort of this tool; and benefits the genetic genealogy community the most. It takes WGS test results and transforms them into the standard file formats used in the industry today. Thus providing a bridge for WGS testing into this community until the industry catches up and starts generally accepting and processing BAM files directly. Although based on the initial concept of the publicly available Extract23, this goes so much further. Marko collected all the various file formats (beside 23andMev3 that Extract23 used) and spent much effort to fully characterize and merge them. He then modified Extract23 to generate a Combined kit of all models instead of the 23andMev3 one solely in Extract23. Finally, Marko wrote the code to take the Combined Kit and extract the subset used by each format and deliver it with an appropriate header and other subtle differences they each exhibit. Included in all this is the incorporation of PyLiftover to handle non-hg19 reference model BAMs. The programmatic accuracy of the various company models incorporated in the base reference files and code is a true work of art. Although inspired by Extract23 and its initial work to variant call then match to a template file, this effort is so much more than that original, simple code base.

1.2.2 Mitochondrial DNA Processing

The main function here is to create a mitochondrial FASTA file from the aligned BAM content. This file is simply a single segment of the whole 16K+ base-pairs found in the mtDNA. The FASTA file can be imported to many other mtDNA sites and is the more universal, complete way to specify the mitochondrial sequencing results; although still dependent on an underlying genome reference model. FASTA-style formats are common for Final Assembly reference models and what is being delivered here.

Note: Previous versions could only process the mitochondrial DNA of the BAM that was in the specific rCRS format (typical of HG38 BAMs and some HG19 ones). Unfortunately, Dante Labs uses a bioinformatics pipeline that puts out the mtDNA alignment using the Yoruba model As of this Beta v2 release though, the tool can adapt appropriately to handle either reference model as input in this command. Currently, though, it generates the FASTA in whatever format has been supplied in the BAM.

Sites that accept the mitochondrial FASTA file are:

James Lick’s mtDNA Haplogroup Analysis: https://dna.jameslick.com/mthap/
(looking for no mismatches, most matches, and least extras)
Haplogrep v2 online at https://haplogrep.i-med.ac.at/app/index.html
(use open in upper left)

yFull also accepts mtDNA data for processing and placing in their mtDNA phylogenetic tree. But they prefer a BAM file input. Currently there is no button to generate that although it is being generated in other places in the tool (see Other / Haplogroups / Mitochondrial). We hope to add the BAM button option here in the near future. And learn to recognize if it exists and simply use in other places where needed. Once mtDNA BAM files can be saved, it can be then used as a source to generate a FASTA.

1.2.3 Y-DNA Processing

The main function here is simply to create a subset of your BAM that contains only the specified DNA reads. This is to dramatically reduce the size of the BAM to make it easier to upload to the other site(s) in a browser dialog. Otherwise, you often have to put your very large full BAM on some 3rd party file site that allows a URL to be generated and given to the upload site.

yFull has both a Y and MItochondrial phylogenetic tree of haplogroups. FamilyTreeDNA had traditionally been the site for both Phylogenetic trees as they offered the most extensive test available before WGS became practical.

The yFull.com site is quickly becoming the premier site for Y haplogroups and phylogenetic tree development. yFull not only accept any WGS and BigY BAM file results, but have incorporated many results from researchers testing ancient DNA samples. This has led to the discovery of a large number of new SNPs over the last few years from the combining of more WGS BAMs than from any particular test lab. While FTDNA is still the largest Y phylogenetic tree and deepest in many areas, yFull has many unique branches and structure due to this import-from-anywhere ability.

yFull and Dante have announced an arrangement where yFull can go grab your BAM file directly from Dante. But nothing has appeared yet to implement this. Hence, the need for this tool still. ySeq.net and yFull have such a working arrangement. So any BAM processing (or WGS test result) done there can be directly uploaded to yFull. But using the WGS Extract tool to create a subset file here makes it much easier to upload your needed data to yFull. It can simply be done via the web browser interface.

Nebula Genomics and FamilyTreeDNA have announced an arrangement where Nebula Genomics WGS 30x results (only) can be grabbed directly and imported into FTDNA accounts. This is expected to include SNPs extracted from the BAM for Autosomal (FamilyFinder), mitochondrial, and Y. The import of Y is expected to behave like the National Geographic Genographic import of 13,000+ Y SNPs before. So SNPs will be available to project managers and give you a deep haplogroup. But will not give access to the BigY tree or tools with the full WGS BAM result. Such access to the BigY tree but may be added later for a separate fee.

Note: Dante Labs delivers aGRCh37 model BAM. But yFull works much better with a GRC38 model BAM for Y. It is best to convert your BAM before doing the extraction supported here. That is, supply a GRC38 BAM to this tool instead of the Dante Labs delivered GRC37 one. Currently, this tool cannot do that “liftover” or “religning” task to convert the models. (It typically takes 24+ hours on a higher powered desktop computer or server.) There are several outside services that can provide this: Dante Labs themselves, ySeq.net and others. Or learn how to do it on your desktop in the BioInformatics for Newbies document. This may be added as a feature in the next release. Nebula Genomics delivers BAMs in a GRCh38 model format already. They utilize the “liftover” function to GRCh37 to make the Autosomal extract tools work properly.

The y-DNA Warehouse site is the gateway not only to James Kane’s Haplogroup-R where it spun out of but also to Alex Williamson’s Big Tree at yTree.net of R1b-P312 and below. (note: Alex has announced a possible scaling back of his Big Tree effort as he has not been able to keep up with the explosion of SNPs discovered with BigY-500, BigY-700 and WGS test results.) The y-DNA Warehouse is working on their own full phylogenetic Y tree now as well. A Y BAM extracted from here can be uploaded directly there.

1.3 Other Tab

Currently there are two categories of commands on the Other tab of the main program window. Each is described separately below.

Part of the v2a minor update saw major changes in the Other tab functionality. The “Check BAM file” was expanded to include the RAW total gigabases number. That which is used to check if Dante Labs met their stated deliverable for 30x WGS tests. The Y-DNA Haplogroup tab now works on Win10 systems. And has been enhanced on all systems to go deeper in the ISOGG tree. Often to the leaf nodes. Finally, a new “Oral Microbiome” option is added to simply create a BAM file of all the unmapped (to the human genome reference model) portions of the FASTQ/BAM file read segments. Suitable for submission to sites like cosmosid.com and mg-rast.org.

1.3.1 Check BAM File

This is essentially running the “samtools idxstats” command and creating some summary values from its output. This command is very quick to execute IF the BAM Index file is available. Creating the BAM Index file, if not there, is part of the preprocessing done when you specify the BAM file. The summary table is based on the IDXSTATS v1/v2 spreadsheet of Randy Harr.

Note: v2 of the idxstats spreadsheet which provides RAW values suitable for determining if 90 gigabases is now included with WGS Extract Betav2

1.3.2 Haplogroups

Often you will submit your Y and mtDNA BAM to yFull.com or similar to have your Haplogroup determined and your placement made in their Phylogenetic Tree of Haplogroups. We have incorporated some haplogroup calling software in this release of WGS Extract to enable a quicker, more streamlined call. The two tools are:

Haplogrep from the Medical University of Innsbruck, Austria; and
yLeaf from the Erasmus Medical center Department Genetic Identification, Netherlands

for the mitochondrial and Y DNA Haplogroup determination; respectively.

Note that the software and its database used here may not be as up-to-date as the dedicated services mentioned earlier in the Extract Data section. Haplogroups and the trees can be changing hourly every day. So use this result as a quick and dirty approach to see the result and verify your BAM file is correct. But a haplogroup that is to be confirmed with more in-depth tools like yFull.com or mitomap later on using the Extract Data section output.

Figure 4: Example output from the Mitochondrial DNA Haplogroup Caller

WARNING: The “mitochondrial DNA” Haplogroup tool is based on Haplogrep and requires a Java 8 release to be installed. You will be notified if you have no Java installed. But the system may simply error (in the command script menu) and seem to hang (with the Waiting pop-up never going away) if an older, incompatible release of Java is installed on your system. A fix we hope to get in the next update to check that a compatible release is installed before attempting to execute the command. Java is not part of the Bioinformatic tool suite.

With the version v2a minor release, the yLeaf Haplogroup caller has had some key changes. One, it is now available on the Win10 platform. (When introduced in v2, it was not yet working there.) Two, we have loosened the requirements for calling that are imposed in the original yLeaf code so haplogroups deeper in the ISOGG tree are more regularly called.

Figure 5: Example output from the Y Haplogroup caller.

Note: none of the text giving the Haplogroup, SNPs or URLs is written in a form that is selectable to copy and paste as text (in fact, no window in this program). You have to retype any text in other tools or pages. This will hopefully be fixed in a future release. For now, here are those URL’s from the above window (obtained by clicking on the buttons also):

https://yfull.com/search-snp-in-tree/
https://familytreedna.com/public/y-dna-haplotree/ followed by the letter of your main branch

So the first thing to notice in the Y Haplogroup results is that the ISOGG Y Tree is still using the YCC Long Hand haplogroup format names. Whereas most other sites (FamilyTreeDNA, yFull, etc) are using the YCC Short Hand names based on an SNP in the haplogroup. Next, the list of SNPs in the haplogroup is not complete. Some haplogroups have tens to a hundred or more SNPs. So not all the SNPs are displayed.

We go into much more detail about these result windows in a dedicated chapter on Haplogroups later in the document.

1.3.3 Oral microbiome

Basically, this command will extract the unmapped reads from your BAM file. Those marked as “asterisk” for the chromosome (or model designated area). It is a simple Samtools command to do this: “samtools view <your BAM>.bam -b -O …. But this program takes it a step further and converts the BAM back into the paired-end FASTQ files of these unaligned / unmapped read segments. These files are suitable for upload to sites like CosmosID.com (paid service but free trial) and mg-rast.org. This is new to version 2.

The command takes a few minutes to an hour to run; depending on how much of your original BAM is mapped to the Human Genome. Once done, a pop-up similar to the below appears. The URL given in this pop-up is embedded as a link in the here given.

WGS result files generally will have 90 or more percent of the sequenced DNA mapped to the Human Genome Reference Model. Sometimes as high as 98%. That remaining unmapped DNA is often attributed to bacteria. Likely in your mouth when your spit sample was taken. Not any gut bacteria or similar within the bloodstream. Hence the title “oral microbiome”.

The goal here is to extract a BAM of just these unmapped reads, which is often significantly smaller, to then pass to other tools to align and map to other DNA reference models. The size of the resulting files depends on the mapping percentage of your original BAM file. The higher your sample mapped to the human genome, the lower the amount of unmapped read segments contained in this file.

If you have a low mapping rate; especially one at 60% or lower, then the resultant FASTQ files may still be very large. Take this into account when manipulating and submitting the result to another site. (If you have a low mapping rate, the output of many functions of this tool are in question. You should seek to be re-sampled and re-sequenced.)

Note that the FASTQ file is unmapped and no longer dependent on a reference model. But, as these are the unmapped segments in the BAM, there really is no difference. So a simpler and much quicker option is to generate an unmapped-only BAM; cosmosID accepts it just as easily and it is the same size as the FASTQ files: This will be added as a preferred option in a later release.

When the author first learned of these tools he had a good laugh. His focus is on genetic genealogy which commonly promotes the ad-mixture (ethnicity) aspect of DNA testing. The tools targeted to use these unmapped BAM files are basically providing the ad-mixture (ethnicity) for your symbiotic bacteria in your mouth. Joking aside, the need for this analysis is a serious search for a possible source of illness with some taking a WGS test.

1.4 General pop-up’s

We cover here some of the basic pop-ups that will occur during processing with the tool. We already introduced you to the language pop-up setting that initiates the program. Lets cover a few more including the underlying, always there command line window.

The pop-ups, in general, will replace the main dialog window when they are there. (As do various result windows that may come but are explained under each command.)

1.4.1 Please Wait

There are many commands that may take an hour or more to run. When such a command is started, the main program dialog window as shown before is hidden. In its place is another pop-up dialog as shown here:

The pop-up will disappear and the main program window will reappear once the command is finished. A command may take 24 hours to complete. You are executing bioinformatic tools on 50 GB files so expect a wait for some operations. Consult the oft-hidden command script window for an idea of what may be running in the background.

1.4.2 Task Running / Finished

Other similar pop-up dialog boxes are:

The above is presented when doing an Autosomal RAW file extract from your BAM Data. Once finished, it presents the below:

1.4.3 Command Script Window

It is easy to lose the “focus” of the program. The persistent, initial command script window can get buried behind other windows. Which means any subsequent dialog windows may be hidden as well. So learn how to find the original command script window and bring it to the forefront and thus bring any other WGSExtract pop-up dialogs to the front also. The below is an example of the persistent, initial command script on a Windows10 machine:

This window is also useful to get the detailed status of what is currently executing. It will contain a log of the command actions over time and as they happen. Think of it as a current status and dialog history.

Occasionally the script running in the background will cause an uncaught error. That should be displayed in this command script window, if so. In which case you may need to kill this window (hit the ‘X’ or close) to bring the program to a close and restart. Try to copy and save any error reports in the command script window and email them to the developer: Often, uncaught errors are indicated by the word “Traceback”. Here is an example of an error in a command line window (that will thus cause the program to never finish):

1.4.4 Error and Warning Dialogs

Occasionally, an error or warning must be issued. Often via a pop-up dialog box. For example, if you have a directory path or filename that includes special characters or spaces, you may see the following error pop-up:

Or, for a warning, a pop-up dialog may appear like the below:

2 Autosomal RAW / VCF generation detail

(note: it is known there is repeat / overlap in this section. Working to prune it and rewrite.)

BAMs from WGS tests can be used to generate the equivalent of super-kits on steroids. Basically, 2x or more the number of SNPs that the “best possible” super-kit made just from microarray test results can do. Super-kits are made by a merger of previous microarray test results. The Super-kits on steroids is mentioned because it is unlikely anyone was able to test with all the various versions over the extended time period to create a complete superkit. And the tool here can generate any of the versions and even more accurately than the original microarray DNA test process could. (NGS Sequencing tends to have higher, more accurate call rates than the microarray tests.)

Unfortunately, most of today's match tools cannot handle super-kit files like this. Tools are setup to see only the smaller subsets of tested SNPs (or variants) from a single test or maybe two kits at best. So you may have to wait for the tools to catch up before using the full power of your available WGS tested SNPs. To date, only GEDMatch and Geneanet can receive an Everything file (super-kit) generated here.

The overlap of SNPs tested in the various kits can be quite small as it is. So decide what best file to generate and feed into each test site. Note that, oddly, most sites will not accept an import of a file in the same format as they generate. So even though you can generate the file as if you tested on their site with their microarray, they generally do not allow you to import that optimal file. Optimal because you would have the same SNPs in the file that their native test result does. We have found 23andMe v3 and v5 are the most widely accepted and best result kits to generate and load into tools that accept outside test results. v3 for just about everywhere with v5 especially good for LivingDNA as that is the same chip as their initial version 1 test kit.

Note: you might even get away with a merged v3 and v5 file. But this tool cannot generate that. Use DNA Kit Studio to create that merged file for upload from the separate v3 and v5 files generated here.

At the current time, only GEDMatch and Geneanet can accept the “Everything combined in one file” option. And even then, throws much of it out on import. Hence, for compatibility in the current market, you are given the option to generate just a single version of a single company test. Maybe, eventually, someone will build a new autosomal match system that can utilize WGS results and super-kits generated from merged microarray test results.

The autosomal extract tool here was originally based on Thomas Krahn’s Extract23. That tool as released on GitHub can only generate a 23andMe v3 file from an HG19 BAM. So the tool here has been greatly expanded to support all the variations from all the vendors over time.

This tool may appear similar to what Wilhelm H-O’s DNA Kit Studio is doing when processing and accepting RAW and VCF files and putting out similar microarray test files. But with this tool, you are starting with your BAM file which has everything. Not just the “Everything” of the merger of other microarray tests, but everything in your DNA itself. Also, the VCF worked with there is only the identified variants. You cannot simply impute or assume the non-listed values are ancestral or what the model indicates. Hence this tool more accurately generates these complete files as if you natively tested with each company.

If you are trying to compare a WGS test result with a microarray test result (from the same person in each result), then choose the same version and company as that test result. Either load the WGS generated file back into the original test company database (if possible) or load the original and your WGS generated file into GEDMatch or Geneanet for comparison there. Either way, you are looking to see that you appear as your identical twin. Possibly even just to validate your WGS test result is really yours. You can also do this to compare to a close family member (sibling, parent, 1st cousin, etc;) where the match strength is known with a tight tolerance.

Comparing your “Everything” from your WGS with an original test from one of the microarray companies may not show as close an identical match as you like (still 99.9+% but not 100%). This has to do with the match algorithms at these sites,your extra SNPs available in the “Everything” file that would not be comparing an identical file to itself, and even due to a slight variance in a few SNPs as read by the two different test techniques. In general, for those variances, the NGS test will likely be the more accurate. The match tools are not so strict about exact matches knowing this small inaccuracy in testing. They look to see a number of SNPs different before calling the match terminated.

The WGS test returns millions of SNPs for the Autosomes and X. It does not help to load SNPs that cannot possibly be matched.^[1] And only serves to confuse the matching algorithm when it sees little overlap in your results and the other test you are comparing too. So all the options are to generate a subset of your WGS test result that most closely resembles only one of the test company results at a time. You can use DNA Kit Studio to merge multiple test results into super-kits for sites that accept them.

Even the “Everything” option is a subset of what actually exists in the WGS BAM file. It is “everything" that might exist in each of the test company file versions if all were generated and merged. More like a merger of all the other formats than “everything” in your WGS result. Your WGS result still has more. Until a WGS-result Match Database comes about, this is the best you can hope to use on any existing microarray result match and analysis site. Eventually, we will see Autosomal match sites be able to read the BAM files directly; like being done for Y now.

The WGS test returns values for ALL your SNP’s; whether derived (positive for change) or ancestral (negative for change from the reference model). As do the test company microarray RAW result files. It is important to include all these values when submitting to a match database.^[2] So the processing here is making sure to deliver them all.

VCF files contain only your "derived" values. This was an issue with early use of only traditional VCF files to generate RAW microarray format files.^[3] To properly use the VCF files, one has to create a starting result file of ancestral values for all expected values in a particular company file. Then fill in what is in the VCF for found variants to override the expected values. Or possibly changing the SNP to a no-call if that is what is found in the VCF as well.

The merger of all possible SNPs from any of the test companies and versions of tests they have creates a file that is more than double the SNPs of any one particular test kit. This helps assure the best match against any other kit from any of the test companies or versions that may happen to be in that database. But, to date, none of the test companies allow a transfer in of such a large file. In fact, this is why a file of simply all your detected SNPs is not made from the BAM. The matching algorithms in the current microarray-based match sites simply cannot deal with the supersized files and references to SNPs not found in other test array results.

23andMe and Ancestry do not allow any outside result files to be imported. But they comprise the largest base of testers and the majority of entries in the other companies test databases (due to transfers). So best to include those SNPs (contained in their file formats) in any file you plan to import elsewhere. Hence why the “Everything” option exists.

2.1 Tool Testing

Marko compared personal microarray tests with 23andMeV4, 23andMeV5, AncestryV2, and FTDNAv2 with the output of the modified Extract23. Comparing identical SNPs and the few differences. Compared the files as well as after uploading and comparing on GEDMatch. Details forthcoming.

MyHeritageV1 is testing nearly the same SNPs as FTDNAv2. For similar reasons, MyHeritageV2, FTDNAv3, and LDNAv1 are very similar in SNPs to 23andMeV5. In this latter group, they are all based on the Illumina Global Screen Array. So due to the similar chips and equipment used in the microarray tests of those versions from those companies. LivingDNA v2 uses an Affymetrix chip customized to provide a mid-range SNP overlap with all other chips. The FTDNAv1 also used an Affymetrix chip but much different than what LivingDNA now uses. See the ISOGG Autosomal Comparison chart and our earlier Test Chart Comparison for a few more numbers. The latter was inspired early on by Felix Immanuels Venn Diagram.

3 Haplogroup Analysis Detail

We use this section to provide more background on the Haplogroup caller features to help initiate those to Phylogenetic Trees of Haplogroups (erroneously shortened to Haplotrees by some) a bit better. There is a glossary available for even further background. The section is introduced in the section covering Haplogroups in the Other tab above.

3.1 Haplogroup caller based on yLeaf v2.1

yLeaf v2.1 is the primary source of the Y haplogroup caller in WGS Extract. yLeaf uses a stored copy of the 2019 ISOGG tree SNP list to match variants output from the tool here. The calling is thus subject to the frequency of updating that stored list. As well as the depth of that particular tree at any point in time.

ISOGG still names their haplogroups using the YCC Long Format. Whereas most other trees are using the Short form that simply names a haplogroup by one of the defining SNPs within it. This long (or lineage, path) format can be difficult to read. And more importantly, difficult to track when tree restructuring occurs.

The Haplogroup caller puts out the lowest (in the tree, most recent in time, farthest from the root) mapped haplogroup it can find in the reference tree (ISOGG 2019). It then lists the SNPs found in the BAM file from only that haplogroup. And only a few SNPs at that; there can be tens to a hundred or more in some haplogroups.

You are then referred to other trees where you can look up those SNPs and find their equivalent branch. The biggest problem is that the trees do not necessarily name the haplogroups nor SNPs with the same name. Sometimes it is due to parallel, simultaneous development. Often due to a lack of cooperation. Here are some pitfalls you may find explained by looking at one example haplogroup.

We start with looking at a Haplogroup shown in figure 3.1 below from a few years ago. It happens to be an older, well known haplogroup deep down the R1b-P312 path. It is most commonly called L20 or sometimes R-L20.

Figure 3.1: R-L20 on the haplogroup-r.org PhyloGenetic Tree

R-L20, as named here, contains three SNPs that define it. Each SNP has 2 or more aliases. That is, different names for the same SNP. Generally, if you are in this haplogroup, or below, you are positive for change in all three of these SNPs. Most SNPs start with a letter designator of who named the SNP, followed by a sequence number the namer assigned to make it unique. Key behind each name is the chromosome and position in the chromosome that defines it. Sometimes an rsID number more formally and uniquely naming it as well.

Here is that same haplogroup in yFull’s tree:

Luckily, yFull happens to name it the same, using the R-L20 designation. In their nomenclature, they use asterisks (*) to separate different SNPs that makeup the haplogroup. They also have put the SNPs in a different order and even have an additional alias for PF129.

Notice the tabs across the top. These are representing the haplogroup names from the root (Home) down to this branch R-L20. They somewhat correspond to the similar R1b1a1b … designation given by ISOGG. In fact, the ISOGG Tree 2020 designation for this haplogroup is

R1b1a1b1a1a2b1a1.

Let’s go back to the haplogroup given in our example output. And crop to just that haplogroup name and show an over-under of the previously inline information so you can see it more clearly.

In yFull, this haplogroup is found a few branches above R-L20 already shown. In fact, here is R-Z258 two steps earlier in the tree.

But wait, the designated SNPs are listed as S372 and Z384 in the tool. yFull says this haplogroup has one SNP named Z258/S372. Ahh, S372, we have seen that before. So Z258 in yFull is the same haplogroup as R1b1a1b1a1a2b1a~ in ISOGG.

And in fact, here is the ISOGG tree as designated from the program. It itself shows the alias Z258 for SNP S372. So we get closure.

So we now see that Z558 is an alias (or equivalent name) for S372 in both trees. Just each tree chose a different alias of the same SNP to list as the primary name.

And in fact, if we start to follow the tab path in yFull and compare it to the ISOGG path, we get:

R 1 b 1 a 1 b 1 a 1 a 2 b 1 a ~

M207 M173 M343 L754 L389 P297 M269 L23 L51 P310 L151 P312 U152 L2 - -

(note that yFull has inserted an extra branch named Y482 just below R that ISOGG has not given. ISOGG lists L389 as L388. ISOGG has a branch S255 (shown as final ‘a’ above) which yFull names Z367. But between L2 and L20, yFull has a branch Z258 and then S255 / Z367.)

In fact, this Y Haplogroup call in yLeaf of S372 is actually for a tester that is well below L20 in most of the trees. The deepest ISOGG goes is L20 for this tester though. yFull is even deeper than FamilyTreeDNA for the terminal branch due to some non-FTDNA test samples included. You can get more information about what is actually in the BAM for this tester on their website.

This particular one was a bit easy. SNPs with aliases were listed in both sites. The paths were near the same and many of the same SNPs were used to name the haplogroups. There are many other areas not so easy to trace and determine. Especially when you get to SNPs in FTDNA starting with FT (their own recently discovered), ones in yFull named Y (their own recently discovered) and ISOGG not having either.

ybrowse.org is an industry-normal place that researchers register their SNPs before the more formal process of creating an rsID with the USA National Institute of Health. Key is you have to determine the location of the SNP in the chromosome to then determine if SNP names are equivalent. But the newer SNPs by the tree companies are not necessarily given with a location yet. yBrowse is run by Thomas Krahn; now of yseq.net,.

So we see SNPs can have aliases (or equivalent names). And that different trees will have different haplogroups / branch points in different places based on data they have. But there are even cases where the trees have unique SNP names not found in the other. Either because it has not been determined an equivalent name exists (so they are cross listed on both trees) or they are too new and not divulged yet. Add to this that the different trees have access to different samples and data; thus allowing them to infer different branching. And they are changing daily. Well, you can begin to see why it is a wild and wacky world in the phylogenetic tree community. It is not that anyone is necessarily more wrong or right. Just different due to their myopic views and procedures to create and maintain their tree.

For other Y Phylogenetic Trees of Haplogroups (and their SNP lists to search), see:

ISOGG Tree

(https://isogg.org/tree/)

Search ISOGG for SNP

(https://isogg.org/tree/ISOGG_YDNA_SNP_Index.html)

yFull

(https://yFull.com/tree)

Search yFull for SNP

(https://yfull.com/search-snp-in-tree/)

FamilyTreeDNA

(https://familytreedna.com/public/y-dna-haplotree/)
BigY Blocktree (requires their test to see their BigY Block tree)

Search FTDNA for SNP

(https://familytreedna.com/public/y-dna-haplotree/)

Note: Change from "View by Country" to "View by Variants" once on the page and then do a browser search for the SNP name

yTree BigTree
(https://ytree.net/)
(only R1b-P312 and below)

Search yTree for SNP

(https://www.ytree.net/SNPIndex.php)

Haplogroup-R

(http://haplogroup-r.org/tree/R.html)
(currently, only R and below but expected to expand to a full tree)

Search Rtree for SNP

(https://haplogroup-r.org/variants.html)

Historically, ISOGG’s tree has been an academic, research one where only branches developed and cited in scholarly articles were included in the tree. As opposed to yFull and FamilyTreeDNA which have programmatic tree development based on more automated analysis of large numbers of submitted BAM samples. This latter technique may be more subject to eventual human verification / intervention but as a process can add thousands of branches a month to the existing trees. These trees are constantly tuning their algorithms and manual guidance to improve the output. The ISOGG tree tends to be far less developed; especially in some fast growth areas of development like U152 or M222, than the other programmatic trees like FTDNA and yFull.

From a compiled set of comments by Thomas Krahn in Dante Labs and Nebula Genomics Customers Facebook group:

HG19 doesn't belong to a single person. For the most part the Y chromosome sequence comes from an R1b-U152 > L2 > L20 > CTS9733 > CTS7275 person, however there are still some parts of a hg G person. Those hg G parts were eliminated in hg38. … Most SNPs were properly oriented with their ancestral and derived states after analyzing the most distant A00 haplogroup NGS sequences. Most of this work was done by Greg Magoon from Full Genomes, the YFull team and myself. This is a good example of how companies can work together.

So for Y Chromosome, they were able to do what RSRS was trying to do for MItochondrial analysis. Make the mitochondrial Eve be the root (ancestral) for all SNPs and thus the sequence itself.

For more information on Y Haplogroups, see

Wikipedia (https://en.wikipedia.org/wiki/Human_Y-chromosome_DNA_haplogroup)

Eupedia (https://www.eupedia.com/genetics/phylogenetic_trees_Y-DNA_haplogroups.shtml)

H600 Project (https://h600.org/wiki/Haplogroup)

ISOGG (https://isogg.org/wiki/Portal:Y-chromosome_DNA) *
* oddly ISOGG does not have a single definition of a haplogroup nor a list of available phylogenetic trees

3.2 Mitochondrial Haplogroup caller based on Haplogrep v2

TBD section

The mitochondrial reference model has developed somewhat independent of the main Human Genome Reference model. (As has one for Y but its changes were folded into the standard Human Genome model.) For an introduction to the various models, see a Haplogrep help page titled rCRS vs RSRS vs HG19 (Yoruba). Mitomap is another good resource with their Yoruba to rCRS table informative. Although a bit dated, there is a nice detailed description of comparison on the SNPedia MitoDNA Conversion Chart (especially criticl to showing the change in 23andMe generated files).

Here is a rough summary table of the models:

Model	Year Introduced	Length (in base-pairs)	Reference Sequence	Models Used	Notes
rCRS / CRS	1999 / 1981	16,569	H2a2a1	GRCh38, HG38, (hs37d5)	3107N in rCRS
Yoruba		16,571	L3e2b1a1 (African Yoruba)	GRCh36, HG18, HG19	In Yoruba, 310, 317 and 16961 are insertions compared to rCRS. 3107 from rCRS (a deletion and N value) does not exist.
RSRS	2012	16,569	mito Eve (root)	none	523N, 524N and 3107N

3.2.1 Yoruba Reference Model warning

So you got the dreaded pop-up on the Yoruba reference model:

This comes from either the Haplogrep v2 button for mitochondrial haplogroup calling or the mtDNA FASTA file generator for transferring data to other tools.

Most of the tree sites work off of the rCRS model of the mitochondria. But some of the models delivered by WGS testing come with the Yoruba model mapping. WGS Extract issues a warning but may allow you to proceed (FASTA generation) or simply returns without further work done (Haplgrep v2 Haplogroup caller). What really needs to happen is for Yoruba reference model BAMs need to be converted to rCRS ones before processing. The tool already does other “simple” things (like creating a missing BAM Index file), so this should be added. But it is not yet available.

Easiest, if wanting to use WGS Extract's implementation of Haplogrep, is to convert the BAM from one model to the other yourself. First extract the mtDNA only to a new BAM, then create the FASTQ files from that (not an extracted FASTA mtDNA BAM), then map/align to a new model (say GRCh38). You will now have a rCRS mapped mtDNA-only BAM that you can use in WGS Extract’s call of Haplogrep. The Bioinformatics for Newbies document gives the various commands for these operations; if not then the Facebook group Files section. It is a bit of a pain today.

There is a hand mapping or liftover process described at https://www.mitomap.org/foswiki/bin/view/MITOMAP/YorubanConversion. The models are really that similar. There is even a tool to convert SNP names, which have several variations. See https://mseqdr.org/mvtool.php. Yoruba was in Build36/HG18 and not just still included by UCSC in their HG19 release. GRC37, which was based on HG19, did do the mitochondrial model update though. Hence leading to the confusion. Technically, there are even more variations: rCRS (Andrews et al., 1999), RSRS (Behar et al., 2012), chrM in hg19, chrMT in GRCh37, and chrMT in GRCh38. See https://haplogrep.i-med.ac.at/2014/09/08/rcrs-vs-rsrs-vs-hg19/ from the creators of Haplogrep. (in that paper, they claim v2 will accept Yoruba. But maybe it never got in? Haplogrep is a JAR file and we have not yet un-compiled it back to source code to look at it.)

4 Installing and Starting WGS Extract

The latest WGS Extract program and this manual are available from:

https://wgsextract.github.io/

(Beta v2b 18 Feb 2020)

WGS Extract is a Windows-based, GUI Python script written to execute the bioinformatic tools (like samtools). It has been ported and tested on Win10 (latest edition), MacOSX 10.14 Mohave and 10.15 Catalina, and Ubuntu Linux 18.04 and 20.04. It may run on other releases and other Linux systems. Although the bioinformatic tools are command-line oriented, WGS Extract requires Tcl/TK to operate. Only Desktop OS’s have this library needed to create windows, buttons, dialog pop-ups and the like. As such, WGS Extract cannot run on a headless, non-GUI, Server edition of any operating system. Including anything installed using Windows Subsystem for Linux (WSL). It has no problems executing in a Virtual Machine like VMWare Player, VirtualBox or xxxx. A future release that can be scripted and thus operate “headless” may become available. But not today.

The downloaded .zip archive file is approximately 4.7GB in size and will expand slightly once uncompressed. The majority of this space is the needed Human Genome Reference model final assembly files. Four such files at approximately 1 GB each in size. Although seemingly a large amount of space, your WGS files are likely near 100GB on their own. And you will likely need 2x that space for intermediate and final files from processing. So this program is small compared to the files it has to work on. Cloud and some network disks will not have the needed performance to execute the bioinformatic tools on your WGS files.

Separate instructions are given for the Microsoft Win10, Apple MacOS and Linux install below. This program generally requires a desktop or laptop computer to run. Tablets and smart phones do not have enough storage nor memory capacity for the bioinformatic tool runs. A multi-core processor is best. Some commands can take a few hours and upto a day or more to run. The next largest group of files (300 MB) are the HTSLib bioinformatic tools and Python release only needed for Win10 systems.

Warning: Some of the embedded tools from other developers require Java. Broad Institute’s GATK, Picard and even Haplogrep (mitochondrial haplogroup analysis). Java is more often considered a basic of any OS environment and so we do not cover it in the installation process. Nor is it checked for except before needed inside the code. Safest bet is to make sure you have a Java v8 or later release installed and working. If you type java -version at a command shell, it should return with the version of the java installed. (note: if a simple Java Runtime Environment is installed, the version is prepended by 1. so version 8 is 1.8.) If you do not have java installed, go to https://www.java.com/en/download/ to find your installation.

In all three cases, the ‘bat/.sh file you are asked to run is an initial setup and install. The WGS Extract program relies on third-party package installers and software that you likely do not have. Beside the typical BioInformatic tools like Samtools, WGS Extract needs a Python3 language interpreter. In most cases, this INSTALL requires elevated privileges (sudo). The INSTALL can be run again to verify the environment is up to date (as an UPDATE script). There is no UNINSTALL script at this time. Nor UPDATE for the WGSE program itself.

No matter the platform, the actual start command can simply be (from the root directory of the installation):

python3 ./programs/wgsextract/wgsextract.py

In the future, the START scripts will be renamed to INSTALL scripts that you run only to install. Then a simpler, quicker startup for just the main Python script will be made available. If there is more than one Python3 on your machine, you may have to select the specific one we install.

4.1 Windows 10

Download the WGS Extract program itself from the link given above. Extract and uncompress all the files into a folder where you want the executables to reside. This can be done by right-clicking on the downloaded file and selecting “Extract All …”. Usually best not to extract to your “Program Files” or a similar folder. The tool invocations cannot handle embedded spaces that are typical in Windows folder and path names. Either drop the extracted folder in “C:\users\<your user name>” or maybe “C:\users\public” if you wish the program to be available to everyone.^[4] You can also place the program in a separate disk where your WGS files may be located. The tool uses the “output directory” specified during each run for final output files (that often are in the tens of gigabytes in size). WGS Extract does use a temp folder in the install directory for intermediate files. You will need ~5 GB to install the program (Beta v2b).

To start the program, navigate into the WGSExtract folder and double click the Windows_START.bat file located there. If you prefer, make a shortcut of that file by right clicking it and selecting “create shortcut”. Then move the shortcut to your desktop for easier access and invocation; renaming it as desired. It can also be added to your Start Menu this way or pinned to the taskbar.

As the Windows installation includes all files it depends on, there is no complicated package manager and install process.

Optional: See the Bioinformatic tools on Windows 10: Quick and Dirty file for more information on setting up path variables to directly access the bioinformatic tools that come with this installation. When adding the commands given there, you can access the tools independent of using the WGS Extract program itself. Note that there are many Perl scripts from the Bioinformatic tools in the Win10 release directory. You will need to add a Perl interpreter to be able to use those scripts, if desired. Something you can find in the Windows Store.

WARNING: You will need a 64 bit machine and OS to run this tool.The Windows release includes executables of a Python3 interpreter and CygWin and MinGW ports of bioinformatics tools. While the CygWin tools are 32bit, the Python3 interpreter and MinGW tools are 64 bit. We had to include our own Python3 interpreter as we found some incompatibilities between different releases and the libraries we rely on. So it was easier to simply bundle the interpreter and libraries with this program. The interpreter and the libraries are hardwired into the program.

WGS Extract has ported the needed Bioinformatic tools to the Windows environment. Using both CygWin and MinGW. In reality, you can run in a UNIX/Linux environment shell, just as you do in MacOS and Linux. Instead of using Windows Powershell, start a BASH shell and execute the WGS Extract program from there. In either case it is using the Microsoft Windows window manager. We mention this in case you want to run some BASH shell scripts with the tool set outside the normal use of WGS Extract.

4.1.1 Uninstalling WGS Extract

Sorry to see you go. Here is how you undo everything you did to install and use WGS Extract.

remove the WGS Extract folder
if desired, edit the PATH variables (if you opted for the Quick and Dirty document)

4.2 Apple MacOS X

4.2.1 Download and Unzip

Grab the actual WGS Extract program via a browser from the link given above. Or use a Terminal command line to grab it as shown here:

curl -O http://bit.ly/3afRl6O

WARNING: do not use the “Documents” or “Downloads” directory to install the WGS Extract program. It causes problems with the permission system in Apple’s MacOS. Instead, it is best to download to the folder (and disk) where your WGS files are located. Simply plan to install and unpack there. Or install in your home directory (~). If you want the program available to other users on the machine, use sudo and unpack it in /opt.

By default, MacOS automatically extracts the files of .zip archives when downloaded in a browser. If, for some reason, it did not do that, extract the files by double clicking on the .zip file. Or use the unzip command in the Terminal.

Your ~/Download folder should now have a folder named WGSExtractBeta/. Inside that should be a WGSExtract/ folder which then contains all the files. If installing in your home directory simply drag that WGSExtract folder to your home area. Or, by a Terminal command line, execute the command
mv ~/Download/WGSExtractBeta/WGSExtract ~
You can delete the empty WGSExtractBeta/ folder in your ~/Download folder.

WARNING: we have released a patch for this version on MacOSX to replace the existing MacOS_Start.sh. Please follow the path release link to download and overlay this patch on your release BEFORE proceeding below. We have modified the instructions assuming this patch is in place and installed as described on the patch release page linked above.

With the introduction of Catalina and general security updates, Apple no longer allows their own Applescript applets to be run from outside their App Store.

With this release, we now provide an Apple Applet to install the program and another to start it. Just double click (or right click and select “Open”) the Install_MacOSX.app applet to start the installer. Or the Start_MacOSX.app to actually start the program once installed.

If you wish to start the program from the GUI somewhere else, then simply right click on the Applet in the WGS Extract folder and select to “Make Alias”. Move the alias wherever you wish to use it (your desktop, the application folder, etc). You can rename the alias as well; say the Start_MacOSX alias to WGSExtract. just click on the alias to start the applet as before.

4.2.2 Apple Applets

The first time you run an Applet, you will see the following pop-up:

Just click OK and it will not appear again. Our Applet needs your permission to run a terminal window to capture the output of the commands and show to you as a log.

If not careful, sometimes the WGS Extract program Applets lose their “execute” permission when being unzipped. If you find you do not have permission to run them, you need to add execute permission by the following Terminal commands:

chmod ugo+x Install_MacOSX.app/Contents/MacOS/applet

chmod ugo+x Start_MacOSX.app/Contents/MacOS/applet

chmod ugo+x Uninstall_MacOSX.app/Contents/MacOS/applet

Only the Applets need these permissions. Not any other scripts or files.

The rest of the instructions provide a few additional details you may need to know about.

4.2.3 Install WGSExtract

Simply click on the Install_MacOSX applet. Rename the Install_acOSX.sh files to end in .command instead. First time executing them, use Control-Click instead of click or double-click on the file in Finder. Then select Open from the pop-up menu. On the security pop-up, select Open again. Thereafter, you can simply click on the file ending in .command to execute it.

To Start The (Installer) Program via the command line:
In the Terminal command line window, type the following commands:

% cd ~/WGSExtract

% sudo -H /bin/sh Install_MacOSX.sh

The install (and uninstall) script needs elevated permissions as it is doing application installs. In the command window you will be prompted to enter the password of your account (assuming you have sudo super-user elevated privileges). If not, run the script from an account that does.

When you run the install command it will check for and likely install many additional programs. This is now fully automated but may still ask your approval to do a few things. Key to install are the Python interpreter and MacPorts package manager; along with a current Java interpreter The first two then install many dependent packages of other programs. Key being MacPorts bringing in the Bioinformatic Tools.

The Installer script can be run again later and will act as an update script to make sure your release and its dependencies are up to date. This will not happen automatically. It does not update the WGS Extract program itself; only the installed dependencies.

The install script, once finished verifying and installing all the needed applications, now simply finishes. It no longer starts the program directly as Start_MacOS.sh previously did.

4.2.4 Starting WGSExtract

Simply click on the Start_MacOSX.app applet. Rename the Start_MacOSX.sh file to end in .command instead. First time executing it, use Control-Click instead of click or double-click on the file in Finder. Then select Open from the pop-up menu. On the security pop-up, select Open again. Thereafter, you can simply click on the file to execute it.

To Start The Program in the command shell:
In the terminal command line window, type the following command:

/usr/local/bin/python3 ~/WGSExtract/programs/wgsextract/wgsextract.py

This assumes you installed WGS Extract in your home folder. Modify as appropriate if not.

4.2.5 Uninstalling WGSExtract

Simply click on the Uninstall_MacOSX.app applet. Otherwise, in the Terminal command window, type:

% cd ~/WGSExtract

% sudo -H /bin/sh Uninstall_MacOSX.sh

4.2.6 Terminal Preferences

For many reasons, we chose to have a terminal log of commands be behind the application and installer. But, much to our dismay, MacOSX does not allow an application to control some of the settings for Terminal windows. As a result, a Terminal window log will be left on your screen after every run of a script on MacOSX (as shown below).

If you wish to avoid this, you need to go into your Terminal Preferences to adjust the settings for all uses of the Terminal program. Start up the Terminal app (from the Application folder). Click on the upper bar Terminal and then Preferences to bring its popup dialog.

In that dialog, select Profiles and then Shell as shown below:

There you will see two settings. One “When the shell starts” and the other “Ask before closing”. The defaults are “Don’t close the window” and “Only if there are processes …”; respectively. You want to change these two settings to “Close if the shell …” and “Never”; respectively.

Exit out of this Preferences pop-up and you should not be bothered by lingering Terminal windows anymore (unless the program crashes; in which case you can peruse the Terminal log to find a reason for the error.

4.2.7 OLD Instructions

If you already used the patch instructions above, you can ignore the rest of this section. You are done. This material was from pre-patch and left here for posterity.

4.2.8 Install WGSExtract itself (pre-patch)

To Start The (Installer) Program: In the Terminal command line window, type the following commands:

% cd <folder path to>/WGSExtract

% sudo -h ./MacOS_START.sh

Note: When you run the above start command for the very first time, it will simply report that you need to install an additional program. You then need to follow the instructions to install that utility and run the above program again. Only to be told it is still missing something else. Keep re-running the command until it gets to the end successfully. This was not fully automated in this first Mac beta release. The next release is now fully automatic. Left here until that release is out and the install is fairly simple.

The embedded "sudo" command it needs in many places within the script may require the administrator password. By starting thel script with the sudo -H command, the embedded sudo commands will simply work and you only have to enter the password once. You can run the script without the Sudo command. But then a password will be required for the embedded Sudo commands.

The above script, once finished verifying and installing all the needed applications, will start the python interpreter on the main WGS Extract Python script. Once you see that pop-up asking for the language, you know the main install script is done. You may want to exit at this point and restart without the sudo command.

Warning: This initial Mac release has some errors in the script. It puts in a requirement to find /usr/opt/local/bin/dot which will never be found. So it loops forever. It also installs graphviz, which has many dependencies, including one on the massive Xcode developer environment. Neither of these is needed. You can either edit these errors out of the script, or simply follow the rest of this section to do everything by hand. We have already fixed this in the new release and made it 100% automatic (no longer asks you to install additional software).

4.2.9 Manual WGSExtract Install (pre-patch)

Here are the instructions to manually install in case you do not want to edit the defective install script.

4.2.9.1 Via GUI

You need to download and install a few, externally-provided software packages. If you do not install these, the first run of WGS Extract will detect them missing and simply report they need to be installed before proceeding to start. The two packages you need to install are the Python3 interpreter and the MacPorts package manager:

https://www.python.org/ftp/python/3.8.0/python-3.8.0-macosx10.9.pkg

https://distfiles.macports.org/MacPorts/MacPorts-2.6.2-10.15-Catalina.pkg

Visit the links in your browser and let them download into the Downloads directory. Visit the Downloads directory and click on each .pkg file to install each of them. You can delete the downloaded package files once installed.

Note: the above link for MacPorts is for the 10.15-Catalina version of the MacOS. If your MacOS X version is older, you may need to replace the file name with your OS, like 10.14-Mojave or similar. See the list of older releases at: https://distfiles.macports.org/MacPorts/

The link for Python is to the first in the 3.8 series. We do not specifically need 3.8; 3.7 is OK also. There may be later minor release versions available.

When the installers ask questions, just confirm with the default settings. The first installation is for the Python language interpreter needed to run the main Python script of the WGS Extract program. The second for a package manager with a large library of Linux programs that have been ported to the MacOS. MacPorts is similar to Debian Linux apt-get, Microsoft Windows Store, Apple Store, or Google Play Store. MacPorts and Homebrew are where freely distributed Unix programs are available for the Apple Mac platform.

4.2.9.2 Via Terminal

If you prefer, the above can be just as easily be downloaded and run from the Terminal command line. To open a terminal in MacOS, if not familiar, press the “command” and “space bar” buttons at the same time, type “terminal” then hit the <enter> key. Terminal is also in the Launchpad and can be added to the dock. Here are the commands to enter once in a Terminal:

cd ~

# Get Python 3 (2 is default in MacOS X)

curl -O https://www.python.org/ftp/python/3.8.0/python-3.8.0-macosx10.9.pkg

sudo -H installer -package python-3.8.0-macosx10.9.pkg -target /

# Get MacPorts for OSX Catalina 10.15

curl -O \ https://distfiles.macports.org/MacPorts/MacPorts-2.6.2-10.15-Catalina.pkg

sudo -H installer -package MacPorts-2.6.2-10.15-Catalina.pkg -target /

# Remove the install packages

rm python-3.8.0-macosx10.9.pkg MacPorts-2.6.2-10.15-Catalina.pkg

This will put the files in /usr/local/bin and similar for use by everyone on the machine. Curl and Python (v2) are defaults in MacOS. Use python3 instead of old python (2) with this program.

4.2.9.3 Common for both

If you used the GUI to install the previous packages, you need to open a Terminal now to continue with the rest. Execute the following commands in a Terminal shell to install missing tools:

python3 -m pip install pillow pyliftover

sudo -H port selfupdate

sudo -h port -q install samtools bcftools gsed htslib coreutils zip unzip

sudo -h port -q install dos2unix bash grep

If the port command (from MacPorts) is not in your PATH, simply exit the Terminal and start again. Likely you started the terminal before MacPorts was installed. Or you may simply give the fully qualified name of /opt/local/bin/port after the sudo command.

The history of commands is saved in the Terminal and can be retrieved by using the up arrow key for each previous command. As the install proceeds, answer Yes to continue with installing dependent packages it finds are needed..

The original script incorrectly was requiring the following (that you should ignore):

xcode-select --install # This is a very large code dev environment

Xcode is the large code development environment in MacOS. The command installs the compilers, linkers, debuggers and such. Have not figured out what dependency from the MacPorts list of programs requires this. Hopefully we can figure that out and remove the requirement for its installation all together. Suspect it is graphviz.

4.2.10 Starting WGSExtract (pre-patch)

To Start The Program: In the terminal command line window, type the following command:

python3 ./programs/wgsextract/wgsextract.py

This assumes you are in the top-level of the WGSExtract directory. Modify as appropriate if not.

MacOS, like other OS’s, allows you to define the default application to start files when double clicked in the GUI. The odd thing is that the Python 3.8 Launcher on the Mac you just installed is set to use the default Python2 interpreter. So you need to (a) modify the preferences in the Python Launcher to start the Python3 interpreter AND (b) associate the Python Launcher with files ending in .py. Then, in the future, you can simply start the program by clicking on its Python script in the GUI Otherwise, simply type the above line in your Terminal to start the program.

4.2.10.1 Modify the Preferences in the Python Launcher:

Go to the Launchpad and start the Python Launcher there. Or use Finder and look in the Applications for the Python3 folder to find the Launcher and start. Bring up the preferences for the Launcher (via the menu on top if it does not automatically pop up). Change the line that says “Interpreter” to /use/local/bin/python3 by typing (will not be available to select). Close out the preferences and quit the Python Launcher.

4.2.10.2 Associate Python Launcher with .py script files:

Navigate to the wgsextract.py file as shown above. Right click and select “open with ;>”. Select the Python Launcher as the application. If it does not appear, you may have to find it using “Other”. You can now start the program by clicking on its main Python file.

Make an alias of the wgsextract.py file by right clicking and selecting that option. Place the Alias on your desktop, with your BAM files, or another favorite place. This provides a quick and easy way to launch the program from the GUI from somewhere other than the buried installation directory.

If you are on an Apple MacOS, the Win10 executable directories can be removed (WGSExtract/programs/python, samtools-cygwin and samtools-mingw folders) Does not really save much space though. They are less than 2% of the overall program. If and when the bioinformatics tools are ported and available as a standalone windows package, we might be able to avoid distributing the Win10 files with the general release. Until then …

4.3 Linux

Linux support is now available in v2b 18 Feb 2020 and later. Albeit only developed and tested on Ubuntu 18 and later where the bioinformatic tools are readily available in apt-get. Note that it requires a GUI environment Linux installation. Not a headless, server installation that only supports command line shells. Most of the documents like Bioinformatics for Newbies have focused on headless, server installations. You likely need to use a VM like VirtualBox if you want a GUI Linux in a Windows environment.

To start the program, navigate into the WGSExtract folder and double click the Linux_START.sh file located there. It can be started at the command line as:
% cd <folder path to>/WGSExtract

% sudo -H ./Linux_START.sh

Like for the Mac install, it is running a package manager that requires root access. If you do not want to be typing that password into the middle of a large script dump to the command window, then you can run the program as root using sudo as shown. Once the packages are installed, the sudo is no longer needed. (But currently, the script naively simply checks if sudo is enabled and fails on error if not. Hopefully to be fixed soon.)

You can also do the package installs before you start the script. In fact, that is what a majority of the Linux start shell script is doing. Please check your script for the latest. But as of this first release, these commands below perform the package install needed. And maybe avoid the need to run the WGSExtract command as SUDO to start with.

% sudo apt-get install samtools bcftools tabix dos2unix # Need bioinfor tools

% sudo apt-get install sed coreutils zip unzip bash grep python3 # Needed Unix tools

% sudo apt-get install python3-pip python3-pydot python3-pygraphviz # Python3 packages
% sudo apt-get install python3-tk python3-pil python3-pil.imagetk # Python3 packages

% #sudo apt-get install picard-tools # see note, prefer just distribute jar?

% pip3 install pillow pyliftover

(note: removed Marko’s picard-tools in favor of simply having the picard.jar file around. Need to work with him on this. picard-tools in apt-get brings in 400MB of source code compile and library environments for all sorts of other languages like Fortran, etc. See https://broadinstitute.github.io/picard/)^[a]

If you have done the bioinformatic tool installs as part of the Bioinformatics for Newbies document, then you will already have most of the tools installed. Just missing some of the specialized Python3 packages needed to implement the core of the WGS Extract tool.

The major issue with any Linux release is dealing with all the variant distributions and package managers one might encounter. Think of Unix as Latin and each Linux variant as a different language derived from that (French, Spanish, Italian, etc). Close but not quite close enough for the exacting needs of computers. If you want to try building the Linux port for a different flavor, let us know.

The command to start the Linux program, if you have done these installs, is simply:

% python3 ./programs/wgsextract/wgsextract.py

which you can simply make an alias for or an icon of to start in your shell or window. For example, an alias for your .bashrc might be:

% alias wgsextract=’python3 <path to WGSextract>/programs/wgsextract/wgsextract.py’

If you are on a Linux release, the Win10 executable directories can be removed (python, samtools-cygwin and samtools-mingw folders inside the programs folder of WGSExtract).

4.4 General

The compressed release is around 5GB. The uncompressed (installed) release is about the same. The release is so large due to the many human genome reference models included in it. This has expanded with the Beta v2 release to make the program able to handle more alignment types in the BAMs coming out of various WGS NGS testing services. The Extract23 Autosomal RAW file extractor also needs the reference “genomes” as templates from which to make the extracted files. Combined, these reference files make up 98% of the release. In the future, we may support loading the reference genomes on demand, as needed; given most only need one.

Beta v2 release

The green boxed folders in the above release directory structure are the Win10 executables that we earlier mentioned can be removed if installed on a Mac or Linux. But, as you can see, they are small compared to the reference genomes directory.

5 Known Limitations, Licenses, Redistributed Tools

5.1 Known Limitations

Documenting some of the known issues either with the tool or the files that are input or output.

Install: The program cannot fully handle embedded spaces in folders / paths in all places. Although we are trying to recognize and report this when it occurs; as well as simply handle with quoted paths when possible, it is not always achieved..
Workaround: Install the WGSExtract folder and your WGS files to be processed in a path excluding spaces. Such as C:\users\public or maybe an external drive path like D:\WGS. The tool has been modified to try and detect this issue and report back an error if discovered.

Autosomal: Microarray DNA Test Companies (FamilyTreeDNA, MyHeritage, LivingDNA) are not expecting a file in their own format.

Workaround: Only use other formats (like 23andMe or Ancestry) to import to the sites. This is not a WGSExtract limitation but one from their sites. They do allow multiple files to be loaded. So load a 23andMe v3 file and attach it to someone in your tree. Then upload a 23andMe v5 file and attach it to that same person. You can view the match lists separately. For LivingDNA, use a 23andMe v5 for upload as that matches their v1 format closely (similar chip).

Autosomal: FamilyTreeDNA, MyHeritage and LivingDNA cannot import an Everything CombinedKIt file. Their match algorithm cannot handle the extra SNPs contained within it. Only GEDMatch seems to be able to accept it; but then throw half of them out anyway.
Workaround: do not use the Everything CombinedKit file as an import to those company sites. This is a limitation they have.

Autosomal: Many sites can not take the “Everything” file but they can take simpler mergers of multiple files.

Workaround: Use DNA Kit Studio to read multiple files output here and create a minimally merged file that may be acceptable. Especially recommended is to create a 23andMe v3 and v5 merged file and use it for upload where accepted. Check with the DNA Kit Studio documentation for what works with what sites.

Autosomal: During the first month of the WGSExtract v2 release availability (Jan 2020), there was a shipped bug that would generate an error message when hitting “Generate the selected files” and not having “Everything combined in one file” selected. Additionally with this bug, some files may have been named incorrectly.

Workaround: Download the latest release with that fixed or simply select “Everything” in addition to what you want. The tool has to do an “Everything” file on every run anyway. It is minimal extra time no matter what additional formats you select to actually write out. Say 60 minutes of main processing time and 2 minutes per format after that.

Autosomal: The hg38tohg19.py file to wrap pyliftover has a call to cat/gcat that does not include the full path. On the MacOSX, where MacPorts /opt/local/bin is not added to the shell path, this can lead to the program failing to find the executable. This wrapper is only called when asking to do a microarray file generation on an HG38/GRCh38 reference model BAM.
Workaround: Add /opt/local/bin to your executable path in the Terminal shell on MacOSX.

Autosomal and mitoFASTA: A BAM file based on the GCA*fna.gz reference model file is incorrectly characterized as an HG38 file. As a result also, the wrong reference model is chosen when needed. This happens today with all Nebula Genomics supplied CRAM files that are converted to a BAM as they are based on this reference model.
Workaround: The only possible work-around is to make a new BAM not based on that reference model. The selection and use is too ingrained in the current tool code. This has been fixed in the rewrite v3 but a fix is not yet available here. Luckily, the reference genome is only used in the mitoFASTA button to get the mitochondrial model (which is the same in both instances). But it is used in the autosomal / microarray file generation to make SNP calls and the wrong selection is problematic there.

Autosomal and mitoFASTA: The reference genomes supplied for UCSC HG models (hg19.fa.gz and hg38.fa.gz) are not the ones expected with those names and simply chosen to be correct with the program. The hg19.fa.gz supplied is missing all the alt contigs; even ones in the base primary model decided on by the reference model committee. The hg38.fa.gz is actually the hgp13.fa.gz file.
Workaround: The best workaround is to replace both files with the correct ones and then rebuild the indices. With that said, the existing models may not pose a problem for the uses of the reference models in the current tool. As the BAM is already aligned, it is only used for variant calling on an already mapped set of segments. The original models are at ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/initial/hg19.fa.gz and ftp://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/initial/hg38.fa.gz. They are also used by ySeq and available at http://genomes.yseq.net/WGS/ref/hg19/hg19.zip and http://genomes.yseq.net/WGS/ref/hg38/hg38.fa (note: this one is not compressed). The indices are created with samtools faidx fa.gz, samtools dict fa.gz > fa.gz.dict and samtools index fa.gz. Replacing fa,gz with the appropriate reference genome. They need to be bgzip compressed as well.

MacOSX: The latest update to MacOSX Catalina 15.0.7 continues to tighten what is allowed to run on a Mac in the GateKeeper code. This latest update prevents any unsigned, compiled AppleScript files from being run at all. Catalina itself updated the default shell on MacOSX to Zsh from the traditional Bash supported in the past. They issue stern warnings for not using Zsh.
Workaround: the only workaround is to not use the AppleScript code wrappers that are provided as a convenience to Apple users. They allow icon-based, window invocation of the program. Users will have to default back to the Terminal invocation of the BASH shell scripts. For each Applescript .app folder icon their is a corresponding Bash .sh script. All the Applescript .app does is call that script (with no parameters). One can simply run the script from the Terminal using ./Install_MaxOSX.sh; for example. Ignore the Zsh warning. Unfortunately there is no way to remove it. Nor to change the shell and still have the program run on earlier versions of MacOSX.

5.2 Suggestions

As added and discussed between Randy and Marko. No particular order or priority at this time.

Note: all suggestions and changes documented in the alpha Betav3 manual available to developers. The below has been deleted here to avoid duplication and possible changes here that will not be transferred there.

5.3 Licenses and Redistributed Tools

Tool	Author / Institution	Where Used	Notes
Samtools, BCFTools, htsLib	Sanger Institute (Cambridge, et al)	Throughout (pretty much every function)	Installed with package managers at install time for MacOS and Linux. Custom CygWin/MinGW ports redistributed with Win10 release
Extract23 v0.0 (distributed with WGSExtract)	Thomas Khran, ySeq.net	Autosomal File Generator	Major, extensive expansion by Marko to include all RAW file output format variants from all companies; not just the single 23andMe v3 one. Including adding the Everything option.
Haplogrep (distributed with WGSExtract)		Mitochondrial haplogroup determination	JAR file redistributable
yLeaf v2.1 (distributed with WGSExtract)		Y haplogroup determination	Modified to relax standards so dives deeper into the tree; moderate cleanup
Picard, GATK (TBD)	Broad Institute (MIT, etc)	TBD	JAR file redistribute. Redistributed with Win10 release; in package managers in MacOS and Linux
Python3, CygWin, MinGW	Various	Everywhere (Win10 compiled versions of Unix utilities and libraries to support Win10 compiled bioinformatic tools)	Win10 release only (to support non-Unix/Linux environment and ported tools); libraries installed by user otherwise
PyLiftover (installed via Pip; Python module)	Konstantin Tretyakov	Autosomal extraction	To convert coordinates of extracted SNPs from GRCh38 to 37; when needed^[b]. Py library pulled in at install with Pip.
WGS Extract Manual	Randy Harr, Marko Farmer and other users	Documentation	Available with the tool under the same license arrangement

These are not incorporated into the tool but more general inspirations.

BAM Analysis Kit and others	Felix Chandrashakar / Immanuel, y-str.org (Using update by Teepean)	Initial idea for Haplogroup caller inclusion; and general tool flow effort	Not directly incorporated; but many tool flows and features inspired by this very early work
DNA Kit Studio	Wilhelm H-O	Manipulating microarray RAW data files (of autosomal and X data, primarily)	Not directly incorporated but inspired by what it does for microarray result files

As of this WGSExtract BetaV2b release on 18 Feb 2020, here are the ported tool versions you get with the various platforms (updated 30 May 2020):

Platform Tool	Microsoft Win10 CygWin	Microsoft Win10 minGW	MacOS X 10.14 Moj, 10.15 Cata	Linux Ubuntu 18.04	Linux Ubuntu 20.04	Latest Available @source
SAMTools	1.9	1.6-139	1.10	1.7	1.10	1.10
BCFTools	1.9	1.4 (1.9htslib)	1.10-2	1.7	1.10.2 (1.10.2-3htslib)	1.10-2
htslib	1.9	1.6-304	1.10-2	1.7.2	1.10.2-3	1.10-2
Source	WGSExtract Port	WGSExtract Port	MacPorts 2.6.2	apt-get	apt-get

There are only two places the MinGW ported tools are used. The Autosomal microarray (RAW) file format generation and the Mitochondrial FASTA file generation. There was some difficulty porting the tools to the pseudo Linux environments on Windows and getting them to work with a wide variety of WGS 30x BAM files. But clearly we need to look into recommending staying with later Ubuntu releases and find a way to get a better port to Windows.

Ubuntu 18.04 LTS is the most stable but has older versions of the key tools. While an upgrade to 19.10 can bring newer bioinformatic tools, it may introduce incompatibility with other software. By summer 2020, a new 20.04 LTS of Ubuntu should be available and stable and include the later 1.10 release of the key tools.

Old, archived release still available for posterity and error checking:

http://37.187.22.93/wgsextract/WGSExtractBeta.zip (Beta v2 29 Dec 2019)
SHA256: BFF2C3445EF914946FCCFEFEF9C68D4E93122300820E4ECF0D2963242CF19212
MD5: 0FF0A1EBADB10CA5C09DFCA3E825B2C8
(note: link is to archived copy; not the depicted URL)

(Beta v1 16 Jul 2019)

WGS Extract Manual Beta Edition v1 (20 Jun 2019)

6 Appendix: Under the Hood

To be as transparent as possible, we want to show you the example commands that are run with many of the options. As mentioned, this program function is mostly a collection of other publicly available tools that have been scripted together. In most cases, with extra tuning and modification to work with additional human reference models, on more platforms, and with more special cases. So the below shows the generic commands that are called. Not all the detailed guts -- read the python code for that. If you recommend a different approach, let us know.

Note that these commands are subject to change. Please refer to the latest release python code for what actually happens. Decisions to use pipes versus new command lines with temporary, intermediate files is sometimes arbitrary.

NOTE: Many times, long command lines are shown below. As the command line can wrap unexpectedly, depending on the width display you are viewing this document in, we start each new command line with the ‘%’ prompt. Know the prompt indicates the start of a new command line. Any line without that prompt is likely a wrap-around of the line before.

Some of these commands can be seen in the scrolling command line window or log. Others are buried in temporarily created scripts in the temp/ directory of the WGSExtract program.

In the command shown, we assume the file your.bam and your.bam.bai are in use and available in the current directory. We assume paths to all appropriate tools and stored reference genome files are known and automatically found. And that the input and output directory is set relative to the current directory. The referenced temp directory is where you put it. This is all simplified here for convenience to alleviate cluttering commands with paths. General commands are shown in blue. Important settings / reference files used are shown in red and may change depending on the context (which model of the human genome is used, for example).

The “section numbers” below match the main sections from Chapter one where the command is discussed.

1.1 Settings Tab (Select BAM)

After selecting the BAM, many internal steps are performed. A few captured in the command line window are:

% samtools view -H your.bam 1> temp/bamheader.tmp

% grep '@SQ SN:chr1 LN:.*$' -m 1 -o temp/bamheader.tmp 1> temp/refgen2.tmp

% grep '@SQ SN:1 LN:.*$' -m 1 -o temp/bamheader.tmp 1> temp/refgen3.tmp

% grep '@SQ SN:MT LN:.*$' -m 1 -o temp/bamheader.tmp 1> temp/refgen4.tmp

These are used to help try and determine the reference model used and other information.

If a companion BAM Index file is not found, then the following is immediately executed (and can delay the return of control to the main window for the user):

% samtools index your.bam

The BAM Index file is placed in the same directory as the original BAM as they must coexist together and with specific matched file names.

Similarly, if the BAM is found to not be sorted in coordinate format, it is sorted and a new BAM (and BAM Index) is created and used as the new basis for all work.

% samtools sort -o your_sorted.bam your.bam

If your original BAM was your.bam, the new one is your_sorted.bam.

A temporary script get_samtools_stats.sh is created and then destroyed to get the BAM Stats:

% samtools idxstats your.bam > temp/idxstats.csv

% samtools view your.bam | head -100000 | gawk -F ’\t’ '{sum += length($10); count++} END {print sum/count}' > temp/readlength.tmp

These results are partially displayed on the main settings tab as well as used later in the 2.4.1 Other / Check BAM File detailed settings tab. The code is based on the Average Read Depth document.

It appears long-read “nanopore” BAMs may be sorted by length of segment as the read length tends to be very low. qual.iobio.io, like its sister tool bam.iobio.io, jumps around the BAM doing random sampling. So the code here has been modified to rerun the view and head command to use 2 million samples to try and get deep enough into the file to get a better average length value.

1.2.1 Autosomes and X Chromosome

This is based on the original Extract23 but goes so much further. So we delve into the content of the modified Extract23 script that is generated and run first. Extract23 originally only generated a single 23andMe v3 file from only an HG19 BAM file. We do the same initial command but now it is based on a Combined Kit file generated by Marko that has a merge of every SNP ever used in any microarray test. And mapped to all the different human genome reference models. Once we have the variant call file for this combined kit, we can then scan and extract for the many other file formats; customizing as needed along the way.

# The -l parameter requires mpileup to exactly pick the constellations at the SNP positions without having to screen through the whole genome base by base.

% samtools mpileup -C 50 -v -l extract23/All_SNPs_GRCh37_ref.tab.gz -f reference_genomes/hs37d5.fa.gz your.bam > temp/temp_autosomes_raw.vcf.gz

% tabix -p vcf temp/temp_autosomes_raw.vcf.gz

# Now we call the SNPs from the raw mpileup data with the -m (mixed base) caller

% bcftools call -O z -V indels -m -P 0 temp/temp_autosomes_raw.vcf.gz > temp/temp_autosomes_called.vcf.gz

% tabix -p vcf temp/temp_autosomes_called.vcf.gz

# Here we annotate the SNP names (rs numbers) to each SNP position

% bcftools annotate -O z -a extract23/All_SNPs_GRCh37_ref.tab.gz -c CHROM,POS,ID temp/temp_autosomes_called.vcf.gz > temp/temp_autosomes_annotated.vcf.gz

% tabix -p vcf temp/temp_autosomes_annotated.vcf.gz

# Pick the data from the vcf and convert it into a tab delimited table.

% bcftools query -f '%ID\t%CHROM\t%POS[\t%TGT]\n' temp/temp_autosomes_annotated.vcf.gz

% sort -t $'\t' -k2,3 -V temp/temp_autosomes_result.tab > temp/temp_autosomes_result_sorted.tab

This temp_autosomes_result_sorted.tab created from your BAM is now used as a basis to match against the template from all the various companies and their versions. It is also the direct source to create the Combined Kit (super-kit on steroids) of over 2 million SNP diploid values.

On Win10 systems, the above commands use the MinGW versions of the executables as the CygWin ported versions were causing some issues. The MinGW versions are slower and an older release. See the end of this section for details on the release versions.

1.2.2 Mitochondrial DNA processing

For the only option to create a FASTA file of mtDNA for GenBank, et al.

% samtools mpileup -r MT -u -C 50 -v -f hs37d5.fa.gz your.bam | bcftools call -O z -v -m -P 0 1> temp\chrM.vcf.gz

% tabix temp\chrM.vcf.gz

% samtools faidx hs37d5.fa.gz MT | bcftools consensus temp\chrM.vcf.gz -o your_mtdna.fasta

(note: depends on the model in the BAM as to which reference genome to use. An mtDNA FASTA is specific to a reference genome just like a BAM file. If you have a Yoruba mtDNA model BAM, you are best to convert it to a rRCS model BAM before extraction. That command will be in version 3 coming soon.)

1.2.3 Y DNA Processing

For the option for yFull to generate a BAM that contains both Y and Mito DNA, a command line similar to below is executed:

% samtools view -b your.bam chrY chrM > your_chrY_and_chrM.bam

(note: the resultant BAM has the same model as the source BAM):

1.3.1 Check BAM File

See 1.1 Settings Tab (Select BAM) section above (in this appendix) where the work is actually done. This tab simply displays it.

1.3.2 Haplogroups

The haplogroup callers are extensive scripts / programs unto themselves. We simply show how we call those programs (with what parameters) and leave it to your exploration to dig into the programs further.

% python yleaf/Yleaf.py -r 3 -bam your.bam -pos yleaf/Position_files/hg19.txt -out temp/tempYleaf -r 1 -q 20 -b 90 -py python -samt samtools

% java -jar haplogrep.jar … … …

1.3.3 Oral Microbiome

The current option is to create FASTQ files of just your unmapped segments in the BAM file. This has to be done separately for each group of paired reads to recreate the paired-read FASTQ files.

% samtools view -hbf 64 -@ 2 your.bam '*' | samtools bam2fq - | bgzip your_unmapped_R1.fastq.gz

% samtools view -hbf 128 -@ 2 your.bam '*' | samtools bam2fq - | bgzip your_unmapped_R2.fastq.gz

Note that the FASTQ file is unmapped and no longer dependent on a reference model. But, as these are the unmapped segments in the BAM, there really is no difference.

Another, simpler option is to generate an unmapped BAM (cosmosID accepts it just as easily):

% samtools view your.bam '*' | bgzip your_unmapped.bam

A preferred option to be added later to the tool.

Page of

[1] None of the transfer-in companies accept any Y or mitochondrial SNPs that may be in a file. Even though they are still extracted and included here as they exist in the BAM.

[2] Traditional VCF files only have the derived SNP values listed and assume you are ancestral for everything not mentioned. So information of whether you were tested or not for possible ancestral SNPs is lost.

[3] See http://www.beholdgenealogy.com/blog/?p=2879 for one anecdotal example

[4] note: if your username has a space in it, the same issue applies as to why you should not use the “Program Files” folder for the install.

[a]+marko@bauermail.koeln

_Assigned to marko@bauermail.koeln_

[b]+marko@bauermail.koeln : Crossmap, a popular liftover tool, was issued a DCMA take down notice by UCSC in Sep 2019 due to providing the chain file. There may be an issue to redistributing it. See http://crossmap.sourceforge.net/

_Assigned to marko@bauermail.koeln_