The FASTQZ Compression Algorithm for FASTQ Files
Matt Mahoney
Dell Inc.
Mar. 18, 2012
This paper describes the FASTQZ algorithm for compressing FASTQ files, which are output by DNA sequencing machines. FASTQZ was ranked second by compression ratio in the Pistoia Alliance compression contest, while compressing faster than the leading program and using less memory. It splits the input into 4 streams which are modeled separately: headers, alignments to a reference genome, unmatched base calls, and quality scores. Each stream is compressed using a custom ZPAQ model.
The Pistoia Alliance announced a compression contest for FASTQ files in November, 2011, to end on Mar. 15, 2012 with a US $15,000 prize. Contestants were required to submit open source programs published on Sourceforge under a BSD-2 license. Programs would be tested for compression ratio, compression time, decompression time, memory usage, and number of header, quality, and sequence mismatches, although the criteria for the prize was left unspecified. The test environment was an Amazon Web Services m2.xlarge instance, a dual core x86-64 environment running Linux with 17 GB memory and 300 GB disk space. Contestants were provided with an Amazon machine image so that they could test their programs in the same environment.
The test data was not released, but contestants were provided two similar test files, one 80 MB and one 6.3 GB in the common Sanger variant of the FASTQ format. These were two of three files from a paired end read. In a paired read, the DNA is broken into fragments of a few hundred or few thousand bases such that the two ends of the strands are attached to adjacent locations and replicated before sequencing. This produces complementary, but generally non-overlapping reads that are constrained to be near each other when the sequence is assembled. The machine output consists of two large files, each containing one end of every pair, and one smaller file containing all of the reads that could not be paired because its mate was too low in quality to be included.
A FASTQ file is a text file containing (possibly) hundreds of millions of “reads”. A read is described in 4 lines of text: a header, a sequence, an optional second copy of the header, and a list of quality scores. The following example shows three of 24,148,993 consecutive reads from the middle of the large file. The sequence and quality lines are each 100 characters long, so they may appear word-wrapped. In reality, they are single lines.
@SRR062634.2724179 HWI-EAS110_103327062:6:13:11133:13696/1
TGGAATCAGATGGAATCATCGAATGGACTGGAATGGAATCATTGAATGGACTCGAAAGGGATCATTATTGAATGGAGTTGAATGGAATCATCGAATGGTC
+
GGGFGGFDGGGGGGFGFGGGGGGGGGGGGGGEFGGGGFGEDGGGFGGGFEDFGCDFDG?ECCDBEECGEEC@DFDACC@B@9@B=AC@??ABBC######
@SRR062634.2724180 HWI-EAS110_103327062:6:13:11133:11572/1
ATATAGTCCATTGTACTCCCTTGCTTAAATCTGGATCCCTGCAAATAAAAACATCTTCCTGGGGAGCCCAGCCCCCCTCCCCCCCCCGCGCCCCCCCCCC
+
GGGGGGGGFGGGGEGGFGGGEGGFDGEAEGGEEEEBEEEEEEEEEEEEEEEEEEECCCCA:::?####################################
@SRR062634.2724181 HWI-EAS110_103327062:6:13:11133:5630/1
CATAGATCTCTTAAAATCACTTTTTCATAGAATATTAAGGAAATGGGTTTATTTTACACCAGCTCCTGGGTAATCCCGGTGAGCGTTGCCTTTCTTGGCC
+
GGGGGGGGGGGGGGGGGGGGGGGGGGEFGGGGGGGFGGGGGGGGGGGEGGEGGGGEEFGFGGGDGGGGGGEGEDGEEGEEECEEEEEEEE@EBAECD@?C
The first line of each read is a header beginning with “@”. The variable text in the first header indicates read number 2724179, lane 6, cell 13, x coordinate 11133, y coordinate 13696, and paired end /1. The mate will appear in the other file with the same read number, possibly different x and y, and paired end /2. Read numbers are usually consecutive starting at 1, but will skip values that don’t have mates. Those reads appear in the smaller file, also in consecutive order. In this sample, the lane number is always 6, the cell number starts at 1 and climbs to 120, and the x values are in ascending order within each cell from 1 to 20000. There are usually 10 to 20 reads before x is incremented. Within this group, the y values, which also range from 1 to 20000, occur in apparently random order. The pair is always /1 in this file, but may be either /1 or /2 in the smaller file. All numbers are written without leading zeros, so the line length is variable. Note, however, that the FASTQ format places no restrictions on the contents of the header except that it is ASCII text beginning with “@”.
The second line is the sequence. In the test files, the only letters are the bases A, C, G, T, or N meaning unknown. However, the FASTQ format again does not have this restriction. For example, ABI SOLiD sequencers will write in color space. A sequence consists of a base followed by a list of colors indicated by 0, 1, 2, 3 (or possibly 4 or 5), or “.” for unknown. A color indicates an overlapping pair of bases. However, all sequences must have the same length throughout the file in all formats.
The third line is an optional second copy of the header. It starts with “+” and can either be empty or contain a copy of the first line after the “@”. In the test files, it is always empty.
The fourth line is the quality scores. In the common Sanger variant, each character is ASCII 33 plus the Phred score. The Phred score is -10 x log10 p, where p is the probability of error in the corresponding base. In the test files, the possible scores are 0 for N or 2 through 38 for A, C, G, and T. (However, FASTQ allows 0 through 40 for any base). A Phred score of 38 means that the probability of error is about 0.00016. It is indicated by the letter G, which has an ASCII code of 33 + 38 = 71. The lowest score observed for other than N is 2 (p = 0.63). It is indicated by “#”, which has an ASCII code of 33 + 2 = 35. In the test file, any quality score of “#” is always followed by another “#” or “!” (score 0) to the end of the read. In the test files, about 40% of the quality scores are 38 (“G”), mostly early in the reads. The quality score lists must always be the same length as the read lengths throughout the file.
Some machines, such as Solexa/Illumina produce a different format, either Phred + 64, or an alternative coding + 64 of the form -10 x log10 p/(1-p), which can range from -5 to 40 (producing ASCII 59 through 104).
For the contest, it was assumed that the format would be restricted as described above. These assumptions apparently held in the withheld test data. We also assumed that lines were terminated with linefeeds without carriage returns, i.e. UNIX/Linux style text files.
The FASTQZ program takes a FASTQ file and possibly a reference genome as input and produces three or four compressed files:
There are two compression modes, fast and slow. In fast mode, each of the output file uses a simple encoding designed to allow further compression using a context model. In slow mode, the files are compressed using custom ZPAQ models. ZPAQ compresses by predicting input bits one at a time by an array of context models, then arithmetic coding the final mix of predictions. It is based on the PAQ series of compressors, which are top ranked on many benchmarks.
If a reference genome is provided, then the compressor attempts to match each read to the reference. It encodes the position, direction, and a list of differences to the .fxa file and encodes any leftover bases to the.fxb file. The same reference genome is needed to decompress. If no reference is given during compression, then no .fxa file is produced and all of the bases are encoded.
It was found that the test data is human whole-genome DNA. For the contest, the reference genome hg19 was used.
We first describe how headers, bases, quality scores, and alignments are encoded as the final output in fast mode, or as the input to context modeling in slow mode.
Headers are encoded to the .fxh file as a string of bytes of the form (j, k, n, xxx..., 0). The string is decoded by going to the j’th character in a stored copy of the previous header (starting with j = 1), and adding k - 1 to the decimal number that ends there. After this step, copy the first n characters of this modified copy, then output any remaining characters xxx, and finally output a linefeed (ASCII 10). During compression, the program searches for the longest string of digits that differ by at most 254, then continues matching up to the first character that differs after that. In most cases, the only characters that need to be encoded literally are the y coordinate, pair end, and occasionally the last digit of the x coordinate.
Bases are encoded to the .fxb file by first deleting all N’s, and then packing 3 or 4 bases per byte using a variable length code. The N’s can be restored because they always have a quality score of 0, and no other bases do.
The bases A, T, C, G are interpreted as digits 1 through 4 respectively in base 4 and parsed into the longest string that codes to a byte of 255 or less. Thus, strings beginning with G, CG, or CCG have 3 byte encodings, and all others 4 bytes. The reason for this encoding, instead of the more obvious 4 bases per byte, is to improve compression by allowing a model to find matches in reads starting at different points that would otherwise be hidden if the starting points differed by other than a multiple of 4. The coding tends to synchronize, as once the two parsers are in sync, they stay that way. The encoding assigns lower digits to A and T because they occur almost twice as often as C and G in the test data, thus producing fewer 3 byte codes.
The encoding produces output about 6% larger than 4 bases per byte. It uses only the byte values 64 through 255 except for the very last byte in the output. It was found experimentally to compress better than coding a partial string at the end of each read.
Quality scores are encoded in the .fxq file as a 0 terminated string. Bytes are decoded as follows:
Recall that 71 = G is the highest and also the most common quality score (Phred = 38). The encoding allows runs of G up to 55 to be encoded in 1 byte. It also allows groups of 3 scores in the range 35..38 or groups of 2 scores in the range 31..38 to be encoded as one byte. If the string is terminated early, then all remaining quality scores are assumed to “#” (Phred = 2).
If a reference genome is supplied, then alignments are encoded to a .fxa file and the matched bases are deleted before encoding to the .fxb file. The reference genome used as input must be packed 4 bases per byte in MSB (most significant bit) to LSB (least significant bit) order using the coding A, C, G, T = 0..3, respectively. This is a different encoding than .fxb, but simplifies finding complementary reversed matches. The program fapacks, part of the FASTQZ distribution, will create these files from FASTA input using the command “fapacks output *.fa” from the hg19 distribution above. The output is 724,327,615 bytes.
Alignments are encoded as strings of the form (m1+d*128, m2, m3, m4, p3, p2, p1, p0) where m1..m4 are the locations of the first 4 mismatches with the reference (in range 1..n where n is the read length), and d = 1 if the match is to the complement, i.e. in the reverse direction, swapping A with T and C with G. If there are more than 4 mismatches, then all of the bases after m4 are assumed to not match. If there are less than 4 mismatches, then the remaining are stored as n+1. The matches are stored in ascending order. The pointer is given as 4 bytes, MSB first, to the first base of the match in the reference, which is padded at both ends (by 16K bases) as an optimization to avoid some bounds checks. This format limits read lengths to 126 and reference genomes to a little under 232 bases.
Reads are aligned as follows. First the reference is read into memory as packed, with padding added to the ends. Then an index is built using a hash table of 227 elements taking 512 MiB memory. The genome is divided into segments of 32 bases. Each segment is hashed by interpreting the segment as a 64 bit number, multiplying by an arbitrary 64 bit constant (123456789123456789), and taking the high 32 bits as a hash. The low 27 bits of this hash is used as a starting index into the hash table. The program searches the first 8 elements for an empty one, then stores a 27 bit pointer to the segment and the 5 high bits of the hash as a check sum as an optimization to avoid unnecessary comparisons during alignment.
Each read is aligned by computing a rolling hash of the sequence in both directions (after deleting N’s), looking for matches in the hash table and picking the best one. “Best” means the highest value of m4, breaking ties by m3, m2, and m1 respectively. If m4 is less than n/2 (50), then no match is saved and all n bases are written to the .fxb file. If a match is found, then the matched bases are deleted before encoding.
Experimentally it was found that 95.8% of the bases in the large public file could be matched to hg19, but the encoding only allowed 93% of the bases to be deleted. The hash table is 75% filled, with 6% of hashes having to be discarded during indexing. The alignment algorithm requires about 1.3 GB of memory.
In high compression mode, the 3 or 4 encoded files are further compressed using custom ZPAQ models. ZPAQ is a format for highly compressed data based on the PAQ context mixing algorithm, which is top ranked on many benchmarks. Context mixing encode bits one at a time by arithmetic coding. The high compression ratio is achieved by accurately assigning probabilities to bits by combining the predictions of many context models. The most important types of context models are:
A typical architecture might consist of an array of CM or ICM taking contexts over various orders (i.e. hashes of the last 0, 1, 2, 3, 4 bytes) and mixing them. For stationary sources, a CM with a high count limit gives the best compression. For data with changing statistics, a fast adapting CM or ICM works better. Instead of an ICM, a chain of ISSE going from low to high order sometimes works better. A MATCH might be used for high orders. A MIX may also be followed by a low order SSE. In general, compression ratio improves with the number of components, but run time increases.
Contexts can be customized to particular data types for better compression. For example, English text compression can be improved by using contexts that begin on word boundaries and ignore upper/lower case differences. Image compression might use the high bits of adjacent pixels both horizontally and vertically.
ZPAQ stores a description of the decompression algorithm in the compressed file header. Thus, old versions of the decompresser will still work when the compressor is updated to a better algorithm. The algorithm is stored as an array of components, and as a set of instructions for computing contexts for each component. Components always take a context as input, and sometimes take predictions of earlier components as input. The output of the last component is used to arithmetic encode the bit.
Contexts are computed by instructions written in a virtual machine language called ZPAQL. The public domain libzpaq API translates ZPAQL byte code into 32 bit or 64 bit x86 code and executes it during compression and decompression. ZPAQ also supports using ZPAQL for post-processing the decoded output, but FASTQZ does not use this feature. FASTQZ uses the following models for each of the four encoded files.
Headers (.fxh files) are compressed using the model described in the ZPAQL source code below. In each case, the zpaq program was used to convert the source to a byte code and insert it into the FASTQZ source code.
The source code below describes an array of 5 components: two CM, two ICM, and a final MIX. Recall that the input is a sequence of strings, which can be thought of as lines of text, where the first 3 characters have special meaning. The context for the first CM is a hash of the previous byte (i.e. order 1), the current byte in the previous line, and the column number. The next three components are similar except that they are order 2, 3, and 4 respectively. The MIX also uses the column number (starting at 0 at the beginning of each line) as context to select the weight vectors for mixing. In addition, for all ZPAQ components, contexts include the previously coded bits of the current byte. Bits are coded in MSB to LSB order.
(fxh.cfg model to compress headers)
comp 3 8 0 0 5 (H has size 2^3, M has size 2^8)
0 cm 20 128 (direct 20-bit context model with max count 128*4)
1 cm 22 128
2 icm 18 (indirect context model with 2^(18+6) bit histories)
3 icm 19
4 mix 13 0 4 24 255 (13 bit context, mix 0..0+4-1, rate 24, mask 255)
hcomp
*c=a c++ a== 0 if c=0 endif (save input in buffer M pointed to by C)
d=0 *d=0 b=c a=c hashd (context H[0] is a hash of column number)
a=*b hashd (combined with the byte above, saved in M)
b-- a=*b hashd (combined with the byte to the left (order 1))
a=*d d++ *d=a b-- a=*b hashd (context H[1] as above but order 2)
a=*d d++ *d=a b-- a=*b hashd (context H[2] as above but order 3)
a=*d d++ *d=a b-- a=*b hashd (context H[3] as above put order 4)
d++ a=c a<<= 8 *d=a (context H[4] for mixer is just the column number)
halt
post 0 end (no post-processing)
The specification for the ZPAQL source language is given in the ZPAQ specification and reference manual. Briefly, the HCOMP section of the code computes contexts by calling it once for each byte coded. These contexts are combined with the previously coded bits of the current byte for modeling by the components listed in the COMP section. A ZPAQ virtual machine consists of 32 bit registers A, B, C, D, R0...R255, an array of 32 bit values H, and an array of bytes M, and a 1 bit condition flag. H is also the output. A is the input (the previously coded byte) and the destination for all arithmetic and logical operations. B and C are pointers to M (modulo the array size). D points to H. The instruction HASHD computes *D = (*D + A + 512) * 773 to update the context hash with A. Machine state except for input A is preserved between calls.
The model, including component parameters, was developed experimentally and tuned for best compression. In the program, M contains the last line of input. C is used to point to the next character to be written to M. It is reset to 0 at the end of each line. B is used to read from M.
Recall that base sequences are encoded in self delimiting base 4 with 3 or 4 bases per byte. Base sequences are modeled using a simple order 0..5 mix of 6 models. This model uses by far the most memory, about 1.4 GB. High memory usage is necessary in order to make use of similarities between strings that are far apart in the input. None of the other files have such long range correlations, so they do not need a lot of memory.
(fxb.cfg model to compress base calls)
comp 3 3 0 0 7 (hh hm ph pm n)
0 cm 9 255 (2 KB)
1 cm 18 255 (1 MB)
2 cm 25 255 (128 MB)
3 icm 22 (256 MB)
4 isse 23 3 (512 MB)
5 match 26 28 (256 MB hash table, 256 MB buffer)
6 mix 8 0 6 12 255 (order 0 mix of 0..0+6-1, rate 12, mask 255)
hcomp
c++ *c=a b=c a=0 (save in rotating buffer M)
d= 1 hash *d=a
b-- d++ hash *d=a
b-- d++ hash *d=a
b-- d++ hash *d=a
b-- d++ hash *d=a
halt
post 0 end
The sequence data is stationary, so slow adapting CMs give the best compression for orders 0, 1, and 2. The count limit is set to the maximum 255 x 4. For higher orders an order 3 ICM chained to an order 4 ISSE is used to save memory. Each context maps to a 1 byte pair of counters instead of a 4 byte prediction and count. The order 5 MATCH uses a 256 MB history buffer, but ideally it would be as large as the input, with a similar sized hash table index. The MIX learning rate is low (12), which is appropriate for stationary data.
The context computation code uses M as a short rotating history buffer for computing context and C as a pointer to the front. The HASH instruction computes A = (A + *B + 512) * 773. The hash is accumulated in A for orders 1 through 5 and stored in H through D. No context is computed for the MIX, but there is still the implied order 0 context consisting of the previously coded bits of the current byte, as with all other contexts.
Recall that high quality scores are packed 2 or 3 per byte or in runs of the maximum value, 38. The coding was designed so that higher codes correspond to higher values, in particular, by putting the newest score in the most significant bits of the code. Also, a code of 0 is used to signal the end of the sequence to the model, even if there are no trailing quality 2 (#) scores.
Quality scores can be predicted from the last several scores independently. We also know that scores start high and tend to decrease. Therefore, scores are predicted using a mix of sparse models of the last several bytes, or the high bits of those bytes, in addition to the approximate position within the read. Given a history (... q5 q4 q3 q2 q1) at position C, then three contexts hashes are computed and modeled with slow adapting CMs to predict the next value as follows:
The MIX also takes a context hash to select the weights for averaging:
(fxq.cfg model used to compress quality scores)
comp 2 12 0 0 4
0 cm 22 128
1 cm 22 128
2 cm 22 128
3 mix 14 0 3 12 255
hcomp
c++ *c=a (store input in M pointed to by C)
a== 0 if c=0 endif (reset M at newline)
(0) d=0 b=c hash *d=a a=c a>>= 3 hashd
(1) d++ a=0 b-- hash *d=a b-- a=*b a>>= 5 hashd
(2) d++ *d=0 b-- a=*b hashd b-- a=*b a>>= 4 hashd
(3) d++ a=*c a>>= 3 *d=0 hashd a=c a> 3 if a>>= 5 a+= 4 endif hashd
halt
post 0 end
Recall that alignments are described by either a 0 byte (no match) or a list of 4 mismatches in ascending order followed by a 4 byte pointer. Mismatches can be predicted by their position in the list and the approximate previous value. Pointers tend to have a uniform distribution. Thus, the low bytes at the end are essentially random. The direction bit (high bit of the first byte) is also random.
Alignments are modeled using a single, slow adapting CM. The context is the parse state (0..7) in all cases, and the high 6 bits of the previous byte when modeling the mismatches or the high byte of the pointer. In the code below, the context is not a hash. Rather it is a 20 bit context where the low 9 bits are left empty to allow the bitwise context to be inserted during modeling. (ZPAQ expands the bit-wise context to 9 bits as an optimization to reduce cache misses during table lookup).
(fxa.cfg to model reference matches)
comp 0 0 0 0 1
0 cm 20 255
hcomp
c++ b=a a== 0 if a=c a== 1 if c=0 endif endif
a=c a> 7 if c=0 endif (c=parse state, b=last byte)
a< 6 if
a=b a>>= 2 a<<= 5 a+=c
else
a=c
endif
a<<= 9 *d=a
halt
post 0 end
Two public test files were provided by the contest as follows:
For testing and development, it was convenient to construct a smaller test file consisting of the first million lines (250,000 reads) of the larger public test file. It had a size of 64,905,006 bytes, about 1% of the larger file. This file was first compressed using 58 combinations of existing Windows-based programs and options. The following table shows the resulting compressed sizes, listed from smallest to largest. Most programs are listed at least once with options set for maximum compression.
12,531,989 paq8px_v69 -5 13,227,449 lpaq9m -6 13,355,233 zpaq -m4 13,433,071 paq9a 13,458,075 nanozip -cc 13,462,802 ppmonstr 13,488,462 ash 13,525,283 nanozip 13,598,145 lpaq1 -6 13,648,640 bbb cfm100 13,902,686 zpaq -m3 13,982,636 bcm 14,012,471 bsc -b 14,012,471 bsc 14,075,209 ctw -d6 -n16M -f16M 14,143,994 ppmd -m256 -o8 14,231,517 zpaq -m2 14,322,919 bsc -b -m6 14,428,231 hook 256 14,498,323 ppmd -m256 -o4 14,508,672 ppmd -m256 -o16 -r1 14,533,245 ppmd 14,549,668 bsc -m3 -p 14,701,803 ppmd -m256 -o16 15,378,386 acb 15,625,687 zpaq -m1 15,964,391 ppmd -m256 -o2 16,219,269 7zip -mx=9 16,572,057 bzip2 16,611,205 comprox_ba 16,818,466 7zip | 17,444,725 cabarc -m lzx21 17,790,769 sr2 18,943,331 kzip 19,408,403 compress 19,673,907 tornado -12 19,941,617 comprox_sa 20,089,975 zip -9 20,537,505 slug 20,636,394 gzip 20,636,617 zip 21,013,468 fpaq0f2 21,331,870 tornado -5 21,994,357 etincelle 24,066,560 lzop -9 26,686,218 lzrw5 26,890,950 lzrw3-a 30,159,732 fpaq0p 32,331,188 lzop 32,555,565 stz 32,621,820 fpaq0 32,638,967 lzrw3 33,484,574 lzrw2 33,767,596 tornado -1 34,579,330 lzrw1-a 34,967,551 lzrw1 43,118,827 ppp 43,387,347 flzp 64,905,006 uncompressed |
Table 1. Compressed sizes of the first 250,000 reads of the large file with general purpose compressors.
The low rank of popular compressors like zip, gzip, bzip2, and 7zip should be noted. The top ranked program, paq8px_v69, required 2.26 hours to compress this small file to 12,531,989 bytes on a dual core 2.0 GHz T3200 processor with 3 GB memory under 32 bit Windows Vista. It would take over 9 days on the large file at this rate. Decompression time would be about the same.
Table 2 shows the compressed sizes of FASTQZ in fast and slow modes, with and without alignment to human genome hg19. Compression and decompression times, respectively, are shown in real seconds for the hardware described above. Modeling of the 3 or 4 files is done in parallel in separate threads.
Fast, unaligned 6,614,113 base sequences 2,601,585 headers 9,690,392 quality 18,906,090 total Time 1.1s, 1.3s | Fast, aligned to hg19 1,464,469 alignments 460,013 base sequences 2,601,585 headers 9,690,392 quality 14,216,459 total Time 28.5s, 9.5s |
Slow, unaligned 5,565,121 base sequences 513,802 headers 6,452,417 quality 12,531,340 total Time 19.4s, 18.5s | Slow, aligned to hg19 1,100,799 alignments 408,504 base sequences 513,802 headers 6,452,417 quality 8,475,522 total Time 40.6s, 25.5s |
Table 2. Compression results with FASTQZ on first 250,000 reads.
In slow mode without alignment, FASTQZ compresses slightly better than paq8px_v69 and 400 times faster. With alignment, FASTQZ compresses one third smaller.
In fast mode without alignment, FASTQZ compresses about 10% smaller than gzip at the default compression level. By comparison, gzip takes 11.8 seconds to compress and 1.0 seconds to decompress. FASTQZ compresses 10 times faster than gzip and decompresses only slightly slower.
This is not the best we can do. In some cases, using an existing general purpose compressor on the encoded file may do better than the models built into FASTQZ. Table 3 shows the result of testing each of the encoded files on the same 58 compressor combinations. Only the top 10 are shown, plus the popular gzip, 7zip, and bzip2. There is some room for improvement in modeling aligned base sequences and headers, but these files are small compared to the others.
Unaligned bases 5,548,287 paq8px_v69 5,565,121 FASTQZ 5,574,283 ppmonstr 5,580,899 lpaq1 -6 5,583,970 nanozip -cc 5,592,897 zpaq -m4 5,609,694 paq9a 5,622,555 bsc -p 5,622,555 bsc 5,623,486 bsc -p -m6 ... 5,725,900 7zip 6,000,220 bzip2 6,006,138 gzip | Aligned bases 397,094 paq8px_v69 403,886 nanozip -cc 404,850 ppmonstr 405,341 zpaq -m4 406,806 lpaq1 -6 406,964 paq9a 408,504 FASTQZ 412,016 ppmd -m256 -o8 412,139 ppmd -m256 -o16 -r1 412,139 ppmd -m256 -o16 ... 423,708 7zip 437,482 bzip2 458,088 gzip |
Alignments 1,100,799 FASTQZ 1,112,050 paq8px_v69 1,161,514 nanozip -cc 1,161,644 zpaq -m4 1,166,600 ppmonstr 1,168,013 paq9a 1,190,379 bbb -cfm100 1,192,097 lpaq9m -6 1,194,857 bsc -p -m6 1,194,940 bsc -p ... 1,279,842 7zip 1,293,286 bzip2 1,366,794 gzip | |
Headers 476,819 paq8px_v69 497,647 lpaq9m -6 498,292 zpaq -m4 500,170 nanozip -cc 504,386 lpaq1 -6 508,414 ctw -d6 -n16M -f16M 509,481 paq9a 510,339 zpaq -m3 513,802 FASTQZ 523,348 ash ... 585,294 bzip2 633,476 7zip 740,753 gzip | Quality Scores 6,452,417 FASTQZ 6,469,353 paq8px_v69 6,531,769 zpaq -m4 6,539,805 bbb -cfm100 6,543,694 bcm 6,560,147 paq9a 6,567,513 ctw -d6 -n16M -f16M 6,571,165 nanozip -cc 6,573,128 lpaq1 -6 6,605,849 lpaq9m -6 ... 7,268,779 7zip 7,344,637 bzip2 8,039,195 gzip |
Table 3. Top 10 compression results plus 7zip, bzip2, and gzip for encoded 250,000 reads.
The following results were obtained on the complete file (6,345,444,769 bytes). Fast mode decompression and unaligned compression are limited by disk I/O speed. CPU process times are shown in parenthesis.
Fast, unaligned 639,049,273 base sequences 251,697,610 headers 867,178,255 quality 1,757,925,138 total Time 346s (117s CPU), 371s (111s CPU) | Fast, aligned 183,313,663 alignments 49,174,693 base sequences 251,697,610 headers 867,178,255 quality 1,351,364,221 total Time 1348s, 620s (150s CPU) |
Slow, unaligned 503,239,070 base sequences 47,861,283 headers 574,112,937 quality 1,125,213,290 bytes Time 2357s, 2494s | Slow, aligned (submitted as ID 99) 105,063,319 alignments 30,852,888 base sequences 47,861,283 headers 574,112,937 quality 757,890,427 total Time 2231s, 1552s |
Table 4. FASTQZ v1.5 compression results for SRR062634_1.filt.fastq
The slow, aligned mode was submitted to the compression contest. The compression ratio was 0.1194 on this public file and 0.1166 on the withheld contest data. Run time was reported as 1218 seconds to compres and 769 seconds to decompress. Memory usage is reported as 5398224, apparently in units of 256 bytes. FASTQZ uses about 1.5 GB memory.
The top entry in the contest by compression ratio was a submission by James Bonfield with compression ratio of 0.1141, compression time 3280 seconds, decompression time 325 seconds, and memory 13235808 (about 3.5 GB). His program uses Bowtie for alignment.
We note that most of the compressed space is for quality scores. The low bits tend to be random, and therefore not compressible. FASTQZ has an option to quantize the scores to make them more compressible. It takes a parameter q, and rounds all quality scores of 2 or higher to 2 plus a multiple of q. (It needs to preserve lower scores to distinguish N from called bases). For example, a parameter of 5 would round any score in the range 32 through 36 down to 32. Table 5 shows the effect of quantization on an older version of the program (v1.4) that used a less efficient quality score encoding.
585,117,655 q = 1 (lossless) 459,087,639 q = 2 279,217,591 q = 5 120,458,000 q = 10 72,175 q = 64 (submitted as ID 96) |
Table 5. Compressed sizes of the quality scores for FASTQZ v1.4 with quantization q.
Version 1.4 encoded quality scores as runs of 1 to 6 for all values. Specifically, a run of n repeats of byte x was coded as (x-33)*n. The encoded scores were compressed with a single CM rather than a mix of 3 CMs. The newer version compresses 2% smaller. Nevertheless, quantization has a much larger effect. Contest submission ID 96 is this program with q = 64, essentially discarding all of the scores. It had the top compression ratio among lossy programs, 0.0290 on the public file and 0.0287 officially.
The contest required that all submissions be open source and posted to Sourceforge at the time of submission. Although the rules appear to give competitors an advantage, I found the opposite to be true. During the contest there was a great deal of discussion, especially by James Bonfield and I, including posting of ideas, analysis, test results, algorithms, and code to encode.ru, a message board for data compression developers. I believe that this discussion worked to our mutual benefit, even when the deadline drew near and we exchanged the lead four times in the last few days. I compared it to a bicycle race in which competitors break away and work together to gain an advantage over the peloton.
Much of the early work focused on analyzing the data. For example, splitting the file into headers, bases, and quality scores and trying various general purpose compressors or looking at n-gram statistics. It included an experiment in which sequences were reversed in an attempt (which failed) to make the data more compressible.
My original approach was to write a pure ZPAQ model. I wrote ZPAQ beginning in 2009, based on years of research on the open source PAQ project, so I was familiar with it. However, I realized that compression would be faster if the three data types were modeled separately. I had hoped to use ZPAQ’s post-processing capability to do this, but it can’t merge more than one file.
I have to credit James Bonfield with discovering that the data was human whole-genome DNA, with the idea of aligning it to a reference sequence, and pointing out when I was using an older version of the reference. I wrote my own aligner because I wanted something simple and fast. (Using alignment actually makes the program faster because there is less data to model). In retrospect, my program probably loses compression by not handling insertions and deletions, and by not modeling correlations between quality scores and the locations of mismatches in the alignment data.