Benchmarks are using the faster CRAM codecs; primarily deflate and rANS. For comparison we also include “Io_lib”’s Scramble tool for bzip2 and lzma CRAM and the Deez tool on one data set.
This test set is chr1 of NA12878_S1, downloaded from http://www.ebi.ac.uk/ena/data/view/ERP002490
Conversion from BAM to
All times are reported as wall-clock, although typically these algorithms are CPU bound so the cpu time is comparable.
Format | Size | Encoding(s) | Decoding(s) |
---|---|---|---|
BAM | 8164228924 | 1216 | 111 |
CRAM v2 | 5716980996 | 1247 | 182 |
CRAM v3 | 4922879082 | 574 | 190 |
To compare CRAM efficiency in a variety of circumstances we chose a smaller dataset to more completely explore the parameter space. MiSeq_Ecoli_DH10B_110721_PF.bam is the smallest example data taken from the Deez paper, so we also include Deez here for comparison.
Format | Size | Encoding(s) | Decoding(s) | Notes |
---|---|---|---|---|
SAM | 5579036306 | 46 | - | |
BAM | 1412001095 | 209 | 17.7 | |
CRAM v2 | 1053744556 | 183 | 26.9 | |
CRAM v3 | 869500447 | 75 | 30.8 | |
CRAM v3+bz2 | 850165878 | 124 | 45.4 | Via Scramble -j |
Deez | 870040062 | 208 | 165.0 | Via deez |
Extra decoding time for CRAM v3 is largely explained by the additional CRC checksums.
The effect of varying compression levels:
Format | Level | Size | Encoding(s) |
---|---|---|---|
BAM | 9 | 1399448787 | 403 |
BAM | (default) | 1412001095 | 209 |
BAM | 1 | 1616365585 | 88 |
BAM | u | 4310414959 | 45 |
CRAMv3 | 9 | 862152172 | 193 |
CRAMv3 | (default) | 869500447 | 75 |
CRAMv3 | 3 | 881163788 | 68 |
CRAMv3 | 1 | 886716515 | 69 |
CRAMv3 | u | 2631665078 | 45 |
Compression level “u” is uncompressed. Note there is almost no difference in speed between CRAM level 1 and the default level. Maybe we need to make -1 more aggressively fast at the expense of ratio.
Also note that BAM -1 is slower to encode than CRAMv3 at default levels, although it will be faster to decode. That makes me wonder about how we should deal with temporary files. (Ideally with neither BAM nor CRAM compression, but LZ4 or Snappy.)
Format | Size | Encoding(s) |
---|---|---|
BAM | 2005274489 | 284 |
CRAMv2 | 1046736335 | 206 |
CRAMv3 | 848319136 | 99 |
Embedded reference - no external file dependencies:
Format | Size | Encoding(s) |
---|---|---|
CRAM v2 | 1055144162 | 187 |
CRAM v3 | 870779925 | 78 |
Note that this is almost the same as the default mode of using an external reference as the sequence depth is high.
Non-reference encoding - all sequence bases are stored verbatim:
Format | Size | Encoding(s) |
---|---|---|
CRAM v2 | 1140863900 | 387 |
CRAM v3 | 941857968 | 100 |
The significant speed difference between version 2.1 and 3.0 is due to improved ways of storing multi-base differences instead of requiring one CRAM feature for each base call.
Human gut sample SAMEA728920 from http://www.ebi.ac.uk/ena/data/view/ERA000116 This is unmapped data, converted from FASTQ to SAM via biobambam.
Format | Size | Encoding(s) | Decoding(s) | Notes |
---|---|---|---|---|
SAM | 1443985040 | 15 | - | I/O bound |
FASTQ.gz | 491867991 | 126 | 12.7 | |
BAM | 428540917 | 64 | 5.5 | |
CRAM v2 | 335644015 | 36 | 9.8 | |
CRAM v3 | 308745955 | 24 | 9.3 | |
CRAM v3+bz2 | 289888505 | 32 | - | Via Scramble -j |
CRAM v3+lzma | 282989638 | 105 | - | Via Scramble -Z |
CRAM v3 MAX | 281666960 | 166 | - | Via Scramble -9 -jZ (bzip2, lzma) |
Scramble was used to test bzip2, lzma and both combined along with compression level 9 for maximum shrinkage, although since v1.4 Htslib also supports these methods.
Copyright © 2018 Genome Research Limited (reg no. 2742969) is a charity registered in England with number 1021457. Terms and conditions.