CRAM benchmarking

Benchmarks are using the faster CRAM codecs; primarily deflate and rANS. For comparison we also include “Io_lib”’s Scramble tool for bzip2 and lzma CRAM and the Deez tool on one data set.

Coordinate sorted human data

This test set is chr1 of NA12878_S1, downloaded from http://www.ebi.ac.uk/ena/data/view/ERP002490

Conversion from BAM to is achieved via htslib test_view -b or -C. Decoding times are computed using test_view -B (benchmarking mode) with performs input, uncompression and decoding only. Encoding time includes decoding of the input BAM, so subtract the BAM decoding time to get encode-only, although the difference is only minor.

All times are reported as wall-clock, although typically these algorithms are CPU bound so the cpu time is comparable.

Format	Size	Encoding(s)	Decoding(s)
BAM	8164228924	1216	111
CRAM v2	5716980996	1247	182
CRAM v3	4922879082	574	190

E.Coli: sorted, unsorted, with and without references

To compare CRAM efficiency in a variety of circumstances we chose a smaller dataset to more completely explore the parameter space. MiSeq_Ecoli_DH10B_110721_PF.bam is the smallest example data taken from the Deez paper, so we also include Deez here for comparison.

Position sorted

Format	Size	Encoding(s)	Decoding(s)	Notes
SAM	5579036306	46	-
BAM	1412001095	209	17.7
CRAM v2	1053744556	183	26.9
CRAM v3	869500447	75	30.8
CRAM v3+bz2	850165878	124	45.4	Via Scramble -j
Deez	870040062	208	165.0	Via deez

Extra decoding time for CRAM v3 is largely explained by the additional CRC checksums.

The effect of varying compression levels:

Format	Level	Size	Encoding(s)
BAM	9	1399448787	403
BAM	(default)	1412001095	209
BAM	1	1616365585	88
BAM	u	4310414959	45
CRAMv3	9	862152172	193
CRAMv3	(default)	869500447	75
CRAMv3	3	881163788	68
CRAMv3	1	886716515	69
CRAMv3	u	2631665078	45

Compression level “u” is uncompressed. Note there is almost no difference in speed between CRAM level 1 and the default level. Maybe we need to make -1 more aggressively fast at the expense of ratio.

Also note that BAM -1 is slower to encode than CRAMv3 at default levels, although it will be faster to decode. That makes me wonder about how we should deal with temporary files. (Ideally with neither BAM nor CRAM compression, but LZ4 or Snappy.)

Name sorted

Format	Size	Encoding(s)
BAM	2005274489	284
CRAMv2	1046736335	206
CRAMv3	848319136	99

Embedding & Reference-less encoding

Embedded reference - no external file dependencies:

Format	Size	Encoding(s)
CRAM v2	1055144162	187
CRAM v3	870779925	78

Note that this is almost the same as the default mode of using an external reference as the sequence depth is high.

Non-reference encoding - all sequence bases are stored verbatim:

Format	Size	Encoding(s)
CRAM v2	1140863900	387
CRAM v3	941857968	100

The significant speed difference between version 2.1 and 3.0 is due to improved ways of storing multi-base differences instead of requiring one CRAM feature for each base call.

Unmapped data

Human gut sample SAMEA728920 from http://www.ebi.ac.uk/ena/data/view/ERA000116 This is unmapped data, converted from FASTQ to SAM via biobambam.

Format	Size	Encoding(s)	Decoding(s)	Notes
SAM	1443985040	15	-	I/O bound
FASTQ.gz	491867991	126	12.7
BAM	428540917	64	5.5
CRAM v2	335644015	36	9.8
CRAM v3	308745955	24	9.3
CRAM v3+bz2	289888505	32	-	Via Scramble -j
CRAM v3+lzma	282989638	105	-	Via Scramble -Z
CRAM v3 MAX	281666960	166	-	Via Scramble -9 -jZ (bzip2, lzma)

Scramble was used to test bzip2, lzma and both combined along with compression level 9 for maximum shrinkage, although since v1.4 Htslib also supports these methods.