CRAM-ming it in

By: Ali Cranage, science writer at the Wellcome Sanger Institute

An impending flood of genomic data threatens to overwhelm scientists’ abilities to store and share it. Next-generation compression seeks to CRAM it all in

DNA – nature’s data compression format

DNA – nature’s data compression format for life

DNA itself doesn’t take up much space – there is a whole copy of our genome, some 6.4 billion letters of DNA – tightly packaged in almost every single one of our 37 trillion cells. But when the DNA in our genome is sequenced, the information takes up about 200Gb of digital space. The data from only a few human genomes would fill a standard laptop. The Global Alliance for Genomics and Health (GA4GH) estimates that by 2025, 60 million human genomes will have been sequenced; genomics is fast becoming the biggest source of digital data on the planet.

Squeezing Sequence Data

To effectively store and manage the vast amounts of data, scientists have created specific digital file formats. SAM and BAM were the main file formats for storing genome data in the early 2000s. They utilise generalised compression methods, like ZIP. CRAM is the next generation, a compressed version of BAM. It reduces the size needed to store a human genome by 50 per cent, bringing huge savings. It takes advantage of the unique properties of genomic data.

SAM, BAM and CRAM explained

SAMSequence Alignment/MapText file of information about the DNA in a genome that can be read humans
BAMBinary Alignment/MapBinary file that can be read by computersBAM files are the compressed version of a SAM file
CRAMCompressed columnar file formatColumn-based format that stores the differences between the stored sequence and a reference genomeCRAM fils can be 30-60% smaller than their equivalent BAM files

Based on: https://en.wikipedia.org/wiki/SAMtools

“There’s a good reason we don’t use ZIP for images — we use custom algorithms written for that task, such as JPEG or PNG,” said James Bonfield, principal software developer at the Wellcome Sanger Institute.

“This is what CRAM is doing. CRAM is a custom algorithm written to compress the BAM data to a much smaller size.”

It is based on algorithms originally created by James and other researchers at the Sanger Institute and the European Bioinformatics Institute (EMBL-EBI) as part of a ‘sequence squeeze’ competition in 2011.

CRAM has now been adopted as the industry standard for genomic data compression, storage and transfer.

Global commitment to open-source

The global genomics community made the commitment to make data open and freely available at the very beginning, when the first human genome was sequenced. To make certain this continues, the tools and software to store and analyse genome data must also be available to all.

The CRAM file format is maintained by GA4GH, the international standards body for genomic data. CRAM is completely free, open-source and has been constantly refined by the genomics community that uses it. It is interoperable with the main genomic libraries and most software, meaning scientists can get the most value from genomic datasets.

Find out more

The GA4GH are encouraging researchers worldwide to use CRAM, as part of a suite of standards. Standards allow data and algorithms to be shared between institutions, enabling collaborations and the healthcare innovations that genomics can bring.

Read more about CRAM on the GA4GH site https://www.ga4gh.org/cram/ and  https://www.ga4gh.org/news/cram-compression-for-genomics/ or join the twitter chat on Friday 5 April 2019 1:30pm GMT. https://twitter.com/GA4GH