By: Ali Cranage, science writer at the Wellcome Sanger Institute
An impending flood of genomic data threatens to overwhelm scientists’ abilities to store and share it. Next-generation compression seeks to CRAM it all in
DNA – nature’s data compression format
DNA itself doesn’t take up much space – there is a whole copy of our genome, some 6.4 billion letters of DNA – tightly packaged in almost every single one of our 37 trillion cells. But when the DNA in our genome is sequenced, the information takes up about 200Gb of digital space. The data from only a few human genomes would fill a standard laptop. The Global Alliance for Genomics and Health (GA4GH) estimates that by 2025, 60 million human genomes will have been sequenced; genomics is fast becoming the biggest source of digital data on the planet.
Squeezing Sequence Data
To effectively store and manage the vast amounts of data, scientists have created specific digital file formats. SAM and BAM were the main file formats for storing genome data in the early 2000s. They utilise generalised compression methods, like ZIP. CRAM is the next generation, a compressed version of BAM. It reduces the size needed to store a human genome by 50 per cent, bringing huge savings. It takes advantage of the unique properties of genomic data.
SAM, BAM and CRAM explained
|SAM||Sequence Alignment/Map||Text file of information about the DNA in a genome that can be read humans|
|BAM||Binary Alignment/Map||Binary file that can be read by computers||BAM files are the compressed version of a SAM file|
|CRAM||Compressed columnar file format||Column-based format that stores the differences between the stored sequence and a reference genome||CRAM fils can be 30-60% smaller than their equivalent BAM files|
Based on: https://en.wikipedia.org/wiki/SAMtools
“There’s a good reason we don’t use ZIP for images — we use custom algorithms written for that task, such as JPEG or PNG,” said James Bonfield, principal software developer at the Wellcome Sanger Institute.
“This is what CRAM is doing. CRAM is a custom algorithm written to compress the BAM data to a much smaller size.”
It is based on algorithms originally created by James and other researchers at the Sanger Institute and the European Bioinformatics Institute (EMBL-EBI) as part of a ‘sequence squeeze’ competition in 2011.
CRAM has now been adopted as the industry standard for genomic data compression, storage and transfer.
Global commitment to open-source
The global genomics community made the commitment to make data open and freely available at the very beginning, when the first human genome was sequenced. To make certain this continues, the tools and software to store and analyse genome data must also be available to all.
The CRAM file format is maintained by GA4GH, the international standards body for genomic data. CRAM is completely free, open-source and has been constantly refined by the genomics community that uses it. It is interoperable with the main genomic libraries and most software, meaning scientists can get the most value from genomic datasets.
Find out more
The GA4GH are encouraging researchers worldwide to use CRAM, as part of a suite of standards. Standards allow data and algorithms to be shared between institutions, enabling collaborations and the healthcare innovations that genomics can bring.
Read more about CRAM on the GA4GH site https://www.ga4gh.org/cram/ and https://www.ga4gh.org/news/cram-compression-for-genomics/ or join the twitter chat on Friday 5 April 2019 1:30pm GMT. https://twitter.com/GA4GH