Today (1 November 2018) the Earth BioGenome Project – a mission to sequence the genomes of all life on earth – was launched to the world’s media.

Unimaginable secrets are hidden in the genomes of the known, and unknown, species on our planet. The Sanger Institute is taking a leading role in this historic undertaking, as we plan to sequence the genomes of all 66,000 eukaryotic species in the UK.

Associate Director of the Sanger Institute, Dr Julia Wilson, talks about the ambitions, scale and challenges of this remarkable endeavour.

What is the Earth BioGenome Project?

EarthBioGenomeThe Earth BioGenome Project is a global collaboration, which aims to sequence the genomes of all eukaryote species on earth in the next 10 years. The ambition is vast. The project will transform science. We can only begin to imagine the benefits for advancing research into conservation, evolution, agriculture, biology and medicine.

Why sequence all life on Earth?

How life is divided up
Bacteria – such as MRSA and E. coli – relatively simple lifeforms which are single cells that have no membrane around their nucleus
Archaea – equally simple single cell lifeforms that are seen as the the oldest species of organisms on earth, and tend to be found in extreme environments
Eukaryota – everything else! These organisms have a nucleus with a membrane, and include animals, birds, fish, fungi and insects

All cellular life descended from a common ancestor, and genome sequences are the products of billions of years of evolution. Knowing the DNA sequences of all species will provide fundamental, transformative insights into biology.

There are an estimated 10–15 million eukaryotic species, and trillions of bacterial and archaeal species on Earth. But only a fraction of those – about 2.3 million, are actually known. We are only just beginning to understand the full splendour of life.

So far about 15,000 species, mostly microbes, have completed or partially sequenced genomes. From this, a wealth of knowledge has emerged, enabling enormous advances in agriculture, medicine, and biology-based industries and enhanced approaches for conservation.

Yet the world’s biodiversity remains largely uncharacterized. And the Earth has entered a period of unprecedented change. A new epoch – the Anthropocene – has been defined by human impact on the Earth’s geology and ecosystems. Human activity is threatening biodiversity through climate change, habitat destruction and species exploitation.

How life is divided up - the three classes of life explained

The three categories of life

We have a responsibility to care for our increasingly compromised planet. The project will produce a complete inventory of all life on Earth, and their complete DNA sequences; transforming our ability to monitor life as part of global conservation efforts.

It is essentially a mission to acquire knowledge of the natural world. That knowledge will form a foundation for future biotechnology.

Why now?

The family tree of life of life on earth. Ancestral tree courtesy of the Earth BioGenome Project

The family tree of life of life on earth. Ancestral tree courtesy of the Earth BioGenome Project

For the first time in history it is possible to efficiently sequence the genomes of all known species. In particular, the recent advances in DNA sequencing technology and the arrival of long sequence reads, mean that the project is now feasible.

A number of projects to sequence species for the first time are ongoing around the world. These include initiatives to sequence all birds, insects or bats. They are invaluable, but understandably fragmented and often shaped by funding limitations. Now is the time to bring everyone together, to co-ordinate DNA sequencing efforts. Joining these projects together will ensure consistency and deliver the best possible resource for future research.

How will this help with research into evolution, conservation, bio-diversity and health?

There are a broad set of scientific aims and outcomes of the EBP. The first is to revise and reinvigorate our understanding of biology, ecosystems, and evolution. This includes understanding the evolutionary relationships between all life on Earth, discovering new species, and uncovering fundamental laws that describe and drive evolution.

The second is to enable the conservation, protection, and regeneration of biodiversity. This includes clarifying how climate change and human activity are affecting biodiversity.

Finally, the goal is to explore the potential benefits for society and human wellbeing. This encompasses discovery of new medicines, enhanced control of pandemics, identifying new ways to improve agriculture, discovering new biomaterials, energy sources and biochemicals.

What role will the Sanger Institute play?

Organisations working together to read the genomes of UK fish, birds, animals, insects and plants

Organisations working together to read the genomes of UK fish, birds, animals, insects and plants

The Darwin Tree of Life Project will be an inclusive consortium of UK scientists and organisations. Key organisations are: the Sanger Institute, the Natural History Museum; Royal Botanic Gardens, Kew; EMBL-EBI; Earlham Institute; Edinburgh Genomics. Other institutes and organisations are expected to join. Together we will work to sequence all eukaryotic species in the UK, estimated at around 66,000 species.

We will also work with other countries to develop the global strategy for the EBP, and help to ensure that the benefits are shared.

What are the main challenges you can foresee?

Sample collection is a big challenge. It may be that we need to develop new machines or drones that can travel to hard to reach areas, for example sea beds. It’s possible they could be developed to extract DNA and store samples too.

The Wellcome Sanger Institute has the largest biosciences data centre in Europe, capable of storing and processing genomes of all sizes and complexities

The Wellcome Sanger Institute has the largest biosciences data centre in Europe, capable of storing and processing genomes of all sizes and complexities

Computing will also be a challenge. Requirements for data storage and processing are large – but tractable. In terms of computing power needed, mammalian-sized long-read genome assemblies currently require about 100 processor-weeks. The later phases of the EBP will require about 10,000 simultaneous assemblies running in parallel—a scale already approached by academic supercomputing centres.

Current tools are already capable of completing the project. But there is no doubt that genome assembly, alignment, and annotation algorithms will all need to be improved. It is a huge opportunity to develop new computational methods to maximize our understanding and use of the vast volumes of data that the project will produce.

How will you find all the species in the UK?

Finding, extracting and storing samples of all eukaryotic life in the UK is no easy task, and the Sanger will be working closely with the Natural History Museum, the Royal Botanic Gardens, Kew and other biobank repositories to fulfil the Darwin Tree of Life project

Finding, extracting and storing samples of all eukaryotic life in the UK is no easy task, and the Sanger will be working closely with a number of biobank repositories

For the Darwin Tree of Life Project, we’ll be working with UK organisations that have existing, extensive sample collections – including Royal Botanic Gardens, Kew; the Natural History Museum; the Culture Collection of Algae and Protozoa and others.

New sample collection will be required too. We’ll establish a dedicated team and strategy to survey the UK – gathering samples with the quality of DNA as their primary consideration. In the Darwin Tree of Life Project, we’ll be sequencing all eukaryotes in the UK. We won’t be sequencing non-native species, for example those in UK zoos.

Efforts to sequence all bacteria and archaea are already underway, so the EBP won’t be sequencing those.

Where will you start?

How species fit into the order of life. An animal such as a red fox would be in the domain of Eukaryota, in the Canidae family and the species Vulpes vulpes

How species fit into the order of life. An animal such as a red fox would be in the domain of Eukaryota, in the Canidae family and the species Vulpes vulpes

We have three starting points. Firstly, we will sequence a representative of each of the 3,849 families of species in the UK, plus a selected subset of species of particular interest.

Second, we will sequence all eukaryotic organisms from one or more ecosystems (e.g. St Kilda, Priests Pot or Wytham Woods).

Third, we will sequence all organisms from one or more of clades in the British Isles (group of organisms that consists of a common ancestor and all its descendants e.g. vertebrates).

How much is it going to cost?

The current estimate, for the whole EBP, is that sequencing all eukaryotic species will cost about $4.7 billion. This cost covers sample collection, sequencing machines, data storage, analysis, visualization and dissemination, and project management. Incredibly, this cost is similar to the cost of sequencing the first human genome, which in today’s money was about $5 billion.

The Darwin Tree of Life project is estimated to cost approximately £100 million over the first five years.

Will the sequences be made public?

Yes. UK species data will be publicly released and freely available via a dedicated website. EMBL-EBI will aggregate, curate and distribute assembly and gene sequences to the scientific community via a range of services and tools including Ensembl.

The data from the whole EBP will become a permanent foundation for future scientific discovery. The project will be working within international legislations to ensure that all countries can benefit from their involvement.  The EBP aims to provide fair and equitable access to genome sequence data and benefits it will bring.

Links:

News story: Genetic code of 66,000 UK species to be sequenced

News story: Launch of global effort to read genetic code of all complex life on earth

Posted by sangerinstitute

From the Wellcome Sanger Institute, a charitably funded genomic research organisation