Month: April 2018

Roesel's Bush Cricket: The trouble with crickets and their ever increasing genomes... Image: Richard Bartz, Wikimedia Commons
25 Genomes

The trouble with Crickets

By: Dan Mead, the 25th Anniversary Sequencing Project Coordinator
Date: 23/04/2018

The type of cricket (Roesel’s Bush Cricket, Bicolorana roeselii or Metrioptera roeselii) we decided to sequence is interesting because it has spread out of its traditional salt-marsh environment to the interior of the country. We want to know if this is because it has adapted to live in less saline conditions or if it’s been possible due to the increased salt spreading on roads making corridors for the crickets to move along (or a combination of both).

This was one of the first species we received (from Björn Beckmann & Peter Sutton of the Orthoptera & Allied Insects group late in the summer of 2017. We got three, all from a field in Oxfordshire, and it turns out they’re not adverse to a little cannibalism – one of them ate the back legs of its roommate (the other was in a separate container although was also missing a leg) before I could separate them. Seeing as I was feeling a little mischievous I named them Hannibal, Oscar and Heather (despite them all being male – I took some creative license).

Getting the DNA from Oscar was one of the easier ones, good yield and reasonable (although not the best) quality, certainly good enough for PacBio sequencing though.

This is a femto pulse trace of the DNA fragment size, here it’s mostly in the 20Kb+ range, ideally it’d be bigger- an perfect trace has one giant peak at ~165Kb

This is a femto pulse trace of the DNA fragment size, here it’s mostly in the 20Kb+ range, ideally it’d be bigger- an perfect trace has one giant peak at ~165Kb

Later extractions also gave better DNA for the 10X sequencing and so things were going swimmingly. I’d estimated that the genome size for this was ~2Gb, based on the average cricket genome from the animal size genome database, so quite large for an insect, but reasonable enough for this project.

Little did I know the seemingly unending horror show that now befalls us …

Initially things progressed as expected, the PacBio sequencing went well – producing >95Gb data. Likewise for the 10X, we got 120Gb from that, so ~50X coverage for both.

Things started to get a bit icky when the assembly first failed for PacBio, then for 10X. A PacBio miniasm assembly then came back with a revised genome size of 2.8Gb, bigger than expected but not too bad at this point, although the N50 was terrible (76Kb).

The next thing that happened was a kmer-based quality control report – this gave the genome size as 4.6Gb! We’re definitely into the realm of the unexpected now … this reduces our effective coverage to ~20X, waaay less than is needed for a decent assembly.

Finally (after running out of memory a few times) Supernova ran on the 10X data. This returned a gut-wrenching estimated genome size of 7.5Gb!

Combine this with the heterozygosity estimate of around 3.04% and everything looks a little wonky.

So what went wrong?

I’ve just been back to the genome size database and there is an outlier in the sizes – the camel cricket (Ceuthophilus stygius) which is a cave cricket from North America.

By James St. John – Ceuthophilus stygius (camel cricket) inside entrance to Great Onyx Cave (Flint Ridge, Mammoth Cave National Park, Kentucky, USA) 1, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=39945246

By James St. John – Ceuthophilus stygius (camel cricket) inside entrance to Great Onyx Cave (Flint Ridge, Mammoth Cave National Park, Kentucky, USA) 1, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=39945246

This beauty has a genome size of 9.55Gb!

So the question is how likely is this to be the case (or close to) for our cricket?

Taking all the crickets with known genome sizes from the database (there aren’t that many – 7 – one of which is the gloriously named ‘unwelcome mole cricket’ Neoscapteriscus borellii) and putting them into the phyloT tree generator and IToL (Interactive Tree of Life) gives you this:

I don’t think you can by these anymore

Brown curry mole crickets in a can: I don’t think you can buy these anymore

Sorry, that’s just a can of curried crickets, the tree looks like this:

Unwelcome mole crickets are unwelcome in NCBI apparently, there’s no taxon number so no tree entry.

Unwelcome mole crickets are unwelcome in NCBI apparently, there’s no taxon number so no tree entry.

From this it looks like our Tettigoniidae bush cricket pre-dates our large-genomed friend the camel cricket (a Gryllacrididae) and split from the ‘true’ crickets (the Gryllidae) a while back. But how far?

Then we used another online resource, the timetree, to see when this split occurred. From the below you can see it was ~270MYA, which is a long time, plenty of time for some weird genome expansion to have happened I guess.

Gryllidae and Gryllacrididae separated 100MY before Tettigoniidae diverged from Gryllacrididae (~172MYA).

Gryllidae and Gryllacrididae separated 100MY before Tettigoniidae diverged from Gryllacrididae (~172MYA).

You may have noticed that this tree is a little different, this is for two reasons:

  • It’s a simple expansion of the last shared taxon group, the Ensifera.
  • The Gryllacrididae and Tettiginiidae split from the Rhaphidophoridae, not the other way around.

Before you ask, no I don’t know why, but I assume the latter is correct as the first tree lacks all the taxon groups for an input.

The sole example of the Rhaphidophoridae taxon has a 1.55Gb genome and as this line goes back to the common ancestor of the Roesel’s cricket it could be that our initial estimate is true OR, more likely, there’s been some horrible expansion that involves (multiple?) genome duplication events.

The thing that’s really annoying is my own lack of knowledge and tendency to make (in this case stupid) assumptions – who knew that Gryllacrididae and Gryllidae are actually further distant than Gryllacrididae and Tettiginiidae? Taxonomist probably, or someone who studied classics.

Anyway we’re doing some more sequencing to get extra 10x data, hopefully this will answer the question once and for all….stay tuned!

About the author:

Dan Mead is the 25th Anniversary Sequencing Project Coordinator, for the 25 Genomes Project for the Wellcome Sanger Institute, Cambridge.

More on the 25 Genomes Project:

25 Genomes Project web page 

Human Cell AtlasSanger Science

New computational method reveals where genes are expressed

By: Valentine Svensson
Date: 06.04.18

main figure

SpatialDE automatically identifies sub-structures (middle), and links these to genes that depend on spatial location (right) in mouse olfactory bulb data from Stahl et al 2016.

In the body, cells are often considered the atomic fundamental units. In a similar way to how atoms are structurally joined to form molecules, cells form tissues. The organization of these tissues let different cell types work together, to enable organs in the body to perform their functions. These structures have been studied and catalogued for hundreds of years in the field of histology, using microscopes.

During the 20th century molecular techniques have enabled researchers to investigate how different genes and proteins are used in different parts of tissues, to understand how cell types collaborate in tissues. Large scale projects such as the Protein Atlas or the Allen Brain Atlas have been systematically performing molecular measurements of individual genes and proteins in tissues.

In the last decade, tremendous advancements in the scale and cost effectiveness of molecular measurements have been made. This has led to the analysis of single cell gene expression -ie which genes are switched on in a cell. This lets researchers define cell types from molecular data. Similarly, spatially defined molecular measurements of gene expression can now be made on thousands of genes in single cell resolution. Projects that would previously have taken hundreds of people and long time schedules can now be done by individual labs, meaning more types of tissues in more conditions can be investigated.

The most powerful new high throughput methods generate measurements of expression levels for tens of thousands of genes. At this scale just looking at all the genes will not be possible. Typically these sorts of data have been analysed by only looking at a handful of known marker genes.

We have now developed a method that tells us if there is a relationship between genes expressed in cells, and where those cells are located.

Our SpatialDE method filters and sorts all the genes according to how certain we are that cell location matters for the expression levels. In the main data we analysed for our paper, out of close to 12,000 genes measured only 67 genes were filtered as “spatial”. By focusing on this shortlist of genes, researchers can quickly discover genes previously unknown to be related to tissue structure.

Tissues are often divided into sub-structures, based on visual appearance, or by expression of particular proteins indicating a specific function of that sub structure. The brain for example has different layers, so does skin: the thymus on the other hand consists of connected lobules with medullas inside.

The sub-structures are defined by different cell type compositions. For cells to have major functional differences they need to express many genes together that are specific to the function, which will be reflected on a whole tissue level. We created a second method which uses this property to automatically define tissue substructures. In one go, researchers obtain the genes defining the regions, as well as labels for the regions themselves.

This allows researchers to zoom into the structures of the tissue. The markers allow design of downstream functional experiments to investigate which genes cause the structure and which are a consequence of the structure. The spatial labels then allow researchers to investigate the interaction between structures, the development of the structures, and how the tissue performs its function.

Relating cell types to their spatial structure and organization in tissues is a major component in the ongoing Human Cell Atlas project. But the technologies for spatial gene expression measurements are feasible to perform for individual labs that wants to study their tissue of on a genomic level. With our methods, researchers can answer new questions about the relation between genes and tissue structure that was not possible before, which we demonstrate in our paper.

In the long term, genomic and quantitative spatial gene expression measurements, captured and analysed by methods such as SpatialDE, may form the basis of histology and pathology in the clinic. This would allow this area of medical diagnostics to become even more powerful and personalized.

About the author:
Dr Valentine Svensson was an EMBL PhD student supervised by Sarah Teichmann at the Wellcome Sanger Institute, collaborating with Oliver Stegle at the EMBL-EBI when this work was done.  He is now a postdoctoral scholar in the Division of Biology and Biological Engineering at Caltech, working with Lior Pachter on statistics for omics based cell biology.

Related publication:
Valentine Svensson, Sarah A Teichmann and Oliver Stegle. (2018). SpatialDE: identification of spatially variable genes. Nature MethodsDOI:10.1038/nmeth.4636

Further Links: