By: Dan Mead, the 25th Anniversary Sequencing Project Coordinator
Date: 23/04/2018

The type of cricket (Roesel’s Bush Cricket, Bicolorana roeselii or Metrioptera roeselii) we decided to sequence is interesting because it has spread out of its traditional salt-marsh environment to the interior of the country. We want to know if this is because it has adapted to live in less saline conditions or if it’s been possible due to the increased salt spreading on roads making corridors for the crickets to move along (or a combination of both).

This was one of the first species we received (from Björn Beckmann & Peter Sutton of the Orthoptera & Allied Insects group late in the summer of 2017. We got three, all from a field in Oxfordshire, and it turns out they’re not adverse to a little cannibalism – one of them ate the back legs of its roommate (the other was in a separate container although was also missing a leg) before I could separate them. Seeing as I was feeling a little mischievous I named them Hannibal, Oscar and Heather (despite them all being male – I took some creative license).

Getting the DNA from Oscar was one of the easier ones, good yield and reasonable (although not the best) quality, certainly good enough for PacBio sequencing though.

This is a femto pulse trace of the DNA fragment size, here it’s mostly in the 20Kb+ range, ideally it’d be bigger- an perfect trace has one giant peak at ~165Kb

This is a femto pulse trace of the DNA fragment size, here it’s mostly in the 20Kb+ range, ideally it’d be bigger- an perfect trace has one giant peak at ~165Kb

Later extractions also gave better DNA for the 10X sequencing and so things were going swimmingly. I’d estimated that the genome size for this was ~2Gb, based on the average cricket genome from the animal size genome database, so quite large for an insect, but reasonable enough for this project.

Little did I know the seemingly unending horror show that now befalls us …

Initially things progressed as expected, the PacBio sequencing went well – producing >95Gb data. Likewise for the 10X, we got 120Gb from that, so ~50X coverage for both.

Things started to get a bit icky when the assembly first failed for PacBio, then for 10X. A PacBio miniasm assembly then came back with a revised genome size of 2.8Gb, bigger than expected but not too bad at this point, although the N50 was terrible (76Kb).

The next thing that happened was a kmer-based quality control report – this gave the genome size as 4.6Gb! We’re definitely into the realm of the unexpected now … this reduces our effective coverage to ~20X, waaay less than is needed for a decent assembly.

Finally (after running out of memory a few times) Supernova ran on the 10X data. This returned a gut-wrenching estimated genome size of 7.5Gb!

Combine this with the heterozygosity estimate of around 3.04% and everything looks a little wonky.

So what went wrong?

I’ve just been back to the genome size database and there is an outlier in the sizes – the camel cricket (Ceuthophilus stygius) which is a cave cricket from North America.

By James St. John – Ceuthophilus stygius (camel cricket) inside entrance to Great Onyx Cave (Flint Ridge, Mammoth Cave National Park, Kentucky, USA) 1, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=39945246

By James St. John – Ceuthophilus stygius (camel cricket) inside entrance to Great Onyx Cave (Flint Ridge, Mammoth Cave National Park, Kentucky, USA) 1, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=39945246

This beauty has a genome size of 9.55Gb!

So the question is how likely is this to be the case (or close to) for our cricket?

Taking all the crickets with known genome sizes from the database (there aren’t that many – 7 – one of which is the gloriously named ‘unwelcome mole cricket’ Neoscapteriscus borellii) and putting them into the phyloT tree generator and IToL (Interactive Tree of Life) gives you this:

I don’t think you can by these anymore

Brown curry mole crickets in a can: I don’t think you can buy these anymore

Sorry, that’s just a can of curried crickets, the tree looks like this:

Unwelcome mole crickets are unwelcome in NCBI apparently, there’s no taxon number so no tree entry.

Unwelcome mole crickets are unwelcome in NCBI apparently, there’s no taxon number so no tree entry.

From this it looks like our Tettigoniidae bush cricket pre-dates our large-genomed friend the camel cricket (a Gryllacrididae) and split from the ‘true’ crickets (the Gryllidae) a while back. But how far?

Then we used another online resource, the timetree, to see when this split occurred. From the below you can see it was ~270MYA, which is a long time, plenty of time for some weird genome expansion to have happened I guess.

Gryllidae and Gryllacrididae separated 100MY before Tettigoniidae diverged from Gryllacrididae (~172MYA).

Gryllidae and Gryllacrididae separated 100MY before Tettigoniidae diverged from Gryllacrididae (~172MYA).

You may have noticed that this tree is a little different, this is for two reasons:

  • It’s a simple expansion of the last shared taxon group, the Ensifera.
  • The Gryllacrididae and Tettiginiidae split from the Rhaphidophoridae, not the other way around.

Before you ask, no I don’t know why, but I assume the latter is correct as the first tree lacks all the taxon groups for an input.

The sole example of the Rhaphidophoridae taxon has a 1.55Gb genome and as this line goes back to the common ancestor of the Roesel’s cricket it could be that our initial estimate is true OR, more likely, there’s been some horrible expansion that involves (multiple?) genome duplication events.

The thing that’s really annoying is my own lack of knowledge and tendency to make (in this case stupid) assumptions – who knew that Gryllacrididae and Gryllidae are actually further distant than Gryllacrididae and Tettiginiidae? Taxonomist probably, or someone who studied classics.

Anyway we’re doing some more sequencing to get extra 10x data, hopefully this will answer the question once and for all….stay tuned!

About the author:

Dan Mead is the 25th Anniversary Sequencing Project Coordinator, for the 25 Genomes Project for the Wellcome Sanger Institute, Cambridge.

More on the 25 Genomes Project:

25 Genomes Project web page 

Posted by sangerinstitute

From the Sanger Institute, a charitably funded genomic research organisation