Image credit: AdobeStock

Categories: Sanger Science1 April 2025

Top ten steps to get your genomics data AI-Ready

By Katrina Costa, Science Writer at the Wellcome Sanger Institute

Artificial intelligence (AI) continues to shape the field of genomics, but unprepared data can hamper scientific progress. Digital experts at the Wellcome Sanger Institute are optimising data workflows to fully leverage the power of AI in the Institute’s large-scale sequencing projects.

Sign up for our email newsletter

AI models thrive on large and complex datasets, so they can be an ideal tool for tackling genomic data analysis. In a recent blog, we explored how AI can enhance diverse areas of research, including protein biology, generative genomics and synthetic biology. However, AI model predictions are only as good as the data they are trained on, so biological datasets must be optimised correctly before applying these models.1 We caught up with Dr James McCafferty, Chief Information Officer at the Wellcome Sanger Institute, to discuss ten tips to help ensure your genomics data are AI-ready.

1. Clean up your data

Real-world genomics data can be messy, which can lead to messy AI predictions.

Begin by backing up raw data and then assessing data quality. Then, clean the data by correcting any errors, removing duplicate records, and fixing missing values – perhaps through estimation or further research. Follow this up by checking for any anomalies or inconsistencies.2

Detecting these issues early can prevent misleading results.

2. Keep it consistent

Standardise the data formats used and address any batch effects, where technical variations can creep in from different sample processing conditions.3 This is a common problem in large-scale biological data analyses. Use batch effect correction techniques such as ComBat, which is for removing technical variability.

Be sure to standardise metadata to provide the AI model with consistent inputs.

3. Make it relevant

Consider how the AI model will be used.

The data need to relate directly to the AI model’s tasks and reflect the real-world biological scenarios the model will encounter. Any irrelevant data will add noise and may lead to unreliable results.

4. Structure your data

AI models rely on well-organised, machine-readable data.

Convert raw sequence reads and other unstructured data into standardised formats such as tables, FASTA files for biological sequences, or BAM files for DNA sequence alignments.

5. Label everything

AI models rely on labelled and annotated data to provide context and ensure reliable predictions. Genomic features such as genes and regulatory elements need to be clearly annotated and linked to relevant biological traits and health outcomes.

Combine computational and manual curation for accurate annotations and more reliable AI results.4

Related Sanger blogs

Digital transformation and a new era in science

How can we enhance biological research with AI?

AI and the future of generative biology

6. Ensure dataset diversity

AI models perform best and provide more generalisable results when trained on diverse datasets. This is especially true for orthogonal data, where the variables do not correlate with each other.

Training on broad datasets avoids the AI model ‘overfitting’ to the data, which occurs when it trains too specifically to the target dataset, meaning it will not perform well on new, unfamiliar data.5

Solving biological problems often relies on combining diverse data types, so avoid training on a narrow subset of biology.

7. Balance your dataset

Be sure to balance data across different categories, such as healthy versus diseased samples, to avoid skewed or biased results.

Within genomics, underrepresented populations present a significant problem, which can impact the reliability of AI models.

Correct any imbalances by adding data from external sources, generating synthetic data, upweighting any underrepresented samples to ensure fairer learning, or using data resampling techniques to adjust the data distribution.6

8. Make it accessible

Take steps to ensure your data are easily accessible to others by storing the data in platforms that allow secure sharing and collaboration. This is central to the FAIR principles (Findable, Accessible, Interoperable, Reusable) of scientific data management.7

The FAIR principles ensure data are easily reusable for both machines and individuals, which enhances reproducibility.

9. Track its history

Keep a clear record of the data’s provenance, which is where the data came from and how they are processed, alongside relevant metadata.

Tracking data provenance ensures reproducibility and provides information on data quality because multiple processing steps can introduce errors or bias. It also creates transparency.

Use version control systems, such as Git, and provenance tracking tools like the Open Provenance Model.

10. Scale up for robust AI training

AI models perform best on large, varied datasets.

For large-scale data handling, cloud-based storage solutions, such as AWS for Genomics, and High-Performance Computing (HPC)8 can provide appropriate storage and compute resources.

For pre-processing of data, consider automated data pipelines such as Snakemake, Nextflow, or Apache Spark.

Sensitive data, such as patient information, can be kept private and secure using federated learning, in which models train across data nodes without directly accessing raw data.9

Here at the Sanger Institute, we are embracing advances in machine learning and AI to support our genomics research at scale.

“The potential of AI in genomics goes far beyond just processing data – it’s about uncovering new patterns, accelerating discoveries, and making research more impactful. AI can also generate new insights, which is especially important for engineering biology. At Sanger, we’re ensuring that our data are not just accessible but optimised for AI-driven insights, laying the foundation for the next generation of genomic research.”

Dr James McCafferty,
Chief Information Officer, Wellcome Sanger Institute

Download our free printable guide

Find out more

References

  1. Forbes. The Critical Role Of Data Quality In AI. [Last accessed: March 2025].
  2. IBM. What Is Data Cleaning? [Last accessed: March 2025].
  3. Goh WWB, et al. Why Batch Effects Matter in Omics Data, and How to Avoid Them. Trends Biotechnol. 2017; 35:498–507.
  4. Chen Z, et al. From tradition to innovation: conventional and deep learning frameworks in genome annotation. Brief Bioinform. 2024; 25:138.
  5. Amazon Web Services. What is Overfitting? [Last accessed: March 2025].
  6. Medium. Imbalanced Datasets: The Hidden Enemy of Machine Learning. Available from: [Last accessed: March 2025].
  7. Wilkinson MD, Dumontier M, Aalbersberg IjJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016; 3:160018.
  8. European Commission. High Performance Computing. [Last accessed: March 2025].
  9. Yurdem B, et al. Federated learning: Overview, strategies, applications, tools and future directions. Heliyon. 2024; 10:e38137.