Data Engineering and Aggregation Lead

Are you an experienced, pragmatic and forward-thinking Data Engineering and Aggregation Lead looking to contribute towards solving real global health challenges? If so, this may be a once in a lifetime opportunity - and we are looking for someone like you! Come and join us at the world-famous Wellcome Sanger Institute in the newly formed Genomic Surveillance Unit. No science background needed!

You will lead a team of data engineers supporting the aggregation, management, storage and integration of genomic surveillance data and related metadata with public health entities and research partners.

About the Role

As the Data Engineering and Aggregation Lead, your primary objective will be to take data from sequencing informatics pipelines and integrate it with multiple sources of heterogeneous data to deliver a range of high quality, analysis-ready genetic data products to third party organisations and the Analysis team via multiple channels.

In this role you will:

  • Lead a team of principal, senior, and junior data engineers.
  • Develop operational relationships with health and government organisations around the world.
  • Engage with the GSU leadership teams and other internal stakeholders to enable requirements gathering
  • Work closely with Sanger IDS (IT shared service) and GSU to implement a data management strategy
  • Leverage off data architect skills to harmonise data infrastructure to support our products
  • Develop and recruit flexible agile teams
Job Logo
Athena Swan
Job Logo - 2
Working Families
PSG Band 4
Salary per annum
58,00 - 80,000
Full Time, Part Time, Flexible Working
Full Time
Contract Type
Closing Date
28 August 2022
Job Reference

Essential Skills

About You

  • A natural leader when it comes to data management technology and people
  • Able to relate to scrum team members and motivate them to achieve great results
  • Passionate about data engineering and using technology to deliver according to organisation requirements
  • Demonstrable experience when it comes to using data engineering tools to achieve strategic objectives
  • Flexible in your approach and role-model service-oriented behaviour in all you do
  • Able to interpret the needs of an organisation and translate them into actionable plans - collaborating closely with your fellow principals as well as your colleagues in the data architecture, data engineering, and software development space

Other information

About the Tech

We currently develop and run our pipelines in Apache Spark. This registry is backed up in Gitlab. We are trialling the use of Prometheus and Grafana for monitoring and alerting. We use Kafka as a message broker and Prefect as an orchestration tool primarily leveraging off the scheduler and dependency management modules. As most of our code is Python and therefore we use the PEP 8 style guide. We use Gitlab CI/CD for test pipelines and deployment along with Terraform and Ansible to provision and configure services on OpenStack. We are an agile team running our Kanban board in Jira and we’re always looking for ways to improve our way of working.


About Us

Wellcome Sanger is a world-leading genomics research institute and our work helps improve human health and understand life on Earth. The newly launched Genomic Surveillance Unit (GSU) is the first service-delivery unit within Sanger that is working towards using genomic surveillance as a practical tool for local disease control and to support countries in being pandemic-prepared. Collaboration is at the heart of what we do and we are connected to a network of researchers, clinicians and public health agencies across the globe through a number of multi-centre projects. We provide a framework for generating, integrating and sharing genetic and genomic data, and for investigating key questions about COVID-19, malaria biology and epidemiology. Have a look here to get more insight into COVID-19 genomic surveillance and its impact.

Data at Sanger is generated based on organic samples that are processed through sequencing informatics pipelines and combined with multiple heterogeneous data sources. One data stream provides an end-to-end view of the samples’ lifecycle to manage our operations while another stream delivers a range of highly available, scalable, and near real-time data products to third party organisations, public health entities and research partners. 

It is the responsibility of the Data Engineering team to ensure that all necessary data points are consolidated from multiple internal/external systems, analytics outputs, and partners using the most appropriate data architecture.  Our goal is for our product portfolio to contain datasets that are interoperable with the receiving organisation’s existing systems and processes.


Additional Information

We have adopted a hybrid model to support a balance of remote and office working. You can find out more about our inspiring campus here.

Interviews will be taking place virtually or on the campus depending on candidate location. This approach may vary for individuals located overseas and/or where a visa is required and starting will be based on a number of factors, we will be able to provide specialist advice to those affected candidates.

Please apply with your CV and cover letter outlining:

  • Your suitability for the role
  • Your experience as a leader and manager in the big data space
  • Your approach to team leadership
  • Any references to prior projects you’ve been involved in that you believe are relevant for consideration