GISAID EpiCoV Downloader

Easily grabs large batches of sequences and metadata from GISAID EpiCoV.

GISAID EpiCoV Downloader is an RPA-powered software bot designed for bulk retrieval of sequences and metadata from the GISAID EpiCoV portal. It demonstrates a robust, end-to-end data pipeline that automates a traditionally manual and error-prone process.

circle-info

See “Dataset Crawling” for how this project supports ViraTrend’s development.

chevron-rightTL;DRhashtag

GISAID EpiCoV Downloader is a software bot for bulk crawling sequences and metadata from GISAID EpiCoV. It showcases a typical data-pipeline use case powered by RPA.

Problem

"A Critical Genetic Database Under Firearrow-up-right" highlights how researchers are forced to spend hundreds of hours downloading sequences in tiny chunks due to GISAID limitations:

  • No real-time API, limited bulk download, and 10k-per-download cap

  • Dynamic popups, variable load times, and server errors disrupt workflows

  • Long prep times force manual, piecemeal downloads — error-prone and slow

Solution

An RPA bot that enables automated retrieval by:

  • Mimicking human actions in a browser

  • Downloading entries from a specified ID list

  • Merging batches into a single dataset

  • Normalising to the contract schema

Outcome

  • Near real-time, reliable ingestion without manual effort

  • Faster, scalable data acquisition despite portal limits

  • Clean, normalised, deduplicated data ready for analysis

  • Reduced errors and significant time savings

Why RPA

  • Excels at repetitive, rules-based tasks

  • Interacts with legacy systems to provide an API-like overlay

  • Enables rapid prototyping with minimal or low-code effort

  • Built-in auditing for transparent, accurate action tracing

Process Flow

  1. Logs into GISAID EpiCoV.

  2. Downloads records from a specified record-ID list.

  3. Unzips and loads into a local cleansing datastore.

  4. Normalises data to the contract format.

  5. Exports to a CSV file.

Features

  • Auditability: Screen recordings, logs, and staged files enable full audits.

  • Flexibility: Bot tolerates variable download times, unresponsive UIs, and server errors.

  • Recoverability: Robust recovery resumes from the last checkpoint after major interruptions.

Benchmarks

Ran bot overnight to download Australian GISAID EpiCoV data

  • Volume: several GB, 200k+ records

  • Duration: ~4 hours, fully automated (no human intervention)

  • Resilience: a few server-error exceptions handled

  • Outcome: accurate, complete dataset with no duplicates and missing entries

Why is extracting data from GISAID so challenging?

Researchers often face severe friction when extracting data from GISAID. The platform lacks a real-time API, imposes limited bulk-download options, and caps each download at 10,000 records. Dynamic popups, variable page loads, and intermittent server errors frequently disrupt workflows. As a result, many researchers spend hundreds of hours performing piecemeal downloads, stitching data together manually, and dealing with avoidable errors and delays.

Related Article(s): A Critical Genetic Database Under Firearrow-up-right by Economist

How does the bot automate retrieval?

The bot automates the entire retrieval workflow by mimicking human interactions in a browser, downloading entries from a defined list of record IDs, and merging batches into a single dataset. It then normalises the output to a contract schema, producing a clean, analysis-ready export. This approach delivers a practical, API-like overlay on top of a legacy web interface without requiring changes to the underlying system.

What benefits does the bot bring?

The bot enables near real-time, reliable ingestion with no manual effort. It scales to large data volumes despite portal constraints and delivers clean, normalised, deduplicated datasets. Teams benefit from reduced errors, significant time savings, and a repeatable process that keeps pace with evolving research needs.

Why use RPA for this workflow?

RPA excels at repetitive, rules-based tasks and integrates smoothly with legacy systems that lack modern interfaces. It enables rapid prototyping with minimal or low-code development and provides built-in auditing for transparent, accurate tracking of every action. In this context, RPA effectively transforms a constrained web workflow into a dependable data pipeline.

What steps does the bot follow?

The bot signs into GISAID EpiCoV, downloads records using a specified list of IDs, and unzips the results into a local cleansing datastore. It then normalises the data to the agreed contract format and exports the final dataset as a CSV file. Throughout the process, it tolerates variable load times, unresponsive UI states, and transient server errors, resuming seamlessly from the last checkpoint when needed.

What makes the bot production-ready?

The bot is built for auditability, capturing screen recordings, logs, and staged files to support full traceability. Its flexible design handles network variability and UI inconsistencies, while robust recovery ensures progress isn’t lost after interruptions. These capabilities make it suitable for sustained, high-volume runs.

What performance has the bot achieved?

In an overnight run ingesting Australian GISAID EpiCoV records, the bot processed several gigabytes of data spanning more than 200,000 records in roughly four hours, fully unattended. A handful of server-error exceptions were automatically handled, and the final output was accurate, complete, and free of duplicates and missing entries.

Last updated