why it’s hard to track coronavirus variants


Labs like this one in New Delhi, India, are sequencing the genomes of coronavirus samples and trying to detect the Omicron variant.Photo credit: T. Narayan / Bloomberg via Getty

Researchers are trying to detect Omicron, the latest worrying variant of SARS-CoV-2, by sequencing the genomes of coronaviruses that infect humans. However, monitoring by genomic sequencing can be slow and patchy, complicating the picture of how and where Omicron is spreading.

One positive development is that researchers are sequencing more SARS-CoV-2 genomes than ever before. This enabled them to recognize Omicron relatively quickly. In April last year – around 16 months after the start of the pandemic – an online database run by the GISAID Data Science Initiative contained one million SARS-CoV-2 genome sequences. Since then, researchers have submitted another five million sequences to GISAID in about eight months – an almost ten-fold increase (see “Genome Explosion”). “We are now much better able to find Omicron or some other new variant,” said Kelly Wroblewski, director of infectious diseases at the Association of Public Health Laboratories in Silver Spring, Maryland.

Genome Explosion: Line graph showing the number of SARS-CoV-2 genome sequences added to GISAID since January 2020.

Source: GISAID

However, researchers caution that there are still worrying gaps in the sequencing data that make any interpretation of the movement of a variant difficult. “The numbers are complex and there are so many caveats,” says Wroblewski. For one thing, some countries do not have the laboratory capacity to sequence the genome of pathogens. So it might look like there aren’t any variants while the mutant viruses are actually spreading under the radar.

Sequencing rates also vary within countries, giving an uneven picture of how a variant spreads within a country’s borders. For example, 10 US states have sequenced less than 2% of the coronaviruses that have infected people who tested positive for COVID-19 in those states in the past month, according to the sequences published by GISAID. In contrast, Wyoming, Colorado, and Vermont sequenced more than 10% of their positive cases over the same time period (see “Monitoring Status”).

Surveillance States: US state map showing the percentage of COVID-19 cases detected that have been sequenced in the past 30 days.

Sources: GISAID and the Rockefeller Foundation’s Pandemic Prevention Institute

But even if a site sequenced many of its positive cases, variants could still pass if the tests are bad or biased. “It’s easy to sequence 100% of your cases if you only test a few people to begin with,” said Jennifer Nuzzo, an epidemiologist at Johns Hopkins University in Baltimore, Maryland. For example, some countries mainly test international travelers. Even if they sequence all of these samples, they could miss a worrying variant that is circulating domestically.

Note the data gap

Faced with surveillance challenges like this, epidemiologist Sam Scarpino and his colleagues at the Rockefeller Foundation’s Pandemic Prevention Institute in Washington DC are looking for new ways to understand the spread of variants. One method is to use a model they have developed to estimate how often Omicron would have to prevail in a particular area before it would be detected by public health officials, given the state of testing and sequencing in that particular area. Omicron would have to be fairly common for researchers to be able to identify in a location with little surveillance, for example.

The team also creates schedules using Omicron reports that are uploaded to GISAID every day to paint a clearer picture of the detection. They order sequences based on the dates the samples were taken – not when they appear online in the database. The timing can be confusing as weeks can elapse between a person testing positive for the coronavirus and sending a sample to a genomics lab, which can be sequenced and then reported online and to authorities. For example, according to GISAID’s December 9 data, the first person known to be infected with Omicron was sampled in South Africa on November 8, about three weeks before the virus sequence for that particular sample was posted online – and almost two weeks before South Africa’s first report on Omicron. More data has come in since then, and a new sequence from Omicron is based on a sample collected in South Africa on November 5th. On the other hand, barely two days passed between the sampling of the first known infected person in Spain and the sequencing (see “Procedure”).

Sequence of events: timeline showing the time to collect a COVID sample and send it to a database in worldwide locations.

Sources: GISAID and the Rockefeller Foundation’s Pandemic Prevention Institute

Dave Luo, a data scientist who advises Rockefeller’s Pandemic Institute, warns that this type of timeline alone cannot determine how Omicron spreads. To do this, scientists need to compare the genetic codes of different SARS-CoV-2 sequences and create an evolutionary tree that shows how closely one virus is related to another. Genomic epidemiologists, like those who participated in the Nextstrain project, are currently performing such analyzes.

All of these studies are evolving daily as new Omicron sequences arrive from around the world. A clue as to how fast this field is moving is the rapid rise in genomes reported after the World Health Organization identified Omicron as a worrying variant on Nov. 26. Shortly after the agency’s announcement, 15 countries submitted 187 genome sequences from Omicron to GISAID. By December 14, 55 countries had shared 4,265 Omicron sequences. The numbers are on track to keep rising – but Luo warns that this isn’t necessarily representative of how quickly the variant is spreading. Many test centers prefer to sequence samples after a simple, rapid genotyping test detects a possible signal for omicron – a specific amino acid in the gene for its spike protein. As a result, Omicron could currently be overrepresented among the SARS-CoV-2 genome sequences.

Genomic information is biased and messy in many ways, Luo says. “We have to be careful what we take from a single data source.”