Bull World Health Organ 98(7): 495-504

PMC: PMC7375210

PMID: 32742035

Variant analysis of SARS-CoV-2 genomes

Takahiko Koyama

Daniel Platt

Laxmi Parida

^{IBM TJ Watson Research Center, 1101 Kitchawan Rd, Yorktown Heights, New York 10598, United States of America.}

^{Corresponding author.}

Correspondence to Takahiko Koyama (email: moc.mbi.su@amayokt).

Received 2020 Feb 22; Revised 2020 May 13; Accepted 2020 May 13.

This is an open access article distributed under the terms of the Creative Commons Attribution IGO License (http://creativecommons.org/licenses/by/3.0/igo/legalcode), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. In any reproduction of this article there should not be any suggestion that WHO or this article endorse any specific organization or products. The use of the WHO logo is not permitted. This notice should be preserved along with the article's original URL.

Abstract

Objective

To analyse genome variants of severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2).

Methods

Between 1 February and 1 May 2020, we downloaded 10 022 SARS CoV-2 genomes from four databases. The genomes were from infected patients in 68 countries. We identified variants by extracting pairwise alignment to the reference genome {"type":"entrez-nucleotide","attrs":{"text":"NC_045512","term_id":"1798174254","term_text":"NC_045512"}}NC_045512, using the EMBOSS needle. Nucleotide variants in the coding regions were converted to corresponding encoded amino acid residues. For clade analysis, we used the open source software Bayesian evolutionary analysis by sampling trees, version 2.5.

Findings

We identified 5775 distinct genome variants, including 2969 missense mutations, 1965 synonymous mutations, 484 mutations in the non-coding regions, 142 non-coding deletions, 100 in-frame deletions, 66 non-coding insertions, 36 stop-gained variants, 11 frameshift deletions and two in-frame insertions. The most common variants were the synonymous 3037C > T (6334 samples), P4715L in the open reading frame 1ab (6319 samples) and D614G in the spike protein (6294 samples). We identified six major clades, (that is, basal, D614G, L84S, L3606F, D448del and G392D) and 14 subclades. Regarding the base changes, the C > T mutation was the most common with 1670 distinct variants.

Conclusion

We found that several variants of the SARS-CoV-2 genome exist and that the D614G clade has become the most common variant since December 2019. The evolutionary analysis indicated structured transmission, with the possibility of multiple introductions into the population.

Abstract

Résumé

Objectif

Analyser les variantes du génome de coronavirus 2 du syndrome respiratoire aigu sévère (SARS-CoV-2).

Méthodes

Entre le 1^{février et le 1}^{mai 2020, nous avons téléchargé 10 022 génomes de SARS CoV-2 issus de quatre bases de données. Ces génomes provenaient de patients infectés originaires de 68 pays. Nous avons identifié les variantes en procédant à un alignement par paires avec la séquence de référence NC_045512, à l'aide de l'outil EMBOSS Needle. Les variantes de nucléotides dans les régions codantes ont été converties en résidus d'acides aminés codés correspondants. Enfin, pour analyser le clade, nous avons employé un logiciel open source appelé Bayesian Evolutionary Analysis by Sampling Trees, version 2.5.}

Résultats

Nous avons détecté 5775 variantes de génome distinctes, dont 2969 mutations faux-sens, 1965 mutations synonymes, 484 mutations dans les régions non codantes, 142 délétions non codantes, 100 délétions sans décalage du cadre de lecture, 66 insertions non codantes, 36 variantes de codon stop, 11 délétions entraînant un décalage du cadre de lecture, et 2 insertions sans décalage du cadre de lecture. Les variantes les plus fréquentes étaient les synonymes 3037C > T (6334 échantillons), P4715L dans le cadre ouvert de lecture 1ab (6319 échantillons) et D614G dans la protéine de spicule (6294 échantillons). Nous avons identifié six clades majeurs (à savoir, de base, D614G, L84S, L3606F, D448del et G392D) et 14 sous-clades. Quant aux changements de base, la mutation C > T était la plus répandue avec 1670 variantes distinctes.

Conclusion

Nous avons constaté qu'il existait de nombreuses variantes du génome de SARS-CoV-2, et que le clade D614G était devenu la variante la plus commune depuis décembre 2019. L'analyse évolutive a indiqué une transmission structurée, avec une possibilité d'introductions multiples au sein de la population.

Résumé

Resumen

Objetivo

Analizar las variantes del genoma del coronavirus tipo 2 del síndrome respiratorio agudo grave (SARS-CoV-2).

Métodos

Entre el 1 de febrero y el 1 de mayo de 2020, se registraron 10 022 genomas del CoV-2 del SARS en cuatro bases de datos. Los genomas eran de pacientes infectados ubicados en 68 países. Se identificaron variantes al extraer la alineación por pares del genoma de referencia NC_045512, por medio de EMBOSS Needle. Las variantes de los nucleótidos en las regiones codificantes se convirtieron en los correspondientes residuos de aminoácidos codificados. Para analizar los clados, se utilizó el programa informático de código abierto Bayesian evolutionary analysis by sampling trees, versión 2.5.

Resultados

Se identificaron 5775 variaciones diferentes del genoma, incluidas 2969 mutaciones con cambio de sentido, 1965 mutaciones sinónimas, 484 mutaciones en las regiones no codificantes, 142 supresiones no codificantes, 100 supresiones en la fase, 66 inserciones no codificantes, 36 variaciones de parada prematuras (stop-gained), 11 supresiones de desplazamiento de fase y dos inserciones en la fase. Las variaciones más comunes eran las sinónimas 3037C > T (6334 muestras), P4715L en la fase abierta de lectura 1ab (6319 muestras) y D614G en la proteína S (6294 muestras). Se identificaron seis clados principales, (es decir, basal, D614G, L84S, L3606F, D448del y G392D) y 14 subclados. En relación con los cambios de base, la mutación C > T fue la más común con 1670 variaciones diferentes.

Conclusión

Se determinó que existen diversas variaciones del genoma del SARS-CoV-2 y que el clado D614G es la variante más común desde diciembre de 2019. El análisis evolutivo indicó una transmisión estructurada, en la que existe la posibilidad de que se realicen múltiples inserciones en la población.

Resumen

ملخص

الغرض تحليل الأشكال المختلفة لجينوم المتلازمة التنفسية الحادة الشديدة المعروفة باسم كورونا فيروس 2 (سارس كوف 2).

الطريقة خلال الفترة ما بين 1 فبراير/شباط، و1 مايو/أيار 2020، قمنا بتنزيل 10022 من جينوم سارس كوف 2 من أربع قواعد بيانات. كانت الجينومات من المرضى حاملي العدوى في 68 دولة. قمنا بتحديد أشكال مختلفة عن طريق استخلاص تنسيق على شكل زوجي من الجينوم المرجعي NC_045512، باستخدام إبرة EMBOSS. تم تحويل الأشكال المختلفة من النيوكليتويد في مناطق الترميز إلى بقايا الحمض الأميني المشفر المقابل. وبالنسبة لتحليل كليد، فقد استخدمنا تحليل بايزان المتطور لبرنامج المصدر المفتوح، عن طريق تفرعات العينات، الإصدار 2.5.

النتائج حددنا 5775 شكلاً مختلفاً ومتميزاً من الجينوم، بما في ذلك 2969 طفرة مُغلطة، و1965 طفرة متشابهة، و484 طفرة في المناطق غير المشفرة، و142 حالة حذف غير مشفرة، و100 حالة حذف في الإطار، و66 إدخال غير مشفر، و36 شكلاً مكتسبًا موقوفاً، و11 حالة حذف لإزاحة الإطار، وعمليتي إدراج داخل الإطار. كانت أكثر الأشكال المختلفة شيوعاً هي المشابه 3037C > T (6334 عينة)، وP4715L في إطار القراءة المفتوحة 1ab (6319 عينة)، وD614G في بروتين الشوكي (6294 عينة). قمنا بتحديد ستة عوامل كليد أساسية (وهي القاعدي، وD614G، وL84S، وL3606FK، D448del، وG392D)، و14 عاملاً فرعياً من كليد. وبخصوص التغييرات القاعدية، فإن طفرة C > T، كانت الأكثر شيوعاً في 1670 شكلاً مختلفاً ومتميزاً.

الاستنتاج لقد اكتشفنا أن هناك العديد من الأشكال المختلفة من جينوم سارس كوف 2، وأن كليد D614G قد أصبح الشكل المختلف الأكثر شيوعاً منذ ديسمبر/كانون أول 2019. أشار التحليل المتطور إلى انتقال منظم، مع إمكانية الظهور المتعدد في السكان.

ملخص

摘要

目的

旨在分析严重急性呼吸综合征冠状病毒 2 (SARS-CoV-2) 的基因组变异体情况。

方法

在 2020 年 2 月 1 日至 5 月 1 日期间，我们从四个数据库下载了 10,022 个严重急性呼吸综合征冠状病毒 2 (SARS-CoV-2) 基因组。这些基因组来自 68 个国家的感染患者。我们通过使用凸出针提取参考基因组 NC_045512 的成对序列比对来确定变异体。编码区的核苷酸变体被转化为相应的编码氨基酸残基。我们使用基于抽样树的开源软件贝叶斯演化分析（2.5 版）进行支系分析。

结果

我们确定了 5775 个不同的基因组变异体，包括 2969 个错义突变、1965 个同义突变、484 个非编码区突变、142 个非编码缺失、100 个框架内缺失、66 个非编码插入、36 个止损变异体、11 个移码缺失和 2 个框架内插入。最常见的变异是同义 3037C > T（6334 个样本）、开放阅读框 1ab 中的 P4715L（6319 个样本）和纤突蛋白中的 D614G（6294 个样本）。我们确定了 6 大主要分支（即，基底、D614G、L84S、L3606F、D448del 和 G392D）和 14 个子分支。在基底变化方面，以 C > T 突变最为常见，共有 1670 个不同的变异体。

结论

我们发现严重急性呼吸综合征冠状病毒 2 (SARS-CoV-2) 基因组存在多种变异体，其中 D614G 支系自 2019 年 12 月以来已成为最常见的变异体。演化分析表明，这是一种结构化传播，有可能多次传入人群中。

摘要

Резюме

Цель

Проанализировать варианты геномов тяжелого острого респираторного синдрома, вызванного коронавирусом‑2 (SARS-CoV-2).

Методы

В период между 1 февраля и 1 мая 2020 года авторы загрузили данные по 10 022 геномам вируса SARS CoV-2 из четырех баз данных. Геномы принадлежали инфицированным пациентам из 68 стран. Авторы идентифицировали варианты, извлекая и попарно сравнивая последовательности с эталонным геномом NC_045512, используя набор инструментов EMBOSS. Варианты нуклеотидной последовательности в кодирующих участках были преобразованы в соответствующие кодируемые аминокислотные остатки. Для анализа клад использовалось программное обеспечение с открытым кодом для байесовского эволюционного анализа деревьев выборки, версия 2.5.

Результаты

Было идентифицировано 5775 четких вариантов генома, в том числе 2969 миссенс-мутаций, 1965 синонимичных мутаций, 484 мутации в некодирующих участках, 142 некодирующие делеции, 100 делеций внутри рамки считывания, 66 некодирующих вставок, 36 вариантов изменения последовательности ДНК с новым стоп-кодоном, 11 делеций со сдвигом рамки и две вставки внутри рамки считывания. Чаще всего встречались синонимичная замена 3037C > T (6334 образца), P4715L в открытой рамке считывания 1ab (6319 образцов) и D614G в белке «шипа» (6294 образца). Было выявлено шесть основных клад (базовая, D614G, L84S, L3606F, D448del и G392D) и 14 субклад. Что касается замены оснований, наиболее частой была мутация с заменой цитозина на тимин (C>T), которая встречалась в 1670 вариантах.

Вывод

Авторы обнаружили существование нескольких вариантов генома SARS-CoV-2 и выяснили, что с декабря 2019 года наиболее распространенным вариантом является клада D614G. Эволюционный анализ продемонстрировал структурированную передачу генетических данных с возможностью многократной интродукции в популяцию.

Резюме

Number of samples of severe acute respiratory syndrome coronavirus 2 from each country or territory included in sequence analysis, 2019–2020

Acknowledgements

We gratefully acknowledge the authors, originating and submitting laboratories of the sequences from GISAID’s EpiFlu Database, GenBank, and NGDC Genome Warehouse, and the National Microbiology Data Center on which this research is based. The list of genomes is available from the data repository.²⁰ We also thank Jane Snowdon and Dilhan Weeraratne.

Acknowledgements

Competing interests:

None declared.

Competing interests:

References

1. Coronavirus disease (COVID-19). Situation Report – 124. Geneva: World Health Organization; 2020. Available from: [cited 2020 28 May].[PubMed]
2. Lu R, Zhao X, Li J, Niu P, Yang B, Wu H, et al. Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding.Lancet. 2020. February 22;395(10224):565–74. 10.1016/S0140-6736(20)30251-8 ] [
3. Wu F, Zhao S, Yu B, Chen Y-M, Wang W, Song ZG, et al. A new coronavirus associated with human respiratory disease in China.Nature. 2020. March;579(7798):265–9. 10.1038/s41586-020-2008-3 ] [
4. Wuhan seafood market pneumonia virus isolate Wuhan-Hu-1, complete genome. NCBI Reference Sequence: NC_045512.1. Bethesda: National Center for Biotechnology Information; 2020. Available from: [cited 2020 May 29].[PubMed]
5. Hoffmann M, Kleine-Weber H, Schroeder S, Krüger N, Herrler T, Erichsen S, et al. SARS-CoV-2 Cell Entry Depends on ACE2 and TMPRSS2 and Is Blocked by a Clinically Proven Protease Inhibitor.Cell. 2020. April 16;181(2):271–280.e8. 10.1016/j.cell.2020.02.052 ] [
6. Wang D, Hu B, Hu C, Zhu F, Liu X, Zhang J, et al. Clinical characteristics of 138 hospitalized patients with 2019 novel coronavirus–infected pneumonia in Wuhan. China: JAMA; 2020. 10.1001/jama.2020.1585 ] [
7. Guan WJ, Ni ZY, Hu Y, Liang WH, Ou CQ, He JX, et al.; China Medical Treatment Expert Group for Covid-19. Clinical characteristics of coronavirus disease 2019 in China.N Engl J Med. 2020. April 30;382(18):1708–20. 10.1056/NEJMoa2002032 ] [
8. Wu Z, McGoogan JM. Characteristics of and important lessons from the coronavirus disease 2019 (COVID-19) outbreak in China: summary of a report of 72 314 cases from the Chinese Center for Disease Control and Prevention.JAMA. 2020. February 24;323(13):1239–42. 10.1001/jama.2020.2648 [] [[PubMed]
9. Wang C, Horby PW, Hayden FG, Gao GF. A novel coronavirus outbreak of global health concern.Lancet. 2020. February 15;395(10223):470–3. 10.1016/S0140-6736(20)30185-9 ] [
10. Cumulative Number of Reported Probable Cases of SARS [internet]Geneva: World Health Organization; 2020. [cited 2020 May 29].
11. Middle East respiratory syndrome coronavirus (MERS-CoV) [internet]Geneva: World Health Organization; 2020. [cited 2020 May 29].
12. Richardson S, Hirsch JS, Narasimhan M, Crawford JM, McGinn T, Davidson KW, et al.; and the Northwell COVID-19 Research Consortium. Presenting characteristics, comorbidities, and outcomes among 5700 patients hospitalized with COVID-19 in the New York city area.JAMA. 2020. April 22. Epub ahead of print. 10.1001/jama.2020.6775 ] [
13. Chen N, Zhou M, Dong X, Qu J, Gong F, Han Y, et al. Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study.Lancet. 2020. February 15;395(10223):507–13. 10.1016/S0140-6736(20)30211-7 ] [
14. Yang X, Yu Y, Xu J, Shu H, Xia J, Liu H, et al. Clinical course and outcomes of critically ill patients with SARS-CoV-2 pneumonia in Wuhan, China: a single-centered, retrospective, observational study.Lancet Respir Med. 2020. May;8(5):475–81. 10.1016/S2213-2600(20)30079-5 ] [
15. Sanjuán R, Domingo-Calap P. Mechanisms of viral mutation.Cell Mol Life Sci. 2016. December;73(23):4433–48. 10.1007/s00018-016-2299-6 ] [
16. Shu Y, McCauley J. GISAID: global initiative on sharing all influenza data - from vision to reality.Euro Surveill. 2017. March 30;22(13):30494. 10.2807/1560-7917.ES.2017.22.13.30494 ] [
17. Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome. NCBI Reference Sequence: NC_045512.2. Bethesda: National Center for Biotechnology Information; 2020. Available from: [cited 2020 May 19].[PubMed]
18. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins.J Mol Biol. 1970. March;48(3):443–53. 10.1016/0022-2836(70)90057-4 [] [[PubMed]
19. orf1ab polyprotein [Severe acute respiratory syndrome coronavirus 2]. NCBI Reference Sequence: YP_009724389.1. Bethesda: National Center for Biotechnology Information; 2020. Available from: [cited 2020 May 29].[PubMed]
20. Koyama T, Platt D, Parida L. Variant analysis of SARS-CoV-2 genomes [data repository]. Meyrin: European Organization for Nuclear Research; 2020. 10.5281/zenodo.384046510.5281/zenodo.3840465 [] [[PubMed]
21. Ward JH Jr. Hierarchical Grouping to Optimize an Objective Function. J Am Stat Assoc. 1963;58(301):236–44. 10.1080/01621459.1963.10500845 [[PubMed]
22. Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al.; SciPy 1.0 Contributors. SciPy 1.0: fundamental algorithms for scientific computing in Python.Nat Methods. 2020. March;17(3):261–72. 10.1038/s41592-019-0686-2 ] [
23. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool.J Mol Biol. 1990. October 5;215(3):403–10. 10.1016/S0022-2836(05)80360-2 [] [[PubMed]
24. Papadopoulos JS, Agarwala R. COBALT: constraint-based alignment tool for multiple protein sequences.Bioinformatics. 2007. May 1;23(9):1073–9. 10.1093/bioinformatics/btm076 [] [[PubMed]
25. Grifoni A, Sidney J, Zhang Y, Scheuermann RH, Peters B, Sette A. A sequence homology and bioinformatic approach can predict candidate targets for immune responses to SARS-CoV-2.Cell Host Microbe. 2020. April 8;27(4):671–680.e2. 10.1016/j.chom.2020.03.002 ] [
26. Liu Z, Xiao X, Wei X, Li J, Yang J, Tan H, et al. Composition and divergence of coronavirus spike proteins and host ACE2 receptors predict potential intermediate hosts of SARS-CoV-2.J Med Virol. 2020. February 26;92(6):595–601. 10.1002/jmv.25726 ] [
27. Arvestad L. alv: a console-based viewer for molecular sequence alignments.J Open Source Softw. 2018;3(31):955 10.21105/joss.00955 [[PubMed]
28. Bouckaert R, Vaughan TG, Barido-Sottani J, Duchêne S, Fourment M, Gavryushkina A, et al. BEAST 2.5: An advanced software platform for Bayesian evolutionary analysis.PLOS Comput Biol. 2019. April 8;15(4):e1006650. 10.1371/journal.pcbi.1006650 ] [
29. Katoh K, Misawa K, Kuma K, Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform.Nucleic Acids Res. 2002. July 15;30(14):3059–66. 10.1093/nar/gkf436 ] [
30. Hasegawa M, Kishino H, Yano T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA.J Mol Evol. 1985;22(2):160–74. 10.1007/BF02101694 [] [[PubMed]
31. Lyons DM, Lauring AS. Evidence for the selective basis of transition-to-transversion substitution bias in two RNA viruses.Mol Biol Evol. 2017. December 1;34(12):3205–15. 10.1093/molbev/msx251 ] [
32. Li Z, Wu J, Deleo CJ. RNA damage and surveillance under oxidative stress.IUBMB Life. 2006. October;58(10):581–8. 10.1080/15216540600946456 [] [[PubMed]
33. Koyama T, Weeraratne D, Snowdon JL, Parida L. Emergence of drift variants that may affect COVID-19 vaccine development and antibody treatment.Pathogens. 2020. April 26;9(5):324. 10.3390/pathogens9050324 ] [
34. Ou J, Zhou Z, Dai R, Zhang J, Lan W, Zhao S, et al. Emergence of RBD mutations in circulating SARS-CoV-2 strains enhancing the structural stability and human ACE2 receptor affinity of the spike protein. [preprint]. Cold Spring Habor: medRxiv; 2020. 10.1101/2020.03.15.99184410.1101/2020.03.15.991844 [] [[PubMed]
35. Zhao Z, Li H, Wu X, Zhong Y, Zhang K, Zhang YP, et al. Moderate mutation rate in the SARS coronavirus genome and its implications.BMC Evol Biol. 2004. June 28;4(1):21. 10.1186/1471-2148-4-21 ] [