COVID-19 Sequence Uploader

COVID-19 Sequence Uploader is a platform that allows researchers to upload sequence data of COVID-19 virus to a public repository. There are two ways to upload the sequence data, one is through this website and the other on the command line. You can use it to upload the genomes of COVID-19 samples to make them publicly and freely available to other researchers.

BORG/CBRC Arvados Cloud Platform

The uploader uses the Arvados Cloud platform for managing, processing, and sharing genomic and other large scientific and biomedical data. The Arvados instance is deployed on BORG/CBRC servers for testing and development.

In order to use BORG/CBRC Arvados platform for managing your data, user can create his/her account and sign in to arvados platform.

Getting an API token

The Arvados API token is a secret key that enables user to authenticate themselves in order to use the command line tools that are using arvados platform to manage data .

COVID-19 Sequence uploader also requires the arvados API token to be set as environment variable before running it.

Setting API token as environment variables

The Current token page, accessed using the dropdown menu icon in the upper right corner of the top navigation menu on arvados web inteface, includes a command you may copy and paste directly into the shell. It will look something as the following.

export ARVADOS_API_TOKEN=2jv9346o396exampledonotuseexampledonotuseexes7j1ld
export ARVADOS_API_HOST=cborg.cbrc.kaust.edu.sa

Using COVID-19 Web portal for uploading

For uploading the sequence data, You can follow the following steps:

  1. You will need to sign in if you have already registered. You can sign in using your ORCID account too by clicking on Orcid.org link on sign in page.
  2. Once you are logged in, click on the Upload menu from top menu bar.
  3. Choose the sequence files to be uploaded and metadata file (or fill the input fields in the form) from the file system and submit the form. It will take few minutes to upload and process sequence file.
  4. Once the data is processed, click on the Submissions menu from the menu bar to see your submission.
  5. Click on the view link to see the details.
  6. The ID of the submission is the URL to the sequence data directory in the Arvados. Click on the URL to see the submitted data.
  7. You can download and view uploaded files on arvados web interface.

Metadata File Format

The structure of the metadata file is described in YAML language. The schema file is available here.

Metadata file fields description:
Field Name Required Description
id required The subject (eg the fasta/fastq file) that the metadata describes
host.host_species required Host species as defined in NCBITaxon, e.g. http://purl.obolibrary.org/obo/NCBITaxon_9606 for Homo sapiens
host.host_id optional Identifer for the host. If you submit multiple samples from the same host, use the same host_id for those samples
host.host_sex optional Sex of the host as defined in PATO, expect male () or female ()
host.host_age optional Age of the host as number (e.g. 50)
host.host_age_unit optional Unit of host age e.g. http://purl.obolibrary.org/obo/UO_0000036
host.host_health_status optional A condition or state at a particular time, must be one of the following (obo:NCIT_C115935 obo:NCIT_C3833 obo:NCIT_C25269 obo:GENEPIO_0002020 obo:GENEPIO_0001849 obo:NCIT_C28554 obo:NCIT_C37987)
host.host_treatment optional Process in which the act is intended to modify or alter host status
host.host_vaccination optional List of vaccines given to the host
host.ethnicity optional Ethinicity of the host e.g. http://purl.obolibrary.org/obo/HANCESTRO_0010
host.additional_host_information optional Field for additional host information
sample.sample_id required Id of the sample as defined by the submitter
sample.collection_date required Date when the sample was taken
sample.collection_location required Geographical location where the sample was collected as wikidata reference, e.g. http://www.wikidata.org/entity/Q148 (China)
sample.collector_name optional Name of the person that took the sample
sample.collecting_institution optional Institute that was responsible for sampeling
sample.specimen_source optional There can be more than one source of specimen. Method how the specimen was derived as NCIT IRI, e.g. http://purl.obolibrary.org/obo/NCIT_C155831 (=nasopharyngeal swab)
sample.sample_storage_conditions optional Information about storage of a specified type, e.g. frozen specimen, paraffin, fresh ....
sample.additional_collection_information optional Add additional comment about the circumstances that a sample was taken
sample.source_database_accession optional If data is deposit at a public resource (e.g. Genbank, ENA) enter the Accession Id here. Please use a resolveable URL (e.g. http://identifiers.org/insdc/LC522350.1#sequence)
virus.virus_species required The name of virus species from the NCBI taxonomy database, e.g. http://purl.obolibrary.org/obo/NCBITaxon_2697049 for Severe acute respiratory syndrome coronavirus 2
virus.virus_strain optional Name of the virus strain
technology.sample_sequencing_technology optional Technology that was used to sequence this sample (e.g Sanger, Nanopor MiniION)
technology.sequence_assembly_method optional Protocol which provides instructions on the alignment of sequencing reads to reference genome
technology.sequencing_coverage optional Sequence coverage defined as the average number of reads representing a given nucleotide (e.g. [100]) - if multiple technologies were used multiple float values can be submitted e.g. [100, 20]
technology.additional_technology_information optional Field for additional technology information
submitter.authors required Name(s) of the author(s)
submitter.submitter_name optional Name of the submitter(s)
submitter.submitter_address optional Address of the submitter
submitter.originating_lab optional Name of the laboratory that took the sample
submitter.lab_address optional Address of the laboratory where the sample was taken
submitter.provider_sample_id optional
submitter.submitter_sample_id optional
submitter.publication optional Reference to publication of this sample (e.g. DOI, pubmed ID, ...)
submitter.submitter_orcid optional ORCID of the submitter as a full URI, e.g. https://orcid.org/0000-0002-1825-0097
submitter.additional_submitter_information optional Field for additional submitter information

Sequence File Format

The sequence read files should be in FASTA or FASTQ File file format. The maximum size of the file size allowed to be uploaded is 512 MB.

SPARQL

To be able to make metadata available through SPARQL endpoint for querying, the uploader at time of upload converts the metadata into RDF. Single RDF resource is compiled that is linked against external resources such as NCBITaxon, PATO, CHEBI and Wikidata. The generated RDF file is hosted in any triple store and can be queried using SPARQL.

RDF resource structure is described in SHACL language and the schema file is available here.

The uploader web interface provides an interactive SPARQL query editor and the example queries for querying submitted data such as query for listing all the submissions, listing submissions for SARS-COV-2 virus and details of a specific submission.

Accessing COVID-19 Pangenome Analysis Results

Our Pangenome analysis service runs the analysis over all the uploaded sequences twice a day. It then sync the latest results to BORG/CBRC's public galaxy server.

The CBRC's galaxy instance provides a user friendly interface for analysis of bioscience and biomedical data, using a wide variety of tools and algorithms developed by an international team of experts and researchers.

There is a shared data library for COVID-19 pangenome analysis on our galaxy instance. Here is the link to latest pangenome analysis results.

Commandline tool

To get started, first install the uploader, and use the cborguploader command to upload your data.

Installation

Prepare your system

You need to make sure you have Python, and the ability to install modules such as `pycurl` and `pyopenssl`. On Ubuntu 18.04, you can run:

sudo apt update
sudo apt install -y virtualenv git libcurl4-openssl-dev build-essential python3-dev libssl-dev
Create and enter your virtualenv.

Go to downloaded uploader directory and make and enter a virtualenv:

virtualenv --python python3 venv . venv/bin/activate

Note that you will need to repeat the `. venv/bin/activate` step from this directory to enter your virtualenv whenever you want to use the installed tool.

Install the tool

Once in your virtualenv, install this project:

pip3 install git+https://github.com/bio-ontology-research-group/cborguploader@master

It should print some instructions about how to use the uploader.

Test the tool

Try running:

cborguploader --help
Set Arvados API Token

Before uploading the sequence files, you need to set arvados api token value to environment variable ARVADOS_API_TOKEN. It will look something as the following:

export ARVADOS_API_TOKEN=2jv9346o396exampledonotuseexampledonotuseexes7j1ld

You can find the arvados token at current token link in your user profile menu on arvados web portal.

Usage

Run the uploader with a FASTA or FASTQ file and accompanying metadata file in JSON or YAML:

cborguploader example/sequence.fasta example/metadata.yaml

You can find the example files on COVID-19 web uploader. Here are the links to example files:

Workflow for Generating a Pangenome

All these uploaded sequences are being fed into a workflow to generate a pangenome for the virus. You can replicate this workflow yourself.

An example is to get your SARS-CoV-2 sequences from GenBank in `seqs.fa`, and then run a series of commands

minimap2 -cx asm20 -X seqs.fa seqs.fa >seqs.paf
seqwish -s seqs.fa -p seqs.paf -g seqs.gfa
odgi build -g seqs.gfa -s -o seqs.odgi
odgi viz -i seqs.odgi -o seqs.png -x 4000 -y 500 -R -P 5

Here we convert such a pipeline into the Common Workflow Language (CWL) and sources can be found here.