COVID-19 Sequence Uploader
COVID-19 Sequence Uploader is a platform that allows researchers to upload sequence data of COVID-19 virus to a public repository. There are two ways to upload the sequence data, one is through this website and the other on the command line. You can use it to upload the genomes of COVID-19 samples to make them publicly and freely available to other researchers.
BORG/CBRC Arvados Cloud Platform
The uploader uses the Arvados Cloud platform for managing, processing, and sharing genomic and other large scientific and biomedical data. The Arvados instance is deployed on BORG/CBRC servers for testing and development.
In order to use BORG/CBRC Arvados platform for managing your data, user can create his/her account and sign in to arvados platform.
Getting an API token
The Arvados API token is a secret key that enables user to authenticate themselves in order to use the command line tools that are using arvados platform to manage data .
COVID-19 Sequence uploader also requires the arvados API token to be set as environment variable before running it.
Setting API token as environment variables
The Current token page, accessed using the dropdown menu icon in the upper right corner of the top navigation menu on arvados web inteface, includes a command you may copy and paste directly into the shell. It will look something as the following.
export ARVADOS_API_TOKEN=2jv9346o396exampledonotuseexampledonotuseexes7j1ld
export ARVADOS_API_HOST=cborg.cbrc.kaust.edu.sa
Using COVID-19 Web portal for uploading
For uploading the sequence data, You can follow the following steps:
- You will need to sign in if you have already registered. You can sign in using your ORCID account too by clicking on Orcid.org link on sign in page.
- Once you are logged in, click on the Upload menu from top menu bar.
- Choose the sequence files to be uploaded and metadata file (or fill the input fields in the form) from the file system and submit the form. It will take few minutes to upload and process sequence file.
- Once the data is processed, click on the Submissions menu from the menu bar to see your submission.
- Click on the view link to see the details.
- The ID of the submission is the URL to the sequence data directory in the Arvados. Click on the URL to see the submitted data.
- You can download and view uploaded files on arvados web interface.
Metadata File Format
The structure of the metadata file is described in YAML language. The schema file is available here.
Metadata file fields description:
Field Name | Required | Description |
---|---|---|
id | required | The subject (eg the fasta/fastq file) that the metadata describes |
host.host_species | required | Host species as defined in NCBITaxon, e.g. http://purl.obolibrary.org/obo/NCBITaxon_9606 for Homo sapiens |
host.host_id | optional | Identifer for the host. If you submit multiple samples from the same host, use the same host_id for those samples |
host.host_sex | optional | Sex of the host as defined in PATO, expect male () or female () |
host.host_age | optional | Age of the host as number (e.g. 50) |
host.host_age_unit | optional | Unit of host age e.g. http://purl.obolibrary.org/obo/UO_0000036 |
host.host_health_status | optional | A condition or state at a particular time, must be one of the following (obo:NCIT_C115935 obo:NCIT_C3833 obo:NCIT_C25269 obo:GENEPIO_0002020 obo:GENEPIO_0001849 obo:NCIT_C28554 obo:NCIT_C37987) |
host.host_treatment | optional | Process in which the act is intended to modify or alter host status |
host.host_vaccination | optional | List of vaccines given to the host |
host.ethnicity | optional | Ethinicity of the host e.g. http://purl.obolibrary.org/obo/HANCESTRO_0010 |
host.additional_host_information | optional | Field for additional host information |
sample.sample_id | required | Id of the sample as defined by the submitter |
sample.collection_date | required | Date when the sample was taken |
sample.collection_location | required | Geographical location where the sample was collected as wikidata reference, e.g. http://www.wikidata.org/entity/Q148 (China) |
sample.collector_name | optional | Name of the person that took the sample |
sample.collecting_institution | optional | Institute that was responsible for sampeling |
sample.specimen_source | optional | There can be more than one source of specimen. Method how the specimen was derived as NCIT IRI, e.g. http://purl.obolibrary.org/obo/NCIT_C155831 (=nasopharyngeal swab) |
sample.sample_storage_conditions | optional | Information about storage of a specified type, e.g. frozen specimen, paraffin, fresh .... |
sample.additional_collection_information | optional | Add additional comment about the circumstances that a sample was taken |
sample.source_database_accession | optional | If data is deposit at a public resource (e.g. Genbank, ENA) enter the Accession Id here. Please use a resolveable URL (e.g. http://identifiers.org/insdc/LC522350.1#sequence) |
virus.virus_species | required | The name of virus species from the NCBI taxonomy database, e.g. http://purl.obolibrary.org/obo/NCBITaxon_2697049 for Severe acute respiratory syndrome coronavirus 2 |
virus.virus_strain | optional | Name of the virus strain |
technology.sample_sequencing_technology | optional | Technology that was used to sequence this sample (e.g Sanger, Nanopor MiniION) |
technology.sequence_assembly_method | optional | Protocol which provides instructions on the alignment of sequencing reads to reference genome |
technology.sequencing_coverage | optional | Sequence coverage defined as the average number of reads representing a given nucleotide (e.g. [100]) - if multiple technologies were used multiple float values can be submitted e.g. [100, 20] |
technology.additional_technology_information | optional | Field for additional technology information |
submitter.authors | required | Name(s) of the author(s) |
submitter.submitter_name | optional | Name of the submitter(s) |
submitter.submitter_address | optional | Address of the submitter |
submitter.originating_lab | optional | Name of the laboratory that took the sample |
submitter.lab_address | optional | Address of the laboratory where the sample was taken |
submitter.provider_sample_id | optional | |
submitter.submitter_sample_id | optional | |
submitter.publication | optional | Reference to publication of this sample (e.g. DOI, pubmed ID, ...) |
submitter.submitter_orcid | optional | ORCID of the submitter as a full URI, e.g. https://orcid.org/0000-0002-1825-0097 |
submitter.additional_submitter_information | optional | Field for additional submitter information |
Sequence File Format
The sequence read files should be in FASTA or FASTQ File file format. The maximum size of the file size allowed to be uploaded is 512 MB.
SPARQL
To be able to make metadata available through SPARQL endpoint for querying, the uploader at time of upload converts the metadata into RDF. Single RDF resource is compiled that is linked against external resources such as NCBITaxon, PATO, CHEBI and Wikidata. The generated RDF file is hosted in any triple store and can be queried using SPARQL.
RDF resource structure is described in SHACL language and the schema file is available here.
The uploader web interface provides an interactive SPARQL query editor and the example queries for querying submitted data such as query for listing all the submissions, listing submissions for SARS-COV-2 virus and details of a specific submission.
Accessing COVID-19 Pangenome Analysis Results
Our Pangenome analysis service runs the analysis over all the uploaded sequences twice a day. It then sync the latest results to BORG/CBRC's public galaxy server.
The CBRC's galaxy instance provides a user friendly interface for analysis of bioscience and biomedical data, using a wide variety of tools and algorithms developed by an international team of experts and researchers.There is a shared data library for COVID-19 pangenome analysis on our galaxy instance. Here is the link to latest pangenome analysis results.