3KRG in AWS

    To further enable the utilization of the 3K RG dataset by the global rice community, we have released primary analyses results for variant discovery on the sequencing data, with 24 additional genomes being included.  All data generated by the 3K RGP is now publicly available online as an Amazon Web Services (AWS) Public Data Set. You can learn more about accessing the data here at 3000 Rice Genome on AWS webpage.  The manifest for the alignment files can also be used to download the files directly from AWS Public Data.

The data includes alignment of the 3,024 rice genome sequences to 5 published O. sativa genomes representing 3 major cultivated groups (aus, indica, and japonica)  and the raw SNP calls from these alignments. The software pipeline, along with the parameters used in the analysis, are also available at the resource. In this current analyses, over 30 million variants (SNPs and small indels) were discovered from the 3,024 accessions sequenced, representing almost 10% of the total rice genome. These discovered variants span all of the known and predicted rice genes, as well as the potential regulatory regions surrounding these genes. More in-depth analyses of this dataset could lead to inferences about novel alleles causative to important agronomic traits such as higher yield and stress tolerance (to pests, diseases, resilience to climate change).

We have also developed SNP-Seek (Alexandrov et al 2014) that can connect to AWS 3K RG, enabling easy access to particular genome regions of interest (e.g. an important gene for drought tolerance) across a selected set of accessions,  visualize this region in a genome browser, and conduct further analyses on this subset data, as illustrated in the following examples.


Figure 1.  JBrowse in SNP-Seek displaying BAM and VCF tracks from AWS 3kRG Public

The rice genome browser is an instance of JBrowse . The BAM track displays the alignments, while the SNP Coverage the depth or number of aligning short sequences. The VCF track displays the detected SNPs and indels. These tracks read the BAM and VCF files from 3K RG S3.

Figure 2. Downloading the variant sequence of a genome region (after SNP and small indel substitution)

The variant sequence of a genome region of interest is generated by converting the VCF from 3K RG S3 into a sequence, replacing the reference allele with alternate allele, then inserting/deleting sequences as read from the VCF.

The output is a modified Fasta file having deletions represented as gaps (-) of the same length and insertions are enclosed in brackets [].

Figure 3. Download of raw sequences, alignment (BAM), and variants (VCF) files 

The Bulk Download page of SNP-Seek allows the selection from list of accessions, and the raw analysis files for all or selected varieties can be generated. The FastQ sequences can be downloaded from NCBI or EBI, and the BAM and VCF files from Amazon S3.