Resources‎ > ‎

3KRG in AWS

    To further enable the utilization of the 3K RG dataset by the global rice community, we have released primary analyses results for variant discovery on the sequencing data, with 24 additional genomes being included.  All data generated by the 3K RGP is now publicly available online as an Amazon Web Services (AWS) Public Data Set. You can learn more about accessing the data here at 3000 Rice Genome on AWS webpage.  The data includes alignment of the 3,024 rice genome sequences to 5 published O. sativa genomes representing 3 major cultivated groups (aus, indica, and japonica)  and the raw SNP calls from these alignments. The software pipeline, along with the parameters used in the analysis, are also available at the resource. In this current analyses, over 30 million variants (SNPs and small indels) were discovered from the 3,024 accessions sequenced, representing almost 10% of the total rice genome. These discovered variants span all of the known and predicted rice genes, as well as the potential regulatory regions surrounding these genes. More in-depth analyses of this dataset could lead to inferences about novel alleles causative to important agronomic traits such as higher yield and stress tolerance (to pests, diseases, resilience to climate change).

We have also developed two web applications that can connect to AWS 3K RG, enabling easy access to particular genome regions of interest (e.g. an important gene for drought tolerance) across a selected set of accessions,  visualize this region in a genome browser, and conduct further analyses on this subset data. These are:

SNP-Seek (Alexandrov et al 2014) .  Several features of SNP-Seek get input data directly from the AWS 3kRG Public resource, as illustrated in the following examples.

IRIC Portal JBrowse


Figure 1.  JBrowse in SNP-Seek displaying BAM and VCF tracks from AWS 3kRG Public

The rice genome browser is an instance of JBrowse . The BAM track displays the alignments, while the SNP Coverage the depth or number of aligning short sequences. The VCF track displays the detected SNPs and indels. These tracks read the BAM and VCF files from 3K RG S3.

3K RG get variant sequence tool

Figure 2. Downloading the variant sequence of a genome region (after SNP and small indel substitution)

The variant sequence of a genome region of interest is generated by converting the VCF from 3K RG S3 into a sequence, replacing the reference allele with alternate allele, then inserting/deleting sequences as read from the VCF.

The output is a modified Fasta file having deletions represented as gaps (-) of the same length and insertions are enclosed in brackets [].

3K RG bulk data download

Figure 3. Download of raw sequences, alignment (BAM), and variants (VCF) files 

The Bulk Download page of SNP-Seek allows the selection from list of accessions, and the raw analysis files for all or selected varieties can be generated. The FastQ sequences can be downloaded from NCBI or EBI, and the BAM and VCF files from Amazon S3.

IRRI Galaxy (Mauleon et al 2013) . This is a specialized deployment of the Galaxy bioinformatics workbench, which has rice-specific shared datasets (eg. 5 rice reference genomes, rice SNP assay platforms) and custom tools to enable SNP calling from SNP assay machine output, design of rice SNP assays and interconvert SNP data format for various genetic and population analyses. 

IRRI Galaxy Get unique ID in 3K RG accessions

Figure 4. Retrieving the list of 3KRG accessions in 3KRG S3 using IRRI Galaxy 

The unique IDs of the rice accessions stored in the 3kRG S3 bucket can be retrieved using the tool <Get List of 3K Unique ID>. This is required to unambiguously identify the accession you wish to further study.

IRRI Galaxy Get subset VCF tool


Figure 5. VCF retrieval from 3KRG S3 using IRRI Galaxy.

The <Get Subset of 3K VCF> is a tool for extracting subset VCF files from the specified accessions’ VCF files stored in the 3KRG S3 bucket.  The tool needs the unique ID of the accession, the Reference genome where it was aligned, the Start and end basepair positions. Currently, the distance limit from the start position to end-position is 60,000 bp. Once a subset VCF is extracted, it could be converted to other analysis-ready formats such as HapMap, using the <Convert VCF to Hapmap> tool, for use by other analysis software

VCF retrieved

IRRI Galaxy convert VCF to hapmap

Figure 6. Convert VCF to HapMap format using IRRI Galaxy 


Comments