Examining Data on the NCBI SRA Database
Overview
Teaching: 20 min
Exercises: 10 minQuestions
How to work with public data in the NCBI SRA
Objectives
Be aware that public genomic data is available
Understand how to access and download this data
In our experiments we’re usually generating our own genomic data, but many types of analyses use reference data or you may want to use it to compare your results or annotate your data with publicly available data. You may also want to do a full project or set of analyses using publicly available data. This data is a great, and essential, resource for genomic data analysis.
There are many repositories for public data. Some model organisms or fields have specific databases, and there are ones for particular types of data. Two of the most comprehensive are the National Center for Biotechnology Information (NCBI) and European Nucleotide Archive (EMBL-EBI). In this lesson we’re working with the NCBI database, but the general process is the same for any database.
Accessing the original archived data
The sequencing dataset (from Lenski paper) adapted for this lesson was obtained from the NCBI Sequence Read Archive which is a large (>3 quadrillion basepairs as of 2014) repository for next-generation sequence data. Like many NCBI databases, it is complex and mastering its use is greater than the scope of this lesson. Very often, as in the Lenski paper, there will be a direct link (perhaps in the supplemental information) to where on the SRA the dataset can be found. E.g. the link from the Lenski paper is http://www.ncbi.nlm.nih.gov/sra?term=SRA026813
Locate the Run Accessory Table for the Lenski Dataset on the SRA
-
Access the Lenski dataset from the provided link: http://www.ncbi.nlm.nih.gov/sra?term=SRA026813.
You will be presented with a page for the overall SRA accession SRA026813 - this is a collection of all the experimental data. -
Click on the first entry (ZDB30). This will take you to a page for an SRX (Sequence Read eXperiment). Take a few minutes to examine some of the descriptions on the page.
-
Click on the ‘All runs’ link under where it says Study. This is a description of all of the NGS datasets related to the experiment.
-
Go to the top of the page and in the Total row you will see there are 37 runs, 10.15Gb data, and 16.45 Gbases of data. Click the ‘RunInfo Table’ button and save the file locally.
We are not downloading any actual sequence data here! This is only a text file that fully describes the entire dataset.
You should now have a file called SraRunTable.txt
Review the SraRunTable in a spreadsheet program
Using your choice of spreadsheet program open the SraRunTable.txt
file. If prompted this is a tab-delimited file (.tsv
).
Discussion
Discuss with the person next to you:
- What strain of E. coli was used in this experiment?
- What was the sequencing platform used for this experiment?
- What samples in the experiment contain paired end sequencing data?
- What other kind of data is available?
- Are you collecting this kind of information about your sequencing runs?
After answering the questions, you should avoid saving this file. We don’t want to make any changes. If you were to save this file, make sure you save it as a plain .txt
file.
Where to learn more
About the Sequence Read Archive
- You can learn more about the SRA by reading the SRA Documentation
- The best way to transfer a large SRA dataset is by using the SRA Toolkit
References
Blount, Z.D., Barrick, J.E., Davidson, C.J., Lenski, R.E.
Genomic analysis of a key innovation in an experimental Escherichia coli population (2012) Nature, 489 (7417), pp. 513-518.
Paper, Supplemental materials
Data on NCBI SRA: http://www.ncbi.nlm.nih.gov/sra?term=SRA026813
Key Points
Public data repositories are a great source of genomic data.