Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
BioMicro Center
Search
Search
Appearance
Log in
Request account
Personal tools
Log in
Request account
Pages for logged out editors
learn more
Contributions
Talk
Editing
BioMicroCenter:IlluminaDataPipeline
(section)
Page
Discussion
English
Read
Edit
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
View history
General
What links here
Related changes
Special pages
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Basics == We are currently using the Illumina Pipeline version RTA 2.8.0 / OLB 1.8.0 / CASAVA 1.7.0 An older version of the pipeline manual is available (v1.0)[[Media:Pipeline v1.0 User Guide.pdf | HERE]]. The Genome Analyzer Pipeline Software (Pipeline) is a set of utilities designed to perform a complete data analysis of a sequencing run. It is supplied as source code and scripts. Data analysis consists of three steps: image analysis, base calling, and sequence analysis. '''1. Image analysis''' ''(Firecrest)''—Uses the raw TIF files to locate clusters on the image, and outputs the cluster intensity, X,Y positions, and an estimate of the noise for each cluster. The output from image analysis provides the input for base calling. '''2. Base calling''' ''(Bustard)''—Uses cluster intensities and noise estimate to output the sequence of bases read from each cluster, along with a confidence level for each base. '''3. Sequence analysis''' ''(Gerald and Eland)''—Allows for alignment to a reference sequence, filtering of data based on predefined criteria, and visualization of the result. === Base Calling === ''Firecrest'' is the module used for image analysis. Firecrest identifies cluster positions and extracts intensities. Through image filtering, it sharpens and enhances clusters, removes background noise, and detects clusters based on morphological features on the image. Firecrest also adjusts the scale and registration of an image. Firecrest is currently performed in real time with the sequencing process on a dedicated IPARR server as part of the 1.0 pipeline. ''Bustard'' is the module used for base calling. Bustard deconvolves the signal from the clusters and applies correction for cross-talk, phasing, and prephasing. * Frequency cross-talk—The Genome Analyzer uses two lasers and four filters to detect four dyes attached to the four types of nucleotide, respectively. The frequency emission of these four dyes overlaps so that the four images are not independent. The frequency cross-talk is deconvolved using a frequency cross-talk matrix. * Phasing/Prephasing—Depending on the efficiency of the fluidics and the sequencing reactions, a small number of molecules in each cluster may run ahead (prephasing) or fall behind (phasing) of the current incorporation cycle. This effect is mitigated by applying corrections during the base calling step. * All of these corrections are based on an assumption of '''equal base frequency.''' For this reason, Illumina recommends the inclusion of the PhiX control sample in all runs (7+1). Work is currently under way at the [Broad Institute] to create defined spiked in DNA templates with equal base frequency that can act as control reads, however, this has not yet been incorporated into the pipeline. === Alignment === ''Generation of Recursive Analyses Linked by Dependency (GERALD)'' is the module used for sequence alignment, data visualization, filtering, and alignment. The following two alignment programs work within the GERALD module: * ''Efficient Large-Scale Alignment of Nucleotide Databases (ELAND)'' is very fast and aligns for up to two errors from a reference for the first 32 bases. This algorithm is used for any reference larger than 100 kb. * ''PhageAlign'' does an exhaustive alignment (all possible alignments up to arbitrary edit distances), but is slow. === Software Options === While ELAND is extremely fast, it suffers from some significant deficiencies. The largest is the lack of tolerance of errors. Read failure is typically a function of length and it is likely that many nucleotides will have been successfully read before the phasing/pre-phasing or some other error becomes large enough to cause read failure. However, because ELAND is an all or nothing algorithm, it is incapable of handling 'short' reads. Numerous researchers have made significant efforts to create improved versions of ELAND. Some of the options: * '''MAQ''' http://maq.sourceforge.net/ - Maq maps short reads to a reference genome and calls the genotypes from the alignment. It is specificially designed for Illumina-Solexa/AB-SOLiD reads, not for 454 or capillary ones. Key facts about Maq: 1) Maq maps a repeat read randomly, and 2) it gives a probability score (mapping quality) to each alignment. More information is available on the MAQ page. * '''Bowtie''' - The Burge Lab is currently analyzing the quality of reads mapped using Bowtie. Bowtie is believed to be MUCH faster then ELAND.
Summary:
Please note that all contributions to BioMicro Center may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
BioMicro Center:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)