SNP Detection on the Illumina/Solexa Platform
The Illumina Genome Analyzer system generates 30-50 gigabases of sequence per run,
in single-end or paired-end reads (typically 36bp, 50bp, or 75bp long). Currently, Illumina
platforms represent more than 50% of the next-generation sequencing (NGS) market. A few key characteristics
of Illumina/Solexa sequencing present unique challenges for SNP detection:
- Sequencing error rate. Sequencing errors on the Illumina platforms occur at rates of
0.5% to 2.5% in typical datasets - an order of magnitude higher than traditional capillary-based sequencing. Read position
is also correlated with sequencing error rate, with errors more prevalent toward the ends of reads.
- Volume of data. The throughput of Illumina sequencers is astonishing. Whole genome sequencing
of entire human genomes requires just a few runs to achieve haploid coverage of 20-30x or more. Targeted resequencing, or whole-genome
sequencing of smaller organisms can yield read depths of 100x-5000x per base.
- Short read alignment. Mapping the typically short sequencing reads to a reference
genome is a critical but challenging step required for virtually all analysis. Not only is the placement of short sequences
onto complex reference sequences difficult, but it represents a computationally intense process.
Illumina Data File Handling
The Illumina image analysis software generates several data files per lane. The most important files are the Solexa FASTQ
files, typically named s_*_sequence.txt. Single-end (fragment) lanes have a single .txt file (e.g. s_1_sequence.txt),
while paired-end lanes have two matching .txt files (e.g. s_1_1_sequence.txt and s_1_2_sequence.txt). Typically, these
files are converted to the similar but more conventional "Sanger" FASTQ format prior to alignment.
Short Read Alignment with Illumina Data
There are now an assortment of short read aligners, both commercial and freely available, for mapping Illumina/Solexa reads
to a reference sequence. One widely used tool is Maq, developed by Heng Li in the laboratory of Richard Durbin at Wellcome
Trust Sanger Institute. Maq usually maps 70-90% of sequences in a typical Illumina lane to a large (human-sized) genome in
about 1-2 days. However, a new generation of ultra-fast short read aligners, most of levers the Burrows-Wheeler Transform
algorithm for indexing the reference sequence, provided a significant gain in performance. BWA (also by Heng Li) and Bowtie,
an aligner developed in the lab of Steven Salzberg at the University of Marlyand, can achieve comparable results to Maq but in
just a few hours per lane.
Assembly and SNP Detection
Even after reads have been mapped to a reference genome, SNP detection remains a challenge. In regions of low sequencing
coverage or where mapping short reads is difficult, false-negatives are of substantial concern. It takes 10x-20x haploid
coverage of a given position to ensure that heterozygous variants will be detected. In regions of sufficient or high sequencing
coverage, false positives are the challenge. These can arise from sequencing artifacts as well as read mis-alignment.
Sophisticated algorithms are required to accurately call SNPs with high sensitivity while minimizing false positive calls.
|