"It's very rash to say that things are beyond the scope of science."
-Francis Crick




About
InSilicode, LLC
Contact Us

Genomics Education
Personal Genetic Testing
Forensic DNA Techniques

Products
BLAST Database Updates Gene of the Day

Services
Bioinformatics Support
Illumina SNP Detection

Company     Science     Products     Services

SNP Detection on the Illumina/Solexa Platform

The Illumina Genome Analyzer system generates 30-50 gigabases of sequence per run, in single-end or paired-end reads (typically 36bp, 50bp, or 75bp long). Currently, Illumina platforms represent more than 50% of the next-generation sequencing (NGS) market. A few key characteristics of Illumina/Solexa sequencing present unique challenges for SNP detection:

  1. Sequencing error rate. Sequencing errors on the Illumina platforms occur at rates of 0.5% to 2.5% in typical datasets - an order of magnitude higher than traditional capillary-based sequencing. Read position is also correlated with sequencing error rate, with errors more prevalent toward the ends of reads.
  2. Volume of data. The throughput of Illumina sequencers is astonishing. Whole genome sequencing of entire human genomes requires just a few runs to achieve haploid coverage of 20-30x or more. Targeted resequencing, or whole-genome sequencing of smaller organisms can yield read depths of 100x-5000x per base.
  3. Short read alignment. Mapping the typically short sequencing reads to a reference genome is a critical but challenging step required for virtually all analysis. Not only is the placement of short sequences onto complex reference sequences difficult, but it represents a computationally intense process.

Illumina Data File Handling

The Illumina image analysis software generates several data files per lane. The most important files are the Solexa FASTQ files, typically named s_*_sequence.txt. Single-end (fragment) lanes have a single .txt file (e.g. s_1_sequence.txt), while paired-end lanes have two matching .txt files (e.g. s_1_1_sequence.txt and s_1_2_sequence.txt). Typically, these files are converted to the similar but more conventional "Sanger" FASTQ format prior to alignment.

Short Read Alignment with Illumina Data

There are now an assortment of short read aligners, both commercial and freely available, for mapping Illumina/Solexa reads to a reference sequence. One widely used tool is Maq, developed by Heng Li in the laboratory of Richard Durbin at Wellcome Trust Sanger Institute. Maq usually maps 70-90% of sequences in a typical Illumina lane to a large (human-sized) genome in about 1-2 days. However, a new generation of ultra-fast short read aligners, most of levers the Burrows-Wheeler Transform algorithm for indexing the reference sequence, provided a significant gain in performance. BWA (also by Heng Li) and Bowtie, an aligner developed in the lab of Steven Salzberg at the University of Marlyand, can achieve comparable results to Maq but in just a few hours per lane.

Assembly and SNP Detection

Even after reads have been mapped to a reference genome, SNP detection remains a challenge. In regions of low sequencing coverage or where mapping short reads is difficult, false-negatives are of substantial concern. It takes 10x-20x haploid coverage of a given position to ensure that heterozygous variants will be detected. In regions of sufficient or high sequencing coverage, false positives are the challenge. These can arise from sequencing artifacts as well as read mis-alignment. Sophisticated algorithms are required to accurately call SNPs with high sensitivity while minimizing false positive calls.

Copyright 2008 by Erudite Systems, Inc.
Legal   Site Map