Home Second-generation sequencing > Terms Rambler's Top100

Terms




    Fragment sequencing

    Also "shotgun sequencing". DNA is broken up randomly into numerous small overlapping segments. Adaptors are ligated to both ends. Fragments are preamplified, clonally amplified and sequenced from one end. Obtained reads are aligned to the reference genome or used for de-novo assembly.


  • Randomness of fragments. To obtain "random sequences" it is necessary to have "random fragments", "equal efficiency of adaptor ligation", "equal pre-amplification" and "equal clonal amplification"
    • current digestion procedures: hydrodynamic and ultrasonic digestion are generally considered as random. DNase digestion, 2-bp recognition restriction enzymes — non-random with preference for some regions;
    • if starting material is a collection of short fragments, than "end parts" will be overepresented;
    • adaptor ligation definetely has some sequence preference, especially for ligation of A-tailed fragments;
    • both preamplification and clonal amplification are sensitive to GC-content and to secondary structure;


  • Length of fragments:
    • both bridge-amplification (Illumina) and beads-ePCR (SOLiD) do not amplify too long fagments. Limits are: ~700bp for bridge-amplification and ~300bp for beads-ePCR;
    • DNA fragments should not be shorter, than read length: ~35bp both Illumina and SOLiD;
    • shearing of DNA sample into shorter fragments increases the complexity of library. Let's suppose, that two DNA samples were digested to mean sizes ~500bp and ~50bp and libraries were prepared without any losses. Both libraries are suitable for Illumina sequencing. But complexity of the first library is 10x loower than complexity of the second.
    • hydrodinamic digestion is not efficient for ds DNA <1kbp;
    • ultrasonic digestion is not efficient for ds DNA <300bp;
    • ultrasonic digestion produce smaller fragments if compare with hydrodinamic (Hydroshear, nebulizer);


  • Sequence analysis:
    • it is impossible align sequence unambiguously if it is repeated in the reference genome several times, so any repeats longer than read length are out of analysis for fragment libraries;
    • to reconstruct structural variations (insertions, deletions, inversions, duplications) it is necessary to recognize sequence on both borders of variation. 35bp read is too small for such task: it is difficult to recogmize and unambiguously align two fragments within it;
    • de-novo assembly results in ~500bp contigs;



  • Mate-Paired sequencing

    Also "pairwise end", "paired end", "double-barrel shotgun" sequencing. Normally, fragment length should be within some interval (for example 1.5±0.1kb).

  • Mate-Paired sequencing helps to solve two tasks:
    • mapping of repetitive sequences. Let's suppose, that for some particular MP-read one of the end-sequence is unique, and other may be mapped in the number of positions within the genome. Taken alone repetitive sequence can't be unambiguously mapped to the genome. Taken as a part of MP-read it will be mapped unambiguously if only one repeat is located within fragment lenght interval from the unique sequence. Similar algorithm used for mapping of two repetitive sequences: known fragment lenght significantly restrict possible map positions;
    • studying of structural variants. Inversion may be recognized as disturbance of orientation of end-sequences. Insertion/deletion — as significant deviation of mapped length of MP-read from the mean value. Translocation — as location of end-reads in unrelated positions in the genome;


  • Fragment length variation (FLV) should be as small as possible, because:
    • the length of region for location of repetitive sequence is equivalent to the ragment length variation. The smaller FLV, the more accurate positioning of the repetitive sequence;
    • it is possible to recognize insertion/deletion only if rearangement lenght is larger, then the FLV. The shorter the fragment length variation the more In/Del's would be detected;
    • but fragment length variation can't be zero, because: limited resolution of gel electrophoresis, DNA fragments with different sequence have slightly different mobility; the shorter FLV, the less DNA will be used for library preparation, the lover will be complexity of the library.


  • Different sequencing projects may have different optimal MP-fragment length.

    MP-fragment length
    shorterlonger
    • less initial material for the library of same complexity;
    • less fragment length variation (FLV);
    • less fragments should be sequenced to characterize a whole genome (virtual redundancy is higher);



    Library complexity

    Comlexity of the library is a number of independent DNA molecules in it. In both Illumina and SOLiD protocols "preamplified" libraries are used for preparation of flowcells. As a result it is possible, that the same fragment will be sequenced several times.

  • Ideally, complexity should be significantly more, than the number of sequenced reads:
    • it is unpractical to sequence too deep a low complexity libraries (ChIP);
    • high complexity library should be prepared for high-coverage sequencing project;
  • It is possible to estimate complexity after preamplification. Let's suppose, that K cycles of prePCR results in m[µg] of DNA with mean size L[kb].
    • starting amount of DNA was: m0 = m / 2K;
    • number of independent molecules: N0 = m0 / Mw * NAvogadro = m [µg] / {2K * 2 * 330[g/mol] * 1000 * L[kb]} * 6x1023[mol-1] ≈ m/L * 2-K x 1012;



  • Read length (RL)

    Number of sequenced nucleotides. For both SOLiD and Illumina read length is the same for all clones.

    In both systems RL may be selected by user. In most of the cases it is better to have RL larger than 25-28bp, because this range is a border, where most of the "good" sequences map uniquely to the genome. Further increase of the RL practically does not change throughput, slightly decrease a price per nucleotide, increase an error rate, a bit simplify analysis of repeats and structure variants. Different sequencing projects may have different optimal RL.


    long RL
    positivenegative
    • for analysis of structure variations;
    • for analysis of repeats;
    • sequencing time per nucleotide increases with length;
    • sequencing quality significantly decreases with length;
    • increase of RL may result in lower number of "readable" clones;
    • for such applications as ChIP-seq or expression proofiling longer reads does not provide additional information;


  • sequencing price is a sum of
    1. price for flowcell preparation,
    2. price for sequencing reagents,
    3. price for mashine amortization.
    Increase of read length does not change the first component, but proportionally increase second and third.


  • time of the run a sum of
    1. time for run installation,
    2. time for sequencing.
    The first component does not depend on RL. Sequencing time per nucleotide increases with length (because longer time is required for catching of low fluorescent signals).



  • Redundancy and coverage

    Coverage is the percentage of the genome covered by reads.

    Redundancy (sometimes erroneously referred to as coverage) is the number of reads representing a given nucleotide in the reconstructed sequence. Mean redundancy can be calculated from the length of the genome (G), coverage (C), the number of reads (N), and the read length (L) as:

    N*L*C/G
    For example, sequencing of genome with 3x109bp give 5x108 of 35b reads. 70% of these reads were align to 90% of genome. In this case:
    • length of the genome G = 3x109bp;
    • coverage C = 0.9;
    • number of reads N = 0.7 * 5x108 ;
    • read length L = 35b;


    • redundancy N*L*C/G ≈ 3.7

    Both terms (coverage & redundancy) may be applied to the whole genome or any fragment of it.

    Redundancy is not uniform along the genome because of combinatorial and systematic reasons. Uniformity of redundancy is highly desirable, because it could help in analysis of structure variations.




    SNP's

    Single Nucleotide Polymorphism (SNP) represents a DNA sequence variant of a single base pair, with the minor allele occurring in more than 1% of a given population. SNPs having a minor allele frequency ≥20% are called "common SNPs". Frequently, the term "SNP" is used in a looser sense for short allelic variants — substitutions or small insertions-deletions (indels) without any assumptions about minimum allele frequencies for the polymorphisms. For example, NCBI dbSNP database (http://www.ncbi.nlm.nih.gov/SNP/) uses the SNP term regardless to allelic frequencies.




    Structural variations

    Insertion deletion inversion duplication translocation



Second-generation sequencing
URL: http://seq.zbio.net
e-mail: soldatov@molgen.mpg.de
visits:
Warning: require(/home/molbiol/data/www/vphp/include.php) [function.require]: failed to open stream: No such file or directory in /usr/home/molbiol/domains/molbiol.ru/public_html/seq/ssi/counter.php on line 6

Fatal error: require() [function.require]: Failed opening required '/home/molbiol/data/www/vphp/include.php' (include_path='.:/usr/local/lib/php') in /usr/home/molbiol/domains/molbiol.ru/public_html/seq/ssi/counter.php on line 6
Last modification: 11/12/08

seq.zbio.net  ·  soldatov@molgen.mpg.de

molbiol.ru - methods, information and programs for molecular biologists   Rambler