Home Second-generation sequencing > Analysis Rambler's Top100

Data analysis

Samples of the datafiles:

Pipeline modules

ModuleName of the programFunctionAddress
Goat

FirecrestImage analysisPipeline/Goat/goat_pipeline.py
BustardBase-callerPipeline/Goat/bustard.py
GeraldElandSequence alignmentPipeline/Gerald/GERALD.pl


Preliminary

  • ask Peter Marquardt about access to the machine where analysis should be done. In my case:
       directory: /project/solexb
       name: solexb
  • ask Wei Chen what is the best version of the analysis software at the moment and what the pathway to it is. On the moment of text preparation the best version was 0.2.0 and pathway was: /project/solexa/src/SolexaPipeline-0.2.0


Login procedure


for short runs
  • login on molgix machine with your name and Unix password: [program: PuTTY · host name: molgix.molgen.mpg.de · port: 22 · protocol: SSH];
  • connect to the machine where analysis will be performed ('hurtz' in my case) with the username from Pefer ('solexb' in my case):
    soldatov@molgix:~> ssh -l solexb hurtz


for long runs
  • login on molgix machine with your name and Unix password;
  • check, that ther is no old screen-sessions:
    soldatov@molgix:~> screen -r
  • organize screen-session:
    soldatov@molgix:~> screen
    SPACE
  • connect to the machine where analysis will be performed:
    soldatov@molgix:~> ssh -l solexb hurtz


  • --- start some long analysis ---
  • create virtual screen:
    soldatov@molgix:~> Ctrl-a Ctrl-d


  • --- now it is possible to log out ---
  • login on molgix machine again;
  • run virtual session on the screen:
    soldatov@molgix:~> screen -r


Transfer sequencing data (images)

Transfer folder YYMMDD_SLXA-EAS12_NNNN (070216_SLXA-EAS12_0034/' ~450Gb) to USB-hard disk (~1 day)
  • login as 'solexb';
  • go to the analysis directory:
    solexb@hurtz:~> cd /project/solexb
  • create symbolic link to the current Solexa_Sequence_Analysis_Package:
    solexb@hurtz:~> ln -s /project/solexa/src/SolexaPipeline-0.2.0 Pipeline
  • organize 'data' directory:
    solexb@hurtz:~> mkdir data
  • download images (folder YYMMDD_SLXA-EAS12_NNNN) in the 'data' folder (~1 day): in this example data were transferred in 'harddisk-20070220/070216_SLXA-EAS12_0034/', but it not necessary to create an additional folder, so '070216_SLXA-EAS12_0034/' would be better


Download reference genome

Take genome from "UCSC Genome Browser". In this case it is mouse genome, file: chromFa.tar.gz
  • prepare folders (folder 'Genome' for fasta-files; folder 'mouseGenome' for 2-bits-per-base format files) for the reference genome:
    solexb@hurtz:~> cd data
    solexb@hurtz:~> mkdir Genome
    solexb@hurtz:~> mkdir mouseGenome
  • download reference genome
    solexb@hurtz:~> cd /project/solexb/Genome
    solexb@hurtz:~> ftp hgdownload.cse.ucsc.edu
    name: anonymous
    password:
    cd /goldenPath/mm8/bigZips/
    ls -l
    mget chromFa.tar.gz (takes ~1 h)
    quit


  • prepare 2-bits-per-base format genome files
  • unpack the *.tar.gz genome archive:
    solexb@hurtz:~> tar xvzf chromFa.tar.gz
  • a lot of folders will be organized with 1-2 files per folder:
    ./1
         ./1/chr1.fa
         ./1/chr1_random.fa

    ./2
         ./2/chr2.fa

    .....
  • prepare 2-bits-per-base format files:
    solexb@hurtz:~> cd /project/solexb/data/Genome
    solexb@hurtz:~> /project/solexb/Pipeline/Eland/squashGenome /project/solexb/data/mouseGenome */*
  • check 2-bits-per-base format files:
    solexb@hurtz:~> ls -l /project/solexb/data/mouseGenome
  • remove folders with *.fa files:
    solexb@hurtz:~> rm -ri /project/solexb/data/Genome/*/


Test analysis

Test analysis takes one tile (from 200) in one channel (from 8). In this example tile '50' from channel '4' was selected (parameter --tiles=s_4_0050). Test analysis is performed to check software and estimate the quality of the sequencing quickly.

Goat-run

  • perform test Goat-run:
    solexb@hurtz:~> /project/solexb/Pipeline/Goat/goat_pipeline.py --tiles=s_4_0050 --offsets=auto /project/solexb/data/harddisk-20070220/070216_SLXA-EAS12_0034/
  • create Goat-make-file:
    solexb@hurtz:~> /project/solexb/Pipeline/Goat/goat_pipeline.py --tiles=s_4_0050 --offsets=auto --make /project/solexb/data/harddisk-20070220/070216_SLXA-EAS12_0034/
  • check that a new Firecrest-folder appears in the Data-folder:
    solexb@hurtz:~> ls -l /project/solexb/data/harddisk-20070220/070216_SLXA-EAS12_0034/Data

    in my case it was: C1-27_Firecrest1.8.26_06-03-2007_solexb.2
  • perform Firecrest-run -- go to Firecrest-folder and run 'make':
    solexb@hurtz:~> cd /project/solexb/data/harddisk-20070220/070216_SLXA-EAS12_0034/Data/C1-27_Firecrest1.8.26_06-03-2007_solexb.2 solexb@hurtz:~> make
  • check, that 'int-file' appears (in this case: s_4_0050_int.txt) in Firecrest-folder:
    solexb@hurtz:~> ls -l /project/solexb/data/harddisk-20070220/070216_SLXA-EAS12_0034/Data/C1-27_Firecrest1.8.26_06-03-2007_solexb.2
  • perform 'Bustard' run - go to Bustard-folder and run 'make':
    solexb@hurtz:~> cd /project/solexb/data/harddisk-20070220/070216_SLXA-EAS12_0034/Data/C1-27_Firecrest1.8.26_06-03-2007_solexb.2/Bustard1.8.26_06-03-2007_solexb solexb@hurtz:~> make
  • check, that "seq-file" appeared in Bustard-folder: (in this case: s_4_0050_seq.txt):
    solexb@hurtz:~> ls -l /project/solexb/data/harddisk-20070220/070216_SLXA-EAS12_0034/Data/C1-27_Firecrest1.8.26_06-03-2007_solexb.2/Bustard1.8.26_06-03-2007_solexb
  • check, that the content of the s_4_0050_seq.txt is similar to what you expect to obtain from the sequencing.


Sequence-alignment (GERALD' run)

  • Prepare config.txt file and put it in the /project/solexb folder:
    EMAIL_LIST soldatov@molgen.mpg.de
    EMAIL_SERVER sally
    EXPT_DIR /project/solexb/data/harddisk-20070220/070216_SLXA-EAS12_0034/Data/C1-27_Firecrest1.8.26_06-03-2007_solexb.2/Bustard1.8.26_06-03-2007_solexb/
    WEB_DIR_ROOT
    CONTAM_DIR /project/solexb/data/mouseGenome
    READ_LENGTH 26
    ANALYSIS eland
    CONTAM_FILE contam.txt
    GENOME_FILE chr1.fa
    GENOME_DIR /project/solexb/data/mouseGenome
    ELAND_GENOME /project/solexb/data/mouseGenome
    ELAND_REPEAT /project/solexb/data/mouseGenome
    USE_BASES nYYYYYYYYYYYYYYYYYYYYYYYYYY
  • perform 'GERALD' run:
    solexb@hurtz:~> cd /project/solexb solexb@hurtz:~> Pipeline/Gerald/GERALD.pl config.txt --FORCE
  • the program perform "make self_test" automatically. If everything OK, go to the GERALD-directory and run "make":
    solexb@hurtz:~> cd /project/solexb/data/harddisk-20070220/070216_SLXA-EAS12_0034/Data/C1-27_Firecrest1.8.26_06-03-2007_solexb.2/Bustard1.8.26_06-03-2007_solexb/GERALD_06-03-2007_solexb/ solexb@hurtz:~> make
  • check, that ELAND-results in the "s_4_eland_result.txt" file look acceptable.




Second-generation sequencing
URL: http://seq.zbio.net
e-mail: soldatov@molgen.mpg.de
visits:
Warning: require(/home/molbiol/data/www/vphp/include.php) [function.require]: failed to open stream: No such file or directory in /usr/home/molbiol/domains/molbiol.ru/public_html/seq/ssi/counter.php on line 6

Fatal error: require() [function.require]: Failed opening required '/home/molbiol/data/www/vphp/include.php' (include_path='.:/usr/local/lib/php') in /usr/home/molbiol/domains/molbiol.ru/public_html/seq/ssi/counter.php on line 6
Last modification: 01/12/08

seq.zbio.net  ·  soldatov@molgen.mpg.de

molbiol.ru - methods, information and programs for molecular biologists   Rambler