Python¥Ð¥¤¥ª?¡¡Python¥Ð¥¤¥ª/¥Ä¡¼¥ë?
4330¡¡¡¡¡¡2019-06-09 (Æü) 06:56:26

¥ê¡¼¥É¤Î»²¾ÈÇÛÎó¤Ø¤Î¥Þ¥Ã¥Ô¥ó¥°

Bowtie2ºÇ¿·ÈǤò¥¤¥ó¥¹¥È¡¼¥ë¡Ê¥¢¥Ã¥×¥Ç¡¼¥È¡Ë
¥À¥¦¥ó¥í¡¼¥É¤·¤Æunzip¤·¤Æmake

»È¤¤Êý¤ÎÎã¡¡biopapyrus bowtie2¡¢¥Þ¥Ë¥å¥¢¥ë¡¡http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml

bowtie2-build --threads 16 GCF_000001405.38_GRCh38.p12_genomic.fna HG38.ix
Settings:
  Output files: "HG38.ix.*.bt2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  GCF_000001405.38_GRCh38.p12_genomic.fna
Building a SMALL index
Reading reference sizes
  Time reading reference sizes: 00:00:20
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:20
bmax according to bmaxDivN setting: 773987710
Using parameters --bmax 580490783 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 580490783 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
  Building sPrime
  Building sPrimeOrder
  V-Sorting samples
  V-Sorting samples time: 00:01:24
  Allocating rank array
  Ranking v-sort output
  Ranking v-sort output time: 00:00:17
  Invoking Larsson-Sadakane on ranks
  Invoking Larsson-Sadakane on ranks time: 00:00:30
  Sanity-checking and returning
Building samples
Reserving space for 12 sample suffixes
Generating random suffixes
QSorting 12 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 12 samples
  (Using difference cover)
  Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Splitting and merging
  Splitting and merging time: 00:00:00
Avg bucket size: 3.09595e+09 (target: 580490782)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering Ebwt loop
Getting block 1 of 1
  No samples; assembling all-inclusive block
  Sorting block of length 3095950843 for bucket 1
  (Using difference cover)

¼¡¤Ë¡¢samtools¤Ç¤Þ¤È¤á¤ë¡£ÊÂÎó»ØÄê¤Ï-@ ¥¹¥ì¥Ã¥É¿ô

samtools view -@ 32 -bS SRR002322.sam > SRR002322.bam

³¤¤¤Æ¡¢¥½¡¼¥È¡£ÊÂÎó»ØÄê¤Ï-@ ¥¹¥ì¥Ã¥É

samtools sort SRR002322.bam -o SRR002322.sorted.bam -@ 32

ºÇ¸å¤Ëmpileup¡¡¤³¤ì¤ÏÊÂÎó»ØÄê¤Ï¤Ç¤­¤Ê¤¤¤é¤·¤¤

samtools mpileup -uf GCF_000001405.38_GRCh38.p12_genomic.fna SRR002322.sorted.bam | bcftools view -Ov - > SR002322.raw.bcf

¤Ê¤ª¡¢mpileup¤âÊÂÎó¼Â¹Ô¤Ç¤­¤ë¥Ñ¥Ã¥±¡¼¥¸¤¬¤¢¤Ã¤¿¤Î¤Ç¡¢¤¤¤º¤ì¥Æ¥¹¥È¡£
Sambamba 0.6.8

°äÅÁ¸¦¹Ö½¬²ñ¤Ç¤ÎÎã

¡ÖÀè¿Ê¥²¥Î¥à»Ù±ç¡×¾ðÊó²òÀϹֽ¬²ñ¤Î¤´°ÆÆâ
¢Í¾ðÊó²òÀϹֽ¬²ñ¥Ó¥Ç¥ª¡ã2018ǯÅÙ¡¡¾ðÊó²òÀϹֽ¬²ñ¡ÊÃæµé¼Ô¸þ¤±¡Ë¡ä
¢Í»ñÎÁ¡ÊGitHub¡Ë

°äÅÁ¸¦¹Ö½¬²ñ¤ÎÎã¤Ç¤Ï¡¢¡ÖTopHat2¤Î¸å·Ñ¤Ç¤¢¤ë¡×HISAT2¤ò»È¤Ã¤Æ¤¤¤ë¡£<-- ¥¢¥á¥ê¥¨¥Õ¤Î¥Ö¥í¥° HISAT2

HISAT2¡¡¤ËÛ©¤¯

HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of human genomes (as well as to a single reference genome). Based on an extension of BWT for graphs ...

http://ccb.jhu.edu/software/hisat2/dl/hisat2-2.1.0-source.zip
unzip
make

¥Ð¥¤¥Ê¥ê¤Ïmake¤·¤¿¥Ç¥£¥ì¥¯¥È¥êÃæ¤Ë¤Ç¤­¤ë¤Î¤Ç¡¢Å¬µ¹¥ê¥ó¥¯¤òÄ¥¤ë¤³¤È¡£

hisat2¤Î½èÍý¤Ï¡¢¥¤¥ó¥Ç¥Ã¥¯¥¹¤òºî¤ë(hisat2-build) ¢Í hisat2¤Ç¥Þ¥Ã¥×¡¡¤Î½ç¡£

hisat2-build s288c.fna s288c-build.fna
Settings:
  Output files: "s288c-build.fna.*.ht2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Local offset rate: 3 (one in 8)
  Local fTable chars: 6
  Local sequence length: 57344
  Local sequence overlap between two consecutive indexes: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  s288c.fna
Reading reference sizes
  Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:00
  Time to read SNPs and splice sites: 00:00:00
Using parameters --bmax 2279457 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 2279457 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
  Building sPrime
  Building sPrimeOrder
  V-Sorting samples
  V-Sorting samples time: 00:00:01
  Allocating rank array
  Ranking v-sort output
  Ranking v-sort output time: 00:00:00
  Invoking Larsson-Sadakane on ranks
  Invoking Larsson-Sadakane on ranks time: 00:00:00
  Sanity-checking and returning
Building samples
Reserving space for 12 sample suffixes
Generating random suffixes
QSorting 12 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 12 samples
  (Using difference cover)
  Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Splitting and merging
  Splitting and merging time: 00:00:00
Avg bucket size: 1.73673e+06 (target: 2279456)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering GFM loop
Getting block 1 of 7
  Reserving size (2279457) for bucket 1
  Calculating Z arrays for bucket 1
  Entering block accumulator loop for bucket 1:
  bucket 1: 10%
  bucket 1: 20%
  bucket 1: 30%
  bucket 1: 40%
  bucket 1: 50%
  bucket 1: 60%
  bucket 1: 70%
  bucket 1: 80%
  bucket 1: 90%
  bucket 1: 100%
  Sorting block of length 1917483 for bucket 1
  (Using difference cover)
  Sorting block time: 00:00:01
Returning block of 1917484 for bucket 1
Getting block 2 of 7
  Reserving size (2279457) for bucket 2
  Calculating Z arrays for bucket 2
  Entering block accumulator loop for bucket 2:
  bucket 2: 10%
  bucket 2: 20%
  bucket 2: 30%
  bucket 2: 40%
  bucket 2: 50%
  bucket 2: 60%
  bucket 2: 70%
  bucket 2: 80%
  bucket 2: 90%
  bucket 2: 100%
  Sorting block of length 1565107 for bucket 2
  (Using difference cover)
  Sorting block time: 00:00:00
Returning block of 1565108 for bucket 2
Getting block 3 of 7
  Reserving size (2279457) for bucket 3
  Calculating Z arrays for bucket 3
  Entering block accumulator loop for bucket 3:
  bucket 3: 10%
  bucket 3: 20%
  bucket 3: 30%
  bucket 3: 40%
  bucket 3: 50%
  bucket 3: 60%
  bucket 3: 70%
  bucket 3: 80%
  bucket 3: 90%
  bucket 3: 100%
  Sorting block of length 2077071 for bucket 3
  (Using difference cover)
  Sorting block time: 00:00:01
Returning block of 2077072 for bucket 3
Getting block 4 of 7
  Reserving size (2279457) for bucket 4
  Calculating Z arrays for bucket 4
  Entering block accumulator loop for bucket 4:
  bucket 4: 10%
  bucket 4: 20%
  bucket 4: 30%
  bucket 4: 40%
  bucket 4: 50%
  bucket 4: 60%
  bucket 4: 70%
  bucket 4: 80%
  bucket 4: 90%
  bucket 4: 100%
  Sorting block of length 1959963 for bucket 4
  (Using difference cover)
  Sorting block time: 00:00:00
Returning block of 1959964 for bucket 4
Getting block 5 of 7
  Reserving size (2279457) for bucket 5
  Calculating Z arrays for bucket 5
  Entering block accumulator loop for bucket 5:
  bucket 5: 10%
  bucket 5: 20%
  bucket 5: 30%
  bucket 5: 40%
  bucket 5: 50%
  bucket 5: 60%
  bucket 5: 70%
  bucket 5: 80%
  bucket 5: 90%
  bucket 5: 100%
  Sorting block of length 1858740 for bucket 5
  (Using difference cover)
  Sorting block time: 00:00:01
Returning block of 1858741 for bucket 5
Getting block 6 of 7
  Reserving size (2279457) for bucket 6
  Calculating Z arrays for bucket 6
  Entering block accumulator loop for bucket 6:
  bucket 6: 10%
  bucket 6: 20%
  bucket 6: 30%
  bucket 6: 40%
  bucket 6: 50%
  bucket 6: 60%
  bucket 6: 70%
  bucket 6: 80%
  bucket 6: 90%
  bucket 6: 100%
  Sorting block of length 630502 for bucket 6
  (Using difference cover)
  Sorting block time: 00:00:00
Returning block of 630503 for bucket 6
Getting block 7 of 7
  Reserving size (2279457) for bucket 7
  Calculating Z arrays for bucket 7
  Entering block accumulator loop for bucket 7:
  bucket 7: 10%
  bucket 7: 20%
  bucket 7: 30%
  bucket 7: 40%
  bucket 7: 50%
  bucket 7: 60%
  bucket 7: 70%
  bucket 7: 80%
 bucket 7: 90%
 bucket 7: 100%
 Sorting block of length 2148233 for bucket 7
 (Using difference cover)
 Sorting block time: 00:00:01
Returning block of 2148234 for bucket 7
Exited GFM loop
fchr[A]: 0
fchr[C]: 3766349
fchr[G]: 6086925
fchr[T]: 8404025
fchr[$]: 12157105
Exiting GFM::buildToDisk()
Returning from initFromVector
Wrote 8248454 bytes to primary GFM file: s288c-build.fna.1.ht2
Wrote 3039284 bytes to secondary GFM file: s288c-build.fna.2.ht2
Re-opening _in1 and _in2 as input streams
Returning from GFM constructor
Returning from initFromVector
Wrote 5399069 bytes to primary GFM file: s288c-build.fna.5.ht2
Wrote 3092708 bytes to secondary GFM file: s288c-build.fna.6.ht2
Re-opening _in5 and _in5 as input streams
Returning from HierEbwt constructor
Headers:
    len: 12157105
    gbwtLen: 12157106
    nodes: 12157106
    sz: 3039277
    gbwtSz: 3039277
    lineRate: 6
    offRate: 4
    offMask: 0xfffffff0
    ftabChars: 10
    eftabLen: 0
    eftabSz: 0
    ftabLen: 1048577
    ftabSz: 4194308
    offsLen: 759820
    offsSz: 3039280
    lineSz: 64
    sideSz: 64
    sideGbwtSz: 48
    sideGbwtLen: 192
    numSides: 63319
    numLines: 63319
    gbwtTotLen: 4052416
    gbwtTotSz: 4052416
    reverse: 0
    linearFM: Yes
Total time for call to driver() for forward index: 00:00:06

½àÈ÷½ª¤ï¤ê¡£

¥Þ¥Ã¥Ô¥ó¥°¤Ï

hisat2 -p 32 --dta-cufflinks -x s288c-build.fna -1 paired_SRR453566_1.trim.fastq -2 paired_SRR453566_2.trim.fastq -S SRR453566.sam
5115482 reads; of these:
  5115482 (100.00%) were paired; of these:
    323268 (6.32%) aligned concordantly 0 times
    4524989 (88.46%) aligned concordantly exactly 1 time
    267225 (5.22%) aligned concordantly >1 times
    ----
    323268 pairs aligned concordantly 0 times; of these:
      77115 (23.85%) aligned discordantly 1 time
    ----
    246153 pairs aligned 0 times concordantly or discordantly; of these:
      492306 mates make up the pairs; of these:
        341629 (69.39%) aligned 0 times
        135729 (27.57%) aligned exactly 1 time
        14948 (3.04%) aligned >1 times
96.66% overall alignment rate

¤Ç½ª¤ï¤ê¡£

¤³¤³¤Ç¡¢samtools¤ò»È¤Ã¤Æsam¥Õ¥¡¥¤¥ë¤òbam¥Õ¥¡¥¤¥ë¤ËÊÑ´¹¤·¡¢¤«¤Ä¡¢position¤Çsort¤¹¤ë¡£

samtools/bcftools¤Î¥¤¥ó¥¹¥È¡¼¥ë¤Ï¡¢Samtools¤Î¥µ¥¤¥È]http://www.htslib.org/download/¤«¤é¥À¥¦¥ó¥í¡¼¥É¤·¡¢

./configure
make
sudo make install

¤Ç´°Î»¡£

samtools view¤ò»È¤Ã¤Æsam¥Õ¥¡¥¤¥ëSRR453566.sam ¤òbam¥Õ¥¡¥¤¥ë SR453566.bam¤ËÊÑ´¹¡£@32¤Ï32¥¹¥ì¥Ã¥É»ÈÍÑ¡£

samtools view -@ 32 -b SRR453566.sam > SRR453566.bam
Ëô¤Ï samtools view -@ 32 -b SRR453566.sam -o SRR453566.bam

samtools sort¤ò»È¤Ã¤Æbam¥Õ¥¡¥¤¥ëSRR453566.bam ¤ò¥½¡¼¥È¤·¤ÆSRR453566.sorted.bam¤Ë½ÐÎÏ

samtools sort -@ 32 SRR453566.bam > SRR453566.sorted.bam 
   [bam_sort_core] merging from 0 files and 32 in-memory blocks...

hisat2¤Î¥É¥­¥å¥á¥ó¥È¤Ç¤Ï¸å½èÍý¤È¤·¤Æ¡¡samtools mpileup¤·¤Æbcftools view¤Çɽ¼¨

samtools mpileup -uvf s288c.fna SRR453566.sorted.bam > SRR453566.vcf
[warning] samtools mpileup option `u` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[warning] samtools mpileup option `v` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[mpileup] 1 samples in 1 input files

Ʊ¤¸¤³¤È¤òbcftools¤Ç¤Ï

bcftools mpileup -Ou -f s288c.fna SRR453566.sorted.bam > SRR453566.vcf

¤Ê¤ª¡¢¥Ñ¥é¥á¥¿ --threads 16 ¤Ê¤É¤Ï¼ÂºÝ¤Ë¤ÏÊÂÎ󲽤θú²Ì¤¬Ìµ¤«¤Ã¤¿¡£

samtools¤Çvariant call¤Ç¤Ï

bcftools mpileup -Ou -f <ref.fa> <sample1.bam> <sample2.bam> <sample3.bam> | bcftools call -vmO z -o <study.vcf.gz>

¤È½ñ¤¤¤Æ¤¤¤ë¡£

bcftools¤Î¥Þ¥Ë¥å¥¢¥ë¤Ç¤Ïview¤ÎÂå¤ï¤ê¤Ëcall¤ò»È¤¨¤È¤·¤Æ¤ª¤ê¡¢¶ñÂÎŪ¤ÊÃê½Ð½èÍý¤ò¤É¤¦¤¹¤ë¤«Î㤬¤¤¤¯¤Ä¤«¤¢¤ë¡£

¹¹¤Ë¡¢ HISAT2¤Ç¤Ï¡¡

bcftools view -bvcg SRR453566.vcf > SRR453566.raw.bcf¡¡

¤È¤¹¤ë¤È¤·¤Æ¤¤¤ë¤¬¡¢¥Ñ¥é¥á¥¿-bvgf¤Ï¥¨¥é¡¼¡£

²òÀâ¤ò¸«¤ë¤È¡¢ºÇ¶á¤Î¥Ð¡¼¥¸¥ç¥ó¤Ç¤Ïbcftools call¤Ë¤è¤Ã¤Æ¥³¡¼¥ë¤ò¼è½Ð¤¹¤é¤·¤¤¡£

bcftools call -O v -v -c SRR453566.vcf > SRR453566.raw.bcf

¤¬Àµ²ò¤ÎÌÏÍÍ¡£ÆÀ¤é¤ì¤¿·ë²Ì¤Ï¡ÊÀèÆ¬¥³¥á¥ó¥È¥é¥¤¥ó¤ò½ü¤¯¤È¡Ë

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SRR453566.sorted.bam
NC_001133.9     137     .       C       CT      4.4191  .       INDEL;IDV=1;IMF=1;DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-37.5262   GT:PL   0/1:40,3,0
NC_001133.9     349     .       C       T       8.64911 .       DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,0,1;MQ=60;FQ=-29.9908     GT:PL   1/1:38,3,0
NC_001133.9     11969   .       C       G       10.4247 .       DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9905     GT:PL   1/1:40,3,0
NC_001133.9     11997   .       T       A       27.0206 .       DP=2;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9901     GT:PL   1/1:57,3,0
NC_001133.9     12361   .       G       A       4.13164 .       DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9928     GT:PL   0/1:32,3,0
NC_001133.9     12373   .       G       A       4.13164 .       DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9928     GT:PL   0/1:32,3,0

¤È¤¤¤Ã¤¿´¶¤¸¡£INFOÉôʬ¤Î²òÀâ¤Ï¥Ø¥Ã¥ÀÉôʬ¤Ë¤¢¤ëÄ̤ꡢ

##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)">
##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele frequency (assuming HWE)">
##INFO=<ID=AF2,Number=1,Type=Float,Description="Max-likelihood estimate of the first and second group ALT allele frequency (assuming HWE)">
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele count (no HWE assumption)">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same">
##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
##INFO=<ID=G3,Number=3,Type=Float,Description="ML estimate of genotype frequencies">
##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value based on G3">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">

¥Ð¥ê¥¢¥ó¥È¥³¡¼¥ë·ë²Ì¤ÎVCF¥Õ¥©¡¼¥Þ¥Ã¥È - mac¤Ç¥¤¥ó¥Õ¥©¥Þ¥Æ¥£¥¯¥¹

bcftools mpileup¤Ç¡¢Ê£¿ô¥µ¥ó¥×¥ë¤òʤ٤Ƥߤ롣¤³¤Î¥Ú¡¼¥¸, bcftools¤Î¥Ú¡¼¥¸

bcftools mpileup -Ou -f s288c.fna SRR453566.sorted.bam  SRR453567.sorted.bam  SRR453568.sorted.bam  SRR453569.sorted.bam  SRR453570.sorted.bam  SRR453571.sorted.bam | bcftools call -Ou -mv -o SRR453566_mpileup.call.vcf

¹¹¤Ë·ë²Ì¤òÆÉ¤á¤ë¤è¤¦¤Ë¤¹¤ë¡£

bcftools view SRR453566_mpileup.call.vcf | vcf-annotate -f + > SRR453566_mpileup.call.annotate.vcf

¼¡¤Î¤è¤¦¤Ê·ë²Ì¤Ë¤Ê¤Ã¤¿¡£

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SRR453566.sorted.bam    SRR453567.sorted.bam    SRR453568.sorted.bam    SRR453569.sorted.bam    SRR453570.sorted.bam    SRR453571.sorted.bam
NC_001133.9     136     .       G       A       68      SnpGap  DP=4;VDB=0.00992398;SGB=-0.896883;MQ0F=0;AC=8;AN=8;DP4=0,0,4,0;MQ=60    GT:PL   1/1:23,3,0      1/1:24,3,0      ./.:0,0,0       1/1:23,3,0      1/1:23,3,0      ./.:0,0,0
NC_001133.9     137     .       C       CT      135     PASS    INDEL;IDV=1;IMF=1;DP=4;VDB=0.00992398;SGB=-0.896883;MQ0F=0;AC=8;AN=8;DP4=0,0,4,0;MQ=60  GT:PL   1/1:40,3,0      1/1:40,3,0      ./.:0,0,0       1/1:40,3,0      1/1:40,3,0      ./.:0,0,0
NC_001133.9     286     .       A       T       13.4171 MinAB;MinDP     DP=1;SGB=-0.045183;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=60   GT:PL   ./.:0,0,0       ./.:0,0,0       1/1:38,3,0      ./.:0,0,0       ./.:0,0,0       ./.:0,0,0
NC_001133.9     305     .       C       G       10.6101 MinAB;MinDP     DP=1;SGB=-0.045183;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=60   GT:PL   ./.:0,0,0       ./.:0,0,0       1/1:35,3,0      ./.:0,0,0       ./.:0,0,0       ./.:0,0,0
NC_001133.9     349     .       C       T       88      PASS    DP=3;VDB=0.102722;SGB=-0.471632;MQ0F=0;AC=6;AN=6;DP4=0,0,0,3;MQ=60      GT:PL   1/1:38,3,0      ./.:0,0,0       1/1:35,3,0      ./.:0,0,0       ./.:0,0,0       1/1:40,3,0
NC_001133.9     373     .       T       C       9.69791 Qual;MinAB      DP=2;SGB=-0.045183;RPB=1;MQB=1;BQB=1;MQ0F=0;ICB=0.5;HOB=0.5;AC=2;AN=4;DP4=0,1,0,1;MQ=60 GT:PL   0/1:0,3,41      ./.:0,0,0       0/1:40,3,0      ./.:0,0,0       ./.:0,0,0       ./.:0,0,0
NC_001133.9     457     .       CAAA    CAA     7.96917 Qual;MinAB;MinDP        INDEL;IDV=1;IMF=1;DP=1;SGB=-0.045183;MQ0F=0;AC=2;AN=2;DP4=0,0,0,1;MQ=60 GT:PL   ./.:0,0,0       ./.:0,0,0       ./.:0,0,0       ./.:0,0,0       1/1:32,3,0      ./.:0,0,0

¥µ¥ó¥×¥ë´Ö¤Ç¶¦Ä̤¹¤ëÊѰۤȸÇÍ­¤ÎÊѰۤòÃê½Ð¤¹¤ë - mac¤Ç¥¤¥ó¥Õ¥©¥Þ¥Æ¥£¥¯¥¹

¤½¤ì¤Ç¡¢bam¥Õ¥¡¥¤¥ë¤ä¡¢callÃê½Ð¤·¤¿¸å¤Îvcf¥Õ¥¡¥¤¥ë¤Ë¡¢¥¤¥ó¥Ç¥Ã¥¯¥¹ÉÕ¤±¤¬É¬Íפˤʤ뤳¤È¤¬¤¢¤ë¡£¤¿¤È¤¨¤ÐIGV¤Çɽ¼¨¤¹¤ë¤È¤­¤Ê¤É¡£

¥¤¥ó¥Ç¥Ã¥¯¥¹ÉÕ¤±¤Ïbcftools¤Ç²Äǽ¤À¤¬¡¢ÆþÎϤ¿¤ëvcf¥Õ¥¡¥¤¥ë¤¬bgzip¤µ¤ì¤Æ¤¤¤Ê¤±¤ì¤Ð¤Ê¤é¤Ê¤¤¡£gdzip¼«ÂΤÏsamtools¤Îhtslib¤ÎÃæ¤ËÆþ¤Ã¤Æ¤¤¤Æ¡¢samtools¤òmake; make install¤·¤¿¤À¤±¤Ç¤ÏÆþ¤é¤Ê¤¤¡£make all all-htslib; make install install-htslib¡¡¤Î¤è¤¦¤ËÄɲÃmake¤¹¤ëɬÍפ¬¤¢¤ë¡£

bgzip -c SRR453566.raw.bcf > SRR453566.raw.bcf.gz
bcftools index SRR453566.raw.bcf.gz

¤³¤Î¾õÂÖ¤ÇIGV¤Ë¿©¤ï¤»¤ë¤³¤È¤¬¤Ç¤­¤ë¡£

import igv
# tracks¤ÇVCF¤ò»ØÄꤷ¤Æ¤ß¤ë
ref = { 
    "id": "Saccha", 
    "name": "Saccha",
    "fastaURL": "files/s288c.fna",
    "indexURL": "files/s288c.fna.fai",
    "tracks": [ 
      { 
        "name": "My GFF", 
        "type": "annotation", 
        "format": "gff3", 
        #"displayMode": "EXPANDED", 
        "url": "files/s288c_e.gff", 
        "indexed": False, 
      },
      { 
        "name": "VCF", 
        "type": "variant", 
        "format": "vcf", 
        "displayMode": "EXPANDED", 
        "url": "files/SRR453566.raw.bcf.gz", 
        "indexed": True, 
      } 
    ] 
}
b = igv.Browser({"reference": ref})
b.show()

°äÅÁ¸¦¹Ö½¬²ñ»ñÎÁ¤Ç¤Ï¸å½èÍý¤È¤·¤ÆFeature Count¤Ç¿ô¤¨¤ë

Feature Count¤ÎÁ°½èÍý¤È¤·¤Æ¡¢GeneID¤òÉÕÍ¿¤¹¤ë¡£

°äÅÁ¸¦¤ÎÄ󶡤¹¤ëpython¥×¥í¥°¥é¥à¡¡add_gene_id¡¡¤Ç½èÍý¡£¤½¤Î¤¿¤á¤Ë¡¢¥Ñ¥Ã¥±¡¼¥¸bcbio-gff¤¬É¬ÍפʤΤǡ¢pip¤Ç¥¤¥ó¥¹¥È¡¼¥ë

pip install bcbio-gff

¤³¤ì¤Ç¡¢add_gene_id¤ò²ÔƯ¤Ç¤­¤ë¡£

python add_gene_id s288c.gff s288c_e.gff

ÆÀ¤é¤ì¤¿s288c.e.gff¤Ë¤Ïgene_id¤¬Æþ¤Ã¤Æ¤¤¤ë¤Ï¤º¡£

¤³¤ì¤òFeature Count¤¹¤ë¡£biopapyrus¤ÎÀâÌÀ¡¡¤ò°úÍѤ¹¤ë¤È¡§

featureCounts ¤Ï¥Þ¥Ã¥Ô¥ó¥°·ë²Ì¤Ç¤¢¤ë BAM ¤Þ¤¿¤Ï SAM ¥Õ¥¡¥¤¥ë¤òÆþÎϤȤ·¡¢³Æ feature Îΰè¡Êgene¡¢CDS¡¢exon¡Ë¤´¤È¤Ë¥ê¡¼¥É¤ò¥«¥¦¥ó¥È¤¹¤ë¥×¥í¥°¥é¥à¤Ç¤¢¤ë¡£featureCounts ¤Ï Subread ¥Ñ¥Ã¥±¡¼¥¸¤Ë´Þ¤Þ¤ì¤ë¥×¥í¥°¥é¥à¤Î°ì¤Ä¤Ç¤¢¤ë¡£featureCounts ¤òÍøÍѤ¹¤ë¤Ë¤Ï subread ¤ò¥¤¥ó¥¹¥È¡¼¥ë¤¹¤ëɬÍפ¬¤¢¤ë¡£ subread ¤Î¥½¡¼¥¹¥³¡¼¥É¤È¥³¥ó¥Ñ¥¤¥ë¤µ¤ì¤¿¥×¥í¥°¥é¥à¤Ï sourceforge ¤Ç¸ø³«¤µ¤ì¤Æ¤¤¤ë¡£É¬Íפ˱þ¤¸¤Æ¤É¤Á¤é¤«¤ò¥À¥¦¥ó¥í¡¼¥É¤·¤Æ»ÈÍѤ¹¤ë¡£

¤È¤¤¤¦¤³¤È¤Ç¡¢subread¤Î¥Ú¡¼¥¸¤«¤é¥À¥¦¥ó¥í¡¼¥É¡£

featureCounts -p -T 8 -t exon -g gene_id -a $GFF \ 
 -o ../featurecount/counts.txt \ 
 SRR453566.sorted.bam \
 SRR453567.sorted.bam \ 
 SRR453568.sorted.bam \ 
 SRR453569.sorted.bam \ 
 SRR453570.sorted.bam \ 
 SRR453571.sorted.bam

$GFF¤Ï»²¾È¥·¡¼¥±¥ó¥¹¤ÎGene¾ðÊó¤ò´Þ¤à¥Õ¥¡¥¤¥ë¡£featureCount (Subread) ¤Î¥Þ¥Ë¥å¥¢¥ë¤Ç¤Ï

The annotation file should be in either GTF format or a simplified annotation format (SAF) 
as shown below (columns are tab-delimited):
GeneID	Chr	Start	End	Strand
497097	chr1	3204563	3207049	-
497097	chr1	3411783	3411982	-
497097	chr1	3660633	3661579	-
...

¤È¤·¤Æ¤¢¤ê¡¢GTF¥Õ¥©¡¼¥Þ¥Ã¥È¤ÏFrequently Asked Questions: Data File Formats¤ª¤è¤ÓGTF2.2: A Gene Annotation Format (Revised Ensembl GTF)¤Ë¤è¤ë¤È

GTF stands for Gene transfer format. It borrows from GFF, but has additional 
structure that warrants a separate definition and format name.
Structure is as GFF, so the fields are: 
<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]

Here is a simple example with 3 translated exons. Order of rows is not important.

381 Twinscan  CDS          380   401   .   +   0  gene_id "001"; transcript_id "001.1";
381 Twinscan  CDS          501   650   .   +   2  gene_id "001"; transcript_id "001.1";
381 Twinscan  CDS          700   707   .   +   2  gene_id "001"; transcript_id "001.1";
381 Twinscan  start_codon  380   382   .   +   0  gene_id "001"; transcript_id "001.1";
381 Twinscan  stop_codon   708   710   .   +   0  gene_id "001"; transcript_id "001.1";
The whitespace in this example is provided only for readability. In GTF, fields must be separated by a single TAB and no white space.

¹¹¤Ë¡¢GFF¥Õ¥©¡¼¥Þ¥Ã¥È¤ÏFrequently Asked Questions: Data File Formats¤Ë¤è¤ë¤È

GFF (General Feature Format) lines are based on the Sanger GFF2 specification. 
GFF lines have nine required fields that must be tab-separated. 
If the fields are separated by spaces instead of tabs, the track will not display correctly. 
For more information on GFF format, refer to Sanger's GFF page.

¤Þ¤¿¡¢¾åµ­¤Ç»²¾È¤·¤Æ¤¤¤ëSanger¤ÎGFF2/3¤ÎÀâÌÀ¤Ï¡¢GFF2¤ÈGFF3¤Ë¤¢¤ë¡£GFF2¤Ïobsolete¤È¤µ¤ì¡¢GFF3¤¬recommend¤µ¤ì¤Æ¤¤¤ë¡£¹¹¤Ë¡¢GFF3¤Ç¤Ï¥Õ¥¡¥¤¥ë¤ÎºÇ¸å¤Ë¥ª¥×¥·¥ç¥ó¤Çsequence¤¬½ñ¤±¤ë¡£

ưºî¤¹¤ë¤³¤È¤Î¥Æ¥¹¥È¤È¤·¤Æ¡¢SRR453566.sorted.bam¤Î£±¤Ä¤À¤±¤ò½èÍý¤·¤Æ¤ß¤ë¡£

featureCounts -p -T 32 -t exon -g gene_id -a s288c_e.gff -o SRR435366.counts.txt SRR453566.sorted.bam
       ==========     _____ _    _ ____  _____  ______          _____  
       =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
         =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
           ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
             ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
       ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
	  v1.6.3
//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           P SRR453566.sorted.bam                           ||
||                                                                            ||
||             Output file : SRR435366.counts.txt                             ||
||                 Summary : SRR435366.counts.txt.summary                     ||
||              Annotation : s288c_e.gff (GTF)                                ||
||      Dir for temp files : ./                                               ||
||                                                                            ||
||                 Threads : 32                                               ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : counted                                          ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file s288c_e.gff ...                                       ||
||    Features : 6761                                                         ||
||    Meta-features : 6420                                                    ||
||    Chromosomes/contigs : 17                                                ||
||                                                                            ||
|| Process BAM file SRR453566.sorted.bam...                                   ||
||    Paired-end reads are included.                                          ||
||    Assign alignments (paired-end) to features...                           ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Pairing up the read pairs.                                              ||
||                                                                            ||
||    Total alignments : 5650830                                              ||
||    Successfully assigned alignments : 4568783 (80.9%)                      ||
||    Running time : 0.03 minutes                                             ||
||                                                                            ||
||                                                                            ||
|| Summary of counting results can be found in file "SRR435366.counts.txt.su  ||
|| mmary"                                                                     ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

½ÐÎϤϡÊÀèÆ¬¡Ë

# Program:featureCounts v1.6.3; Command:"featureCounts" "-p" "-T" "32" "-t" "exon" "-g" "gene_id" "-a" "s288c_e.gff" "-o" "SRR435366.counts.txt" "SRR453566.sorted.bam"
Geneid  Chr     Start   End     Strand  Length  SRR453566.sorted.bam
gene_0001       NC_001133.9     1807    2169    -       363     0
gene_0002       NC_001133.9     2480    2707    +       228     0
gene_0003       NC_001133.9     7235    9016    -       1782    0
gene_0004       NC_001133.9     11565   11951   -       387     0
gene_0005       NC_001133.9     12046   12426   +       381     2
gene_0006       NC_001133.9     13363   13743   -       381     0
gene_0007       NC_001133.9     21566   21850   +       285     0
gene_0008       NC_001133.9     22395   22685   -       291     0
gene_0009       NC_001133.9     24000   27968   -       3969    32
gene_0010       NC_001133.9     31567   32940   +       1374    65
gene_0011       NC_001133.9     33448   34701   +       1254    119
gene_0012       NC_001133.9     35155   36303   +       1149    1155

¡ä¡ä°äÅÁ¸¦¥»¥ß¥Ê¡¼¤Î¤³¤ÎÀè¤Î½èÍý¤ÏRNAseq­£-0¡Ê°äÅÁ¸¦¥»¥ß¥Ê¡¼¤Ç¤Î¡Ëȯ¸½²òÀÏ¡Á¥«¥¦¥ó¥È¥Ç¡¼¥¿¤Î²òÀÏ

Ê̤λñÎÁ¤Ë¤è¤ë¤È¡¢¥Ñ¥¤¥×¥é¥¤¥ó¤È¤·¤Æ¡¢HISAT2¢ÍStringTie¢Í Ballgawn

Johns Hopkins¤«¤éÏÀʸ Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nature Protocols volume 11, pages 1650–1667 (2016)

¤½¤ì¤Ç¡¢HISAT2¤Ç¥ê¡¼¥É¤ò¥¢¥»¥ó¥Ö¥ë¤·¤¿·ë²Ì¤Îbam¥Õ¥¡¥¤¥ë¤òStringTie¤Ç¥¢¥»¥ó¥Ö¥ë¤¹¤ë¡£

StringTie¤Î¥¤¥ó¥¹¥È¡¼¥ë¤Ï¡¢¥Þ¥Ë¥å¥¢¥ëÄ̤ꡣ¥½¡¼¥¹¤ò¥À¥¦¥ó¥í¡¼¥É¤·¡¢make¡£¥¤¥ó¥¹¥È¡¼¥ë¤Ï¼«Ê¬¤Ç/usr/local/bin/stringtie¤«¤é¥ê¥ó¥¯¤òÄ¥¤ì¤Ð½ª¤ï¤ê¡£

½èÍý¤Ï¡¢¤¢¤é¤«¤¸¤áHISAT2¤Çsorted.bam¤¬¤Ç¤­¤Æ¤¤¤ë¤È¤·¤Æ¡¢¡ÊHISAT2¤Ç--dta¤ò»ØÄꤹ¤ë¤³¤È¡Ë

stringtie -B -p 16 -o SRR453566.gtf -G s288c_e.gff SRR453566.sorted.bam

¤³¤³¤Ç¡¢-B ¤Ï½ÐÎϤòBallgown¤Ç»È¤¦¤¿¤á¤ÎÀßÄê¡¢-G s288c_e.gff ¤Ïreference annotation file (in GTF or GFF3 format)¡¢-p 16¤Ï¥¹¥ì¥Ã¥É¿ô¤Ç¡¢½ÐÎϤÏ-o SRR453566.gtf ¤Ç»ØÄê¡£

¤³¤Îstringtie¤Î½ÐÎϤò¡¢Ballgawn¤ÇÀ°Íý¤·¤Æ²Ä»ë²½¤¹¤ë¡£
A program for computing differentially expressed genes in two or more RNA-seq experiments, using the output of StringTie or Cufflinks. The Ballgown package provides functions to organize, visualize, and analyze expression measurements. Ballgown is written in R and is part of Bioconductor.

GitHub - alyssafrazee/ballgown: Bioconductor package "ballgown"

HISAT-StringTie-Ballgown ¤ò»î¤·¤Æ¤ß¤è¤¦ - Palmsonntagmorgen¡¡¤³¤ì¤Ç¤ÏÏÀʸ¤Ç¤Ïoptional¤È¤Ê¤Ã¤Æ¤¤¤Þ¤¹¤¬¡¢gffcompare¤ò»È¤Ã¤Æ¡¢À¸À®¤µ¤ì¤¿°äÅÁ»Ò¥ê¥¹¥È¤Èreference¤Î°äÅÁ»Ò¥ê¥¹¥È¤òÈæ³Ó¤¹¤ë¤³¤È¤¬¤Ç¤­¤Þ¤¹¡£

³¤­¤È¤·¤ÆHISAT-StringTie-Ballgown ¤ò»î¤·¤Æ¤ß¤è¤¦ ¤½¤Î2 - Palmsonntagmorgen


¥È¥Ã¥×   ÊÔ½¸ Åà·ë º¹Ê¬ ¥Ð¥Ã¥¯¥¢¥Ã¥× źÉÕ Ê£À½ ̾Á°Êѹ¹ ¥ê¥í¡¼¥É   ¿·µ¬ °ìÍ÷ ñ¸ì¸¡º÷ ºÇ½ª¹¹¿·   ¥Ø¥ë¥×   ºÇ½ª¹¹¿·¤ÎRSS
Last-modified: 2019-06-09 (Æü) 06:56:26 (1329d)