Python¥Ð¥¤¥ª?¡¡Python¥Ð¥¤¥ª/¥Ä¡¼¥ë?
4330¡¡¡¡¡¡2019-06-09 (Æü) 06:56:26
Bowtie2ºÇ¿·ÈǤò¥¤¥ó¥¹¥È¡¼¥ë¡Ê¥¢¥Ã¥×¥Ç¡¼¥È¡Ë
¥À¥¦¥ó¥í¡¼¥É¤·¤Æunzip¤·¤Æmake
»È¤¤Êý¤ÎÎã¡¡biopapyrus bowtie2¡¢¥Þ¥Ë¥å¥¢¥ë¡¡http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
bowtie2-build --threads 16 GCF_000001405.38_GRCh38.p12_genomic.fna HG38.ix
Settings: Output files: "HG38.ix.*.bt2" Line rate: 6 (line is 64 bytes) Lines per side: 1 (side is 64 bytes) Offset rate: 4 (one in 16) FTable chars: 10 Strings: unpacked Max bucket size: default Max bucket size, sqrt multiplier: default Max bucket size, len divisor: 4 Difference-cover sample period: 1024 Endianness: little Actual local endianness: little Sanity checking: disabled Assertions: disabled Random seed: 0 Sizeofs: void*:8, int:4, long:8, size_t:8 Input files DNA, FASTA: GCF_000001405.38_GRCh38.p12_genomic.fna Building a SMALL index Reading reference sizes Time reading reference sizes: 00:00:20 Calculating joined length Writing header Reserving space for joined string Joining reference sequences Time to join reference sequences: 00:00:20 bmax according to bmaxDivN setting: 773987710 Using parameters --bmax 580490783 --dcv 1024 Doing ahead-of-time memory usage test Passed! Constructing with these parameters: --bmax 580490783 --dcv 1024 Constructing suffix-array element generator Building DifferenceCoverSample Building sPrime Building sPrimeOrder V-Sorting samples V-Sorting samples time: 00:01:24 Allocating rank array Ranking v-sort output Ranking v-sort output time: 00:00:17 Invoking Larsson-Sadakane on ranks Invoking Larsson-Sadakane on ranks time: 00:00:30 Sanity-checking and returning Building samples Reserving space for 12 sample suffixes Generating random suffixes QSorting 12 sample offsets, eliminating duplicates QSorting sample offsets, eliminating duplicates time: 00:00:00 Multikey QSorting 12 samples (Using difference cover) Multikey QSorting samples time: 00:00:00 Calculating bucket sizes Splitting and merging Splitting and merging time: 00:00:00 Avg bucket size: 3.09595e+09 (target: 580490782) Converting suffix-array elements to index image Allocating ftab, absorbFtab Entering Ebwt loop Getting block 1 of 1 No samples; assembling all-inclusive block Sorting block of length 3095950843 for bucket 1 (Using difference cover)
¼¡¤Ë¡¢samtools¤Ç¤Þ¤È¤á¤ë¡£ÊÂÎó»ØÄê¤Ï-@ ¥¹¥ì¥Ã¥É¿ô
samtools view -@ 32 -bS SRR002322.sam > SRR002322.bam
³¤¤¤Æ¡¢¥½¡¼¥È¡£ÊÂÎó»ØÄê¤Ï-@ ¥¹¥ì¥Ã¥É
samtools sort SRR002322.bam -o SRR002322.sorted.bam -@ 32
ºÇ¸å¤Ëmpileup¡¡¤³¤ì¤ÏÊÂÎó»ØÄê¤Ï¤Ç¤¤Ê¤¤¤é¤·¤¤
samtools mpileup -uf GCF_000001405.38_GRCh38.p12_genomic.fna SRR002322.sorted.bam | bcftools view -Ov - > SR002322.raw.bcf
¤Ê¤ª¡¢mpileup¤âÊÂÎó¼Â¹Ô¤Ç¤¤ë¥Ñ¥Ã¥±¡¼¥¸¤¬¤¢¤Ã¤¿¤Î¤Ç¡¢¤¤¤º¤ì¥Æ¥¹¥È¡£
Sambamba 0.6.8
¡ÖÀè¿Ê¥²¥Î¥à»Ù±ç¡×¾ðÊó²òÀϹֽ¬²ñ¤Î¤´°ÆÆâ
¢Í¾ðÊó²òÀϹֽ¬²ñ¥Ó¥Ç¥ª¡ã2018ǯÅÙ¡¡¾ðÊó²òÀϹֽ¬²ñ¡ÊÃæµé¼Ô¸þ¤±¡Ë¡ä
¢Í»ñÎÁ¡ÊGitHub¡Ë
°äÅÁ¸¦¹Ö½¬²ñ¤ÎÎã¤Ç¤Ï¡¢¡ÖTopHat2¤Î¸å·Ñ¤Ç¤¢¤ë¡×HISAT2¤ò»È¤Ã¤Æ¤¤¤ë¡£<-- ¥¢¥á¥ê¥¨¥Õ¤Î¥Ö¥í¥° HISAT2
HISAT2¡¡¤ËÛ©¤¯
HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of human genomes (as well as to a single reference genome). Based on an extension of BWT for graphs ...
http://ccb.jhu.edu/software/hisat2/dl/hisat2-2.1.0-source.zip unzip make
¥Ð¥¤¥Ê¥ê¤Ïmake¤·¤¿¥Ç¥£¥ì¥¯¥È¥êÃæ¤Ë¤Ç¤¤ë¤Î¤Ç¡¢Å¬µ¹¥ê¥ó¥¯¤òÄ¥¤ë¤³¤È¡£
hisat2¤Î½èÍý¤Ï¡¢¥¤¥ó¥Ç¥Ã¥¯¥¹¤òºî¤ë(hisat2-build) ¢Í hisat2¤Ç¥Þ¥Ã¥×¡¡¤Î½ç¡£
hisat2-build s288c.fna s288c-build.fna Settings: Output files: "s288c-build.fna.*.ht2" Line rate: 6 (line is 64 bytes) Lines per side: 1 (side is 64 bytes) Offset rate: 4 (one in 16) FTable chars: 10 Strings: unpacked Local offset rate: 3 (one in 8) Local fTable chars: 6 Local sequence length: 57344 Local sequence overlap between two consecutive indexes: 1024 Endianness: little Actual local endianness: little Sanity checking: disabled Assertions: disabled Random seed: 0 Sizeofs: void*:8, int:4, long:8, size_t:8 Input files DNA, FASTA: s288c.fna Reading reference sizes Time reading reference sizes: 00:00:00 Calculating joined length Writing header Reserving space for joined string Joining reference sequences Time to join reference sequences: 00:00:00 Time to read SNPs and splice sites: 00:00:00 Using parameters --bmax 2279457 --dcv 1024 Doing ahead-of-time memory usage test Passed! Constructing with these parameters: --bmax 2279457 --dcv 1024 Constructing suffix-array element generator Building DifferenceCoverSample Building sPrime Building sPrimeOrder V-Sorting samples V-Sorting samples time: 00:00:01 Allocating rank array Ranking v-sort output Ranking v-sort output time: 00:00:00 Invoking Larsson-Sadakane on ranks Invoking Larsson-Sadakane on ranks time: 00:00:00 Sanity-checking and returning Building samples Reserving space for 12 sample suffixes Generating random suffixes QSorting 12 sample offsets, eliminating duplicates QSorting sample offsets, eliminating duplicates time: 00:00:00 Multikey QSorting 12 samples (Using difference cover) Multikey QSorting samples time: 00:00:00 Calculating bucket sizes Splitting and merging Splitting and merging time: 00:00:00 Avg bucket size: 1.73673e+06 (target: 2279456) Converting suffix-array elements to index image Allocating ftab, absorbFtab Entering GFM loop Getting block 1 of 7 Reserving size (2279457) for bucket 1 Calculating Z arrays for bucket 1 Entering block accumulator loop for bucket 1: bucket 1: 10% bucket 1: 20% bucket 1: 30% bucket 1: 40% bucket 1: 50% bucket 1: 60% bucket 1: 70% bucket 1: 80% bucket 1: 90% bucket 1: 100% Sorting block of length 1917483 for bucket 1 (Using difference cover) Sorting block time: 00:00:01 Returning block of 1917484 for bucket 1 Getting block 2 of 7 Reserving size (2279457) for bucket 2 Calculating Z arrays for bucket 2 Entering block accumulator loop for bucket 2: bucket 2: 10% bucket 2: 20% bucket 2: 30% bucket 2: 40% bucket 2: 50% bucket 2: 60% bucket 2: 70% bucket 2: 80% bucket 2: 90% bucket 2: 100% Sorting block of length 1565107 for bucket 2 (Using difference cover) Sorting block time: 00:00:00 Returning block of 1565108 for bucket 2 Getting block 3 of 7 Reserving size (2279457) for bucket 3 Calculating Z arrays for bucket 3 Entering block accumulator loop for bucket 3: bucket 3: 10% bucket 3: 20% bucket 3: 30% bucket 3: 40% bucket 3: 50% bucket 3: 60% bucket 3: 70% bucket 3: 80% bucket 3: 90% bucket 3: 100% Sorting block of length 2077071 for bucket 3 (Using difference cover) Sorting block time: 00:00:01 Returning block of 2077072 for bucket 3 Getting block 4 of 7 Reserving size (2279457) for bucket 4 Calculating Z arrays for bucket 4 Entering block accumulator loop for bucket 4: bucket 4: 10% bucket 4: 20% bucket 4: 30% bucket 4: 40% bucket 4: 50% bucket 4: 60% bucket 4: 70% bucket 4: 80% bucket 4: 90% bucket 4: 100% Sorting block of length 1959963 for bucket 4 (Using difference cover) Sorting block time: 00:00:00 Returning block of 1959964 for bucket 4 Getting block 5 of 7 Reserving size (2279457) for bucket 5 Calculating Z arrays for bucket 5 Entering block accumulator loop for bucket 5: bucket 5: 10% bucket 5: 20% bucket 5: 30% bucket 5: 40% bucket 5: 50% bucket 5: 60% bucket 5: 70% bucket 5: 80% bucket 5: 90% bucket 5: 100% Sorting block of length 1858740 for bucket 5 (Using difference cover) Sorting block time: 00:00:01 Returning block of 1858741 for bucket 5 Getting block 6 of 7 Reserving size (2279457) for bucket 6 Calculating Z arrays for bucket 6 Entering block accumulator loop for bucket 6: bucket 6: 10% bucket 6: 20% bucket 6: 30% bucket 6: 40% bucket 6: 50% bucket 6: 60% bucket 6: 70% bucket 6: 80% bucket 6: 90% bucket 6: 100% Sorting block of length 630502 for bucket 6 (Using difference cover) Sorting block time: 00:00:00 Returning block of 630503 for bucket 6 Getting block 7 of 7 Reserving size (2279457) for bucket 7 Calculating Z arrays for bucket 7 Entering block accumulator loop for bucket 7: bucket 7: 10% bucket 7: 20% bucket 7: 30% bucket 7: 40% bucket 7: 50% bucket 7: 60% bucket 7: 70% bucket 7: 80% bucket 7: 90% bucket 7: 100% Sorting block of length 2148233 for bucket 7 (Using difference cover) Sorting block time: 00:00:01 Returning block of 2148234 for bucket 7 Exited GFM loop fchr[A]: 0 fchr[C]: 3766349 fchr[G]: 6086925 fchr[T]: 8404025 fchr[$]: 12157105 Exiting GFM::buildToDisk() Returning from initFromVector Wrote 8248454 bytes to primary GFM file: s288c-build.fna.1.ht2 Wrote 3039284 bytes to secondary GFM file: s288c-build.fna.2.ht2 Re-opening _in1 and _in2 as input streams Returning from GFM constructor Returning from initFromVector Wrote 5399069 bytes to primary GFM file: s288c-build.fna.5.ht2 Wrote 3092708 bytes to secondary GFM file: s288c-build.fna.6.ht2 Re-opening _in5 and _in5 as input streams Returning from HierEbwt constructor Headers: len: 12157105 gbwtLen: 12157106 nodes: 12157106 sz: 3039277 gbwtSz: 3039277 lineRate: 6 offRate: 4 offMask: 0xfffffff0 ftabChars: 10 eftabLen: 0 eftabSz: 0 ftabLen: 1048577 ftabSz: 4194308 offsLen: 759820 offsSz: 3039280 lineSz: 64 sideSz: 64 sideGbwtSz: 48 sideGbwtLen: 192 numSides: 63319 numLines: 63319 gbwtTotLen: 4052416 gbwtTotSz: 4052416 reverse: 0 linearFM: Yes Total time for call to driver() for forward index: 00:00:06
½àÈ÷½ª¤ï¤ê¡£
¥Þ¥Ã¥Ô¥ó¥°¤Ï
hisat2 -p 32 --dta-cufflinks -x s288c-build.fna -1 paired_SRR453566_1.trim.fastq -2 paired_SRR453566_2.trim.fastq -S SRR453566.sam 5115482 reads; of these: 5115482 (100.00%) were paired; of these: 323268 (6.32%) aligned concordantly 0 times 4524989 (88.46%) aligned concordantly exactly 1 time 267225 (5.22%) aligned concordantly >1 times ---- 323268 pairs aligned concordantly 0 times; of these: 77115 (23.85%) aligned discordantly 1 time ---- 246153 pairs aligned 0 times concordantly or discordantly; of these: 492306 mates make up the pairs; of these: 341629 (69.39%) aligned 0 times 135729 (27.57%) aligned exactly 1 time 14948 (3.04%) aligned >1 times 96.66% overall alignment rate
¤Ç½ª¤ï¤ê¡£
¤³¤³¤Ç¡¢samtools¤ò»È¤Ã¤Æsam¥Õ¥¡¥¤¥ë¤òbam¥Õ¥¡¥¤¥ë¤ËÊÑ´¹¤·¡¢¤«¤Ä¡¢position¤Çsort¤¹¤ë¡£
samtools/bcftools¤Î¥¤¥ó¥¹¥È¡¼¥ë¤Ï¡¢Samtools¤Î¥µ¥¤¥È]http://www.htslib.org/download/¤«¤é¥À¥¦¥ó¥í¡¼¥É¤·¡¢
./configure make sudo make install
¤Ç´°Î»¡£
samtools view¤ò»È¤Ã¤Æsam¥Õ¥¡¥¤¥ëSRR453566.sam ¤òbam¥Õ¥¡¥¤¥ë SR453566.bam¤ËÊÑ´¹¡£@32¤Ï32¥¹¥ì¥Ã¥É»ÈÍÑ¡£
samtools view -@ 32 -b SRR453566.sam > SRR453566.bam Ëô¤Ï samtools view -@ 32 -b SRR453566.sam -o SRR453566.bam
samtools sort¤ò»È¤Ã¤Æbam¥Õ¥¡¥¤¥ëSRR453566.bam ¤ò¥½¡¼¥È¤·¤ÆSRR453566.sorted.bam¤Ë½ÐÎÏ
samtools sort -@ 32 SRR453566.bam > SRR453566.sorted.bam [bam_sort_core] merging from 0 files and 32 in-memory blocks...
samtools mpileup -uvf s288c.fna SRR453566.sorted.bam > SRR453566.vcf [warning] samtools mpileup option `u` is functional, but deprecated. Please switch to using bcftools mpileup in future. [warning] samtools mpileup option `v` is functional, but deprecated. Please switch to using bcftools mpileup in future. [mpileup] 1 samples in 1 input files
Ʊ¤¸¤³¤È¤òbcftools¤Ç¤Ï
bcftools mpileup -Ou -f s288c.fna SRR453566.sorted.bam > SRR453566.vcf
¤Ê¤ª¡¢¥Ñ¥é¥á¥¿ --threads 16 ¤Ê¤É¤Ï¼ÂºÝ¤Ë¤ÏÊÂÎ󲽤θú²Ì¤¬Ìµ¤«¤Ã¤¿¡£
bcftools mpileup -Ou -f <ref.fa> <sample1.bam> <sample2.bam> <sample3.bam> | bcftools call -vmO z -o <study.vcf.gz>
¤È½ñ¤¤¤Æ¤¤¤ë¡£
bcftools¤Î¥Þ¥Ë¥å¥¢¥ë¤Ç¤Ïview¤ÎÂå¤ï¤ê¤Ëcall¤ò»È¤¨¤È¤·¤Æ¤ª¤ê¡¢¶ñÂÎŪ¤ÊÃê½Ð½èÍý¤ò¤É¤¦¤¹¤ë¤«Î㤬¤¤¤¯¤Ä¤«¤¢¤ë¡£
¹¹¤Ë¡¢ HISAT2¤Ç¤Ï¡¡
bcftools view -bvcg SRR453566.vcf > SRR453566.raw.bcf¡¡
¤È¤¹¤ë¤È¤·¤Æ¤¤¤ë¤¬¡¢¥Ñ¥é¥á¥¿-bvgf¤Ï¥¨¥é¡¼¡£
²òÀâ¤ò¸«¤ë¤È¡¢ºÇ¶á¤Î¥Ð¡¼¥¸¥ç¥ó¤Ç¤Ïbcftools call¤Ë¤è¤Ã¤Æ¥³¡¼¥ë¤ò¼è½Ð¤¹¤é¤·¤¤¡£
bcftools call -O v -v -c SRR453566.vcf > SRR453566.raw.bcf
¤¬Àµ²ò¤ÎÌÏÍÍ¡£ÆÀ¤é¤ì¤¿·ë²Ì¤Ï¡ÊÀèÆ¬¥³¥á¥ó¥È¥é¥¤¥ó¤ò½ü¤¯¤È¡Ë
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR453566.sorted.bam NC_001133.9 137 . C CT 4.4191 . INDEL;IDV=1;IMF=1;DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-37.5262 GT:PL 0/1:40,3,0 NC_001133.9 349 . C T 8.64911 . DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,0,1;MQ=60;FQ=-29.9908 GT:PL 1/1:38,3,0 NC_001133.9 11969 . C G 10.4247 . DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9905 GT:PL 1/1:40,3,0 NC_001133.9 11997 . T A 27.0206 . DP=2;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9901 GT:PL 1/1:57,3,0 NC_001133.9 12361 . G A 4.13164 . DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9928 GT:PL 0/1:32,3,0 NC_001133.9 12373 . G A 4.13164 . DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9928 GT:PL 0/1:32,3,0
¤È¤¤¤Ã¤¿´¶¤¸¡£INFOÉôʬ¤Î²òÀâ¤Ï¥Ø¥Ã¥ÀÉôʬ¤Ë¤¢¤ëÄ̤ꡢ
##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)"> ##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)"> ##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric."> ##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)"> ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele frequency (assuming HWE)"> ##INFO=<ID=AF2,Number=1,Type=Float,Description="Max-likelihood estimate of the first and second group ALT allele frequency (assuming HWE)"> ##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele count (no HWE assumption)"> ##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads"> ##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same"> ##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias"> ##INFO=<ID=G3,Number=3,Type=Float,Description="ML estimate of genotype frequencies"> ##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value based on G3"> ##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
¥Ð¥ê¥¢¥ó¥È¥³¡¼¥ë·ë²Ì¤ÎVCF¥Õ¥©¡¼¥Þ¥Ã¥È - mac¤Ç¥¤¥ó¥Õ¥©¥Þ¥Æ¥£¥¯¥¹
bcftools mpileup¤Ç¡¢Ê£¿ô¥µ¥ó¥×¥ë¤òʤ٤Ƥߤ롣¤³¤Î¥Ú¡¼¥¸, bcftools¤Î¥Ú¡¼¥¸
bcftools mpileup -Ou -f s288c.fna SRR453566.sorted.bam SRR453567.sorted.bam SRR453568.sorted.bam SRR453569.sorted.bam SRR453570.sorted.bam SRR453571.sorted.bam | bcftools call -Ou -mv -o SRR453566_mpileup.call.vcf
¹¹¤Ë·ë²Ì¤òÆÉ¤á¤ë¤è¤¦¤Ë¤¹¤ë¡£
bcftools view SRR453566_mpileup.call.vcf | vcf-annotate -f + > SRR453566_mpileup.call.annotate.vcf
¼¡¤Î¤è¤¦¤Ê·ë²Ì¤Ë¤Ê¤Ã¤¿¡£
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR453566.sorted.bam SRR453567.sorted.bam SRR453568.sorted.bam SRR453569.sorted.bam SRR453570.sorted.bam SRR453571.sorted.bam NC_001133.9 136 . G A 68 SnpGap DP=4;VDB=0.00992398;SGB=-0.896883;MQ0F=0;AC=8;AN=8;DP4=0,0,4,0;MQ=60 GT:PL 1/1:23,3,0 1/1:24,3,0 ./.:0,0,0 1/1:23,3,0 1/1:23,3,0 ./.:0,0,0 NC_001133.9 137 . C CT 135 PASS INDEL;IDV=1;IMF=1;DP=4;VDB=0.00992398;SGB=-0.896883;MQ0F=0;AC=8;AN=8;DP4=0,0,4,0;MQ=60 GT:PL 1/1:40,3,0 1/1:40,3,0 ./.:0,0,0 1/1:40,3,0 1/1:40,3,0 ./.:0,0,0 NC_001133.9 286 . A T 13.4171 MinAB;MinDP DP=1;SGB=-0.045183;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=60 GT:PL ./.:0,0,0 ./.:0,0,0 1/1:38,3,0 ./.:0,0,0 ./.:0,0,0 ./.:0,0,0 NC_001133.9 305 . C G 10.6101 MinAB;MinDP DP=1;SGB=-0.045183;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=60 GT:PL ./.:0,0,0 ./.:0,0,0 1/1:35,3,0 ./.:0,0,0 ./.:0,0,0 ./.:0,0,0 NC_001133.9 349 . C T 88 PASS DP=3;VDB=0.102722;SGB=-0.471632;MQ0F=0;AC=6;AN=6;DP4=0,0,0,3;MQ=60 GT:PL 1/1:38,3,0 ./.:0,0,0 1/1:35,3,0 ./.:0,0,0 ./.:0,0,0 1/1:40,3,0 NC_001133.9 373 . T C 9.69791 Qual;MinAB DP=2;SGB=-0.045183;RPB=1;MQB=1;BQB=1;MQ0F=0;ICB=0.5;HOB=0.5;AC=2;AN=4;DP4=0,1,0,1;MQ=60 GT:PL 0/1:0,3,41 ./.:0,0,0 0/1:40,3,0 ./.:0,0,0 ./.:0,0,0 ./.:0,0,0 NC_001133.9 457 . CAAA CAA 7.96917 Qual;MinAB;MinDP INDEL;IDV=1;IMF=1;DP=1;SGB=-0.045183;MQ0F=0;AC=2;AN=2;DP4=0,0,0,1;MQ=60 GT:PL ./.:0,0,0 ./.:0,0,0 ./.:0,0,0 ./.:0,0,0 1/1:32,3,0 ./.:0,0,0
¥µ¥ó¥×¥ë´Ö¤Ç¶¦Ä̤¹¤ëÊѰۤȸÇͤÎÊѰۤòÃê½Ð¤¹¤ë - mac¤Ç¥¤¥ó¥Õ¥©¥Þ¥Æ¥£¥¯¥¹
¤½¤ì¤Ç¡¢bam¥Õ¥¡¥¤¥ë¤ä¡¢callÃê½Ð¤·¤¿¸å¤Îvcf¥Õ¥¡¥¤¥ë¤Ë¡¢¥¤¥ó¥Ç¥Ã¥¯¥¹ÉÕ¤±¤¬É¬Íפˤʤ뤳¤È¤¬¤¢¤ë¡£¤¿¤È¤¨¤ÐIGV¤Çɽ¼¨¤¹¤ë¤È¤¤Ê¤É¡£
¥¤¥ó¥Ç¥Ã¥¯¥¹ÉÕ¤±¤Ïbcftools¤Ç²Äǽ¤À¤¬¡¢ÆþÎϤ¿¤ëvcf¥Õ¥¡¥¤¥ë¤¬bgzip¤µ¤ì¤Æ¤¤¤Ê¤±¤ì¤Ð¤Ê¤é¤Ê¤¤¡£gdzip¼«ÂΤÏsamtools¤Îhtslib¤ÎÃæ¤ËÆþ¤Ã¤Æ¤¤¤Æ¡¢samtools¤òmake; make install¤·¤¿¤À¤±¤Ç¤ÏÆþ¤é¤Ê¤¤¡£make all all-htslib; make install install-htslib¡¡¤Î¤è¤¦¤ËÄɲÃmake¤¹¤ëɬÍפ¬¤¢¤ë¡£
bgzip -c SRR453566.raw.bcf > SRR453566.raw.bcf.gz bcftools index SRR453566.raw.bcf.gz
¤³¤Î¾õÂÖ¤ÇIGV¤Ë¿©¤ï¤»¤ë¤³¤È¤¬¤Ç¤¤ë¡£
import igv # tracks¤ÇVCF¤ò»ØÄꤷ¤Æ¤ß¤ë ref = { "id": "Saccha", "name": "Saccha", "fastaURL": "files/s288c.fna", "indexURL": "files/s288c.fna.fai", "tracks": [ { "name": "My GFF", "type": "annotation", "format": "gff3", #"displayMode": "EXPANDED", "url": "files/s288c_e.gff", "indexed": False, }, { "name": "VCF", "type": "variant", "format": "vcf", "displayMode": "EXPANDED", "url": "files/SRR453566.raw.bcf.gz", "indexed": True, } ] } b = igv.Browser({"reference": ref}) b.show()
Feature Count¤ÎÁ°½èÍý¤È¤·¤Æ¡¢GeneID¤òÉÕÍ¿¤¹¤ë¡£
°äÅÁ¸¦¤ÎÄ󶡤¹¤ëpython¥×¥í¥°¥é¥à¡¡add_gene_id¡¡¤Ç½èÍý¡£¤½¤Î¤¿¤á¤Ë¡¢¥Ñ¥Ã¥±¡¼¥¸bcbio-gff¤¬É¬ÍפʤΤǡ¢pip¤Ç¥¤¥ó¥¹¥È¡¼¥ë
pip install bcbio-gff
¤³¤ì¤Ç¡¢add_gene_id¤ò²ÔƯ¤Ç¤¤ë¡£
python add_gene_id s288c.gff s288c_e.gff
ÆÀ¤é¤ì¤¿s288c.e.gff¤Ë¤Ïgene_id¤¬Æþ¤Ã¤Æ¤¤¤ë¤Ï¤º¡£
¤³¤ì¤òFeature Count¤¹¤ë¡£biopapyrus¤ÎÀâÌÀ¡¡¤ò°úÍѤ¹¤ë¤È¡§
featureCounts ¤Ï¥Þ¥Ã¥Ô¥ó¥°·ë²Ì¤Ç¤¢¤ë BAM ¤Þ¤¿¤Ï SAM ¥Õ¥¡¥¤¥ë¤òÆþÎϤȤ·¡¢³Æ feature Îΰè¡Êgene¡¢CDS¡¢exon¡Ë¤´¤È¤Ë¥ê¡¼¥É¤ò¥«¥¦¥ó¥È¤¹¤ë¥×¥í¥°¥é¥à¤Ç¤¢¤ë¡£featureCounts ¤Ï Subread ¥Ñ¥Ã¥±¡¼¥¸¤Ë´Þ¤Þ¤ì¤ë¥×¥í¥°¥é¥à¤Î°ì¤Ä¤Ç¤¢¤ë¡£featureCounts ¤òÍøÍѤ¹¤ë¤Ë¤Ï subread ¤ò¥¤¥ó¥¹¥È¡¼¥ë¤¹¤ëɬÍפ¬¤¢¤ë¡£ subread ¤Î¥½¡¼¥¹¥³¡¼¥É¤È¥³¥ó¥Ñ¥¤¥ë¤µ¤ì¤¿¥×¥í¥°¥é¥à¤Ï sourceforge ¤Ç¸ø³«¤µ¤ì¤Æ¤¤¤ë¡£É¬Íפ˱þ¤¸¤Æ¤É¤Á¤é¤«¤ò¥À¥¦¥ó¥í¡¼¥É¤·¤Æ»ÈÍѤ¹¤ë¡£
¤È¤¤¤¦¤³¤È¤Ç¡¢subread¤Î¥Ú¡¼¥¸¤«¤é¥À¥¦¥ó¥í¡¼¥É¡£
featureCounts -p -T 8 -t exon -g gene_id -a $GFF \ -o ../featurecount/counts.txt \ SRR453566.sorted.bam \ SRR453567.sorted.bam \ SRR453568.sorted.bam \ SRR453569.sorted.bam \ SRR453570.sorted.bam \ SRR453571.sorted.bam
$GFF¤Ï»²¾È¥·¡¼¥±¥ó¥¹¤ÎGene¾ðÊó¤ò´Þ¤à¥Õ¥¡¥¤¥ë¡£featureCount (Subread) ¤Î¥Þ¥Ë¥å¥¢¥ë¤Ç¤Ï
The annotation file should be in either GTF format or a simplified annotation format (SAF) as shown below (columns are tab-delimited): GeneID Chr Start End Strand 497097 chr1 3204563 3207049 - 497097 chr1 3411783 3411982 - 497097 chr1 3660633 3661579 - ...
¤È¤·¤Æ¤¢¤ê¡¢GTF¥Õ¥©¡¼¥Þ¥Ã¥È¤ÏFrequently Asked Questions: Data File Formats¤ª¤è¤ÓGTF2.2: A Gene Annotation Format (Revised Ensembl GTF)¤Ë¤è¤ë¤È
GTF stands for Gene transfer format. It borrows from GFF, but has additional structure that warrants a separate definition and format name. Structure is as GFF, so the fields are: <seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments] Here is a simple example with 3 translated exons. Order of rows is not important. 381 Twinscan CDS 380 401 . + 0 gene_id "001"; transcript_id "001.1"; 381 Twinscan CDS 501 650 . + 2 gene_id "001"; transcript_id "001.1"; 381 Twinscan CDS 700 707 . + 2 gene_id "001"; transcript_id "001.1"; 381 Twinscan start_codon 380 382 . + 0 gene_id "001"; transcript_id "001.1"; 381 Twinscan stop_codon 708 710 . + 0 gene_id "001"; transcript_id "001.1"; The whitespace in this example is provided only for readability. In GTF, fields must be separated by a single TAB and no white space.
¹¹¤Ë¡¢GFF¥Õ¥©¡¼¥Þ¥Ã¥È¤ÏFrequently Asked Questions: Data File Formats¤Ë¤è¤ë¤È
GFF (General Feature Format) lines are based on the Sanger GFF2 specification. GFF lines have nine required fields that must be tab-separated. If the fields are separated by spaces instead of tabs, the track will not display correctly. For more information on GFF format, refer to Sanger's GFF page.
¤Þ¤¿¡¢¾åµ¤Ç»²¾È¤·¤Æ¤¤¤ëSanger¤ÎGFF2/3¤ÎÀâÌÀ¤Ï¡¢GFF2¤ÈGFF3¤Ë¤¢¤ë¡£GFF2¤Ïobsolete¤È¤µ¤ì¡¢GFF3¤¬recommend¤µ¤ì¤Æ¤¤¤ë¡£¹¹¤Ë¡¢GFF3¤Ç¤Ï¥Õ¥¡¥¤¥ë¤ÎºÇ¸å¤Ë¥ª¥×¥·¥ç¥ó¤Çsequence¤¬½ñ¤±¤ë¡£
ưºî¤¹¤ë¤³¤È¤Î¥Æ¥¹¥È¤È¤·¤Æ¡¢SRR453566.sorted.bam¤Î£±¤Ä¤À¤±¤ò½èÍý¤·¤Æ¤ß¤ë¡£
featureCounts -p -T 32 -t exon -g gene_id -a s288c_e.gff -o SRR435366.counts.txt SRR453566.sorted.bam
========== _____ _ _ ____ _____ ______ _____ ===== / ____| | | | _ \| __ \| ____| /\ | __ \ ===== | (___ | | | | |_) | |__) | |__ / \ | | | | ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | | ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| | ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/ v1.6.3
//========================== featureCounts setting ===========================\\ || || || Input files : 1 BAM file || || P SRR453566.sorted.bam || || || || Output file : SRR435366.counts.txt || || Summary : SRR435366.counts.txt.summary || || Annotation : s288c_e.gff (GTF) || || Dir for temp files : ./ || || || || Threads : 32 || || Level : meta-feature level || || Paired-end : yes || || Multimapping reads : not counted || || Multi-overlapping reads : not counted || || Min overlapping bases : 1 || || || || Chimeric reads : counted || || Both ends mapped : not required || || || \\===================== http://subread.sourceforge.net/ ======================// //================================= Running ==================================\\ || || || Load annotation file s288c_e.gff ... || || Features : 6761 || || Meta-features : 6420 || || Chromosomes/contigs : 17 || || || || Process BAM file SRR453566.sorted.bam... || || Paired-end reads are included. || || Assign alignments (paired-end) to features... || || || || WARNING: reads from the same pair were found not adjacent to each || || other in the input (due to read sorting by location or || || reporting of multi-mapping read pairs). || || || || Pairing up the read pairs. || || || || Total alignments : 5650830 || || Successfully assigned alignments : 4568783 (80.9%) || || Running time : 0.03 minutes || || || || || || Summary of counting results can be found in file "SRR435366.counts.txt.su || || mmary" || || || \\===================== http://subread.sourceforge.net/ ======================//
½ÐÎϤϡÊÀèÆ¬¡Ë
# Program:featureCounts v1.6.3; Command:"featureCounts" "-p" "-T" "32" "-t" "exon" "-g" "gene_id" "-a" "s288c_e.gff" "-o" "SRR435366.counts.txt" "SRR453566.sorted.bam" Geneid Chr Start End Strand Length SRR453566.sorted.bam gene_0001 NC_001133.9 1807 2169 - 363 0 gene_0002 NC_001133.9 2480 2707 + 228 0 gene_0003 NC_001133.9 7235 9016 - 1782 0 gene_0004 NC_001133.9 11565 11951 - 387 0 gene_0005 NC_001133.9 12046 12426 + 381 2 gene_0006 NC_001133.9 13363 13743 - 381 0 gene_0007 NC_001133.9 21566 21850 + 285 0 gene_0008 NC_001133.9 22395 22685 - 291 0 gene_0009 NC_001133.9 24000 27968 - 3969 32 gene_0010 NC_001133.9 31567 32940 + 1374 65 gene_0011 NC_001133.9 33448 34701 + 1254 119 gene_0012 NC_001133.9 35155 36303 + 1149 1155
¡ä¡ä°äÅÁ¸¦¥»¥ß¥Ê¡¼¤Î¤³¤ÎÀè¤Î½èÍý¤ÏRNAseq£-0¡Ê°äÅÁ¸¦¥»¥ß¥Ê¡¼¤Ç¤Î¡Ëȯ¸½²òÀÏ¡Á¥«¥¦¥ó¥È¥Ç¡¼¥¿¤Î²òÀÏ
Johns Hopkins¤«¤éÏÀʸ Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nature Protocols volume 11, pages 1650–1667 (2016)
¤½¤ì¤Ç¡¢HISAT2¤Ç¥ê¡¼¥É¤ò¥¢¥»¥ó¥Ö¥ë¤·¤¿·ë²Ì¤Îbam¥Õ¥¡¥¤¥ë¤òStringTie¤Ç¥¢¥»¥ó¥Ö¥ë¤¹¤ë¡£
StringTie¤Î¥¤¥ó¥¹¥È¡¼¥ë¤Ï¡¢¥Þ¥Ë¥å¥¢¥ëÄ̤ꡣ¥½¡¼¥¹¤ò¥À¥¦¥ó¥í¡¼¥É¤·¡¢make¡£¥¤¥ó¥¹¥È¡¼¥ë¤Ï¼«Ê¬¤Ç/usr/local/bin/stringtie¤«¤é¥ê¥ó¥¯¤òÄ¥¤ì¤Ð½ª¤ï¤ê¡£
½èÍý¤Ï¡¢¤¢¤é¤«¤¸¤áHISAT2¤Çsorted.bam¤¬¤Ç¤¤Æ¤¤¤ë¤È¤·¤Æ¡¢¡ÊHISAT2¤Ç--dta¤ò»ØÄꤹ¤ë¤³¤È¡Ë
stringtie -B -p 16 -o SRR453566.gtf -G s288c_e.gff SRR453566.sorted.bam
¤³¤³¤Ç¡¢-B ¤Ï½ÐÎϤòBallgown¤Ç»È¤¦¤¿¤á¤ÎÀßÄê¡¢-G s288c_e.gff ¤Ïreference annotation file (in GTF or GFF3 format)¡¢-p 16¤Ï¥¹¥ì¥Ã¥É¿ô¤Ç¡¢½ÐÎϤÏ-o SRR453566.gtf ¤Ç»ØÄê¡£
e2t.ctab e_data.ctab i2t.ctab i_data.ctab t_data.ctab¤Ç¤¢¤ê¡¢¤³¤ì¤òBallgawn¤Ë¿©¤ï¤»¤ëɬÍפ¬¤¢¤ë¡£
¤³¤Îstringtie¤Î½ÐÎϤò¡¢Ballgawn¤ÇÀ°Íý¤·¤Æ²Ä»ë²½¤¹¤ë¡£
A program for computing differentially expressed genes in two or more RNA-seq experiments, using the output of StringTie or Cufflinks. The Ballgown package provides functions to organize, visualize, and analyze expression measurements. Ballgown is written in R and is part of Bioconductor.
GitHub - alyssafrazee/ballgown: Bioconductor package "ballgown"
HISAT-StringTie-Ballgown ¤ò»î¤·¤Æ¤ß¤è¤¦ - Palmsonntagmorgen¡¡¤³¤ì¤Ç¤ÏÏÀʸ¤Ç¤Ïoptional¤È¤Ê¤Ã¤Æ¤¤¤Þ¤¹¤¬¡¢gffcompare¤ò»È¤Ã¤Æ¡¢À¸À®¤µ¤ì¤¿°äÅÁ»Ò¥ê¥¹¥È¤Èreference¤Î°äÅÁ»Ò¥ê¥¹¥È¤òÈæ³Ó¤¹¤ë¤³¤È¤¬¤Ç¤¤Þ¤¹¡£
³¤¤È¤·¤ÆHISAT-StringTie-Ballgown ¤ò»î¤·¤Æ¤ß¤è¤¦ ¤½¤Î2 - Palmsonntagmorgen