[[Python¥Ð¥¤¥ª]]¡¡[[Python¥Ð¥¤¥ª/¥Ä¡¼¥ë]]~ &counter();¡¡¡¡¡¡&lastmod();~ *¥ê¡¼¥É¤Î»²¾ÈÇÛÎó¤Ø¤Î¥Þ¥Ã¥Ô¥ó¥° [#mbaaa438] Bowtie2ºÇ¿·ÈǤò¥¤¥ó¥¹¥È¡¼¥ë¡Ê¥¢¥Ã¥×¥Ç¡¼¥È¡Ë~ ¥À¥¦¥ó¥í¡¼¥É¤·¤Æunzip¤·¤Æmake »È¤¤Êý¤ÎÎã¡¡[[biopapyrus bowtie2:https://bi.biopapyrus.jp/rnaseq/mapping/bowtie2/]]¡¢¥Þ¥Ë¥å¥¢¥ë¡¡[[http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#getting-started-with-bowtie-2-lambda-phage-example]] bowtie2-build --threads 16 GCF_000001405.38_GRCh38.p12_genomic.fna HG38.ix Settings: Output files: "HG38.ix.*.bt2" Line rate: 6 (line is 64 bytes) Lines per side: 1 (side is 64 bytes) Offset rate: 4 (one in 16) FTable chars: 10 Strings: unpacked Max bucket size: default Max bucket size, sqrt multiplier: default Max bucket size, len divisor: 4 Difference-cover sample period: 1024 Endianness: little Actual local endianness: little Sanity checking: disabled Assertions: disabled Random seed: 0 Sizeofs: void*:8, int:4, long:8, size_t:8 Input files DNA, FASTA: GCF_000001405.38_GRCh38.p12_genomic.fna Building a SMALL index Reading reference sizes Time reading reference sizes: 00:00:20 Calculating joined length Writing header Reserving space for joined string Joining reference sequences Time to join reference sequences: 00:00:20 bmax according to bmaxDivN setting: 773987710 Using parameters --bmax 580490783 --dcv 1024 Doing ahead-of-time memory usage test Passed! Constructing with these parameters: --bmax 580490783 --dcv 1024 Constructing suffix-array element generator Building DifferenceCoverSample Building sPrime Building sPrimeOrder V-Sorting samples V-Sorting samples time: 00:01:24 Allocating rank array Ranking v-sort output Ranking v-sort output time: 00:00:17 Invoking Larsson-Sadakane on ranks Invoking Larsson-Sadakane on ranks time: 00:00:30 Sanity-checking and returning Building samples Reserving space for 12 sample suffixes Generating random suffixes QSorting 12 sample offsets, eliminating duplicates QSorting sample offsets, eliminating duplicates time: 00:00:00 Multikey QSorting 12 samples (Using difference cover) Multikey QSorting samples time: 00:00:00 Calculating bucket sizes Splitting and merging Splitting and merging time: 00:00:00 Avg bucket size: 3.09595e+09 (target: 580490782) Converting suffix-array elements to index image Allocating ftab, absorbFtab Entering Ebwt loop Getting block 1 of 1 No samples; assembling all-inclusive block Sorting block of length 3095950843 for bucket 1 (Using difference cover) ¼¡¤Ë¡¢samtools¤Ç¤Þ¤È¤á¤ë¡£ÊÂÎó»ØÄê¤Ï-@ ¥¹¥ì¥Ã¥É¿ô samtools view -@ 32 -bS SRR002322.sam > SRR002322.bam ³¤¤¤Æ¡¢¥½¡¼¥È¡£ÊÂÎó»ØÄê¤Ï-@ ¥¹¥ì¥Ã¥É samtools sort SRR002322.bam -o SRR002322.sorted.bam -@ 32 ºÇ¸å¤Ëmpileup¡¡¤³¤ì¤ÏÊÂÎó»ØÄê¤Ï¤Ç¤¤Ê¤¤¤é¤·¤¤ samtools mpileup -uf GCF_000001405.38_GRCh38.p12_genomic.fna SRR002322.sorted.bam | bcftools view -Ov - > SR002322.raw.bcf ¤Ê¤ª¡¢mpileup¤âÊÂÎó¼Â¹Ô¤Ç¤¤ë¥Ñ¥Ã¥±¡¼¥¸¤¬¤¢¤Ã¤¿¤Î¤Ç¡¢¤¤¤º¤ì¥Æ¥¹¥È¡£~ [[Sambamba 0.6.8:https://github.com/biod/sambamba/releases]] *°äÅÁ¸¦¹Ö½¬²ñ¤Ç¤ÎÎã [#c85616f7] [[¡ÖÀè¿Ê¥²¥Î¥à»Ù±ç¡×¾ðÊó²òÀϹֽ¬²ñ¤Î¤´°ÆÆâ:https://www.genome-sci.jp/whatsnew/event/news20180920.html]]~ ¢Í[[¾ðÊó²òÀϹֽ¬²ñ¥Ó¥Ç¥ª¡ã2018ǯÅÙ¡¡¾ðÊó²òÀϹֽ¬²ñ¡ÊÃæµé¼Ô¸þ¤±¡Ë¡ä:https://www.genome-sci.jp/lecture20181st]]~ ¢Í[[»ñÎÁ¡ÊGitHub¡Ë:https://github.com/genome-sci/python_bioinfo_2018]] °äÅÁ¸¦¹Ö½¬²ñ¤ÎÎã¤Ç¤Ï¡¢¡ÖTopHat2¤Î¸å·Ñ¤Ç¤¢¤ë¡×HISAT2¤ò»È¤Ã¤Æ¤¤¤ë¡£<-- [[¥¢¥á¥ê¥¨¥Õ¤Î¥Ö¥í¥° HISAT2:http://staffblog.amelieff.jp/entry/2016/07/01/144419]] [[HISAT2:https://ccb.jhu.edu/software/hisat2/index.shtml]]¡¡¤ËÛ©¤¯ >HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of human genomes (as well as to a single reference genome). Based on an extension of BWT for graphs ... http://ccb.jhu.edu/software/hisat2/dl/hisat2-2.1.0-source.zip unzip make ¥Ð¥¤¥Ê¥ê¤Ïmake¤·¤¿¥Ç¥£¥ì¥¯¥È¥êÃæ¤Ë¤Ç¤¤ë¤Î¤Ç¡¢Å¬µ¹¥ê¥ó¥¯¤òÄ¥¤ë¤³¤È¡£ hisat2¤Î½èÍý¤Ï¡¢¥¤¥ó¥Ç¥Ã¥¯¥¹¤òºî¤ë(hisat2-build) ¢Í hisat2¤Ç¥Þ¥Ã¥×¡¡¤Î½ç¡£ hisat2-build s288c.fna s288c-build.fna Settings: Output files: "s288c-build.fna.*.ht2" Line rate: 6 (line is 64 bytes) Lines per side: 1 (side is 64 bytes) Offset rate: 4 (one in 16) FTable chars: 10 Strings: unpacked Local offset rate: 3 (one in 8) Local fTable chars: 6 Local sequence length: 57344 Local sequence overlap between two consecutive indexes: 1024 Endianness: little Actual local endianness: little Sanity checking: disabled Assertions: disabled Random seed: 0 Sizeofs: void*:8, int:4, long:8, size_t:8 Input files DNA, FASTA: s288c.fna Reading reference sizes Time reading reference sizes: 00:00:00 Calculating joined length Writing header Reserving space for joined string Joining reference sequences Time to join reference sequences: 00:00:00 Time to read SNPs and splice sites: 00:00:00 Using parameters --bmax 2279457 --dcv 1024 Doing ahead-of-time memory usage test Passed! Constructing with these parameters: --bmax 2279457 --dcv 1024 Constructing suffix-array element generator Building DifferenceCoverSample Building sPrime Building sPrimeOrder V-Sorting samples V-Sorting samples time: 00:00:01 Allocating rank array Ranking v-sort output Ranking v-sort output time: 00:00:00 Invoking Larsson-Sadakane on ranks Invoking Larsson-Sadakane on ranks time: 00:00:00 Sanity-checking and returning Building samples Reserving space for 12 sample suffixes Generating random suffixes QSorting 12 sample offsets, eliminating duplicates QSorting sample offsets, eliminating duplicates time: 00:00:00 Multikey QSorting 12 samples (Using difference cover) Multikey QSorting samples time: 00:00:00 Calculating bucket sizes Splitting and merging Splitting and merging time: 00:00:00 Avg bucket size: 1.73673e+06 (target: 2279456) Converting suffix-array elements to index image Allocating ftab, absorbFtab Entering GFM loop Getting block 1 of 7 Reserving size (2279457) for bucket 1 Calculating Z arrays for bucket 1 Entering block accumulator loop for bucket 1: bucket 1: 10% bucket 1: 20% bucket 1: 30% bucket 1: 40% bucket 1: 50% bucket 1: 60% bucket 1: 70% bucket 1: 80% bucket 1: 90% bucket 1: 100% Sorting block of length 1917483 for bucket 1 (Using difference cover) Sorting block time: 00:00:01 Returning block of 1917484 for bucket 1 Getting block 2 of 7 Reserving size (2279457) for bucket 2 Calculating Z arrays for bucket 2 Entering block accumulator loop for bucket 2: bucket 2: 10% bucket 2: 20% bucket 2: 30% bucket 2: 40% bucket 2: 50% bucket 2: 60% bucket 2: 70% bucket 2: 80% bucket 2: 90% bucket 2: 100% Sorting block of length 1565107 for bucket 2 (Using difference cover) Sorting block time: 00:00:00 Returning block of 1565108 for bucket 2 Getting block 3 of 7 Reserving size (2279457) for bucket 3 Calculating Z arrays for bucket 3 Entering block accumulator loop for bucket 3: bucket 3: 10% bucket 3: 20% bucket 3: 30% bucket 3: 40% bucket 3: 50% bucket 3: 60% bucket 3: 70% bucket 3: 80% bucket 3: 90% bucket 3: 100% Sorting block of length 2077071 for bucket 3 (Using difference cover) Sorting block time: 00:00:01 Returning block of 2077072 for bucket 3 Getting block 4 of 7 Reserving size (2279457) for bucket 4 Calculating Z arrays for bucket 4 Entering block accumulator loop for bucket 4: bucket 4: 10% bucket 4: 20% bucket 4: 30% bucket 4: 40% bucket 4: 50% bucket 4: 60% bucket 4: 70% bucket 4: 80% bucket 4: 90% bucket 4: 100% Sorting block of length 1959963 for bucket 4 (Using difference cover) Sorting block time: 00:00:00 Returning block of 1959964 for bucket 4 Getting block 5 of 7 Reserving size (2279457) for bucket 5 Calculating Z arrays for bucket 5 Entering block accumulator loop for bucket 5: bucket 5: 10% bucket 5: 20% bucket 5: 30% bucket 5: 40% bucket 5: 50% bucket 5: 60% bucket 5: 70% bucket 5: 80% bucket 5: 90% bucket 5: 100% Sorting block of length 1858740 for bucket 5 (Using difference cover) Sorting block time: 00:00:01 Returning block of 1858741 for bucket 5 Getting block 6 of 7 Reserving size (2279457) for bucket 6 Calculating Z arrays for bucket 6 Entering block accumulator loop for bucket 6: bucket 6: 10% bucket 6: 20% bucket 6: 30% bucket 6: 40% bucket 6: 50% bucket 6: 60% bucket 6: 70% bucket 6: 80% bucket 6: 90% bucket 6: 100% Sorting block of length 630502 for bucket 6 (Using difference cover) Sorting block time: 00:00:00 Returning block of 630503 for bucket 6 Getting block 7 of 7 Reserving size (2279457) for bucket 7 Calculating Z arrays for bucket 7 Entering block accumulator loop for bucket 7: bucket 7: 10% bucket 7: 20% bucket 7: 30% bucket 7: 40% bucket 7: 50% bucket 7: 60% bucket 7: 70% bucket 7: 80% bucket 7: 90% bucket 7: 100% Sorting block of length 2148233 for bucket 7 (Using difference cover) Sorting block time: 00:00:01 Returning block of 2148234 for bucket 7 Exited GFM loop fchr[A]: 0 fchr[C]: 3766349 fchr[G]: 6086925 fchr[T]: 8404025 fchr[$]: 12157105 Exiting GFM::buildToDisk() Returning from initFromVector Wrote 8248454 bytes to primary GFM file: s288c-build.fna.1.ht2 Wrote 3039284 bytes to secondary GFM file: s288c-build.fna.2.ht2 Re-opening _in1 and _in2 as input streams Returning from GFM constructor Returning from initFromVector Wrote 5399069 bytes to primary GFM file: s288c-build.fna.5.ht2 Wrote 3092708 bytes to secondary GFM file: s288c-build.fna.6.ht2 Re-opening _in5 and _in5 as input streams Returning from HierEbwt constructor Headers: len: 12157105 gbwtLen: 12157106 nodes: 12157106 sz: 3039277 gbwtSz: 3039277 lineRate: 6 offRate: 4 offMask: 0xfffffff0 ftabChars: 10 eftabLen: 0 eftabSz: 0 ftabLen: 1048577 ftabSz: 4194308 offsLen: 759820 offsSz: 3039280 lineSz: 64 sideSz: 64 sideGbwtSz: 48 sideGbwtLen: 192 numSides: 63319 numLines: 63319 gbwtTotLen: 4052416 gbwtTotSz: 4052416 reverse: 0 linearFM: Yes Total time for call to driver() for forward index: 00:00:06 ½àÈ÷½ª¤ï¤ê¡£ ¥Þ¥Ã¥Ô¥ó¥°¤Ï hisat2 -p 32 --dta-cufflinks -x s288c-build.fna -1 paired_SRR453566_1.trim.fastq -2 paired_SRR453566_2.trim.fastq -S SRR453566.sam 5115482 reads; of these: 5115482 (100.00%) were paired; of these: 323268 (6.32%) aligned concordantly 0 times 4524989 (88.46%) aligned concordantly exactly 1 time 267225 (5.22%) aligned concordantly >1 times ---- 323268 pairs aligned concordantly 0 times; of these: 77115 (23.85%) aligned discordantly 1 time ---- 246153 pairs aligned 0 times concordantly or discordantly; of these: 492306 mates make up the pairs; of these: 341629 (69.39%) aligned 0 times 135729 (27.57%) aligned exactly 1 time 14948 (3.04%) aligned >1 times 96.66% overall alignment rate ¤Ç½ª¤ï¤ê¡£ ¤³¤³¤Ç¡¢samtools¤ò»È¤Ã¤Æsam¥Õ¥¡¥¤¥ë¤òbam¥Õ¥¡¥¤¥ë¤ËÊÑ´¹¤·¡¢¤«¤Ä¡¢position¤Çsort¤¹¤ë¡£ samtools/bcftools¤Î¥¤¥ó¥¹¥È¡¼¥ë¤Ï¡¢[[Samtools¤Î¥µ¥¤¥È]http://www.htslib.org/download/]]¤«¤é¥À¥¦¥ó¥í¡¼¥É¤·¡¢ ./configure make sudo make install ¤Ç´°Î»¡£ samtools view¤ò»È¤Ã¤Æsam¥Õ¥¡¥¤¥ëSRR453566.sam ¤òbam¥Õ¥¡¥¤¥ë SR453566.bam¤ËÊÑ´¹¡£@32¤Ï32¥¹¥ì¥Ã¥É»ÈÍÑ¡£ samtools view -@ 32 -b SRR453566.sam > SRR453566.bam Ëô¤Ï samtools view -@ 32 -b SRR453566.sam -o SRR453566.bam samtools sort¤ò»È¤Ã¤Æbam¥Õ¥¡¥¤¥ëSRR453566.bam ¤ò¥½¡¼¥È¤·¤ÆSRR453566.sorted.bam¤Ë½ÐÎÏ samtools sort -@ 32 SRR453566.bam > SRR453566.sorted.bam [bam_sort_core] merging from 0 files and 32 in-memory blocks... ***hisat2¤Î¥É¥¥å¥á¥ó¥È¤Ç¤Ï¸å½èÍý¤È¤·¤Æ¡¡samtools mpileup¤·¤Æbcftools view¤Çɽ¼¨ [#ie39cac8] samtools mpileup -uvf s288c.fna SRR453566.sorted.bam > SRR453566.vcf [warning] samtools mpileup option `u` is functional, but deprecated. Please switch to using bcftools mpileup in future. [warning] samtools mpileup option `v` is functional, but deprecated. Please switch to using bcftools mpileup in future. [mpileup] 1 samples in 1 input files Ʊ¤¸¤³¤È¤òbcftools¤Ç¤Ï bcftools mpileup -Ou -f s288c.fna SRR453566.sorted.bam > SRR453566.vcf ¤Ê¤ª¡¢¥Ñ¥é¥á¥¿ --threads 16 ¤Ê¤É¤Ï¼ÂºÝ¤Ë¤ÏÊÂÎ󲽤θú²Ì¤¬Ìµ¤«¤Ã¤¿¡£ [[samtools¤Çvariant call:https://qiita.com/motthy/items/38951f127dce26b22ce3]]¤Ç¤Ï bcftools mpileup -Ou -f <ref.fa> <sample1.bam> <sample2.bam> <sample3.bam> | bcftools call -vmO z -o <study.vcf.gz> ¤È½ñ¤¤¤Æ¤¤¤ë¡£ [[bcftools¤Î¥Þ¥Ë¥å¥¢¥ë:https://samtools.github.io/bcftools/bcftools.html#call]]¤Ç¤Ïview¤ÎÂå¤ï¤ê¤Ëcall¤ò»È¤¨¤È¤·¤Æ¤ª¤ê¡¢¶ñÂÎŪ¤ÊÃê½Ð½èÍý¤ò¤É¤¦¤¹¤ë¤«Î㤬¤¤¤¯¤Ä¤«¤¢¤ë¡£ ¹¹¤Ë¡¢ HISAT2¤Ç¤Ï¡¡ bcftools view -bvcg SRR453566.vcf > SRR453566.raw.bcf¡¡ ¤È¤¹¤ë¤È¤·¤Æ¤¤¤ë¤¬¡¢¥Ñ¥é¥á¥¿-bvgf¤Ï¥¨¥é¡¼¡£ ²òÀâ¤ò¸«¤ë¤È¡¢ºÇ¶á¤Î¥Ð¡¼¥¸¥ç¥ó¤Ç¤Ïbcftools call¤Ë¤è¤Ã¤Æ¥³¡¼¥ë¤ò¼è½Ð¤¹¤é¤·¤¤¡£ bcftools call -O v -v -c SRR453566.vcf > SRR453566.raw.bcf ¤¬Àµ²ò¤ÎÌÏÍÍ¡£ÆÀ¤é¤ì¤¿·ë²Ì¤Ï¡ÊÀèÆ¬¥³¥á¥ó¥È¥é¥¤¥ó¤ò½ü¤¯¤È¡Ë #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR453566.sorted.bam NC_001133.9 137 . C CT 4.4191 . INDEL;IDV=1;IMF=1;DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-37.5262 GT:PL 0/1:40,3,0 NC_001133.9 349 . C T 8.64911 . DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,0,1;MQ=60;FQ=-29.9908 GT:PL 1/1:38,3,0 NC_001133.9 11969 . C G 10.4247 . DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9905 GT:PL 1/1:40,3,0 NC_001133.9 11997 . T A 27.0206 . DP=2;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9901 GT:PL 1/1:57,3,0 NC_001133.9 12361 . G A 4.13164 . DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9928 GT:PL 0/1:32,3,0 NC_001133.9 12373 . G A 4.13164 . DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9928 GT:PL 0/1:32,3,0 ¤È¤¤¤Ã¤¿´¶¤¸¡£INFOÉôʬ¤Î²òÀâ¤Ï¥Ø¥Ã¥ÀÉôʬ¤Ë¤¢¤ëÄ̤ꡢ ##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)"> ##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)"> ##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric."> ##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)"> ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele frequency (assuming HWE)"> ##INFO=<ID=AF2,Number=1,Type=Float,Description="Max-likelihood estimate of the first and second group ALT allele frequency (assuming HWE)"> ##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele count (no HWE assumption)"> ##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads"> ##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same"> ##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias"> ##INFO=<ID=G3,Number=3,Type=Float,Description="ML estimate of genotype frequencies"> ##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value based on G3"> ##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases"> [[¥Ð¥ê¥¢¥ó¥È¥³¡¼¥ë·ë²Ì¤ÎVCF¥Õ¥©¡¼¥Þ¥Ã¥È - mac¤Ç¥¤¥ó¥Õ¥©¥Þ¥Æ¥£¥¯¥¹:http://kazumaxneo.hatenablog.com/entry/2017/06/02/140208]] bcftools mpileup¤Ç¡¢Ê£¿ô¥µ¥ó¥×¥ë¤òʤ٤Ƥߤ롣[[¤³¤Î¥Ú¡¼¥¸:https://pepper.is.sci.toho-u.ac.jp/pepper/index.php?cmd=read&page=Python%A5%D0%A5%A4%A5%AA%2F%A5%C4%A1%BC%A5%EB%2FVariant-Calling&word=pyvcf]], [[bcftools¤Î¥Ú¡¼¥¸:https://samtools.github.io/bcftools/howtos/variant-calling.html]] bcftools mpileup -Ou -f s288c.fna SRR453566.sorted.bam SRR453567.sorted.bam SRR453568.sorted.bam SRR453569.sorted.bam SRR453570.sorted.bam SRR453571.sorted.bam | bcftools call -Ou -mv -o SRR453566_mpileup.call.vcf ¹¹¤Ë·ë²Ì¤òÆÉ¤á¤ë¤è¤¦¤Ë¤¹¤ë¡£ bcftools view SRR453566_mpileup.call.vcf | vcf-annotate -f + > SRR453566_mpileup.call.annotate.vcf ¼¡¤Î¤è¤¦¤Ê·ë²Ì¤Ë¤Ê¤Ã¤¿¡£ #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR453566.sorted.bam SRR453567.sorted.bam SRR453568.sorted.bam SRR453569.sorted.bam SRR453570.sorted.bam SRR453571.sorted.bam NC_001133.9 136 . G A 68 SnpGap DP=4;VDB=0.00992398;SGB=-0.896883;MQ0F=0;AC=8;AN=8;DP4=0,0,4,0;MQ=60 GT:PL 1/1:23,3,0 1/1:24,3,0 ./.:0,0,0 1/1:23,3,0 1/1:23,3,0 ./.:0,0,0 NC_001133.9 137 . C CT 135 PASS INDEL;IDV=1;IMF=1;DP=4;VDB=0.00992398;SGB=-0.896883;MQ0F=0;AC=8;AN=8;DP4=0,0,4,0;MQ=60 GT:PL 1/1:40,3,0 1/1:40,3,0 ./.:0,0,0 1/1:40,3,0 1/1:40,3,0 ./.:0,0,0 NC_001133.9 286 . A T 13.4171 MinAB;MinDP DP=1;SGB=-0.045183;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=60 GT:PL ./.:0,0,0 ./.:0,0,0 1/1:38,3,0 ./.:0,0,0 ./.:0,0,0 ./.:0,0,0 NC_001133.9 305 . C G 10.6101 MinAB;MinDP DP=1;SGB=-0.045183;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=60 GT:PL ./.:0,0,0 ./.:0,0,0 1/1:35,3,0 ./.:0,0,0 ./.:0,0,0 ./.:0,0,0 NC_001133.9 349 . C T 88 PASS DP=3;VDB=0.102722;SGB=-0.471632;MQ0F=0;AC=6;AN=6;DP4=0,0,0,3;MQ=60 GT:PL 1/1:38,3,0 ./.:0,0,0 1/1:35,3,0 ./.:0,0,0 ./.:0,0,0 1/1:40,3,0 NC_001133.9 373 . T C 9.69791 Qual;MinAB DP=2;SGB=-0.045183;RPB=1;MQB=1;BQB=1;MQ0F=0;ICB=0.5;HOB=0.5;AC=2;AN=4;DP4=0,1,0,1;MQ=60 GT:PL 0/1:0,3,41 ./.:0,0,0 0/1:40,3,0 ./.:0,0,0 ./.:0,0,0 ./.:0,0,0 NC_001133.9 457 . CAAA CAA 7.96917 Qual;MinAB;MinDP INDEL;IDV=1;IMF=1;DP=1;SGB=-0.045183;MQ0F=0;AC=2;AN=2;DP4=0,0,0,1;MQ=60 GT:PL ./.:0,0,0 ./.:0,0,0 ./.:0,0,0 ./.:0,0,0 1/1:32,3,0 ./.:0,0,0 [[¥µ¥ó¥×¥ë´Ö¤Ç¶¦Ä̤¹¤ëÊѰۤȸÇͤÎÊѰۤòÃê½Ð¤¹¤ë - mac¤Ç¥¤¥ó¥Õ¥©¥Þ¥Æ¥£¥¯¥¹:http://kazumaxneo.hatenablog.com/entry/2017/06/06/165329]] ¤½¤ì¤Ç¡¢bam¥Õ¥¡¥¤¥ë¤ä¡¢callÃê½Ð¤·¤¿¸å¤Îvcf¥Õ¥¡¥¤¥ë¤Ë¡¢¥¤¥ó¥Ç¥Ã¥¯¥¹ÉÕ¤±¤¬É¬Íפˤʤ뤳¤È¤¬¤¢¤ë¡£¤¿¤È¤¨¤ÐIGV¤Çɽ¼¨¤¹¤ë¤È¤¤Ê¤É¡£ ¥¤¥ó¥Ç¥Ã¥¯¥¹ÉÕ¤±¤Ïbcftools¤Ç²Äǽ¤À¤¬¡¢ÆþÎϤ¿¤ëvcf¥Õ¥¡¥¤¥ë¤¬bgzip¤µ¤ì¤Æ¤¤¤Ê¤±¤ì¤Ð¤Ê¤é¤Ê¤¤¡£gdzip¼«ÂΤÏsamtools¤Îhtslib¤ÎÃæ¤ËÆþ¤Ã¤Æ¤¤¤Æ¡¢samtools¤òmake; make install¤·¤¿¤À¤±¤Ç¤ÏÆþ¤é¤Ê¤¤¡£make all all-htslib; make install install-htslib¡¡¤Î¤è¤¦¤ËÄɲÃmake¤¹¤ëɬÍפ¬¤¢¤ë¡£ bgzip -c SRR453566.raw.bcf > SRR453566.raw.bcf.gz bcftools index SRR453566.raw.bcf.gz ¤³¤Î¾õÂÖ¤ÇIGV¤Ë¿©¤ï¤»¤ë¤³¤È¤¬¤Ç¤¤ë¡£ import igv # tracks¤ÇVCF¤ò»ØÄꤷ¤Æ¤ß¤ë ref = { "id": "Saccha", "name": "Saccha", "fastaURL": "files/s288c.fna", "indexURL": "files/s288c.fna.fai", "tracks": [ { "name": "My GFF", "type": "annotation", "format": "gff3", #"displayMode": "EXPANDED", "url": "files/s288c_e.gff", "indexed": False, }, { "name": "VCF", "type": "variant", "format": "vcf", "displayMode": "EXPANDED", "url": "files/SRR453566.raw.bcf.gz", "indexed": True, } ] } b = igv.Browser({"reference": ref}) b.show() ***°äÅÁ¸¦¹Ö½¬²ñ»ñÎÁ¤Ç¤Ï¸å½èÍý¤È¤·¤ÆFeature Count¤Ç¿ô¤¨¤ë [#qaa4366c] Feature Count¤ÎÁ°½èÍý¤È¤·¤Æ¡¢GeneID¤òÉÕÍ¿¤¹¤ë¡£ °äÅÁ¸¦¤ÎÄ󶡤¹¤ëpython¥×¥í¥°¥é¥à¡¡add_gene_id¡¡¤Ç½èÍý¡£¤½¤Î¤¿¤á¤Ë¡¢¥Ñ¥Ã¥±¡¼¥¸bcbio-gff¤¬É¬ÍפʤΤǡ¢pip¤Ç¥¤¥ó¥¹¥È¡¼¥ë pip install bcbio-gff ¤³¤ì¤Ç¡¢add_gene_id¤ò²ÔƯ¤Ç¤¤ë¡£ python add_gene_id s288c.gff s288c_e.gff ÆÀ¤é¤ì¤¿s288c.e.gff¤Ë¤Ïgene_id¤¬Æþ¤Ã¤Æ¤¤¤ë¤Ï¤º¡£ ¤³¤ì¤òFeature Count¤¹¤ë¡£[[biopapyrus¤ÎÀâÌÀ:https://bi.biopapyrus.jp/rnaseq/expression/featurecounts.html]]¡¡¤ò°úÍѤ¹¤ë¤È¡§ >featureCounts ¤Ï¥Þ¥Ã¥Ô¥ó¥°·ë²Ì¤Ç¤¢¤ë BAM ¤Þ¤¿¤Ï SAM ¥Õ¥¡¥¤¥ë¤òÆþÎϤȤ·¡¢³Æ feature Îΰè¡Êgene¡¢CDS¡¢exon¡Ë¤´¤È¤Ë¥ê¡¼¥É¤ò¥«¥¦¥ó¥È¤¹¤ë¥×¥í¥°¥é¥à¤Ç¤¢¤ë¡£featureCounts ¤Ï Subread ¥Ñ¥Ã¥±¡¼¥¸¤Ë´Þ¤Þ¤ì¤ë¥×¥í¥°¥é¥à¤Î°ì¤Ä¤Ç¤¢¤ë¡£featureCounts ¤òÍøÍѤ¹¤ë¤Ë¤Ï subread ¤ò¥¤¥ó¥¹¥È¡¼¥ë¤¹¤ëɬÍפ¬¤¢¤ë¡£ subread ¤Î¥½¡¼¥¹¥³¡¼¥É¤È¥³¥ó¥Ñ¥¤¥ë¤µ¤ì¤¿¥×¥í¥°¥é¥à¤Ï sourceforge ¤Ç¸ø³«¤µ¤ì¤Æ¤¤¤ë¡£É¬Íפ˱þ¤¸¤Æ¤É¤Á¤é¤«¤ò¥À¥¦¥ó¥í¡¼¥É¤·¤Æ»ÈÍѤ¹¤ë¡£ ¤È¤¤¤¦¤³¤È¤Ç¡¢[[subread:http://subread.sourceforge.net/]]¤Î¥Ú¡¼¥¸¤«¤é¥À¥¦¥ó¥í¡¼¥É¡£ featureCounts -p -T 8 -t exon -g gene_id -a $GFF \ -o ../featurecount/counts.txt \ SRR453566.sorted.bam \ SRR453567.sorted.bam \ SRR453568.sorted.bam \ SRR453569.sorted.bam \ SRR453570.sorted.bam \ SRR453571.sorted.bam $GFF¤Ï»²¾È¥·¡¼¥±¥ó¥¹¤ÎGene¾ðÊó¤ò´Þ¤à¥Õ¥¡¥¤¥ë¡£[[featureCount (Subread) ¤Î¥Þ¥Ë¥å¥¢¥ë:http://bioinf.wehi.edu.au/featureCounts/]]¤Ç¤Ï The annotation file should be in either GTF format or a simplified annotation format (SAF) as shown below (columns are tab-delimited): GeneID Chr Start End Strand 497097 chr1 3204563 3207049 - 497097 chr1 3411783 3411982 - 497097 chr1 3660633 3661579 - ... ¤È¤·¤Æ¤¢¤ê¡¢GTF¥Õ¥©¡¼¥Þ¥Ã¥È¤Ï[[Frequently Asked Questions: Data File Formats:https://genome.ucsc.edu/FAQ/FAQformat.html#format4]]¤ª¤è¤Ó[[GTF2.2: A Gene Annotation Format (Revised Ensembl GTF):http://mblab.wustl.edu/GTF22.html]]¤Ë¤è¤ë¤È GTF stands for Gene transfer format. It borrows from GFF, but has additional structure that warrants a separate definition and format name. Structure is as GFF, so the fields are: <seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments] Here is a simple example with 3 translated exons. Order of rows is not important. 381 Twinscan CDS 380 401 . + 0 gene_id "001"; transcript_id "001.1"; 381 Twinscan CDS 501 650 . + 2 gene_id "001"; transcript_id "001.1"; 381 Twinscan CDS 700 707 . + 2 gene_id "001"; transcript_id "001.1"; 381 Twinscan start_codon 380 382 . + 0 gene_id "001"; transcript_id "001.1"; 381 Twinscan stop_codon 708 710 . + 0 gene_id "001"; transcript_id "001.1"; The whitespace in this example is provided only for readability. In GTF, fields must be separated by a single TAB and no white space. ¹¹¤Ë¡¢GFF¥Õ¥©¡¼¥Þ¥Ã¥È¤Ï[[Frequently Asked Questions: Data File Formats:https://genome.ucsc.edu/FAQ/FAQformat.html#format3]]¤Ë¤è¤ë¤È GFF (General Feature Format) lines are based on the Sanger GFF2 specification. GFF lines have nine required fields that must be tab-separated. If the fields are separated by spaces instead of tabs, the track will not display correctly. For more information on GFF format, refer to Sanger's GFF page. ¤Þ¤¿¡¢¾åµ¤Ç»²¾È¤·¤Æ¤¤¤ëSanger¤ÎGFF2/3¤ÎÀâÌÀ¤Ï¡¢[[GFF2:http://gmod.org/wiki/GFF2]]¤È[[GFF3:http://gmod.org/wiki/GFF3]]¤Ë¤¢¤ë¡£GFF2¤Ïobsolete¤È¤µ¤ì¡¢GFF3¤¬recommend¤µ¤ì¤Æ¤¤¤ë¡£¹¹¤Ë¡¢GFF3¤Ç¤Ï¥Õ¥¡¥¤¥ë¤ÎºÇ¸å¤Ë¥ª¥×¥·¥ç¥ó¤Çsequence¤¬½ñ¤±¤ë¡£ ưºî¤¹¤ë¤³¤È¤Î¥Æ¥¹¥È¤È¤·¤Æ¡¢SRR453566.sorted.bam¤Î£±¤Ä¤À¤±¤ò½èÍý¤·¤Æ¤ß¤ë¡£ featureCounts -p -T 32 -t exon -g gene_id -a s288c_e.gff -o SRR435366.counts.txt SRR453566.sorted.bam ========== _____ _ _ ____ _____ ______ _____ ===== / ____| | | | _ \| __ \| ____| /\ | __ \ ===== | (___ | | | | |_) | |__) | |__ / \ | | | | ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | | ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| | ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/ v1.6.3 //========================== featureCounts setting ===========================\\ || || || Input files : 1 BAM file || || P SRR453566.sorted.bam || || || || Output file : SRR435366.counts.txt || || Summary : SRR435366.counts.txt.summary || || Annotation : s288c_e.gff (GTF) || || Dir for temp files : ./ || || || || Threads : 32 || || Level : meta-feature level || || Paired-end : yes || || Multimapping reads : not counted || || Multi-overlapping reads : not counted || || Min overlapping bases : 1 || || || || Chimeric reads : counted || || Both ends mapped : not required || || || \\===================== http://subread.sourceforge.net/ ======================// //================================= Running ==================================\\ || || || Load annotation file s288c_e.gff ... || || Features : 6761 || || Meta-features : 6420 || || Chromosomes/contigs : 17 || || || || Process BAM file SRR453566.sorted.bam... || || Paired-end reads are included. || || Assign alignments (paired-end) to features... || || || || WARNING: reads from the same pair were found not adjacent to each || || other in the input (due to read sorting by location or || || reporting of multi-mapping read pairs). || || || || Pairing up the read pairs. || || || || Total alignments : 5650830 || || Successfully assigned alignments : 4568783 (80.9%) || || Running time : 0.03 minutes || || || || || || Summary of counting results can be found in file "SRR435366.counts.txt.su || || mmary" || || || \\===================== http://subread.sourceforge.net/ ======================// ½ÐÎϤϡÊÀèÆ¬¡Ë # Program:featureCounts v1.6.3; Command:"featureCounts" "-p" "-T" "32" "-t" "exon" "-g" "gene_id" "-a" "s288c_e.gff" "-o" "SRR435366.counts.txt" "SRR453566.sorted.bam" Geneid Chr Start End Strand Length SRR453566.sorted.bam gene_0001 NC_001133.9 1807 2169 - 363 0 gene_0002 NC_001133.9 2480 2707 + 228 0 gene_0003 NC_001133.9 7235 9016 - 1782 0 gene_0004 NC_001133.9 11565 11951 - 387 0 gene_0005 NC_001133.9 12046 12426 + 381 2 gene_0006 NC_001133.9 13363 13743 - 381 0 gene_0007 NC_001133.9 21566 21850 + 285 0 gene_0008 NC_001133.9 22395 22685 - 291 0 gene_0009 NC_001133.9 24000 27968 - 3969 32 gene_0010 NC_001133.9 31567 32940 + 1374 65 gene_0011 NC_001133.9 33448 34701 + 1254 119 gene_0012 NC_001133.9 35155 36303 + 1149 1155 ¡ä¡ä°äÅÁ¸¦¥»¥ß¥Ê¡¼¤Î¤³¤ÎÀè¤Î½èÍý¤Ï[[RNAseq£-0¡Ê°äÅÁ¸¦¥»¥ß¥Ê¡¼¤Ç¤Î¡Ëȯ¸½²òÀÏ¡Á¥«¥¦¥ó¥È¥Ç¡¼¥¿¤Î²òÀÏ>Python¥Ð¥¤¥ª/¥Ä¡¼¥ë/RNAseq£-0¡Ê°äÅÁ¸¦¥»¥ß¥Ê¡¼¤Ç¤Î¡Ëȯ¸½²òÀÏ¡Á¥«¥¦¥ó¥È¥Ç¡¼¥¿¤Î²òÀÏ]] //================================= ***Ê̤λñÎÁ¤Ë¤è¤ë¤È¡¢¥Ñ¥¤¥×¥é¥¤¥ó¤È¤·¤Æ¡¢HISAT2¢ÍStringTie¢Í [#e7486864] ***Ê̤λñÎÁ¤Ë¤è¤ë¤È¡¢¥Ñ¥¤¥×¥é¥¤¥ó¤È¤·¤Æ¡¢HISAT2¢ÍStringTie¢Í Ballgawn [#e7486864] Johns Hopkins¤«¤éÏÀʸ [[Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nature Protocols volume 11, pages 1650–1667 (2016):https://www.nature.com/articles/nprot.2016.095]] - [[StringTie:https://ccb.jhu.edu/software/stringtie/index.shtml]] is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. It uses a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus. Its input can include not only the alignments of raw reads used by other transcript assemblers, but also alignments longer sequences that have been assembled from those reads.In order to identify differentially expressed genes between experiments, StringTie's output can be processed by specialized software like Ballgown, Cuffdiff or other programs (DESeq2, edgeR, etc.). -¥Þ¥Ë¥å¥¢¥ë¤Ï[[manual:https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual]] HISAT2¤Î½ÐÎϤò»È¤¦¾ì¹ç¡¢be sure to run HISAT2 with the --dta option for alignmen~ -StringTie¤ÎHP¤Ç¾È²ñ¤µ¤ì¤Æ¤¤¤ë[[gffcompare:https://ccb.jhu.edu/software/stringtie/gff.shtml#gffcompare]]¤Ï¡¢~ The program gffcompare can be used to compare, merge, annotate and estimate accuracy of one or more GFF files (the "query" files), when compared with a reference annotation (also provided as GFF/GTF). - [[NGS¥Ç¡¼¥¿¤«¤é¿·¤¿¤ÊÃ챤òƳ½Ð¤¹¤ë¤¿¤á¤Î¥Ç¡¼¥¿²òÀÏ¥ê¥Æ¥é¥·¡¼ Åý¹çTV(togotv)¡ÃÀ¸Ì¿²Ê³Ø·ÏDB¡¦¥Ä¡¼¥ë»È¤¤Åݤ··Ï¥Á¥ã¥ó¥Í¥ë:https://togotv.dbcls.jp/ajacs2018008.html]]¤Î¡Ö¥Ç¡¼¥¿²òÀϡפιब»²¹Í¤Ë¤Ê¤ê¤½¤¦¡£ -Cufflinks²ó¤ê ¡Á °ÊÁ°¤Ïcufflinks¤ò»È¤Ã¤Æ¤¤¤¿¤é¤·¤¤ --[[¼¡À¤Â奷¡¼¥¯¥¨¥ó¥µ¡¼DRY²òÀ϶µËܤˤè¤ë°äÅÁ»Òȯ¸½²òÀÏ〜Cufflinks¡¢cummeRbund¡¢DAVID〜@AJACSa»°Åç2 Åý¹çTV(togotv)¡ÃÀ¸Ì¿²Ê³Ø·ÏDB¡¦¥Ä¡¼¥ë»È¤¤Åݤ··Ï¥Á¥ã¥ó¥Í¥ë:https://togotv.dbcls.jp/ajacs2016010.html]]¤Ïcufflinks¤ò»È¤Ã¤¿²òÀâ --[[Cufflinks | RNA-seq ¤òÍøÍѤ·¤¿°äÅÁ»Òȯ¸½²òÀÏ:https://bi.biopapyrus.jp/rnaseq/expression/cufflinks.html]] --¤½¤Î¼êÁ°¤Î³µÏÀ¡¡[[ȯ¸½Î̲òÀÏ | RNA-Seq ¤òÍøÍѤ·¤¿È¯¸½ÊÑÆ°°äÅÁ»Ò¤Î¸¡½Ð:https://bi.biopapyrus.jp/rnaseq/analysis/]] -¡Ê[[¥³¥ó¥»¥ó¥µ¥¹ÇÛÎó¤ÎºîÀ® | VCF ¤ÎÊѰ۾ðÊó¤Ë´ð¤Å¤¤¤Æ¥ê¥Õ¥¡¥ì¥ó¥¹¤Î±ö´ðÇÛÎó¤òÃÖ´¹¤·¥³¥ó¥»¥ó¥µ¥¹ÇÛÎó¤òºîÀ®ÊýË¡:https://bi.biopapyrus.jp/gwas/vcf-consensus-fasta.html]]¡Ë ¤½¤ì¤Ç¡¢HISAT2¤Ç¥ê¡¼¥É¤ò¥¢¥»¥ó¥Ö¥ë¤·¤¿·ë²Ì¤Îbam¥Õ¥¡¥¤¥ë¤òStringTie¤Ç¥¢¥»¥ó¥Ö¥ë¤¹¤ë¡£ StringTie¤Î¥¤¥ó¥¹¥È¡¼¥ë¤Ï¡¢[[¥Þ¥Ë¥å¥¢¥ë:https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual]]Ä̤ꡣ¥½¡¼¥¹¤ò¥À¥¦¥ó¥í¡¼¥É¤·¡¢make¡£¥¤¥ó¥¹¥È¡¼¥ë¤Ï¼«Ê¬¤Ç/usr/local/bin/stringtie¤«¤é¥ê¥ó¥¯¤òÄ¥¤ì¤Ð½ª¤ï¤ê¡£ ½èÍý¤Ï¡¢¤¢¤é¤«¤¸¤áHISAT2¤Çsorted.bam¤¬¤Ç¤¤Æ¤¤¤ë¤È¤·¤Æ¡¢¡ÊHISAT2¤Ç--dta¤ò»ØÄꤹ¤ë¤³¤È¡Ë stringtie -B -p 16 -o SRR453566.gtf -G s288c_e.gff SRR453566.sorted.bam ¤³¤³¤Ç¡¢-B ¤Ï½ÐÎϤòBallgown¤Ç»È¤¦¤¿¤á¤ÎÀßÄê¡¢-G s288c_e.gff ¤Ïreference annotation file (in GTF or GFF3 format)¡¢-p 16¤Ï¥¹¥ì¥Ã¥É¿ô¤Ç¡¢½ÐÎϤÏ-o SRR453566.gtf ¤Ç»ØÄê¡£ -B»ØÄê¤ò¤·¤¿¤È¤¤Î½ÐÎϤϡ¢SRR453566.gtf¤Ë²Ã¤¨¤Æ¡¢ e2t.ctab e_data.ctab i2t.ctab i_data.ctab t_data.ctab ¤Ç¤¢¤ê¡¢¤³¤ì¤òBallgawn¤Ë¿©¤ï¤»¤ëɬÍפ¬¤¢¤ë¡£ ¤³¤Îstringtie¤Î½ÐÎϤò¡¢Ballgawn¤ÇÀ°Íý¤·¤Æ²Ä»ë²½¤¹¤ë¡£~ A program for computing differentially expressed genes in two or more RNA-seq experiments, using the output of StringTie or Cufflinks. The Ballgown package provides functions to organize, visualize, and analyze expression measurements. Ballgown is written in R and is part of Bioconductor. [[GitHub - alyssafrazee/ballgown: Bioconductor package "ballgown":https://github.com/alyssafrazee/ballgown]] [[HISAT-StringTie-Ballgown ¤ò»î¤·¤Æ¤ß¤è¤¦ - Palmsonntagmorgen:http://rnakato.hatenablog.jp/entry/2018/11/09/165200]]¡¡¤³¤ì¤Ç¤ÏÏÀʸ¤Ç¤Ïoptional¤È¤Ê¤Ã¤Æ¤¤¤Þ¤¹¤¬¡¢gffcompare¤ò»È¤Ã¤Æ¡¢À¸À®¤µ¤ì¤¿°äÅÁ»Ò¥ê¥¹¥È¤Èreference¤Î°äÅÁ»Ò¥ê¥¹¥È¤òÈæ³Ó¤¹¤ë¤³¤È¤¬¤Ç¤¤Þ¤¹¡£ ³¤¤È¤·¤Æ[[HISAT-StringTie-Ballgown ¤ò»î¤·¤Æ¤ß¤è¤¦ ¤½¤Î2 - Palmsonntagmorgen:http://rnakato.hatenablog.jp/entry/2018/11/26/145847]]