[[Pythonバイオ]] [[Pythonバイオ/ツール]]~
&counter();   &lastmod();~

*リードの参照配列へのマッピング [#mbaaa438]

Bowtie2最新版をインストール(アップデート)~
ダウンロードしてunzipしてmake

使い方の例 [[biopapyrus bowtie2:https://bi.biopapyrus.jp/rnaseq/mapping/bowtie2/]]、マニュアル [[http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#getting-started-with-bowtie-2-lambda-phage-example]]

 bowtie2-build --threads 16 GCF_000001405.38_GRCh38.p12_genomic.fna HG38.ix

 Settings:
   Output files: "HG38.ix.*.bt2"
   Line rate: 6 (line is 64 bytes)
   Lines per side: 1 (side is 64 bytes)
   Offset rate: 4 (one in 16)
   FTable chars: 10
   Strings: unpacked
   Max bucket size: default
   Max bucket size, sqrt multiplier: default
   Max bucket size, len divisor: 4
   Difference-cover sample period: 1024
   Endianness: little
   Actual local endianness: little
   Sanity checking: disabled
   Assertions: disabled
   Random seed: 0
   Sizeofs: void*:8, int:4, long:8, size_t:8
 Input files DNA, FASTA:
   GCF_000001405.38_GRCh38.p12_genomic.fna
 Building a SMALL index
 Reading reference sizes
   Time reading reference sizes: 00:00:20
 Calculating joined length
 Writing header
 Reserving space for joined string
 Joining reference sequences
   Time to join reference sequences: 00:00:20
 bmax according to bmaxDivN setting: 773987710
 Using parameters --bmax 580490783 --dcv 1024
   Doing ahead-of-time memory usage test
   Passed!  Constructing with these parameters: --bmax 580490783 --dcv 1024
 Constructing suffix-array element generator
 Building DifferenceCoverSample
   Building sPrime
   Building sPrimeOrder
   V-Sorting samples
   V-Sorting samples time: 00:01:24
   Allocating rank array
   Ranking v-sort output
   Ranking v-sort output time: 00:00:17
   Invoking Larsson-Sadakane on ranks
   Invoking Larsson-Sadakane on ranks time: 00:00:30
   Sanity-checking and returning
 Building samples
 Reserving space for 12 sample suffixes
 Generating random suffixes
 QSorting 12 sample offsets, eliminating duplicates
 QSorting sample offsets, eliminating duplicates time: 00:00:00
 Multikey QSorting 12 samples
   (Using difference cover)
   Multikey QSorting samples time: 00:00:00
 Calculating bucket sizes
 Splitting and merging
   Splitting and merging time: 00:00:00
 Avg bucket size: 3.09595e+09 (target: 580490782)
 Converting suffix-array elements to index image
 Allocating ftab, absorbFtab
 Entering Ebwt loop
 Getting block 1 of 1
   No samples; assembling all-inclusive block
   Sorting block of length 3095950843 for bucket 1
   (Using difference cover)

次に、samtoolsでまとめる。並列指定は-@ スレッド数
 samtools view -@ 32 -bS SRR002322.sam > SRR002322.bam

続いて、ソート。並列指定は-@ スレッド
 samtools sort SRR002322.bam -o SRR002322.sorted.bam -@ 32

最後にmpileup これは並列指定はできないらしい
 samtools mpileup -uf GCF_000001405.38_GRCh38.p12_genomic.fna SRR002322.sorted.bam | bcftools view -Ov - > SR002322.raw.bcf



なお、mpileupも並列実行できるパッケージがあったので、いずれテスト。~
[[Sambamba 0.6.8:https://github.com/biod/sambamba/releases]]

*遺伝研講習会での例 [#c85616f7]
[[「先進ゲノム支援」情報解析講習会のご案内:https://www.genome-sci.jp/whatsnew/event/news20180920.html]]~
⇒[[情報解析講習会ビデオ<2018年度 情報解析講習会(中級者向け)>:https://www.genome-sci.jp/lecture20181st]]~
⇒[[資料(GitHub):https://github.com/genome-sci/python_bioinfo_2018]]

遺伝研講習会の例では、「TopHat2の後継である」HISAT2を使っている。<-- [[アメリエフのブログ HISAT2:http://staffblog.amelieff.jp/entry/2016/07/01/144419]]

[[HISAT2:https://ccb.jhu.edu/software/hisat2/index.shtml]] に曰く

>HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of human genomes (as well as to a single reference genome).  Based on an extension of BWT for graphs ...

 http://ccb.jhu.edu/software/hisat2/dl/hisat2-2.1.0-source.zip
 unzip
 make

バイナリはmakeしたディレクトリ中にできるので、適宜リンクを張ること。

hisat2の処理は、インデックスを作る(hisat2-build) ⇒ hisat2でマップ の順。

 hisat2-build s288c.fna s288c-build.fna
 Settings:
   Output files: "s288c-build.fna.*.ht2"
   Line rate: 6 (line is 64 bytes)
   Lines per side: 1 (side is 64 bytes)
   Offset rate: 4 (one in 16)
   FTable chars: 10
   Strings: unpacked
   Local offset rate: 3 (one in 8)
   Local fTable chars: 6
   Local sequence length: 57344
   Local sequence overlap between two consecutive indexes: 1024
   Endianness: little
   Actual local endianness: little
   Sanity checking: disabled
   Assertions: disabled
   Random seed: 0
   Sizeofs: void*:8, int:4, long:8, size_t:8
 Input files DNA, FASTA:
   s288c.fna
 Reading reference sizes
   Time reading reference sizes: 00:00:00
 Calculating joined length
 Writing header
 Reserving space for joined string
 Joining reference sequences
   Time to join reference sequences: 00:00:00
   Time to read SNPs and splice sites: 00:00:00
 Using parameters --bmax 2279457 --dcv 1024
   Doing ahead-of-time memory usage test
   Passed!  Constructing with these parameters: --bmax 2279457 --dcv 1024
 Constructing suffix-array element generator
 Building DifferenceCoverSample
   Building sPrime
   Building sPrimeOrder
   V-Sorting samples
   V-Sorting samples time: 00:00:01
   Allocating rank array
   Ranking v-sort output
   Ranking v-sort output time: 00:00:00
   Invoking Larsson-Sadakane on ranks
   Invoking Larsson-Sadakane on ranks time: 00:00:00
   Sanity-checking and returning
 Building samples
 Reserving space for 12 sample suffixes
 Generating random suffixes
 QSorting 12 sample offsets, eliminating duplicates
 QSorting sample offsets, eliminating duplicates time: 00:00:00
 Multikey QSorting 12 samples
   (Using difference cover)
   Multikey QSorting samples time: 00:00:00
 Calculating bucket sizes
 Splitting and merging
   Splitting and merging time: 00:00:00
 Avg bucket size: 1.73673e+06 (target: 2279456)
 Converting suffix-array elements to index image
 Allocating ftab, absorbFtab
 Entering GFM loop
 Getting block 1 of 7
   Reserving size (2279457) for bucket 1
   Calculating Z arrays for bucket 1
   Entering block accumulator loop for bucket 1:
   bucket 1: 10%
   bucket 1: 20%
   bucket 1: 30%
   bucket 1: 40%
   bucket 1: 50%
   bucket 1: 60%
   bucket 1: 70%
   bucket 1: 80%
   bucket 1: 90%
   bucket 1: 100%
   Sorting block of length 1917483 for bucket 1
   (Using difference cover)
   Sorting block time: 00:00:01
 Returning block of 1917484 for bucket 1
 Getting block 2 of 7
   Reserving size (2279457) for bucket 2
   Calculating Z arrays for bucket 2
   Entering block accumulator loop for bucket 2:
   bucket 2: 10%
   bucket 2: 20%
   bucket 2: 30%
   bucket 2: 40%
   bucket 2: 50%
   bucket 2: 60%
   bucket 2: 70%
   bucket 2: 80%
   bucket 2: 90%
   bucket 2: 100%
   Sorting block of length 1565107 for bucket 2
   (Using difference cover)
   Sorting block time: 00:00:00
 Returning block of 1565108 for bucket 2
 Getting block 3 of 7
   Reserving size (2279457) for bucket 3
   Calculating Z arrays for bucket 3
   Entering block accumulator loop for bucket 3:
   bucket 3: 10%
   bucket 3: 20%
   bucket 3: 30%
   bucket 3: 40%
   bucket 3: 50%
   bucket 3: 60%
   bucket 3: 70%
   bucket 3: 80%
   bucket 3: 90%
   bucket 3: 100%
   Sorting block of length 2077071 for bucket 3
   (Using difference cover)
   Sorting block time: 00:00:01
 Returning block of 2077072 for bucket 3
 Getting block 4 of 7
   Reserving size (2279457) for bucket 4
   Calculating Z arrays for bucket 4
   Entering block accumulator loop for bucket 4:
   bucket 4: 10%
   bucket 4: 20%
   bucket 4: 30%
   bucket 4: 40%
   bucket 4: 50%
   bucket 4: 60%
   bucket 4: 70%
   bucket 4: 80%
   bucket 4: 90%
   bucket 4: 100%
   Sorting block of length 1959963 for bucket 4
   (Using difference cover)
   Sorting block time: 00:00:00
 Returning block of 1959964 for bucket 4
 Getting block 5 of 7
   Reserving size (2279457) for bucket 5
   Calculating Z arrays for bucket 5
   Entering block accumulator loop for bucket 5:
   bucket 5: 10%
   bucket 5: 20%
   bucket 5: 30%
   bucket 5: 40%
   bucket 5: 50%
   bucket 5: 60%
   bucket 5: 70%
   bucket 5: 80%
   bucket 5: 90%
   bucket 5: 100%
   Sorting block of length 1858740 for bucket 5
   (Using difference cover)
   Sorting block time: 00:00:01
 Returning block of 1858741 for bucket 5
 Getting block 6 of 7
   Reserving size (2279457) for bucket 6
   Calculating Z arrays for bucket 6
   Entering block accumulator loop for bucket 6:
   bucket 6: 10%
   bucket 6: 20%
   bucket 6: 30%
   bucket 6: 40%
   bucket 6: 50%
   bucket 6: 60%
   bucket 6: 70%
   bucket 6: 80%
   bucket 6: 90%
   bucket 6: 100%
   Sorting block of length 630502 for bucket 6
   (Using difference cover)
   Sorting block time: 00:00:00
 Returning block of 630503 for bucket 6
 Getting block 7 of 7
   Reserving size (2279457) for bucket 7
   Calculating Z arrays for bucket 7
   Entering block accumulator loop for bucket 7:
   bucket 7: 10%
   bucket 7: 20%
   bucket 7: 30%
   bucket 7: 40%
   bucket 7: 50%
   bucket 7: 60%
   bucket 7: 70%
   bucket 7: 80%
  bucket 7: 90%
  bucket 7: 100%
  Sorting block of length 2148233 for bucket 7
  (Using difference cover)
  Sorting block time: 00:00:01
 Returning block of 2148234 for bucket 7
 Exited GFM loop
 fchr[A]: 0
 fchr[C]: 3766349
 fchr[G]: 6086925
 fchr[T]: 8404025
 fchr[$]: 12157105
 Exiting GFM::buildToDisk()
 Returning from initFromVector
 Wrote 8248454 bytes to primary GFM file: s288c-build.fna.1.ht2
 Wrote 3039284 bytes to secondary GFM file: s288c-build.fna.2.ht2
 Re-opening _in1 and _in2 as input streams
 Returning from GFM constructor
 Returning from initFromVector
 Wrote 5399069 bytes to primary GFM file: s288c-build.fna.5.ht2
 Wrote 3092708 bytes to secondary GFM file: s288c-build.fna.6.ht2
 Re-opening _in5 and _in5 as input streams
 Returning from HierEbwt constructor
 Headers:
     len: 12157105
     gbwtLen: 12157106
     nodes: 12157106
     sz: 3039277
     gbwtSz: 3039277
     lineRate: 6
     offRate: 4
     offMask: 0xfffffff0
     ftabChars: 10
     eftabLen: 0
     eftabSz: 0
     ftabLen: 1048577
     ftabSz: 4194308
     offsLen: 759820
     offsSz: 3039280
     lineSz: 64
     sideSz: 64
     sideGbwtSz: 48
     sideGbwtLen: 192
     numSides: 63319
     numLines: 63319
     gbwtTotLen: 4052416
     gbwtTotSz: 4052416
     reverse: 0
     linearFM: Yes
 Total time for call to driver() for forward index: 00:00:06
準備終わり。

マッピングは
 hisat2 -p 32 --dta-cufflinks -x s288c-build.fna -1 paired_SRR453566_1.trim.fastq -2 paired_SRR453566_2.trim.fastq -S SRR453566.sam
 5115482 reads; of these:
   5115482 (100.00%) were paired; of these:
     323268 (6.32%) aligned concordantly 0 times
     4524989 (88.46%) aligned concordantly exactly 1 time
     267225 (5.22%) aligned concordantly >1 times
     ----
     323268 pairs aligned concordantly 0 times; of these:
       77115 (23.85%) aligned discordantly 1 time
     ----
     246153 pairs aligned 0 times concordantly or discordantly; of these:
       492306 mates make up the pairs; of these:
         341629 (69.39%) aligned 0 times
         135729 (27.57%) aligned exactly 1 time
         14948 (3.04%) aligned >1 times
 96.66% overall alignment rate
で終わり。

ここで、samtoolsを使ってsamファイルをbamファイルに変換し、かつ、positionでsortする。

samtools/bcftoolsのインストールは、[[Samtoolsのサイト]http://www.htslib.org/download/]]からダウンロードし、
 ./configure
 make
 sudo make install
で完了。

samtools viewを使ってsamファイルSRR453566.sam をbamファイル SR453566.bamに変換。@32は32スレッド使用。
 samtools view -@ 32 -b SRR453566.sam > SRR453566.bam
 又は samtools view -@ 32 -b SRR453566.sam -o SRR453566.bam

samtools sortを使ってbamファイルSRR453566.bam をソートしてSRR453566.sorted.bamに出力
 samtools sort -@ 32 SRR453566.bam > SRR453566.sorted.bam 
    [bam_sort_core] merging from 0 files and 32 in-memory blocks...

***hisat2のドキュメントでは後処理として samtools mpileupしてbcftools viewで表示 [#ie39cac8]

 samtools mpileup -uvf s288c.fna SRR453566.sorted.bam > SRR453566.vcf
 [warning] samtools mpileup option `u` is functional, but deprecated. Please switch to using bcftools mpileup in future.
 [warning] samtools mpileup option `v` is functional, but deprecated. Please switch to using bcftools mpileup in future.
 [mpileup] 1 samples in 1 input files

同じことをbcftoolsでは
 bcftools mpileup -Ou -f s288c.fna SRR453566.sorted.bam > SRR453566.vcf
なお、パラメタ --threads 16 などは実際には並列化の効果が無かった。

[[samtoolsでvariant call:https://qiita.com/motthy/items/38951f127dce26b22ce3]]では
 bcftools mpileup -Ou -f <ref.fa> <sample1.bam> <sample2.bam> <sample3.bam> | bcftools call -vmO z -o <study.vcf.gz>
と書いている。

[[bcftoolsのマニュアル:https://samtools.github.io/bcftools/bcftools.html#call]]ではviewの代わりにcallを使えとしており、具体的な抽出処理をどうするか例がいくつかある。

更に、 HISAT2では 
 bcftools view -bvcg SRR453566.vcf > SRR453566.raw.bcf 
とするとしているが、パラメタ-bvgfはエラー。

解説を見ると、最近のバージョンではbcftools callによってコールを取出すらしい。
 bcftools call -O v -v -c SRR453566.vcf > SRR453566.raw.bcf
が正解の模様。得られた結果は(先頭コメントラインを除くと)
 #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SRR453566.sorted.bam
 NC_001133.9     137     .       C       CT      4.4191  .       INDEL;IDV=1;IMF=1;DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-37.5262   GT:PL   0/1:40,3,0
 NC_001133.9     349     .       C       T       8.64911 .       DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,0,1;MQ=60;FQ=-29.9908     GT:PL   1/1:38,3,0
 NC_001133.9     11969   .       C       G       10.4247 .       DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9905     GT:PL   1/1:40,3,0
 NC_001133.9     11997   .       T       A       27.0206 .       DP=2;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9901     GT:PL   1/1:57,3,0
 NC_001133.9     12361   .       G       A       4.13164 .       DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9928     GT:PL   0/1:32,3,0
 NC_001133.9     12373   .       G       A       4.13164 .       DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9928     GT:PL   0/1:32,3,0
といった感じ。INFO部分の解説はヘッダ部分にある通り、
 ##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)">
 ##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)">
 ##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
 ##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
 ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
 ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
 ##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele frequency (assuming HWE)">
 ##INFO=<ID=AF2,Number=1,Type=Float,Description="Max-likelihood estimate of the first and second group ALT allele frequency (assuming HWE)">
 ##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele count (no HWE assumption)">
 ##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads">
 ##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same">
 ##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
 ##INFO=<ID=G3,Number=3,Type=Float,Description="ML estimate of genotype frequencies">
 ##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value based on G3">
 ##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">

[[バリアントコール結果のVCFフォーマット - macでインフォマティクス:http://kazumaxneo.hatenablog.com/entry/2017/06/02/140208]]


bcftools mpileupで、複数サンプルを並べてみる。[[このページ:https://pepper.is.sci.toho-u.ac.jp/pepper/index.php?cmd=read&page=Python%A5%D0%A5%A4%A5%AA%2F%A5%C4%A1%BC%A5%EB%2FVariant-Calling&word=pyvcf]],  [[bcftoolsのページ:https://samtools.github.io/bcftools/howtos/variant-calling.html]]

 bcftools mpileup -Ou -f s288c.fna SRR453566.sorted.bam  SRR453567.sorted.bam  SRR453568.sorted.bam  SRR453569.sorted.bam  SRR453570.sorted.bam  SRR453571.sorted.bam | bcftools call -Ou -mv -o SRR453566_mpileup.call.vcf

更に結果を読めるようにする。
 bcftools view SRR453566_mpileup.call.vcf | vcf-annotate -f + > SRR453566_mpileup.call.annotate.vcf

次のような結果になった。

 #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SRR453566.sorted.bam    SRR453567.sorted.bam    SRR453568.sorted.bam    SRR453569.sorted.bam    SRR453570.sorted.bam    SRR453571.sorted.bam
 NC_001133.9     136     .       G       A       68      SnpGap  DP=4;VDB=0.00992398;SGB=-0.896883;MQ0F=0;AC=8;AN=8;DP4=0,0,4,0;MQ=60    GT:PL   1/1:23,3,0      1/1:24,3,0      ./.:0,0,0       1/1:23,3,0      1/1:23,3,0      ./.:0,0,0
 NC_001133.9     137     .       C       CT      135     PASS    INDEL;IDV=1;IMF=1;DP=4;VDB=0.00992398;SGB=-0.896883;MQ0F=0;AC=8;AN=8;DP4=0,0,4,0;MQ=60  GT:PL   1/1:40,3,0      1/1:40,3,0      ./.:0,0,0       1/1:40,3,0      1/1:40,3,0      ./.:0,0,0
 NC_001133.9     286     .       A       T       13.4171 MinAB;MinDP     DP=1;SGB=-0.045183;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=60   GT:PL   ./.:0,0,0       ./.:0,0,0       1/1:38,3,0      ./.:0,0,0       ./.:0,0,0       ./.:0,0,0
 NC_001133.9     305     .       C       G       10.6101 MinAB;MinDP     DP=1;SGB=-0.045183;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=60   GT:PL   ./.:0,0,0       ./.:0,0,0       1/1:35,3,0      ./.:0,0,0       ./.:0,0,0       ./.:0,0,0
 NC_001133.9     349     .       C       T       88      PASS    DP=3;VDB=0.102722;SGB=-0.471632;MQ0F=0;AC=6;AN=6;DP4=0,0,0,3;MQ=60      GT:PL   1/1:38,3,0      ./.:0,0,0       1/1:35,3,0      ./.:0,0,0       ./.:0,0,0       1/1:40,3,0
 NC_001133.9     373     .       T       C       9.69791 Qual;MinAB      DP=2;SGB=-0.045183;RPB=1;MQB=1;BQB=1;MQ0F=0;ICB=0.5;HOB=0.5;AC=2;AN=4;DP4=0,1,0,1;MQ=60 GT:PL   0/1:0,3,41      ./.:0,0,0       0/1:40,3,0      ./.:0,0,0       ./.:0,0,0       ./.:0,0,0
 NC_001133.9     457     .       CAAA    CAA     7.96917 Qual;MinAB;MinDP        INDEL;IDV=1;IMF=1;DP=1;SGB=-0.045183;MQ0F=0;AC=2;AN=2;DP4=0,0,0,1;MQ=60 GT:PL   ./.:0,0,0       ./.:0,0,0       ./.:0,0,0       ./.:0,0,0       1/1:32,3,0      ./.:0,0,0



[[サンプル間で共通する変異と固有の変異を抽出する - macでインフォマティクス:http://kazumaxneo.hatenablog.com/entry/2017/06/06/165329]]

それで、bamファイルや、call抽出した後のvcfファイルに、インデックス付けが必要になることがある。たとえばIGVで表示するときなど。

インデックス付けはbcftoolsで可能だが、入力たるvcfファイルがbgzipされていなければならない。gdzip自体はsamtoolsのhtslibの中に入っていて、samtoolsをmake; make installしただけでは入らない。make all all-htslib; make install install-htslib のように追加makeする必要がある。
 bgzip -c SRR453566.raw.bcf > SRR453566.raw.bcf.gz
 bcftools index SRR453566.raw.bcf.gz
この状態でIGVに食わせることができる。
 import igv
 # tracksでVCFを指定してみる
 ref = { 
     "id": "Saccha", 
     "name": "Saccha",
     "fastaURL": "files/s288c.fna",
     "indexURL": "files/s288c.fna.fai",
     "tracks": [ 
       { 
         "name": "My GFF", 
         "type": "annotation", 
         "format": "gff3", 
         #"displayMode": "EXPANDED", 
         "url": "files/s288c_e.gff", 
         "indexed": False, 
       },
       { 
         "name": "VCF", 
         "type": "variant", 
         "format": "vcf", 
         "displayMode": "EXPANDED", 
         "url": "files/SRR453566.raw.bcf.gz", 
         "indexed": True, 
       } 
     ] 
 }
 b = igv.Browser({"reference": ref})
 b.show()

***遺伝研講習会資料では後処理としてFeature Countで数える [#qaa4366c]

Feature Countの前処理として、GeneIDを付与する。

遺伝研の提供するpythonプログラム add_gene_id で処理。そのために、パッケージbcbio-gffが必要なので、pipでインストール
 pip install bcbio-gff

これで、add_gene_idを稼働できる。
 python add_gene_id s288c.gff s288c_e.gff

得られたs288c.e.gffにはgene_idが入っているはず。

これをFeature Countする。[[biopapyrusの説明:https://bi.biopapyrus.jp/rnaseq/expression/featurecounts.html]] を引用すると:
>featureCounts はマッピング結果である BAM または SAM ファイルを入力とし、各 feature 領域(gene、CDS、exon)ごとにリードをカウントするプログラムである。featureCounts は Subread パッケージに含まれるプログラムの一つである。featureCounts を利用するには subread をインストールする必要がある。
subread のソースコードとコンパイルされたプログラムは sourceforge で公開されている。必要に応じてどちらかをダウンロードして使用する。

ということで、[[subread:http://subread.sourceforge.net/]]のページからダウンロード。

 featureCounts -p -T 8 -t exon -g gene_id -a $GFF \ 
  -o ../featurecount/counts.txt \ 
  SRR453566.sorted.bam \
  SRR453567.sorted.bam \ 
  SRR453568.sorted.bam \ 
  SRR453569.sorted.bam \ 
  SRR453570.sorted.bam \ 
  SRR453571.sorted.bam

$GFFは参照シーケンスのGene情報を含むファイル。[[featureCount (Subread) のマニュアル:http://bioinf.wehi.edu.au/featureCounts/]]では
 The annotation file should be in either GTF format or a simplified annotation format (SAF) 
 as shown below (columns are tab-delimited):
 GeneID	Chr	Start	End	Strand
 497097	chr1	3204563	3207049	-
 497097	chr1	3411783	3411982	-
 497097	chr1	3660633	3661579	-
 ...
としてあり、GTFフォーマットは[[Frequently Asked Questions: Data File Formats:https://genome.ucsc.edu/FAQ/FAQformat.html#format4]]および[[GTF2.2: A Gene Annotation Format (Revised Ensembl GTF):http://mblab.wustl.edu/GTF22.html]]によると
 GTF stands for Gene transfer format. It borrows from GFF, but has additional 
 structure that warrants a separate definition and format name.
 Structure is as GFF, so the fields are: 
 <seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]
 
 Here is a simple example with 3 translated exons. Order of rows is not important.
 
 381 Twinscan  CDS          380   401   .   +   0  gene_id "001"; transcript_id "001.1";
 381 Twinscan  CDS          501   650   .   +   2  gene_id "001"; transcript_id "001.1";
 381 Twinscan  CDS          700   707   .   +   2  gene_id "001"; transcript_id "001.1";
 381 Twinscan  start_codon  380   382   .   +   0  gene_id "001"; transcript_id "001.1";
 381 Twinscan  stop_codon   708   710   .   +   0  gene_id "001"; transcript_id "001.1";
 The whitespace in this example is provided only for readability. In GTF, fields must be separated by a single TAB and no white space.

更に、GFFフォーマットは[[Frequently Asked Questions: Data File Formats:https://genome.ucsc.edu/FAQ/FAQformat.html#format3]]によると
 GFF (General Feature Format) lines are based on the Sanger GFF2 specification. 
 GFF lines have nine required fields that must be tab-separated. 
 If the fields are separated by spaces instead of tabs, the track will not display correctly. 
 For more information on GFF format, refer to Sanger's GFF page.
また、上記で参照しているSangerのGFF2/3の説明は、[[GFF2:http://gmod.org/wiki/GFF2]]と[[GFF3:http://gmod.org/wiki/GFF3]]にある。GFF2はobsoleteとされ、GFF3がrecommendされている。更に、GFF3ではファイルの最後にオプションでsequenceが書ける。

動作することのテストとして、SRR453566.sorted.bamの1つだけを処理してみる。
 featureCounts -p -T 32 -t exon -g gene_id -a s288c_e.gff -o SRR435366.counts.txt SRR453566.sorted.bam

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
	  v1.6.3

 //========================== featureCounts setting ===========================\\
 ||                                                                            ||
 ||             Input files : 1 BAM file                                       ||
 ||                           P SRR453566.sorted.bam                           ||
 ||                                                                            ||
 ||             Output file : SRR435366.counts.txt                             ||
 ||                 Summary : SRR435366.counts.txt.summary                     ||
 ||              Annotation : s288c_e.gff (GTF)                                ||
 ||      Dir for temp files : ./                                               ||
 ||                                                                            ||
 ||                 Threads : 32                                               ||
 ||                   Level : meta-feature level                               ||
 ||              Paired-end : yes                                              ||
 ||      Multimapping reads : not counted                                      ||
 || Multi-overlapping reads : not counted                                      ||
 ||   Min overlapping bases : 1                                                ||
 ||                                                                            ||
 ||          Chimeric reads : counted                                          ||
 ||        Both ends mapped : not required                                     ||
 ||                                                                            ||
 \\===================== http://subread.sourceforge.net/ ======================//
 
 //================================= Running ==================================\\
 ||                                                                            ||
 || Load annotation file s288c_e.gff ...                                       ||
 ||    Features : 6761                                                         ||
 ||    Meta-features : 6420                                                    ||
 ||    Chromosomes/contigs : 17                                                ||
 ||                                                                            ||
 || Process BAM file SRR453566.sorted.bam...                                   ||
 ||    Paired-end reads are included.                                          ||
 ||    Assign alignments (paired-end) to features...                           ||
 ||                                                                            ||
 ||    WARNING: reads from the same pair were found not adjacent to each       ||
 ||             other in the input (due to read sorting by location or         ||
 ||             reporting of multi-mapping read pairs).                        ||
 ||                                                                            ||
 ||    Pairing up the read pairs.                                              ||
 ||                                                                            ||
 ||    Total alignments : 5650830                                              ||
 ||    Successfully assigned alignments : 4568783 (80.9%)                      ||
 ||    Running time : 0.03 minutes                                             ||
 ||                                                                            ||
 ||                                                                            ||
 || Summary of counting results can be found in file "SRR435366.counts.txt.su  ||
 || mmary"                                                                     ||
 ||                                                                            ||
 \\===================== http://subread.sourceforge.net/ ======================//

出力は(先頭)
 # Program:featureCounts v1.6.3; Command:"featureCounts" "-p" "-T" "32" "-t" "exon" "-g" "gene_id" "-a" "s288c_e.gff" "-o" "SRR435366.counts.txt" "SRR453566.sorted.bam"
 Geneid  Chr     Start   End     Strand  Length  SRR453566.sorted.bam
 gene_0001       NC_001133.9     1807    2169    -       363     0
 gene_0002       NC_001133.9     2480    2707    +       228     0
 gene_0003       NC_001133.9     7235    9016    -       1782    0
 gene_0004       NC_001133.9     11565   11951   -       387     0
 gene_0005       NC_001133.9     12046   12426   +       381     2
 gene_0006       NC_001133.9     13363   13743   -       381     0
 gene_0007       NC_001133.9     21566   21850   +       285     0
 gene_0008       NC_001133.9     22395   22685   -       291     0
 gene_0009       NC_001133.9     24000   27968   -       3969    32
 gene_0010       NC_001133.9     31567   32940   +       1374    65
 gene_0011       NC_001133.9     33448   34701   +       1254    119
 gene_0012       NC_001133.9     35155   36303   +       1149    1155

>>遺伝研セミナーのこの先の処理は[[RNAseq-0(遺伝研セミナーでの)発現解析〜カウントデータの解析>Pythonバイオ/ツール/RNAseq-0(遺伝研セミナーでの)発現解析〜カウントデータの解析]]

//=================================

***別の資料によると、パイプラインとして、HISAT2⇒StringTie⇒ [#e7486864]
***別の資料によると、パイプラインとして、HISAT2⇒StringTie⇒ Ballgawn [#e7486864]

Johns Hopkinsから論文
[[Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown.  Nature Protocols volume 11, pages 1650&#8211;1667 (2016):https://www.nature.com/articles/nprot.2016.095]]

- [[StringTie:https://ccb.jhu.edu/software/stringtie/index.shtml]]  is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. It uses a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus. Its input can include not only the alignments of raw reads used by other transcript assemblers, but also alignments longer sequences that have been assembled from those reads.In order to identify differentially expressed genes between experiments, StringTie's output can be processed by specialized software like Ballgown, Cuffdiff or other programs (DESeq2, edgeR, etc.).
-マニュアルは[[manual:https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual]]
HISAT2の出力を使う場合、be sure to run HISAT2 with the --dta option for alignmen~

-StringTieのHPで照会されている[[gffcompare:https://ccb.jhu.edu/software/stringtie/gff.shtml#gffcompare]]は、~
The program gffcompare can be used to compare, merge, annotate and estimate accuracy of one or more GFF files (the "query" files), when compared with a reference annotation (also provided as GFF/GTF).

- [[NGSデータから新たな知識を導出するためのデータ解析リテラシー 統合TV(togotv)|生命科学系DB・ツール使い倒し系チャンネル:https://togotv.dbcls.jp/ajacs2018008.html]]の「データ解析」の項が参考になりそう。

-Cufflinks回り 〜 以前はcufflinksを使っていたらしい
--[[次世代シークエンサーDRY解析教本による遺伝子発現解析&#12316;Cufflinks、cummeRbund、DAVID&#12316;@AJACSa三島2 統合TV(togotv)|生命科学系DB・ツール使い倒し系チャンネル:https://togotv.dbcls.jp/ajacs2016010.html]]はcufflinksを使った解説
--[[Cufflinks | RNA-seq を利用した遺伝子発現解析:https://bi.biopapyrus.jp/rnaseq/expression/cufflinks.html]]
--その手前の概論 [[発現量解析 | RNA-Seq を利用した発現変動遺伝子の検出:https://bi.biopapyrus.jp/rnaseq/analysis/]]

-([[コンセンサス配列の作成 | VCF の変異情報に基づいてリファレンスの塩基配列を置換しコンセンサス配列を作成方法:https://bi.biopapyrus.jp/gwas/vcf-consensus-fasta.html]])

それで、HISAT2でリードをアセンブルした結果のbamファイルをStringTieでアセンブルする。

StringTieのインストールは、[[マニュアル:https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual]]通り。ソースをダウンロードし、make。インストールは自分で/usr/local/bin/stringtieからリンクを張れば終わり。

処理は、あらかじめHISAT2でsorted.bamができているとして、(HISAT2で--dtaを指定すること)
 stringtie -B -p 16 -o SRR453566.gtf -G s288c_e.gff SRR453566.sorted.bam
ここで、-B は出力をBallgownで使うための設定、-G s288c_e.gff はreference annotation file (in GTF or GFF3 format)、-p 16はスレッド数で、出力は-o SRR453566.gtf で指定。

-B指定をしたときの出力は、SRR453566.gtfに加えて、
        e2t.ctab
        e_data.ctab
        i2t.ctab
        i_data.ctab
        t_data.ctab
であり、これをBallgawnに食わせる必要がある。

このstringtieの出力を、Ballgawnで整理して可視化する。~
A program for computing differentially expressed genes in two or more RNA-seq experiments, using the output of StringTie or Cufflinks. The Ballgown package provides functions to organize, visualize, and analyze expression measurements. Ballgown is written in R and is part of Bioconductor.

[[GitHub - alyssafrazee/ballgown: Bioconductor package "ballgown":https://github.com/alyssafrazee/ballgown]]


[[HISAT-StringTie-Ballgown を試してみよう - Palmsonntagmorgen:http://rnakato.hatenablog.jp/entry/2018/11/09/165200]] これでは論文ではoptionalとなっていますが、gffcompareを使って、生成された遺伝子リストとreferenceの遺伝子リストを比較することができます。

続きとして[[HISAT-StringTie-Ballgown を試してみよう その2 - Palmsonntagmorgen:http://rnakato.hatenablog.jp/entry/2018/11/26/145847]]

トップ   編集 差分 バックアップ 添付 複製 名前変更 リロード   新規 一覧 単語検索 最終更新   ヘルプ   最終更新のRSS