Pythonバイオ? Pythonバイオ/ツール?
495   2019-06-09 (日) 06:56:26

リードの参照配列へのマッピング

Bowtie2最新版をインストール(アップデート)
ダウンロードしてunzipしてmake

使い方の例 biopapyrus bowtie2、マニュアル http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml

bowtie2-build --threads 16 GCF_000001405.38_GRCh38.p12_genomic.fna HG38.ix
Settings:
  Output files: "HG38.ix.*.bt2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  GCF_000001405.38_GRCh38.p12_genomic.fna
Building a SMALL index
Reading reference sizes
  Time reading reference sizes: 00:00:20
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:20
bmax according to bmaxDivN setting: 773987710
Using parameters --bmax 580490783 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 580490783 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
  Building sPrime
  Building sPrimeOrder
  V-Sorting samples
  V-Sorting samples time: 00:01:24
  Allocating rank array
  Ranking v-sort output
  Ranking v-sort output time: 00:00:17
  Invoking Larsson-Sadakane on ranks
  Invoking Larsson-Sadakane on ranks time: 00:00:30
  Sanity-checking and returning
Building samples
Reserving space for 12 sample suffixes
Generating random suffixes
QSorting 12 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 12 samples
  (Using difference cover)
  Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Splitting and merging
  Splitting and merging time: 00:00:00
Avg bucket size: 3.09595e+09 (target: 580490782)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering Ebwt loop
Getting block 1 of 1
  No samples; assembling all-inclusive block
  Sorting block of length 3095950843 for bucket 1
  (Using difference cover)

次に、samtoolsでまとめる。並列指定は-@ スレッド数

samtools view -@ 32 -bS SRR002322.sam > SRR002322.bam

続いて、ソート。並列指定は-@ スレッド

samtools sort SRR002322.bam -o SRR002322.sorted.bam -@ 32

最後にmpileup これは並列指定はできないらしい

samtools mpileup -uf GCF_000001405.38_GRCh38.p12_genomic.fna SRR002322.sorted.bam | bcftools view -Ov - > SR002322.raw.bcf

なお、mpileupも並列実行できるパッケージがあったので、いずれテスト。
Sambamba 0.6.8

遺伝研講習会での例

「先進ゲノム支援」情報解析講習会のご案内
情報解析講習会ビデオ<2018年度 情報解析講習会(中級者向け)>
資料(GitHub)

遺伝研講習会の例では、「TopHat2の後継である」HISAT2を使っている。<-- アメリエフのブログ HISAT2

HISAT2 に曰く

HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of human genomes (as well as to a single reference genome). Based on an extension of BWT for graphs ...

http://ccb.jhu.edu/software/hisat2/dl/hisat2-2.1.0-source.zip
unzip
make

バイナリはmakeしたディレクトリ中にできるので、適宜リンクを張ること。

hisat2の処理は、インデックスを作る(hisat2-build) ⇒ hisat2でマップ の順。

hisat2-build s288c.fna s288c-build.fna
Settings:
  Output files: "s288c-build.fna.*.ht2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Local offset rate: 3 (one in 8)
  Local fTable chars: 6
  Local sequence length: 57344
  Local sequence overlap between two consecutive indexes: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  s288c.fna
Reading reference sizes
  Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:00
  Time to read SNPs and splice sites: 00:00:00
Using parameters --bmax 2279457 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 2279457 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
  Building sPrime
  Building sPrimeOrder
  V-Sorting samples
  V-Sorting samples time: 00:00:01
  Allocating rank array
  Ranking v-sort output
  Ranking v-sort output time: 00:00:00
  Invoking Larsson-Sadakane on ranks
  Invoking Larsson-Sadakane on ranks time: 00:00:00
  Sanity-checking and returning
Building samples
Reserving space for 12 sample suffixes
Generating random suffixes
QSorting 12 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 12 samples
  (Using difference cover)
  Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Splitting and merging
  Splitting and merging time: 00:00:00
Avg bucket size: 1.73673e+06 (target: 2279456)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering GFM loop
Getting block 1 of 7
  Reserving size (2279457) for bucket 1
  Calculating Z arrays for bucket 1
  Entering block accumulator loop for bucket 1:
  bucket 1: 10%
  bucket 1: 20%
  bucket 1: 30%
  bucket 1: 40%
  bucket 1: 50%
  bucket 1: 60%
  bucket 1: 70%
  bucket 1: 80%
  bucket 1: 90%
  bucket 1: 100%
  Sorting block of length 1917483 for bucket 1
  (Using difference cover)
  Sorting block time: 00:00:01
Returning block of 1917484 for bucket 1
Getting block 2 of 7
  Reserving size (2279457) for bucket 2
  Calculating Z arrays for bucket 2
  Entering block accumulator loop for bucket 2:
  bucket 2: 10%
  bucket 2: 20%
  bucket 2: 30%
  bucket 2: 40%
  bucket 2: 50%
  bucket 2: 60%
  bucket 2: 70%
  bucket 2: 80%
  bucket 2: 90%
  bucket 2: 100%
  Sorting block of length 1565107 for bucket 2
  (Using difference cover)
  Sorting block time: 00:00:00
Returning block of 1565108 for bucket 2
Getting block 3 of 7
  Reserving size (2279457) for bucket 3
  Calculating Z arrays for bucket 3
  Entering block accumulator loop for bucket 3:
  bucket 3: 10%
  bucket 3: 20%
  bucket 3: 30%
  bucket 3: 40%
  bucket 3: 50%
  bucket 3: 60%
  bucket 3: 70%
  bucket 3: 80%
  bucket 3: 90%
  bucket 3: 100%
  Sorting block of length 2077071 for bucket 3
  (Using difference cover)
  Sorting block time: 00:00:01
Returning block of 2077072 for bucket 3
Getting block 4 of 7
  Reserving size (2279457) for bucket 4
  Calculating Z arrays for bucket 4
  Entering block accumulator loop for bucket 4:
  bucket 4: 10%
  bucket 4: 20%
  bucket 4: 30%
  bucket 4: 40%
  bucket 4: 50%
  bucket 4: 60%
  bucket 4: 70%
  bucket 4: 80%
  bucket 4: 90%
  bucket 4: 100%
  Sorting block of length 1959963 for bucket 4
  (Using difference cover)
  Sorting block time: 00:00:00
Returning block of 1959964 for bucket 4
Getting block 5 of 7
  Reserving size (2279457) for bucket 5
  Calculating Z arrays for bucket 5
  Entering block accumulator loop for bucket 5:
  bucket 5: 10%
  bucket 5: 20%
  bucket 5: 30%
  bucket 5: 40%
  bucket 5: 50%
  bucket 5: 60%
  bucket 5: 70%
  bucket 5: 80%
  bucket 5: 90%
  bucket 5: 100%
  Sorting block of length 1858740 for bucket 5
  (Using difference cover)
  Sorting block time: 00:00:01
Returning block of 1858741 for bucket 5
Getting block 6 of 7
  Reserving size (2279457) for bucket 6
  Calculating Z arrays for bucket 6
  Entering block accumulator loop for bucket 6:
  bucket 6: 10%
  bucket 6: 20%
  bucket 6: 30%
  bucket 6: 40%
  bucket 6: 50%
  bucket 6: 60%
  bucket 6: 70%
  bucket 6: 80%
  bucket 6: 90%
  bucket 6: 100%
  Sorting block of length 630502 for bucket 6
  (Using difference cover)
  Sorting block time: 00:00:00
Returning block of 630503 for bucket 6
Getting block 7 of 7
  Reserving size (2279457) for bucket 7
  Calculating Z arrays for bucket 7
  Entering block accumulator loop for bucket 7:
  bucket 7: 10%
  bucket 7: 20%
  bucket 7: 30%
  bucket 7: 40%
  bucket 7: 50%
  bucket 7: 60%
  bucket 7: 70%
  bucket 7: 80%
 bucket 7: 90%
 bucket 7: 100%
 Sorting block of length 2148233 for bucket 7
 (Using difference cover)
 Sorting block time: 00:00:01
Returning block of 2148234 for bucket 7
Exited GFM loop
fchr[A]: 0
fchr[C]: 3766349
fchr[G]: 6086925
fchr[T]: 8404025
fchr[$]: 12157105
Exiting GFM::buildToDisk()
Returning from initFromVector
Wrote 8248454 bytes to primary GFM file: s288c-build.fna.1.ht2
Wrote 3039284 bytes to secondary GFM file: s288c-build.fna.2.ht2
Re-opening _in1 and _in2 as input streams
Returning from GFM constructor
Returning from initFromVector
Wrote 5399069 bytes to primary GFM file: s288c-build.fna.5.ht2
Wrote 3092708 bytes to secondary GFM file: s288c-build.fna.6.ht2
Re-opening _in5 and _in5 as input streams
Returning from HierEbwt constructor
Headers:
    len: 12157105
    gbwtLen: 12157106
    nodes: 12157106
    sz: 3039277
    gbwtSz: 3039277
    lineRate: 6
    offRate: 4
    offMask: 0xfffffff0
    ftabChars: 10
    eftabLen: 0
    eftabSz: 0
    ftabLen: 1048577
    ftabSz: 4194308
    offsLen: 759820
    offsSz: 3039280
    lineSz: 64
    sideSz: 64
    sideGbwtSz: 48
    sideGbwtLen: 192
    numSides: 63319
    numLines: 63319
    gbwtTotLen: 4052416
    gbwtTotSz: 4052416
    reverse: 0
    linearFM: Yes
Total time for call to driver() for forward index: 00:00:06

準備終わり。

マッピングは

hisat2 -p 32 --dta-cufflinks -x s288c-build.fna -1 paired_SRR453566_1.trim.fastq -2 paired_SRR453566_2.trim.fastq -S SRR453566.sam
5115482 reads; of these:
  5115482 (100.00%) were paired; of these:
    323268 (6.32%) aligned concordantly 0 times
    4524989 (88.46%) aligned concordantly exactly 1 time
    267225 (5.22%) aligned concordantly >1 times
    ----
    323268 pairs aligned concordantly 0 times; of these:
      77115 (23.85%) aligned discordantly 1 time
    ----
    246153 pairs aligned 0 times concordantly or discordantly; of these:
      492306 mates make up the pairs; of these:
        341629 (69.39%) aligned 0 times
        135729 (27.57%) aligned exactly 1 time
        14948 (3.04%) aligned >1 times
96.66% overall alignment rate

で終わり。

ここで、samtoolsを使ってsamファイルをbamファイルに変換し、かつ、positionでsortする。

samtools/bcftoolsのインストールは、Samtoolsのサイト]http://www.htslib.org/download/からダウンロードし、

./configure
make
sudo make install

で完了。

samtools viewを使ってsamファイルSRR453566.sam をbamファイル SR453566.bamに変換。@32は32スレッド使用。

samtools view -@ 32 -b SRR453566.sam > SRR453566.bam
又は samtools view -@ 32 -b SRR453566.sam -o SRR453566.bam

samtools sortを使ってbamファイルSRR453566.bam をソートしてSRR453566.sorted.bamに出力

samtools sort -@ 32 SRR453566.bam > SRR453566.sorted.bam 
   [bam_sort_core] merging from 0 files and 32 in-memory blocks...

hisat2のドキュメントでは後処理として samtools mpileupしてbcftools viewで表示

samtools mpileup -uvf s288c.fna SRR453566.sorted.bam > SRR453566.vcf
[warning] samtools mpileup option `u` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[warning] samtools mpileup option `v` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[mpileup] 1 samples in 1 input files

同じことをbcftoolsでは

bcftools mpileup -Ou -f s288c.fna SRR453566.sorted.bam > SRR453566.vcf

なお、パラメタ --threads 16 などは実際には並列化の効果が無かった。

samtoolsでvariant callでは

bcftools mpileup -Ou -f <ref.fa> <sample1.bam> <sample2.bam> <sample3.bam> | bcftools call -vmO z -o <study.vcf.gz>

と書いている。

bcftoolsのマニュアルではviewの代わりにcallを使えとしており、具体的な抽出処理をどうするか例がいくつかある。

更に、 HISAT2では 

bcftools view -bvcg SRR453566.vcf > SRR453566.raw.bcf 

とするとしているが、パラメタ-bvgfはエラー。

解説を見ると、最近のバージョンではbcftools callによってコールを取出すらしい。

bcftools call -O v -v -c SRR453566.vcf > SRR453566.raw.bcf

が正解の模様。得られた結果は(先頭コメントラインを除くと)

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SRR453566.sorted.bam
NC_001133.9     137     .       C       CT      4.4191  .       INDEL;IDV=1;IMF=1;DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-37.5262   GT:PL   0/1:40,3,0
NC_001133.9     349     .       C       T       8.64911 .       DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,0,1;MQ=60;FQ=-29.9908     GT:PL   1/1:38,3,0
NC_001133.9     11969   .       C       G       10.4247 .       DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9905     GT:PL   1/1:40,3,0
NC_001133.9     11997   .       T       A       27.0206 .       DP=2;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9901     GT:PL   1/1:57,3,0
NC_001133.9     12361   .       G       A       4.13164 .       DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9928     GT:PL   0/1:32,3,0
NC_001133.9     12373   .       G       A       4.13164 .       DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9928     GT:PL   0/1:32,3,0

といった感じ。INFO部分の解説はヘッダ部分にある通り、

##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)">
##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele frequency (assuming HWE)">
##INFO=<ID=AF2,Number=1,Type=Float,Description="Max-likelihood estimate of the first and second group ALT allele frequency (assuming HWE)">
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele count (no HWE assumption)">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same">
##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
##INFO=<ID=G3,Number=3,Type=Float,Description="ML estimate of genotype frequencies">
##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value based on G3">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">

バリアントコール結果のVCFフォーマット - macでインフォマティクス

bcftools mpileupで、複数サンプルを並べてみる。このページ, bcftoolsのページ

bcftools mpileup -Ou -f s288c.fna SRR453566.sorted.bam  SRR453567.sorted.bam  SRR453568.sorted.bam  SRR453569.sorted.bam  SRR453570.sorted.bam  SRR453571.sorted.bam | bcftools call -Ou -mv -o SRR453566_mpileup.call.vcf

更に結果を読めるようにする。

bcftools view SRR453566_mpileup.call.vcf | vcf-annotate -f + > SRR453566_mpileup.call.annotate.vcf

次のような結果になった。

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SRR453566.sorted.bam    SRR453567.sorted.bam    SRR453568.sorted.bam    SRR453569.sorted.bam    SRR453570.sorted.bam    SRR453571.sorted.bam
NC_001133.9     136     .       G       A       68      SnpGap  DP=4;VDB=0.00992398;SGB=-0.896883;MQ0F=0;AC=8;AN=8;DP4=0,0,4,0;MQ=60    GT:PL   1/1:23,3,0      1/1:24,3,0      ./.:0,0,0       1/1:23,3,0      1/1:23,3,0      ./.:0,0,0
NC_001133.9     137     .       C       CT      135     PASS    INDEL;IDV=1;IMF=1;DP=4;VDB=0.00992398;SGB=-0.896883;MQ0F=0;AC=8;AN=8;DP4=0,0,4,0;MQ=60  GT:PL   1/1:40,3,0      1/1:40,3,0      ./.:0,0,0       1/1:40,3,0      1/1:40,3,0      ./.:0,0,0
NC_001133.9     286     .       A       T       13.4171 MinAB;MinDP     DP=1;SGB=-0.045183;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=60   GT:PL   ./.:0,0,0       ./.:0,0,0       1/1:38,3,0      ./.:0,0,0       ./.:0,0,0       ./.:0,0,0
NC_001133.9     305     .       C       G       10.6101 MinAB;MinDP     DP=1;SGB=-0.045183;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=60   GT:PL   ./.:0,0,0       ./.:0,0,0       1/1:35,3,0      ./.:0,0,0       ./.:0,0,0       ./.:0,0,0
NC_001133.9     349     .       C       T       88      PASS    DP=3;VDB=0.102722;SGB=-0.471632;MQ0F=0;AC=6;AN=6;DP4=0,0,0,3;MQ=60      GT:PL   1/1:38,3,0      ./.:0,0,0       1/1:35,3,0      ./.:0,0,0       ./.:0,0,0       1/1:40,3,0
NC_001133.9     373     .       T       C       9.69791 Qual;MinAB      DP=2;SGB=-0.045183;RPB=1;MQB=1;BQB=1;MQ0F=0;ICB=0.5;HOB=0.5;AC=2;AN=4;DP4=0,1,0,1;MQ=60 GT:PL   0/1:0,3,41      ./.:0,0,0       0/1:40,3,0      ./.:0,0,0       ./.:0,0,0       ./.:0,0,0
NC_001133.9     457     .       CAAA    CAA     7.96917 Qual;MinAB;MinDP        INDEL;IDV=1;IMF=1;DP=1;SGB=-0.045183;MQ0F=0;AC=2;AN=2;DP4=0,0,0,1;MQ=60 GT:PL   ./.:0,0,0       ./.:0,0,0       ./.:0,0,0       ./.:0,0,0       1/1:32,3,0      ./.:0,0,0

サンプル間で共通する変異と固有の変異を抽出する - macでインフォマティクス

それで、bamファイルや、call抽出した後のvcfファイルに、インデックス付けが必要になることがある。たとえばIGVで表示するときなど。

インデックス付けはbcftoolsで可能だが、入力たるvcfファイルがbgzipされていなければならない。gdzip自体はsamtoolsのhtslibの中に入っていて、samtoolsをmake; make installしただけでは入らない。make all all-htslib; make install install-htslib のように追加makeする必要がある。

bgzip -c SRR453566.raw.bcf > SRR453566.raw.bcf.gz
bcftools index SRR453566.raw.bcf.gz

この状態でIGVに食わせることができる。

import igv
# tracksでVCFを指定してみる
ref = { 
    "id": "Saccha", 
    "name": "Saccha",
    "fastaURL": "files/s288c.fna",
    "indexURL": "files/s288c.fna.fai",
    "tracks": [ 
      { 
        "name": "My GFF", 
        "type": "annotation", 
        "format": "gff3", 
        #"displayMode": "EXPANDED", 
        "url": "files/s288c_e.gff", 
        "indexed": False, 
      },
      { 
        "name": "VCF", 
        "type": "variant", 
        "format": "vcf", 
        "displayMode": "EXPANDED", 
        "url": "files/SRR453566.raw.bcf.gz", 
        "indexed": True, 
      } 
    ] 
}
b = igv.Browser({"reference": ref})
b.show()

遺伝研講習会資料では後処理としてFeature Countで数える

Feature Countの前処理として、GeneIDを付与する。

遺伝研の提供するpythonプログラム add_gene_id で処理。そのために、パッケージbcbio-gffが必要なので、pipでインストール

pip install bcbio-gff

これで、add_gene_idを稼働できる。

python add_gene_id s288c.gff s288c_e.gff

得られたs288c.e.gffにはgene_idが入っているはず。

これをFeature Countする。biopapyrusの説明 を引用すると:

featureCounts はマッピング結果である BAM または SAM ファイルを入力とし、各 feature 領域(gene、CDS、exon)ごとにリードをカウントするプログラムである。featureCounts は Subread パッケージに含まれるプログラムの一つである。featureCounts を利用するには subread をインストールする必要がある。 subread のソースコードとコンパイルされたプログラムは sourceforge で公開されている。必要に応じてどちらかをダウンロードして使用する。

ということで、subreadのページからダウンロード。

featureCounts -p -T 8 -t exon -g gene_id -a $GFF \ 
 -o ../featurecount/counts.txt \ 
 SRR453566.sorted.bam \
 SRR453567.sorted.bam \ 
 SRR453568.sorted.bam \ 
 SRR453569.sorted.bam \ 
 SRR453570.sorted.bam \ 
 SRR453571.sorted.bam

$GFFは参照シーケンスのGene情報を含むファイル。featureCount (Subread) のマニュアルでは

The annotation file should be in either GTF format or a simplified annotation format (SAF) 
as shown below (columns are tab-delimited):
GeneID	Chr	Start	End	Strand
497097	chr1	3204563	3207049	-
497097	chr1	3411783	3411982	-
497097	chr1	3660633	3661579	-
...

としてあり、GTFフォーマットはFrequently Asked Questions: Data File FormatsおよびGTF2.2: A Gene Annotation Format (Revised Ensembl GTF)によると

GTF stands for Gene transfer format. It borrows from GFF, but has additional 
structure that warrants a separate definition and format name.
Structure is as GFF, so the fields are: 
<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]

Here is a simple example with 3 translated exons. Order of rows is not important.

381 Twinscan  CDS          380   401   .   +   0  gene_id "001"; transcript_id "001.1";
381 Twinscan  CDS          501   650   .   +   2  gene_id "001"; transcript_id "001.1";
381 Twinscan  CDS          700   707   .   +   2  gene_id "001"; transcript_id "001.1";
381 Twinscan  start_codon  380   382   .   +   0  gene_id "001"; transcript_id "001.1";
381 Twinscan  stop_codon   708   710   .   +   0  gene_id "001"; transcript_id "001.1";
The whitespace in this example is provided only for readability. In GTF, fields must be separated by a single TAB and no white space.

更に、GFFフォーマットはFrequently Asked Questions: Data File Formatsによると

GFF (General Feature Format) lines are based on the Sanger GFF2 specification. 
GFF lines have nine required fields that must be tab-separated. 
If the fields are separated by spaces instead of tabs, the track will not display correctly. 
For more information on GFF format, refer to Sanger's GFF page.

また、上記で参照しているSangerのGFF2/3の説明は、GFF2GFF3にある。GFF2はobsoleteとされ、GFF3がrecommendされている。更に、GFF3ではファイルの最後にオプションでsequenceが書ける。

動作することのテストとして、SRR453566.sorted.bamの1つだけを処理してみる。

featureCounts -p -T 32 -t exon -g gene_id -a s288c_e.gff -o SRR435366.counts.txt SRR453566.sorted.bam
       ==========     _____ _    _ ____  _____  ______          _____  
       =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
         =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
           ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
             ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
       ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
	  v1.6.3
//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           P SRR453566.sorted.bam                           ||
||                                                                            ||
||             Output file : SRR435366.counts.txt                             ||
||                 Summary : SRR435366.counts.txt.summary                     ||
||              Annotation : s288c_e.gff (GTF)                                ||
||      Dir for temp files : ./                                               ||
||                                                                            ||
||                 Threads : 32                                               ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : counted                                          ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file s288c_e.gff ...                                       ||
||    Features : 6761                                                         ||
||    Meta-features : 6420                                                    ||
||    Chromosomes/contigs : 17                                                ||
||                                                                            ||
|| Process BAM file SRR453566.sorted.bam...                                   ||
||    Paired-end reads are included.                                          ||
||    Assign alignments (paired-end) to features...                           ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Pairing up the read pairs.                                              ||
||                                                                            ||
||    Total alignments : 5650830                                              ||
||    Successfully assigned alignments : 4568783 (80.9%)                      ||
||    Running time : 0.03 minutes                                             ||
||                                                                            ||
||                                                                            ||
|| Summary of counting results can be found in file "SRR435366.counts.txt.su  ||
|| mmary"                                                                     ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

出力は(先頭)

# Program:featureCounts v1.6.3; Command:"featureCounts" "-p" "-T" "32" "-t" "exon" "-g" "gene_id" "-a" "s288c_e.gff" "-o" "SRR435366.counts.txt" "SRR453566.sorted.bam"
Geneid  Chr     Start   End     Strand  Length  SRR453566.sorted.bam
gene_0001       NC_001133.9     1807    2169    -       363     0
gene_0002       NC_001133.9     2480    2707    +       228     0
gene_0003       NC_001133.9     7235    9016    -       1782    0
gene_0004       NC_001133.9     11565   11951   -       387     0
gene_0005       NC_001133.9     12046   12426   +       381     2
gene_0006       NC_001133.9     13363   13743   -       381     0
gene_0007       NC_001133.9     21566   21850   +       285     0
gene_0008       NC_001133.9     22395   22685   -       291     0
gene_0009       NC_001133.9     24000   27968   -       3969    32
gene_0010       NC_001133.9     31567   32940   +       1374    65
gene_0011       NC_001133.9     33448   34701   +       1254    119
gene_0012       NC_001133.9     35155   36303   +       1149    1155

>>遺伝研セミナーのこの先の処理はRNAseq-0(遺伝研セミナーでの)発現解析〜カウントデータの解析

別の資料によると、パイプラインとして、HISAT2⇒StringTie⇒ Ballgawn

Johns Hopkinsから論文 Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nature Protocols volume 11, pages 1650–1667 (2016)

それで、HISAT2でリードをアセンブルした結果のbamファイルをStringTieでアセンブルする。

StringTieのインストールは、マニュアル通り。ソースをダウンロードし、make。インストールは自分で/usr/local/bin/stringtieからリンクを張れば終わり。

処理は、あらかじめHISAT2でsorted.bamができているとして、(HISAT2で--dtaを指定すること)

stringtie -B -p 16 -o SRR453566.gtf -G s288c_e.gff SRR453566.sorted.bam

ここで、-B は出力をBallgownで使うための設定、-G s288c_e.gff はreference annotation file (in GTF or GFF3 format)、-p 16はスレッド数で、出力は-o SRR453566.gtf で指定。

このstringtieの出力を、Ballgawnで整理して可視化する。
A program for computing differentially expressed genes in two or more RNA-seq experiments, using the output of StringTie or Cufflinks. The Ballgown package provides functions to organize, visualize, and analyze expression measurements. Ballgown is written in R and is part of Bioconductor.

GitHub - alyssafrazee/ballgown: Bioconductor package "ballgown"

HISAT-StringTie-Ballgown を試してみよう - Palmsonntagmorgen これでは論文ではoptionalとなっていますが、gffcompareを使って、生成された遺伝子リストとreferenceの遺伝子リストを比較することができます。

続きとしてHISAT-StringTie-Ballgown を試してみよう その2 - Palmsonntagmorgen


トップ   編集 凍結 差分 バックアップ 添付 複製 名前変更 リロード   新規 一覧 単語検索 最終更新   ヘルプ   最終更新のRSS
Last-modified: 2019-06-09 (日) 06:56:26 (98d)