[[Pythonバイオ]] [[Pythonバイオ/ツール]]~
&counter();   &lastmod();~

*分析ツール [#e7c82741]

-[[Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation (Nature Biotechnology volume 28, pages 511–515 (2010)):https://www.nature.com/articles/nbt.1621]]
-[[Transcript assembly and abundance estimation from RNA-Seq reveals thousands of new transcripts and switching among isoforms(PubMed):https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3146043/]]

-[[Third-generation sequencing fireworks at Marco Island (Nature Biotechnology volume 28, pages 426–428 (2010)):https://www.nature.com/articles/nbt0510-426]]

-[[Advancing RNA-Seq analysis (Nature Biotechnology volume 28, pages 421–423 (2010)):https://www.nature.com/articles/nbt0510-421]]

-[[RNA-Seqの数理―生成モデルによる発現量推定:アーカイブ - Wolfeyes Bioinformatics beta:http://yagays.github.io/blog/2013/07/14/mathematical-models-for-transcriptome-quantification-0/]]

-[[Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks.:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3334321/pdf/nihms-366741.pdf]]


-[[BaseSpace で行う RNA-seq 入門 < TopHat/Cufflinks編> 2015_ilmn_webinar_basespace-rna_session12.pdf:https://jp.illumina.com/content/dam/illumina-marketing/apac/japan/documents/pdf/2015_ilmn_webinar_basespace-rna_session12.pdf]]

-[[RNA-Seq|次世代シーケンサー お悩みカウンセリング(2013):http://www.mss.co.jp/businessfield/bioinformatics/ngs/pop_rna-seq.html]]

-[[(ERANGE) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Method 2008:https://www.ncbi.nlm.nih.gov/pubmed/18516045]]
-[[(ERANGE) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Method 2008 (ResearchGate copy):https://www.researchgate.net/publication/5331349_Mapping_and_quantifying_mammalian_transcriptomes_by_RNA-Seq/download]]


------------------------

**Cuffdiff (Cufflinks) [#sb90c59c]

cufflinksでエラーが出る。[[HISAT2-StringTie-Cuffdiff error: BAM record error: found spliced alignment without XS attribute:https://www.biostars.org/p/222849/]](正しくは?)

対策は、HISAT2でパラメータ--dta-cufflinksを指定すること
 
***cufflinks/cuffdiff [#ed355856]
-[[cufflinks:https://github.com/cole-trapnell-lab/cufflinks]]
^[[Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks:https://www.nature.com/articles/nprot.2012.016]]
-[[binary download page:http://cole-trapnell-lab.github.io/cufflinks/install/]]

cuffdiffを使ってみる
cuffdiffを使ってみる 2019-06-29 再度やり直す必要ありそう。まずcufflinksで集計が必要かも。要チェック。

-[[Cufflinks | RNA-seq を利用した遺伝子発現解析:https://bi.biopapyrus.jp/rnaseq/expression/cufflinks.html]]

あらかじめ用意したsort済みのbamファイルSRR4535xx.tagged.sorted.bam とマップs288c_3.gffに対して
 $cuffdiff -p 32 -o ./results1 ../s288c_e.gff -L batch,chemo \
 SRR453566.tagged.sorted.bam,SRR453567.tagged.sorted.bam,SRR453568.tagged.sorted.bam \
 SRR453569.tagged.sorted.bam,SRR453570.tagged.sorted.bam,SRR453571.tagged.sorted.bam
ラベル付け-Lの都合で、ラベル2種類batch,chemoに対して、入力ファイルSRR453566,67,68をbatchに、SRR453569,70,71をchemoに割り当てた(はず)。コンマの前後は空白なし(あるとコマンド解釈で引数の切れ目と思ってしまうので) この-Lと入力ファイルの区切りによって、後の解析パターン(ソースの分類)が決まるので要注意。

 Warning: Could not connect to update server to verify current version. Please check
   at the Cufflinks website (http://cufflinks.cbcb.umd.edu).
 [10:38:24] Loading reference annotation.
 [10:38:25] Inspecting maps and determining fragment length distributions.
 [10:40:50] Modeling fragment count overdispersion.
 [10:42:08] Modeling fragment count overdispersion.
 > Map Properties:
 >	Normalized Map Mass: 3618635.35
 >	Raw Map Mass: 3746966.57
 >	Fragment Length Distribution: Empirical (learned)
 >	              Estimated Mean: 209.34
 >	           Estimated Std Dev: 99.53
 > Map Properties:
 >	Normalized Map Mass: 3618635.35
 >	Raw Map Mass: 5136465.50
 >	Fragment Length Distribution: Empirical (learned)
 >	              Estimated Mean: 213.53
 >	           Estimated Std Dev: 96.07
 > Map Properties:
 >	Normalized Map Mass: 3618635.35
 >	Raw Map Mass: 3780062.80
 >	Fragment Length Distribution: Empirical (learned)
 >	              Estimated Mean: 206.94
 >	           Estimated Std Dev: 94.02
 > Map Properties:
 >	Normalized Map Mass: 3618635.35
 >	Raw Map Mass: 2502112.93
 >	Fragment Length Distribution: Empirical (learned)
 >	              Estimated Mean: 202.52
 >	           Estimated Std Dev: 96.05
 > Map Properties:
 >	Normalized Map Mass: 3618635.35
 >	Raw Map Mass: 3127907.67
 >	Fragment Length Distribution: Empirical (learned)
 >	              Estimated Mean: 211.12
 >	           Estimated Std Dev: 94.18
 > Map Properties:
 >	Normalized Map Mass: 3618635.35
 >	Raw Map Mass: 3898613.50
 >	Fragment Length Distribution: Empirical (learned)
 >	              Estimated Mean: 213.42
 >	           Estimated Std Dev: 98.72
 [10:43:25] Calculating preliminary abundance estimates
 [10:43:25] Testing for differential expression and regulation in locus.
 > Processed 6278 loci.                         [*************************] 100%
 Performed 5732 isoform-level transcription difference tests
 Performed 0 tss-level transcription difference tests
 Performed 5732 gene-level transcription difference tests
 Performed 0 CDS-level transcription difference tests
 Performed 0 splicing tests
 Performed 0 promoter preference tests
 Performing 0 relative CDS output tests
 Writing isoform-level FPKM tracking
 Writing TSS group-level FPKM tracking
 Writing gene-level FPKM tracking
 Writing CDS-level FPKM tracking
 Writing isoform-level count tracking
 Writing TSS group-level count tracking
 Writing gene-level count tracking
 Writing CDS-level count tracking
 Writing isoform-level read group tracking
 Writing TSS group-level read group tracking
 Writing gene-level read group tracking
 Writing CDS-level read group tracking
 Writing read group info
 Writing run info

 $ cd results1
 $ ls
 bias_params.info         cds_exp.diff          genes.read_group_tracking     promoters.diff      tss_groups.count_tracking
 cds.count_tracking       cuffData.db           isoform_exp.diff              read_groups.info    tss_groups.fpkm_tracking
 cds.diff                 gene_exp.diff         isoforms.count_tracking       run.info            tss_groups.read_group_tracking
 cds.fpkm_tracking        genes.count_tracking  isoforms.fpkm_tracking        splicing.diff       var_model.info
 cds.read_group_tracking  genes.fpkm_tracking   isoforms.read_group_tracking  tss_group_exp.diff

この例では(CDSが書いていないので?)gene_exp.diffとisoforms_exp.diffが結果になっている。

gene_exp.diffを見ると
#ref(SRR453566-cuffdiff-gene-exp-diff.png);

isoforms_exp.diffを見ると
#ref(SRR453566-cuffdiff-isoforms-exp-diff.png);

元論文~
[[Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks (Nat Protoc. ; 7(3): 562&#8211;578):https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3334321/pdf/nihms-366741.pdf]]~
によると、この結果は[[cummeRbund:https://www.bioconductor.org/packages/release/bioc/html/cummeRbund.html]]によって可視化できる。~
手順はたとえば
--[[CummeRbundで遺伝子発現のプロットを遺伝子ごとに並べる - Qiita:https://qiita.com/yuifu/items/cc9730f7e3a1277f6c50]]
--[[cummeRbund-example-workflow.pdf:https://www.bioconductor.org/packages/release/bioc/vignettes/cummeRbund/inst/doc/cummeRbund-example-workflow.pdf]] 後半が未完成で途中までしかない!!
--[[cummeRbund-manual.pdf:https://www.bioconductor.org/packages/release/bioc/vignettes/cummeRbund/inst/doc/cummeRbund-manual.pdf]] マニュアル風だが手順も多少意識している?

**cummeRbundを使う [#c8ee86ea]
-[[Bioconductor - cummeRbund::https://www.bioconductor.org/packages/release/bioc/html/cummeRbund.html]]
-[[CummeRbund: Visualization and Exploration of Cufflinks High-throughput Sequencing Data:https://www.bioconductor.org/packages/release/bioc/vignettes/cummeRbund/inst/doc/cummeRbund-manual.pdf]] ここに出ているプロセスに従って以下実験する。
-[[cummeRbund Example Workflow:https://www.bioconductor.org/packages/release/bioc/vignettes/cummeRbund/inst/doc/cummeRbund-example-workflow.pdf]] これは書きかけの状態のようで、後半が空白。
-[[リファレンスマニュアル:https://www.bioconductor.org/packages/release/bioc/manuals/cummeRbund/man/cummeRbund.pdf]]


cummeRbundの入力はcuffDiffの結果で、cummeRbandでの解析のためにcuffDiffでラベル付けをしてあること。そのラベルを元に、比較を取ったりする。

 データルート
    SRR453566.tagged.sorted.bam
    SRR453567.tagged.sorted.bam
    ...
    SRR453571.tagged.sorted.bam
    results1
        bias_params.info
        cds_exp.diff
        genes.read_group_tracking
        promoters.diff
        tss_groups.count_tracking
        cds.count_tracking
        isoform_exp.diff
        read_groups.info
        tss_groups.fpkm_tracking
        cds.diff
        gene_exp.diff
        isoforms.count_tracking
        run.info
        tss_groups.read_group_tracking
        cds.fpkm_tracking
        genes.count_tracking
        isoforms.fpkm_tracking
        splicing.diff
        var_model.info
        cds.read_group_tracking
        genes.fpkm_tracking
        isoforms.read_group_tracking
        tss_group_exp.diff
  ↑  このレベルをRのワーキングディレクトリにしておく。

Rを起動。
 > library(cummeRbund)    プログラムcummRbundをロード
 > cuff<-readCufflinks("results1")  ディレクトリ results1 からデータ読込んでデータベースに書き込む
 (注) > cuff <- readCufflinks('results1', gtfFile='../s288c_e.gff')はgenome=が無いとダメ。genome=に書ける値が不明
   A character string indicating to which genome build the .gtf annotations belong
   (e.g. ’hg19’ or ’mm9’)
 > cuff
 CuffSet instance with:
 	 2 samples
 	 6445 genes
 	 6445 isoforms
 	 0 TSS
 	 0 CDS
 	 0 promoters
 	 0 splicing
 	 0 relCDS
 > disp<-dispersionPlot(genes(cuff))
 > disp
 > dens<-csDensity(genes(cuff))
 > dens
 > b<-csBoxplot(genes(cuff))
 > b
 > sm<-csScatterMatrix(genes(cuff))
 > sm
 > s<-csScatter(genes(cuff),"chemo","batch",smooth=T)
  chemoとbatchでの散布図
 > s
 > dend<-csDendro(genes(cuff)) 
 > dend
 > m<-MAplot(genes(cuff),"chemo","batch")
 > m
 > v<-csVolcanoMatrix(genes(cuff))
 > v
 > mySigGeneIds<-getSig(cuff,alpha=0.05,level='genes')
 > lengthh(mySigGeneIds)
 > sigGenes<-getGenes(cuff,sigGeneIds)
 sigGenes
 CuffGeneSet instance for  3900  genes
  
 Slots:
 	 annotation
 	 fpkm
 	 repFpkm
 	 diff
 	 count
 	 isoforms	 CuffFeatureSet instance of size 3900 
 	 TSS		 CuffFeatureSet instance of size 0 
 	 CDS		 CuffFeatureSet instance of size 0 
 	 promoters		 CuffFeatureSet instance of size 3900 
 	 splicing		 CuffFeatureSet instance of size 0 
 	 relCDS		 CuffFeatureSet instance of size 3900 
 > head(annotation(genes(cuff)))
     gene_id class_code nearest_ref_id gene_short_name
 1 gene_0001       <NA>           <NA>            PAU8
 2 gene_0002       <NA>           <NA>       YAL067W-A
 3 gene_0003       <NA>           <NA>            SEO1
 4 gene_0004       <NA>           <NA>         YAL065C
 5 gene_0005       <NA>           <NA>       YAL064W-B
 6 gene_0006       <NA>           <NA>            TDA8
                     locus length coverage feature_id isoform_id seqnames
 1   NC_001133.9:1806-2169     NA       NA         NA       <NA>     <NA>
 2   NC_001133.9:2479-2707     NA       NA         NA       <NA>     <NA>
 3   NC_001133.9:7234-9016     NA       NA         NA       <NA>     <NA>
 4 NC_001133.9:11564-11951     NA       NA         NA       <NA>     <NA>
 5 NC_001133.9:12045-12426     NA       NA         NA       <NA>     <NA>
 6 NC_001133.9:13362-13743     NA       NA         NA       <NA>     <NA>
   source type start end score strand frame
 1   <NA>   NA    NA  NA    NA   <NA>  <NA>
 2   <NA>   NA    NA  NA    NA   <NA>  <NA>
 3   <NA>   NA    NA  NA    NA   <NA>  <NA>
 4   <NA>   NA    NA  NA    NA   <NA>  <NA>
 5   <NA>   NA    NA  NA    NA   <NA>  <NA>
 6   <NA>   NA    NA  NA    NA   <NA>  <NA>
 > head(fpkm(genes(cuff)))
 > head(count(genes(cuff)))
 > head(fpkm(isoforms(cuff)))
 > head(diffData(genes(cuff)))
 > head(fpkmMatrix(genes(cuff)))

興味あるgene(集合)を指定して分析する
 > myGeneIds <- c("gene_0010", "gene_0010", "gene_0011", "gene_0012", "gene_0013", "gene_0014", "gene_0015", "gene_0016", "gene_0017", "gene_0018", "gene_0019")
   興味のあるgeneidをリストしたもの。あらかじめ作ってあるとする。ここは仮の例。
 > myGenes <- getGenes(cuff,myGeneIds)
 > myGenes
 CuffGeneSet instance for  10  genes
  
 Slots:
 	 annotation
 	 fpkm
 	 repFpkm
 	 diff
 	 count
 	 isoforms	 CuffFeatureSet instance of size 10 
 	 TSS		 CuffFeatureSet instance of size 0 
 	 CDS		 CuffFeatureSet instance of size 0 
 	 promoters		 CuffFeatureSet instance of size 10 
 	 splicing		 CuffFeatureSet instance of size 0 
 	 relCDS		 CuffFeatureSet instance of size 10 
 > myGeneset.pluri<-getGenes(cuff,myGeneIds,sampleIdList=c("batch"))
 > h<-csHeatmap(myGenes,cluster='both')
 > h
 > b<-expressionBarplot(myGenes)
 > b
 > s<-csScatter(myGenes,"chemo","batch",smooth=T)
 > s
 > v<-csVolcano(myGenes,"chemo","batch")
 > v
 > ih<-csHeatmap(isoforms(myGenes),cluster='both',labRow=F)
 > ih
 > th<-csHeatmap(TSS(myGenes),cluster='both',labRow=F)
 > th
 > den<-csDendro(myGenes)
 > den

1つだけにする
 > myGeneId <- c("gene_0010")        1つだけにする
 > myGene<-getGene(cuff,myGeneId)
 > head(fpkm(myGene))       同じように使える
 > gl<-expressionPlot(myGene)
 > gl
 > gl.iso.rep<-expressionPlot(isoforms(myGene),replicates=T)
 > gb<-expressionBarplot(myGene)
 > gb
 > igb<-expressionBarplot(isoforms(myGene),replicates=T)
 > gp<-csPie(myGene,level="isoforms")
 > gp


 > mySigMat<-sigMatrix(cuff,level='genes',alpha=0.05)
 > mySigMat
 > mySigGeneIds<-getSig(cuff,alpha=0.05,level='genes')
 > head(mySigGeneIds)
 > chemo_vs_batch.sigIsoformIds<-getSig(cuff,x='chemo',y='batch',alpha=0.05,level='isoforms')


 > myDistHeat<-csDistHeat(genes(cuff))
 > myDistHeat
 > myRepDistHeat<-csDistHeat(genes(cuff),replicates=T)
 > myRepDistHeat
 > genes.PCA<-PCAplot(genes(cuff),"PC1","PC2")
 > genes.PCA
 > genes.MDS<-MDSplot(genes(cuff))
  cmdscale(d, eig = TRUE, k = 2) でエラー: 
   'k' must be in {1, 2, ..  n - 1}    MDS(multi-dimensional scaling)は次元数3以上

クラスタ化
 > ic<-csCluster(myGenes,k=4)
 > ic
 > ic$cluster
 gene_0010 gene_0011 gene_0013 gene_0014 gene_0015 gene_0016 gene_0017 
         1         1         2         2         3         3         4 
 gene_0018 gene_0019 
         3         3 
 > icp<-csClusterPlot(ic)
 > icp

類似geneの検出
 > mySimilar<-findSimilar(cuff,myGeneId,n=20)
 > mySimilar.expression<-expressionPlot(mySimilar,logMode=T,showErrorbars=F)
 > mySimilar.expression
 
 > myProfile<-c(500,0,400)
 > mySimilar2<-findSimilar(cuff,myProfile,n=10)
 > mySimilar2.expression<-expressionPlot(mySimilar2,logMode=T,showErrorbars=F)
 > mySimilar2.expression

 > head(annotation(genes(cuff)))
     gene_id class_code nearest_ref_id gene_short_name
 1 gene_0001       <NA>           <NA>            PAU8
 2 gene_0002       <NA>           <NA>       YAL067W-A
 3 gene_0003       <NA>           <NA>            SEO1
 4 gene_0004       <NA>           <NA>         YAL065C
 5 gene_0005       <NA>           <NA>       YAL064W-B
 6 gene_0006       <NA>           <NA>            TDA8
                     locus length coverage feature_id isoform_id seqnames
 1   NC_001133.9:1806-2169     NA       NA         NA       <NA>     <NA>
 2   NC_001133.9:2479-2707     NA       NA         NA       <NA>     <NA>
 3   NC_001133.9:7234-9016     NA       NA         NA       <NA>     <NA>
 4 NC_001133.9:11564-11951     NA       NA         NA       <NA>     <NA>
 5 NC_001133.9:12045-12426     NA       NA         NA       <NA>     <NA>
 6 NC_001133.9:13362-13743     NA       NA         NA       <NA>     <NA>
   source type start end score strand frame
 1   <NA>   NA    NA  NA    NA   <NA>  <NA>
 2   <NA>   NA    NA  NA    NA   <NA>  <NA>
 3   <NA>   NA    NA  NA    NA   <NA>  <NA>
 4   <NA>   NA    NA  NA    NA   <NA>  <NA>
 5   <NA>   NA    NA  NA    NA   <NA>  <NA>
 6   <NA>   NA    NA  NA    NA   <NA>  <NA>




〜〜〜〜〜〜〜〜〜
**もう一度勉強 [#y2c99830]
その前に
-NIGのパイプラインサービス [[NGSアプリケーション Pipline-Menu-wiki:http://cell-innovation.nig.ac.jp/maser/Applications/applications_top.html]]、 [[提供全パイプラインリスト All-Pipelines:http://cell-innovation.nig.ac.jp/maser_cgi/cip-pl_list_violin_ja.cgi]]、と [[RNA-seqのパイプライン:http://cell-innovation.nig.ac.jp/maser/Applications/RNA-seq.html]]

参照
-主 [[Tuxedo Genome Guided Transcriptome Assembly Workshop &#183; trinityrnaseq/RNASeq_Trinity_Tuxedo_Workshop Wiki:https://github.com/trinityrnaseq/RNASeq_Trinity_Tuxedo_Workshop/wiki/Tuxedo-Genome-Guided-Transcriptome-Assembly-Workshop]]
-追加
--[[RNA seq 其の2 マッピングから統計処理まで - macでインフォマティクス:http://kazumaxneo.hatenablog.com/entry/2017/03/27/141521]]~
--[[RNA seq 其の3 ヒートマップ作成 - macでインフォマティクス:http://kazumaxneo.hatenablog.com/entry/2017/03/29/171543]]
-追加 [[RNA-seq Analysis With R/Bioconductor - The Cat Way:http://catway.jp/lecture/tohoku-u/NGS-R-Bioconductor-2nd.html]]

-おまけ
--Filgen.jpのスライド[[NGSデータ解析入門Webセミナー:https://filgen.jp/Product/BioScience21-software/CLC/NGS_Web_Seminar_slide_20181126.pdf]]  スライドp17,18に解析の全体像あり
--中岡慎治 [[遺伝子発現解析入門:http://coop-math.ism.ac.jp/files/216/%E8%B3%87%E6%96%99_nakaoka.pdf]] スライド13以降がRNA-seq 少し古いか?
--アメリエフ R N A - s e q [[20150805and28_yamaguchi.pdf https://biosciencedbc.jp/gadget/human/20150805and28_yamaguchi.pdf]]

トップ   編集 差分 バックアップ 添付 複製 名前変更 リロード   新規 一覧 単語検索 最終更新   ヘルプ   最終更新のRSS