Pythonバイオ? Pythonバイオ/ツール?
26   2019-05-11 (土) 15:51:27

分析ツール


Cuffdiff (Cufflinks)

cufflinksでエラーが出る。HISAT2-StringTie-Cuffdiff error: BAM record error: found spliced alignment without XS attribute(正しくは?)

対策は、HISAT2でパラメータ--dta-cufflinksを指定すること

cufflinks/cuffdiff

cuffdiffを使ってみる

あらかじめ用意したsort済みのbamファイルSRR4535xx.tagged.sorted.bam とマップs288c_3.gffに対して

$cuffdiff -p 32 -o ./results1 ../s288c_e.gff -L batch,chemo \
SRR453566.tagged.sorted.bam,SRR453567.tagged.sorted.bam,SRR453568.tagged.sorted.bam \
SRR453569.tagged.sorted.bam,SRR453570.tagged.sorted.bam,SRR453571.tagged.sorted.bam

ラベル付け-Lの都合で、ラベル2種類batch,chemoに対して、入力ファイルSRR453566,67,68をbatchに、SRR453569,70,71をchemoに割り当てた(はず)。コンマの前後は空白なし(あるとコマンド解釈で引数の切れ目と思ってしまうので) この-Lと入力ファイルの区切りによって、後の解析パターン(ソースの分類)が決まるので要注意。

Warning: Could not connect to update server to verify current version. Please check
  at the Cufflinks website (http://cufflinks.cbcb.umd.edu).
[10:38:24] Loading reference annotation.
[10:38:25] Inspecting maps and determining fragment length distributions.
[10:40:50] Modeling fragment count overdispersion.
[10:42:08] Modeling fragment count overdispersion.
> Map Properties:
>	Normalized Map Mass: 3618635.35
>	Raw Map Mass: 3746966.57
>	Fragment Length Distribution: Empirical (learned)
>	              Estimated Mean: 209.34
>	           Estimated Std Dev: 99.53
> Map Properties:
>	Normalized Map Mass: 3618635.35
>	Raw Map Mass: 5136465.50
>	Fragment Length Distribution: Empirical (learned)
>	              Estimated Mean: 213.53
>	           Estimated Std Dev: 96.07
> Map Properties:
>	Normalized Map Mass: 3618635.35
>	Raw Map Mass: 3780062.80
>	Fragment Length Distribution: Empirical (learned)
>	              Estimated Mean: 206.94
>	           Estimated Std Dev: 94.02
> Map Properties:
>	Normalized Map Mass: 3618635.35
>	Raw Map Mass: 2502112.93
>	Fragment Length Distribution: Empirical (learned)
>	              Estimated Mean: 202.52
>	           Estimated Std Dev: 96.05
> Map Properties:
>	Normalized Map Mass: 3618635.35
>	Raw Map Mass: 3127907.67
>	Fragment Length Distribution: Empirical (learned)
>	              Estimated Mean: 211.12
>	           Estimated Std Dev: 94.18
> Map Properties:
>	Normalized Map Mass: 3618635.35
>	Raw Map Mass: 3898613.50
>	Fragment Length Distribution: Empirical (learned)
>	              Estimated Mean: 213.42
>	           Estimated Std Dev: 98.72
[10:43:25] Calculating preliminary abundance estimates
[10:43:25] Testing for differential expression and regulation in locus.
> Processed 6278 loci.                         [*************************] 100%
Performed 5732 isoform-level transcription difference tests
Performed 0 tss-level transcription difference tests
Performed 5732 gene-level transcription difference tests
Performed 0 CDS-level transcription difference tests
Performed 0 splicing tests
Performed 0 promoter preference tests
Performing 0 relative CDS output tests
Writing isoform-level FPKM tracking
Writing TSS group-level FPKM tracking
Writing gene-level FPKM tracking
Writing CDS-level FPKM tracking
Writing isoform-level count tracking
Writing TSS group-level count tracking
Writing gene-level count tracking
Writing CDS-level count tracking
Writing isoform-level read group tracking
Writing TSS group-level read group tracking
Writing gene-level read group tracking
Writing CDS-level read group tracking
Writing read group info
Writing run info
$ cd results1
$ ls
bias_params.info         cds_exp.diff          genes.read_group_tracking     promoters.diff      tss_groups.count_tracking
cds.count_tracking       cuffData.db           isoform_exp.diff              read_groups.info    tss_groups.fpkm_tracking
cds.diff                 gene_exp.diff         isoforms.count_tracking       run.info            tss_groups.read_group_tracking
cds.fpkm_tracking        genes.count_tracking  isoforms.fpkm_tracking        splicing.diff       var_model.info
cds.read_group_tracking  genes.fpkm_tracking   isoforms.read_group_tracking  tss_group_exp.diff

この例では(CDSが書いていないので?)gene_exp.diffとisoforms_exp.diffが結果になっている。

gene_exp.diffを見ると

SRR453566-cuffdiff-gene-exp-diff.png

isoforms_exp.diffを見ると

SRR453566-cuffdiff-isoforms-exp-diff.png

元論文
Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks
によると、この結果はcummeRbundによって可視化できる。
手順はたとえば

cummeRbundを使う

cummeRbundの入力はcuffDiffの結果で、cummeRbandでの解析のためにcuffDiffでラベル付けをしてあること。そのラベルを元に、比較を取ったりする。

データルート
   SRR453566.tagged.sorted.bam
   SRR453567.tagged.sorted.bam
   ...
   SRR453571.tagged.sorted.bam
   results1
       bias_params.info
       cds_exp.diff
       genes.read_group_tracking
       promoters.diff
       tss_groups.count_tracking
       cds.count_tracking
       isoform_exp.diff
       read_groups.info
       tss_groups.fpkm_tracking
       cds.diff
       gene_exp.diff
       isoforms.count_tracking
       run.info
       tss_groups.read_group_tracking
       cds.fpkm_tracking
       genes.count_tracking
       isoforms.fpkm_tracking
       splicing.diff
       var_model.info
       cds.read_group_tracking
       genes.fpkm_tracking
       isoforms.read_group_tracking
       tss_group_exp.diff
 ↑  このレベルをRのワーキングディレクトリにしておく。

Rを起動。

> library(cummeRbund)    プログラムcummRbundをロード
> cuff<-readCufflinks("results1")  ディレクトリ results1 からデータ読込んでデータベースに書き込む
(注) > cuff <- readCufflinks('results1', gtfFile='../s288c_e.gff')はgenome=が無いとダメ。genome=に書ける値が不明
  A character string indicating to which genome build the .gtf annotations belong
  (e.g. ’hg19’ or ’mm9’)
> cuff
CuffSet instance with:
	 2 samples
	 6445 genes
	 6445 isoforms
	 0 TSS
	 0 CDS
	 0 promoters
	 0 splicing
	 0 relCDS
> disp<-dispersionPlot(genes(cuff))
> disp
> dens<-csDensity(genes(cuff))
> dens
> b<-csBoxplot(genes(cuff))
> b
> sm<-csScatterMatrix(genes(cuff))
> sm
> s<-csScatter(genes(cuff),"chemo","batch",smooth=T)
 chemoとbatchでの散布図
> s
> dend<-csDendro(genes(cuff)) 
> dend
> m<-MAplot(genes(cuff),"chemo","batch")
> m
> v<-csVolcanoMatrix(genes(cuff))
> v
> mySigGeneIds<-getSig(cuff,alpha=0.05,level='genes')
> lengthh(mySigGeneIds)
> sigGenes<-getGenes(cuff,sigGeneIds)
sigGenes
CuffGeneSet instance for  3900  genes
 
Slots:
	 annotation
	 fpkm
	 repFpkm
	 diff
	 count
	 isoforms	 CuffFeatureSet instance of size 3900 
	 TSS		 CuffFeatureSet instance of size 0 
	 CDS		 CuffFeatureSet instance of size 0 
	 promoters		 CuffFeatureSet instance of size 3900 
	 splicing		 CuffFeatureSet instance of size 0 
	 relCDS		 CuffFeatureSet instance of size 3900 
> head(annotation(genes(cuff)))
    gene_id class_code nearest_ref_id gene_short_name
1 gene_0001       <NA>           <NA>            PAU8
2 gene_0002       <NA>           <NA>       YAL067W-A
3 gene_0003       <NA>           <NA>            SEO1
4 gene_0004       <NA>           <NA>         YAL065C
5 gene_0005       <NA>           <NA>       YAL064W-B
6 gene_0006       <NA>           <NA>            TDA8
                    locus length coverage feature_id isoform_id seqnames
1   NC_001133.9:1806-2169     NA       NA         NA       <NA>     <NA>
2   NC_001133.9:2479-2707     NA       NA         NA       <NA>     <NA>
3   NC_001133.9:7234-9016     NA       NA         NA       <NA>     <NA>
4 NC_001133.9:11564-11951     NA       NA         NA       <NA>     <NA>
5 NC_001133.9:12045-12426     NA       NA         NA       <NA>     <NA>
6 NC_001133.9:13362-13743     NA       NA         NA       <NA>     <NA>
  source type start end score strand frame
1   <NA>   NA    NA  NA    NA   <NA>  <NA>
2   <NA>   NA    NA  NA    NA   <NA>  <NA>
3   <NA>   NA    NA  NA    NA   <NA>  <NA>
4   <NA>   NA    NA  NA    NA   <NA>  <NA>
5   <NA>   NA    NA  NA    NA   <NA>  <NA>
6   <NA>   NA    NA  NA    NA   <NA>  <NA>
> head(fpkm(genes(cuff)))
> head(count(genes(cuff)))
> head(fpkm(isoforms(cuff)))
> head(diffData(genes(cuff)))
> head(fpkmMatrix(genes(cuff)))

興味あるgene(集合)を指定して分析する

> myGeneIds <- c("gene_0010", "gene_0010", "gene_0011", "gene_0012", "gene_0013", "gene_0014", "gene_0015", "gene_0016", "gene_0017", "gene_0018", "gene_0019")
  興味のあるgeneidをリストしたもの。あらかじめ作ってあるとする。ここは仮の例。
> myGenes <- getGenes(cuff,myGeneIds)
> myGenes
CuffGeneSet instance for  10  genes
 
Slots:
	 annotation
	 fpkm
	 repFpkm
	 diff
	 count
	 isoforms	 CuffFeatureSet instance of size 10 
	 TSS		 CuffFeatureSet instance of size 0 
	 CDS		 CuffFeatureSet instance of size 0 
	 promoters		 CuffFeatureSet instance of size 10 
	 splicing		 CuffFeatureSet instance of size 0 
	 relCDS		 CuffFeatureSet instance of size 10 
> myGeneset.pluri<-getGenes(cuff,myGeneIds,sampleIdList=c("batch"))
> h<-csHeatmap(myGenes,cluster='both')
> h
> b<-expressionBarplot(myGenes)
> b
> s<-csScatter(myGenes,"chemo","batch",smooth=T)
> s
> v<-csVolcano(myGenes,"chemo","batch")
> v
> ih<-csHeatmap(isoforms(myGenes),cluster='both',labRow=F)
> ih
> th<-csHeatmap(TSS(myGenes),cluster='both',labRow=F)
> th
> den<-csDendro(myGenes)
> den

1つだけにする

> myGeneId <- c("gene_0010")        1つだけにする
> myGene<-getGene(cuff,myGeneId)
> head(fpkm(myGene))       同じように使える
> gl<-expressionPlot(myGene)
> gl
> gl.iso.rep<-expressionPlot(isoforms(myGene),replicates=T)
> gb<-expressionBarplot(myGene)
> gb
> igb<-expressionBarplot(isoforms(myGene),replicates=T)
> gp<-csPie(myGene,level="isoforms")
> gp
> mySigMat<-sigMatrix(cuff,level='genes',alpha=0.05)
> mySigMat
> mySigGeneIds<-getSig(cuff,alpha=0.05,level='genes')
> head(mySigGeneIds)
> chemo_vs_batch.sigIsoformIds<-getSig(cuff,x='chemo',y='batch',alpha=0.05,level='isoforms')
> myDistHeat<-csDistHeat(genes(cuff))
> myDistHeat
> myRepDistHeat<-csDistHeat(genes(cuff),replicates=T)
> myRepDistHeat
> genes.PCA<-PCAplot(genes(cuff),"PC1","PC2")
> genes.PCA
> genes.MDS<-MDSplot(genes(cuff))
 cmdscale(d, eig = TRUE, k = 2) でエラー: 
  'k' must be in {1, 2, ..  n - 1}    MDS(multi-dimensional scaling)は次元数3以上

クラスタ化

> ic<-csCluster(myGenes,k=4)
> ic
> ic$cluster
gene_0010 gene_0011 gene_0013 gene_0014 gene_0015 gene_0016 gene_0017 
        1         1         2         2         3         3         4 
gene_0018 gene_0019 
        3         3 
> icp<-csClusterPlot(ic)
> icp

類似geneの検出

> mySimilar<-findSimilar(cuff,myGeneId,n=20)
> mySimilar.expression<-expressionPlot(mySimilar,logMode=T,showErrorbars=F)
> mySimilar.expression

> myProfile<-c(500,0,400)
> mySimilar2<-findSimilar(cuff,myProfile,n=10)
> mySimilar2.expression<-expressionPlot(mySimilar2,logMode=T,showErrorbars=F)
> mySimilar2.expression
> head(annotation(genes(cuff)))
    gene_id class_code nearest_ref_id gene_short_name
1 gene_0001       <NA>           <NA>            PAU8
2 gene_0002       <NA>           <NA>       YAL067W-A
3 gene_0003       <NA>           <NA>            SEO1
4 gene_0004       <NA>           <NA>         YAL065C
5 gene_0005       <NA>           <NA>       YAL064W-B
6 gene_0006       <NA>           <NA>            TDA8
                    locus length coverage feature_id isoform_id seqnames
1   NC_001133.9:1806-2169     NA       NA         NA       <NA>     <NA>
2   NC_001133.9:2479-2707     NA       NA         NA       <NA>     <NA>
3   NC_001133.9:7234-9016     NA       NA         NA       <NA>     <NA>
4 NC_001133.9:11564-11951     NA       NA         NA       <NA>     <NA>
5 NC_001133.9:12045-12426     NA       NA         NA       <NA>     <NA>
6 NC_001133.9:13362-13743     NA       NA         NA       <NA>     <NA>
  source type start end score strand frame
1   <NA>   NA    NA  NA    NA   <NA>  <NA>
2   <NA>   NA    NA  NA    NA   <NA>  <NA>
3   <NA>   NA    NA  NA    NA   <NA>  <NA>
4   <NA>   NA    NA  NA    NA   <NA>  <NA>
5   <NA>   NA    NA  NA    NA   <NA>  <NA>
6   <NA>   NA    NA  NA    NA   <NA>  <NA>

〜〜〜〜〜〜〜〜〜

もう一度勉強

その前に

参照


添付ファイル: fileSRR453566-cuffdiff-isoforms-exp-diff.png 2件 [詳細] fileSRR453566-cuffdiff-gene-exp-diff.png 2件 [詳細]

トップ   編集 凍結 差分 バックアップ 添付 複製 名前変更 リロード   新規 一覧 単語検索 最終更新   ヘルプ   最終更新のRSS
Last-modified: 2019-05-11 (土) 15:51:27 (14d)