Pythonバイオ? Pythonバイオ/ツール?
28   2019-06-25 (火) 11:15:26

GO解析

入力データとして用意するもの: 

測定結果

測定データは、study genesとして対象となる遺伝子のリスト。diff_tpm_ex.tsvはTPM発現量の差のリストなのだが、元処理ではNCBI GeneIDが付いていない。勝手につけたgeneidのみ。

それで、あらかじめ表gene_id_product_plus_GeneID.tsvを作っておき、それを使って表に追記する。表gene_id_product_plus_GeneID.tsvの作成は、

import pandas as pd
import numpy as np
import urllib.parse

outputlist = []
fout = open('gene_id_product_plus_GeneID.tsv', 'w')
with open('../s288c_e.gff', 'r') as fin:
    for line in fin:
        product = ''
        geneid = ''
        line=line.rstrip()
        if not line.startswith('#'):
            s=line.split('\t')
            items=s[8].split(';')
            for item in items:
                if item.startswith('product='):
                    product = item.split('=')[1]
                    product = urllib.parse.unquote(product)
                if item.startswith('gene_id='):
                    geneid = item.split('=')[1]
                if item.startswith('gene='):
                    gene = item.split('=')[1]
                if 'GeneID:' in item:
                    pos = item.find('GeneID')
                    end = item[pos:].find(',')
                    NCBI_ID = item[pos+7:pos+end]
                if (geneid != '') and (product != '') and (gene != '') and  (NCBI_ID != ''):
                    outputlist.append(geneid+'\t'+product+'\t'+gene+'\t'+NCBI_ID)
outputlist = sorted(list(set(outputlist)))
print('Geneid\tproduct\tgene\tNCBI_ID', file=fout)
for u in outputlist:
    print(u, file=fout)
fout.close()

元へ戻って、Studyデータの取込みと変換。

# Studyデータを取りこむ
import pandas as pd
df_study = pd.read_csv('../diff_tpm_ex.tsv', sep='\t')
df_study = df_study[:400]
#print(df_study.head(), len(df_study))
# GeneID を付ける
# あらかじめ表gene_id_product_plus_GeneID.tsvを作っておく Saccha_gene_id-to-product-plus-GeneID-gene.ipynb
df_GeneID = pd.read_csv('gene_id_product_plus_GeneID.tsv', sep='\t', index_col=0)
print(df_GeneID.head())
df_study = df_study.merge(df_GeneID)
print(df_study.head()['NCBI_ID'])

このdf_GeneIDは、

                          product  gene  NCBI_ID
Geneid                                          
gene_0001       seripauperin PAU8  PAU8   851229
gene_0002    hypothetical protein  PAU8  1466426
gene_0003  putative permease SEO1  SEO1   851230
gene_0004    hypothetical protein  SEO1   851232
gene_0005    hypothetical protein  SEO1   851233

df_study.head()['NCBI_ID']は

0     854753
1     855213
2     855134
3    1466426
4     851232

association geneid2gos

NBCI Taxonomy IDを知る

GeneID2nt

1) NCBIでgene listを作る 

https://github.com/tanghaibao/goatools/tree/master/doc/md/README_NCBI_gene.md

browserで https://www.ncbi.nlm.nih.gov/gene/ ⇒ 検索キーワード

genetype protein coding[Properties] AND "559292"[Taxonomy ID] AND alive[property]

これは、genes_NCBI_****_ProteinCodingと称しているものに当たる。

画面に検索結果が得られたら、「Send to」でFile⇒Tabular選択。結果はgene_result.txtとしてダウンロードされる。先頭部分は

tax_id	Org_name	GeneID	CurrentID	Status	Symbol	Aliases	description	other_designations	map_location	chromosome	genomic_nucleotide_accession.version	start_position_on_the_genomic_accession	end_position_on_the_genomic_accession	orientation	exon_count	OMIM	
559292	Saccharomyces cerevisiae S288C	854976	0	live	RAD52	YML032C	recombinase RAD52	recombinase RAD52	     XIII	NC_001145.3	212515	213930	minus	1		
559292	Saccharomyces cerevisiae S288C	852457	0	live	CDC28	YBR160W, CDK1, HSL5, SRM5	cyclin-dependent serine/threonine-protein kinase CDC28	cyclin-dependent serine/threonine-protein kinase CDC28		II	NC_001134.8	560078	56097plus	1		
559292	Saccharomyces cerevisiae S288C	856831	0	live	RAD51	YER095W, MUT5	recombinase RAD51	recombinase RAD51    NC_001137.3	349980	351182	plus	1		
559292	Saccharomyces cerevisiae S288C	851415	0	live	RPO21	YDL140C, RPB1, RPB220, SUA8	DNA-directed RNA polymerase II core subunit RPO21	DNA-directed RNA polymerase II core subunit RPO21		IV	NC_001136.10	205360	210561	minus1
559292	Saccharomyces cerevisiae S288C	851752	0	live	SUP35	YDR172W, GST1, PNM2, SAL3, SUF12, SUP2, SUP36	translation termination factor GTPase eRF3	translation termination factor GTPase eRF3		IV	NC_001136.10	808324	810381	plus 1

2) 次に goatools のncbi_gene_results_to_python.pyで変換 goatoolsのスクリプト goatools/goatools/scripts/ncbi_gene_results_to_python.pyで Pythonプログラム(のデータ)に変換する

scripts/ncbi_gene_results_to_python.py \
   -i ~/src/RNAseq-Saccha/Saccha/GO/gene_result.txt \
   -o ~/src/RNAseq-Saccha/Saccha/GO/gene_result.py

       6002 lines READ:  /home/yamanouc/src/RNAseq-Saccha/Saccha/GO/gene_result.txt
       6002 geneids WROTE: /home/yamanouc/src/RNAseq-Saccha/Saccha/GO/gene_result.py

できたgene_result.pyは

"""Data downloaded from NCBI Gene converted into Python namedtuples."""

from collections import namedtuple

WRITTEN = "2019_06_25" # 6002 items

#pylint: disable=line-too-long,too-many-lines,invalid-name
ntncbi = namedtuple('ntncbi', 'tax_id Org_name GeneID CurrentID Status Symbol Aliases description other_designations map_location chromosome genomic_nucleotide_accession_version start_position_on_the_genomic_accession end_position_on_the_genomic_accession orientation exon_count OMIM no_hdr0')

GENEID2NT = { # 6,002 items
    850289 : ntncbi(tax_id=559292, Org_name='Saccharomyces cerevisiae S288C', GeneID=850289, CurrentID=0, Status='live', Symbol='GEX1', Aliases=['YCL073C'], description='glutathione exchanger', other_designations='glutathione exchanger', map_location='', chromosome='III', genomic_nucleotide_accession_version='NC_001135.5', start_position_on_the_genomic_accession=6479, end_position_on_the_genomic_accession=8326, orientation='minus', exon_count=1, OMIM='', no_hdr0=''),
    850290 : ntncbi(tax_id=559292, Org_name='Saccharomyces cerevisiae S288C', GeneID=850290, CurrentID=0, Status='live', Symbol='VBA3', Aliases=['YCL069W'], description='basic amino acid transporter', other_designations='basic amino acid transporter', map_location='', chromosome='III', genomic_nucleotide_accession_version='NC_001135.5', start_position_on_the_genomic_accession=9706, end_position_on_the_genomic_accession=11082, orientation='plus', exon_count=1, OMIM='', no_hdr0=''),

GOEnrichmentStudyのインスタンスを作る

from gene_result import GENEID2NT as GeneID2nt_saccha

from goatools.go_enrichment import GOEnrichmentStudy
goeaobj = GOEnrichmentStudy(
        GeneID2nt_saccha.keys(), # List of mouse protein-coding genes
        geneid2gos_saccha, # geneid/GO associations
        obodag, # Ontologies
        propagate_counts = False,
        alpha = 0.05, # default significance cut-off
        methods = ['fdr_bh']) # defult multipletest correction method

94%  5,627 of  6,002 population items found in association

GOEA実行

df_studyのGeneIDの列だけを取出してリスト化し、GOEAに与える。

# goea 実行
study_list = df_study['NCBI_ID'].values.tolist()
#print(study_list)
goea_results_all = goeaobj.run_study(study_list)
goea_results_sig = [r for r in goea_results_all if r.p_fdr_bh < 0.05]
goeaobj.wr_xlsx("SacchaGo.xlsx", goea_results_sig)
goeaobj.wr_txt("SacchGo.txt", goea_results_sig)

画面出力は

study_in_pop {851972, 856068, 856071, 856072, 856074, 851980, 854031, (後略)}
 67%    705 of  1,050 study items found in association
  2%  1,050 of 49,269 study items found in population(6002)
Calculating 6,030 uncorrected p-values using fisher
   6,030 GO terms are associated with  5,627 of  6,002 population items
   1,220 GO terms are associated with    705 of  1,066 study items
     133 GO terms found significant (< 0.05=alpha) after multitest correction: statsmodels fdr_bh
    133 items WROTE: nbt3102.xlsx
    133 GOEA results for   518 study items. WROTE: nbt3102.txt

結果はp<0.05の部分だけを取出して、ファイル化。SacchaGo.xlsx、SacchaGo.txt

続いてグラフ化

from goatools.godag_plot import plot_gos, plot_results, plot_goid2goobj

plot_results("Saccha_Go_{NS}.png", goea_results_sig)

   57 usr 263 GOs  WROTE: Saccha_Go_BP.png
   43 usr 121 GOs  WROTE: Saccha_Go_CC.png
   33 usr  89 GOs  WROTE: Saccha_Go_MF.png

グラフは非常に大きい。

そこで、スタートのGOタームを、study count数の多いものに限定。 先ほど得られたSacchaGo.xlsxを開いて、study_countの欄でソートし、32程度より多いものを選択した。

# Plot subset starting from these significant GO terms
goid_subset = {
    'GO:0005737', #	CC	p	cytoplasm
    'GO:0005634', #	CC	p	nucleus
    'GO:0005739', #	CC	p	mitochondrion
    'GO:0005829', #	CC	p	cytosol
    'GO:0005783', #	CC	p	endoplasmic reticulum
    'GO:0046872', #	MF	p	metal ion binding
    'GO:0000166', #	MF	p	nucleotide binding
    'GO:0005524', #	MF	p	ATP binding
    'GO:0016740', #	MF	p	transferase activity
    'GO:0016787', #	MF	p	hydrolase activity
}
plot_gos("SacchaGo_Subset.png", 
    goid_subset, # Source GO ids
    obodag, 
    goea_results=goea_results_all) # Use pvals for coloring

この結果の図が

SacchaGo_Subset.png

添付ファイル: fileSacchaGo_Subset.png 4件 [詳細] fileREADME_NCBI_gene.pdf 7件 [詳細]

トップ   編集 凍結 差分 バックアップ 添付 複製 名前変更 リロード   新規 一覧 単語検索 最終更新   ヘルプ   最終更新のRSS
Last-modified: 2019-06-25 (火) 11:15:26 (28d)