Pythonバイオ? Pythonバイオ/ツール?
29   2019-05-11 (土) 10:33:39

(遺伝研セミナー 2-1, 2-2)発現解析

やること:

2-1の前提 АfeatureCountでSRRのカウントをし、出力all.counts.txtを得ている。コマンド

featureCounts -p -T 32 -t exon -g gene_id -a s288c_e.gff -o all.counts.txt *.sorted.bam

で処理。出力は

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/ 
	  v1.6.3

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 6 BAM files                                      ||
||                           P SRR453566.sorted.bam                           ||
||                           P SRR453567.sorted.bam                           ||
||                           P SRR453568.sorted.bam                           ||
||                           P SRR453569.sorted.bam                           ||
||                           P SRR453570.sorted.bam                           ||
||                           P SRR453571.sorted.bam                           ||
||                                                                            ||
||             Output file : all.counts.txt                                   ||
||                 Summary : all.counts.txt.summary                           ||
||              Annotation : s288c_e.gff (GTF)                                ||
||      Dir for temp files : ./                                               ||
||                                                                            ||
||                 Threads : 32                                               ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : counted                                          ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================// 

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file s288c_e.gff ...                                       ||
||    Features : 6761                                                         ||
||    Meta-features : 6420                                                    ||
||    Chromosomes/contigs : 17                                                ||
||                                                                            ||
|| Process BAM file SRR453566.sorted.bam...                                   ||
||    Paired-end reads are included.                                          ||
||    Assign alignments (paired-end) to features...                           ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Pairing up the read pairs.                                              ||
||                                                                            ||
||    Total alignments : 5650830                                              ||
||    Successfully assigned alignments : 4568783 (80.9%)                      ||
||    Running time : 0.07 minutes                                             ||
||                                                                            ||
|| Process BAM file SRR453567.sorted.bam...                                   ||
||    Paired-end reads are included.                                          ||
||    Assign alignments (paired-end) to features...                           ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Pairing up the read pairs.                                              ||
||                                                                            ||
||    Total alignments : 7714981                                              ||
||    Successfully assigned alignments : 6257956 (81.1%)                      ||
||    Running time : 0.12 minutes                                             ||
||                                                                            ||
|| Process BAM file SRR453568.sorted.bam...                                   ||
||    Paired-end reads are included.                                          ||
||    Assign alignments (paired-end) to features...                           ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Pairing up the read pairs.                                              ||
||                                                                            ||
||    Total alignments : 5547385                                              ||
||    Successfully assigned alignments : 4527122 (81.6%)                      ||
||    Running time : 0.10 minutes                                             ||
||                                                                            ||
|| Process BAM file SRR453569.sorted.bam...                                   ||
||    Paired-end reads are included.                                          ||
||    Assign alignments (paired-end) to features...                           ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Pairing up the read pairs.                                              ||
||                                                                            ||
||    Total alignments : 3990749                                              ||
||    Successfully assigned alignments : 3077110 (77.1%)                      ||
||    Running time : 0.07 minutes                                             ||
||                                                                            ||
|| Process BAM file SRR453570.sorted.bam...                                   ||
||    Paired-end reads are included.                                          ||
||    Assign alignments (paired-end) to features...                           ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Pairing up the read pairs.                                              ||
||                                                                            ||
||    Total alignments : 6691016                                              ||
||    Successfully assigned alignments : 3873751 (57.9%)                      ||
||    Running time : 0.12 minutes                                             ||
||                                                                            ||
|| Process BAM file SRR453571.sorted.bam...                                   ||
||    Paired-end reads are included.                                          ||
||    Assign alignments (paired-end) to features...                           ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Pairing up the read pairs.                                              ||
||                                                                            ||
||    Total alignments : 6248063                                              ||
||    Successfully assigned alignments : 4907305 (78.5%)                      ||
||    Running time : 0.11 minutes                                             ||
||                                                                            ||
||                                                                            ||
|| Summary of counting results can be found in file "all.counts.txt.summary"  ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

2-1の前提◆.侫.ぅ gene_id_product.tsv (gene_idとproduct の一覧表) を使いたい。
これはステップ1-3で作っている。

課題 1
../1-1/reference/s288c_e.gff を処理して,gene_id と product の一覧表を作成する.
product は URL エンコードされている.
import urllib.parse
product="glutamate dehydrogenase %28NADP%28%2B%29%29 GDH3"
product=urllib.parse.unquote(product)
print(product) ## glutamate dehydrogenase (NADP(+)) GDH3
gene_0001 seripauperin PAU8
gene_0002 hypothetical protein
gene_0003 putative permease SEO1
gene_0004 hypothetical protein

ということで、作る。

import pandas as pd
import numpy as np
import urllib.parse

outputlist = []
fout = open('gene_id_product.tsv', 'w')
with open('s288c_e.gff', 'r') as fin:
    for line in fin:
        product = ''
        geneid = ''
        line=line.rstrip()
        if not line.startswith('#'):
            s=line.split('\t')
            items=s[8].split(';')
            for item in items:
                if item.startswith('product='):
                    product = item.split('=')[1]
                    product = urllib.parse.unquote(product)
                if item.startswith('gene_id='):
                    geneid = item.split('=')[1]
                if (geneid != '') and (product != ''):
                    outputlist.append(geneid+'\t'+product)
outputlist = sorted(list(set(outputlist)))
fout = open('gene_id_product.tsv', 'w')
for u in outputlist:
    print(u, file=fout)
fout.close()

出力は

gene_0001	seripauperin PAU8
gene_0002	hypothetical protein
gene_0003	putative permease SEO1
gene_0004	hypothetical protein
gene_0005	hypothetical protein
(以下略)

次に、featureCountの出力であるall.counts.txtと、gene_id_product.tsvとをマージすることで、all.counts.txtにproductの入った表を作り、count_preprocessed.tsvに書き出す。

import pandas as pd
import numpy as np
df = pd.read_table('all.counts.txt', index_col=0, skiprows=1) 
rename_columns = {'SRR453566.sorted.bam': 'batch_1',
    'SRR453567.sorted.bam': 'batch_2',
    'SRR453568.sorted.bam': 'batch_3',
    'SRR453569.sorted.bam': 'chemostat_1',
    'SRR453570.sorted.bam': 'chemostat_2',
    'SRR453571.sorted.bam': 'chemostat_3'}
df = df.rename_axis(mapper=rename_columns, axis=1)
#print(df.head())
df = df[df.Chr != 'NC_001224.1']  # ミトコンドリア(NC_001224.1)上の遺伝子を除く
gene_products = pd.read_table('gene_id_product.tsv', index_col=0, names=["gene_id", "product"])
#print(gene_products.head())

# マージ
#df_with_product = gene_products.join(df)  # 値がfloatになる?
df_with_product = df.join(gene_products)
#print(df_with_product.head())
df_count = df_with_product[[\
    "batch_1","batch_2","batch_3","chemostat_1","chemostat_2","chemostat_3"]]
df_count.to_csv('count_raw.tsv', sep='\t')
df_with_product.to_csv('count_preprocessed.tsv', sep='\t')

出力は

Geneid	Chr	Start	End	Strand	Length	batch_1	batch_2	batch_3	chemostat_1	chemostat_2	chemostat_3	product
gene_0001	NC_001133.9	1807	2169	-	363	0	2	6	0	0	1	seripauperin PAU8
gene_0002	NC_001133.9	2480	2707	+	228	0	0	0	0	0	0	hypothetical protein
gene_0003	NC_001133.9	7235	9016	-	1782	0	0	0	0	0	0	putative permease SEO1
gene_0004	NC_001133.9	11565	11951	-	387	0	0	0	0	0	0	hypothetical protein
gene_0005	NC_001133.9	12046	12426	+	381	2	8	10	6	7	18	hypothetical protein
(以下略)

ここから正規化処理。 RPM/FPM、RPKM/FPKM、TPMを計算する。

FPKMを使って、batch cultureとchemostatの間で発現変動の大きい遺伝子を抽出

%matplotlib inline
import pandas as pd
import numpy as np

def normalize_per_million_reads(df):
    # RPM/FPM = reads per million  fragments per million
    sum_count = df.sum()
    return 10**6 * df / sum_count

def normalize_per_kilobase(df, gene_length):
    # FPKM = fragments per kilobase of exon per million reads mapped
    df_tmp = df.copy()
    df_tmp = (df.T * 10**3 / gene_length).T 
    return df_tmp

def normalize_tpm(df, gene_length):
    # TPM = transcripts per kilobase million   https://bi.biopapyrus.jp/ 
    df_tmp = df.copy()
    df_tmp = normalize_per_kilobase(df_tmp, gene_length)
    df_tmp = normalize_per_million_reads(df_tmp)
    return df_tmp

df_count = pd.read_table('count_raw.tsv', index_col=0)
df_with_product = pd.read_table('count_preprocessed.tsv', index_col=0)
gene_products = pd.read_table('gene_id_product.tsv', index_col=0, names=["gene_id", "product"])

# RPM/FPM = reads per million  fragments per million
df_count_fpm = normalize_per_million_reads(df_count)
#print(df_count_fpm.head())
df_count_fpm.to_csv("count_fpm.tsv", sep="\t")

# RPKM/FPKM
# FPKM = fragments per kilobase of exon per million reads mapped
# single-endの場合、RPKM = reads per kilobase of exon per million reads mapped
gene_length = df_with_product['Length']
df_count_fpkm = normalize_per_kilobase(df_count_fpm, gene_length)
#print(df_count_fpkm.head())
df_count_fpkm.to_csv('count_fpkm.tsv', sep='\t')

# TPM normalization
# TPM = transcripts per kilobase million   https://bi.biopapyrus.jp/ 
df_count_tpm = normalize_tpm(df_count, gene_length)
#print(df_count_tpm.head())
#print(df_count_tpm.sum())
df_count_tpm.to_csv("count_tpm.tsv", sep="\t")

# 発現変動の大きい遺伝子の抽出
# batch cultureの平均を求める
df_count_fpkm["batch"] = (df_count_fpkm["batch_1"] + df_count_fpkm["batch_2"] + 
df_count_fpkm["batch_3"]) / 3
# chemostatの平均を求める
df_count_fpkm["chemostat"] = (df_count_fpkm["chemostat_1"] + 
df_count_fpkm["chemostat_2"] + df_count_fpkm["chemostat_3"]) / 3
# 発現変動をlog2 fold として求める
# 0 での除算を防ぐため、分&#11935;に微&#12073;な値を加えている
df_count_fpkm["log2fold"] = \
   df_count_fpkm["chemostat"] / (df_count_fpkm["batch"] + 10**-6)
df_count_fpkm["log2fold"] = df_count_fpkm["log2fold"].apply(np.log2)
#print(df_count_fpkm.head())
# 必要部分のみ抜き出し、productと結合
diff_ex = df_count_fpkm[["batch", "chemostat", "log2fold"]]
diff_ex = diff_ex.join(gene_products)
# カウント数が0のデータを除く
diff_ex = diff_ex[diff_ex["batch"] > 0]
diff_ex = diff_ex[diff_ex["chemostat"] > 0]
#print(diff_ex)
diff_ex = diff_ex.sort_values("log2fold", ascending=False)
#print(diff_ex.head())
#print(diff_ex.tail())
diff_ex.to_csv("diff_ex.tsv", sep="\t")

結果は、

Geneid	batch	chemostat	log2fold	product
gene_2989	0.5113545477349243	1609.2291342813207	11.619755389936394	Rgi2p
gene_4740	3.714171096311252	5568.713898594607	10.550087794327904	Sip18p
gene_4667	5.970597813523303	5108.597498100943	9.740835924216602	Spg4p
gene_5965	8.609101140765873	5809.572900018754	9.398353606047285	Gre1p
gene_4237	1.158010357867918	775.5514280425233	9.387429238134654	hypothetical protein
(以下略)

おまけ。batch_1、batcn_2、batch_3、chemstat_1, chemstat_2, chemstat_3間の相関距離を計算して、階層クラスタ処理を使ってトリ―表現する。

# TMP正規化したデータをクラスタリング
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram
tpm_t = df_count_tpm.T
print('---')
print(df_count_tpm.head())
from scipy.spatial.distance import pdist
print(tpm_t.shape, pdist(tpm_t, 'correlation'))
linkage_result = linkage(tpm_t, method='average', metric='correlation')
print(linkage_result)
plt.figure(num=None, figsize=(16, 9), dpi=200, facecolor='w', edgecolor='k')
dendrogram(linkage_result, labels=df_count_tpm.columns)
plt.show()

結果は

SRR453566-71-gene-clustering.png

添付ファイル: fileSRR453566-71-gene-clustering.png 10件 [詳細]

トップ   編集 凍結 差分 バックアップ 添付 複製 名前変更 リロード   新規 一覧 単語検索 最終更新   ヘルプ   最終更新のRSS
Last-modified: 2019-05-11 (土) 10:33:39 (14d)