Python¥Ð¥¤¥ª?¡¡Python¥Ð¥¤¥ª/¥Ä¡¼¥ë?
3646¡¡¡¡¡¡2019-09-01 (Æü) 13:26:10
¤ä¤ë¤³¤È¡§
2-1¤ÎÁ°Äó¡¡§¡¡featureCount¤ÇSRR¤Î¥«¥¦¥ó¥È¤ò¤·¡¢½ÐÎÏall.counts.txt¤òÆÀ¤Æ¤¤¤ë¡£¥³¥Þ¥ó¥É
featureCounts -p -T 32 -t exon -g gene_id -a s288c_e.gff -o all.counts.txt *.sorted.bam
¤Ç½èÍý¡£½ÐÎϤÏ
========== _____ _ _ ____ _____ ______ _____ ===== / ____| | | | _ \| __ \| ____| /\ | __ \ ===== | (___ | | | | |_) | |__) | |__ / \ | | | | ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | | ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| | ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/ v1.6.3 //========================== featureCounts setting ===========================\\ || || || Input files : 6 BAM files || || P SRR453566.sorted.bam || || P SRR453567.sorted.bam || || P SRR453568.sorted.bam || || P SRR453569.sorted.bam || || P SRR453570.sorted.bam || || P SRR453571.sorted.bam || || || || Output file : all.counts.txt || || Summary : all.counts.txt.summary || || Annotation : s288c_e.gff (GTF) || || Dir for temp files : ./ || || || || Threads : 32 || || Level : meta-feature level || || Paired-end : yes || || Multimapping reads : not counted || || Multi-overlapping reads : not counted || || Min overlapping bases : 1 || || || || Chimeric reads : counted || || Both ends mapped : not required || || || \\===================== http://subread.sourceforge.net/ ======================// //================================= Running ==================================\\ || || || Load annotation file s288c_e.gff ... || || Features : 6761 || || Meta-features : 6420 || || Chromosomes/contigs : 17 || || || || Process BAM file SRR453566.sorted.bam... || || Paired-end reads are included. || || Assign alignments (paired-end) to features... || || || || WARNING: reads from the same pair were found not adjacent to each || || other in the input (due to read sorting by location or || || reporting of multi-mapping read pairs). || || || || Pairing up the read pairs. || || || || Total alignments : 5650830 || || Successfully assigned alignments : 4568783 (80.9%) || || Running time : 0.07 minutes || || || || Process BAM file SRR453567.sorted.bam... || || Paired-end reads are included. || || Assign alignments (paired-end) to features... || || || || WARNING: reads from the same pair were found not adjacent to each || || other in the input (due to read sorting by location or || || reporting of multi-mapping read pairs). || || || || Pairing up the read pairs. || || || || Total alignments : 7714981 || || Successfully assigned alignments : 6257956 (81.1%) || || Running time : 0.12 minutes || || || || Process BAM file SRR453568.sorted.bam... || || Paired-end reads are included. || || Assign alignments (paired-end) to features... || || || || WARNING: reads from the same pair were found not adjacent to each || || other in the input (due to read sorting by location or || || reporting of multi-mapping read pairs). || || || || Pairing up the read pairs. || || || || Total alignments : 5547385 || || Successfully assigned alignments : 4527122 (81.6%) || || Running time : 0.10 minutes || || || || Process BAM file SRR453569.sorted.bam... || || Paired-end reads are included. || || Assign alignments (paired-end) to features... || || || || WARNING: reads from the same pair were found not adjacent to each || || other in the input (due to read sorting by location or || || reporting of multi-mapping read pairs). || || || || Pairing up the read pairs. || || || || Total alignments : 3990749 || || Successfully assigned alignments : 3077110 (77.1%) || || Running time : 0.07 minutes || || || || Process BAM file SRR453570.sorted.bam... || || Paired-end reads are included. || || Assign alignments (paired-end) to features... || || || || WARNING: reads from the same pair were found not adjacent to each || || other in the input (due to read sorting by location or || || reporting of multi-mapping read pairs). || || || || Pairing up the read pairs. || || || || Total alignments : 6691016 || || Successfully assigned alignments : 3873751 (57.9%) || || Running time : 0.12 minutes || || || || Process BAM file SRR453571.sorted.bam... || || Paired-end reads are included. || || Assign alignments (paired-end) to features... || || || || WARNING: reads from the same pair were found not adjacent to each || || other in the input (due to read sorting by location or || || reporting of multi-mapping read pairs). || || || || Pairing up the read pairs. || || || || Total alignments : 6248063 || || Successfully assigned alignments : 4907305 (78.5%) || || Running time : 0.11 minutes || || || || || || Summary of counting results can be found in file "all.counts.txt.summary" || || || \\===================== http://subread.sourceforge.net/ ======================//
all.counts.txt¥Õ¥¡¥¤¥ë¤¬¤Ç¤¤ë¡£¤½¤ÎÆâÍÆ¤Ï
2-1¤ÎÁ°Ä󢡡¥Õ¥¡¥¤¥ë gene_id_product.tsv ¡Êgene_id¤Èproduct ¤Î°ìÍ÷ɽ¡Ë ¤ò»È¤¤¤¿¤¤¡£
¤³¤ì¤Ï¥¹¥Æ¥Ã¥×1-3¤Çºî¤Ã¤Æ¤¤¤ë¡£
²ÝÂê 1 ../1-1/reference/s288c_e.gff ¤ò½èÍý¤·¤Æ¡¤gene_id ¤È product ¤Î°ìÍ÷ɽ¤òºîÀ®¤¹¤ë¡¥ product ¤Ï URL ¥¨¥ó¥³¡¼¥É¤µ¤ì¤Æ¤¤¤ë¡¥ import urllib.parse product="glutamate dehydrogenase %28NADP%28%2B%29%29 GDH3" product=urllib.parse.unquote(product) print(product) ## glutamate dehydrogenase (NADP(+)) GDH3 gene_0001 seripauperin PAU8 gene_0002 hypothetical protein gene_0003 putative permease SEO1 gene_0004 hypothetical protein
¤È¤¤¤¦¤³¤È¤Ç¡¢ºî¤ë¡£
import pandas as pd import numpy as np import urllib.parse outputlist = [] fout = open('gene_id_product.tsv', 'w') with open('s288c_e.gff', 'r') as fin: for line in fin: product = '' geneid = '' line=line.rstrip() if not line.startswith('#'): s=line.split('\t') items=s[8].split(';') for item in items: if item.startswith('product='): product = item.split('=')[1] product = urllib.parse.unquote(product) if item.startswith('gene_id='): geneid = item.split('=')[1] if (geneid != '') and (product != ''): outputlist.append(geneid+'\t'+product) outputlist = sorted(list(set(outputlist))) fout = open('gene_id_product.tsv', 'w') for u in outputlist: print(u, file=fout) fout.close()
½ÐÎϤÏ
gene_0001 seripauperin PAU8 gene_0002 hypothetical protein gene_0003 putative permease SEO1 gene_0004 hypothetical protein gene_0005 hypothetical protein ¡Ê°Ê²¼Î¬¡Ë
¼¡¤Ë¡¢featureCount¤Î½ÐÎϤǤ¢¤ëall.counts.txt¤È¡¢gene_id_product.tsv¤È¤ò¥Þ¡¼¥¸¤¹¤ë¤³¤È¤Ç¡¢all.counts.txt¤Ëproduct¤ÎÆþ¤Ã¤¿É½¤òºî¤ê¡¢count_preprocessed.tsv¤Ë½ñ¤½Ð¤¹¡£
import pandas as pd import numpy as np df = pd.read_table('all.counts.txt', index_col=0, skiprows=1) rename_columns = {'SRR453566.sorted.bam': 'batch_1', 'SRR453567.sorted.bam': 'batch_2', 'SRR453568.sorted.bam': 'batch_3', 'SRR453569.sorted.bam': 'chemostat_1', 'SRR453570.sorted.bam': 'chemostat_2', 'SRR453571.sorted.bam': 'chemostat_3'} df = df.rename_axis(mapper=rename_columns, axis=1) #print(df.head()) df = df[df.Chr != 'NC_001224.1'] # ¥ß¥È¥³¥ó¥É¥ê¥¢(NC_001224.1)¾å¤Î°äÅÁ»Ò¤ò½ü¤¯ gene_products = pd.read_table('gene_id_product.tsv', index_col=0, names=["gene_id", "product"]) #print(gene_products.head()) # ¥Þ¡¼¥¸ #df_with_product = gene_products.join(df) # Ãͤ¬float¤Ë¤Ê¤ë¡© df_with_product = df.join(gene_products) #print(df_with_product.head()) df_count = df_with_product[[\ "batch_1","batch_2","batch_3","chemostat_1","chemostat_2","chemostat_3"]] df_count.to_csv('count_raw.tsv', sep='\t') df_with_product.to_csv('count_preprocessed.tsv', sep='\t')
½ÐÎϤÏ
Geneid Chr Start End Strand Length batch_1 batch_2 batch_3 chemostat_1 chemostat_2 chemostat_3 product gene_0001 NC_001133.9 1807 2169 - 363 0 2 6 0 0 1 seripauperin PAU8 gene_0002 NC_001133.9 2480 2707 + 228 0 0 0 0 0 0 hypothetical protein gene_0003 NC_001133.9 7235 9016 - 1782 0 0 0 0 0 0 putative permease SEO1 gene_0004 NC_001133.9 11565 11951 - 387 0 0 0 0 0 0 hypothetical protein gene_0005 NC_001133.9 12046 12426 + 381 2 8 10 6 7 18 hypothetical protein ¡Ê°Ê²¼Î¬¡Ë
¤³¤³¤«¤éÀµµ¬²½½èÍý¡£¡¡RPM/FPM¡¢RPKM/FPKM¡¢TPM¤ò·×»»¤¹¤ë¡£
FPKM¤ò»È¤Ã¤Æ¡¢batch culture¤Èchemostat¤Î´Ö¤Çȯ¸½ÊÑÆ°¤ÎÂ礤¤°äÅÁ»Ò¤òÃê½Ð
%matplotlib inline import pandas as pd import numpy as np def normalize_per_million_reads(df): # RPM/FPM = reads per million fragments per million sum_count = df.sum() return 10**6 * df / sum_count def normalize_per_kilobase(df, gene_length): # FPKM = fragments per kilobase of exon per million reads mapped df_tmp = df.copy() df_tmp = (df.T * 10**3 / gene_length).T return df_tmp def normalize_tpm(df, gene_length): # TPM = transcripts per kilobase million https://bi.biopapyrus.jp/ df_tmp = df.copy() df_tmp = normalize_per_kilobase(df_tmp, gene_length) df_tmp = normalize_per_million_reads(df_tmp) return df_tmp df_count = pd.read_table('count_raw.tsv', index_col=0) df_with_product = pd.read_table('count_preprocessed.tsv', index_col=0) gene_products = pd.read_table('gene_id_product.tsv', index_col=0, names=["gene_id", "product"]) # RPM/FPM = reads per million fragments per million df_count_fpm = normalize_per_million_reads(df_count) #print(df_count_fpm.head()) df_count_fpm.to_csv("count_fpm.tsv", sep="\t") # RPKM/FPKM # FPKM = fragments per kilobase of exon per million reads mapped # single-end¤Î¾ì¹ç¡¢RPKM = reads per kilobase of exon per million reads mapped gene_length = df_with_product['Length'] df_count_fpkm = normalize_per_kilobase(df_count_fpm, gene_length) #print(df_count_fpkm.head()) df_count_fpkm.to_csv('count_fpkm.tsv', sep='\t') # TPM normalization # TPM = transcripts per kilobase million https://bi.biopapyrus.jp/ df_count_tpm = normalize_tpm(df_count, gene_length) #print(df_count_tpm.head()) #print(df_count_tpm.sum()) df_count_tpm.to_csv("count_tpm.tsv", sep="\t") # ȯ¸½ÊÑÆ°¤ÎÂ礤¤°äÅÁ»Ò¤ÎÃê½Ð # batch culture¤ÎÊ¿¶Ñ¤òµá¤á¤ë df_count_fpkm["batch"] = (df_count_fpkm["batch_1"] + df_count_fpkm["batch_2"] + df_count_fpkm["batch_3"]) / 3 # chemostat¤ÎÊ¿¶Ñ¤òµá¤á¤ë df_count_fpkm["chemostat"] = (df_count_fpkm["chemostat_1"] + df_count_fpkm["chemostat_2"] + df_count_fpkm["chemostat_3"]) / 3 # ȯ¸½ÊÑÆ°¤òlog2 fold ¤È¤·¤Æµá¤á¤ë # 0 ¤Ç¤Î½ü»»¤òËɤ°¤¿¤á¡¢Ê¬Êì¤ËÈù¾¯¤ÊÃͤò²Ã¤¨¤Æ¤¤¤ë df_count_fpkm["log2fold"] = \ df_count_fpkm["chemostat"] / (df_count_fpkm["batch"] + 10**-6) df_count_fpkm["log2fold"] = df_count_fpkm["log2fold"].apply(np.log2) #print(df_count_fpkm.head()) # ɬÍ×Éôʬ¤Î¤ßÈ´¤½Ð¤·¡¢product¤È·ë¹ç diff_ex = df_count_fpkm[["batch", "chemostat", "log2fold"]] diff_ex = diff_ex.join(gene_products) # ¥«¥¦¥ó¥È¿ô¤¬0¤Î¥Ç¡¼¥¿¤ò½ü¤¯ diff_ex = diff_ex[diff_ex["batch"] > 0] diff_ex = diff_ex[diff_ex["chemostat"] > 0] #print(diff_ex) diff_ex = diff_ex.sort_values("log2fold", ascending=False) #print(diff_ex.head()) #print(diff_ex.tail()) diff_ex.to_csv("diff_ex.tsv", sep="\t")
·ë²Ì¤Ï¡¢
Geneid batch chemostat log2fold product gene_2989 0.5113545477349243 1609.2291342813207 11.619755389936394 Rgi2p gene_4740 3.714171096311252 5568.713898594607 10.550087794327904 Sip18p gene_4667 5.970597813523303 5108.597498100943 9.740835924216602 Spg4p gene_5965 8.609101140765873 5809.572900018754 9.398353606047285 Gre1p gene_4237 1.158010357867918 775.5514280425233 9.387429238134654 hypothetical protein ¡Ê°Ê²¼Î¬¡Ë
¤ª¤Þ¤±¡£batch_1¡¢batcn_2¡¢batch_3¡¢chemstat_1, chemstat_2, chemstat_3´Ö¤ÎÁê´Øµ÷Î¥¤ò·×»»¤·¤Æ¡¢³¬ÁØ¥¯¥é¥¹¥¿½èÍý¤ò»È¤Ã¤Æ¥È¥ê¡½É½¸½¤¹¤ë¡£
# TMPÀµµ¬²½¤·¤¿¥Ç¡¼¥¿¤ò¥¯¥é¥¹¥¿¥ê¥ó¥° import matplotlib.pyplot as plt from scipy.cluster.hierarchy import linkage, dendrogram tpm_t = df_count_tpm.T print('---') print(df_count_tpm.head()) from scipy.spatial.distance import pdist print(tpm_t.shape, pdist(tpm_t, 'correlation')) linkage_result = linkage(tpm_t, method='average', metric='correlation') print(linkage_result) plt.figure(num=None, figsize=(16, 9), dpi=200, facecolor='w', edgecolor='k') dendrogram(linkage_result, labels=df_count_tpm.columns) plt.show()
·ë²Ì¤Ï
MA¥×¥í¥Ã¥È¤È¤Ï¡¢
¥Þ¥¤¥¯¥í¥¢¥ì¥¤¤Î¥Ç¡¼¥¿¤ÎÂåɽŪ¤Êɽ¼¨ÊýË¡¤Ç¡¢¿Þ·Á¤È¤·¤Æ¤Ï¡¢»¶ÉÛ¿Þ¤ò45Åٲ󞤵¤»¤¿¤è¤¦¤Ê¥¤¥á¡¼¥¸
Z-score ¡Êȯ¸½ÊÑÆ°°äÅÁ»Ò¤òȽÄꤹ¤ë¤â¤¦1¤Ä¤ÎÊýË¡¡Ë – °äÅÁ»Òȯ¸½²òÀϡʥޥ¤¥¯¥í¥¢¥ì¥¤²òÀÏ, RNA-seq¡Ë
MA ¥×¥í¥Ã¥È (MA PLOT) – °äÅÁ»Òȯ¸½²òÀϡʥޥ¤¥¯¥í¥¢¥ì¥¤²òÀÏ, RNA-SEQ¡Ë¤Ë¡¢°Ê²¼¤ÎÀâÌÀ
2¥µ¥ó¥×¥ë¤Î¥Ç¡¼¥¿¤«¤é¡¢M¤ÈA¤ÎÃͤò»»½Ð¤·¤Æ»ÈÍѤ·¤Þ¤¹¡£¤³¤³¤Ç¡¢M ¤Ï¡Ölog2ÊÑ´¹¤µ¤ì¤¿¥·¥°¥Ê¥ëÃͤκ¹¡×¤Ç¤¢¤ê¡¢A ¤Ï¡Ölog2ÊÑ´¹¤µ¤ì¤¿¥·¥°¥Ê¥ëÃͤÎÊ¿¶ÑÃ͡פǤ¹¡£¿ô¼°¤Ç¤Ï¡¢¼¡¤Î¤è¤¦¤Ë½ñ¤±¤Þ¤¹¡£
M = log2(¼Â¸³¥µ¥ó¥×¥ë) – log2(¥³¥ó¥È¥í¡¼¥ë¥µ¥ó¥×¥ë) A = { log2(¼Â¸³¥µ¥ó¥×¥ë) + log2(¥³¥ó¥È¥í¡¼¥ë¥µ¥ó¥×¥ë) } / 2
log2(KO/WT)
M > 1 ¤È¤¤¤¦¤³¤È¤Ï¡¢logFC > 1 ¤Ä¤Þ¤ê¡¢ratio > 2 ¡£¡Êȯ¸½Áý²Ã¡Ë ratio¤Ï¼Â¸³/¥³¥ó¥È¥í¡¼¥ë M = 0 ¤È¤¤¤¦¤³¤È¤Ï¡¢logFC = 0 ¤Ä¤Þ¤ê¡¢ratio = 1¡£¡ÊÊÑÆ°¤Ê¤·¡Ë M < -1 ¤È¤¤¤¦¤³¤È¤Ï¡¢logFC < -1 ¤Ä¤Þ¤ê¡¢ratio < 0.5¡£¡Êȯ¸½¸º¾¯¡Ë
MA¥×¥í¥Ã¥È¤òÉÁ¤¯¥×¥í¥°¥é¥à¡¡RNASaccha/Saccha/°äÅÁ¸¦2-3²Ä»ë²½2-1-MAplot
%matplotlib inline # TPM count_tpm.tsv¤«¤éMA¥×¥í¥Ã¥È # MA¥×¥í¥Ã¥È¤È¤Ï http://array.cell-innovator.com/?p=1772 # M = log2(¼Â¸³¥µ¥ó¥×¥ë) – log2(¥³¥ó¥È¥í¡¼¥ë¥µ¥ó¥×¥ë) # A = { log2(¼Â¸³¥µ¥ó¥×¥ë) + log2(¥³¥ó¥È¥í¡¼¥ë¥µ¥ó¥×¥ë) } / 2 import matplotlib.pyplot as plt import pandas as pd import numpy as np import math df_tpm = pd.read_table('count_tpm.tsv')[['batch_1', 'chemostat_1']] u = np.array(list(map(math.log2, df_tpm['chemostat_1']+1))) v = np.array(list(map(math.log2, df_tpm['batch_1']+1))) M = u - v A = (u + v) / 2 df_tpm_log = df_tpm.applymap(lambda x: math.log2(x+1)) df_tpm_log['M'] = df_tpm_log['chemostat_1'] - df_tpm_log['batch_1'] df_tpm_log['A'] = (df_tpm_log['chemostat_1'] + df_tpm_log['batch_1'])/2 df_tpm_log_smaller = df_tpm_log[(df_tpm_log['M']<=2) & (df_tpm_log['M']>=-2)] df_tpm_log_larger = df_tpm_log[(df_tpm_log['M']>2) | (df_tpm_log['M']<-2)] fig = plt.figure() ax = fig.add_subplot(1,1,1) #plt.scatter(A, M) df_tpm_log_smaller.plot.scatter(x='A', y='M', ax=ax, label='non-DEG') df_tpm_log_larger.plot.scatter(x='A', y='M', c='orange', ax=ax, label='DEG') plt.xlabel('A value') plt.ylabel('M value') plt.legend() plt.show()
ÉÁ¤±¤¿¿Þ¤Ï