Python¥Ð¥¤¥ª?¡¡Python¥Ð¥¤¥ª/¥Ä¡¼¥ë?
2584¡¡¡¡¡¡2019-06-09 (Æü) 09:07:15
¤ä¤ë¤³¤È¡§
2-2¤ÇÆÀ¤¿Á°Äó¥Õ¥¡¥¤¥ë¡§¡¡diff_ex.tsv
Geneid batch chemostat log2fold product gene_2989 0.5113545477349243 1609.2291342813207 11.619755389936394 Rgi2p gene_4740 3.714171096311252 5568.713898594607 10.550087794327904 Sip18p gene_4667 5.970597813523303 5108.597498100943 9.740835924216602 Spg4p gene_5965 8.609101140765873 5809.572900018754 9.398353606047285 Gre1p gene_4237 1.158010357867918 775.5514280425233 9.387429238134654 hypothetical protein ¡Ê°Ê²¼Î¬¡Ë
count_tpm.tsv¤ÇÆÀ¤é¤ì¤¿TMP¥Ç¡¼¥¿¤Ç¡¢batch_1, 2, 3, chemostat_1, 2, 3¤Î£¶¼¡¸µ¤ò PCA¤Ë¤è¤ê°µ½Ì¤¹¤ë¡£½èÍý¤Î´ðËܤÏ
df = pd.read_table('count_tpm.tsv', index_col=0) pca = sklearn.decomposition.PCA() coords = pca.fit_transform(df.transpose().values)
¤À¤¬¡¢ÆÀ¤é¤ì¤¿coords¤ò¥×¥í¥Ã¥È¤¹¤ë¤³¤È¡¢ÆþÎϥǡ¼¥¿¤Î¤¦¤Á¥ª¡¼¥ë0¤Ï½ü³°¤¹¤ë¤³¤È ¤Ê¤É¤Î½èÍý¤¬É¬Íס£
%matplotlib inline import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import sklearn.decomposition df = pd.read_table('count_tpm.tsv', index_col=0) all_zero_index = df.index[df.sum(axis=1) == 0] df = df.drop(all_zero_index) sample_colors = {'batch_1':'b', # b : blue 'batch_2':'b', 'batch_3':'b', 'chemostat_1':'g', # g: green 'chemostat_2':'g', 'chemostat_3':'g'} #colors = (df.columns).map(sample_colors) colors = [sample_colors[u] for u in df.columns] print(colors) # PCA ¼Â¹Ô pca = sklearn.decomposition.PCA() coords = pca.fit_transform(df.transpose().values) def scatter_plot(coords, sample_labels, colors, xlabel=None, ylabel=None, title=''): fig = plt.figure() ax = fig.add_subplot(111) ax.scatter(coords[:, 0], coords[:, 1], color=colors) for i, sample_label in enumerate(sample_labels): ax.annotate(sample_label, xy=(coords[i, :2]), xytext=(10,10), textcoords='offset points', color=colors[i], arrowprops={'arrowstyle':'-', 'edgecolor':colors[i]}) ax.set_xlabel(xlabel) ax.set_ylabel(ylabel) ax.set_title(title) plt.show() scatter_plot(coords, df.columns, colors, xlabel='PC1', ylabel='PC2')
½ÐÎϤÏ
Z¥¹¥³¥¢¤Ë¤è¤ëɸ½à²½¡Ê¥Ç¡¼¥¿·²¤ÎÅö³º¿ôÃͤ«¤éÊ¿¶ÑÃͤò°ú¤¤¤ÆÉ¸½àÊк¹¤Ç³ä¤Ã¤¿¤â¤Î¡Ë ¤ò¤·¤ÆPCA¡£
%matplotlib inline
# Z-score ɸ½à²½ import sklearn.preprocessing values = df.transpose().values scaler = sklearn.preprocessing.StandardScaler(with_mean=True, with_std=True) std_values = scaler.fit_transform(values) std_df = pd.DataFrame(std_values.T, index=df.index, columns=df.columns) print('\nRaw values') print('TPM Average :') print(df.mean(axis=1)[:10]) print('\nTPM Standard deviation :') print(df.std(ddof=0, axis=1)[:10]) print('\n\nStandardized values') print('TPM Average :') print(std_df.mean(axis=1)[:10]) print('\nTPM Standard deviation :') print(std_df.std(ddof=0, axis=1)[:10]) pca = sklearn.decomposition.PCA() coords = pca.fit_transform(std_df.transpose().values) scatter_plot(coords, std_df.columns, colors, xlabel='PC1', ylabel='PC2')
½ÐÎϤÏ
£¶¤Ä¤Î¥µ¥ó¥×¥ëbatch_1¡¢batcn_2¡¢batch_3¡¢chemstat_1, chemstat_2, chemstat_3¤Î´Ö¤Î Áê´Ø¤ò¸«¤ë¡£¤Þ¤º¤Ï»¶ÉÛ¿Þ¤òÉÁ¤¯¡£
%matplotlib inline import pandas as pd import seaborn as sns sns.set() #seaborn # Á´ÁȤ߹ç¤ï¤»¤Î»¶ÉÛ¿Þ¤òɽ⽰ tpm = pd.read_table("count_tpm.tsv") sns.pairplot(tpm)
·ë²Ì¤Ï
batch´Ö¡¢chemostat´Ö¤Ï¹â¤¤Áê´Ø¡¢batch-chemostat´Û¤Ï¤Ð¤é¤Ä¤¤¬¤¢¤ë¤³¤È¤¬¤ï¤«¤ë¡£
Áê´Ø·¸¿ô¤òµá¤á¤Æ¤ß¤ë¡£
# SpearmanÁê´Ø·¸¿ô¤ò·×»»¤·¤Æ¤ß¤ë from scipy import stats corr, p = stats.spearmanr(tpm.batch_1, tpm.batch_2) print("Spearman correlation between batch_1 and batch_2: " + str(corr)) print("p value: " + str(p)) corr, p = stats.spearmanr(tpm.batch_1, tpm.chemostat_1) print("Spearman correlation between batch_1 and chemostat_1: " + str(corr)) print("p value: " + str(p))
·ë²Ì¤Ï
Spearman correlation between batch_1 and batch_2: 0.9903475599652781 p value: 0.0 Spearman correlation between batch_1 and chemostat_1: 0.8483821121349234 p value: 0.0
Áê´Ø·¸¿ô¤Ç¸«¤ë¤È¡¢°Û¼ï´Ö¤ÎÁê´Ø·¸¿ô¤¬ÌÀ¤é¤«¤Ë¾®¤µ¤¤¤¬¡¢¤½¤ì¤Ç¤â¤½¤³¤½¤³Áê´Ø¤¬¤¢¤ë¡£
°ã¤¤¤¬¤¢¤ë¤«¤É¤¦¤«¤ò¡¢Mann-Whitney U¸¡Äê¤òÍѤ¤¤ÆÄ´¤Ù¤ë¡£ 2¤Ä¤ÎÊ콸ÃĤ¬Æ±¤¸¤Ç¤¢¤ë¤È¤¹¤ë¤È¤¤¤¦²¾Äê¤ËÂФ¹¤ëpÃÍ¡£p<0.05(0.01)¤À¤È´þµÑ¢ªÆ±¤¸¤Ç¤Ê¤¤
# °ã¤¤¤¬¤¢¤ë¤«¤É¤¦¤«¤ò¸¡Äꤷ¤Æ¤ß¤ë # Ê콸ÃĤ¬Àµµ¬Ê¬ÉÛ¤·¤Æ¤¤¤ë¤È¤Ï¤¤¤¤¤Ë¤¯¤¤¤Î¤Ç¡¢Mann-Whitney U¸¡Äê¤òÍѤ¤¤ë b1c1 = stats.mannwhitneyu(tpm.batch_1, tpm.chemostat_1, alternative="two-sided") # alternative="two-sided" ¤ò»ØÄꤷ¤Ê¤¤¤È⽚¦¸¡Äê¤Ë¤Ê¤Ã¤Æ¤·¤Þ¤¦ b1c2 = stats.mannwhitneyu(tpm.batch_1, tpm.chemostat_2, alternative="two-sided") b1c3 = stats.mannwhitneyu(tpm.batch_1, tpm.chemostat_3, alternative="two-sided") b2c1 = stats.mannwhitneyu(tpm.batch_2, tpm.chemostat_1, alternative="two-sided") b2c2 = stats.mannwhitneyu(tpm.batch_2, tpm.chemostat_2, alternative="two-sided") b2c3 = stats.mannwhitneyu(tpm.batch_2, tpm.chemostat_3, alternative="two-sided") b3c1 = stats.mannwhitneyu(tpm.batch_3, tpm.chemostat_1, alternative="two-sided") b3c2 = stats.mannwhitneyu(tpm.batch_3, tpm.chemostat_2, alternative="two-sided") b3c3 = stats.mannwhitneyu(tpm.batch_3, tpm.chemostat_3, alternative="two-sided") print(b1c1.pvalue, b1c2.pvalue, b1c3.pvalue, "\n", b2c1.pvalue, b2c2.pvalue, b2c3.pvalue, "\n", b3c1.pvalue, b3c2.pvalue, b3c3.pvalue) # ¤Ä¤¤¤Ç¤Ë¡¢Æ±¼ïÎàbatch_1¤Èbatch_2¤Î´Ö¤Ç¤Ï¤É¤¦¤«¡© print() b1b2 = stats.mannwhitneyu(tpm.batch_1, tpm.batch_2, alternative="two-sided") b1b3 = stats.mannwhitneyu(tpm.batch_1, tpm.batch_3, alternative="two-sided") b2b3 = stats.mannwhitneyu(tpm.batch_2, tpm.batch_3, alternative="two-sided") c1c2 = stats.mannwhitneyu(tpm.chemostat_1, tpm.chemostat_2, alternative="two-sided") c1c3 = stats.mannwhitneyu(tpm.chemostat_1, tpm.chemostat_3, alternative="two-sided") c2c3 = stats.mannwhitneyu(tpm.chemostat_2, tpm.chemostat_3, alternative="two-sided") print(b1b2.pvalue, b1b3.pvalue, b2b3.pvalue, '\n', \ c1c2.pvalue, c1c3.pvalue, c2c3.pvalue)
·ë²Ì¤Ï¡¢°Û¼ï´Ö¤Î¾ì¹ç
7.443032122149763e-45 1.872248071308331e-35 5.5364679561011494e-30 6.146536587950258e-55 1.2997112467897929e-44 1.6801015887922584e-38 1.5468893669458462e-52 1.9165599784456836e-42 1.8061342891881474e-36
¤Ç¡¢¤¤¤º¤ì¤âp<<0.01¤Ç´þµÑ¤µ¤ì¤ë¡ÊƱ¤¸¤Ç¤Ï¤Ê¤¤¡Ë
¾Êý¡¢Æ±¼ï´Ö¡Êb1-b2´Ö¡Ë¤Î¾ì¹ç
0.08065503501024336 0.15188401219837053 0.7603592816804289 0.07125032415042876 0.006632138001689073 0.3395999834216805
¤Ç¡¢¤¤¤º¤ì¤â´þµÑ¤·¤Ê¤¤¤È¸À¤¤¤¿¤¤¤¬¡¢chemostat_1¤È3¤Î´Ö¤Ïp¤¬¤«¤Ê¤ê¾®¤µ¤¯¡¢´þµÑ¤µ¤ì¤ë¡£
¥«¥¤£²¾èÆÈΩÀ¸¡Äê¤Ë¤è¤ê¡¢°äÅÁ»ÒX¤È¾¤Î°äÅÁ»Ò¤È¤ÎÆÈΩÀ¤ò¸¡Äꤷ¡¢ÆÈΩ¤Ê¤â¤Î¤òõ¤¹¡£
# ¥«¥¤£²¾èÆÈΩÀ¤Î¸¡Äê¤Ë¤è¤ê¡¢°äÅÁ»Ò¤´¤È¤Îȯ¸½Î̤ÎͰպ¹¤ò¸«¤ë # ¼¡¤Îʬ³äɽ¤ËÂФ¹¤ë¥«¥¤£²¾è¤Ë¤è¤ëÆÈΩÀ¸¡Äê # ¡Ã ¡¡¡¡¡Ãbatch ¡¡¡¡¡¡¡Ã chemostat # ¡Ã°äÅÁ»Ò£±¡¡¡Ã£µ£°¡¡¡¡¡¡¡¡¡Ã £±£°£° # ¡Ã¤½¤ì°Ê³°¡¡¡Ã12,000,000 ¡Ã13,000,000 # µ¢Ìµ²¾Àâ¡§ÆÈΩ¤Ç¤¢¤ë¡¡¤ò¸¡Äꢪp<<0.05¤Ê¤é´þµÑ¡ÊÆÈΩ¤Ç¤Ê¤¤¡Ë #¥«¥¤⼆¾èÆÈ⽴À¤Î¸¡Äê¤Ç¤È¤ê¤¢¤¨¤ººÇ½é¤Î10°äÅÁ⼦¤ò¸¡Äꤷ¤Æ¤ß¤ë import numpy as np #¥«¥¤⼆¾èÆÈ⽴À¤Î¸¡Äê¤Ï¥«¥¦¥ó¥È¥Ç¡¼¥¿¤ò»È¤¦¤Î¤Ç¡¢rawtag¤ò»È¤¦ rawtag = pd.read_table("count_raw.tsv") gene_ids = rawtag['Geneid'] #batch 3 replicate¤Î¥¿¥°¥«¥¦¥ó¥È¤Î¹ç·×¤ò·×»» sx = sum(rawtag.batch_1) + sum(rawtag.batch_2) + sum(rawtag.batch_3) #chemostat 3 replicate¤Î¥¿¥°¥«¥¦¥ó¥È¤Î¹ç·×¤ò·×»» sy = sum(rawtag.chemostat_1) + sum(rawtag.chemostat_2) + sum(rawtag.chemostat_3) ar = [] for i in range(10): #iÈÖÌܤΰäÅÁ»Ò¤Îbatch 3 replicate¤Ç¤Î¥¿¥°¥«¥¦¥ó¥È¤Î¹ç·×¤ò·×»» x = rawtag.batch_1[i] + rawtag.batch_2[i] + rawtag.batch_3[i] #iÈÖÌܤΰäÅÁ»Ò¤Îchemostat 3 replicate¤Ç¤Î¥¿¥°¥«¥¦¥ó¥È¤Î¹ç·×¤ò·×»» y = rawtag.chemostat_1[i] + rawtag.chemostat_2[i] + rawtag.chemostat_3[i] #iÈÖÌܤΰäÅÁ»Ò°Ê³°¤ÎÁ´°äÅÁ»Ò¤Î¡¢batch 3 replicate¤Ç¤Î¥¿¥°¥«¥¦¥ó¥È¤Î¹ç·×¤ò·×»» sxr = sx - x #iÈÖÌܤΰäÅÁ»Ò°Ê³°¤ÎÁ´°äÅÁ»Ò¤Î¡¢chemostat 3 replicate¤Ç¤Î¥¿¥°¥«¥¦¥ó¥È¤Î¹ç·×¤ò·×»» syr = sy - y if ( rawtag.batch_1[i] == 0 or rawtag.batch_2[i] == 0 or rawtag.batch_3[i] == 0 or \ rawtag.chemostat_1[i] == 0 or rawtag.chemostat_2[i] == 0 or \ rawtag.chemostat_3[i] == 0): ar.append(1) #¥«¥¤£²¾èÆÈΩÀ¤Î¸¡Äê¤Ç¤Ï¡¢³ÆÃͤΤɤ줫¤¬0¤À¤È¸¡Äê¤ÎÀºÅÙ¤¬°¤¯¤Ê¤ë¤¿¤á¡¢ # ¤½¤Î¤è¤¦¤Ê¥Ç¡¼¥¿¤Î¾ì¹ç¤Ë¤Ï¸¡Äê¤ò¤»¤º¤ËpÃͤò1¤È¤¹¤ë else: forchi = np.array([[x,y],[sxr,syr]]) chi = stats.chi2_contingency(forchi) ar.append(chi[1]) print(ar) print() # ¾å¤À¤È¸¡Äê¤ò·«¤êÊÖ¤·¤Æ¤¤¤ë¡£¤½¤ì¤ËÂФ¹¤ëpÃͤÎÊäÀµ¤È¤·¤ÆBonferroniÊäÀµ¤ò»È¤¦ # BonferroniÊäÀµ¤Ï¡¢0.05Åù¤Î¸Ä¡¹¤Î¸¡Äê¤ÎpÃͤò¡¢¸¡Äê²ó¿ô¤Ç³ä¤Ã¤ÆÆÀ¤é¤ì¤¿ÃͤòͰÕÀ¤ÎȽÄê¤Ë»È¤¦ import statsmodels.stats.multitest as multi sx = sum(rawtag.batch_1) + sum(rawtag.batch_2) + sum(rawtag.batch_3) sy = sum(rawtag.chemostat_1) + sum(rawtag.chemostat_2) + sum(rawtag.chemostat_3) ar = [] for i in range(10): x = rawtag.batch_1[i] + rawtag.batch_2[i] + rawtag.batch_3[i] y = rawtag.chemostat_1[i] + rawtag.chemostat_2[i] + rawtag.chemostat_3[i] sxr = sx - x syr = sy - y if ( rawtag.batch_1[i] == 0 or rawtag.batch_2[i] == 0 or rawtag.batch_3[i] == 0 or \ rawtag.chemostat_1[i] == 0 or rawtag.chemostat_2[i] == 0 or rawtag.chemostat_3[i] == 0): ar.append(1) else: forchi = np.array([[x,y],[sxr,syr]]) chi = stats.chi2_contingency(forchi) ar.append(chi[1]) print(ar) #¦Á=0.05¤È¤·¤Æ¡¢BonferroniÊäÀµ¤ò¹Ô¤¦ pval_corr = multi.multipletests(ar, alpha=0.05, method='Bonferroni') print('Bonferroni:\n', pval_corr[1]) # BonferroniÊäÀµ¤Ï¤«¤Ê¤ê¸·¤·¤¤ÊäÀµ¼êË¡¤Ç¤¢¤ë¡£ # ¤½¤³¤Ç¡¢familywise error rate¤ÎÂå¤ï¤ê¤Ë¡¢Í°Õº¹¤¢¤ê¤ÈȽÄꤵ¤ì¤¿·ë²Ì¤ÎÃæ¤ËËÜÅö¤Ïº¹¤¬Ìµ¤¤¤â¤Î¤ò´Þ¤à³ÎΨ # ¡ÊFalse Discovery Rate¡Ë¤ò⼀Äê¤Î⽔½à°Ê²¼¤Ë¤¹¤ë¡¢FDR¤òpÃͤÎÂå¤ï¤ê¤Ë⽤¤¤¤ë⼿Ë¡¤Ç¤ÎÊäÀµ¼êË¡¤ò»î¤·¤Æ¤ß¤ë¡£ # FDR¤Î·×»»⽅Ë¡¤Ë¤Ï¤¤¤¯¤Ä¤«¼ïÎब¤¢¤ë¤¬¡¢⼀ÈÖ´ðËÜŪ¤ÊBenjamini HochbergË¡¤Ç¤Ï¡¢³Æ¸¡Äê ¤ÇÆÀ¤é¤ì¤¿pÃͤò # ⼩¤µ¤¤½ç¤Ëʤ٤ơ¢N/½ç°Ì¤òpÃͤˤ«¤±¤ë¤³¤È¤Ç¸ÄÊ̤θ¡Äê¤Ë¤ª¤±¤ëFDR¤òµá¤á¤ë¡£ # ¿¤¯¤Î¾ì¹ç¡¢FDR¤ÏpÃͤȶèÊ̤¹¤ë¤¿¤á¤ËqÃͤȸƤФì¤ë¡£ pval_corr = multi.multipletests(ar, alpha=0.05, method='fdr_bh') #Benjamini HochbergË¡ print('FDR-BH', pval_corr[1])
·ë²Ì¤Ï
[1, 1, 1, 1, 0.019436546571414606, 1, 1, 1, 2.0214198154233473e-13, 0.0] [1, 1, 1, 1, 0.019436546571414606, 1, 1, 1, 2.0214198154233473e-13, 0.0] Bonferroni: [1.00000000e+00 1.00000000e+00 1.00000000e+00 1.00000000e+00 1.94365466e-01 1.00000000e+00 1.00000000e+00 1.00000000e+00 2.02141982e-12 0.00000000e+00] FDR-BH [1.00000000e+00 1.00000000e+00 1.00000000e+00 1.00000000e+00 6.47884886e-02 1.00000000e+00 1.00000000e+00 1.00000000e+00 1.01070991e-12 0.00000000e+00]
¤³¤ì¤òƧ¤Þ¤¨¤Æ¡¢Á´°äÅÁ»Ò¤Ç¥È¥é¥¤
# Á´°äÅÁ¤Ç¸¡Äê¤ò¤·¡¢FDR¤òµá¤á¤Æ¥Õ¥¡¥¤¥ë¤Ë½ÐÎϤ¹¤ë # ÆÈΩ¤Ç¤¢¤ë¤È¤¤¤¦µ¢Ìµ²¾Äê¤ËÂФ¹¤ëpÃÍ¡£p<0.05(0.01)¤À¤È´þµÑ¢ªÆÈΩ¤Ç¤Ê¤¤ sx = sum(rawtag.batch_1) + sum(rawtag.batch_2) + sum(rawtag.batch_3) sy = sum(rawtag.chemostat_1) + sum(rawtag.chemostat_2) + sum(rawtag.chemostat_3) ai = len(rawtag.batch_1) ar = [] for i in range(ai): x = rawtag.batch_1[i] + rawtag.batch_2[i] + rawtag.batch_3[i] y = rawtag.chemostat_1[i] + rawtag.chemostat_2[i] + rawtag.chemostat_3[i] sxr = sx - x syr = sy - y if ( rawtag.batch_1[i] <= 0 or rawtag.batch_2[i] <= 0 or rawtag.batch_3[i] <= 0 \ or rawtag.chemostat_1[i] <= 0 or rawtag.chemostat_2[i] <= 0 or\ rawtag.chemostat_3[i] <= 0): ar.append(1) else: forchi = np.array([[x,y],[sxr,syr]]) chi = stats.chi2_contingency(forchi) ar.append(chi[1]) pval_corr = multi.multipletests(ar, alpha=0.05, method='fdr_bh') np.savetxt('BHtest.txt', pval_corr[1]) gene_ids = list(rawtag['Geneid']) output = pd.DataFrame(list(zip(gene_ids, list(pval_corr[1])))) output.columns = ['Geneid', 'pvalue'] output = output[output['pvalue']<0.05] print(output[:29]) print('complete')
·ë²Ì¤Ï
Geneid pvalue 4 gene_0005 2.643073e-02 8 gene_0009 4.575206e-13 9 gene_0010 0.000000e+00 10 gene_0011 0.000000e+00 12 gene_0013 6.566874e-13 13 gene_0014 1.237817e-08 14 gene_0015 6.762139e-64 15 gene_0016 6.777083e-18 16 gene_0017 0.000000e+00 17 gene_0018 1.301251e-34 18 gene_0019 6.370921e-08 19 gene_0020 8.823167e-71 20 gene_0021 4.864046e-25 ¡Ê°Ê²¼Î¬¡Ë