Python¥Ð¥¤¥ª?¡¡Python¥Ð¥¤¥ª/¥Ä¡¼¥ë?
3646¡¡¡¡¡¡2019-09-01 (Æü) 13:26:10

¡Ê°äÅÁ¸¦¥»¥ß¥Ê¡¼ 2-1, 2-2¡Ëȯ¸½²òÀÏ

¤ä¤ë¤³¤È¡§

2-1¤ÎÁ°Äó­¡¡§¡¡featureCount¤ÇSRR¤Î¥«¥¦¥ó¥È¤ò¤·¡¢½ÐÎÏall.counts.txt¤òÆÀ¤Æ¤¤¤ë¡£¥³¥Þ¥ó¥É

featureCounts -p -T 32 -t exon -g gene_id -a s288c_e.gff -o all.counts.txt *.sorted.bam

¤Ç½èÍý¡£½ÐÎϤÏ

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/ 
	  v1.6.3

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 6 BAM files                                      ||
||                           P SRR453566.sorted.bam                           ||
||                           P SRR453567.sorted.bam                           ||
||                           P SRR453568.sorted.bam                           ||
||                           P SRR453569.sorted.bam                           ||
||                           P SRR453570.sorted.bam                           ||
||                           P SRR453571.sorted.bam                           ||
||                                                                            ||
||             Output file : all.counts.txt                                   ||
||                 Summary : all.counts.txt.summary                           ||
||              Annotation : s288c_e.gff (GTF)                                ||
||      Dir for temp files : ./                                               ||
||                                                                            ||
||                 Threads : 32                                               ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : counted                                          ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================// 

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file s288c_e.gff ...                                       ||
||    Features : 6761                                                         ||
||    Meta-features : 6420                                                    ||
||    Chromosomes/contigs : 17                                                ||
||                                                                            ||
|| Process BAM file SRR453566.sorted.bam...                                   ||
||    Paired-end reads are included.                                          ||
||    Assign alignments (paired-end) to features...                           ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Pairing up the read pairs.                                              ||
||                                                                            ||
||    Total alignments : 5650830                                              ||
||    Successfully assigned alignments : 4568783 (80.9%)                      ||
||    Running time : 0.07 minutes                                             ||
||                                                                            ||
|| Process BAM file SRR453567.sorted.bam...                                   ||
||    Paired-end reads are included.                                          ||
||    Assign alignments (paired-end) to features...                           ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Pairing up the read pairs.                                              ||
||                                                                            ||
||    Total alignments : 7714981                                              ||
||    Successfully assigned alignments : 6257956 (81.1%)                      ||
||    Running time : 0.12 minutes                                             ||
||                                                                            ||
|| Process BAM file SRR453568.sorted.bam...                                   ||
||    Paired-end reads are included.                                          ||
||    Assign alignments (paired-end) to features...                           ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Pairing up the read pairs.                                              ||
||                                                                            ||
||    Total alignments : 5547385                                              ||
||    Successfully assigned alignments : 4527122 (81.6%)                      ||
||    Running time : 0.10 minutes                                             ||
||                                                                            ||
|| Process BAM file SRR453569.sorted.bam...                                   ||
||    Paired-end reads are included.                                          ||
||    Assign alignments (paired-end) to features...                           ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Pairing up the read pairs.                                              ||
||                                                                            ||
||    Total alignments : 3990749                                              ||
||    Successfully assigned alignments : 3077110 (77.1%)                      ||
||    Running time : 0.07 minutes                                             ||
||                                                                            ||
|| Process BAM file SRR453570.sorted.bam...                                   ||
||    Paired-end reads are included.                                          ||
||    Assign alignments (paired-end) to features...                           ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Pairing up the read pairs.                                              ||
||                                                                            ||
||    Total alignments : 6691016                                              ||
||    Successfully assigned alignments : 3873751 (57.9%)                      ||
||    Running time : 0.12 minutes                                             ||
||                                                                            ||
|| Process BAM file SRR453571.sorted.bam...                                   ||
||    Paired-end reads are included.                                          ||
||    Assign alignments (paired-end) to features...                           ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Pairing up the read pairs.                                              ||
||                                                                            ||
||    Total alignments : 6248063                                              ||
||    Successfully assigned alignments : 4907305 (78.5%)                      ||
||    Running time : 0.11 minutes                                             ||
||                                                                            ||
||                                                                            ||
|| Summary of counting results can be found in file "all.counts.txt.summary"  ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

all.counts.txt¥Õ¥¡¥¤¥ë¤¬¤Ç¤­¤ë¡£¤½¤ÎÆâÍÆ¤Ï

Saccha_all_counts_txt.png

2-1¤ÎÁ°Ä󭢡¡¥Õ¥¡¥¤¥ë gene_id_product.tsv ¡Êgene_id¤Èproduct ¤Î°ìÍ÷ɽ¡Ë ¤ò»È¤¤¤¿¤¤¡£
¤³¤ì¤Ï¥¹¥Æ¥Ã¥×1-3¤Çºî¤Ã¤Æ¤¤¤ë¡£

²ÝÂê 1
../1-1/reference/s288c_e.gff ¤ò½èÍý¤·¤Æ¡¤gene_id ¤È product ¤Î°ìÍ÷ɽ¤òºîÀ®¤¹¤ë¡¥
product ¤Ï URL ¥¨¥ó¥³¡¼¥É¤µ¤ì¤Æ¤¤¤ë¡¥
import urllib.parse
product="glutamate dehydrogenase %28NADP%28%2B%29%29 GDH3"
product=urllib.parse.unquote(product)
print(product) ## glutamate dehydrogenase (NADP(+)) GDH3
gene_0001 seripauperin PAU8
gene_0002 hypothetical protein
gene_0003 putative permease SEO1
gene_0004 hypothetical protein

¤È¤¤¤¦¤³¤È¤Ç¡¢ºî¤ë¡£

import pandas as pd
import numpy as np
import urllib.parse

outputlist = []
fout = open('gene_id_product.tsv', 'w')
with open('s288c_e.gff', 'r') as fin:
    for line in fin:
        product = ''
        geneid = ''
        line=line.rstrip()
        if not line.startswith('#'):
            s=line.split('\t')
            items=s[8].split(';')
            for item in items:
                if item.startswith('product='):
                    product = item.split('=')[1]
                    product = urllib.parse.unquote(product)
                if item.startswith('gene_id='):
                    geneid = item.split('=')[1]
                if (geneid != '') and (product != ''):
                    outputlist.append(geneid+'\t'+product)
outputlist = sorted(list(set(outputlist)))
fout = open('gene_id_product.tsv', 'w')
for u in outputlist:
    print(u, file=fout)
fout.close()

½ÐÎϤÏ

gene_0001	seripauperin PAU8
gene_0002	hypothetical protein
gene_0003	putative permease SEO1
gene_0004	hypothetical protein
gene_0005	hypothetical protein
¡Ê°Ê²¼Î¬¡Ë

¼¡¤Ë¡¢featureCount¤Î½ÐÎϤǤ¢¤ëall.counts.txt¤È¡¢gene_id_product.tsv¤È¤ò¥Þ¡¼¥¸¤¹¤ë¤³¤È¤Ç¡¢all.counts.txt¤Ëproduct¤ÎÆþ¤Ã¤¿É½¤òºî¤ê¡¢count_preprocessed.tsv¤Ë½ñ¤­½Ð¤¹¡£

import pandas as pd
import numpy as np
df = pd.read_table('all.counts.txt', index_col=0, skiprows=1) 
rename_columns = {'SRR453566.sorted.bam': 'batch_1',
    'SRR453567.sorted.bam': 'batch_2',
    'SRR453568.sorted.bam': 'batch_3',
    'SRR453569.sorted.bam': 'chemostat_1',
    'SRR453570.sorted.bam': 'chemostat_2',
    'SRR453571.sorted.bam': 'chemostat_3'}
df = df.rename_axis(mapper=rename_columns, axis=1)
#print(df.head())
df = df[df.Chr != 'NC_001224.1']  # ¥ß¥È¥³¥ó¥É¥ê¥¢(NC_001224.1)¾å¤Î°äÅÁ»Ò¤ò½ü¤¯
gene_products = pd.read_table('gene_id_product.tsv', index_col=0, names=["gene_id", "product"])
#print(gene_products.head())

# ¥Þ¡¼¥¸
#df_with_product = gene_products.join(df)  # Ãͤ¬float¤Ë¤Ê¤ë¡©
df_with_product = df.join(gene_products)
#print(df_with_product.head())
df_count = df_with_product[[\
    "batch_1","batch_2","batch_3","chemostat_1","chemostat_2","chemostat_3"]]
df_count.to_csv('count_raw.tsv', sep='\t')
df_with_product.to_csv('count_preprocessed.tsv', sep='\t')

½ÐÎϤÏ

Geneid	Chr	Start	End	Strand	Length	batch_1	batch_2	batch_3	chemostat_1	chemostat_2	chemostat_3	product
gene_0001	NC_001133.9	1807	2169	-	363	0	2	6	0	0	1	seripauperin PAU8
gene_0002	NC_001133.9	2480	2707	+	228	0	0	0	0	0	0	hypothetical protein
gene_0003	NC_001133.9	7235	9016	-	1782	0	0	0	0	0	0	putative permease SEO1
gene_0004	NC_001133.9	11565	11951	-	387	0	0	0	0	0	0	hypothetical protein
gene_0005	NC_001133.9	12046	12426	+	381	2	8	10	6	7	18	hypothetical protein
¡Ê°Ê²¼Î¬¡Ë

¤³¤³¤«¤éÀµµ¬²½½èÍý¡£¡¡RPM/FPM¡¢RPKM/FPKM¡¢TPM¤ò·×»»¤¹¤ë¡£

FPKM¤ò»È¤Ã¤Æ¡¢batch culture¤Èchemostat¤Î´Ö¤Çȯ¸½ÊÑÆ°¤ÎÂ礭¤¤°äÅÁ»Ò¤òÃê½Ð

%matplotlib inline
import pandas as pd
import numpy as np

def normalize_per_million_reads(df):
    # RPM/FPM = reads per million  fragments per million
    sum_count = df.sum()
    return 10**6 * df / sum_count

def normalize_per_kilobase(df, gene_length):
    # FPKM = fragments per kilobase of exon per million reads mapped
    df_tmp = df.copy()
    df_tmp = (df.T * 10**3 / gene_length).T 
    return df_tmp

def normalize_tpm(df, gene_length):
    # TPM = transcripts per kilobase million   https://bi.biopapyrus.jp/ 
    df_tmp = df.copy()
    df_tmp = normalize_per_kilobase(df_tmp, gene_length)
    df_tmp = normalize_per_million_reads(df_tmp)
    return df_tmp

df_count = pd.read_table('count_raw.tsv', index_col=0)
df_with_product = pd.read_table('count_preprocessed.tsv', index_col=0)
gene_products = pd.read_table('gene_id_product.tsv', index_col=0, names=["gene_id", "product"])

# RPM/FPM = reads per million  fragments per million
df_count_fpm = normalize_per_million_reads(df_count)
#print(df_count_fpm.head())
df_count_fpm.to_csv("count_fpm.tsv", sep="\t")

# RPKM/FPKM
# FPKM = fragments per kilobase of exon per million reads mapped
# single-end¤Î¾ì¹ç¡¢RPKM = reads per kilobase of exon per million reads mapped
gene_length = df_with_product['Length']
df_count_fpkm = normalize_per_kilobase(df_count_fpm, gene_length)
#print(df_count_fpkm.head())
df_count_fpkm.to_csv('count_fpkm.tsv', sep='\t')

# TPM normalization
# TPM = transcripts per kilobase million   https://bi.biopapyrus.jp/ 
df_count_tpm = normalize_tpm(df_count, gene_length)
#print(df_count_tpm.head())
#print(df_count_tpm.sum())
df_count_tpm.to_csv("count_tpm.tsv", sep="\t")

# ȯ¸½ÊÑÆ°¤ÎÂ礭¤¤°äÅÁ»Ò¤ÎÃê½Ð
# batch culture¤ÎÊ¿¶Ñ¤òµá¤á¤ë
df_count_fpkm["batch"] = (df_count_fpkm["batch_1"] + df_count_fpkm["batch_2"] + 
df_count_fpkm["batch_3"]) / 3
# chemostat¤ÎÊ¿¶Ñ¤òµá¤á¤ë
df_count_fpkm["chemostat"] = (df_count_fpkm["chemostat_1"] + 
df_count_fpkm["chemostat_2"] + df_count_fpkm["chemostat_3"]) / 3
# ȯ¸½ÊÑÆ°¤òlog2 fold ¤È¤·¤Æµá¤á¤ë
# 0 ¤Ç¤Î½ü»»¤òËɤ°¤¿¤á¡¢Ê¬Êì¤ËÈù¾¯¤ÊÃͤò²Ã¤¨¤Æ¤¤¤ë
df_count_fpkm["log2fold"] = \
   df_count_fpkm["chemostat"] / (df_count_fpkm["batch"] + 10**-6)
df_count_fpkm["log2fold"] = df_count_fpkm["log2fold"].apply(np.log2)
#print(df_count_fpkm.head())
# ɬÍ×Éôʬ¤Î¤ßÈ´¤­½Ð¤·¡¢product¤È·ë¹ç
diff_ex = df_count_fpkm[["batch", "chemostat", "log2fold"]]
diff_ex = diff_ex.join(gene_products)
# ¥«¥¦¥ó¥È¿ô¤¬0¤Î¥Ç¡¼¥¿¤ò½ü¤¯
diff_ex = diff_ex[diff_ex["batch"] > 0]
diff_ex = diff_ex[diff_ex["chemostat"] > 0]
#print(diff_ex)
diff_ex = diff_ex.sort_values("log2fold", ascending=False)
#print(diff_ex.head())
#print(diff_ex.tail())
diff_ex.to_csv("diff_ex.tsv", sep="\t")

·ë²Ì¤Ï¡¢

Geneid	batch	chemostat	log2fold	product
gene_2989	0.5113545477349243	1609.2291342813207	11.619755389936394	Rgi2p
gene_4740	3.714171096311252	5568.713898594607	10.550087794327904	Sip18p
gene_4667	5.970597813523303	5108.597498100943	9.740835924216602	Spg4p
gene_5965	8.609101140765873	5809.572900018754	9.398353606047285	Gre1p
gene_4237	1.158010357867918	775.5514280425233	9.387429238134654	hypothetical protein
¡Ê°Ê²¼Î¬¡Ë

¤ª¤Þ¤±¡£batch_1¡¢batcn_2¡¢batch_3¡¢chemstat_1, chemstat_2, chemstat_3´Ö¤ÎÁê´Øµ÷Î¥¤ò·×»»¤·¤Æ¡¢³¬ÁØ¥¯¥é¥¹¥¿½èÍý¤ò»È¤Ã¤Æ¥È¥ê¡½É½¸½¤¹¤ë¡£

# TMPÀµµ¬²½¤·¤¿¥Ç¡¼¥¿¤ò¥¯¥é¥¹¥¿¥ê¥ó¥°
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram
tpm_t = df_count_tpm.T
print('---')
print(df_count_tpm.head())
from scipy.spatial.distance import pdist
print(tpm_t.shape, pdist(tpm_t, 'correlation'))
linkage_result = linkage(tpm_t, method='average', metric='correlation')
print(linkage_result)
plt.figure(num=None, figsize=(16, 9), dpi=200, facecolor='w', edgecolor='k')
dendrogram(linkage_result, labels=df_count_tpm.columns)
plt.show()

·ë²Ì¤Ï

SRR453566-71-gene-clustering.png

¤µ¤é¤Ë¡¢¿Þ¤òÉÁ¤¯ÏÃ

MA¥×¥í¥Ã¥È

°äÅÁ¸¦»ñÎÁ

MA¥×¥í¥Ã¥È¤È¤Ï¡¢

Wikipedia

¥Þ¥¤¥¯¥í¥¢¥ì¥¤¤Î¥Ç¡¼¥¿¤ÎÂåɽŪ¤Êɽ¼¨ÊýË¡¤Ç¡¢¿Þ·Á¤È¤·¤Æ¤Ï¡¢»¶ÉÛ¿Þ¤ò45Åٲ󞤵¤»¤¿¤è¤¦¤Ê¥¤¥á¡¼¥¸

Z-score ¡Êȯ¸½ÊÑÆ°°äÅÁ»Ò¤òȽÄꤹ¤ë¤â¤¦1¤Ä¤ÎÊýË¡¡Ë – °äÅÁ»Òȯ¸½²òÀϡʥޥ¤¥¯¥í¥¢¥ì¥¤²òÀÏ, RNA-seq¡Ë

MA ¥×¥í¥Ã¥È (MA PLOT) – °äÅÁ»Òȯ¸½²òÀϡʥޥ¤¥¯¥í¥¢¥ì¥¤²òÀÏ, RNA-SEQ¡Ë¤Ë¡¢°Ê²¼¤ÎÀâÌÀ

2¥µ¥ó¥×¥ë¤Î¥Ç¡¼¥¿¤«¤é¡¢M¤ÈA¤ÎÃͤò»»½Ð¤·¤Æ»ÈÍѤ·¤Þ¤¹¡£¤³¤³¤Ç¡¢M ¤Ï¡Ölog2ÊÑ´¹¤µ¤ì¤¿¥·¥°¥Ê¥ëÃͤκ¹¡×¤Ç¤¢¤ê¡¢A ¤Ï¡Ölog2ÊÑ´¹¤µ¤ì¤¿¥·¥°¥Ê¥ëÃͤÎÊ¿¶ÑÃ͡פǤ¹¡£¿ô¼°¤Ç¤Ï¡¢¼¡¤Î¤è¤¦¤Ë½ñ¤±¤Þ¤¹¡£

M = log2(¼Â¸³¥µ¥ó¥×¥ë) – log2(¥³¥ó¥È¥í¡¼¥ë¥µ¥ó¥×¥ë) A = { log2(¼Â¸³¥µ¥ó¥×¥ë) + log2(¥³¥ó¥È¥í¡¼¥ë¥µ¥ó¥×¥ë) } / 2

log2(KO/WT)

M > 1 ¤È¤¤¤¦¤³¤È¤Ï¡¢logFC > 1 ¤Ä¤Þ¤ê¡¢ratio > 2 ¡£¡Êȯ¸½Áý²Ã¡Ë ratio¤Ï¼Â¸³/¥³¥ó¥È¥í¡¼¥ë M = 0 ¤È¤¤¤¦¤³¤È¤Ï¡¢logFC = 0 ¤Ä¤Þ¤ê¡¢ratio = 1¡£¡ÊÊÑÆ°¤Ê¤·¡Ë M < -1 ¤È¤¤¤¦¤³¤È¤Ï¡¢logFC < -1 ¤Ä¤Þ¤ê¡¢ratio < 0.5¡£¡Êȯ¸½¸º¾¯¡Ë

MA¥×¥í¥Ã¥È¤òÉÁ¤¯¥×¥í¥°¥é¥à¡¡RNASaccha/Saccha/°äÅÁ¸¦2-3²Ä»ë²½2-1-MAplot

%matplotlib inline
# TPM count_tpm.tsv¤«¤éMA¥×¥í¥Ã¥È
# MA¥×¥í¥Ã¥È¤È¤Ï  http://array.cell-innovator.com/?p=1772
#   M = log2(¼Â¸³¥µ¥ó¥×¥ë) &#8211; log2(¥³¥ó¥È¥í¡¼¥ë¥µ¥ó¥×¥ë)
#   A =  { log2(¼Â¸³¥µ¥ó¥×¥ë) + log2(¥³¥ó¥È¥í¡¼¥ë¥µ¥ó¥×¥ë) } / 2

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import math
df_tpm = pd.read_table('count_tpm.tsv')[['batch_1', 'chemostat_1']]

u = np.array(list(map(math.log2, df_tpm['chemostat_1']+1)))
v = np.array(list(map(math.log2, df_tpm['batch_1']+1)))
M = u - v
A = (u + v) / 2

df_tpm_log = df_tpm.applymap(lambda x: math.log2(x+1))
df_tpm_log['M'] = df_tpm_log['chemostat_1'] - df_tpm_log['batch_1']
df_tpm_log['A'] = (df_tpm_log['chemostat_1'] + df_tpm_log['batch_1'])/2
df_tpm_log_smaller = df_tpm_log[(df_tpm_log['M']<=2) & (df_tpm_log['M']>=-2)]
df_tpm_log_larger = df_tpm_log[(df_tpm_log['M']>2) | (df_tpm_log['M']<-2)]

fig = plt.figure()
ax = fig.add_subplot(1,1,1)
#plt.scatter(A, M)
df_tpm_log_smaller.plot.scatter(x='A', y='M', ax=ax, label='non-DEG')
df_tpm_log_larger.plot.scatter(x='A', y='M', c='orange', ax=ax, label='DEG')
plt.xlabel('A value')
plt.ylabel('M value')
plt.legend()
plt.show()

ÉÁ¤±¤¿¿Þ¤Ï

SRR453566-71-MAplot.png

PCAʬÀÏ

SRR453566-71-PCA.png
SRR453566-71-PCA-Zscore.png

źÉÕ¥Õ¥¡¥¤¥ë: fileSaccha_all_counts_txt.png 885·ï [¾ÜºÙ] fileSRR453566-71-PCA.png 633·ï [¾ÜºÙ] fileSRR453566-71-PCA-Zscore.png 625·ï [¾ÜºÙ] fileSRR453566-71-MAplot.png 717·ï [¾ÜºÙ] fileSRR453566-71-gene-clustering.png 762·ï [¾ÜºÙ]

¥È¥Ã¥×   ÊÔ½¸ Åà·ë º¹Ê¬ ¥Ð¥Ã¥¯¥¢¥Ã¥× źÉÕ Ê£À½ ̾Á°Êѹ¹ ¥ê¥í¡¼¥É   ¿·µ¬ °ìÍ÷ ñ¸ì¸¡º÷ ºÇ½ª¹¹¿·   ¥Ø¥ë¥×   ºÇ½ª¹¹¿·¤ÎRSS
Last-modified: 2019-09-01 (Æü) 13:26:10 (1305d)