Pythonバイオ? Pythonバイオ/ツール?
45   2019-05-22 (水) 15:47:41

Variant Calling

参照

いずれにしても、

Duplicate

gatk MarkDuplicates -I Kishimoto_01_43B_S1.sorted.bam -M Kishimoto_01_43B_S1.duplicated.metrics \
-O Kishimoto_01_43B_S1.dedup.sorted.bam --ASSUME_SORTED TRUE --REMOVE_DUPLICATES TRUE

Using GATK jar /usr/local/src/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar
Running:
   java -Dsamjdk.use_async_io_read_samtools=false 
-Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 
-jar /usr/local/src/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar MarkDuplicates 
-I Kishimoto_01_43B_S1.sorted.bam -M Kishimoto_01_43B_S1.duplicated.metrics 
-O Kishimoto_01_43B_S1.dedup.sorted.bam --ASSUME_SORTED TRUE --REMOVE_DUPLICATES TRUE
16:35:52.607 INFO  NativeLibraryLoader -
 Loading libgkl_compression.so from jar:file:/usr/local/src/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Thu May 16 16:35:52 JST 2019] MarkDuplicates  --INPUT Kishimoto_01_43B_S1.sorted.bam 
--OUTPUT Kishimoto_01_43B_S1.dedup.sorted.bam --METRICS_FILE Kishimoto_01_43B_S1.duplicated.metrics 
--REMOVE_DUPLICATES true --ASSUME_SORTED true  --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 
--MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false 
--REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false 
--ADD_PG_TAG_TO_READS true --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates 
--PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> 
--OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 
--VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 
--MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false 
--GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false 
--USE_JDK_INFLATER false
[Thu May 16 16:35:55 JST 2019] Executing as yamanouc@pepper.is.sci.toho-u.ac.jp on Linux 3.10.0-514.2.2.el7.x86_64 amd64; 
  OpenJDK 64-Bit Server VM 1.8.0_111-b15; Deflater: Intel; Inflater: Intel; Provider GCS is available; 
  Picard version: Version:4.1.2.0
INFO	2019-05-16 16:35:55	MarkDuplicates	Start of doWork freeMemory: 1417478608; totalMemory: 1434976256; maxMemory: 28631367680
INFO	2019-05-16 16:35:55	MarkDuplicates	Reading input file and constructing read end information.
INFO	2019-05-16 16:35:55	MarkDuplicates	Will retain up to 103736839 data points before spilling to disk.
INFO	2019-05-16 16:36:52	MarkDuplicates	Read     1,000,000 records.  Elapsed time: 00:00:52s.  Time for last 1,000,000:   52s.  
  Last read position: AP012030.1:74,603
INFO	2019-05-16 16:36:52	MarkDuplicates	Tracking 40722 as yet unmatched pairs. 40722 records in RAM.
INFO	2019-05-16 16:37:29	MarkDuplicates	Read     2,000,000 records.  Elapsed time: 00:01:29s.  Time for last 1,000,000:   37s.  Last read position: AP012030.1:154,447
INFO	2019-05-16 16:37:29	MarkDuplicates	Tracking 77823 as yet unmatched pairs. 77823 records in RAM.
INFO	2019-05-16 16:38:24	MarkDuplicates	Read     3,000,000 records.  Elapsed time: 00:02:25s.  Time for last 1,000,000:   55s.  Last read position: AP012030.1:228,168
INFO	2019-05-16 16:38:24	MarkDuplicates	Tracking 110546 as yet unmatched pairs. 110546 records in RAM.
INFO	2019-05-16 16:38:53	MarkDuplicates	Read     4,000,000 records.  Elapsed time: 00:02:53s.  Time for last 1,000,000:   28s.  Last read position: AP012030.1:301,562
INFO	2019-05-16 16:38:53	MarkDuplicates	Tracking 143886 as yet unmatched pairs. 143886 records in RAM.
INFO	2019-05-16 16:39:37	MarkDuplicates	Read     5,000,000 records.  Elapsed time: 00:03:38s.  Time for last 1,000,000:   44s.  Last read position: AP012030.1:377,173
INFO	2019-05-16 16:39:37	MarkDuplicates	Tracking 175571 as yet unmatched pairs. 175571 records in RAM.
INFO	2019-05-16 16:40:06	MarkDuplicates	Read     6,000,000 records.  Elapsed time: 00:04:07s.  Time for last 1,000,000:   29s.  Last read position: AP012030.1:448,403
INFO	2019-05-16 16:40:06	MarkDuplicates	Tracking 203544 as yet unmatched pairs. 203544 records in RAM.
INFO	2019-05-16 16:41:02	MarkDuplicates	Read     7,000,000 records.  Elapsed time: 00:05:02s.  Time for last 1,000,000:   55s.  Last read position: AP012030.1:528,920
INFO	2019-05-16 16:41:02	MarkDuplicates	Tracking 235908 as yet unmatched pairs. 235908 records in RAM.
INFO	2019-05-16 16:41:26	MarkDuplicates	Read     8,000,000 records.  Elapsed time: 00:05:27s.  Time for last 1,000,000:   24s.  Last read position: AP012030.1:608,145
INFO	2019-05-16 16:41:26	MarkDuplicates	Tracking 263951 as yet unmatched pairs. 263951 records in RAM.
INFO	2019-05-16 16:42:03	MarkDuplicates	Read     9,000,000 records.  Elapsed time: 00:06:03s.  Time for last 1,000,000:   36s.  Last read position: AP012030.1:690,374
INFO	2019-05-16 16:42:03	MarkDuplicates	Tracking 291824 as yet unmatched pairs. 291824 records in RAM.

(中略)

INFO	2019-05-16 17:08:15	MarkDuplicates	Read    59,000,000 records.  Elapsed time: 00:32:16s.  Time for last 1,000,000:   28s.  Last read position: AP012030.1:4,479,193
INFO	2019-05-16 17:08:15	MarkDuplicates	Tracking 70213 as yet unmatched pairs. 70213 records in RAM.
INFO	2019-05-16 17:08:44	MarkDuplicates	Tracking 30486 as yet unmatched pairs. 30486 records in RAM.
INFO	2019-05-16 17:09:06	MarkDuplicates	Read 60734092 records. 0 pairs never matched.
INFO	2019-05-16 17:10:06	MarkDuplicates	After buildSortedReadEndLists freeMemory: 11654078344; totalMemory: 18848153600; maxMemory: 28631367680
INFO	2019-05-16 17:10:06	MarkDuplicates	Will retain up to 894730240 duplicate indices before spilling to disk.
INFO	2019-05-16 17:10:14	MarkDuplicates	Traversing read pair information and detecting duplicates.
INFO	2019-05-16 17:13:40	MarkDuplicates	Traversing fragment information and detecting duplicates.
INFO	2019-05-16 17:13:57	MarkDuplicates	Sorting list of duplicate records.
INFO	2019-05-16 17:14:00	MarkDuplicates	After generateDuplicateIndexes freeMemory: 21992264872; totalMemory: 29286203392; maxMemory: 29286203392
INFO	2019-05-16 17:14:00	MarkDuplicates	Marking 2183568 records as duplicates.
INFO	2019-05-16 17:14:00	MarkDuplicates	Found 9424 optical duplicate clusters.
INFO	2019-05-16 17:14:00	MarkDuplicates	Reads are assumed to be ordered by: coordinate

2019-05-20 bcftools mpileup | bcftools call

samtools 使い方 mpileup ( calling SNPs ) & annotation には

$ samtools mpileup -uf ref.fa myfile.sorted_unique.bam | bcftools call -O b -v -c ->| 
myfile.sorted_unique.raw.bcf

と書いてある。また samtoolsでvariant call - Qiita にはこう書いている。

bcftools mpileup -Ou -f <ref.fa> <sample1.bam> <sample2.bam> <sample3.bam> |
bcftools call -vmO z -o <study.vcf.gz>

でも

bcftools mpileup -Ou -f AP012030-new.fasta Anc/AP012030/data/reference.bam 1_2-1/AP012030/data/reference.bam 2_2-1/AP012030/data/reference.bam 2_5-1/AP012030/data/reference.bam 2_5-1-7A/AP012030/data/reference.bam > mytest.vcf
bcftools call -O v -v -c mytest.vcf > mytest.raw.bcf

bcftools call -vmO z -o mystudy.vcf.gz mytest.vcf
Could not add dummy header for contig 
Segmentation fault (コアダンプ)

でうまくいかない。

bcftoolsのマニュアルのmpileupにある例では

   bcftools mpileup -Ou -f ref.fa aln.bam | \
   bcftools call -Ou -mv | \
   bcftools filter -s LowQual -e '%QUAL<20 || DP>100' > var.flt.vcf

と書いてある。

ここでは、

bcftools mpileup -Ou -f AP012030-new.fasta \
  Anc/AP012030/data/reference.bam \
  1_2-1/AP012030/data/reference.bam \
  2_2-1/AP012030/data/reference.bam \
  2_5-1/AP012030/data/reference.bam \
  2_5-1-7A/AP012030/data/reference.bam |
bcftools call -Ou -mv -o mytest.call.vcf

を試してみる。

この結果を、vcf-annotateでアノテーションを追加して読めるようにする。この手順の出所はsamtools 使い方 mpileup ( calling SNPs ) & annotation – バイオインフォ 道場

bcftools view mytest.call.vcf | vcf-annotate -f + > mytest.call.annnotate.vcf
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Anc/AP012030/data/reference.bam 1_2-1/AP012030/data/reference.bam       2_2-1/AP012030/data/reference.bam       2_5-1/AP012030/data/reference.bam       2_5-1-7A/AP012030/data/reference.bam
AP012030        301     .       CTTTTTTTTTT     CTTTTTTTTT      61      PASS    INDEL;IDV=211;IMF=0.886555;DP=1191;VDB=0.990724;SGB=-30.4726;MQSB=1;MQ0F=0;ICB=0.425;HOB=0.32;AC=8;AN=10;DP4=178,15,967,31;MQ=20        GT:PL   1/1:31,255,0    1/1:22,216,0    1/1:31,255,0    1/1:16,235,0    0/0:0,157,15
AP012030        3448    .       C       T       464     PASS    DP=1211;VDB=0.484561;SGB=120.949;RPB=0.750085;MQB=1;MQSB=1;BQB=0.101507;MQ0F=0;ICB=0.117361;HOB=0.48;AC=6;AN=10;DP4=478,6,697,8;MQ=20   GT:PL   0/0:0,255,173   0/0:0,255,183   1/1:167,255,0   1/1:189,255,0   1/1:162,255,0
AP012030        10591   .       C       T       262     PASS    DP=1220;VDB=0.354141;SGB=400.466;RPB=0.300118;MQB=1;MQSB=1;BQB=7.53989e-05;MQ0F=0;ICB=0.117361;HOB=0.48;AC=4;AN=10;DP4=748,5,433,8;MQ=20        GT:PL   0/0:0,255,176   0/0:0,255,169   0/0:0,255,176   1/1:139,255,0   1/1:177,255,0
AP012030        41097   .       G       A       465     PASS    DP=1222;VDB=0.0336028;SGB=359.679;RPB=0.995592;MQB=1;MQSB=1;BQB=0.533691;MQ0F=0;ICB=0.117361;HOB=0.48;AC=6;AN=10;DP4=477,10,705,7;MQ=20 GT:PL   0/0:0,255,178   0/0:0,255,197   1/1:173,255,0   1/1:176,255,0   1/1:170,255,0
AP012030        44797   .       AGGGGG  AGGGG   340     PASS    INDEL;IDV=234;IMF=0.955102;DP=1231;VDB=0.691086;SGB=115.801;MQSB=1;MQ0F=0;ICB=0.117361;HOB=0.48;AC=6;AN=10;DP4=548,7,672,4;MQ=20        GT:PL   0/0:0,255,180   0/0:0,255,186   1/1:134,255,0   1/1:132,255,0   1/1:128,255,0
AP012030        59093   .       C       T       13.7826 PASS    DP=252;VDB=0.00328272;SGB=-7.51117;RPB=0.00279524;MQB=0.0718584;MQSB=0.999297;BQB=0.00206264;MQ0F=0;ICB=1;HOB=0.166667;AC=3;AN=6;DP4=233,3,5,0;MQ=19    GT:PL   0/0:0,255,175   ./.:0,0,0       ./.:0,0,0       1/1:46,9,0      0/1:6,0,14
AP012030        59105   .       A       C       69      PASS    DP=260;VDB=0.000538717;SGB=-8.68981;RPB=0.000860322;MQB=0.187859;MQSB=0.999902;BQB=0.0445104;MQ0F=0;ICB=1;HOB=0.21875;AC=5;AN=8;DP4=237,4,8,0;MQ=20     GT:PL   0/0:0,255,193   ./.:0,0,0       1/1:49,9,0      1/1:46,9,0      0/1:18,0,28
AP012030        59108   .       T       C       71      PASS    DP=266;VDB=0.000464403;SGB=-8.68981;RPB=0.00435826;MQB=0.19652;MQSB=0.998903;BQB=0.106941;MQ0F=0;ICB=1;HOB=0.21875;AC=5;AN=8;DP4=238,4,7,1;MQ=19        GT:PL   0/0:0,255,177   ./.:0,0,0       1/1:36,6,0      1/1:46,9,0      0/1:32,0,39

ミソは最後のFORMATの部分が5つのサンプルそれぞれの値を列記した形になること。でもこれをどうやって使えばいいのか? この出力を、IGVで表示することができる。

さて、サンプル間の変異を比較する方法としてサンプル間で共通する変異と固有の変異を抽出する - MACでインフォマティクスにはそれぞれのサンプルのBAMを(単独で)VCFに変換し、それらの間の差分を取る方法を提案している。

VCFファイルをPythonで読み込んで操作する

pyVCFがパーザーとして使える。

PyVCF - A Variant Call Format Parser for Python — PyVCF 0.6.8 documentation   API — PyVCF 0.6.8 documentation

pyVCFはnamedTupleという便利っぽいものを使っている。

PyVCFを使ってみる。

import vcf
# vcf.model &#8212; PyVCF 0.6.8 documentation 
https://pyvcf.readthedocs.io/en/latest/_modules/vcf/model.html#_Record
# PyVCF | VCF ファイルをパースする Python ライブラリー https://bi.biopapyrus.jp/python/module/pyvcf.html

vcff = vcf.Reader(open('mytest.call.annnotate.vcf', 'r'))
for i, record in enumerate(vcff):
    print('POS:', record.POS, 'SNP:', record.is_snp, 'INDEL:', record.is_indel, 'subtype:', 
record.var_subtype)
    for sample in record.samples:
        print('  ', sample.sample, end=' ')
        #print(sample.sample, sample.data, sample.gt_type)
        #print(sample.data.GT)
        #print(sample.data._fields)
        for skey in sample.data._fields:
            print(skey, ':',  getattr(sample.data, skey), end=', ')
        print()
    if i>1:
        break

結果は

POS: 301 SNP: False INDEL: True subtype: del
   Anc/AP012030/data/reference.bam GT : 1/1, PL : [31, 255, 0], 
   1_2-1/AP012030/data/reference.bam GT : 1/1, PL : [22, 216, 0], 
   2_2-1/AP012030/data/reference.bam GT : 1/1, PL : [31, 255, 0], 
   2_5-1/AP012030/data/reference.bam GT : 1/1, PL : [16, 235, 0], 
   2_5-1-7A/AP012030/data/reference.bam GT : 0/0, PL : [0, 157, 15], 
POS: 3448 SNP: True INDEL: False subtype: ts
   Anc/AP012030/data/reference.bam GT : 0/0, PL : [0, 255, 173], 
   1_2-1/AP012030/data/reference.bam GT : 0/0, PL : [0, 255, 183], 
   2_2-1/AP012030/data/reference.bam GT : 1/1, PL : [167, 255, 0], 
   2_5-1/AP012030/data/reference.bam GT : 1/1, PL : [189, 255, 0], 
   2_5-1-7A/AP012030/data/reference.bam GT : 1/1, PL : [162, 255, 0], 
POS: 10591 SNP: True INDEL: False subtype: ts
   Anc/AP012030/data/reference.bam GT : 0/0, PL : [0, 255, 176], 
   1_2-1/AP012030/data/reference.bam GT : 0/0, PL : [0, 255, 169], 
   2_2-1/AP012030/data/reference.bam GT : 0/0, PL : [0, 255, 176], 
   2_5-1/AP012030/data/reference.bam GT : 1/1, PL : [139, 255, 0], 
   2_5-1-7A/AP012030/data/reference.bam GT : 1/1, PL : [177, 255, 0],

トップ   編集 凍結 差分 バックアップ 添付 複製 名前変更 リロード   新規 一覧 単語検索 最終更新   ヘルプ   最終更新のRSS
Last-modified: 2019-05-22 (水) 15:47:41 (3d)