Python¥Ð¥¤¥ª?¡¡Python¥Ð¥¤¥ª/¥Ä¡¼¥ë?
3854¡¡¡¡¡¡2019-05-22 (¿å) 15:47:41

Variant Calling

»²¾È

¤¤¤º¤ì¤Ë¤·¤Æ¤â¡¢

Duplicate

gatk MarkDuplicates -I Kishimoto_01_43B_S1.sorted.bam -M Kishimoto_01_43B_S1.duplicated.metrics \
-O Kishimoto_01_43B_S1.dedup.sorted.bam --ASSUME_SORTED TRUE --REMOVE_DUPLICATES TRUE

Using GATK jar /usr/local/src/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar
Running:
   java -Dsamjdk.use_async_io_read_samtools=false 
-Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 
-jar /usr/local/src/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar MarkDuplicates 
-I Kishimoto_01_43B_S1.sorted.bam -M Kishimoto_01_43B_S1.duplicated.metrics 
-O Kishimoto_01_43B_S1.dedup.sorted.bam --ASSUME_SORTED TRUE --REMOVE_DUPLICATES TRUE
16:35:52.607 INFO  NativeLibraryLoader -
 Loading libgkl_compression.so from jar:file:/usr/local/src/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Thu May 16 16:35:52 JST 2019] MarkDuplicates  --INPUT Kishimoto_01_43B_S1.sorted.bam 
--OUTPUT Kishimoto_01_43B_S1.dedup.sorted.bam --METRICS_FILE Kishimoto_01_43B_S1.duplicated.metrics 
--REMOVE_DUPLICATES true --ASSUME_SORTED true  --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 
--MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false 
--REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false 
--ADD_PG_TAG_TO_READS true --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates 
--PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> 
--OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 
--VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 
--MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false 
--GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false 
--USE_JDK_INFLATER false
[Thu May 16 16:35:55 JST 2019] Executing as yamanouc@pepper.is.sci.toho-u.ac.jp on Linux 3.10.0-514.2.2.el7.x86_64 amd64; 
  OpenJDK 64-Bit Server VM 1.8.0_111-b15; Deflater: Intel; Inflater: Intel; Provider GCS is available; 
  Picard version: Version:4.1.2.0
INFO	2019-05-16 16:35:55	MarkDuplicates	Start of doWork freeMemory: 1417478608; totalMemory: 1434976256; maxMemory: 28631367680
INFO	2019-05-16 16:35:55	MarkDuplicates	Reading input file and constructing read end information.
INFO	2019-05-16 16:35:55	MarkDuplicates	Will retain up to 103736839 data points before spilling to disk.
INFO	2019-05-16 16:36:52	MarkDuplicates	Read     1,000,000 records.  Elapsed time: 00:00:52s.  Time for last 1,000,000:   52s.  
  Last read position: AP012030.1:74,603
INFO	2019-05-16 16:36:52	MarkDuplicates	Tracking 40722 as yet unmatched pairs. 40722 records in RAM.
INFO	2019-05-16 16:37:29	MarkDuplicates	Read     2,000,000 records.  Elapsed time: 00:01:29s.  Time for last 1,000,000:   37s.  Last read position: AP012030.1:154,447
INFO	2019-05-16 16:37:29	MarkDuplicates	Tracking 77823 as yet unmatched pairs. 77823 records in RAM.
INFO	2019-05-16 16:38:24	MarkDuplicates	Read     3,000,000 records.  Elapsed time: 00:02:25s.  Time for last 1,000,000:   55s.  Last read position: AP012030.1:228,168
INFO	2019-05-16 16:38:24	MarkDuplicates	Tracking 110546 as yet unmatched pairs. 110546 records in RAM.
INFO	2019-05-16 16:38:53	MarkDuplicates	Read     4,000,000 records.  Elapsed time: 00:02:53s.  Time for last 1,000,000:   28s.  Last read position: AP012030.1:301,562
INFO	2019-05-16 16:38:53	MarkDuplicates	Tracking 143886 as yet unmatched pairs. 143886 records in RAM.
INFO	2019-05-16 16:39:37	MarkDuplicates	Read     5,000,000 records.  Elapsed time: 00:03:38s.  Time for last 1,000,000:   44s.  Last read position: AP012030.1:377,173
INFO	2019-05-16 16:39:37	MarkDuplicates	Tracking 175571 as yet unmatched pairs. 175571 records in RAM.
INFO	2019-05-16 16:40:06	MarkDuplicates	Read     6,000,000 records.  Elapsed time: 00:04:07s.  Time for last 1,000,000:   29s.  Last read position: AP012030.1:448,403
INFO	2019-05-16 16:40:06	MarkDuplicates	Tracking 203544 as yet unmatched pairs. 203544 records in RAM.
INFO	2019-05-16 16:41:02	MarkDuplicates	Read     7,000,000 records.  Elapsed time: 00:05:02s.  Time for last 1,000,000:   55s.  Last read position: AP012030.1:528,920
INFO	2019-05-16 16:41:02	MarkDuplicates	Tracking 235908 as yet unmatched pairs. 235908 records in RAM.
INFO	2019-05-16 16:41:26	MarkDuplicates	Read     8,000,000 records.  Elapsed time: 00:05:27s.  Time for last 1,000,000:   24s.  Last read position: AP012030.1:608,145
INFO	2019-05-16 16:41:26	MarkDuplicates	Tracking 263951 as yet unmatched pairs. 263951 records in RAM.
INFO	2019-05-16 16:42:03	MarkDuplicates	Read     9,000,000 records.  Elapsed time: 00:06:03s.  Time for last 1,000,000:   36s.  Last read position: AP012030.1:690,374
INFO	2019-05-16 16:42:03	MarkDuplicates	Tracking 291824 as yet unmatched pairs. 291824 records in RAM.

¡ÊÃæÎ¬¡Ë

INFO	2019-05-16 17:08:15	MarkDuplicates	Read    59,000,000 records.  Elapsed time: 00:32:16s.  Time for last 1,000,000:   28s.  Last read position: AP012030.1:4,479,193
INFO	2019-05-16 17:08:15	MarkDuplicates	Tracking 70213 as yet unmatched pairs. 70213 records in RAM.
INFO	2019-05-16 17:08:44	MarkDuplicates	Tracking 30486 as yet unmatched pairs. 30486 records in RAM.
INFO	2019-05-16 17:09:06	MarkDuplicates	Read 60734092 records. 0 pairs never matched.
INFO	2019-05-16 17:10:06	MarkDuplicates	After buildSortedReadEndLists freeMemory: 11654078344; totalMemory: 18848153600; maxMemory: 28631367680
INFO	2019-05-16 17:10:06	MarkDuplicates	Will retain up to 894730240 duplicate indices before spilling to disk.
INFO	2019-05-16 17:10:14	MarkDuplicates	Traversing read pair information and detecting duplicates.
INFO	2019-05-16 17:13:40	MarkDuplicates	Traversing fragment information and detecting duplicates.
INFO	2019-05-16 17:13:57	MarkDuplicates	Sorting list of duplicate records.
INFO	2019-05-16 17:14:00	MarkDuplicates	After generateDuplicateIndexes freeMemory: 21992264872; totalMemory: 29286203392; maxMemory: 29286203392
INFO	2019-05-16 17:14:00	MarkDuplicates	Marking 2183568 records as duplicates.
INFO	2019-05-16 17:14:00	MarkDuplicates	Found 9424 optical duplicate clusters.
INFO	2019-05-16 17:14:00	MarkDuplicates	Reads are assumed to be ordered by: coordinate

2019-05-20 bcftools mpileup | bcftools call

samtools »È¤¤Êý mpileup ( calling SNPs ) & annotation¡¡¤Ë¤Ï

$ samtools mpileup -uf ref.fa myfile.sorted_unique.bam | bcftools call -O b -v -c ->| 
myfile.sorted_unique.raw.bcf

¤È½ñ¤¤¤Æ¤¢¤ë¡£¤Þ¤¿ samtools¤Çvariant call - Qiita ¤Ë¤Ï¤³¤¦½ñ¤¤¤Æ¤¤¤ë¡£

bcftools mpileup -Ou -f <ref.fa> <sample1.bam> <sample2.bam> <sample3.bam> |
bcftools call -vmO z -o <study.vcf.gz>

¤Ç¤â

bcftools mpileup -Ou -f AP012030-new.fasta Anc/AP012030/data/reference.bam 1_2-1/AP012030/data/reference.bam 2_2-1/AP012030/data/reference.bam 2_5-1/AP012030/data/reference.bam 2_5-1-7A/AP012030/data/reference.bam > mytest.vcf
bcftools call -O v -v -c mytest.vcf > mytest.raw.bcf

¤ä

bcftools call -vmO z -o mystudy.vcf.gz mytest.vcf
Could not add dummy header for contig 
Segmentation fault (¥³¥¢¥À¥ó¥×)

¤Ç¤¦¤Þ¤¯¤¤¤«¤Ê¤¤¡£

bcftools¤Î¥Þ¥Ë¥å¥¢¥ë¤Împileup¤Ë¤¢¤ëÎã¤Ç¤Ï

   bcftools mpileup -Ou -f ref.fa aln.bam | \
   bcftools call -Ou -mv | \
   bcftools filter -s LowQual -e '%QUAL<20 || DP>100' > var.flt.vcf

¤È½ñ¤¤¤Æ¤¢¤ë¡£

¤³¤³¤Ç¤Ï¡¢

bcftools mpileup -Ou -f AP012030-new.fasta \
  Anc/AP012030/data/reference.bam \
  1_2-1/AP012030/data/reference.bam \
  2_2-1/AP012030/data/reference.bam \
  2_5-1/AP012030/data/reference.bam \
  2_5-1-7A/AP012030/data/reference.bam |
bcftools call -Ou -mv -o mytest.call.vcf

¤ò»î¤·¤Æ¤ß¤ë¡£

¤³¤Î·ë²Ì¤ò¡¢vcf-annotate¤Ç¥¢¥Î¥Æ¡¼¥·¥ç¥ó¤òÄɲä·¤ÆÆÉ¤á¤ë¤è¤¦¤Ë¤¹¤ë¡£¤³¤Î¼ê½ç¤Î½Ð½ê¤Ïsamtools »È¤¤Êý mpileup ( calling SNPs ) & annotation – ¥Ð¥¤¥ª¥¤¥ó¥Õ¥© Æ»¾ì

bcftools view mytest.call.vcf | vcf-annotate -f + > mytest.call.annnotate.vcf
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Anc/AP012030/data/reference.bam 1_2-1/AP012030/data/reference.bam       2_2-1/AP012030/data/reference.bam       2_5-1/AP012030/data/reference.bam       2_5-1-7A/AP012030/data/reference.bam
AP012030        301     .       CTTTTTTTTTT     CTTTTTTTTT      61      PASS    INDEL;IDV=211;IMF=0.886555;DP=1191;VDB=0.990724;SGB=-30.4726;MQSB=1;MQ0F=0;ICB=0.425;HOB=0.32;AC=8;AN=10;DP4=178,15,967,31;MQ=20        GT:PL   1/1:31,255,0    1/1:22,216,0    1/1:31,255,0    1/1:16,235,0    0/0:0,157,15
AP012030        3448    .       C       T       464     PASS    DP=1211;VDB=0.484561;SGB=120.949;RPB=0.750085;MQB=1;MQSB=1;BQB=0.101507;MQ0F=0;ICB=0.117361;HOB=0.48;AC=6;AN=10;DP4=478,6,697,8;MQ=20   GT:PL   0/0:0,255,173   0/0:0,255,183   1/1:167,255,0   1/1:189,255,0   1/1:162,255,0
AP012030        10591   .       C       T       262     PASS    DP=1220;VDB=0.354141;SGB=400.466;RPB=0.300118;MQB=1;MQSB=1;BQB=7.53989e-05;MQ0F=0;ICB=0.117361;HOB=0.48;AC=4;AN=10;DP4=748,5,433,8;MQ=20        GT:PL   0/0:0,255,176   0/0:0,255,169   0/0:0,255,176   1/1:139,255,0   1/1:177,255,0
AP012030        41097   .       G       A       465     PASS    DP=1222;VDB=0.0336028;SGB=359.679;RPB=0.995592;MQB=1;MQSB=1;BQB=0.533691;MQ0F=0;ICB=0.117361;HOB=0.48;AC=6;AN=10;DP4=477,10,705,7;MQ=20 GT:PL   0/0:0,255,178   0/0:0,255,197   1/1:173,255,0   1/1:176,255,0   1/1:170,255,0
AP012030        44797   .       AGGGGG  AGGGG   340     PASS    INDEL;IDV=234;IMF=0.955102;DP=1231;VDB=0.691086;SGB=115.801;MQSB=1;MQ0F=0;ICB=0.117361;HOB=0.48;AC=6;AN=10;DP4=548,7,672,4;MQ=20        GT:PL   0/0:0,255,180   0/0:0,255,186   1/1:134,255,0   1/1:132,255,0   1/1:128,255,0
AP012030        59093   .       C       T       13.7826 PASS    DP=252;VDB=0.00328272;SGB=-7.51117;RPB=0.00279524;MQB=0.0718584;MQSB=0.999297;BQB=0.00206264;MQ0F=0;ICB=1;HOB=0.166667;AC=3;AN=6;DP4=233,3,5,0;MQ=19    GT:PL   0/0:0,255,175   ./.:0,0,0       ./.:0,0,0       1/1:46,9,0      0/1:6,0,14
AP012030        59105   .       A       C       69      PASS    DP=260;VDB=0.000538717;SGB=-8.68981;RPB=0.000860322;MQB=0.187859;MQSB=0.999902;BQB=0.0445104;MQ0F=0;ICB=1;HOB=0.21875;AC=5;AN=8;DP4=237,4,8,0;MQ=20     GT:PL   0/0:0,255,193   ./.:0,0,0       1/1:49,9,0      1/1:46,9,0      0/1:18,0,28
AP012030        59108   .       T       C       71      PASS    DP=266;VDB=0.000464403;SGB=-8.68981;RPB=0.00435826;MQB=0.19652;MQSB=0.998903;BQB=0.106941;MQ0F=0;ICB=1;HOB=0.21875;AC=5;AN=8;DP4=238,4,7,1;MQ=19        GT:PL   0/0:0,255,177   ./.:0,0,0       1/1:36,6,0      1/1:46,9,0      0/1:32,0,39

¥ß¥½¤ÏºÇ¸å¤ÎFORMAT¤ÎÉôʬ¤¬£µ¤Ä¤Î¥µ¥ó¥×¥ë¤½¤ì¤¾¤ì¤ÎÃͤòÎóµ­¤·¤¿·Á¤Ë¤Ê¤ë¤³¤È¡£¤Ç¤â¤³¤ì¤ò¤É¤¦¤ä¤Ã¤Æ»È¤¨¤Ð¤¤¤¤¤Î¤«¡© ¤³¤Î½ÐÎϤò¡¢IGV¤Çɽ¼¨¤¹¤ë¤³¤È¤¬¤Ç¤­¤ë¡£

¤µ¤Æ¡¢¥µ¥ó¥×¥ë´Ö¤ÎÊѰۤòÈæ³Ó¤¹¤ëÊýË¡¤È¤·¤Æ¥µ¥ó¥×¥ë´Ö¤Ç¶¦Ä̤¹¤ëÊѰۤȸÇÍ­¤ÎÊѰۤòÃê½Ð¤¹¤ë - MAC¤Ç¥¤¥ó¥Õ¥©¥Þ¥Æ¥£¥¯¥¹¤Ë¤Ï¤½¤ì¤¾¤ì¤Î¥µ¥ó¥×¥ë¤ÎBAM¤ò¡ÊñÆÈ¤Ç¡ËVCF¤ËÊÑ´¹¤·¡¢¤½¤ì¤é¤Î´Ö¤Îº¹Ê¬¤ò¼è¤ëÊýË¡¤òÄ󰯤·¤Æ¤¤¤ë¡£

VCF¥Õ¥¡¥¤¥ë¤òPython¤ÇÆÉ¤ß¹þ¤ó¤ÇÁàºî¤¹¤ë

pyVCF¤¬¥Ñ¡¼¥¶¡¼¤È¤·¤Æ»È¤¨¤ë¡£

PyVCF - A Variant Call Format Parser for Python — PyVCF 0.6.8 documentation ¡¡ API — PyVCF 0.6.8 documentation

pyVCF¤ÏnamedTuple¤È¤¤¤¦ÊØÍø¤Ã¤Ý¤¤¤â¤Î¤ò»È¤Ã¤Æ¤¤¤ë¡£

PyVCF¤ò»È¤Ã¤Æ¤ß¤ë¡£

import vcf
# vcf.model &#8212; PyVCF 0.6.8 documentation 
https://pyvcf.readthedocs.io/en/latest/_modules/vcf/model.html#_Record
# PyVCF | VCF ¥Õ¥¡¥¤¥ë¤ò¥Ñ¡¼¥¹¤¹¤ë Python ¥é¥¤¥Ö¥é¥ê¡¼ https://bi.biopapyrus.jp/python/module/pyvcf.html

vcff = vcf.Reader(open('mytest.call.annnotate.vcf', 'r'))
for i, record in enumerate(vcff):
    print('POS:', record.POS, 'SNP:', record.is_snp, 'INDEL:', record.is_indel, 'subtype:', 
record.var_subtype)
    for sample in record.samples:
        print('  ', sample.sample, end=' ')
        #print(sample.sample, sample.data, sample.gt_type)
        #print(sample.data.GT)
        #print(sample.data._fields)
        for skey in sample.data._fields:
            print(skey, ':',  getattr(sample.data, skey), end=', ')
        print()
    if i>1:
        break

·ë²Ì¤Ï

POS: 301 SNP: False INDEL: True subtype: del
   Anc/AP012030/data/reference.bam GT : 1/1, PL : [31, 255, 0], 
   1_2-1/AP012030/data/reference.bam GT : 1/1, PL : [22, 216, 0], 
   2_2-1/AP012030/data/reference.bam GT : 1/1, PL : [31, 255, 0], 
   2_5-1/AP012030/data/reference.bam GT : 1/1, PL : [16, 235, 0], 
   2_5-1-7A/AP012030/data/reference.bam GT : 0/0, PL : [0, 157, 15], 
POS: 3448 SNP: True INDEL: False subtype: ts
   Anc/AP012030/data/reference.bam GT : 0/0, PL : [0, 255, 173], 
   1_2-1/AP012030/data/reference.bam GT : 0/0, PL : [0, 255, 183], 
   2_2-1/AP012030/data/reference.bam GT : 1/1, PL : [167, 255, 0], 
   2_5-1/AP012030/data/reference.bam GT : 1/1, PL : [189, 255, 0], 
   2_5-1-7A/AP012030/data/reference.bam GT : 1/1, PL : [162, 255, 0], 
POS: 10591 SNP: True INDEL: False subtype: ts
   Anc/AP012030/data/reference.bam GT : 0/0, PL : [0, 255, 176], 
   1_2-1/AP012030/data/reference.bam GT : 0/0, PL : [0, 255, 169], 
   2_2-1/AP012030/data/reference.bam GT : 0/0, PL : [0, 255, 176], 
   2_5-1/AP012030/data/reference.bam GT : 1/1, PL : [139, 255, 0], 
   2_5-1-7A/AP012030/data/reference.bam GT : 1/1, PL : [177, 255, 0],

¥È¥Ã¥×   ÊÔ½¸ Åà·ë º¹Ê¬ ¥Ð¥Ã¥¯¥¢¥Ã¥× źÉÕ Ê£À½ ̾Á°Êѹ¹ ¥ê¥í¡¼¥É   ¿·µ¬ °ìÍ÷ ñ¸ì¸¡º÷ ºÇ½ª¹¹¿·   ¥Ø¥ë¥×   ºÇ½ª¹¹¿·¤ÎRSS
Last-modified: 2019-05-22 (¿å) 15:47:41 (1407d)