[[Pythonバイオ]] [[Pythonバイオ/ツール]]~
&counter();   &lastmod();~

*Variant Calling [#ab7ead5d]

参照
-2019-04-21 
--ゲノム解析がらみで、2つのVCFファイルの比較。[[サンプル間で共通する変異と固有の変異を抽出する:http://kazumaxneo.hatenablog.com/entry/2017/06/06/165329]]、 GATKを使う方法と、bcftoolsを使う方法と。要確認。

--その手前。[[small indelとSNV検出のワークフロー:http://kazumaxneo.hatenablog.com/entry/2017/05/31/170329]]~
更にその元 (GATKサイト) [[Variant Calling Pipeline: FastQ to Annotated SNPs in Hours:https://gencore.bio.nyu.edu/variant-calling-pipeline/]]


-[[(howto) Discover variants with GATK - A GATK Workshop Tutorial:https://software.broadinstitute.org/gatk/documentation/article?id=7869]]

-2019-04-21 DNA-seq pipeline [[A framework for variation discovery and genotyping using next-generation DNA sequencing data (Nature Genetics 43, 491?498 (2011) doi:10.1038/ng.806):https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3083463/]]

-2019-04-21 NIG National Institute of Genetics NIGのパイプラインサービス  [[DNA-seqのパイプライン:http://cell-innovation.nig.ac.jp/maser/Applications/Genome-seq.html]]

-Dry解析教本のp163〜 エクソームシーケンスのデータ解析 ワークフローの図は[[GATK Best Practice:https://software.broadinstitute.org/gatk/best-practices/?v=3]]からの改変だそうな。

-[[Variant Calling Pipeline: FastQ to Annotated SNPs in Hours – The Genomics Core Facility @ NYU Center for Genomics and Systems Biology:https://gencore.bio.nyu.edu/variant-calling-pipeline/]]

-GATKの主張するワークフローは[[GATK workflows:https://github.com/gatk-workflows/]]にリストがある。これおもしろいかも。

いずれにしても、
-クオリティチェック FASTQC
-トリミング Trimmomatic (Best PracticeではここではやらないとDry解析教本に書いてあった。理由はバリアントコール抽出の過程で考慮されるからだそうな。)
-マッピング  ⇒ BAM
-重複リードの除去 picard
-InDelリアラインメント GATK target finding; then realign around InDel
-クオリティの再調整? (Dry解析教本にはある、NYUフローにはない)
-Variant Calling  ⇒ VCF
-(GenoTyping??)
-Extraction
--Extract SNPs; Filter SNPs (GATK) VCF
--Extract Indels; Filter Indels (GATK) VCF
-Predict Variant Effects (snpEFF, NYU)
-アノテーション付与 (Annovar, Dry解析教本)


Duplicate

 gatk MarkDuplicates -I Kishimoto_01_43B_S1.sorted.bam -M Kishimoto_01_43B_S1.duplicated.metrics \
 -O Kishimoto_01_43B_S1.dedup.sorted.bam --ASSUME_SORTED TRUE --REMOVE_DUPLICATES TRUE
 
 Using GATK jar /usr/local/src/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar
 Running:
    java -Dsamjdk.use_async_io_read_samtools=false 
 -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 
 -jar /usr/local/src/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar MarkDuplicates 
 -I Kishimoto_01_43B_S1.sorted.bam -M Kishimoto_01_43B_S1.duplicated.metrics 
 -O Kishimoto_01_43B_S1.dedup.sorted.bam --ASSUME_SORTED TRUE --REMOVE_DUPLICATES TRUE
 16:35:52.607 INFO  NativeLibraryLoader -
  Loading libgkl_compression.so from jar:file:/usr/local/src/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
 [Thu May 16 16:35:52 JST 2019] MarkDuplicates  --INPUT Kishimoto_01_43B_S1.sorted.bam 
 --OUTPUT Kishimoto_01_43B_S1.dedup.sorted.bam --METRICS_FILE Kishimoto_01_43B_S1.duplicated.metrics 
 --REMOVE_DUPLICATES true --ASSUME_SORTED true  --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 
 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false 
 --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false 
 --ADD_PG_TAG_TO_READS true --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates 
 --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> 
 --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 
 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 
 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false 
 --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false 
 --USE_JDK_INFLATER false
 [Thu May 16 16:35:55 JST 2019] Executing as yamanouc@pepper.is.sci.toho-u.ac.jp on Linux 3.10.0-514.2.2.el7.x86_64 amd64; 
   OpenJDK 64-Bit Server VM 1.8.0_111-b15; Deflater: Intel; Inflater: Intel; Provider GCS is available; 
   Picard version: Version:4.1.2.0
 INFO	2019-05-16 16:35:55	MarkDuplicates	Start of doWork freeMemory: 1417478608; totalMemory: 1434976256; maxMemory: 28631367680
 INFO	2019-05-16 16:35:55	MarkDuplicates	Reading input file and constructing read end information.
 INFO	2019-05-16 16:35:55	MarkDuplicates	Will retain up to 103736839 data points before spilling to disk.
 INFO	2019-05-16 16:36:52	MarkDuplicates	Read     1,000,000 records.  Elapsed time: 00:00:52s.  Time for last 1,000,000:   52s.  
   Last read position: AP012030.1:74,603
 INFO	2019-05-16 16:36:52	MarkDuplicates	Tracking 40722 as yet unmatched pairs. 40722 records in RAM.
 INFO	2019-05-16 16:37:29	MarkDuplicates	Read     2,000,000 records.  Elapsed time: 00:01:29s.  Time for last 1,000,000:   37s.  Last read position: AP012030.1:154,447
 INFO	2019-05-16 16:37:29	MarkDuplicates	Tracking 77823 as yet unmatched pairs. 77823 records in RAM.
 INFO	2019-05-16 16:38:24	MarkDuplicates	Read     3,000,000 records.  Elapsed time: 00:02:25s.  Time for last 1,000,000:   55s.  Last read position: AP012030.1:228,168
 INFO	2019-05-16 16:38:24	MarkDuplicates	Tracking 110546 as yet unmatched pairs. 110546 records in RAM.
 INFO	2019-05-16 16:38:53	MarkDuplicates	Read     4,000,000 records.  Elapsed time: 00:02:53s.  Time for last 1,000,000:   28s.  Last read position: AP012030.1:301,562
 INFO	2019-05-16 16:38:53	MarkDuplicates	Tracking 143886 as yet unmatched pairs. 143886 records in RAM.
 INFO	2019-05-16 16:39:37	MarkDuplicates	Read     5,000,000 records.  Elapsed time: 00:03:38s.  Time for last 1,000,000:   44s.  Last read position: AP012030.1:377,173
 INFO	2019-05-16 16:39:37	MarkDuplicates	Tracking 175571 as yet unmatched pairs. 175571 records in RAM.
 INFO	2019-05-16 16:40:06	MarkDuplicates	Read     6,000,000 records.  Elapsed time: 00:04:07s.  Time for last 1,000,000:   29s.  Last read position: AP012030.1:448,403
 INFO	2019-05-16 16:40:06	MarkDuplicates	Tracking 203544 as yet unmatched pairs. 203544 records in RAM.
 INFO	2019-05-16 16:41:02	MarkDuplicates	Read     7,000,000 records.  Elapsed time: 00:05:02s.  Time for last 1,000,000:   55s.  Last read position: AP012030.1:528,920
 INFO	2019-05-16 16:41:02	MarkDuplicates	Tracking 235908 as yet unmatched pairs. 235908 records in RAM.
 INFO	2019-05-16 16:41:26	MarkDuplicates	Read     8,000,000 records.  Elapsed time: 00:05:27s.  Time for last 1,000,000:   24s.  Last read position: AP012030.1:608,145
 INFO	2019-05-16 16:41:26	MarkDuplicates	Tracking 263951 as yet unmatched pairs. 263951 records in RAM.
 INFO	2019-05-16 16:42:03	MarkDuplicates	Read     9,000,000 records.  Elapsed time: 00:06:03s.  Time for last 1,000,000:   36s.  Last read position: AP012030.1:690,374
 INFO	2019-05-16 16:42:03	MarkDuplicates	Tracking 291824 as yet unmatched pairs. 291824 records in RAM.
 
 (中略)
 
 INFO	2019-05-16 17:08:15	MarkDuplicates	Read    59,000,000 records.  Elapsed time: 00:32:16s.  Time for last 1,000,000:   28s.  Last read position: AP012030.1:4,479,193
 INFO	2019-05-16 17:08:15	MarkDuplicates	Tracking 70213 as yet unmatched pairs. 70213 records in RAM.
 INFO	2019-05-16 17:08:44	MarkDuplicates	Tracking 30486 as yet unmatched pairs. 30486 records in RAM.
 INFO	2019-05-16 17:09:06	MarkDuplicates	Read 60734092 records. 0 pairs never matched.
 INFO	2019-05-16 17:10:06	MarkDuplicates	After buildSortedReadEndLists freeMemory: 11654078344; totalMemory: 18848153600; maxMemory: 28631367680
 INFO	2019-05-16 17:10:06	MarkDuplicates	Will retain up to 894730240 duplicate indices before spilling to disk.
 INFO	2019-05-16 17:10:14	MarkDuplicates	Traversing read pair information and detecting duplicates.
 INFO	2019-05-16 17:13:40	MarkDuplicates	Traversing fragment information and detecting duplicates.
 INFO	2019-05-16 17:13:57	MarkDuplicates	Sorting list of duplicate records.
 INFO	2019-05-16 17:14:00	MarkDuplicates	After generateDuplicateIndexes freeMemory: 21992264872; totalMemory: 29286203392; maxMemory: 29286203392
 INFO	2019-05-16 17:14:00	MarkDuplicates	Marking 2183568 records as duplicates.
 INFO	2019-05-16 17:14:00	MarkDuplicates	Found 9424 optical duplicate clusters.
 INFO	2019-05-16 17:14:00	MarkDuplicates	Reads are assumed to be ordered by: coordinate


***2019-05-20 bcftools mpileup | bcftools call [#g3a8f9c0]
[[samtools 使い方 mpileup ( calling SNPs ) & annotation:http://bioinfo-dojo.net/2016/07/13/samtools_calling_snps/#Calling_SNPs-2]] には

 $ samtools mpileup -uf ref.fa myfile.sorted_unique.bam | bcftools call -O b -v -c ->| 
 myfile.sorted_unique.raw.bcf

と書いてある。また
[[samtoolsでvariant call - Qiita:https://qiita.com/motthy/items/38951f127dce26b22ce3]]
にはこう書いている。

 bcftools mpileup -Ou -f <ref.fa> <sample1.bam> <sample2.bam> <sample3.bam> |
 bcftools call -vmO z -o <study.vcf.gz>

でも
 bcftools mpileup -Ou -f AP012030-new.fasta Anc/AP012030/data/reference.bam 1_2-1/AP012030/data/reference.bam 2_2-1/AP012030/data/reference.bam 2_5-1/AP012030/data/reference.bam 2_5-1-7A/AP012030/data/reference.bam > mytest.vcf
 bcftools call -O v -v -c mytest.vcf > mytest.raw.bcf
や
 bcftools call -vmO z -o mystudy.vcf.gz mytest.vcf
 Could not add dummy header for contig 
 Segmentation fault (コアダンプ)
でうまくいかない。

bcftoolsの[[マニュアル:https://samtools.github.io/bcftools/bcftools.html]]のmpileupにある例では
    bcftools mpileup -Ou -f ref.fa aln.bam | \
    bcftools call -Ou -mv | \
    bcftools filter -s LowQual -e '%QUAL<20 || DP>100' > var.flt.vcf
と書いてある。

ここでは、
 bcftools mpileup -Ou -f AP012030-new.fasta \
   Anc/AP012030/data/reference.bam \
   1_2-1/AP012030/data/reference.bam \
   2_2-1/AP012030/data/reference.bam \
   2_5-1/AP012030/data/reference.bam \
   2_5-1-7A/AP012030/data/reference.bam |
 bcftools call -Ou -mv -o mytest.call.vcf
を試してみる。

この結果を、vcf-annotateでアノテーションを追加して読めるようにする。この手順の出所は[[samtools 使い方 mpileup ( calling SNPs ) & annotation &#8211; バイオインフォ 道場:http://bioinfo-dojo.net/2016/07/13/samtools_calling_snps/#Calling_SNPs-2]]
 bcftools view mytest.call.vcf | vcf-annotate -f + > mytest.call.annnotate.vcf

 #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Anc/AP012030/data/reference.bam 1_2-1/AP012030/data/reference.bam       2_2-1/AP012030/data/reference.bam       2_5-1/AP012030/data/reference.bam       2_5-1-7A/AP012030/data/reference.bam
 AP012030        301     .       CTTTTTTTTTT     CTTTTTTTTT      61      PASS    INDEL;IDV=211;IMF=0.886555;DP=1191;VDB=0.990724;SGB=-30.4726;MQSB=1;MQ0F=0;ICB=0.425;HOB=0.32;AC=8;AN=10;DP4=178,15,967,31;MQ=20        GT:PL   1/1:31,255,0    1/1:22,216,0    1/1:31,255,0    1/1:16,235,0    0/0:0,157,15
 AP012030        3448    .       C       T       464     PASS    DP=1211;VDB=0.484561;SGB=120.949;RPB=0.750085;MQB=1;MQSB=1;BQB=0.101507;MQ0F=0;ICB=0.117361;HOB=0.48;AC=6;AN=10;DP4=478,6,697,8;MQ=20   GT:PL   0/0:0,255,173   0/0:0,255,183   1/1:167,255,0   1/1:189,255,0   1/1:162,255,0
 AP012030        10591   .       C       T       262     PASS    DP=1220;VDB=0.354141;SGB=400.466;RPB=0.300118;MQB=1;MQSB=1;BQB=7.53989e-05;MQ0F=0;ICB=0.117361;HOB=0.48;AC=4;AN=10;DP4=748,5,433,8;MQ=20        GT:PL   0/0:0,255,176   0/0:0,255,169   0/0:0,255,176   1/1:139,255,0   1/1:177,255,0
 AP012030        41097   .       G       A       465     PASS    DP=1222;VDB=0.0336028;SGB=359.679;RPB=0.995592;MQB=1;MQSB=1;BQB=0.533691;MQ0F=0;ICB=0.117361;HOB=0.48;AC=6;AN=10;DP4=477,10,705,7;MQ=20 GT:PL   0/0:0,255,178   0/0:0,255,197   1/1:173,255,0   1/1:176,255,0   1/1:170,255,0
 AP012030        44797   .       AGGGGG  AGGGG   340     PASS    INDEL;IDV=234;IMF=0.955102;DP=1231;VDB=0.691086;SGB=115.801;MQSB=1;MQ0F=0;ICB=0.117361;HOB=0.48;AC=6;AN=10;DP4=548,7,672,4;MQ=20        GT:PL   0/0:0,255,180   0/0:0,255,186   1/1:134,255,0   1/1:132,255,0   1/1:128,255,0
 AP012030        59093   .       C       T       13.7826 PASS    DP=252;VDB=0.00328272;SGB=-7.51117;RPB=0.00279524;MQB=0.0718584;MQSB=0.999297;BQB=0.00206264;MQ0F=0;ICB=1;HOB=0.166667;AC=3;AN=6;DP4=233,3,5,0;MQ=19    GT:PL   0/0:0,255,175   ./.:0,0,0       ./.:0,0,0       1/1:46,9,0      0/1:6,0,14
 AP012030        59105   .       A       C       69      PASS    DP=260;VDB=0.000538717;SGB=-8.68981;RPB=0.000860322;MQB=0.187859;MQSB=0.999902;BQB=0.0445104;MQ0F=0;ICB=1;HOB=0.21875;AC=5;AN=8;DP4=237,4,8,0;MQ=20     GT:PL   0/0:0,255,193   ./.:0,0,0       1/1:49,9,0      1/1:46,9,0      0/1:18,0,28
 AP012030        59108   .       T       C       71      PASS    DP=266;VDB=0.000464403;SGB=-8.68981;RPB=0.00435826;MQB=0.19652;MQSB=0.998903;BQB=0.106941;MQ0F=0;ICB=1;HOB=0.21875;AC=5;AN=8;DP4=238,4,7,1;MQ=19        GT:PL   0/0:0,255,177   ./.:0,0,0       1/1:36,6,0      1/1:46,9,0      0/1:32,0,39

ミソは最後のFORMATの部分が5つのサンプルそれぞれの値を列記した形になること。でもこれをどうやって使えばいいのか?
この出力を、IGVで表示することができる。

さて、サンプル間の変異を比較する方法として[[サンプル間で共通する変異と固有の変異を抽出する - MACでインフォマティクス:http://kazumaxneo.hatenablog.com/entry/2017/06/06/165329]]にはそれぞれのサンプルのBAMを(単独で)VCFに変換し、それらの間の差分を取る方法を提案している。

***VCFファイルをPythonで読み込んで操作する [#qb769039]
pyVCFがパーザーとして使える。

[[PyVCF - A Variant Call Format Parser for Python &#8212; PyVCF 0.6.8 documentation:https://pyvcf.readthedocs.io/en/latest/index.html]]
 
[[API &#8212; PyVCF 0.6.8 documentation:https://pyvcf.readthedocs.io/en/latest/API.html#vcf-model-record]]


pyVCFはnamedTupleという便利っぽいものを使っている。
-2019-05-21 python namedtuple
--[[namedtupleで美しいpythonを書く!(翻訳) - Qiita:https://qiita.com/Seny/items/add4d03876f505442136]]
--[[collections --- コンテナデータ型 &#8212; Python 3.7.3 ドキュメント:https://docs.python.org/ja/3/library/collections.html#collections.namedtuple]]
--一般には、p.fieldnameのようにしてアクセス。
--フィールド名リストは p._fields なので、
--文字列としてフィールド名が与えられている時の値の取り出し getattr(p, 'フィールド名')
 for fieldname in p._fields:
     print(getattr(p, fieldname)

PyVCFを使ってみる。

 import vcf
 # vcf.model &#8212; PyVCF 0.6.8 documentation 
 https://pyvcf.readthedocs.io/en/latest/_modules/vcf/model.html#_Record
 # PyVCF | VCF ファイルをパースする Python ライブラリー https://bi.biopapyrus.jp/python/module/pyvcf.html
 
 vcff = vcf.Reader(open('mytest.call.annnotate.vcf', 'r'))
 for i, record in enumerate(vcff):
     print(record.POS, record.is_snp, record.is_indel, record.var_subtype)
     print('POS:', record.POS, 'SNP:', record.is_snp, 'INDEL:', record.is_indel, 'subtype:', 
 record.var_subtype)
     for sample in record.samples:
         print(sample.sample, sample.data, sample.gt_type)
         print('  ', sample.sample, end=' ')
         #print(sample.sample, sample.data, sample.gt_type)
         #print(sample.data.GT)
         #print(sample.data._fields)
         for skey in sample.data._fields:
             print(getattr(sample.data, skey))
             print(skey, ':',  getattr(sample.data, skey), end=', ')
         print()
     if i>1:
         break
結果は
 POS: 301 SNP: False INDEL: True subtype: del
    Anc/AP012030/data/reference.bam GT : 1/1, PL : [31, 255, 0], 
    1_2-1/AP012030/data/reference.bam GT : 1/1, PL : [22, 216, 0], 
    2_2-1/AP012030/data/reference.bam GT : 1/1, PL : [31, 255, 0], 
    2_5-1/AP012030/data/reference.bam GT : 1/1, PL : [16, 235, 0], 
    2_5-1-7A/AP012030/data/reference.bam GT : 0/0, PL : [0, 157, 15], 
 POS: 3448 SNP: True INDEL: False subtype: ts
    Anc/AP012030/data/reference.bam GT : 0/0, PL : [0, 255, 173], 
    1_2-1/AP012030/data/reference.bam GT : 0/0, PL : [0, 255, 183], 
    2_2-1/AP012030/data/reference.bam GT : 1/1, PL : [167, 255, 0], 
    2_5-1/AP012030/data/reference.bam GT : 1/1, PL : [189, 255, 0], 
    2_5-1-7A/AP012030/data/reference.bam GT : 1/1, PL : [162, 255, 0], 
 POS: 10591 SNP: True INDEL: False subtype: ts
    Anc/AP012030/data/reference.bam GT : 0/0, PL : [0, 255, 176], 
    1_2-1/AP012030/data/reference.bam GT : 0/0, PL : [0, 255, 169], 
    2_2-1/AP012030/data/reference.bam GT : 0/0, PL : [0, 255, 176], 
    2_5-1/AP012030/data/reference.bam GT : 1/1, PL : [139, 255, 0], 
    2_5-1-7A/AP012030/data/reference.bam GT : 1/1, PL : [177, 255, 0],

トップ   編集 差分 バックアップ 添付 複製 名前変更 リロード   新規 一覧 単語検索 最終更新   ヘルプ   最終更新のRSS