[
トップ
] [
新規
|
一覧
|
単語検索
|
最終更新
|
ヘルプ
]
開始行:
[[Pythonバイオ]] [[Pythonバイオ/ツール]]~
&counter(); &lastmod();~
*Variant Calling [#ab7ead5d]
参照
-2019-04-21
--ゲノム解析がらみで、2つのVCFファイルの比較。[[サンプル間で共通する変異と固有の変異を抽出する:http://kazumaxneo.hatenablog.com/entry/2017/06/06/165329]]、 GATKを使う方法と、bcftoolsを使う方法と。要確認。
--その手前。[[small indelとSNV検出のワークフロー:http://kazumaxneo.hatenablog.com/entry/2017/05/31/170329]]~
更にその元 (GATKサイト) [[Variant Calling Pipeline: FastQ to Annotated SNPs in Hours:https://gencore.bio.nyu.edu/variant-calling-pipeline/]]
-[[(howto) Discover variants with GATK - A GATK Workshop Tutorial:https://software.broadinstitute.org/gatk/documentation/article?id=7869]]
-2019-04-21 DNA-seq pipeline [[A framework for variation discovery and genotyping using next-generation DNA sequencing data (Nature Genetics 43, 491?498 (2011) doi:10.1038/ng.806):https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3083463/]]
-2019-04-21 NIG National Institute of Genetics NIGのパイプラインサービス [[DNA-seqのパイプライン:http://cell-innovation.nig.ac.jp/maser/Applications/Genome-seq.html]]
-Dry解析教本のp163〜 エクソームシーケンスのデータ解析 ワークフローの図は[[GATK Best Practice:https://software.broadinstitute.org/gatk/best-practices/?v=3]]からの改変だそうな。
-[[Variant Calling Pipeline: FastQ to Annotated SNPs in Hours – The Genomics Core Facility @ NYU Center for Genomics and Systems Biology:https://gencore.bio.nyu.edu/variant-calling-pipeline/]]
-GATKの主張するワークフローは[[GATK workflows:https://github.com/gatk-workflows/]]にリストがある。これおもしろいかも。
いずれにしても、
-クオリティチェック FASTQC
-トリミング Trimmomatic (Best PracticeではここではやらないとDry解析教本に書いてあった。理由はバリアントコール抽出の過程で考慮されるからだそうな。)
-マッピング ⇒ BAM
-重複リードの除去 picard
-InDelリアラインメント GATK target finding; then realign around InDel
-クオリティの再調整? (Dry解析教本にはある、NYUフローにはない)
-Variant Calling ⇒ VCF
-(GenoTyping??)
-Extraction
--Extract SNPs; Filter SNPs (GATK) VCF
--Extract Indels; Filter Indels (GATK) VCF
-Predict Variant Effects (snpEFF, NYU)
-アノテーション付与 (Annovar, Dry解析教本)
Duplicate
gatk MarkDuplicates -I Kishimoto_01_43B_S1.sorted.bam -M Kishimoto_01_43B_S1.duplicated.metrics \
-O Kishimoto_01_43B_S1.dedup.sorted.bam --ASSUME_SORTED TRUE --REMOVE_DUPLICATES TRUE
Using GATK jar /usr/local/src/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false
-Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2
-jar /usr/local/src/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar MarkDuplicates
-I Kishimoto_01_43B_S1.sorted.bam -M Kishimoto_01_43B_S1.duplicated.metrics
-O Kishimoto_01_43B_S1.dedup.sorted.bam --ASSUME_SORTED TRUE --REMOVE_DUPLICATES TRUE
16:35:52.607 INFO NativeLibraryLoader -
Loading libgkl_compression.so from jar:file:/usr/local/src/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Thu May 16 16:35:52 JST 2019] MarkDuplicates --INPUT Kishimoto_01_43B_S1.sorted.bam
--OUTPUT Kishimoto_01_43B_S1.dedup.sorted.bam --METRICS_FILE Kishimoto_01_43B_S1.duplicated.metrics
--REMOVE_DUPLICATES true --ASSUME_SORTED true --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000
--MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false
--REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false
--ADD_PG_TAG_TO_READS true --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates
--PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values>
--OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000
--VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2
--MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false
--GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false
--USE_JDK_INFLATER false
[Thu May 16 16:35:55 JST 2019] Executing as yamanouc@pepper.is.sci.toho-u.ac.jp on Linux 3.10.0-514.2.2.el7.x86_64 amd64;
OpenJDK 64-Bit Server VM 1.8.0_111-b15; Deflater: Intel; Inflater: Intel; Provider GCS is available;
Picard version: Version:4.1.2.0
INFO 2019-05-16 16:35:55 MarkDuplicates Start of doWork freeMemory: 1417478608; totalMemory: 1434976256; maxMemory: 28631367680
INFO 2019-05-16 16:35:55 MarkDuplicates Reading input file and constructing read end information.
INFO 2019-05-16 16:35:55 MarkDuplicates Will retain up to 103736839 data points before spilling to disk.
INFO 2019-05-16 16:36:52 MarkDuplicates Read 1,000,000 records. Elapsed time: 00:00:52s. Time for last 1,000,000: 52s.
Last read position: AP012030.1:74,603
INFO 2019-05-16 16:36:52 MarkDuplicates Tracking 40722 as yet unmatched pairs. 40722 records in RAM.
INFO 2019-05-16 16:37:29 MarkDuplicates Read 2,000,000 records. Elapsed time: 00:01:29s. Time for last 1,000,000: 37s. Last read position: AP012030.1:154,447
INFO 2019-05-16 16:37:29 MarkDuplicates Tracking 77823 as yet unmatched pairs. 77823 records in RAM.
INFO 2019-05-16 16:38:24 MarkDuplicates Read 3,000,000 records. Elapsed time: 00:02:25s. Time for last 1,000,000: 55s. Last read position: AP012030.1:228,168
INFO 2019-05-16 16:38:24 MarkDuplicates Tracking 110546 as yet unmatched pairs. 110546 records in RAM.
INFO 2019-05-16 16:38:53 MarkDuplicates Read 4,000,000 records. Elapsed time: 00:02:53s. Time for last 1,000,000: 28s. Last read position: AP012030.1:301,562
INFO 2019-05-16 16:38:53 MarkDuplicates Tracking 143886 as yet unmatched pairs. 143886 records in RAM.
INFO 2019-05-16 16:39:37 MarkDuplicates Read 5,000,000 records. Elapsed time: 00:03:38s. Time for last 1,000,000: 44s. Last read position: AP012030.1:377,173
INFO 2019-05-16 16:39:37 MarkDuplicates Tracking 175571 as yet unmatched pairs. 175571 records in RAM.
INFO 2019-05-16 16:40:06 MarkDuplicates Read 6,000,000 records. Elapsed time: 00:04:07s. Time for last 1,000,000: 29s. Last read position: AP012030.1:448,403
INFO 2019-05-16 16:40:06 MarkDuplicates Tracking 203544 as yet unmatched pairs. 203544 records in RAM.
INFO 2019-05-16 16:41:02 MarkDuplicates Read 7,000,000 records. Elapsed time: 00:05:02s. Time for last 1,000,000: 55s. Last read position: AP012030.1:528,920
INFO 2019-05-16 16:41:02 MarkDuplicates Tracking 235908 as yet unmatched pairs. 235908 records in RAM.
INFO 2019-05-16 16:41:26 MarkDuplicates Read 8,000,000 records. Elapsed time: 00:05:27s. Time for last 1,000,000: 24s. Last read position: AP012030.1:608,145
INFO 2019-05-16 16:41:26 MarkDuplicates Tracking 263951 as yet unmatched pairs. 263951 records in RAM.
INFO 2019-05-16 16:42:03 MarkDuplicates Read 9,000,000 records. Elapsed time: 00:06:03s. Time for last 1,000,000: 36s. Last read position: AP012030.1:690,374
INFO 2019-05-16 16:42:03 MarkDuplicates Tracking 291824 as yet unmatched pairs. 291824 records in RAM.
(中略)
INFO 2019-05-16 17:08:15 MarkDuplicates Read 59,000,000 records. Elapsed time: 00:32:16s. Time for last 1,000,000: 28s. Last read position: AP012030.1:4,479,193
INFO 2019-05-16 17:08:15 MarkDuplicates Tracking 70213 as yet unmatched pairs. 70213 records in RAM.
INFO 2019-05-16 17:08:44 MarkDuplicates Tracking 30486 as yet unmatched pairs. 30486 records in RAM.
INFO 2019-05-16 17:09:06 MarkDuplicates Read 60734092 records. 0 pairs never matched.
INFO 2019-05-16 17:10:06 MarkDuplicates After buildSortedReadEndLists freeMemory: 11654078344; totalMemory: 18848153600; maxMemory: 28631367680
INFO 2019-05-16 17:10:06 MarkDuplicates Will retain up to 894730240 duplicate indices before spilling to disk.
INFO 2019-05-16 17:10:14 MarkDuplicates Traversing read pair information and detecting duplicates.
INFO 2019-05-16 17:13:40 MarkDuplicates Traversing fragment information and detecting duplicates.
INFO 2019-05-16 17:13:57 MarkDuplicates Sorting list of duplicate records.
INFO 2019-05-16 17:14:00 MarkDuplicates After generateDuplicateIndexes freeMemory: 21992264872; totalMemory: 29286203392; maxMemory: 29286203392
INFO 2019-05-16 17:14:00 MarkDuplicates Marking 2183568 records as duplicates.
INFO 2019-05-16 17:14:00 MarkDuplicates Found 9424 optical duplicate clusters.
INFO 2019-05-16 17:14:00 MarkDuplicates Reads are assumed to be ordered by: coordinate
***2019-05-20 bcftools mpileup | bcftools call [#g3a8f9c0]
[[samtools 使い方 mpileup ( calling SNPs ) & annotation:http://bioinfo-dojo.net/2016/07/13/samtools_calling_snps/#Calling_SNPs-2]] には
$ samtools mpileup -uf ref.fa myfile.sorted_unique.bam | bcftools call -O b -v -c ->|
myfile.sorted_unique.raw.bcf
と書いてある。また
[[samtoolsでvariant call - Qiita:https://qiita.com/motthy/items/38951f127dce26b22ce3]]
にはこう書いている。
bcftools mpileup -Ou -f <ref.fa> <sample1.bam> <sample2.bam> <sample3.bam> |
bcftools call -vmO z -o <study.vcf.gz>
でも
bcftools mpileup -Ou -f AP012030-new.fasta Anc/AP012030/data/reference.bam 1_2-1/AP012030/data/reference.bam 2_2-1/AP012030/data/reference.bam 2_5-1/AP012030/data/reference.bam 2_5-1-7A/AP012030/data/reference.bam > mytest.vcf
bcftools call -O v -v -c mytest.vcf > mytest.raw.bcf
や
bcftools call -vmO z -o mystudy.vcf.gz mytest.vcf
Could not add dummy header for contig
Segmentation fault (コアダンプ)
でうまくいかない。
bcftoolsの[[マニュアル:https://samtools.github.io/bcftools/bcftools.html]]のmpileupにある例では
bcftools mpileup -Ou -f ref.fa aln.bam | \
bcftools call -Ou -mv | \
bcftools filter -s LowQual -e '%QUAL<20 || DP>100' > var.flt.vcf
と書いてある。
ここでは、
bcftools mpileup -Ou -f AP012030-new.fasta \
Anc/AP012030/data/reference.bam \
1_2-1/AP012030/data/reference.bam \
2_2-1/AP012030/data/reference.bam \
2_5-1/AP012030/data/reference.bam \
2_5-1-7A/AP012030/data/reference.bam |
bcftools call -Ou -mv -o mytest.call.vcf
を試してみる。
この結果を、vcf-annotateでアノテーションを追加して読めるようにする。この手順の出所は[[samtools 使い方 mpileup ( calling SNPs ) & annotation – バイオインフォ 道場:http://bioinfo-dojo.net/2016/07/13/samtools_calling_snps/#Calling_SNPs-2]]
bcftools view mytest.call.vcf | vcf-annotate -f + > mytest.call.annnotate.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Anc/AP012030/data/reference.bam 1_2-1/AP012030/data/reference.bam 2_2-1/AP012030/data/reference.bam 2_5-1/AP012030/data/reference.bam 2_5-1-7A/AP012030/data/reference.bam
AP012030 301 . CTTTTTTTTTT CTTTTTTTTT 61 PASS INDEL;IDV=211;IMF=0.886555;DP=1191;VDB=0.990724;SGB=-30.4726;MQSB=1;MQ0F=0;ICB=0.425;HOB=0.32;AC=8;AN=10;DP4=178,15,967,31;MQ=20 GT:PL 1/1:31,255,0 1/1:22,216,0 1/1:31,255,0 1/1:16,235,0 0/0:0,157,15
AP012030 3448 . C T 464 PASS DP=1211;VDB=0.484561;SGB=120.949;RPB=0.750085;MQB=1;MQSB=1;BQB=0.101507;MQ0F=0;ICB=0.117361;HOB=0.48;AC=6;AN=10;DP4=478,6,697,8;MQ=20 GT:PL 0/0:0,255,173 0/0:0,255,183 1/1:167,255,0 1/1:189,255,0 1/1:162,255,0
AP012030 10591 . C T 262 PASS DP=1220;VDB=0.354141;SGB=400.466;RPB=0.300118;MQB=1;MQSB=1;BQB=7.53989e-05;MQ0F=0;ICB=0.117361;HOB=0.48;AC=4;AN=10;DP4=748,5,433,8;MQ=20 GT:PL 0/0:0,255,176 0/0:0,255,169 0/0:0,255,176 1/1:139,255,0 1/1:177,255,0
AP012030 41097 . G A 465 PASS DP=1222;VDB=0.0336028;SGB=359.679;RPB=0.995592;MQB=1;MQSB=1;BQB=0.533691;MQ0F=0;ICB=0.117361;HOB=0.48;AC=6;AN=10;DP4=477,10,705,7;MQ=20 GT:PL 0/0:0,255,178 0/0:0,255,197 1/1:173,255,0 1/1:176,255,0 1/1:170,255,0
AP012030 44797 . AGGGGG AGGGG 340 PASS INDEL;IDV=234;IMF=0.955102;DP=1231;VDB=0.691086;SGB=115.801;MQSB=1;MQ0F=0;ICB=0.117361;HOB=0.48;AC=6;AN=10;DP4=548,7,672,4;MQ=20 GT:PL 0/0:0,255,180 0/0:0,255,186 1/1:134,255,0 1/1:132,255,0 1/1:128,255,0
AP012030 59093 . C T 13.7826 PASS DP=252;VDB=0.00328272;SGB=-7.51117;RPB=0.00279524;MQB=0.0718584;MQSB=0.999297;BQB=0.00206264;MQ0F=0;ICB=1;HOB=0.166667;AC=3;AN=6;DP4=233,3,5,0;MQ=19 GT:PL 0/0:0,255,175 ./.:0,0,0 ./.:0,0,0 1/1:46,9,0 0/1:6,0,14
AP012030 59105 . A C 69 PASS DP=260;VDB=0.000538717;SGB=-8.68981;RPB=0.000860322;MQB=0.187859;MQSB=0.999902;BQB=0.0445104;MQ0F=0;ICB=1;HOB=0.21875;AC=5;AN=8;DP4=237,4,8,0;MQ=20 GT:PL 0/0:0,255,193 ./.:0,0,0 1/1:49,9,0 1/1:46,9,0 0/1:18,0,28
AP012030 59108 . T C 71 PASS DP=266;VDB=0.000464403;SGB=-8.68981;RPB=0.00435826;MQB=0.19652;MQSB=0.998903;BQB=0.106941;MQ0F=0;ICB=1;HOB=0.21875;AC=5;AN=8;DP4=238,4,7,1;MQ=19 GT:PL 0/0:0,255,177 ./.:0,0,0 1/1:36,6,0 1/1:46,9,0 0/1:32,0,39
ミソは最後のFORMATの部分が5つのサンプルそれぞれの値を列記した形になること。でもこれをどうやって使えばいいのか?
この出力を、IGVで表示することができる。
さて、サンプル間の変異を比較する方法として[[サンプル間で共通する変異と固有の変異を抽出する - MACでインフォマティクス:http://kazumaxneo.hatenablog.com/entry/2017/06/06/165329]]にはそれぞれのサンプルのBAMを(単独で)VCFに変換し、それらの間の差分を取る方法を提案している。
***VCFファイルをPythonで読み込んで操作する [#qb769039]
pyVCFがパーザーとして使える。
[[PyVCF - A Variant Call Format Parser for Python — PyVCF 0.6.8 documentation:https://pyvcf.readthedocs.io/en/latest/index.html]]
[[API — PyVCF 0.6.8 documentation:https://pyvcf.readthedocs.io/en/latest/API.html#vcf-model-record]]
pyVCFはnamedTupleという便利っぽいものを使っている。
-2019-05-21 python namedtuple
--[[namedtupleで美しいpythonを書く!(翻訳) - Qiita:https://qiita.com/Seny/items/add4d03876f505442136]]
--[[collections --- コンテナデータ型 — Python 3.7.3 ドキュメント:https://docs.python.org/ja/3/library/collections.html#collections.namedtuple]]
--一般には、p.fieldnameのようにしてアクセス。
--フィールド名リストは p._fields なので、
--文字列としてフィールド名が与えられている時の値の取り出し getattr(p, 'フィールド名')
for fieldname in p._fields:
print(getattr(p, fieldname)
PyVCFを使ってみる。
import vcf
# vcf.model — PyVCF 0.6.8 documentation
https://pyvcf.readthedocs.io/en/latest/_modules/vcf/model.html#_Record
# PyVCF | VCF ファイルをパースする Python ライブラリー https://bi.biopapyrus.jp/python/module/pyvcf.html
vcff = vcf.Reader(open('mytest.call.annnotate.vcf', 'r'))
for i, record in enumerate(vcff):
print('POS:', record.POS, 'SNP:', record.is_snp, 'INDEL:', record.is_indel, 'subtype:',
record.var_subtype)
for sample in record.samples:
print(' ', sample.sample, end=' ')
#print(sample.sample, sample.data, sample.gt_type)
#print(sample.data.GT)
#print(sample.data._fields)
for skey in sample.data._fields:
print(skey, ':', getattr(sample.data, skey), end=', ')
print()
if i>1:
break
結果は
POS: 301 SNP: False INDEL: True subtype: del
Anc/AP012030/data/reference.bam GT : 1/1, PL : [31, 255, 0],
1_2-1/AP012030/data/reference.bam GT : 1/1, PL : [22, 216, 0],
2_2-1/AP012030/data/reference.bam GT : 1/1, PL : [31, 255, 0],
2_5-1/AP012030/data/reference.bam GT : 1/1, PL : [16, 235, 0],
2_5-1-7A/AP012030/data/reference.bam GT : 0/0, PL : [0, 157, 15],
POS: 3448 SNP: True INDEL: False subtype: ts
Anc/AP012030/data/reference.bam GT : 0/0, PL : [0, 255, 173],
1_2-1/AP012030/data/reference.bam GT : 0/0, PL : [0, 255, 183],
2_2-1/AP012030/data/reference.bam GT : 1/1, PL : [167, 255, 0],
2_5-1/AP012030/data/reference.bam GT : 1/1, PL : [189, 255, 0],
2_5-1-7A/AP012030/data/reference.bam GT : 1/1, PL : [162, 255, 0],
POS: 10591 SNP: True INDEL: False subtype: ts
Anc/AP012030/data/reference.bam GT : 0/0, PL : [0, 255, 176],
1_2-1/AP012030/data/reference.bam GT : 0/0, PL : [0, 255, 169],
2_2-1/AP012030/data/reference.bam GT : 0/0, PL : [0, 255, 176],
2_5-1/AP012030/data/reference.bam GT : 1/1, PL : [139, 255, 0],
2_5-1-7A/AP012030/data/reference.bam GT : 1/1, PL : [177, 255, 0],
終了行:
[[Pythonバイオ]] [[Pythonバイオ/ツール]]~
&counter(); &lastmod();~
*Variant Calling [#ab7ead5d]
参照
-2019-04-21
--ゲノム解析がらみで、2つのVCFファイルの比較。[[サンプル間で共通する変異と固有の変異を抽出する:http://kazumaxneo.hatenablog.com/entry/2017/06/06/165329]]、 GATKを使う方法と、bcftoolsを使う方法と。要確認。
--その手前。[[small indelとSNV検出のワークフロー:http://kazumaxneo.hatenablog.com/entry/2017/05/31/170329]]~
更にその元 (GATKサイト) [[Variant Calling Pipeline: FastQ to Annotated SNPs in Hours:https://gencore.bio.nyu.edu/variant-calling-pipeline/]]
-[[(howto) Discover variants with GATK - A GATK Workshop Tutorial:https://software.broadinstitute.org/gatk/documentation/article?id=7869]]
-2019-04-21 DNA-seq pipeline [[A framework for variation discovery and genotyping using next-generation DNA sequencing data (Nature Genetics 43, 491?498 (2011) doi:10.1038/ng.806):https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3083463/]]
-2019-04-21 NIG National Institute of Genetics NIGのパイプラインサービス [[DNA-seqのパイプライン:http://cell-innovation.nig.ac.jp/maser/Applications/Genome-seq.html]]
-Dry解析教本のp163〜 エクソームシーケンスのデータ解析 ワークフローの図は[[GATK Best Practice:https://software.broadinstitute.org/gatk/best-practices/?v=3]]からの改変だそうな。
-[[Variant Calling Pipeline: FastQ to Annotated SNPs in Hours – The Genomics Core Facility @ NYU Center for Genomics and Systems Biology:https://gencore.bio.nyu.edu/variant-calling-pipeline/]]
-GATKの主張するワークフローは[[GATK workflows:https://github.com/gatk-workflows/]]にリストがある。これおもしろいかも。
いずれにしても、
-クオリティチェック FASTQC
-トリミング Trimmomatic (Best PracticeではここではやらないとDry解析教本に書いてあった。理由はバリアントコール抽出の過程で考慮されるからだそうな。)
-マッピング ⇒ BAM
-重複リードの除去 picard
-InDelリアラインメント GATK target finding; then realign around InDel
-クオリティの再調整? (Dry解析教本にはある、NYUフローにはない)
-Variant Calling ⇒ VCF
-(GenoTyping??)
-Extraction
--Extract SNPs; Filter SNPs (GATK) VCF
--Extract Indels; Filter Indels (GATK) VCF
-Predict Variant Effects (snpEFF, NYU)
-アノテーション付与 (Annovar, Dry解析教本)
Duplicate
gatk MarkDuplicates -I Kishimoto_01_43B_S1.sorted.bam -M Kishimoto_01_43B_S1.duplicated.metrics \
-O Kishimoto_01_43B_S1.dedup.sorted.bam --ASSUME_SORTED TRUE --REMOVE_DUPLICATES TRUE
Using GATK jar /usr/local/src/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false
-Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2
-jar /usr/local/src/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar MarkDuplicates
-I Kishimoto_01_43B_S1.sorted.bam -M Kishimoto_01_43B_S1.duplicated.metrics
-O Kishimoto_01_43B_S1.dedup.sorted.bam --ASSUME_SORTED TRUE --REMOVE_DUPLICATES TRUE
16:35:52.607 INFO NativeLibraryLoader -
Loading libgkl_compression.so from jar:file:/usr/local/src/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Thu May 16 16:35:52 JST 2019] MarkDuplicates --INPUT Kishimoto_01_43B_S1.sorted.bam
--OUTPUT Kishimoto_01_43B_S1.dedup.sorted.bam --METRICS_FILE Kishimoto_01_43B_S1.duplicated.metrics
--REMOVE_DUPLICATES true --ASSUME_SORTED true --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000
--MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false
--REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false
--ADD_PG_TAG_TO_READS true --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates
--PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values>
--OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000
--VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2
--MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false
--GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false
--USE_JDK_INFLATER false
[Thu May 16 16:35:55 JST 2019] Executing as yamanouc@pepper.is.sci.toho-u.ac.jp on Linux 3.10.0-514.2.2.el7.x86_64 amd64;
OpenJDK 64-Bit Server VM 1.8.0_111-b15; Deflater: Intel; Inflater: Intel; Provider GCS is available;
Picard version: Version:4.1.2.0
INFO 2019-05-16 16:35:55 MarkDuplicates Start of doWork freeMemory: 1417478608; totalMemory: 1434976256; maxMemory: 28631367680
INFO 2019-05-16 16:35:55 MarkDuplicates Reading input file and constructing read end information.
INFO 2019-05-16 16:35:55 MarkDuplicates Will retain up to 103736839 data points before spilling to disk.
INFO 2019-05-16 16:36:52 MarkDuplicates Read 1,000,000 records. Elapsed time: 00:00:52s. Time for last 1,000,000: 52s.
Last read position: AP012030.1:74,603
INFO 2019-05-16 16:36:52 MarkDuplicates Tracking 40722 as yet unmatched pairs. 40722 records in RAM.
INFO 2019-05-16 16:37:29 MarkDuplicates Read 2,000,000 records. Elapsed time: 00:01:29s. Time for last 1,000,000: 37s. Last read position: AP012030.1:154,447
INFO 2019-05-16 16:37:29 MarkDuplicates Tracking 77823 as yet unmatched pairs. 77823 records in RAM.
INFO 2019-05-16 16:38:24 MarkDuplicates Read 3,000,000 records. Elapsed time: 00:02:25s. Time for last 1,000,000: 55s. Last read position: AP012030.1:228,168
INFO 2019-05-16 16:38:24 MarkDuplicates Tracking 110546 as yet unmatched pairs. 110546 records in RAM.
INFO 2019-05-16 16:38:53 MarkDuplicates Read 4,000,000 records. Elapsed time: 00:02:53s. Time for last 1,000,000: 28s. Last read position: AP012030.1:301,562
INFO 2019-05-16 16:38:53 MarkDuplicates Tracking 143886 as yet unmatched pairs. 143886 records in RAM.
INFO 2019-05-16 16:39:37 MarkDuplicates Read 5,000,000 records. Elapsed time: 00:03:38s. Time for last 1,000,000: 44s. Last read position: AP012030.1:377,173
INFO 2019-05-16 16:39:37 MarkDuplicates Tracking 175571 as yet unmatched pairs. 175571 records in RAM.
INFO 2019-05-16 16:40:06 MarkDuplicates Read 6,000,000 records. Elapsed time: 00:04:07s. Time for last 1,000,000: 29s. Last read position: AP012030.1:448,403
INFO 2019-05-16 16:40:06 MarkDuplicates Tracking 203544 as yet unmatched pairs. 203544 records in RAM.
INFO 2019-05-16 16:41:02 MarkDuplicates Read 7,000,000 records. Elapsed time: 00:05:02s. Time for last 1,000,000: 55s. Last read position: AP012030.1:528,920
INFO 2019-05-16 16:41:02 MarkDuplicates Tracking 235908 as yet unmatched pairs. 235908 records in RAM.
INFO 2019-05-16 16:41:26 MarkDuplicates Read 8,000,000 records. Elapsed time: 00:05:27s. Time for last 1,000,000: 24s. Last read position: AP012030.1:608,145
INFO 2019-05-16 16:41:26 MarkDuplicates Tracking 263951 as yet unmatched pairs. 263951 records in RAM.
INFO 2019-05-16 16:42:03 MarkDuplicates Read 9,000,000 records. Elapsed time: 00:06:03s. Time for last 1,000,000: 36s. Last read position: AP012030.1:690,374
INFO 2019-05-16 16:42:03 MarkDuplicates Tracking 291824 as yet unmatched pairs. 291824 records in RAM.
(中略)
INFO 2019-05-16 17:08:15 MarkDuplicates Read 59,000,000 records. Elapsed time: 00:32:16s. Time for last 1,000,000: 28s. Last read position: AP012030.1:4,479,193
INFO 2019-05-16 17:08:15 MarkDuplicates Tracking 70213 as yet unmatched pairs. 70213 records in RAM.
INFO 2019-05-16 17:08:44 MarkDuplicates Tracking 30486 as yet unmatched pairs. 30486 records in RAM.
INFO 2019-05-16 17:09:06 MarkDuplicates Read 60734092 records. 0 pairs never matched.
INFO 2019-05-16 17:10:06 MarkDuplicates After buildSortedReadEndLists freeMemory: 11654078344; totalMemory: 18848153600; maxMemory: 28631367680
INFO 2019-05-16 17:10:06 MarkDuplicates Will retain up to 894730240 duplicate indices before spilling to disk.
INFO 2019-05-16 17:10:14 MarkDuplicates Traversing read pair information and detecting duplicates.
INFO 2019-05-16 17:13:40 MarkDuplicates Traversing fragment information and detecting duplicates.
INFO 2019-05-16 17:13:57 MarkDuplicates Sorting list of duplicate records.
INFO 2019-05-16 17:14:00 MarkDuplicates After generateDuplicateIndexes freeMemory: 21992264872; totalMemory: 29286203392; maxMemory: 29286203392
INFO 2019-05-16 17:14:00 MarkDuplicates Marking 2183568 records as duplicates.
INFO 2019-05-16 17:14:00 MarkDuplicates Found 9424 optical duplicate clusters.
INFO 2019-05-16 17:14:00 MarkDuplicates Reads are assumed to be ordered by: coordinate
***2019-05-20 bcftools mpileup | bcftools call [#g3a8f9c0]
[[samtools 使い方 mpileup ( calling SNPs ) & annotation:http://bioinfo-dojo.net/2016/07/13/samtools_calling_snps/#Calling_SNPs-2]] には
$ samtools mpileup -uf ref.fa myfile.sorted_unique.bam | bcftools call -O b -v -c ->|
myfile.sorted_unique.raw.bcf
と書いてある。また
[[samtoolsでvariant call - Qiita:https://qiita.com/motthy/items/38951f127dce26b22ce3]]
にはこう書いている。
bcftools mpileup -Ou -f <ref.fa> <sample1.bam> <sample2.bam> <sample3.bam> |
bcftools call -vmO z -o <study.vcf.gz>
でも
bcftools mpileup -Ou -f AP012030-new.fasta Anc/AP012030/data/reference.bam 1_2-1/AP012030/data/reference.bam 2_2-1/AP012030/data/reference.bam 2_5-1/AP012030/data/reference.bam 2_5-1-7A/AP012030/data/reference.bam > mytest.vcf
bcftools call -O v -v -c mytest.vcf > mytest.raw.bcf
や
bcftools call -vmO z -o mystudy.vcf.gz mytest.vcf
Could not add dummy header for contig
Segmentation fault (コアダンプ)
でうまくいかない。
bcftoolsの[[マニュアル:https://samtools.github.io/bcftools/bcftools.html]]のmpileupにある例では
bcftools mpileup -Ou -f ref.fa aln.bam | \
bcftools call -Ou -mv | \
bcftools filter -s LowQual -e '%QUAL<20 || DP>100' > var.flt.vcf
と書いてある。
ここでは、
bcftools mpileup -Ou -f AP012030-new.fasta \
Anc/AP012030/data/reference.bam \
1_2-1/AP012030/data/reference.bam \
2_2-1/AP012030/data/reference.bam \
2_5-1/AP012030/data/reference.bam \
2_5-1-7A/AP012030/data/reference.bam |
bcftools call -Ou -mv -o mytest.call.vcf
を試してみる。
この結果を、vcf-annotateでアノテーションを追加して読めるようにする。この手順の出所は[[samtools 使い方 mpileup ( calling SNPs ) & annotation – バイオインフォ 道場:http://bioinfo-dojo.net/2016/07/13/samtools_calling_snps/#Calling_SNPs-2]]
bcftools view mytest.call.vcf | vcf-annotate -f + > mytest.call.annnotate.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Anc/AP012030/data/reference.bam 1_2-1/AP012030/data/reference.bam 2_2-1/AP012030/data/reference.bam 2_5-1/AP012030/data/reference.bam 2_5-1-7A/AP012030/data/reference.bam
AP012030 301 . CTTTTTTTTTT CTTTTTTTTT 61 PASS INDEL;IDV=211;IMF=0.886555;DP=1191;VDB=0.990724;SGB=-30.4726;MQSB=1;MQ0F=0;ICB=0.425;HOB=0.32;AC=8;AN=10;DP4=178,15,967,31;MQ=20 GT:PL 1/1:31,255,0 1/1:22,216,0 1/1:31,255,0 1/1:16,235,0 0/0:0,157,15
AP012030 3448 . C T 464 PASS DP=1211;VDB=0.484561;SGB=120.949;RPB=0.750085;MQB=1;MQSB=1;BQB=0.101507;MQ0F=0;ICB=0.117361;HOB=0.48;AC=6;AN=10;DP4=478,6,697,8;MQ=20 GT:PL 0/0:0,255,173 0/0:0,255,183 1/1:167,255,0 1/1:189,255,0 1/1:162,255,0
AP012030 10591 . C T 262 PASS DP=1220;VDB=0.354141;SGB=400.466;RPB=0.300118;MQB=1;MQSB=1;BQB=7.53989e-05;MQ0F=0;ICB=0.117361;HOB=0.48;AC=4;AN=10;DP4=748,5,433,8;MQ=20 GT:PL 0/0:0,255,176 0/0:0,255,169 0/0:0,255,176 1/1:139,255,0 1/1:177,255,0
AP012030 41097 . G A 465 PASS DP=1222;VDB=0.0336028;SGB=359.679;RPB=0.995592;MQB=1;MQSB=1;BQB=0.533691;MQ0F=0;ICB=0.117361;HOB=0.48;AC=6;AN=10;DP4=477,10,705,7;MQ=20 GT:PL 0/0:0,255,178 0/0:0,255,197 1/1:173,255,0 1/1:176,255,0 1/1:170,255,0
AP012030 44797 . AGGGGG AGGGG 340 PASS INDEL;IDV=234;IMF=0.955102;DP=1231;VDB=0.691086;SGB=115.801;MQSB=1;MQ0F=0;ICB=0.117361;HOB=0.48;AC=6;AN=10;DP4=548,7,672,4;MQ=20 GT:PL 0/0:0,255,180 0/0:0,255,186 1/1:134,255,0 1/1:132,255,0 1/1:128,255,0
AP012030 59093 . C T 13.7826 PASS DP=252;VDB=0.00328272;SGB=-7.51117;RPB=0.00279524;MQB=0.0718584;MQSB=0.999297;BQB=0.00206264;MQ0F=0;ICB=1;HOB=0.166667;AC=3;AN=6;DP4=233,3,5,0;MQ=19 GT:PL 0/0:0,255,175 ./.:0,0,0 ./.:0,0,0 1/1:46,9,0 0/1:6,0,14
AP012030 59105 . A C 69 PASS DP=260;VDB=0.000538717;SGB=-8.68981;RPB=0.000860322;MQB=0.187859;MQSB=0.999902;BQB=0.0445104;MQ0F=0;ICB=1;HOB=0.21875;AC=5;AN=8;DP4=237,4,8,0;MQ=20 GT:PL 0/0:0,255,193 ./.:0,0,0 1/1:49,9,0 1/1:46,9,0 0/1:18,0,28
AP012030 59108 . T C 71 PASS DP=266;VDB=0.000464403;SGB=-8.68981;RPB=0.00435826;MQB=0.19652;MQSB=0.998903;BQB=0.106941;MQ0F=0;ICB=1;HOB=0.21875;AC=5;AN=8;DP4=238,4,7,1;MQ=19 GT:PL 0/0:0,255,177 ./.:0,0,0 1/1:36,6,0 1/1:46,9,0 0/1:32,0,39
ミソは最後のFORMATの部分が5つのサンプルそれぞれの値を列記した形になること。でもこれをどうやって使えばいいのか?
この出力を、IGVで表示することができる。
さて、サンプル間の変異を比較する方法として[[サンプル間で共通する変異と固有の変異を抽出する - MACでインフォマティクス:http://kazumaxneo.hatenablog.com/entry/2017/06/06/165329]]にはそれぞれのサンプルのBAMを(単独で)VCFに変換し、それらの間の差分を取る方法を提案している。
***VCFファイルをPythonで読み込んで操作する [#qb769039]
pyVCFがパーザーとして使える。
[[PyVCF - A Variant Call Format Parser for Python — PyVCF 0.6.8 documentation:https://pyvcf.readthedocs.io/en/latest/index.html]]
[[API — PyVCF 0.6.8 documentation:https://pyvcf.readthedocs.io/en/latest/API.html#vcf-model-record]]
pyVCFはnamedTupleという便利っぽいものを使っている。
-2019-05-21 python namedtuple
--[[namedtupleで美しいpythonを書く!(翻訳) - Qiita:https://qiita.com/Seny/items/add4d03876f505442136]]
--[[collections --- コンテナデータ型 — Python 3.7.3 ドキュメント:https://docs.python.org/ja/3/library/collections.html#collections.namedtuple]]
--一般には、p.fieldnameのようにしてアクセス。
--フィールド名リストは p._fields なので、
--文字列としてフィールド名が与えられている時の値の取り出し getattr(p, 'フィールド名')
for fieldname in p._fields:
print(getattr(p, fieldname)
PyVCFを使ってみる。
import vcf
# vcf.model — PyVCF 0.6.8 documentation
https://pyvcf.readthedocs.io/en/latest/_modules/vcf/model.html#_Record
# PyVCF | VCF ファイルをパースする Python ライブラリー https://bi.biopapyrus.jp/python/module/pyvcf.html
vcff = vcf.Reader(open('mytest.call.annnotate.vcf', 'r'))
for i, record in enumerate(vcff):
print('POS:', record.POS, 'SNP:', record.is_snp, 'INDEL:', record.is_indel, 'subtype:',
record.var_subtype)
for sample in record.samples:
print(' ', sample.sample, end=' ')
#print(sample.sample, sample.data, sample.gt_type)
#print(sample.data.GT)
#print(sample.data._fields)
for skey in sample.data._fields:
print(skey, ':', getattr(sample.data, skey), end=', ')
print()
if i>1:
break
結果は
POS: 301 SNP: False INDEL: True subtype: del
Anc/AP012030/data/reference.bam GT : 1/1, PL : [31, 255, 0],
1_2-1/AP012030/data/reference.bam GT : 1/1, PL : [22, 216, 0],
2_2-1/AP012030/data/reference.bam GT : 1/1, PL : [31, 255, 0],
2_5-1/AP012030/data/reference.bam GT : 1/1, PL : [16, 235, 0],
2_5-1-7A/AP012030/data/reference.bam GT : 0/0, PL : [0, 157, 15],
POS: 3448 SNP: True INDEL: False subtype: ts
Anc/AP012030/data/reference.bam GT : 0/0, PL : [0, 255, 173],
1_2-1/AP012030/data/reference.bam GT : 0/0, PL : [0, 255, 183],
2_2-1/AP012030/data/reference.bam GT : 1/1, PL : [167, 255, 0],
2_5-1/AP012030/data/reference.bam GT : 1/1, PL : [189, 255, 0],
2_5-1-7A/AP012030/data/reference.bam GT : 1/1, PL : [162, 255, 0],
POS: 10591 SNP: True INDEL: False subtype: ts
Anc/AP012030/data/reference.bam GT : 0/0, PL : [0, 255, 176],
1_2-1/AP012030/data/reference.bam GT : 0/0, PL : [0, 255, 169],
2_2-1/AP012030/data/reference.bam GT : 0/0, PL : [0, 255, 176],
2_5-1/AP012030/data/reference.bam GT : 1/1, PL : [139, 255, 0],
2_5-1-7A/AP012030/data/reference.bam GT : 1/1, PL : [177, 255, 0],
ページ名: