GnomAD的nhemi注释
#coding
需求是注释出GnomAD的nhemi字段,但翻了一下dbNSFP和VEP,都没有提供这个内容,又得自己下载数据库搞了。
根据提示,nhemi即半合子计数,取GnomAD数据中的AC_XY值就好。这个值给chrX和chrY的突变提供即可。
数据库
UCSC里有一个现成的注释表,版本是GnomAD v4.1的,足够新。我尝试下载了一下,发现足足有200+G,因此放弃了(如果不嫌弃的话这个版本倒是不错的,值直接了当列好了)。
由于我只需要chrX和chrY,那么就只下载GnomAD官网提供的vcf即可。
wget https://gnomad-public-us-east-1.s3.amazonaws.com/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chrX.vcf.bgz
wget https://gnomad-public-us-east-1.s3.amazonaws.com/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chrY.vcf.bgz
wget https://gnomad-public-us-east-1.s3.amazonaws.com/release/4.1/vcf/genomes/gnomad.genomes.v4.1.sites.chrX.vcf.bgz
wget https://gnomad-public-us-east-1.s3.amazonaws.com/release/4.1/vcf/genomes/gnomad.genomes.v4.1.sites.chrY.vcf.bgz
甲方的需求是要把exomes和genomes的值相加,所以我都下载了,然后再进行处理。
处理
这个文件最终的用途只是注释nhemi,因此这里将多余的字段删除,并计算exomes+genomes的总和,使用python
# coding=utf-8
import gzip
import re
def extract_ac_xy(info_field):
"""从INFO字段中提取AC_XY值"""
match = re.search(r'AC_XY=(\d+)', info_field)
return int(match.group(1)) if match else 0
def process_vcf_files(exome_file, genome_file, output_file, chrom_name="chrY"):
# 存储位置和AC_XY值的字典
variants = {}
# 处理exome文件
with gzip.open(exome_file, 'rt') as f:
for line in f:
if line.startswith('#'):
continue
fields = line.strip().split('\t')
pos_key = f"{fields[0]}_{fields[1]}_{fields[3]}_{fields[4]}" # CHROM_POS_REF_ALT
ac_xy = extract_ac_xy(fields[7])
variants[pos_key] = {'exome': ac_xy, 'genome': 0, 'pos': fields[1],
'ref': fields[3], 'alt': fields[4]}
# 处理genome文件
with gzip.open(genome_file, 'rt') as f:
for line in f:
if line.startswith('#'):
continue
fields = line.strip().split('\t')
pos_key = f"{fields[0]}_{fields[1]}_{fields[3]}_{fields[4]}"
ac_xy = extract_ac_xy(fields[7])
if pos_key in variants:
variants[pos_key]['genome'] = ac_xy
else:
variants[pos_key] = {'exome': 0, 'genome': ac_xy, 'pos': fields[1],
'ref': fields[3], 'alt': fields[4]}
# 写入输出文件
with open(output_file, 'w') as out:
# 写入头部信息
out.write('##fileformat=VCFv4.2\n')
out.write(f'##INFO=<ID=Exomes_AC_XY,Number=1,Type=Integer,Description="Allele count for {chrom_name} in exomes">\n')
out.write(f'##INFO=<ID=Genomes_AC_XY,Number=1,Type=Integer,Description="Allele count for {chrom_name} in genomes">\n')
out.write(f'##INFO=<ID=Gnomad_AC_XY,Number=1,Type=Integer,Description="Combined allele count for {chrom_name}">\n')
out.write('#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\n')
# 写入变异信息
for pos_key, data in sorted(variants.items()):
chrom = chrom_name
total_ac = data['exome'] + data['genome']
info = f'Exomes_AC_XY={data["exome"]};Genomes_AC_XY={data["genome"]};Gnomad_AC_XY={total_ac}'
out.write(f'{chrom}\t{data["pos"]}\t.\t{data["ref"]}\t{data["alt"]}\t.\t.\t{info}\n')
# chrY
process_vcf_files('gnomad.exomes.v4.1.sites.chrY.vcf.bgz', 'gnomad.genomes.v4.1.sites.chrY.vcf.bgz', 'gnomad.v4.1.sites.chrY.combined_AC_XY.vcf')
# chrX
process_vcf_files('gnomad.exomes.v4.1.sites.chrX.vcf.bgz', 'gnomad.genomes.v4.1.sites.chrX.vcf.bgz', 'gnomad.v4.1.sites.chrX.combined_AC_XY.vcf', "chrX")
chrX处理完有一点顺序问题,使用bedtools再处理一下,不知道chrY有没有问题,也顺便处理一下
bedtools sort -i gnomad.v4.1.sites.chrX.combined_AC_XY.vcf -header > gnomad.v4.1.sites.chrX.combined_AC_XY.sort.vcf
bedtools sort -i gnomad.v4.1.sites.chrY.combined_AC_XY.vcf -header > gnomad.v4.1.sites.chrY.combined_AC_XY.sort.vcf
liftover
是的,还是在用GRCh37/hg19的,需要liftover一下,不去管是否会造成的数据损失了。首先下载hg38转hg19的chain file
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz
gunzip hg38ToHg19.over.chain.gz
为了一步到位,先把chrX和chrY合起来
cp gnomad.v4.1.sites.chrX.combined_AC_XY.sort.vcf gnomad.v4.1.sites.combined_AC_XY.vcf
awk 'NR>1 && !/^#/' gnomad.v4.1.sites.chrY.combined_AC_XY.sort.vcf >> gnomad.v4.1.sites.combined_AC_XY.vcf
然后可以使用gatk/picard对vcf进行liftover
gatk LiftoverVcf \
-I gnomad.v4.1.sites.combined_AC_XY.vcf \
-O gnomad.v4.1.sites.combined_AC_XY.hg19.vcf \
-C hg38ToHg19.over.chain \
--REJECT rejected_variants.vcf \
-R hg19.fa
最后使用bgzip压缩,建立索引(用gatk可能可以直接输出为.vcf.gz?我没试。)
bgzip gnomad.v4.1.sites.combined_AC_XY.hg19.vcf
tabix -p vcf gnomad.v4.1.sites.combined_AC_XY.hg19.vcf.gz
注释
还是使用VEP进行注释
vep \
-i input.vcf -o output.vcf \
--custom file=gnomad.v4.1.sites.combined_AC_XY.hg19.vcf.gz,short_name=GnomHemi,format=vcf,type=exact,coords=0,fields=Exomes_AC_XY%Genomes_AC_XY%Gnomad_AC_XY