使用gencode的人类GFF制作一个基于Mane Select转录本(Refseq版本)的外显子Bed文件。
首先下载gff文件。是的,还是使用的GRCh37。
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_46/GRCh37_mapping/gencode.v46lift37.annotation.gff3.gz
然后把mane select和exon部分提取出来
zcat gencode.v46lift37.annotation.gff3.gz | grep "MANE_Select" | grep "ID=exon" > grch37.mane.exon.gff3
但是,由于gencode这里面是Ensembl的转录本编号,我们一般使用的都是Refseq编号,因此还要将编号对应回去。
下载这个下面这个Refseq的GRCh38的gff用于匹配Ensembl和Refseq的转录本(如果是需要GRCh38的bed,那就只用下面这个文件就好,我在NCBI官网找不到GRCh37的gff)。
wget https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/release_1.3/MANE.GRCh38.v1.3.refseq_genomic.gff.gz
gunzip MANE.GRCh38.v1.3.refseq_genomic.gff.gz
接下来让撸个脚本进行处理
# python3
transcript_dict = {}
with open("MANE.GRCh38.v1.3.refseq_genomic.gff", "r") as f:
for line in f:
if line.startswith("#"):
continue
ensembl = refseq = "."
lines = line.rstrip().split("\t")
if (lines[2] == "exon") and ("tag=MANE Select" in line):
info_fields = lines[8].split(";")
for i in info_fields:
if i.startswith("Dbxref"):
dbxrefs = i.lstrip("Dbxref=").split(",")
for d in dbxrefs:
if d.startswith("Ensembl:"):
ensembl = d.split("Ensembl:")[1]
elif i.startswith("transcript_id="):
refseq = i.split("transcript_id=")[1]
if not any([ensembl == ".", refseq == "."]):
transcript_dict[ensembl] = refseq
output_bed = open("GRCh37.Gencode.ManeSelect.Exon.bed", "w", encoding="utf-8")
output_bed.write("#Chrom\tStart\tEnd\tRegion\tGene\tEnsembl\tRefseq\tStrand\n")
with open("grch37.mane.exon.gff3", "r") as gff_file:
for line in gff_file:
if not line.startswith("#"):
fields = line.rstrip().split("\t")
info_fields = fields[8].split(";")
gene = ensembl = exon = "."
for i in info_fields:
if i.startswith("Parent"):
ensembl = i.split("=")[1]
elif i.startswith("gene_name"):
gene = i.split("=")[1]
elif i.startswith("exon_number"):
exon = "exon" + i.split("=")[1]
if ensembl in transcript_dict:
refseq = transcript_dict[ensembl]
output_list = [fields[0], fields[3], fields[4], exon, gene, ensembl, refseq, fields[6]]
output_bed.write("\t".join(output_list) + "\n")
output_bed.close()
如果需要的是CDS区域的话,就调整一下筛选就可以了。另外这里不会输出Ensembl和Refseq无对照的转录本,需要注意一下。
都4202年了,到底是谁还在用GRCh37啊。某委只要一声令下,往后的所有质评都只用GRCh38,还不是纷纷响应的。要知道GRCh38都已经推出超过10年了,还拿数据库不完善说事呢,我主要感觉现在找GRCh37的资源比找GRCh38困难多了。