外显子Bed文件制作

使用gencode的人类GFF制作一个基于Mane Select转录本(Refseq版本)的外显子Bed文件。

首先下载gff文件。是的,还是使用的GRCh37。

wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_46/GRCh37_mapping/gencode.v46lift37.annotation.gff3.gz

然后把mane select和exon部分提取出来

zcat gencode.v46lift37.annotation.gff3.gz | grep "MANE_Select" | grep "ID=exon" > grch37.mane.exon.gff3

但是,由于gencode这里面是Ensembl的转录本编号,我们一般使用的都是Refseq编号,因此还要将编号对应回去。

下载这个下面这个Refseq的GRCh38的gff用于匹配Ensembl和Refseq的转录本(如果是需要GRCh38的bed,那就只用下面这个文件就好,我在NCBI官网找不到GRCh37的gff)。

wget https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/release_1.3/MANE.GRCh38.v1.3.refseq_genomic.gff.gz
gunzip MANE.GRCh38.v1.3.refseq_genomic.gff.gz

接下来让撸个脚本进行处理

# python3

transcript_dict = {}
with open("MANE.GRCh38.v1.3.refseq_genomic.gff", "r") as f:
    for line in f:
        if line.startswith("#"):
            continue
        ensembl = refseq = "."
        lines = line.rstrip().split("\t")
        if (lines[2] == "exon") and ("tag=MANE Select" in line):
            info_fields = lines[8].split(";")
            for i in info_fields:
                if i.startswith("Dbxref"):
                    dbxrefs = i.lstrip("Dbxref=").split(",")
                    for d in dbxrefs:
                        if d.startswith("Ensembl:"):
                            ensembl = d.split("Ensembl:")[1]
                elif i.startswith("transcript_id="):
                    refseq = i.split("transcript_id=")[1]
        if not any([ensembl == ".", refseq == "."]):
            transcript_dict[ensembl] = refseq

output_bed = open("GRCh37.Gencode.ManeSelect.Exon.bed", "w", encoding="utf-8")
output_bed.write("#Chrom\tStart\tEnd\tRegion\tGene\tEnsembl\tRefseq\tStrand\n")

with open("grch37.mane.exon.gff3", "r") as gff_file:
    for line in gff_file:
        if not line.startswith("#"):
            fields = line.rstrip().split("\t")
            info_fields = fields[8].split(";")
            gene = ensembl = exon = "."
            for i in info_fields:
                if i.startswith("Parent"):
                    ensembl = i.split("=")[1]
                elif i.startswith("gene_name"):
                    gene = i.split("=")[1]
                elif i.startswith("exon_number"):
                    exon = "exon" + i.split("=")[1]
            if ensembl in transcript_dict:
                refseq = transcript_dict[ensembl]
            output_list = [fields[0], fields[3], fields[4], exon, gene, ensembl, refseq, fields[6]]
            output_bed.write("\t".join(output_list) + "\n")

output_bed.close()

如果需要的是CDS区域的话,就调整一下筛选就可以了。另外这里不会输出Ensembl和Refseq无对照的转录本,需要注意一下。

都4202年了,到底是谁还在用GRCh37啊。某委只要一声令下,往后的所有质评都只用GRCh38,还不是纷纷响应的。要知道GRCh38都已经推出超过10年了,还拿数据库不完善说事呢,我主要感觉现在找GRCh37的资源比找GRCh38困难多了:joy: