基因功能注释：swissprot篇

By liyupeng, 22 September, 2025

Forums

Researches

基因功能注释：swissprot篇

介绍

swissprot库里面收集的是人工矫正过后的蛋白序列，也有没矫正的蛋白序列，下载库的时候需要注意；但是，既然选择了这个数据库，所以这里用的是校正后的蛋白序列库，这样数据量也比较少一些。

参考的文档是徐洲更老师的教程，也用了他的脚本进行后期的输出整理，脚本存在的一些细小的问题也在后面有提到，并修改了，附件中添加的脚本也是修改完的

操作流程

从swissprot官网上下载蛋白序列（dat.gz，推测是包含功能信息描述，这也与徐州更大佬的步骤），uniprot_sprot.dat.gz（人工审查过的），tremble是包括没有审查在内的蛋白序列，come from GPT

 wget https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/complete/uniprot_sprot.dat.gz

解压成dat格式
从data文件中提取fasta序列

dat=uniprot_sprot.dat
awk '{if (/^ /) {gsub(/ /, ""); print} else if (/^AC/) print ">" $2}' $dat > ${dat%%.dat}.fasta

建立diamond索引

diamond makedb --in uniprot_sprot.fasta -d uniprot

使用diamond进行比对

nohup diamond blastp \
    --query female_hap1_rename_protein.fa \
    --db uniprot.dmnd \
    --out female_hap1_blastp.outfmt6 \
    --outfmt 6 \
    --evalue 1e-5 \
    --max-target-seqs 1 \
    --threads 20 &

使用add_annotation_from_dat.py(代码在GitHub上)根据blastp输出从dat中提取GO/KEGG/同源基因。运行在Python2/3环境中，需要安装BioPython
1. 默认输出文件：swiss_annotation.tsv，单次只能操作一条命令
2. 另外，这个作者开发的脚本存在大概是python2和python3的转码还是什么的问题，好在这个问题只要插入一条语句就可以解决：error with add_annotation_from_dat.py · Issue #3 · xuzhougeng/myscripts · GitHub

python3 add_annotation_from_dat.py male_hap2_blastp.outfmt6 uniprot_sprot.dat

附件

add_annotation_from_dat.py

#!/user/bin/env python
import re
import sys
from Bio import SeqIO

if len(sys.argv) < 2:
    print("usage: add_annotation_from_dat.py blastp dat")
    sys.exit(1)

blastp_file = sys.argv[1]
dat = sys.argv[2]

uniprot = SeqIO.index(dat, "swiss")

out_file = open("swiss_annotation.tsv", "w")

for line in open(blastp_file, "r"):
    gene,acc,ident = line.strip().split()[0:3]
    if not uniprot.get(acc.strip(";")):
        continue
    record = uniprot.get_raw(acc.strip(";"))
    record = record.decode('utf-8')
    GO_RECORD = re.findall(r"GO; (GO:\d+); ([F|P|C]):.*?; (.*):",record)

    SPECIES = re.findall(r"OS   (.*)\.", record)
    if len(SPECIES) == 0:
        SPECIES = [""]
    else:
        SPECIES = [ SPECIES[0].replace("\t"," ") ]

    ENSEMBLE_Plant = re.findall(r"EnsemblPlants; (.*?);", record)
    if len(ENSEMBLE_Plant) == 0:
        ENSEMBLE_Plant = [""]
    else:
        ENSEMBLE_Plant = [ENSEMBLE_Plant[0].replace("\t"," ")]

    for GO in GO_RECORD:
        outline = "\t".join([gene, acc,ident] + SPECIES + ENSEMBLE_Plant + list(GO))
        out_file.writelines(outline + "\n")

out_file.close()