注释版本基因对照表v2

By liyupeng, 29 July, 2025

Forums

Researches

注释版本基因对照表v2

introduction

Previously on the Gene Annotation Version Mapping Table（基因注释表的前情概要），NC-NCC，NCC-BA，还有NC-sun（来自孙学鹏老师的明明版本）。

来自社区的反馈：对应表太多，太麻烦了，希望出一张统一的表。

对吾辈来说，重新制作一张新的表也是个麻烦的事情。好在最近看了老周的经济学理论，虽然主体内容忘记的差不多了，但想到excel表是个不错的工具，这里就根据已有的对应表，用excel进行汇总。

工作流程part1

BA和ncc的比对结果出现了一些问题，简单来说，就是通过蛋白序列的比对，导致一些不同染色体上的基因id被比对到了一起，然后根据吾辈的筛选保留，这些有问题的部分也流了下来。不同染色体上的基因被比对到一起，纳入最终id对应表的有4466个。情况差不多就是下面这张图：
1. 解决方法，增加一层过滤条件，把不同染色体之间比对上的结果去掉

筛选出染色体编号相同的比对结果

python chr_filter.py -i blast_results.tsv -o blast_results_filter.tsv

之后就是过一遍路程;https://www.kdocs.cn/l/cospz3D2R2ja，按照这里面设置的筛选条件重新跑一遍

python3 filter_blast_results.py -i blast_results_filter.tsv -o ncc_ba.tsv

工作流程part2

收集已有的对应表
1. Tgra_comparison.tsv（NC-NCC）
2. gene_name_table.txt（NCC-NC）
3. ncc_ba.tsv（NC-sun版）
偶先把3张表的内容导入到excel表中，像这样

处理一下原始数据，直接粘贴过来的表是存在一些问题，比如sun版本中的NC编号evm是大写，这样会对应不上；BA和NCC的对应后方存在mrna的结尾，这部分用分列把他分割除去，值留下基因id的部分。
接下来用函数，把对应的sun和BA版本的ID映射到第3列和第4列
1. 映射sun版本的id（直接把公式桥带C和D列，然后下拉就结束了。）

=IFERROR(VLOOKUP(A2, I:J, 2, FALSE), "")

A2 是当前行的 NC ID；
在 I:J 区域中查找 NC → sun 对应；
如果找不到，返回空字符串 ""，避免出现 #N/A。
映射BA版本的id

=IFERROR(VLOOKUP(B2, E:F, 2, FALSE), "")

最后的结果
1. 部分BA没有对应的id（6031），可能是没有注释到，然后比对也没有结果，推测是这样，毕竟两个是完全不同的版本。
2. 另外，这里保存的结果（id），都是筛选过的，最多只有32057个基因
文件存放路径（修正时间，2025/7/21）

/data2/liyupeng/alice/output/female_anno/female_gavmt

附件

chr_filter.py

用法：i指定输入文件，o指定输出文件

python chr_filter.py -i blast_results.tsv -o blast_results_filter.tsv

import sys
import re

def extract_chr(gene_id):
    match = re.search(r'[A-Za-z]+(\d{2})G', gene_id)
    return match.group(1) if match else None

args = sys.argv
input_file = None
output_file = None

# 简单参数解析
for i in range(1, len(args)):
    if args[i] == '-i' and i + 1 < len(args):
        input_file = args[i + 1]
    elif args[i] == '-o' and i + 1 < len(args):
        output_file = args[i + 1]

if not input_file or not output_file:
    print("Usage: python script.py -i input.tsv -o output.tsv")
    sys.exit(1)

with open(input_file) as infile, open(output_file, 'w') as outfile:
    for i, line in enumerate(infile):
        if i == 0:
            outfile.write(line)
            continue
        cols = line.strip().split('\t')
        q_chr = extract_chr(cols[0])
        s_chr = extract_chr(cols[1])
        if q_chr and s_chr and q_chr == s_chr:
            outfile.write(line)

chr_filter2.py

import sys
import re

def extract_chr(gene_id):
    # 提取 G 前的两位数字作为染色体编号，例如从 TgFv401G24140-mRNA1 中提取 01
    match = re.search(r'G(\d{2})', gene_id)
    return match.group(1) if match else None

args = sys.argv
input_file = None
output_file = None

# 简单参数解析
for i in range(1, len(args)):
    if args[i] == '-i' and i + 1 < len(args):
        input_file = args[i + 1]
    elif args[i] == '-o' and i + 1 < len(args):
        output_file = args[i + 1]

if not input_file or not output_file:
    print("Usage: python script.py -i input.tsv -o output.tsv")
    sys.exit(1)

with open(input_file) as infile, open(output_file, 'w') as outfile:
    for i, line in enumerate(infile):
        if i == 0:
            outfile.write(line)
            continue
        cols = line.strip().split('\t')
        q_chr = extract_chr(cols[0])
        s_chr = extract_chr(cols[1])
        if q_chr and s_chr and q_chr == s_chr:
            outfile.write(line)