extract_gene.sh（Forum） | Zhang Lab Forum

By liyupeng, 24 October, 2025

Forums

设定：

制作一个能够提取指定区间注释内容的脚本
情况：给定一段区间（包含了起始位置和终止位置），脚本根据这个提取指定范围内的基因）
构建：
1. gff3文件的第4列和第5列分别是起始位置和终止位置，直接和设定的区间范围进行比较就可以了
2. 完全落在区间范围内的gene才会被输出到
3. gene的定位，根据第三列的内容（gene）进行定位，第三列gene所在行的第四列和第五列，起始和终止位置作为判断标准
4. 提取所有符合条件的gene所有行和列的注释内容，将区间范围内的注释内容输出到指定文件extract_gene.gff3中
5. 从extract_gene.gff3获取这些被提取的gene的ID，ID在gff3文件的第9列内容，以ID和；作为分隔符，需要的ID就在这两个分隔符的中间
6. 提取基因ID，输出到gene_id.txt文档中
7. 根据gene_id.txt中储存的id，从cds和protein的序列文件提取对应的序列，分别保存到extract_cds.fa和extract_protein.fa；因为蛋白序列和cds序列的id后面多了-mRNA的部分，例如：>TgF01G00010-mRNA1，匹配的时候无视-mRNA（数字）这部分的内容，直接匹配前面的编号
8. 制作一个extract_out文件夹，将输出文档都放在这个文件夹当中；如果当前文件夹存在相同的名字的文件夹，则覆盖（例如把原先的删除掉）
参数设定
1. g，指定输入的文件夹
2. s，起始位置
3. e，终止位置
4. d，cds序列
5. p，蛋白序列
6. c，指定染色体（指定染色体后，在该染色体上进行上述的作业）
7. o，指定输出文件夹

第一版：

命令行

bash extract_genes.sh \
  -o extract_out \
  -g female_ba_last.gff3 \
  -s 4006664 \
  -e 15018754 \
  -c Chr01 \
  -d female_ba_last.cds.fa \
  -p female_ba_last.protein.fa

带注释的命令行（不可运行）

bash extract_genes.sh \ 
  -o extract_out \  # 指定输出目录为 `
  -g female_ba_last.gff3 \  # 指定注释文件
  -s 4006664 \  # 指定区间的起始位置
  -e 15018754 \  # 指定区间的终止位置
  -c Chr01 \  # 指定染色体编号为
  -d female_ba_last.cds.fa \  # 指定 CDS 序列文件
  -p female_ba_last.protein.fa  # 指定蛋白质序列文件

脚本内容

#!/bin/bash

# 脚本名称：process_genomic_data.sh
# 用途：处理基因组数据，提取注释行、mRNA ID、CDS 序列和蛋白质序列。
# 运行方法：
# ./process_genomic_data.sh -o <output_dir> -g <annotation_file> -s <start_position> -e <end_position> -c <chromosome> -d <cds_fasta> -p <protein_fasta>

# 默认值
output_dir="./output"
annotation_file=""
start_position=""
end_position=""
chromosome=""
cds_fasta=""
protein_fasta=""

# 提示信息
usage() {
    echo "Usage: $0 -o <output_dir> -g <annotation_file> -s <start_position> -e <end_position> -c <chromosome> -d <cds_fasta> -p <protein_fasta>"
    exit 1
}

# 解析输入参数
while getopts "o:g:s:e:c:d:p:" opt; do
    case ${opt} in
        o) output_dir=${OPTARG} ;;
        g) annotation_file=${OPTARG} ;;
        s) start_position=${OPTARG} ;;
        e) end_position=${OPTARG} ;;
        c) chromosome=${OPTARG} ;;
        d) cds_fasta=${OPTARG} ;;
        p) protein_fasta=${OPTARG} ;;
        *) usage ;;
    esac
done

# 检查必需的参数
if [[ -z "$annotation_file" || -z "$start_position" || -z "$end_position" || -z "$chromosome" || -z "$cds_fasta" || -z "$protein_fasta" ]]; then
    usage
fi

# 创建输出文件夹
mkdir -p "$output_dir"

# 1. 提取完全落在区间内的注释行
echo "Step 1: Extracting annotations within the specified region..."
awk -v start="$start_position" -v end="$end_position" -v chrom="$chromosome" \
    '$1 == chrom && $4 >= start && $5 <= end {print}' "$annotation_file" > "$output_dir/extract_genes.gff3"

# 2. 提取基因ID
echo "Step 2: Extracting gene IDs..."
awk '$3 == "gene" {split($9, arr, "ID="); split(arr[2], id, ";"); print id[1]}' "$output_dir/extract_genes.gff3" > "$output_dir/gene_id.txt"

# 3. 提取mRNA的ID
echo "Step 3: Extracting mRNA IDs..."
awk '$3 == "mRNA" {split($9, arr, "ID="); split(arr[2], id, ";"); print id[1]}' "$output_dir/extract_genes.gff3" > "$output_dir/mRNA_id.txt"

# 4. 根据mRNA的ID提取对应的CDS序列
echo "Step 4: Extracting CDS sequences based on mRNA IDs..."
# 使用awk提取CDS序列
awk 'NR==FNR {a[$1]; next} 
     /^>/ { 
         id = substr($1, 2);  # 去掉 > 符号，只保留 ID 部分
         if (id in a) { 
             print_header = 1; 
             print $0;
         } else { 
             print_header = 0;
         }
     } 
     !/^>/ && print_header { print $0 }' "$output_dir/mRNA_id.txt" "$cds_fasta" > "$output_dir/extracted_cds_full.fa"

# 5. 根据mRNA的ID提取对应的蛋白质序列
echo "Step 5: Extracting protein sequences based on mRNA IDs..."
awk 'NR==FNR {a[$1]; next} 
     /^>/ { 
         id = substr($1, 2);  # 去掉 > 符号，只保留 ID 部分
         if (id in a) { 
             print_header = 1; 
             print $0;
         } else { 
             print_header = 0;
         }
     } 
     !/^>/ && print_header { print $0 }' "$output_dir/mRNA_id.txt" "$protein_fasta" > "$output_dir/extracted_protein.fa"

# 完成
echo "Process completed successfully. Results saved in $output_dir."

# 建议：可以考虑将后续的 `blastp` 或其他分析步骤集成到脚本中。
# 例如：执行blastp分析，输出最佳比对结果等。

第一版：extract_genes.sh运行流程

脚本运行流程
1. 构建一个文件夹，将输出文件存放到这个地方
  1. o，指定输出文件文件夹
提取完全落在区间内的注释行
1. 参数
  1. g，指定注释文件
  2. s，指定区间的起始位置
  3. e，指定区间的终止位置
  4. c，指定染色体编号

awk -v start=4006664 -v end=15018754 -v chrom="Chr01" '$1 == chrom && $4 >= start && $5 <= end {print}' female_ba_last.gff3 > extract_genes.gff3

提取基因id

awk '$3 == "gene" {split($9, arr, "ID="); split(arr[2], id, ";"); print id[1]}' extract_genes.gff3 > gene_id.txt

提取mra的id

awk '$3 == "mRNA" {split($9, arr, "ID="); split(arr[2], id, ";"); print id[1]}' extract_genes.gff3 > mRNA_id.txt

根据mrna的id提取对应的cds序列
1. 使用seqkit和gene的id来提取，不错是不错，但是没有安装的话比较麻烦

seqtk subseq female_ba_last.cds.fa <(cat gene_id.txt | sed 's/$/-mRNA1/') > extracted_cds.fa

b.awk根据mrna的id文档来提取

参数d，指定cds序列

awk 'NR==FNR {a[$1]; next} 
     /^>/ { 
         id = substr($1, 2);  # 去掉 > 符号，只保留 ID 部分
         if (id in a) { 
             print_header = 1; 
             print $0;
         } else { 
             print_header = 0;
         }
     } 
     !/^>/ && print_header { print $0 }' mRNA_id.txt female_ba_last.cds.fa > extracted_cds_full.fa

和上一个相同，根据mrna的id提取对应的蛋白序列
1. 参数p，指定蛋白序列

awk 'NR==FNR {a[$1]; next} 
     /^>/ { 
         id = substr($1, 2);  # 去掉 > 符号，只保留 ID 部分
         if (id in a) { 
             print_header = 1; 
             print $0;
         } else { 
             print_header = 0;
         }
     } 
     !/^>/ && print_header { print $0 }' mRNA_id.txt female_ba_last.protein.fa > extracted_protein.fa