By jiangchenhao, 31 March, 2026
Forums

  • 这个文档的内容还会更新:
    https://www.kdocs.cn/l/cfA58xFB7km9

    猫言猫语

我之前和董老师提过使用等位基因频率的做法,我现在想到一个更方便的事情?!为什么不使用核苷酸多样性呢?不过话说回来了,如果使用Pi值,结果80%是反过来的。

不能用Pi值,但是可以用基于等位基因频率的向量来计算。

背景

基本做法是确定基因,从群体中提取基因序列,并进行一致性序列的计算。

一、 提取fasta序列:
1. /data/chuand/fusion_gene/data/female_vcf_clean/2_female.clean.vcf.gz vcf文件
2. /data/chuand/organ_development/data/Tgra.chr.fa 基因组
3. /data/chuand/organ_development/data/Tgra.chr.no_fragment.gff gff文件
4. Suplementary Tables.xlsx 文件里面的 1_vs_1.csv 存放的是基因对的信息。
5.为我个人创建一个工作目录:/data/jiangchenhao/26_2_4_chund_work_repeat

起容器

docker run -itd --name chuand_work_repeat_jiangchenhao_2_4_3 -v /data/jiangchenhao/26_2_4_chund_work_repeat:/project1/output -v /data/chuand/fusion_gene/data/female_vcf_clean/:/project1/input:ro  -v /data/chuand/organ_development/data/:/projec1/refer  ubuntu_mamba:1.12

问AI:

1.我支持使用共识序列。最好是使用IUPAC编码。
2.当我对多个个体提取序列后,我要根据为个体构建一共共识序列,这是可以做到的吗?如果可以做到,请告诉我要用什么工具。
3.当我要对两个序列比较基因相似性的时候,我希望先对齐,然后再使用python直接计数,有这样的工具吗?

环境安装

mamba create -n vcf2fasta -c conda-forge -c bioconda   bcftools bedtools samtools -y
mamba activate vcf2fasta
mamba install -c bioconda mafft

01对于每个基因提取所有样本的序列

fastmale_01_vcf_to_sequence.py

这个代码的输入是基因,样本list,输出是一个文件,里面包含了所有样本的序列。

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import os
import subprocess


def get_gene_region(gene_id, gff):
    pats = [f"ID={gene_id}", f"Name={gene_id}", f'gene_id "{gene_id}"', f"gene={gene_id}"]
    with open(gff) as f:
        for line in f:
            if not line.strip() or line.startswith("#"):
                continue
            a = line.rstrip("\n").split("\t")
            if len(a) < 9:
                continue
            chrom, _, feature, start, end, _, strand, _, attr = a
            if feature == "gene" and any(p in attr for p in pats):
                return chrom, int(start), int(end), strand
    raise ValueError(f"gene not found: {gene_id}")


def cmd_out(cmd):
    p = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
    if p.returncode != 0:
        raise RuntimeError(p.stderr)
    return p.stdout


def extract_gene_fasta_all_samples(gene_id, gff, vcf_gz, ref_fa, out_fa, samples, missing="N"):
    chrom, start, end, strand = get_gene_region(gene_id, gff)
    region = f"{chrom}:{start}-{end}"

    if not os.path.exists(vcf_gz + ".tbi"):
        raise FileNotFoundError(f"missing index: {vcf_gz}.tbi (run tabix -p vcf)")

    os.makedirs(os.path.dirname(os.path.abspath(out_fa)), exist_ok=True)

    with open(out_fa, "w") as out:
        for s in samples:
            fa = cmd_out(["bcftools", "consensus", "-M", missing, "-f", ref_fa, "-s", s, "-r", region, vcf_gz])
            seq = "".join([x.strip() for x in fa.splitlines()[1:] if x.strip()])
            out.write(f">{gene_id}|{s}|{region}|{strand}\n{seq}\n")

    return region

02对齐+投票,构建共识序列

fastmale_02_align_vote_consensus.py

(base) root@e6bde9036e8e:/project1/bin# cat fastmale_02_align_vote_consensus.py
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import os
import subprocess
from collections import Counter


def read_fasta(path):
    seqs = {}
    name = None
    buf = []
    with open(path) as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            if line.startswith(">"):
                if name is not None:
                    seqs[name] = "".join(buf).upper()
                name = line[1:].split()[0]
                buf = []
            else:
                buf.append(line)
        if name is not None:
            seqs[name] = "".join(buf).upper()
    return seqs


def write_fasta(path, header, seq, width=60):
    with open(path, "w") as out:
        out.write(f">{header}\n")
        for i in range(0, len(seq), width):
            out.write(seq[i:i+width] + "\n")


def run_mafft(in_fa, out_aln, threads=4):
    cmd = ["mafft", "--auto", "--thread", str(threads), in_fa]
    p = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
    if p.returncode != 0:
        raise RuntimeError(p.stderr)
    with open(out_aln, "w") as f:
        f.write(p.stdout)


def consensus_from_alignment(aln_fa, include_missing=True):
    seqs = read_fasta(aln_fa)
    names = list(seqs.keys())
    if not names:
        raise ValueError("empty alignment fasta")

    L = len(seqs[names[0]])
    for n in names[1:]:
        if len(seqs[n]) != L:
            raise ValueError("alignment sequences have different lengths")

    cons = []
    for i in range(L):
        col = [seqs[n][i] for n in names]

        if not include_missing:
            col = [c for c in col if c not in set(["-", "N"])]

        if not col:
            cons.append("N")
            continue

        cnt = Counter(col)
        top = cnt.most_common()
        if len(top) == 1 or top[0][1] > top[1][1]:
            cons.append(top[0][0])
        else:
            cons.append("N")
    return "".join(cons)



def build_population_consensus(in_fa, out_prefix, threads=4):
    aln_fa = out_prefix + ".aln.fa"
    cons_fa = out_prefix + ".consensus.fa"

    run_mafft(in_fa, aln_fa, threads=threads)
    cons = consensus_from_alignment(aln_fa)

    write_fasta(cons_fa, header=os.path.basename(out_prefix) + "|population_consensus", seq=cons)
    return aln_fa, cons_fa  

03结合01和02计算一个基因的共识序列

fastmale_03_gene_to_population_consensus.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import argparse
import os

from fastmale_01_vcf_to_sequence import extract_gene_fasta_all_samples
from fastmale_02_align_vote_consensus import build_population_consensus


def read_samples(sample_list_file):
    with open(sample_list_file) as f:
        return [x.strip() for x in f if x.strip() and not x.startswith("#")]


def main():
    p = argparse.ArgumentParser(description="Extract per-sample gene FASTA from VCF and build population consensus.")
    p.add_argument("-g", "--gene", required=True, help="gene_id (must match GFF attributes)")
    p.add_argument("--gff", required=True, help="GFF/GTF file")
    p.add_argument("--vcf", required=True, help="bgzipped VCF (.vcf.gz)")
    p.add_argument("--ref", required=True, help="reference genome fasta")
    p.add_argument("--samples", required=True, help="sample list file (one sample per line)")
    p.add_argument("-o", "--outdir", default=".", help="output directory")
    p.add_argument("-t", "--threads", type=int, default=4, help="mafft threads")
    p.add_argument("--missing", default="N", help="missing genotype mask char for bcftools consensus (default: N)")
    args = p.parse_args()

    os.makedirs(args.outdir, exist_ok=True)
    sample_list = read_samples(args.samples)

    # 根据样本列表文件名生成前缀
    sample_file_prefix = os.path.splitext(os.path.basename(args.samples))[0]
    out_prefix = os.path.join(args.outdir, f"{sample_file_prefix}_{args.gene}")

    # per-sample FASTA 文件名统一
    per_sample_fa = f"{out_prefix}.samples.fa"

    # 提取每个样本的基因序列
    extract_gene_fasta_all_samples(
        gene_id=args.gene,
        gff=args.gff,
        vcf_gz=args.vcf,
        ref_fa=args.ref,
        out_fa=per_sample_fa,
        samples=sample_list,
        missing=args.missing
    )

    # 构建群体共识序列
    aln_fa, cons_fa = build_population_consensus(
        in_fa=per_sample_fa,
        out_prefix=out_prefix,
        threads=args.threads
    )

    print("DONE")
    print("Per-sample FASTA:", per_sample_fa)
    print("Alignment FASTA:", aln_fa)
    print("Consensus FASTA:", cons_fa)


if __name__ == "__main__":
    main()

04构建基因Pairs,并提取序列,进行比较

非常好,现在我要求你做更进一步的工作,我要求调用代码03计算一对基因A,A'在群体中M的序列,并比较其差异。
具体而言,我们先分别计算在基因A,A‘在群体M中的两个共识序列MA,MA'
然后我们比较MA'和MA的碱基差异 (杂合位点计算时可以考虑计为1分,缺失2分,纯合级别的差异计算为2个,这里有没有更严肃的说法?以避免我自己构建理论)
把以上代码汇总为一个可以命令行调用的代码,最好是对03的调用。

AI给的代码没有对齐,我要求他对齐后重新发给我

(base) root@e6bde9036e8e:/project1/bin# cat fastmale_04_compare_gene_pair.py
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import shutil
import argparse
import os
import subprocess
import tempfile
import re
import sys
from collections import Counter

def run_03_script(gene_id, gff, vcf_gz, ref_fa, samples, outdir, threads=4, missing="N"):
    """调用 fastmale_03_gene_to_population_consensus.py 生成单个基因的共识序列"""
    # 构建命令行参数
    cmd = [
        sys.executable, 
        f"{os.path.dirname(os.path.abspath(__file__))}/fastmale_03_gene_to_population_consensus.py",
        "-g", gene_id,
        "--gff", gff,
        "--vcf", vcf_gz,
        "--ref", ref_fa,
        "--samples", samples,
        "-o", outdir,
        "-t", str(threads),
        "--missing", missing
    ]
    
    # 执行命令
    result = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
    
    if result.returncode != 0:
        raise RuntimeError(f"Error running script for gene {gene_id}:\n{result.stderr}")
    
    # 查找输出的共识序列文件
    sample_file_prefix = os.path.splitext(os.path.basename(samples))[0]
    cons_file = os.path.join(outdir, f"{sample_file_prefix}_{gene_id}.consensus.fa")
    
    if not os.path.exists(cons_file):
        raise FileNotFoundError(f"Consensus file not found: {cons_file}")
    
    return cons_file

def read_consensus_sequence(cons_file):
    """从FASTA文件中读取共识序列"""
    with open(cons_file) as f:
        lines = f.readlines()
        if not lines:
            raise ValueError(f"Empty consensus file: {cons_file}")
        
        # 忽略以 '>' 开头的行
        seq = "".join(line.strip().upper() for line in lines if not line.startswith(">"))
        
        if not seq:
            raise ValueError(f"Empty sequence in consensus file: {cons_file}")
        
        return seq

def pairwise_align(seq1, seq2, threads=1):
    """使用mafft对两条序列进行双序列比对"""
    with tempfile.NamedTemporaryFile(mode='w', delete=False) as tmp_in, \
         tempfile.NamedTemporaryFile(mode='r', delete=False) as tmp_out:
        
        tmp_in.write(f">seq1\n{seq1}\n>seq2\n{seq2}\n")
        tmp_in.flush()
        
        cmd = ["mafft", "--quiet", "--thread", str(threads), tmp_in.name]
        p = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
        
        if p.returncode != 0:
            os.unlink(tmp_in.name)
            raise RuntimeError(f"mafft error: {p.stderr}")
        
        # 将比对结果写入临时文件
        with open(tmp_out.name, 'w') as f:
            f.write(p.stdout)
        
        os.unlink(tmp_in.name)
        return tmp_out.name

def calculate_sequence_diff(aligned_file):
    """基于 IUPAC 等位集合的序列差异计算"""

    IUPAC_MAP = {
        'A': {'A'}, 'C': {'C'}, 'G': {'G'}, 'T': {'T'},
        'R': {'A', 'G'}, 'Y': {'C', 'T'},
        'S': {'G', 'C'}, 'W': {'A', 'T'},
        'K': {'G', 'T'}, 'M': {'A', 'C'},
        'B': {'C', 'G', 'T'}, 'D': {'A', 'G', 'T'},
        'H': {'A', 'C', 'T'}, 'V': {'A', 'C', 'G'},
        'N': None, '-': None
    }

    with open(aligned_file) as f:
        content = f.read()
        seqs = re.findall(r">.*\n([A-Za-z\n-]+)", content)

        if len(seqs) != 2:
            raise ValueError("Alignment file should contain exactly 2 sequences")

        seq1 = seqs[0].replace("\n", "").upper()
        seq2 = seqs[1].replace("\n", "").upper()

    total_score = 0
    homo_diff = 0
    hetero_diff = 0
    missing = 0

    for b1, b2 in zip(seq1, seq2):
        if b1 == b2:
            continue

        a1 = IUPAC_MAP.get(b1)
        a2 = IUPAC_MAP.get(b2)

        # 任意一方是缺失或 gap
        if a1 is None and a2 is None:
            continue
        if a1 is None or a2 is None:
            total_score += 2
            missing += 1
            continue

        # 有交集:杂合 or 杂合-纯合
        if a1 & a2:
            total_score += 1
            hetero_diff += 1
        else:
            total_score += 2
            homo_diff += 1

    return {
        "total_score": total_score,
        "homozygous_diff": homo_diff,
        "heterozygous_diff": hetero_diff,
        "missing": missing,
        "alignment_length": len(seq1)
    }

def main():
    parser = argparse.ArgumentParser(description="比较两个基因在群体中的共识序列差异")
    parser.add_argument("--geneA", required=True, help="第一个基因ID")
    parser.add_argument("--geneA_prime", required=True, help="第二个基因ID")
    parser.add_argument("--gff", required=True, help="GFF/GTF注释文件")
    parser.add_argument("--vcf", required=True, help="BGZIP压缩的VCF文件")
    parser.add_argument("--ref", required=True, help="参考基因组FASTA文件")
    parser.add_argument("--samples", required=True, help="样本列表文件")
    parser.add_argument("-o", "--outdir", default="results", help="输出目录")
    parser.add_argument("-t", "--threads", type=int, default=4, help="并行线程数")
    parser.add_argument("--missing", default="N", help="缺失基因型标记字符")
    args = parser.parse_args()

    # 创建输出目录
    os.makedirs(args.outdir, exist_ok=True)
    
    # 为基因A生成共识序列
    print(f"Generating consensus for gene {args.geneA}...")
    cons_file_A = run_03_script(
        gene_id=args.geneA,
        gff=args.gff,
        vcf_gz=args.vcf,
        ref_fa=args.ref,
        samples=args.samples,
        outdir=args.outdir,
        threads=args.threads,
        missing=args.missing
    )
    cons_seq_A = read_consensus_sequence(cons_file_A)
    print(f"Consensus for {args.geneA} generated: {cons_file_A}")
    
    # 为基因A'生成共识序列
    print(f"\nGenerating consensus for gene {args.geneA_prime}...")
    cons_file_A_prime = run_03_script(
        gene_id=args.geneA_prime,
        gff=args.gff,
        vcf_gz=args.vcf,
        ref_fa=args.ref,
        samples=args.samples,
        outdir=args.outdir,
        threads=args.threads,
        missing=args.missing
    )
    cons_seq_A_prime = read_consensus_sequence(cons_file_A_prime)
    print(f"Consensus for {args.geneA_prime} generated: {cons_file_A_prime}")
    
    # 比对两个共识序列
    print("\nAligning consensus sequences...")
    aligned_tmp = pairwise_align(cons_seq_A, cons_seq_A_prime, threads=args.threads)
    aligned_file = os.path.join(args.outdir,f"{args.geneA}_vs_{args.geneA_prime}.aligned.fa")
    shutil.move(aligned_tmp, aligned_file)
    
    # 计算序列差异
    diff_result = calculate_sequence_diff(aligned_file)
    
    # 输出比较结果
    print("\n===== Comparison Results =====")
    print(f"Alignment length: {diff_result['alignment_length']} bp")
    print(f"Total difference score: {diff_result['total_score']}")
    print(f"Homozygous differences (2 points): {diff_result['homozygous_diff']}")
    print(f"Heterozygous-related differences (1 point): {diff_result['heterozygous_diff']}")
    print(f"Missing sites (2 points): {diff_result['missing']}")
    # 保存结果到文件
    result_file = os.path.join(args.outdir, "comparison_results.txt")

    with open(result_file, "w") as f:
        f.write(f"Gene A: {args.geneA}\n")
        f.write(f"Gene A': {args.geneA_prime}\n")
        f.write(f"Alignment length: {diff_result['alignment_length']} bp\n")
        f.write(f"Total difference score: {diff_result['total_score']}\n")
        f.write(f"Homozygous differences: {diff_result['homozygous_diff']}\n")
        f.write(f"Heterozygous-related differences: {diff_result['heterozygous_diff']}\n")
        f.write(f"Missing sites: {diff_result['missing']}\n")

    
    print(f"\nFull results saved to {result_file}")
    #os.unlink(aligned_file)  # 清理临时文件

if __name__ == "__main__":
    main()
  • 杂合位点(IUPAC)差异计 1
  • 缺失(N 或 -)计 2
  • 纯合差异(A vs T/G/C)计 2

让Deepseek写的代码应当是没有

fastmale_04_compare_gene_pair.py   --geneA evm.TU.PTG000014L.12   --geneA_prime evm.TU.PTG000700L.62   --gff /projec1/refer/Tgra.chr.gff   --vcf /project1/input/2_female.clean.vcf.gz   --ref /projec1/refer/Tgra.chr.fa   --samples /project1/output/female_pop.txt   -o ./results   -t 8

05 根据pair list进行迭代运算,并形成汇总文件。

好的,现在我对04代码暂时没有问题了,我要求你给我写05代码,他是对04代码的调用,目标是这样,输入一个pair list,他是一个格式为A\tA'的制位表文件。我要求你遍历这个文件夹下的所有pairs对,调04计算相似程度,
最后,当你跑完所有基因对了,我要求你汇总结果为制位表文件,请你尽可能避免文件路径之类的接口问题。04的每个基因的结果格式如下:
(vcf2fasta) root@e6bde9036e8e:/project1/output/results# cat comparison_results.txt 
Gene A: evm.TU.PTG000014L.12 
Gene A': evm.TU.PTG000700L.62 
Alignment length: 15125 bp 
Total difference score: 20439 
Homozygous differences: 4555 
Heterozygous-related differences: 13 
Missing sites: 5658

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import argparse
import re
import os
import subprocess
import csv
from concurrent.futures import ProcessPoolExecutor, as_completed
from tqdm import tqdm


def run_04_script(geneA, geneA_prime, gff, vcf, ref, samples, outdir, missing="N"):
    """调用 fastmale_04_compare_gene_pair.py,比对并保留全部中间文件(04 单线程)"""

    pair_tag = f"{geneA}__{geneA_prime}"
    pair_outdir = os.path.join(outdir, "pairs", pair_tag)
    os.makedirs(pair_outdir, exist_ok=True)

    cmd = [
        "fastmale_04_compare_gene_pair.py",
        "--geneA", geneA,
        "--geneA_prime", geneA_prime,
        "--gff", gff,
        "--vcf", vcf,
        "--ref", ref,
        "--samples", samples,
        "-o", pair_outdir,
        "-t", "1",          # 04 固定单线程
        "--missing", missing
    ]

    result = subprocess.run(
        cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True
    )

    if result.returncode != 0:
        raise RuntimeError(
            f"{geneA} vs {geneA_prime} failed:\n{result.stderr}"
        )

    comp_file = os.path.join(pair_outdir, "comparison_results.txt")
    if not os.path.exists(comp_file):
        raise FileNotFoundError(
            f"comparison_results.txt not found for {pair_tag}"
        )

    data = {}
    with open(comp_file) as f:
        for line in f:
            line = line.strip()
            if line.startswith("Gene A:"):
                data["GeneA"] = line.split(":", 1)[1].strip()
            elif line.startswith("Gene A'"):
                data["GeneA_prime"] = line.split(":", 1)[1].strip()
            elif line.startswith("Alignment length:"):
                data["Alignment_length"] = int(re.search(r"\d+", line).group())
            elif line.startswith("Total difference score:"):
                data["Total_score"] = int(re.search(r"\d+", line).group())
            elif line.startswith("Homozygous differences"):
                data["Homozygous_diff"] = int(re.search(r"\d+", line).group())
            elif line.startswith("Heterozygous-related differences"):
                data["Heterozygous_diff"] = int(re.search(r"\d+", line).group())
            elif line.startswith("Missing sites"):
                data["Missing"] = int(re.search(r"\d+", line).group())

    return data


def worker(args):
    geneA, geneA_prime, gff, vcf, ref, samples, outdir, missing = args
    data = run_04_script(
        geneA, geneA_prime, gff, vcf, ref, samples, outdir, missing
    )

    data["Missing_fraction"] = round(
        data["Missing"] / data["Alignment_length"], 4
    )
    data["Information_fraction"] = round(
        1 - data["Missing_fraction"], 4
    )
    return data
def main():
    parser = argparse.ArgumentParser(description="批量比较基因对(保留所有中间文件)")
    parser.add_argument("--pair_list", required=True)
    parser.add_argument("--gff", required=True)
    parser.add_argument("--vcf", required=True)
    parser.add_argument("--ref", required=True)
    parser.add_argument("--samples", required=True)
    parser.add_argument("-o", "--outdir", default="results")
    parser.add_argument(
        "-t", "--threads", type=int, default=4,
        help="pairs 并行进程数"
    )
    parser.add_argument("--missing", default="N")
    args = parser.parse_args()

    os.makedirs(args.outdir, exist_ok=True)
    os.makedirs(os.path.join(args.outdir, "pairs"), exist_ok=True)

    summary_file = os.path.join(args.outdir, "all_pairs_comparison.tsv")
    fieldnames = [
        "GeneA", "GeneA_prime",
        "Alignment_length", "Total_score",
        "Homozygous_diff", "Heterozygous_diff", "Missing",
        "Missing_fraction", "Information_fraction"
    ]

    pairs = []
    with open(args.pair_list) as f:
        for row in csv.reader(f, delimiter="\t"):
            if len(row) >= 2:
                pairs.append((row[0].strip(), row[1].strip()))

    tasks = [
        (geneA, geneA_prime,
         args.gff, args.vcf, args.ref, args.samples,
         args.outdir, args.missing)
        for geneA, geneA_prime in pairs
    ]

    with open(summary_file, "w", newline="") as outcsv:
        writer = csv.DictWriter(outcsv, fieldnames=fieldnames, delimiter="\t")
        writer.writeheader()

        with ProcessPoolExecutor(max_workers=args.threads) as executor:
            futures = [executor.submit(worker, t) for t in tasks]

            for future in tqdm(
                as_completed(futures),
                total=len(futures),
                desc="Comparing gene pairs"
            ):
                try:
                    data = future.result()
                    writer.writerow(data)
                except Exception as e:
                    print(f"[ERROR] {e}")

    print(f"\nAll pairs comparison finished: {summary_file}")


if __name__ == "__main__":
    main()

比较两者的结果

好的,现在我有female的比较文件female_pair_comparison_summary.tsv
我希望你根据这个文件计算序列相似度,计算方法是这样:
对于male_pair_comparison_summary.tsv的每一行,用Alignment_length列,减去Homozygous_differences,Heterozygous_related_differences,Missing_sites,得到相同位点的数量,这个时候再用相同位点的数量/Alignment_length,我们就得到了相同位点的比例。
然后我还有male的比较文件,你也要计算一遍male_pair_comparison_summary.tsv。
最后,我希望根据这些结果绘制一个用于比较male和female的小提琴图,请用R代码绘图,可以吗?

Promote

[jiangchenhao@localhost fasta_male_promote]$ cat fastmale_01_vcf_to_sequence.py 
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import os
import subprocess

def get_gene_region(gene_id, gff):
    pats = [f"ID={gene_id}", f"Name={gene_id}", f'gene_id "{gene_id}"', f"gene={gene_id}"]
    with open(gff) as f:
        for line in f:
            if not line.strip() or line.startswith("#"):
                continue
            a = line.rstrip("\n").split("\t")
            if len(a) < 9:
                continue
            chrom, _, feature, start, end, _, strand, _, attr = a
            if feature == "gene" and any(p in attr for p in pats):
                return chrom, int(start), int(end), strand
    raise ValueError(f"gene not found: {gene_id}")

def cmd_out(cmd, shell=False):
    p = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, shell=shell)
    if p.returncode != 0:
        raise RuntimeError(p.stderr)
    return p.stdout

def extract_gene_fasta_all_samples(gene_id, gff, vcf_gz, ref_fa, out_fa, samples, missing="N"):

    promoter_len = 2000  # ← 在这里改 promoter 长度即可

    chrom, start, end, strand = get_gene_region(gene_id, gff)

    # ---------- 计算 promoter 区间 ----------
    if strand == "+":
        pro_start = max(1, start - promoter_len)
        pro_end = start - 1
    else:
        pro_start = end + 1
        pro_end = end + promoter_len

    region = f"{chrom}:{pro_start}-{pro_end}"

    # ---------- 检查 index ----------
    if not (os.path.exists(vcf_gz + ".csi") or os.path.exists(vcf_gz + ".tbi")):
        raise FileNotFoundError(f"missing index for {vcf_gz} (run bcftools index or tabix)")

    os.makedirs(os.path.dirname(os.path.abspath(out_fa)), exist_ok=True)

    # ---------- 反向互补函数 ----------
    def revcomp(seq):
        comp = str.maketrans("ATCGNatcgn", "TAGCNtagcn")
        return seq.translate(comp)[::-1]

    # ---------- 提取每个样本 ----------
    with open(out_fa, "w") as out:
        for s in samples:
            cmd = (
                f"samtools faidx {ref_fa} {region} | "
                f"bcftools consensus -M {missing} -s {s} -I -f - {vcf_gz}"
            )

            fa = cmd_out(cmd, shell=True)

            seq = "".join([x.strip() for x in fa.splitlines()[1:] if x.strip()])

            # 如果是负链 → 反向互补
            if strand == "-":
                seq = revcomp(seq)
            out.write(f">{gene_id}|{s}|{region}|{strand}\n{seq}\n")

调用代码

/project1/bin/fasta_male_promote/fastmale_05_compare_all_pairs.py --pair_list /project1/output/pairs_simple.txt  --gff /projec1/refer/Tgra.chr.no_fragment.gff --vcf /project1/input/2_female.clean.vcf.gz  --ref /projec1/refer/Tgra.chr.fa --samples /project1/output/female_pop.txt -o /project1/output/ --threads 4

让AI再修改一下这个版本的代码块

当我跑10个基因对的时候,无事发生,但是当我跑7000个基因对的时候,事情就发生了变化,程序不能正常工作,疑似卡死了。
你能不能把我上述的五个代码整合成一个代码,保持接口完全不变?

代码块

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
fastmale_integrated.py - 整合版基因对比较工具
整合了原fastmale_01到05的所有功能,优化了大规模处理性能
"""
import argparse
import os
import sys
import re
import csv
import subprocess
import tempfile
import shutil
import uuid
from collections import Counter
from concurrent.futures import ProcessPoolExecutor, as_completed
from tqdm import tqdm
import psutil

# ------------------------- 核心功能模块 -------------------------

def get_gene_region(gene_id, gff):
    """从GFF文件中获取基因区域 (原01功能)"""
    pats = [f"ID={gene_id}", f"Name={gene_id}", f'gene_id "{gene_id}"', f"gene={gene_id}"]
    with open(gff) as f:
        for line in f:
            if not line.strip() or line.startswith("#"):
                continue
            a = line.rstrip("\n").split("\t")
            if len(a) < 9:
                continue
            chrom, _, feature, start, end, _, strand, _, attr = a
            if feature == "gene" and any(p in attr for p in pats):
                return chrom, int(start), int(end), strand
    raise ValueError(f"Gene not found in GFF: {gene_id}")

def run_command(cmd, shell=False):
    """安全执行命令行工具"""
    p = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, shell=shell)
    if p.returncode != 0:
        raise RuntimeError(f"Command failed: {' '.join(cmd) if not shell else cmd}\n{p.stderr}")
    return p.stdout

def extract_gene_fasta(gene_id, gff, vcf_gz, ref_fa, sample, missing="N", promoter_len=2000):
    """为单个样本提取基因FASTA序列 (原01功能增强版)"""
    chrom, start, end, strand = get_gene_region(gene_id, gff)
    
    # 计算启动子区域
    if strand == "+":
        pro_start = max(1, start - promoter_len)
        pro_end = start - 1
    else:
        pro_start = end + 1
        pro_end = end + promoter_len
        
    region = f"{chrom}:{pro_start}-{pro_end}"
    
    # 反向互补函数
    def revcomp(seq):
        comp = str.maketrans("ATCGNatcgn", "TAGCNtagcn")
        return seq.translate(comp)[::-1]
    
    # 提取序列
    cmd = f"samtools faidx {ref_fa} {region} | bcftools consensus -M {missing} -s {sample} -I -f - {vcf_gz}"
    fa = run_command(cmd, shell=True)
    
    # 处理序列
    seq_lines = [line.strip() for line in fa.splitlines() if line.strip() and not line.startswith(">")]
    seq = "".join(seq_lines)
    
    if strand == "-":
        seq = revcomp(seq)
    
    return f">{gene_id}|{sample}|{region}|{strand}\n{seq}\n"

def run_mafft(in_fa, out_aln, threads=1):
    """运行MAFFT进行多序列比对 (原02功能)"""
    cmd = ["mafft", "--auto", "--quiet", "--thread", str(threads), in_fa]
    p = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
    if p.returncode != 0:
        raise RuntimeError(f"MAFFT error: {p.stderr}")
    with open(out_aln, "w") as f:
        f.write(p.stdout)

def consensus_from_alignment(aln_fa):
    """从比对结果构建共识序列 (原02功能)"""
    seqs = {}
    current_seq = []
    current_name = None
    
    with open(aln_fa) as f:
        for line in f:
            line = line.strip()
            if line.startswith(">"):
                if current_name:
                    seqs[current_name] = "".join(current_seq).upper()
                current_name = line[1:].split()[0]
                current_seq = []
            elif line:
                current_seq.append(line)
                
        if current_name:
            seqs[current_name] = "".join(current_seq).upper()
    
    if not seqs:
        raise ValueError("Empty alignment file")
    
    seq_length = len(next(iter(seqs.values())))
    consensus = []
    
    for i in range(seq_length):
        column = [seq[i] for seq in seqs.values() if i < len(seq)]
        # 移除非信息位点
        column = [base for base in column if base not in ['-', 'N']]
        
        if not column:
            consensus.append('N')
            continue
            
        base_counts = Counter(column)
        most_common = base_counts.most_common(1)[0][0]
        consensus.append(most_common)
    
    return "".join(consensus)

def build_population_consensus(gene_id, samples, gff, vcf_gz, ref_fa, out_prefix, threads=1, missing="N"):
    """构建群体共识序列 (原03功能优化版)"""
    # 创建每样本FASTA序列
    per_sample_fa = f"{out_prefix}.samples.fa"
    
    with open(per_sample_fa, "w") as out_fa:
        for sample in samples:
            seq = extract_gene_fasta(gene_id, gff, vcf_gz, ref_fa, sample, missing)
            out_fa.write(seq)
    
    # 运行MAFFT比对
    aln_fa = f"{out_prefix}.aln.fa"
    run_mafft(per_sample_fa, aln_fa, threads=threads)
    
    # 生成共识序列
    consensus_seq = consensus_from_alignment(aln_fa)
    cons_fa = f"{out_prefix}.consensus.fa"
    
    with open(cons_fa, "w") as out:
        out.write(f">{gene_id}_consensus\n{consensus_seq}\n")
    
    return per_sample_fa, aln_fa, cons_fa

def pairwise_align(seq1, seq2, threads=1):
    """对两条序列进行双序列比对 (原04功能)"""
    # 使用UUID确保临时文件唯一性,避免并行冲突
    unique_id = uuid.uuid4().hex[:8]
    tmp_in = tempfile.NamedTemporaryFile(mode='w', delete=False, suffix=f"_{unique_id}.fa")
    tmp_out = tempfile.NamedTemporaryFile(mode='r', delete=False, suffix=f"_{unique_id}.aln")
    
    # 写入序列
    tmp_in.write(f">seq1\n{seq1}\n>seq2\n{seq2}\n")
    tmp_in.close()
    
    # 运行MAFFT
    cmd = ["mafft", "--quiet", "--thread", str(threads), tmp_in.name]
    p = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
    
    if p.returncode != 0:
        os.unlink(tmp_in.name)
        os.unlink(tmp_out.name)
        raise RuntimeError(f"MAFFT error: {p.stderr}")
    
    # 保存比对结果
    with open(tmp_out.name, 'w') as f:
        f.write(p.stdout)
    
    os.unlink(tmp_in.name)
    return tmp_out.name

def calculate_sequence_diff(aligned_file):
    """计算序列差异 (原04功能增强版)"""
    IUPAC_MAP = {
        'A': {'A'}, 'C': {'C'}, 'G': {'G'}, 'T': {'T'},
        'R': {'A', 'G'}, 'Y': {'C', 'T'}, 'S': {'G', 'C'}, 'W': {'A', 'T'},
        'K': {'G', 'T'}, 'M': {'A', 'C'}, 'B': {'C', 'G', 'T'}, 'D': {'A', 'G', 'T'},
        'H': {'A', 'C', 'T'}, 'V': {'A', 'C', 'G'}, 'N': None, '-': None
    }
    
    with open(aligned_file) as f:
        content = f.read()
        seqs = re.findall(r">.*\n([A-Za-z\n-]+)", content)
        
        if len(seqs) != 2:
            raise ValueError(f"Alignment file should contain exactly 2 sequences, found {len(seqs)}")
        
        seq1 = seqs[0].replace("\n", "").upper()
        seq2 = seqs[1].replace("\n", "").upper()
    
    if len(seq1) != len(seq2):
        raise ValueError("Aligned sequences have different lengths")
    
    total_score = 0
    homo_diff = 0
    hetero_diff = 0
    missing = 0
    
    for b1, b2 in zip(seq1, seq2):
        if b1 == b2:
            continue
            
        a1 = IUPAC_MAP.get(b1)
        a2 = IUPAC_MAP.get(b2)
        
        # 处理缺失或gap
        if a1 is None and a2 is None:
            continue
        elif a1 is None or a2 is None:
            total_score += 2
            missing += 1
            continue
            
        # 检查交集
        if a1 & a2:
            total_score += 1
            hetero_diff += 1
        else:
            total_score += 2
            homo_diff += 1
            
    return {
        "total_score": total_score,
        "homozygous_diff": homo_diff,
        "heterozygous_diff": hetero_diff,
        "missing": missing,
        "alignment_length": len(seq1)
    }

def process_gene_pair(args):
    """处理单个基因对 (核心工作函数)"""
    geneA, geneA_prime, gff, vcf_gz, ref_fa, samples_list, outdir, missing = args
    
    # 确保输出目录存在
    pair_dir = os.path.join(outdir, "pairs", f"{geneA}__{geneA_prime}")
    os.makedirs(pair_dir, exist_ok=True)
    
    # 读取样本列表
    with open(samples_list) as f:
        samples = [line.strip() for line in f if line.strip() and not line.startswith("#")]
    
    # 资源监控
    mem = psutil.virtual_memory()
    if mem.available < 1024**3:  # 小于1GB内存时警告
        print(f"WARNING: Low memory available ({mem.available/1024**2:.1f}MB) for {geneA} vs {geneA_prime}")
    
    # 处理基因A
    prefix_A = os.path.join(pair_dir, geneA)
    _, _, cons_file_A = build_population_consensus(
        geneA, samples, gff, vcf_gz, ref_fa, prefix_A, threads=1, missing=missing
    )
    
    # 读取共识序列
    with open(cons_file_A) as f:
        cons_seq_A = "".join(line.strip() for line in f if not line.startswith(">"))
    
    # 处理基因A'
    prefix_A_prime = os.path.join(pair_dir, geneA_prime)
    _, _, cons_file_A_prime = build_population_consensus(
        geneA_prime, samples, gff, vcf_gz, ref_fa, prefix_A_prime, threads=1, missing=missing
    )
    
    # 读取共识序列
    with open(cons_file_A_prime) as f:
        cons_seq_A_prime = "".join(line.strip() for line in f if not line.startswith(">"))
    
    # 序列比对
    aligned_tmp = pairwise_align(cons_seq_A, cons_seq_A_prime, threads=1)
    aligned_file = os.path.join(pair_dir, f"{geneA}_vs_{geneA_prime}.aligned.fa")
    shutil.move(aligned_tmp, aligned_file)
    
    # 计算差异
    diff_result = calculate_sequence_diff(aligned_file)
    
    # 保存结果
    result_file = os.path.join(pair_dir, "comparison_results.txt")
    with open(result_file, "w") as f:
        f.write(f"Gene A: {geneA}\n")
        f.write(f"Gene A': {geneA_prime}\n")
        f.write(f"Alignment length: {diff_result['alignment_length']} bp\n")
        f.write(f"Total difference score: {diff_result['total_score']}\n")
        f.write(f"Homozygous differences: {diff_result['homozygous_diff']}\n")
        f.write(f"Heterozygous-related differences: {diff_result['heterozygous_diff']}\n")
        f.write(f"Missing sites: {diff_result['missing']}\n")
    
    # 返回结构化结果
    return {
        "GeneA": geneA,
        "GeneA_prime": geneA_prime,
        "Alignment_length": diff_result["alignment_length"],
        "Total_score": diff_result["total_score"],
        "Homozygous_diff": diff_result["homozygous_diff"],
        "Heterozygous_diff": diff_result["heterozygous_diff"],
        "Missing": diff_result["missing"],
        "Missing_fraction": round(diff_result["missing"] / diff_result["alignment_length"], 4),
        "Information_fraction": round(1 - (diff_result["missing"] / diff_result["alignment_length"]), 4)
    }

# ------------------------- 命令行接口 -------------------------

def process_single_gene(args):
    """处理单个基因模式 (原03功能)"""
    # 读取样本列表
    with open(args.samples) as f:
        samples = [line.strip() for line in f if line.strip() and not line.startswith("#")]
    
    # 创建输出前缀
    sample_prefix = os.path.splitext(os.path.basename(args.samples))[0]
    out_prefix = os.path.join(args.outdir, f"{sample_prefix}_{args.gene}")
    
    # 构建群体共识
    per_sample_fa, aln_fa, cons_fa = build_population_consensus(
        args.gene, samples, args.gff, args.vcf, args.ref, out_prefix, 
        threads=args.threads, missing=args.missing
    )
    
    print("DONE")
    print("Per-sample FASTA:", per_sample_fa)
    print("Alignment FASTA:", aln_fa)
    print("Consensus FASTA:", cons_fa)

def process_gene_pair_direct(args):
    """直接处理基因对模式 (原04功能)"""
    # 读取样本列表
    with open(args.samples) as f:
        samples = [line.strip() for line in f if line.strip() and not line.startswith("#")]
    
    # 处理基因对
    result = process_gene_pair((
        args.geneA, args.geneA_prime, args.gff, args.vcf, args.ref, 
        args.samples, args.outdir, args.missing
    ))
    
    # 输出结果
    print("\n===== Comparison Results =====")
    print(f"Gene A: {result['GeneA']}")
    print(f"Gene A': {result['GeneA_prime']}")
    print(f"Alignment length: {result['Alignment_length']} bp")
    print(f"Total difference score: {result['Total_score']}")
    print(f"Homozygous differences: {result['Homozygous_diff']}")
    print(f"Heterozygous-related differences: {result['Heterozygous_diff']}")
    print(f"Missing sites: {result['Missing']}")

def process_all_pairs(args):
    """批量处理基因对模式 (原05功能优化版)"""
    os.makedirs(args.outdir, exist_ok=True)
    os.makedirs(os.path.join(args.outdir, "pairs"), exist_ok=True)
    
    # 资源检查
    mem = psutil.virtual_memory()
    print(f"System resources at start:")
    print(f"  CPU cores: {os.cpu_count()}")
    print(f"  Memory: {mem.used/1024**3:.1f}/{mem.total/1024**3:.1f} GB ({mem.percent}% used)")
    print(f"  Available: {mem.available/1024**3:.1f} GB")
    
    # 读取基因对列表
    pairs = []
    with open(args.pair_list) as f:
        reader = csv.reader(f, delimiter="\t")
        for row in reader:
            if len(row) >= 2:
                pairs.append((row[0].strip(), row[1].strip()))
    
    print(f"Loaded {len(pairs)} gene pairs for processing")
    
    # 准备任务
    tasks = [
        (geneA, geneA_prime, args.gff, args.vcf, args.ref, 
         args.samples, args.outdir, args.missing)
        for geneA, geneA_prime in pairs
    ]
    
    # 结果文件
    summary_file = os.path.join(args.outdir, "all_pairs_comparison.tsv")
    fieldnames = [
        "GeneA", "GeneA_prime",
        "Alignment_length", "Total_score",
        "Homozygous_diff", "Heterozygous_diff", "Missing",
        "Missing_fraction", "Information_fraction"
    ]
    
    # 使用进程池处理
    completed = 0
    with open(summary_file, "w") as outfile:
        writer = csv.DictWriter(outfile, fieldnames=fieldnames, delimiter="\t")
        writer.writeheader()
        
        with ProcessPoolExecutor(max_workers=args.threads) as executor:
            futures = {executor.submit(process_gene_pair, task): task for task in tasks}
            
            # 进度条和资源监控
            with tqdm(total=len(futures), desc="Processing gene pairs") as pbar:
                for future in as_completed(futures):
                    try:
                        result = future.result()
                        writer.writerow(result)
                        completed += 1
                        pbar.update(1)
                        
                        # 每100对输出资源状态
                        if completed % 100 == 0:
                            mem = psutil.virtual_memory()
                            pbar.set_postfix({
                                "mem_used": f"{mem.used/1024**3:.1f}G",
                                "avail": f"{mem.available/1024**3:.1f}G"
                            })
                            
                    except Exception as e:
                        task = futures[future]
                        print(f"\nERROR processing {task[0]} vs {task[1]}: {str(e)}")
                        # 记录错误到日志
                        with open(os.path.join(args.outdir, "error_log.txt"), "a") as errlog:
                            errlog.write(f"Failed pair: {task[0]}__{task[1]}\nError: {str(e)}\n\n")
    
    print(f"\nProcessing completed. Results saved to: {summary_file}")
    print(f"Successfully processed {completed}/{len(pairs)} gene pairs")
    print(f"Error details in: {os.path.join(args.outdir, 'error_log.txt')}")

def main():
    parser = argparse.ArgumentParser(
        description="Integrated FASTMale Tool - Gene Pair Comparison",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""Modes of operation:
  single-gene: Generate population consensus for a single gene
     Example: %(prog)s single-gene -g GENE_ID [options]
     
  gene-pair: Compare two specific genes
     Example: %(prog)s gene-pair --geneA GENE1 --geneA_prime GENE2 [options]
     
  all-pairs: Batch process list of gene pairs
     Example: %(prog)s all-pairs --pair_list PAIRS.txt [options]"""
    )
    
    # 模式选择
    subparsers = parser.add_subparsers(dest="mode", required=True, help="Operation mode")
    
    # 单个基因模式 (原03)
    parser_single = subparsers.add_parser('single-gene', help='Process a single gene')
    parser_single.add_argument("-g", "--gene", required=True, help="Gene ID")
    parser_single.add_argument("--gff", required=True, help="GFF/GTF file")
    parser_single.add_argument("--vcf", required=True, help="bgzipped VCF (.vcf.gz)")
    parser_single.add_argument("--ref", required=True, help="Reference genome FASTA")
    parser_single.add_argument("--samples", required=True, help="Sample list file")
    parser_single.add_argument("-o", "--outdir", default=".", help="Output directory")
    parser_single.add_argument("-t", "--threads", type=int, default=4, help="MAFFT threads")
    parser_single.add_argument("--missing", default="N", help="Missing genotype character")
    
    # 基因对模式 (原04)
    parser_pair = subparsers.add_parser('gene-pair', help='Compare two specific genes')
    parser_pair.add_argument("--geneA", required=True, help="First gene ID")
    parser_pair.add_argument("--geneA_prime", required=True, help="Second gene ID")
    parser_pair.add_argument("--gff", required=True, help="GFF/GTF file")
    parser_pair.add_argument("--vcf", required=True, help="bgzipped VCF (.vcf.gz)")
    parser_pair.add_argument("--ref", required=True, help="Reference genome FASTA")
    parser_pair.add_argument("--samples", required=True, help="Sample list file")
    parser_pair.add_argument("-o", "--outdir", default="results", help="Output directory")
    parser_pair.add_argument("--missing", default="N", help="Missing genotype character")
    
    # 批量模式 (原05)
    parser_batch = subparsers.add_parser('all-pairs', help='Batch process list of gene pairs')
    parser_batch.add_argument("--pair_list", required=True, help="TSV file with gene pairs")
    parser_batch.add_argument("--gff", required=True, help="GFF/GTF file")
    parser_batch.add_argument("--vcf", required=True, help="bgzipped VCF (.vcf.gz)")
    parser_batch.add_argument("--ref", required=True, help="Reference genome FASTA")
    parser_batch.add_argument("--samples", required=True, help="Sample list file")
    parser_batch.add_argument("-o", "--outdir", default="results", help="Output directory")
    parser_batch.add_argument("-t", "--threads", type=int, default=4, help="Parallel processes")
    parser_batch.add_argument("--missing", default="N", help="Missing genotype character")
    
    args = parser.parse_args()
    
    # 执行相应模式
    if args.mode == "single-gene":
        process_single_gene(args)
    elif args.mode == "gene-pair":
        process_gene_pair_direct(args)
    elif args.mode == "all-pairs":
        process_all_pairs(args)

if __name__ == "__main__":
    main()

调用

/project1/bin/fasta_male_promote/fastmale_integrated.py all-pairs   --pair_list /project1/output/pairs_simple.txt   --gff /projec1/refer/Tgra.chr.no_fragment.gff   --vcf /project1/input/2_female.clean.vcf.gz   --ref /projec1/refer/Tgra.chr.fa   --samples /project1/output/female_pop.txt   -o /project1/output/   --threads 20
/project1/bin/fasta_male_promote/fastmale_integrated.py all-pairs   --pair_list /project1/output/pairs_simple.txt   --gff /projec1/refer/Tgra.chr.no_fragment.gff   --vcf /project1/input/2_female.clean.vcf.gz   --ref /projec1/refer/Tgra.chr.fa   --samples /project1/output/male_pop.txt   -o /project1/output/   --threads 20

等着吧,等程序跑完。

按照不同的基因对类型进行fast_male计算

参考链接:https://www.kdocs.cn/l/ckAEEcJZKvrX

根据董老师给的代码,先把复制基因对在雌性网络里的表达量相关性和雄性网络里的表达量相关性计算出来。

(base) root@e6bde9036e8e:/project1/bin# cat coexpression_v3.py 
#!/usr/bin/env python3

import pandas as pd
from scipy.stats import pearsonr

# 输入文件
male_file = "/project1/output/TG_male_gene_fpkm_matrix.csv"
female_file = "/project1/output/TG_female_gene_fpkm_matrix_average.csv"
pair_file = "/project1/output/pairs_simple.txt"

# 读取表达矩阵
male = pd.read_csv(male_file)
female = pd.read_csv(female_file)

male = male.set_index(male.columns[0])
female = female.set_index(female.columns[0])

# gene → expression vector
male_exp = male.to_dict('index')
female_exp = female.to_dict('index')

pairs = pd.read_csv(pair_file, sep="\t")

results = []

for g1, g2 in zip(pairs.iloc[:,0], pairs.iloc[:,1]):

    if g1 not in male.index or g2 not in male.index:
        continue
    if g1 not in female.index or g2 not in female.index:
        continue

    m1 = male.loc[g1].values
    m2 = male.loc[g2].values

    f1 = female.loc[g1].values
    f2 = female.loc[g2].values

    r_m, p_m = pearsonr(m1, m2)
    r_f, p_f = pearsonr(f1, f2)

    results.append([g1, g2, r_m, p_m, r_f, p_f])

out = pd.DataFrame(results,
                   columns=["Gene1","Gene2","r_male","p_male","r_female","p_female"])

out.to_csv("gene_pair_correlation.txt", sep="\t", index=False)

print("Finished.")

好了,现在我们得到了这个表达量相关性的数据

(base) root@e6bde9036e8e:/project1/output# head gene_pair_correlation.txt 
Gene1   Gene2   r_male  p_male  r_female        p_female
evm.TU.PTG000014L.12    evm.TU.PTG000700L.62    0.32570273469192074     0.43112200023595526     0.14636552940230652     0.6865967478727787
evm.TU.PTG000014L.34    evm.TU.PTG009293L.1     0.987559857975728       4.768207710100965e-06   0.25692741409177094     0.4736254069802095
evm.TU.PTG000014L.90    evm.TU.PTG003679L.13    -0.22243486518277664    0.5964872697054151      0.0472959631398287      0.8967712005187125
evm.TU.PTG000014L.98    evm.TU.PTG005291L.1     0.1478849138616649      0.7267320559737454
evm.TU.PTG000021L.1     evm.TU.PTG009463L.37    -0.2252310598241153     0.591756597886558
evm.TU.PTG000022L.13    evm.TU.PTG000022L.20
evm.TU.PTG000033L.4     evm.TU.PTG003861L.7
evm.TU.PTG000033L.41    evm.TU.PTG003428L.45                    0.6781565114282653      0.031131517205275853
evm.TU.PTG000046L.3     evm.TU.PTG011317L.96

那么,我们先得到在两个网络里都有数据的基因对?对,先获得在两个网络里都有数据的基因对,然后获得符合下面标准的三个分类,e值设定为0.05. e值不到0.05就视为不显著。

所以要先用一个命令得到上述基因对,然后从promote_female_pop_similty.txt文件和promote_male_pop_similty.txt文件中调取相似度信息。

怎么样呢?

应该是可以的。最好汇总到一个脚本里。现在我写提示词。

给我写一个python脚本,目标如下,现在有输入文件gene_pair_correlation.txt,他是一个六列的制位标文件,格式如下:
Gene1 Gene2 r_male p_male r_female p_female
evm.TU.PTG000014L.12 evm.TU.PTG000700L.62 0.32570273469192074 0.43112200023595526 0.14636552940230652 0.6865967478727787
evm.TU.PTG000014L.34 evm.TU.PTG009293L.1 0.987559857975728 4.768207710100965e-06 0.25692741409177094 0.4736254069802095
evm.TU.PTG000014L.90 evm.TU.PTG003679L.13 -0.22243486518277664 0.5964872697054151 0.0472959631398287 0.8967712005187125
evm.TU.PTG000014L.98 evm.TU.PTG005291L.1 0.1478849138616649 0.7267320559737454
evm.TU.PTG000021L.1 evm.TU.PTG009463L.37 -0.2252310598241153 0.591756597886558
这个文件描述了基因对在雄性共表达网络和雌性共表达网络中的表达相关性指数。r_male代表在雄性共表达网络和雌性共表达网络中的相似度,有的行中出现了数据缺失的情况,我要求你从这个表格中得到数据全部完整的行,在这个基础上:
得到r_male>0 且p_male <=0.05,r_female<0且p_female<0.05的行,这种行定义为female-bias的基因对
得到r_female>0且p_female<0.05,r_male<0且p_male <=0.05的行,这种行定位为male-bias的基因对。
剩下的有完整数据的行视为没有任何bias的行,读取完了之后计数。

那么这个标准太高了,现在稍做修改:

这个文件描述了基因对在雄性共表达网络和雌性共表达网络中的表达相关性指数。r_male代表在雄性共表达网络和雌性共表达网络中的相似度,有的行中出现了数据缺失的情况,我要求你从这个表格中得到数据全部完整的行,在这个基础上:
得到r_male>0 且p_male <=0.05,r_female<0且p_female<0.05的行,这种行定义为male-bias的基因对;
得到r_female>0且p_female<0.05,r_male<0且p_male <=0.05的行,这种行定位为female-bias的基因对。
剩下的有完整数据的行视为没有任何bias的行,读取完了之后计数。

参考

使用了log值处理了一下fpkm,并逐步放宽p值后,数据量有所增加,在放宽p值到0.25后,基因的数量来到了百来个。

p值到0.25后,基因的数量来到了百来个

按照董老师的要求,完全取消p值

这是不做p值过滤的结果

这是做了p值过滤的结果

还是可以看到,no-bias的基因对序列相似度要高一些。

但是我发现一个问题,这里的no-bias的基因对,不一定就没有表达的分化,只能说,no-sex-bias,也有可能在其他组织中出现bias,这一点也是需要注意的。

咦,这里是不是可以用我那个根茎叶的转录组数据?
那个数据有18组呢。

很好,除了这张箱线图以外,我还需要绘制一张回归的图,横坐标是promote的相似性,纵坐标是共表达的相关程度。

那么,应当也要区分三种基因对吗?我想是需要的。

现在写一个代码,绘制一张回归的图,需要先筛选基因对,筛选那些共表达关系可靠的基因对,这是第一步,然后用female的共表达值-male的共表达值,这是y轴,

第二步,从相似度结果中获得promote的序列相似度,这个时候我们就得到一组回归数据,可以绘制一组拟合线。由于我又做了基因本身的相似度,这个时候我就又得到一组回归数据,又可以绘制一组拟合线。

现在搞笑的是,序列本身的相似度信息和promote的序列相似度信息是分成雌+雄两组的,这个还是比较愚蠢的。没有必要这样搞的。

至于后续要不要区分基因对,就另说了。

好的,现在我想绘制一个r代码,用来绘制一组回归图。
输入文件有这些
1.gene_pair_delta_r.txt。就是上面这个代码生成的,要对于每个基因对A,A‘载入Delta_r值
2.四个文件,分别是female和male群体中的基因相似度和promote相似度。female_pair_comparison_summary.tsv, male_pair_comparison_summary.tsv,promote_female_pop_similty.txt,promote_male_pop_similty.txt
我要求,对于male群体和female群体,这两个回归图要分别绘制,一个回归图里要有两条拟合线,分别代表基因的拟合和promot的拟合,我希望你可以参考这个代码中部分的绘图风格。

等一等哈,如果是male更容易解偶联,那么当male为+,female为+,male-female应该<0

当male为-,female为-,male-famle应该>0

所以可以总结为|male|-|female|<0

但是又有例外的情况,比如male>0,female<0,这个时候可以理解为male-female>0

所以解偶联的定义很重要,

我现在决定用male-feamle

0------
-
-相关性值:-0.5
-
-
-
-相关性值:-0.9

-0.5相比于-0.9是正相关了,还是负相关了?所以这个东西应该是理解为负相关了,更解偶联了?

这个数值为+时,代表male相比对female变得正相关的程度。

当这个数值为-时,代表female相比male变得正相关的程度。

所以+-都应该绘制拟合线。

而且这个值应该选择使用female-male,而不是male-female?

那么如果选择使用female-male作为拟合值?可以吗?

这个图确实能说明promote的贡献比gene的贡献大,但是要解释变化,这样似乎还不够。因为只解释了

这个结果不对,不能这样用,左边应该要绘制蛋白质的相似性,而不是dna序列的相似性,因为gene的相似度包含了cds+蛋白质。

应该是要这样做:蛋白质+基因+promote

改成绝对值回归,摆脱群体数据来计算。现在只看promote和蛋白质的线条。
两条线:改成p值过滤和p值不过滤。
回归方法改成箱线图的平均值回归。加上置信区间(95%)。

不再依赖于群体数据集的promote,基因,序列相似度计算

现在请你给我写一个python,def函数,输入是基因(实际编号为mRNA)对组成的list文件,要求,每次遍历每一个mRNA pair,然后提取基因对的蛋白质,基因序列(从mRNA映射过去,包含内含子的序列),promote序列(2000bp)。分别比较相似度,相似度规则为相似的碱基或氨基酸除以总的碱基或氨基酸数目。
输入是gff文件,基因组文件,gene_pairs文件
gene_pairs的格式是什么?
/project1/output/pairs_simple.txt 
evm.TU.PTG000013L.28 evm.TU.PTG010933L.31
evm.TU.PTG000014L.12 evm.TU.PTG000700L.62
在请你用gffread提取蛋白质序列,再用biopython提取目标蛋白质序列。
用samtools提取promote序列和含有内含子的mRNA序列。
这些包已经在当前环境中安装,你不必再安装
可以用一个tmp-dir来储存比对结果
注意正负链

在AI写的代码面前小修小补一下吧。

(base) root@8ade79efb6a6:/project1/bin# cat extract.py 
import subprocess
import os
import sys
from Bio.Seq import Seq
from Bio import Align
import tempfile
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm

def run_command(cmd):
    """执行shell命令并返回输出"""
    process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
    stdout, stderr = process.communicate()
    if process.returncode != 0:
        raise RuntimeError(f"命令执行失败: {cmd}\n错误信息: {stderr}")
    return stdout

def parse_gff_for_mrna(gff_file):
    """解析GFF文件,构建mRNA位置字典"""
    mrna_dict = {}
    with open(gff_file, 'r') as f:
        for line in f:
            if line.startswith('#'):
                continue
            parts = line.strip().split('\t')
            if len(parts) < 9:
                continue
            feature_type = parts[2]
            if feature_type not in ["mRNA", "transcript"]:
                continue
            
            attrs = {k: v for k, v in [item.split('=') for item in parts[8].split(';') if '=' in item]}
            mrna_id = attrs.get('ID', None)
            if not mrna_id:
                continue
            
            chrom = parts[0]
            start = int(parts[3])
            end = int(parts[4])
            strand = parts[6]
            mrna_dict[mrna_id] = (chrom, start, end, strand)
    return mrna_dict

def extract_sequence(genome_file, chrom, start, end, strand='+'):
    """使用samtools提取基因组序列,处理正负链"""
    region = f"{chrom}:{start}-{end}"
    cmd = f"samtools faidx {genome_file} {region}"
    output = run_command(cmd)
    seq_lines = output.split('\n')[1:]
    seq = ''.join(line.strip() for line in seq_lines if line.strip())
    
    if strand == '-':
        seq = str(Seq(seq).reverse_complement())
    return seq

def get_promoter_sequence(mrna_info, genome_file, length=2000):
    """获取启动子序列(转录起始位点上游2000bp)"""
    chrom, start, end, strand = mrna_info
    if strand == '+':
        promo_start = max(1, start - length)
        promo_end = start - 1
    else:  # 负链基因的启动子在下游
        promo_start = end + 1
        promo_end = end + length
    return extract_sequence(genome_file, chrom, promo_start, promo_end, strand)

def get_gene_sequence(mrna_info, genome_file):
    """获取包含内含子的完整基因序列"""
    chrom, start, end, strand = mrna_info
    return extract_sequence(genome_file, chrom, start, end, strand)

def extract_all_protein_sequences(gff_file, genome_file, cache_file="all_proteins.pep"):
    """使用gffread提取所有mRNA的蛋白质序列并缓存"""
    if os.path.exists(cache_file):
        return cache_file
    
    with tempfile.NamedTemporaryFile(mode='w') as tmp_gff, open(cache_file, 'w') as tmp_faa:
        # 创建包含所有mRNA的临时GFF
        tmp_gff.write(f"##gff-version 3\n")
        run_command(f"grep 'mRNA\\|transcript' {gff_file} >> {tmp_gff.name}")
        tmp_gff.flush()
        
        # 生成蛋白质序列
        cmd = f"gffread -y {tmp_faa.name} -g {genome_file} {tmp_gff.name}"
        run_command(cmd)
        tmp_faa.write(open(tmp_faa.name).read())
    
    return cache_file

def get_protein_sequence_from_cache(cache_file, mrna_id):
    """从缓存文件中提取特定mRNA的蛋白质序列"""
    with open(cache_file, 'r') as faa:
        lines = faa.readlines()
        header, *seq_lines = [], []
        for line in lines:
            if line.startswith('>'):
                if header and seq_lines:
                    if header[0].split()[0][1:] == mrna_id:
                        return ''.join(seq_lines).replace('\n', '')
                    header, seq_lines = [], []
                header.append(line)
            else:
                seq_lines.append(line.strip())
        if header and seq_lines:
            if header[0].split()[0][1:] == mrna_id:
                return ''.join(seq_lines).replace('\n', '')
    return None

def calculate_similarity_with_mafft(seq1, seq2):
    """使用MAFFT计算两条序列的相似度"""
    if not seq1 or not seq2:
        return 0.0

    with tempfile.NamedTemporaryFile(mode='w') as tmp_fasta, \
         tempfile.NamedTemporaryFile(mode='r') as tmp_aligned:

        # 创建临时FASTA文件
        tmp_fasta.write(f">seq1\n{seq1}\n>seq2\n{seq2}\n")
        tmp_fasta.flush()

        # 使用MAFFT进行比对
        cmd = f"mafft --auto {tmp_fasta.name} > {tmp_aligned.name}"
        run_command(cmd)

        # 读取比对结果
        alignment = tmp_aligned.read()
        lines = alignment.strip().split('\n')
        seq1_aligned = ''.join(lines[1:lines.index('>')]).replace('-', '')
        seq2_aligned = ''.join(lines[lines.index('>')+1:]).replace('-', '')

        # 计算相同字符数
        matches = sum(1 for a, b in zip(seq1_aligned, seq2_aligned) if a == b)
        return matches / min(len(seq1), len(seq2))


def process_gene_pairs(gff_file, genome_file, pairs_file, output_file=None, num_threads=4):
    """主处理函数,支持多线程"""
    if not os.path.exists(f"{genome_file}.fai"):
        run_command(f"samtools faidx {genome_file}")

    mrna_dict = parse_gff_for_mrna(gff_file)

    # 提取所有蛋白质序列并缓存
    cache_file = extract_all_protein_sequences(gff_file, genome_file)

    # 提前解析所有需要的蛋白质序列
    protein_sequences = {}
    for mrna_id in mrna_dict.keys():
        try:
            protein_sequences[mrna_id] = get_protein_sequence_from_cache(cache_file, mrna_id)
        except Exception as e:
            print(f"Failed to extract protein sequence for {mrna_id}: {e}")
            protein_sequences[mrna_id] = None

    pairs = []
    with open(pairs_file, 'r') as f:
        for line in f:
            if not line.strip():
                continue
            ids = line.split()
            if len(ids) >= 2:
                pairs.append((ids[0], ids[1]))

    def process_pair(mrna1, mrna2):
        try:
            sys.stderr.write(f"Processing pair: {mrna1} vs {mrna2}\n")

            if mrna1 not in mrna_dict or mrna2 not in mrna_dict:
                return (mrna1, mrna2, "ERROR", "mRNA not found in GFF")

            prot1 = protein_sequences.get(mrna1)
            prot2 = protein_sequences.get(mrna2)
            if prot1 is None or prot2 is None:
                return (mrna1, mrna2, "ERROR", "Failed to extract protein sequence")

            prot_sim = calculate_similarity_with_mafft(prot1, prot2)

            gene_seq1 = get_gene_sequence(mrna_dict[mrna1], genome_file)
            gene_seq2 = get_gene_sequence(mrna_dict[mrna2], genome_file)
            gene_sim = calculate_similarity_with_mafft(gene_seq1, gene_seq2)

            promo1 = get_promoter_sequence(mrna_dict[mrna1], genome_file)
            promo2 = get_promoter_sequence(mrna_dict[mrna2], genome_file)
            promo_sim = calculate_similarity_with_mafft(promo1, promo2)

            return (mrna1, mrna2, prot_sim, gene_sim, promo_sim)

        except Exception as e:
            return (mrna1, mrna2, "ERROR", str(e))

    results = []
    futures = [executor.submit(process_pair, mrna1, mrna2) for mrna1, mrna2 in pairs]
    
    with ThreadPoolExecutor(max_workers=num_threads) as executor:
        futures = [executor.submit(process_pair, mrna1, mrna2) for mrna1, mrna2 in pairs]
        for future in tqdm(as_completed(futures), total=len(futures), desc="Processing gene pairs"):
            results.append(future.result())

    output = sys.stdout if output_file is None else open(output_file, 'w')
    output.write("mRNA1\tmRNA2\tProtein_Similarity\tGene_Similarity\tPromoter_Similarity\n")
    for res in results:
        output.write("\t".join(str(x) for x in res) + "\n")
    if output_file:
        output.close()


if __name__ == "__main__":
    if len(sys.argv) < 4:
        print("Usage: python script.py <gff_file> <genome_file> <pairs_file> [output_file] [num_threads]")
        sys.exit(1)

    gff_file = sys.argv[1]
    genome_file = sys.argv[2]
    pairs_file = sys.argv[3]
    output_file = sys.argv[4] if len(sys.argv) > 4 else None
    num_threads = int(sys.argv[5]) if len(sys.argv) > 5 else 4

    process_gene_pairs(gff_file, genome_file, pairs_file, output_file, num_threads)

 

python /project1/bin/extract.py  /project1/refer/Tgra.chr.no_fragment.gff /project1/refer/Tgra.chr.fa /project1/output/mrna_pairs_simple.txt smiliatr.txt 20

 

import subprocess
import os
import sys
import uuid
from Bio.Seq import Seq
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm

def run_command(cmd):
    process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
    stdout, stderr = process.communicate()
    if process.returncode != 0:
        print(f"[ERROR CMD] {cmd}", file=sys.stderr)
        print(stderr, file=sys.stderr)
        raise RuntimeError("Command failed")
    return stdout

def parse_gff_for_mrna(gff_file):
    mrna_dict = {}
    with open(gff_file, 'r') as f:
        for line in f:
            if line.startswith('#'):
                continue
            parts = line.strip().split('\t')
            if len(parts) < 9:
                continue
            if parts[2] not in ["mRNA", "transcript"]:
                continue

            attrs = {}
            for item in parts[8].split(';'):
                if '=' in item:
                    k, v = item.split('=', 1)
                    attrs[k] = v

            mrna_id = attrs.get('ID')
            if not mrna_id:
                continue

            chrom = parts[0]
            start = int(parts[3])
            end = int(parts[4])
            strand = parts[6]

            mrna_dict[mrna_id] = (chrom, start, end, strand)

    return mrna_dict

def extract_sequence(genome_file, chrom, start, end, strand='+'):
    region = f"{chrom}:{start}-{end}"
    cmd = f"samtools faidx {genome_file} {region}"
    output = run_command(cmd)

    seq_lines = output.split('\n')[1:]
    seq = ''.join(line.strip() for line in seq_lines if line.strip())

    if strand == '-':
        seq = str(Seq(seq).reverse_complement())

    return seq


def get_promoter_sequence(mrna_info, genome_file, length=2000):
    chrom, start, end, strand = mrna_info
    if strand == '+':
        promo_start = max(1, start - length)
        promo_end = start - 1
    else:
        promo_start = end + 1
        promo_end = end + length

    return extract_sequence(genome_file, chrom, promo_start, promo_end, strand)


def get_gene_sequence(mrna_info, genome_file):
    chrom, start, end, strand = mrna_info
    return extract_sequence(genome_file, chrom, start, end, strand)


def extract_all_protein_sequences(gff_file, genome_file, output_file):
    if os.path.exists(output_file):
        return output_file
    cmd = f"gffread -y {output_file} -g {genome_file} {gff_file}"
    run_command(cmd)
    return output_file


def load_protein_cache(faa_file):
    seqs = {}
    current_id = None
    seq = []

    with open(faa_file) as f:
        for line in f:
            line = line.strip()
            if line.startswith(">"):
                if current_id:
                    seqs[current_id] = ''.join(seq)
                current_id = line.split()[0][1:]
                seq = []
            else:
                seq.append(line)

        if current_id:
            seqs[current_id] = ''.join(seq)

    return seqs


def calculate_similarity_with_mafft(seq1, seq2, output_dir="./mafft_tmp"):
    if not seq1 or not seq2:
        return 0.0

    if len(seq1) < 10 or len(seq2) < 10:
        return 0.0

    os.makedirs(output_dir, exist_ok=True)

    uid = uuid.uuid4().hex
    fasta_file = f"{output_dir}/pair_{uid}.fa"
    aligned_file = f"{output_dir}/pair_{uid}_aligned.fa"

    with open(fasta_file, 'w') as f:
        f.write(f">seq1\n{seq1}\n>seq2\n{seq2}\n")

    cmd = f"mafft --thread 1 --auto {fasta_file} > {aligned_file}"
    run_command(cmd)

    # 检查输出
    if not os.path.exists(aligned_file) or os.path.getsize(aligned_file) == 0:
        return 0.0

    # 解析FASTA
    seqs = {}
    current_id = None

    with open(aligned_file) as f:
        for line in f:
            line = line.strip()
            if line.startswith(">"):
                current_id = line[1:]
                seqs[current_id] = []
            else:
                if current_id:
                    seqs[current_id].append(line)

    if 'seq1' not in seqs or 'seq2' not in seqs:
        return 0.0

    seq1_aligned = ''.join(seqs['seq1'])
    seq2_aligned = ''.join(seqs['seq2'])

    seq1_clean = seq1_aligned.replace('-', '')
    seq2_clean = seq2_aligned.replace('-', '')

    if not seq1_clean or not seq2_clean:
        return 0.0

    matches = sum(1 for a, b in zip(seq1_clean, seq2_clean) if a == b)

    return matches / min(len(seq1_clean), len(seq2_clean))


def process_gene_pairs(gff_file, genome_file, pairs_file, output_file=None, num_threads=4):
    print(">>> Start processing", flush=True)

    if not os.path.exists(f"{genome_file}.fai"):
        run_command(f"samtools faidx {genome_file}")

    mrna_dict = parse_gff_for_mrna(gff_file)

    protein_file = "proteins.faa"
    extract_all_protein_sequences(gff_file, genome_file, protein_file)

    protein_sequences = load_protein_cache(protein_file)

    pairs = []
    with open(pairs_file) as f:
        for line in f:
            if line.strip():
                ids = line.split()
                if len(ids) >= 2:
                    pairs.append((ids[0], ids[1]))

    def process_pair(mrna1, mrna2):
        try:
            if mrna1 not in mrna_dict or mrna2 not in mrna_dict:
                return (mrna1, mrna2, "ERROR", "mRNA not in GFF")

            prot1 = protein_sequences.get(mrna1)
            prot2 = protein_sequences.get(mrna2)

            if not prot1 or not prot2:
                return (mrna1, mrna2, "ERROR", "Protein missing")

            prot_sim = calculate_similarity_with_mafft(prot1, prot2)

            gene1 = get_gene_sequence(mrna_dict[mrna1], genome_file)
            gene2 = get_gene_sequence(mrna_dict[mrna2], genome_file)
            gene_sim = calculate_similarity_with_mafft(gene1, gene2)

            promo1 = get_promoter_sequence(mrna_dict[mrna1], genome_file)
            promo2 = get_promoter_sequence(mrna_dict[mrna2], genome_file)
            promo_sim = calculate_similarity_with_mafft(promo1, promo2)

            return (mrna1, mrna2, prot_sim, gene_sim, promo_sim)

        except Exception as e:
            return (mrna1, mrna2, "ERROR", str(e))

    results = []
    with ThreadPoolExecutor(max_workers=num_threads) as executor:
        futures = [executor.submit(process_pair, m1, m2) for m1, m2 in pairs]

        for f in tqdm(as_completed(futures), total=len(futures)):
            results.append(f.result())

    output = sys.stdout if output_file is None else open(output_file, 'w')

    output.write("mRNA1\tmRNA2\tProtein_Similarity\tGene_Similarity\tPromoter_Similarity\n")

    for r in results:
        output.write("\t".join(map(str, r)) + "\n")

    if output_file:
        output.close()


if __name__ == "__main__":
    if len(sys.argv) < 4:
        print("Usage: python script.py <gff> <genome> <pairs> [output] [threads]")
        sys.exit(1)
    gff = sys.argv[1]
    genome = sys.argv[2]
    pairs = sys.argv[3]
    output = sys.argv[4] if len(sys.argv) > 4 else None
    threads = int(sys.argv[5]) if len(sys.argv) > 5 else 2
    process_gene_pairs(gff, genome, pairs, output, threads)

这个东西跑完以后我们再做个小的检验。抽几个序列来看看,有没有比对错误。

>evm.model.PTG009160L.17
MGFKGTKAEKKVVYDRKICDLLEEFSQVLVCLADNVGSTQLQTIRQGLRPDSIVLMGKNTMMKRSIRLYS
EKSGNTAYLNLIPLLVGNVGLIFTKGDLKEVREEVAKYKVGAPARVGLVAPVDVVVPPGNTGLDPSQTSF
FQVLNIPTKINKGTVEIITPVELIKKGDKVGSSEAALLAKLGIRPFSYGLIVRSVYDNGSVFDPAVLDLT
EDDLIGKFAAGISNVAALSLALNYPTLAAAPHFFINGYKNVLSIAIATDYSFPYAEKVKEFLKDPSKFAV
TAAPVAAEAAAAPAAAAPKEEEKKPEPEEESEEDMGFSLFD
>evm.model.PTG000994L.15
MGFKGTKAEKKVVYDKKICELLEDYSQVLVCLADNVGSTQLQTIRQGLRPDSIVLMGKNTMMKRSIRIYT
EKTGNAAYQNLIPLLVGNVGLIFTKGDLKEVREEVAKYKVGAPARVGLVAPVDVVVPPGNTGLDPSQTSF
FQVLNIPTKINKGTVEIITPVELIKKGDKVGSSEAALLAKLGIRPFSYGLIVQSVYDNGSVFDPAVLDLT
EDDLFQRFSAGISSVAALSLALNYPTLAAAPHALINSYKNVLAVAIATDYSFPLAEKVKEYLEDPSKFAV
TAAPVATEAAVPSAAAPKEEEKKPEPEEESEEDMGFSLFD

代码输出说是有85%的相似度,看上去也是非常相似的。
由于下面那个图蛋白质的相似度也太低了,我再找个结果来比对一下。

>evm.model.PTG000014L.98
MQRVEELLKFRVNAKMKMIVKFISNAQMKKLVRKMYSFPRASVTVVLGSIVRRFGFDPWTSVFIFSKKML
VNTIADLDNIWQWCNARKKLCHQRSATVTTIVGFDALSTRVAPLKEQEEIIRRRRMKGRKFGVMGFVNPK
DPGALIATTAYTIAALQ
>evm.model.PTG005291L.10
MNAMAKNGCLVCCSLVAHSVEETVMQMQQAKADGADIIELRIDYLHNFNPDTHLQILLQQRPLPIIITYR
PKWEGGEYGGEEVTRLDTLRLALELGADYVDVELQVYPHQYFCRKLKKKRIKKWNFEKIRADLDRTYPSW
PKNLDPAGFCADPARFGNYVLHPLLGLTNTFIDDVLDENA

代码输出只有0.03821656050955414的相似度,实际上也只有这么点的相似度。

绘图代码

(base) root@8ade79efb6a6:/project1/bin# cat r_plot.r 
# ============================================================================
# 修正版回归图绘制脚本
# ============================================================================

library(ggplot2)
library(dplyr)

# 1. 读取数据
bias_data <- read.table("gene_pair_analysis_with_type_and_delta.txt", 
                        header = TRUE, sep = "\t", stringsAsFactors = FALSE)

similarity_data <- read.table("test.txt", 
                               header = TRUE, sep = "\t", stringsAsFactors = FALSE)

# 2. 统一基因ID格式(如需要)
bias_data$Gene1 <- gsub("^evm\\.TU\\.", "evm.model.", bias_data$Gene1)
bias_data$Gene2 <- gsub("^evm\\.TU\\.", "evm.model.", bias_data$Gene2)

# 3. 创建唯一的基因对ID
bias_data$GenePair <- apply(bias_data[, c("Gene1", "Gene2")], 1, function(x) {
  paste(sort(x), collapse = "_")
})

similarity_data$GenePair <- apply(similarity_data[, c("mRNA1", "mRNA2")], 1, function(x) {
  paste(sort(x), collapse = "_")
})

# 4. 处理Delta_r(取绝对值)
bias_data$Delta_r_abs <- abs(bias_data$Delta_r)

# 5. 合并数据
merged_data <- merge(bias_data, similarity_data, by = "GenePair")

# 6. 绘制回归图(x轴=相似度,y轴=Delta_r)
p <- ggplot(merged_data, aes(y = Delta_r_abs)) +
  # 蛋白质相似度回归线(x轴)
  geom_smooth(aes(x = Protein_Similarity, color = "Protein Similarity"), 
              method = "lm", se = FALSE, linewidth = 1) +
  # 基因相似度回归线(x轴)
  geom_smooth(aes(x = Gene_Similarity, color = "Gene Similarity"), 
              method = "lm", se = FALSE, linewidth = 1) +
  # 启动子相似度回归线(x轴)
  geom_smooth(aes(x = Promoter_Similarity, color = "Promoter Similarity"), 
              method = "lm", se = FALSE, linewidth = 1) +
  # 添加散点
  geom_point(aes(x = Protein_Similarity, color = "Protein Similarity"), 
             alpha = 0.3, size = 0.5) +
  geom_point(aes(x = Gene_Similarity, color = "Gene Similarity"), 
             alpha = 0.3, size = 0.5) +
  geom_point(aes(x = Promoter_Similarity, color = "Promoter Similarity"), 
             alpha = 0.3, size = 0.5) +
  # 坐标轴标签
  labs(x = "Similarity", 
       y = "Absolute Delta_r (|Δr|)",
       title = "Expression Bias vs Sequence Similarity",
       color = "Similarity Type") +
  theme_bw(base_size = 14) +
  theme(legend.position = "top",
        legend.title = element_text(face = "bold"),
        plot.title = element_text(hjust = 0.5, face = "bold"))

# 7. 输出图片
ggsave("similarity_vs_delta_regression.png", plot = p, width = 10, height = 7, dpi = 300)

# 8. 打印回归方程和R²值
cat("=== 回归分析结果(修正版) ===\n")
cat("y = |Δr|, x = Similarity\n\n")
for(sim_type in c("Protein_Similarity", "Gene_Similarity", "Promoter_Similarity")) {
  model <- lm(as.formula("Delta_r_abs ~ ."), data = merged_data[, c("Delta_r_abs", sim_type)])
  r2 <- summary(model)$r.squared
  coef <- coef(model)
  cat(paste0(sim_type, ":\n"))
  cat(paste0("  方程: |Δr| = ", round(coef[2], 4), " × ", sim_type, " + ", round(coef[1], 4), "\n"))
  cat(paste0("  R² = ", round(r2, 4), "\n\n"))
}

绘图结果

按照董老师的要求,把pep改成了cds,然后现在我们要输出下面这几张图:

全局的拟合图+全局的小提琴分布图。

雌性>0

雌性>雄性

回归图小提琴图,小提琴图显示在回归图的右上方。

现在我要修改这个代码以达到董老师的要求

现在,我希望你为我调整这个绘图代码,原来的绘图代码只绘制了三种不同数据与表达相关性的拟合线。现在,我希望你在图中的右上方绘制一组小提琴图,分别包括这三种数据自身的相似度的分布,能否做到呢?

感觉这个图,效果也不能说完全符合假设?promote的拟合效果确实是最好的,当promote的相似度达到1的时候,非常明显的是,基因的表达量达到差异接近于0.

但是,我们也可以看到cds(pep,这里还没改header)本身的相似度分布是更广泛的。promote的平均相似度要更高一些,那就是说,promote相比对cds,要更保守一些。