By jiangchenhao, 30 July, 2025

Forums

原始金山文档链接
背景

这个工具发表在cell上，链接是https://github.com/xingxingshen/HGTfinder，从马尾松那篇论文的结果来看，这个工具应该是比较宽松的，能够计算比较多的hgt

配置

conda activate base
git clone https://github.com/xingxingshen/HGTfinder
apt update && apt install -y liblist-moreutils-perl
根据github上的信息，用mamba配置环境：
mamba create -n hgt_analysis -c conda-forge -c bioconda \
    perl=5.32 \                # Perl环境
    mafft=7.505 \              # 多序列比对
    trimal=1.4.1 \             # 比对修剪
    iqtree=2.2.0 \             # 建树工具
    r-ape=5.7 \                # R的ape包
    r-base=4.3 \               # R语言环境
    bioperl=1.7.8 \            # 生物信息学Perl模块
    list-moreutils-perl        # Perl额外模块
mamba activate hgt_analysis

测试，跟着测试数据集进一步配置

wget https://figshare.com/articles/dataset/Tempo_and_mode_of_genome_evolution_in_the_budding_yeast_subphylum/5854692
#解压后一级的结构目录如下
0_332yeast_genomes.zip
1_molecular_phylogenetics.zip
2_HGTs.zip  343taxa_speicies-name_clade-name_color-code.txt
3_trait_evolution.zip  4_orthomcl_output.zip
additional_figures_and_tables.zip
data_in_Figure2.zip
README.txt
这个目录下的README.txt中，明确指出了
我逐个解压压缩包，没能从0_332yeast_genomes.zip和1_molecular_phylogenetics.zip提取到有效信息，根据github上的提示“Toy data (see compressed file "2_HGTs.zip") can be found on the figshare",所以压缩包里的第三个文件是比较重要的，进一步解压来看，其中

我们直接看第二步的readme：

Detection of horizontal gene transfer (HGT):
1,538,912 analyzed genes from 329 yeast genomes are put in 329yeasts_genes_analyzed folder.
878HGTs_QUERY_seqs file contains 878 protein gene sequences.
List of 878 putative_HGTs_table file lists gene name, potential donor, GO term for each of 78 protein gene sequences.
Alignments and gene trees for 878 putative HGT-driven genes are given in 878HGTs_alignments folder and 878HGTs_genetrees folder, respectively.
329yeasts_outputs folder contain output files for alien index value for each gene; and output files for 1,806 HGT candidates.

Workflow of HGT identification:
Before conducting blastp queries against the nr database or your own database, please convert their sequence ids in the database into following ids:
>WP_003409891.1-77643-Mycobacterium_tub-Actinobact
MTDRTDADDLDLQRVGARLAARAQIRDIRLLRTQAAVHRAPKPAQGLTYDLEFEPAVDADPATISAFVVRISCHLRIQNQAADDDVKEGDTKDETQDVATADFEFAALFDYHLQEGEDDPTEEELTAYAATTGRFALYPYIREYVYDLTGRLALPPLTLEILSRPMPVSPGAQWPATRGTP
WP_003409891.1: NCBI Reference Sequence
77643: Taxonomy id
Mycobacterium_tub: Shorten species name
Actinobact: Taxonomic classification

Step 1:
blast
blastp -query Spathaspora_gorwiae.fas -db nr.fas -evalue 1e-10 -out Spathaspora_gorwiae_vs_nr.b6 -outfmt '6 qseqid sseqid pident length evalue bitscore' -num_threads 2

AI calculation (see example in example4AI_estimate folder)
perl calculate_alien_index_MD_Xing.pl -c Spathaspora_gorwiae_ai.config

Step 2:
building tree:
/tools/mafft-7.299-without-extensions/bin/mafft --thread 6 --auto $filename > ${gene}_aln.fasta
/tools/trimal-trimAl/source/trimal -in ${gene}_aln.fasta -out ${gene}_aln_trimmed.fasta -automated1
/tools/IQ-TREE-1.5.1/build/iqtree-omp -nt 6 -st AA -s ${gene}_aln_trimmed.fasta -m TEST -mrate G4 -keep-ident -bb 1000 -pre ${gene}

Midpoint root a phylogeny with R:
mytree <- read.tree("$filename")
midponit_mytree <- ladderize (midpoint(mytree))
write.tree (midponit_mytree, "${gene}midpoint.tree")

太长不看，直接让AI给我解析：

这个README文件描述了一个检测水平基因转移（HGT）的工作流程，主要分为数据准备、BLAST比对、AI（Alien Index）计算和系统发育树构建等步骤。以下是对该文件的详细解析：

1. 数据文件说明

文件/文件夹	内容描述
329yeasts_genes_analyzed	包含来自329个酵母基因组的1,538,912个待分析基因
878HGTs_QUERY_seqs	878个疑似HGT的蛋白质基因序列
putative_HGTs_table	包含878个基因的名称、潜在供体物种和GO功能注释
878HGTs_alignments	878个基因的多序列比对结果
878HGTs_genetrees	878个基因的系统发育树
329yeasts_outputs	输出文件，包括每个基因的Alien Index值和1,806个HGT候选基因

2. 关键步骤解析

(1) BLASTP比对

目的：鉴定目标基因与数据库的相似性
命令示例

blastp -query Spathaspora_gorwiae.fas -db nr.fas \
       -evalue 1e-10 -out Spathaspora_gorwiae_vs_nr.b6 \
       -outfmt '6 qseqid sseqid pident length evalue bitscore' \
       -num_threads 2

这个blastp比对是需要预先进行的，也可以使用diamond进行比对，线程数属于是想加就加。

这一步它在它的测试文件里已经跑了，我们需要做的就是准备对应的nr库，nr库我们也是有的，所以只需要跑就行了。

(2) Alien Index (AI) 计算

perl calculate_alien_index_MD_Xing.pl -c Spathaspora_gorwiae_ai.config

在/project1/data/2_HGTs/example4AI_estimate运行这个脚本，而对于这个perl脚本，用diff做了检验，这个pl脚本文件和另外一个pl直接从github获得的文件，是几乎相同的

我们解析一下这个配置文件里的信息：

#--------------------#
# Alien Index Config #
#--------------------#
PROGRAM                         = blast
BLASTFILE               = Spathaspora_gorwiae_vs_nr.b6
QUERY_BLASTFILE         = Spathaspora_gorwiae_vs_Spathaspora_gorwiae.b6
#这个是自己比对自己
QUERY2Sacch_BLASTFILE           = Spathaspora_gorwiae_vs_329_Saccharomycotina.b6
#329_Saccharomycotina在这里是什么意思？我也不知道
TAXDB                           = blastdbs
RECIPIENT                       = 147537
#这个是接收者
ANCESTRAL_LINEAGE       = 4751
#这个是受体物种（RECIPIENT）在进化上的保守祖先类群，可以是所有裸子植物的taxid，也可以是所有植物的id
## hits_no is max number of hits that are not from the RECIPIENT group ##
HITS_NO         = 300
E_VALUE         = 1e-10
PREFIX          = Spathaspora_gorwiae

输出文件解释Spathaspora_gorwiae_alien.txt

测试后我发现，运行完这个命令后，它会输出两个文件，一个是Spathaspora_gorwiae_alien.txt，另外一个是Spathaspora_gorwiae_usefullhits.txt

先解析Spathaspora_gorwiae_alien.txt，我打印前10行，并把列名用""标注处来。

"query sequence id"       "query sequence length"   "maximum bitscore possible for query sequence"    "sequence id of top hit"  "taxonomy id of top hit"  "lineage of top hit"      "evalue of top hit"       "bitscore of top hit"     "sequence id of top hit within GROUP lineage"  "taxonomy id of top hit within GROUP lineage"     "evalue of top hit within GROUP lineage"  "bitscore of top hit within GROUP lineage"        "sequence id of top hit outside of GROUP" "lineage taxonomy id of top hit outside of GROUP lineage"      "evalue of top hit outside of GROUP lineage"      "bitscore of top hit outside of GROUP lineage"    "alien index"     "unique lineage(s) of top hits"   "no. of unique lineage(s) of top hits"    "recipient_no"    "group_no"    "other_no" "pct of other_no"
Spathaspora_gorwiae@Seq_3705    226       455   XP_007375095.1-619300-Spathaspora_passa-Saccharom       619300  Saccharomycotina        2e-74     232   XP_007833845.1-1229662-Pestalotiopsis_fi-Sordario       1229662 3e-38     143   XP_001624605.1-45351-Nematostella_vect-Metazoa       45351   1e-30     122   -0.0461538461538462     Saccharomycotina Sordariomycetes Taphrinomycotina other_Pezizomycotina Dothideomycetes other_Fungi Leotiomycetes Eurotiomycetes 8       216 84       0       0
Spathaspora_gorwiae@Seq_4475    133       276   XP_007375119.1-619300-Spathaspora_passa-Saccharom       619300  Saccharomycotina        3e-40     138   CEQ41300.1-5005-Sporidiobolus_sal-Basidiomyc    5005    9e-14   69.3    WP_061990031.1-1257021-Flammeovirgaceae_-BacteroC    1257021 2e-15   73.2    0.0141304347826087      Saccharomycotina BacteroidetesChlorobi Basidiomycota Euglenozoa Actinobacteria Proteobacteria other_Alveolata other_Bacteria Dothideomycetes Crenarchaeota Cyanobacteria Firmicutes Apicomplexa      13      213     6       81      27
Spathaspora_gorwiae@Seq_2740    1932     4022   XP_007377139.1-619300-Spathaspora_passa-Saccharom       619300  Saccharomycotina        0.0      3424   XP_011118845.1-756982-Arthrobotrys_olig-other_Pez       756982  0.0      2112   XP_011424585.1-29159-Crassostrea_gigas-Metazoa       29159   0.0      1677   -0.108155146693187      Saccharomycotina        1       300     0       0       0
Spathaspora_gorwiae@Seq_1182    195       402   XP_002420785.1-573826-Candida_dublinien-Saccharom       573826  Saccharomycotina        3e-127    365   XP_003296695.1-861557-Pyrenophora_teres-Dothideom       861557  5e-86     263   XP_015929200.1-114398-Parasteatoda_tepi-Metazoa      114398  1e-58     191   -0.17910447761194       Saccharomycotina Dothideomycetes        2       298     2       0       0
Spathaspora_gorwiae@Seq_3964    175       349   XP_007373348.1-619300-Spathaspora_passa-Saccharom       619300  Saccharomycotina        5e-78     239   OKO96907.1-1316194-Penicillium_subru-Eurotiomyc 1316194 1e-06   51.6    na      na  10       -0.14785100286533       Saccharomycotina        1       167     0       0       0
Spathaspora_gorwiae@Seq_5902    404       840   XP_007375106.1-619300-Spathaspora_passa-Saccharom       619300  Saccharomycotina        0.0       788   OCL12505.1-574774-Glonium_stellatum-Dothideomy  574774  0.0       572   XP_004348297.1-595528-Capsaspora_owczar-other_Opi    595528  6e-159    464   -0.128571428571429      Saccharomycotina Dothideomycetes Sordariomycetes other_Pezizomycotina Eurotiomycetes Leotiomycetes other_Fungi  7       187     113     0       0
Spathaspora_gorwiae@Seq_14      704      1457   XP_001382258.2-322104-Scheffersomyces_s-Saccharom       322104  Saccharomycotina        0.0      1119   GAP86496.1-77044-Rosellinia_necatr-Sordariomy   77044   0.0       606   GAV02364.1-947166-Ramazzottius_vari-Metazoa  947166  4e-159    487   -0.0816746739876458     Saccharomycotina        1       300     0       0       0
Spathaspora_gorwiae@Seq_6170    218       444   XP_001385142.1-322104-Scheffersomyces_s-Saccharom       322104  Saccharomycotina        2e-143    408   XP_018160710.1-759273-Colletotrichum_hi-Sordariom       759273  1e-106    316   SCN22162.1-5821-Plasmodium_berghe-Apicomplex 5821    8e-110    334   0.0405405405405406      Saccharomycotina Apicomplexa Sordariomycetes    3       296     3       1       0.333333333333333
Spathaspora_gorwiae@Seq_6371    376       788   KGQ89858.1-1094983-Candida_albicans_-Saccharomy 1094983 Saccharomycotina        0.0       570   XP_016220506.1-212818-Exophiala_mesophi-Eurotiomy       212818  3e-137    407   KXJ29407.1-1720309-Exaiptasia_pallid-Metazoa 1720309 5e-86     273   -0.17005076142132       Saccharomycotina Eurotiomycetes Sordariomycetes other_Pezizomycotina Leotiomycetes Dothideomycetes      6       211     89      0       0

这有个10+列，老老实实上excel

不行，就算转到excel也太多了，打印第一列之后，让AI解释每一列的含义，它能解释的很清楚

以下是 Spathaspora_gorwiae_alien.txt 文件列名的详细解释（按顺序对应）：

1. 基本信息列

列名	说明
`query sequence id`	查询基因ID（格式：`物种名@基因编号`，如 `Spathaspora_gorwiae@Seq_3705`）
`query sequence length`	查询基因的序列长度（氨基酸/碱基数）
`maximum bitscore possible for query sequence`	查询基因自比对的最高bitscore（用于标准化比较）

2. 全局最佳匹配（Top Hit）

列名	说明
`sequence id of top hit`	最佳匹配序列的ID（含物种和分类信息）
`taxonomy id of top hit`	最佳匹配序列的分类ID（NCBI Taxonomy）
`lineage of top hit`	最佳匹配序列的分类层级（如 `Saccharomycotina`）
`evalue of top hit`	最佳匹配的E值（显著性越低，值越小）
`bitscore of top hit`	最佳匹配的bitscore（比对得分，越高越相似）

3. 组内最佳匹配（GROUP Lineage）

列名	说明
`sequence id of top hit within GROUP lineage`	GROUP类群（如真菌）中的最佳匹配序列ID
`taxonomy id of top hit within GROUP lineage`	组内匹配的分类ID
`evalue of top hit within GROUP lineage`	组内匹配的E值
`bitscore of top hit within GROUP lineage`	组内匹配的bitscore

4. 组外最佳匹配（OUTGROUP Lineage）

列名	说明
`sequence id of top hit outside of GROUP lineage`	非GROUP类群（如细菌、动物）中的最佳匹配序列ID
`taxonomy id of top hit outside of GROUP lineage`	组外匹配的分类ID
`evalue of top hit outside of GROUP lineage`	组外匹配的E值
`bitscore of top hit outside of GROUP lineage`	组外匹配的bitscore

5. 外源指数（Alien Index）与分类统计

列名	说明
`alien index`	外源指数，计算公式：`(组外bitscore - 组内bitscore) / 最大bitscore`<br> • AI > 0：可能为水平转移基因<br> • AI ≤ 0：保守基因
`unique lineage(s) of top hits`	所有匹配序列的独特分类层级（多个层级用空格分隔）
`no. of unique lineage(s) of top hits`	独特分类层级的数量
`recipient_no`	匹配到 RECIPIENT类群（如酵母亚门）的序列数
`group_no`	匹配到 GROUP类群（如其他真菌）的序列数
`other_no`	匹配到非GROUP类群（如细菌、动物）的序列数
`pct of other_no`	`other_no` 占总匹配数的百分比（`other_no / (recipient_no + group_no + other_no)`

输出文件解释Spathaspora_gorwiae_usefullhits.txt

那么从描述来看，这个文件不需要使用。

信息的过滤总结

所以它最后的信息中有两列值（pct of other_no和Alien Index）是有用的，只需要根据这两列值进行筛选就可以了。

文献阅读

这个工具的原文是发在cell上的，我直接查看这篇文章里的方法

Horizontal gene transfer analyses
Identification of HGT events
To detect genes in budding yeast genomes that may have been horizontally acquired from non-fungal organisms, we employed a
robust and conservative phylogeny-based approach (Husnik and McCutcheon, 2018; Marcet-Houben and Gabaldo´ n, 2010; Richards et al., 2011; Wisecaver et al., 2016). Briefly, for a given budding yeast gene, we inferred it to have been acquired by HGT if
there was substantial topological disagreement between the gene tree and its associated species tree and the budding yeast
gene sequence was robustly nested within the donor lineage in the gene tree.
To avoid spurious results due to the presence of small genomic fragments of contaminant organisms in our genome assemblies
(Scho¨ nknecht et al., 2014), we limited our analyses to those genes that resided in genomic contigs or scaffolds that were R 100 kb.
This filter resulted in the analysis of 1,538,912 predicted genes (out of a total of 1,892,694 genes) in 329 yeast genomes. The remaining three genomes (Botryozyma nematodophila, Blastobotrys nivea, and Citeromyces siamensis) did not contain genomic contigs
that were R 100 kb).
For each gene, we evaluated whether it had been horizontally acquired using a two-step workflow.
In step 1, we first carried out a BLASTP search against a custom database (nr+) consisting of the NCBI non-redundant (nr) protein
database (last accessed January 20, 2017) and all predicted protein sequences from 329 yeasts genomes, with an e-value cutoff
of 1010. We next used custom Perl scripts to: (a) assign taxonomic information to each BLAST hit rating with the dump files downloaded from the NCBI Taxonomy database (nodes.dmp, merged.dmp, and names.dmp; ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy),
and then (b) parse the BLAST hits, based on their taxonomic information, into three different lineages (RECIPIENT: Saccharomycotina; GROUP: Fungi; OUTGROUP: non-fungal) so as to obtain three values: bbhO (BLAST bitscore of the best hit in OUTGROUP lineage), bbhG (bitscore of the best hit in GROUP lineage but not in RECIPIENT lineage), and maxB (bitscore of the query to itself).
Using this information, we next calculated: (a) the Alien Index (AI) value (Wisecaver et al., 2016), which is a normalized
difference of bitscore between the best hit in OUTGROUP lineage and the best hit in GROUP lineage but not in RECIPIENT lineage:
AI = bbhO=maxB  bbhG=maxB, and (b) the percentage of species from OUTGROUP lineage (outg_pct) in the list of the top 300 hits
that have different taxonomic species names (so that we avoid over-representation of multiple strains of the same species) (MarcetHouben and Gabaldo´ n, 2010). From the 1,538,912 genes analyzed, we found 1,806 genes with AI value R 0.1 and outg_pct R 90%.
In step 2, we retrieved the 300 most similar homologs that have different taxonomic species names from the nr+ database
(see above), aligned them by the MAFFT, version 7.299 (Katoh and Standley, 2013), with ‘‘–auto’’ option, and trimmed ambiguously
aligned regions using trimAl v1.4 (Capella-Gutie´rrez et al., 2009) with ‘‘-automated1’’ option. We then used the resulting alignment to
infer the ML tree using IQ-TREE 1.5.1 (Nguyen et al., 2015) with its best-fitting model of amino acid evolution and 1000 ultrafast bootstrapping replicates (Minh et al., 2013). Lastly, we rooted each ML tree at the midpoint using the ape and phangorn R packages and
visualized it using the command version of iTOL v3 (Letunic and Bork, 2016). After manually inspecting all 1,806 ML trees, we identified 878 genes in 186 / 329 budding yeast genomes whose phylogenies indicated they were putatively acquired via HGT. A summary
table file that contains species name, gene name, genomic contig ID, genomic contig length, HGT status, and Gene Ontology (GO)
term for each of 1,538,912 genes in 329 yeast genomes that we examined has been deposited on the Figshare data repository

然后我让AI来解释：

分析流程分为两步：
第一步：

使用BLASTP比对自建数据库（nr+，包含NCBI非冗余蛋白库及329个酵母基因组蛋白序列），E值阈值设为10^-10。
通过自定义Perl脚本：
基于NCBI分类数据库（nodes.dmp、merged.dmp、names.dmp）为比对标注释分类信息；
将比对结果按分类划分为三类：
RECIPIENT（受体类群）：酵母亚门（Saccharomycotina）
GROUP（近缘类群）：真菌界（Fungi）
OUTGROUP（外源类群）：非真菌生物
计算三个关键值：
bbhO：OUTGROUP类群最佳比对的bitscore
bbhG：GROUP类群（非RECIPIENT）最佳比对的bitscore
maxB：查询序列自比对的bitscore
计算：
外源指数（Alien Index, AI）= (bbhO/maxB) - (bbhG/maxB)
OUTGROUP类群在Top300比对中的占比（outg_pct，排除同物种多菌株的重复计数）
在1,538,912个基因中，筛选出1,806个满足AI≥0.1且outg_pct≥90%的候选基因。
第二步：

对候选基因，从nr+库中获取300个不同物种的同源序列；
使用MAFFT v7.299（--auto参数）进行多序列比对，trimAl v1.4（-automated1参数）修剪不确定区域；
用IQ-TREE 1.5.1构建最大似然树（选择最佳氨基酸替代模型+1000次超快自举检验）；
使用ape和phangorn R包进行中点定根，iTOL v3可视化；
人工检查1,806个系统发育树后，最终确定186个酵母基因组中的878个基因为HGT候选基因。
所有分析的1,538,912个基因信息（物种名、基因名、contig ID、contig长度、HGT状态、GO注释等）已上传至Figshare数据仓库。
</FILE_CONTENT>
</ATTACHMENT_FILE>

镜像封装

目前就保留了92服务器上的hgt_jiangchenhao:1.1

这个镜像有保存测试数据集，所以很大，有接近20.7G

Ncbi的分类单元中没有一个独立的裸子植物分类单元，因此在定义内群的时候，可以使用松柏纲

原始金山文档链接背景

配置