原始金山文档链接
背景
这个工具发表在cell上,链接是https://github.com/xingxingshen/HGTfinder,从马尾松那篇论文的结果来看,这个工具应该是比较宽松的,能够计算比较多的hgt
配置
conda activate base
git clone https://github.com/xingxingshen/HGTfinder
apt update && apt install -y liblist-moreutils-perl
根据github上的信息,用mamba配置环境:
mamba create -n hgt_analysis -c conda-forge -c bioconda \
perl=5.32 \ # Perl环境
mafft=7.505 \ # 多序列比对
trimal=1.4.1 \ # 比对修剪
iqtree=2.2.0 \ # 建树工具
r-ape=5.7 \ # R的ape包
r-base=4.3 \ # R语言环境
bioperl=1.7.8 \ # 生物信息学Perl模块
list-moreutils-perl # Perl额外模块
mamba activate hgt_analysis
测试,跟着测试数据集进一步配置
wget https://figshare.com/articles/dataset/Tempo_and_mode_of_genome_evolution_in_the_budding_yeast_subphylum/5854692
#解压后一级的结构目录如下
0_332yeast_genomes.zip
1_molecular_phylogenetics.zip
2_HGTs.zip 343taxa_speicies-name_clade-name_color-code.txt
3_trait_evolution.zip 4_orthomcl_output.zip
additional_figures_and_tables.zip
data_in_Figure2.zip
README.txt
这个目录下的README.txt中,明确指出了
我逐个解压压缩包,没能从0_332yeast_genomes.zip和1_molecular_phylogenetics.zip提取到有效信息,根据github上的提示“Toy data (see compressed file "2_HGTs.zip") can be found on the figshare",所以压缩包里的第三个文件是比较重要的,进一步解压来看,其中
我们直接看第二步的readme:
Detection of horizontal gene transfer (HGT):
1,538,912 analyzed genes from 329 yeast genomes are put in 329yeasts_genes_analyzed folder.
878HGTs_QUERY_seqs file contains 878 protein gene sequences.
List of 878 putative_HGTs_table file lists gene name, potential donor, GO term for each of 78 protein gene sequences.
Alignments and gene trees for 878 putative HGT-driven genes are given in 878HGTs_alignments folder and 878HGTs_genetrees folder, respectively.
329yeasts_outputs folder contain output files for alien index value for each gene; and output files for 1,806 HGT candidates.
Workflow of HGT identification:
Before conducting blastp queries against the nr database or your own database, please convert their sequence ids in the database into following ids:
>WP_003409891.1-77643-Mycobacterium_tub-Actinobact
MTDRTDADDLDLQRVGARLAARAQIRDIRLLRTQAAVHRAPKPAQGLTYDLEFEPAVDADPATISAFVVRISCHLRIQNQAADDDVKEGDTKDETQDVATADFEFAALFDYHLQEGEDDPTEEELTAYAATTGRFALYPYIREYVYDLTGRLALPPLTLEILSRPMPVSPGAQWPATRGTP
WP_003409891.1: NCBI Reference Sequence
77643: Taxonomy id
Mycobacterium_tub: Shorten species name
Actinobact: Taxonomic classification
Step 1:
blast
blastp -query Spathaspora_gorwiae.fas -db nr.fas -evalue 1e-10 -out Spathaspora_gorwiae_vs_nr.b6 -outfmt '6 qseqid sseqid pident length evalue bitscore' -num_threads 2
AI calculation (see example in example4AI_estimate folder)
perl calculate_alien_index_MD_Xing.pl -c Spathaspora_gorwiae_ai.config
Step 2:
building tree:
/tools/mafft-7.299-without-extensions/bin/mafft --thread 6 --auto $filename > ${gene}_aln.fasta
/tools/trimal-trimAl/source/trimal -in ${gene}_aln.fasta -out ${gene}_aln_trimmed.fasta -automated1
/tools/IQ-TREE-1.5.1/build/iqtree-omp -nt 6 -st AA -s ${gene}_aln_trimmed.fasta -m TEST -mrate G4 -keep-ident -bb 1000 -pre ${gene}
Midpoint root a phylogeny with R:
mytree <- read.tree("$filename")
midponit_mytree <- ladderize (midpoint(mytree))
write.tree (midponit_mytree, "${gene}midpoint.tree")
太长不看,直接让AI给我解析:
这个README文件描述了一个检测水平基因转移(HGT)的工作流程,主要分为数据准备、BLAST比对、AI(Alien Index)计算和系统发育树构建等步骤。以下是对该文件的详细解析:
1. 数据文件说明
文件/文件夹 | 内容描述 |
329yeasts_genes_analyzed | 包含来自329个酵母基因组的1,538,912个待分析基因 |
878HGTs_QUERY_seqs | 878个疑似HGT的蛋白质基因序列 |
putative_HGTs_table | 包含878个基因的名称、潜在供体物种和GO功能注释 |
878HGTs_alignments | 878个基因的多序列比对结果 |
878HGTs_genetrees | 878个基因的系统发育树 |
329yeasts_outputs | 输出文件,包括每个基因的Alien Index值和1,806个HGT候选基因 |
2. 关键步骤解析
(1) BLASTP比对
- 目的:鉴定目标基因与数据库的相似性
- 命令示例
blastp -query Spathaspora_gorwiae.fas -db nr.fas \
-evalue 1e-10 -out Spathaspora_gorwiae_vs_nr.b6 \
-outfmt '6 qseqid sseqid pident length evalue bitscore' \
-num_threads 2
这个blastp比对是需要预先进行的,也可以使用diamond进行比对,线程数属于是想加就加。
这一步它在它的测试文件里已经跑了,我们需要做的就是准备对应的nr库,nr库我们也是有的,所以只需要跑就行了。
(2) Alien Index (AI) 计算
perl calculate_alien_index_MD_Xing.pl -c Spathaspora_gorwiae_ai.config
在/project1/data/2_HGTs/example4AI_estimate运行这个脚本,而对于这个perl脚本,用diff做了检验,这个pl脚本文件和另外一个pl直接从github获得的文件,是几乎相同的
我们解析一下这个配置文件里的信息:
#--------------------#
# Alien Index Config #
#--------------------#
PROGRAM = blast
BLASTFILE = Spathaspora_gorwiae_vs_nr.b6
QUERY_BLASTFILE = Spathaspora_gorwiae_vs_Spathaspora_gorwiae.b6
#这个是自己比对自己
QUERY2Sacch_BLASTFILE = Spathaspora_gorwiae_vs_329_Saccharomycotina.b6
#329_Saccharomycotina在这里是什么意思?我也不知道
TAXDB = blastdbs
RECIPIENT = 147537
#这个是接收者
ANCESTRAL_LINEAGE = 4751
#这个是受体物种(RECIPIENT)在进化上的保守祖先类群,可以是所有裸子植物的taxid,也可以是所有植物的id
## hits_no is max number of hits that are not from the RECIPIENT group ##
HITS_NO = 300
E_VALUE = 1e-10
PREFIX = Spathaspora_gorwiae
输出文件解释Spathaspora_gorwiae_alien.txt
测试后我发现,运行完这个命令后,它会输出两个文件,一个是Spathaspora_gorwiae_alien.txt,另外一个是Spathaspora_gorwiae_usefullhits.txt
先解析Spathaspora_gorwiae_alien.txt,我打印前10行,并把列名用""标注处来。
"query sequence id" "query sequence length" "maximum bitscore possible for query sequence" "sequence id of top hit" "taxonomy id of top hit" "lineage of top hit" "evalue of top hit" "bitscore of top hit" "sequence id of top hit within GROUP lineage" "taxonomy id of top hit within GROUP lineage" "evalue of top hit within GROUP lineage" "bitscore of top hit within GROUP lineage" "sequence id of top hit outside of GROUP" "lineage taxonomy id of top hit outside of GROUP lineage" "evalue of top hit outside of GROUP lineage" "bitscore of top hit outside of GROUP lineage" "alien index" "unique lineage(s) of top hits" "no. of unique lineage(s) of top hits" "recipient_no" "group_no" "other_no" "pct of other_no"
Spathaspora_gorwiae@Seq_3705 226 455 XP_007375095.1-619300-Spathaspora_passa-Saccharom 619300 Saccharomycotina 2e-74 232 XP_007833845.1-1229662-Pestalotiopsis_fi-Sordario 1229662 3e-38 143 XP_001624605.1-45351-Nematostella_vect-Metazoa 45351 1e-30 122 -0.0461538461538462 Saccharomycotina Sordariomycetes Taphrinomycotina other_Pezizomycotina Dothideomycetes other_Fungi Leotiomycetes Eurotiomycetes 8 216 84 0 0
Spathaspora_gorwiae@Seq_4475 133 276 XP_007375119.1-619300-Spathaspora_passa-Saccharom 619300 Saccharomycotina 3e-40 138 CEQ41300.1-5005-Sporidiobolus_sal-Basidiomyc 5005 9e-14 69.3 WP_061990031.1-1257021-Flammeovirgaceae_-BacteroC 1257021 2e-15 73.2 0.0141304347826087 Saccharomycotina BacteroidetesChlorobi Basidiomycota Euglenozoa Actinobacteria Proteobacteria other_Alveolata other_Bacteria Dothideomycetes Crenarchaeota Cyanobacteria Firmicutes Apicomplexa 13 213 6 81 27
Spathaspora_gorwiae@Seq_2740 1932 4022 XP_007377139.1-619300-Spathaspora_passa-Saccharom 619300 Saccharomycotina 0.0 3424 XP_011118845.1-756982-Arthrobotrys_olig-other_Pez 756982 0.0 2112 XP_011424585.1-29159-Crassostrea_gigas-Metazoa 29159 0.0 1677 -0.108155146693187 Saccharomycotina 1 300 0 0 0
Spathaspora_gorwiae@Seq_1182 195 402 XP_002420785.1-573826-Candida_dublinien-Saccharom 573826 Saccharomycotina 3e-127 365 XP_003296695.1-861557-Pyrenophora_teres-Dothideom 861557 5e-86 263 XP_015929200.1-114398-Parasteatoda_tepi-Metazoa 114398 1e-58 191 -0.17910447761194 Saccharomycotina Dothideomycetes 2 298 2 0 0
Spathaspora_gorwiae@Seq_3964 175 349 XP_007373348.1-619300-Spathaspora_passa-Saccharom 619300 Saccharomycotina 5e-78 239 OKO96907.1-1316194-Penicillium_subru-Eurotiomyc 1316194 1e-06 51.6 na na 10 -0.14785100286533 Saccharomycotina 1 167 0 0 0
Spathaspora_gorwiae@Seq_5902 404 840 XP_007375106.1-619300-Spathaspora_passa-Saccharom 619300 Saccharomycotina 0.0 788 OCL12505.1-574774-Glonium_stellatum-Dothideomy 574774 0.0 572 XP_004348297.1-595528-Capsaspora_owczar-other_Opi 595528 6e-159 464 -0.128571428571429 Saccharomycotina Dothideomycetes Sordariomycetes other_Pezizomycotina Eurotiomycetes Leotiomycetes other_Fungi 7 187 113 0 0
Spathaspora_gorwiae@Seq_14 704 1457 XP_001382258.2-322104-Scheffersomyces_s-Saccharom 322104 Saccharomycotina 0.0 1119 GAP86496.1-77044-Rosellinia_necatr-Sordariomy 77044 0.0 606 GAV02364.1-947166-Ramazzottius_vari-Metazoa 947166 4e-159 487 -0.0816746739876458 Saccharomycotina 1 300 0 0 0
Spathaspora_gorwiae@Seq_6170 218 444 XP_001385142.1-322104-Scheffersomyces_s-Saccharom 322104 Saccharomycotina 2e-143 408 XP_018160710.1-759273-Colletotrichum_hi-Sordariom 759273 1e-106 316 SCN22162.1-5821-Plasmodium_berghe-Apicomplex 5821 8e-110 334 0.0405405405405406 Saccharomycotina Apicomplexa Sordariomycetes 3 296 3 1 0.333333333333333
Spathaspora_gorwiae@Seq_6371 376 788 KGQ89858.1-1094983-Candida_albicans_-Saccharomy 1094983 Saccharomycotina 0.0 570 XP_016220506.1-212818-Exophiala_mesophi-Eurotiomy 212818 3e-137 407 KXJ29407.1-1720309-Exaiptasia_pallid-Metazoa 1720309 5e-86 273 -0.17005076142132 Saccharomycotina Eurotiomycetes Sordariomycetes other_Pezizomycotina Leotiomycetes Dothideomycetes 6 211 89 0 0
这有个10+列,老老实实上excel
不行,就算转到excel也太多了,打印第一列之后,让AI解释每一列的含义,它能解释的很清楚
以下是 Spathaspora_gorwiae_alien.txt 文件列名的详细解释(按顺序对应):
1. 基本信息列
列名 | 说明 |
| 查询基因ID(格式: |
| 查询基因的序列长度(氨基酸/碱基数) |
| 查询基因自比对的最高bitscore(用于标准化比较) |
2. 全局最佳匹配(Top Hit)
列名 | 说明 |
| 最佳匹配序列的ID(含物种和分类信息) |
| 最佳匹配序列的分类ID(NCBI Taxonomy) |
| 最佳匹配序列的分类层级(如 |
| 最佳匹配的E值(显著性越低,值越小) |
| 最佳匹配的bitscore(比对得分,越高越相似) |
3. 组内最佳匹配(GROUP Lineage)
列名 | 说明 |
| GROUP类群(如真菌)中的最佳匹配序列ID |
| 组内匹配的分类ID |
| 组内匹配的E值 |
| 组内匹配的bitscore |
4. 组外最佳匹配(OUTGROUP Lineage)
列名 | 说明 |
| 非GROUP类群(如细菌、动物)中的最佳匹配序列ID |
| 组外匹配的分类ID |
| 组外匹配的E值 |
| 组外匹配的bitscore |
5. 外源指数(Alien Index)与分类统计
列名 | 说明 |
| 外源指数,计算公式: |
| 所有匹配序列的独特分类层级(多个层级用空格分隔) |
| 独特分类层级的数量 |
| 匹配到 RECIPIENT类群(如酵母亚门)的序列数 |
| 匹配到 GROUP类群(如其他真菌)的序列数 |
| 匹配到 非GROUP类群(如细菌、动物)的序列数 |
|
|
输出文件解释Spathaspora_gorwiae_usefullhits.txt
那么从描述来看,这个文件不需要使用。
信息的过滤总结
所以它最后的信息中有两列值(pct of other_no
和Alien Index
)是有用的,只需要根据这两列值进行筛选就可以了。
文献阅读
这个工具的原文是发在cell上的,我直接查看这篇文章里的方法
Horizontal gene transfer analyses
Identification of HGT events
To detect genes in budding yeast genomes that may have been horizontally acquired from non-fungal organisms, we employed a
robust and conservative phylogeny-based approach (Husnik and McCutcheon, 2018; Marcet-Houben and Gabaldo´ n, 2010; Richards et al., 2011; Wisecaver et al., 2016). Briefly, for a given budding yeast gene, we inferred it to have been acquired by HGT if
there was substantial topological disagreement between the gene tree and its associated species tree and the budding yeast
gene sequence was robustly nested within the donor lineage in the gene tree.
To avoid spurious results due to the presence of small genomic fragments of contaminant organisms in our genome assemblies
(Scho¨ nknecht et al., 2014), we limited our analyses to those genes that resided in genomic contigs or scaffolds that were R 100 kb.
This filter resulted in the analysis of 1,538,912 predicted genes (out of a total of 1,892,694 genes) in 329 yeast genomes. The remaining three genomes (Botryozyma nematodophila, Blastobotrys nivea, and Citeromyces siamensis) did not contain genomic contigs
that were R 100 kb).
For each gene, we evaluated whether it had been horizontally acquired using a two-step workflow.
In step 1, we first carried out a BLASTP search against a custom database (nr+) consisting of the NCBI non-redundant (nr) protein
database (last accessed January 20, 2017) and all predicted protein sequences from 329 yeasts genomes, with an e-value cutoff
of 1010. We next used custom Perl scripts to: (a) assign taxonomic information to each BLAST hit rating with the dump files downloaded from the NCBI Taxonomy database (nodes.dmp, merged.dmp, and names.dmp; ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy),
and then (b) parse the BLAST hits, based on their taxonomic information, into three different lineages (RECIPIENT: Saccharomycotina; GROUP: Fungi; OUTGROUP: non-fungal) so as to obtain three values: bbhO (BLAST bitscore of the best hit in OUTGROUP lineage), bbhG (bitscore of the best hit in GROUP lineage but not in RECIPIENT lineage), and maxB (bitscore of the query to itself).
Using this information, we next calculated: (a) the Alien Index (AI) value (Wisecaver et al., 2016), which is a normalized
difference of bitscore between the best hit in OUTGROUP lineage and the best hit in GROUP lineage but not in RECIPIENT lineage:
AI = bbhO=maxB bbhG=maxB, and (b) the percentage of species from OUTGROUP lineage (outg_pct) in the list of the top 300 hits
that have different taxonomic species names (so that we avoid over-representation of multiple strains of the same species) (MarcetHouben and Gabaldo´ n, 2010). From the 1,538,912 genes analyzed, we found 1,806 genes with AI value R 0.1 and outg_pct R 90%.
In step 2, we retrieved the 300 most similar homologs that have different taxonomic species names from the nr+ database
(see above), aligned them by the MAFFT, version 7.299 (Katoh and Standley, 2013), with ‘‘–auto’’ option, and trimmed ambiguously
aligned regions using trimAl v1.4 (Capella-Gutie´rrez et al., 2009) with ‘‘-automated1’’ option. We then used the resulting alignment to
infer the ML tree using IQ-TREE 1.5.1 (Nguyen et al., 2015) with its best-fitting model of amino acid evolution and 1000 ultrafast bootstrapping replicates (Minh et al., 2013). Lastly, we rooted each ML tree at the midpoint using the ape and phangorn R packages and
visualized it using the command version of iTOL v3 (Letunic and Bork, 2016). After manually inspecting all 1,806 ML trees, we identified 878 genes in 186 / 329 budding yeast genomes whose phylogenies indicated they were putatively acquired via HGT. A summary
table file that contains species name, gene name, genomic contig ID, genomic contig length, HGT status, and Gene Ontology (GO)
term for each of 1,538,912 genes in 329 yeast genomes that we examined has been deposited on the Figshare data repository
然后我让AI来解释:
分析流程分为两步:
第一步:
使用BLASTP比对自建数据库(nr+,包含NCBI非冗余蛋白库及329个酵母基因组蛋白序列),E值阈值设为10^-10。
通过自定义Perl脚本:
基于NCBI分类数据库(nodes.dmp、merged.dmp、names.dmp)为比对标注释分类信息;
将比对结果按分类划分为三类:
RECIPIENT(受体类群):酵母亚门(Saccharomycotina)
GROUP(近缘类群):真菌界(Fungi)
OUTGROUP(外源类群):非真菌生物
计算三个关键值:
bbhO:OUTGROUP类群最佳比对的bitscore
bbhG:GROUP类群(非RECIPIENT)最佳比对的bitscore
maxB:查询序列自比对的bitscore
计算:
外源指数(Alien Index, AI)= (bbhO/maxB) - (bbhG/maxB)
OUTGROUP类群在Top300比对中的占比(outg_pct,排除同物种多菌株的重复计数)
在1,538,912个基因中,筛选出1,806个满足AI≥0.1且outg_pct≥90%的候选基因。
第二步:
对候选基因,从nr+库中获取300个不同物种的同源序列;
使用MAFFT v7.299(--auto参数)进行多序列比对,trimAl v1.4(-automated1参数)修剪不确定区域;
用IQ-TREE 1.5.1构建最大似然树(选择最佳氨基酸替代模型+1000次超快自举检验);
使用ape和phangorn R包进行中点定根,iTOL v3可视化;
人工检查1,806个系统发育树后,最终确定186个酵母基因组中的878个基因为HGT候选基因。
所有分析的1,538,912个基因信息(物种名、基因名、contig ID、contig长度、HGT状态、GO注释等)已上传至Figshare数据仓库。
</FILE_CONTENT>
</ATTACHMENT_FILE>
镜像封装
目前就保留了92服务器上的hgt_jiangchenhao:1.1
这个镜像有保存测试数据集,所以很大,有接近20.7G
Ncbi的分类单元中没有一个独立的裸子植物分类单元,因此在定义内群的时候,可以使用松柏纲