一、数据的准备
- 收集数据
- 统一 ID
seqkit replace -p "\s+.*$" -r "" Ath.pep.fa -o Ath.pep.clean.fa
seqkit replace -p "\s+.*$" -r "" Ath.cds.fa -o Ath.cds.clean.fa
二、鉴定单拷贝直系同源基因
3. 运行 OrthoFinder
orthofinder -f pep_clean_dir -S diamond -M msa -T iqtree -t 32 -a 32
三、给物种树分支点编号
4. Python 脚本
from ete3 import Tree
t = Tree('SpeciesTree_rooted.txt', format=1)
for i, node in enumerate(t.traverse('levelorder')):
if not node.is_leaf():
node.name = f'br{i}'
t.write(format=1, outfile='species_br.nwk')
四、把“单拷贝基因树”与“物种树 branch”做 reconciliation
5. 批量跑 Notung
for g in Single_Copy_Orthologue_Sequences/*.fa; do
iqtree -s $g -m LG+F+G4 -bb 1000 -nt 1 -pre ${g%.fa}
java -jar Notung-3.0.jar -g ${g%.fa}.treefile -s species_br.nwk \
--reconcile --csvexport --threshold 80
done
五、生成“基因 × branch”0/1 矩阵
6. 解析 Notung CSV → 0/1
import pandas as pd, glob, re
from ete3 import Tree
branches = [n.name for n in Tree('species_br.nwk', format=1).traverse()
if n.name.startswith('br')]
matrix = pd.DataFrame(index=[], columns=sorted(branches), dtype=int)
for csv in glob.glob('*.csv'):
gene = re.search(r'(\w+)\.csv', csv).group(1)
tmp = pd.read_csv(csv)
present = tmp[tmp['Event'] != 'L']['Branch'].unique()
row = {br: 1 if br in present else 0 for br in branches}
matrix.loc[gene] = row
matrix.to_csv('gene_branch_matrix.csv')
六、判定“基因在哪个 branch 出现”
7. 结算,直到 1 的比例 ≥ 50 %
order = sorted(branches, key=lambda x: int(x[2:])) # br0<br1<br2...
appear = pd.Series(index=matrix.index, dtype='object')
for gene, row in matrix.iterrows():
rest = order[:]
while rest:
m = len(rest)
k = row[rest].sum()
if k/m >= 0.5:
appear[gene] = rest[0]
break
rest.pop(0)
else:
appear[gene] = 'terminal'
appear.to_csv('gene_first_appearance.csv')
七、统计每个 branch
8. 汇总 reconciliation CSV
summary = pd.DataFrame(index=branches,
columns=['maintain', 'dup', 'loss'], dtype=int)
for br in branches:
events = pd.concat([pd.read_csv(f) for f in glob.glob('*.csv')])
summary.loc[br] = [
len(events[(events['Branch']==br) & (events['Event']=='S')]),
len(events[(events['Branch']==br) & (events['Event']=='D')]),
len(events[(events['Branch']==br) & (events['Event']=='L')])
]
summary.to_csv('branch_event_summary.csv')
八、可视化
9. 热图 + 出现 branch 色带
import seaborn as sns, matplotlib.pyplot as plt
g = sns.clustermap(matrix, cmap='Gray_r', row_cluster=False,
col_cluster=False, figsize=(6, 12),
row_colors=appear.map(dict(zip(order+['terminal'],
sns.color_palette('hls', len(order)+1))))
g.ax_row_dendrogram.set_visible(False)
plt.savefig('gene_branch_heatmap.pdf', bbox_inches='tight')
10.branch 柱状图
summary.plot(kind='bar', stacked=True, color=['gray', 'red', 'black'])
plt.ylabel('No. of genes')
plt.xlabel('Branch')
plt.tight_layout()
plt.savefig('branch_event_bar.pdf')
【金山文档 | WPS云文档】 系统发育树中的每个branch与单拷贝同源基因对应用于分析基因年龄
https://www.kdocs.cn/l/cbzP2WNFzxNu