背景
香榧性别群体信息重建
https://www.kdocs.cn/l/ctScwmSTfJQ5
因为龙凤榧是明确同一棵树上取下来的样本,他们的亲缘关系在0.43-0.44
明确要剔除的七个样本是F26,F27,F28,F29,F30,F5,F3
可以考虑保留的样本是 F10,F2,F4,有可能是章老师说的远源的无性系。
很奇怪啊,F10为什么也这么近,现在还不知道为什么,先让子弹飞一会。
从群晖服务器上传数据
数据路径在这里:
/volume1/homes/zafu/Torreya/SunXP_VCF/gvcf.gz五个龙凤榧的数据是在这里:
/data3/wuxiaopei/TG_MF_WGS_vcf/ZAFU001/jointCall-5这一次章老师明确讲了,不要使用龙凤榧数据,因此这次的数据整合优先考虑这30 -7 +4 =27个雌树
雄树样本有30个
这样就是29比30。
我93服务器上的路径:
/data3/jiangchenhao/all_vcf章老师觉得这个数据在服务器上肯定有,但是我真搞不清楚有没有,先放着吧,有3.4T呢,实际上也比较大了
Nas传输效率太低了,只有20MB/S,我拿着硬盘去拷贝了
nohup cp M2* /volumeUSB1/usbshare/ &
nohup find . -maxdepth 1 -name "M3*" | grep --color=auto -v "MF" | xargs -I {} cp {} /volumeUSB1/usbshare &好的,数据传输已经完成了,是20MB/s跑完的传输
好,现在新的问题是,我怎么合并他们,为了保证前后工作的一致性,我们找到我们做GWAS分析常用的那个文件,然后根据header行查看合并命令。
##SentieonCommandLine.GVCFtyper=<ID=GVCFtyper,CommandLine="/data/phxst/software/sentieon-genomics-202308/libexec/driver -r ../ref/genome.fa -t 60 --algo GVCFtyper all.vcf.gz ../02.gvcf/F10-hc.g.vcf.gz ../02.gvcf/F11-hc.g.vcf.gz ../02.gvcf/F12-hc.g.vcf.gz ../02.gvcf/F13-hc.g.vcf.gz ../02.gvcf/F14-hc.g.vcf.gz ../02.gvcf/F15-hc.g.vcf.gz ../02.gvcf/F16-hc.g.vcf.gz ../02.gvcf/F17-hc.g.vcf.gz ../02.gvcf/F18-hc.g.vcf.gz ../02.gvcf/F19-hc.g.vcf.gz ../02.gvcf/F1-hc.g.vcf.gz ../02.gvcf/F20-hc.g.vcf.gz ../02.gvcf/F21-hc.g.vcf.gz ../02.gvcf/F22-hc.g.vcf.gz ../02.gvcf/F23-hc.g.vcf.gz ../02.gvcf/F24-hc.g.vcf.gz ../02.gvcf/F25-hc.g.vcf.gz ../02.gvcf/F26-hc.g.vcf.gz ../02.gvcf/F27-hc.g.vcf.gz ../02.gvcf/F28-hc.g.vcf.gz ../02.gvcf/F29-hc.g.vcf.gz ../02.gvcf/F2-hc.g.vcf.gz ../02.gvcf/F30-hc.g.vcf.gz ../02.gvcf/F3-hc.g.vcf.gz ../02.gvcf/F4-hc.g.vcf.gz ../02.gvcf/F5-hc.g.vcf.gz ../02.gvcf/F6-hc.g.vcf.gz ../02.gvcf/F7-hc.g.vcf.gz ../02.gvcf/F8-hc.g.vcf.gz ../02.gvcf/F9-hc.g.vcf.gz ../02.gvcf/M10-hc.g.vcf.gz ../02.gvcf/M11-hc.g.vcf.gz ../02.gvcf/M12-hc.g.vcf.gz ../02.gvcf/M13-hc.g.vcf.gz ../02.gvcf/M14-hc.g.vcf.gz ../02.gvcf/M15-hc.g.vcf.gz ../02.gvcf/M16-hc.g.vcf.gz ../02.gvcf/M17-hc.g.vcf.gz ../02.gvcf/M18-hc.g.vcf.gz ../02.gvcf/M19-hc.g.vcf.gz ../02.gvcf/M1-hc.g.vcf.gz ../02.gvcf/M20-hc.g.vcf.gz ../02.gvcf/M21-hc.g.vcf.gz ../02.gvcf/M22-hc.g.vcf.gz ../02.gvcf/M23-hc.g.vcf.gz ../02.gvcf/M24-hc.g.vcf.gz ../02.gvcf/M25-hc.g.vcf.gz ../02.gvcf/M26-hc.g.vcf.gz ../02.gvcf/M27-hc.g.vcf.gz ../02.gvcf/M28-hc.g.vcf.gz ../02.gvcf/M29-hc.g.vcf.gz ../02.gvcf/M2-hc.g.vcf.gz ../02.gvcf/M30-hc.g.vcf.gz ../02.gvcf/M3-hc.g.vcf.gz ../02.gvcf/M4-hc.g.vcf.gz ../02.gvcf/M5-hc.g.vcf.gz ../02.gvcf/M6-hc.g.vcf.gz ../02.gvcf/M7-hc.g.vcf.gz ../02.gvcf/M8-hc.g.vcf.gz ../02.gvcf/M9-hc.g.vcf.gz ../02.gvcf/MF31-hc.g.vcf.gz ../02.gvcf/MF32-hc.g.vcf.gz ../02.gvcf/MF33-hc.g.vcf.gz",Date=2024-01-17T12:52:13Z,Version=sentieon-genomics-202308>
##SentieonCommandLine.Haplotyper=<ID=Haplotyper,CommandLine="/data1/phxst/software/sentieon-genomics-202308/libexec/driver -r ref/genome.fa -t 96 -i 03.align/F10_md.cram --algo Haplotyper --emit_mode gvcf 04.gvcf/F10-hc.g.vcf.gz",Date=2023-11-21T01:08:16Z,Version=sentieon-genomics-202308>之前的gwas数据是用sentieon合并的
那么正好章老师给了五个龙凤榧的合并gvcf,我看看这五个龙凤的合并命令
nohup /data2/93chuand/.bin/baseNumber/slmgvcf_gpu F1-hc.g.vcf.gz F2-hc.g.vcf.gz F3-hc.g.vcf.gz F4-hc.g.vcf.gz F6-hc.g.vcf.gz F7-hc.g.vcf.gz F8-hc.g.vcf.gz F9-hc.g.vcf.gz F10-hc.g.vcf.gz F11-hc.g.vcf.gz F12-hc.g.vcf.gz F13-hc.g.vcf.gz F14-hc.g.vcf.gz F15-hc.g.vcf.gz F16-hc.g.vcf.gz F17-hc.g.vcf.gz F18-hc.g.vcf.gz F19-hc.g.vcf.gz F20-hc.g.vcf.gz F21-hc.g.vcf.gz F22-hc.g.vcf.gz F23-hc.g.vcf.gz F24-hc.g.vcf.gz F25-hc.g.vcf.gz ZWY_1_merged_with_Chr0.g.vcf.gz ZWY_2_merged_with_Chr0.g.vcf.gz ZWY-4_merged_with_Chr0.g.vcf.gz ZWY-5_merged_with_Chr0.g.vcf.gz M1-hc.g.vcf.gz M2-hc.g.vcf.gz M3-hc.g.vcf.gz M4-hc.g.vcf.gz M5-hc.g.vcf.gz M6-hc.g.vcf.gz M7-hc.g.vcf.gz M8-hc.g.vcf.gz M9-hc.g.vcf.gz M10-hc.g.vcf.gz M11-hc.g.vcf.gz M12-hc.g.vcf.gz M13-hc.g.vcf.gz M14-hc.g.vcf.gz M15-hc.g.vcf.gz M16-hc.g.vcf.gz M17-hc.g.vcf.gz M18-hc.g.vcf.gz M19-hc.g.vcf.gz M20-hc.g.vcf.gz M21-hc.g.vcf.gz M22-hc.g.vcf.gz M23-hc.g.vcf.gz M24-hc.g.vcf.gz M25-hc.g.vcf.gz M26-hc.g.vcf.gz M27-hc.g.vcf.gz M28-hc.g.vcf.gz M29-hc.g.vcf.gz M30-hc.g.vcf.gz -g gwas_25_11_15 &
# 检查 F1-hc.g.vcf.gz 的 contigs
bcftools view -h F1-hc.g.vcf.gz | grep "##contig"
# 检查 ZWY_1_merged.g.vcf.gz 的 contigs
bcftools view -h ZWY_1_merged.g.vcf.gz | grep "##contig"先写一个比较齐全的
按照上面说的,剔除F26,F27,F28,F29,F30,F5,F3
其中留一个F3,不能全踢了
那就是踢掉了6个,增加了4个,总数还有28个
报错:
[2025-11-15 22:06:52.475] [slmgvcf_gpu] [critical] [message] ERROR: the contigs of file(ZWY_1_merged.g.vcf.gz) is not same with file(F1-hc.g.vcf.gz)!这个报错的原因是原始版本的gvcf.gz文件是有contigs合成的chr0的,而我现在的版本是没有的(拆分染色体、合并.....)
然后在软件层面上,是GATK合并不同样本时似乎不允许header中,存在样本A有染色体1,样本B没有的情况
那这样,我手动添加一个空的Chr0的header,因为Chr0在后续的分析中是不会用到的,只是为了软件不报错。
nohup bash -c 'zcat ZWY-4_merged.g.vcf.gz | bcftools reheader -h <( bcftools view -h ZWY-4_merged.g.vcf.gz; echo "##contig=<ID=Chr0,length=165740284>" ) | bgzip > ZWY-4_merged_Chr0.g.vcf.gz' > reheader4.log 2>&1 &
nohup bash -c 'zcat ZWY-5_merged.g.vcf.gz | bcftools reheader -h <( bcftools view -h ZWY-5_merged.g.vcf.gz; echo "##contig=<ID=Chr0,length=165740284>" ) | bgzip > ZWY-5_merged_Chr0.g.vcf.gz' > reheader5.log 2>&1 &
nohup bash -c 'zcat ZWY_1_merged.g.vcf.gz | bcftools reheader -h <( bcftools view -h ZWY_1_merged.g.vcf.gz; echo "##contig=<ID=Chr0,length=165740284>" ) | bgzip > ZWY_1_merged_Chr0.g.vcf.gz' > reheader1.log 2>&1 &
nohup bash -c 'zcat ZWY_2_merged.g.vcf.gz | bcftools reheader -h <( bcftools view -h ZWY_2_merged.g.vcf.gz; echo "##contig=<ID=Chr0,length=165740284>" ) | bgzip > ZWY_2_merged_Chr0.g.vcf.gz' > reheader2.log 2>&1 &这个东西太麻烦了,我现在理解了董老师发给我的语音是什么意思了:
我当时回的是我自己试试好了,不听老人言,吃亏在眼前呐
意思就是合并命令要求每个文件里header行里的contigs排序必须是完全一样的。
那么就不仅仅是Chr0的问题,那还需要合并的时候调整顺序,我明白了,我完全明白了。唉。
bcftools reheader --threads 10 -h fixed_template.txt ZWY-4_merged.g.vcf.gz -o ZWY-4_fixed.g.vcf.gztest
nohup /data2/93chuand/.bin/baseNumber/slmgvcf_gpu ZWY-4_reformatted.g.vcf.gz M1-hc.g.vcf.gz -g test.vcf.gz/data2/93chuand/.bin/baseNumber/slmgvcf_gpu ZWY-4_fixed.g.vcf.gz M1-hc.g.vcf.gz -g test.vcf.gz
尝试了几个小时,虽然还是失败,但是现在我感觉我已经开始弄懂这个事情了。
第一,合并失败首先是因为header行缺乏染色体0,而这个问题我没有办法现在去解决它了。
第二,单纯的修改header不能解决问题,因为实际上仍然缺乏Chr0的数据,我们喂给basenumber还会报错。
所以,现在我解决问题的办法是,从原始数据中把的Chr0给踢出去,同时修改header行文件。
这样应该可以解决问题的。
如果还不行呢?
那么就这样做,从已知的样本中完整地提取一个样本的Chr0,然后把这部分样本的Sample id改掉,和我ZWY的样本合并。
思路正确,的确是能够跑通的:
#bcftools去除掉Chr0的比对结果
bcftools view F10-hc.g.vcf.gz -t "^Chr0" -Oz -o F10-hc_no_Chr0.g.vcf.gzbcftools view -h ZWY_1_merged.g.vcf.gz > Header.txt
#这一步的目的是以植物园的样本为模板,构建一个通用的表头,但是注意,这一步要手动改一下最后一行,把CHROM这一行的文本样本名改成我们后面要重拍的样本名,比如下面要重排的是F10,这里也要改成F10
bcftools reheader --threads 10 --header ZWY_header_template.txt F10_no_Chr0.g.vcf.gz -o F10_no_Chr0_fix.g.vcf.gz然后再重新写表头,这样就可以了。
这一步搞明白了,合并的时候我们可以再试着用英伟达的那个工具合并一下,如果可以的话,最起码有两个设备可以用于合并,对吧,总是好事。
事实上,我在这一步就遇到了问题
bcftools view F10-hc.g.vcf.gz -t "^Chr0" -Oz -o F10-hc_no_Chr0.g.vcf.gz
原因是服务器没有算力了,好吧,92还有算力,现在16个一组并行,不知道到白天能不能全部跑完,由于另外两个步骤非常快,如果白天我醒过来的时候能跑完,应该是问题不大的
我也尝试在工作站运行,但是当我在工作站上并行运行的时候,竟然报错了,而且是报的文件内容错误,md5校验无误后发现单个程序独立跑是不会报错的,暂时不清楚这是什么原因。
现在工作站上也在跑,IO写入速度快,但是也要半个小时一个样本,总之,慢慢等吧。
算了程序早晚会跑完了,不论我是否盯着它,都不会比现在的情况更坏了,多等一天甚至不是什么大不了的事情,睡觉!
到晚上的19点51,服务器上还有两个样本没跑完,我的解决办法是把服务器上的数据拉下来,交给工作站去跑,而我自己可以先把reheader部分的代码写起来。
首先,我们对于每一个输入的需要rehader的样本的gvcf文件,根据样本名生成一个新的header,
首先使用这个命令获取一个gvcf文件的样本名:
bcftools query -l M10-hc_no_Chr0.g.vcf.gz然后我们需要修改的是模板header的这个位置,也就是最后一行,我们需要替换最后的F10为我们刚刚获得的样本名
awk -F"\t" 'NR==FNR {if (NR>1) print last; last=$0; next} FNR==NR {print} END {sub(/[^\t]+$/, "F11", last); print last}' ZWY_header_template.txt > F11.header.txt可以的,这个命令是可以用的。
bcftools reheader --threads 10 --header F11.header.txt F11-hc_no_Chr0.g.vcf.gz -o F11-hc_no_Chr0_fixed.g.vcf.gz在93上直接core dump了,看来是没有cpu了,放92上重新跑,然后我们让AI按照我的思路整合成一个新的脚本
nohup /data2/93chuand/.bin/baseNumber/slmgvcf_gpu \
F1-hc_no_fixed_Chr0.g.vcf.gz \
F2-hc_no_fixed_Chr0.g.vcf.gz \
F3-hc_no_fixed_Chr0.g.vcf.gz \
F4-hc_no_fixed_Chr0.g.vcf.gz \
F6-hc_no_fixed_Chr0.g.vcf.gz \
F7-hc_no_fixed_Chr0.g.vcf.gz \
F8-hc_no_fixed_Chr0.g.vcf.gz \
F9-hc_no_fixed_Chr0.g.vcf.gz \
F10-hc_no_fixed_Chr0.g.vcf.gz \
F11-hc_no_fixed_Chr0.g.vcf.gz \
F12-hc_no_fixed_Chr0.g.vcf.gz \
F13-hc_no_fixed_Chr0.g.vcf.gz \
F14-hc_no_fixed_Chr0.g.vcf.gz \
F15-hc_no_fixed_Chr0.g.vcf.gz \
F16-hc_no_fixed_Chr0.g.vcf.gz \
F17-hc_no_fixed_Chr0.g.vcf.gz \
F18-hc_no_fixed_Chr0.g.vcf.gz \
F19-hc_no_fixed_Chr0.g.vcf.gz \
F20-hc_no_fixed_Chr0.g.vcf.gz \
F21-hc_no_fixed_Chr0.g.vcf.gz \
F22-hc_no_fixed_Chr0.g.vcf.gz \
F23-hc_no_fixed_Chr0.g.vcf.gz \
F24-hc_no_fixed_Chr0.g.vcf.gz \
F25-hc_no_fixed_Chr0.g.vcf.gz \
M1-hc_no_fixed_Chr0.g.vcf.gz \
M2-hc_no_fixed_Chr0.g.vcf.gz \
M3-hc_no_fixed_Chr0.g.vcf.gz \
M4-hc_no_fixed_Chr0.g.vcf.gz \
M5-hc_no_fixed_Chr0.g.vcf.gz \
M6-hc_no_fixed_Chr0.g.vcf.gz \
M7-hc_no_fixed_Chr0.g.vcf.gz \
M8-hc_no_fixed_Chr0.g.vcf.gz \
M9-hc_no_fixed_Chr0.g.vcf.gz \
M10-hc_no_fixed_Chr0.g.vcf.gz \
M11-hc_no_fixed_Chr0.g.vcf.gz \
M12-hc_no_fixed_Chr0.g.vcf.gz \
M13-hc_no_fixed_Chr0.g.vcf.gz \
M14-hc_no_fixed_Chr0.g.vcf.gz \
M15-hc_no_fixed_Chr0.g.vcf.gz \
M16-hc_no_fixed_Chr0.g.vcf.gz \
M17-hc_no_fixed_Chr0.g.vcf.gz \
M18-hc_no_fixed_Chr0.g.vcf.gz \
M19-hc_no_fixed_Chr0.g.vcf.gz \
M20-hc_no_fixed_Chr0.g.vcf.gz \
M21-hc_no_fixed_Chr0.g.vcf.gz \
M22-hc_no_fixed_Chr0.g.vcf.gz \
M23-hc_no_fixed_Chr0.g.vcf.gz \
M24-hc_no_fixed_Chr0.g.vcf.gz \
M25-hc_no_fixed_Chr0.g.vcf.gz \
M26-hc_no_fixed_Chr0.g.vcf.gz \
M27-hc_no_fixed_Chr0.g.vcf.gz \
M28-hc_no_fixed_Chr0.g.vcf.gz \
M29-hc_no_fixed_Chr0.g.vcf.gz \
M30-hc_no_fixed_Chr0.g.vcf.gz \
-g gwas_25_11_18_no_ZWY &这个命令直接就给我报Segmentation fault的错误,我只能一步步来了,搞不清楚为什么,今天有程序在93上跑也这样
nohup /data2/93chuand/.bin/baseNumber/slmgvcf_gpu ZWY_1_merged.g.vcf.gz ZWY_2_merged.g.vcf.gz ZWY-4_clean.g.merged.gvcf.gz ZWY-5_clean.g.merged.gvcf.gz F10-hc_no_fixed_Chr0.g.vcf.gz F11-hc_no_fixed_Chr0.g.vcf.gz F12-hc_no_fixed_Chr0.g.vcf.gz F13-hc_no_fixed_Chr0.g.vcf.gz F14-hc_no_fixed_Chr0.g.vcf.gz F15-hc_no_fixed_Chr0.g.vcf.gz -g test_ten_samples &跑10个样本没问题,有进度条!说明我这个套路是没问题的!问题可能真的出在93服务器上没资源了。
目前93的负载率实在是太高了
他是支持一个文件包含多个样本的,那么我其实不急于一次性合并所有样本,可以一次合并一部分,然后在这个基础上再次合并。
还是不行!
因为在比对行中,四个4个ZWY的样本的gvcf文件和60个群体的gvcf文件的染色体顺序又又又又又不一样,那么我们这样做,这次我先合并60个植物园的样本。
/data2/93chuand/.bin/baseNumber/slmgvcf_gpu F1-hc_no_fixed_Chr0.g.vcf.gz F2-hc_no_fixed_Chr0.g.vcf.gz F3-hc_no_fixed_Chr0.g.vcf.gz F4-hc_no_fixed_Chr0.g.vcf.gz F6-hc_no_fixed_Chr0.g.vcf.gz F7-hc_no_fixed_Chr0.g.vcf.gz F8-hc_no_fixed_Chr0.g.vcf.gz F9-hc_no_fixed_Chr0.g.vcf.gz F10-hc_no_fixed_Chr0.g.vcf.gz F11-hc_no_fixed_Chr0.g.vcf.gz F12-hc_no_fixed_Chr0.g.vcf.gz F13-hc_no_fixed_Chr0.g.vcf.gz F14-hc_no_fixed_Chr0.g.vcf.gz F15-hc_no_fixed_Chr0.g.vcf.gz F16-hc_no_fixed_Chr0.g.vcf.gz F17-hc_no_fixed_Chr0.g.vcf.gz F18-hc_no_fixed_Chr0.g.vcf.gz F19-hc_no_fixed_Chr0.g.vcf.gz F20-hc_no_fixed_Chr0.g.vcf.gz F21-hc_no_fixed_Chr0.g.vcf.gz F22-hc_no_fixed_Chr0.g.vcf.gz F23-hc_no_fixed_Chr0.g.vcf.gz F24-hc_no_fixed_Chr0.g.vcf.gz F25-hc_no_fixed_Chr0.g.vcf.gz M1-hc_no_fixed_Chr0.g.vcf.gz M2-hc_no_fixed_Chr0.g.vcf.gz M3-hc_no_fixed_Chr0.g.vcf.gz M4-hc_no_fixed_Chr0.g.vcf.gz M5-hc_no_fixed_Chr0.g.vcf.gz M6-hc_no_fixed_Chr0.g.vcf.gz M7-hc_no_fixed_Chr0.g.vcf.gz M8-hc_no_fixed_Chr0.g.vcf.gz M9-hc_no_fixed_Chr0.g.vcf.gz M10-hc_no_fixed_Chr0.g.vcf.gz M11-hc_no_fixed_Chr0.g.vcf.gz M12-hc_no_fixed_Chr0.g.vcf.gz M13-hc_no_fixed_Chr0.g.vcf.gz M14-hc_no_fixed_Chr0.g.vcf.gz M15-hc_no_fixed_Chr0.g.vcf.gz M16-hc_no_fixed_Chr0.g.vcf.gz M17-hc_no_fixed_Chr0.g.vcf.gz M18-hc_no_fixed_Chr0.g.vcf.gz M19-hc_no_fixed_Chr0.g.vcf.gz M20-hc_no_fixed_Chr0.g.vcf.gz M21-hc_no_fixed_Chr0.g.vcf.gz M22-hc_no_fixed_Chr0.g.vcf.gz M23-hc_no_fixed_Chr0.g.vcf.gz M24-hc_no_fixed_Chr0.g.vcf.gz M25-hc_no_fixed_Chr0.g.vcf.gz M26-hc_no_fixed_Chr0.g.vcf.gz M27-hc_no_fixed_Chr0.g.vcf.gz M28-hc_no_fixed_Chr0.g.vcf.gz M29-hc_no_fixed_Chr0.g.vcf.gz M30-hc_no_fixed_Chr0.g.vcf.gz -g gwas_25_11_18_no_ZWY先把没有植物园的样本给跑了,然后再把植物园的样本给跑了。
这样来看,直接比对之前的60个样本可能有问题,不一定能跑通,不如把植物园的样本跑了,然后好好的reheader一下,再比照原有的染色体排序一下。这样或许会快一点。
现在直接合成恐怕完不成的。我说实话我有点脑壳疼了。最好的办法是停掉现在的命令,先把群体里的我已经过滤掉Chr0的文件合并一遍。然后再把植物园的合并一遍,在把植物园的header给清洗一遍,然后再把植物园的reads排序重新排一下,这样就好了,是不是看上去很简单呢~~~~~~
nohup /data2/93chuand/.bin/baseNumber/slmgvcf_gpu_t \
F1-hc_no_Chr0.g.vcf.gz F2-hc_no_Chr0.g.vcf.gz F4-hc_no_Chr0.g.vcf.gz F6-hc_no_Chr0.g.vcf.gz F7-hc_no_Chr0.g.vcf.gz F8-hc_no_Chr0.g.vcf.gz F9-hc_no_Chr0.g.vcf.gz F10-hc_no_Chr0.g.vcf.gz F11-hc_no_Chr0.g.vcf.gz F12-hc_no_Chr0.g.vcf.gz F13-hc_no_Chr0.g.vcf.gz F14-hc_no_Chr0.g.vcf.gz F15-hc_no_Chr0.g.vcf.gz F16-hc_no_Chr0.g.vcf.gz F17-hc_no_Chr0.g.vcf.gz F18-hc_no_Chr0.g.vcf.gz F19-hc_no_Chr0.g.vcf.gz F20-hc_no_Chr0.g.vcf.gz F21-hc_no_Chr0.g.vcf.gz F22-hc_no_Chr0.g.vcf.gz F23-hc_no_Chr0.g.vcf.gz F24-hc_no_Chr0.g.vcf.gz F25-hc_no_Chr0.g.vcf.gz \
M1-hc_no_Chr0.g.vcf.gz M2-hc_no_Chr0.g.vcf.gz M3-hc_no_Chr0.g.vcf.gz M4-hc_no_Chr0.g.vcf.gz M5-hc_no_Chr0.g.vcf.gz M6-hc_no_Chr0.g.vcf.gz M7-hc_no_Chr0.g.vcf.gz M8-hc_no_Chr0.g.vcf.gz M9-hc_no_Chr0.g.vcf.gz M10-hc_no_Chr0.g.vcf.gz M11-hc_no_Chr0.g.vcf.gz M12-hc_no_Chr0.g.vcf.gz M13-hc_no_Chr0.g.vcf.gz M14-hc_no_Chr0.g.vcf.gz M15-hc_no_Chr0.g.vcf.gz M16-hc_no_Chr0.g.vcf.gz M17-hc_no_Chr0.g.vcf.gz M18-hc_no_Chr0.g.vcf.gz M19-hc_no_Chr0.g.vcf.gz M20-hc_no_Chr0.g.vcf.gz M21-hc_no_Chr0.g.vcf.gz M22-hc_no_Chr0.g.vcf.gz M23-hc_no_Chr0.g.vcf.gz M24-hc_no_Chr0.g.vcf.gz M25-hc_no_Chr0.g.vcf.gz M26-hc_no_Chr0.g.vcf.gz M27-hc_no_Chr0.g.vcf.gz M28-hc_no_Chr0.g.vcf.gz M29-hc_no_Chr0.g.vcf.gz M30-hc_no_Chr0.g.vcf.gz \
-g gwas_25_11_18_no_ZWY \
> slmgvcf_gpu_run.log 2>&1 &根据说明书上说的,slmgvcf_gpu_t速度更快
gwas_25_11_18_no_ZWY
那么,植物园样本也可以直接挂上去,应该不至于一次只能运行一个样本。只要储存池不溢出就行了。
储存池子是不可能溢出的,就在这里跑就行了。同时运行cpu版本和gpu版本
从合并后的植物园样本提取header
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=2025-11-18
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phasing set (typically the position of the first variant in the set)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=RAW_MQandDP,Number=2,Type=Integer,Description="Raw data (sum of squared MQ and total depth) for improved RMS Mapping Quality calculation. Incompatible with deprecated RAW_MQ formulation.">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
##INFO=<ID=VariantType,Number=1,Type=String,Description="Variant type description">
##contig=<ID=Chr1,length=1824307213>
##contig=<ID=Chr2,length=1166843293>
##contig=<ID=Chr3,length=1762147815>
##contig=<ID=Chr4,length=1717786663>
##contig=<ID=Chr5,length=2007191503>
##contig=<ID=Chr6,length=1957849926>
##contig=<ID=Chr7,length=1367468039>
##contig=<ID=Chr8,length=1668700658>
##contig=<ID=Chr9,length=1587360643>
##contig=<ID=Chr10,length=2000250390>
##contig=<ID=Chr11,length=1826763786>
##source=sailegene
##bcftools_viewVersion=1.11+htslib-1.13
##bcftools_viewCommand=view -h ZWY_only.gvcf; Date=Tue Nov 18 22:47:31 2025
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ZWY_1_clean ZWY_2_clean ZWY-4_clean ZWY-5_clean##contig=<ID=Chr11,length=1826763786,assembly=unknown>
##contig=<ID=Chr7,length=1367468039,assembly=unknown>
##contig=<ID=Chr1,length=1824307213,assembly=unknown>
##contig=<ID=Chr3,length=1762147815,assembly=unknown>
##contig=<ID=Chr2,length=1166843293,assembly=unknown>
##contig=<ID=Chr9,length=1587360643,assembly=unknown>
##contig=<ID=Chr4,length=1717786663,assembly=unknown>
##contig=<ID=Chr6,length=1957849926,assembly=unknown>
##contig=<ID=Chr5,length=2007191503,assembly=unknown>
##contig=<ID=Chr8,length=1668700658,assembly=unknown>
##contig=<ID=Chr10,length=2000250390,assembly=unknown>
##contig=<ID=Chr0,length=165740284,assembly=unknown>把下面这个群体系列的contigs改掉,替换上面的contigs的顺序,然后用命令重新排序
bcftools sort -m 800G -o ZWY_only_reheader_sorted.gvcf -O z -T ./tmp ZWY_only_reheader.gvcfsort之后确实是Chr11在前面了,这下应该能跑通了,picard应该有等价的命令。
完了又出错了,因为我生成的是vcf文件而不是gvcf文件,两个vcf文件是不能直接合并的。
那么现在我重新用工具合成,生成两个gvcf文件,然后再reheader,sort之后
nohup /data2/93chuand/.bin/baseNumber/slmgvcf_gpu_t \
F1-hc_no_Chr0.g.vcf.gz F2-hc_no_Chr0.g.vcf.gz F3-hc_no_Chr0.g.vcf.gz F4-hc_no_Chr0.g.vcf.gz F6-hc_no_Chr0.g.vcf.gz F7-hc_no_Chr0.g.vcf.gz F8-hc_no_Chr0.g.vcf.gz F9-hc_no_Chr0.g.vcf.gz F10-hc_no_Chr0.g.vcf.gz F11-hc_no_Chr0.g.vcf.gz F12-hc_no_Chr0.g.vcf.gz F13-hc_no_Chr0.g.vcf.gz F14-hc_no_Chr0.g.vcf.gz F15-hc_no_Chr0.g.vcf.gz F16-hc_no_Chr0.g.vcf.gz F17-hc_no_Chr0.g.vcf.gz F18-hc_no_Chr0.g.vcf.gz F19-hc_no_Chr0.g.vcf.gz F20-hc_no_Chr0.g.vcf.gz F21-hc_no_Chr0.g.vcf.gz F22-hc_no_Chr0.g.vcf.gz F23-hc_no_Chr0.g.vcf.gz F24-hc_no_Chr0.g.vcf.gz F25-hc_no_Chr0.g.vcf.gz ZWY_1_merged_sorted.g.vcf.gz ZWY_2_merged_sorted.g.vcf.gz ZWY_4_merged_sorted.g.vcf.gz ZWY_5_merged_sorted.g.vcf.gz\
M1-hc_no_Chr0.g.vcf.gz M2-hc_no_Chr0.g.vcf.gz M3-hc_no_Chr0.g.vcf.gz M4-hc_no_Chr0.g.vcf.gz M5-hc_no_Chr0.g.vcf.gz M6-hc_no_Chr0.g.vcf.gz M7-hc_no_Chr0.g.vcf.gz M8-hc_no_Chr0.g.vcf.gz M9-hc_no_Chr0.g.vcf.gz M10-hc_no_Chr0.g.vcf.gz M11-hc_no_Chr0.g.vcf.gz M12-hc_no_Chr0.g.vcf.gz M13-hc_no_Chr0.g.vcf.gz M14-hc_no_Chr0.g.vcf.gz M15-hc_no_Chr0.g.vcf.gz M16-hc_no_Chr0.g.vcf.gz M17-hc_no_Chr0.g.vcf.gz M18-hc_no_Chr0.g.vcf.gz M19-hc_no_Chr0.g.vcf.gz M20-hc_no_Chr0.g.vcf.gz M21-hc_no_Chr0.g.vcf.gz M22-hc_no_Chr0.g.vcf.gz M23-hc_no_Chr0.g.vcf.gz M24-hc_no_Chr0.g.vcf.gz M25-hc_no_Chr0.g.vcf.gz M26-hc_no_Chr0.g.vcf.gz M27-hc_no_Chr0.g.vcf.gz M28-hc_no_Chr0.g.vcf.gz M29-hc_no_Chr0.g.vcf.gz M30-hc_no_Chr0.g.vcf.gz \
-g gwas_ZWY_dup.vcf.gz \
> slmgvcf_gpu_run.log 2>&1 &区别就在于-o参数,生成的是联合的gvcf文件,-e参数,生成是vcf文件,毫无疑问,gvcf文件会很大。
这下可以看到生成的gvcf文件是有no refer变异的了。
ok清理了一下中间文件,这样一口气清理了8TB文件。
植物园的样本也重新跑,区别在于换成cpu程序,输出也要是gvcf文件。
nohup /data2/93chuand/.bin/baseNumber/slmgvcf ZWY_1_merged.g.vcf.gz ZWY_2_merged.g.vcf.gz ZWY-4_clean.g.merged.gvcf.gz ZWY-5_clean.g.merged.gvcf.gz -o ZWY_merged.gvcf.gz &那这样看,gvcf文件应该要保留,这样下一次要跑的时候就只需要合并这些gvcf文件就好了。
bcftools sort -m 800G -o ZWY_only_reheader_sorted.gvcf -O z -T ./tmp ZWY_merged_reheader.gvcf.gz好的,最后我们可以生成群体数据了
nohup /data2/93chuand/.bin/baseNumber/slmgvcf_gpu_t gwas_no_ZWY.gvcf.gz ZWY_only_reheader_sorted.gvcf.gz -g gwas.vcf.gz &
/data2/93chuand/.bin/baseNumber/slmgvcf_gpu_t M8-hc_no_Chr0.g.vcf.gz ZWY_only_reheader_sorted.gvcf.gz -g test.vcf.gz不行,仍然报错
[2025-11-20 21:11:24.043] [slmgvcf_gpu_t] [critical] [message] [ERROR]The file(1) is must be with a gVCF format这个file1应该指的是谁呢,应该是指的我排序后的 ZWY_only_reheader_sorted.gvcf.gz
那么,现在我的解决办法就只剩下把zwy里的每个样本都重新rehader再sorted了。
死忙当作活马医
为什么起这个标题,因为我之前合并一个样本的不同染色体的gvcf的时候我发现它似乎不是按照我给的顺序排序,现在我想再试试。
(rna_seq_call_snp) [jiangchenhao@localhost ZWY_4]$ bcftools concat -a -Oz -o test_merged.g.vcf.gz ZWY-4_clean.sorted.Chr7.dedup.filtered.gvcf.gz ZWY-4_clean.sorted.Chr4.dedup.filtered.gvcf.gz
Checking the headers and starting positions of 2 files这回它没有按照字典顺序排序,而是正常排序了,真奇怪啊,但是没关系,我搞清楚了。
这下正常排序了
我搞明白了为什么不行,因为上次我是用AI给的代码合并的,我没有检测他给我的顺序是否真的准确。所以太着急是真的不行的,绝对不行的,人在头脑发昏的时候可能会因为负反馈形成负循环的。
好吧排序的命令也不对,
nohup bcftools sort -m 300G \
-T ./tmp4 \
-O z \
-o ZWY_4_rehader_sorted.g.vcf.gz \
ZWY_4_rehader.g.vcf.gz &要这个才对,靠
bcftools sort -m 800G -o ZWY_only_reheader_sorted.gvcf -O z -T ./tmp ZWY_merged_reheader.gvcf.gz这个命令是不对的
nohup bcftools concat -a -Oz -o ZWY_4_merged.g.vcf.gz
ZWY-4_clean.sorted.Chr11.dedup.filtered.gvcf.gz
ZWY-4_clean.sorted.Chr7.dedup.filtered.gvcf.gz
ZWY-4_clean.sorted.Chr1.dedup.filtered.gvcf.gz
ZWY-4_clean.sorted.Chr3.dedup.filtered.gvcf.gz
ZWY-4_clean.sorted.Chr2.dedup.filtered.gvcf.gz
ZWY-4_clean.sorted.Chr9.dedup.filtered.gvcf.gz
ZWY-4_clean.sorted.Chr4.dedup.filtered.gvcf.gz
ZWY-4_clean.sorted.Chr6.dedup.filtered.gvcf.gz
ZWY-4_clean.sorted.Chr5.dedup.filtered.gvcf.gz
ZWY-4_clean.sorted.Chr8.dedup.filtered.gvcf.gz
ZWY-4_clean.sorted.Chr10.dedup.filtered.gvcf.gz --threads 30 &这个命令和我之前的命令应该没有差别,但是居然臂上次的命令文件大小少了将近10%,而我确实看了命令,数量是没问题的呀
好吧,先不管文件大小的事情,先拿两个样本跑一下gpu合并,看一下报错不报错
nohup /data2/93chuand/.bin/baseNumber/slmgvcf_gpu_t ZWY_4_merged_reheader_sorted.g.vcf.gz M8-hc_no_Chr0.g.vcf.gz -g test.vcf.gz &成了成了!!!!!!!撒花!!!!!!!!!
最终命令
nohup /data2/93chuand/.bin/baseNumber/slmgvcf_gpu_t \
F1-hc_no_Chr0.g.vcf.gz F2-hc_no_Chr0.g.vcf.gz F4-hc_no_Chr0.g.vcf.gz F6-hc_no_Chr0.g.vcf.gz F7-hc_no_Chr0.g.vcf.gz F8-hc_no_Chr0.g.vcf.gz F9-hc_no_Chr0.g.vcf.gz F10-hc_no_Chr0.g.vcf.gz F11-hc_no_Chr0.g.vcf.gz F12-hc_no_Chr0.g.vcf.gz F13-hc_no_Chr0.g.vcf.gz F14-hc_no_Chr0.g.vcf.gz F15-hc_no_Chr0.g.vcf.gz F16-hc_no_Chr0.g.vcf.gz F17-hc_no_Chr0.g.vcf.gz F18-hc_no_Chr0.g.vcf.gz F19-hc_no_Chr0.g.vcf.gz F20-hc_no_Chr0.g.vcf.gz F21-hc_no_Chr0.g.vcf.gz F22-hc_no_Chr0.g.vcf.gz F23-hc_no_Chr0.g.vcf.gz F24-hc_no_Chr0.g.vcf.gz F25-hc_no_Chr0.g.vcf.gz ZWY_1_merged_sorted.g.vcf.gz ZWY_2_merged_sorted.g.vcf.gz ZWY_4_merged_sorted.g.vcf.gz ZWY_5_merged_sorted.g.vcf.gz \
M1-hc_no_Chr0.g.vcf.gz M2-hc_no_Chr0.g.vcf.gz M3-hc_no_Chr0.g.vcf.gz M4-hc_no_Chr0.g.vcf.gz M5-hc_no_Chr0.g.vcf.gz M6-hc_no_Chr0.g.vcf.gz M7-hc_no_Chr0.g.vcf.gz M8-hc_no_Chr0.g.vcf.gz M9-hc_no_Chr0.g.vcf.gz M10-hc_no_Chr0.g.vcf.gz M11-hc_no_Chr0.g.vcf.gz M12-hc_no_Chr0.g.vcf.gz M13-hc_no_Chr0.g.vcf.gz M14-hc_no_Chr0.g.vcf.gz M15-hc_no_Chr0.g.vcf.gz M16-hc_no_Chr0.g.vcf.gz M17-hc_no_Chr0.g.vcf.gz M18-hc_no_Chr0.g.vcf.gz M19-hc_no_Chr0.g.vcf.gz M20-hc_no_Chr0.g.vcf.gz M21-hc_no_Chr0.g.vcf.gz M22-hc_no_Chr0.g.vcf.gz M23-hc_no_Chr0.g.vcf.gz M24-hc_no_Chr0.g.vcf.gz M25-hc_no_Chr0.g.vcf.gz M26-hc_no_Chr0.g.vcf.gz M27-hc_no_Chr0.g.vcf.gz M28-hc_no_Chr0.g.vcf.gz M29-hc_no_Chr0.g.vcf.gz M30-hc_no_Chr0.g.vcf.gz \
-g gwas_25_11_21.vcf.gz \
> slmgvcf_gpu_run.log 2>&1 &路径:
/data3/jiangchenhao/25_11_21_popu/gwas_25_11_21.vcf.gz总结:如何构建与孙学鹏老师的版本相匹配的GVCF文件
我假定已经有了按染色体分割好的gvcf文件,
第一步是用bcftools contact合并,合并的时候的染色体排序要按照孙老师的染色体顺序来
参考下面这个顺序:
nohup bcftools concat -a -Oz -o ZWY_4_merged.g.vcf.gz \
ZWY-4_clean.sorted.Chr11.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr7.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr1.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr3.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr2.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr9.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr4.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr6.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr5.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr8.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr10.dedup.filtered.gvcf.gz --threads 3 &搞定之后,我们使用bcftools view -h 或者zcat 打印合并后的gvcf.gz文件中的header行,输出到一个新文件里面。
比如:
zcat ZWY_4_merged |head -n 105 > header.txt然后用文本编辑器打开
vim header.txt找到其中的染色体部分
##contig=<ID=Chr1,length=1824307213>
##contig=<ID=Chr2,length=1166843293>
##contig=<ID=Chr3,length=1762147815>
##contig=<ID=Chr4,length=1717786663>
##contig=<ID=Chr5,length=2007191503>
##contig=<ID=Chr6,length=1957849926>
##contig=<ID=Chr7,length=1367468039>
##contig=<ID=Chr8,length=1668700658>
##contig=<ID=Chr9,length=1587360643>
##contig=<ID=Chr10,length=2000250390>
##contig=<ID=Chr11,length=1826763786>注意有的时候第一条染色体会和第二条染色体隔开一行,要删干净
替换为下面这个
##contig=<ID=Chr11,length=1826763786,assembly=unknown>
##contig=<ID=Chr7,length=1367468039,assembly=unknown>
##contig=<ID=Chr1,length=1824307213,assembly=unknown>
##contig=<ID=Chr3,length=1762147815,assembly=unknown>
##contig=<ID=Chr2,length=1166843293,assembly=unknown>
##contig=<ID=Chr9,length=1587360643,assembly=unknown>
##contig=<ID=Chr4,length=1717786663,assembly=unknown>
##contig=<ID=Chr6,length=1957849926,assembly=unknown>
##contig=<ID=Chr5,length=2007191503,assembly=unknown>
##contig=<ID=Chr8,length=1668700658,assembly=unknown>
##contig=<ID=Chr10,length=2000250390,assembly=unknown>
##contig=<ID=Chr0,length=165740284,assembly=unknown>然后现在我们获得了新的,修正过的header后,使用bcftools reheader命令重排就好了。
bcftools reheader --threads 10 --header header.txt ZWY_4_merged.g.vcf.gz -o ZWY_4_merged_reheader_merged.g.vcf.gz然后再用basenumber的gpu合并工具进行合并就好了,非常简单。
过滤vcf.gz文件使之满足群体分析标准
我在CNS大论文的word文档里看到了完整的表述
QD < 2.0 || MQ < 40.0 || FS > 60.0 || QUAL <30.0 || MQrankSum < -12.5 || ReadPosRankSum < -8.0. The SNPs identified by GATK were further filtered: only SNPs with a minor allele frequency greater than 5%, less than 90% missing data and minQuality > 30 were considered as high-quality SNPs. 前面的"QD < 2.0 || MQ < 40.0 || FS > 60.0 || QUAL <30.0 || MQrankSum < -12.5 || ReadPosRankSum < -8.0"
中间的这个比较麻烦,需要通过bcftools生成MAF指标和F_MISSING,这里我备注一下,我的F_MISSING指标是0.1,而不是word里写的0.9,应该是之前写的时候搞错了吧?
但是目前我仍然是按照<0.9来计算的
最后这个"minQuality > 30"应该等价于basenumber的"--minimum-mapping-quality 30"
也就是说之前就进行了过滤了,传送门https://www.kdocs.cn/l/clIBseLKGTVF
为了以防万一,我又查看了原始vcf文件的header行
zcat 2_female_60sample.biallelic.snp.vcf.gz |head -n 200
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=Filter,Description="QD<2.0 or FS>60.0 or MQ<40.0 or MQRankSum < -12.5 or SOR > 3.0 or ReadPosRankSum < -8.0 or QUAL < 30.0">所以第一步对应的命令是
nohup bcftools filter -e "QD<2.0 || FS>60.0 || MQ<40.0 || MQRankSum<-12.5 || SOR>3.0 || ReadPosRankSum<-8.0 || QUAL<30.0" gwas_25_11_21.vcf.gz -Oz -o gwas_25_11_21_filtered1_deleted.vcf.gz --threads 8 &这个命令似乎没有删除位点,因为我一开始没有加参数"filter",后来加了这个filter参数之后,它开始能够正常工作了,查看前面的比对行能明显看到有的数据被删除掉了,这符合预期。
这个命令不怎么占用CPU,我在工作站上和服务器上并行了这个命令,工作站的写入速度是服务器的3倍,那么照这个速度,我有.......事实上,这个命令在工作站上卡死了。
但是后续单独过滤SNPs的命令是没有卡死的
我又想到一件事,我能不能按照染色体拆分,过滤,然后再合并?因为我有csi索引,它不用遍历整个文件,这样就可以加速了。
我想是可以的
nohup bcftools view -v snps -m2 -M2 -r Chr1 -e 'QD<2.0 || FS>60.0 || MQ<40.0 || MQRankSum<-12.5 || SOR>3.0 || ReadPosRankSum<-8.0 || QUAL<30.0' gwas_25_11_21.vcf.gz -Oz -o Chr1_highqual_biallelic_snps.vcf.gz --threads 16 &
nohup bcftools view -v snps -m2 -M2 -r Chr2 -e 'QD<2.0 || FS>60.0 || MQ<40.0 || MQRankSum<-12.5 || SOR>3.0 || ReadPosRankSum<-8.0 || QUAL<30.0' gwas_25_11_21.vcf.gz -Oz -o Chr2_highqual_biallelic_snps.vcf.gz --threads 4 &
nohup bcftools view -v snps -m2 -M2 -r Chr3 -e 'QD<2.0 || FS>60.0 || MQ<40.0 || MQRankSum<-12.5 || SOR>3.0 || ReadPosRankSum<-8.0 || QUAL<30.0' gwas_25_11_21.vcf.gz -Oz -o Chr3_highqual_biallelic_snps.vcf.gz --threads 4 &
nohup bcftools view -v snps -m2 -M2 -r Chr4 -e 'QD<2.0 || FS>60.0 || MQ<40.0 || MQRankSum<-12.5 || SOR>3.0 || ReadPosRankSum<-8.0 || QUAL<30.0' gwas_25_11_21.vcf.gz -Oz -o Chr4_highqual_biallelic_snps.vcf.gz --threads 4 &
nohup bcftools view -v snps -m2 -M2 -r Chr5 -e 'QD<2.0 || FS>60.0 || MQ<40.0 || MQRankSum<-12.5 || SOR>3.0 || ReadPosRankSum<-8.0 || QUAL<30.0' gwas_25_11_21.vcf.gz -Oz -o Chr5_highqual_biallelic_snps.vcf.gz --threads 4 &
nohup bcftools view -v snps -m2 -M2 -r Chr6 -e 'QD<2.0 || FS>60.0 || MQ<40.0 || MQRankSum<-12.5 || SOR>3.0 || ReadPosRankSum<-8.0 || QUAL<30.0' gwas_25_11_21.vcf.gz -Oz -o Chr6_highqual_biallelic_snps.vcf.gz --threads 4 &
nohup bcftools view -v snps -m2 -M2 -r Chr7 -e 'QD<2.0 || FS>60.0 || MQ<40.0 || MQRankSum<-12.5 || SOR>3.0 || ReadPosRankSum<-8.0 || QUAL<30.0' gwas_25_11_21.vcf.gz -Oz -o Chr7_highqual_biallelic_snps.vcf.gz --threads 4 &
nohup bcftools view -v snps -m2 -M2 -r Chr8 -e 'QD<2.0 || FS>60.0 || MQ<40.0 || MQRankSum<-12.5 || SOR>3.0 || ReadPosRankSum<-8.0 || QUAL<30.0' gwas_25_11_21.vcf.gz -Oz -o Chr8_highqual_biallelic_snps.vcf.gz --threads 4 &
nohup bcftools view -v snps -m2 -M2 -r Chr9 -e 'QD<2.0 || FS>60.0 || MQ<40.0 || MQRankSum<-12.5 || SOR>3.0 || ReadPosRankSum<-8.0 || QUAL<30.0' gwas_25_11_21.vcf.gz -Oz -o Chr9_highqual_biallelic_snps.vcf.gz --threads 4 &
nohup bcftools view -v snps -m2 -M2 -r Chr10 -e 'QD<2.0 || FS>60.0 || MQ<40.0 || MQRankSum<-12.5 || SOR>3.0 || ReadPosRankSum<-8.0 || QUAL<30.0' gwas_25_11_21.vcf.gz -Oz -o Chr10_highqual_biallelic_snps.vcf.gz --threads 4 &
nohup bcftools view -v snps -m2 -M2 -r Chr11 -e 'QD<2.0 || FS>60.0 || MQ<40.0 || MQRankSum<-12.5 || SOR>3.0 || ReadPosRankSum<-8.0 || QUAL<30.0' gwas_25_11_21.vcf.gz -Oz -o Chr11_highqual_biallelic_snps.vcf.gz --threads 4 &这个命令是直接按照'SNP'、'双等位'、'QD<2.0 || FS>60.0 || MQ<40.0 || MQRankSum<-12.5 || SOR>3.0 || ReadPosRankSum<-8.0 || QUAL<30.0'、具体的染色体来过滤的,等到它跑完了,我再按照每条染色体分别计算它们的MAF和F_MISSING。
第二部
nohup bcftools +fill-tags Chr1_highqual_biallelic_snps.vcf.gz -- -t MAF,F_MISSING | bcftools view -i 'MAF>=0.05 && F_MISSING<0.9' -Oz -o Chr1_highqual_biallelic_snps_filtered.vcf.gz &
nohup bcftools +fill-tags Chr2_highqual_biallelic_snps.vcf.gz -- -t MAF,F_MISSING | bcftools view -i 'MAF>=0.05 && F_MISSING<0.9' -Oz -o Chr2_highqual_biallelic_snps_filtered.vcf.gz &
nohup bcftools +fill-tags Chr3_highqual_biallelic_snps.vcf.gz -- -t MAF,F_MISSING | bcftools view -i 'MAF>=0.05 && F_MISSING<0.9' -Oz -o Chr3_highqual_biallelic_snps_filtered.vcf.gz &
nohup bcftools +fill-tags Chr4_highqual_biallelic_snps.vcf.gz -- -t MAF,F_MISSING | bcftools view -i 'MAF>=0.05 && F_MISSING<0.9' -Oz -o Chr4_highqual_biallelic_snps_filtered.vcf.gz &
nohup bcftools +fill-tags Chr5_highqual_biallelic_snps.vcf.gz -- -t MAF,F_MISSING | bcftools view -i 'MAF>=0.05 && F_MISSING<0.9' -Oz -o Chr5_highqual_biallelic_snps_filtered.vcf.gz &
nohup bcftools +fill-tags Chr6_highqual_biallelic_snps.vcf.gz -- -t MAF,F_MISSING | bcftools view -i 'MAF>=0.05 && F_MISSING<0.9' -Oz -o Chr6_highqual_biallelic_snps_filtered.vcf.gz &
nohup bcftools +fill-tags Chr7_highqual_biallelic_snps.vcf.gz -- -t MAF,F_MISSING | bcftools view -i 'MAF>=0.05 && F_MISSING<0.9' -Oz -o Chr7_highqual_biallelic_snps_filtered.vcf.gz &
nohup bcftools +fill-tags Chr8_highqual_biallelic_snps.vcf.gz -- -t MAF,F_MISSING | bcftools view -i 'MAF>=0.05 && F_MISSING<0.9' -Oz -o Chr8_highqual_biallelic_snps_filtered.vcf.gz &
nohup bcftools +fill-tags Chr9_highqual_biallelic_snps.vcf.gz -- -t MAF,F_MISSING | bcftools view -i 'MAF>=0.05 && F_MISSING<0.9' -Oz -o Chr9_highqual_biallelic_snps_filtered.vcf.gz &
nohup bcftools +fill-tags Chr10_highqual_biallelic_snps.vcf.gz -- -t MAF,F_MISSING | bcftools view -i 'MAF>=0.05 && F_MISSING<0.9' -Oz -o Chr10_highqual_biallelic_snps_filtered.vcf.gz &
nohup bcftools +fill-tags Chr11_highqual_biallelic_snps.vcf.gz -- -t MAF,F_MISSING | bcftools view -i 'MAF>=0.05 && F_MISSING<0.9' -Oz -o Chr11_highqual_biallelic_snps_filtered.vcf.gz &不过啊,非常遗憾的是在工作站上它又报错了
(base) root@7e61571abaa4:/data# tail nohup.out
[E::bgzf_uncompress] CRC32 checksum mismatch
[E::bgzf_read_block] BGZF decode jobs returned error 1 for block offset 6839004835
Error: BCF read error我怀疑这个东西和cpu资源挤压有关系。
我直接给docker stop了,然后再docker start起来。
总之这里可以慢慢来,别着急。
工作站的写入速度只有20M/s
不会是i9-14900k这块cpu的缩肛问题吧
nohup bcftools concat -a -Oz -o 11_merged.g.vcf.gz \
ZWY-4_clean.sorted.Chr11.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr7.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr1.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr3.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr2.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr9.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr4.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr6.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr5.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr8.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr10.dedup.filtered.gvcf.gz --threads 3 &然后用类似的bcftools concat命令汇总就好了。
按照0.1再过滤一遍
之前我的参数是完全按照word上写的"minor allele frequency greater than 5%, less than 90% missing data",但是" less than 90% missing data"和小培老师、董川老师使用的vcf.gz文件中的实际情况不符合,实际情况是“less than 10% missing data”。那么就按照实际情况再过滤一遍吧。
nohup bcftools +fill-tags Chr1_highqual_biallelic_snps.vcf.gz -- -t MAF,F_MISSING | bcftools view -i 'MAF>=0.05 && F_MISSING<0.1' -Oz -o Chr1_highqual_biallelic_snps_filtered_0.1.vcf.gz &
nohup bcftools +fill-tags Chr2_highqual_biallelic_snps.vcf.gz -- -t MAF,F_MISSING | bcftools view -i 'MAF>=0.05 && F_MISSING<0.1' -Oz -o Chr2_highqual_biallelic_snps_filtered_0.1.vcf.gz &
nohup bcftools +fill-tags Chr3_highqual_biallelic_snps.vcf.gz -- -t MAF,F_MISSING | bcftools view -i 'MAF>=0.05 && F_MISSING<0.1' -Oz -o Chr3_highqual_biallelic_snps_filtered_0.1.vcf.gz &
nohup bcftools +fill-tags Chr4_highqual_biallelic_snps.vcf.gz -- -t MAF,F_MISSING | bcftools view -i 'MAF>=0.05 && F_MISSING<0.1' -Oz -o Chr4_highqual_biallelic_snps_filtered_0.1.vcf.gz &
nohup bcftools +fill-tags Chr5_highqual_biallelic_snps.vcf.gz -- -t MAF,F_MISSING | bcftools view -i 'MAF>=0.05 && F_MISSING<0.1' -Oz -o Chr5_highqual_biallelic_snps_filtered_0.1.vcf.gz &
nohup bcftools +fill-tags Chr6_highqual_biallelic_snps.vcf.gz -- -t MAF,F_MISSING | bcftools view -i 'MAF>=0.05 && F_MISSING<0.1' -Oz -o Chr6_highqual_biallelic_snps_filtered_0.1.vcf.gz &
nohup bcftools +fill-tags Chr7_highqual_biallelic_snps.vcf.gz -- -t MAF,F_MISSING | bcftools view -i 'MAF>=0.05 && F_MISSING<0.1' -Oz -o Chr7_highqual_biallelic_snps_filtered_0.1.vcf.gz &
nohup bcftools +fill-tags Chr8_highqual_biallelic_snps.vcf.gz -- -t MAF,F_MISSING | bcftools view -i 'MAF>=0.05 && F_MISSING<0.1' -Oz -o Chr8_highqual_biallelic_snps_filtered_0.1.vcf.gz &
nohup bcftools +fill-tags Chr9_highqual_biallelic_snps.vcf.gz -- -t MAF,F_MISSING | bcftools view -i 'MAF>=0.05 && F_MISSING<0.1' -Oz -o Chr9_highqual_biallelic_snps_filtered_0.1.vcf.gz &
nohup bcftools +fill-tags Chr10_highqual_biallelic_snps.vcf.gz -- -t MAF,F_MISSING | bcftools view -i 'MAF>=0.05 && F_MISSING<0.1' -Oz -o Chr10_highqual_biallelic_snps_filtered_0.1.vcf.gz &
nohup bcftools +fill-tags Chr11_highqual_biallelic_snps.vcf.gz -- -t MAF,F_MISSING | bcftools view -i 'MAF>=0.05 && F_MISSING<0.1' -Oz -o Chr11_highqual_biallelic_snps_filtered_0.1.vcf.gz &nohup bcftools concat -a -Oz -o 11_merged.g.vcf.gz \
ZWY-4_clean.sorted.Chr11.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr7.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr1.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr3.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr2.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr9.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr4.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr6.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr5.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr8.dedup.filtered.gvcf.gz \
ZWY-4_clean.sorted.Chr10.dedup.filtered.gvcf.gz --threads 3 &