杜鹃花VCF转Hmp/012/plink文件

By Liu Qirui, 5 September, 2024

Forums

目前杜鹃花的生信研究需要的文件格式有这几种：VCF格式、PLINK的BED/BIM格式、Numeric格式(也有说012格式)、Hapmap格式(Hmp)格式

基于叶倩男提供的原始VCF格式文件，可以转为不同的格式

从全部变异的VCF文件中筛选出实际的SNP

vcftools --vcf combine_all.pass.recode.clear.vcf --remove-indels --chr 1 2 3 4 5 6 7 8 9 10 11 12 13 --recode --recode-INFO-all --out combine_all.pass.recode.clear.snp

plink --vcf combine_all.pass.recode.vcf --allow-extra-chr --snps-only --chr 1-13 --make-bed --out combine_all.pass.recode.13chr

plink --bfile combine_all.pass.recode.13chr --allow-extra-chr --recode --out combine_all.pass.recode.13chr

5327033 out of 5981019 variants loaded from .bim file.

这两种方法最终都会得到来自13个染色体的5327033个SNP（剔除了非CHR染色体、非SNP的位点），二者的区别时，前者能够直接输出VCF格式文件，后者输出则为BIM格式和BED格式

转为Hapmap和Numeric格式

该步骤使用R语言下的blupADC包

library(blupADC)

phased_result=geno_format(

input_data_path="/root/home/liuqirui/SNPdata", #替换为自用路径，如果在本目录下可以直接使用getwd()

input_data_name="combine_all.pass.recode.snp.13chr.vcf", #读取的文件

input_data_type="VCF",

phased_genotype=TRUE,

haplotype_window_nSNP=NULL,

output_data_type=c("Hapmap"),

output_data_path="/root/home/liuqirui/SNPdata", #替换为自用路径，如果在本目录下可以直接使用getwd()

output_data_name="combine_all.pass.recode.snp.13chr.hmp", #输出文件名（输出为设置名.txt）

return_result=TRUE, cpu_cores=1 )

其他参数默认即可，也可以自行查看blupADC的目录进行进一步详细设置

sed -i 's/rs#/rs/g' combine_all.pass.recode.snp.hmp.txt

blupADC会输出一个带有”rs#“标题头的hmp文件，在后续的分析使用中有可能会造成问题，因此通过一个sed表达式将其替换为rs（即去掉#符号）

vcftools --vcf combine_all.pass.recode.snp.13chr.vcf --out combine_all.pass.recode.snp.13chr --012

VCFtools可以将vcf输出为012格式

paste combine_all.pass.recode.snp.012.indv combine_all.pass.recode.snp.012 > combine_all.pass.recode.snp.012.txt

vcftools是输出中，样本列会被单独记录在.indv文件中，可以使用简单的paste命令将其合并为一个带有样本名称的012文本文件