3 Variant Calling
3.1 SNV/Indel Calling
We call SNVs and indels with GATK4 HaplotypeCaller.
exome=sureselectV4_padded_hg38.bed
names='father mother proband'
for name in $names
do
input="BAM/${name}.recal.bam"
output="${name}_raw_variants.g.vcf.gz"
java -Xmx6g -jar $GATK_JAR HaplotypeCaller \
-R $reference \
-I $input \
-O $output \
-L $exome \
-ip 0 \
-ERC GVCF
done
Then, combine gVCFs and compile genotypes at sites with a variant in at least one sample.
# Combine gVCFs
find . -maxdepth 1 -type f -name "*_raw_variants.g.vcf.gz" > input.list
java -Xmx6g -jar $GATK_JAR CombineGVCFs \
$reference \
-R \
-V input.list
-O joint/joint_raw_variants.g.vcf.gz
# Genotype gVCF
java -Xmx6g -jar $GATK_JAR GenotypeGVCFs \
$reference \
-R \
-V joint/joint_raw_variants.g.vcf.gz -O joint/joint_raw_variants_genotype.vcf.gz
3.2 Variant Filtration
For SNVs, we use VQSR to filter variants. VQSR creates a training set of high-confidence variants and learns which region of the parameter space is associated with high quality calls.
# Sties only
java -Xmx2g -jar $GATK_JAR MakeSitesOnlyVcf \
\
-I joint/joint_raw_variants_genotype.vcf.gz
-O joint/joint_raw_variants_genotype_sitesonly.vcf.gz
# VQSR for SNPs
java -Xmx6g -jar $GATK_JAR VariantRecalibrator \
$reference \
-R \
-V joint/joint_raw_variants_genotype_sitesonly.vcf.gz \
--trust-all-polymorphic -tranche 99.95 -tranche 99.9 -tranche 99.8 -tranche 99.6 -tranche 99.5 -tranche 99.4 -tranche 99.3 -tranche 99.0 -tranche 98.0 -tranche 97.0 -tranche 90.0 \
-tranche 100.0 -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an SOR -an DP \
-an QD \
-resource:hapmap,known=false,training=true,truth=true,prior=15 VQSR/hg38/hapmap_3.3.hg38.vcf.gz \
-resource:omni,known=false,training=true,truth=true,prior=12 VQSR/hg38/1000G_omni2.5.hg38.vcf.gz \
-resource:1000G,known=false,training=true,truth=false,prior=10 VQSR/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz \
-resource:dbsnp,known=true,training=false,truth=false,prior=7 VQSR/hg38/Homo_sapiens_assembly38.dbsnp138.vcf \
--max-gaussians 4 \
-mode SNP \
-O joint/cohort_snps.recal \
--tranches-file joint/cohort_snps.tranches
--rscript-file joint/output_snp.plots.R
java -Xmx6g -jar $GATK_JAR ApplyVQSR \
$reference \
-R \
-V joint/joint_raw_variants_genotype.vcf.gz \
-O joint/snp.recalibrated.vcf.gz \
--recal-file joint/cohort_snps.recal \
--tranches-file joint/cohort_snps.tranches \
--truth-sensitivity-filter-level 99.7 \
--create-output-variant-index true -mode SNP
We manually filter indels by setting hard thresholds due to the small number of samples. For large numbers of samples, we can use VQSR to filter indels.
# Select indels
java -Xmx2g -jar $GATK_JAR SelectVariants \
$reference \
-R \
-V joint/joint_raw_variants_genotype.vcf.gz \
--select-type-to-include INDEL
-O joint/raw_indels.vcf.gz
# Indels hard threshold
java -Xmx2g -jar $GATK_JAR VariantFiltration \
$reference \
-R \
-V joint/raw_indels.vcf.gz \
-O joint/filtered_indels.vcf.gz "QD < 2.0" --filter-name "QD2" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \
-filter "FS > 200.0" --filter-name "FS200" \
-filter "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" -filter
We merge SNPs and indels into a single VCF.
java -Xmx2g -jar $GATK_JAR SelectVariants \
$reference \
-R \
-V joint/snp.recalibrated.vcf.gz \
--select-type-to-include SNP
-O joint/snp.recalibrated_minus_indel.vcf.gz
java -Xmx2g -jar $GATK_JAR MergeVcfs \
\
-I joint/snp.recalibrated_minus_indel.vcf.gz \
-I joint/filtered_indels.vcf.gz -O joint/output_vqsr_snps_hard_indels.vcf.gz
3.3 Genotype Posteriors
We recalculate the likelihood of the genotypes given pedigree relationships and population allele frequencies.
java -Xmx2g -jar $GATK_JAR CalculateGenotypePosteriors \
\
-V joint/output_vqsr_snps_hard_indels.vcf.gz \
-ped pedigree.ped \
--supporting-callsets af-only-gnomad.hg38.vcf.gz -O trio_refined.vcf.gzb
3.4 Structural Variants
We call structural variants with Manta.
module load manta/1.6.0
for name in $names
do
mkdir "manta_${name}"
configManta.py \
--bam "BAM/${name}.recal.bam" \
--referenceFasta $reference \
--runDir manta \
--callRegions $exome \
--exome
manta_${name}/runWorkflow.py
done
3.5 Benchmarking
We benchmarked the pipeline against the GIAB high-confidence variants calls using the Illumina hap.py VCF comparison tool. We obtain the following results for the proband HG002:
Type | Truth TP | Truth FN | Query FP | Recall | Precision | F Score |
---|---|---|---|---|---|---|
SNV | 37653 | 753 | 369 | 98.0% | 99.0% | 0.985 |
INDEL | 2732 | 176 | 382 | 93.9% | 87.9% | 0.908 |
We note that the F-scores are comparable to those obtained by Kumaran, Subramanian, and Devarajan (2019): SNV F-score>0.96 and INDEL F-score>0.9.