3 Variant Calling

3.1 SNV/Indel Calling

We call SNVs and indels with GATK4 HaplotypeCaller.

exome=sureselectV4_padded_hg38.bed
names='father mother proband' 
for name in $names
do
    input="BAM/${name}.recal.bam"
    output="${name}_raw_variants.g.vcf.gz"
    java -Xmx6g -jar $GATK_JAR HaplotypeCaller \
    -R $reference \
    -I $input \
    -O $output \
    -L $exome \
    -ip 0 \
    -ERC GVCF
done

Then, combine gVCFs and compile genotypes at sites with a variant in at least one sample.

# Combine gVCFs
find . -maxdepth 1 -type f -name "*_raw_variants.g.vcf.gz" > input.list
java -Xmx6g -jar $GATK_JAR CombineGVCFs \
-R $reference \
-V input.list \
-O joint/joint_raw_variants.g.vcf.gz

# Genotype gVCF
java -Xmx6g -jar $GATK_JAR GenotypeGVCFs \
-R $reference \
-V joint/joint_raw_variants.g.vcf.gz \
-O joint/joint_raw_variants_genotype.vcf.gz

3.2 Variant Filtration

For SNVs, we use VQSR to filter variants. VQSR creates a training set of high-confidence variants and learns which region of the parameter space is associated with high quality calls.

# Sties only
java -Xmx2g -jar $GATK_JAR MakeSitesOnlyVcf \
-I joint/joint_raw_variants_genotype.vcf.gz \
-O joint/joint_raw_variants_genotype_sitesonly.vcf.gz

# VQSR for SNPs
java -Xmx6g -jar $GATK_JAR VariantRecalibrator \
-R $reference \
-V joint/joint_raw_variants_genotype_sitesonly.vcf.gz \
--trust-all-polymorphic \
-tranche 100.0 -tranche 99.95 -tranche 99.9 -tranche 99.8 -tranche 99.6 -tranche 99.5 -tranche 99.4 -tranche 99.3 -tranche 99.0 -tranche 98.0 -tranche 97.0 -tranche 90.0 \
-an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an SOR -an DP \
-resource:hapmap,known=false,training=true,truth=true,prior=15 VQSR/hg38/hapmap_3.3.hg38.vcf.gz \
-resource:omni,known=false,training=true,truth=true,prior=12 VQSR/hg38/1000G_omni2.5.hg38.vcf.gz \
-resource:1000G,known=false,training=true,truth=false,prior=10 VQSR/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz \
-resource:dbsnp,known=true,training=false,truth=false,prior=7 VQSR/hg38/Homo_sapiens_assembly38.dbsnp138.vcf \
--max-gaussians 4 \
-mode SNP \
-O joint/cohort_snps.recal \
--tranches-file joint/cohort_snps.tranches \
--rscript-file joint/output_snp.plots.R

java -Xmx6g -jar $GATK_JAR ApplyVQSR \
-R $reference \
-V joint/joint_raw_variants_genotype.vcf.gz \
-O joint/snp.recalibrated.vcf.gz \
--recal-file joint/cohort_snps.recal \
--tranches-file joint/cohort_snps.tranches \
--truth-sensitivity-filter-level 99.7 \
--create-output-variant-index true \
-mode SNP

We manually filter indels by setting hard thresholds due to the small number of samples. For large numbers of samples, we can use VQSR to filter indels.

# Select indels
java -Xmx2g -jar $GATK_JAR SelectVariants \
-R $reference \
-V joint/joint_raw_variants_genotype.vcf.gz \
--select-type-to-include INDEL \
-O joint/raw_indels.vcf.gz

# Indels hard threshold
java -Xmx2g -jar $GATK_JAR VariantFiltration \
-R $reference \
-V joint/raw_indels.vcf.gz \
-O joint/filtered_indels.vcf.gz \
-filter "QD < 2.0" --filter-name "QD2" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \
-filter "FS > 200.0" --filter-name "FS200" \
-filter "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20"

We merge SNPs and indels into a single VCF.

java -Xmx2g -jar $GATK_JAR SelectVariants \
-R $reference \
-V joint/snp.recalibrated.vcf.gz \
--select-type-to-include SNP \
-O joint/snp.recalibrated_minus_indel.vcf.gz

java -Xmx2g -jar $GATK_JAR MergeVcfs \
-I joint/snp.recalibrated_minus_indel.vcf.gz \
-I joint/filtered_indels.vcf.gz \
-O joint/output_vqsr_snps_hard_indels.vcf.gz

3.3 Genotype Posteriors

We recalculate the likelihood of the genotypes given pedigree relationships and population allele frequencies.

java -Xmx2g -jar $GATK_JAR CalculateGenotypePosteriors \
-V joint/output_vqsr_snps_hard_indels.vcf.gz \
-ped pedigree.ped \
--supporting-callsets af-only-gnomad.hg38.vcf.gz \
-O trio_refined.vcf.gzb

3.4 Structural Variants

We call structural variants with Manta.

module load manta/1.6.0
for name in $names
do
  mkdir "manta_${name}"
  configManta.py \
  --bam "BAM/${name}.recal.bam" \
  --referenceFasta $reference \
  --runDir manta \
  --callRegions $exome \
  --exome
  manta_${name}/runWorkflow.py
done

3.5 Benchmarking

We benchmarked the pipeline against the GIAB high-confidence variants calls using the Illumina hap.py VCF comparison tool. We obtain the following results for the proband HG002:

Type Truth TP Truth FN Query FP Recall Precision F Score
SNV 37653 753 369 98.0% 99.0% 0.985
INDEL 2732 176 382 93.9% 87.9% 0.908

We note that the F-scores are comparable to those obtained by Kumaran, Subramanian, and Devarajan (2019): SNV F-score>0.96 and INDEL F-score>0.9.