Whole Exome Sequencing Pipeline
2024-05-19
1 Overview
This is a whole exome sequencing pipeline to identify the genetic causes of Mendelian disorders in singletons, trios and larger pedigrees. It was assembled in the Kitzler lab at the McGill University Health Centre and run on the Compute Canada servers. It includes the following steps:
- Pre-processing of raw reads.
- Germline variant calling for SNVs, indels and structural variants.
- Pedigree-based, read-based and statistical phasing.
- Estimation of kinship coefficients to verify pedigree relationships.
- Variant annotation and trio analysis to identify disease-causing variants.
We follow GATK best practices for pre-processing and SNV/indel calling. We use Manta to call structural variants, WhatsHap to phase variants, Linkdatagen to select SNPs in approximate linkage equilibrium and OpenMendel and PLINK to estimate kinship coefficients. We use Ensembl VEP to annotate variants. We provide R scripts to identify de novo variants as well as compound heterozygous variants in trans within a gene and to fetch data from the gnomAD API, in particular homozygote counts and phasing estimates based on the EM algorithm. We output the results to Excel in a readable format.
We illustrate the expected output and benchmark the pipeline using Illumina paired-end WES data from the GIAB Ashkenazi Jewish reference trio (SRR2962669, SRR2962692, SRR2962694). Bash scripts submitted to Slurm as well as associated data and results can be found on the GitHub repository.