4 Statistical Genetics
4.1 Phasing
We use WhatsHap to phase genotypes in a pedigree using the Mendelian transmission rules and physical phasing based on reads that span two or more heterozygous variants.
whatshap phase --ped pedigree.ped \
\
-o phased_whatshap.vcf joint/output_vqsr_variants.vcf.gz \
aligned_HG002.bam aligned_HG003.bam aligned_HG004.bam --no-reference --indels --tag=PS
4.2 Kinship
We seek to verify pedigree relationships and identify hidden relatedness between samples. We first select SNPs in approximate linkage equilibrium with Linkdatagen based on the genetic linkage map of HapMap SNPs. Linkdatagen also eliminates Mendelian inconsitencies in the data.
# Split variant calling file into seperate files for each individual
bcftools +split AJ_output_vqsr_snps_hard_indels.vcf.gz -Ov -o split
# Write path to each individual's vcf in a text file named MyVCFlist.txt
perl vcf2linkdatagen.pl \
\
-variantCaller unifiedgenotyper \
-annotfile annotHapMap2U_hg38.txt -mindepth 10 -missingness 0 \
-pop CEU > HG002_trio.brlmm
-idlist MyVCFlist.txt
perl linkdatagen.pl \
-pedfile MyPed.ped \
-data m \
-whichSamplesFile MyWS.ws \
-callFile MySNPs.brlmm \
-annotFile annotHapMap2U.txt -binsize 0.3 \
-pop CEU -prog pl \
-MendelErrors removeSNPs > MyPed_HapMap2_pl.out -outputDir MyPed_HapMap2_pl
We construct an allele frequency report for the chosen SNPs.
library(tidyverse)
<- read_delim(file = "data/plink.map", delim = " ", col_select = 1:4, col_names = c("CHR", "SNP", "cM", "bp"))
map <- read_tsv(file = "data/orderedSNPs.txt")
ord <- map %>% left_join(ord, by = "SNP") %>%
freq_pl ::select(CHR, SNP, `Allele freq`) %>%
dplyrmutate(A1="1") %>% mutate(A2="2") %>% mutate(NCHROBS="300") %>%
::select(CHR, SNP, A1, A2, `Allele freq`, NCHROBS) %>%
dplyr::rename(MAF = `Allele freq`)
dplyrwrite_delim(freq_pl, "result/freq_pl.frq", delim = " ")
Alternatively, with a sufficient number of samples, we can estimate the linkage disequilibrium between alleles and remove correlated SNPs using PLINK.
plink --file data --indep 50 5 2
We then calculate the empirical kinship coefficient by estimating identity by descent based on identity by state and population allele frequencies (Lange 2002). We run this calculation in PLINK.
module load plink/1.9b_6.21-x86_64
plink --file plink --genome full --read-freq freq_pl.frq
We also run this calculation using MendelKinship from the OpenMendel project.
using MendelKinship, CSV
Kinship("control_compare.txt")
We obtain the following results for the AJ reference trio using PLINK. PI_HAT is two times the estimated kinship coefficient between individuals IID1 and IID2. We note that the estimated kinship coefficient between individuals HG003 and HG004, which equals the the inbreeding coefficient of their son HG002, is 0.
IID1 | IID2 | RT | EZ | Z0 | Z1 | Z2 | PI_HAT | DST | IBS0 | IBS1 | IBS2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
HG002 | HG003 | PO | 0.5 | 0 | 1 | 0 | 0.5 | 0.752334 | 0 | 2918 | 2973 | |
HG002 | HG004 | PO | 0.5 | 0 | 1 | 0 | 0.5 | 0.756153 | 0 | 2873 | 3018 | |
HG003 | HG004 | OT | 0 | 1 | 0 | 0 | 0 | 0.616449 | 638 | 3243 | 2010 |
4.3 Runs of Homozygosity
If we suspect some degree of consanguinity, we may look for a homozygous pathogenic variant in runs of homozygosity using AutozygosityMapper.

AutozygosityMapper report for HG002 as case and HG003 and HG004 as controls.
4.4 Linkage Analysis
Using the same set of SNPs in linkage equillibrium, we can run linkage analysis on appropriately chosen pedigrees using OpenMendel for two-point linkage or multipoint linkage analysis.
# Two-point linkage analysis
using MendelTwoPointLinkage, CSV
TwoPointLinkage("Control_file.txt")
# Multipoint linkage analysis
using MendelLocationScores, CSV
LocationScores("Control_file.txt")
4.5 Population Substructure
Given a large number of samples from a population, we cluster individuals by degree of identity by state using PLINK in order to identify subgroups with shared ancestry.
plink --file mydata --cluster
# dimensional reduction to 4D then plot two coordinates
plink --file mydata --read-genome plink.genome --cluster --mds-plot 4