We used the DNA left over from the library prep for WGS of 18 samples. The DNA was extracted but not used entirely for the library prep. Here we will compare the genotypes of sites that are shared between the two data sets. For the genotyping using the chip, we used the recommended priors and the new priors we obtained using the “SSTool” from Thermo Fisher using the crosses. The WGS data was used to design the probe sequences for the chip. However, we used 819 genomes to design the chip, and here we will take into consideration only 18 samples. We used ANGSD to perform the genotype calls for all 819 samples together and here we are looking at only a few samples. Therefore, although the comparison can help us identify problematic loci, we are cautious about the accuracy of each technology. The average sequence depth for the WGS across the 819 samples was 12X. However, it is variable from sample to sample, and across the genome. Therefore, we cannot precisely tell if the discrepancies in the genotypes between the technologies are due to sequence depth, sequencing errors, or with the chip. We aim to gather a general overview of loci with discrepancies in zygosity or genomic regions with higher than expected genotype discordancies.
library(tidyverse)
library(here)
library(colorout)
library(flextable)
library(ggplot2)
library(scales)
library(reticulate)
library(extrafont)
library(stringr)
library(readr)
library(dplyr)
library(data.table)
library(scales)
library(ggrepel)
library(flextable)
library(forcats)
library(officer)
library(ggvenn)
library(RColorBrewer)
library(ggstatsplot)
library(broom)
Note about the general approach We have data of 18 samples from 2 populations genotyped with both technologies: 6 samples from Nepal (KAT) and 12 samples from Trinidad and Tobago (SAI) - we did not have enough DNA left after library prep for all samples
3 genotyping calls: WGS -> 800+ samples, 30 samples (KAT 12 samples and SAI 18 samples), and 18 samples (KAT 6 samples and SAI 12 samples)
Chip -> 500 samples, 95 samples (1 plate with the 18 samples and other wild samples), and 18 samples (KAT 6 samples and SAI 12 samples)
Since the WGS calls took longer, part of the code is written comparing default and new prior generated using the crosses. The aim is illustrative and to develop the code while waiting for the WGS calls to finish. It is not a good idea to use a prior from lab crosses in genotype calls using wild animals.
Check how many samples
# make sure you have all the .CEL samples in your family file - 152
bcftools query -l data/raw_data/albo/wgs_vs_chip/wgs_default_prior_recommended_june_16_2023.vcf | wc -l
## 18
Check sample names
# make sure you have all the .CEL samples in your family file - 152
bcftools query -l data/raw_data/albo/wgs_vs_chip/wgs_default_prior_recommended_june_16_2023.vcf | head
## 601_Debug027_A12.CEL
## 604_Debug027_B1.CEL
## 605_Debug027_B2.CEL
## 606_Debug027_B3.CEL
## 607_Debug027_B4.CEL
## 608_Debug027_B5.CEL
## 611_Debug028_G10.CEL
## 612_Debug028_G11.CEL
## 613_Debug028_G7.CEL
## 614_Debug028_G8.CEL
Create output directory
# Create main directory
dir.create(
here("output", "wgs_vs_chip"),
showWarnings = FALSE,
recursive = FALSE
)
Convert ‘vcf’ file from Axiom suite to ‘bed’ format
# I created a fam file with the information about each sample, but first we import the data and create a bed file setting the family id constant
plink2 \
--allow-extra-chr \
--vcf data/raw_data/albo/wgs_vs_chip/wgs_default_prior_recommended_june_16_2023.vcf \
--const-fid \
--make-bed \
--fa data/genome/albo.fasta.gz \
--ref-from-fa 'force' `# sets REF alleles when it can be done unambiguously, we use force to change the alleles` \
--out output/wgs_vs_chip/chip_dp_01 `# dp - default priors` \
--silent;
# --keep-allele-order \ if you use Plink 1.9
grep "variants" output/wgs_vs_chip/chip_dp_01.log # to get the number of variants from the log file.
## --vcf: 105607 variants scanned.
## 105607 variants loaded from output/wgs_vs_chip/chip_dp_01-temporary.pvar.zst.
## --ref-from-fa force: 0 variants changed, 105607 validated.
Using the default priors we obtained 105,607 SNPs. All the reference alleles matched the reference genome (AalbF3).
# I created a fam file with the information about each sample, but first we import the data and create a bed file setting the family id constant
plink2 \
--allow-extra-chr \
--vcf data/raw_data/albo/wgs_vs_chip/wgs_new_prior_recommended_june_16_2023.vcf \
--const-fid \
--make-bed \
--fa data/genome/albo.fasta.gz \
--ref-from-fa 'force' `# sets REF alleles when it can be done unambiguously, we use force to change the alleles` \
--out output/wgs_vs_chip/chip_np_01 `# np - new priors` \
--silent;
# --keep-allele-order \ if you use Plink 1.9
grep "variants" output/wgs_vs_chip/chip_np_01.log # to get the number of variants from the log file.
## --vcf: 118408 variants scanned.
## 118408 variants loaded from output/wgs_vs_chip/chip_np_01-temporary.pvar.zst.
## --ref-from-fa force: 0 variants changed, 118408 validated.
Using the new priors we obtained 118,408 SNPs. All the reference alleles matched the reference genome (AalbF3).
Check the headings of the the files we will work on.
## 0 601_Debug027_A12.CEL 0 0 0 -9
## 0 604_Debug027_B1.CEL 0 0 0 -9
## 0 605_Debug027_B2.CEL 0 0 0 -9
## 0 606_Debug027_B3.CEL 0 0 0 -9
## 0 607_Debug027_B4.CEL 0 0 0 -9
We need to update the family information, individual id, and sex of each individual. We can use the same file we use with the Axiom Suite to update our .fam file.
## Sample Filename Family_ID Individual_ID Father_ID Mother_ID Sex Affection Status
## 608_Debug027_B5.CEL KAT 12a 0 0 2 -9
## 616_Debug028_H10.CEL SAI 16a 0 0 2 -9
## 615_Debug028_G9.CEL SAI 3a 0 0 2 -9
## 607_Debug027_B4.CEL KAT 11a 0 0 2 -9
Import the fam file we use with Axiom Suite
# the order of the rows in this file does not matter
samples <-
read.delim(
file = here(
"data",
"raw_data",
"albo",
"wgs_vs_chip",
"sample_ped_info.txt"
),
header = TRUE
)
head(samples)
## Sample.Filename Family_ID Individual_ID Father_ID Mother_ID Sex
## 1 608_Debug027_B5.CEL KAT 12a 0 0 2
## 2 616_Debug028_H10.CEL SAI 16a 0 0 2
## 3 615_Debug028_G9.CEL SAI 3a 0 0 2
## 4 607_Debug027_B4.CEL KAT 11a 0 0 2
## 5 606_Debug027_B3.CEL KAT 10a 0 0 2
## 6 614_Debug028_G8.CEL SAI 2a 0 0 2
## Affection.Status
## 1 -9
## 2 -9
## 3 -9
## 4 -9
## 5 -9
## 6 -9
Import .fam file we created once we created the bed file using Plink2
# The fam file is the same for both data sets with the default or new priors
fam1 <-
read.delim(
file = here(
"output", "wgs_vs_chip", "chip_dp_01.fam"
),
header = FALSE,
)
head(fam1)
## V1 V2 V3 V4 V5 V6
## 1 0 601_Debug027_A12.CEL 0 0 0 -9
## 2 0 604_Debug027_B1.CEL 0 0 0 -9
## 3 0 605_Debug027_B2.CEL 0 0 0 -9
## 4 0 606_Debug027_B3.CEL 0 0 0 -9
## 5 0 607_Debug027_B4.CEL 0 0 0 -9
## 6 0 608_Debug027_B5.CEL 0 0 0 -9
We can merge the tibbles.
# to keep the same order of the .fam file, we will first create an index based on the numbers of the samples, then use it too keep the order
# Extract the number part from the columns
fam1_temp <- fam1 |>
mutate(num_id = as.numeric(str_extract(V2, "^\\d+")))
samples_temp <- samples |>
mutate(num_id = as.numeric(str_extract(Sample.Filename, "^\\d+")))
# Perform the left join using the num_id columns and keep the order of fam1
df <- fam1_temp |>
dplyr::left_join(samples_temp, by = "num_id") |>
dplyr::select(-num_id) |>
dplyr::select(8:13)
# check the data frame
head(df)
## Family_ID Individual_ID Father_ID Mother_ID Sex Affection.Status
## 1 KAT 7a 0 0 2 -9
## 2 KAT 8a 0 0 2 -9
## 3 KAT 9a 0 0 2 -9
## 4 KAT 10a 0 0 2 -9
## 5 KAT 11a 0 0 2 -9
## 6 KAT 12a 0 0 2 -9
We can check how many samples we have in our file
## [1] 18
Before you save the new fam file, you can change the original file to a different name, to compare the order later. If you want to repeat the steps above after you saving the new file1.fam, you will need to import the vcf again.
# Save and override the .fam file for dp
write.table(
df,
file = here(
"output", "wgs_vs_chip", "chip_dp_01.fam"
),
sep = "\t",
row.names = FALSE,
col.names = FALSE,
quote = FALSE
)
# Save and override the .fam file for np
# Fist we need to change the sample ids
df$Individual_ID <- gsub("a", "b", df$Individual_ID)
# Save it
write.table(
df,
file = here(
"output", "wgs_vs_chip", "chip_np_01.fam"
),
sep = "\t",
row.names = FALSE,
col.names = FALSE,
quote = FALSE
)
Check the new .fam file to see if has the order and the sample attributes we want.
# you can open the file on a text editor and double check the sample order and information.
head -n 5 output/wgs_vs_chip/chip_dp_01.fam
## KAT 7a 0 0 2 -9
## KAT 8a 0 0 2 -9
## KAT 9a 0 0 2 -9
## KAT 10a 0 0 2 -9
## KAT 11a 0 0 2 -9
# you can open the file on a text editor and double check the sample order and information.
head -n 5 output/wgs_vs_chip/chip_np_01.fam
## KAT 7b 0 0 2 -9
## KAT 8b 0 0 2 -9
## KAT 9b 0 0 2 -9
## KAT 10b 0 0 2 -9
## KAT 11b 0 0 2 -9
The WGS data is already in the ‘bed’ format, we can create a new bed file and check if the reference alleles match the reference genome.
# We can create a new bed file and check if the reference and alternative alleles are set correctly
# I manually added "w" to the sample names after creating the file
plink2 \
--allow-extra-chr \
--bfile data/raw_data/albo/wgs_vs_chip/wgs \
--make-bed \
--fa data/genome/albo.fasta.gz \
--ref-from-fa 'force' \
--out output/wgs_vs_chip/wgs_01 \
--silent;
# --keep-allele-order \ if you use Plink 1.9
grep "variants\|samples" output/wgs_vs_chip/wgs_01.log
## 18 samples (0 females, 0 males, 18 ambiguous; 18 founders) loaded from
## 175360 variants loaded from data/raw_data/albo/wgs_vs_chip/wgs.bim.
## --ref-from-fa force: 0 variants changed, 175360 validated.
Now we have some considerations to make about which strategy to follow to do a pairwise comparison of the 18 samples:
Single VCF for Each Technology: We can create two multi-sample VCFs, one for each technology (sequencing and SNP chip). This approach could make it easier to manage and manipulate your data, especially if the number of variants detected by each technology is different.
Single VCF for Each Sample: Having a separate VCF for each sample could be useful if we plan to do a lot of sample-specific processing. However, it could become difficult to manage if we had a large number of samples.
I will create a vcf for each sample setting the missingness to zero.
Create output directory
# Create subdirectories for default and new priors. We can put the WGS vcfs in both.
subdirs <- c("vcfs")
for (subdir in subdirs) {
dir.create(here("output", "wgs_vs_chip", subdir), showWarnings = FALSE)
}
We can merge the WGS and Chip data sets
# Create list of files to merge: wgs with chip with default prior
echo 'output/wgs_vs_chip/wgs_01
output/wgs_vs_chip/chip_dp_01
output/wgs_vs_chip/chip_np_01' > output/wgs_vs_chip/merge_list.txt
Merge the data (wgs and both chip data sets)
plink \
--allow-extra-chr \
--keep-allele-order \
--merge-list output/wgs_vs_chip/merge_list.txt \
--out output/wgs_vs_chip/wgs_chip \
--silent
grep "variants\|samples" output/wgs_vs_chip/wgs_chip.log
## Performing single-pass merge (54 people, 175388 variants).
Now we can subset the samples and keep the pairs that we are interested in.
Code Explanation:
input_file="output/wgs_vs_chip/wgs_chip.fam"
output_dir="output/wgs_vs_chip/vcfs"
bfile="output/wgs_vs_chip/wgs_chip"
# create the output directory if it does not exist
mkdir -p $output_dir
# get unique families
families=$(awk '{print $1}' $input_file | sort | uniq)
for famid in $families; do
# get the base sample ids (without a, b, w)
base_iids=$(grep "$famid" $input_file | awk '{print $2}' | sed 's/[abw]$//' | uniq)
for base_iid in $base_iids; do
for combination in "aw" "ab" "bw"; do
# Check if both samples exist
if grep -qE "${famid}\s${base_iid}[${combination:0:1}]\s" "$input_file" &&
grep -qE "${famid}\s${base_iid}[${combination:1:1}]\s" "$input_file"; then
# Create temporary file
tmp_file=$(mktemp)
grep -E "${famid}\s${base_iid}[${combination:0:1}]\s" "$input_file" > "$tmp_file"
grep -E "${famid}\s${base_iid}[${combination:1:1}]\s" "$input_file" >> "$tmp_file"
# Execute plink2
plink2 \
--allow-extra-chr \
--keep-allele-order \
--bfile $bfile \
--keep "$tmp_file" \
--recode vcf-iid \
--geno 0 \
--out "$output_dir/${famid}_${base_iid}${combination}" \
--silent
# Remove temporary file
rm "$tmp_file"
fi
done
done
done
Check how many SNPs per vcf
# Define directory with the vcfs
output_dir="output/wgs_vs_chip/vcfs"
# Count how many SNPs we have in each vcf file
for file in ${output_dir}/*.vcf; do
echo $(basename $file): $(grep -v '^#' $file | wc -l)
done
## KAT_10ab.vcf: 88082
## KAT_10aw.vcf: 103266
## KAT_10bw.vcf: 112299
## KAT_11ab.vcf: 87696
## KAT_11aw.vcf: 102966
## KAT_11bw.vcf: 111933
## KAT_12ab.vcf: 87242
## KAT_12aw.vcf: 102463
## KAT_12bw.vcf: 110802
## KAT_7ab.vcf: 88070
## KAT_7aw.vcf: 103231
## KAT_7bw.vcf: 112281
## KAT_8ab.vcf: 87510
## KAT_8aw.vcf: 102794
## KAT_8bw.vcf: 111420
## KAT_9ab.vcf: 87797
## KAT_9aw.vcf: 103062
## KAT_9bw.vcf: 111759
## SAI_12ab.vcf: 87428
## SAI_12aw.vcf: 102797
## SAI_12bw.vcf: 112888
## SAI_13ab.vcf: 87351
## SAI_13aw.vcf: 102716
## SAI_13bw.vcf: 112654
## SAI_14ab.vcf: 87155
## SAI_14aw.vcf: 102598
## SAI_14bw.vcf: 112582
## SAI_15ab.vcf: 87550
## SAI_15aw.vcf: 102946
## SAI_15bw.vcf: 113085
## SAI_16ab.vcf: 87591
## SAI_16aw.vcf: 102931
## SAI_16bw.vcf: 113002
## SAI_17ab.vcf: 87443
## SAI_17aw.vcf: 102744
## SAI_17bw.vcf: 112943
## SAI_18ab.vcf: 87646
## SAI_18aw.vcf: 103116
## SAI_18bw.vcf: 113466
## SAI_1ab.vcf: 87267
## SAI_1aw.vcf: 102757
## SAI_1bw.vcf: 112835
## SAI_2ab.vcf: 87376
## SAI_2aw.vcf: 102681
## SAI_2bw.vcf: 112683
## SAI_3ab.vcf: 87600
## SAI_3aw.vcf: 102966
## SAI_3bw.vcf: 112968
## SAI_4ab.vcf: 87492
## SAI_4aw.vcf: 102970
## SAI_4bw.vcf: 113414
## SAI_5ab.vcf: 87469
## SAI_5aw.vcf: 102862
## SAI_5bw.vcf: 112823
Check sample names to see if our code created the vcfs with two samples
# Define directory with the VCFs
output_dir="output/wgs_vs_chip/vcfs"
# Iterate over each VCF file
for file in "${output_dir}"/*.vcf; do
# Extract the file name without the directory path
file_name=$(basename "${file}")
# Use bcftools query to retrieve the sample names
sample_names=$(bcftools query -l "${file}")
# Print the file name and the sample names
echo "${file_name}: ${sample_names}"
done
## KAT_10ab.vcf: 10a
## 10b
## KAT_10aw.vcf: 10a
## 10w
## KAT_10bw.vcf: 10b
## 10w
## KAT_11ab.vcf: 11a
## 11b
## KAT_11aw.vcf: 11a
## 11w
## KAT_11bw.vcf: 11b
## 11w
## KAT_12ab.vcf: 12a
## 12b
## KAT_12aw.vcf: 12a
## 12w
## KAT_12bw.vcf: 12b
## 12w
## KAT_7ab.vcf: 7a
## 7b
## KAT_7aw.vcf: 7a
## 7w
## KAT_7bw.vcf: 7b
## 7w
## KAT_8ab.vcf: 8a
## 8b
## KAT_8aw.vcf: 8a
## 8w
## KAT_8bw.vcf: 8b
## 8w
## KAT_9ab.vcf: 9a
## 9b
## KAT_9aw.vcf: 9a
## 9w
## KAT_9bw.vcf: 9b
## 9w
## SAI_12ab.vcf: 12a
## 12b
## SAI_12aw.vcf: 12a
## 12w
## SAI_12bw.vcf: 12b
## 12w
## SAI_13ab.vcf: 13a
## 13b
## SAI_13aw.vcf: 13a
## 13w
## SAI_13bw.vcf: 13b
## 13w
## SAI_14ab.vcf: 14a
## 14b
## SAI_14aw.vcf: 14a
## 14w
## SAI_14bw.vcf: 14b
## 14w
## SAI_15ab.vcf: 15a
## 15b
## SAI_15aw.vcf: 15a
## 15w
## SAI_15bw.vcf: 15b
## 15w
## SAI_16ab.vcf: 16a
## 16b
## SAI_16aw.vcf: 16a
## 16w
## SAI_16bw.vcf: 16b
## 16w
## SAI_17ab.vcf: 17a
## 17b
## SAI_17aw.vcf: 17a
## 17w
## SAI_17bw.vcf: 17b
## 17w
## SAI_18ab.vcf: 18a
## 18b
## SAI_18aw.vcf: 18a
## 18w
## SAI_18bw.vcf: 18b
## 18w
## SAI_1ab.vcf: 1a
## 1b
## SAI_1aw.vcf: 1a
## 1w
## SAI_1bw.vcf: 1b
## 1w
## SAI_2ab.vcf: 2a
## 2b
## SAI_2aw.vcf: 2a
## 2w
## SAI_2bw.vcf: 2b
## 2w
## SAI_3ab.vcf: 3a
## 3b
## SAI_3aw.vcf: 3a
## 3w
## SAI_3bw.vcf: 3b
## 3w
## SAI_4ab.vcf: 4a
## 4b
## SAI_4aw.vcf: 4a
## 4w
## SAI_4bw.vcf: 4b
## 4w
## SAI_5ab.vcf: 5a
## 5b
## SAI_5aw.vcf: 5a
## 5w
## SAI_5bw.vcf: 5b
## 5w
Create new directories
# Create main directory
dir.create(
here("output", "wgs_vs_chip", "scripts"),
showWarnings = FALSE,
recursive = FALSE
)
Script to compare alleles between wgs and chip or chip priors
Code summary: The provided code performs the following steps:
Import the necessary libraries The code imports the required libraries: “allel”, “pandas”, “os”, and “numpy”.
Create an empty DataFrame The code initializes an empty DataFrame called “output_df” to store the output results obtained from the analysis.
Specify the directory The code defines the directory path where the VCF files are located using the “dir_name” variable.
Retrieve a list of VCF files The code uses the “os.listdir()” function and list comprehension to create a list of all VCF files in the specified directory that end with ‘.vcf’.
Iterate over each VCF file The code sets up a loop to iterate over each VCF file found in the previous step.
Construct the file path The code constructs the full file path for the current VCF file by combining the directory path and the file name using “os.path.join()”.
Read the VCF file The code reads the VCF file using “allel.read_vcf()” from the “allel” library, specifying to load all available fields (’*’).
Extract the genotype data The code extracts the genotype data from the VCF file using “allel.GenotypeArray(callset[‘calldata/GT’])”.
Check sample count The code verifies if the VCF file contains two samples by checking the shape of the genotype array using the “assert” statement. If the shape doesn’t match the expected number of samples, an assertion error is raised.
Count total SNPs The code determines the total number of SNPs in the genotype data by calculating the length of the genotype array using “len(gt)”.
Calculate counts of homozygous and heterozygous SNPs The code uses “np.count_nonzero()” and relevant methods of the “gt” object to count the number of homozygous reference, homozygous alternate, and heterozygous SNPs for each sample.
Compute counts of mismatched homozygous and heterozygous SNPs The code compares the genotypes between the two samples using “np.sum()” to calculate the counts of mismatched homozygous reference, homozygous alternate, and heterozygous SNPs.
Extract reference and alternative alleles The code retrieves the reference and alternative alleles for each SNP from the VCF file.
Count mismatching reference and alternative alleles The code compares the alleles between the two samples and counts the number of SNPs with mismatching reference alleles and the number of SNPs with mismatching alternative alleles.
Calculate counts of A, T, C, and G alleles The code computes the counts of A, T, C, and G alleles for each sample based on the genotype data and the corresponding reference and alternative alleles.
Create and append result to output dataframe The code creates a DataFrame called “result” to store the calculated statistics for the current VCF file and appends it to the “output_df” DataFrame using “pd.concat()”.
Repeat for each VCF file The code repeats steps 5 to 16 for each VCF file in the directory, processing and appending the results to the “output_df” DataFrame.
Write the output to a CSV file The code writes the final “output_df” DataFrame to a CSV file named ‘allele_comparison_stats_2.csv’ using the “to_csv()” method of pandas.
import allel
import pandas as pd
import os
import numpy as np
# Initialize the output dataframe
output_df = pd.DataFrame()
# Directory with vcf files
dir_name = "output/wgs_vs_chip/vcfs/"
# Get list of all vcf files in the directory
vcf_files = [f for f in os.listdir(dir_name) if f.endswith('.vcf')]
# Iterate over VCF files
for vcf_file in vcf_files:
file_path = os.path.join(dir_name, vcf_file)
callset = allel.read_vcf(file_path, fields=['*'])
# Get genotype
gt = allel.GenotypeArray(callset['calldata/GT'])
# Verify the vcf contains two samples
assert gt.shape[1] == 2, f"Expected 2 samples in {vcf_file}, found {gt.shape[1]}"
# Count SNPs
n_snps = len(gt)
# Count homozygous and heterozygous SNPs for each sample
n_homo_ref = np.count_nonzero(gt.is_hom_ref(), axis=0)
n_homo_alt = np.count_nonzero(gt.is_hom_alt(), axis=0)
n_hetero = np.count_nonzero(gt.is_het(), axis=0)
# Count homozygous and heterozygous SNPs mismatches
n_homo_ref_mismatch = np.sum(gt.is_hom_ref()[:, 0] != gt.is_hom_ref()[:, 1])
n_homo_alt_mismatch = np.sum(gt.is_hom_alt()[:, 0] != gt.is_hom_alt()[:, 1])
n_hetero_mismatch = np.sum(gt.is_het()[:, 0] != gt.is_het()[:, 1])
# Get alleles
ref_alleles = callset['variants/REF']
alt_alleles = callset['variants/ALT'][:, 0] # assuming bi-allelic
# Count mismatching reference and alternative alleles
n_snps_ref_mismatch = np.count_nonzero(ref_alleles[gt[:,0]] != ref_alleles[gt[:,1]])
n_snps_alt_mismatch = np.count_nonzero(alt_alleles[gt[:,0]] != alt_alleles[gt[:,1]])
# Count alleles for each sample
n_a = sum(np.count_nonzero(gt == i, axis=0) for i in range(4) if ref_alleles[i] == 'A' or alt_alleles[i] == 'A')
n_t = sum(np.count_nonzero(gt == i, axis=0) for i in range(4) if ref_alleles[i] == 'T' or alt_alleles[i] == 'T')
n_c = sum(np.count_nonzero(gt == i, axis=0) for i in range(4) if ref_alleles[i] == 'C' or alt_alleles[i] == 'C')
n_g = sum(np.count_nonzero(gt == i, axis=0) for i in range(4) if ref_alleles[i] == 'G' or alt_alleles[i] == 'G')
# Append results to the output dataframe
result = pd.DataFrame({
'vcf_file': [file_path],
'n_SNPs': [n_snps],
'n_SNPs_ref_mismatch': [n_snps_ref_mismatch],
'n_SNPs_alt_mismatch': [n_snps_alt_mismatch],
'n_A': [n_a],
'n_T': [n_t],
'n_C': [n_c],
'n_G': [n_g],
'n_homo_ref': [n_homo_ref],
'n_homo_alt': [n_homo_alt],
'n_hetero': [n_hetero],
'n_homo_ref_mismatch': [n_homo_ref_mismatch],
'n_homo_alt_mismatch': [n_homo_alt_mismatch],
'n_hetero_mismatch': [n_hetero_mismatch]
})
output_df = pd.concat([output_df, result])
# Write the result to a csv file
output_df.to_csv('output/wgs_vs_chip/allele_comparison_stats_2.csv', index=False)
Clean env
Import the data
data <-
read_delim(
"output/wgs_vs_chip/allele_comparison_stats_2.csv",
delim = ",",
show_col_types = FALSE
)
data <-
data |>
mutate(vcf_file = str_remove(vcf_file, "output/wgs_vs_chip/vcfs/")) |>
separate(
vcf_file,
into = c("Population", "Sample_Comparison"),
sep = "_",
extra = "drop"
) |>
separate(
Sample_Comparison,
into = c("Sample", "Comparison"),
sep = "(?<=\\d)(?=[a-z])",
convert = TRUE
) |>
mutate(Comparison = str_remove(Comparison, ".vcf")) |>
arrange(Comparison)
# Split the "Comparison" column into "Sample1" and "Sample2"
data <-
data |>
separate(
Comparison,
into = c("Sample1", "Sample2"),
sep = 1,
# because each comparison has two characters
remove = FALSE
) |> # keep the original comparison column
relocate(Sample1, Sample2, .after = Comparison) # move the new columns right after Comparison
cols_to_split <-
c("n_A",
"n_T",
"n_C",
"n_G",
"n_homo_ref",
"n_homo_alt",
"n_hetero")
# Remove unwanted characters from the columns
for (col_name in cols_to_split) {
data[[col_name]] <- gsub("\\[\\[|]\\n", "", data[[col_name]])
}
# Split the columns
for (col_name in cols_to_split) {
# Create new column names based on 'Sample1' and 'Sample2'
new_col_names <- paste0(col_name, "_sample", 1:2)
data <- data |>
separate(
col = col_name,
into = new_col_names,
sep = " ",
extra = "drop"
)
}
# Clean the new columns
cols_to_clean <-
grep("^n_", names(data), value = TRUE)
for (col_name in cols_to_clean) {
# Remove unwanted characters '[', ']', and '\n'
data[[col_name]] <- gsub("\\[|]|\\n", "", data[[col_name]])
}
# Split the column names into "Sample" and numeric value
data <-
data |>
separate(
col = Comparison,
into = c("Sample1", "Sample2"),
sep = 1,
remove = FALSE
) |>
relocate(Sample1, Sample2, .after = Comparison)
# Convert columns to numeric
# Specify the column names to convert to numeric
columns_to_convert <-
c(
# "Population",
"Sample",
# "Comparison",
# "Sample1",
# "Sample2",
"n_SNPs",
"n_SNPs_ref_mismatch",
"n_SNPs_alt_mismatch",
"n_A_sample1",
"n_A_sample2",
"n_T_sample1",
"n_T_sample2",
"n_C_sample1",
"n_C_sample2",
"n_G_sample1",
"n_G_sample2",
"n_homo_ref_sample1",
"n_homo_ref_sample2",
"n_homo_alt_sample1",
"n_homo_alt_sample2",
"n_hetero_sample1",
"n_hetero_sample2",
"n_homo_ref_mismatch",
"n_homo_alt_mismatch",
"n_hetero_mismatch"
)
# Convert columns to numeric
data[columns_to_convert] <-
lapply(data[columns_to_convert], function(x)
as.numeric(as.character(x)))
# Verify the column types
print(sapply(data[columns_to_convert], class))
## Sample n_SNPs n_SNPs_ref_mismatch n_SNPs_alt_mismatch
## "numeric" "numeric" "numeric" "numeric"
## n_A_sample1 n_A_sample2 n_T_sample1 n_T_sample2
## "numeric" "numeric" "numeric" "numeric"
## n_C_sample1 n_C_sample2 n_G_sample1 n_G_sample2
## "numeric" "numeric" "numeric" "numeric"
## n_homo_ref_sample1 n_homo_ref_sample2 n_homo_alt_sample1 n_homo_alt_sample2
## "numeric" "numeric" "numeric" "numeric"
## n_hetero_sample1 n_hetero_sample2 n_homo_ref_mismatch n_homo_alt_mismatch
## "numeric" "numeric" "numeric" "numeric"
## n_hetero_mismatch
## "numeric"
Now we can subset the data to have more meaningful comparisons and visualizations.
First we can compare the priors to see if it is reasonable to generate new priors using the SSTool. I am doing this first because the genotype calls for the WGS data are still running. We can test our code and later we use it to look at the comparisons of interest. I do not think that the new prior generated with the crosses data should work since the population have been in the lab for several generations and we are using the priors with wild animals.
I create new priors using the SSToll from ThermoFisher and the crosses data. We can compare the genotype calls using each priors. We need to do some data tyding first.
# Filter rows containing "ab" in column "Comparison"
priors <-
data |>
filter(
Comparison == "ab"
)
# The default priors is represented as "a" (Sample1) and the new priors are represented as "b" (Sample2)
# Change column names
colnames(priors) <- gsub("sample1", "default_prior", colnames(priors))
colnames(priors) <- gsub("sample2", "new_prior", colnames(priors))
# Verify the updated column names
print(colnames(priors))
## [1] "Population" "Sample"
## [3] "Comparison" "Sample1"
## [5] "Sample2" "n_SNPs"
## [7] "n_SNPs_ref_mismatch" "n_SNPs_alt_mismatch"
## [9] "n_A_default_prior" "n_A_new_prior"
## [11] "n_T_default_prior" "n_T_new_prior"
## [13] "n_C_default_prior" "n_C_new_prior"
## [15] "n_G_default_prior" "n_G_new_prior"
## [17] "n_homo_ref_default_prior" "n_homo_ref_new_prior"
## [19] "n_homo_alt_default_prior" "n_homo_alt_new_prior"
## [21] "n_hetero_default_prior" "n_hetero_new_prior"
## [23] "n_homo_ref_mismatch" "n_homo_alt_mismatch"
## [25] "n_hetero_mismatch"
Sanity check
# Add a new column named allele_totals to sum n_A_new_prior, n_T_new_prior, n_C_new_prior, and n_G_new_prior
priors <-
priors |>
mutate(
allele_total_new = n_A_new_prior + n_T_new_prior + n_C_new_prior + n_G_new_prior,
allele_total_default = n_A_default_prior + n_T_default_prior + n_C_default_prior + n_G_default_prior
)
# Compare the allele totals with the number of SNPs
head(priors |>
dplyr::select(Population, Sample, n_SNPs, allele_total_new, allele_total_default))
## # A tibble: 6 × 5
## Population Sample n_SNPs allele_total_new allele_total_default
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 KAT 11 87696 175392 175392
## 2 KAT 9 87797 175594 175594
## 3 SAI 12 87428 174856 174856
## 4 SAI 16 87591 175182 175182
## 5 SAI 14 87155 174310 174310
## 6 KAT 12 87242 174484 174484
The sum of A, T, C and G is twice as the number of SNPs because we have two samples in each comparison. Therefore, we need to divide by 2 when calculating the differences in allele counts.
# we can calculate how many counts of each allele (A, T, C and G) we have for each prior. Lets do difference = New - default prior
priors_allele_count <-
priors |>
dplyr::select(
Population,
Sample,
n_SNPs,
n_A_default_prior,
n_A_new_prior,
n_T_default_prior,
n_T_new_prior,
n_C_default_prior,
n_C_new_prior,
n_G_default_prior,
n_G_new_prior,
) |>
mutate(
n_A_diff = (n_A_new_prior / 2 - n_A_default_prior / 2),
n_T_diff = (n_T_new_prior / 2 - n_T_default_prior / 2),
n_C_diff = (n_C_new_prior / 2 - n_C_default_prior / 2),
n_G_diff = (n_G_new_prior / 2 - n_G_default_prior / 2)
) |>
dplyr::select(Population,
Sample,
n_SNPs,
n_A_diff,
n_T_diff,
n_C_diff,
n_G_diff) |>
arrange(Population, Sample) |>
mutate(
n_A_diff = paste0(
formatC(
n_A_diff,
big.mark = ",",
format = "f",
digits = 0
),
" (",
round((n_A_diff / n_SNPs) * 100, 2),
"%)"
),
n_T_diff = paste0(
formatC(
n_T_diff,
big.mark = ",",
format = "f",
digits = 0
),
" (",
round((n_T_diff / n_SNPs) * 100, 2),
"%)"
),
n_C_diff = paste0(
formatC(
n_C_diff,
big.mark = ",",
format = "f",
digits = 0
),
" (",
round((n_C_diff / n_SNPs) * 100, 2),
"%)"
),
n_G_diff = paste0(
formatC(
n_G_diff,
big.mark = ",",
format = "f",
digits = 0
),
" (",
round((n_G_diff / n_SNPs) * 100, 2),
"%)"
)
) |>
relocate(n_C_diff, .after = n_A_diff) # move the new columns right after n_A_diff
# Convert head(results) to a tibble
table_result <-
as_tibble(priors_allele_count)
# Set theme if you want to use something different from the previous table
set_flextable_defaults(
font.family = "Arial",
font.size = 9,
big.mark = ",",
theme_fun = "theme_zebra" # try the themes: theme_alafoli(), theme_apa(), theme_booktabs(), theme_box(), theme_tron_legacy(), theme_tron(), theme_vader(), theme_vanilla(), theme_zebra()
)
# Then create the flextable object
flex_table <-
flextable(table_result) |>
set_caption(caption = as_paragraph(
as_chunk(
"Table 1. Differences between the default and new priors from the crosses obtained using the SSTool.",
props = fp_text_default(color = "#000000", font.size = 14)
)
),
fp_p = fp_par(text.align = "center", padding = 5))
flex_table
Population | Sample | n_SNPs | n_A_diff | n_C_diff | n_T_diff | n_G_diff |
---|---|---|---|---|---|---|
KAT | 7 | 88,070 | 5,343 (6.07%) | -5,343 (-6.07%) | 0 (0%) | 0 (0%) |
KAT | 8 | 87,510 | 6,162 (7.04%) | -6,162 (-7.04%) | 0 (0%) | 0 (0%) |
KAT | 9 | 87,797 | 4,838 (5.51%) | -4,838 (-5.51%) | 0 (0%) | 0 (0%) |
KAT | 10 | 88,082 | 5,141 (5.84%) | -5,141 (-5.84%) | 0 (0%) | 0 (0%) |
KAT | 11 | 87,696 | 6,703 (7.64%) | -6,703 (-7.64%) | 0 (0%) | 0 (0%) |
KAT | 12 | 87,242 | 4,926 (5.65%) | -4,926 (-5.65%) | 0 (0%) | 0 (0%) |
SAI | 1 | 87,267 | 10,592 (12.14%) | -10,592 (-12.14%) | 0 (0%) | 0 (0%) |
SAI | 2 | 87,376 | -10,104 (-11.56%) | 10,104 (11.56%) | -10,104 (-11.56%) | 10,104 (11.56%) |
SAI | 3 | 87,600 | 9,602 (10.96%) | -9,602 (-10.96%) | 0 (0%) | 0 (0%) |
SAI | 4 | 87,492 | 10,586 (12.1%) | -10,586 (-12.1%) | 0 (0%) | 0 (0%) |
SAI | 5 | 87,469 | 10,018 (11.45%) | -10,018 (-11.45%) | 0 (0%) | 0 (0%) |
SAI | 12 | 87,428 | 9,953 (11.38%) | -9,953 (-11.38%) | 0 (0%) | 0 (0%) |
SAI | 13 | 87,351 | 9,996 (11.44%) | -9,996 (-11.44%) | 0 (0%) | 0 (0%) |
SAI | 14 | 87,155 | 10,676 (12.25%) | -10,676 (-12.25%) | 0 (0%) | 0 (0%) |
SAI | 15 | 87,550 | 10,196 (11.65%) | -10,196 (-11.65%) | 0 (0%) | 0 (0%) |
SAI | 16 | 87,591 | 9,513 (10.86%) | -9,513 (-10.86%) | 0 (0%) | 0 (0%) |
SAI | 17 | 87,443 | 9,590 (10.97%) | -9,590 (-10.97%) | 0 (0%) | 0 (0%) |
SAI | 18 | 87,646 | 10,398 (11.86%) | -10,398 (-11.86%) | 0 (0%) | 0 (0%) |
The main difference of the genotypes obtained from the different priors are the transversions of A and C. T. The problem might be from the fact we used priors from the crosses. What I can do is to run a genotype call with the entire plate that has the samples we are comparing and generate priors for them. The SSTool requires at least 1 plate to generate new priors and we have only 18 samples. I will do that and add it to the comparisons we need to do.
Lets do a sanity check and count how many homozygous and heterozygous we have
# Add a new column named allele_totals to sum n_A_new_prior, n_T_new_prior, n_C_new_prior, and n_G_new_prior
priors <-
priors |>
mutate(
n_hom_het_default = rowSums(
cbind(
n_homo_ref_default_prior,
n_homo_alt_default_prior,
n_hetero_default_prior
),
na.rm = TRUE
),
n_hom_het_new = rowSums(
cbind(
n_homo_ref_new_prior,
n_homo_alt_new_prior,
n_hetero_new_prior
),
na.rm = TRUE
)
)
# Compare the allele totals with the number of SNPs
head(priors |>
dplyr::select(Population, Sample, n_SNPs, n_hom_het_default, n_hom_het_new))
## # A tibble: 6 × 5
## Population Sample n_SNPs n_hom_het_default n_hom_het_new
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 KAT 11 87696 87696 87696
## 2 KAT 9 87797 78120 87391
## 3 SAI 12 87428 87428 87428
## 4 SAI 16 87591 87591 87591
## 5 SAI 14 87155 87155 87155
## 6 KAT 12 87242 77390 86829
The total number of SNPs match the sum of homozygous and heterozygous, so we do not have to divide by 2 as we did for the sum of alleles
# we can select only one of the column since it is biallelic data
priors_ref_alt <-
priors |>
dplyr::select(
Population,
Sample,
n_SNPs,
n_SNPs_ref_mismatch,
n_SNPs_alt_mismatch,
n_homo_ref_default_prior,
n_homo_ref_new_prior,
n_homo_ref_mismatch,
n_homo_alt_default_prior,
n_homo_alt_new_prior,
n_homo_alt_mismatch,
n_hetero_default_prior,
n_hetero_new_prior,
n_hetero_mismatch
) |>
arrange(
Population, Sample
)
# We can select or rename columns to make our table easier to understand. We can create new columns since the alt and ref allele counts are the same because the alleles are swapped when we use the new priors.
# Get the number of SNPs with the alleles swapped. Remember, for 2 mosquitoes with 10 SNPs we have 40 alleles. When we want to calculate the percentages based on the number of SNPs, we need to divided the values by 2 (two samples)
priors_ref_alt <-
priors_ref_alt |>
mutate(
alleles_swapped = n_SNPs_ref_mismatch,
hom_ref_diff = n_homo_ref_mismatch,
hom_ref_alt = n_homo_alt_mismatch,
het_diff = n_hetero_mismatch
) |>
dplyr::select(Population,
Sample,
n_SNPs,
alleles_swapped,
hom_ref_diff,
hom_ref_alt,
het_diff) |>
mutate(
alleles_swapped = paste0(
formatC(alleles_swapped, big.mark = ",", format = "d"),
" (",
round((alleles_swapped / n_SNPs) * 100, 2),
"%)"
),
hom_ref_diff = paste0(
formatC(hom_ref_diff, big.mark = ",", format = "d"),
" (",
round((hom_ref_diff / n_SNPs) * 100, 2),
"%)"
),
hom_ref_alt = paste0(
formatC(hom_ref_alt, big.mark = ",", format = "d"),
" (",
round((hom_ref_alt / n_SNPs) * 100, 2),
"%)"
),
het_diff = paste0(
formatC(het_diff, big.mark = ",", format = "d"),
" (",
round((het_diff / n_SNPs) * 100, 2),
"%)"
)
)
# Convert head(results) to a tibble
table_result <-
as_tibble(priors_ref_alt)
# Set theme if you want to use something different from the previous table
set_flextable_defaults(
font.family = "Arial",
font.size = 9,
big.mark = ",",
theme_fun = "theme_zebra" # try the themes: theme_alafoli(), theme_apa(), theme_booktabs(), theme_box(), theme_tron_legacy(), theme_tron(), theme_vader(), theme_vanilla(), theme_zebra()
)
# Then create the flextable object
flex_table <-
flextable(table_result) |>
set_caption(caption = as_paragraph(
as_chunk(
"Table 2. Number of alleles with alleles swapped and differences in zygosity when default and new priors of the crosses.",
props = fp_text_default(color = "#000000", font.size = 14)
)
),
fp_p = fp_par(text.align = "center", padding = 5))
# Print the flextable
flex_table
Population | Sample | n_SNPs | alleles_swapped | hom_ref_diff | hom_ref_alt | het_diff |
---|---|---|---|---|---|---|
KAT | 7 | 88,070 | 1,500 (1.7%) | 718 (0.82%) | 782 (0.89%) | 1,486 (1.69%) |
KAT | 8 | 87,510 | 1,573 (1.8%) | 724 (0.83%) | 849 (0.97%) | 1,565 (1.79%) |
KAT | 9 | 87,797 | 1,500 (1.71%) | 682 (0.78%) | 818 (0.93%) | 1,486 (1.69%) |
KAT | 10 | 88,082 | 1,417 (1.61%) | 659 (0.75%) | 758 (0.86%) | 1,411 (1.6%) |
KAT | 11 | 87,696 | 1,559 (1.78%) | 651 (0.74%) | 908 (1.04%) | 1,559 (1.78%) |
KAT | 12 | 87,242 | 1,655 (1.9%) | 789 (0.9%) | 866 (0.99%) | 1,649 (1.89%) |
SAI | 1 | 87,267 | 1,702 (1.95%) | 650 (0.74%) | 1,052 (1.21%) | 1,696 (1.94%) |
SAI | 2 | 87,376 | 1,696 (1.94%) | 700 (0.8%) | 996 (1.14%) | 1,692 (1.94%) |
SAI | 3 | 87,600 | 1,530 (1.75%) | 638 (0.73%) | 892 (1.02%) | 1,522 (1.74%) |
SAI | 4 | 87,492 | 1,727 (1.97%) | 654 (0.75%) | 1,073 (1.23%) | 1,719 (1.96%) |
SAI | 5 | 87,469 | 1,592 (1.82%) | 678 (0.78%) | 914 (1.04%) | 1,584 (1.81%) |
SAI | 12 | 87,428 | 1,628 (1.86%) | 656 (0.75%) | 972 (1.11%) | 1,618 (1.85%) |
SAI | 13 | 87,351 | 1,651 (1.89%) | 695 (0.8%) | 956 (1.09%) | 1,643 (1.88%) |
SAI | 14 | 87,155 | 1,761 (2.02%) | 739 (0.85%) | 1,022 (1.17%) | 1,751 (2.01%) |
SAI | 15 | 87,550 | 1,589 (1.81%) | 641 (0.73%) | 948 (1.08%) | 1,585 (1.81%) |
SAI | 16 | 87,591 | 1,639 (1.87%) | 673 (0.77%) | 966 (1.1%) | 1,623 (1.85%) |
SAI | 17 | 87,443 | 1,577 (1.8%) | 679 (0.78%) | 898 (1.03%) | 1,563 (1.79%) |
SAI | 18 | 87,646 | 1,589 (1.81%) | 641 (0.73%) | 948 (1.08%) | 1,579 (1.8%) |
I create new priors using the SSToll from ThermoFisher and the crosses data. We can compare the genotype calls using each priors. We need to do some data tidying first.
# Filter rows containing "ab" in column "Comparison"
default_wgs <-
data |>
filter(
Comparison == "aw"
)
# The default priors is represented as "a" (Sample1) and the new priors are represented as "b" (Sample2)
# Change column names
colnames(default_wgs) <- gsub("sample1", "default_prior", colnames(default_wgs))
colnames(default_wgs) <- gsub("sample2", "wgs", colnames(default_wgs))
# Verify the updated column names
print(colnames(default_wgs))
## [1] "Population" "Sample"
## [3] "Comparison" "Sample1"
## [5] "Sample2" "n_SNPs"
## [7] "n_SNPs_ref_mismatch" "n_SNPs_alt_mismatch"
## [9] "n_A_default_prior" "n_A_wgs"
## [11] "n_T_default_prior" "n_T_wgs"
## [13] "n_C_default_prior" "n_C_wgs"
## [15] "n_G_default_prior" "n_G_wgs"
## [17] "n_homo_ref_default_prior" "n_homo_ref_wgs"
## [19] "n_homo_alt_default_prior" "n_homo_alt_wgs"
## [21] "n_hetero_default_prior" "n_hetero_wgs"
## [23] "n_homo_ref_mismatch" "n_homo_alt_mismatch"
## [25] "n_hetero_mismatch"
# we can calculate how many counts of each allele (A, T, C and G)
priors_allele_count_dw <-
default_wgs |>
dplyr::select(
Population,
Sample,
n_SNPs,
n_A_default_prior,
n_A_wgs,
n_T_default_prior,
n_T_wgs,
n_C_default_prior,
n_C_wgs,
n_G_default_prior,
n_G_wgs,
) |>
mutate(
n_A_diff = (n_A_wgs / 2 - n_A_default_prior / 2),
n_T_diff = (n_T_wgs / 2 - n_T_default_prior / 2),
n_C_diff = (n_C_wgs / 2 - n_C_default_prior / 2),
n_G_diff = (n_G_wgs / 2 - n_G_default_prior / 2)
) |>
dplyr::select(Population,
Sample,
n_SNPs,
n_A_diff,
n_T_diff,
n_C_diff,
n_G_diff) |>
arrange(Population, Sample) |>
mutate(
n_A_diff = paste0(
formatC(
n_A_diff,
big.mark = ",",
format = "f",
digits = 0
),
" (",
round((n_A_diff / n_SNPs) * 100, 2),
"%)"
),
n_T_diff = paste0(
formatC(
n_T_diff,
big.mark = ",",
format = "f",
digits = 0
),
" (",
round((n_T_diff / n_SNPs) * 100, 2),
"%)"
),
n_C_diff = paste0(
formatC(
n_C_diff,
big.mark = ",",
format = "f",
digits = 0
),
" (",
round((n_C_diff / n_SNPs) * 100, 2),
"%)"
),
n_G_diff = paste0(
formatC(
n_G_diff,
big.mark = ",",
format = "f",
digits = 0
),
" (",
round((n_G_diff / n_SNPs) * 100, 2),
"%)"
)
) |>
relocate(n_C_diff, .after = n_A_diff) # move the new columns right after n_A_diff
# Convert head(results) to a tibble
table_result <-
as_tibble(priors_allele_count_dw)
# Set theme if you want to use something different from the previous table
set_flextable_defaults(
font.family = "Arial",
font.size = 9,
big.mark = ",",
theme_fun = "theme_zebra" # try the themes: theme_alafoli(), theme_apa(), theme_booktabs(), theme_box(), theme_tron_legacy(), theme_tron(), theme_vader(), theme_vanilla(), theme_zebra()
)
# Then create the flextable object
flex_table <-
flextable(table_result) |>
set_caption(caption = as_paragraph(
as_chunk(
"Table 3. Differences between the default prior from the WGS data.",
props = fp_text_default(color = "#000000", font.size = 14)
)
),
fp_p = fp_par(text.align = "center", padding = 5))
# Print the flextable
flex_table
Population | Sample | n_SNPs | n_A_diff | n_C_diff | n_T_diff | n_G_diff |
---|---|---|---|---|---|---|
KAT | 7 | 103,231 | 6,742 (6.53%) | -6,742 (-6.53%) | 0 (0%) | 0 (0%) |
KAT | 8 | 102,794 | 7,750 (7.54%) | -7,750 (-7.54%) | 0 (0%) | 0 (0%) |
KAT | 9 | 103,062 | 6,186 (6%) | -6,186 (-6%) | 0 (0%) | 0 (0%) |
KAT | 10 | 103,266 | 6,574 (6.37%) | -6,574 (-6.37%) | 0 (0%) | 0 (0%) |
KAT | 11 | 102,966 | 8,372 (8.13%) | -8,372 (-8.13%) | 0 (0%) | 0 (0%) |
KAT | 12 | 102,463 | 6,241 (6.09%) | -6,241 (-6.09%) | 0 (0%) | 0 (0%) |
SAI | 1 | 102,757 | 12,930 (12.58%) | -12,930 (-12.58%) | 0 (0%) | 0 (0%) |
SAI | 2 | 102,681 | 0 (0%) | 0 (0%) | -12,286 (-11.97%) | 12,286 (11.97%) |
SAI | 3 | 102,966 | 11,771 (11.43%) | -11,771 (-11.43%) | 0 (0%) | 0 (0%) |
SAI | 4 | 102,970 | 12,924 (12.55%) | -12,924 (-12.55%) | 0 (0%) | 0 (0%) |
SAI | 5 | 102,862 | 12,219 (11.88%) | -12,219 (-11.88%) | 0 (0%) | 0 (0%) |
SAI | 12 | 102,797 | 12,152 (11.82%) | -12,152 (-11.82%) | 0 (0%) | 0 (0%) |
SAI | 13 | 102,716 | 12,126 (11.8%) | -12,126 (-11.8%) | 0 (0%) | 0 (0%) |
SAI | 14 | 102,598 | 13,013 (12.68%) | -13,013 (-12.68%) | 0 (0%) | 0 (0%) |
SAI | 15 | 102,946 | 12,514 (12.16%) | -12,514 (-12.16%) | 0 (0%) | 0 (0%) |
SAI | 16 | 102,931 | 11,548 (11.22%) | -11,548 (-11.22%) | 0 (0%) | 0 (0%) |
SAI | 17 | 102,744 | 11,670 (11.36%) | -11,670 (-11.36%) | 0 (0%) | 0 (0%) |
SAI | 18 | 103,116 | 12,732 (12.35%) | -12,732 (-12.35%) | 0 (0%) | 0 (0%) |
The main difference of the genotypes obtained from the different priors are the transversions of A and C. T. The problem might be from the fact we used priors from the crosses. What I can do is to run a genotype call with the entire plate that has the samples we are comparing and generate priors for them. The SSTool requires at least 1 plate to generate new priors and we have only 18 samples. I will do that and add it to the comparisons we need to do.
# we can select only one of the column since it is biallelic data
priors_ref_alt_dw <-
default_wgs |>
dplyr::select(
Population,
Sample,
n_SNPs,
n_SNPs_ref_mismatch,
n_SNPs_alt_mismatch,
n_homo_ref_default_prior,
n_homo_ref_wgs,
n_homo_ref_mismatch,
n_homo_alt_default_prior,
n_homo_alt_wgs,
n_homo_alt_mismatch,
n_hetero_default_prior,
n_hetero_wgs,
n_hetero_mismatch
) |>
arrange(
Population, Sample
)
# We can select or rename columns to make our table easier to understand. We can create new columns since the alt and ref allele counts are the same because the alleles are swapped when we use the new priors.
# Set the display format to avoid scientific notation
options(scipen = 999)
# Get the number of SNPs with the alleles swapped
priors_ref_alt_dw <-
priors_ref_alt_dw |>
mutate(
alleles_swapped = n_SNPs_ref_mismatch,
hom_ref_diff = n_homo_ref_mismatch,
hom_ref_alt = n_homo_alt_mismatch,
het_diff = n_hetero_mismatch
) |>
dplyr::select(Population,
Sample,
n_SNPs,
alleles_swapped,
hom_ref_diff,
hom_ref_alt,
het_diff) |>
mutate(
alleles_swapped = paste0(
formatC(alleles_swapped, big.mark = ",", format = "d"),
" (",
round((alleles_swapped / n_SNPs) * 100, 2),
"%)"
),
hom_ref_diff = paste0(
formatC(hom_ref_diff, big.mark = ",", format = "d"),
" (",
round((hom_ref_diff / n_SNPs) * 100, 2),
"%)"
),
hom_ref_alt = paste0(
formatC(hom_ref_alt, big.mark = ",", format = "d"),
" (",
round((hom_ref_alt / n_SNPs) * 100, 2),
"%)"
),
het_diff = paste0(
formatC(het_diff, big.mark = ",", format = "d"),
" (",
round((het_diff / n_SNPs) * 100, 2),
"%)"
)
)
# Convert head(results) to a tibble
table_result <-
as_tibble(priors_ref_alt_dw)
# Set theme if you want to use something different from the previous table
set_flextable_defaults(
font.family = "Arial",
font.size = 9,
big.mark = ",",
theme_fun = "theme_zebra" # try the themes: theme_alafoli(), theme_apa(), theme_booktabs(), theme_box(), theme_tron_legacy(), theme_tron(), theme_vader(), theme_vanilla(), theme_zebra()
)
# Then create the flextable object
flex_table <-
flextable(table_result) |>
set_caption(caption = as_paragraph(
as_chunk(
"Table 4. SNPs with alleles swapped and differences in zygosity comparing the default prior and WGS data.",
props = fp_text_default(color = "#000000", font.size = 14)
)
),
fp_p = fp_par(text.align = "center", padding = 5))
# Print the flextable
flex_table
Population | Sample | n_SNPs | alleles_swapped | hom_ref_diff | hom_ref_alt | het_diff |
---|---|---|---|---|---|---|
KAT | 7 | 103,231 | 8,537 (8.27%) | 4,785 (4.64%) | 3,752 (3.63%) | 6,051 (5.86%) |
KAT | 8 | 102,794 | 9,023 (8.78%) | 4,959 (4.82%) | 4,064 (3.95%) | 6,539 (6.36%) |
KAT | 9 | 103,062 | 7,891 (7.66%) | 4,438 (4.31%) | 3,453 (3.35%) | 5,555 (5.39%) |
KAT | 10 | 103,266 | 8,573 (8.3%) | 4,844 (4.69%) | 3,729 (3.61%) | 6,025 (5.83%) |
KAT | 11 | 102,966 | 9,184 (8.92%) | 5,055 (4.91%) | 4,129 (4.01%) | 6,864 (6.67%) |
KAT | 12 | 102,463 | 8,444 (8.24%) | 4,741 (4.63%) | 3,703 (3.61%) | 5,742 (5.6%) |
SAI | 1 | 102,757 | 11,541 (11.23%) | 6,258 (6.09%) | 5,283 (5.14%) | 9,839 (9.58%) |
SAI | 2 | 102,681 | 11,628 (11.32%) | 6,322 (6.16%) | 5,306 (5.17%) | 9,788 (9.53%) |
SAI | 3 | 102,966 | 13,355 (12.97%) | 7,493 (7.28%) | 5,862 (5.69%) | 10,475 (10.17%) |
SAI | 4 | 102,970 | 14,416 (14%) | 8,084 (7.85%) | 6,332 (6.15%) | 11,322 (11%) |
SAI | 5 | 102,862 | 15,141 (14.72%) | 8,644 (8.4%) | 6,497 (6.32%) | 11,563 (11.24%) |
SAI | 12 | 102,797 | 10,711 (10.42%) | 5,842 (5.68%) | 4,869 (4.74%) | 9,165 (8.92%) |
SAI | 13 | 102,716 | 12,642 (12.31%) | 6,909 (6.73%) | 5,733 (5.58%) | 10,204 (9.93%) |
SAI | 14 | 102,598 | 12,376 (12.06%) | 6,829 (6.66%) | 5,547 (5.41%) | 10,502 (10.24%) |
SAI | 15 | 102,946 | 10,326 (10.03%) | 5,519 (5.36%) | 4,807 (4.67%) | 9,042 (8.78%) |
SAI | 16 | 102,931 | 10,644 (10.34%) | 5,822 (5.66%) | 4,822 (4.68%) | 8,882 (8.63%) |
SAI | 17 | 102,744 | 13,389 (13.03%) | 7,579 (7.38%) | 5,810 (5.65%) | 10,367 (10.09%) |
SAI | 18 | 103,116 | 12,630 (12.25%) | 7,045 (6.83%) | 5,585 (5.42%) | 10,270 (9.96%) |
I create new priors using the SSToll from ThermoFisher and the crosses data. We can compare the genotype calls using each priors. We need to do some data tidying first.
# Filter rows containing "ab" in column "Comparison"
cross_prior_wgs <-
data |>
filter(
Comparison == "bw"
)
# The default priors is represented as "a" (Sample1) and the new priors are represented as "b" (Sample2)
# Change column names
colnames(cross_prior_wgs) <- gsub("sample1", "cross_prior", colnames(cross_prior_wgs))
colnames(cross_prior_wgs) <- gsub("sample2", "wgs", colnames(cross_prior_wgs))
# Verify the updated column names
print(colnames(cross_prior_wgs))
## [1] "Population" "Sample" "Comparison"
## [4] "Sample1" "Sample2" "n_SNPs"
## [7] "n_SNPs_ref_mismatch" "n_SNPs_alt_mismatch" "n_A_cross_prior"
## [10] "n_A_wgs" "n_T_cross_prior" "n_T_wgs"
## [13] "n_C_cross_prior" "n_C_wgs" "n_G_cross_prior"
## [16] "n_G_wgs" "n_homo_ref_cross_prior" "n_homo_ref_wgs"
## [19] "n_homo_alt_cross_prior" "n_homo_alt_wgs" "n_hetero_cross_prior"
## [22] "n_hetero_wgs" "n_homo_ref_mismatch" "n_homo_alt_mismatch"
## [25] "n_hetero_mismatch"
# we can calculate how many counts of each allele (A, T, C and G)
priors_allele_count_nw <-
cross_prior_wgs |>
dplyr::select(
Population,
Sample,
n_SNPs,
n_A_cross_prior,
n_A_wgs,
n_T_cross_prior,
n_T_wgs,
n_C_cross_prior,
n_C_wgs,
n_G_cross_prior,
n_G_wgs,
) |>
mutate(
n_A_diff = (n_A_wgs / 2 - n_A_cross_prior / 2),
n_T_diff = (n_T_wgs / 2 - n_T_cross_prior / 2),
n_C_diff = (n_C_wgs / 2 - n_C_cross_prior / 2),
n_G_diff = (n_G_wgs / 2 - n_G_cross_prior / 2)
) |>
dplyr::select(Population,
Sample,
n_SNPs,
n_A_diff,
n_T_diff,
n_C_diff,
n_G_diff) |>
arrange(Population, Sample) |>
mutate(
n_A_diff = paste0(
formatC(
n_A_diff,
big.mark = ",",
format = "f",
digits = 0
),
" (",
round((n_A_diff / n_SNPs) * 100, 2),
"%)"
),
n_T_diff = paste0(
formatC(
n_T_diff,
big.mark = ",",
format = "f",
digits = 0
),
" (",
round((n_T_diff / n_SNPs) * 100, 2),
"%)"
),
n_C_diff = paste0(
formatC(
n_C_diff,
big.mark = ",",
format = "f",
digits = 0
),
" (",
round((n_C_diff / n_SNPs) * 100, 2),
"%)"
),
n_G_diff = paste0(
formatC(
n_G_diff,
big.mark = ",",
format = "f",
digits = 0
),
" (",
round((n_G_diff / n_SNPs) * 100, 2),
"%)"
)
) |>
relocate(n_C_diff, .after = n_A_diff) # move the new columns right after n_A_diff
# Convert head(results) to a tibble
table_result <-
as_tibble(priors_allele_count_nw)
# Set theme if you want to use something different from the previous table
set_flextable_defaults(
font.family = "Arial",
font.size = 9,
big.mark = ",",
theme_fun = "theme_zebra" # try the themes: theme_alafoli(), theme_apa(), theme_booktabs(), theme_box(), theme_tron_legacy(), theme_tron(), theme_vader(), theme_vanilla(), theme_zebra()
)
# Then create the flextable object
flex_table <-
flextable(table_result) |>
set_caption(caption = as_paragraph(
as_chunk(
"Table 5. Allele count differences between the crosses' prior from the WGS data.",
props = fp_text_default(color = "#000000", font.size = 14)
)
),
fp_p = fp_par(text.align = "center", padding = 5))
# Print the flextable
flex_table
Population | Sample | n_SNPs | n_A_diff | n_C_diff | n_T_diff | n_G_diff |
---|---|---|---|---|---|---|
KAT | 7 | 112,281 | 6,634 (5.91%) | -6,634 (-5.91%) | -6,634 (-5.91%) | 6,634 (5.91%) |
KAT | 8 | 111,420 | 7,564 (6.79%) | -7,564 (-6.79%) | -7,564 (-6.79%) | 7,564 (6.79%) |
KAT | 9 | 111,759 | 6,236 (5.58%) | -6,236 (-5.58%) | -6,236 (-5.58%) | 6,236 (5.58%) |
KAT | 10 | 112,299 | 6,488 (5.78%) | -6,488 (-5.78%) | -6,488 (-5.78%) | 6,488 (5.78%) |
KAT | 11 | 111,933 | 8,256 (7.38%) | -8,256 (-7.38%) | -8,256 (-7.38%) | 8,256 (7.38%) |
KAT | 12 | 110,802 | 6,302 (5.69%) | -6,302 (-5.69%) | -6,302 (-5.69%) | 6,302 (5.69%) |
SAI | 1 | 112,835 | 12,710 (11.26%) | -12,710 (-11.26%) | -12,710 (-11.26%) | 12,710 (11.26%) |
SAI | 2 | 112,683 | 12,124 (10.76%) | -12,124 (-10.76%) | -12,124 (-10.76%) | 12,124 (10.76%) |
SAI | 3 | 112,968 | 11,580 (10.25%) | -11,580 (-10.25%) | -11,580 (-10.25%) | 11,580 (10.25%) |
SAI | 4 | 113,414 | 12,744 (11.24%) | -12,744 (-11.24%) | -12,744 (-11.24%) | 12,744 (11.24%) |
SAI | 5 | 112,823 | 11,893 (10.54%) | -11,893 (-10.54%) | -11,893 (-10.54%) | 11,893 (10.54%) |
SAI | 12 | 112,888 | 11,938 (10.57%) | -11,938 (-10.57%) | -11,938 (-10.57%) | 11,938 (10.57%) |
SAI | 13 | 112,654 | 11,978 (10.63%) | -11,978 (-10.63%) | -11,978 (-10.63%) | 11,978 (10.63%) |
SAI | 14 | 112,582 | 12,753 (11.33%) | -12,753 (-11.33%) | -12,753 (-11.33%) | 12,753 (11.33%) |
SAI | 15 | 113,085 | 12,135 (10.73%) | -12,135 (-10.73%) | -12,135 (-10.73%) | 12,135 (10.73%) |
SAI | 16 | 113,002 | 11,524 (10.2%) | -11,524 (-10.2%) | -11,524 (-10.2%) | 11,524 (10.2%) |
SAI | 17 | 112,943 | 11,556 (10.23%) | -11,556 (-10.23%) | -11,556 (-10.23%) | 11,556 (10.23%) |
SAI | 18 | 113,466 | 12,483 (11%) | -12,483 (-11%) | -12,483 (-11%) | 12,483 (11%) |
The main difference of the genotypes obtained from the different priors are the transversions of A and C. T. The problem might be from the fact we used priors from the crosses. What I can do is to run a genotype call with the entire plate that has the samples we are comparing and generate priors for them. The SSTool requires at least 1 plate to generate new priors and we have only 18 samples. I will do that and add it to the comparisons we need to do.
# we can select only one of the column since it is biallelic data
priors_ref_alt_nw <-
cross_prior_wgs |>
dplyr::select(
Population,
Sample,
n_SNPs,
n_SNPs_ref_mismatch,
n_SNPs_alt_mismatch,
n_homo_ref_cross_prior,
n_homo_ref_wgs,
n_homo_ref_mismatch,
n_homo_alt_cross_prior,
n_homo_alt_wgs,
n_homo_alt_mismatch,
n_hetero_cross_prior,
n_hetero_wgs,
n_hetero_mismatch
) |>
arrange(
Population, Sample
)
# We can select or rename columns to make our table easier to understand. We can create new columns since the alt and ref allele counts are the same because the alleles are swapped when we use the new priors.
# Set the display format to avoid scientific notation
options(scipen = 999)
# Get the number of SNPs with the alleles swapped
priors_ref_alt_nw <-
priors_ref_alt_nw |>
mutate(
alleles_swapped = n_SNPs_ref_mismatch / 2,
hom_ref_diff = n_homo_ref_mismatch / 2,
hom_ref_alt = n_homo_alt_mismatch / 2,
het_diff = n_hetero_mismatch / 2
) |>
dplyr::select(Population,
Sample,
n_SNPs,
alleles_swapped,
hom_ref_diff,
hom_ref_alt,
het_diff) |>
mutate(
alleles_swapped = paste0(
formatC(alleles_swapped, big.mark = ",", format = "d"),
" (",
round((alleles_swapped / n_SNPs) * 100, 2),
"%)"
),
hom_ref_diff = paste0(
formatC(hom_ref_diff, big.mark = ",", format = "d"),
" (",
round((hom_ref_diff / n_SNPs) * 100, 2),
"%)"
),
hom_ref_alt = paste0(
formatC(hom_ref_alt, big.mark = ",", format = "d"),
" (",
round((hom_ref_alt / n_SNPs) * 100, 2),
"%)"
),
het_diff = paste0(
formatC(het_diff, big.mark = ",", format = "d"),
" (",
round((het_diff / n_SNPs) * 100, 2),
"%)"
)
)
# Convert head(results) to a tibble
table_result <-
as_tibble(priors_ref_alt_nw)
# Set theme if you want to use something different from the previous table
set_flextable_defaults(
font.family = "Arial",
font.size = 9,
big.mark = ",",
theme_fun = "theme_zebra" # try the themes: theme_alafoli(), theme_apa(), theme_booktabs(), theme_box(), theme_tron_legacy(), theme_tron(), theme_vader(), theme_vanilla(), theme_zebra()
)
# Then create the flextable object
flex_table <-
flextable(table_result) |>
set_caption(caption = as_paragraph(
as_chunk(
"Table 6. SNPs with alleles swapped and differences in zygosity comparing crosses prior and WGS data.",
props = fp_text_default(color = "#000000", font.size = 14)
)
),
fp_p = fp_par(text.align = "center", padding = 5))
# Print the flextable
flex_table
Population | Sample | n_SNPs | alleles_swapped | hom_ref_diff | hom_ref_alt | het_diff |
---|---|---|---|---|---|---|
KAT | 7 | 112,281 | 6,608 (5.89%) | 3,690 (3.29%) | 2,918 (2.6%) | 4,225 (3.76%) |
KAT | 8 | 111,420 | 6,810 (6.11%) | 3,782 (3.39%) | 3,027 (2.72%) | 4,463 (4.01%) |
KAT | 9 | 111,759 | 6,221 (5.57%) | 3,447 (3.08%) | 2,773 (2.48%) | 3,924 (3.51%) |
KAT | 10 | 112,299 | 6,602 (5.88%) | 3,678 (3.28%) | 2,924 (2.6%) | 4,184 (3.73%) |
KAT | 11 | 111,933 | 6,866 (6.13%) | 3,824 (3.42%) | 3,042 (2.72%) | 4,693 (4.19%) |
KAT | 12 | 110,802 | 6,499 (5.87%) | 3,603 (3.25%) | 2,895 (2.61%) | 4,002 (3.61%) |
SAI | 1 | 112,835 | 8,107 (7.19%) | 4,770 (4.23%) | 3,337 (2.96%) | 6,555 (5.81%) |
SAI | 2 | 112,683 | 8,127 (7.21%) | 4,799 (4.26%) | 3,327 (2.95%) | 6,427 (5.7%) |
SAI | 3 | 112,968 | 8,935 (7.91%) | 5,300 (4.69%) | 3,634 (3.22%) | 6,723 (5.95%) |
SAI | 4 | 113,414 | 9,449 (8.33%) | 5,582 (4.92%) | 3,867 (3.41%) | 7,173 (6.33%) |
SAI | 5 | 112,823 | 9,759 (8.65%) | 5,783 (5.13%) | 3,975 (3.52%) | 7,225 (6.4%) |
SAI | 12 | 112,888 | 7,633 (6.76%) | 4,458 (3.95%) | 3,175 (2.81%) | 6,131 (5.43%) |
SAI | 13 | 112,654 | 8,580 (7.62%) | 4,990 (4.43%) | 3,590 (3.19%) | 6,601 (5.86%) |
SAI | 14 | 112,582 | 8,412 (7.47%) | 4,994 (4.44%) | 3,417 (3.04%) | 6,775 (6.02%) |
SAI | 15 | 113,085 | 7,529 (6.66%) | 4,392 (3.88%) | 3,137 (2.77%) | 6,150 (5.44%) |
SAI | 16 | 113,002 | 7,638 (6.76%) | 4,463 (3.95%) | 3,174 (2.81%) | 5,946 (5.26%) |
SAI | 17 | 112,943 | 8,922 (7.9%) | 5,318 (4.71%) | 3,604 (3.19%) | 6,665 (5.9%) |
SAI | 18 | 113,466 | 8,634 (7.61%) | 5,070 (4.47%) | 3,564 (3.14%) | 6,707 (5.91%) |
Now, I have to do the genotype call using the entire plate, generate new priors, and then compare the data to the wgs data set. However, I did not do any filtering. I could do some QC in the data before any comparisons, but the total number of SNPs that I can compare will be decreased.
Comparing the two priors
import allel
import pandas as pd
import os
import numpy as np
import warnings
# Ignore DtypeWarnings from pandas
warnings.filterwarnings('ignore', category=pd.errors.DtypeWarning)
# Directory with vcf files
dir_name = "output/wgs_vs_chip/vcfs/"
# Get list of all vcf files in the directory
# vcf_files = [f for f in os.listdir(dir_name) if f.endswith('.vcf')]
# Get list of all vcf files in the directory with *_ab.vcf, *_aw.vcf or *_bw.vcf
vcf_files = [f for f in os.listdir(dir_name) if f.endswith('ab.vcf')]
csv_output_files = []
# Function to convert genotype indices to alleles
def genotype_to_alleles(gt_indices, ref_allele, alt_alleles):
alleles = np.concatenate(([ref_allele], alt_alleles))
return " ".join(alleles[idx] for idx in gt_indices if idx!=-1) # idx -1 means missing data
# Iterate over VCF files
for vcf_file in vcf_files:
file_path = os.path.join(dir_name, vcf_file)
callset = allel.read_vcf(file_path, fields=['*'])
# Get genotype
gt = allel.GenotypeArray(callset['calldata/GT'])
# Get sample names and add prefix from file name
sample_1, sample_2 = callset['samples']
prefix = vcf_file.split("_")[0] + "_" # Added "_" after prefix
sample_1 = prefix + sample_1
sample_2 = prefix + sample_2
# Verify the vcf contains two samples
assert gt.shape[1] == 2, f"Expected 2 samples in {vcf_file}, found {gt.shape[1]}"
# Create DataFrame
df = pd.DataFrame({
'SNP_id': callset['variants/ID'],
f'{sample_1}_geno': [genotype_to_alleles(gt, callset['variants/REF'][i], callset['variants/ALT'][i]) for i, gt in enumerate(gt[:, 0])],
f'{sample_2}_geno': [genotype_to_alleles(gt, callset['variants/REF'][i], callset['variants/ALT'][i]) for i, gt in enumerate(gt[:, 1])],
f'{sample_1}_{sample_2}_gcomp': np.where(gt[:, 0] == gt[:, 1], 'match', 'mismatch').tolist(),
f'{sample_1}_zygo': np.where(gt.is_hom_ref()[:, 0], 'hom_ref', np.where(gt.is_hom_alt()[:, 0], 'hom_alt', 'het')).tolist(),
f'{sample_2}_zygo': np.where(gt.is_hom_ref()[:, 1], 'hom_ref', np.where(gt.is_hom_alt()[:, 1], 'hom_alt', 'het')).tolist(),
f'{sample_1}_{sample_2}_zcomp': np.where(gt.is_hom()[:, 0] == gt.is_hom()[:, 1], 'match', 'mismatch').tolist()
})
output_file = f'output/wgs_vs_chip/{os.path.basename(vcf_file).replace(".vcf", "")}_comparison_ab.csv' # change the name here when you change the vcfs you are analyzing
df.to_csv(output_file, index=False)
csv_output_files.append(output_file)
# # Combine only the newly created CSVs into one
# Get the directory path where your files are located
dir_path = "output/wgs_vs_chip/"
# Get list of all CSV files in the directory that end with '_ab.csv'
csv_files = [os.path.join(dir_path, f) for f in os.listdir(dir_path) if f.endswith('_ab.csv')]
# Ensure that we have at least one such file
if not csv_files:
raise ValueError("No CSV files found matching '_ab.csv'")
# Load the first CSV file
combined_csv = pd.read_csv(csv_files[0])
# Merge the rest of the CSV files one by one
for f in csv_files[1:]:
df = pd.read_csv(f)
combined_csv = pd.merge(combined_csv, df, on='SNP_id', how='outer')
combined_csv.to_csv(os.path.join(dir_path, 'combined_comparison_ab.csv'), index=False)
Compare the reference and alternative allele between the two priors
Import the data and use “Tidyverse” to change column names. I left the two codes to compare the output and make sure it is creating the same object.
data_ab <-
read_delim(
"output/wgs_vs_chip/combined_comparison_ab.csv",
delim = ",",
show_col_types = FALSE
)
# Get all column names that end with '_gcomp'
gcomp_cols <- grep("_gcomp$", names(data_ab), value = TRUE)
# Iterate over those column names and for each, create new _ref and _alt columns
for (col in gcomp_cols) {
data_ab <- data_ab |>
separate(col, into = c(paste0(col, "_ref"), paste0(col, "_alt")), sep = ",") |>
mutate(across(
starts_with(paste0(col, "_")),
~ str_replace_all(., "\\[|\\]|'|[:space:]", "")
))
}
# Renaming columns to match the reference and alternative alleles
data_ab <-
data_ab |>
dplyr::rename_with(~ str_replace_all(., "_gcomp_alt$", "_ALT"),
ends_with("_gcomp_alt")) |>
dplyr::rename_with(~ str_replace_all(., "_gcomp_ref$", "_REF"),
ends_with("_gcomp_ref"))
# Now we can count how many times each SNP had errors within the 18 samples
# Check output
head(data_ab[, c("SNP_id", names(data_ab)[grepl("_REF$|_ALT$", names(data_ab))]), with = FALSE])
## # A tibble: 6 × 37
## SNP_id KAT_9a_KAT_9b_REF KAT_9a_KAT_9b_ALT SAI_15a_SAI_15b_REF
## <chr> <chr> <chr> <chr>
## 1 AX-581444870 match match match
## 2 AX-583035067 match match match
## 3 AX-583033342 match match match
## 4 AX-583035163 match match match
## 5 AX-583035194 match match match
## 6 AX-583033387 match match match
## # ℹ 33 more variables: SAI_15a_SAI_15b_ALT <chr>, SAI_3a_SAI_3b_REF <chr>,
## # SAI_3a_SAI_3b_ALT <chr>, KAT_12a_KAT_12b_REF <chr>,
## # KAT_12a_KAT_12b_ALT <chr>, KAT_7a_KAT_7b_REF <chr>,
## # KAT_7a_KAT_7b_ALT <chr>, SAI_2a_SAI_2b_REF <chr>, SAI_2a_SAI_2b_ALT <chr>,
## # SAI_14a_SAI_14b_REF <chr>, SAI_14a_SAI_14b_ALT <chr>,
## # KAT_8a_KAT_8b_REF <chr>, KAT_8a_KAT_8b_ALT <chr>,
## # SAI_13a_SAI_13b_REF <chr>, SAI_13a_SAI_13b_ALT <chr>, …
Import the data and use “library(data.table) to change column names
# Read the file with fread() function which is faster than read_delim()
data_ab_dt <-
fread(
here(
"output",
"wgs_vs_chip",
"combined_comparison_ab.csv"
)
)
# Get all column names that end with '_gcomp'
gcomp_cols <- grep("_gcomp$", names(data_ab_dt), value = TRUE)
# Convert data.frame to data.table
setDT(data_ab_dt)
# Iterate over those column names and for each, create new _REF and _ALT columns
for (col in gcomp_cols) {
# Split each '_gcomp' column into '_REF' and '_ALT'
data_ab_dt[, c(paste0(col, "_REF"), paste0(col, "_ALT")) := tstrsplit(get(col), ", ", fixed=TRUE)]
# Remove unwanted characters from each new column
data_ab_dt[, (paste0(col, "_REF")) := gsub("\\[|\\]|'", "", get(paste0(col, "_REF")))]
data_ab_dt[, (paste0(col, "_ALT")) := gsub("\\[|\\]|'", "", get(paste0(col, "_ALT")))]
}
# Renaming columns to remove _gcomp
new_names <- names(data_ab_dt)
new_names <- gsub("_gcomp_ALT$", "_ALT", new_names)
new_names <- gsub("_gcomp_REF$", "_REF", new_names)
setnames(data_ab_dt, new_names)
# Select and display only columns that match the criteria
head(data_ab_dt[, c("SNP_id", names(data_ab_dt)[grepl("_REF$|_ALT$", names(data_ab_dt))]), with = FALSE])
## SNP_id KAT_9a_KAT_9b_REF KAT_9a_KAT_9b_ALT SAI_15a_SAI_15b_REF
## 1: AX-581444870 match match match
## 2: AX-583035067 match match match
## 3: AX-583033342 match match match
## 4: AX-583035163 match match match
## 5: AX-583035194 match match match
## 6: AX-583033387 match match match
## SAI_15a_SAI_15b_ALT SAI_3a_SAI_3b_REF SAI_3a_SAI_3b_ALT KAT_12a_KAT_12b_REF
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## KAT_12a_KAT_12b_ALT KAT_7a_KAT_7b_REF KAT_7a_KAT_7b_ALT SAI_2a_SAI_2b_REF
## 1: match match match <NA>
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_2a_SAI_2b_ALT SAI_14a_SAI_14b_REF SAI_14a_SAI_14b_ALT KAT_8a_KAT_8b_REF
## 1: <NA> match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## KAT_8a_KAT_8b_ALT SAI_13a_SAI_13b_REF SAI_13a_SAI_13b_ALT SAI_5a_SAI_5b_REF
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_5a_SAI_5b_ALT SAI_18a_SAI_18b_REF SAI_18a_SAI_18b_ALT
## 1: match match match
## 2: match match match
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
## KAT_10a_KAT_10b_REF KAT_10a_KAT_10b_ALT SAI_1a_SAI_1b_REF SAI_1a_SAI_1b_ALT
## 1: match match <NA> <NA>
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_17a_SAI_17b_REF SAI_17a_SAI_17b_ALT SAI_4a_SAI_4b_REF SAI_4a_SAI_4b_ALT
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_12a_SAI_12b_REF SAI_12a_SAI_12b_ALT KAT_11a_KAT_11b_REF
## 1: match match match
## 2: match match match
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
## KAT_11a_KAT_11b_ALT SAI_16a_SAI_16b_REF SAI_16a_SAI_16b_ALT
## 1: match match match
## 2: match match match
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
Compare one sample to see if the counts match and mismatch are correct using “Tidyverse” or “data.table”
##
## match mismatch
## 87506 546
##
## match mismatch
## 87506 546
We can also count NAs
##
## match mismatch <NA>
## 87506 546 2490
##
## match mismatch <NA>
## 87506 546 2490
The main difference between the objects is that we kept the original columns in data_ab_dt but not in the data_ab. It is not important but we can inspect the data for inconsistencies in our code.
Now we can change our code to get all the metrics we want. We have too many column names in our data. Check column names
## [1] "SNP_id" "KAT_9a_geno" "KAT_9b_geno"
## [4] "KAT_9a_KAT_9b_gcomp" "KAT_9a_zygo" "KAT_9b_zygo"
## [7] "KAT_9a_KAT_9b_zcomp" "SAI_15a_geno" "SAI_15b_geno"
## [10] "SAI_15a_SAI_15b_gcomp" "SAI_15a_zygo" "SAI_15b_zygo"
## [13] "SAI_15a_SAI_15b_zcomp" "SAI_3a_geno" "SAI_3b_geno"
## [16] "SAI_3a_SAI_3b_gcomp" "SAI_3a_zygo" "SAI_3b_zygo"
## [19] "SAI_3a_SAI_3b_zcomp" "KAT_12a_geno" "KAT_12b_geno"
## [22] "KAT_12a_KAT_12b_gcomp" "KAT_12a_zygo" "KAT_12b_zygo"
## [25] "KAT_12a_KAT_12b_zcomp" "KAT_7a_geno" "KAT_7b_geno"
## [28] "KAT_7a_KAT_7b_gcomp" "KAT_7a_zygo" "KAT_7b_zygo"
## [31] "KAT_7a_KAT_7b_zcomp" "SAI_2a_geno" "SAI_2b_geno"
## [34] "SAI_2a_SAI_2b_gcomp" "SAI_2a_zygo" "SAI_2b_zygo"
## [37] "SAI_2a_SAI_2b_zcomp" "SAI_14a_geno" "SAI_14b_geno"
## [40] "SAI_14a_SAI_14b_gcomp" "SAI_14a_zygo" "SAI_14b_zygo"
## [43] "SAI_14a_SAI_14b_zcomp" "KAT_8a_geno" "KAT_8b_geno"
## [46] "KAT_8a_KAT_8b_gcomp" "KAT_8a_zygo" "KAT_8b_zygo"
## [49] "KAT_8a_KAT_8b_zcomp" "SAI_13a_geno" "SAI_13b_geno"
## [52] "SAI_13a_SAI_13b_gcomp" "SAI_13a_zygo" "SAI_13b_zygo"
## [55] "SAI_13a_SAI_13b_zcomp" "SAI_5a_geno" "SAI_5b_geno"
## [58] "SAI_5a_SAI_5b_gcomp" "SAI_5a_zygo" "SAI_5b_zygo"
## [61] "SAI_5a_SAI_5b_zcomp" "SAI_18a_geno" "SAI_18b_geno"
## [64] "SAI_18a_SAI_18b_gcomp" "SAI_18a_zygo" "SAI_18b_zygo"
## [67] "SAI_18a_SAI_18b_zcomp" "KAT_10a_geno" "KAT_10b_geno"
## [70] "KAT_10a_KAT_10b_gcomp" "KAT_10a_zygo" "KAT_10b_zygo"
## [73] "KAT_10a_KAT_10b_zcomp" "SAI_1a_geno" "SAI_1b_geno"
## [76] "SAI_1a_SAI_1b_gcomp" "SAI_1a_zygo" "SAI_1b_zygo"
## [79] "SAI_1a_SAI_1b_zcomp" "SAI_17a_geno" "SAI_17b_geno"
## [82] "SAI_17a_SAI_17b_gcomp" "SAI_17a_zygo" "SAI_17b_zygo"
## [85] "SAI_17a_SAI_17b_zcomp" "SAI_4a_geno" "SAI_4b_geno"
## [88] "SAI_4a_SAI_4b_gcomp" "SAI_4a_zygo" "SAI_4b_zygo"
## [91] "SAI_4a_SAI_4b_zcomp" "SAI_12a_geno" "SAI_12b_geno"
## [94] "SAI_12a_SAI_12b_gcomp" "SAI_12a_zygo" "SAI_12b_zygo"
## [97] "SAI_12a_SAI_12b_zcomp" "KAT_11a_geno" "KAT_11b_geno"
## [100] "KAT_11a_KAT_11b_gcomp" "KAT_11a_zygo" "KAT_11b_zygo"
## [103] "KAT_11a_KAT_11b_zcomp" "SAI_16a_geno" "SAI_16b_geno"
## [106] "SAI_16a_SAI_16b_gcomp" "SAI_16a_zygo" "SAI_16b_zygo"
## [109] "SAI_16a_SAI_16b_zcomp" "KAT_9a_KAT_9b_REF" "KAT_9a_KAT_9b_ALT"
## [112] "SAI_15a_SAI_15b_REF" "SAI_15a_SAI_15b_ALT" "SAI_3a_SAI_3b_REF"
## [115] "SAI_3a_SAI_3b_ALT" "KAT_12a_KAT_12b_REF" "KAT_12a_KAT_12b_ALT"
## [118] "KAT_7a_KAT_7b_REF" "KAT_7a_KAT_7b_ALT" "SAI_2a_SAI_2b_REF"
## [121] "SAI_2a_SAI_2b_ALT" "SAI_14a_SAI_14b_REF" "SAI_14a_SAI_14b_ALT"
## [124] "KAT_8a_KAT_8b_REF" "KAT_8a_KAT_8b_ALT" "SAI_13a_SAI_13b_REF"
## [127] "SAI_13a_SAI_13b_ALT" "SAI_5a_SAI_5b_REF" "SAI_5a_SAI_5b_ALT"
## [130] "SAI_18a_SAI_18b_REF" "SAI_18a_SAI_18b_ALT" "KAT_10a_KAT_10b_REF"
## [133] "KAT_10a_KAT_10b_ALT" "SAI_1a_SAI_1b_REF" "SAI_1a_SAI_1b_ALT"
## [136] "SAI_17a_SAI_17b_REF" "SAI_17a_SAI_17b_ALT" "SAI_4a_SAI_4b_REF"
## [139] "SAI_4a_SAI_4b_ALT" "SAI_12a_SAI_12b_REF" "SAI_12a_SAI_12b_ALT"
## [142] "KAT_11a_KAT_11b_REF" "KAT_11a_KAT_11b_ALT" "SAI_16a_SAI_16b_REF"
## [145] "SAI_16a_SAI_16b_ALT"
Check the data
## Rows: 90,542
## Columns: 145
## $ SNP_id <chr> "AX-581444870", "AX-583035067", "AX-583033342", …
## $ KAT_9a_geno <chr> "T T", "A T", "G C", "G G", "G G", "T C", "T T",…
## $ KAT_9b_geno <chr> "T T", "A T", "G C", "G G", "G G", "T C", "T T",…
## $ KAT_9a_KAT_9b_gcomp <chr> "['match', 'match']", "['match', 'match']", "['m…
## $ KAT_9a_zygo <chr> "hom_ref", "het", "het", "hom_ref", "hom_ref", "…
## $ KAT_9b_zygo <chr> "hom_ref", "het", "het", "hom_ref", "hom_ref", "…
## $ KAT_9a_KAT_9b_zcomp <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_15a_geno <chr> "T T", "A A", "G G", "G A", "G A", "T T", "T T",…
## $ SAI_15b_geno <chr> "T T", "A A", "G G", "G A", "G A", "T T", "T T",…
## $ SAI_15a_SAI_15b_gcomp <chr> "['match', 'match']", "['match', 'match']", "['m…
## $ SAI_15a_zygo <chr> "hom_ref", "hom_ref", "hom_ref", "het", "het", "…
## $ SAI_15b_zygo <chr> "hom_ref", "hom_ref", "hom_ref", "het", "het", "…
## $ SAI_15a_SAI_15b_zcomp <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_3a_geno <chr> "T T", "A A", "G G", "G G", "G A", "T T", "T T",…
## $ SAI_3b_geno <chr> "T T", "A A", "G G", "G G", "G A", "T T", "T T",…
## $ SAI_3a_SAI_3b_gcomp <chr> "['match', 'match']", "['match', 'match']", "['m…
## $ SAI_3a_zygo <chr> "hom_ref", "hom_ref", "hom_ref", "hom_ref", "het…
## $ SAI_3b_zygo <chr> "hom_ref", "hom_ref", "hom_ref", "hom_ref", "het…
## $ SAI_3a_SAI_3b_zcomp <chr> "match", "match", "match", "match", "match", "ma…
## $ KAT_12a_geno <chr> "T T", "A A", "G C", "G G", "G A", "T T", "T T",…
## $ KAT_12b_geno <chr> "T T", "A A", "G C", "G G", "G A", "T T", "T T",…
## $ KAT_12a_KAT_12b_gcomp <chr> "['match', 'match']", "['match', 'match']", "['m…
## $ KAT_12a_zygo <chr> "hom_ref", "hom_ref", "het", "hom_ref", "het", "…
## $ KAT_12b_zygo <chr> "hom_ref", "hom_ref", "het", "hom_ref", "het", "…
## $ KAT_12a_KAT_12b_zcomp <chr> "match", "match", "match", "match", "match", "ma…
## $ KAT_7a_geno <chr> "T T", "A T", "G C", "G G", "G G", "T T", "T T",…
## $ KAT_7b_geno <chr> "T T", "A T", "G C", "G G", "G G", "T T", "T T",…
## $ KAT_7a_KAT_7b_gcomp <chr> "['match', 'match']", "['match', 'match']", "['m…
## $ KAT_7a_zygo <chr> "hom_ref", "het", "het", "hom_ref", "hom_ref", "…
## $ KAT_7b_zygo <chr> "hom_ref", "het", "het", "hom_ref", "hom_ref", "…
## $ KAT_7a_KAT_7b_zcomp <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_2a_geno <chr> "", "A A", "G G", "G A", "G G", "T T", "T T", "C…
## $ SAI_2b_geno <chr> "", "A A", "G G", "G A", "G G", "T T", "T T", "C…
## $ SAI_2a_SAI_2b_gcomp <chr> "", "['match', 'match']", "['match', 'match']", …
## $ SAI_2a_zygo <chr> "", "hom_ref", "hom_ref", "het", "hom_ref", "hom…
## $ SAI_2b_zygo <chr> "", "hom_ref", "hom_ref", "het", "hom_ref", "hom…
## $ SAI_2a_SAI_2b_zcomp <chr> "", "match", "match", "match", "match", "match",…
## $ SAI_14a_geno <chr> "T T", "A A", "G G", "G G", "G G", "T T", "T T",…
## $ SAI_14b_geno <chr> "T T", "A A", "G G", "G G", "G G", "T T", "T T",…
## $ SAI_14a_SAI_14b_gcomp <chr> "['match', 'match']", "['match', 'match']", "['m…
## $ SAI_14a_zygo <chr> "hom_ref", "hom_ref", "hom_ref", "hom_ref", "hom…
## $ SAI_14b_zygo <chr> "hom_ref", "hom_ref", "hom_ref", "hom_ref", "hom…
## $ SAI_14a_SAI_14b_zcomp <chr> "match", "match", "match", "match", "match", "ma…
## $ KAT_8a_geno <chr> "T T", "A A", "G C", "G G", "G A", "T T", "T T",…
## $ KAT_8b_geno <chr> "T T", "A A", "G C", "G G", "G A", "T T", "T T",…
## $ KAT_8a_KAT_8b_gcomp <chr> "['match', 'match']", "['match', 'match']", "['m…
## $ KAT_8a_zygo <chr> "hom_ref", "hom_ref", "het", "hom_ref", "het", "…
## $ KAT_8b_zygo <chr> "hom_ref", "hom_ref", "het", "hom_ref", "het", "…
## $ KAT_8a_KAT_8b_zcomp <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_13a_geno <chr> "T T", "A A", "G G", "A A", "G G", "T T", "T T",…
## $ SAI_13b_geno <chr> "T T", "A A", "G G", "A A", "G G", "T T", "T T",…
## $ SAI_13a_SAI_13b_gcomp <chr> "['match', 'match']", "['match', 'match']", "['m…
## $ SAI_13a_zygo <chr> "hom_ref", "hom_ref", "hom_ref", "hom_alt", "hom…
## $ SAI_13b_zygo <chr> "hom_ref", "hom_ref", "hom_ref", "hom_alt", "hom…
## $ SAI_13a_SAI_13b_zcomp <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_5a_geno <chr> "T T", "A A", "G G", "G G", "G A", "T C", "T T",…
## $ SAI_5b_geno <chr> "T T", "A A", "G G", "G G", "G A", "T C", "T T",…
## $ SAI_5a_SAI_5b_gcomp <chr> "['match', 'match']", "['match', 'match']", "['m…
## $ SAI_5a_zygo <chr> "hom_ref", "hom_ref", "hom_ref", "hom_ref", "het…
## $ SAI_5b_zygo <chr> "hom_ref", "hom_ref", "hom_ref", "hom_ref", "het…
## $ SAI_5a_SAI_5b_zcomp <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_18a_geno <chr> "T T", "A A", "G G", "G A", "G A", "T T", "T T",…
## $ SAI_18b_geno <chr> "T T", "A A", "G G", "G A", "G A", "T T", "T T",…
## $ SAI_18a_SAI_18b_gcomp <chr> "['match', 'match']", "['match', 'match']", "['m…
## $ SAI_18a_zygo <chr> "hom_ref", "hom_ref", "hom_ref", "het", "het", "…
## $ SAI_18b_zygo <chr> "hom_ref", "hom_ref", "hom_ref", "het", "het", "…
## $ SAI_18a_SAI_18b_zcomp <chr> "match", "match", "match", "match", "match", "ma…
## $ KAT_10a_geno <chr> "T T", "A T", "C C", "G G", "G G", "T T", "T T",…
## $ KAT_10b_geno <chr> "T T", "A T", "C C", "G G", "G G", "T T", "T T",…
## $ KAT_10a_KAT_10b_gcomp <chr> "['match', 'match']", "['match', 'match']", "['m…
## $ KAT_10a_zygo <chr> "hom_ref", "het", "hom_alt", "hom_ref", "hom_ref…
## $ KAT_10b_zygo <chr> "hom_ref", "het", "hom_alt", "hom_ref", "hom_ref…
## $ KAT_10a_KAT_10b_zcomp <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_1a_geno <chr> "", "A A", "G G", "G G", "G A", "T T", "T T", "C…
## $ SAI_1b_geno <chr> "", "A A", "G G", "G G", "G A", "T T", "T T", "C…
## $ SAI_1a_SAI_1b_gcomp <chr> "", "['match', 'match']", "['match', 'match']", …
## $ SAI_1a_zygo <chr> "", "hom_ref", "hom_ref", "hom_ref", "het", "hom…
## $ SAI_1b_zygo <chr> "", "hom_ref", "hom_ref", "hom_ref", "het", "hom…
## $ SAI_1a_SAI_1b_zcomp <chr> "", "match", "match", "match", "match", "match",…
## $ SAI_17a_geno <chr> "T T", "A A", "G G", "G A", "G A", "T T", "T T",…
## $ SAI_17b_geno <chr> "T T", "A A", "G G", "G A", "G A", "T T", "T T",…
## $ SAI_17a_SAI_17b_gcomp <chr> "['match', 'match']", "['match', 'match']", "['m…
## $ SAI_17a_zygo <chr> "hom_ref", "hom_ref", "hom_ref", "het", "het", "…
## $ SAI_17b_zygo <chr> "hom_ref", "hom_ref", "hom_ref", "het", "het", "…
## $ SAI_17a_SAI_17b_zcomp <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_4a_geno <chr> "T T", "A A", "G G", "G G", "G A", "C C", "T C",…
## $ SAI_4b_geno <chr> "T T", "A A", "G G", "G G", "G A", "C C", "T C",…
## $ SAI_4a_SAI_4b_gcomp <chr> "['match', 'match']", "['match', 'match']", "['m…
## $ SAI_4a_zygo <chr> "hom_ref", "hom_ref", "hom_ref", "hom_ref", "het…
## $ SAI_4b_zygo <chr> "hom_ref", "hom_ref", "hom_ref", "hom_ref", "het…
## $ SAI_4a_SAI_4b_zcomp <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_12a_geno <chr> "T T", "A A", "G G", "G G", "G A", "T T", "T T",…
## $ SAI_12b_geno <chr> "T T", "A A", "G G", "G G", "G A", "T T", "T T",…
## $ SAI_12a_SAI_12b_gcomp <chr> "['match', 'match']", "['match', 'match']", "['m…
## $ SAI_12a_zygo <chr> "hom_ref", "hom_ref", "hom_ref", "hom_ref", "het…
## $ SAI_12b_zygo <chr> "hom_ref", "hom_ref", "hom_ref", "hom_ref", "het…
## $ SAI_12a_SAI_12b_zcomp <chr> "match", "match", "match", "match", "match", "ma…
## $ KAT_11a_geno <chr> "T T", "A T", "C C", "G G", "G G", "T T", "T T",…
## $ KAT_11b_geno <chr> "T T", "A T", "C C", "G G", "G G", "T T", "T T",…
## $ KAT_11a_KAT_11b_gcomp <chr> "['match', 'match']", "['match', 'match']", "['m…
## $ KAT_11a_zygo <chr> "hom_ref", "het", "hom_alt", "hom_ref", "hom_ref…
## $ KAT_11b_zygo <chr> "hom_ref", "het", "hom_alt", "hom_ref", "hom_ref…
## $ KAT_11a_KAT_11b_zcomp <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_16a_geno <chr> "T T", "A A", "G G", "G G", "G A", "T T", "T T",…
## $ SAI_16b_geno <chr> "T T", "A A", "G G", "G G", "G A", "T T", "T T",…
## $ SAI_16a_SAI_16b_gcomp <chr> "['match', 'match']", "['match', 'match']", "['m…
## $ SAI_16a_zygo <chr> "hom_ref", "hom_ref", "hom_ref", "hom_ref", "het…
## $ SAI_16b_zygo <chr> "hom_ref", "hom_ref", "hom_ref", "hom_ref", "het…
## $ SAI_16a_SAI_16b_zcomp <chr> "match", "match", "match", "match", "match", "ma…
## $ KAT_9a_KAT_9b_REF <chr> "match", "match", "match", "match", "match", "ma…
## $ KAT_9a_KAT_9b_ALT <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_15a_SAI_15b_REF <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_15a_SAI_15b_ALT <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_3a_SAI_3b_REF <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_3a_SAI_3b_ALT <chr> "match", "match", "match", "match", "match", "ma…
## $ KAT_12a_KAT_12b_REF <chr> "match", "match", "match", "match", "match", "ma…
## $ KAT_12a_KAT_12b_ALT <chr> "match", "match", "match", "match", "match", "ma…
## $ KAT_7a_KAT_7b_REF <chr> "match", "match", "match", "match", "match", "ma…
## $ KAT_7a_KAT_7b_ALT <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_2a_SAI_2b_REF <chr> NA, "match", "match", "match", "match", "match",…
## $ SAI_2a_SAI_2b_ALT <chr> NA, "match", "match", "match", "match", "match",…
## $ SAI_14a_SAI_14b_REF <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_14a_SAI_14b_ALT <chr> "match", "match", "match", "match", "match", "ma…
## $ KAT_8a_KAT_8b_REF <chr> "match", "match", "match", "match", "match", "ma…
## $ KAT_8a_KAT_8b_ALT <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_13a_SAI_13b_REF <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_13a_SAI_13b_ALT <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_5a_SAI_5b_REF <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_5a_SAI_5b_ALT <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_18a_SAI_18b_REF <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_18a_SAI_18b_ALT <chr> "match", "match", "match", "match", "match", "ma…
## $ KAT_10a_KAT_10b_REF <chr> "match", "match", "match", "match", "match", "ma…
## $ KAT_10a_KAT_10b_ALT <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_1a_SAI_1b_REF <chr> NA, "match", "match", "match", "match", "match",…
## $ SAI_1a_SAI_1b_ALT <chr> NA, "match", "match", "match", "match", "match",…
## $ SAI_17a_SAI_17b_REF <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_17a_SAI_17b_ALT <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_4a_SAI_4b_REF <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_4a_SAI_4b_ALT <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_12a_SAI_12b_REF <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_12a_SAI_12b_ALT <chr> "match", "match", "match", "match", "match", "ma…
## $ KAT_11a_KAT_11b_REF <chr> "match", "match", "match", "match", "match", "ma…
## $ KAT_11a_KAT_11b_ALT <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_16a_SAI_16b_REF <chr> "match", "match", "match", "match", "match", "ma…
## $ SAI_16a_SAI_16b_ALT <chr> "match", "match", "match", "match", "match", "ma…
Although we have 129 columns, we have the comparison of each sample using different priors or genotyping technology. Then, we have genotypes of each sample, for example our first samples are “KAT_11a_geno” and “KAT_11b_geno”. In this column we have the real genotype of the sample. Here sample “a” and sample “b” are references to the two priors we are comparing (a - default and b - new prior from the crosses). Later, I will compare the default prior with the plate prior (I will create a prior using the plate that had the 18 samples we are comparing).
The next columns are the comparison of the reference and alternative alleles. The values in these columns are “match” and “mismatch”. Later we can summarize the data by counting the strings “match” and “mismatch” across the 18 samples. Or if we are curious, even compare the two populations.
The next column are about the zygosity of each sample. As our first samples we have the columns: “KAT_11a_zygo” “KAT_11b_zygo” and “KAT_11a_KAT_11b_zcomp”. The values in the two first columns are “hom_ref”, “hom_alt”, or “het”. The values for the column _zcomp are “match” or “mismatch” as result of comparing the zygosity of the two columns before it.
We can create two new columns comparing all the samples.
# Convert your data to a data.table (it is already)
setDT(data_ab_dt)
# Create columns for match and mismatch count for columns ending with _REF
cols_REF <-
grep("_REF$", names(data_ab_dt), value = TRUE)
# Calculate the count of "match" or "mismatch" for each row
data_ab_dt[, c("REF_match_count", "REF_mismatch_count") :=
.(rowSums(.SD == "match", na.rm = TRUE),
rowSums(.SD == "mismatch", na.rm = TRUE)),
.SDcols = cols_REF]
# Create columns for match and mismatch count for columns ending with _ALT
cols_ALT <-
grep("_ALT$", names(data_ab_dt), value = TRUE)
# Calculate the count of "match" or "mismatch" for each row
data_ab_dt[, c("ALT_match_count", "ALT_mismatch_count") :=
.(rowSums(.SD == "match", na.rm = TRUE),
rowSums(.SD == "mismatch", na.rm = TRUE)),
.SDcols = cols_ALT]
# Create columns for match and mismatch count for columns ending with _zcomp
cols_Zigo <-
grep("_zcomp$", names(data_ab_dt), value = TRUE)
# Calculate the count of "match" or "mismatch" for each row
data_ab_dt[, c("Zigo_match_count", "Zigo_mismatch_count") :=
.(rowSums(.SD == "match", na.rm = TRUE),
rowSums(.SD == "mismatch", na.rm = TRUE)),
.SDcols = cols_Zigo]
# Now, you can summarize this for each SNP_id
summary_18_samples <-
data_ab_dt[, .(
REF_match = sum(REF_match_count, na.rm = TRUE),
REF_mismatch = sum(REF_mismatch_count, na.rm = TRUE),
ALT_match = sum(ALT_match_count, na.rm = TRUE),
ALT_mismatch = sum(ALT_mismatch_count, na.rm = TRUE),
Zigo_match = sum(Zigo_match_count, na.rm = TRUE),
Zigo_mismatch = sum(Zigo_mismatch_count, na.rm = TRUE)
),
by = SNP_id]
# Sort data by SNP_id
setorder(summary_18_samples, SNP_id)
# Check the result
head(summary_18_samples)
## SNP_id REF_match REF_mismatch ALT_match ALT_mismatch Zigo_match
## 1: AX-579436125 18 0 18 0 18
## 2: AX-579436196 16 0 16 0 16
## 3: AX-579436243 15 3 18 0 15
## 4: AX-579436298 17 0 17 0 17
## 5: AX-579436308 16 0 16 0 16
## 6: AX-579436317 18 0 18 0 18
## Zigo_mismatch
## 1: 0
## 2: 0
## 3: 3
## 4: 0
## 5: 0
## 6: 0
How many SNPs have discrepancies in the genotypes in 1 or more samples (out of the 18 samples)
# Discrepancies in 1 or more samples
# How many SNPs we tested
tested_snps <- length(unique(data_ab_dt$SNP_id))
cat("Number of SNPs tested:", tested_snps, "\n")
## Number of SNPs tested: 90542
# How many SNPs failed
failed_snpsR <-
length(
unique(data_ab_dt[data_ab_dt$REF_mismatch_count >= 1,]$SNP_id
)
)
cat("REF mismatch at in 1 sample:", failed_snpsR, "\n")
## REF mismatch at in 1 sample: 6387
# How many SNPs failed
failed_snpsA <-
length(
unique(data_ab_dt[data_ab_dt$ALT_mismatch_count >= 1,,]$SNP_id
)
)
cat("ALT mismatch at least in 1 sample:", failed_snpsA, "\n")
## ALT mismatch at least in 1 sample: 3464
# How many SNPs failed zygosity
failed_snps <-
length(
unique(data_ab_dt[data_ab_dt$Zigo_mismatch_count >= 1,,]$SNP_id
)
)
cat("Zygosity mismatch in at least 1 sample:", failed_snps, "\n")
## Zygosity mismatch in at least 1 sample: 9309
# Calculate percentage
percentage_failed <- round(failed_snps / tested_snps * 100, 2)
cat("Percentage of failed SNPs in 1 or more samples:", percentage_failed, "%\n")
## Percentage of failed SNPs in 1 or more samples: 10.28 %
We see 12,031 SNPs with discrepancies but most of them are only in 1 sample. Lets check how many have errors in two samples
# Discrepancies in 2 or more samples
# How many SNPs we tested
tested_snps <- length(unique(data_ab_dt$SNP_id))
cat("Number of SNPs tested:", tested_snps, "\n")
## Number of SNPs tested: 90542
# How many SNPs failed
failed_snpsR <-
length(
unique(data_ab_dt[data_ab_dt$REF_mismatch_count >= 2,]$SNP_id
)
)
cat("REF mismatch in 2 or more samples:", failed_snpsR, "\n")
## REF mismatch in 2 or more samples: 2657
# How many SNPs failed
failed_snpsA <-
length(
unique(data_ab_dt[data_ab_dt$ALT_mismatch_count >= 2,]$SNP_id
)
)
cat("ALT mismatch in 2 or more samples:", failed_snpsA, "\n")
## ALT mismatch in 2 or more samples: 1286
# How many SNPs failed
failed_snps <-
length(
unique(data_ab_dt[data_ab_dt$Zigo_mismatch_count >= 2,]$SNP_id
)
)
cat("Zygosity mismatch in 2 or more samples:", failed_snps, "\n")
## Zygosity mismatch in 2 or more samples: 3936
# Calculate percentage
percentage_failed <- round(failed_snps / tested_snps * 100, 2)
cat("Percentage of failed SNPs in 2 or more samples:", percentage_failed, "%\n")
## Percentage of failed SNPs in 2 or more samples: 4.35 %
We see that half of the SNPs have mismatching genotypes in 1 sample only and 6,061 SNPs show genotyping mismatches in 2 or more samples.
# Check how many SNPs with errors in only 1 sample
# How many SNPs we tested
tested_snps <- length(unique(data_ab_dt$SNP_id))
cat("Number of SNPs tested:", tested_snps, "\n")
## Number of SNPs tested: 90542
# How many SNPs failed
failed_snpsR <-
length(
unique(data_ab_dt[data_ab_dt$REF_mismatch_count == 1,]$SNP_id
)
)
cat("REF mismatch in only 1 sample:", failed_snpsR, "\n")
## REF mismatch in only 1 sample: 3730
# How many SNPs failed
failed_snpsA <-
length(
unique(data_ab_dt[data_ab_dt$ALT_mismatch_count == 1,]$SNP_id
)
)
cat("ALT mismatch in only 1 sample:", failed_snpsA, "\n")
## ALT mismatch in only 1 sample: 2178
# How many SNPs failed
failed_snps <-
length(
unique(data_ab_dt[data_ab_dt$Zigo_mismatch_count == 1,]$SNP_id
)
)
cat("Zygosity mismatch in only 1 sample:", failed_snps, "\n")
## Zygosity mismatch in only 1 sample: 5373
# Calculate percentage
percentage_failed <- round(failed_snps / tested_snps * 100, 2)
cat("Percentage of failed SNPs in only 1 sample:", percentage_failed, "%\n")
## Percentage of failed SNPs in only 1 sample: 5.93 %
We observe that 6,004 SNPs have genotype mismatches in only 1 sample out of the 18 samples. Is it random or does it follow a pattern?
Nearly half of the SNPs that have discrepancies are from a single sample genotype mismatch.
We can create a histogram of the number of errors or mismatches per sample
# summary_18_samples is your data.table
setDT(summary_18_samples)
# Select only the relevant columns
dt <-
summary_18_samples[, .(SNP_id, REF_mismatch, ALT_mismatch, Zigo_mismatch)]
# Reshape data to long format
dt_long <-
melt(dt, id.vars = "SNP_id", variable.name = "type", value.name = "count")
# Convert to data.table if it's not already
setDT(dt_long)
# Convert to numeric if it's not already
dt_long[, count := as.numeric(count)]
# Count occurrences per count value
dt_long <-
dt_long[, .(n = .N), by = .(type, count)]
# Calculate total count of unique SNPs
total_SNP <-
length(unique(dt$SNP_id))
# Add a new column for the percentage
dt_long[, perc := n / total_SNP * 100]
# Define new labels
new_labels <-
c(
"Reference Allele" = "REF_mismatch",
"Alternative Allele" = "ALT_mismatch",
"Zygosity Mismatch" = "Zigo_mismatch"
)
# Apply new labels
dt_long$type <-
fct_recode(dt_long$type, !!!new_labels)
# import plotting theme
source(
here(
"scripts",
"analysis",
"my_theme2.R" # choose my_theme.R (Roboto Condensed) or my_theme2.R (default font)
)
)
# Create facet histogram
ggplot(dt_long, aes(x = count, y = n)) +
geom_bar(
stat = "identity",
fill = "#ffcae4",
color = ifelse(
dt_long$count == 0,
"#CCFF00",
ifelse(dt_long$count == 1, "#4169E1", "#FF7F50")
),
width = 0.6,
linewidth = 1
) +
geom_text_repel(aes(label = paste0(
scales::comma(n), " (", round(perc, 2), "%)"
)), size = 2.7, color = "gray10") +
facet_wrap(~ type, scales = "free_y") +
labs(
title = "Histogram of SNP Mismatch Counts across the 18 samples",
x = "Count",
y = "Frequency",
caption = "Comparison of the genotypes of 90,834 SNPs using default and crosses priors.\n 12,030 SNPs (13.24%) have discrepancies in at least 1 sample.\n Bar border colors: Electric Lime = no errors; Royal Blue = 1 error; Coral = more than 1 error"
) +
scale_y_continuous(labels = scales::comma) +
scale_x_continuous(breaks = 0:18) +
my_theme() +
coord_flip() +
theme(plot.caption = element_text(
face = "italic",
size = 10,
color = "grey20"
))
Now we can create columns to get the same statistics for each population “SAI” and “KAT”.
Lets check the first SNP
## Empty data.table (0 rows and 151 cols): SNP_id,KAT_9a_geno,KAT_9b_geno,KAT_9a_KAT_9b_gcomp,KAT_9a_zygo,KAT_9b_zygo...
## Empty data.table (0 rows and 4 cols): SNP_id,REF_mismatch,ALT_mismatch,Zigo_mismatch
We have SAI_ and KAT_; we can subset the data and compare the two populations.
Check SAI
# Convert your data to a data.table
# setDT(data_ab_dt)
# Extract SAI and KAT columns
SAI_cols <- grep("^SAI_", names(data_ab_dt), value = TRUE)
KAT_cols <- grep("^KAT_", names(data_ab_dt), value = TRUE)
# Subset the data into two data tables for SAI and KAT
data_SAI <- data_ab_dt[, c('SNP_id', SAI_cols), with = FALSE]
data_KAT <- data_ab_dt[, c('SNP_id', KAT_cols), with = FALSE]
# SAI
# Create columns for match and mismatch count for columns ending with _REF
cols_REF <-
grep("_REF$", names(data_SAI), value = TRUE)
# Calculate the count of "match" or "mismatch" for each row
data_SAI[, c("REF_match_count", "REF_mismatch_count") :=
.(rowSums(.SD == "match", na.rm = TRUE),
rowSums(.SD == "mismatch", na.rm = TRUE)),
.SDcols = cols_REF]
# Create columns for match and mismatch count for columns ending with _ALT
cols_ALT <-
grep("_ALT$", names(data_SAI), value = TRUE)
# Calculate the count of "match" or "mismatch" for each row
data_SAI[, c("ALT_match_count", "ALT_mismatch_count") :=
.(rowSums(.SD == "match", na.rm = TRUE),
rowSums(.SD == "mismatch", na.rm = TRUE)),
.SDcols = cols_ALT]
# Create columns for match and mismatch count for columns ending with _zcomp
cols_Zigo <-
grep("_zcomp$", names(data_SAI), value = TRUE)
# Calculate the count of "match" or "mismatch" for each row
data_SAI[, c("Zigo_match_count", "Zigo_mismatch_count") :=
.(rowSums(.SD == "match", na.rm = TRUE),
rowSums(.SD == "mismatch", na.rm = TRUE)),
.SDcols = cols_Zigo]
# Now, you can summarize this for each SNP_id
summary_sai <-
data_SAI[, .(
REF_match = sum(REF_match_count, na.rm = TRUE),
REF_mismatch = sum(REF_mismatch_count, na.rm = TRUE),
ALT_match = sum(ALT_match_count, na.rm = TRUE),
ALT_mismatch = sum(ALT_mismatch_count, na.rm = TRUE),
Zigo_match = sum(Zigo_match_count, na.rm = TRUE),
Zigo_mismatch = sum(Zigo_mismatch_count, na.rm = TRUE)
),
by = SNP_id]
# Sort data by SNP_id
setorder(summary_sai, SNP_id)
# Check the result
head(summary_sai)
## SNP_id REF_match REF_mismatch ALT_match ALT_mismatch Zigo_match
## 1: AX-579436125 12 0 12 0 12
## 2: AX-579436196 10 0 10 0 10
## 3: AX-579436243 10 2 12 0 10
## 4: AX-579436298 11 0 11 0 11
## 5: AX-579436308 10 0 10 0 10
## 6: AX-579436317 12 0 12 0 12
## Zigo_mismatch
## 1: 0
## 2: 0
## 3: 2
## 4: 0
## 5: 0
## 6: 0
Now KAT
# KAT
# Create columns for match and mismatch count for columns ending with _REF
cols_REF <-
grep("_REF$", names(data_KAT), value = TRUE)
# Calculate the count of "match" or "mismatch" for each row
data_KAT[, c("REF_match_count", "REF_mismatch_count") :=
.(rowSums(.SD == "match", na.rm = TRUE),
rowSums(.SD == "mismatch", na.rm = TRUE)),
.SDcols = cols_REF]
# Create columns for match and mismatch count for columns ending with _ALT
cols_ALT <-
grep("_ALT$", names(data_KAT), value = TRUE)
# Calculate the count of "match" or "mismatch" for each row
data_KAT[, c("ALT_match_count", "ALT_mismatch_count") :=
.(rowSums(.SD == "match", na.rm = TRUE),
rowSums(.SD == "mismatch", na.rm = TRUE)),
.SDcols = cols_ALT]
# Create columns for match and mismatch count for columns ending with _zcomp
cols_Zigo <-
grep("_zcomp$", names(data_KAT), value = TRUE)
# Calculate the count of "match" or "mismatch" for each row
data_KAT[, c("Zigo_match_count", "Zigo_mismatch_count") :=
.(rowSums(.SD == "match", na.rm = TRUE),
rowSums(.SD == "mismatch", na.rm = TRUE)),
.SDcols = cols_Zigo]
# Now, you can summarize this for each SNP_id
summary_kat <-
data_KAT[, .(
REF_match = sum(REF_match_count, na.rm = TRUE),
REF_mismatch = sum(REF_mismatch_count, na.rm = TRUE),
ALT_match = sum(ALT_match_count, na.rm = TRUE),
ALT_mismatch = sum(ALT_mismatch_count, na.rm = TRUE),
Zigo_match = sum(Zigo_match_count, na.rm = TRUE),
Zigo_mismatch = sum(Zigo_mismatch_count, na.rm = TRUE)
),
by = SNP_id]
# Sort data by SNP_id
setorder(summary_kat, SNP_id)
# Check output
head(summary_kat)
## SNP_id REF_match REF_mismatch ALT_match ALT_mismatch Zigo_match
## 1: AX-579436125 6 0 6 0 6
## 2: AX-579436196 6 0 6 0 6
## 3: AX-579436243 5 1 6 0 5
## 4: AX-579436298 6 0 6 0 6
## 5: AX-579436308 6 0 6 0 6
## 6: AX-579436317 6 0 6 0 6
## Zigo_mismatch
## 1: 0
## 2: 0
## 3: 1
## 4: 0
## 5: 0
## 6: 0
Make plot to visualize the output
First lets get statistics to add to the plot caption. I tried two codes to make sure we get the right output:
How many SNPs have discrepancies in the genotypes in 1 or more samples for KAT?
Code 1
# Discrepancies in 2 or more samples, we use or operator |
failed_kat_ab <-
data_KAT |>
dplyr::filter(REF_mismatch_count > 0 |
ALT_mismatch_count > 0 | Zigo_mismatch_count > 0)
# How many SNPs we tested
tested_snps <-
length(unique(data_KAT$SNP_id))
cat("Number of SNPs tested:", tested_snps, "\n")
## Number of SNPs tested: 90542
# How many SNPs failed
failed_snps_kat_ab <-
length(unique(failed_kat_ab$SNP_id))
cat("Number of SNPs failed:", failed_snps_kat_ab, "\n")
## Number of SNPs failed: 2773
# Calculate percentage
percentage_failed_kat_ab <-
round(failed_snps_kat_ab / tested_snps * 100, 2)
cat("Percentage of failed SNPs:", percentage_failed_kat_ab, "%\n")
## Percentage of failed SNPs: 3.06 %
Code 2
# Discrepancies in 1 or more samples
# How many SNPs we tested
tested_snps <- length(unique(data_KAT$SNP_id))
cat("Number of SNPs tested:", tested_snps, "\n")
## Number of SNPs tested: 90542
# How many SNPs failed
failed_kat_ab <-
length(unique(data_KAT[data_KAT$REF_mismatch_count > 0 |
data_KAT$ALT_mismatch_count > 0 |
data_KAT$Zigo_mismatch_count > 0, ]$SNP_id))
cat("Number of SNPs failed:", failed_kat_ab, "\n")
## Number of SNPs failed: 2773
# Calculate percentage
percentage_failed <- round(failed_kat_ab / tested_snps * 100, 2)
cat("Percentage of failed SNPs:", percentage_failed, "%\n")
## Percentage of failed SNPs: 3.06 %
How many SNPs have discrepancies in the genotypes in 1 or more samples for SAI
Code 1
# Discrepancies in 2 or more samples, we use or operator |
failed_sai_ab <-
data_SAI |>
dplyr::filter(REF_mismatch_count > 0 |
ALT_mismatch_count > 0 | Zigo_mismatch_count > 0)
# How many SNPs we tested
tested_snps <-
length(unique(data_SAI$SNP_id))
cat("Number of SNPs tested:", tested_snps, "\n")
## Number of SNPs tested: 90542
# How many SNPs failed
failed_snps_sai_ab <-
length(unique(failed_sai_ab$SNP_id))
cat("Number of SNPs failed:", failed_snps_sai_ab, "\n")
## Number of SNPs failed: 7532
# Calculate percentage
percentage_failed_sai_ab <-
round(failed_snps_sai_ab / tested_snps * 100, 2)
cat("Percentage of failed SNPs:", percentage_failed_sai_ab, "%\n")
## Percentage of failed SNPs: 8.32 %
Code 2
# Discrepancies in 1 or more samples
# How many SNPs we tested
tested_snps <-
length(unique(data_SAI$SNP_id))
cat("Number of SNPs tested:", tested_snps, "\n")
## Number of SNPs tested: 90542
# How many SNPs failed
failed_sai_ab <-
length(unique(data_SAI[data_SAI$REF_mismatch_count > 0 |
data_SAI$ALT_mismatch_count > 0 |
data_SAI$Zigo_mismatch_count > 0,]$SNP_id))
cat("Number of SNPs failed:", failed_sai_ab, "\n")
## Number of SNPs failed: 7532
# Calculate percentage
percentage_failed <-
round(failed_sai_ab / tested_snps * 100, 2)
cat("Percentage of failed SNPs:", percentage_failed, "%\n")
## Percentage of failed SNPs: 8.32 %
Both codes created the same output.
Data tidying and plotting
# import plotting theme
source(
here(
"scripts",
"analysis",
"my_theme2.R" # choose my_theme.R (Roboto Condensed) or my_theme2.R (default font)
)
)
# Merge summary_sai and summary_kat
merged_sai_kat <-
merge(summary_sai,
summary_kat,
by = "SNP_id",
suffixes = c("_sai", "_kat"))
# Select only the relevant columns
dt <-
merged_sai_kat[, .(
SNP_id,
REF_mismatch_sai,
ALT_mismatch_sai,
Zigo_mismatch_sai,
REF_mismatch_kat,
ALT_mismatch_kat,
Zigo_mismatch_kat
)]
# Reshape data to long format
dt_long <-
melt(dt,
id.vars = "SNP_id",
variable.name = "type",
value.name = "count")
# Convert to data.table if it's not already
setDT(dt_long)
# Extract the last part after "_" in the 'type' column to form 'group' column
dt_long[, group := str_extract(type, "(?<=_)[^_]+$")]
# Extract the part before the first "_" in the 'type' column to form 'allele' column
dt_long[, allele := str_extract(type, "^[^_]+")]
# Convert to numeric if it's not already
dt_long[, count := as.numeric(count)]
# Count occurrences per count value
dt_long <-
dt_long[, .(n = .N), by = .(allele, group, count)]
# dt_long[, n := .N, by = .(allele, group, count)]
# Calculate total count of unique SNPs
total_SNP <-
length(unique(dt$SNP_id))
# Add a new column for the percentage
dt_long[, perc := n / total_SNP * 100, by = group]
# Set levels for 'group' variable
dt_long$group <-
factor(dt_long$group, levels = c("sai", "kat"))
# Set levels for 'allele' variable
dt_long$allele <-
factor(dt_long$allele, levels = c("REF", "ALT", "Zigo"))
# Modify levels for 'allele' variable
levels(dt_long$allele) <-
c("Reference Allele", "Alternative Allele", "Zygosity")
# Modify levels for 'group' variable
levels(dt_long$group) <-
c("SAI", "KAT")
dt_long$count <-
as.numeric(dt_long$count)
# Create plot
ggplot(dt_long, aes(x = count, y = n)) +
geom_bar(
stat = "identity",
fill = "#ffcae4",
color = ifelse(
dt_long$count == 0,
"#CCFF00",
ifelse(dt_long$count == 1, "#4169E1", "#FF7F50")
),
width = 0.6,
linewidth = 1
) +
geom_text_repel(aes(label = paste0(
scales::comma(n), " (", round(perc, 2), "%)"
)), size = 2.7, color = "gray10") +
facet_wrap(~ group + allele, scales = "free_y", ncol = 3) +
labs(
title = "Histogram of SNP Mismatch Counts across all samples for each population",
x = "Count",
y = "Frequency",
caption = "Comparison of the genotypes of 90,834 SNPs using default and crosses priors.\n Number of genotype discordance in at least 1 sample for each sampling locality:\n KAT 6 samples from native range SAI 12 samples from invasive range\n Bar border colors: Electric Lime = no errors; Royal Blue = 1 error; Coral = more than 1 error \nSAI: Saint Augustine, Trinidad and Tobago -> 9,619 SNPs (10.59%)\n KAT: Kathmandu, Nepal -> 4,165 SNPs (4.59%)"
) +
coord_flip() +
my_theme() +
scale_y_continuous(labels = scales::comma) +
scale_x_continuous(breaks = 0:18) +
theme(plot.caption = element_text(
face = "italic",
size = 10,
color = "grey20"
))
# save the plot
ggsave(
here(
"output",
"wgs_vs_chip",
"figures",
"default_cross_priors_mismatches_SAI_KAT.pdf"
),
width = 8,
height = 8,
units = "in"
)
It seems that SAI has more mismatches but it has twice as many samples than KAT. We can check the mismatches per sample.
# Initialize an empty list to hold the counts
count_list <- list()
# Select columns
matching_columns <- colnames(data_ab_dt)[grepl(pattern = "(_REF$|_ALT$|_zcomp$)", colnames(data_ab_dt))]
# Loop through each column
for (column in matching_columns) {
match_count <-
sum(str_detect(data_ab_dt[[column]], "match"), na.rm = TRUE)
mismatch_count <-
sum(str_detect(data_ab_dt[[column]], "mismatch"), na.rm = TRUE)
# Create a data.table with counts for the current column
count_dt <-
data.table(Column = column,
Match = match_count,
Mismatch = mismatch_count)
# Add the count data.table to the list
count_list[[column]] <- count_dt
}
# Combine all count data.tables into a single data.table
counts_all_columns <-
rbindlist(count_list)
# Calculate total
counts_all_columns <-
counts_all_columns |>
mutate(Total = Match + Mismatch)
# Create new columns: Population, Sample, and Comparison
counts_all_columns <-
counts_all_columns |>
mutate(
Population = sub("^([^_]+).*", "\\1", Column),
Sample = sub("^.*_(\\d+).*", "\\1", Column),
Comparison = sub(".*_([^_]+)$", "\\1", Column)
)
# Reorder the columns and create sample_id
counts_all_columns <-
counts_all_columns |>
dplyr::select(Population, Sample, Comparison, Match, Mismatch, Total)
# Calculate percentage columns
counts_all_columns <-
counts_all_columns |>
mutate(Percent_Match = round((Match / Total) * 100, 2),
Percent_Mismatch = round((Mismatch / Total) * 100, 2))
# Replace zcomp with Zygosity
counts_all_columns$Comparison <-
gsub("zcomp", "Zygosity", counts_all_columns$Comparison)
head(counts_all_columns)
## Population Sample Comparison Match Mismatch Total Percent_Match
## 1: KAT 9 Zygosity 87964 775 88739 99.13
## 2: SAI 15 Zygosity 87864 1064 88928 98.80
## 3: SAI 3 Zygosity 87880 1025 88905 98.85
## 4: KAT 12 Zygosity 87462 894 88356 98.99
## 5: KAT 7 Zygosity 88177 810 88987 99.09
## 6: SAI 2 Zygosity 87657 1148 88805 98.71
## Percent_Mismatch
## 1: 0.87
## 2: 1.20
## 3: 1.15
## 4: 1.01
## 5: 0.91
## 6: 1.29
Make a plot
# import plotting theme
source(
here(
"scripts",
"analysis",
"my_theme2.R" # choose my_theme.R (Roboto Condensed) or my_theme2.R (default font)
)
)
# Define color palette
color_palette <- c("#92C6FF", "#f5cb8b", "#bff28c")
# Convert Sample to numeric and sort samples numerically within each Population group
counts_all_columns$Sample <-
as.numeric(counts_all_columns$Sample)
counts_all_columns <-
counts_all_columns |>
arrange(Population, Sample)
# Convert Sample column back to factor with sorted levels within each group
counts_all_columns$Sample <-
factor(counts_all_columns$Sample,
levels = unique(counts_all_columns$Sample))
# Rename and reorder Comparison column
counts_all_columns <-
counts_all_columns |>
mutate(
Comparison_new = recode(
Comparison,
"REF" = "Reference Allele",
"ALT" = "Alternative Allele",
"Zygosity" = "Zygosity"
)
) |>
mutate(Comparison_new = factor(
Comparison_new,
levels = c("Reference Allele", "Alternative Allele", "Zygosity")
))
# Create plot
ggplot(counts_all_columns,
aes(x = Sample, y = Mismatch, fill = Comparison)) +
geom_bar(stat = "identity", position = "dodge") +
facet_grid(Population ~ Comparison_new,
scales = "free_y",
space = "free") +
coord_flip() +
labs(
title = "SNP Mismatch Counts per Sample",
x = "Sample",
y = "Mismatches",
caption = "Genotyping errors per sample within each population using the default and the crosses priors."
) +
# labs(x = "Sample", y = "Mismatch") +
theme(panel.spacing = unit(0.5, "lines")) +
geom_text(aes(label = paste0(
scales::comma(Mismatch), " (", Percent_Mismatch, "%)"
)),
# position = position_dodge(width = 0.9),
hjust = 1,
size = 2.5) +
scale_fill_manual(values = color_palette) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
guides(fill = "none") +
my_theme() +
# theme(plot.margin = margin(10, 20, 10, 10)) + # Increase right margin to prevent labels getting cut off
scale_y_continuous(labels = scales::comma) + # Add thousands separator to y-axis labels
theme(plot.caption = element_text(
face = "italic",
size = 10,
color = "grey20"
))
# save the plot
ggsave(
here(
"output",
"wgs_vs_chip",
"figures",
"default_cross_priors_mismatches_SAI_KAT_per_sample_stats.pdf"
),
width = 8,
height = 7,
units = "in"
)
We see that the number of mismatches are quite consistent across all 18 samples and there does not seem to be a bias towards native or invasive ranges. What we have to decide now is what is random and we can accept and what we need to filter out to avoid problems in our downstream analyses.
# Save the data 18 samples
saveRDS(
summary_18_samples,
file = here(
"output",
"wgs_vs_chip",
"summary_18_samples.rds"
)
)
# Save the data KAT
saveRDS(
summary_kat,
file = here(
"output",
"wgs_vs_chip",
"summary_kat.rds"
)
)
# Save the data SAI
saveRDS(
summary_sai,
file = here(
"output",
"wgs_vs_chip",
"summary_sai.rds"
)
)
# Save the data
saveRDS(
counts_all_columns,
file = here(
"output",
"wgs_vs_chip",
"counts_all_columns.rds"
)
)
# Save the data
saveRDS(
data_ab_dt,
file = here(
"output",
"wgs_vs_chip",
"data_ab_dt.rds"
)
)
We can compare the SNP with 2 or more samples with discrepancies with the SNPs that did not pass our segregation test.
Get the SNPs that have errors in 2 or more samples
# Discrepancies in 2 or more samples
# How many SNPs we tested
tested_snps <- length(unique(data_ab_dt$SNP_id))
cat("Number of SNPs tested:", tested_snps, "\n")
## Number of SNPs tested: 90542
# How many SNPs failed
failed_snpsR <-
length(
unique(data_ab_dt[data_ab_dt$REF_mismatch_count >= 2,]$SNP_id
)
)
cat("REF mismatch at in 2 samples:", failed_snpsR, "\n")
## REF mismatch at in 2 samples: 2657
# How many SNPs failed
failed_snpsA <-
length(
unique(data_ab_dt[data_ab_dt$ALT_mismatch_count >= 2,]$SNP_id
)
)
cat("ALT mismatch at least in 2 samples:", failed_snpsA, "\n")
## ALT mismatch at least in 2 samples: 1286
# How many SNPs failed zygosity
failed_snps <-
length(
unique(data_ab_dt[data_ab_dt$Zigo_mismatch_count >= 2,]$SNP_id
)
)
cat("Zygosity mismatch in at least 2 samples:", failed_snps, "\n")
## Zygosity mismatch in at least 2 samples: 3936
# Calculate percentage
percentage_failed <- round(failed_snps / tested_snps * 100, 2)
cat("Percentage of failed SNPs in 2 or more samples:", percentage_failed, "%\n")
## Percentage of failed SNPs in 2 or more samples: 4.35 %
Get the SNP ids
Create a Venn diagram between the SNPs with genotyping mismatches and those that failed our segregation test
# Read in the two files as vectors
fail_mendel <-
read_table(
here(
"output",
"segregation",
"albopictus",
"albopictus_SNPs_fail_segregation.txt"
),
col_names = FALSE,
show_col_types = FALSE
)[[1]]
fail_geno <-
read_table(
here(
"output",
"wgs_vs_chip",
"SNPs_failed_2_samples.txt"
),
col_names = FALSE,
show_col_types = FALSE
)[[1]]
# Calculate shared values
errors_SNPs <-
intersect(
fail_mendel,
fail_geno
)
# Create Venn diagram
venn_data <-
list(
"Fail Mendel" = fail_mendel,
"Genotype Mismatches" = fail_geno
)
venn_plot <-
ggvenn(
venn_data,
fill_color = c("steelblue", "darkorange"),
show_percentage = TRUE
)
# Add a title
venn_plot <-
venn_plot +
ggtitle("Comparison of SNPs with errors") +
theme(plot.title = element_text(hjust = .5))
# Display the Venn diagram
print(venn_plot)
We can prepare a PCA before and after removing the SNPs with errors. First let’s combine the two vectors with the SNP ids with errors
# Combine vectors
combined_errors <-
unique(c(fail_mendel,
fail_geno))
# Write to file
write.table(
combined_errors,
file = here(
"output",
"wgs_vs_chip",
"SNPs_with_errors.txt"
),
row.names = FALSE,
col.names = FALSE,
quote = FALSE
)
Now use Plink to create PCA excluding only the SNPs that failed our segregation test
Lets import our .fam file to filter the IDs we want to compare.
# Read the data
fam_data <-
here("output", "wgs_vs_chip", "wgs_chip.fam") |>
read_delim(
delim = "\t",
col_names = FALSE,
show_col_types = FALSE
) |>
setNames(
c(
"FID", "IID", "PID", "MID", "Sex", "Phenotype"
)
)
# Filter the data
filtered_data <-
fam_data |>
dplyr::filter(stringr::str_detect(IID, "a$|b$")) |>
dplyr::select("FID", "IID")
# Save to file
write.table(
filtered_data,
file = here("output", "wgs_vs_chip", "samples_priors.txt"),
quote = FALSE,
sep = " ",
row.names = FALSE,
col.names = FALSE
)
Use Plink with only the samples we are comparing (priors) and remove SNPs that failed Mendel test
# Before
plink \
--allow-extra-chr \
--keep-allele-order \
--bfile output/wgs_vs_chip/wgs_chip \
--exclude output/segregation/albopictus/albopictus_SNPs_fail_segregation.txt \
--keep output/wgs_vs_chip/samples_priors.txt \
--pca \
--geno 0.1 \
--maf 0.05 \
--out output/wgs_vs_chip/priors_pca_1 \
--silent
Now do it again but remove both SNPs that failed Mendel test and that have genotype mismatches in at least 2 samples (plus those with segregation errors).
# After
plink \
--allow-extra-chr \
--keep-allele-order \
--bfile output/wgs_vs_chip/wgs_chip \
--exclude output/wgs_vs_chip/SNPs_with_errors.txt \
--keep output/wgs_vs_chip/samples_priors.txt \
--pca \
--geno 0.1 \
--maf 0.05 \
--out output/wgs_vs_chip/priors_pca_2 \
--silent
Create PCA plot
# Load the PCA results
pca_1 <-
read.table(here("output", "wgs_vs_chip", "priors_pca_1.eigenvec"),
header = FALSE)
colnames(pca_1) <- c("FID", "IID", paste0("PC", 1:(ncol(pca_1) - 2)))
pca_1$analysis <- "Before"
pca_1$group <- ifelse(
stringr::str_detect(pca_1$IID, "a$"),
"a",
ifelse(stringr::str_detect(pca_1$IID, "b$"), "b", "Other")
)
pca_2 <-
read.table(here("output", "wgs_vs_chip", "priors_pca_2.eigenvec"),
header = FALSE)
colnames(pca_2) <- c("FID", "IID", paste0("PC", 1:(ncol(pca_2) - 2)))
pca_2$analysis <- "After"
pca_2$group <- ifelse(
stringr::str_detect(pca_2$IID, "a$"),
"a",
ifelse(stringr::str_detect(pca_2$IID, "b$"), "b", "Other")
)
# Combine the data
combined_pca <- rbind(pca_1, pca_2)
# import plotting theme
source(
here(
"scripts",
"analysis",
"my_theme2.R"
)
)
# Convert the 'analysis' column to a factor and specify the level order
combined_pca$analysis <-
factor(combined_pca$analysis, levels = c("Before", "After"))
# Create a facet plot
ggplot(combined_pca, aes(x = PC1, y = PC2, color = group, shape = group)) +
geom_point(size = 2) +
facet_grid(FID ~ analysis, scales = "free") +
# geom_text_repel(aes(label = IID), size = 3, max.overlaps = Inf) +
labs(
x = "PC1",
y = "PC2",
title = "The effect of SNPs with genotyping mismatches in 2 or more samples",
colour = "Prior",
shape = "Prior",
caption = "Removing SNPs with genotypes errors in at least 2 samples. \n'Before' with 71,144 SNPs 'After' with 66,485 SNPs (--maf 0.05 and --geno 0.1)."
) +
my_theme() +
scale_color_manual(
values = c(
"a" = "lightblue",
"b" = "orange",
"Other" = "black"
),
labels = c("a" = "Default", "b" = "Crosses", "Other" = "Other")
) +
theme(plot.caption = element_text(
face = "italic",
size = 10,
color = "grey20"
),
legend.position = "top") +
scale_shape_manual(
values = c(
"a" = 19, # Filled circle
"b" = 1, # Open circle
"Other" = 3 # Plus
),
labels = c("a" = "Default", "b" = "Crosses", "Other" = "Other")
)
# Save plot to PDF
ggsave(
here(
"output",
"wgs_vs_chip",
"figures",
"PCA_before_after_remove_SNPs_errors_2_or_more_samples.pdf"
),
height = 6,
width = 6,
dpi = 300
)
We can remove all SNPs with errors and then we would have a perfect overlap of the points. The frequencies and genotypes would be all the same. It is interesting to know that we can see the effect of few thousand SNPs (~ 6k) that have 1 genotype wrong in 1 sample out of the 18 samples.
Because we extracted the genotypes of the WGS samples from the output of the genotype call using 819 genomes, with KAT and SAI having more samples than we are analyzing here, I will re-do the genotype call using only the samples we have here. Then we can compare the results. One would think that it is okay to subset a dataset and compare it to another one. However, since we used ANGSD doing the genotype calls using all samples, we have the opportunity to compare the outcomes.
We can use the “filtered_data” object to get the sample IDs we need.
# Removing 'a' and 'b' from IID column
samples_wgs <-
filtered_data |>
mutate(IID = str_remove_all(IID, "[ab]")) |>
dplyr::select(FID, IID) |>
distinct()
# Get the number of samples
length(samples_wgs$IID)
## [1] 18
Check the wgs samples
## # A tibble: 18 × 2
## FID IID
## <chr> <chr>
## 1 KAT 7
## 2 KAT 8
## 3 KAT 9
## 4 KAT 10
## 5 KAT 11
## 6 KAT 12
## 7 SAI 1
## 8 SAI 2
## 9 SAI 3
## 10 SAI 4
## 11 SAI 5
## 12 SAI 12
## 13 SAI 13
## 14 SAI 14
## 15 SAI 15
## 16 SAI 16
## 17 SAI 17
## 18 SAI 18
We have a total of 30 samples for KAT + SAI
The name of the wgs samples on the cluster
# all 30 samples for genotype call
# Kathmandu_Nepal_F_10.cram
# Kathmandu_Nepal_F_11.cram
# Kathmandu_Nepal_F_12.cram
# Kathmandu_Nepal_F_7.cram
# Kathmandu_Nepal_F_8.cram
# Kathmandu_Nepal_F_9.cram
# Kathmandu_Nepal_M_1.cram
# Kathmandu_Nepal_M_2.cram
# Kathmandu_Nepal_M_3.cram
# Kathmandu_Nepal_M_4.cram
# Kathmandu_Nepal_M_5.cram
# Kathmandu_Nepal_M_6.cram
# StAugustine_Trinidad_F_12.cram
# StAugustine_Trinidad_F_13.cram
# StAugustine_Trinidad_F_14.cram
# StAugustine_Trinidad_F_15.cram
# StAugustine_Trinidad_F_16.cram
# StAugustine_Trinidad_F_17.cram
# StAugustine_Trinidad_F_18.cram
# StAugustine_Trinidad_F_1.cram
# StAugustine_Trinidad_F_2.cram
# StAugustine_Trinidad_F_3.cram
# StAugustine_Trinidad_F_4.cram
# StAugustine_Trinidad_F_5.cram
# StAugustine_Trinidad_F_6.cram
# StAugustine_Trinidad_M_10.cram
# StAugustine_Trinidad_M_11.cram
# StAugustine_Trinidad_M_7.cram
# StAugustine_Trinidad_M_8.cram
# StAugustine_Trinidad_M_9.cram
# we will do a genotype call with the 18 samples
# Kathmandu_Nepal_F_10.cram
# Kathmandu_Nepal_F_11.cram
# Kathmandu_Nepal_F_12.cram
# Kathmandu_Nepal_F_7.cram
# Kathmandu_Nepal_F_8.cram
# Kathmandu_Nepal_F_9.cram
# StAugustine_Trinidad_F_1.cram
# StAugustine_Trinidad_F_2.cram
# StAugustine_Trinidad_F_3.cram
# StAugustine_Trinidad_F_4.cram
# StAugustine_Trinidad_F_5.cram
# StAugustine_Trinidad_F_12.cram
# StAugustine_Trinidad_F_13.cram
# StAugustine_Trinidad_F_14.cram
# StAugustine_Trinidad_F_15.cram
# StAugustine_Trinidad_F_16.cram
# StAugustine_Trinidad_F_17.cram
# StAugustine_Trinidad_F_18.cram
On the cluster the data is at /ycga-gpfs/project/caccone/lvc26/september_2020/crams
We can do two genotype calls. One with all samples and one with the samples (30) we genotyped with the chip (18). Then, we can compare the results with the extracted genotypes of the 18 samples. We extracted it from a file that we created using angsd and 819 samples.
We can use the same script that we used for the genotype calls, but change the samples and the sites file (use only the one we have in the chip).
To create a sites file we can use the .bim file of the wgs data with all the sites we have in the chip (175k)
Here is a batch script I used for the genotype calls
#!/bin/sh
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=luciano.cosme@yale.edu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH --mem-per-cpu=6gb
#SBATCH --time=120:00:00
#SBATCH --array=1-819
#SBATCH --job-name=angsd_chr
#SBATCH -o angsd_chr.%A_%a.o.txt
#SBATCH -e angsd_chr.%A_%a.ERROR.txt
cd /gpfs/ycga/project/caccone/lvc26/september_2020/snp_calls/chunk_calls
samplesheet="scaffolds.txt"
threads=$SLURM_JOB_CPUS_PER_NODE
name=`sed -n "$SLURM_ARRAY_TASK_ID"p $samplesheet | awk '{print $1}'`
/home/lvc26/project/angsd/angsd \
-ref /gpfs/ycga/project/caccone/lvc26/september_2020/genome/aedes_albopictus_LA2_20200826.fasta \
-bam /gpfs/ycga/project/caccone/lvc26/september_2020/snp_calls/bams.txt \
-nThreads 40 \
-r $name \
-gl 1 \
-dopost 1 \
-doMaf 2 \
-doMajorMinor 4 \
-minMapQ 20 \
-minQ 10 \
-remove_bads 1 \
-uniqueOnly 1 \
-sites /gpfs/ycga/project/caccone/lvc26/september_2020/sites/cat/intersects/shared/shared_sites.txt \
-doCounts 1 \
-setMinDepthInd 10 \
-minInd 2 \
-SNP_pval 1e-6 \
-doPlink 2 \
-doGeno 4 \
-capDepth 45 \
-minMaf 0.01 \
-out $name
We need to create two lists of cram files and a new sites file.
Create new sites file. First, check the .bim file
## 1.1 AX-581444870 0 97856 C T
## 1.1 AX-583033226 0 161729 A G
## 1.1 AX-583035067 0 229640 T A
## 1.1 AX-583035083 0 305518 A G
## 1.1 AX-583035102 0 308124 A G
## 1.1 AX-583033340 0 311920 G A
## 1.1 AX-583033342 0 315059 C G
## 1.1 AX-583035163 0 315386 A G
## 1.1 AX-583033356 0 315674 C T
## 1.1 AX-583033370 0 330057 G A
We can get the first (chromosome) and forth column (position) to create a sites file. Check how many SNPs we have in the .bim file
## 175360 output/wgs_vs_chip/wgs_01.bim
We can use “awk” to do what we need
awk '{print "chr"$1, $4}' output/wgs_vs_chip/wgs_01.bim > output/wgs_vs_chip/new_calls/wgs_sites.txt;
head output/wgs_vs_chip/new_calls/wgs_sites.txt
## chr1.1 97856
## chr1.1 161729
## chr1.1 229640
## chr1.1 305518
## chr1.1 308124
## chr1.1 311920
## chr1.1 315059
## chr1.1 315386
## chr1.1 315674
## chr1.1 330057
The reference genome that I used had “chr” before the scaffold names. We need to use it to match the genome. It is easy to remove or add it.
We can create a file with the SNP id that ANGSD creates (chromosome_position)
chr1.1 chr1.1_97856
chr1.1 chr1.1_161729
chr1.1 chr1.1_229640
chr1.1 chr1.1_305518
chr1.1 chr1.1_308124
chr1.1 chr1.1_311920
chr1.1 chr1.1_315059
chr1.1 chr1.1_315386
awk -v OFS='\t' '{$6="chr"$1 "_" $4; $7="chr" $1; print $1, $7, $4, $6, $2}' output/wgs_vs_chip/wgs_01.bim > output/wgs_vs_chip/new_calls/wgs_snps_ids.txt;
head output/wgs_vs_chip/new_calls/wgs_snps_ids.txt
## 1.1 chr1.1 97856 chr1.1_97856 AX-581444870
## 1.1 chr1.1 161729 chr1.1_161729 AX-583033226
## 1.1 chr1.1 229640 chr1.1_229640 AX-583035067
## 1.1 chr1.1 305518 chr1.1_305518 AX-583035083
## 1.1 chr1.1 308124 chr1.1_308124 AX-583035102
## 1.1 chr1.1 311920 chr1.1_311920 AX-583033340
## 1.1 chr1.1 315059 chr1.1_315059 AX-583033342
## 1.1 chr1.1 315386 chr1.1_315386 AX-583035163
## 1.1 chr1.1 315674 chr1.1_315674 AX-583033356
## 1.1 chr1.1 330057 chr1.1_330057 AX-583033370
We can use this file to replace the SNP ids that we will get with ANGSD.
We can add the SNP ids (AX-) to our file to convert between the two SNP names. We can use the position as reference when replacing the SNP id that ANGSD creates and the ones we have in the chip.
Since we are using only 175k sites instead of over 300 million when we did a genotype call, we do not need to split the genome into chunks or scaffolds. We can do a genotype call for the entire genome.
Index the sites file with ANGSD on the cluster
Now we create the list of cram files.
# Define path and file names
path <- "/ycga-gpfs/project/caccone/lvc26/september_2020/crams/"
samples_30 <-
c(
"Kathmandu_Nepal_F_10.cram",
"Kathmandu_Nepal_F_11.cram",
"Kathmandu_Nepal_F_12.cram",
"Kathmandu_Nepal_F_7.cram",
"Kathmandu_Nepal_F_8.cram",
"Kathmandu_Nepal_F_9.cram",
"Kathmandu_Nepal_M_1.cram",
"Kathmandu_Nepal_M_2.cram",
"Kathmandu_Nepal_M_3.cram",
"Kathmandu_Nepal_M_4.cram",
"Kathmandu_Nepal_M_5.cram",
"Kathmandu_Nepal_M_6.cram",
"StAugustine_Trinidad_F_12.cram",
"StAugustine_Trinidad_F_13.cram",
"StAugustine_Trinidad_F_14.cram",
"StAugustine_Trinidad_F_15.cram",
"StAugustine_Trinidad_F_16.cram",
"StAugustine_Trinidad_F_17.cram",
"StAugustine_Trinidad_F_18.cram",
"StAugustine_Trinidad_F_1.cram",
"StAugustine_Trinidad_F_2.cram",
"StAugustine_Trinidad_F_3.cram",
"StAugustine_Trinidad_F_4.cram",
"StAugustine_Trinidad_F_5.cram",
"StAugustine_Trinidad_F_6.cram",
"StAugustine_Trinidad_M_10.cram",
"StAugustine_Trinidad_M_11.cram",
"StAugustine_Trinidad_M_7.cram",
"StAugustine_Trinidad_M_8.cram",
"StAugustine_Trinidad_M_9.cram"
)
# Combine path and file names
full_paths_30 <- file.path(path, samples_30)
# Write to a text file
writeLines(full_paths_30, here("output","wgs_vs_chip", "new_calls", "crams_30.txt"))
# 18 samples
samples_18 <-
c(
"Kathmandu_Nepal_F_10.cram",
"Kathmandu_Nepal_F_11.cram",
"Kathmandu_Nepal_F_12.cram",
"Kathmandu_Nepal_F_7.cram",
"Kathmandu_Nepal_F_8.cram",
"Kathmandu_Nepal_F_9.cram",
"StAugustine_Trinidad_F_1.cram",
"StAugustine_Trinidad_F_2.cram",
"StAugustine_Trinidad_F_3.cram",
"StAugustine_Trinidad_F_4.cram",
"StAugustine_Trinidad_F_5.cram",
"StAugustine_Trinidad_F_12.cram",
"StAugustine_Trinidad_F_13.cram",
"StAugustine_Trinidad_F_14.cram",
"StAugustine_Trinidad_F_15.cram",
"StAugustine_Trinidad_F_16.cram",
"StAugustine_Trinidad_F_17.cram",
"StAugustine_Trinidad_F_18.cram"
)
# Combine path and file names
full_paths_18 <- file.path(path, samples_18)
# Write to a text file
writeLines(full_paths_18, here("output","wgs_vs_chip", "new_calls", "crams_18.txt"))
Now we have to create the batch scripts to submit in the cluster
30 samples
#!/bin/sh
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=luciano.cosme@yale.edu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH --mem-per-cpu=5gb
#SBATCH --time=120:00:00
#SBATCH --job-name=angsd_wgs_chip_30
#SBATCH -o angsd_wgs_chip_30%A_%a.o.txt
#SBATCH -e angsd_wgs_chip_30%A_%a.ERROR.txt
cd /ycga-gpfs/project/caccone/lvc26/wgs_chip_calls
/home/lvc26/project/angsd/angsd \
-ref /gpfs/ycga/project/caccone/lvc26/september_2020/genome/aedes_albopictus_LA2_20200826.fasta \
-bam /ycga-gpfs/project/caccone/lvc26/wgs_chip_calls/crams_30.txt \
-nThreads 40 \
-gl 1 \
-dopost 1 \
-doMaf 2 \
-doMajorMinor 4 \
-minMapQ 20 \
-minQ 10 \
-remove_bads 1 \
-uniqueOnly 1 \
-sites /ycga-gpfs/project/caccone/lvc26/wgs_chip_calls/wgs_sites.txt \
-doCounts 1 \
-setMinDepthInd 10 \
-minInd 2 \
-SNP_pval 1e-6 \
-doPlink 2 \
-doGeno 4 \
-capDepth 45 \
-minMaf 0.01 \
-out /ycga-gpfs/project/caccone/lvc26/wgs_chip_calls/wgs_chip_30
18 samples
#!/bin/sh
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=luciano.cosme@yale.edu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH --mem-per-cpu=5gb
#SBATCH --time=120:00:00
#SBATCH --job-name=angsd_wgs_chip_18
#SBATCH -o angsd_wgs_chip_18%A_%a.o.txt
#SBATCH -e angsd_wgs_chip_18%A_%a.ERROR.txt
cd /ycga-gpfs/project/caccone/lvc26/wgs_chip_calls
/home/lvc26/project/angsd/angsd \
-ref /gpfs/ycga/project/caccone/lvc26/september_2020/genome/aedes_albopictus_LA2_20200826.fasta \
-bam /ycga-gpfs/project/caccone/lvc26/wgs_chip_calls/crams_18.txt \
-nThreads 40 \
-gl 1 \
-dopost 1 \
-doMaf 2 \
-doMajorMinor 4 \
-minMapQ 20 \
-minQ 10 \
-remove_bads 1 \
-uniqueOnly 1 \
-sites /ycga-gpfs/project/caccone/lvc26/wgs_chip_calls/wgs_sites.txt \
-doCounts 1 \
-setMinDepthInd 10 \
-minInd 2 \
-SNP_pval 1e-6 \
-doPlink 2 \
-doGeno 4 \
-capDepth 45 \
-minMaf 0.01 \
-out /ycga-gpfs/project/caccone/lvc26/wgs_chip_calls/wgs_chip_18
Once the genotype calls are done we can convert the tped to bed file. Our file has only the SNPs from the list we supplied. We can double check and extract the SNP ids to see if everything works.
awk -v OFS='\t' '{$6="chr"$1 "_" $4; $7="chr" "_"$1; print $7, $6}' output/wgs_vs_chip/wgs_01.bim > output/wgs_vs_chip/new_calls/SNPs_175k.txt;
head output/wgs_vs_chip/new_calls/SNPs_175k.txt
Now extract the SNPs and create new bed file
# Load Plink
module load PLINK/1.90-beta4.4
cd /ycga-gpfs/project/caccone/lvc26/wgs_chip_calls
# 30 samples
# Run Plink and extract the 175k SNPs
plink \
--allow-extra-chr \
--keep-allele-order \
--tfile wgs_chip_30 \
--make-bed \
--extract SNPs_175k.txt \
--out wgs_chip_30
# 128238 MB RAM detected; reserving 64119 MB for main workspace.
# Processing .tped file... done.
# wgs_chip_30-temporary.bed + wgs_chip_30-temporary.bim +
# wgs_chip_30-temporary.fam written.
# 169798 variants loaded from .bim file.
# 30 people (0 males, 0 females, 30 ambiguous) loaded from .fam.
# Ambiguous sex IDs written to wgs_chip_30.nosex .
# --extract: 169798 variants remaining.
# Using 1 thread (no multithreaded calculations invoked).
# Before main variant filters, 30 founders and 0 nonfounders present.
# Calculating allele frequencies... done.
# 169798 variants and 30 people pass filters and QC.
# Note: No phenotypes present.
# --make-bed to wgs_chip_30.bed + wgs_chip_30.bim + wgs_chip_30.fam ... done.
# 18 samples
# Run Plink and extract the 175k SNPs
plink \
--allow-extra-chr \
--keep-allele-order \
--tfile wgs_chip_18 \
--make-bed \
--extract SNPs_175k.txt \
--out wgs_chip_18
# 128238 MB RAM detected; reserving 64119 MB for main workspace.
# Processing .tped file... done.
# wgs_chip_18-temporary.bed + wgs_chip_18-temporary.bim +
# wgs_chip_18-temporary.fam written.
# 165104 variants loaded from .bim file.
# 18 people (0 males, 0 females, 18 ambiguous) loaded from .fam.
# Ambiguous sex IDs written to wgs_chip_18.nosex .
# --extract: 165104 variants remaining.
# Using 1 thread (no multithreaded calculations invoked).
# Before main variant filters, 18 founders and 0 nonfounders present.
# Calculating allele frequencies... done.
# 165104 variants and 18 people pass filters and QC.
# Note: No phenotypes present.
# --make-bed to wgs_chip_18.bed + wgs_chip_18.bim + wgs_chip_18.fam ... done.
The only last thing we need to adjust is to make sure our files have the same IDs for chromosome and SNPs. The reference genome used for mapping had “chr” before each scaffold name. When we do a genotype call with the chip data, we do not have the extra string “chr” in each scaffold name. Therefore, we need to adjust that to compare the samples. I did remove the string “chr” from the reference genome. We can remove it from our bed file using any tool.
In the past I did a genotype call for each population. We have 18 samples for SAI and 12 samples for KAT. We had DNA left over for 6 samples for KAT and 12 samples for SAI. That makes everything more complicated to compare. We have to make sure that there is no differences in the genotype calls based on the number of samples with which we do the calls.
For now. I will compare the results of the wgs calls using the 819 samples (all populations), 30 samples (both populations, KAT and SAI), and 18 samples (only the samples we have chip data).
For the chip calls, I did a call using only the 18 samples. Since I did not have more samples, I did a genotype call using the entire plate of samples where the 18 samples were (95 samples total). Finally, I did a genotype call using all wild samples (native and invasive ranges) we have in the manuscript with the 18 samples.
Therefore, we have 3 wgs calls and 3 chip calls. I decided not to compare the priors. We can compare the priors separately.
Lets get the data in the same format. I download the data from the cluster and put it in the dir “new_calls”
Check the .bim file after downloading it from the cluster
## 2.206 NA 0 14153 G A
## 2.206 NA 0 41198 G T
## 2.206 NA 0 46216 C T
## 2.206 NA 0 46416 G A
## 2.206 NA 0 47314 T G
## 2.206 NA 0 64862 A G
## 2.206 NA 0 67410 C T
## 2.206 NA 0 69313 A C
## 2.206 NA 0 71859 A T
## 2.206 NA 0 72355 A G
Now check how the chip data is different
## 1.1 AX-581444870 0 97856 C T
## 1.1 AX-583035067 0 229640 T A
## 1.1 AX-583035102 0 308124 A G
## 1.1 AX-583033342 0 315059 C G
## 1.1 AX-583035163 0 315386 A G
## 1.1 AX-583035194 0 330265 A G
## 1.1 AX-583033387 0 331288 C T
## 1.1 AX-583035211 0 345197 C T
## 1.10 AX-583035257 0 91677 T C
## 1.10 AX-583033504 0 141489 C T
We can see first they are not in the same order and that the SNP ids are different. We can use the file we created earlier to update the ids.
Check the file
## 1.1 chr1.1 97856 chr1.1_97856 AX-581444870
## 1.1 chr1.1 161729 chr1.1_161729 AX-583033226
## 1.1 chr1.1 229640 chr1.1_229640 AX-583035067
## 1.1 chr1.1 305518 chr1.1_305518 AX-583035083
## 1.1 chr1.1 308124 chr1.1_308124 AX-583035102
## 1.1 chr1.1 311920 chr1.1_311920 AX-583033340
## 1.1 chr1.1 315059 chr1.1_315059 AX-583033342
## 1.1 chr1.1 315386 chr1.1_315386 AX-583035163
## 1.1 chr1.1 315674 chr1.1_315674 AX-583033356
## 1.1 chr1.1 330057 chr1.1_330057 AX-583033370
We can import the files, but make sure we keep the same order of the “wgs_chip_18.bim”, we can create an index once we import.
# Define file paths using here
bim_file <-
here("output", "wgs_vs_chip", "new_calls", "wgs_chip_18.bim")
snp_ids_file <-
here("output", "wgs_vs_chip", "new_calls", "wgs_snps_ids.txt")
output_file <-
here("output",
"wgs_vs_chip",
"new_calls",
"wgs_chip_18_updated.bim")
# Import the .bim file
bim_data <- read_delim(
bim_file,
delim = "\t",
show_col_types = FALSE,
col_names = c("chr", "id_match", "cm", "bp", "allele1", "allele2"),
col_types = cols(.default = col_character())
)
# Create an index column
bim_data <-
bim_data |>
mutate(index = row_number()) |>
# Remove the string "chr" from the chr column
mutate(chr = str_remove(chr, "chr"))
# Import the .txt file
snp_ids <- read_delim(
snp_ids_file,
delim = "\t",
show_col_types = FALSE,
col_names = c("chr_ref", "id_ref", "bp_ref", "id_match", "snp_id"),
col_types = cols(.default = col_character())
)
# Merge the two data frames by matching chr and bp in bim_data with chr_ref and bp_ref in snp_ids
merged_data <-
left_join(bim_data, snp_ids, by = "id_match") |>
dplyr::select(
chr, snp_id, cm, bp, allele1, allele2
)
# Check output
head(merged_data)
## # A tibble: 6 × 6
## chr snp_id cm bp allele1 allele2
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2.206 <NA> 0 14153 G A
## 2 2.206 <NA> 0 41198 G T
## 3 2.206 <NA> 0 46216 C T
## 4 2.206 <NA> 0 46416 G A
## 5 2.206 <NA> 0 47314 T G
## 6 2.206 <NA> 0 64862 A G
# Write the updated data frame to a new .bim file without headers or quotes
write.table(
merged_data,
file = output_file,
sep = "\t",
quote = FALSE,
row.names = FALSE,
col.names = FALSE
)
Now, add word “backup” to the current .bim file and then delete “updated” from the new file we save. Then it replaces the current .bim file
Compare both .bim files to see if they look okay
Before
## 2.206 NA 0 14153 G A
## 2.206 NA 0 41198 G T
## 2.206 NA 0 46216 C T
## 2.206 NA 0 46416 G A
## 2.206 NA 0 47314 T G
## 2.206 NA 0 64862 A G
## 2.206 NA 0 67410 C T
## 2.206 NA 0 69313 A C
## 2.206 NA 0 71859 A T
## 2.206 NA 0 72355 A G
After
## 2.206 NA 0 14153 G A
## 2.206 NA 0 41198 G T
## 2.206 NA 0 46216 C T
## 2.206 NA 0 46416 G A
## 2.206 NA 0 47314 T G
## 2.206 NA 0 64862 A G
## 2.206 NA 0 67410 C T
## 2.206 NA 0 69313 A C
## 2.206 NA 0 71859 A T
## 2.206 NA 0 72355 A G
It looks okay. We can replace the original file with the new file
We can check if everything is working by checking the reference allele using the genome without the string ‘chr’
plink2 \
--allow-extra-chr \
--bfile output/wgs_vs_chip/new_calls/wgs_chip_18 \
--make-bed \
--fa data/genome/albo.fasta.gz \
--ref-from-fa 'force' `# sets REF alleles when it can be done unambiguously, we use force to change the alleles` \
--out output/wgs_vs_chip/new_calls/wgs_chip_18_samples \
--silent;
# --keep-allele-order \ if you use Plink 1.9
grep "variants" output/wgs_vs_chip/new_calls/wgs_chip_18_samples.log # to get the number of variants from the log file.
## 165104 variants loaded from output/wgs_vs_chip/new_calls/wgs_chip_18.bim.
## --ref-from-fa force: 35328 variants changed, 129772 validated.
We updated the alleles and now we can do the same operation for the other file with the 30 samples.
# Define file paths using here
bim_file <-
here("output", "wgs_vs_chip", "new_calls", "wgs_chip_30.bim")
snp_ids_file <-
here("output", "wgs_vs_chip", "new_calls", "wgs_snps_ids.txt")
output_file <-
here("output",
"wgs_vs_chip",
"new_calls",
"wgs_chip_30_updated.bim")
# Import the .bim file
bim_data <- read_delim(
bim_file,
delim = "\t",
show_col_types = FALSE,
col_names = c("chr", "id_match", "cm", "bp", "allele1", "allele2"),
col_types = cols(.default = col_character())
)
# Create an index column
bim_data <-
bim_data |>
mutate(index = row_number()) |>
# Remove the string "chr" from the chr column
mutate(chr = str_remove(chr, "chr"))
# Import the .txt file
snp_ids <- read_delim(
snp_ids_file,
delim = "\t",
show_col_types = FALSE,
col_names = c("chr_ref", "id_ref", "bp_ref", "id_match", "snp_id"),
col_types = cols(.default = col_character())
)
# Merge the two data frames by matching chr and bp in bim_data with chr_ref and bp_ref in snp_ids
merged_data <-
left_join(bim_data, snp_ids, by = "id_match") |>
dplyr::select(
chr, snp_id, cm, bp, allele1, allele2
)
# Check output
head(merged_data)
## # A tibble: 6 × 6
## chr snp_id cm bp allele1 allele2
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2.206 <NA> 0 14153 G A
## 2 2.206 <NA> 0 41198 G T
## 3 2.206 <NA> 0 46216 C T
## 4 2.206 <NA> 0 46416 G A
## 5 2.206 <NA> 0 47314 T G
## 6 2.206 <NA> 0 64862 A G
# Write the updated data frame to a new .bim file without headers or quotes
write.table(
merged_data,
file = output_file,
sep = "\t",
quote = FALSE,
row.names = FALSE,
col.names = FALSE
)
Compare both .bim files to see if they look okay
Before
## 2.206 NA 0 14153 G A
## 2.206 NA 0 41198 G T
## 2.206 NA 0 46216 C T
## 2.206 NA 0 46416 G A
## 2.206 NA 0 47314 T G
## 2.206 NA 0 64862 A G
## 2.206 NA 0 67410 C T
## 2.206 NA 0 69313 A C
## 2.206 NA 0 71859 A T
## 2.206 NA 0 72355 A G
After
## 2.206 NA 0 14153 G A
## 2.206 NA 0 41198 G T
## 2.206 NA 0 46216 C T
## 2.206 NA 0 46416 G A
## 2.206 NA 0 47314 T G
## 2.206 NA 0 64862 A G
## 2.206 NA 0 67410 C T
## 2.206 NA 0 69313 A C
## 2.206 NA 0 71859 A T
## 2.206 NA 0 72355 A G
It looks okay. We can replace the original file with the new file
mv output/wgs_vs_chip/new_calls/wgs_chip_30.bim output/wgs_vs_chip/new_calls/wgs_chip_30_backup.bim;
mv output/wgs_vs_chip/new_calls/wgs_chip_30_updated.bim output/wgs_vs_chip/new_calls/wgs_chip_30.bim
We can check if everything is working by checking the reference allele using the genome without the string ‘chr’
plink2 \
--allow-extra-chr \
--bfile output/wgs_vs_chip/new_calls/wgs_chip_30 \
--make-bed \
--fa data/genome/albo.fasta.gz \
--ref-from-fa 'force' `# sets REF alleles when it can be done unambiguously, we use force to change the alleles` \
--out output/wgs_vs_chip/new_calls/wgs_chip_30_samples \
--silent;
# --keep-allele-order \ if you use Plink 1.9
grep "variants" output/wgs_vs_chip/new_calls/wgs_chip_30_samples.log # to get the number of variants from the log file.
## 169798 variants loaded from output/wgs_vs_chip/new_calls/wgs_chip_30.bim.
## --ref-from-fa force: 36269 variants changed, 133525 validated.
The bed file with the genotypes for the 18 samples extracted after a genotype call with all 819 samples is in our directory. We already set the reference alleles. Check the log of the file
## PLINK v2.00a3.3 64-bit (3 Jun 2022)
## Options in effect:
## --allow-extra-chr
## --bfile data/raw_data/albo/wgs_vs_chip/wgs
## --fa data/genome/albo.fasta.gz
## --make-bed
## --out output/wgs_vs_chip/wgs_01
## --ref-from-fa force
## --silent
##
## Hostname: LucianoCosme.wireless.yale.internal
## Working directory: /Users/lucianocosme/Library/CloudStorage/Dropbox/Albopictus/manuscript_chip/data/no_autogenous/albo_chip
## Start time: Mon Aug 28 10:26:14 2023
##
## Random number seed: 1693232774
## 32768 MiB RAM detected; reserving 16384 MiB for main workspace.
## Using up to 12 threads (change this with --threads).
## 18 samples (0 females, 0 males, 18 ambiguous; 18 founders) loaded from
## data/raw_data/albo/wgs_vs_chip/wgs.fam.
## 175360 variants loaded from data/raw_data/albo/wgs_vs_chip/wgs.bim.
## Note: No phenotype data present.
## --ref-from-fa force: 0 variants changed, 175360 validated.
## Writing output/wgs_vs_chip/wgs_01.fam ... done.
## Writing output/wgs_vs_chip/wgs_01.bim ... done.
## Writing output/wgs_vs_chip/wgs_01.bed ... done.
##
## End time: Mon Aug 28 10:26:17 2023
For the chip calls we will use only the default prior. We will have the 3 data sets: call using 18 samples, call using a plate (95 samples), and call using all wild samples (515 samples).
We need to make sure the sex is correct in all files. We can add letters to separate each data set”
a - chip call with 18 samples b - chip call with plate (95 samples) c - chip call with 500+ samples w - wgs call with 800+ samples x - wgs call with 30 samples y - wgs call with 18 samples
Check the log of Plink when we set alleles for the data set with the 18 samples only.
## PLINK v2.00a3.3 64-bit (3 Jun 2022)
## Options in effect:
## --allow-extra-chr
## --const-fid
## --fa data/genome/albo.fasta.gz
## --make-bed
## --out output/wgs_vs_chip/chip_dp_01
## --ref-from-fa force
## --silent
## --vcf data/raw_data/albo/wgs_vs_chip/wgs_default_prior_recommended_june_16_2023.vcf
##
## Hostname: LucianoCosme.wireless.yale.internal
## Working directory: /Users/lucianocosme/Library/CloudStorage/Dropbox/Albopictus/manuscript_chip/data/no_autogenous/albo_chip
## Start time: Mon Aug 28 10:26:09 2023
##
## Random number seed: 1693232769
## 32768 MiB RAM detected; reserving 16384 MiB for main workspace.
## Using up to 12 threads (change this with --threads).
## --vcf: 105607 variants scanned.
## --vcf: output/wgs_vs_chip/chip_dp_01-temporary.pgen +
## output/wgs_vs_chip/chip_dp_01-temporary.pvar.zst +
## output/wgs_vs_chip/chip_dp_01-temporary.psam written.
## 18 samples (0 females, 0 males, 18 ambiguous; 18 founders) loaded from
## output/wgs_vs_chip/chip_dp_01-temporary.psam.
## 105607 variants loaded from output/wgs_vs_chip/chip_dp_01-temporary.pvar.zst.
## Note: No phenotype data present.
## --ref-from-fa force: 0 variants changed, 105607 validated.
## Writing output/wgs_vs_chip/chip_dp_01.fam ... done.
## Writing output/wgs_vs_chip/chip_dp_01.bim ... done.
## Writing output/wgs_vs_chip/chip_dp_01.bed ... done.
##
## End time: Mon Aug 28 10:26:11 2023
Import the new results (95 and 515 samples). I used the default prior for both. We have a different document where we compare the priors and decided if it is worth using it.
# I created a fam file with the information about each sample, but first we import the data and create a bed file setting the family id constant
plink2 \
--allow-extra-chr \
--vcf data/raw_data/albo/wgs_vs_chip/chip_wgs_plate_june_28_dp.vcf \
--const-fid \
--make-bed \
--fa data/genome/albo.fasta.gz \
--ref-from-fa 'force' `# sets REF alleles when it can be done unambiguously, we use force to change the alleles` \
--out output/wgs_vs_chip/chip_plate_dp_01 `# dp - default priors` \
--silent;
# --keep-allele-order \ if you use Plink 1.9
grep "variants" output/wgs_vs_chip/chip_plate_dp_01.log # to get the number of variants from the log file.
## --vcf: 104895 variants scanned.
## 104895 variants loaded from
## --ref-from-fa force: 0 variants changed, 104895 validated.
Now the chip calls using 500+ samples
Import the fam file we use with Axiom Suite
# the order of the rows in this file does not matter
samples <-
read.delim(
file = here(
"data",
"raw_data",
"albo",
"wgs_vs_chip",
"sample_ped_info_2.txt"
),
header = TRUE
)
head(samples)
## Sample.Filename Family_ID Individual_ID Father_ID Mother_ID Sex
## 1 8_MAN_Brazil.CEL MAU 8 0 0 0
## 2 9_MAN_Brazil.CEL MAU 9 0 0 0
## 3 16_MAN_Brazil.CEL MAU 16 0 0 0
## 4 17_MAN_Brazil.CEL MAU 17 0 0 0
## 5 18_MAN_Brazil.CEL MAU 18 0 0 0
## 6 60_MAN_Brazil.CEL MAU 60 0 0 0
## Affection.Status
## 1 -9
## 2 -9
## 3 -9
## 4 -9
## 5 -9
## 6 -9
Import .fam file we created once we created the bed file using Plink2
# The fam file is the same for both data sets with the default or new priors
fam1 <-
read.delim(
file = here(
"output", "wgs_vs_chip", "chip_plate_dp_01.fam"
),
header = FALSE,
)
head(fam1)
## V1 V2 V3 V4 V5 V6
## 1 0 601_Debug027_A12.CEL 0 0 0 -9
## 2 0 602_Debug027_A2.CEL 0 0 0 -9
## 3 0 603_Debug027_A5.CEL 0 0 0 -9
## 4 0 604_Debug027_B1.CEL 0 0 0 -9
## 5 0 605_Debug027_B2.CEL 0 0 0 -9
## 6 0 606_Debug027_B3.CEL 0 0 0 -9
We can merge the tibbles
# to keep the same order of the .fam file, we will first create an index based on the numbers of the samples, then use it too keep the order
# Extract the number part from the columns
fam1_temp <- fam1 |>
mutate(num_id = as.numeric(str_extract(V2, "^\\d+")))
samples_temp <- samples |>
mutate(num_id = as.numeric(str_extract(Sample.Filename, "^\\d+")))
# Perform the left join using the num_id columns and keep the order of fam1
df <- fam1_temp |>
dplyr::left_join(samples_temp, by = "num_id") |>
dplyr::select(-num_id) |>
dplyr::select(8:13)
# check the data frame
head(df)
## Family_ID Individual_ID Father_ID Mother_ID Sex Affection.Status
## 1 KAT 7 0 0 1 -9
## 2 GEL 602 0 0 0 -9
## 3 GEL 603 0 0 0 -9
## 4 KAT 8 0 0 1 -9
## 5 KAT 9 0 0 1 -9
## 6 KAT 10 0 0 1 -9
We can check how many samples we have in our file
## [1] 95
Before you save the new fam file, you can change the original file to a different name, to compare the order later. If you want to repeat the steps above after you save the new file1.fam, you will need to import the vcf again.
# Save and override the .fam file for dp
write.table(
df,
file = here(
"output", "wgs_vs_chip", "chip_plate_dp_01.fam"
),
sep = "\t",
row.names = FALSE,
col.names = FALSE,
quote = FALSE
)
Now we have to subset the data set to keep only the samples form KAT and SAI. We can create a file with the samples we have to keep using the .fam file of our previous call.
Check the .fam file
## KAT 7a 0 0 2 -9
## KAT 8a 0 0 2 -9
## KAT 9a 0 0 2 -9
## KAT 10a 0 0 2 -9
## KAT 11a 0 0 2 -9
## KAT 12a 0 0 2 -9
## SAI 4a 0 0 2 -9
## SAI 5a 0 0 2 -9
## SAI 1a 0 0 2 -9
## SAI 2a 0 0 2 -9
We need to remove the “a”
awk '{gsub("a", "", $2); print $1,$2}' output/wgs_vs_chip/chip_dp_01.fam > output/wgs_vs_chip/chip_samples_subset.txt;
head output/wgs_vs_chip/chip_samples_subset.txt
## KAT 7
## KAT 8
## KAT 9
## KAT 10
## KAT 11
## KAT 12
## SAI 4
## SAI 5
## SAI 1
## SAI 2
Now subset the samples
plink2 \
--allow-extra-chr \
--bfile output/wgs_vs_chip/chip_plate_dp_01 \
--make-bed \
--keep output/wgs_vs_chip/chip_samples_subset.txt \
--out output/wgs_vs_chip/chip_plate_dp_02 \
--silent;
# --keep-allele-order \ if you use Plink 1.9
grep "variants\|samples" output/wgs_vs_chip/chip_plate_dp_02.log # to get the number of variants from the log file.
## --keep output/wgs_vs_chip/chip_samples_subset.txt
## 95 samples (21 females, 60 males, 14 ambiguous; 95 founders) loaded from
## 104895 variants loaded from output/wgs_vs_chip/chip_plate_dp_01.bim.
## --keep: 18 samples remaining.
## 18 samples (0 females, 18 males; 18 founders) remaining after main filters.
Check the new .fam file to see if has the order and the sample attributes we want.
Check the fam file of the call with 18 samples
# you can open the file on a text editor and double check the sample order and information.
head -n 5 output/wgs_vs_chip/chip_dp_01.fam
## KAT 7a 0 0 2 -9
## KAT 8a 0 0 2 -9
## KAT 9a 0 0 2 -9
## KAT 10a 0 0 2 -9
## KAT 11a 0 0 2 -9
Check the plate data
# you can open the file on a text editor and double check the sample order and information.
head -n 5 output/wgs_vs_chip/chip_plate_dp_02.fam
## KAT 7 0 0 1 -9
## KAT 8 0 0 1 -9
## KAT 9 0 0 1 -9
## KAT 10 0 0 1 -9
## KAT 11 0 0 1 -9
We see inconsistency in the sex and that we could add a letter to the fam file of “chip_plate_dp_02.fam”. Lets use awk to add the letter “b”
# Run this only once
awk '{$2 = $2 "b"; print $0}' output/wgs_vs_chip/chip_plate_dp_02.fam > output/wgs_vs_chip/chip_plate_dp_02_new.fam && mv output/wgs_vs_chip/chip_plate_dp_02_new.fam output/wgs_vs_chip/chip_plate_dp_02.fam;
# Check the output
head output/wgs_vs_chip/chip_plate_dp_02.fam
## KAT 7b 0 0 1 -9
## KAT 8b 0 0 1 -9
## KAT 9b 0 0 1 -9
## KAT 10b 0 0 1 -9
## KAT 11b 0 0 1 -9
## KAT 12b 0 0 1 -9
## SAI 4b 0 0 1 -9
## SAI 5b 0 0 1 -9
## SAI 1b 0 0 1 -9
## SAI 2b 0 0 1 -9
I fixed the sex manually and created new file
## KAT 7b 0 0 2 -9
## KAT 8b 0 0 2 -9
## KAT 9b 0 0 2 -9
## KAT 10b 0 0 2 -9
## KAT 11b 0 0 2 -9
## KAT 12b 0 0 2 -9
## SAI 4b 0 0 2 -9
## SAI 5b 0 0 2 -9
## SAI 1b 0 0 2 -9
## SAI 2b 0 0 2 -9
We can use ‘c’ for the data set from the call with 500+ samples.
We can also update the .fam file of the wgs data, adding letters to the samples. We will then merge the bed files and use code to create vcf files with pairs of samples setting missingness to zero.
Check the wgs data
# you can open the file on a text editor and double check the sample order and information.
head -n 5 output/wgs_vs_chip/new_calls/wgs_chip_30_samples.fam
## 1 1 0 0 0 -9
## 2 1 0 0 0 -9
## 3 1 0 0 0 -9
## 4 1 0 0 0 -9
## 5 1 0 0 0 -9
ANGSD create the file with the samples following the order of the samples in our list of crams files
## /ycga-gpfs/project/caccone/lvc26/september_2020/crams/Kathmandu_Nepal_F_10.cram
## /ycga-gpfs/project/caccone/lvc26/september_2020/crams/Kathmandu_Nepal_F_11.cram
## /ycga-gpfs/project/caccone/lvc26/september_2020/crams/Kathmandu_Nepal_F_12.cram
## /ycga-gpfs/project/caccone/lvc26/september_2020/crams/Kathmandu_Nepal_F_7.cram
## /ycga-gpfs/project/caccone/lvc26/september_2020/crams/Kathmandu_Nepal_F_8.cram
I created a file with 3 columns: Family id, sex, individual id
## KAT 2 10
## KAT 2 11
## KAT 2 12
## KAT 2 7
## KAT 2 8
Now we can use the file with the name of the samples to replace columns in the .fam file
# Create new fam
paste output/wgs_vs_chip/new_calls/wgs_chip_30_samples.fam output/wgs_vs_chip/new_calls/crams_30_names_sex.txt| awk '{print $7, $9, $3, $4, $8, $6}' > output/wgs_vs_chip/new_calls/merged_30.fam;
# Check it
head output/wgs_vs_chip/new_calls/merged_30.fam;
# Backup and replace
mv output/wgs_vs_chip/new_calls/wgs_chip_30_samples.fam output/wgs_vs_chip/new_calls/wgs_chip_30_samples_backup.fam;
mv output/wgs_vs_chip/new_calls/merged_30.fam output/wgs_vs_chip/new_calls/wgs_chip_30_samples.fam
## KAT 10 0 0 2 -9
## KAT 11 0 0 2 -9
## KAT 12 0 0 2 -9
## KAT 7 0 0 2 -9
## KAT 8 0 0 2 -9
## KAT 9 0 0 2 -9
## KAT 1 0 0 1 -9
## KAT 2 0 0 1 -9
## KAT 3 0 0 1 -9
## KAT 4 0 0 1 -9
We have to repeat it for the other wgs data sets
18 samples
# Create new fam
paste output/wgs_vs_chip/new_calls/wgs_chip_18_samples.fam output/wgs_vs_chip/new_calls/crams_18_names_sex.txt| awk '{print $7, $9, $3, $4, $8, $6}' > output/wgs_vs_chip/new_calls/merged_18.fam;
# Check it
head output/wgs_vs_chip/new_calls/merged_18.fam;
# Backup and replace
mv output/wgs_vs_chip/new_calls/wgs_chip_18_samples.fam output/wgs_vs_chip/new_calls/wgs_chip_18_samples_backup.fam;
mv output/wgs_vs_chip/new_calls/merged_18.fam output/wgs_vs_chip/new_calls/wgs_chip_18_samples.fam
## KAT 10 0 0 2 -9
## KAT 11 0 0 2 -9
## KAT 12 0 0 2 -9
## KAT 7 0 0 2 -9
## KAT 8 0 0 2 -9
## KAT 9 0 0 2 -9
## SAI 1 0 0 2 -9
## SAI 2 0 0 2 -9
## SAI 3 0 0 2 -9
## SAI 4 0 0 2 -9
Check the file extracted from the 819 samples genotype call
## SAI 5w 0 0 0 -9
## SAI 4w 0 0 0 -9
## SAI 3w 0 0 0 -9
## SAI 2w 0 0 0 -9
## SAI 1w 0 0 0 -9
## SAI 18w 0 0 0 -9
## SAI 17w 0 0 0 -9
## SAI 16w 0 0 0 -9
## SAI 15w 0 0 0 -9
## SAI 14w 0 0 0 -9
I created a new file and added the “w”
## SAI 5w 0 0 2 -9
## SAI 4w 0 0 2 -9
## SAI 3w 0 0 2 -9
## SAI 2w 0 0 2 -9
## SAI 1w 0 0 2 -9
## SAI 18w 0 0 2 -9
## SAI 17w 0 0 2 -9
## SAI 16w 0 0 2 -9
## SAI 15w 0 0 2 -9
## SAI 14w 0 0 2 -9
Lets make sure the sex is set the same in all files
Check “a” chip call with 18 samples
## KAT 7a 0 0 2 -9
## KAT 8a 0 0 2 -9
## KAT 9a 0 0 2 -9
## KAT 10a 0 0 2 -9
## KAT 11a 0 0 2 -9
## KAT 12a 0 0 2 -9
## SAI 4a 0 0 2 -9
## SAI 5a 0 0 2 -9
## SAI 1a 0 0 2 -9
## SAI 2a 0 0 2 -9
Check “b” chip call with plate
## KAT 7b 0 0 2 -9
## KAT 8b 0 0 2 -9
## KAT 9b 0 0 2 -9
## KAT 10b 0 0 2 -9
## KAT 11b 0 0 2 -9
## KAT 12b 0 0 2 -9
## SAI 4b 0 0 2 -9
## SAI 5b 0 0 2 -9
## SAI 1b 0 0 2 -9
## SAI 2b 0 0 2 -9
Check “c” chip call with 500+ samples
We need to prepare the bed file first.
# I created a fam file with the information about each sample, but first we import the data and create a bed file setting the family id constant
plink2 \
--allow-extra-chr \
--vcf data/raw_data/albo/wgs_vs_chip/manuscript_dp_june_28.vcf \
--const-fid \
--make-bed \
--fa data/genome/albo.fasta.gz \
--ref-from-fa 'force' `# sets REF alleles when it can be done unambiguously, we use force to change the alleles` \
--out output/wgs_vs_chip/chip_500_dp_01 `# dp - default priors` \
--silent;
# --keep-allele-order \ if you use Plink 1.9
grep "variants" output/wgs_vs_chip/chip_500_dp_01.log # to get the number of variants from the log file.
## --vcf: 107294 variants scanned.
## 107294 variants loaded from
## --ref-from-fa force: 0 variants changed, 107294 validated.
Import the fam file we use with Axiom Suite
# the order of the rows in this file does not matter
samples <-
read.delim(
file = here(
"data",
"raw_data",
"albo",
"wgs_vs_chip",
"sample_ped_info_ALLPOPS_for_comparisons.txt"
),
header = TRUE
)
head(samples)
## Sample.Filename Family_ID Individual_ID Father_ID Mother_ID Sex
## 1 8_MAN_Brazil.CEL MAU 8 0 0 0
## 2 9_MAN_Brazil.CEL MAU 9 0 0 0
## 3 16_MAN_Brazil.CEL MAU 16 0 0 0
## 4 17_MAN_Brazil.CEL MAU 17 0 0 0
## 5 18_MAN_Brazil.CEL MAU 18 0 0 0
## 6 60_MAN_Brazil.CEL MAU 60 0 0 0
## Affection.Status
## 1 -9
## 2 -9
## 3 -9
## 4 -9
## 5 -9
## 6 -9
Import .fam file we created once we created the bed file using Plink2
# The fam file is the same for both data sets with the default or new priors
fam1 <-
read.delim(
file = here(
"output", "wgs_vs_chip", "chip_500_dp_01.fam"
),
header = FALSE,
)
head(fam1)
## V1 V2 V3 V4 V5 V6
## 1 0 1001_OKI.CEL 0 0 0 -9
## 2 0 1002_OKI.CEL 0 0 0 -9
## 3 0 1003_OKI.CEL 0 0 0 -9
## 4 0 1004_OKI.CEL 0 0 0 -9
## 5 0 1005_OKI.CEL 0 0 0 -9
## 6 0 1006_OKI.CEL 0 0 0 -9
We can merge the tibbles.
# to keep the same order of the .fam file, we will first create an index based on the numbers of the samples, then use it too keep the order
# Extract the number part from the columns
fam1_temp <- fam1 |>
mutate(num_id = as.numeric(str_extract(V2, "^\\d+")))
samples_temp <- samples |>
mutate(num_id = as.numeric(str_extract(Sample.Filename, "^\\d+")))
# Perform the left join using the num_id columns and keep the order of fam1
df <- fam1_temp |>
dplyr::left_join(samples_temp, by = "num_id") |>
dplyr::select(-num_id) |>
dplyr::select(8:13)
# check the data frame
head(df)
## Family_ID Individual_ID Father_ID Mother_ID Sex Affection.Status
## 1 OKI 1001 0 0 2 -9
## 2 OKI 1002 0 0 2 -9
## 3 OKI 1003 0 0 2 -9
## 4 OKI 1004 0 0 2 -9
## 5 OKI 1005 0 0 2 -9
## 6 OKI 1006 0 0 1 -9
We can check how many samples we have in our file
## [1] 479
Before you save the new fam file, you can change the original file to a different name, to compare the order later. If you want to repeat the steps above after you saving the new file1.fam, you will need to import the vcf again.
# Save and override the .fam file for dp
write.table(
df,
file = here("output", "wgs_vs_chip", "chip_500_dp_01.fam"),
sep = "\t",
row.names = FALSE,
col.names = FALSE,
quote = FALSE
)
Now we have to subset the data set to keep only the samples form KAT and SAI. We can create a file with the samples we have to keep using the .fam file of our previous call.
Check the .fam file
## OKI 1001 0 0 2 -9
## OKI 1002 0 0 2 -9
## OKI 1003 0 0 2 -9
## OKI 1004 0 0 2 -9
## OKI 1005 0 0 2 -9
## OKI 1006 0 0 1 -9
## OKI 1007 0 0 1 -9
## OKI 1008 0 0 1 -9
## OKI 1009 0 0 1 -9
## OKI 1010 0 0 1 -9
Now we have to select only the 18 samples for our comparisons.
plink2 \
--allow-extra-chr \
--bfile output/wgs_vs_chip/chip_500_dp_01 \
--make-bed \
--keep output/wgs_vs_chip/chip_samples_subset.txt \
--out output/wgs_vs_chip/chip_500_dp_02 \
--silent;
# --keep-allele-order \ if you use Plink 1.9
grep "variants\|samples" output/wgs_vs_chip/chip_500_dp_02.log # to get the number of variants from the log file.
## --keep output/wgs_vs_chip/chip_samples_subset.txt
## 479 samples (138 females, 130 males, 211 ambiguous; 479 founders) loaded from
## 107294 variants loaded from output/wgs_vs_chip/chip_500_dp_01.bim.
## --keep: 18 samples remaining.
## 18 samples (0 females, 18 males; 18 founders) remaining after main filters.
Check the .fam file
## KAT 7 0 0 1 -9
## KAT 8 0 0 1 -9
## KAT 9 0 0 1 -9
## KAT 10 0 0 1 -9
## KAT 11 0 0 1 -9
## KAT 12 0 0 1 -9
## SAI 4 0 0 1 -9
## SAI 5 0 0 1 -9
## SAI 1 0 0 1 -9
## SAI 2 0 0 1 -9
After fixing the sex and add letter
## KAT 7c 0 0 2 -9
## KAT 8c 0 0 2 -9
## KAT 9c 0 0 2 -9
## KAT 10c 0 0 2 -9
## KAT 11c 0 0 2 -9
## KAT 12c 0 0 2 -9
## SAI 4c 0 0 2 -9
## SAI 5c 0 0 2 -9
## SAI 1c 0 0 2 -9
## SAI 2c 0 0 2 -9
Check “w” wgs call with 800+ samples
## SAI 5w 0 0 2 -9
## SAI 4w 0 0 2 -9
## SAI 3w 0 0 2 -9
## SAI 2w 0 0 2 -9
## SAI 1w 0 0 2 -9
## SAI 18w 0 0 2 -9
## SAI 17w 0 0 2 -9
## SAI 16w 0 0 2 -9
## SAI 15w 0 0 2 -9
## SAI 14w 0 0 2 -9
Check “x” wgs call with 30 samples (I added the x manually after dupplicating the files and adding x)
## KAT 10x 0 0 2 -9
## KAT 11x 0 0 2 -9
## KAT 12x 0 0 2 -9
## KAT 7x 0 0 2 -9
## KAT 8x 0 0 2 -9
## KAT 9x 0 0 2 -9
## KAT 1x 0 0 1 -9
## KAT 2x 0 0 1 -9
## KAT 3x 0 0 1 -9
## KAT 4x 0 0 1 -9
Check “y” wgs call with 18 samples
## KAT 10y 0 0 2 -9
## KAT 11y 0 0 2 -9
## KAT 12y 0 0 2 -9
## KAT 7y 0 0 2 -9
## KAT 8y 0 0 2 -9
## KAT 9y 0 0 2 -9
## SAI 1y 0 0 2 -9
## SAI 2y 0 0 2 -9
## SAI 3y 0 0 2 -9
## SAI 4y 0 0 2 -9
Now we can merge the files into a single bed file. We set all the reference alleles to match the reference genome in every data set. This is crucial for our comparisons. We also need to use –keep-allele-order if we use Plink 1.9
We can create a list of the files to merge
# chip
echo 'output/wgs_vs_chip/chip_dp_01
output/wgs_vs_chip/chip_plate_dp_03
output/wgs_vs_chip/chip_500_dp_03
' > output/wgs_vs_chip/merge_list_2.txt
# wgs
echo 'output/wgs_vs_chip/wgs_02
output/wgs_vs_chip/new_calls/wgs_chip_30x
output/wgs_vs_chip/new_calls/wgs_chip_18y
' > output/wgs_vs_chip/merge_list_3.txt
Merge the chip bed files
plink \
--allow-extra-chr \
--keep-allele-order \
--merge-list output/wgs_vs_chip/merge_list_2.txt \
--out output/wgs_vs_chip/chip_3_datasets \
--silent
grep "variants\|samples" output/wgs_vs_chip/chip_3_datasets.log
Merge the wgs bed files
When we run Plink to merge the files, we get an error about sites having three alleles. It happens because we did genotype calls using only 18, 30 or 819 samples, we end up with different alleles. We used angsd which is a population based algorithm for genotype calls. Plink creates a list of SNPs that have more than 2 alleles. We can check it later. Lets count how many SNPs:
## 2755 output/wgs_vs_chip/wgs_3_datasets.missnp
Let’s see how many SNPs have this problem once we decrease the sample size
# wgs 2 - 819 samples vs 30 samples
echo 'output/wgs_vs_chip/wgs_02
output/wgs_vs_chip/new_calls/wgs_chip_30x
' > output/wgs_vs_chip/merge_list_4.txt;
# wgs 3 - 891 samples vs 18 samples
echo 'output/wgs_vs_chip/wgs_02
output/wgs_vs_chip/new_calls/wgs_chip_18y
' > output/wgs_vs_chip/merge_list_5.txt;
# wgs 4 - 30 samples vc 18 samples
echo 'output/wgs_vs_chip/new_calls/wgs_chip_30x
output/wgs_vs_chip/new_calls/wgs_chip_18y
' > output/wgs_vs_chip/merge_list_6.txt;
Now we can try to merge them to see how many SNPs have different alleles
30 versus 819 samples
plink \
--allow-extra-chr \
--keep-allele-order \
--merge-list output/wgs_vs_chip/merge_list_4.txt \
--out output/wgs_vs_chip/wgs_800_vs_30_samples \
--silent
grep "variants\|samples" output/wgs_vs_chip/wgs_800_vs_30_samples.log
We have 2,245 SNPs with 3+ alleles. It happens because the alternative alleles are different in each data set
## 2245 output/wgs_vs_chip/wgs_800_vs_30_samples.missnp
18 versus 819 samples
plink \
--allow-extra-chr \
--keep-allele-order \
--merge-list output/wgs_vs_chip/merge_list_5.txt \
--out output/wgs_vs_chip/wgs_800_vs_18_samples \
--silent
grep "variants\|samples" output/wgs_vs_chip/wgs_800_vs_18_samples.log
We have 2,257 SNPs with 3+ alleles
## 2257 output/wgs_vs_chip/wgs_800_vs_18_samples.missnp
18 versus 30 samples
plink \
--allow-extra-chr \
--keep-allele-order \
--merge-list output/wgs_vs_chip/merge_list_6.txt \
--out output/wgs_vs_chip/wgs_18_vs_30_samples \
--silent
grep "variants\|samples" output/wgs_vs_chip/wgs_18_vs_30_samples.log
We have 882 SNPs with 3+ alleles
## 882 output/wgs_vs_chip/wgs_18_vs_30_samples.missnp
We can get the list of SNPs
cat output/wgs_vs_chip/wgs_800_vs_30_samples.missnp output/wgs_vs_chip/wgs_800_vs_18_samples.missnp output/wgs_vs_chip/wgs_18_vs_30_samples.missnp | awk '!seen[$0]++' > output/wgs_vs_chip/SNPs_wgs_3_alleles.txt;
wc -l output/wgs_vs_chip/SNPs_wgs_3_alleles.txt
## 2755 output/wgs_vs_chip/SNPs_wgs_3_alleles.txt
We need to remove the 2,755 SNPs.
We can remove these SNPs and only compare the other ones. Lets double check to make sure we have only bi-allelic data as well. Perhaps that is why we see inconsistencies between the genotype calls from the chip and wgs
Exclude from 18 samples
plink \
--bfile output/wgs_vs_chip/new_calls/wgs_chip_18y \
--allow-extra-chr \
--keep-allele-order \
--biallelic-only \
--exclude output/wgs_vs_chip/SNPs_wgs_3_alleles.txt \
--out output/wgs_vs_chip/new_calls/wgs_chip_18y_b \
--make-bed \
--silent
grep "variants\|samples" output/wgs_vs_chip/new_calls/wgs_chip_18y_b.log
Exclude from 30 samples
plink \
--bfile output/wgs_vs_chip/new_calls/wgs_chip_30x \
--allow-extra-chr \
--keep-allele-order \
--biallelic-only \
--exclude output/wgs_vs_chip/SNPs_wgs_3_alleles.txt \
--out output/wgs_vs_chip/new_calls/wgs_chip_30x_b \
--make-bed \
--silent
grep "variants\|samples" output/wgs_vs_chip/new_calls/wgs_chip_30x_b.log
Exclude from data subset with 819 samples
plink \
--bfile output/wgs_vs_chip/wgs_02 \
--allow-extra-chr \
--keep-allele-order \
--biallelic-only \
--exclude output/wgs_vs_chip/SNPs_wgs_3_alleles.txt \
--out output/wgs_vs_chip/wgs_03 \
--make-bed \
--silent
grep "variants\|samples" output/wgs_vs_chip/wgs_03.log
Exclude from data subset from 30 samples
plink \
--bfile output/wgs_vs_chip/new_calls/wgs_chip_30x \
--allow-extra-chr \
--keep-allele-order \
--biallelic-only \
--exclude output/wgs_vs_chip/SNPs_wgs_3_alleles_b.txt \
--out output/wgs_vs_chip/new_calls/wgs_chip_30x_c \
--make-bed \
--silent
grep "variants\|samples" output/wgs_vs_chip/new_calls/wgs_chip_30x_c.log
Exclude from data subset from 18 samples
plink \
--bfile output/wgs_vs_chip/new_calls/wgs_chip_18y \
--allow-extra-chr \
--keep-allele-order \
--biallelic-only \
--exclude output/wgs_vs_chip/SNPs_wgs_3_alleles_b.txt \
--out output/wgs_vs_chip/new_calls/wgs_chip_18y_c \
--make-bed \
--silent
grep "variants\|samples" output/wgs_vs_chip/new_calls/wgs_chip_18y_c.log
Now create new list to merge. We can merge all files (chip and wgs) into one single file, but first lets create one file with the wgs samples only
# wgs
echo 'output/wgs_vs_chip/wgs_03
output/wgs_vs_chip/new_calls/wgs_chip_30x_b
output/wgs_vs_chip/new_calls/wgs_chip_18y_b
' > output/wgs_vs_chip/merge_list_7.txt
# all
echo 'output/wgs_vs_chip/wgs_03
output/wgs_vs_chip/new_calls/wgs_chip_30x_b
output/wgs_vs_chip/new_calls/wgs_chip_18y_b
output/wgs_vs_chip/chip_dp_01
output/wgs_vs_chip/chip_plate_dp_03
output/wgs_vs_chip/chip_500_dp_03
' > output/wgs_vs_chip/merge_list_8.txt
WGS
plink \
--allow-extra-chr \
--keep-allele-order \
--merge-list output/wgs_vs_chip/merge_list_7.txt \
--out output/wgs_vs_chip/wgs_3_datasets_b \
--silent;
grep "variants\|samples" output/wgs_vs_chip/wgs_3_datasets_b.log
Merge all data sets
plink \
--allow-extra-chr \
--keep-allele-order \
--merge-list output/wgs_vs_chip/merge_list_8.txt \
--out output/wgs_vs_chip/wgs_chip_merged \
--silent
grep "variants\|samples" output/wgs_vs_chip/wgs_chip_merged.log
Now, we have to set the reference allele to match the reference genome: we remove SNPs with more than 1 alternative allele due to genotype calls with low sample size, and create a single file.
## KAT 1x 0 0 1 -9
## KAT 2x 0 0 1 -9
## KAT 3x 0 0 1 -9
## KAT 4x 0 0 1 -9
## KAT 5x 0 0 1 -9
## KAT 6x 0 0 1 -9
## KAT 7a 0 0 2 -9
## KAT 7b 0 0 2 -9
## KAT 7c 0 0 2 -9
## KAT 7w 0 0 2 -9
## KAT 7x 0 0 2 -9
## KAT 7y 0 0 2 -9
## KAT 8a 0 0 2 -9
## KAT 8b 0 0 2 -9
## KAT 8c 0 0 2 -9
## KAT 8w 0 0 2 -9
## KAT 8x 0 0 2 -9
## KAT 8y 0 0 2 -9
## KAT 9a 0 0 2 -9
## KAT 9b 0 0 2 -9
## KAT 9c 0 0 2 -9
## KAT 9w 0 0 2 -9
## KAT 9x 0 0 2 -9
## KAT 9y 0 0 2 -9
## KAT 10a 0 0 2 -9
The 18 samples were extracted when more samples were used for the genotype call
a: chip - call with 18 samples b: chip - call with 95 samples (full plate) c: chip - call with 500+ samples w: wgs - call with 800+ samples x: wgs - call with 30 samples (all wgs samples for both populations) y: wgs - call with 18 samples (only samples with wgs and chip)
We do not need to do all pairwise comparisons.
Chip: ab, ac, bc WGS: wx, wy, xy WGS versus chip: aw, ax, ay, bw, bx, by, cw, cx, cy
All comparisons (* those I will focus on)
Chip ab - chip_18 vs chip_95 ac - chip_18 vs chip_500 * bc - chip_95 vs chip_500
WGS wx - wgs_800 vs wgs_30 wy - wgs_800 vs wgs_18 * xy - wgs_18 vs wgs_30
Chip vs WGS aw - chip_18 vs wgs_800 ax - chip_18 vs wgs_30 ay - chip_18 vs wgs_18 bw - chip_95 vs wgs_800 bx - chip_95 vs wgs_30 by - chip_95 vs wgs_18 cw - chip_500 vs wgs_800 cx - chip_500 vs wgs_30 cy - chip_500 vs wgs_18
input_file="output/wgs_vs_chip/wgs_chip_merged.fam"
output_dir="output/wgs_vs_chip/vcfs2"
bfile="output/wgs_vs_chip/wgs_chip_merged"
# create the output directory if it does not exist
mkdir -p $output_dir
# get unique families
families=$(awk '{print $1}' $input_file | sort | uniq)
for famid in $families; do
# get the base sample ids (without a, b, w)
base_iids=$(grep "$famid" $input_file | awk '{print $2}' | sed 's/[abcwxy]$//' | uniq)
for base_iid in $base_iids; do
for combination in "ab" "ac" "bc" "wx" "wy" "xy" "aw" "ax" "ay" "bw" "bx" "by" "cw" "cx" "cy"; do
# Check if both samples exist
if grep -qE "${famid}\s${base_iid}[${combination:0:1}]\s" "$input_file" &&
grep -qE "${famid}\s${base_iid}[${combination:1:1}]\s" "$input_file"; then
# Create temporary file
tmp_file=$(mktemp)
grep -E "${famid}\s${base_iid}[${combination:0:1}]\s" "$input_file" > "$tmp_file"
grep -E "${famid}\s${base_iid}[${combination:1:1}]\s" "$input_file" >> "$tmp_file"
# Execute plink2
plink2 \
--allow-extra-chr \
--keep-allele-order \
--bfile $bfile \
--keep "$tmp_file" \
--recode vcf-iid \
--geno 0 \
--out "$output_dir/${famid}_${base_iid}${combination}" \
--silent
# Remove temporary file
rm "$tmp_file"
fi
done
done
done
Check how many SNPs per vcf
# Define directory with the vcfs
output_dir="output/wgs_vs_chip/vcfs2"
# Count how many SNPs we have in each vcf file
for file in ${output_dir}/*.vcf; do
echo $(basename $file): $(grep -v '^#' $file | wc -l)
done
## KAT_10ab.vcf: 88159
## KAT_10ac.vcf: 87404
## KAT_10aw.vcf: 101950
## KAT_10ax.vcf: 99196
## KAT_10ay.vcf: 96789
## KAT_10bc.vcf: 92187
## KAT_10bw.vcf: 100087
## KAT_10bx.vcf: 97433
## KAT_10by.vcf: 95133
## KAT_10cw.vcf: 100694
## KAT_10cx.vcf: 97956
## KAT_10cy.vcf: 95593
## KAT_10wx.vcf: 167052
## KAT_10wy.vcf: 162472
## KAT_10xy.vcf: 162349
## KAT_11ab.vcf: 88052
## KAT_11ac.vcf: 87326
## KAT_11aw.vcf: 101657
## KAT_11ax.vcf: 98913
## KAT_11ay.vcf: 96503
## KAT_11bc.vcf: 92556
## KAT_11bw.vcf: 100431
## KAT_11bx.vcf: 97766
## KAT_11by.vcf: 95463
## KAT_11cw.vcf: 101043
## KAT_11cx.vcf: 98301
## KAT_11cy.vcf: 95930
## KAT_11wx.vcf: 167052
## KAT_11wy.vcf: 162472
## KAT_11xy.vcf: 162349
## KAT_12ab.vcf: 87462
## KAT_12ac.vcf: 86666
## KAT_12aw.vcf: 101153
## KAT_12ax.vcf: 98413
## KAT_12ay.vcf: 95990
## KAT_12bc.vcf: 91243
## KAT_12bw.vcf: 99186
## KAT_12bx.vcf: 96579
## KAT_12by.vcf: 94294
## KAT_12cw.vcf: 99524
## KAT_12cx.vcf: 96821
## KAT_12cy.vcf: 94467
## KAT_12wx.vcf: 167052
## KAT_12wy.vcf: 162472
## KAT_12xy.vcf: 162349
## KAT_7ab.vcf: 88177
## KAT_7ac.vcf: 87404
## KAT_7aw.vcf: 101913
## KAT_7ax.vcf: 99167
## KAT_7ay.vcf: 96757
## KAT_7bc.vcf: 92119
## KAT_7bw.vcf: 100143
## KAT_7bx.vcf: 97489
## KAT_7by.vcf: 95188
## KAT_7cw.vcf: 100648
## KAT_7cx.vcf: 97909
## KAT_7cy.vcf: 95539
## KAT_7wx.vcf: 167052
## KAT_7wy.vcf: 162472
## KAT_7xy.vcf: 162349
## KAT_8ab.vcf: 87828
## KAT_8ac.vcf: 87045
## KAT_8aw.vcf: 101493
## KAT_8ax.vcf: 98756
## KAT_8ay.vcf: 96344
## KAT_8bc.vcf: 91879
## KAT_8bw.vcf: 99824
## KAT_8bx.vcf: 97180
## KAT_8by.vcf: 94881
## KAT_8cw.vcf: 100319
## KAT_8cx.vcf: 97577
## KAT_8cy.vcf: 95212
## KAT_8wx.vcf: 167052
## KAT_8wy.vcf: 162472
## KAT_8xy.vcf: 162349
## KAT_9ab.vcf: 87964
## KAT_9ac.vcf: 87230
## KAT_9aw.vcf: 101750
## KAT_9ax.vcf: 99001
## KAT_9ay.vcf: 96585
## KAT_9bc.vcf: 91906
## KAT_9bw.vcf: 99797
## KAT_9bx.vcf: 97158
## KAT_9by.vcf: 94873
## KAT_9cw.vcf: 100447
## KAT_9cx.vcf: 97711
## KAT_9cy.vcf: 95341
## KAT_9wx.vcf: 167052
## KAT_9wy.vcf: 162472
## KAT_9xy.vcf: 162349
## SAI_12ab.vcf: 87741
## SAI_12ac.vcf: 87510
## SAI_12aw.vcf: 101471
## SAI_12ax.vcf: 98706
## SAI_12ay.vcf: 96286
## SAI_12bc.vcf: 93098
## SAI_12bw.vcf: 100688
## SAI_12bx.vcf: 98024
## SAI_12by.vcf: 95713
## SAI_12cw.vcf: 102927
## SAI_12cx.vcf: 100107
## SAI_12cy.vcf: 97679
## SAI_12wx.vcf: 167052
## SAI_12wy.vcf: 162472
## SAI_12xy.vcf: 162349
## SAI_13ab.vcf: 87660
## SAI_13ac.vcf: 87456
## SAI_13aw.vcf: 101401
## SAI_13ax.vcf: 98651
## SAI_13ay.vcf: 96226
## SAI_13bc.vcf: 93075
## SAI_13bw.vcf: 100586
## SAI_13bx.vcf: 97927
## SAI_13by.vcf: 95618
## SAI_13cw.vcf: 102993
## SAI_13cx.vcf: 100179
## SAI_13cy.vcf: 97751
## SAI_13wx.vcf: 167052
## SAI_13wy.vcf: 162472
## SAI_13xy.vcf: 162349
## SAI_14ab.vcf: 87533
## SAI_14ac.vcf: 87237
## SAI_14aw.vcf: 101271
## SAI_14ax.vcf: 98518
## SAI_14ay.vcf: 96097
## SAI_14bc.vcf: 92921
## SAI_14bw.vcf: 100489
## SAI_14bx.vcf: 97823
## SAI_14by.vcf: 95505
## SAI_14cw.vcf: 102829
## SAI_14cx.vcf: 100004
## SAI_14cy.vcf: 97585
## SAI_14wx.vcf: 167052
## SAI_14wy.vcf: 162472
## SAI_14xy.vcf: 162349
## SAI_15ab.vcf: 87864
## SAI_15ac.vcf: 87468
## SAI_15aw.vcf: 101625
## SAI_15ax.vcf: 98868
## SAI_15ay.vcf: 96432
## SAI_15bc.vcf: 93104
## SAI_15bw.vcf: 100799
## SAI_15bx.vcf: 98133
## SAI_15by.vcf: 95815
## SAI_15cw.vcf: 102932
## SAI_15cx.vcf: 100132
## SAI_15cy.vcf: 97690
## SAI_15wx.vcf: 167052
## SAI_15wy.vcf: 162472
## SAI_15xy.vcf: 162349
## SAI_16ab.vcf: 87927
## SAI_16ac.vcf: 87597
## SAI_16aw.vcf: 101603
## SAI_16ax.vcf: 98843
## SAI_16ay.vcf: 96400
## SAI_16bc.vcf: 93231
## SAI_16bw.vcf: 100806
## SAI_16bx.vcf: 98139
## SAI_16by.vcf: 95831
## SAI_16cw.vcf: 103026
## SAI_16cx.vcf: 100200
## SAI_16cy.vcf: 97775
## SAI_16wx.vcf: 167052
## SAI_16wy.vcf: 162472
## SAI_16xy.vcf: 162349
## SAI_17ab.vcf: 87744
## SAI_17ac.vcf: 87447
## SAI_17aw.vcf: 101417
## SAI_17ax.vcf: 98666
## SAI_17ay.vcf: 96242
## SAI_17bc.vcf: 93112
## SAI_17bw.vcf: 100736
## SAI_17bx.vcf: 98062
## SAI_17by.vcf: 95751
## SAI_17cw.vcf: 102914
## SAI_17cx.vcf: 100092
## SAI_17cy.vcf: 97664
## SAI_17wx.vcf: 167052
## SAI_17wy.vcf: 162472
## SAI_17xy.vcf: 162349
## SAI_18ab.vcf: 87935
## SAI_18ac.vcf: 87564
## SAI_18aw.vcf: 101797
## SAI_18ax.vcf: 99029
## SAI_18ay.vcf: 96601
## SAI_18bc.vcf: 93301
## SAI_18bw.vcf: 101047
## SAI_18bx.vcf: 98377
## SAI_18by.vcf: 96048
## SAI_18cw.vcf: 103184
## SAI_18cx.vcf: 100357
## SAI_18cy.vcf: 97911
## SAI_18wx.vcf: 167052
## SAI_18wy.vcf: 162472
## SAI_18xy.vcf: 162349
## SAI_1ab.vcf: 87689
## SAI_1ac.vcf: 87385
## SAI_1aw.vcf: 101429
## SAI_1ax.vcf: 98673
## SAI_1ay.vcf: 96245
## SAI_1bc.vcf: 93177
## SAI_1bw.vcf: 100815
## SAI_1bx.vcf: 98143
## SAI_1by.vcf: 95812
## SAI_1cw.vcf: 103214
## SAI_1cx.vcf: 100379
## SAI_1cy.vcf: 97949
## SAI_1wx.vcf: 167052
## SAI_1wy.vcf: 162472
## SAI_1xy.vcf: 162349
## SAI_2ab.vcf: 87657
## SAI_2ac.vcf: 87426
## SAI_2aw.vcf: 101355
## SAI_2ax.vcf: 98620
## SAI_2ay.vcf: 96190
## SAI_2bc.vcf: 93082
## SAI_2bw.vcf: 100677
## SAI_2bx.vcf: 98014
## SAI_2by.vcf: 95688
## SAI_2cw.vcf: 103010
## SAI_2cx.vcf: 100205
## SAI_2cy.vcf: 97774
## SAI_2wx.vcf: 167052
## SAI_2wy.vcf: 162472
## SAI_2xy.vcf: 162349
## SAI_3ab.vcf: 87880
## SAI_3ac.vcf: 87578
## SAI_3aw.vcf: 101643
## SAI_3ax.vcf: 98887
## SAI_3ay.vcf: 96457
## SAI_3bc.vcf: 93174
## SAI_3bw.vcf: 100786
## SAI_3bx.vcf: 98121
## SAI_3by.vcf: 95804
## SAI_3cw.vcf: 103002
## SAI_3cx.vcf: 100179
## SAI_3cy.vcf: 97756
## SAI_3wx.vcf: 167052
## SAI_3wy.vcf: 162472
## SAI_3xy.vcf: 162349
## SAI_4ab.vcf: 87863
## SAI_4ac.vcf: 87521
## SAI_4aw.vcf: 101639
## SAI_4ax.vcf: 98872
## SAI_4ay.vcf: 96440
## SAI_4bc.vcf: 93447
## SAI_4bw.vcf: 101201
## SAI_4bx.vcf: 98511
## SAI_4by.vcf: 96177
## SAI_4cw.vcf: 103372
## SAI_4cx.vcf: 100533
## SAI_4cy.vcf: 98094
## SAI_4wx.vcf: 167052
## SAI_4wy.vcf: 162472
## SAI_4xy.vcf: 162349
## SAI_5ab.vcf: 87833
## SAI_5ac.vcf: 87542
## SAI_5aw.vcf: 101537
## SAI_5ax.vcf: 98775
## SAI_5ay.vcf: 96350
## SAI_5bc.vcf: 93216
## SAI_5bw.vcf: 100802
## SAI_5bx.vcf: 98141
## SAI_5by.vcf: 95826
## SAI_5cw.vcf: 102992
## SAI_5cx.vcf: 100175
## SAI_5cy.vcf: 97746
## SAI_5wx.vcf: 167052
## SAI_5wy.vcf: 162472
## SAI_5xy.vcf: 162349
Since we set genotyping missingness to zero within each pair of samples, we see different number of SNPs in each vcf.
Check sample names to see if our code created the vcfs with two samples
# Define directory with the VCFs
output_dir="output/wgs_vs_chip/vcfs2"
# Iterate over each VCF file
for file in "${output_dir}"/*.vcf; do
# Extract the file name without the directory path
file_name=$(basename "${file}")
# Use bcftools query to retrieve the sample names
sample_names=$(bcftools query -l "${file}")
# Print the file name and the sample names
echo "${file_name}: ${sample_names}"
done
## KAT_10ab.vcf: 10a
## 10b
## KAT_10ac.vcf: 10a
## 10c
## KAT_10aw.vcf: 10a
## 10w
## KAT_10ax.vcf: 10a
## 10x
## KAT_10ay.vcf: 10a
## 10y
## KAT_10bc.vcf: 10b
## 10c
## KAT_10bw.vcf: 10b
## 10w
## KAT_10bx.vcf: 10b
## 10x
## KAT_10by.vcf: 10b
## 10y
## KAT_10cw.vcf: 10c
## 10w
## KAT_10cx.vcf: 10c
## 10x
## KAT_10cy.vcf: 10c
## 10y
## KAT_10wx.vcf: 10w
## 10x
## KAT_10wy.vcf: 10w
## 10y
## KAT_10xy.vcf: 10x
## 10y
## KAT_11ab.vcf: 11a
## 11b
## KAT_11ac.vcf: 11a
## 11c
## KAT_11aw.vcf: 11a
## 11w
## KAT_11ax.vcf: 11a
## 11x
## KAT_11ay.vcf: 11a
## 11y
## KAT_11bc.vcf: 11b
## 11c
## KAT_11bw.vcf: 11b
## 11w
## KAT_11bx.vcf: 11b
## 11x
## KAT_11by.vcf: 11b
## 11y
## KAT_11cw.vcf: 11c
## 11w
## KAT_11cx.vcf: 11c
## 11x
## KAT_11cy.vcf: 11c
## 11y
## KAT_11wx.vcf: 11w
## 11x
## KAT_11wy.vcf: 11w
## 11y
## KAT_11xy.vcf: 11x
## 11y
## KAT_12ab.vcf: 12a
## 12b
## KAT_12ac.vcf: 12a
## 12c
## KAT_12aw.vcf: 12a
## 12w
## KAT_12ax.vcf: 12a
## 12x
## KAT_12ay.vcf: 12a
## 12y
## KAT_12bc.vcf: 12b
## 12c
## KAT_12bw.vcf: 12b
## 12w
## KAT_12bx.vcf: 12b
## 12x
## KAT_12by.vcf: 12b
## 12y
## KAT_12cw.vcf: 12c
## 12w
## KAT_12cx.vcf: 12c
## 12x
## KAT_12cy.vcf: 12c
## 12y
## KAT_12wx.vcf: 12w
## 12x
## KAT_12wy.vcf: 12w
## 12y
## KAT_12xy.vcf: 12x
## 12y
## KAT_7ab.vcf: 7a
## 7b
## KAT_7ac.vcf: 7a
## 7c
## KAT_7aw.vcf: 7a
## 7w
## KAT_7ax.vcf: 7a
## 7x
## KAT_7ay.vcf: 7a
## 7y
## KAT_7bc.vcf: 7b
## 7c
## KAT_7bw.vcf: 7b
## 7w
## KAT_7bx.vcf: 7b
## 7x
## KAT_7by.vcf: 7b
## 7y
## KAT_7cw.vcf: 7c
## 7w
## KAT_7cx.vcf: 7c
## 7x
## KAT_7cy.vcf: 7c
## 7y
## KAT_7wx.vcf: 7w
## 7x
## KAT_7wy.vcf: 7w
## 7y
## KAT_7xy.vcf: 7x
## 7y
## KAT_8ab.vcf: 8a
## 8b
## KAT_8ac.vcf: 8a
## 8c
## KAT_8aw.vcf: 8a
## 8w
## KAT_8ax.vcf: 8a
## 8x
## KAT_8ay.vcf: 8a
## 8y
## KAT_8bc.vcf: 8b
## 8c
## KAT_8bw.vcf: 8b
## 8w
## KAT_8bx.vcf: 8b
## 8x
## KAT_8by.vcf: 8b
## 8y
## KAT_8cw.vcf: 8c
## 8w
## KAT_8cx.vcf: 8c
## 8x
## KAT_8cy.vcf: 8c
## 8y
## KAT_8wx.vcf: 8w
## 8x
## KAT_8wy.vcf: 8w
## 8y
## KAT_8xy.vcf: 8x
## 8y
## KAT_9ab.vcf: 9a
## 9b
## KAT_9ac.vcf: 9a
## 9c
## KAT_9aw.vcf: 9a
## 9w
## KAT_9ax.vcf: 9a
## 9x
## KAT_9ay.vcf: 9a
## 9y
## KAT_9bc.vcf: 9b
## 9c
## KAT_9bw.vcf: 9b
## 9w
## KAT_9bx.vcf: 9b
## 9x
## KAT_9by.vcf: 9b
## 9y
## KAT_9cw.vcf: 9c
## 9w
## KAT_9cx.vcf: 9c
## 9x
## KAT_9cy.vcf: 9c
## 9y
## KAT_9wx.vcf: 9w
## 9x
## KAT_9wy.vcf: 9w
## 9y
## KAT_9xy.vcf: 9x
## 9y
## SAI_12ab.vcf: 12a
## 12b
## SAI_12ac.vcf: 12a
## 12c
## SAI_12aw.vcf: 12a
## 12w
## SAI_12ax.vcf: 12a
## 12x
## SAI_12ay.vcf: 12a
## 12y
## SAI_12bc.vcf: 12b
## 12c
## SAI_12bw.vcf: 12b
## 12w
## SAI_12bx.vcf: 12b
## 12x
## SAI_12by.vcf: 12b
## 12y
## SAI_12cw.vcf: 12c
## 12w
## SAI_12cx.vcf: 12c
## 12x
## SAI_12cy.vcf: 12c
## 12y
## SAI_12wx.vcf: 12w
## 12x
## SAI_12wy.vcf: 12w
## 12y
## SAI_12xy.vcf: 12x
## 12y
## SAI_13ab.vcf: 13a
## 13b
## SAI_13ac.vcf: 13a
## 13c
## SAI_13aw.vcf: 13a
## 13w
## SAI_13ax.vcf: 13a
## 13x
## SAI_13ay.vcf: 13a
## 13y
## SAI_13bc.vcf: 13b
## 13c
## SAI_13bw.vcf: 13b
## 13w
## SAI_13bx.vcf: 13b
## 13x
## SAI_13by.vcf: 13b
## 13y
## SAI_13cw.vcf: 13c
## 13w
## SAI_13cx.vcf: 13c
## 13x
## SAI_13cy.vcf: 13c
## 13y
## SAI_13wx.vcf: 13w
## 13x
## SAI_13wy.vcf: 13w
## 13y
## SAI_13xy.vcf: 13x
## 13y
## SAI_14ab.vcf: 14a
## 14b
## SAI_14ac.vcf: 14a
## 14c
## SAI_14aw.vcf: 14a
## 14w
## SAI_14ax.vcf: 14a
## 14x
## SAI_14ay.vcf: 14a
## 14y
## SAI_14bc.vcf: 14b
## 14c
## SAI_14bw.vcf: 14b
## 14w
## SAI_14bx.vcf: 14b
## 14x
## SAI_14by.vcf: 14b
## 14y
## SAI_14cw.vcf: 14c
## 14w
## SAI_14cx.vcf: 14c
## 14x
## SAI_14cy.vcf: 14c
## 14y
## SAI_14wx.vcf: 14w
## 14x
## SAI_14wy.vcf: 14w
## 14y
## SAI_14xy.vcf: 14x
## 14y
## SAI_15ab.vcf: 15a
## 15b
## SAI_15ac.vcf: 15a
## 15c
## SAI_15aw.vcf: 15a
## 15w
## SAI_15ax.vcf: 15a
## 15x
## SAI_15ay.vcf: 15a
## 15y
## SAI_15bc.vcf: 15b
## 15c
## SAI_15bw.vcf: 15b
## 15w
## SAI_15bx.vcf: 15b
## 15x
## SAI_15by.vcf: 15b
## 15y
## SAI_15cw.vcf: 15c
## 15w
## SAI_15cx.vcf: 15c
## 15x
## SAI_15cy.vcf: 15c
## 15y
## SAI_15wx.vcf: 15w
## 15x
## SAI_15wy.vcf: 15w
## 15y
## SAI_15xy.vcf: 15x
## 15y
## SAI_16ab.vcf: 16a
## 16b
## SAI_16ac.vcf: 16a
## 16c
## SAI_16aw.vcf: 16a
## 16w
## SAI_16ax.vcf: 16a
## 16x
## SAI_16ay.vcf: 16a
## 16y
## SAI_16bc.vcf: 16b
## 16c
## SAI_16bw.vcf: 16b
## 16w
## SAI_16bx.vcf: 16b
## 16x
## SAI_16by.vcf: 16b
## 16y
## SAI_16cw.vcf: 16c
## 16w
## SAI_16cx.vcf: 16c
## 16x
## SAI_16cy.vcf: 16c
## 16y
## SAI_16wx.vcf: 16w
## 16x
## SAI_16wy.vcf: 16w
## 16y
## SAI_16xy.vcf: 16x
## 16y
## SAI_17ab.vcf: 17a
## 17b
## SAI_17ac.vcf: 17a
## 17c
## SAI_17aw.vcf: 17a
## 17w
## SAI_17ax.vcf: 17a
## 17x
## SAI_17ay.vcf: 17a
## 17y
## SAI_17bc.vcf: 17b
## 17c
## SAI_17bw.vcf: 17b
## 17w
## SAI_17bx.vcf: 17b
## 17x
## SAI_17by.vcf: 17b
## 17y
## SAI_17cw.vcf: 17c
## 17w
## SAI_17cx.vcf: 17c
## 17x
## SAI_17cy.vcf: 17c
## 17y
## SAI_17wx.vcf: 17w
## 17x
## SAI_17wy.vcf: 17w
## 17y
## SAI_17xy.vcf: 17x
## 17y
## SAI_18ab.vcf: 18a
## 18b
## SAI_18ac.vcf: 18a
## 18c
## SAI_18aw.vcf: 18a
## 18w
## SAI_18ax.vcf: 18a
## 18x
## SAI_18ay.vcf: 18a
## 18y
## SAI_18bc.vcf: 18b
## 18c
## SAI_18bw.vcf: 18b
## 18w
## SAI_18bx.vcf: 18b
## 18x
## SAI_18by.vcf: 18b
## 18y
## SAI_18cw.vcf: 18c
## 18w
## SAI_18cx.vcf: 18c
## 18x
## SAI_18cy.vcf: 18c
## 18y
## SAI_18wx.vcf: 18w
## 18x
## SAI_18wy.vcf: 18w
## 18y
## SAI_18xy.vcf: 18x
## 18y
## SAI_1ab.vcf: 1a
## 1b
## SAI_1ac.vcf: 1a
## 1c
## SAI_1aw.vcf: 1a
## 1w
## SAI_1ax.vcf: 1a
## 1x
## SAI_1ay.vcf: 1a
## 1y
## SAI_1bc.vcf: 1b
## 1c
## SAI_1bw.vcf: 1b
## 1w
## SAI_1bx.vcf: 1b
## 1x
## SAI_1by.vcf: 1b
## 1y
## SAI_1cw.vcf: 1c
## 1w
## SAI_1cx.vcf: 1c
## 1x
## SAI_1cy.vcf: 1c
## 1y
## SAI_1wx.vcf: 1w
## 1x
## SAI_1wy.vcf: 1w
## 1y
## SAI_1xy.vcf: 1x
## 1y
## SAI_2ab.vcf: 2a
## 2b
## SAI_2ac.vcf: 2a
## 2c
## SAI_2aw.vcf: 2a
## 2w
## SAI_2ax.vcf: 2a
## 2x
## SAI_2ay.vcf: 2a
## 2y
## SAI_2bc.vcf: 2b
## 2c
## SAI_2bw.vcf: 2b
## 2w
## SAI_2bx.vcf: 2b
## 2x
## SAI_2by.vcf: 2b
## 2y
## SAI_2cw.vcf: 2c
## 2w
## SAI_2cx.vcf: 2c
## 2x
## SAI_2cy.vcf: 2c
## 2y
## SAI_2wx.vcf: 2w
## 2x
## SAI_2wy.vcf: 2w
## 2y
## SAI_2xy.vcf: 2x
## 2y
## SAI_3ab.vcf: 3a
## 3b
## SAI_3ac.vcf: 3a
## 3c
## SAI_3aw.vcf: 3a
## 3w
## SAI_3ax.vcf: 3a
## 3x
## SAI_3ay.vcf: 3a
## 3y
## SAI_3bc.vcf: 3b
## 3c
## SAI_3bw.vcf: 3b
## 3w
## SAI_3bx.vcf: 3b
## 3x
## SAI_3by.vcf: 3b
## 3y
## SAI_3cw.vcf: 3c
## 3w
## SAI_3cx.vcf: 3c
## 3x
## SAI_3cy.vcf: 3c
## 3y
## SAI_3wx.vcf: 3w
## 3x
## SAI_3wy.vcf: 3w
## 3y
## SAI_3xy.vcf: 3x
## 3y
## SAI_4ab.vcf: 4a
## 4b
## SAI_4ac.vcf: 4a
## 4c
## SAI_4aw.vcf: 4a
## 4w
## SAI_4ax.vcf: 4a
## 4x
## SAI_4ay.vcf: 4a
## 4y
## SAI_4bc.vcf: 4b
## 4c
## SAI_4bw.vcf: 4b
## 4w
## SAI_4bx.vcf: 4b
## 4x
## SAI_4by.vcf: 4b
## 4y
## SAI_4cw.vcf: 4c
## 4w
## SAI_4cx.vcf: 4c
## 4x
## SAI_4cy.vcf: 4c
## 4y
## SAI_4wx.vcf: 4w
## 4x
## SAI_4wy.vcf: 4w
## 4y
## SAI_4xy.vcf: 4x
## 4y
## SAI_5ab.vcf: 5a
## 5b
## SAI_5ac.vcf: 5a
## 5c
## SAI_5aw.vcf: 5a
## 5w
## SAI_5ax.vcf: 5a
## 5x
## SAI_5ay.vcf: 5a
## 5y
## SAI_5bc.vcf: 5b
## 5c
## SAI_5bw.vcf: 5b
## 5w
## SAI_5bx.vcf: 5b
## 5x
## SAI_5by.vcf: 5b
## 5y
## SAI_5cw.vcf: 5c
## 5w
## SAI_5cx.vcf: 5c
## 5x
## SAI_5cy.vcf: 5c
## 5y
## SAI_5wx.vcf: 5w
## 5x
## SAI_5wy.vcf: 5w
## 5y
## SAI_5xy.vcf: 5x
## 5y
Compare the two samples in each vcf file and create csv output across all samples
import allel
import pandas as pd
import os
import numpy as np
# Initialize the output dataframe
output_df = pd.DataFrame()
# Directory with vcf files
dir_name = "output/wgs_vs_chip/vcfs2/"
# Get list of all vcf files in the directory
vcf_files = [f for f in os.listdir(dir_name) if f.endswith('.vcf')]
# Iterate over VCF files
for vcf_file in vcf_files:
file_path = os.path.join(dir_name, vcf_file)
callset = allel.read_vcf(file_path, fields=['*'])
# Get genotype
gt = allel.GenotypeArray(callset['calldata/GT'])
# Verify the vcf contains two samples
assert gt.shape[1] == 2, f"Expected 2 samples in {vcf_file}, found {gt.shape[1]}"
# Count SNPs
n_snps = len(gt)
# Count homozygous and heterozygous SNPs for each sample
n_homo_ref = np.count_nonzero(gt.is_hom_ref(), axis=0)
n_homo_alt = np.count_nonzero(gt.is_hom_alt(), axis=0)
n_hetero = np.count_nonzero(gt.is_het(), axis=0)
# Count homozygous and heterozygous SNPs mismatches
n_homo_ref_mismatch = np.sum(gt.is_hom_ref()[:, 0] != gt.is_hom_ref()[:, 1])
n_homo_alt_mismatch = np.sum(gt.is_hom_alt()[:, 0] != gt.is_hom_alt()[:, 1])
n_hetero_mismatch = np.sum(gt.is_het()[:, 0] != gt.is_het()[:, 1])
# Get alleles
ref_alleles = callset['variants/REF']
alt_alleles = callset['variants/ALT'][:, 0] # assuming bi-allelic
# Count mismatching reference and alternative alleles
n_snps_ref_mismatch = np.count_nonzero(ref_alleles[gt[:,0]] != ref_alleles[gt[:,1]])
n_snps_alt_mismatch = np.count_nonzero(alt_alleles[gt[:,0]] != alt_alleles[gt[:,1]])
# Count alleles for each sample
n_a = sum(np.count_nonzero(gt == i, axis=0) for i in range(4) if ref_alleles[i] == 'A' or alt_alleles[i] == 'A')
n_t = sum(np.count_nonzero(gt == i, axis=0) for i in range(4) if ref_alleles[i] == 'T' or alt_alleles[i] == 'T')
n_c = sum(np.count_nonzero(gt == i, axis=0) for i in range(4) if ref_alleles[i] == 'C' or alt_alleles[i] == 'C')
n_g = sum(np.count_nonzero(gt == i, axis=0) for i in range(4) if ref_alleles[i] == 'G' or alt_alleles[i] == 'G')
# Append results to the output dataframe
result = pd.DataFrame({
'vcf_file': [file_path],
'n_SNPs': [n_snps],
'n_SNPs_ref_mismatch': [n_snps_ref_mismatch],
'n_SNPs_alt_mismatch': [n_snps_alt_mismatch],
'n_A': [n_a],
'n_T': [n_t],
'n_C': [n_c],
'n_G': [n_g],
'n_homo_ref': [n_homo_ref],
'n_homo_alt': [n_homo_alt],
'n_hetero': [n_hetero],
'n_homo_ref_mismatch': [n_homo_ref_mismatch],
'n_homo_alt_mismatch': [n_homo_alt_mismatch],
'n_hetero_mismatch': [n_hetero_mismatch]
})
output_df = pd.concat([output_df, result])
# Write the result to a csv file
output_df.to_csv('output/wgs_vs_chip/vcfs2/allele_comparison_stats.csv', index=False)
Clean env
Import the data
data <-
read_delim(
"output/wgs_vs_chip/vcfs2/allele_comparison_stats.csv",
delim = ",",
show_col_types = FALSE
)
data <-
data |>
mutate(vcf_file = str_remove(vcf_file, "output/wgs_vs_chip/vcfs2/")) |>
separate(
vcf_file,
into = c("Population", "Sample_Comparison"),
sep = "_",
extra = "drop"
) |>
separate(
Sample_Comparison,
into = c("Sample", "Comparison"),
sep = "(?<=\\d)(?=[a-z])",
convert = TRUE
) |>
mutate(Comparison = str_remove(Comparison, ".vcf")) |>
arrange(Comparison)
# Split the "Comparison" column into "Sample1" and "Sample2"
data <-
data |>
separate(
Comparison,
into = c("Sample1", "Sample2"),
sep = 1,
# because each comparison has two characters
remove = FALSE
) |> # keep the original comparison column
relocate(Sample1, Sample2, .after = Comparison) # move the new columns right after Comparison
cols_to_split <-
c("n_A",
"n_T",
"n_C",
"n_G",
"n_homo_ref",
"n_homo_alt",
"n_hetero")
# Remove unwanted characters from the columns
for (col_name in cols_to_split) {
data[[col_name]] <- gsub("\\[\\[|]\\n", "", data[[col_name]])
}
# Split the columns
for (col_name in cols_to_split) {
# Create new column names based on 'Sample1' and 'Sample2'
new_col_names <- paste0(col_name, "_sample", 1:2)
data <- data |>
separate(
col = col_name,
into = new_col_names,
sep = " ",
extra = "drop"
)
}
# Clean the new columns
cols_to_clean <-
grep("^n_", names(data), value = TRUE)
for (col_name in cols_to_clean) {
# Remove unwanted characters '[', ']', and '\n'
data[[col_name]] <- gsub("\\[|]|\\n", "", data[[col_name]])
}
# Specify the column names to convert to numeric
columns_to_convert <-
c(
# "Population",
"Sample",
# "Comparison",
# "Sample1",
# "Sample2",
"n_SNPs",
"n_SNPs_ref_mismatch",
"n_SNPs_alt_mismatch",
"n_A_sample1",
"n_A_sample2",
"n_T_sample1",
"n_T_sample2",
"n_C_sample1",
"n_C_sample2",
"n_G_sample1",
"n_G_sample2",
"n_homo_ref_sample1",
"n_homo_ref_sample2",
"n_homo_alt_sample1",
"n_homo_alt_sample2",
"n_hetero_sample1",
"n_hetero_sample2",
"n_homo_ref_mismatch",
"n_homo_alt_mismatch",
"n_hetero_mismatch"
)
# Convert columns to numeric
data[columns_to_convert] <-
lapply(data[columns_to_convert], function(x)
as.numeric(as.character(x)))
# Verify the column types
print(sapply(data[columns_to_convert], class))
## Sample n_SNPs n_SNPs_ref_mismatch n_SNPs_alt_mismatch
## "numeric" "numeric" "numeric" "numeric"
## n_A_sample1 n_A_sample2 n_T_sample1 n_T_sample2
## "numeric" "numeric" "numeric" "numeric"
## n_C_sample1 n_C_sample2 n_G_sample1 n_G_sample2
## "numeric" "numeric" "numeric" "numeric"
## n_homo_ref_sample1 n_homo_ref_sample2 n_homo_alt_sample1 n_homo_alt_sample2
## "numeric" "numeric" "numeric" "numeric"
## n_hetero_sample1 n_hetero_sample2 n_homo_ref_mismatch n_homo_alt_mismatch
## "numeric" "numeric" "numeric" "numeric"
## n_hetero_mismatch
## "numeric"
We can look over all the comparisons to see if we can see any pattern
# Calculate the percentages for each category
data <-
data |>
mutate(
Perc_n_homo_ref_mismatch = round((n_homo_ref_mismatch / n_SNPs) * 100, 2),
Perc_n_homo_alt_mismatch = round((n_homo_alt_mismatch / n_SNPs) * 100, 2),
Perc_n_hetero_mismatch = round((n_hetero_mismatch / n_SNPs) * 100, 2)
)
# Continue with the reshaping
data_long <- data |>
pivot_longer(cols = starts_with("n_"),
names_to = "Category",
values_to = "Value") |>
pivot_longer(cols = starts_with("Perc_"),
names_to = "Category_Perc",
values_to = "Percentage") |>
mutate(Category_Perc = str_remove(Category_Perc, "Perc_")) |>
filter(Category == Category_Perc |
Category == "n_SNPs") # Remove Total_mismatch
# Define a color palette
color_palette <- c("#FF8C94", "#FFE180", "#9CE09C", "#A391FF")
# Rename categories
data_long <- data_long |>
mutate(
Category = recode(
Category,
"n_SNPs" = "SNPs",
"n_homo_ref_mismatch" = "Homozygous REF",
"n_homo_alt_mismatch" = "Homozygous ALT",
"n_hetero_mismatch" = "Heterozygous"
)
)
# Change the order of the Comparison variable (Chip, WGS, and Chip vs WGS)
data_long$Comparison <-
factor(
data_long$Comparison,
levels = c(
"ab",
"ac",
"bc",
"wx",
"wy",
"xy",
"aw",
"ax",
"ay",
"bw",
"bx",
"by",
"cw",
"cx",
"cy"
)
)
# Recode the levels of the "Comparison" variable
data_long$Comparison <- recode(
data_long$Comparison,
"ab" = "chip_18 : chip_95",
"ac" = "chip_18 : chip_500",
"bc" = "chip_95 : chip_500",
"wx" = "wgs_800 : wgs_30",
"wy" = "wgs_800 : wgs_18",
"xy" = "wgs_18 : wgs_30",
"aw" = "chip_18 : wgs_800",
"ax" = "chip_18 : wgs_30",
"ay" = "chip_18 : wgs_18",
"bw" = "chip_95 : wgs_800",
"bx" = "chip_95 : wgs_30",
"by" = "chip_95 : wgs_18",
"cw" = "chip_500 : wgs_800",
"cx" = "chip_500 : wgs_30",
"cy" = "chip_500 : wgs_18"
)
# Create the plot
ggplot(data_long, aes(x = Category, y = Value, fill = Category)) +
geom_bar(stat = "identity", position = "dodge") +
facet_grid(Comparison ~ Population, scales = "free_y", space = "free") +
coord_flip() +
labs(
title = "Mismatches of zygosity in pairwise comparisons",
x = "Category",
y = "Count",
caption = "The comparison are between SNPs genotyped in both samples.\n Each sample as genotyped with a different number of samples. Each pair of sample was subseted to\n a vcf file allowing no genotyping missingness. Next, with custom python script,\n the total number of genotypes matches and mismatches was stored in a csv file. \nThe data was tidy and visualized in R.\nREF = Reference allele; ALT = Alternative allele\n At right are the number of samples used in the genotype calls\n for each data set comparison."
) +
theme(panel.spacing = unit(1.5, "lines")) +
geom_text(
aes(label = ifelse(
Category == "SNPs",
scales::comma(Value),
paste0(scales::comma(Value), " (", sprintf("%.2f", Percentage), "%)")
)),
position = position_dodge(width = 0.7),
hjust = 0.9,
vjust = 0.5,
size = 2.5,
check_overlap = TRUE,
color = "black"
) +
scale_fill_manual(values = color_palette) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
guides(fill = "none") +
my_theme() +
scale_y_continuous(
labels = scales::comma,
trans = "log10",
breaks = c(10, 100, 1000, 10000, 100000),
limits = c(1, NA),
expand = expansion(mult = c(0, 0.1))
) +
theme(
plot.caption = element_text(
face = "italic",
size = 8,
color = "grey20"
),
plot.margin = unit(c(1, 2, 1, 1), "cm"),
axis.text.x = element_text(angle = 0, hjust = 1),
axis.text = element_text(size = 7),
strip.text.y = element_text(angle = 360)
)
# save the plot
ggsave(
here(
"output",
"wgs_vs_chip",
"figures",
"01.Pairwise_comparions_wgs_chip.pdf"
),
width = 8,
height = 14,
units = "in"
)
We see the lowest mismatch rate for the comparisons within each technology was used independently to how many samples the sample was genotyped with. The chip seems slightly better for smaller sample sizes. The sample size seems not to affect the overall result of the technologies comparisons. The mismatch rate in SAI (island invasive range) is higher than KAT (continent native range), indicating the presence of low frequency alleles that might might be difficult to detect.
Next, we can look at the performance of the technologies across all 18 samples. Our next questions are a bit different. For example, how many SNPs have mismatches in 1, 2 or more samples? Are there any SNPs that have errors in more than 2 samples? Can we find a way to identify them and remove them?
Since the sample size with which the genotype call was performed does not affect the overall results, we can select a few comparisons to look into in detail. We then can compare the read count for each allele from the WGS with the mismatch rate. Do the SNPs with mismatches between the technologies have a lower read count? If so, what is the error rate if we remove SNPs that had 1 or a few reads? Does the output of the comparison improve the concordance between the technologies? We can do that selecting 1 or two data sets. For example, we can select chip_18: wgs_18 (ay) and chip_500: wgs_800 (cw). Then we will compare the genotypes of samples that were genotyped using only 18 samples or the entire data set we had (around 500 samples for the chip and 800 for the wgs data set). We extracted the 18 samples from our large data set for comparisons.
We can also compare if the SNPs with mismatches are the same when the sample size for the genotype call is large or small. What percentage of SNPs have mismatches when varying the sample size? We can do the same comparison for each technology.
Save the data first
# Save the data
saveRDS(
data_long,
file = here(
"output",
"wgs_vs_chip",
"pairwise_comparison_long.rds"
)
)
# Clean environment and memory
rm(list = ls())
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 10847074 579.3 17143224 915.6 NA 15008399 801.6
## Vcells 19261980 147.0 37740698 288.0 32768 37739631 288.0
Python script to get the match and mismatches from a vcf file
import argparse
import allel
import pandas as pd
import os
import numpy as np
import warnings
# Ignore DtypeWarnings from pandas
warnings.filterwarnings('ignore', category=pd.errors.DtypeWarning)
# Function to convert genotype indices to alleles
def genotype_to_alleles(gt_indices, ref_allele, alt_alleles):
alleles = np.concatenate(([ref_allele], alt_alleles))
return " ".join(alleles[idx] for idx in gt_indices if idx!=-1) # idx -1 means missing data
def process_vcf_files(vcf_file_ending):
dir_name = "output/wgs_vs_chip/vcfs2/"
vcf_files = [f for f in os.listdir(dir_name) if f.endswith(f'{vcf_file_ending}.vcf')]
if not vcf_files:
raise ValueError(f"No VCF files found matching '{vcf_file_ending}'")
for vcf_file in vcf_files:
file_path = os.path.join(dir_name, vcf_file)
callset = allel.read_vcf(file_path, fields=['*'])
# Get genotype
gt = allel.GenotypeArray(callset['calldata/GT'])
# Get sample names and add prefix from file name
sample_1, sample_2 = callset['samples']
prefix = vcf_file.split("_")[0] + "_" # Added "_" after prefix
sample_1 = prefix + sample_1
sample_2 = prefix + sample_2
# Verify the vcf contains two samples
assert gt.shape[1] == 2, f"Expected 2 samples in {vcf_file}, found {gt.shape[1]}"
# Create DataFrame
df = pd.DataFrame({
'SNP_id': callset['variants/ID'],
f'{sample_1}_geno': [genotype_to_alleles(gt, callset['variants/REF'][i], callset['variants/ALT'][i]) for i, gt in enumerate(gt[:, 0])],
f'{sample_2}_geno': [genotype_to_alleles(gt, callset['variants/REF'][i], callset['variants/ALT'][i]) for i, gt in enumerate(gt[:, 1])],
f'{sample_1}_{sample_2}_gcomp': np.where(gt[:, 0] == gt[:, 1], 'match', 'mismatch').tolist(),
f'{sample_1}_zygo': np.where(gt.is_hom_ref()[:, 0], 'hom_ref', np.where(gt.is_hom_alt()[:, 0], 'hom_alt', 'het')).tolist(),
f'{sample_2}_zygo': np.where(gt.is_hom_ref()[:, 1], 'hom_ref', np.where(gt.is_hom_alt()[:, 1], 'hom_alt', 'het')).tolist(),
f'{sample_1}_{sample_2}_zcomp': np.where(gt.is_hom()[:, 0] == gt.is_hom()[:, 1], 'match', 'mismatch').tolist()
})
# When you write your output file, use the input filename to create the corresponding output filename
output_file = f'output/wgs_vs_chip/{os.path.basename(vcf_file).replace(".vcf", "")}_comparison.csv'
df.to_csv(output_file, index=False)
def combine_csv_files(vcf_file_ending):
# Combine only the newly created CSVs into one
dir_path = "output/wgs_vs_chip/"
csv_files = [os.path.join(dir_path, f) for f in os.listdir(dir_path) if f.endswith(f'{vcf_file_ending}_comparison.csv')]
# Ensure that we have at least one such file
if not csv_files:
raise ValueError(f"No CSV files found matching '{vcf_file_ending}_comparison.csv'")
# Load the first CSV file
combined_csv = pd.read_csv(csv_files[0])
# Merge the rest of the CSV files one by one
for f in csv_files[1:]:
df = pd.read_csv(f)
combined_csv = pd.merge(combined_csv, df, on='SNP_id', how='outer')
combined_csv.to_csv(os.path.join(dir_path, f'combined_comparison_{vcf_file_ending}.csv'), index=False)
def main():
# Initialize parser
parser = argparse.ArgumentParser(description="Process VCF files and output CSV comparison files")
# Add argument
parser.add_argument('vcf_file_ending', type=str, help="The ending for VCF files to be processed (e.g., 'ay')")
# Parse arguments
args = parser.parse_args()
# Remove '.vcf' from the ending, if present
vcf_file_ending = args.vcf_file_ending.replace('.vcf', '')
# Process VCF files and combine CSV files
process_vcf_files(vcf_file_ending)
combine_csv_files(vcf_file_ending)
if __name__ == "__main__":
main()
How to run the Python script
We can write a function to import and process the csv files our python script generates
process_csv_files <- function(csv_file_ending) {
# Read the CSV file using fread() function
csv_file <- paste0("output/wgs_vs_chip/combined_comparison_", csv_file_ending, ".csv")
data_dt <- data.table::fread(csv_file)
# Get all column names that end with '_gcomp'
gcomp_cols <- grep("_gcomp$", names(data_dt), value = TRUE)
# Convert data.frame to data.table
setDT(data_dt)
# Iterate over the '_gcomp' columns and create new '_REF' and '_ALT' columns
for (col in gcomp_cols) {
# Split each '_gcomp' column into '_REF' and '_ALT'
ref_col <- paste0(col, "_REF")
alt_col <- paste0(col, "_ALT")
data_dt[, c(ref_col, alt_col) := tstrsplit(get(col), ", ", fixed = TRUE)]
# Remove unwanted characters from each new column
data_dt[, (ref_col) := gsub("\\[|\\]|'", "", get(ref_col))]
data_dt[, (alt_col) := gsub("\\[|\\]|'", "", get(alt_col))]
}
# Rename columns to remove '_gcomp'
new_names <- names(data_dt)
new_names <- gsub("_gcomp_ALT$", "_ALT", new_names)
new_names <- gsub("_gcomp_REF$", "_REF", new_names)
setnames(data_dt, new_names)
setnames(data_dt, new_names)
# Return the processed data.table
return(data_dt)
}
# we can save the function to source it later
dump(
"process_csv_files",
here(
"scripts", "analysis", "process_csv_files.R")
)
How to run the function to import the csv files
data_ay_dt <- process_csv_files("ay")
# Check and display only columns that match the criteria
head(data_ay_dt[, c("SNP_id", names(data_ay_dt)[grepl("_REF$|_ALT$", names(data_ay_dt))]), with = FALSE])
Function to get summary of the mismatches
process_data_object <- function(object_name) {
# Get the data.table object based on the input name
data_dt <- get(object_name)
# Create columns for match and mismatch count for columns ending with _REF
cols_REF <- grep("_REF$", names(data_dt), value = TRUE)
data_dt[, c("REF_match_count", "REF_mismatch_count") := .(
rowSums(.SD == "match", na.rm = TRUE),
rowSums(.SD == "mismatch", na.rm = TRUE)
), .SDcols = cols_REF]
# Create columns for match and mismatch count for columns ending with _ALT
cols_ALT <- grep("_ALT$", names(data_dt), value = TRUE)
data_dt[, c("ALT_match_count", "ALT_mismatch_count") := .(
rowSums(.SD == "match", na.rm = TRUE),
rowSums(.SD == "mismatch", na.rm = TRUE)
), .SDcols = cols_ALT]
# Create columns for match and mismatch count for columns ending with _zcomp
cols_Zigo <- grep("_zcomp$", names(data_dt), value = TRUE)
data_dt[, c("Zigo_match_count", "Zigo_mismatch_count") := .(
rowSums(.SD == "match", na.rm = TRUE),
rowSums(.SD == "mismatch", na.rm = TRUE)
), .SDcols = cols_Zigo]
# Summarize the data for each SNP_id
summary_dt <- data_dt[, .(
REF_match = sum(REF_match_count, na.rm = TRUE),
REF_mismatch = sum(REF_mismatch_count, na.rm = TRUE),
ALT_match = sum(ALT_match_count, na.rm = TRUE),
ALT_mismatch = sum(ALT_mismatch_count, na.rm = TRUE),
Zigo_match = sum(Zigo_match_count, na.rm = TRUE),
Zigo_mismatch = sum(Zigo_mismatch_count, na.rm = TRUE)
), by = SNP_id]
# Sort the summarized data by SNP_id
setorder(summary_dt, SNP_id)
# Return the processed summary data.table
return(summary_dt)
}
# we can save the function to source it later
dump(
"process_data_object",
here(
"scripts", "analysis", "process_data_object.R")
)
How to run the function
Function to process the summaries for plotting
process_summary_object <- function(summary_object_name) {
# Select only the relevant columns
dt <- get(summary_object_name)[, .(SNP_id, REF_mismatch, ALT_mismatch, Zigo_mismatch)]
# Reshape data to long format
dt_long <- reshape2::melt(dt, id.vars = "SNP_id", variable.name = "type", value.name = "count")
# Convert to data.table if it's not already
setDT(dt_long)
# Convert count to numeric if it's not already
dt_long[, count := as.numeric(count)]
# Count occurrences per count value
dt_long <- dt_long[, .(n = .N), by = .(type, count)]
# Calculate total count of unique SNPs
total_SNP <- length(unique(dt$SNP_id))
# Add a new column for the percentage
dt_long[, perc := n / total_SNP * 100]
# Define new labels
new_labels <- c(
"Reference Allele" = "REF_mismatch",
"Alternative Allele" = "ALT_mismatch",
"Zygosity Mismatch" = "Zigo_mismatch"
)
# Apply new labels
dt_long$type <- forcats::fct_recode(dt_long$type, !!!new_labels)
# Return the processed data.table
return(dt_long)
}
# we can save the function to source it later
dump(
"process_summary_object",
here(
"scripts", "analysis", "process_summary_object.R")
)
How to run process_summary_objects function
Theme for plotting
# import plotting theme
source(
here(
"scripts",
"analysis",
"my_theme2.R" # choose my_theme.R (Roboto Condensed) or my_theme2.R (default font)
)
)
Function to errors per SNP per sample
plot_dt_long <- function(object_suffix) {
# Get the object name based on the suffix
object_name <- paste0("dt_long_", object_suffix)
# Get the corresponding data.table object
dt_long <- get(object_name)
# Create facet histogram
p <- ggplot(dt_long, aes(x = count, y = n)) +
geom_bar(
stat = "identity",
fill = "#ffcae4",
color = ifelse(
dt_long$count == 0,
"#CCFF00",
ifelse(dt_long$count == 1, "#4169E1", "#FF7F50")
),
width = 0.6,
linewidth = 1
) +
geom_text(
aes(label = paste0(
scales::comma(n), " (", round(perc, 2), "%)"
)),
hjust = ifelse(dt_long$count == 0, .7, 0.01),
size = 2.3,
color = "gray10"
) +
facet_wrap(~ type, scales = "free_y") +
labs(
title = paste("Histogram of SNP Mismatch Counts", object_suffix),
x = "Sample Count",
y = "SNP Count",
caption = paste(object_suffix, "\n Bar border colors: Electric Lime = no errors; Royal Blue = 1 error; Coral = more than 1 error")
) +
scale_y_continuous(
breaks = c(0, 25000, 50000, 75000, 100000, 125000, 150000, 175000),
labels = function(x) paste0(x / 1000, "k"),
expand = expansion(mult = c(0, 0.2))
) +
scale_x_continuous(breaks = 0:18, expand = expansion(add = c(0.5, 0))) +
my_theme() +
coord_flip() +
theme(
plot.caption = element_text(
face = "italic",
size = 10,
color = "grey20"
),
panel.spacing = unit(2, "lines"),
plot.margin = unit(c(1, 3, 1, 1), "cm"),
axis.text.x = element_text(size = 7, angle = 0)
)
# Save the plot
output_file <- here("output", "wgs_vs_chip", "figures", paste0(object_suffix, "_mismatches.pdf"))
ggsave(output_file, p, width = 8, height = 6, units = "in")
# Return the plot object
return(p)
}
How to run the plotting function
Function to get summary for each population
generate_summary <- function(data_dt, population, object_suffix) {
# Extract population columns
pop_cols <- grep(paste0("^", population, "_"), names(data_dt), value = TRUE)
# Subset the data into population-specific data table
data_pop <- data_dt[, c('SNP_id', pop_cols), with = FALSE]
# Create columns for match and mismatch count for columns ending with _REF
cols_REF <- grep("_REF$", names(data_pop), value = TRUE)
# Calculate the count of "match" or "mismatch" for each row
data_pop[, c("REF_match_count", "REF_mismatch_count") :=
.(rowSums(.SD == "match", na.rm = TRUE),
rowSums(.SD == "mismatch", na.rm = TRUE)),
.SDcols = cols_REF]
# Create columns for match and mismatch count for columns ending with _ALT
cols_ALT <- grep("_ALT$", names(data_pop), value = TRUE)
# Calculate the count of "match" or "mismatch" for each row
data_pop[, c("ALT_match_count", "ALT_mismatch_count") :=
.(rowSums(.SD == "match", na.rm = TRUE),
rowSums(.SD == "mismatch", na.rm = TRUE)),
.SDcols = cols_ALT]
# Create columns for match and mismatch count for columns ending with _zcomp
cols_Zigo <- grep("_zcomp$", names(data_pop), value = TRUE)
# Calculate the count of "match" or "mismatch" for each row
data_pop[, c("Zigo_match_count", "Zigo_mismatch_count") :=
.(rowSums(.SD == "match", na.rm = TRUE),
rowSums(.SD == "mismatch", na.rm = TRUE)),
.SDcols = cols_Zigo]
# Now, you can summarize this for each SNP_id
summary_pop <- data_pop[, .(
REF_match = sum(REF_match_count, na.rm = TRUE),
REF_mismatch = sum(REF_mismatch_count, na.rm = TRUE),
ALT_match = sum(ALT_match_count, na.rm = TRUE),
ALT_mismatch = sum(ALT_mismatch_count, na.rm = TRUE),
Zigo_match = sum(Zigo_match_count, na.rm = TRUE),
Zigo_mismatch = sum(Zigo_mismatch_count, na.rm = TRUE)
),
by = SNP_id]
# Sort data by SNP_id
setorder(summary_pop, SNP_id)
# Assign the summary_pop object to a new variable based on the object_suffix
summary_pop_object_name <- paste0("summary_", population, "_", object_suffix)
assign(summary_pop_object_name, summary_pop, envir = .GlobalEnv)
# Return the summary_pop object
return(summary_pop)
}
How to run the functions
summary_sai_ay <- generate_summary(data_ay_dt, "SAI", "suffix")
summary_kat_ay <- generate_summary(data_ay_dt, "KAT", "suffix")
dt_long_2_ay <- merge_and_transform("ay")
Function to merge the SAI and KAT summaries
merge_and_transform <- function(object_suffix) {
# Merge summary_sai and summary_kat
merged_sai_kat <- merge(
get(paste0("summary_sai_", object_suffix)),
get(paste0("summary_kat_", object_suffix)),
by = "SNP_id",
suffixes = c("_sai", "_kat")
)
# Select only the relevant columns
dt <- merged_sai_kat[, .(
SNP_id,
REF_mismatch_sai,
ALT_mismatch_sai,
Zigo_mismatch_sai,
REF_mismatch_kat,
ALT_mismatch_kat,
Zigo_mismatch_kat
)]
# Reshape data to long format
dt_long <- melt(
dt,
id.vars = "SNP_id",
variable.name = "type",
value.name = "count"
)
# Convert to data.table if it's not already
setDT(dt_long)
# Extract the last part after "_" in the 'type' column to form 'group' column
dt_long[, group := str_extract(type, "(?<=_)[^_]+$")]
# Extract the part before the first "_" in the 'type' column to form 'allele' column
dt_long[, allele := str_extract(type, "^[^_]+")]
# Convert to numeric if it's not already
dt_long[, count := as.numeric(count)]
# Count occurrences per count value
dt_long <- dt_long[, .(n = .N), by = .(allele, group, count)]
# Calculate total count of unique SNPs
total_SNP <- length(unique(dt$SNP_id))
# Add a new column for the percentage
dt_long[, perc := n / total_SNP * 100, by = group]
# Set levels for 'group' variable
dt_long$group <- factor(dt_long$group, levels = c("sai", "kat"))
# Set levels for 'allele' variable
dt_long$allele <- factor(dt_long$allele, levels = c("REF", "ALT", "Zigo"))
# Modify levels for 'allele' variable
levels(dt_long$allele) <- c("Reference Allele", "Alternative Allele", "Zygosity")
# Modify levels for 'group' variable
levels(dt_long$group) <- c("SAI", "KAT")
dt_long$count <- as.numeric(dt_long$count)
# Assign the dt_long object to a new variable
dt_long_object_name <- paste0("dt_long_", object_suffix)
assign(dt_long_object_name, dt_long, envir = .GlobalEnv)
# Return the dt_long object
return(dt_long)
}
Function to create plot comparing the two populations
create_plot2 <- function(object_suffix, output_path, dt_long) {
# Create plot
plot <- ggplot(dt_long, aes(x = count, y = n)) +
geom_bar(
stat = "identity",
fill = "#ffcae4",
color = ifelse(
dt_long$count == 0,
"#CCFF00",
ifelse(dt_long$count == 1, "#4169E1", "#FF7F50")
),
width = 0.6,
linewidth = 1
) +
geom_text(
aes(label = paste0(
scales::comma(n), " (", round(perc, 2), "%)"
)),
hjust = ifelse(dt_long$count == 0, .7, 0.01),
size = 2.3,
color = "gray10"
) +
facet_wrap(~ group + allele, scales = "free_y", ncol = 3) +
labs(
title = paste("Histogram of SNP Mismatch Counts", object_suffix),
x = "Count",
y = "Frequency",
caption = paste(
object_suffix,
"\n KAT 6 samples from native range SAI 12 samples from invasive range\n Bar border colors: Electric Lime = no errors; Royal Blue = 1 error; Coral = more than 1 error"
)
) +
coord_flip() +
my_theme() +
# scale_y_continuous(labels = scales::comma) +
scale_y_continuous(
breaks = c(0, 25000, 50000, 75000, 100000, 125000, 150000, 175000),
labels = function(x) paste0(x / 1000, "k"),
expand = expansion(mult = c(0, 0.2))
) +
scale_x_continuous(breaks = 0:18) +
theme(
plot.caption = element_text(
face = "italic",
size = 10,
color = "grey20"
),
panel.spacing = unit(3, "lines"),
plot.margin = unit(c(1, 3, 1, 1), "cm"),
axis.text.x = element_text(size = 7, angle = 0)
)
# Print the plot in RStudio
print(plot)
# Save the plot
ggsave(
output_path,
plot = plot,
width = 8,
height = 8,
units = "in"
)
}
How to run the functions
summary_sai_ay <- generate_summary_sai(data_ay_dt, "ay")
summary_kat_ay <- generate_summary_kat(data_KAT, "ay")
dt_long_2_ay <- merge_and_transform("ay")
create_plot2("ay", here("output", "wgs_vs_chip", "figures", "ay_mismatches_SAI_KAT.pdf"), dt_long_2_ay)
Function to get counts for pairwise comparison plot
calculate_counts <- function(data_dt) {
# Initialize an empty list to hold the counts
count_list <- list()
# Select columns
matching_columns <- colnames(data_dt)[grepl(pattern = "(_REF$|_ALT$|_zcomp$)", colnames(data_dt))]
# Loop through each column
for (column in matching_columns) {
match_count <- sum(str_detect(data_dt[[column]], "match"), na.rm = TRUE)
mismatch_count <- sum(str_detect(data_dt[[column]], "mismatch"), na.rm = TRUE)
# Create a data.table with counts for the current column
count_dt <- data.table(Column = column, Match = match_count, Mismatch = mismatch_count)
# Add the count data.table to the list
count_list[[column]] <- count_dt
}
# Combine all count data.tables into a single data.table
counts <- rbindlist(count_list)
# Calculate total
counts <- counts |>
mutate(Total = Match + Mismatch)
# Create new columns: Population, Sample, and Comparison
counts <- counts |>
mutate(
Population = sub("^([^_]+).*", "\\1", Column),
Sample = sub("^.*_(\\d+).*", "\\1", Column),
Comparison = sub(".*_([^_]+)$", "\\1", Column)
)
# Reorder the columns and create sample_id
counts <- counts |>
dplyr::select(Population, Sample, Comparison, Match, Mismatch, Total)
# Calculate percentage columns
counts <- counts |>
mutate(
Percent_Match = round((Match / Total) * 100, 2),
Percent_Mismatch = round((Mismatch / Total) * 100, 2)
)
# Replace zcomp with Zygosity
counts$Comparison <- gsub("zcomp", "Zygosity", counts$Comparison)
# Define color palette
color_palette <- c("#92C6FF", "#f5cb8b", "#bff28c")
# Convert Sample to numeric and sort samples numerically within each Population group
counts$Sample <- as.numeric(counts$Sample)
counts <- counts |> arrange(Population, Sample)
# Convert Sample column back to factor with sorted levels within each group
counts$Sample <- factor(counts$Sample, levels = unique(counts$Sample))
# Rename and reorder Comparison column
counts <- counts |> mutate(
Comparison_new = recode(
Comparison,
"REF" = "Reference Allele",
"ALT" = "Alternative Allele",
"Zygosity" = "Zygosity"
)
) |> mutate(
Comparison_new = factor(
Comparison_new,
levels = c("Reference Allele", "Alternative Allele", "Zygosity")
)
)
return(counts)
}
Pairwise plotting function
plot_counts <- function(counts, output_file = NULL) {
library(ggplot2)
# Define color palette
color_palette <- c("#92C6FF", "#f5cb8b", "#bff28c")
# Create plot
plot <- ggplot(counts,
aes(x = Sample, y = Mismatch, fill = Comparison)) +
geom_bar(stat = "identity", position = "dodge") +
facet_grid(Population ~ Comparison_new,
scales = "free_y",
space = "free") +
coord_flip() +
labs(
title = "SNP Mismatch Counts per Sample",
x = "Sample",
y = "Mismatches",
caption = "Genotyping errors per sample within each population."
) +
my_theme() +
theme(panel.spacing = unit(0.5, "lines")) +
geom_text(aes(label = paste0(
scales::comma(Mismatch), " (", Percent_Mismatch, "%)"
)),
hjust = 1,
size = 2.5) +
scale_fill_manual(values = color_palette) +
theme(axis.text.x = element_text(angle = 0, hjust = 1, size = 7)) +
guides(fill = "none") +
theme(plot.caption = element_text(
face = "italic",
size = 10,
color = "grey20"
)) +
scale_y_continuous(labels = scales::comma) # Add thousands separator to y-axis labels
# Save the plot if output_file is provided
if (!is.null(output_file)) {
ggsave(output_file, plot, width = 8, height = 7, units = "in")
}
# Return the plot
return(plot)
}
How to run the calculate_counts and plot_counts functions
# Call the function with data_*_dt as input
counts_ay <- calculate_counts(data_ay_dt)
plot_counts(counts_ay, here("output", "wgs_vs_chip", "figures", "ay_SAI_KAT_per_sample_stats.pdf"))
The comparisons we will make:
Chip: “ab” - Genotyping calls using 18 versus 95 samples “ac” - Genotyping calls using 18 versus 500 samples “bc” - Genotyping calls using 95 versus 500 samples
WGS: “xy” Genotyping calls with 18 versus 30 samples “wy” Genotyping calls with 18 versus 800 samples “wx” Genotyping calls with 30 versus 800 samples
Chip x WGS: “ay” - WGS and chip calls with 18 samples “bx” - WGS call with 30 samples and chip call with 95 samples “cw” - WGS call with 800 samples and chip call with 500 samples
Generate csv files
Import csv
data_ab_dt <- process_csv_files("ab")
# Check and display only columns that match the criteria
head(data_ab_dt[, c("SNP_id", names(data_ab_dt)[grepl("_REF$|_ALT$", names(data_ab_dt))]), with = FALSE])
## SNP_id KAT_9a_KAT_9b_REF KAT_9a_KAT_9b_ALT SAI_15a_SAI_15b_REF
## 1: AX-581444870 match match match
## 2: AX-583035067 match match match
## 3: AX-583033342 match match match
## 4: AX-583035163 match match match
## 5: AX-583035194 match match match
## 6: AX-583033387 match match match
## SAI_15a_SAI_15b_ALT SAI_3a_SAI_3b_REF SAI_3a_SAI_3b_ALT KAT_12a_KAT_12b_REF
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## KAT_12a_KAT_12b_ALT KAT_7a_KAT_7b_REF KAT_7a_KAT_7b_ALT SAI_2a_SAI_2b_REF
## 1: match match match <NA>
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_2a_SAI_2b_ALT SAI_14a_SAI_14b_REF SAI_14a_SAI_14b_ALT KAT_8a_KAT_8b_REF
## 1: <NA> match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## KAT_8a_KAT_8b_ALT SAI_13a_SAI_13b_REF SAI_13a_SAI_13b_ALT SAI_5a_SAI_5b_REF
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_5a_SAI_5b_ALT SAI_18a_SAI_18b_REF SAI_18a_SAI_18b_ALT
## 1: match match match
## 2: match match match
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
## KAT_10a_KAT_10b_REF KAT_10a_KAT_10b_ALT SAI_1a_SAI_1b_REF SAI_1a_SAI_1b_ALT
## 1: match match <NA> <NA>
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_17a_SAI_17b_REF SAI_17a_SAI_17b_ALT SAI_4a_SAI_4b_REF SAI_4a_SAI_4b_ALT
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_12a_SAI_12b_REF SAI_12a_SAI_12b_ALT KAT_11a_KAT_11b_REF
## 1: match match match
## 2: match match match
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
## KAT_11a_KAT_11b_ALT SAI_16a_SAI_16b_REF SAI_16a_SAI_16b_ALT
## 1: match match match
## 2: match match match
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
Get the summary
## SNP_id REF_match REF_mismatch ALT_match ALT_mismatch Zigo_match
## 1: AX-579436125 18 0 18 0 18
## 2: AX-579436196 16 0 16 0 16
## 3: AX-579436243 15 3 18 0 15
## 4: AX-579436298 17 0 17 0 17
## 5: AX-579436308 16 0 16 0 16
## 6: AX-579436317 18 0 18 0 18
## Zigo_mismatch
## 1: 0
## 2: 0
## 3: 3
## 4: 0
## 5: 0
## 6: 0
Check NAs, match and mismatch counts
##
## match mismatch
## 2490 87163 889
Make data long format for plotting
## type count n perc
## 1: Reference Allele 0 84155 92.9458152
## 2: Reference Allele 3 622 0.6869740
## 3: Reference Allele 4 341 0.3766208
## 4: Reference Allele 2 1324 1.4623048
## 5: Reference Allele 1 3730 4.1196351
## 6: Reference Allele 5 174 0.1921760
Create plot of SNP error per sample
Compare both populations
summary_sai_ab <- generate_summary(data_ab_dt, "SAI", "suffix")
summary_kat_ab <- generate_summary(data_ab_dt, "KAT", "suffix")
dt_long_2_ab <- merge_and_transform("ab")
create_plot2("ab", here("output", "wgs_vs_chip", "figures", "ab_mismatches_SAI_KAT.pdf"), dt_long_2_ab)
Counts plot
# Call the function with data_*_dt as input
counts_ab <- calculate_counts(data_ab_dt)
plot_counts(counts_ab, here("output", "wgs_vs_chip", "figures", "ab_SAI_KAT_per_sample_stats.pdf"))
Generate csv files
Import csv
data_ac_dt <- process_csv_files("ac")
# Check and display only columns that match the criteria
head(data_ac_dt[, c("SNP_id", names(data_ac_dt)[grepl("_REF$|_ALT$", names(data_ac_dt))]), with = FALSE])
## SNP_id KAT_12a_KAT_12c_REF KAT_12a_KAT_12c_ALT SAI_3a_SAI_3c_REF
## 1: AX-583035067 match match match
## 2: AX-583033342 match match match
## 3: AX-583035194 match match match
## 4: AX-583033387 match match match
## 5: AX-583035211 match match match
## 6: AX-583035257 match match match
## SAI_3a_SAI_3c_ALT KAT_9a_KAT_9c_REF KAT_9a_KAT_9c_ALT SAI_15a_SAI_15c_REF
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_15a_SAI_15c_ALT SAI_14a_SAI_14c_REF SAI_14a_SAI_14c_ALT
## 1: match match match
## 2: match match match
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
## KAT_8a_KAT_8c_REF KAT_8a_KAT_8c_ALT SAI_2a_SAI_2c_REF SAI_2a_SAI_2c_ALT
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## KAT_7a_KAT_7c_REF KAT_7a_KAT_7c_ALT SAI_17a_SAI_17c_REF SAI_17a_SAI_17c_ALT
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_1a_SAI_1c_REF SAI_1a_SAI_1c_ALT KAT_10a_KAT_10c_REF KAT_10a_KAT_10c_ALT
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_18a_SAI_18c_REF SAI_18a_SAI_18c_ALT SAI_5a_SAI_5c_REF SAI_5a_SAI_5c_ALT
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_13a_SAI_13c_REF SAI_13a_SAI_13c_ALT SAI_16a_SAI_16c_REF
## 1: match match match
## 2: match match match
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
## SAI_16a_SAI_16c_ALT KAT_11a_KAT_11c_REF KAT_11a_KAT_11c_ALT
## 1: match match match
## 2: match match match
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
## SAI_12a_SAI_12c_REF SAI_12a_SAI_12c_ALT SAI_4a_SAI_4c_REF SAI_4a_SAI_4c_ALT
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
Get the summary
## SNP_id REF_match REF_mismatch ALT_match ALT_mismatch Zigo_match
## 1: AX-579436089 13 1 14 0 13
## 2: AX-579436149 18 0 18 0 18
## 3: AX-579436196 16 0 16 0 16
## 4: AX-579436243 15 3 18 0 15
## 5: AX-579436298 17 0 17 0 17
## 6: AX-579436308 16 0 16 0 16
## Zigo_mismatch
## 1: 1
## 2: 0
## 3: 0
## 4: 3
## 5: 0
## 6: 0
Make data long format for plotting
## type count n perc
## 1: Reference Allele 1 4327 4.7943536
## 2: Reference Allele 0 82923 91.8794043
## 3: Reference Allele 3 696 0.7711740
## 4: Reference Allele 2 1508 1.6708771
## 5: Reference Allele 4 371 0.4110712
## 6: Reference Allele 6 126 0.1396091
Create plot of SNP error per sample
Compare both populations
summary_sai_ac <- generate_summary(data_ac_dt, "SAI", "suffix")
summary_kat_ac <- generate_summary(data_ac_dt, "KAT", "suffix")
dt_long_2_ac <- merge_and_transform("ac")
create_plot2("ac", here("output", "wgs_vs_chip", "figures", "ac_mismatches_SAI_KAT.pdf"), dt_long_2_ac)
Counts plot
# Call the function with data_*_dt as input
counts_ac <- calculate_counts(data_ac_dt)
plot_counts(counts_ac, here("output", "wgs_vs_chip", "figures", "ac_SAI_KAT_per_sample_stats.pdf"))
Generate csv files
Import csv
data_bc_dt <- process_csv_files("bc")
# Check and display only columns that match the criteria
head(data_bc_dt[, c("SNP_id", names(data_bc_dt)[grepl("_REF$|_ALT$", names(data_bc_dt))]), with = FALSE])
## SNP_id SAI_3b_SAI_3c_REF SAI_3b_SAI_3c_ALT SAI_2b_SAI_2c_REF
## 1: AX-583035067 match match match
## 2: AX-583033342 match match match
## 3: AX-583033370 match match match
## 4: AX-583035194 match match match
## 5: AX-583033387 match match match
## 6: AX-583035211 match match match
## SAI_2b_SAI_2c_ALT SAI_18b_SAI_18c_REF SAI_18b_SAI_18c_ALT SAI_1b_SAI_1c_REF
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_1b_SAI_1c_ALT SAI_4b_SAI_4c_REF SAI_4b_SAI_4c_ALT SAI_5b_SAI_5c_REF
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_5b_SAI_5c_ALT SAI_12b_SAI_12c_REF SAI_12b_SAI_12c_ALT
## 1: match match match
## 2: match match match
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
## SAI_13b_SAI_13c_REF SAI_13b_SAI_13c_ALT SAI_17b_SAI_17c_REF
## 1: match match match
## 2: match match match
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
## SAI_17b_SAI_17c_ALT SAI_16b_SAI_16c_REF SAI_16b_SAI_16c_ALT
## 1: match match match
## 2: match match match
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
## SAI_15b_SAI_15c_REF SAI_15b_SAI_15c_ALT SAI_14b_SAI_14c_REF
## 1: match match match
## 2: match match match
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
## SAI_14b_SAI_14c_ALT KAT_7b_KAT_7c_REF KAT_7b_KAT_7c_ALT KAT_11b_KAT_11c_REF
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## KAT_11b_KAT_11c_ALT KAT_10b_KAT_10c_REF KAT_10b_KAT_10c_ALT
## 1: match match match
## 2: match match match
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
## KAT_8b_KAT_8c_REF KAT_8b_KAT_8c_ALT KAT_9b_KAT_9c_REF KAT_9b_KAT_9c_ALT
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## KAT_12b_KAT_12c_REF KAT_12b_KAT_12c_ALT
## 1: match match
## 2: match match
## 3: match match
## 4: match match
## 5: match match
## 6: match match
Get the summary
## SNP_id REF_match REF_mismatch ALT_match ALT_mismatch Zigo_match
## 1: AX-579436196 16 0 16 0 16
## 2: AX-579436243 18 0 18 0 18
## 3: AX-579436298 17 0 17 0 17
## 4: AX-579436308 18 0 18 0 18
## 5: AX-579436317 18 0 18 0 18
## 6: AX-579436348 18 0 18 0 18
## Zigo_mismatch
## 1: 0
## 2: 0
## 3: 0
## 4: 0
## 5: 0
## 6: 0
Make data long format for plotting
## type count n perc
## 1: Reference Allele 0 93283 97.19510289
## 2: Reference Allele 1 1741 1.81401407
## 3: Reference Allele 4 88 0.09169054
## 4: Reference Allele 2 498 0.51888513
## 5: Reference Allele 3 227 0.23651993
## 6: Reference Allele 6 41 0.04271946
Create plot of SNP error per sample
Compare both populations
summary_sai_bc <- generate_summary(data_bc_dt, "SAI", "suffix")
summary_kat_bc <- generate_summary(data_bc_dt, "KAT", "suffix")
dt_long_2_bc <- merge_and_transform("bc")
create_plot2("bc", here("output", "wgs_vs_chip", "figures", "bc_mismatches_SAI_KAT.pdf"), dt_long_2_bc)
Counts plot
# Call the function with data_*_dt as input
counts_bc <- calculate_counts(data_bc_dt)
plot_counts(counts_bc, here("output", "wgs_vs_chip", "figures", "bc_SAI_KAT_per_sample_stats.pdf"))
Generate csv files
Import csv
data_xy_dt <- process_csv_files("xy")
# Check and display only columns that match the criteria
head(data_xy_dt[, c("SNP_id", names(data_xy_dt)[grepl("_REF$|_ALT$", names(data_xy_dt))]), with = FALSE])
## SNP_id KAT_7x_KAT_7y_REF KAT_7x_KAT_7y_ALT SAI_2x_SAI_2y_REF
## 1: AX-583035067 match match match
## 2: AX-583035102 match match match
## 3: AX-583033340 match match match
## 4: AX-583033342 match match match
## 5: AX-583035163 match match match
## 6: AX-583033356 match match match
## SAI_2x_SAI_2y_ALT KAT_8x_KAT_8y_REF KAT_8x_KAT_8y_ALT SAI_14x_SAI_14y_REF
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_14x_SAI_14y_ALT KAT_12x_KAT_12y_REF KAT_12x_KAT_12y_ALT
## 1: match match match
## 2: match match match
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
## SAI_15x_SAI_15y_REF SAI_15x_SAI_15y_ALT KAT_9x_KAT_9y_REF KAT_9x_KAT_9y_ALT
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_3x_SAI_3y_REF SAI_3x_SAI_3y_ALT SAI_4x_SAI_4y_REF SAI_4x_SAI_4y_ALT
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_12x_SAI_12y_REF SAI_12x_SAI_12y_ALT SAI_16x_SAI_16y_REF
## 1: match match match
## 2: match match match
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
## SAI_16x_SAI_16y_ALT KAT_11x_KAT_11y_REF KAT_11x_KAT_11y_ALT
## 1: match match match
## 2: match match match
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
## SAI_18x_SAI_18y_REF SAI_18x_SAI_18y_ALT SAI_13x_SAI_13y_REF
## 1: match match match
## 2: match match match
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
## SAI_13x_SAI_13y_ALT SAI_5x_SAI_5y_REF SAI_5x_SAI_5y_ALT SAI_1x_SAI_1y_REF
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_1x_SAI_1y_ALT SAI_17x_SAI_17y_REF SAI_17x_SAI_17y_ALT
## 1: match match match
## 2: match match match
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
## KAT_10x_KAT_10y_REF KAT_10x_KAT_10y_ALT
## 1: match match
## 2: match match
## 3: match match
## 4: match match
## 5: match match
## 6: match match
Get the summary
## SNP_id REF_match REF_mismatch ALT_match ALT_mismatch Zigo_match
## 1: AX-579436016 18 0 18 0 18
## 2: AX-579436089 18 0 18 0 18
## 3: AX-579436102 18 0 18 0 18
## 4: AX-579436125 18 0 18 0 18
## 5: AX-579436196 18 0 18 0 18
## 6: AX-579436214 18 0 18 0 18
## Zigo_mismatch
## 1: 0
## 2: 0
## 3: 0
## 4: 0
## 5: 0
## 6: 0
Make data long format for plotting
## type count n perc
## 1: Reference Allele 0 159905 98.49460114
## 2: Reference Allele 1 836 0.51494004
## 3: Reference Allele 3 256 0.15768499
## 4: Reference Allele 4 165 0.10163290
## 5: Reference Allele 2 285 0.17554774
## 6: Reference Allele 15 17 0.01047127
Create plot of SNP error per sample
Compare both populations
summary_sai_xy <- generate_summary(data_xy_dt, "SAI", "suffix")
summary_kat_xy <- generate_summary(data_xy_dt, "KAT", "suffix")
dt_long_2_xy <- merge_and_transform("xy")
create_plot2("xy", here("output", "wgs_vs_chip", "figures", "xy_mismatches_SAI_KAT.pdf"), dt_long_2_xy)
Counts plot
# Call the function with data_*_dt as input
counts_xy <- calculate_counts(data_xy_dt)
plot_counts(counts_xy, here("output", "wgs_vs_chip", "figures", "xy_SAI_KAT_per_sample_stats.pdf"))
Generate csv files
Import csv
data_wy_dt <- process_csv_files("wy")
# Check and display only columns that match the criteria
head(data_wy_dt[, c("SNP_id", names(data_wy_dt)[grepl("_REF$|_ALT$", names(data_wy_dt))]), with = FALSE])
## SNP_id KAT_7w_KAT_7y_REF KAT_7w_KAT_7y_ALT KAT_8w_KAT_8y_REF
## 1: AX-583035067 match match match
## 2: AX-583035102 match match match
## 3: AX-583033340 match match match
## 4: AX-583033342 match match match
## 5: AX-583035163 match match match
## 6: AX-583033356 match match match
## KAT_8w_KAT_8y_ALT SAI_14w_SAI_14y_REF SAI_14w_SAI_14y_ALT SAI_2w_SAI_2y_REF
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_2w_SAI_2y_ALT SAI_3w_SAI_3y_REF SAI_3w_SAI_3y_ALT SAI_15w_SAI_15y_REF
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_15w_SAI_15y_ALT KAT_9w_KAT_9y_REF KAT_9w_KAT_9y_ALT KAT_12w_KAT_12y_REF
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## KAT_12w_KAT_12y_ALT SAI_12w_SAI_12y_REF SAI_12w_SAI_12y_ALT
## 1: match match match
## 2: match match match
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
## SAI_4w_SAI_4y_REF SAI_4w_SAI_4y_ALT KAT_11w_KAT_11y_REF KAT_11w_KAT_11y_ALT
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_16w_SAI_16y_REF SAI_16w_SAI_16y_ALT SAI_5w_SAI_5y_REF SAI_5w_SAI_5y_ALT
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_13w_SAI_13y_REF SAI_13w_SAI_13y_ALT SAI_18w_SAI_18y_REF
## 1: match match match
## 2: match match match
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
## SAI_18w_SAI_18y_ALT KAT_10w_KAT_10y_REF KAT_10w_KAT_10y_ALT
## 1: match match match
## 2: match match match
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
## SAI_17w_SAI_17y_REF SAI_17w_SAI_17y_ALT SAI_1w_SAI_1y_REF SAI_1w_SAI_1y_ALT
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
Get the summary
## SNP_id REF_match REF_mismatch ALT_match ALT_mismatch Zigo_match
## 1: AX-579436016 18 0 18 0 18
## 2: AX-579436089 18 0 18 0 18
## 3: AX-579436102 18 0 18 0 18
## 4: AX-579436125 18 0 18 0 18
## 5: AX-579436196 18 0 18 0 18
## 6: AX-579436214 18 0 18 0 18
## Zigo_mismatch
## 1: 0
## 2: 0
## 3: 0
## 4: 0
## 5: 0
## 6: 0
Make data long format for plotting
## type count n perc
## 1: Reference Allele 0 156689 96.440617460
## 2: Reference Allele 3 564 0.347136737
## 3: Reference Allele 1 2128 1.309764144
## 4: Reference Allele 2 814 0.501009405
## 5: Reference Allele 5 318 0.195726033
## 6: Reference Allele 16 16 0.009847851
Create plot of SNP error per sample
Compare both populations
summary_sai_wy <- generate_summary(data_wy_dt, "SAI", "suffix")
summary_kat_wy <- generate_summary(data_wy_dt, "KAT", "suffix")
dt_long_2_wy <- merge_and_transform("wy")
create_plot2("wy", here("output", "wgs_vs_chip", "figures", "wy_mismatches_SAI_KAT.pdf"), dt_long_2_wy)
Counts plot
# Call the function with data_*_dt as input
counts_wy <- calculate_counts(data_wy_dt)
plot_counts(counts_wy, here("output", "wgs_vs_chip", "figures", "wy_SAI_KAT_per_sample_stats.pdf"))
Generate csv files
Import csv
data_wx_dt <- process_csv_files("wx")
# Check and display only columns that match the criteria
head(data_wy_dt[, c("SNP_id", names(data_wy_dt)[grepl("_REF$|_ALT$", names(data_wy_dt))]), with = FALSE])
## SNP_id KAT_7w_KAT_7y_REF KAT_7w_KAT_7y_ALT KAT_8w_KAT_8y_REF
## 1: AX-583035067 match match match
## 2: AX-583035102 match match match
## 3: AX-583033340 match match match
## 4: AX-583033342 match match match
## 5: AX-583035163 match match match
## 6: AX-583033356 match match match
## KAT_8w_KAT_8y_ALT SAI_14w_SAI_14y_REF SAI_14w_SAI_14y_ALT SAI_2w_SAI_2y_REF
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_2w_SAI_2y_ALT SAI_3w_SAI_3y_REF SAI_3w_SAI_3y_ALT SAI_15w_SAI_15y_REF
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_15w_SAI_15y_ALT KAT_9w_KAT_9y_REF KAT_9w_KAT_9y_ALT KAT_12w_KAT_12y_REF
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## KAT_12w_KAT_12y_ALT SAI_12w_SAI_12y_REF SAI_12w_SAI_12y_ALT
## 1: match match match
## 2: match match match
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
## SAI_4w_SAI_4y_REF SAI_4w_SAI_4y_ALT KAT_11w_KAT_11y_REF KAT_11w_KAT_11y_ALT
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_16w_SAI_16y_REF SAI_16w_SAI_16y_ALT SAI_5w_SAI_5y_REF SAI_5w_SAI_5y_ALT
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_13w_SAI_13y_REF SAI_13w_SAI_13y_ALT SAI_18w_SAI_18y_REF
## 1: match match match
## 2: match match match
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
## SAI_18w_SAI_18y_ALT KAT_10w_KAT_10y_REF KAT_10w_KAT_10y_ALT
## 1: match match match
## 2: match match match
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
## SAI_17w_SAI_17y_REF SAI_17w_SAI_17y_ALT SAI_1w_SAI_1y_REF SAI_1w_SAI_1y_ALT
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
Get the summary
## SNP_id REF_match REF_mismatch ALT_match ALT_mismatch Zigo_match
## 1: AX-579436016 18 0 18 0 18
## 2: AX-579436089 18 0 18 0 18
## 3: AX-579436102 18 0 18 0 18
## 4: AX-579436125 18 0 18 0 18
## 5: AX-579436196 18 0 18 0 18
## 6: AX-579436214 18 0 18 0 18
## Zigo_mismatch
## 1: 0
## 2: 0
## 3: 0
## 4: 0
## 5: 0
## 6: 0
Make data long format for plotting
## type count n perc
## 1: Reference Allele 0 161444 96.64296147
## 2: Reference Allele 3 577 0.34540143
## 3: Reference Allele 2 824 0.49325958
## 4: Reference Allele 1 2117 1.26727007
## 5: Reference Allele 5 310 0.18557096
## 6: Reference Allele 16 24 0.01436678
Create plot of SNP error per sample
Compare both populations
summary_sai_wx <- generate_summary(data_wx_dt, "SAI", "suffix")
summary_kat_wx <- generate_summary(data_wx_dt, "KAT", "suffix")
dt_long_2_wx <- merge_and_transform("wx")
create_plot2("wx", here("output", "wgs_vs_chip", "figures", "wx_mismatches_SAI_KAT.pdf"), dt_long_2_wx)
Counts plot
# Call the function with data_*_dt as input
counts_wx <- calculate_counts(data_wx_dt)
plot_counts(counts_wx, here("output", "wgs_vs_chip", "figures", "wx_SAI_KAT_per_sample_stats.pdf"))
Generate csv files
Import csv
data_ay_dt <- process_csv_files("ay")
# Check and display only columns that match the criteria
head(data_ay_dt[, c("SNP_id", names(data_ay_dt)[grepl("_REF$|_ALT$", names(data_ay_dt))]), with = FALSE])
## SNP_id KAT_11a_KAT_11y_REF KAT_11a_KAT_11y_ALT SAI_16a_SAI_16y_REF
## 1: AX-583035067 match mismatch match
## 2: AX-583035102 match match mismatch
## 3: AX-583033342 match match match
## 4: AX-583035163 match match match
## 5: AX-583035194 match match match
## 6: AX-583033387 match match match
## SAI_16a_SAI_16y_ALT SAI_12a_SAI_12y_REF SAI_12a_SAI_12y_ALT
## 1: match match match
## 2: match mismatch mismatch
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
## SAI_4a_SAI_4y_REF SAI_4a_SAI_4y_ALT KAT_10a_KAT_10y_REF KAT_10a_KAT_10y_ALT
## 1: match match match match
## 2: match match mismatch match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_17a_SAI_17y_REF SAI_17a_SAI_17y_ALT SAI_1a_SAI_1y_REF SAI_1a_SAI_1y_ALT
## 1: match match match match
## 2: match match match mismatch
## 3: match match match match
## 4: match match match match
## 5: match match match mismatch
## 6: match match match match
## SAI_5a_SAI_5y_REF SAI_5a_SAI_5y_ALT SAI_13a_SAI_13y_REF SAI_13a_SAI_13y_ALT
## 1: match match match match
## 2: match mismatch match mismatch
## 3: match match match match
## 4: match match mismatch mismatch
## 5: match match match match
## 6: match match match match
## SAI_18a_SAI_18y_REF SAI_18a_SAI_18y_ALT KAT_8a_KAT_8y_REF KAT_8a_KAT_8y_ALT
## 1: match match match match
## 2: <NA> <NA> match mismatch
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_14a_SAI_14y_REF SAI_14a_SAI_14y_ALT SAI_2a_SAI_2y_REF SAI_2a_SAI_2y_ALT
## 1: match match match match
## 2: match mismatch match mismatch
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## KAT_7a_KAT_7y_REF KAT_7a_KAT_7y_ALT SAI_3a_SAI_3y_REF SAI_3a_SAI_3y_ALT
## 1: match mismatch match match
## 2: match mismatch mismatch match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_15a_SAI_15y_REF SAI_15a_SAI_15y_ALT KAT_9a_KAT_9y_REF KAT_9a_KAT_9y_ALT
## 1: match match match match
## 2: match mismatch <NA> <NA>
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## KAT_12a_KAT_12y_REF KAT_12a_KAT_12y_ALT
## 1: match match
## 2: match mismatch
## 3: match match
## 4: match match
## 5: match match
## 6: match match
Get the summary
## SNP_id REF_match REF_mismatch ALT_match ALT_mismatch Zigo_match
## 1: AX-579436089 15 0 15 0 15
## 2: AX-579436125 15 3 18 0 15
## 3: AX-579436196 14 2 16 0 14
## 4: AX-579436243 15 3 18 0 15
## 5: AX-579436298 16 1 12 5 13
## 6: AX-579436308 16 0 16 0 16
## Zigo_mismatch
## 1: 0
## 2: 3
## 3: 2
## 4: 3
## 5: 4
## 6: 0
Make data long format for plotting
## type count n perc
## 1: Reference Allele 0 60897 61.5419597
## 2: Reference Allele 3 4609 4.6578139
## 3: Reference Allele 2 8737 8.8295335
## 4: Reference Allele 1 18417 18.6120543
## 5: Reference Allele 4 2586 2.6133883
## 6: Reference Allele 7 471 0.4759884
Create plot of SNP error per sample
Compare both populations
summary_sai_ay <- generate_summary(data_ay_dt, "SAI", "suffix")
summary_kat_ay <- generate_summary(data_ay_dt, "KAT", "suffix")
dt_long_2_ay <- merge_and_transform("ay")
create_plot2("ay", here("output", "wgs_vs_chip", "figures", "ay_mismatches_SAI_KAT.pdf"), dt_long_2_ay)
Counts plot
# Call the function with data_*_dt as input
counts_ay <- calculate_counts(data_ay_dt)
plot_counts(counts_ay, here("output", "wgs_vs_chip", "figures", "ay_SAI_KAT_per_sample_stats.pdf"))
Generate csv files
Import csv
data_bx_dt <- process_csv_files("bx")
# Check and display only columns that match the criteria
head(data_bx_dt[, c("SNP_id", names(data_bx_dt)[grepl("_REF$|_ALT$", names(data_bx_dt))]), with = FALSE])
## SNP_id SAI_14b_SAI_14x_REF SAI_14b_SAI_14x_ALT KAT_8b_KAT_8x_REF
## 1: AX-583035067 match match match
## 2: AX-583035102 match mismatch match
## 3: AX-583033342 match match match
## 4: AX-583035163 match match match
## 5: AX-583033370 match match match
## 6: AX-583035194 match match match
## KAT_8b_KAT_8x_ALT SAI_2b_SAI_2x_REF SAI_2b_SAI_2x_ALT KAT_7b_KAT_7x_REF
## 1: match match match match
## 2: mismatch match mismatch match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## KAT_7b_KAT_7x_ALT KAT_12b_KAT_12x_REF KAT_12b_KAT_12x_ALT SAI_3b_SAI_3x_REF
## 1: mismatch match match match
## 2: mismatch match mismatch mismatch
## 3: match match match match
## 4: match match match match
## 5: match match match mismatch
## 6: match match match match
## SAI_3b_SAI_3x_ALT KAT_9b_KAT_9x_REF KAT_9b_KAT_9x_ALT SAI_15b_SAI_15x_REF
## 1: match match match match
## 2: match <NA> <NA> match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_15b_SAI_15x_ALT SAI_16b_SAI_16x_REF SAI_16b_SAI_16x_ALT
## 1: match match match
## 2: mismatch mismatch match
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
## KAT_11b_KAT_11x_REF KAT_11b_KAT_11x_ALT SAI_12b_SAI_12x_REF
## 1: match mismatch match
## 2: match match mismatch
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
## SAI_12b_SAI_12x_ALT SAI_4b_SAI_4x_REF SAI_4b_SAI_4x_ALT SAI_17b_SAI_17x_REF
## 1: match match match match
## 2: mismatch match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_17b_SAI_17x_ALT SAI_1b_SAI_1x_REF SAI_1b_SAI_1x_ALT KAT_10b_KAT_10x_REF
## 1: match match match match
## 2: match match mismatch mismatch
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match mismatch match
## KAT_10b_KAT_10x_ALT SAI_18b_SAI_18x_REF SAI_18b_SAI_18x_ALT
## 1: match match match
## 2: match <NA> <NA>
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
## SAI_5b_SAI_5x_REF SAI_5b_SAI_5x_ALT SAI_13b_SAI_13x_REF SAI_13b_SAI_13x_ALT
## 1: match match match match
## 2: match mismatch match mismatch
## 3: match match match match
## 4: match match mismatch mismatch
## 5: match match match match
## 6: match match match match
Get the summary
## SNP_id REF_match REF_mismatch ALT_match ALT_mismatch Zigo_match
## 1: AX-579436125 15 3 18 0 15
## 2: AX-579436196 14 2 16 0 14
## 3: AX-579436243 18 0 18 0 18
## 4: AX-579436298 16 1 12 5 13
## 5: AX-579436308 18 0 18 0 18
## 6: AX-579436348 18 0 18 0 18
## Zigo_mismatch
## 1: 3
## 2: 2
## 3: 0
## 4: 4
## 5: 0
## 6: 0
Make data long format for plotting
## type count n perc
## 1: Reference Allele 3 3861 3.830927
## 2: Reference Allele 2 7427 7.369152
## 3: Reference Allele 0 66782 66.261845
## 4: Reference Allele 1 16746 16.615568
## 5: Reference Allele 5 1168 1.158903
## 6: Reference Allele 4 2155 2.138215
Create plot of SNP error per sample
Compare both populations
summary_sai_bx <- generate_summary(data_bx_dt, "SAI", "suffix")
summary_kat_bx <- generate_summary(data_bx_dt, "KAT", "suffix")
dt_long_2_bx <- merge_and_transform("bx")
create_plot2("bx", here("output", "wgs_vs_chip", "figures", "bx_mismatches_SAI_KAT.pdf"), dt_long_2_bx)
Counts plot
# Call the function with data_*_dt as input
counts_bx <- calculate_counts(data_bx_dt)
plot_counts(counts_bx, here("output", "wgs_vs_chip", "figures", "bx_SAI_KAT_per_sample_stats.pdf"))
Generate csv files
Import csv
data_cw_dt <- process_csv_files("cw")
# Check and display only columns that match the criteria
head(data_cw_dt[, c("SNP_id", names(data_cw_dt)[grepl("_REF$|_ALT$", names(data_cw_dt))]), with = FALSE])
## SNP_id KAT_9c_KAT_9w_REF KAT_9c_KAT_9w_ALT SAI_15c_SAI_15w_REF
## 1: AX-583035067 match match match
## 2: AX-583033342 match match match
## 3: AX-583033356 match match match
## 4: AX-583033370 match match match
## 5: AX-583035194 match match match
## 6: AX-583033387 match match match
## SAI_15c_SAI_15w_ALT SAI_3c_SAI_3w_REF SAI_3c_SAI_3w_ALT KAT_12c_KAT_12w_REF
## 1: match match match match
## 2: match match match match
## 3: match mismatch match match
## 4: match mismatch match match
## 5: match match match match
## 6: match match match match
## KAT_12c_KAT_12w_ALT KAT_7c_KAT_7w_REF KAT_7c_KAT_7w_ALT SAI_2c_SAI_2w_REF
## 1: match match mismatch match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_2c_SAI_2w_ALT SAI_14c_SAI_14w_REF SAI_14c_SAI_14w_ALT KAT_8c_KAT_8w_REF
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match mismatch match
## KAT_8c_KAT_8w_ALT SAI_13c_SAI_13w_REF SAI_13c_SAI_13w_ALT SAI_5c_SAI_5w_REF
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match match match match
## SAI_5c_SAI_5w_ALT SAI_18c_SAI_18w_REF SAI_18c_SAI_18w_ALT
## 1: match match match
## 2: match match match
## 3: match match match
## 4: match match match
## 5: match match match
## 6: match match match
## KAT_10c_KAT_10w_REF KAT_10c_KAT_10w_ALT SAI_1c_SAI_1w_REF SAI_1c_SAI_1w_ALT
## 1: match match match match
## 2: match match match match
## 3: <NA> <NA> match match
## 4: match match match match
## 5: match match match mismatch
## 6: match match match mismatch
## SAI_17c_SAI_17w_REF SAI_17c_SAI_17w_ALT SAI_4c_SAI_4w_REF SAI_4c_SAI_4w_ALT
## 1: match match match match
## 2: match match match match
## 3: match match match match
## 4: match match match match
## 5: match match match match
## 6: match mismatch match match
## SAI_12c_SAI_12w_REF SAI_12c_SAI_12w_ALT KAT_11c_KAT_11w_REF
## 1: match match match
## 2: match match match
## 3: match match <NA>
## 4: match match match
## 5: match match match
## 6: match mismatch match
## KAT_11c_KAT_11w_ALT SAI_16c_SAI_16w_REF SAI_16c_SAI_16w_ALT
## 1: mismatch match match
## 2: match match match
## 3: <NA> mismatch match
## 4: match match match
## 5: match match match
## 6: match match mismatch
Get the summary
## SNP_id REF_match REF_mismatch ALT_match ALT_mismatch Zigo_match
## 1: AX-579436089 15 1 16 0 15
## 2: AX-579436149 18 0 18 0 18
## 3: AX-579436196 15 2 17 0 15
## 4: AX-579436243 18 0 18 0 18
## 5: AX-579436298 16 1 14 3 13
## 6: AX-579436308 18 0 18 0 18
## Zigo_mismatch
## 1: 1
## 2: 0
## 3: 2
## 4: 0
## 5: 4
## 6: 0
Make data long format for plotting
## type count n perc
## 1: Reference Allele 1 17110 16.1682022
## 2: Reference Allele 0 71793 67.8412473
## 3: Reference Allele 2 7336 6.9321994
## 4: Reference Allele 5 1105 1.0441767
## 5: Reference Allele 3 3758 3.5511458
## 6: Reference Allele 7 350 0.3307347
Create plot of SNP error per sample
Compare both populations
summary_sai_cw <- generate_summary(data_cw_dt, "SAI", "suffix")
summary_kat_cw <- generate_summary(data_cw_dt, "KAT", "suffix")
dt_long_2_cw <- merge_and_transform("cw")
create_plot2("cw", here("output", "wgs_vs_chip", "figures", "cw_mismatches_SAI_KAT.pdf"), dt_long_2_cw)
Counts plot
# Call the function with data_*_dt as input
counts_cw <- calculate_counts(data_cw_dt)
plot_counts(counts_cw, here("output", "wgs_vs_chip", "figures", "cw_SAI_KAT_per_sample_stats.pdf"))
Function to get the Zygosity summary for each object
create_Zygosity_df <- function(counts_df) {
Zygosity_df <- counts_df |>
filter(Comparison == "Zygosity") |>
dplyr::select(
Population,
Sample,
Total,
Match,
Percent_Match,
Mismatch,
Percent_Mismatch
)
return(Zygosity_df)
}
Apply the function
# chip
Zygosity_ab <- create_Zygosity_df(counts_ab)
Zygosity_ac <- create_Zygosity_df(counts_ac)
Zygosity_bc <- create_Zygosity_df(counts_bc)
# wgs
Zygosity_xy <- create_Zygosity_df(counts_xy)
Zygosity_wy <- create_Zygosity_df(counts_wy)
Zygosity_wx <- create_Zygosity_df(counts_wx)
# wgs x chip
Zygosity_ay <- create_Zygosity_df(counts_ay)
Zygosity_bx <- create_Zygosity_df(counts_bx)
Zygosity_cw <- create_Zygosity_df(counts_cw)
Use library(ggstatsplot) to compare the mean error rate for Zygosity. We classified each loci as homo_ref, homo_alt, and het. Then we checked if they matched or not.
# Add source columns to each data frame
Zygosity_ab$Source <- 'ab'
Zygosity_ac$Source <- 'ac'
Zygosity_bc$Source <- 'bc'
Zygosity_xy$Source <- 'xy'
Zygosity_wy$Source <- 'wy'
Zygosity_wx$Source <- 'wx'
Zygosity_ay$Source <- 'ay'
Zygosity_bx$Source <- 'bx'
Zygosity_cw$Source <- 'cw'
# Combine all data frames
combined_data <-
rbind(
Zygosity_ab,
Zygosity_ac,
Zygosity_bc,
Zygosity_xy,
Zygosity_wy,
Zygosity_wx,
Zygosity_ay,
Zygosity_bx,
Zygosity_cw
)
For KAT
# Specify the desired order
desired_order <- c("ab", "ac", "bc", "xy", "wy", "wx", "ay", "bx", "cw")
# Convert the 'Source' column to a factor and specify the order of the levels
combined_data$Source <- factor(combined_data$Source, levels = desired_order)
# For KAT
data_KAT_t <- subset(combined_data, Population == "KAT")
# first, assign the plot to a variable
plot_KAT_plot <- ggbetweenstats(
data = data_KAT_t,
x = Source,
y = Percent_Mismatch,
title = "Genotyping mismatches for KAT (native)",
type = "nonparametric",
pairwise.comparisons = TRUE,
pairwise.display = "significant",
palette = "RdYlBu", # change to a different palette if you prefer
package = "RColorBrewer"
)
plot_KAT_plot
# Use here function to specify the path
output_path <- here("output", "wgs_vs_chip", "figures", "stats_KAT.pdf")
# Save the plot
ggsave(filename = output_path, plot = plot_KAT_plot, width = 10, height = 7, dpi = 300)
For SAI
# For SAI
data_SAI_t <- subset(combined_data, Population == "SAI")
# first, assign the plot to a variable
plot_SAI_plot <- ggbetweenstats(
data = data_SAI_t,
x = Source,
y = Percent_Mismatch,
title = "Genotyping mismatches for SAI (invasive)",
type = "nonparametric",
pairwise.comparisons = TRUE,
pairwise.display = "significant",
palette = "RdYlBu", # change to a different palette if you prefer
package = "RColorBrewer"
)
plot_SAI_plot
# Use here function to specify the path
output_path <- here("output", "wgs_vs_chip", "figures", "stats_SAI.pdf")
# Save the plot
ggsave(filename = output_path, plot = plot_SAI_plot, width = 10, height = 7, dpi = 300)
Comparison irrespective of population
plot_both_plot<- ggbetweenstats(
data = combined_data, # using the entire data here, not just KAT
x = Source,
y = Percent_Mismatch,
title = "Comparison of mean percent mismatch between sources",
type = "nonparametric",
pairwise.comparisons = TRUE,
pairwise.display = "significant",
palette = "RdYlBu",
package = "RColorBrewer"
)
plot_both_plot
# Use here function to specify the path
output_path <- here("output", "wgs_vs_chip", "figures", "stats_both.pdf")
# Save the plot
ggsave(filename = output_path, plot = plot_both_plot, width = 10, height = 7, dpi = 300)
We can use library broom to get a table
set.seed(123)
# I put warning=FALSE because some of the values are close. In the next chunk we add some jitter and we will not get warnings.
# Conduct pairwise Wilcoxon test
result <- pairwise.wilcox.test(
combined_data$Percent_Mismatch,
combined_data$Source,
p.adjust.method = "holm"
)
# Tidy the result to a dataframe
result_tidy <- broom::tidy(result)
# Print the result
print(result_tidy)
## # A tibble: 36 × 3
## group1 group2 p.value
## <chr> <chr> <dbl>
## 1 ac ab 0.384
## 2 bc ab 0.0000101
## 3 bc ac 0.0000101
## 4 xy ab 0.694
## 5 xy ac 1
## 6 xy bc 0.0000101
## 7 wy ab 0.0000101
## 8 wy ac 0.0000101
## 9 wy bc 0.0000101
## 10 wy xy 0.0000116
## # ℹ 26 more rows
Add jitters
set.seed(123)
# If we add jitters the p-values are slightly different.
combined_data$Percent_Mismatch_jitter <- jitter(combined_data$Percent_Mismatch, amount = 1e-9)
result <- pairwise.wilcox.test(
combined_data$Percent_Mismatch_jitter,
combined_data$Source,
p.adjust.method = "holm"
)
# Tidy the result to a dataframe
result_tidy <- broom::tidy(result)
# Print the result
print(result_tidy)
## # A tibble: 36 × 3
## group1 group2 p.value
## <chr> <chr> <dbl>
## 1 ac ab 0.382
## 2 bc ab 0.00000000793
## 3 bc ac 0.00000000793
## 4 xy ab 0.644
## 5 xy ac 1
## 6 xy bc 0.00000000793
## 7 wy ab 0.00000000837
## 8 wy ac 0.0000000423
## 9 wy bc 0.00000000793
## 10 wy xy 0.000000278
## # ℹ 26 more rows
Create table
# Calculate mean
mean_df <- combined_data |>
group_by(Source) |>
summarise(Mean_Percent_Mismatch = mean(Percent_Mismatch, na.rm = TRUE))
# Calculate median
median_df <- combined_data |>
group_by(Source) |>
summarise(Median_Percent_Mismatch = median(Percent_Mismatch, na.rm = TRUE))
# Pairwise Wilcoxon test
result <- pairwise.wilcox.test(
combined_data$Percent_Mismatch,
combined_data$Source,
p.adjust.method = "holm"
)
# # Extract p-values and tidy the result into a data frame
pvalues_df <- as.data.frame(result$p.value) |>
rownames_to_column("Source") |>
dplyr::rename(P_Value = 2)
# pvalues_df <-
# tibble::rownames_to_column(as.data.frame(result$p.value), "Source") %>%
# dplyr::rename(P_Value = 2)
# Merge mean, median and p-values into one table
summary_df <- full_join(mean_df, median_df, by = "Source") |>
full_join(pvalues_df, by = "Source")
# Rename "Source" column
summary_df <- dplyr::rename(summary_df, Comparison = Source)
# Function to format data for mean and median
format_mean_median <- function(x) {
round(x, 2)
}
# Function to format data for p-values
format_pvalue <- function(x) {
formatted_x <- ifelse(abs(x) < 1e-4, formatC(x, format = "e", digits = 4), round(x, 4))
# Append an asterisk for p-values below 0.05
ifelse(x < 0.05, paste0(formatted_x, "*"), formatted_x)
}
# Apply the function to each column as needed
summary_df$Mean_Percent_Mismatch <- format_mean_median(summary_df$Mean_Percent_Mismatch)
summary_df$Median_Percent_Mismatch <- format_mean_median(summary_df$Median_Percent_Mismatch)
# Apply the function to each column as needed
pvalue_cols <- colnames(summary_df)[-(1:3)]
for(col in pvalue_cols){
summary_df[[col]] <- format_pvalue(summary_df[[col]])
}
# Create the flextable
ft <- flextable::flextable(summary_df)
# Apply zebra theme
ft <- flextable::theme_zebra(ft)
# Add a caption to the table
ft <- flextable::add_header_lines(ft, "Table 1: Mean and Median Percent Mismatch by Comparison. The P-values are from a pairwise Wilcoxon test with Holm adjustment for multiple comparisons. An asterisk (*) next to a P-value indicates a statistically significant difference (P < 0.05).")
# Save it to a Word document
officer::read_docx() |>
body_add_flextable(ft) |>
print(target = here::here("output", "wgs_vs_chip", "figures", "summary_table.docx"))
ft
Table 1: Mean and Median Percent Mismatch by Comparison. The P-values are from a pairwise Wilcoxon test with Holm adjustment for multiple comparisons. An asterisk (*) next to a P-value indicates a statistically significant difference (P < 0.05). | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Comparison | Mean_Percent_Mismatch | Median_Percent_Mismatch | P_Value | ac | bc | xy | wy | wx | ay | bx |
ab | 1.16 | 1.19 | ||||||||
ac | 1.29 | 1.34 | 0.3841 | |||||||
bc | 0.41 | 0.43 | 1.0098e-05* | 1.0098e-05* | ||||||
xy | 1.35 | 1.27 | 0.6937 | 1 | 1.0098e-05* | |||||
wy | 3.09 | 2.67 | 1.0098e-05* | 1.0098e-05* | 1.0098e-05* | 1.1608e-05* | ||||
wx | 2.99 | 2.58 | 1.0098e-05* | 1.0098e-05* | 1.0098e-05* | 1.1608e-05* | 1 | |||
ay | 8.11 | 8.71 | 1.0098e-05* | 1.0098e-05* | 1.0098e-05* | 1.0098e-05* | 1.0098e-05* | 1.0098e-05* | ||
bx | 7.14 | 7.58 | 1.0098e-05* | 1.0098e-05* | 1.0098e-05* | 1.0098e-05* | 2.0980e-06* | 7.6958e-07* | 0.3841 | |
cw | 6.70 | 7.10 | 1.0098e-05* | 1.0098e-05* | 1.0098e-05* | 1.0098e-05* | 2.7127e-06* | 2.0980e-06* | 0.0760 | 1 |
We can look at each population or across all samples. The code below assumes you have all the data loaded.
The “ac” comparison is for when we genotype 18 samples alone or use around 500 samples in the genotype call.
We extracted the 18 samples out of the full data set to compare with the 18 samples genotyped alone.
How many SNPs have discrepancies in the genotypes in 1 or more samples (out of the 18 samples)
# Discrepancies in 1 or more samples
# How many SNPs we tested
tested_snps <- length(unique(data_ac_dt$SNP_id))
cat("Number of SNPs tested:", tested_snps, "\n")
## Number of SNPs tested: 90252
# How many SNPs failed
failed_snpsR <-
length(
unique(data_ac_dt[data_ac_dt$REF_mismatch_count >= 1,]$SNP_id
)
)
cat("REF mismatch at in 1 sample:", failed_snpsR, "\n")
## REF mismatch at in 1 sample: 7329
# How many SNPs failed
failed_snpsA <-
length(
unique(data_ac_dt[data_ac_dt$ALT_mismatch_count >= 1,,]$SNP_id
)
)
cat("ALT mismatch at least in 1 sample:", failed_snpsA, "\n")
## ALT mismatch at least in 1 sample: 3782
# How many SNPs failed zygosity
failed_snps <-
length(
unique(data_ac_dt[data_ac_dt$Zigo_mismatch_count >= 1,,]$SNP_id
)
)
cat("Zygosity mismatch in at least 1 sample:", failed_snps, "\n")
## Zygosity mismatch in at least 1 sample: 10545
# Calculate percentage
percentage_failed <- round(failed_snps / tested_snps * 100, 2)
cat("Percentage of failed SNPs in 1 or more samples:", percentage_failed, "%\n")
## Percentage of failed SNPs in 1 or more samples: 11.68 %
When we look at the Zygosity of each SNP we find that 10,545 SNPs have mismatches (11.68%). However, we see from the previous plot that we have SNPs showing discrepancies in only 1 sample out of the 18 samples.
Check how many SNPs have erros in 2 or more samples
# Discrepancies in 2 or more samples
# How many SNPs we tested
tested_snps <- length(unique(data_ac_dt$SNP_id))
cat("Number of SNPs tested:", tested_snps, "\n")
## Number of SNPs tested: 90252
# How many SNPs failed
failed_snpsR <-
length(
unique(data_ac_dt[data_ac_dt$REF_mismatch_count >= 2,]$SNP_id
)
)
cat("REF mismatch in 2 or more samples:", failed_snpsR, "\n")
## REF mismatch in 2 or more samples: 3002
# How many SNPs failed
failed_snpsA <-
length(
unique(data_ac_dt[data_ac_dt$ALT_mismatch_count >= 2,]$SNP_id
)
)
cat("ALT mismatch in 2 or more samples:", failed_snpsA, "\n")
## ALT mismatch in 2 or more samples: 1417
# How many SNPs failed
failed_snps <-
length(
unique(data_ac_dt[data_ac_dt$Zigo_mismatch_count >= 2,]$SNP_id
)
)
cat("Zygosity mismatch in 2 or more samples:", failed_snps, "\n")
## Zygosity mismatch in 2 or more samples: 4396
# Calculate percentage
percentage_failed <- round(failed_snps / tested_snps * 100, 2)
cat("Percentage of failed SNPs in 2 or more samples:", percentage_failed, "%\n")
## Percentage of failed SNPs in 2 or more samples: 4.87 %
We can check how many times a SNP has mismatching Zygosity or alleles across the 18 samples.
# Number of samples you want to iterate over
num_samples <- 18
# Create an empty data frame to store results
results2 <- data.frame()
# How many SNPs we tested
tested_snps <- length(unique(data_ac_dt$SNP_id))
for(i in 1:num_samples){
# How many SNPs failed REF
failed_snpsR <- length(unique(data_ac_dt[data_ac_dt$REF_mismatch_count >= i,]$SNP_id))
# How many SNPs failed ALT
failed_snpsA <- length(unique(data_ac_dt[data_ac_dt$ALT_mismatch_count >= i,]$SNP_id))
# How many SNPs failed zygosity
failed_snpsZ <- length(unique(data_ac_dt[data_ac_dt$Zigo_mismatch_count >= i,]$SNP_id))
# Calculate percentage
percentage_failed <- round(failed_snpsZ / tested_snps * 100, 2)
# Create a data frame with results for this number of samples
temp_results <- data.frame(
'Samples' = i,
'SNPs' = tested_snps,
'Mismatch_REF' = failed_snpsR,
'Mismatch_ALT' = failed_snpsA,
'Mismatch_Zygosity' = failed_snpsZ,
'Mismatch_Zygosity_perc' = percentage_failed
)
# Append the results to the main results data frame
results2 <- rbind(results2, temp_results)
}
# Create the flextable
ft <- flextable(results2)
# Apply zebra theme
ft <- theme_zebra(ft)
# Add a caption to the table
ft <- add_header_lines(ft, "Table 2: Summary of the SNP mismatch rate for the 18 samples genotyped alone or with 500 samples. ")
# Save it to a Word document
officer::read_docx() |>
body_add_flextable(ft) |>
print(target = here::here("output", "wgs_vs_chip", "figures", "summary_ac.docx"))
KAT
# Discrepancies in 1 or more samples
# How many SNPs we tested
tested_snps_ac <- length(unique(summary_kat_ac$SNP_id))
cat("Number of SNPs tested:", tested_snps_ac, "\n")
## Number of SNPs tested: 90252
# How many SNPs failed
failed_kat_ac <-
length(unique(summary_kat_ac[summary_kat_ac$REF_mismatch > 0 |
summary_kat_ac$ALT_mismatch > 0 |
summary_kat_ac$Zigo_mismatch > 0, ]$SNP_id))
cat("Number of SNPs failed:", failed_kat_ac, "\n")
## Number of SNPs failed: 3029
# Calculate percentage
percentage_failed_ac <- round(failed_kat_ac / tested_snps_ac * 100, 2)
cat("Percentage of failed SNPs:", percentage_failed_ac, "%\n")
## Percentage of failed SNPs: 3.36 %
# How many SNPs failed KAT
failed_kat_ac <-
length(unique(summary_kat_ac[summary_kat_ac$REF_mismatch > 0 |
summary_kat_ac$ALT_mismatch > 0 |
summary_kat_ac$Zigo_mismatch > 0, ]$SNP_id))
cat("Number of SNPs failed:", failed_kat_ac, "\n")
## Number of SNPs failed: 3029
# How many SNPs failed SAI
failed_sai_ac <-
length(unique(summary_sai_ac[summary_sai_ac$REF_mismatch > 0 |
summary_sai_ac$ALT_mismatch > 0 |
summary_sai_ac$Zigo_mismatch > 0, ]$SNP_id))
cat("Number of SNPs failed:", failed_sai_ac, "\n")
## Number of SNPs failed: 8579
# Calculate percentage
percentage_kat_ac <- round(failed_kat_ac / tested_snps_ac * 100, 2)
cat("Percentage of failed SNPs:", percentage_kat_ac, "%\n")
## Percentage of failed SNPs: 3.36 %
percentage_sai_ac <- round(failed_sai_ac / tested_snps_ac * 100, 2)
cat("Percentage of failed SNPs:", percentage_sai_ac, "%\n")
## Percentage of failed SNPs: 9.51 %
Summary
# Create an empty data frame to store results
results_ac <- data.frame()
# How many SNPs we tested
tested_snps_ac <- length(unique(summary_kat_ac$SNP_id))
# Datasets and corresponding number of samples
datasets_ac <- list(KAT=list(data=summary_kat_ac, num_samples=6), SAI=list(data=summary_sai_ac, num_samples=12))
for(name in names(datasets_ac)){
data <- datasets_ac[[name]]$data
num_samples <- datasets_ac[[name]]$num_samples
for(i in 1:num_samples){
# How many SNPs failed
failed_snps <- length(unique(data[data$REF_mismatch >= i |
data$ALT_mismatch >= i |
data$Zigo_mismatch >= i, ]$SNP_id))
# Calculate percentage
percentage_failed <- round(failed_snps / tested_snps_ac * 100, 2)
# Create a data frame with results for this number of samples
temp_results <- data.frame(
'Data_Set' = name,
'Num_Samples' = i,
'Tested_SNPs' = tested_snps_ac,
'Failed_SNPs' = failed_snps,
'Perc_ac' = percentage_failed
)
# Append the results to the main results data frame
results_ac <- rbind(results_ac, temp_results)
}
}
# Print the results
print(results_ac)
## Data_Set Num_Samples Tested_SNPs Failed_SNPs Perc_ac
## 1 KAT 1 90252 3029 3.36
## 2 KAT 2 90252 1241 1.38
## 3 KAT 3 90252 654 0.72
## 4 KAT 4 90252 363 0.40
## 5 KAT 5 90252 165 0.18
## 6 KAT 6 90252 95 0.11
## 7 SAI 1 90252 8579 9.51
## 8 SAI 2 90252 3228 3.58
## 9 SAI 3 90252 1541 1.71
## 10 SAI 4 90252 793 0.88
## 11 SAI 5 90252 411 0.46
## 12 SAI 6 90252 236 0.26
## 13 SAI 7 90252 130 0.14
## 14 SAI 8 90252 89 0.10
## 15 SAI 9 90252 58 0.06
## 16 SAI 10 90252 47 0.05
## 17 SAI 11 90252 38 0.04
## 18 SAI 12 90252 21 0.02
We can get the percentage of failing SNPs for all the comparisons we made: c(“ab”, “ac”, “bc”, “xy”, “wy”, “wx”, “ay”, “bx”, “cw”)
# Your data set identifiers
datasets_identifiers <- c("ab", "ac", "bc", "xy", "wy", "wx", "ay", "bx", "cw")
# Define all possible 'Num_Samples'
all_samples <- 1:18
# Initialize an empty list to hold results data frames for each data set
results_list <- list()
# Iterate over the data set identifiers
for(ds_id in datasets_identifiers){
# Generate the variable name for this data set
summary_var_name <- paste0("summary_", ds_id)
# Retrieve the data frame
summary_data <- get(summary_var_name)
# Create an empty data frame to store results with all possible 'Num_Samples'
results <- data.frame(Num_Samples = all_samples)
for(i in 1:18){
# How many SNPs we tested
tested_snps <- length(unique(summary_data$SNP_id))
# How many SNPs failed
failed_snps <- length(unique(summary_data[summary_data$REF_mismatch >= i |
summary_data$ALT_mismatch >= i |
summary_data$Zigo_mismatch >= i, ]$SNP_id))
# Calculate percentage
percentage_failed <- round(failed_snps / tested_snps * 100, 2)
# Assign the results to the corresponding row
results[i, paste0('Perc_', ds_id)] <- percentage_failed
}
# Add the results data frame to the list
results_list[[ds_id]] <- results
}
# Initialize the final merged results data frame with just 'Num_Samples' and the first percentage column.
merged_results <- results_list[[datasets_identifiers[1]]]
# Merge all other results data frames into the final results data frame
for(ds_id in datasets_identifiers[-1]){
merged_results <- merge(merged_results, results_list[[ds_id]], by = "Num_Samples", all = TRUE)
}
# Rename 'Num_Samples' to 'n_sample_fail'
names(merged_results)[names(merged_results) == "Num_Samples"] <- "n_sample_fail"
# Remove 'Perc_' from other column names
names(merged_results)[-1] <- sub("Perc_", "", names(merged_results)[-1])
# Create the flextable
ft <- flextable::flextable(merged_results)
# Apply zebra theme
ft <- flextable::theme_zebra(ft)
# Add a caption to the table
ft <- flextable::add_header_lines(ft, "Table 3: SNP mismatch percentage for Zygosity across all data set comparisons. ")
# Save it to a Word document
officer::read_docx() |>
body_add_flextable(ft) |>
print(target = here::here("output", "wgs_vs_chip", "figures", "summary_all_data_sets.docx"))
ft
Table 3: SNP mismatch percentage for Zygosity across all data set comparisons. | |||||||||
---|---|---|---|---|---|---|---|---|---|
n_sample_fail | ab | ac | bc | xy | wy | wx | ay | bx | cw |
1 | 10.28 | 11.69 | 3.93 | 5.65 | 13.74 | 13.31 | 58.58 | 53.45 | 50.67 |
2 | 4.35 | 4.88 | 1.41 | 3.99 | 9.63 | 9.16 | 37.66 | 33.29 | 30.89 |
3 | 2.27 | 2.54 | 0.70 | 3.21 | 7.64 | 7.22 | 24.67 | 21.47 | 19.66 |
4 | 1.27 | 1.42 | 0.38 | 2.64 | 6.24 | 5.84 | 16.20 | 14.05 | 12.80 |
5 | 0.75 | 0.83 | 0.23 | 2.22 | 5.14 | 4.79 | 10.84 | 9.38 | 8.57 |
6 | 0.47 | 0.52 | 0.14 | 1.84 | 4.20 | 3.92 | 7.40 | 6.43 | 5.98 |
7 | 0.28 | 0.30 | 0.08 | 1.51 | 3.39 | 3.19 | 5.01 | 4.57 | 4.33 |
8 | 0.19 | 0.20 | 0.05 | 1.23 | 2.71 | 2.56 | 3.47 | 3.38 | 3.25 |
9 | 0.13 | 0.14 | 0.04 | 0.96 | 2.07 | 1.99 | 2.48 | 2.63 | 2.65 |
10 | 0.10 | 0.11 | 0.03 | 0.72 | 1.54 | 1.50 | 1.77 | 2.06 | 2.17 |
11 | 0.09 | 0.09 | 0.03 | 0.52 | 1.16 | 1.15 | 1.26 | 1.66 | 1.83 |
12 | 0.07 | 0.07 | 0.02 | 0.33 | 0.83 | 0.85 | 0.93 | 1.34 | 1.54 |
13 | 0.06 | 0.05 | 0.02 | 0.21 | 0.57 | 0.63 | 0.69 | 1.10 | 1.28 |
14 | 0.05 | 0.04 | 0.02 | 0.12 | 0.37 | 0.43 | 0.52 | 0.87 | 1.05 |
15 | 0.04 | 0.03 | 0.02 | 0.06 | 0.20 | 0.27 | 0.36 | 0.68 | 0.87 |
16 | 0.03 | 0.03 | 0.01 | 0.03 | 0.09 | 0.15 | 0.25 | 0.51 | 0.66 |
17 | 0.03 | 0.02 | 0.01 | 0.00 | 0.00 | 0.06 | 0.15 | 0.33 | 0.44 |
18 | 0.02 | 0.01 | 0.00 | 0.00 | 0.00 | 0.01 | 0.07 | 0.17 | 0.24 |
Create a plot Theme for plotting
# import plotting theme
source(
here(
"scripts",
"analysis",
"my_theme2.R" # choose my_theme.R (Roboto Condensed) or my_theme2.R (default font)
)
)
Plot
# Convert the data frame from wide to long format
long_results <- merged_results %>%
pivot_longer(
cols = -n_sample_fail,
names_to = "Data_Set",
values_to = "Percentage"
)
# Specify the order of the fill factor
long_results$Data_Set <-
factor(long_results$Data_Set,
levels = c("ab", "ac", "bc", "xy", "wy", "wx", "ay", "bx", "cw"))
# Define color blind friendly palette
color_blind_friendly <- c(
"Chip (ab)" = "#E69F00",
"Chip (ac)" = "#56B4E9",
"Chip (bc)" = "#009E73",
"WGS (xy)" = "#F0E442",
"WGS (wy)" = "#0072B2",
"WGS (wx)" = "#D55E00",
"WGS_Chip (ay)" = "#CC79A7",
"WGS_Chip (bx)" = "#999999",
"WGS_Chip (cw)" = "#000000"
)
# Create a named vector to recode Data_Set column
recode_vector <- c(
"ab" = "Chip (ab)",
"ac" = "Chip (ac)",
"bc" = "Chip (bc)",
"xy" = "WGS (xy)",
"wy" = "WGS (wy)",
"wx" = "WGS (wx)",
"ay" = "WGS_Chip (ay)",
"bx" = "WGS_Chip (bx)",
"cw" = "WGS_Chip (cw)"
)
# Recode the Data_Set column
long_results$Data_Set <- recode_vector[long_results$Data_Set]
# Create the bar plot with new legend labels
ggplot(long_results,
aes(x = n_sample_fail, y = Percentage, fill = Data_Set)) +
geom_bar(stat = "identity", position = "dodge") +
labs(
x = "Samples (n)",
y = "SNPs with mismatches (%)",
fill = "Comparison",
title = "Cummulative mismatches by number of samples for each SNP",
caption = "Number of samples per genotype call: \nChip:\n'ab' - 18 versus 95 samples\n'ac' - 18 versus 500 samples\n'bc' - 95 versus 500 samples\n\nWGS:\n'xy' - 18 versus 30 samples\n'wy' - 18 versus 800 samples\n'wx' - Genotyping calls with 30 versus 800 samples\n\nChip x WGS:\n'ay' - both 18 samples\n'bx' - WGS 30 samples and chip 95 samples\n'cw' - WGS 800 samples and chip 500 samples"
) +
scale_fill_manual(
values = color_blind_friendly,
labels = c(
"Chip (ab)" = "Chip (ab)",
"Chip (ac)" = "Chip (ac)",
"Chip (bc)" = "Chip (bc)",
"WGS (xy)" = "WGS (xy)",
"WGS (wy)" = "WGS (wy)",
"WGS (wx)" = "WGS (wx)",
"WGS_Chip (ay)" = "WGS_Chip (ay)",
"WGS_Chip (bx)" = "WGS_Chip (bx)",
"WGS_Chip (cw)" = "WGS_Chip (cw)"
)
) +
coord_flip() +
my_theme() +
scale_x_continuous(breaks = seq(0, 18, 1)) +
theme(
legend.position = "top",
plot.caption = element_text(
size = 8,
color = "gray30",
face = "italic",
hjust = 1
)
) # This changes the caption's size, color, and makes it italic.
# Save plot to PDF
ggsave(
here(
"output",
"wgs_vs_chip",
"figures",
"percentage_all_samples.pdf"
),
height = 10,
width = 8,
dpi = 300
)
Per population
# Your data set identifiers
datasets_identifiers <-
c("ab", "ac", "bc", "xy", "wy", "wx", "ay", "bx", "cw")
# Initialize an empty list to hold results data frames for each data set
results_list <- list()
# Iterate over the data set identifiers
for (ds_id in datasets_identifiers) {
# Generate the variable names for this data set
kat_var_name <- paste0("summary_kat_", ds_id)
sai_var_name <- paste0("summary_sai_", ds_id)
# Retrieve the data frames
summary_kat <- get(kat_var_name)
summary_sai <- get(sai_var_name)
# Datasets and corresponding number of samples
datasets <-
list(
KAT = list(data = summary_kat, num_samples = 6),
SAI = list(data = summary_sai, num_samples = 12)
)
# Create an empty data frame to store results
results <- data.frame()
for (name in names(datasets)) {
data <- datasets[[name]]$data
num_samples <- datasets[[name]]$num_samples
for (i in 1:num_samples) {
# How many SNPs we tested
tested_snps <- length(unique(data$SNP_id))
# How many SNPs failed
failed_snps <- length(unique(data[data$REF_mismatch >= i |
data$ALT_mismatch >= i |
data$Zigo_mismatch >= i,]$SNP_id))
# Calculate percentage
percentage_failed <- round(failed_snps / tested_snps * 100, 2)
# Create a data frame with results for this number of samples
temp_results <- data.frame(
'Data_Set' = name,
'Num_Samples' = i,
'Tested_SNPs' = tested_snps,
'Failed_SNPs' = failed_snps,
'Percentage' = percentage_failed
)
# Assign the appropriate column name
colnames(temp_results)[which(colnames(temp_results) == "Percentage")] <-
paste0('Perc_', ds_id)
# Append the results to the main results data frame
results <- rbind(results, temp_results)
}
}
# Add the results data frame to the list
results_list[[ds_id]] <- results
}
# Initialize the final merged results data frame with just 'Num_Samples', 'Data_Set' and percentage column.
merged_results <-
results_list[[datasets_identifiers[1]]][, c("Data_Set",
"Num_Samples",
paste0('Perc_', datasets_identifiers[1]))]
# Merge all other results data frames into the final results data frame
for (ds_id in datasets_identifiers[-1]) {
# Select only 'Num_Samples', 'Data_Set' and 'Perc_*' column for merging.
merge_data <-
results_list[[ds_id]][, c("Data_Set", "Num_Samples", paste0('Perc_', ds_id))]
merged_results <-
merge(
merged_results,
merge_data,
by = c("Data_Set", "Num_Samples"),
all = TRUE
)
}
# Select only the 'Data_Set', 'Num_Samples' and 'Perc_*' columns
perc_columns <- grep("^Perc_", names(merged_results), value = TRUE)
selected_columns <- c("Data_Set", "Num_Samples", perc_columns)
# Subset the merged results
subset_results <- merged_results[, selected_columns]
# Group by 'Data_Set' and 'Num_Samples' and calculate the mean for each 'Num_Samples' across all 'Perc_*' columns, ignoring NAs
summary_results <- subset_results |>
group_by(Data_Set, Num_Samples) |>
summarise(across(starts_with("Perc_"), mean, na.rm = TRUE), .groups = "drop")
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(starts_with("Perc_"), mean, na.rm = TRUE)`.
## ℹ In group 1: `Data_Set = "KAT"`, `Num_Samples = 1`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
# Remove "Perc_" from column names
names(summary_results) <- sub("Perc_", "", names(summary_results))
# Create the flextable
ft <- flextable::flextable(summary_results)
# Apply zebra theme
ft <- flextable::theme_zebra(ft)
# Add a caption to the table
ft <-
flextable::add_header_lines(
ft,
"Table 4: Summary of the SNP mismatch percentage for Zygosity across each population in all data sets. "
)
# Save it to a Word document
officer::read_docx() |>
body_add_flextable(ft) |>
print(
target = here::here(
"output",
"wgs_vs_chip",
"figures",
"summary_all_data_sets_per_pop.docx"
)
)
ft
Table 4: Summary of the SNP mismatch percentage for Zygosity across each population in all data sets. | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Data_Set | Num_Samples | ab | ac | bc | xy | wy | wx | ay | bx | cw |
KAT | 1 | 3.06 | 3.36 | 1.02 | 2.99 | 6.35 | 6.07 | 18.54 | 16.56 | 16.14 |
KAT | 2 | 1.24 | 1.38 | 0.37 | 1.95 | 4.01 | 3.73 | 9.93 | 8.89 | 8.61 |
KAT | 3 | 0.65 | 0.72 | 0.19 | 1.32 | 2.66 | 2.46 | 6.07 | 5.51 | 5.39 |
KAT | 4 | 0.36 | 0.40 | 0.09 | 0.85 | 1.70 | 1.59 | 3.54 | 3.27 | 3.32 |
KAT | 5 | 0.17 | 0.18 | 0.05 | 0.49 | 1.02 | 0.95 | 1.76 | 1.85 | 1.91 |
KAT | 6 | 0.10 | 0.11 | 0.03 | 0.21 | 0.50 | 0.50 | 1.00 | 1.16 | 1.25 |
SAI | 1 | 8.32 | 9.51 | 3.19 | 5.19 | 12.91 | 12.46 | 53.31 | 48.64 | 45.80 |
SAI | 2 | 3.19 | 3.58 | 1.06 | 3.43 | 8.56 | 8.16 | 31.82 | 28.23 | 25.84 |
SAI | 3 | 1.53 | 1.71 | 0.48 | 2.60 | 6.38 | 6.06 | 19.09 | 16.78 | 15.10 |
SAI | 4 | 0.78 | 0.88 | 0.26 | 1.98 | 4.81 | 4.56 | 11.39 | 10.04 | 9.06 |
SAI | 5 | 0.41 | 0.46 | 0.14 | 1.47 | 3.60 | 3.42 | 6.81 | 6.17 | 5.64 |
SAI | 6 | 0.24 | 0.26 | 0.08 | 1.06 | 2.64 | 2.53 | 4.12 | 3.87 | 3.72 |
SAI | 7 | 0.14 | 0.14 | 0.05 | 0.74 | 1.85 | 1.80 | 2.49 | 2.55 | 2.57 |
SAI | 8 | 0.10 | 0.10 | 0.03 | 0.48 | 1.25 | 1.24 | 1.50 | 1.73 | 1.84 |
SAI | 9 | 0.07 | 0.06 | 0.03 | 0.28 | 0.76 | 0.80 | 0.92 | 1.22 | 1.40 |
SAI | 10 | 0.06 | 0.05 | 0.02 | 0.14 | 0.38 | 0.44 | 0.57 | 0.90 | 1.05 |
SAI | 11 | 0.05 | 0.04 | 0.02 | 0.06 | 0.14 | 0.21 | 0.31 | 0.56 | 0.71 |
SAI | 12 | 0.03 | 0.02 | 0.01 | 0.01 | 0.02 | 0.05 | 0.13 | 0.29 | 0.40 |
Per population plot
# Convert the data frame to long format
long_results <- summary_results %>%
pivot_longer(
cols = -c(Data_Set, Num_Samples),
names_to = "Comparison",
values_to = "Percentage"
) |>
mutate(Data_Set = factor(Data_Set))
# Specify the order of the fill factor
long_results$Comparison <- factor(long_results$Comparison,
levels = c("ab", "ac", "bc", "xy", "wy", "wx", "ay", "bx", "cw"))
# Define color blind friendly palette
color_blind_friendly <- c(
"Chip (ab)" = "#E69F00",
"Chip (ac)" = "#56B4E9",
"Chip (bc)" = "#009E73",
"WGS (xy)" = "#F0E442",
"WGS (wy)" = "#0072B2",
"WGS (wx)" = "#D55E00",
"WGS_Chip (ay)" = "#CC79A7",
"WGS_Chip (bx)" = "#999999",
"WGS_Chip (cw)" = "#000000"
)
# Create a named vector to recode Data_Set column
recode_vector <- c(
"ab" = "Chip (ab)",
"ac" = "Chip (ac)",
"bc" = "Chip (bc)",
"xy" = "WGS (xy)",
"wy" = "WGS (wy)",
"wx" = "WGS (wx)",
"ay" = "WGS_Chip (ay)",
"bx" = "WGS_Chip (bx)",
"cw" = "WGS_Chip (cw)"
)
# Recode the Data_Set column
long_results$Comparison <- recode_vector[long_results$Comparison]
# Create the bar plot with facets
ggplot(long_results,
aes(x = Num_Samples, y = Percentage)) +
geom_bar(aes(fill = Comparison), stat = "identity", position = "dodge") +
facet_wrap( ~ Data_Set, ncol = 1, scales = "free_y") +
labs(
x = "Samples (n)",
y = "SNPs with mismatches (%)",
fill = "Comparison",
title = "Number of samples that SNPs have mismatches in the zygosity",
caption = "Number of samples per genotype call: \nChip:\n'ab' - 18 versus 95 samples\n'ac' - 18 versus 500 samples\n'bc' - 95 versus 500 samples\n\nWGS:\n'xy' - 18 versus 30 samples\n'wy' - 18 versus 800 samples\n'wx' - Genotyping calls with 30 versus 800 samples\n\nChip x WGS:\n'ay' - both 18 samples\n'bx' - WGS 30 samples and chip 95 samples\n'cw' - WGS 800 samples and chip 500 samples"
) +
scale_fill_manual(
values = color_blind_friendly,
labels = c(
"Chip (ab)" = "Chip (ab)",
"Chip (ac)" = "Chip (ac)",
"Chip (bc)" = "Chip (bc)",
"WGS (xy)" = "WGS (xy)",
"WGS (wy)" = "WGS (wy)",
"WGS (wx)" = "WGS (wx)",
"WGS_Chip (ay)" = "WGS_Chip (ay)",
"WGS_Chip (bx)" = "WGS_Chip (bx)",
"WGS_Chip (cw)" = "WGS_Chip (cw)"
)
) +
coord_flip() +
my_theme() +
scale_x_continuous(breaks = seq(0, 12, 1)) +
theme(
legend.position = "top",
plot.caption = element_text(
size = 8,
color = "gray30",
face = "italic",
hjust = 1
)
)
# Save plot to PDF
ggsave(
here(
"output",
"wgs_vs_chip",
"figures",
"percentage_per_pop.pdf"
),
height = 10,
width = 8,
dpi = 300
)
We can also create a plot with the pairwise sample mismatch rate across all 18 samples and comparisons
# Your data set identifiers
datasets_identifiers <- c("ab", "ac", "bc", "xy", "wy", "wx", "ay", "bx", "cw")
# Initialize the final merged results data frame with the first dataset
merged_results <- get(paste0("Zygosity_", datasets_identifiers[1]))[, .(Population, Sample, Percent_Mismatch)]
setnames(merged_results, "Percent_Mismatch", datasets_identifiers[1])
# Merge all other results data frames into the final results data frame
for (ds_id in datasets_identifiers[-1]) {
# Retrieve the dataset
Zygosity_data <- get(paste0("Zygosity_", ds_id))[, .(Population, Sample, Percent_Mismatch)]
# Rename the Percent_Mismatch column to the dataset identifier
setnames(Zygosity_data, "Percent_Mismatch", ds_id)
# Merge with the final results data frame
merged_results <- merge(merged_results, Zygosity_data, by = c("Population", "Sample"), all = TRUE)
}
# Create the flextable
ft <- flextable::flextable(merged_results)
# Apply zebra theme
ft <- flextable::theme_zebra(ft)
# Add a caption to the table
ft <-
flextable::add_header_lines(
ft,
"Table 4: Summary of the SNP mismatch percentage for Zygosity for pairwise comparisons "
)
# Save it to a Word document
officer::read_docx() |>
body_add_flextable(ft) |>
print(
target = here::here(
"output",
"wgs_vs_chip",
"figures",
"summary_all_data_sets_pairwise.docx"
)
)
ft
Table 4: Summary of the SNP mismatch percentage for Zygosity for pairwise comparisons | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Population | Sample | ab | ac | bc | xy | wy | wx | ay | bx | cw |
KAT | 7 | 0.91 | 1.00 | 0.30 | 1.28 | 2.61 | 2.51 | 5.51 | 4.72 | 4.60 |
KAT | 8 | 1.01 | 1.09 | 0.31 | 1.27 | 2.66 | 2.52 | 5.96 | 5.13 | 4.94 |
KAT | 9 | 0.87 | 0.95 | 0.26 | 1.09 | 2.15 | 2.03 | 5.07 | 4.27 | 4.19 |
KAT | 10 | 0.88 | 0.99 | 0.31 | 1.31 | 2.74 | 2.63 | 5.52 | 4.78 | 4.63 |
KAT | 11 | 1.00 | 1.12 | 0.32 | 1.19 | 2.48 | 2.39 | 6.30 | 5.43 | 5.25 |
KAT | 12 | 1.01 | 1.12 | 0.31 | 1.31 | 2.68 | 2.54 | 5.32 | 4.36 | 4.17 |
SAI | 12 | 1.23 | 1.36 | 0.42 | 0.87 | 2.06 | 1.99 | 8.43 | 7.31 | 6.86 |
SAI | 1 | 1.41 | 1.55 | 0.49 | 0.97 | 2.30 | 2.27 | 9.03 | 7.83 | 7.34 |
SAI | 2 | 1.29 | 1.50 | 0.49 | 1.08 | 2.65 | 2.61 | 9.00 | 7.91 | 7.39 |
SAI | 3 | 1.15 | 1.34 | 0.44 | 1.83 | 4.42 | 4.31 | 9.74 | 8.79 | 8.12 |
SAI | 4 | 1.42 | 1.59 | 0.50 | 1.94 | 4.62 | 4.51 | 10.37 | 9.43 | 8.69 |
SAI | 5 | 1.17 | 1.30 | 0.41 | 2.30 | 5.56 | 5.37 | 10.90 | 10.00 | 9.20 |
SAI | 13 | 1.23 | 1.38 | 0.47 | 1.57 | 3.70 | 3.60 | 9.48 | 8.46 | 7.93 |
SAI | 14 | 1.43 | 1.56 | 0.50 | 1.20 | 2.90 | 2.85 | 9.62 | 8.43 | 7.80 |
SAI | 15 | 1.20 | 1.32 | 0.44 | 0.58 | 1.42 | 1.38 | 8.31 | 7.32 | 6.83 |
SAI | 16 | 1.21 | 1.42 | 0.45 | 1.07 | 2.44 | 2.40 | 8.24 | 7.12 | 6.65 |
SAI | 17 | 1.19 | 1.34 | 0.42 | 1.86 | 4.46 | 4.33 | 9.64 | 8.69 | 8.10 |
SAI | 18 | 1.23 | 1.38 | 0.46 | 1.54 | 3.77 | 3.65 | 9.48 | 8.51 | 7.86 |
Create a plot
# Convert the data frame to long format
long_results <- merged_results |>
pivot_longer(
cols = -c(Population, Sample),
names_to = "Comparison",
values_to = "Percentage"
) |>
mutate(Population = factor(Population))
# Specify the order of the fill factor
long_results$Comparison <- factor(long_results$Comparison,
levels = datasets_identifiers)
# Define color blind friendly palette
color_blind_friendly <- c(
"ab" = "#E69F00",
"ac" = "#56B4E9",
"bc" = "#009E73",
"xy" = "#F0E442",
"wy" = "#0072B2",
"wx" = "#D55E00",
"ay" = "#CC79A7",
"bx" = "#999999",
"cw" = "#000000"
)
# Create the bar plot with facets
ggplot(long_results,
aes(x = Sample, y = Percentage)) +
geom_bar(aes(fill = Comparison), stat = "identity", position = "dodge") +
facet_wrap(~Population, ncol = 1, scales = "free_y") +
labs(
x = "Sample",
y = "SNPs with mismatches (%)",
fill = "Comparison",
title = "Percentage of SNPs with zygosity mismatches in pairwise comparisons",
caption = "Number of samples per genotype call: \nChip:\n'ab' - 18 versus 95 samples\n'ac' - 18 versus 500 samples\n'bc' - 95 versus 500 samples\n\nWGS:\n'xy' - 18 versus 30 samples\n'wy' - 18 versus 800 samples\n'wx' - Genotyping calls with 30 versus 800 samples\n\nChip x WGS:\n'ay' - both 18 samples\n'bx' - WGS 30 samples and chip 95 samples\n'cw' - WGS 800 samples and chip 500 samples"
) +
my_theme() +
scale_fill_manual(
values = color_blind_friendly,
labels = datasets_identifiers
) +
coord_flip() +
theme(
legend.position = "top",
plot.caption = element_text(
size = 8,
color = "gray30",
face = "italic",
hjust = 1
)
)
We can count how many reads for each allele in each cram file for all 175k sites for every sample
Changed strategy: count how many ATCG for each SNP position
#!/bin/sh
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=luciano.cosme@yale.edu
#SBATCH --array=1-30
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=10gb
#SBATCH --time=100:00:00
#SBATCH --job-name=base_count
#SBATCH -o base_count%A_%a.o.txt
#SBATCH -e base_count%A_%a.ERROR.txt
cd /ycga-gpfs/project/caccone/lvc26/wgs_chip_calls
module load SAMtools/1.16-GCCcore-10.2.0
# File containing the paths to the CRAM files
file_list="/ycga-gpfs/project/caccone/lvc26/wgs_chip_calls/crams_30.txt"
# Get the file path for this array task
file_path=$(sed -n "${SLURM_ARRAY_TASK_ID}p" "$file_list")
# Reference genome
reference="/gpfs/ycga/project/caccone/lvc26/september_2020/genome/aedes_albopictus_LA2_20200826.fasta"
# Sites file
sites_file="/ycga-gpfs/project/caccone/lvc26/wgs_chip_calls/wgs_sites.txt"
# Extract the file name from the file path
file_name=$(basename "$file_path" .cram)
# Call samtools mpileup on the entire sites file
samtools mpileup -q 20 -Q 20 -f "$reference" -l "$sites_file" "$file_path" > "pileup_${file_name}.txt"
Merge the output files
#!/bin/sh
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=luciano.cosme@yale.edu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=5gb
#SBATCH --time=02:00:00
#SBATCH --job-name=merge_pileup
#SBATCH -o merge_pileup%A_%a.o.txt
#SBATCH -e merge_pileup%A_%a.ERROR.txt
cd /ycga-gpfs/project/caccone/lvc26/wgs_chip_calls
# File containing the paths to the CRAM files
file_list="/ycga-gpfs/project/caccone/lvc26/wgs_chip_calls/crams_30.txt"
# Total number of chunks
num_chunks=50
# For each CRAM file
for i in $(seq 1 30); do
# Get the file path for this CRAM file
file_path=$(sed -n "$i"p "$file_list")
# Extract the file name from the file path
file_name=$(basename "$file_path" .cram)
# Concatenate the chunk outputs and delete them
for j in $(seq -f "%02g" 0 $((num_chunks-1))); do
cat "pileup_${file_name}_$j.txt" >> "pileup_${file_name}.txt"
rm "pileup_${file_name}_$j.txt"
done
done
The pileup format example: chr2.196 14755 A 20 ,…,..,….,,,,..,, FkFFFkkFFFFFFFFFFFkF
Explanation: chr2.196: This is the name of the chromosome or scaffold.
14755: This is the position on the chromosome or scaffold.
A: This is the reference base at this position.
20: This is the total depth of coverage for this position across all reads. In other words, this position has been sequenced 20 times.
,…,..,….,,,,..,,: This string represents the bases at this position in the reads that mapped to this location. The character “,” or “.” represents a match to the reference base (with “,” indicating a match on the reverse strand and “.” indicating a match on the forward strand). The pileup string here is showing that all 20 reads are matching the reference base, “A”. The directionality of the reads (whether they are from the forward or reverse strand) is also encoded here, with forward strand reads shown as “.” and reverse strand reads shown as “,”.
FkFFFkkFFFFFFFFFFFkF: This represents the base quality scores for the bases at this position in the reads. These scores are in Phred format and ASCII encoded. The higher the score, the lower the probability that the base is called incorrectly.
In summary, for position 14755 on chromosome chr2.196, the reference base is “A”. All 20 reads that cover this position have a base that matches the reference base “A”. These 20 reads are derived from both the forward and reverse strands. The base quality scores show that the base calling accuracy is high for most of these bases.
Check the files
## KAT_10.txt
## KAT_11.txt
## KAT_12.txt
## KAT_1n.txt
## KAT_2n.txt
## KAT_3n.txt
## KAT_4n.txt
## KAT_5n.txt
## KAT_6n.txt
## KAT_7.txt
## KAT_8.txt
## KAT_9.txt
## SAI_1.txt
## SAI_10n.txt
## SAI_11n.txt
## SAI_12.txt
## SAI_13.txt
## SAI_14.txt
## SAI_15.txt
## SAI_16.txt
## SAI_17.txt
## SAI_18.txt
## SAI_2.txt
## SAI_3.txt
## SAI_4.txt
## SAI_5.txt
## SAI_6n.txt
## SAI_7n.txt
## SAI_8n.txt
## SAI_9n.txt
## merged_data.csv
## processed
We can create a Python script to parse the pileup files.
import pandas as pd
import numpy as np
import glob
import os
def process_pileup_file(filename):
def phred33ToQ(qual):
score = ord(qual) - 33
return min(score, 40) # Limit the score to a maximum of 40
# Read the file into a DataFrame
df = pd.read_csv(filename, sep='\t', header=None, names=['chr', 'pos', 'ref_base', 'site_counts', 'pileup', 'quality'],
usecols=range(6), dtype={'site_counts': str}, on_bad_lines='skip')
# Remove rows with missing 'ref_base' or 'site_counts'
df = df.dropna(subset=['ref_base', 'site_counts'])
# Replace NaNs with empty strings in the 'ref_base' and 'pileup' columns
df['ref_base'] = df['ref_base'].replace(np.nan, '', regex=True).str.upper()
df['pileup'] = df['pileup'].replace(np.nan, '', regex=True)
# Convert 'site_counts' to numeric, handle errors by converting them to NaN, then to int
df['site_counts'] = pd.to_numeric(df['site_counts'], errors='coerce').fillna(0).astype(int)
# Initialize nucleotide count columns
df['A'] = 0
df['T'] = 0
df['C'] = 0
df['G'] = 0
df['ref_allele'] = df['ref_base']
df['ref_count'] = 0
df['alt_allele'] = ''
df['alt_count'] = 0
# Initialize InDel column
df['InDel'] = False
# Calculate counts and identify InDels
for i, row in df.iterrows():
# Replace '.' and ',' with reference base
pileup = row['pileup'].replace('.', row['ref_base']).replace(',', row['ref_base'])
# Count each nucleotide
counts = {
'A': pileup.count('A') + pileup.count('a'),
'T': pileup.count('T') + pileup.count('t'),
'C': pileup.count('C') + pileup.count('c'),
'G': pileup.count('G') + pileup.count('g'),
}
# Assign nucleotide counts
df.at[i, 'A'] = counts['A']
df.at[i, 'T'] = counts['T']
df.at[i, 'C'] = counts['C']
df.at[i, 'G'] = counts['G']
# Assign reference allele count
ref_allele = row['ref_base'].upper()
df.at[i, 'ref_count'] = counts.get(ref_allele, 0)
# Identify InDels
if '+' in pileup or '-' in pileup:
df.at[i, 'InDel'] = True
# Identify alternative alleles
if ref_allele in counts:
del counts[ref_allele]
if counts: # If there are any alternative alleles
alt_allele, alt_count = max(counts.items(), key=lambda x: x[1]) # Pick the most common alternative allele
df.at[i, 'alt_allele'] = alt_allele
df.at[i, 'alt_count'] = alt_count
# Handle bases on both strands
quality_scores = [phred33ToQ(qual) for qual in str(row['quality'])]
# Calculate average quality scores for reference and alternative alleles
ref_qual_scores = [score for base, score in zip(pileup, quality_scores) if base.upper() == ref_allele]
alt_qual_scores = [score for base, score in zip(pileup, quality_scores) if base.upper() == alt_allele]
ref_mean_quality = np.mean(ref_qual_scores) if ref_qual_scores else np.nan
alt_mean_quality = np.mean(alt_qual_scores) if alt_qual_scores else np.nan
# Assign mean quality scores
df.at[i, 'ref_mean_quality'] = round(ref_mean_quality, 2)
df.at[i, 'alt_mean_quality'] = round(alt_mean_quality, 2)
# Calculate zygosity
ref_count = df.at[i, 'ref_count']
alt_count = df.at[i, 'alt_count']
if ref_count == 0 and alt_count > 0:
zygosity = 'hom_alt'
elif ref_count > 0 and alt_count == 0:
zygosity = 'hom_ref'
elif ref_count > 0 and alt_count > 0:
zygosity = 'hete'
else:
zygosity = ''
df.at[i, 'zygosity'] = zygosity
# Create an 'id' column by concatenating 'chr' (without 'chr') and 'pos'
df['id'] = df['chr'].astype(str).apply(lambda x: x.replace('chr', '')) + '_' + df['pos'].astype(str)
# Keep only the desired columns
df = df[['id', 'chr', 'pos', 'site_counts', 'ref_base', 'A', 'T', 'C', 'G', 'ref_allele', 'ref_count', 'alt_allele', 'alt_count', 'InDel', 'ref_mean_quality', 'alt_mean_quality', 'zygosity']]
return df
# Get a list of all .txt files
files = glob.glob('output/wgs_vs_chip/allele_counts/*.txt')
# Create output directory if it does not exist
output_directory = 'output/wgs_vs_chip/allele_counts/processed'
os.makedirs(output_directory, exist_ok=True)
# Process all files
for file in files:
df = process_pileup_file(file)
# Create output filename
output_filename = os.path.join(output_directory, f'{os.path.basename(file)[:-4]}.csv')
# Write the processed data to a new .csv file
df.to_csv(output_filename, index=False)
import pandas as pd
import glob
# Get a list of all processed CSV files
files = glob.glob('output/wgs_vs_chip/allele_counts/processed/*.csv')
# Initialize an empty list to store the individual DataFrames
dfs = []
# Iterate over the files and read them into DataFrames
for file in files:
# Read the CSV file
df = pd.read_csv(file)
# Extract the file name without the extension
file_name = file.split('/')[-1].split('.')[0]
# Append the file name to the column names
df = df.rename(columns={col: col + '_' + file_name for col in df.columns if col != 'id'})
# Append the DataFrame to the list
dfs.append(df)
# Merge the individual DataFrames into a single DataFrame using 'id' as the key
merged_df = dfs[0] # Initialize merged_df with the first DataFrame
for df in dfs[1:]:
merged_df = merged_df.merge(df, on='id', how='outer')
# Drop the 'chr_' and 'pos_' columns
merged_df = merged_df.drop(columns=[col for col in merged_df.columns if col.startswith('chr_') or col.startswith('pos_')])
# Save the merged data to a new CSV file
merged_df.to_csv('output/wgs_vs_chip/allele_counts/merged_data.csv', index=False)
Clean env
We created a file earlier to update the SNP ids. We can use it to add the information to our data.
Check the file
## 1.1 chr1.1 97856 chr1.1_97856 AX-581444870
## 1.1 chr1.1 161729 chr1.1_161729 AX-583033226
## 1.1 chr1.1 229640 chr1.1_229640 AX-583035067
## 1.1 chr1.1 305518 chr1.1_305518 AX-583035083
## 1.1 chr1.1 308124 chr1.1_308124 AX-583035102
## 1.1 chr1.1 311920 chr1.1_311920 AX-583033340
## 1.1 chr1.1 315059 chr1.1_315059 AX-583033342
## 1.1 chr1.1 315386 chr1.1_315386 AX-583035163
## 1.1 chr1.1 315674 chr1.1_315674 AX-583033356
## 1.1 chr1.1 330057 chr1.1_330057 AX-583033370
Our data has a column named “id” that we can use to add the SNP id
merged_data <-
fread(here("output", "wgs_vs_chip", "allele_counts", "merged_data.csv"))
head(merged_data)
## id site_counts_KAT_5n ref_base_KAT_5n A_KAT_5n T_KAT_5n C_KAT_5n
## 1: 2.206_14153 21 A 8 0 0
## 2: 2.206_41198 22 T 0 22 0
## 3: 2.206_46216 21 C 0 0 21
## 4: 2.206_46416 23 A 23 0 0
## 5: 2.206_47314 14 T 0 14 0
## 6: 2.206_49900 19 G 19 0 0
## G_KAT_5n ref_allele_KAT_5n ref_count_KAT_5n alt_allele_KAT_5n
## 1: 13 A 8 G
## 2: 0 T 22 A
## 3: 0 C 21 A
## 4: 0 A 23 T
## 5: 0 T 14 A
## 6: 0 G 0 A
## alt_count_KAT_5n InDel_KAT_5n ref_mean_quality_KAT_5n
## 1: 13 FALSE 35.88
## 2: 0 FALSE 37.41
## 3: 0 FALSE 37.14
## 4: 0 FALSE 37.26
## 5: 0 FALSE 36.14
## 6: 19 FALSE NA
## alt_mean_quality_KAT_5n zygosity_KAT_5n site_counts_SAI_18 ref_base_SAI_18
## 1: 35.15 hete 7 A
## 2: NA hom_ref 23 T
## 3: NA hom_ref 16 C
## 4: NA hom_ref 19 A
## 5: NA hom_ref 17 T
## 6: 36.05 hom_alt 15 G
## A_SAI_18 T_SAI_18 C_SAI_18 G_SAI_18 ref_allele_SAI_18 ref_count_SAI_18
## 1: 7 0 0 0 A 7
## 2: 0 13 0 10 T 13
## 3: 0 16 0 0 C 0
## 4: 19 0 0 0 A 19
## 5: 0 1 0 16 T 1
## 6: 15 0 0 0 G 0
## alt_allele_SAI_18 alt_count_SAI_18 InDel_SAI_18 ref_mean_quality_SAI_18
## 1: T 0 FALSE 35.29
## 2: G 10 FALSE 37.23
## 3: T 16 FALSE NA
## 4: T 0 FALSE 36.37
## 5: G 16 FALSE 37.00
## 6: A 15 FALSE NA
## alt_mean_quality_SAI_18 zygosity_SAI_18 site_counts_KAT_4n ref_base_KAT_4n
## 1: NA hom_ref 14 A
## 2: 37.30 hete 21 T
## 3: 35.56 hom_alt 19 C
## 4: NA hom_ref 25 A
## 5: 37.75 hete 20 T
## 6: 36.60 hom_alt 16 G
## A_KAT_4n T_KAT_4n C_KAT_4n G_KAT_4n ref_allele_KAT_4n ref_count_KAT_4n
## 1: 9 0 0 5 A 9
## 2: 0 21 0 0 T 21
## 3: 0 7 12 0 C 12
## 4: 11 0 0 14 A 11
## 5: 0 13 0 7 T 13
## 6: 16 0 0 0 G 0
## alt_allele_KAT_4n alt_count_KAT_4n InDel_KAT_4n ref_mean_quality_KAT_4n
## 1: G 5 FALSE 37.67
## 2: A 0 FALSE 37.29
## 3: T 7 FALSE 37.75
## 4: G 14 FALSE 35.91
## 5: G 7 FALSE 36.54
## 6: A 16 FALSE NA
## alt_mean_quality_KAT_4n zygosity_KAT_4n site_counts_KAT_11 ref_base_KAT_11
## 1: 37.00 hete 9 A
## 2: NA hom_ref 22 T
## 3: 35.43 hete 25 C
## 4: 36.79 hete 16 A
## 5: 37.86 hete 18 T
## 6: 34.75 hom_alt 26 G
## A_KAT_11 T_KAT_11 C_KAT_11 G_KAT_11 ref_allele_KAT_11 ref_count_KAT_11
## 1: 2 0 0 7 A 2
## 2: 0 22 0 0 T 22
## 3: 0 14 11 0 C 11
## 4: 9 0 0 7 A 9
## 5: 0 7 0 11 T 7
## 6: 26 0 0 0 G 0
## alt_allele_KAT_11 alt_count_KAT_11 InDel_KAT_11 ref_mean_quality_KAT_11
## 1: G 7 FALSE 37.00
## 2: A 0 FALSE 36.73
## 3: T 14 FALSE 37.27
## 4: G 7 FALSE 37.33
## 5: G 11 FALSE 35.29
## 6: A 26 FALSE NA
## alt_mean_quality_KAT_11 zygosity_KAT_11 site_counts_KAT_10 ref_base_KAT_10
## 1: 37.29 hete 17 A
## 2: NA hom_ref 23 T
## 3: 37.21 hete 23 C
## 4: 37.43 hete 28 A
## 5: 37.00 hete 8 T
## 6: 36.77 hom_alt 23 G
## A_KAT_10 T_KAT_10 C_KAT_10 G_KAT_10 ref_allele_KAT_10 ref_count_KAT_10
## 1: 0 0 0 17 A 0
## 2: 0 23 0 0 T 23
## 3: 0 12 11 0 C 11
## 4: 16 0 0 12 A 16
## 5: 0 1 0 7 T 1
## 6: 23 0 0 0 G 0
## alt_allele_KAT_10 alt_count_KAT_10 InDel_KAT_10 ref_mean_quality_KAT_10
## 1: G 17 FALSE NA
## 2: A 0 FALSE 37.00
## 3: T 12 FALSE 37.00
## 4: G 12 FALSE 37.38
## 5: G 7 FALSE 37.00
## 6: A 23 FALSE NA
## alt_mean_quality_KAT_10 zygosity_KAT_10 site_counts_SAI_6n ref_base_SAI_6n
## 1: 36.00 hom_alt 9 A
## 2: NA hom_ref 21 T
## 3: 37.00 hete 19 C
## 4: 37.25 hete 19 A
## 5: 35.29 hete 14 T
## 6: 36.22 hom_alt 19 G
## A_SAI_6n T_SAI_6n C_SAI_6n G_SAI_6n ref_allele_SAI_6n ref_count_SAI_6n
## 1: 9 0 0 0 A 9
## 2: 0 21 0 0 T 21
## 3: 0 19 0 0 C 0
## 4: 12 0 0 7 A 12
## 5: 0 0 0 14 T 0
## 6: 19 0 0 0 G 0
## alt_allele_SAI_6n alt_count_SAI_6n InDel_SAI_6n ref_mean_quality_SAI_6n
## 1: T 0 FALSE 37.0
## 2: A 0 FALSE 37.0
## 3: T 19 FALSE NA
## 4: G 7 FALSE 37.5
## 5: G 14 FALSE NA
## 6: A 19 FALSE NA
## alt_mean_quality_SAI_6n zygosity_SAI_6n site_counts_KAT_3n ref_base_KAT_3n
## 1: NA hom_ref 34 A
## 2: NA hom_ref 34 T
## 3: 36.67 hom_alt 24 C
## 4: 37.00 hete 28 A
## 5: 37.21 hom_alt 15 T
## 6: 36.21 hom_alt 20 G
## A_KAT_3n T_KAT_3n C_KAT_3n G_KAT_3n ref_allele_KAT_3n ref_count_KAT_3n
## 1: 0 0 0 34 A 0
## 2: 0 34 0 0 T 34
## 3: 0 0 24 0 C 24
## 4: 28 0 0 0 A 28
## 5: 0 15 0 0 T 15
## 6: 20 0 0 0 G 0
## alt_allele_KAT_3n alt_count_KAT_3n InDel_KAT_3n ref_mean_quality_KAT_3n
## 1: G 34 FALSE NA
## 2: A 0 FALSE 36.03
## 3: A 0 FALSE 36.62
## 4: T 0 FALSE 36.68
## 5: A 0 FALSE 36.40
## 6: A 20 FALSE NA
## alt_mean_quality_KAT_3n zygosity_KAT_3n site_counts_KAT_12 ref_base_KAT_12
## 1: 36.59 hom_alt 23 A
## 2: NA hom_ref 34 T
## 3: NA hom_ref 16 C
## 4: NA hom_ref 21 A
## 5: NA hom_ref 24 T
## 6: 36.55 hom_alt 19 G
## A_KAT_12 T_KAT_12 C_KAT_12 G_KAT_12 ref_allele_KAT_12 ref_count_KAT_12
## 1: 0 0 0 23 A 0
## 2: 0 34 0 0 T 34
## 3: 0 16 0 0 C 0
## 4: 0 0 0 21 A 0
## 5: 0 0 0 24 T 0
## 6: 19 0 0 0 G 0
## alt_allele_KAT_12 alt_count_KAT_12 InDel_KAT_12 ref_mean_quality_KAT_12
## 1: G 23 FALSE NA
## 2: A 0 FALSE 36.55
## 3: T 16 FALSE NA
## 4: G 21 FALSE NA
## 5: G 24 FALSE NA
## 6: A 19 FALSE NA
## alt_mean_quality_KAT_12 zygosity_KAT_12 site_counts_SAI_11n ref_base_SAI_11n
## 1: 35.17 hom_alt 5 A
## 2: NA hom_ref 27 T
## 3: 37.19 hom_alt 20 C
## 4: 36.86 hom_alt 17 A
## 5: 37.00 hom_alt 21 T
## 6: 36.68 hom_alt 21 G
## A_SAI_11n T_SAI_11n C_SAI_11n G_SAI_11n ref_allele_SAI_11n ref_count_SAI_11n
## 1: 5 0 0 0 A 5
## 2: 0 27 0 0 T 27
## 3: 0 20 0 0 C 0
## 4: 10 0 0 7 A 10
## 5: 0 0 0 21 T 0
## 6: 21 0 0 0 G 0
## alt_allele_SAI_11n alt_count_SAI_11n InDel_SAI_11n ref_mean_quality_SAI_11n
## 1: T 0 FALSE 37.00
## 2: A 0 FALSE 36.56
## 3: T 20 FALSE NA
## 4: G 7 FALSE 37.00
## 5: G 21 FALSE NA
## 6: A 21 FALSE NA
## alt_mean_quality_SAI_11n zygosity_SAI_11n site_counts_KAT_7 ref_base_KAT_7
## 1: NA hom_ref 23 A
## 2: NA hom_ref 36 T
## 3: 36.40 hom_alt 26 C
## 4: 37.00 hete 30 A
## 5: 36.43 hom_alt 12 T
## 6: 37.29 hom_alt 22 G
## A_KAT_7 T_KAT_7 C_KAT_7 G_KAT_7 ref_allele_KAT_7 ref_count_KAT_7
## 1: 0 0 0 23 A 0
## 2: 0 36 0 0 T 36
## 3: 0 11 15 0 C 15
## 4: 9 0 0 21 A 9
## 5: 0 7 0 5 T 7
## 6: 22 0 0 0 G 0
## alt_allele_KAT_7 alt_count_KAT_7 InDel_KAT_7 ref_mean_quality_KAT_7
## 1: G 23 FALSE NA
## 2: A 0 FALSE 36.26
## 3: T 11 FALSE 37.00
## 4: G 21 FALSE 37.38
## 5: G 5 FALSE 35.29
## 6: A 22 FALSE NA
## alt_mean_quality_KAT_7 zygosity_KAT_7 site_counts_SAI_10n ref_base_SAI_10n
## 1: 36.87 hom_alt 8 A
## 2: NA hom_ref 25 T
## 3: 37.27 hete 20 C
## 4: 36.71 hete 23 A
## 5: 37.00 hete 18 T
## 6: 37.00 hom_alt 25 G
## A_SAI_10n T_SAI_10n C_SAI_10n G_SAI_10n ref_allele_SAI_10n ref_count_SAI_10n
## 1: 8 0 0 0 A 8
## 2: 0 25 0 0 T 25
## 3: 0 10 10 0 C 10
## 4: 13 0 0 10 A 13
## 5: 0 12 0 6 T 12
## 6: 25 0 0 0 G 0
## alt_allele_SAI_10n alt_count_SAI_10n InDel_SAI_10n ref_mean_quality_SAI_10n
## 1: T 0 FALSE 36.25
## 2: A 0 FALSE 36.52
## 3: T 10 FALSE 37.00
## 4: G 10 FALSE 37.23
## 5: G 6 FALSE 37.25
## 6: A 25 FALSE NA
## alt_mean_quality_SAI_10n zygosity_SAI_10n site_counts_SAI_7n ref_base_SAI_7n
## 1: NA hom_ref 11 A
## 2: NA hom_ref 31 T
## 3: 37.30 hete 33 C
## 4: 37.00 hete 35 A
## 5: 37.50 hete 21 T
## 6: 37.24 hom_alt 22 G
## A_SAI_7n T_SAI_7n C_SAI_7n G_SAI_7n ref_allele_SAI_7n ref_count_SAI_7n
## 1: 11 0 0 0 A 11
## 2: 0 31 0 0 T 31
## 3: 0 33 0 0 C 0
## 4: 18 0 0 17 A 18
## 5: 0 0 0 21 T 0
## 6: 22 0 0 0 G 0
## alt_allele_SAI_7n alt_count_SAI_7n InDel_SAI_7n ref_mean_quality_SAI_7n
## 1: T 0 FALSE 37.00
## 2: A 0 FALSE 37.21
## 3: T 33 FALSE NA
## 4: G 17 FALSE 36.67
## 5: G 21 FALSE NA
## 6: A 22 FALSE NA
## alt_mean_quality_SAI_7n zygosity_SAI_7n site_counts_KAT_2n ref_base_KAT_2n
## 1: NA hom_ref 7 A
## 2: NA hom_ref 21 T
## 3: 36.82 hom_alt 19 C
## 4: 36.71 hete 15 A
## 5: 36.86 hom_alt 25 T
## 6: 36.18 hom_alt 19 G
## A_KAT_2n T_KAT_2n C_KAT_2n G_KAT_2n ref_allele_KAT_2n ref_count_KAT_2n
## 1: 0 0 0 7 A 0
## 2: 0 21 0 0 T 21
## 3: 0 19 0 0 C 0
## 4: 0 0 0 15 A 0
## 5: 0 0 0 25 T 0
## 6: 19 0 0 0 G 0
## alt_allele_KAT_2n alt_count_KAT_2n InDel_KAT_2n ref_mean_quality_KAT_2n
## 1: G 7 FALSE NA
## 2: A 0 FALSE 35.86
## 3: T 19 FALSE NA
## 4: G 15 FALSE NA
## 5: G 25 FALSE NA
## 6: A 19 FALSE NA
## alt_mean_quality_KAT_2n zygosity_KAT_2n site_counts_KAT_1n ref_base_KAT_1n
## 1: 37.86 hom_alt 14 A
## 2: NA hom_ref 23 T
## 3: 36.37 hom_alt 20 C
## 4: 37.20 hom_alt 17 A
## 5: 36.64 hom_alt 15 T
## 6: 35.79 hom_alt 25 G
## A_KAT_1n T_KAT_1n C_KAT_1n G_KAT_1n ref_allele_KAT_1n ref_count_KAT_1n
## 1: 5 0 0 9 A 5
## 2: 0 23 0 0 T 23
## 3: 0 20 0 0 C 0
## 4: 0 0 0 17 A 0
## 5: 0 0 0 15 T 0
## 6: 25 0 0 0 G 0
## alt_allele_KAT_1n alt_count_KAT_1n InDel_KAT_1n ref_mean_quality_KAT_1n
## 1: G 9 FALSE 34.60
## 2: A 0 FALSE 36.77
## 3: T 20 FALSE NA
## 4: G 17 FALSE NA
## 5: G 15 FALSE NA
## 6: A 25 FALSE NA
## alt_mean_quality_KAT_1n zygosity_KAT_1n site_counts_SAI_1 ref_base_SAI_1
## 1: 35.89 hete 12 A
## 2: NA hom_ref 15 T
## 3: 36.30 hom_alt 29 C
## 4: 36.29 hom_alt 25 A
## 5: 36.20 hom_alt 31 T
## 6: 36.52 hom_alt 20 G
## A_SAI_1 T_SAI_1 C_SAI_1 G_SAI_1 ref_allele_SAI_1 ref_count_SAI_1
## 1: 12 0 0 0 A 12
## 2: 0 6 0 9 T 6
## 3: 0 29 0 0 C 0
## 4: 11 0 0 14 A 11
## 5: 0 0 0 31 T 0
## 6: 20 0 0 0 G 0
## alt_allele_SAI_1 alt_count_SAI_1 InDel_SAI_1 ref_mean_quality_SAI_1
## 1: T 0 FALSE 37.0
## 2: G 9 FALSE 35.5
## 3: T 29 FALSE NA
## 4: G 14 FALSE 37.0
## 5: G 31 FALSE NA
## 6: A 20 FALSE NA
## alt_mean_quality_SAI_1 zygosity_SAI_1 site_counts_SAI_8n ref_base_SAI_8n
## 1: NA hom_ref 35 A
## 2: 35.67 hete 42 T
## 3: 35.83 hom_alt 21 C
## 4: 37.00 hete 17 A
## 5: 36.81 hom_alt 25 T
## 6: 37.00 hom_alt 28 G
## A_SAI_8n T_SAI_8n C_SAI_8n G_SAI_8n ref_allele_SAI_8n ref_count_SAI_8n
## 1: 35 0 0 0 A 35
## 2: 0 42 0 0 T 42
## 3: 0 21 0 0 C 0
## 4: 0 0 0 17 A 0
## 5: 0 0 0 25 T 0
## 6: 28 0 0 0 G 0
## alt_allele_SAI_8n alt_count_SAI_8n InDel_SAI_8n ref_mean_quality_SAI_8n
## 1: T 0 FALSE 36.14
## 2: A 0 FALSE 36.57
## 3: T 21 FALSE NA
## 4: G 17 FALSE NA
## 5: G 25 FALSE NA
## 6: A 28 FALSE NA
## alt_mean_quality_SAI_8n zygosity_SAI_8n site_counts_SAI_2 ref_base_SAI_2
## 1: NA hom_ref 23 A
## 2: NA hom_ref 14 T
## 3: 36.71 hom_alt 23 C
## 4: 36.88 hom_alt 23 A
## 5: 36.76 hom_alt 18 T
## 6: 37.32 hom_alt 22 G
## A_SAI_2 T_SAI_2 C_SAI_2 G_SAI_2 ref_allele_SAI_2 ref_count_SAI_2
## 1: 23 0 0 0 A 23
## 2: 0 7 0 7 T 7
## 3: 0 11 12 0 C 12
## 4: 23 0 0 0 A 23
## 5: 0 8 0 10 T 8
## 6: 22 0 0 0 G 0
## alt_allele_SAI_2 alt_count_SAI_2 InDel_SAI_2 ref_mean_quality_SAI_2
## 1: T 0 FALSE 37.26
## 2: G 7 FALSE 36.29
## 3: T 11 FALSE 37.25
## 4: T 0 FALSE 37.27
## 5: G 10 FALSE 35.88
## 6: A 22 FALSE NA
## alt_mean_quality_SAI_2 zygosity_SAI_2 site_counts_SAI_3 ref_base_SAI_3
## 1: NA hom_ref 15 A
## 2: 37.00 hete 17 T
## 3: 35.45 hete 24 C
## 4: NA hom_ref 24 A
## 5: 36.10 hete 15 T
## 6: 36.86 hom_alt 12 G
## A_SAI_3 T_SAI_3 C_SAI_3 G_SAI_3 ref_allele_SAI_3 ref_count_SAI_3
## 1: 15 0 0 0 A 15
## 2: 0 17 0 0 T 17
## 3: 0 24 0 0 C 0
## 4: 24 0 0 0 A 24
## 5: 0 0 0 15 T 0
## 6: 12 0 0 0 G 0
## alt_allele_SAI_3 alt_count_SAI_3 InDel_SAI_3 ref_mean_quality_SAI_3
## 1: T 0 FALSE 36.4
## 2: A 0 FALSE 37.0
## 3: T 24 FALSE NA
## 4: T 0 FALSE 37.0
## 5: G 15 FALSE NA
## 6: A 12 FALSE NA
## alt_mean_quality_SAI_3 zygosity_SAI_3 site_counts_SAI_9n ref_base_SAI_9n
## 1: NA hom_ref 4 A
## 2: NA hom_ref 19 T
## 3: 36.58 hom_alt 21 C
## 4: NA hom_ref 17 A
## 5: 37.00 hom_alt 18 T
## 6: 37.00 hom_alt 13 G
## A_SAI_9n T_SAI_9n C_SAI_9n G_SAI_9n ref_allele_SAI_9n ref_count_SAI_9n
## 1: 4 0 0 0 A 4
## 2: 0 0 0 19 T 0
## 3: 0 21 0 0 C 0
## 4: 17 0 0 0 A 17
## 5: 0 0 0 18 T 0
## 6: 13 0 0 0 G 0
## alt_allele_SAI_9n alt_count_SAI_9n InDel_SAI_9n ref_mean_quality_SAI_9n
## 1: T 0 FALSE 37
## 2: G 19 FALSE NA
## 3: T 21 FALSE NA
## 4: T 0 FALSE 37
## 5: G 18 FALSE NA
## 6: A 13 FALSE NA
## alt_mean_quality_SAI_9n zygosity_SAI_9n site_counts_SAI_4 ref_base_SAI_4
## 1: NA hom_ref 7 A
## 2: 37.47 hom_alt 22 T
## 3: 37.14 hom_alt 24 C
## 4: NA hom_ref 25 A
## 5: 37.33 hom_alt 15 T
## 6: 36.08 hom_alt 15 G
## A_SAI_4 T_SAI_4 C_SAI_4 G_SAI_4 ref_allele_SAI_4 ref_count_SAI_4
## 1: 7 0 0 0 A 7
## 2: 0 22 0 0 T 22
## 3: 0 0 24 0 C 24
## 4: 25 0 0 0 A 25
## 5: 0 15 0 0 T 15
## 6: 15 0 0 0 G 0
## alt_allele_SAI_4 alt_count_SAI_4 InDel_SAI_4 ref_mean_quality_SAI_4
## 1: T 0 FALSE 37.00
## 2: A 0 FALSE 35.77
## 3: A 0 FALSE 37.26
## 4: T 0 FALSE 37.48
## 5: A 0 FALSE 36.14
## 6: A 15 FALSE NA
## alt_mean_quality_SAI_4 zygosity_SAI_4 site_counts_KAT_9 ref_base_KAT_9
## 1: NA hom_ref 20 A
## 2: NA hom_ref 21 T
## 3: NA hom_ref 28 C
## 4: NA hom_ref 28 A
## 5: NA hom_ref 19 T
## 6: 37.4 hom_alt 26 G
## A_KAT_9 T_KAT_9 C_KAT_9 G_KAT_9 ref_allele_KAT_9 ref_count_KAT_9
## 1: 0 0 0 20 A 0
## 2: 0 21 0 0 T 21
## 3: 0 14 14 0 C 14
## 4: 14 0 0 14 A 14
## 5: 0 6 0 13 T 6
## 6: 26 0 0 0 G 0
## alt_allele_KAT_9 alt_count_KAT_9 InDel_KAT_9 ref_mean_quality_KAT_9
## 1: G 20 FALSE NA
## 2: A 0 FALSE 37.30
## 3: T 14 FALSE 37.00
## 4: G 14 FALSE 35.29
## 5: G 13 FALSE 37.50
## 6: A 26 FALSE NA
## alt_mean_quality_KAT_9 zygosity_KAT_9 site_counts_KAT_8 ref_base_KAT_8
## 1: 34.60 hom_alt 28 A
## 2: NA hom_ref 23 T
## 3: 36.57 hete 24 C
## 4: 36.43 hete 21 A
## 5: 37.46 hete 20 T
## 6: 35.73 hom_alt 18 G
## A_KAT_8 T_KAT_8 C_KAT_8 G_KAT_8 ref_allele_KAT_8 ref_count_KAT_8
## 1: 0 0 0 28 A 0
## 2: 0 23 0 0 T 23
## 3: 0 24 0 0 C 0
## 4: 0 0 0 21 A 0
## 5: 0 0 0 20 T 0
## 6: 18 0 0 0 G 0
## alt_allele_KAT_8 alt_count_KAT_8 InDel_KAT_8 ref_mean_quality_KAT_8
## 1: G 28 FALSE NA
## 2: A 0 FALSE 35.57
## 3: T 24 FALSE NA
## 4: G 21 FALSE NA
## 5: G 20 FALSE NA
## 6: A 18 FALSE NA
## alt_mean_quality_KAT_8 zygosity_KAT_8 site_counts_SAI_5 ref_base_SAI_5
## 1: 36.56 hom_alt 13 A
## 2: NA hom_ref 5 T
## 3: 36.12 hom_alt 11 C
## 4: 36.52 hom_alt 23 A
## 5: 37.45 hom_alt 10 T
## 6: 36.33 hom_alt 10 G
## A_SAI_5 T_SAI_5 C_SAI_5 G_SAI_5 ref_allele_SAI_5 ref_count_SAI_5
## 1: 13 0 0 0 A 13
## 2: 0 2 0 3 T 2
## 3: 0 11 0 0 C 0
## 4: 17 0 0 6 A 17
## 5: 0 0 0 10 T 0
## 6: 10 0 0 0 G 0
## alt_allele_SAI_5 alt_count_SAI_5 InDel_SAI_5 ref_mean_quality_SAI_5
## 1: T 0 FALSE 34.23
## 2: G 3 FALSE 37.00
## 3: T 11 FALSE NA
## 4: G 6 FALSE 36.47
## 5: G 10 FALSE NA
## 6: A 10 FALSE NA
## alt_mean_quality_SAI_5 zygosity_SAI_5 site_counts_SAI_17 ref_base_SAI_17
## 1: NA hom_ref 9 A
## 2: 37.00 hete 7 T
## 3: 37.27 hom_alt 17 C
## 4: 37.00 hete 21 A
## 5: 36.40 hom_alt 15 T
## 6: 37.30 hom_alt 15 G
## A_SAI_17 T_SAI_17 C_SAI_17 G_SAI_17 ref_allele_SAI_17 ref_count_SAI_17
## 1: 9 0 0 0 A 9
## 2: 0 7 0 0 T 7
## 3: 0 17 0 0 C 0
## 4: 14 0 0 7 A 14
## 5: 0 0 0 15 T 0
## 6: 15 0 0 0 G 0
## alt_allele_SAI_17 alt_count_SAI_17 InDel_SAI_17 ref_mean_quality_SAI_17
## 1: T 0 FALSE 37.00
## 2: A 0 FALSE 35.71
## 3: T 17 FALSE NA
## 4: G 7 FALSE 37.21
## 5: G 15 FALSE NA
## 6: A 15 FALSE NA
## alt_mean_quality_SAI_17 zygosity_SAI_17 site_counts_SAI_16 ref_base_SAI_16
## 1: NA hom_ref 15 A
## 2: NA hom_ref 27 T
## 3: 35.47 hom_alt 17 C
## 4: 34.14 hete 20 A
## 5: 37.00 hom_alt 23 T
## 6: 36.40 hom_alt 10 G
## A_SAI_16 T_SAI_16 C_SAI_16 G_SAI_16 ref_allele_SAI_16 ref_count_SAI_16
## 1: 15 0 0 0 A 15
## 2: 0 27 0 0 T 27
## 3: 0 17 0 0 C 0
## 4: 0 0 0 20 A 0
## 5: 0 0 0 23 T 0
## 6: 10 0 0 0 G 0
## alt_allele_SAI_16 alt_count_SAI_16 InDel_SAI_16 ref_mean_quality_SAI_16
## 1: T 0 FALSE 35.6
## 2: A 0 FALSE 37.0
## 3: T 17 FALSE NA
## 4: G 20 FALSE NA
## 5: G 23 FALSE NA
## 6: A 10 FALSE NA
## alt_mean_quality_SAI_16 zygosity_SAI_16 site_counts_SAI_14 ref_base_SAI_14
## 1: NA hom_ref 4 A
## 2: NA hom_ref 23 T
## 3: 36.29 hom_alt 21 C
## 4: 36.70 hom_alt 13 A
## 5: 35.74 hom_alt 22 T
## 6: 37.60 hom_alt 17 G
## A_SAI_14 T_SAI_14 C_SAI_14 G_SAI_14 ref_allele_SAI_14 ref_count_SAI_14
## 1: 4 0 0 0 A 4
## 2: 0 12 0 11 T 12
## 3: 0 21 0 0 C 0
## 4: 12 0 0 1 A 12
## 5: 0 0 0 22 T 0
## 6: 17 0 0 0 G 0
## alt_allele_SAI_14 alt_count_SAI_14 InDel_SAI_14 ref_mean_quality_SAI_14
## 1: T 0 FALSE 37.00
## 2: G 11 FALSE 36.55
## 3: T 21 FALSE NA
## 4: G 1 FALSE 37.50
## 5: G 22 FALSE NA
## 6: A 17 FALSE NA
## alt_mean_quality_SAI_14 zygosity_SAI_14 site_counts_SAI_15 ref_base_SAI_15
## 1: NA hom_ref 16 A
## 2: 37.27 hete 24 T
## 3: 36.43 hom_alt 33 C
## 4: 25.00 hete 24 A
## 5: 36.18 hom_alt 34 T
## 6: 36.00 hom_alt 28 G
## A_SAI_15 T_SAI_15 C_SAI_15 G_SAI_15 ref_allele_SAI_15 ref_count_SAI_15
## 1: 16 0 0 0 A 16
## 2: 0 0 0 24 T 0
## 3: 0 33 0 0 C 0
## 4: 24 0 0 0 A 24
## 5: 0 0 0 34 T 0
## 6: 28 0 0 0 G 0
## alt_allele_SAI_15 alt_count_SAI_15 InDel_SAI_15 ref_mean_quality_SAI_15
## 1: T 0 FALSE 37.19
## 2: G 24 FALSE NA
## 3: T 33 FALSE NA
## 4: T 0 FALSE 36.62
## 5: G 34 FALSE NA
## 6: A 28 FALSE NA
## alt_mean_quality_SAI_15 zygosity_SAI_15 site_counts_KAT_6n ref_base_KAT_6n
## 1: NA hom_ref 18 A
## 2: 36.88 hom_alt 30 T
## 3: 37.00 hom_alt 22 C
## 4: NA hom_ref 27 A
## 5: 36.82 hom_alt 12 T
## 6: 36.46 hom_alt 20 G
## A_KAT_6n T_KAT_6n C_KAT_6n G_KAT_6n ref_allele_KAT_6n ref_count_KAT_6n
## 1: 5 0 0 13 A 5
## 2: 0 30 0 1 T 30
## 3: 0 22 0 0 C 0
## 4: 0 0 0 27 A 0
## 5: 0 0 0 12 T 0
## 6: 20 0 0 0 G 0
## alt_allele_KAT_6n alt_count_KAT_6n InDel_KAT_6n ref_mean_quality_KAT_6n
## 1: G 13 FALSE 37.00
## 2: G 1 FALSE 37.52
## 3: T 22 FALSE NA
## 4: G 27 FALSE NA
## 5: G 12 FALSE NA
## 6: A 20 FALSE NA
## alt_mean_quality_KAT_6n zygosity_KAT_6n site_counts_SAI_12 ref_base_SAI_12
## 1: 37.00 hete 15 A
## 2: NA hete 16 T
## 3: 37.27 hom_alt 24 C
## 4: 36.37 hom_alt 23 A
## 5: 37.25 hom_alt 31 T
## 6: 36.10 hom_alt 25 G
## A_SAI_12 T_SAI_12 C_SAI_12 G_SAI_12 ref_allele_SAI_12 ref_count_SAI_12
## 1: 15 0 0 0 A 15
## 2: 0 7 0 9 T 7
## 3: 0 24 0 0 C 0
## 4: 10 0 0 13 A 10
## 5: 0 0 0 31 T 0
## 6: 25 0 0 0 G 0
## alt_allele_SAI_12 alt_count_SAI_12 InDel_SAI_12 ref_mean_quality_SAI_12
## 1: T 0 FALSE 37.20
## 2: G 9 FALSE 35.71
## 3: T 24 FALSE NA
## 4: G 13 FALSE 35.80
## 5: G 31 FALSE NA
## 6: A 25 FALSE NA
## alt_mean_quality_SAI_12 zygosity_SAI_12 site_counts_SAI_13 ref_base_SAI_13
## 1: NA hom_ref 4 A
## 2: 37.33 hete 16 T
## 3: 35.88 hom_alt 19 C
## 4: 37.00 hete 13 A
## 5: 37.10 hom_alt 14 T
## 6: 36.52 hom_alt 20 G
## A_SAI_13 T_SAI_13 C_SAI_13 G_SAI_13 ref_allele_SAI_13 ref_count_SAI_13
## 1: 4 0 0 0 A 4
## 2: 0 8 0 8 T 8
## 3: 0 19 0 0 C 0
## 4: 7 0 0 6 A 7
## 5: 0 0 0 14 T 0
## 6: 20 0 0 0 G 0
## alt_allele_SAI_13 alt_count_SAI_13 InDel_SAI_13 ref_mean_quality_SAI_13
## 1: T 0 FALSE 37.75
## 2: G 8 FALSE 37.00
## 3: T 19 FALSE NA
## 4: G 6 FALSE 37.43
## 5: G 14 FALSE NA
## 6: A 20 FALSE NA
## alt_mean_quality_SAI_13 zygosity_SAI_13
## 1: NA hom_ref
## 2: 37.00 hete
## 3: 36.68 hom_alt
## 4: 37.50 hete
## 5: 37.21 hom_alt
## 6: 35.35 hom_alt
Import the SNP id file
# Import the .txt file
snp_ids <- read_delim(
here("output", "wgs_vs_chip", "new_calls", "wgs_snps_ids.txt"),
delim = "\t",
show_col_types = FALSE,
col_names = c("chr_ref", "id_ref", "bp_ref", "id", "snp_id"),
col_types = cols(.default = col_character())
)
# Remove "chr"
snp_ids <-
snp_ids |>
mutate(id = gsub("chr", "", id),
id_ref = gsub("chr", "", id_ref)) |>
dplyr::select(-"id_ref")
head(snp_ids)
## # A tibble: 6 × 4
## chr_ref bp_ref id snp_id
## <chr> <chr> <chr> <chr>
## 1 1.1 97856 1.1_97856 AX-581444870
## 2 1.1 161729 1.1_161729 AX-583033226
## 3 1.1 229640 1.1_229640 AX-583035067
## 4 1.1 305518 1.1_305518 AX-583035083
## 5 1.1 308124 1.1_308124 AX-583035102
## 6 1.1 311920 1.1_311920 AX-583033340
We can merge them now
# Convert the 'snp_ids' object to a data.table
snp_ids <- as.data.table(snp_ids)
# Merge the two objects based on the "id" column
merged_data2 <- merge(merged_data, snp_ids, by = "id")
# Remove rows with NA values (rows without a match)
# merged_data2 <- na.omit(merged_data2)
# Make sure merged_data2 is a data.table
setDT(merged_data2)
# Reorder columns with setcolorder() function
setcolorder(merged_data2, c("id", "snp_id", setdiff(names(merged_data2), c("id", "snp_id"))))
# Print the first few rows of the data table
head(merged_data2)
## id snp_id site_counts_KAT_5n ref_base_KAT_5n A_KAT_5n
## 1: 1.101_110197 AX-583079274 4 A 0
## 2: 1.101_116980 AX-583077250 19 C 0
## 3: 1.101_118670 AX-583079283 16 G 16
## 4: 1.101_147467 AX-583079310 22 G 0
## 5: 1.101_171602 AX-583077312 10 C 9
## 6: 1.101_210793 AX-583077325 7 T 0
## T_KAT_5n C_KAT_5n G_KAT_5n ref_allele_KAT_5n ref_count_KAT_5n
## 1: 0 0 4 A 0
## 2: 19 0 0 C 0
## 3: 0 0 0 G 0
## 4: 0 22 0 G 0
## 5: 0 1 0 C 1
## 6: 2 5 0 T 2
## alt_allele_KAT_5n alt_count_KAT_5n InDel_KAT_5n ref_mean_quality_KAT_5n
## 1: G 4 FALSE NA
## 2: T 19 FALSE NA
## 3: A 16 FALSE NA
## 4: C 22 FALSE NA
## 5: A 9 FALSE 25
## 6: C 5 FALSE 37
## alt_mean_quality_KAT_5n zygosity_KAT_5n site_counts_SAI_18 ref_base_SAI_18
## 1: 37.00 hom_alt 12 A
## 2: 37.32 hom_alt 13 C
## 3: 35.50 hom_alt 13 G
## 4: 36.45 hom_alt 15 G
## 5: 37.33 hete 15 C
## 6: 37.00 hete 1 T
## A_SAI_18 T_SAI_18 C_SAI_18 G_SAI_18 ref_allele_SAI_18 ref_count_SAI_18
## 1: 12 0 0 0 A 12
## 2: 0 13 0 0 C 0
## 3: 3 0 0 10 G 10
## 4: 0 0 0 15 G 15
## 5: 15 7 14 7 C 14
## 6: 0 0 1 0 T 0
## alt_allele_SAI_18 alt_count_SAI_18 InDel_SAI_18 ref_mean_quality_SAI_18
## 1: T 0 FALSE 37.75
## 2: T 13 FALSE NA
## 3: A 3 FALSE 37.60
## 4: A 0 FALSE 37.21
## 5: A 15 TRUE 28.75
## 6: C 1 FALSE NA
## alt_mean_quality_SAI_18 zygosity_SAI_18 site_counts_KAT_4n ref_base_KAT_4n
## 1: NA hom_ref 4 A
## 2: 37.23 hom_alt 15 C
## 3: 37.00 hete 30 G
## 4: NA hom_ref 22 G
## 5: 29.67 hete 14 C
## 6: 37.00 hom_alt 9 T
## A_KAT_4n T_KAT_4n C_KAT_4n G_KAT_4n ref_allele_KAT_4n ref_count_KAT_4n
## 1: 0 0 0 4 A 0
## 2: 0 15 0 0 C 0
## 3: 30 0 0 0 G 0
## 4: 0 0 22 0 G 0
## 5: 12 0 2 0 C 2
## 6: 0 4 5 0 T 4
## alt_allele_KAT_4n alt_count_KAT_4n InDel_KAT_4n ref_mean_quality_KAT_4n
## 1: G 4 FALSE NA
## 2: T 15 FALSE NA
## 3: A 30 FALSE NA
## 4: C 22 FALSE NA
## 5: A 12 FALSE 37
## 6: C 5 FALSE 37
## alt_mean_quality_KAT_4n zygosity_KAT_4n site_counts_KAT_11 ref_base_KAT_11
## 1: 37.00 hom_alt 11 A
## 2: 37.00 hom_alt 16 C
## 3: 36.20 hom_alt 17 G
## 4: 36.45 hom_alt 6 G
## 5: 37.25 hete 7 C
## 6: 33.60 hete 3 T
## A_KAT_11 T_KAT_11 C_KAT_11 G_KAT_11 ref_allele_KAT_11 ref_count_KAT_11
## 1: 10 0 0 1 A 10
## 2: 0 8 8 0 C 8
## 3: 10 0 0 7 G 7
## 4: 0 0 0 6 G 6
## 5: 0 0 7 0 C 7
## 6: 0 1 2 0 T 1
## alt_allele_KAT_11 alt_count_KAT_11 InDel_KAT_11 ref_mean_quality_KAT_11
## 1: G 1 FALSE 38.20
## 2: T 8 FALSE 37.00
## 3: A 10 FALSE 37.43
## 4: A 0 FALSE 37.50
## 5: A 0 FALSE 37.00
## 6: C 2 FALSE 37.00
## alt_mean_quality_KAT_11 zygosity_KAT_11 site_counts_KAT_10 ref_base_KAT_10
## 1: 37 hete 6 A
## 2: 37 hete 28 C
## 3: 37 hete 28 G
## 4: NA hom_ref 19 G
## 5: NA hom_ref 22 C
## 6: 37 hete 7 T
## A_KAT_10 T_KAT_10 C_KAT_10 G_KAT_10 ref_allele_KAT_10 ref_count_KAT_10
## 1: 0 0 0 6 A 0
## 2: 0 0 28 0 C 28
## 3: 28 0 0 0 G 0
## 4: 0 0 0 19 G 19
## 5: 0 0 22 0 C 22
## 6: 0 0 7 0 T 0
## alt_allele_KAT_10 alt_count_KAT_10 InDel_KAT_10 ref_mean_quality_KAT_10
## 1: G 6 FALSE NA
## 2: A 0 FALSE 36.36
## 3: A 28 FALSE NA
## 4: A 0 FALSE 37.00
## 5: A 0 FALSE 37.29
## 6: C 7 FALSE NA
## alt_mean_quality_KAT_10 zygosity_KAT_10 site_counts_SAI_6n ref_base_SAI_6n
## 1: 37.00 hom_alt 14 A
## 2: NA hom_ref 19 C
## 3: 36.71 hom_alt 22 G
## 4: NA hom_ref 8 G
## 5: NA hom_ref 6 C
## 6: 37.00 hom_alt 6 T
## A_SAI_6n T_SAI_6n C_SAI_6n G_SAI_6n ref_allele_SAI_6n ref_count_SAI_6n
## 1: 14 0 0 0 A 14
## 2: 0 14 5 0 C 5
## 3: 22 0 0 0 G 0
## 4: 0 0 0 8 G 8
## 5: 6 6 12 6 C 12
## 6: 0 0 6 0 T 0
## alt_allele_SAI_6n alt_count_SAI_6n InDel_SAI_6n ref_mean_quality_SAI_6n
## 1: T 0 FALSE 36.14
## 2: T 14 FALSE 35.20
## 3: A 22 FALSE NA
## 4: A 0 FALSE 37.00
## 5: A 6 TRUE 33.00
## 6: C 6 FALSE NA
## alt_mean_quality_SAI_6n zygosity_SAI_6n site_counts_KAT_3n ref_base_KAT_3n
## 1: NA hom_ref 7 A
## 2: 37.43 hete 21 C
## 3: 35.36 hom_alt 15 G
## 4: NA hom_ref 26 G
## 5: 26.00 hete 14 C
## 6: 38.00 hom_alt 11 T
## A_KAT_3n T_KAT_3n C_KAT_3n G_KAT_3n ref_allele_KAT_3n ref_count_KAT_3n
## 1: 0 0 0 7 A 0
## 2: 0 21 0 0 C 0
## 3: 15 0 0 0 G 0
## 4: 0 0 26 0 G 0
## 5: 11 0 3 0 C 3
## 6: 0 7 4 0 T 7
## alt_allele_KAT_3n alt_count_KAT_3n InDel_KAT_3n ref_mean_quality_KAT_3n
## 1: G 7 FALSE NA
## 2: T 21 FALSE NA
## 3: A 15 FALSE NA
## 4: C 26 FALSE NA
## 5: A 11 FALSE 37.00
## 6: C 4 FALSE 37.43
## alt_mean_quality_KAT_3n zygosity_KAT_3n site_counts_KAT_12 ref_base_KAT_12
## 1: 37.86 hom_alt 17 A
## 2: 36.24 hom_alt 13 C
## 3: 37.00 hom_alt 15 G
## 4: 37.12 hom_alt 14 G
## 5: 35.91 hete 11 C
## 6: 37.00 hete 3 T
## A_KAT_12 T_KAT_12 C_KAT_12 G_KAT_12 ref_allele_KAT_12 ref_count_KAT_12
## 1: 5 0 0 12 A 5
## 2: 0 13 0 0 C 0
## 3: 9 0 0 6 G 6
## 4: 0 0 11 3 G 3
## 5: 7 7 18 7 C 18
## 6: 0 3 0 0 T 3
## alt_allele_KAT_12 alt_count_KAT_12 InDel_KAT_12 ref_mean_quality_KAT_12
## 1: G 12 FALSE 37.6
## 2: T 13 FALSE NA
## 3: A 9 FALSE 37.0
## 4: C 11 FALSE 37.0
## 5: A 7 TRUE 32.5
## 6: A 0 FALSE 38.0
## alt_mean_quality_KAT_12 zygosity_KAT_12 site_counts_SAI_11n ref_base_SAI_11n
## 1: 37.00 hete 21 A
## 2: 37.46 hom_alt 13 C
## 3: 35.67 hete 26 G
## 4: 35.91 hete 5 G
## 5: 26.00 hete 8 C
## 6: NA hom_ref 5 T
## A_SAI_11n T_SAI_11n C_SAI_11n G_SAI_11n ref_allele_SAI_11n ref_count_SAI_11n
## 1: 21 0 0 0 A 21
## 2: 0 13 0 0 C 0
## 3: 26 0 0 0 G 0
## 4: 0 0 0 5 G 5
## 5: 0 0 8 0 C 8
## 6: 0 1 4 0 T 1
## alt_allele_SAI_11n alt_count_SAI_11n InDel_SAI_11n ref_mean_quality_SAI_11n
## 1: T 0 FALSE 36.68
## 2: T 13 FALSE NA
## 3: A 26 FALSE NA
## 4: A 0 FALSE 37.00
## 5: A 0 FALSE 35.50
## 6: C 4 FALSE 37.00
## alt_mean_quality_SAI_11n zygosity_SAI_11n site_counts_KAT_7 ref_base_KAT_7
## 1: NA hom_ref 4 A
## 2: 35.00 hom_alt 18 C
## 3: 36.77 hom_alt 25 G
## 4: NA hom_ref 18 G
## 5: NA hom_ref 20 C
## 6: 37.75 hete 7 T
## A_KAT_7 T_KAT_7 C_KAT_7 G_KAT_7 ref_allele_KAT_7 ref_count_KAT_7
## 1: 0 0 0 4 A 0
## 2: 0 0 18 0 C 18
## 3: 25 0 0 0 G 0
## 4: 0 0 0 18 G 18
## 5: 0 0 20 0 C 20
## 6: 0 0 7 0 T 0
## alt_allele_KAT_7 alt_count_KAT_7 InDel_KAT_7 ref_mean_quality_KAT_7
## 1: G 4 FALSE NA
## 2: A 0 FALSE 37.19
## 3: A 25 FALSE NA
## 4: A 0 FALSE 37.17
## 5: A 0 FALSE 37.00
## 6: C 7 FALSE NA
## alt_mean_quality_KAT_7 zygosity_KAT_7 site_counts_SAI_10n ref_base_SAI_10n
## 1: 37.00 hom_alt 9 A
## 2: NA hom_ref 13 C
## 3: 37.24 hom_alt 17 G
## 4: NA hom_ref NA
## 5: NA hom_ref 16 C
## 6: 35.43 hom_alt 4 T
## A_SAI_10n T_SAI_10n C_SAI_10n G_SAI_10n ref_allele_SAI_10n ref_count_SAI_10n
## 1: 9 0 0 0 A 9
## 2: 0 13 0 0 C 0
## 3: 17 0 0 0 G 0
## 4: NA NA NA NA NA
## 5: 16 8 16 8 C 16
## 6: 0 0 4 0 T 0
## alt_allele_SAI_10n alt_count_SAI_10n InDel_SAI_10n ref_mean_quality_SAI_10n
## 1: T 0 FALSE 36.89
## 2: T 13 FALSE NA
## 3: A 17 FALSE NA
## 4: NA NA NA
## 5: A 16 TRUE 28.75
## 6: C 4 FALSE NA
## alt_mean_quality_SAI_10n zygosity_SAI_10n site_counts_SAI_7n ref_base_SAI_7n
## 1: NA hom_ref 11 A
## 2: 36.31 hom_alt 12 C
## 3: 36.53 hom_alt 11 G
## 4: NA 38 G
## 5: 35.40 hete 32 C
## 6: 37.00 hom_alt 10 T
## A_SAI_7n T_SAI_7n C_SAI_7n G_SAI_7n ref_allele_SAI_7n ref_count_SAI_7n
## 1: 11 0 0 0 A 11
## 2: 0 12 0 0 C 0
## 3: 11 0 0 0 G 0
## 4: 0 0 0 38 G 38
## 5: 0 0 32 0 C 32
## 6: 0 10 0 0 T 10
## alt_allele_SAI_7n alt_count_SAI_7n InDel_SAI_7n ref_mean_quality_SAI_7n
## 1: T 0 FALSE 37.27
## 2: T 12 FALSE NA
## 3: A 11 FALSE NA
## 4: A 0 FALSE 37.03
## 5: A 0 FALSE 37.47
## 6: A 0 FALSE 37.30
## alt_mean_quality_SAI_7n zygosity_SAI_7n site_counts_KAT_2n ref_base_KAT_2n
## 1: NA hom_ref 15 A
## 2: 37.00 hom_alt 13 C
## 3: 35.91 hom_alt 26 G
## 4: NA hom_ref NA
## 5: NA hom_ref 6 C
## 6: NA hom_ref 1 T
## A_KAT_2n T_KAT_2n C_KAT_2n G_KAT_2n ref_allele_KAT_2n ref_count_KAT_2n
## 1: 15 0 0 0 A 15
## 2: 0 13 0 0 C 0
## 3: 11 0 0 15 G 15
## 4: NA NA NA NA NA
## 5: 0 0 6 0 C 6
## 6: 0 1 0 0 T 1
## alt_allele_KAT_2n alt_count_KAT_2n InDel_KAT_2n ref_mean_quality_KAT_2n
## 1: T 0 FALSE 36.20
## 2: T 13 FALSE NA
## 3: A 11 FALSE 35.14
## 4: NA NA NA
## 5: A 0 FALSE 37.50
## 6: A 0 FALSE 37.00
## alt_mean_quality_KAT_2n zygosity_KAT_2n site_counts_KAT_1n ref_base_KAT_1n
## 1: NA hom_ref 23 A
## 2: 37.00 hom_alt 15 C
## 3: 37.27 hete 13 G
## 4: NA 17 G
## 5: NA hom_ref 11 C
## 6: NA hom_ref 2 T
## A_KAT_1n T_KAT_1n C_KAT_1n G_KAT_1n ref_allele_KAT_1n ref_count_KAT_1n
## 1: 0 0 0 23 A 0
## 2: 0 15 0 0 C 0
## 3: 13 0 0 0 G 0
## 4: 0 0 14 3 G 3
## 5: 7 0 4 0 C 4
## 6: 0 2 0 0 T 2
## alt_allele_KAT_1n alt_count_KAT_1n InDel_KAT_1n ref_mean_quality_KAT_1n
## 1: G 23 FALSE NA
## 2: T 15 FALSE NA
## 3: A 13 FALSE NA
## 4: C 14 FALSE 37
## 5: A 7 FALSE 37
## 6: A 0 FALSE 31
## alt_mean_quality_KAT_1n zygosity_KAT_1n site_counts_SAI_1 ref_base_SAI_1
## 1: 36.61 hom_alt 31 A
## 2: 36.80 hom_alt 13 C
## 3: 37.23 hom_alt 13 G
## 4: 36.71 hete NA
## 5: 37.43 hete 8 C
## 6: NA hom_ref 7 T
## A_SAI_1 T_SAI_1 C_SAI_1 G_SAI_1 ref_allele_SAI_1 ref_count_SAI_1
## 1: 31 0 0 0 A 31
## 2: 0 13 0 0 C 0
## 3: 13 0 0 0 G 0
## 4: NA NA NA NA NA
## 5: 8 0 0 0 C 0
## 6: 0 0 7 0 T 0
## alt_allele_SAI_1 alt_count_SAI_1 InDel_SAI_1 ref_mean_quality_SAI_1
## 1: T 0 FALSE 36.03
## 2: T 13 FALSE NA
## 3: A 13 FALSE NA
## 4: NA NA NA
## 5: A 8 FALSE NA
## 6: C 7 FALSE NA
## alt_mean_quality_SAI_1 zygosity_SAI_1 site_counts_SAI_8n ref_base_SAI_8n
## 1: NA hom_ref 12 A
## 2: 36.31 hom_alt 8 C
## 3: 35.62 hom_alt 16 G
## 4: NA 40 G
## 5: 37.00 hom_alt 28 C
## 6: 37.00 hom_alt 11 T
## A_SAI_8n T_SAI_8n C_SAI_8n G_SAI_8n ref_allele_SAI_8n ref_count_SAI_8n
## 1: 12 0 0 0 A 12
## 2: 0 8 0 0 C 0
## 3: 16 0 0 0 G 0
## 4: 0 0 0 40 G 40
## 5: 0 0 28 0 C 28
## 6: 0 6 5 0 T 6
## alt_allele_SAI_8n alt_count_SAI_8n InDel_SAI_8n ref_mean_quality_SAI_8n
## 1: T 0 FALSE 36.00
## 2: T 8 FALSE NA
## 3: A 16 FALSE NA
## 4: A 0 FALSE 36.72
## 5: A 0 FALSE 37.32
## 6: C 5 FALSE 37.00
## alt_mean_quality_SAI_8n zygosity_SAI_8n site_counts_SAI_2 ref_base_SAI_2
## 1: NA hom_ref 24 A
## 2: 37.38 hom_alt 19 C
## 3: 37.00 hom_alt 15 G
## 4: NA hom_ref 12 G
## 5: NA hom_ref 6 C
## 6: 37.60 hete 5 T
## A_SAI_2 T_SAI_2 C_SAI_2 G_SAI_2 ref_allele_SAI_2 ref_count_SAI_2
## 1: 12 0 0 12 A 12
## 2: 0 19 0 0 C 0
## 3: 15 0 0 0 G 0
## 4: 0 0 0 12 G 12
## 5: 0 0 6 0 C 6
## 6: 0 0 5 0 T 0
## alt_allele_SAI_2 alt_count_SAI_2 InDel_SAI_2 ref_mean_quality_SAI_2
## 1: G 12 FALSE 37.5
## 2: T 19 FALSE NA
## 3: A 15 FALSE NA
## 4: A 0 FALSE 36.0
## 5: A 0 FALSE 37.0
## 6: C 5 FALSE NA
## alt_mean_quality_SAI_2 zygosity_SAI_2 site_counts_SAI_3 ref_base_SAI_3
## 1: 35.00 hete 15 A
## 2: 36.53 hom_alt 8 C
## 3: 36.40 hom_alt 9 G
## 4: NA hom_ref NA
## 5: NA hom_ref 20 C
## 6: 38.80 hom_alt 5 T
## A_SAI_3 T_SAI_3 C_SAI_3 G_SAI_3 ref_allele_SAI_3 ref_count_SAI_3
## 1: 15 0 0 0 A 15
## 2: 0 8 0 0 C 0
## 3: 9 0 0 0 G 0
## 4: NA NA NA NA NA
## 5: 20 0 0 0 C 0
## 6: 0 0 5 0 T 0
## alt_allele_SAI_3 alt_count_SAI_3 InDel_SAI_3 ref_mean_quality_SAI_3
## 1: T 0 FALSE 37
## 2: T 8 FALSE NA
## 3: A 9 FALSE NA
## 4: NA NA NA
## 5: A 20 FALSE NA
## 6: C 5 FALSE NA
## alt_mean_quality_SAI_3 zygosity_SAI_3 site_counts_SAI_9n ref_base_SAI_9n
## 1: NA hom_ref 5 A
## 2: 34.88 hom_alt 10 C
## 3: 37.33 hom_alt 11 G
## 4: NA 9 G
## 5: 37.00 hom_alt 9 C
## 6: 34.20 hom_alt 4 T
## A_SAI_9n T_SAI_9n C_SAI_9n G_SAI_9n ref_allele_SAI_9n ref_count_SAI_9n
## 1: 0 0 0 5 A 0
## 2: 0 10 0 0 C 0
## 3: 11 0 0 0 G 0
## 4: 0 0 0 9 G 9
## 5: 0 0 9 0 C 9
## 6: 0 2 2 0 T 2
## alt_allele_SAI_9n alt_count_SAI_9n InDel_SAI_9n ref_mean_quality_SAI_9n
## 1: G 5 FALSE NA
## 2: T 10 FALSE NA
## 3: A 11 FALSE NA
## 4: A 0 FALSE 37
## 5: A 0 FALSE 37
## 6: C 2 FALSE 31
## alt_mean_quality_SAI_9n zygosity_SAI_9n site_counts_SAI_4 ref_base_SAI_4
## 1: 37 hom_alt 16 A
## 2: 37 hom_alt 12 C
## 3: 37 hom_alt 9 G
## 4: NA hom_ref NA
## 5: NA hom_ref 13 C
## 6: 37 hete 5 T
## A_SAI_4 T_SAI_4 C_SAI_4 G_SAI_4 ref_allele_SAI_4 ref_count_SAI_4
## 1: 16 0 0 0 A 16
## 2: 0 12 0 0 C 0
## 3: 9 0 0 0 G 0
## 4: NA NA NA NA NA
## 5: 13 0 0 0 C 0
## 6: 0 0 5 0 T 0
## alt_allele_SAI_4 alt_count_SAI_4 InDel_SAI_4 ref_mean_quality_SAI_4
## 1: T 0 FALSE 37.12
## 2: T 12 FALSE NA
## 3: A 9 FALSE NA
## 4: NA NA NA
## 5: A 13 FALSE NA
## 6: C 5 FALSE NA
## alt_mean_quality_SAI_4 zygosity_SAI_4 site_counts_KAT_9 ref_base_KAT_9
## 1: NA hom_ref 10 A
## 2: 37.25 hom_alt 9 C
## 3: 37.33 hom_alt 9 G
## 4: NA 36 G
## 5: 37.23 hom_alt 9 C
## 6: 37.00 hom_alt 12 T
## A_KAT_9 T_KAT_9 C_KAT_9 G_KAT_9 ref_allele_KAT_9 ref_count_KAT_9
## 1: 10 0 0 0 A 10
## 2: 0 9 0 0 C 0
## 3: 6 0 0 3 G 3
## 4: 0 0 36 0 G 0
## 5: 4 4 13 4 C 13
## 6: 0 4 8 0 T 4
## alt_allele_KAT_9 alt_count_KAT_9 InDel_KAT_9 ref_mean_quality_KAT_9
## 1: T 0 FALSE 36.00
## 2: T 9 FALSE NA
## 3: A 6 FALSE 37.00
## 4: C 36 FALSE NA
## 5: A 4 TRUE 31.86
## 6: C 8 FALSE 37.75
## alt_mean_quality_KAT_9 zygosity_KAT_9 site_counts_KAT_8 ref_base_KAT_8
## 1: NA hom_ref 7 A
## 2: 34.56 hom_alt 15 C
## 3: 37.00 hete 15 G
## 4: 36.67 hom_alt 15 G
## 5: NA hete 4 C
## 6: 38.12 hete 7 T
## A_KAT_8 T_KAT_8 C_KAT_8 G_KAT_8 ref_allele_KAT_8 ref_count_KAT_8
## 1: 0 0 0 7 A 0
## 2: 0 15 0 0 C 0
## 3: 15 0 0 0 G 0
## 4: 0 0 14 1 G 1
## 5: 0 0 4 0 C 4
## 6: 0 6 1 0 T 6
## alt_allele_KAT_8 alt_count_KAT_8 InDel_KAT_8 ref_mean_quality_KAT_8
## 1: G 7 FALSE NA
## 2: T 15 FALSE NA
## 3: A 15 FALSE NA
## 4: C 14 FALSE 37.00
## 5: A 0 FALSE 36.75
## 6: C 1 FALSE 37.50
## alt_mean_quality_KAT_8 zygosity_KAT_8 site_counts_SAI_5 ref_base_SAI_5
## 1: 35.00 hom_alt 8 A
## 2: 36.60 hom_alt 8 C
## 3: 36.20 hom_alt 13 G
## 4: 36.57 hete 10 G
## 5: NA hom_ref 13 C
## 6: 20.00 hete 1 T
## A_SAI_5 T_SAI_5 C_SAI_5 G_SAI_5 ref_allele_SAI_5 ref_count_SAI_5
## 1: 8 0 0 0 A 8
## 2: 0 8 0 0 C 0
## 3: 13 0 0 0 G 0
## 4: 0 0 0 10 G 10
## 5: 1 1 14 1 C 14
## 6: 0 0 1 0 T 0
## alt_allele_SAI_5 alt_count_SAI_5 InDel_SAI_5 ref_mean_quality_SAI_5
## 1: T 0 FALSE 37.38
## 2: T 8 FALSE NA
## 3: A 13 FALSE NA
## 4: A 0 FALSE 34.44
## 5: A 1 TRUE 36.38
## 6: C 1 FALSE NA
## alt_mean_quality_SAI_5 zygosity_SAI_5 site_counts_SAI_17 ref_base_SAI_17
## 1: NA hom_ref 9 A
## 2: 37.00 hom_alt 12 C
## 3: 37.23 hom_alt 16 G
## 4: NA hom_ref 14 G
## 5: 37.00 hete 9 C
## 6: 37.00 hom_alt 2 T
## A_SAI_17 T_SAI_17 C_SAI_17 G_SAI_17 ref_allele_SAI_17 ref_count_SAI_17
## 1: 9 0 0 0 A 9
## 2: 0 0 12 0 C 12
## 3: 16 0 0 0 G 0
## 4: 0 0 0 14 G 14
## 5: 9 9 18 9 C 18
## 6: 0 0 2 0 T 0
## alt_allele_SAI_17 alt_count_SAI_17 InDel_SAI_17 ref_mean_quality_SAI_17
## 1: T 0 FALSE 37.00
## 2: A 0 FALSE 37.00
## 3: A 16 FALSE NA
## 4: A 0 FALSE 37.43
## 5: A 9 TRUE 27.33
## 6: C 2 FALSE NA
## alt_mean_quality_SAI_17 zygosity_SAI_17 site_counts_SAI_16 ref_base_SAI_16
## 1: NA hom_ref 15 A
## 2: NA hom_ref 17 C
## 3: 36.25 hom_alt 11 G
## 4: NA hom_ref 1 G
## 5: 26.00 hete 15 C
## 6: 38.50 hom_alt 9 T
## A_SAI_16 T_SAI_16 C_SAI_16 G_SAI_16 ref_allele_SAI_16 ref_count_SAI_16
## 1: 15 0 0 0 A 15
## 2: 0 17 0 0 C 0
## 3: 11 0 0 0 G 0
## 4: 0 0 0 1 G 1
## 5: 15 0 0 0 C 0
## 6: 0 0 9 0 T 0
## alt_allele_SAI_16 alt_count_SAI_16 InDel_SAI_16 ref_mean_quality_SAI_16
## 1: T 0 FALSE 37
## 2: T 17 FALSE NA
## 3: A 11 FALSE NA
## 4: A 0 FALSE 40
## 5: A 15 FALSE NA
## 6: C 9 FALSE NA
## alt_mean_quality_SAI_16 zygosity_SAI_16 site_counts_SAI_14 ref_base_SAI_14
## 1: NA hom_ref 11 A
## 2: 37.35 hom_alt 7 C
## 3: 37.55 hom_alt 27 G
## 4: NA hom_ref 8 G
## 5: 37.00 hom_alt 10 C
## 6: 36.33 hom_alt 3 T
## A_SAI_14 T_SAI_14 C_SAI_14 G_SAI_14 ref_allele_SAI_14 ref_count_SAI_14
## 1: 0 0 0 11 A 0
## 2: 0 7 0 0 C 0
## 3: 27 0 0 0 G 0
## 4: 0 0 0 8 G 8
## 5: 10 0 0 0 C 0
## 6: 0 3 0 0 T 3
## alt_allele_SAI_14 alt_count_SAI_14 InDel_SAI_14 ref_mean_quality_SAI_14
## 1: G 11 FALSE NA
## 2: T 7 FALSE NA
## 3: A 27 FALSE NA
## 4: A 0 FALSE 35.5
## 5: A 10 FALSE NA
## 6: A 0 FALSE 38.0
## alt_mean_quality_SAI_14 zygosity_SAI_14 site_counts_SAI_15 ref_base_SAI_15
## 1: 35.09 hom_alt 26 A
## 2: 37.43 hom_alt 21 C
## 3: 36.56 hom_alt 36 G
## 4: NA hom_ref 29 G
## 5: 37.30 hom_alt 17 C
## 6: NA hom_ref 7 T
## A_SAI_15 T_SAI_15 C_SAI_15 G_SAI_15 ref_allele_SAI_15 ref_count_SAI_15
## 1: 26 0 0 0 A 26
## 2: 0 21 0 0 C 0
## 3: 20 0 0 16 G 16
## 4: 0 0 0 29 G 29
## 5: 14 14 31 14 C 31
## 6: 0 0 7 0 T 0
## alt_allele_SAI_15 alt_count_SAI_15 InDel_SAI_15 ref_mean_quality_SAI_15
## 1: T 0 FALSE 37.23
## 2: T 21 FALSE NA
## 3: A 20 FALSE 37.00
## 4: A 0 FALSE 36.50
## 5: A 14 TRUE 32.14
## 6: C 7 FALSE NA
## alt_mean_quality_SAI_15 zygosity_SAI_15 site_counts_KAT_6n ref_base_KAT_6n
## 1: NA hom_ref 5 A
## 2: 35.38 hom_alt 15 C
## 3: 36.53 hete 14 G
## 4: NA hom_ref 20 G
## 5: 26.00 hete 10 C
## 6: 37.86 hom_alt 6 T
## A_KAT_6n T_KAT_6n C_KAT_6n G_KAT_6n ref_allele_KAT_6n ref_count_KAT_6n
## 1: 0 0 0 5 A 0
## 2: 0 15 0 0 C 0
## 3: 14 0 0 0 G 0
## 4: 0 0 20 0 G 0
## 5: 10 0 0 0 C 0
## 6: 0 1 5 0 T 1
## alt_allele_KAT_6n alt_count_KAT_6n InDel_KAT_6n ref_mean_quality_KAT_6n
## 1: G 5 FALSE NA
## 2: T 15 FALSE NA
## 3: A 14 FALSE NA
## 4: C 20 FALSE NA
## 5: A 10 FALSE NA
## 6: C 5 FALSE 37
## alt_mean_quality_KAT_6n zygosity_KAT_6n site_counts_SAI_12 ref_base_SAI_12
## 1: 34.00 hom_alt 6 A
## 2: 37.40 hom_alt 9 C
## 3: 35.43 hom_alt 13 G
## 4: 36.40 hom_alt 15 G
## 5: 34.60 hom_alt 9 C
## 6: 38.20 hete 11 T
## A_SAI_12 T_SAI_12 C_SAI_12 G_SAI_12 ref_allele_SAI_12 ref_count_SAI_12
## 1: 0 0 0 6 A 0
## 2: 0 9 0 0 C 0
## 3: 13 0 0 0 G 0
## 4: 0 0 0 15 G 15
## 5: 0 0 9 0 C 9
## 6: 0 0 11 0 T 0
## alt_allele_SAI_12 alt_count_SAI_12 InDel_SAI_12 ref_mean_quality_SAI_12
## 1: G 6 FALSE NA
## 2: T 9 FALSE NA
## 3: A 13 FALSE NA
## 4: A 0 FALSE 37.40
## 5: A 0 FALSE 37.33
## 6: C 11 FALSE NA
## alt_mean_quality_SAI_12 zygosity_SAI_12 site_counts_SAI_13 ref_base_SAI_13
## 1: 37.00 hom_alt 8 A
## 2: 37.33 hom_alt 16 C
## 3: 36.54 hom_alt 12 G
## 4: NA hom_ref 4 G
## 5: NA hom_ref 12 C
## 6: 37.73 hom_alt 4 T
## A_SAI_13 T_SAI_13 C_SAI_13 G_SAI_13 ref_allele_SAI_13 ref_count_SAI_13
## 1: 8 0 0 0 A 8
## 2: 0 16 0 0 C 0
## 3: 12 0 0 0 G 0
## 4: 0 0 0 4 G 4
## 5: 10 10 22 10 C 22
## 6: 0 0 4 0 T 0
## alt_allele_SAI_13 alt_count_SAI_13 InDel_SAI_13 ref_mean_quality_SAI_13
## 1: T 0 FALSE 37.75
## 2: T 16 FALSE NA
## 3: A 12 FALSE NA
## 4: A 0 FALSE 37.00
## 5: A 10 TRUE 31.00
## 6: C 4 FALSE NA
## alt_mean_quality_SAI_13 zygosity_SAI_13 chr_ref bp_ref
## 1: NA hom_ref 1.101 110197
## 2: 37.19 hom_alt 1.101 116980
## 3: 36.00 hom_alt 1.101 118670
## 4: NA hom_ref 1.101 147467
## 5: 25.00 hete 1.101 171602
## 6: 38.50 hom_alt 1.101 210793
We can explore the data sets now and see if the SNPs with mismatching genotypes have lower read counts or low base quality.
First we can check the SNPs with indels
# Make sure merged_data2 is a data.table
setDT(merged_data2)
# Select columns
column_names <- c("snp_id", grep("^InDel_", names(merged_data2), value = TRUE))
# Create a new data table with the selected columns
indels_dt <- merged_data2[, ..column_names]
# Filter rows with any TRUE values
indels_dt_true <- indels_dt[rowSums(indels_dt[, -1, with = FALSE] == TRUE, na.rm = TRUE) > 0, ]
# Count TRUE values in each column
indels_count_true <- sapply(indels_dt_true[, -1, with = FALSE], function(col) sum(col == TRUE, na.rm = TRUE))
# Print the counts
print(indels_count_true)
## InDel_KAT_5n InDel_SAI_18 InDel_KAT_4n InDel_KAT_11 InDel_KAT_10
## 903 895 978 945 911
## InDel_SAI_6n InDel_KAT_3n InDel_KAT_12 InDel_SAI_11n InDel_KAT_7
## 842 921 884 940 910
## InDel_SAI_10n InDel_SAI_7n InDel_KAT_2n InDel_KAT_1n InDel_SAI_1
## 916 884 930 909 929
## InDel_SAI_8n InDel_SAI_2 InDel_SAI_3 InDel_SAI_9n InDel_SAI_4
## 968 908 859 866 860
## InDel_KAT_9 InDel_KAT_8 InDel_SAI_5 InDel_SAI_17 InDel_SAI_16
## 892 918 846 855 899
## InDel_SAI_14 InDel_SAI_15 InDel_KAT_6n InDel_SAI_12 InDel_SAI_13
## 876 997 952 907 880
We have around 900 SNPs per sample with indels. So, the genotype of these samples may be wrong in the WGS calls.
We can count how many SNPs have indels across all samples. We can create a new column and see if there is any TRUE values
# Create a new column "any_true"
indels_dt[, any_true := rowSums(.SD == TRUE, na.rm = TRUE) > 0, .SDcols = patterns("InDel")]
# Select rows where "any_true" is TRUE
true_rows <- indels_dt[any_true == TRUE]
# Count unique "snp_id" where "any_true" is TRUE
num_true_snp_id <- uniqueN(true_rows$snp_id)
# Print the number of unique "snp_id" with any TRUE
print(num_true_snp_id)
## [1] 4814
Across all samples, we see 4,814 sites with indel (deletion or insertion). Next, how many times we see indels per SNP?
Theme for plotting
# import plotting theme
source(
here(
"scripts",
"analysis",
"my_theme2.R" # choose my_theme.R (Roboto Condensed) or my_theme2.R (default font)
)
)
Create histogram
# Create a new column "num_true"
indels_dt[, num_true := rowSums(.SD == TRUE, na.rm = TRUE), .SDcols = patterns("InDel")]
# Count number of snp_id for each number of TRUE
true_counts <- indels_dt[, .(count = .N), by = num_true]
# Plot histogram with the indel counts
ggplot(true_counts, aes(x = num_true, y = count)) +
geom_bar(
stat = "identity",
fill = "#ddfacc",
color = "#f5c5d8",
width = 0.8
) +
geom_text(aes(label = scales::comma(count)), size = 2) +
scale_y_log10(labels = scales::comma) +
labs(x = "Number of times the SNP site has an indel",
y = "Number of SNPs (log10)",
title = "How many times a SNP site has indels in 30 cram files") +
coord_flip() +
my_theme()
# Save plot to PDF
ggsave(
here(
"output",
"wgs_vs_chip",
"figures",
"indels_per_30_cram.pdf"
),
height = 8,
width = 6,
dpi = 300
)
We see that most of the sites have no indels (170,546) but 4,814 have indels. 1,621 indels appear only in 1 sample. While 685 appear in two samples, etc. We see that 35 sites have indels in all the samples.
We can repeat the sample calculations using only the 18 samples we have data for chip and wgs
# Get names of the columns that do not end with "n"
cols_to_keep <- names(indels_dt)[!grepl("n$", names(indels_dt))]
# Subset the dataframe to keep only the desired columns
indels_dt <- indels_dt[, ..cols_to_keep]
# Create a new column "any_true"
indels_dt[, any_true := rowSums(.SD == TRUE, na.rm = TRUE) > 0, .SDcols = patterns("InDel")]
# Select rows where "any_true" is TRUE
true_rows <- indels_dt[any_true == TRUE]
# Count unique "snp_id" where "any_true" is TRUE
num_true_snp_id <- uniqueN(true_rows$snp_id)
# Print the number of unique "snp_id" with any TRUE
print(num_true_snp_id)
## [1] 4020
Across all samples, we see 4,020 down from 4,814 sites with indel (deletion or insertion) when we used the 30 samples. Next, how many times we see indels per SNP in the 18 samples?
Create histogram
# Create a new column "num_true"
indels_dt[, num_true := rowSums(.SD == TRUE, na.rm = TRUE), .SDcols = patterns("InDel")]
# Count number of snp_id for each number of TRUE
true_counts <- indels_dt[, .(count = .N), by = num_true]
# Plot histogram with the indel counts
ggplot(true_counts, aes(x = num_true, y = count)) +
geom_bar(
stat = "identity",
fill = "#ddfacc",
color = "#f5c5d8",
width = 0.8
) +
geom_text(aes(label = scales::comma(count)), size = 2) +
scale_y_log10(labels = scales::comma) +
labs(x = "Number of times the SNP site has an indel",
y = "Number of SNPs (log10)",
title = "How many times a SNP site has indels in 30 cram files") +
coord_flip() +
my_theme()
# Save plot to PDF
ggsave(
here(
"output",
"wgs_vs_chip",
"figures",
"indels_per_18_cram.pdf"
),
height = 8,
width = 6,
dpi = 300
)
We still see sites with indels in multiple samples. However, it is lower than when we use the 30 samples. It might explain the mismatches when we compare samples from genotype calls with different number of samples.
Next, we can chose one within WGS comparisons and one chip vs. WGS and see if there is any correlation and allele read depth and mismatches.
WGS “xy” Genotyping calls with 18 versus 30 samples * “wy” Genotyping calls with 18 versus 800 samples
Chip x WGS: “ay” - WGS and chip calls with 18 samples * “bx” - WGS call with 30 samples and chip call with 95 samples
Because of limited time, I will compare the WGS (18 vs. 30 samples in the genotype call), chip (18 vs 95), then WGS vs. chip (18 samples)
First we need to get the SNP ids with 2 or more mismatches
Find those with zero mismatches
# Filter the dataframe for Zigo_mismatch = 2
no_mismatches_xy <- summary_xy[summary_xy$Zigo_mismatch == 0,]
# Create a vector with SNP_id
SNPs_0_mismatches_xy <- no_mismatches_xy$SNP_id
# Print the vector
length(SNPs_0_mismatches_xy)
## [1] 153255
Find those with 2 or more mismatches
# Filter the dataframe for Zigo_mismatch > 2
filtered_xy <- summary_xy[summary_xy$Zigo_mismatch >= 2,]
# Create a vector with SNP_id
SNPs_2_mismatches_xy <- filtered_xy$SNP_id
# Print the vector
length(SNPs_2_mismatches_xy)
## [1] 6404
Now we can check in our data the read count of this two sets of SNPs. We first need to select only the 18 samples.
# Identify columns that end with "n"
cols_to_remove <- grep("n$", names(merged_data2))
# Remove those columns
merged_data3 <- merged_data2[, -cols_to_remove, with = FALSE]
# Print the updated data table
head(merged_data3)
## id snp_id site_counts_SAI_18 ref_base_SAI_18 A_SAI_18
## 1: 1.101_110197 AX-583079274 12 A 12
## 2: 1.101_116980 AX-583077250 13 C 0
## 3: 1.101_118670 AX-583079283 13 G 3
## 4: 1.101_147467 AX-583079310 15 G 0
## 5: 1.101_171602 AX-583077312 15 C 15
## 6: 1.101_210793 AX-583077325 1 T 0
## T_SAI_18 C_SAI_18 G_SAI_18 ref_allele_SAI_18 ref_count_SAI_18
## 1: 0 0 0 A 12
## 2: 13 0 0 C 0
## 3: 0 0 10 G 10
## 4: 0 0 15 G 15
## 5: 7 14 7 C 14
## 6: 0 1 0 T 0
## alt_allele_SAI_18 alt_count_SAI_18 InDel_SAI_18 ref_mean_quality_SAI_18
## 1: T 0 FALSE 37.75
## 2: T 13 FALSE NA
## 3: A 3 FALSE 37.60
## 4: A 0 FALSE 37.21
## 5: A 15 TRUE 28.75
## 6: C 1 FALSE NA
## alt_mean_quality_SAI_18 zygosity_SAI_18 site_counts_KAT_11 ref_base_KAT_11
## 1: NA hom_ref 11 A
## 2: 37.23 hom_alt 16 C
## 3: 37.00 hete 17 G
## 4: NA hom_ref 6 G
## 5: 29.67 hete 7 C
## 6: 37.00 hom_alt 3 T
## A_KAT_11 T_KAT_11 C_KAT_11 G_KAT_11 ref_allele_KAT_11 ref_count_KAT_11
## 1: 10 0 0 1 A 10
## 2: 0 8 8 0 C 8
## 3: 10 0 0 7 G 7
## 4: 0 0 0 6 G 6
## 5: 0 0 7 0 C 7
## 6: 0 1 2 0 T 1
## alt_allele_KAT_11 alt_count_KAT_11 InDel_KAT_11 ref_mean_quality_KAT_11
## 1: G 1 FALSE 38.20
## 2: T 8 FALSE 37.00
## 3: A 10 FALSE 37.43
## 4: A 0 FALSE 37.50
## 5: A 0 FALSE 37.00
## 6: C 2 FALSE 37.00
## alt_mean_quality_KAT_11 zygosity_KAT_11 site_counts_KAT_10 ref_base_KAT_10
## 1: 37 hete 6 A
## 2: 37 hete 28 C
## 3: 37 hete 28 G
## 4: NA hom_ref 19 G
## 5: NA hom_ref 22 C
## 6: 37 hete 7 T
## A_KAT_10 T_KAT_10 C_KAT_10 G_KAT_10 ref_allele_KAT_10 ref_count_KAT_10
## 1: 0 0 0 6 A 0
## 2: 0 0 28 0 C 28
## 3: 28 0 0 0 G 0
## 4: 0 0 0 19 G 19
## 5: 0 0 22 0 C 22
## 6: 0 0 7 0 T 0
## alt_allele_KAT_10 alt_count_KAT_10 InDel_KAT_10 ref_mean_quality_KAT_10
## 1: G 6 FALSE NA
## 2: A 0 FALSE 36.36
## 3: A 28 FALSE NA
## 4: A 0 FALSE 37.00
## 5: A 0 FALSE 37.29
## 6: C 7 FALSE NA
## alt_mean_quality_KAT_10 zygosity_KAT_10 site_counts_KAT_12 ref_base_KAT_12
## 1: 37.00 hom_alt 17 A
## 2: NA hom_ref 13 C
## 3: 36.71 hom_alt 15 G
## 4: NA hom_ref 14 G
## 5: NA hom_ref 11 C
## 6: 37.00 hom_alt 3 T
## A_KAT_12 T_KAT_12 C_KAT_12 G_KAT_12 ref_allele_KAT_12 ref_count_KAT_12
## 1: 5 0 0 12 A 5
## 2: 0 13 0 0 C 0
## 3: 9 0 0 6 G 6
## 4: 0 0 11 3 G 3
## 5: 7 7 18 7 C 18
## 6: 0 3 0 0 T 3
## alt_allele_KAT_12 alt_count_KAT_12 InDel_KAT_12 ref_mean_quality_KAT_12
## 1: G 12 FALSE 37.6
## 2: T 13 FALSE NA
## 3: A 9 FALSE 37.0
## 4: C 11 FALSE 37.0
## 5: A 7 TRUE 32.5
## 6: A 0 FALSE 38.0
## alt_mean_quality_KAT_12 zygosity_KAT_12 site_counts_KAT_7 ref_base_KAT_7
## 1: 37.00 hete 4 A
## 2: 37.46 hom_alt 18 C
## 3: 35.67 hete 25 G
## 4: 35.91 hete 18 G
## 5: 26.00 hete 20 C
## 6: NA hom_ref 7 T
## A_KAT_7 T_KAT_7 C_KAT_7 G_KAT_7 ref_allele_KAT_7 ref_count_KAT_7
## 1: 0 0 0 4 A 0
## 2: 0 0 18 0 C 18
## 3: 25 0 0 0 G 0
## 4: 0 0 0 18 G 18
## 5: 0 0 20 0 C 20
## 6: 0 0 7 0 T 0
## alt_allele_KAT_7 alt_count_KAT_7 InDel_KAT_7 ref_mean_quality_KAT_7
## 1: G 4 FALSE NA
## 2: A 0 FALSE 37.19
## 3: A 25 FALSE NA
## 4: A 0 FALSE 37.17
## 5: A 0 FALSE 37.00
## 6: C 7 FALSE NA
## alt_mean_quality_KAT_7 zygosity_KAT_7 site_counts_SAI_1 ref_base_SAI_1
## 1: 37.00 hom_alt 31 A
## 2: NA hom_ref 13 C
## 3: 37.24 hom_alt 13 G
## 4: NA hom_ref NA
## 5: NA hom_ref 8 C
## 6: 35.43 hom_alt 7 T
## A_SAI_1 T_SAI_1 C_SAI_1 G_SAI_1 ref_allele_SAI_1 ref_count_SAI_1
## 1: 31 0 0 0 A 31
## 2: 0 13 0 0 C 0
## 3: 13 0 0 0 G 0
## 4: NA NA NA NA NA
## 5: 8 0 0 0 C 0
## 6: 0 0 7 0 T 0
## alt_allele_SAI_1 alt_count_SAI_1 InDel_SAI_1 ref_mean_quality_SAI_1
## 1: T 0 FALSE 36.03
## 2: T 13 FALSE NA
## 3: A 13 FALSE NA
## 4: NA NA NA
## 5: A 8 FALSE NA
## 6: C 7 FALSE NA
## alt_mean_quality_SAI_1 zygosity_SAI_1 site_counts_SAI_2 ref_base_SAI_2
## 1: NA hom_ref 24 A
## 2: 36.31 hom_alt 19 C
## 3: 35.62 hom_alt 15 G
## 4: NA 12 G
## 5: 37.00 hom_alt 6 C
## 6: 37.00 hom_alt 5 T
## A_SAI_2 T_SAI_2 C_SAI_2 G_SAI_2 ref_allele_SAI_2 ref_count_SAI_2
## 1: 12 0 0 12 A 12
## 2: 0 19 0 0 C 0
## 3: 15 0 0 0 G 0
## 4: 0 0 0 12 G 12
## 5: 0 0 6 0 C 6
## 6: 0 0 5 0 T 0
## alt_allele_SAI_2 alt_count_SAI_2 InDel_SAI_2 ref_mean_quality_SAI_2
## 1: G 12 FALSE 37.5
## 2: T 19 FALSE NA
## 3: A 15 FALSE NA
## 4: A 0 FALSE 36.0
## 5: A 0 FALSE 37.0
## 6: C 5 FALSE NA
## alt_mean_quality_SAI_2 zygosity_SAI_2 site_counts_SAI_3 ref_base_SAI_3
## 1: 35.00 hete 15 A
## 2: 36.53 hom_alt 8 C
## 3: 36.40 hom_alt 9 G
## 4: NA hom_ref NA
## 5: NA hom_ref 20 C
## 6: 38.80 hom_alt 5 T
## A_SAI_3 T_SAI_3 C_SAI_3 G_SAI_3 ref_allele_SAI_3 ref_count_SAI_3
## 1: 15 0 0 0 A 15
## 2: 0 8 0 0 C 0
## 3: 9 0 0 0 G 0
## 4: NA NA NA NA NA
## 5: 20 0 0 0 C 0
## 6: 0 0 5 0 T 0
## alt_allele_SAI_3 alt_count_SAI_3 InDel_SAI_3 ref_mean_quality_SAI_3
## 1: T 0 FALSE 37
## 2: T 8 FALSE NA
## 3: A 9 FALSE NA
## 4: NA NA NA
## 5: A 20 FALSE NA
## 6: C 5 FALSE NA
## alt_mean_quality_SAI_3 zygosity_SAI_3 site_counts_SAI_4 ref_base_SAI_4
## 1: NA hom_ref 16 A
## 2: 34.88 hom_alt 12 C
## 3: 37.33 hom_alt 9 G
## 4: NA NA
## 5: 37.00 hom_alt 13 C
## 6: 34.20 hom_alt 5 T
## A_SAI_4 T_SAI_4 C_SAI_4 G_SAI_4 ref_allele_SAI_4 ref_count_SAI_4
## 1: 16 0 0 0 A 16
## 2: 0 12 0 0 C 0
## 3: 9 0 0 0 G 0
## 4: NA NA NA NA NA
## 5: 13 0 0 0 C 0
## 6: 0 0 5 0 T 0
## alt_allele_SAI_4 alt_count_SAI_4 InDel_SAI_4 ref_mean_quality_SAI_4
## 1: T 0 FALSE 37.12
## 2: T 12 FALSE NA
## 3: A 9 FALSE NA
## 4: NA NA NA
## 5: A 13 FALSE NA
## 6: C 5 FALSE NA
## alt_mean_quality_SAI_4 zygosity_SAI_4 site_counts_KAT_9 ref_base_KAT_9
## 1: NA hom_ref 10 A
## 2: 37.25 hom_alt 9 C
## 3: 37.33 hom_alt 9 G
## 4: NA 36 G
## 5: 37.23 hom_alt 9 C
## 6: 37.00 hom_alt 12 T
## A_KAT_9 T_KAT_9 C_KAT_9 G_KAT_9 ref_allele_KAT_9 ref_count_KAT_9
## 1: 10 0 0 0 A 10
## 2: 0 9 0 0 C 0
## 3: 6 0 0 3 G 3
## 4: 0 0 36 0 G 0
## 5: 4 4 13 4 C 13
## 6: 0 4 8 0 T 4
## alt_allele_KAT_9 alt_count_KAT_9 InDel_KAT_9 ref_mean_quality_KAT_9
## 1: T 0 FALSE 36.00
## 2: T 9 FALSE NA
## 3: A 6 FALSE 37.00
## 4: C 36 FALSE NA
## 5: A 4 TRUE 31.86
## 6: C 8 FALSE 37.75
## alt_mean_quality_KAT_9 zygosity_KAT_9 site_counts_KAT_8 ref_base_KAT_8
## 1: NA hom_ref 7 A
## 2: 34.56 hom_alt 15 C
## 3: 37.00 hete 15 G
## 4: 36.67 hom_alt 15 G
## 5: NA hete 4 C
## 6: 38.12 hete 7 T
## A_KAT_8 T_KAT_8 C_KAT_8 G_KAT_8 ref_allele_KAT_8 ref_count_KAT_8
## 1: 0 0 0 7 A 0
## 2: 0 15 0 0 C 0
## 3: 15 0 0 0 G 0
## 4: 0 0 14 1 G 1
## 5: 0 0 4 0 C 4
## 6: 0 6 1 0 T 6
## alt_allele_KAT_8 alt_count_KAT_8 InDel_KAT_8 ref_mean_quality_KAT_8
## 1: G 7 FALSE NA
## 2: T 15 FALSE NA
## 3: A 15 FALSE NA
## 4: C 14 FALSE 37.00
## 5: A 0 FALSE 36.75
## 6: C 1 FALSE 37.50
## alt_mean_quality_KAT_8 zygosity_KAT_8 site_counts_SAI_5 ref_base_SAI_5
## 1: 35.00 hom_alt 8 A
## 2: 36.60 hom_alt 8 C
## 3: 36.20 hom_alt 13 G
## 4: 36.57 hete 10 G
## 5: NA hom_ref 13 C
## 6: 20.00 hete 1 T
## A_SAI_5 T_SAI_5 C_SAI_5 G_SAI_5 ref_allele_SAI_5 ref_count_SAI_5
## 1: 8 0 0 0 A 8
## 2: 0 8 0 0 C 0
## 3: 13 0 0 0 G 0
## 4: 0 0 0 10 G 10
## 5: 1 1 14 1 C 14
## 6: 0 0 1 0 T 0
## alt_allele_SAI_5 alt_count_SAI_5 InDel_SAI_5 ref_mean_quality_SAI_5
## 1: T 0 FALSE 37.38
## 2: T 8 FALSE NA
## 3: A 13 FALSE NA
## 4: A 0 FALSE 34.44
## 5: A 1 TRUE 36.38
## 6: C 1 FALSE NA
## alt_mean_quality_SAI_5 zygosity_SAI_5 site_counts_SAI_17 ref_base_SAI_17
## 1: NA hom_ref 9 A
## 2: 37.00 hom_alt 12 C
## 3: 37.23 hom_alt 16 G
## 4: NA hom_ref 14 G
## 5: 37.00 hete 9 C
## 6: 37.00 hom_alt 2 T
## A_SAI_17 T_SAI_17 C_SAI_17 G_SAI_17 ref_allele_SAI_17 ref_count_SAI_17
## 1: 9 0 0 0 A 9
## 2: 0 0 12 0 C 12
## 3: 16 0 0 0 G 0
## 4: 0 0 0 14 G 14
## 5: 9 9 18 9 C 18
## 6: 0 0 2 0 T 0
## alt_allele_SAI_17 alt_count_SAI_17 InDel_SAI_17 ref_mean_quality_SAI_17
## 1: T 0 FALSE 37.00
## 2: A 0 FALSE 37.00
## 3: A 16 FALSE NA
## 4: A 0 FALSE 37.43
## 5: A 9 TRUE 27.33
## 6: C 2 FALSE NA
## alt_mean_quality_SAI_17 zygosity_SAI_17 site_counts_SAI_16 ref_base_SAI_16
## 1: NA hom_ref 15 A
## 2: NA hom_ref 17 C
## 3: 36.25 hom_alt 11 G
## 4: NA hom_ref 1 G
## 5: 26.00 hete 15 C
## 6: 38.50 hom_alt 9 T
## A_SAI_16 T_SAI_16 C_SAI_16 G_SAI_16 ref_allele_SAI_16 ref_count_SAI_16
## 1: 15 0 0 0 A 15
## 2: 0 17 0 0 C 0
## 3: 11 0 0 0 G 0
## 4: 0 0 0 1 G 1
## 5: 15 0 0 0 C 0
## 6: 0 0 9 0 T 0
## alt_allele_SAI_16 alt_count_SAI_16 InDel_SAI_16 ref_mean_quality_SAI_16
## 1: T 0 FALSE 37
## 2: T 17 FALSE NA
## 3: A 11 FALSE NA
## 4: A 0 FALSE 40
## 5: A 15 FALSE NA
## 6: C 9 FALSE NA
## alt_mean_quality_SAI_16 zygosity_SAI_16 site_counts_SAI_14 ref_base_SAI_14
## 1: NA hom_ref 11 A
## 2: 37.35 hom_alt 7 C
## 3: 37.55 hom_alt 27 G
## 4: NA hom_ref 8 G
## 5: 37.00 hom_alt 10 C
## 6: 36.33 hom_alt 3 T
## A_SAI_14 T_SAI_14 C_SAI_14 G_SAI_14 ref_allele_SAI_14 ref_count_SAI_14
## 1: 0 0 0 11 A 0
## 2: 0 7 0 0 C 0
## 3: 27 0 0 0 G 0
## 4: 0 0 0 8 G 8
## 5: 10 0 0 0 C 0
## 6: 0 3 0 0 T 3
## alt_allele_SAI_14 alt_count_SAI_14 InDel_SAI_14 ref_mean_quality_SAI_14
## 1: G 11 FALSE NA
## 2: T 7 FALSE NA
## 3: A 27 FALSE NA
## 4: A 0 FALSE 35.5
## 5: A 10 FALSE NA
## 6: A 0 FALSE 38.0
## alt_mean_quality_SAI_14 zygosity_SAI_14 site_counts_SAI_15 ref_base_SAI_15
## 1: 35.09 hom_alt 26 A
## 2: 37.43 hom_alt 21 C
## 3: 36.56 hom_alt 36 G
## 4: NA hom_ref 29 G
## 5: 37.30 hom_alt 17 C
## 6: NA hom_ref 7 T
## A_SAI_15 T_SAI_15 C_SAI_15 G_SAI_15 ref_allele_SAI_15 ref_count_SAI_15
## 1: 26 0 0 0 A 26
## 2: 0 21 0 0 C 0
## 3: 20 0 0 16 G 16
## 4: 0 0 0 29 G 29
## 5: 14 14 31 14 C 31
## 6: 0 0 7 0 T 0
## alt_allele_SAI_15 alt_count_SAI_15 InDel_SAI_15 ref_mean_quality_SAI_15
## 1: T 0 FALSE 37.23
## 2: T 21 FALSE NA
## 3: A 20 FALSE 37.00
## 4: A 0 FALSE 36.50
## 5: A 14 TRUE 32.14
## 6: C 7 FALSE NA
## alt_mean_quality_SAI_15 zygosity_SAI_15 site_counts_SAI_12 ref_base_SAI_12
## 1: NA hom_ref 6 A
## 2: 35.38 hom_alt 9 C
## 3: 36.53 hete 13 G
## 4: NA hom_ref 15 G
## 5: 26.00 hete 9 C
## 6: 37.86 hom_alt 11 T
## A_SAI_12 T_SAI_12 C_SAI_12 G_SAI_12 ref_allele_SAI_12 ref_count_SAI_12
## 1: 0 0 0 6 A 0
## 2: 0 9 0 0 C 0
## 3: 13 0 0 0 G 0
## 4: 0 0 0 15 G 15
## 5: 0 0 9 0 C 9
## 6: 0 0 11 0 T 0
## alt_allele_SAI_12 alt_count_SAI_12 InDel_SAI_12 ref_mean_quality_SAI_12
## 1: G 6 FALSE NA
## 2: T 9 FALSE NA
## 3: A 13 FALSE NA
## 4: A 0 FALSE 37.40
## 5: A 0 FALSE 37.33
## 6: C 11 FALSE NA
## alt_mean_quality_SAI_12 zygosity_SAI_12 site_counts_SAI_13 ref_base_SAI_13
## 1: 37.00 hom_alt 8 A
## 2: 37.33 hom_alt 16 C
## 3: 36.54 hom_alt 12 G
## 4: NA hom_ref 4 G
## 5: NA hom_ref 12 C
## 6: 37.73 hom_alt 4 T
## A_SAI_13 T_SAI_13 C_SAI_13 G_SAI_13 ref_allele_SAI_13 ref_count_SAI_13
## 1: 8 0 0 0 A 8
## 2: 0 16 0 0 C 0
## 3: 12 0 0 0 G 0
## 4: 0 0 0 4 G 4
## 5: 10 10 22 10 C 22
## 6: 0 0 4 0 T 0
## alt_allele_SAI_13 alt_count_SAI_13 InDel_SAI_13 ref_mean_quality_SAI_13
## 1: T 0 FALSE 37.75
## 2: T 16 FALSE NA
## 3: A 12 FALSE NA
## 4: A 0 FALSE 37.00
## 5: A 10 TRUE 31.00
## 6: C 4 FALSE NA
## alt_mean_quality_SAI_13 zygosity_SAI_13 chr_ref bp_ref
## 1: NA hom_ref 1.101 110197
## 2: 37.19 hom_alt 1.101 116980
## 3: 36.00 hom_alt 1.101 118670
## 4: NA hom_ref 1.101 147467
## 5: 25.00 hete 1.101 171602
## 6: 38.50 hom_alt 1.101 210793
Now we can get the mean read count for each allele across all the samples, or we could compare only two samples. Lets subset the columns with counts and quality into a new data table
# Define the patterns to look for
patterns <- c("^ref_count_", "^alt_count_", "^ref_mean_quality_", "^alt_mean_quality_", "^site_counts_")
# Create an empty vector to store the column indices
cols_to_keep <- integer(0)
# Loop over the patterns
for (pattern in patterns) {
# Find columns that start with the pattern and append their indices to cols_to_keep
cols_to_keep <- c(cols_to_keep, grep(pattern, names(merged_data3)))
}
# Append the index of the 'snp_id' column to cols_to_keep
cols_to_keep <- c(which(names(merged_data3) == "snp_id"), cols_to_keep)
# Subset the data table
merged_data4 <- merged_data3[, cols_to_keep, with = FALSE]
# Print the updated data table
head(merged_data4)
## snp_id ref_count_SAI_18 ref_count_KAT_11 ref_count_KAT_10
## 1: AX-583079274 12 10 0
## 2: AX-583077250 0 8 28
## 3: AX-583079283 10 7 0
## 4: AX-583079310 15 6 19
## 5: AX-583077312 14 7 22
## 6: AX-583077325 0 1 0
## ref_count_KAT_12 ref_count_KAT_7 ref_count_SAI_1 ref_count_SAI_2
## 1: 5 0 31 12
## 2: 0 18 0 0
## 3: 6 0 0 0
## 4: 3 18 NA 12
## 5: 18 20 0 6
## 6: 3 0 0 0
## ref_count_SAI_3 ref_count_SAI_4 ref_count_KAT_9 ref_count_KAT_8
## 1: 15 16 10 0
## 2: 0 0 0 0
## 3: 0 0 3 0
## 4: NA NA 0 1
## 5: 0 0 13 4
## 6: 0 0 4 6
## ref_count_SAI_5 ref_count_SAI_17 ref_count_SAI_16 ref_count_SAI_14
## 1: 8 9 15 0
## 2: 0 12 0 0
## 3: 0 0 0 0
## 4: 10 14 1 8
## 5: 14 18 0 0
## 6: 0 0 0 3
## ref_count_SAI_15 ref_count_SAI_12 ref_count_SAI_13 alt_count_SAI_18
## 1: 26 0 8 0
## 2: 0 0 0 13
## 3: 16 0 0 3
## 4: 29 15 4 0
## 5: 31 9 22 15
## 6: 0 0 0 1
## alt_count_KAT_11 alt_count_KAT_10 alt_count_KAT_12 alt_count_KAT_7
## 1: 1 6 12 4
## 2: 8 0 13 0
## 3: 10 28 9 25
## 4: 0 0 11 0
## 5: 0 0 7 0
## 6: 2 7 0 7
## alt_count_SAI_1 alt_count_SAI_2 alt_count_SAI_3 alt_count_SAI_4
## 1: 0 12 0 0
## 2: 13 19 8 12
## 3: 13 15 9 9
## 4: NA 0 NA NA
## 5: 8 0 20 13
## 6: 7 5 5 5
## alt_count_KAT_9 alt_count_KAT_8 alt_count_SAI_5 alt_count_SAI_17
## 1: 0 7 0 0
## 2: 9 15 8 0
## 3: 6 15 13 16
## 4: 36 14 0 0
## 5: 4 0 1 9
## 6: 8 1 1 2
## alt_count_SAI_16 alt_count_SAI_14 alt_count_SAI_15 alt_count_SAI_12
## 1: 0 11 0 6
## 2: 17 7 21 9
## 3: 11 27 20 13
## 4: 0 0 0 0
## 5: 15 10 14 0
## 6: 9 0 7 11
## alt_count_SAI_13 ref_mean_quality_SAI_18 ref_mean_quality_KAT_11
## 1: 0 37.75 38.20
## 2: 16 NA 37.00
## 3: 12 37.60 37.43
## 4: 0 37.21 37.50
## 5: 10 28.75 37.00
## 6: 4 NA 37.00
## ref_mean_quality_KAT_10 ref_mean_quality_KAT_12 ref_mean_quality_KAT_7
## 1: NA 37.6 NA
## 2: 36.36 NA 37.19
## 3: NA 37.0 NA
## 4: 37.00 37.0 37.17
## 5: 37.29 32.5 37.00
## 6: NA 38.0 NA
## ref_mean_quality_SAI_1 ref_mean_quality_SAI_2 ref_mean_quality_SAI_3
## 1: 36.03 37.5 37
## 2: NA NA NA
## 3: NA NA NA
## 4: NA 36.0 NA
## 5: NA 37.0 NA
## 6: NA NA NA
## ref_mean_quality_SAI_4 ref_mean_quality_KAT_9 ref_mean_quality_KAT_8
## 1: 37.12 36.00 NA
## 2: NA NA NA
## 3: NA 37.00 NA
## 4: NA NA 37.00
## 5: NA 31.86 36.75
## 6: NA 37.75 37.50
## ref_mean_quality_SAI_5 ref_mean_quality_SAI_17 ref_mean_quality_SAI_16
## 1: 37.38 37.00 37
## 2: NA 37.00 NA
## 3: NA NA NA
## 4: 34.44 37.43 40
## 5: 36.38 27.33 NA
## 6: NA NA NA
## ref_mean_quality_SAI_14 ref_mean_quality_SAI_15 ref_mean_quality_SAI_12
## 1: NA 37.23 NA
## 2: NA NA NA
## 3: NA 37.00 NA
## 4: 35.5 36.50 37.40
## 5: NA 32.14 37.33
## 6: 38.0 NA NA
## ref_mean_quality_SAI_13 alt_mean_quality_SAI_18 alt_mean_quality_KAT_11
## 1: 37.75 NA 37
## 2: NA 37.23 37
## 3: NA 37.00 37
## 4: 37.00 NA NA
## 5: 31.00 29.67 NA
## 6: NA 37.00 37
## alt_mean_quality_KAT_10 alt_mean_quality_KAT_12 alt_mean_quality_KAT_7
## 1: 37.00 37.00 37.00
## 2: NA 37.46 NA
## 3: 36.71 35.67 37.24
## 4: NA 35.91 NA
## 5: NA 26.00 NA
## 6: 37.00 NA 35.43
## alt_mean_quality_SAI_1 alt_mean_quality_SAI_2 alt_mean_quality_SAI_3
## 1: NA 35.00 NA
## 2: 36.31 36.53 34.88
## 3: 35.62 36.40 37.33
## 4: NA NA NA
## 5: 37.00 NA 37.00
## 6: 37.00 38.80 34.20
## alt_mean_quality_SAI_4 alt_mean_quality_KAT_9 alt_mean_quality_KAT_8
## 1: NA NA 35.00
## 2: 37.25 34.56 36.60
## 3: 37.33 37.00 36.20
## 4: NA 36.67 36.57
## 5: 37.23 NA NA
## 6: 37.00 38.12 20.00
## alt_mean_quality_SAI_5 alt_mean_quality_SAI_17 alt_mean_quality_SAI_16
## 1: NA NA NA
## 2: 37.00 NA 37.35
## 3: 37.23 36.25 37.55
## 4: NA NA NA
## 5: 37.00 26.00 37.00
## 6: 37.00 38.50 36.33
## alt_mean_quality_SAI_14 alt_mean_quality_SAI_15 alt_mean_quality_SAI_12
## 1: 35.09 NA 37.00
## 2: 37.43 35.38 37.33
## 3: 36.56 36.53 36.54
## 4: NA NA NA
## 5: 37.30 26.00 NA
## 6: NA 37.86 37.73
## alt_mean_quality_SAI_13 site_counts_SAI_18 site_counts_KAT_11
## 1: NA 12 11
## 2: 37.19 13 16
## 3: 36.00 13 17
## 4: NA 15 6
## 5: 25.00 15 7
## 6: 38.50 1 3
## site_counts_KAT_10 site_counts_KAT_12 site_counts_KAT_7 site_counts_SAI_1
## 1: 6 17 4 31
## 2: 28 13 18 13
## 3: 28 15 25 13
## 4: 19 14 18 NA
## 5: 22 11 20 8
## 6: 7 3 7 7
## site_counts_SAI_2 site_counts_SAI_3 site_counts_SAI_4 site_counts_KAT_9
## 1: 24 15 16 10
## 2: 19 8 12 9
## 3: 15 9 9 9
## 4: 12 NA NA 36
## 5: 6 20 13 9
## 6: 5 5 5 12
## site_counts_KAT_8 site_counts_SAI_5 site_counts_SAI_17 site_counts_SAI_16
## 1: 7 8 9 15
## 2: 15 8 12 17
## 3: 15 13 16 11
## 4: 15 10 14 1
## 5: 4 13 9 15
## 6: 7 1 2 9
## site_counts_SAI_14 site_counts_SAI_15 site_counts_SAI_12 site_counts_SAI_13
## 1: 11 26 6 8
## 2: 7 21 9 16
## 3: 27 36 13 12
## 4: 8 29 15 4
## 5: 10 17 9 12
## 6: 3 7 11 4
We can get the mean sample values across all the 18 samples. We will ignore the NAs
# Define the prefixes
prefixes <- c("site_counts_", "ref_count_", "alt_count_", "ref_mean_quality_", "alt_mean_quality_")
# Create an empty data table for the results
snp_depth_qual <- data.table(snp_id = merged_data4$snp_id)
# Loop over the prefixes
for (prefix in prefixes) {
# Get the column indices for the current prefix
cols <- grep(prefix, names(merged_data4))
# Compute the row-wise means while ignoring NA values and round them to two decimal places
mean_values <- apply(merged_data4[, cols, with = FALSE], 1, function(x) round(mean(x, na.rm = TRUE), 2))
# Add the mean values to the results data table
snp_depth_qual[[paste0(prefix, "mean")]] <- mean_values
}
# Print the results
head(snp_depth_qual)
## snp_id site_counts_mean ref_count_mean alt_count_mean
## 1: AX-583079274 13.11 9.83 3.28
## 2: AX-583077250 14.11 3.67 10.44
## 3: AX-583079283 16.44 2.33 14.11
## 4: AX-583079310 14.40 10.33 4.07
## 5: AX-583077312 12.22 11.00 7.00
## 6: AX-583077325 5.50 0.94 4.56
## ref_mean_quality_mean alt_mean_quality_mean
## 1: 37.20 36.26
## 2: 36.89 36.63
## 3: 37.21 36.68
## 4: 36.94 36.38
## 5: 34.03 32.29
## 6: 37.65 36.09
Now we can merge our data tables
# Using data.table's efficient join
setkey(snp_depth_qual, snp_id)
setkey(summary_xy, SNP_id)
snp_depth_qual_xy <- snp_depth_qual[summary_xy]
head(snp_depth_qual_xy)
## snp_id site_counts_mean ref_count_mean alt_count_mean
## 1: AX-579436016 19.44 15.78 3.67
## 2: AX-579436089 19.39 15.44 3.94
## 3: AX-579436102 18.44 14.06 4.28
## 4: AX-579436125 23.17 16.22 6.94
## 5: AX-579436196 21.28 15.22 6.06
## 6: AX-579436214 20.67 11.50 9.17
## ref_mean_quality_mean alt_mean_quality_mean REF_match REF_mismatch ALT_match
## 1: 36.81 36.50 18 0 18
## 2: 36.79 36.58 18 0 18
## 3: 36.84 36.49 18 0 18
## 4: 36.84 36.53 18 0 18
## 5: 36.79 36.53 18 0 18
## 6: 36.88 36.69 18 0 18
## ALT_mismatch Zigo_match Zigo_mismatch
## 1: 0 18 0
## 2: 0 18 0
## 3: 0 18 0
## 4: 0 18 0
## 5: 0 18 0
## 6: 0 18 0
Let’s start easy and see if there is any correlation between site_counts_mean and Zigo_mismatch
# Compute the correlation
correlation <- cor(snp_depth_qual_xy$site_counts_mean, snp_depth_qual_xy$Zigo_mismatch, use = "complete.obs")
# Print the correlation
print(correlation)
## [1] -0.2451664
A negative correlation coefficient, like the -0.2451664 we’ve obtained, indicates a negative or inverse relationship between the two variables, site_counts_mean and Zigo_mismatch in our case.
What this means is that as site_counts_mean increases, Zigo_mismatch tends to decrease, and vice versa. However, the value of -0.2451664 suggests a weak negative correlation.
Typically, we would interpret the strength of the correlation using the absolute value of the correlation coefficient (ignoring the negative sign), where:
Values near 0 indicate a very weak correlation. Values near 0.2 to 0.3 are generally considered weak. Values near 0.4 to 0.6 are moderate. Values above 0.6 are strong.
So in our case, the weak negative correlation of -0.2451664 suggests that while there may be a general trend of Zigo_mismatch decreasing as site_counts_mean increases, this relationship is not particularly strong and there is a lot of variability not accounted for by this relationship.
# Create a scatter plot with a regression line
ggplot(snp_depth_qual_xy, aes(x = site_counts_mean, y = Zigo_mismatch)) +
geom_point() +
geom_smooth(method = lm, se = FALSE, color = "red") +
my_theme() +
labs(x = "Site Counts Mean", y = "Zigo Mismatch", title = "Correlation between Site Counts Mean and Zigo Mismatch")
We can see if there is any strong correlation between counts and quality with the mismatches using data table library
# Define the suffixes of interest
mean_suffixes <- c("_counts_mean", "_count_mean", "_quality_mean") # Add "_counts_mean" to match "site_counts_mean"
mismatch_suffixes <- c("_mismatch")
# Get the column names of interest
mean_cols <- grep(paste(mean_suffixes, collapse = "|"), names(snp_depth_qual_xy), value = TRUE)
mismatch_cols <- grep(paste(mismatch_suffixes, collapse = "|"), names(snp_depth_qual_xy), value = TRUE)
# Compute the correlations
correlations <- list()
for (mean_col in mean_cols) {
for (mismatch_col in mismatch_cols) {
correlations[[length(correlations) + 1]] <- list(
Mean_Column = mean_col,
Mismatch_Column = mismatch_col,
Correlation = cor(snp_depth_qual_xy[[mean_col]], snp_depth_qual_xy[[mismatch_col]], use = "complete.obs")
)
}
}
# Convert correlations into a data table
correlations_dt <- rbindlist(correlations)
# Rename values in the 'Mean_Column' column
correlations_dt[, Mean_Column := gsub("_mean", "", Mean_Column)]
# Rename values in the 'Mismatch_Column' column
correlations_dt[, Mismatch_Column := gsub("_mismatch", "", Mismatch_Column)]
# Convert data table to long format
correlations_dt_long <- melt(correlations_dt, id.vars = c("Mean_Column", "Mismatch_Column"),
measure.vars = "Correlation")
# Convert 'value' column to numeric
correlations_dt_long[, value := as.numeric(value)]
# Rename 'value' column to 'Correlation'
setnames(correlations_dt_long, old = "value", new = "Correlation")
# Format the correlation to 2 decimal places
correlations_dt_long[, Correlation_formatted := sprintf("%.2f", Correlation)]
# Create scatter plot
ggplot(correlations_dt_long,
aes(x = Mean_Column, y = Mismatch_Column, fill = Correlation)) +
geom_tile(color = "black", size = 0.5) + # Here you can specify the border color and size
geom_text(aes(label = Correlation_formatted), color = "black", size = 4) + # Add correlation values
scale_fill_gradient2(
low = "blue",
high = "red",
mid = "white",
midpoint = 0,
limit = c(-1, 1),
space = "Lab",
name = "Pearson\nCorrelation"
) +
my_theme() +
theme(axis.text.x = element_text(
angle = 45,
vjust = 1,
size = 12,
hjust = 1
)) +
coord_fixed() +
labs(x = "Counts or quality", y = "Mismatches", title = "Correlation between sites read counts and quality and mismatches", caption = "WGS samples, comparison of genotype calls using 18 or 30 samples.") +
theme(plot.caption = element_text(
size = 8,
color = "gray30",
face = "italic",
hjust = 1
))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Save plot to PDF
ggsave(
here(
"output",
"wgs_vs_chip",
"figures",
"xy_read_depth_by_zigo_mismatches.pdf"
),
height = 5,
width = 6,
dpi = 300
)
The highest correlation is between the number of reads at the site and the Zygosity mismatches. As the read depth decreases the number of mismatches increase. Let’s group the data by Zigo_mismatch and get the mean site_counts per group.
# Group by 'Zigo_mismatch' and calculate the mean of 'site_counts_mean'
snp_summary_dt <- snp_depth_qual_xy[, .(mean_site_counts = round(mean(site_counts_mean, na.rm = TRUE), 2)), by = Zigo_mismatch]
# Create the bar plot with annotations and adjusted x-axis limits
ggplot(snp_summary_dt, aes(x = Zigo_mismatch, y = mean_site_counts)) +
geom_bar(stat = "identity",
fill = "#b0dfe8",
color = "#f5c5d8") +
geom_text(aes(label = sprintf("%.1f", mean_site_counts)), vjust = -0.5) +
labs(x = "Number of samples with Zygosity mismatches", y = "Mean Site Counts", title = "Mean Site Counts by Zygosity Mismatch", caption = "WGS samples, comparison of genotype calls using 18 or 30 samples.") +
my_theme() + coord_cartesian(xlim = c(0, 18)) +
scale_x_continuous(breaks = seq(0, 18, 1)) +
theme(plot.caption = element_text(
size = 8,
color = "gray30",
face = "italic",
hjust = 1
))
First we need to get the SNP ids with 2 or more mismatches
Find those with zero mismatches
# Filter the dataframe for Zigo_mismatch = 2
no_mismatches_ay <- summary_ay[summary_ay$Zigo_mismatch == 0,]
# Create a vector with SNP_id
SNPs_0_mismatches_ay <- no_mismatches_ay$SNP_id
# Print the vector
length(SNPs_0_mismatches_ay)
## [1] 42960
Find those with 2 or more mismatches
# Filter the dataframe for Zigo_mismatch > 2
filtered_ay <- summary_ay[summary_ay$Zigo_mismatch >= 2,]
# Create a vector with SNP_id
SNPs_2_mismatches_ay <- filtered_ay$SNP_id
# Print the vector
length(SNPs_2_mismatches_ay)
## [1] 34895
Now we can check in our data the read count of this two sets of SNPs. We first need to select only the 18 samples.
# Identify columns that end with "n"
cols_to_remove <- grep("n$", names(merged_data2))
# Remove those columns
merged_data3 <- merged_data2[, -cols_to_remove, with = FALSE]
# Print the updated data table
head(merged_data3)
## id snp_id site_counts_SAI_18 ref_base_SAI_18 A_SAI_18
## 1: 1.101_110197 AX-583079274 12 A 12
## 2: 1.101_116980 AX-583077250 13 C 0
## 3: 1.101_118670 AX-583079283 13 G 3
## 4: 1.101_147467 AX-583079310 15 G 0
## 5: 1.101_171602 AX-583077312 15 C 15
## 6: 1.101_210793 AX-583077325 1 T 0
## T_SAI_18 C_SAI_18 G_SAI_18 ref_allele_SAI_18 ref_count_SAI_18
## 1: 0 0 0 A 12
## 2: 13 0 0 C 0
## 3: 0 0 10 G 10
## 4: 0 0 15 G 15
## 5: 7 14 7 C 14
## 6: 0 1 0 T 0
## alt_allele_SAI_18 alt_count_SAI_18 InDel_SAI_18 ref_mean_quality_SAI_18
## 1: T 0 FALSE 37.75
## 2: T 13 FALSE NA
## 3: A 3 FALSE 37.60
## 4: A 0 FALSE 37.21
## 5: A 15 TRUE 28.75
## 6: C 1 FALSE NA
## alt_mean_quality_SAI_18 zygosity_SAI_18 site_counts_KAT_11 ref_base_KAT_11
## 1: NA hom_ref 11 A
## 2: 37.23 hom_alt 16 C
## 3: 37.00 hete 17 G
## 4: NA hom_ref 6 G
## 5: 29.67 hete 7 C
## 6: 37.00 hom_alt 3 T
## A_KAT_11 T_KAT_11 C_KAT_11 G_KAT_11 ref_allele_KAT_11 ref_count_KAT_11
## 1: 10 0 0 1 A 10
## 2: 0 8 8 0 C 8
## 3: 10 0 0 7 G 7
## 4: 0 0 0 6 G 6
## 5: 0 0 7 0 C 7
## 6: 0 1 2 0 T 1
## alt_allele_KAT_11 alt_count_KAT_11 InDel_KAT_11 ref_mean_quality_KAT_11
## 1: G 1 FALSE 38.20
## 2: T 8 FALSE 37.00
## 3: A 10 FALSE 37.43
## 4: A 0 FALSE 37.50
## 5: A 0 FALSE 37.00
## 6: C 2 FALSE 37.00
## alt_mean_quality_KAT_11 zygosity_KAT_11 site_counts_KAT_10 ref_base_KAT_10
## 1: 37 hete 6 A
## 2: 37 hete 28 C
## 3: 37 hete 28 G
## 4: NA hom_ref 19 G
## 5: NA hom_ref 22 C
## 6: 37 hete 7 T
## A_KAT_10 T_KAT_10 C_KAT_10 G_KAT_10 ref_allele_KAT_10 ref_count_KAT_10
## 1: 0 0 0 6 A 0
## 2: 0 0 28 0 C 28
## 3: 28 0 0 0 G 0
## 4: 0 0 0 19 G 19
## 5: 0 0 22 0 C 22
## 6: 0 0 7 0 T 0
## alt_allele_KAT_10 alt_count_KAT_10 InDel_KAT_10 ref_mean_quality_KAT_10
## 1: G 6 FALSE NA
## 2: A 0 FALSE 36.36
## 3: A 28 FALSE NA
## 4: A 0 FALSE 37.00
## 5: A 0 FALSE 37.29
## 6: C 7 FALSE NA
## alt_mean_quality_KAT_10 zygosity_KAT_10 site_counts_KAT_12 ref_base_KAT_12
## 1: 37.00 hom_alt 17 A
## 2: NA hom_ref 13 C
## 3: 36.71 hom_alt 15 G
## 4: NA hom_ref 14 G
## 5: NA hom_ref 11 C
## 6: 37.00 hom_alt 3 T
## A_KAT_12 T_KAT_12 C_KAT_12 G_KAT_12 ref_allele_KAT_12 ref_count_KAT_12
## 1: 5 0 0 12 A 5
## 2: 0 13 0 0 C 0
## 3: 9 0 0 6 G 6
## 4: 0 0 11 3 G 3
## 5: 7 7 18 7 C 18
## 6: 0 3 0 0 T 3
## alt_allele_KAT_12 alt_count_KAT_12 InDel_KAT_12 ref_mean_quality_KAT_12
## 1: G 12 FALSE 37.6
## 2: T 13 FALSE NA
## 3: A 9 FALSE 37.0
## 4: C 11 FALSE 37.0
## 5: A 7 TRUE 32.5
## 6: A 0 FALSE 38.0
## alt_mean_quality_KAT_12 zygosity_KAT_12 site_counts_KAT_7 ref_base_KAT_7
## 1: 37.00 hete 4 A
## 2: 37.46 hom_alt 18 C
## 3: 35.67 hete 25 G
## 4: 35.91 hete 18 G
## 5: 26.00 hete 20 C
## 6: NA hom_ref 7 T
## A_KAT_7 T_KAT_7 C_KAT_7 G_KAT_7 ref_allele_KAT_7 ref_count_KAT_7
## 1: 0 0 0 4 A 0
## 2: 0 0 18 0 C 18
## 3: 25 0 0 0 G 0
## 4: 0 0 0 18 G 18
## 5: 0 0 20 0 C 20
## 6: 0 0 7 0 T 0
## alt_allele_KAT_7 alt_count_KAT_7 InDel_KAT_7 ref_mean_quality_KAT_7
## 1: G 4 FALSE NA
## 2: A 0 FALSE 37.19
## 3: A 25 FALSE NA
## 4: A 0 FALSE 37.17
## 5: A 0 FALSE 37.00
## 6: C 7 FALSE NA
## alt_mean_quality_KAT_7 zygosity_KAT_7 site_counts_SAI_1 ref_base_SAI_1
## 1: 37.00 hom_alt 31 A
## 2: NA hom_ref 13 C
## 3: 37.24 hom_alt 13 G
## 4: NA hom_ref NA
## 5: NA hom_ref 8 C
## 6: 35.43 hom_alt 7 T
## A_SAI_1 T_SAI_1 C_SAI_1 G_SAI_1 ref_allele_SAI_1 ref_count_SAI_1
## 1: 31 0 0 0 A 31
## 2: 0 13 0 0 C 0
## 3: 13 0 0 0 G 0
## 4: NA NA NA NA NA
## 5: 8 0 0 0 C 0
## 6: 0 0 7 0 T 0
## alt_allele_SAI_1 alt_count_SAI_1 InDel_SAI_1 ref_mean_quality_SAI_1
## 1: T 0 FALSE 36.03
## 2: T 13 FALSE NA
## 3: A 13 FALSE NA
## 4: NA NA NA
## 5: A 8 FALSE NA
## 6: C 7 FALSE NA
## alt_mean_quality_SAI_1 zygosity_SAI_1 site_counts_SAI_2 ref_base_SAI_2
## 1: NA hom_ref 24 A
## 2: 36.31 hom_alt 19 C
## 3: 35.62 hom_alt 15 G
## 4: NA 12 G
## 5: 37.00 hom_alt 6 C
## 6: 37.00 hom_alt 5 T
## A_SAI_2 T_SAI_2 C_SAI_2 G_SAI_2 ref_allele_SAI_2 ref_count_SAI_2
## 1: 12 0 0 12 A 12
## 2: 0 19 0 0 C 0
## 3: 15 0 0 0 G 0
## 4: 0 0 0 12 G 12
## 5: 0 0 6 0 C 6
## 6: 0 0 5 0 T 0
## alt_allele_SAI_2 alt_count_SAI_2 InDel_SAI_2 ref_mean_quality_SAI_2
## 1: G 12 FALSE 37.5
## 2: T 19 FALSE NA
## 3: A 15 FALSE NA
## 4: A 0 FALSE 36.0
## 5: A 0 FALSE 37.0
## 6: C 5 FALSE NA
## alt_mean_quality_SAI_2 zygosity_SAI_2 site_counts_SAI_3 ref_base_SAI_3
## 1: 35.00 hete 15 A
## 2: 36.53 hom_alt 8 C
## 3: 36.40 hom_alt 9 G
## 4: NA hom_ref NA
## 5: NA hom_ref 20 C
## 6: 38.80 hom_alt 5 T
## A_SAI_3 T_SAI_3 C_SAI_3 G_SAI_3 ref_allele_SAI_3 ref_count_SAI_3
## 1: 15 0 0 0 A 15
## 2: 0 8 0 0 C 0
## 3: 9 0 0 0 G 0
## 4: NA NA NA NA NA
## 5: 20 0 0 0 C 0
## 6: 0 0 5 0 T 0
## alt_allele_SAI_3 alt_count_SAI_3 InDel_SAI_3 ref_mean_quality_SAI_3
## 1: T 0 FALSE 37
## 2: T 8 FALSE NA
## 3: A 9 FALSE NA
## 4: NA NA NA
## 5: A 20 FALSE NA
## 6: C 5 FALSE NA
## alt_mean_quality_SAI_3 zygosity_SAI_3 site_counts_SAI_4 ref_base_SAI_4
## 1: NA hom_ref 16 A
## 2: 34.88 hom_alt 12 C
## 3: 37.33 hom_alt 9 G
## 4: NA NA
## 5: 37.00 hom_alt 13 C
## 6: 34.20 hom_alt 5 T
## A_SAI_4 T_SAI_4 C_SAI_4 G_SAI_4 ref_allele_SAI_4 ref_count_SAI_4
## 1: 16 0 0 0 A 16
## 2: 0 12 0 0 C 0
## 3: 9 0 0 0 G 0
## 4: NA NA NA NA NA
## 5: 13 0 0 0 C 0
## 6: 0 0 5 0 T 0
## alt_allele_SAI_4 alt_count_SAI_4 InDel_SAI_4 ref_mean_quality_SAI_4
## 1: T 0 FALSE 37.12
## 2: T 12 FALSE NA
## 3: A 9 FALSE NA
## 4: NA NA NA
## 5: A 13 FALSE NA
## 6: C 5 FALSE NA
## alt_mean_quality_SAI_4 zygosity_SAI_4 site_counts_KAT_9 ref_base_KAT_9
## 1: NA hom_ref 10 A
## 2: 37.25 hom_alt 9 C
## 3: 37.33 hom_alt 9 G
## 4: NA 36 G
## 5: 37.23 hom_alt 9 C
## 6: 37.00 hom_alt 12 T
## A_KAT_9 T_KAT_9 C_KAT_9 G_KAT_9 ref_allele_KAT_9 ref_count_KAT_9
## 1: 10 0 0 0 A 10
## 2: 0 9 0 0 C 0
## 3: 6 0 0 3 G 3
## 4: 0 0 36 0 G 0
## 5: 4 4 13 4 C 13
## 6: 0 4 8 0 T 4
## alt_allele_KAT_9 alt_count_KAT_9 InDel_KAT_9 ref_mean_quality_KAT_9
## 1: T 0 FALSE 36.00
## 2: T 9 FALSE NA
## 3: A 6 FALSE 37.00
## 4: C 36 FALSE NA
## 5: A 4 TRUE 31.86
## 6: C 8 FALSE 37.75
## alt_mean_quality_KAT_9 zygosity_KAT_9 site_counts_KAT_8 ref_base_KAT_8
## 1: NA hom_ref 7 A
## 2: 34.56 hom_alt 15 C
## 3: 37.00 hete 15 G
## 4: 36.67 hom_alt 15 G
## 5: NA hete 4 C
## 6: 38.12 hete 7 T
## A_KAT_8 T_KAT_8 C_KAT_8 G_KAT_8 ref_allele_KAT_8 ref_count_KAT_8
## 1: 0 0 0 7 A 0
## 2: 0 15 0 0 C 0
## 3: 15 0 0 0 G 0
## 4: 0 0 14 1 G 1
## 5: 0 0 4 0 C 4
## 6: 0 6 1 0 T 6
## alt_allele_KAT_8 alt_count_KAT_8 InDel_KAT_8 ref_mean_quality_KAT_8
## 1: G 7 FALSE NA
## 2: T 15 FALSE NA
## 3: A 15 FALSE NA
## 4: C 14 FALSE 37.00
## 5: A 0 FALSE 36.75
## 6: C 1 FALSE 37.50
## alt_mean_quality_KAT_8 zygosity_KAT_8 site_counts_SAI_5 ref_base_SAI_5
## 1: 35.00 hom_alt 8 A
## 2: 36.60 hom_alt 8 C
## 3: 36.20 hom_alt 13 G
## 4: 36.57 hete 10 G
## 5: NA hom_ref 13 C
## 6: 20.00 hete 1 T
## A_SAI_5 T_SAI_5 C_SAI_5 G_SAI_5 ref_allele_SAI_5 ref_count_SAI_5
## 1: 8 0 0 0 A 8
## 2: 0 8 0 0 C 0
## 3: 13 0 0 0 G 0
## 4: 0 0 0 10 G 10
## 5: 1 1 14 1 C 14
## 6: 0 0 1 0 T 0
## alt_allele_SAI_5 alt_count_SAI_5 InDel_SAI_5 ref_mean_quality_SAI_5
## 1: T 0 FALSE 37.38
## 2: T 8 FALSE NA
## 3: A 13 FALSE NA
## 4: A 0 FALSE 34.44
## 5: A 1 TRUE 36.38
## 6: C 1 FALSE NA
## alt_mean_quality_SAI_5 zygosity_SAI_5 site_counts_SAI_17 ref_base_SAI_17
## 1: NA hom_ref 9 A
## 2: 37.00 hom_alt 12 C
## 3: 37.23 hom_alt 16 G
## 4: NA hom_ref 14 G
## 5: 37.00 hete 9 C
## 6: 37.00 hom_alt 2 T
## A_SAI_17 T_SAI_17 C_SAI_17 G_SAI_17 ref_allele_SAI_17 ref_count_SAI_17
## 1: 9 0 0 0 A 9
## 2: 0 0 12 0 C 12
## 3: 16 0 0 0 G 0
## 4: 0 0 0 14 G 14
## 5: 9 9 18 9 C 18
## 6: 0 0 2 0 T 0
## alt_allele_SAI_17 alt_count_SAI_17 InDel_SAI_17 ref_mean_quality_SAI_17
## 1: T 0 FALSE 37.00
## 2: A 0 FALSE 37.00
## 3: A 16 FALSE NA
## 4: A 0 FALSE 37.43
## 5: A 9 TRUE 27.33
## 6: C 2 FALSE NA
## alt_mean_quality_SAI_17 zygosity_SAI_17 site_counts_SAI_16 ref_base_SAI_16
## 1: NA hom_ref 15 A
## 2: NA hom_ref 17 C
## 3: 36.25 hom_alt 11 G
## 4: NA hom_ref 1 G
## 5: 26.00 hete 15 C
## 6: 38.50 hom_alt 9 T
## A_SAI_16 T_SAI_16 C_SAI_16 G_SAI_16 ref_allele_SAI_16 ref_count_SAI_16
## 1: 15 0 0 0 A 15
## 2: 0 17 0 0 C 0
## 3: 11 0 0 0 G 0
## 4: 0 0 0 1 G 1
## 5: 15 0 0 0 C 0
## 6: 0 0 9 0 T 0
## alt_allele_SAI_16 alt_count_SAI_16 InDel_SAI_16 ref_mean_quality_SAI_16
## 1: T 0 FALSE 37
## 2: T 17 FALSE NA
## 3: A 11 FALSE NA
## 4: A 0 FALSE 40
## 5: A 15 FALSE NA
## 6: C 9 FALSE NA
## alt_mean_quality_SAI_16 zygosity_SAI_16 site_counts_SAI_14 ref_base_SAI_14
## 1: NA hom_ref 11 A
## 2: 37.35 hom_alt 7 C
## 3: 37.55 hom_alt 27 G
## 4: NA hom_ref 8 G
## 5: 37.00 hom_alt 10 C
## 6: 36.33 hom_alt 3 T
## A_SAI_14 T_SAI_14 C_SAI_14 G_SAI_14 ref_allele_SAI_14 ref_count_SAI_14
## 1: 0 0 0 11 A 0
## 2: 0 7 0 0 C 0
## 3: 27 0 0 0 G 0
## 4: 0 0 0 8 G 8
## 5: 10 0 0 0 C 0
## 6: 0 3 0 0 T 3
## alt_allele_SAI_14 alt_count_SAI_14 InDel_SAI_14 ref_mean_quality_SAI_14
## 1: G 11 FALSE NA
## 2: T 7 FALSE NA
## 3: A 27 FALSE NA
## 4: A 0 FALSE 35.5
## 5: A 10 FALSE NA
## 6: A 0 FALSE 38.0
## alt_mean_quality_SAI_14 zygosity_SAI_14 site_counts_SAI_15 ref_base_SAI_15
## 1: 35.09 hom_alt 26 A
## 2: 37.43 hom_alt 21 C
## 3: 36.56 hom_alt 36 G
## 4: NA hom_ref 29 G
## 5: 37.30 hom_alt 17 C
## 6: NA hom_ref 7 T
## A_SAI_15 T_SAI_15 C_SAI_15 G_SAI_15 ref_allele_SAI_15 ref_count_SAI_15
## 1: 26 0 0 0 A 26
## 2: 0 21 0 0 C 0
## 3: 20 0 0 16 G 16
## 4: 0 0 0 29 G 29
## 5: 14 14 31 14 C 31
## 6: 0 0 7 0 T 0
## alt_allele_SAI_15 alt_count_SAI_15 InDel_SAI_15 ref_mean_quality_SAI_15
## 1: T 0 FALSE 37.23
## 2: T 21 FALSE NA
## 3: A 20 FALSE 37.00
## 4: A 0 FALSE 36.50
## 5: A 14 TRUE 32.14
## 6: C 7 FALSE NA
## alt_mean_quality_SAI_15 zygosity_SAI_15 site_counts_SAI_12 ref_base_SAI_12
## 1: NA hom_ref 6 A
## 2: 35.38 hom_alt 9 C
## 3: 36.53 hete 13 G
## 4: NA hom_ref 15 G
## 5: 26.00 hete 9 C
## 6: 37.86 hom_alt 11 T
## A_SAI_12 T_SAI_12 C_SAI_12 G_SAI_12 ref_allele_SAI_12 ref_count_SAI_12
## 1: 0 0 0 6 A 0
## 2: 0 9 0 0 C 0
## 3: 13 0 0 0 G 0
## 4: 0 0 0 15 G 15
## 5: 0 0 9 0 C 9
## 6: 0 0 11 0 T 0
## alt_allele_SAI_12 alt_count_SAI_12 InDel_SAI_12 ref_mean_quality_SAI_12
## 1: G 6 FALSE NA
## 2: T 9 FALSE NA
## 3: A 13 FALSE NA
## 4: A 0 FALSE 37.40
## 5: A 0 FALSE 37.33
## 6: C 11 FALSE NA
## alt_mean_quality_SAI_12 zygosity_SAI_12 site_counts_SAI_13 ref_base_SAI_13
## 1: 37.00 hom_alt 8 A
## 2: 37.33 hom_alt 16 C
## 3: 36.54 hom_alt 12 G
## 4: NA hom_ref 4 G
## 5: NA hom_ref 12 C
## 6: 37.73 hom_alt 4 T
## A_SAI_13 T_SAI_13 C_SAI_13 G_SAI_13 ref_allele_SAI_13 ref_count_SAI_13
## 1: 8 0 0 0 A 8
## 2: 0 16 0 0 C 0
## 3: 12 0 0 0 G 0
## 4: 0 0 0 4 G 4
## 5: 10 10 22 10 C 22
## 6: 0 0 4 0 T 0
## alt_allele_SAI_13 alt_count_SAI_13 InDel_SAI_13 ref_mean_quality_SAI_13
## 1: T 0 FALSE 37.75
## 2: T 16 FALSE NA
## 3: A 12 FALSE NA
## 4: A 0 FALSE 37.00
## 5: A 10 TRUE 31.00
## 6: C 4 FALSE NA
## alt_mean_quality_SAI_13 zygosity_SAI_13 chr_ref bp_ref
## 1: NA hom_ref 1.101 110197
## 2: 37.19 hom_alt 1.101 116980
## 3: 36.00 hom_alt 1.101 118670
## 4: NA hom_ref 1.101 147467
## 5: 25.00 hete 1.101 171602
## 6: 38.50 hom_alt 1.101 210793
Now we can get the mean read count for each allele across all the samples, or we could compare only two samples. Lets subset the columns with counts and quality into a new data table
# Define the patterns to look for
patterns <- c("^ref_count_", "^alt_count_", "^ref_mean_quality_", "^alt_mean_quality_", "^site_counts_")
# Create an empty vector to store the column indices
cols_to_keep <- integer(0)
# Loop over the patterns
for (pattern in patterns) {
# Find columns that start with the pattern and append their indices to cols_to_keep
cols_to_keep <- c(cols_to_keep, grep(pattern, names(merged_data3)))
}
# Append the index of the 'snp_id' column to cols_to_keep
cols_to_keep <- c(which(names(merged_data3) == "snp_id"), cols_to_keep)
# Subset the data table
merged_data4 <- merged_data3[, cols_to_keep, with = FALSE]
# Print the updated data table
head(merged_data4)
## snp_id ref_count_SAI_18 ref_count_KAT_11 ref_count_KAT_10
## 1: AX-583079274 12 10 0
## 2: AX-583077250 0 8 28
## 3: AX-583079283 10 7 0
## 4: AX-583079310 15 6 19
## 5: AX-583077312 14 7 22
## 6: AX-583077325 0 1 0
## ref_count_KAT_12 ref_count_KAT_7 ref_count_SAI_1 ref_count_SAI_2
## 1: 5 0 31 12
## 2: 0 18 0 0
## 3: 6 0 0 0
## 4: 3 18 NA 12
## 5: 18 20 0 6
## 6: 3 0 0 0
## ref_count_SAI_3 ref_count_SAI_4 ref_count_KAT_9 ref_count_KAT_8
## 1: 15 16 10 0
## 2: 0 0 0 0
## 3: 0 0 3 0
## 4: NA NA 0 1
## 5: 0 0 13 4
## 6: 0 0 4 6
## ref_count_SAI_5 ref_count_SAI_17 ref_count_SAI_16 ref_count_SAI_14
## 1: 8 9 15 0
## 2: 0 12 0 0
## 3: 0 0 0 0
## 4: 10 14 1 8
## 5: 14 18 0 0
## 6: 0 0 0 3
## ref_count_SAI_15 ref_count_SAI_12 ref_count_SAI_13 alt_count_SAI_18
## 1: 26 0 8 0
## 2: 0 0 0 13
## 3: 16 0 0 3
## 4: 29 15 4 0
## 5: 31 9 22 15
## 6: 0 0 0 1
## alt_count_KAT_11 alt_count_KAT_10 alt_count_KAT_12 alt_count_KAT_7
## 1: 1 6 12 4
## 2: 8 0 13 0
## 3: 10 28 9 25
## 4: 0 0 11 0
## 5: 0 0 7 0
## 6: 2 7 0 7
## alt_count_SAI_1 alt_count_SAI_2 alt_count_SAI_3 alt_count_SAI_4
## 1: 0 12 0 0
## 2: 13 19 8 12
## 3: 13 15 9 9
## 4: NA 0 NA NA
## 5: 8 0 20 13
## 6: 7 5 5 5
## alt_count_KAT_9 alt_count_KAT_8 alt_count_SAI_5 alt_count_SAI_17
## 1: 0 7 0 0
## 2: 9 15 8 0
## 3: 6 15 13 16
## 4: 36 14 0 0
## 5: 4 0 1 9
## 6: 8 1 1 2
## alt_count_SAI_16 alt_count_SAI_14 alt_count_SAI_15 alt_count_SAI_12
## 1: 0 11 0 6
## 2: 17 7 21 9
## 3: 11 27 20 13
## 4: 0 0 0 0
## 5: 15 10 14 0
## 6: 9 0 7 11
## alt_count_SAI_13 ref_mean_quality_SAI_18 ref_mean_quality_KAT_11
## 1: 0 37.75 38.20
## 2: 16 NA 37.00
## 3: 12 37.60 37.43
## 4: 0 37.21 37.50
## 5: 10 28.75 37.00
## 6: 4 NA 37.00
## ref_mean_quality_KAT_10 ref_mean_quality_KAT_12 ref_mean_quality_KAT_7
## 1: NA 37.6 NA
## 2: 36.36 NA 37.19
## 3: NA 37.0 NA
## 4: 37.00 37.0 37.17
## 5: 37.29 32.5 37.00
## 6: NA 38.0 NA
## ref_mean_quality_SAI_1 ref_mean_quality_SAI_2 ref_mean_quality_SAI_3
## 1: 36.03 37.5 37
## 2: NA NA NA
## 3: NA NA NA
## 4: NA 36.0 NA
## 5: NA 37.0 NA
## 6: NA NA NA
## ref_mean_quality_SAI_4 ref_mean_quality_KAT_9 ref_mean_quality_KAT_8
## 1: 37.12 36.00 NA
## 2: NA NA NA
## 3: NA 37.00 NA
## 4: NA NA 37.00
## 5: NA 31.86 36.75
## 6: NA 37.75 37.50
## ref_mean_quality_SAI_5 ref_mean_quality_SAI_17 ref_mean_quality_SAI_16
## 1: 37.38 37.00 37
## 2: NA 37.00 NA
## 3: NA NA NA
## 4: 34.44 37.43 40
## 5: 36.38 27.33 NA
## 6: NA NA NA
## ref_mean_quality_SAI_14 ref_mean_quality_SAI_15 ref_mean_quality_SAI_12
## 1: NA 37.23 NA
## 2: NA NA NA
## 3: NA 37.00 NA
## 4: 35.5 36.50 37.40
## 5: NA 32.14 37.33
## 6: 38.0 NA NA
## ref_mean_quality_SAI_13 alt_mean_quality_SAI_18 alt_mean_quality_KAT_11
## 1: 37.75 NA 37
## 2: NA 37.23 37
## 3: NA 37.00 37
## 4: 37.00 NA NA
## 5: 31.00 29.67 NA
## 6: NA 37.00 37
## alt_mean_quality_KAT_10 alt_mean_quality_KAT_12 alt_mean_quality_KAT_7
## 1: 37.00 37.00 37.00
## 2: NA 37.46 NA
## 3: 36.71 35.67 37.24
## 4: NA 35.91 NA
## 5: NA 26.00 NA
## 6: 37.00 NA 35.43
## alt_mean_quality_SAI_1 alt_mean_quality_SAI_2 alt_mean_quality_SAI_3
## 1: NA 35.00 NA
## 2: 36.31 36.53 34.88
## 3: 35.62 36.40 37.33
## 4: NA NA NA
## 5: 37.00 NA 37.00
## 6: 37.00 38.80 34.20
## alt_mean_quality_SAI_4 alt_mean_quality_KAT_9 alt_mean_quality_KAT_8
## 1: NA NA 35.00
## 2: 37.25 34.56 36.60
## 3: 37.33 37.00 36.20
## 4: NA 36.67 36.57
## 5: 37.23 NA NA
## 6: 37.00 38.12 20.00
## alt_mean_quality_SAI_5 alt_mean_quality_SAI_17 alt_mean_quality_SAI_16
## 1: NA NA NA
## 2: 37.00 NA 37.35
## 3: 37.23 36.25 37.55
## 4: NA NA NA
## 5: 37.00 26.00 37.00
## 6: 37.00 38.50 36.33
## alt_mean_quality_SAI_14 alt_mean_quality_SAI_15 alt_mean_quality_SAI_12
## 1: 35.09 NA 37.00
## 2: 37.43 35.38 37.33
## 3: 36.56 36.53 36.54
## 4: NA NA NA
## 5: 37.30 26.00 NA
## 6: NA 37.86 37.73
## alt_mean_quality_SAI_13 site_counts_SAI_18 site_counts_KAT_11
## 1: NA 12 11
## 2: 37.19 13 16
## 3: 36.00 13 17
## 4: NA 15 6
## 5: 25.00 15 7
## 6: 38.50 1 3
## site_counts_KAT_10 site_counts_KAT_12 site_counts_KAT_7 site_counts_SAI_1
## 1: 6 17 4 31
## 2: 28 13 18 13
## 3: 28 15 25 13
## 4: 19 14 18 NA
## 5: 22 11 20 8
## 6: 7 3 7 7
## site_counts_SAI_2 site_counts_SAI_3 site_counts_SAI_4 site_counts_KAT_9
## 1: 24 15 16 10
## 2: 19 8 12 9
## 3: 15 9 9 9
## 4: 12 NA NA 36
## 5: 6 20 13 9
## 6: 5 5 5 12
## site_counts_KAT_8 site_counts_SAI_5 site_counts_SAI_17 site_counts_SAI_16
## 1: 7 8 9 15
## 2: 15 8 12 17
## 3: 15 13 16 11
## 4: 15 10 14 1
## 5: 4 13 9 15
## 6: 7 1 2 9
## site_counts_SAI_14 site_counts_SAI_15 site_counts_SAI_12 site_counts_SAI_13
## 1: 11 26 6 8
## 2: 7 21 9 16
## 3: 27 36 13 12
## 4: 8 29 15 4
## 5: 10 17 9 12
## 6: 3 7 11 4
We can get the mean sample values across all the 18 samples. We will ignore the NAs
# Define the prefixes
prefixes <- c("site_counts_", "ref_count_", "alt_count_", "ref_mean_quality_", "alt_mean_quality_")
# Create an empty data table for the results
snp_depth_qual <- data.table(snp_id = merged_data4$snp_id)
# Loop over the prefixes
for (prefix in prefixes) {
# Get the column indices for the current prefix
cols <- grep(prefix, names(merged_data4))
# Compute the row-wise means while ignoring NA values and round them to two decimal places
mean_values <- apply(merged_data4[, cols, with = FALSE], 1, function(x) round(mean(x, na.rm = TRUE), 2))
# Add the mean values to the results data table
snp_depth_qual[[paste0(prefix, "mean")]] <- mean_values
}
# Print the results
head(snp_depth_qual)
## snp_id site_counts_mean ref_count_mean alt_count_mean
## 1: AX-583079274 13.11 9.83 3.28
## 2: AX-583077250 14.11 3.67 10.44
## 3: AX-583079283 16.44 2.33 14.11
## 4: AX-583079310 14.40 10.33 4.07
## 5: AX-583077312 12.22 11.00 7.00
## 6: AX-583077325 5.50 0.94 4.56
## ref_mean_quality_mean alt_mean_quality_mean
## 1: 37.20 36.26
## 2: 36.89 36.63
## 3: 37.21 36.68
## 4: 36.94 36.38
## 5: 34.03 32.29
## 6: 37.65 36.09
Now we can merge our data tables
# Using data.table's efficient join
setkey(snp_depth_qual, snp_id)
setkey(summary_ay, SNP_id)
snp_depth_qual_ay <- snp_depth_qual[summary_ay]
head(snp_depth_qual_ay)
## snp_id site_counts_mean ref_count_mean alt_count_mean
## 1: AX-579436089 19.39 15.44 3.94
## 2: AX-579436125 23.17 16.22 6.94
## 3: AX-579436196 21.28 15.22 6.06
## 4: AX-579436243 21.44 18.67 2.78
## 5: AX-579436298 15.61 12.22 3.39
## 6: AX-579436308 21.39 3.06 18.33
## ref_mean_quality_mean alt_mean_quality_mean REF_match REF_mismatch ALT_match
## 1: 36.79 36.58 15 0 15
## 2: 36.84 36.53 15 3 18
## 3: 36.79 36.53 14 2 16
## 4: 36.93 36.98 15 3 18
## 5: 37.14 36.77 16 1 12
## 6: 37.37 36.83 16 0 16
## ALT_mismatch Zigo_match Zigo_mismatch
## 1: 0 15 0
## 2: 0 15 3
## 3: 0 14 2
## 4: 0 15 3
## 5: 5 13 4
## 6: 0 16 0
Let’s start easy and see if there is any correlation between site_counts_mean and Zigo_mismatch
# Compute the correlation
correlation <- cor(snp_depth_qual_ay$site_counts_mean, snp_depth_qual_ay$Zigo_mismatch, use = "complete.obs")
# Print the correlation
print(correlation)
## [1] -0.3468091
A negative correlation coefficient, like the -0.3468091 we’ve obtained, indicates a negative or inverse relationship between the two variables, site_counts_mean and Zigo_mismatch in our case.
What this means is that as site_counts_mean increases, Zigo_mismatch tends to decrease, and vice versa. However, the value of -0.3468091 suggests a weak negative correlation.
Typically, we would interpret the strength of the correlation using the absolute value of the correlation coefficient (ignoring the negative sign), where:
Values near 0 indicate a very weak correlation. Values near 0.2 to 0.3 are generally considered weak. Values near 0.4 to 0.6 are moderate. Values above 0.6 are strong.
So in our case, the weak negative correlation of -0.3468091 suggests that while there may be a general trend of Zigo_mismatch decreasing as site_counts_mean increases, this relationship is not particularly strong and there is a lot of variability not accounted for by this relationship.
# Create a scatter plot with a regression line
ggplot(snp_depth_qual_ay, aes(x = site_counts_mean, y = Zigo_mismatch)) +
geom_point() +
geom_smooth(method = lm, se = FALSE, color = "red") +
my_theme() +
labs(x = "Site Counts Mean", y = "Zigo Mismatch", title = "Correlation between Site Counts Mean and Zigo Mismatch")
We can see if there is any strong correlation between counts and quality with the mismatches using data table library
# Define the suffixes of interest
mean_suffixes <- c("_counts_mean", "_count_mean", "_quality_mean") # Add "_counts_mean" to match "site_counts_mean"
mismatch_suffixes <- c("_mismatch")
# Get the column names of interest
mean_cols <- grep(paste(mean_suffixes, collapse = "|"), names(snp_depth_qual_ay), value = TRUE)
mismatch_cols <- grep(paste(mismatch_suffixes, collapse = "|"), names(snp_depth_qual_ay), value = TRUE)
# Compute the correlations
correlations <- list()
for (mean_col in mean_cols) {
for (mismatch_col in mismatch_cols) {
correlations[[length(correlations) + 1]] <- list(
Mean_Column = mean_col,
Mismatch_Column = mismatch_col,
Correlation = cor(snp_depth_qual_ay[[mean_col]], snp_depth_qual_ay[[mismatch_col]], use = "complete.obs")
)
}
}
# Convert correlations into a data table
correlations_dt <- rbindlist(correlations)
# Rename values in the 'Mean_Column' column
correlations_dt[, Mean_Column := gsub("_mean", "", Mean_Column)]
# Rename values in the 'Mismatch_Column' column
correlations_dt[, Mismatch_Column := gsub("_mismatch", "", Mismatch_Column)]
# Convert data table to long format
correlations_dt_long <- melt(correlations_dt, id.vars = c("Mean_Column", "Mismatch_Column"),
measure.vars = "Correlation")
# Convert 'value' column to numeric
correlations_dt_long[, value := as.numeric(value)]
# Rename 'value' column to 'Correlation'
setnames(correlations_dt_long, old = "value", new = "Correlation")
# Format the correlation to 2 decimal places
correlations_dt_long[, Correlation_formatted := sprintf("%.2f", Correlation)]
# Create scatter plot
ggplot(correlations_dt_long,
aes(x = Mean_Column, y = Mismatch_Column, fill = Correlation)) +
geom_tile(color = "black", size = 0.5) + # Here you can specify the border color and size
geom_text(aes(label = Correlation_formatted), color = "black", size = 4) + # Add correlation values
scale_fill_gradient2(
low = "blue",
high = "red",
mid = "white",
midpoint = 0,
limit = c(-1, 1),
space = "Lab",
name = "Pearson\nCorrelation"
) +
my_theme() +
theme(axis.text.x = element_text(
angle = 45,
vjust = 1,
size = 12,
hjust = 1
)) +
coord_fixed() +
labs(x = "Counts or quality", y = "Mismatches", title = "Correlation between sites read counts \nand quality and mismatches", caption = "WGS and Chip calls done with the 18 samples.") +
theme(plot.caption = element_text(
size = 8,
color = "gray30",
face = "italic",
hjust = 1
))
# Save plot to PDF
ggsave(
here(
"output",
"wgs_vs_chip",
"figures",
"ay_read_depth_by_zigo_mismatches_correlation.pdf"
),
height = 5,
width = 6,
dpi = 300
)
The highest correlation is not between the number of reads at the site and the Zygosity mismatches as within the wgs data set we compared before. Now, the read depth of the alternative allele has a moderate correlation with the number of mismatches (0.39) As the read depth decreases the number of mismatches increase. Let’s group the data by Zigo_mismatch and get the mean site_counts per group.
We can check the number of reads at the site and the Zygosity mismatches as we did before
# Group by 'Zigo_mismatch' and calculate the mean of 'site_counts_mean'
snp_summary_dt <- snp_depth_qual_ay[, .(mean_site_counts = round(mean(site_counts_mean, na.rm = TRUE), 2)), by = Zigo_mismatch]
# Create the bar plot with annotations and adjusted x-axis limits
ggplot(snp_summary_dt, aes(x = Zigo_mismatch, y = mean_site_counts)) +
geom_bar(stat = "identity",
fill = "#b0dfe8",
color = "#f5c5d8") +
geom_text(aes(label = sprintf("%.1f", mean_site_counts)), vjust = -0.5, size = 3) +
labs(x = "Number of samples with Zygosity mismatches", y = "Mean Site Counts", title = "Mean Site Counts by Zygosity Mismatch", caption = "WGS and chip samples genotyped with 18 samples") +
my_theme() + coord_cartesian(xlim = c(0, 18)) +
scale_x_continuous(breaks = seq(0, 18, 1)) +
theme(plot.caption = element_text(
size = 8,
color = "gray30",
face = "italic",
hjust = 1
))
# Save plot to PDF
ggsave(
here(
"output",
"wgs_vs_chip",
"figures",
"ay_read_depth_by_zigo_mismatches.pdf"
),
height = 5,
width = 6,
dpi = 300
)
We can also look at the alternative allele read counts
# Group by 'ALT_mismatch' and calculate the mean of 'site_counts_mean'
snp_summary_dt <- snp_depth_qual_ay[, .(mean_site_counts = round(mean(site_counts_mean, na.rm = TRUE), 2)), by = ALT_mismatch]
# Create the bar plot with annotations and adjusted x-axis limits
ggplot(snp_summary_dt, aes(x = ALT_mismatch, y = mean_site_counts)) +
geom_bar(stat = "identity",
fill = "#b0dfe8",
color = "#f5c5d8") +
geom_text(aes(label = sprintf("%.1f", mean_site_counts)), vjust = -0.5, size = 3) +
labs(x = "Number of samples with Zygosity mismatches", y = "Mean ALT allele Counts", title = "Mean ALT allele Counts by Zygosity Mismatch", caption = "WGS samples, comparison of genotype calls using 18 or 30 samples.") +
my_theme() + coord_cartesian(xlim = c(0, 18)) +
scale_x_continuous(breaks = seq(0, 18, 1)) +
theme(plot.caption = element_text(
size = 8,
color = "gray30",
face = "italic",
hjust = 1
))
# Save plot to PDF
ggsave(
here(
"output",
"wgs_vs_chip",
"figures",
"ay_read_depth_by_ALT_read_zigo_mismatches.pdf"
),
height = 5,
width = 6,
dpi = 300
)
We see that the read depth decreases with the number of samples for which we find Zygosity mismatches (homo_ref, homo_alt and heterozygous)
Now we can import the SNP metrics for the chip genotype call and see if we find correlations as well.
# Read the file with fread()
ay_chip_metrics <- fread(
here(
"data",
"raw_data",
"albo",
"wgs_vs_chip",
"wgs_18_samples_metrics.txt"
)
)
# We can add two new columns, n_NoCall (missing call_ and n_OTV (off target variant) and subset our data table
# Define a list of columns to check
call_code_cols = grep("_call_code$", names(ay_chip_metrics), value = TRUE)
# Create the n_NoCall column
ay_chip_metrics[, n_NoCall := rowSums(do.call(cbind, lapply(.SD, function(x) x == "NoCall"))), .SDcols = call_code_cols]
# Create the n_OTV column
ay_chip_metrics[, n_OTV := rowSums(do.call(cbind, lapply(.SD, function(x) x == "OTV"))), .SDcols = call_code_cols]
# Select columns to subset - I remove MMD. HomFLD, HetSO and HomRO columns since it had NAs
selected_columns <- ay_chip_metrics[, .(probeset_id, CR, FLD, nMinorAllele, Nclus, n_AA, n_AB, n_BB, n_NC, MinorAlleleFrequency, n_NoCall, n_OTV)]
# Subset
ay_chip_metrics <- ay_chip_metrics[, .(probeset_id, CR, FLD, nMinorAllele, Nclus, n_AA, n_AB, n_BB, n_NC, MinorAlleleFrequency, n_NoCall, n_OTV)]
# Check output
head(ay_chip_metrics)
## probeset_id CR FLD nMinorAllele Nclus n_AA n_AB n_BB n_NC
## 1: AX-579436016 83.333 2.533 6 2 0 6 9 3
## 2: AX-579436089 94.444 3.760 7 3 11 5 1 1
## 3: AX-579436102 88.889 3.377 10 2 6 10 0 2
## 4: AX-579436125 100.000 4.809 8 2 10 8 0 0
## 5: AX-579436149 100.000 NaN 0 1 0 0 18 0
## 6: AX-579436196 94.444 5.996 12 3 8 6 3 1
## MinorAlleleFrequency n_NoCall n_OTV
## 1: 0.200 7 0
## 2: 0.206 1 2
## 3: 0.312 6 5
## 4: 0.222 0 0
## 5: 0.000 0 0
## 6: 0.353 0 2
Now we can merge our data sets to run the correlation analysis (merge with snp_depth_qual_ay)
# Set the key for each table
setkey(ay_chip_metrics, probeset_id)
setkey(snp_depth_qual_ay, snp_id)
# Join the tables
ay_wgs_chip_metrics <- snp_depth_qual_ay[ay_chip_metrics, nomatch = 0L]
# Remove rows with NAs
ay_wgs_chip_metrics <- ay_wgs_chip_metrics[complete.cases(ay_wgs_chip_metrics), ]
head(ay_wgs_chip_metrics)
## snp_id site_counts_mean ref_count_mean alt_count_mean
## 1: AX-579436089 19.39 15.44 3.94
## 2: AX-579436125 23.17 16.22 6.94
## 3: AX-579436196 21.28 15.22 6.06
## 4: AX-579436243 21.44 18.67 2.78
## 5: AX-579436298 15.61 12.22 3.39
## 6: AX-579436308 21.39 3.06 18.33
## ref_mean_quality_mean alt_mean_quality_mean REF_match REF_mismatch ALT_match
## 1: 36.79 36.58 15 0 15
## 2: 36.84 36.53 15 3 18
## 3: 36.79 36.53 14 2 16
## 4: 36.93 36.98 15 3 18
## 5: 37.14 36.77 16 1 12
## 6: 37.37 36.83 16 0 16
## ALT_mismatch Zigo_match Zigo_mismatch CR FLD nMinorAllele Nclus n_AA
## 1: 0 15 0 94.444 3.760 7 3 11
## 2: 0 15 3 100.000 4.809 8 2 10
## 3: 0 14 2 94.444 5.996 12 3 8
## 4: 0 15 3 100.000 5.017 9 3 3
## 5: 5 13 4 100.000 6.941 6 3 13
## 6: 0 16 0 100.000 12.298 5 2 13
## n_AB n_BB n_NC MinorAlleleFrequency n_NoCall n_OTV
## 1: 5 1 1 0.206 1 2
## 2: 8 0 0 0.222 0 0
## 3: 6 3 1 0.353 0 2
## 4: 3 12 0 0.250 0 0
## 5: 4 1 0 0.167 0 1
## 6: 5 0 0 0.139 2 0
Check for NAs
# Check for NAs in ay_wgs_chip_metrics
any_na <- any(colSums(is.na(ay_wgs_chip_metrics)) > 0)
if (any_na) {
print("There are NA values in the ay_wgs_chip_metrics table.")
} else {
print("There are no NA values in the ay_wgs_chip_metrics table.")
}
## [1] "There are no NA values in the ay_wgs_chip_metrics table."
Plot
# Define the suffixes of interest
mean_suffixes <- c("_counts_mean", "_count_mean", "_quality_mean")
mismatch_suffixes <- c("_mismatch")
# Get the column names of interest
mean_cols <-
grep(paste(mean_suffixes, collapse = "|"),
names(ay_wgs_chip_metrics),
value = TRUE)
mismatch_cols <-
grep(paste(mismatch_suffixes, collapse = "|"),
names(ay_wgs_chip_metrics),
value = TRUE)
other_numeric_cols <-
c(
"CR",
"FLD",
"nMinorAllele",
"Nclus",
"n_AA",
"n_AB",
"n_BB",
"n_NC",
"MinorAlleleFrequency",
"n_NoCall",
"n_OTV"
)
# Combine mean_cols and other_numeric_cols
mean_and_other_numeric_cols <- c(mean_cols, other_numeric_cols)
# Compute the correlations
ay_correlations <- list()
for (col1 in mean_and_other_numeric_cols) {
for (col2 in mismatch_cols) {
ay_correlations[[length(ay_correlations) + 1]] <- list(
Column1 = col1,
Column2 = col2,
Correlation = cor(ay_wgs_chip_metrics[[col1]], ay_wgs_chip_metrics[[col2]], use = "complete.obs")
)
}
}
# Convert correlations into a data table
ay_correlations_dt <- rbindlist(ay_correlations)
# Format the correlation to 2 decimal places
ay_correlations_dt[, Correlation_formatted := sprintf("%.2f", Correlation)]
# Update the visualization
ggplot(ay_correlations_dt,
aes(x = Column1, y = Column2, fill = Correlation)) +
geom_tile(color = "black", size = 0.5) +
geom_text(aes(label = Correlation_formatted),
color = "black",
size = 4) +
scale_fill_gradient2(
low = "blue",
high = "red",
mid = "white",
midpoint = 0,
limit = c(-1, 1),
space = "Lab",
name = "Pearson\nCorrelation"
) +
my_theme() +
theme(axis.text.x = element_text(
angle = 45,
vjust = 1,
size = 12,
hjust = 1
)) +
coord_fixed() +
labs(
x = "WGS/chip metrics",
y = "Mismatches",
title = "Correlation between WGS counts/quality,\n chip metrics against mismatches",
caption = "WGS and Chip calls done with the 18 samples."
) +
theme(plot.caption = element_text(
size = 8,
color = "gray30",
face = "italic",
hjust = 1
))
# Save plot to PDF
ggsave(
here(
"output",
"wgs_vs_chip",
"figures",
"ay_wgs_chip_metrics_vs_mismatches.pdf"
),
height = 6,
width = 10,
dpi = 300
)
We can check the number of reads at the site and the Zygosity mismatches as we did before
snp_summary_dt <-
ay_wgs_chip_metrics[, .(
mean_FLD = round(mean(FLD, na.rm = TRUE), 2),
mean_CR = round(mean(CR, na.rm = TRUE), 2),
mean_site_counts_mean = round(mean(site_counts_mean, na.rm = TRUE), 2)
),
by = Zigo_mismatch]
# Reshape data from wide to long format
snp_summary_dt_long <-
melt(
snp_summary_dt,
id.vars = "Zigo_mismatch",
variable.name = "Variable",
value.name = "Mean"
)
cbPalette <- c("#CC79A7", "#56B4E9", "#009E73")
# Change variable names for legend keys
snp_summary_dt_long$Variable <- factor(
snp_summary_dt_long$Variable,
levels = c("mean_FLD", "mean_CR", "mean_site_counts_mean"),
labels = c("FLD", "Call Rate", "Read Count")
)
ggplot(snp_summary_dt_long,
aes(x = Zigo_mismatch, y = Mean, fill = Variable)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(
aes(label = sprintf("%.1f", Mean)),
position = position_dodge(width = 0.9),
vjust = -0.25,
size = 2
) +
scale_fill_manual(values = cbPalette) +
labs(
x = "Number of samples with Zygosity mismatches",
y = "Mean Value (log 10)",
fill = "Metrics",
title = "Mean Values by Zygosity Mismatch",
caption = "WGS and chip samples genotyped with 18 samples"
) +
theme_minimal() + coord_cartesian(xlim = c(0, 18)) +
scale_x_continuous(breaks = seq(0, 18, 1)) +
scale_y_log10() + # apply log transformation to y-axis
theme(
plot.caption = element_text(
size = 8,
color = "gray30",
face = "italic",
hjust = 1
),
legend.position = "top"
)
# Save plot to PDF
ggsave(
here(
"output",
"wgs_vs_chip",
"figures",
"ay_wgs_chip_bars_after_correlation.pdf"
),
height = 5,
width = 6,
dpi = 300
)
We do not see a clear pattern since the FLD, Call Rate and Read Count seem similar for sites with mismatches. Perhaps the mean values per number of samples with mismatches are not a good way to represent the correlation. We know that SNPs with lower FLD can have an higher rate of mismatches.
Lets try a violin plot
# Reshape data from wide to long format
snp_summary_dt_long <- melt(
snp_summary_dt,
id.vars = "Zigo_mismatch",
variable.name = "Variable",
value.name = "Value"
)
# Change variable names for legend keys
snp_summary_dt_long$Variable <- factor(
snp_summary_dt_long$Variable,
levels = c("mean_FLD", "mean_CR", "mean_site_counts_mean"),
labels = c("FLD", "Call Rate", "Read Count")
)
# Create a new categorical variable based on Zigo_mismatch
snp_summary_dt_long$Zigo_group <-
cut(
snp_summary_dt_long$Zigo_mismatch,
breaks = seq(0, 18, 2),
labels = seq(0, 16, 2),
include.lowest = TRUE
)
# Create violin plot
ggplot(snp_summary_dt_long,
aes(x = Zigo_group, y = Value, fill = Variable)) +
geom_violin(scale = "width", trim = FALSE) +
geom_boxplot(
width = 0.2,
fill = "white",
color = "black",
outlier.shape = NA
) +
scale_fill_manual(values = cbPalette) +
labs(
x = "Number of samples with Zygosity mismatches",
y = "Value",
fill = "Metrics",
title = "Distribution of Values by Zygosity Mismatch",
caption = "WGS and chip samples genotyped with 18 samples"
) +
my_theme() +
scale_y_log10() +
theme(
plot.caption = element_text(
size = 8,
color = "gray30",
face = "italic",
hjust = 1
),
legend.position = "top"
)
We can remove the SNPs with segregation errors before we run the PCA analysis. In previous comparisons we did not see significant overlap between the two types of SNPs, with mismatches and segregation errors.
We can compare the SNPs with 1 or more samples with discrepancies with the SNPs that did not pass our segregation test.
Get the SNPs that have errors in 1 or more samples
# Discrepancies in 1 or more samples
# How many SNPs we tested
tested_snps <- length(unique(data_ay_dt_filtered$SNP_id))
cat("Number of SNPs tested:", tested_snps, "\n")
## Number of SNPs tested: 33466
# How many SNPs failed
failed_snpsR <-
length(
unique(data_ay_dt_filtered[data_ay_dt_filtered$REF_mismatch_count >= 1,]$SNP_id
)
)
cat("REF mismatch at in 1 samples:", failed_snpsR, "\n")
## REF mismatch at in 1 samples: 0
# How many SNPs failed
failed_snpsA <-
length(
unique(data_ay_dt_filtered[data_ay_dt_filtered$ALT_mismatch_count >= 1,]$SNP_id
)
)
cat("ALT mismatch at least in 1 samples:", failed_snpsA, "\n")
## ALT mismatch at least in 1 samples: 0
# How many SNPs failed zygosity
failed_snps <-
length(
unique(data_ay_dt_filtered[data_ay_dt_filtered$Zigo_mismatch_count >= 1,]$SNP_id
)
)
cat("Zygosity mismatch in at least 1 samples:", failed_snps, "\n")
## Zygosity mismatch in at least 1 samples: 0
# Calculate percentage
percentage_failed <- round(failed_snps / tested_snps * 100, 2)
cat("Percentage of failed SNPs in 1 or more samples:", percentage_failed, "%\n")
## Percentage of failed SNPs in 1 or more samples: 0 %
Get the SNP ids
Create a Venn diagram between the SNPs with genotyping mismatches and those that failed our segregation test
# Read in the two files as vectors
fail_mendel <-
read_table(
here(
"output",
"segregation",
"albopictus",
"albopictus_SNPs_fail_segregation.txt"
),
col_names = FALSE,
show_col_types = FALSE
)[[1]]
fail_geno <-
read_table(
here("output",
"wgs_vs_chip",
"SNPs_failed_2_samples.txt"),
col_names = FALSE,
show_col_types = FALSE
)[[1]]
# Calculate shared values
errors_SNPs <-
intersect(fail_mendel,
fail_geno)
# Create Venn diagram
venn_data <-
list(
"Fail Mendel" = fail_mendel,
"Genotype Mismatches" = fail_geno
)
venn_plot <-
ggvenn(
venn_data,
fill_color = c("steelblue", "darkorange"),
show_percentage = TRUE
)
# Add a title
venn_plot <-
venn_plot +
ggtitle("Comparison of SNPs with errors") +
theme(plot.title = element_text(hjust = .5))
# Display the Venn diagram
print(venn_plot)
# Save Venn diagram to PDF
output_path <-
here(
"output",
"wgs_vs_chip",
"figures",
"Mendel_mismatches_overlap_filtered.pdf"
)
ggsave(
output_path,
venn_plot,
height = 6,
width = 6,
dpi = 300
)
We will remove all SNPs with mismatches in 3 or more samples and those with segregation errors, in our case 141 SNPs. We can create a new file to use later with Plink
# Merge the vectors
SNPs_to_exclude <- unique(c(fail_mendel, fail_geno))
# How many to remove
cat("How many SNPs to remove:", length(SNPs_to_exclude), "\n")
## How many SNPs to remove: 5933
# Write to file
write.table(
SNPs_to_exclude,
file = here(
"output",
"wgs_vs_chip",
"SNPs_to_exclude.txt"
),
row.names = FALSE,
col.names = FALSE,
quote = FALSE
)
# We can also create a list of SNPs that we can use for the PCA
ay_SNPs_to_extract <-
unique(
data_ay_dt_filtered$SNP_id
)
# We can remove the SNPs with errors
ay_SNPs_to_extract_filtered <- ay_SNPs_to_extract[!ay_SNPs_to_extract %in% SNPs_to_exclude]
# How many SNPs left
cat("How many SNPs to keep after filtering:", length(ay_SNPs_to_extract_filtered), "\n")
## How many SNPs to keep after filtering: 32223
WGS vs chip “ay” - WGS and chip calls with 18 samples
Now use Plink to create a PCA excluding only the SNPs that failed our segregation test
Lets import our .fam file to filter the IDs we want to compare.
# Read the data
fam_data <-
here("output", "wgs_vs_chip", "wgs_chip_merged.fam") |>
read_delim(
delim = "\t",
col_names = FALSE,
show_col_types = FALSE
) |>
setNames(
c(
"FID", "IID", "PID", "MID", "Sex", "Phenotype"
)
)
# Filter the data
filtered_data <-
fam_data |>
dplyr::filter(stringr::str_detect(IID, "a$|y$")) |>
dplyr::select("FID", "IID")
# Save to file
write.table(
filtered_data,
file = here("output", "wgs_vs_chip", "ay_wgs_chip_samples.txt"),
quote = FALSE,
sep = " ",
row.names = FALSE,
col.names = FALSE
)
Use Plink with only the samples we are comparing (priors) and remove SNPs that failed the Mendel test “output”, “segregation”, “albopictus”, “alobopictus_SNPs_fail_segregation.txt”
# Before
# Here we set genotyping missingness to 10%, MAF 5%, and remove the SNPs with segreagtion errors
plink \
--allow-extra-chr \
--keep-allele-order \
--bfile output/wgs_vs_chip/wgs_chip_merged \
--exclude output/segregation/albopictus/alobopictus_SNPs_fail_segregation.txt \
--keep output/wgs_vs_chip/ay_wgs_chip_samples.txt \
--pca \
--geno 0.1 \
--maf 0.05 \
--out output/wgs_vs_chip/ay_pca_before \
--silent;
grep "samples\|variants" output/wgs_vs_chip/ay_pca_before.log
## Error: Failed to open
## output/segregation/albopictus/alobopictus_SNPs_fail_segregation.txt.
## --keep output/wgs_vs_chip/ay_wgs_chip_samples.txt
## 174424 variants loaded from .bim file.
Now do it again but remove both SNPs that failed the Mendel test and that have genotype mismatches in at least 2 samples (plus those with segregation errors).
# After
# Here we set genotyping missingness to 10%, MAF 5%, and remove the SNPs with segreagation errors and those with mismatches between wgs and chip. We now can extract the SNPs that passed
plink \
--allow-extra-chr \
--keep-allele-order \
--bfile output/wgs_vs_chip/wgs_chip_merged \
--extract output/wgs_vs_chip/ay_SNPs_to_extract_filtered.txt \
--keep output/wgs_vs_chip/ay_wgs_chip_samples.txt \
--pca \
--geno 0.1 \
--maf 0.05 \
--out output/wgs_vs_chip/ay_pca_after \
--silent;
grep "samples\|variants" output/wgs_vs_chip/ay_pca_after.log
## --keep output/wgs_vs_chip/ay_wgs_chip_samples.txt
## 174424 variants loaded from .bim file.
## --extract: 32223 variants remaining.
## Total genotyping rate in remaining samples is 0.994761.
## 308 variants removed due to missing genotype data (--geno).
## 2189 variants removed due to minor allele threshold(s)
## 29726 variants and 36 people pass filters and QC.
Create PCA plot
# Load the PCA results
pca_1 <-
read.table(here("output", "wgs_vs_chip", "ay_pca_before.eigenvec"),
header = FALSE)
colnames(pca_1) <- c("FID", "IID", paste0("PC", 1:(ncol(pca_1) - 2)))
pca_1$analysis <- "Before"
pca_1$group <- ifelse(
stringr::str_detect(pca_1$IID, "a$"),
"a",
ifelse(stringr::str_detect(pca_1$IID, "y$"), "y", "Other")
)
pca_2 <-
read.table(here("output", "wgs_vs_chip", "ay_pca_after.eigenvec"),
header = FALSE)
colnames(pca_2) <- c("FID", "IID", paste0("PC", 1:(ncol(pca_2) - 2)))
pca_2$analysis <- "After"
pca_2$group <- ifelse(
stringr::str_detect(pca_2$IID, "a$"),
"a",
ifelse(stringr::str_detect(pca_2$IID, "y$"), "y", "Other")
)
# Combine the data
combined_pca <- rbind(pca_1, pca_2)
# import plotting theme
source(
here(
"scripts",
"analysis",
"my_theme2.R"
)
)
# Convert the 'analysis' column to a factor and specify the level order
combined_pca$analysis <-
factor(combined_pca$analysis, levels = c("Before", "After"))
# Create a facet plot
ggplot(combined_pca, aes(x = PC1, y = PC2, color = group, shape = group)) +
geom_point(size = 2) +
facet_grid(FID ~ analysis, scales = "fixed") +
labs(
x = "PC1",
y = "PC2",
title = "The effect of SNPs with genotyping mismatches in\n 1 or more samples and low read depth",
colour = "Method",
shape = "Method",
caption = "Removing SNPs with genotypes mismatches in >= 1 sample.\n WGS data filtered based on read depth (20x for site or alleles). \n Chip data filtered by call rate (CR=98.5%) and Fisher Linear Discriminant (FLD>=6). \n'Before' with 88,443 SNPs 'After' with 20,293 SNPs (--maf 0.05 and --geno 0.1)."
) +
my_theme() +
scale_color_manual(
values = c(
"a" = "lightblue",
"y" = "orange",
"Other" = "black"
),
labels = c("a" = "Chip", "y" = "WGS", "Other" = "Other")
) +
theme(plot.caption = element_text(
face = "italic",
size = 10,
color = "grey20"
),
legend.position = "top") +
scale_shape_manual(
values = c(
"a" = 19, # Filled circle
"y" = 1, # Open circle
"Other" = 3 # Plus
),
labels = c("a" = "Chip", "y" = "WGS", "Other" = "Other")
)
We can use these thresholds: WGS -> read depth = 20 for site or alleles Chip -> FLD >= 6 and call rate >= 98.5% if we wish to combine samples genotyped with WGS and chip. Besides, the number of samples in the genotype calls do affect the mismatch rate between the technologies. The sample size seems to affect more the WGS data than the chip. Therefore, it is crucial to check the read depth and allele specific read depth to filter sites or alleles with low read depth. It is probably not possible to combine low depth sequencing data with the chip data. Due to incredibly repetitive nature of the genome (~70%) the sequencing cost will be higher if we want to obtain 20x per site or allele. It is probably not a good idea to merge the data sets without any of these considerations.