We will used SnpEff to perform a functional annotation of our SNPs You can download the software HERE. They have great documentation that it is worth checking it HERE.
# move to the directory you want to download it
# Download and install SnpEff
curl -v -L 'https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip' > snpEff_latest_core.zip
unzip snpEff_latest_core.zip
We have two options to compare the functional annotation of data sets. We can do the annotation on each data set and parse the results with bash or R. A different approach would be run one annotation once, and then remove those that do not pass our MAF threshold from the annotated data set.
We also want to do LD pruning to remove close linked loci. We can do the LD filtering before or after the functional annotation. Perhaps, we want to compare the effect of LD pruning on the variant numbers by effect or region. It seems possible that pruning might effect more variants on coding regions, whose minor allele frequencies might be low.
I decided to do the filtering (LD and MAF) before the functional annotation, avoiding any bias based on the SNP location or it biological implications. Therefore, the filtering will be based solely on the SNP frequencies and locations. Previous analysis of linkage disequilibrium in Ae. albopictus showed considerable less linkage than in Ae aegypti. While in Ae. aegypti there are long linkage blocks (1/2 dist maximum r2 between a pair of SNPs vary from 40kb up to 120kb in aegypti), albopictus have shorter blocks (1/2 dist maximum r2 between a pair of SNPs vary from less than 1 to 2kb). However, using a chromosomal scale, we found similar linkage patterns in Ae. albopictus.
While WGS LD pruning using number of SNPs is the usual choice, the chip data needs a different approach. For example, we can we look at 50 SNPs, estimate r2 among all SNP pairs, and remove one SNP from pairs where the estimate is above 0.1 (remove the SNP whose minor allele frequency is smaller), and then move the window one SNP, and repeat the operation.
This is the case in Plink when we use the flag –indep-pairwise 50 1 0.1 (it is a bit more complicated than I explained, check to learn about VIF https://www.cog-genomics.org/plink/2.0/ld or https://www.cog-genomics.org/plink/1.9/ld).
For the chip data, we have long stretches of the genome without any data. If we compare 50 SNPs they can come from difference genomic regions, the average distance of SNPs on the chip is a bit above 10kb, however it is variable. Therefore, if we use 20kb and maximum distance we want to compare the SNPs, we will be conservative since there is low linkage among variants. We can also test different linkage filterings and their effect on the data sets we end up with.
To do all these tests, it is easier to have access to a cluster. The main problem turns out to be storage for all the different data sets. Therefore, we can install SnpEff on the cluster.
# Go to home dir
cd
# Download latest version
wget https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip
# Unzip file
unzip snpEff_latest_core.zip
# pwd
# /home/lvc26/snpEff
# on the cluster
srun --pty -N 1 -n 1 -c 2 -p interactive zsh
module load miniconda/4.12.0
# check conda
which conda
/gpfs/ycga/apps/hpc/software/miniconda/4.12.0/condabin/conda
#
# install GFF3Toolkit https://github.com/NAL-i5K/GFF3toolkit
# on the base env using --user
pip install gff3tool --user
#
#
# QC the annotation from current and previous assembly
# ____________________________________________________________________________
# Boyle et al, 2021 ####
#
# To run the QC on the annotation we can try different strategies:
# 1. use the gff file as it is
# 2. sort by scaffold and physical location
# 3. sort by transcript id.
# I am not sure which one would be the recommended option, therefore I will test all 3.
# GFF3toolkit has a couple options https://github.com/NAL-i5K/GFF3toolkit#sort-a-gff3-file-back
# https://gff3toolkit.readthedocs.io/en/latest/gff3_sort.html#inputs
#
# we need to decompress the gff file and rename it to use it with snpEff later
cd $HOME/snpEff/data/albo_v3;
zcat aedes_albopictus_LA2_20200826.gff3.gz > genes.gff
#
## ............................................................................
## 1. use gff as it is ####
# run the qc to get errors
logsave -a $HOME/snpEff/data/albo_v3/log_1.txt gff3_QC \
-g $HOME/snpEff/data/albo_v3/genes.gff \
-f $HOME/snpEff/data/genomes/albo_v3.fa \
-o $HOME/snpEff/data/albo_v3/error_1.txt \
-s $HOME/snpEff/data/albo_v3/statistic_1.txt
#
# run the qc to get errors but use flag -noncg –noncanonical_gene
# check https://www.uniprot.org/help/canonical_and_isoforms to see what it means
logsave -a $HOME/snpEff/data/albo_v3/log_1nc.txt gff3_QC \
-noncg \
-g $HOME/snpEff/data/albo_v3/genes.gff \
-f $HOME/snpEff/data/genomes/albo_v3.fa \
-o $HOME/snpEff/data/albo_v3/error_1nc.txt \
-s $HOME/snpEff/data/albo_v3/statistic_1nc.txt
#
#
## ............................................................................
## 2. sort gff by scaffolds and positions ####
#
# Sort a GFF3 file according to the order of Scaffold, coordinates on a Scaffold, and parent-child feature relationships
# check https://github.com/NAL-i5K/GFF3toolkit/blob/master/docs/gff3_sort.md
# we can also use a template to sort by feature
echo "gene pseudogene
mRNA
exon
CDS" > $HOME/snpEff/data/albo_v3/template.txt
# Note:
# If not all the feature type are documented in the sort template file. gff3_sort will sort features by level(1st-level, 2nd-level, and etc) and then by the order in sort template file.
#
# sort
logsave -a $HOME/snpEff/data/albo_v3/log_sort.txt gff3_sort \
--sort_template $HOME/snpEff/data/albo_v3/template.txt \
-g $HOME/snpEff/data/albo_v3/genes.gff \
-og $HOME/snpEff/data/albo_v3/sorted_genes.gff
#
# now we can run the QC again on the sorted gff file
logsave -a $HOME/snpEff/data/albo_v3/log_2.txt gff3_QC \
-g $HOME/snpEff/data/albo_v3/sorted_genes.gff \
-f $HOME/snpEff/data/genomes/albo_v3.fa \
-o $HOME/snpEff/data/albo_v3/error_2.txt \
-s $HOME/snpEff/data/albo_v3/statistic_2.txt
#
## ............................................................................
## 3. sort gff by transcript ####
#
# sort gff3 file with --isoform_sort
logsave -a $HOME/snpEff/data/albo_v3/log_sort_isoform.txt gff3_sort \
--isoform_sort \
--sort_template $HOME/snpEff/data/albo_v3/template.txt \
-g $HOME/snpEff/data/albo_v3/genes.gff \
-og $HOME/snpEff/data/albo_v3/isoform_sorted_genes.gff
#
# now we can run the QC again on the sorted gff file
logsave -a $HOME/snpEff/data/albo_v3/log_3.txt gff3_QC \
-g $HOME/snpEff/data/albo_v3/isoform_sorted_genes.gff \
-f $HOME/snpEff/data/genomes/albo_v3.fa \
-o $HOME/snpEff/data/albo_v3/error_3.txt \
-s $HOME/snpEff/data/albo_v3/statistic_3.txt
#
# run the qc to get errors
gff3_QC \
-g $HOME/snpEff/data/albo_v3/genes_sorted.gff3 \
-f $HOME/snpEff/data/genomes/albo_v3.fa \
-o $HOME/snpEff/data/albo_v3/error.txt \
-s $HOME/snpEff/data/albo_v3/statistic.txt
#
# when sorting it creates another error - duplicated ids
# therefore sorting does not seem to solve the errors that cannot be fixed automatically
#
# ____________________________________________________________________________
# AalboF2 - Palinni et al, 2020 ####
#
# genome at https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/7160/102/GCF_006496715.1_Aalbo_primary.1/
#
# decompress the fasta file and name it as albo_v2
zcat GCF_006496715.1_Aalbo_primary.1_genomic.fna.gz > albo_v2.fa
#
# index the genome
module load SAMtools/1.16-GCCcore-10.2.0
samtools faidx $HOME/snpEff/data/genomes/albo_v2.fa
#
# decompress gff file, name it as genes.gff, and place it in a directory called albo_v2 inside snpEff directory
mkdir $HOME/snpEff/data/genomes/albo_v2;
zcat GCF_006496715.1_Aalbo_primary.1_genomic.gff.gz > genes.gff
#
# create interactive session with 4CPU and 20GB memory (10 was not enough)
srun --pty -N 1 -n 1 -c 4 -p interactive zsh
#
# load GATK, it will automatic load the python I installed gfftoolkit
module load GATK/4.2.6.1-GCCcore-10.2.0-Java-11
#
## ............................................................................
## 1. use gff as it is ####
# run the qc to get errors
logsave -a $HOME/snpEff/data/albo_v2/log_1.txt gff3_QC \
-g $HOME/snpEff/data/albo_v2/genes.gff \
-f $HOME/snpEff/data/genomes/albo_v2.fa \
-o $HOME/snpEff/data/albo_v2/error_1.txt \
-s $HOME/snpEff/data/albo_v2/statistic_1.txt
#
# with flag non canonical
logsave -a $HOME/snpEff/data/albo_v2/log_1nc.txt gff3_QC \
-noncg \
-g $HOME/snpEff/data/albo_v2/genes.gff \
-f $HOME/snpEff/data/genomes/albo_v2.fa \
-o $HOME/snpEff/data/albo_v2/error_1nc.txt \
-s $HOME/snpEff/data/albo_v2/statistic_1nc.txt
#
#
# lets sort it as well to see if there is a difference
# it seems some genes are missing some features, but those genes might be non coding genes
# now we can run the QC again on the sorted gff file
#
# sort
logsave -a $HOME/snpEff/data/albo_v2/log_sort.txt gff3_sort \
--sort_template $HOME/snpEff/data/albo_v3/template.txt \
-g $HOME/snpEff/data/albo_v2/genes.gff \
-og $HOME/snpEff/data/albo_v2/sorted_genes.gff
#
# now we can run the QC again on the sorted gff file
logsave -a $HOME/snpEff/data/albo_v2/log_2.txt gff3_QC \
-g $HOME/snpEff/data/albo_v2/sorted_genes.gff \
-f $HOME/snpEff/data/genomes/albo_v2.fa \
-o $HOME/snpEff/data/albo_v2/error_2.txt \
-s $HOME/snpEff/data/albo_v2/statistic_2.txt
#
# sorting also does not help resolving some of the errors, as it happens to our current assembly.
We can try to fix some of the errors
# check documentation at https://github.com/NAL-i5K/GFF3toolkit/blob/master/docs/gff3_fix.py-documentation.md
# only 30 types of errors can be corrected but there are 22 errors that there is no automatic correction
#
## ............................................................................
## 1. use gff as it is ####
#
# run gff3_fix
logsave -a $HOME/snpEff/data/albo_v3/log_fix_1.txt gff3_fix \
-qc_r $HOME/snpEff/data/albo_v3/error_1.txt \
-g $HOME/snpEff/data/albo_v3/genes.gff \
-og $HOME/snpEff/data/albo_v3/corrected_genes.gff
# no luck here it throws an error
# UnboundLocalError: local variable 'CDS_list' referenced before assignment
# gff3_fix died with exit status 1
# if you check the type of errors, you can see that some errors cannot be fixed automatically
# list of errors that can or cannot be fixed at https://github.com/NAL-i5K/GFF3toolkit/blob/master/docs/gff3_fix.py-documentation.md
head -n 100 statistic_1.txt
# Error_code Number_of_problematic_models Error_level Error_tag
# Esf0014 1 Error ##gff-version" missing from the first line
# Esf0036 1060 Error Value of a attribute contains unescaped ","
# Ema0006 26 Info Wrong phase
# Ema0007 110 Warning CDS and parent feature on different strands
# Ema0004 5079 Info Incomplete gene feature that should contain at least one mRNA, exon, and CDS
# Ema0005 1739 Info Pseudogene has invalid child feature type
# Ema0009 1957 Warning Incorrectly merged gene parent? Isoforms that do not share coding sequences are found
# Ema0002 118 Warning Protein sequence contains internal stop codons
# Ema0008 17 Warning for distinct isoforms that do not share any regions
# Emr0002 39 Warning Incorrectly split gene parent?
#
# using the non canonical option
logsave -a $HOME/snpEff/data/albo_v3/log_fix_1nc.txt gff3_fix \
-qc_r $HOME/snpEff/data/albo_v3/error_1nc.txt \
-g $HOME/snpEff/data/albo_v3/genes.gff \
-og $HOME/snpEff/data/albo_v3/corrected_genes_nc.gff
# then it works
# if you check the type of errors it had
head -n 100 statistic_1nc.txt
# Error_code Number_of_problematic_models Error_level Error_tag
# Esf0014 1 Error ##gff-version" missing from the first line
# Esf0036 1060 Error Value of a attribute contains unescaped ","
# Ema0005 1739 Info Pseudogene has invalid child feature type
#
# therefore we are able to fix these errors. The other "errors" is because the annotation was not curated,
# we do not have data supporting all gene models, exons and introns boundaries. For example,
# we can have exons of a gene that are actually 2, but the intron was not identified. We also have
# exons without parent feature - the software expect exons to be in order by transcript. If we have
# interspaced exons of transcripts, it results in "orphan" exons. It did not follow what our gene
# models expected
# we can use this version to build the snpEff database
#
# we can try the sorted files
## ............................................................................
## 2. sort gff by scaffolds and positions ####
#
logsave -a $HOME/snpEff/data/albo_v3/log_fix_2.txt gff3_fix \
-qc_r $HOME/snpEff/data/albo_v3/error_2.txt \
-g $HOME/snpEff/data/albo_v3/sorted_genes.gff \
-og $HOME/snpEff/data/albo_v3/corrected_sorted_genes.gff
# we get a different error
# ValueError: max() arg is an empty sequence
# gff3_fix died with exit status 1
#
## ............................................................................
## 3. sort gff by transcript ####
#
logsave -a $HOME/snpEff/data/albo_v3/log_fix_3.txt gff3_fix \
-qc_r $HOME/snpEff/data/albo_v3/error_3.txt \
-g $HOME/snpEff/data/albo_v3/isoform_sorted_genes.gff \
-og $HOME/snpEff/data/albo_v3/corrected_isoform_sorted_genes.gff
# ufortunately we get the error again
# ValueError: max() arg is an empty sequence
# gff3_fix died with exit status 1
#
### . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
### count errors from log files ####
#
# original
grep -c 'There are no CDSs' log_1.txt
# 12677
grep -c 'Parent feature missing' log_1.txt
# 11560 - we have 11,560 orphan exons
# sorted by scaffold following template
grep -c 'There are no CDSs' log_2.txt
# 12677
grep -c 'Parent feature missing' log_2.txt
# 0
# sorted by isoform following template
grep -c 'There are no CDSs' log_3.txt
# 12677
grep -c 'Parent feature missing' log_3.txt
# 0
# sorting removes the orphan exons but we still have 12,677 genes with incomplete or missing CDS
# sorting is not having the impact I expected.
#
We can check the Ensembl annotation files for the first genome assembly of Ae. albopictus and check the latest annotation of Aedes aegypti genome and compare it to the Ae. albopictus
# we can check the data in Ensembl, they curated their annotations very well.
# check the paper about annotations, what a canonical transcript is and they find it https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4919035/pdf/baw093.pdf
#
# we have aegypti and albopictus genome on ensembl. The albopictus genome is AaloF1
https://metazoa.ensembl.org/info/data/ftp/index.html
#
# aegypti
mkdir $HOME/snpEff/data/aegypti_v5
# decompress gff
zcat Aedes_aegypti_lvpagwg.AaegL5.55.gff3.gz > genes.gff
# decompress genome and index it
zcat Aedes_aegypti_lvpagwg.AaegL5.dna.toplevel.fa.gz > aegypti_v5.fa
samtools faidx $HOME/snpEff/data/genomes/aegypti_v5.fa
#
# Ensembl albopictus AaloF1 - first genome assembly Chen et al, 2015 - 154,782 scaffolds
zcat Aedes_albopictus.AaloF1.dna.toplevel.fa.gz > AaloF1.fa
samtools faidx $HOME/snpEff/data/genomes/AaloF1.fa
zcat Aedes_albopictus.AaloF1.55.gff3.gz > genes.gff
Run QC on Ensembl annotations
# we will not try to sort the files since it creates other artifacts
#
# Ae. aegypti
#
## ............................................................................
## 1. use gff as it is ####
# run the qc to get errors
logsave -a $HOME/snpEff/data/aegypti_v5/log_1.txt gff3_QC \
-g $HOME/snpEff/data/aegypti_v5/genes.gff \
-f $HOME/snpEff/data/genomes/aegypti_v5.fa \
-o $HOME/snpEff/data/aegypti_v5/error_1.txt \
-s $HOME/snpEff/data/aegypti_v5/statistic_1.txt
#
# with flag non canonical
logsave -a $HOME/snpEff/data/aegypti_v5/log_1nc.txt gff3_QC \
-noncg \
-g $HOME/snpEff/data/aegypti_v5/genes.gff \
-f $HOME/snpEff/data/genomes/aegypti_v5.fa \
-o $HOME/snpEff/data/aegypti_v5/error_1nc.txt \
-s $HOME/snpEff/data/aegypti_v5/statistic_1nc.txt
#
# AaloF1
logsave -a $HOME/snpEff/data/AaloF1/log_1.txt gff3_QC \
-g $HOME/snpEff/data/AaloF1/genes.gff \
-f $HOME/snpEff/data/genomes/AaloF1.fa \
-o $HOME/snpEff/data/AaloF1/error_1.txt \
-s $HOME/snpEff/data/AaloF1/statistic_1.txt
#
# with flag non canonical
logsave -a $HOME/snpEff/data/AaloF1/log_1nc.txt gff3_QC \
-noncg \
-g $HOME/snpEff/data/AaloF1/genes.gff \
-f $HOME/snpEff/data/genomes/AaloF1.fa \
-o $HOME/snpEff/data/AaloF1/error_1nc.txt \
-s $HOME/snpEff/data/AaloF1/statistic_1nc.txt
We can parse the data from the QC we did with the different genomes
# create dir and transfer the data from cluster to it
mkdir -p output/SnpEff/gff_qc/{albo_v1,albo_v2,albo_v3,aegypti_v5}
Import data into R
# I also included the logs and the error descriptions in the output directory for each genome assembly
# import data
## ............................................................................
## aegypti_v5 ####
#
aegypti_v5_c <- # using canonical gene models
read_delim(
here(
"output", "SnpEff", "gff_qc", "aegypti_v5", "statistic_1.txt"
),
col_names = TRUE,
show_col_types = FALSE,
col_types = "cnccc"
) |>
mutate( # create column with info about assembly and flags
assembly = "AaegL5",
model = "Canonical"
)
#
aegypti_v5_nc <- # using non canonical gene models
read_delim(
here(
"output", "SnpEff", "gff_qc", "aegypti_v5", "statistic_1nc.txt"
),
col_names = TRUE,
show_col_types = FALSE,
col_types = "cnccc"
) |>
mutate(
assembly = "AaegL5",
model = "Non-canonical"
)
#
## ............................................................................
## albo_v1 ####
#
albo_v1_c <- # using canonical gene models
read_delim(
here(
"output", "SnpEff", "gff_qc", "albo_v1", "statistic_1.txt"
),
col_names = TRUE,
show_col_types = FALSE,
col_types = "cnccc"
) |>
mutate(
assembly = "AalbF1",
model = "Canonical"
)
#
albo_v1_nc <- # using non canonical gene models
read_delim(
here(
"output", "SnpEff", "gff_qc", "albo_v1", "statistic_1nc.txt"
),
col_names = TRUE,
show_col_types = FALSE,
col_types = "cnccc"
) |>
mutate(
assembly = "AalbF1",
model = "Non-canonical"
)
#
## ............................................................................
## albo_v2 ####
#
albo_v2_c <- # using canonical gene models
read_delim(
here(
"output", "SnpEff", "gff_qc", "albo_v2", "statistic_1.txt"
),
col_names = TRUE,
show_col_types = FALSE,
col_types = "cnccc"
) |>
mutate(
assembly = "AalbF2",
model = "Canonical"
)
#
albo_v2_nc <- # using non canonical gene models
read_delim(
here(
"output", "SnpEff", "gff_qc", "albo_v2", "statistic_1nc.txt"
),
col_names = TRUE,
show_col_types = FALSE,
col_types = "cnccc"
) |>
mutate(
assembly = "AalbF2",
model = "Non-canonical"
)
#
## ............................................................................
## albo_v3 ####
#
albo_v3_c <- # using canonical gene models
read_delim(
here(
"output", "SnpEff", "gff_qc", "albo_v3", "statistic_1.txt"
),
col_names = TRUE,
show_col_types = FALSE,
col_types = "cnccc"
) |>
mutate(
assembly = "AalbF3",
model = "Canonical"
)
#
albo_v3_nc <- # using non canonical gene models
read_delim(
here(
"output", "SnpEff", "gff_qc", "albo_v3", "statistic_1nc.txt"
),
col_names = TRUE,
show_col_types = FALSE,
col_types = "cnccc"
) |>
mutate(
assembly = "AalbF3",
model = "Non-canonical"
)
#
Merge the data sets and calculate totals
# row bind the tibbles
gff_qc <-
bind_rows(
list(
aegypti_v5_c, aegypti_v5_nc, albo_v1_c, albo_v1_nc, albo_v2_c, albo_v2_nc, albo_v3_c, albo_v3_nc
)
)
#
# calculate total for each group
gff_qc2 <-
gff_qc |>
group_by(
assembly, model
) |>
group_split() |>
purrr::map_df(
janitor::adorn_totals
)
#
# summary errors
gff_qc3 <-
gff_qc |>
group_by(assembly,model) |>
summarise(Total = sum(Number_of_problematic_models), .groups = "keep") # |>
# knitr::kable()
Create table with gff QC data
#
# make table
ftab_gff <- flextable(
gff_qc2 |>
rename(
"Error code" = 1,
"Number of problematic models" = 2,
"Error level" = 3,
"Error tag" = 4,
Assembly = 5,
Model = 6
) |>
select( # change the order of the columns by index
c(
5, 1, 6, 2, 3, 4
)
)
) |>
set_caption(
as_paragraph(
as_chunk(
"Errors detected in genome annotations files according to the gene model by GFF3ToolKit",
props = fp_text_default(
font.family = "Cambria"
)
)
),
word_stylename = "Table Caption"
)
#
# check it
ftab_gff |>
autofit()
Assembly | Error code | Model | Number of problematic models | Error level | Error tag |
---|---|---|---|---|---|
AaegL5 | Ema0009 | Canonical | 447 | Warning | Incorrectly merged gene parent? Isoforms that do not share coding sequences are found |
AaegL5 | Ema0004 | Canonical | 366 | Info | Incomplete gene feature that should contain at least one mRNA, exon, and CDS |
AaegL5 | Ema0008 | Canonical | 28 | Warning | Warning for distinct isoforms that do not share any regions |
AaegL5 | Ema0002 | Canonical | 91 | Warning | Protein sequence contains internal stop codons |
AaegL5 | Esf0003 | Canonical | 2,310 | Error | Strand information missing |
- | Total | - | 3,242 | - | - |
AaegL5 | Esf0003 | Non-canonical | 2,310 | Error | Strand information missing |
- | Total | - | 2,310 | - | - |
AalbF1 | Esf0012 | Canonical | 82 | Info | Found Ns in a feature using the external FASTA |
AalbF1 | Esf0003 | Canonical | 154,782 | Error | Strand information missing |
- | Total | - | 154,864 | - | - |
AalbF1 | Esf0012 | Non-canonical | 82 | Info | Found Ns in a feature using the external FASTA |
AalbF1 | Esf0003 | Non-canonical | 154,782 | Error | Strand information missing |
- | Total | - | 154,864 | - | - |
AalbF2 | Esf0036 | Canonical | 2,234 | Error | Value of a attribute contains unescaped "," |
AalbF2 | Ema0004 | Canonical | 9,530 | Info | Incomplete gene feature that should contain at least one mRNA, exon, and CDS |
AalbF2 | Ema0005 | Canonical | 3,994 | Info | Pseudogene has invalid child feature type |
AalbF2 | Ema0009 | Canonical | 4,285 | Warning | Incorrectly merged gene parent? Isoforms that do not share coding sequences are found |
AalbF2 | Ema0002 | Canonical | 235 | Warning | Protein sequence contains internal stop codons |
AalbF2 | Ema0008 | Canonical | 38 | Warning | Warning for distinct isoforms that do not share any regions |
AalbF2 | Emr0001 | Canonical | 1 | Warning | Duplicate transcript found |
AalbF2 | Emr0002 | Canonical | 137 | Warning | Incorrectly split gene parent? |
- | Total | - | 20,454 | - | - |
AalbF2 | Esf0036 | Non-canonical | 2,234 | Error | Value of a attribute contains unescaped "," |
AalbF2 | Ema0005 | Non-canonical | 3,994 | Info | Pseudogene has invalid child feature type |
- | Total | - | 6,228 | - | - |
AalbF3 | Esf0014 | Canonical | 1 | Error | ##gff-version" missing from the first line |
AalbF3 | Esf0036 | Canonical | 1,060 | Error | Value of a attribute contains unescaped "," |
AalbF3 | Ema0006 | Canonical | 26 | Info | Wrong phase |
AalbF3 | Ema0007 | Canonical | 110 | Warning | CDS and parent feature on different strands |
AalbF3 | Ema0004 | Canonical | 5,079 | Info | Incomplete gene feature that should contain at least one mRNA, exon, and CDS |
AalbF3 | Ema0005 | Canonical | 1,739 | Info | Pseudogene has invalid child feature type |
AalbF3 | Ema0009 | Canonical | 1,957 | Warning | Incorrectly merged gene parent? Isoforms that do not share coding sequences are found |
AalbF3 | Ema0002 | Canonical | 118 | Warning | Protein sequence contains internal stop codons |
AalbF3 | Ema0008 | Canonical | 17 | Warning | Warning for distinct isoforms that do not share any regions |
AalbF3 | Emr0002 | Canonical | 39 | Warning | Incorrectly split gene parent? |
- | Total | - | 10,146 | - | - |
AalbF3 | Esf0014 | Non-canonical | 1 | Error | ##gff-version" missing from the first line |
AalbF3 | Esf0036 | Non-canonical | 1,060 | Error | Value of a attribute contains unescaped "," |
AalbF3 | Ema0005 | Non-canonical | 1,739 | Info | Pseudogene has invalid child feature type |
- | Total | - | 2,800 | - | - |
# save
# set doc properties
sect_properties <-
prop_section(
page_size = page_size(
orient = "landscape",
width = 8.3,
height = 11.7
),
type = "continuous",
page_margins = page_mar()
)
#
# save it using the properties
ftab_gff |>
autofit() |>
save_as_docx(
path = here(
"output", "SnpEff", "Errors_annotations_GFF3ToolKit.docx"
),
pr_section = sect_properties
)
Summary of errors
# summary errors
ftab_gff2 <- flextable(
gff_qc3 |>
rename(
Assembly = 1,
Model = 2
)
) |>
set_caption(
as_paragraph(
as_chunk(
"Summary of errors detected in genome annotations files by gene model using GFF3ToolKit",
props = fp_text_default(
font.family = "Cambria"
)
)
),
word_stylename = "Table Caption"
)
#
# check it
ftab_gff2 |>
autofit()
Assembly | Model | Total |
---|---|---|
AaegL5 | Canonical | 3,242 |
AaegL5 | Non-canonical | 2,310 |
AalbF1 | Canonical | 154,864 |
AalbF1 | Non-canonical | 154,864 |
AalbF2 | Canonical | 20,454 |
AalbF2 | Non-canonical | 6,228 |
AalbF3 | Canonical | 10,146 |
AalbF3 | Non-canonical | 2,800 |
# save
# set doc properties
sect_properties <-
prop_section(
page_size = page_size(
orient = "portrait",
width = 8.3,
height = 11.7
),
type = "continuous",
page_margins = page_mar()
)
#
# save it using the properties
ftab_gff2 |>
autofit() |>
save_as_docx(
path = here(
"output", "SnpEff", "Summary_errors_annotations_GFF3ToolKit.docx"
),
pr_section = sect_properties
)
Generate peptide sequences
logsave -a $HOME/snpEff/data/albo_v3/log_protein_1.txt gff3_to_fasta \
--gff $HOME/snpEff/data/albo_v3/genes.gff \
--fasta $HOME/snpEff/data/genomes/albo_v3.fa \
--sequence_type pep \
--defline simple \
--output_prefix $HOME/snpEff/data/albo_v3/albo
# it will create a peptide file names albo_pep.fa, we need to change the name of the file for snpEff database building
mv $HOME/snpEff/data/albo_v3/albo_pep.fa $HOME/snpEff/data/albo_v3/protein.fa
# check the file
head $HOME/snpEff/data/albo_v3/protein.fa
wc -l $HOME/snpEff/data/albo_v3/protein.fa
# 47990 sequences
Create data base for genome on the cluster
# we need to create a dir for the database at the SnpEff home directory
# create interactive session ( 2CPU 10Gb)
srun --pty -N 1 -n 1 -c 2 -p interactive zsh
cd /home/lvc26/snpEff;
mkdir -p data/{genomes,albo_v3}
#
# create protein sequences from gff
# check http://ccb.jhu.edu/software/stringtie/gff.shtml#gffread
git clone https://github.com/gpertea/gffread
cd gffread
make release
# export path
export PATH=$PATH:/gpfs/ycga/home/lvc26/gffread
#
# check gff file
# In order to see the GTF2 version of the same transcripts, the -T option should be added
gffread -E $HOME/snpEff/data/albo_v3/genes.gff -T -o- | more
# The examples above also show that gffread can be used to convert a file between GTF2 and GFF3 file formats.
# now we have to put the gff the genome file inside the directories we create the data base
# name the them accordingly, your directories show look like this
$HOME/snpEff/data/albo_v3/genes.gff
$HOME/snpEff/data/albo_v3/genes.gtf
$HOME/snpEff/data/genomes/albo_v3.fa
#
# index the genome
module load SAMtools/1.16-GCCcore-10.2.0
samtools faidx $HOME/snpEff/data/genomes/albo_v3.fa
#
# earlier I created a fasta file with protein sequences
$HOME/snpEff/data/albo_v3/protein.fa
#
# then run chmod on the executable file
chmod 775 snpEff.jar
#
# load gatk to automatic load java 11, I tried some versions and it did not work, saying it the version was too old
module load GATK/4.2.6.1-GCCcore-10.2.0-Java-11
#
## ............................................................................
## create database ####
#
# when I first run it, I used 4 CPUs and 20GB of memory, however, I got the message "java died with exit status 255"
# at the end of the database build. It means that I did not have enough memory. Fix it requesting more memory
# the command below requests 1CP with 40GB of memory for 30 minutes
salloc -t 0:30:00 --mem=40G
# it also failed
# submit batch job instead
# we can use dsq, first we create script, we can use echo and save it as file
# check snpEff options
java -Xmx2g -jar $HOME/snpEff/snpEff.jar build
# load gatk again
module load GATK/4.2.6.1-GCCcore-10.2.0-Java-11
#
# build database
logsave -a $HOME/snpEff/data/albo_v3/log_database_1.txt java \
-Xmx40g \
-jar $HOME/snpEff/snpEff.jar \
build \
-gff3 \
-v albo_v3
#
# create command file
echo "module load GATK/4.2.6.1-GCCcore-10.2.0-Java-11; java -Xmx50g -jar /home/lvc26/snpEff/snpEff.jar build -noCheckProtein -noCheckCds -gff3 -v albo_v3" > $HOME/snpEff/data/albo_v3/1.snpEff_build_database.txt
#
# load dsq
module load dSQ/1.05
# run dsq
dsq \
--job-file $HOME/snpEff/data/albo_v3/1.snpEff_build_database.txt \
--mem-per-cpu 50g \
--cpus-per-task=1 \
-t 00-00:15:00 \
--mail-type END \
--partition=scavenge \
--job-name snpEff \
--output $HOME/snpEff/data/albo_v3/dsq-jobfile-%A_%a-%N.txt \
--batch-file $HOME/snpEff/data/albo_v3/1.snpEff.sh
#
# submit it
sbatch /home/lvc26/snpEff/data/albo_v3/1.snpEff.sh
# autopsy
dsqa -j 19329159
# completed and database is ready for use.
# I checked the email of the job and it only used less than 2GB of memory, so it was failing because of the database checks that failed.
# then next time we need to use it check help
java -Xmx2g -jar $HOME/snpEff/snpEff.jar
#
# to submit a batch job it is probably to load gatk to load a recent java version, and then use snpEff. It will probably be easier to submit an array job with all diffent parameters.
module load GATK/4.2.6.1-GCCcore-10.2.0-Java-11
java -Xmx8g -jar $HOME/snpEff/snpEff.jar
Our first step is to create a database using the AalbF3 using the annotation file (gff format). There is one database for Aedes albopictus on the SnpEff databases, but it is for the AalbF1. We cannot use it.
## chr3.9 Gnomon gene 10376 142706 . - . ID=gene-LOC109400916;Dbxref=GeneID:109400916;Name=LOC109400916;gbkey=Gene;gene=LOC109400916;gene_biotype=protein_coding;partial=true;start_range=.,10376
## chr3.9 Gnomon mRNA 10376 142706 . - . ID=rna-XM_029851713.1;Parent=gene-LOC109400916;Dbxref=GeneID:109400916,Genbank:XM_029851713.1;Name=XM_029851713.1;gbkey=mRNA;gene=LOC109400916;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 97%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 57 samples with support for all annotated introns;partial=true;product=UPF0565 protein C2orf69 homolog;start_range=.,10376;transcript_id=XM_029851713.1
Create database for AalbF3. SnpEff documentation HERE:
There are three main steps when building a database:
Step 1: Configure a new genome in SnpEff's config file snpEff.config.
Step 2: Build using gene annotations and reference sequences
Step 3: Checking the database: SnpEff will check the database by comparing predicted protein sequences and CDS sequences with ones provided by the user.
Documentation regarding building database using GFF file:
GFF File name
SnpEff expects the GFF file to be located at
$SNPEFF_HOME/data/GENOME_NAME/genes.gff
where:
$SNPEFF_HOME is the directory where SnpEff is installed (usually $HOME/snpEff)
GENOME_NAME is the genome name of the genome you are trying to build, which MUST match the name you added in the config file snpEff.config
Note: The file name can be genes.gff.gz if it's compressed using gzip.
# change the path to the location you have SnpEff
# make sure you move the gene.gff to the directory you have SnpEff
# also add the genome to snpEff.config at SnpEff home directory
echo "albo_v3.genome : Albopictus" >> snpEff.config
#
# then build the database, in my case I had a few databases that I used previously. You can name it anyway you want.
java -Xmx8g -jar /Users/lucianocosme/Packages/snpEff/snpEff.jar build -gff3 -v albo_v3
Check SnpEff options
We will annotate only SNPs that passed quality control. Our first step is to convert it to vcf file format
# We will first annotate the SNPs from the wild caught animals. We will use the file after quality control.
plink2 \
--bfile data/raw_data/albo/variant_annotation/file7 \
--allow-extra-chr \
--fa data/genome/albo.fasta.gz \
--ref-from-fa 'force' \
--recode vcf \
--out output/SnpEff/wild \
--silent;
grep 'variants' output/SnpEff/wild.log;
Check sample names on the vcf
## OKI_1001
## OKI_1002
## OKI_1003
## OKI_1004
## OKI_1005
## OKI_1006
## OKI_1007
## OKI_1008
## OKI_1009
## OKI_1011
# then annotate the vcf with the v4
java -Xmx8g -jar /Users/lucianocosme/Packages/snpEff/snpEff.jar albo_v3 output/SnpEff/wild.vcf -v > output/SnpEff/chip_ann.vcf
#
# we will get some warnings or errors because the annotation was lifted from AalbF2 to the AalbF3, with some transcripts missing some features.
# therefore we need to focus on genes with id starting with LOC
Move files to output
# SnpEff will create a .txt and a .html file in our project root directory, we can move them to the right place
mv snpEff_* output/SnpEff/
# you can double click the file snpEff_summary.html to see the summary of our annotation
Get types of variants
# i wrote a bash script to get the number of variants we have, you might have to run chmod 775 if you get permission denied
./scripts/analysis/variants_by_effect.sh output/SnpEff/chip_ann.vcf
Alternatively, we can create a file with the effects and how they are annotated in the vcf file
echo "Nonsynonymous missense_variant
Synonymous synonymous_variant
Intron intron_variant
Intergenic intergenic_region
3UTR 3_prime_UTR_variant
5UTR 5_prime_UTR" > output/SnpEff/variant_types.txt
# not the most elegant but works
cat output/SnpEff/variant_types.txt | while IFS=" " read -r variant effect
do
echo "$variant"; cat output/SnpEff/chip_ann.vcf | sed '/##/d' | grep -c "$effect"
done | xargs -n2 > output/SnpEff/annotations_summary.txt
Now, we can import it into R and make a table
# import data
var_effects <- read_delim(
here(
"output", "SnpEff", "annotations_summary.txt"
),
col_names = FALSE,
show_col_types = FALSE,
col_types = "ci"
) |>
rename(
Effect = 1,
Number = 2
) |>
mutate(
Effect = as.factor(
Effect
),
Percentage = prop.table(
Number
) * 100,
across(
3, ~round(., 2)
)
) |>
arrange(
-Percentage
)
# make table
ftabv <- flextable(
var_effects |>
rename(
Effect = 1,
"Number of SNPs" = 2
)
) |>
set_caption(
as_paragraph(
as_chunk(
"Variant effects for SNPs genotyped with the Chip",
props = fp_text_default(
font.family = "Cambria"
)
)
),
word_stylename = "Table Caption"
)
# Set theme
ftabv <- flextable::theme_zebra(ftabv)
# print(ftab, preview = "docx") # to open and see it in Microsoft Word.
ftabv
Effect | Number of SNPs | Percentage |
---|---|---|
Intron | 24,387 | 39.49 |
Synonymous | 14,323 | 23.20 |
Intergenic | 13,758 | 22.28 |
3UTR | 3,697 | 5.99 |
5UTR | 3,642 | 5.90 |
Nonsynonymous | 1,942 | 3.14 |
# save
# then you can open MS Word and make adjustments
ftabv |>
autofit() |>
save_as_docx(
path = here(
"output", "SnpEff", "effect_type_and_number_SNPs.docx"
)
)
We can also make a plot of the effects
# create vector of colors for our plot
# Wong, B. Points of view: Color blindness. Nat Methods (2011).
bla <- '#000000'
blu <- '#0072b2'
grb <- '#56b4e9'
lir <- '#cc79a7'
gre <- '#009e73'
red <- '#d55e00'
org <- '#e69f00'
yel <- '#f0e442'
gry <- '#BBBBBB'
#
# create vector
# option1
# special_colors <- list(
# "Intron" = "#BBBBBB",
# "Intergenic" = "#0072b2",
# "Synonymous" = "#009e73",
# "5UTR" = "#56b4e9",
# "3UTR" = "#cc79a7",
# "Nonsynonymous" = "#d55e00"
# )
#
# Option 2
special_colors <- list(
"Intron" = "#C1CDCD",
"Intergenic" = "#1AC722",
"Synonymous" = "#FF8C00",
"5UTR" = "#87CEFA",
"3UTR" = "#FF3030",
"Nonsynonymous" = "pink"
)
#
# make plot
var_effects |>
mutate( # we can use mutate to change the class of column
Effect = as.factor( # we make column "Effect" factor
Effect
),
Effect = fct_reorder( # library forcats to re-order by the number of variants
Effect, Number
)
) |>
ggplot() +
geom_col( # to make bar plot with different colors
aes( # mapping
x = Effect,
y = Number,
fill = Effect
),
color = "yellow" # for the border
) +
labs( # to change the ylab and tittle
title = "Chip variants per SnpEff annotation",
y = "Number of annotated variants",
x = "Annotated effect",
caption = "Functional annotation of the SNPs in the CHIP after QC (~60k)"
) +
hrbrthemes::theme_ipsum( # I like this theme, but you can use any
base_family = "Roboto Condensed", # needs to have extrafont library
axis_text_size = 12,
axis_title_size = 14,
plot_margin = margin(
10, 10, 10, 10
),
grid = TRUE,
grid_col = "#fabbe2"
) +
scale_fill_manual(
values = unlist(
special_colors
)
) +
scale_color_manual(
values = unlist(
special_colors
)
) +
theme(
legend.position = "none",
panel.grid.major = element_line(
linetype = "dashed",
linewidth = 0.2,
),
panel.grid.minor = element_line(
linetype = "dashed",
linewidth = 0.2
)
) +
scale_y_continuous(
trans = "log10",
breaks = scales::log_breaks(
n = 10
),
labels = label_comma()
) +
annotation_logticks(
sides = "b"
) +
geom_label(
aes(
x = Effect,
y = Number,
color = Effect,
label=paste(
formatC(
Number, format="f", big.mark = ",", digits=0
),"\n","(",Percentage,"%)"
)
),
face = "bold",
size = 2,
family = "",
alpha = .9,
) +
coord_flip() # to make it easier to read
#
# save plot
ggsave(
here(
"output", "SnpEff", "figures", "SNPs_effect_chip.pdf"
),
width = 10,
height = 4,
units = "in"
)
We can make variant lists based on their effect. It is useful for other analysis. However, since the annotation is not optimal, we have to be careful.
# variants from intergenic regions
sed '/##/d' output/SnpEff/chip_ann.vcf | grep "intergenic_region" | awk '{print $3}' > output/SnpEff/intergenic_SNPs.txt
#
# variants from introns
sed '/##/d' output/SnpEff/chip_ann.vcf | grep "intron_variant" | awk '{print $3}' > output/SnpEff/intron_SNPs.txt
#
# variants from 3'UTR
sed '/##/d' output/SnpEff/chip_ann.vcf | grep "3_prime_UTR_variant" | awk '{print $3}' > output/SnpEff/3UTR_SNPs.txt
#
# variants from 5'UTR
sed '/##/d' output/SnpEff/chip_ann.vcf | grep "5_prime_UTR" | awk '{print $3}' > output/SnpEff/5UTR_SNPs.txt
#
# variants from both 3 and 5'UTRs
sed '/##/d' output/SnpEff/chip_ann.vcf | grep "3_prime_UTR_variant\|5_prime_UTR" | awk '{print $3}' > output/SnpEff/UTRs_SNPs.txt
#
# synonymous
sed '/##/d' output/SnpEff/chip_ann.vcf | grep "synonymous_variant" | awk '{print $3}' > output/SnpEff/Synonymous_SNPs.txt
#
# nonsynonymous
sed '/##/d' output/SnpEff/chip_ann.vcf | grep "missense_variant" | awk '{print $3}' > output/SnpEff/Nonsynonymous_SNPs.txt
We can also import the output of SnpEff into R
# import SnpEff output
snpEff <-
read_delim(
here(
"output", "SnpEff", "snpEff_genes.txt"
),
delim = '\t',
comment = "# ",
col_names = TRUE,
show_col_types = FALSE
) |>
rename(
GeneName = 1
)
# we can check the name of the columns
colnames(snpEff)
## [1] "GeneName"
## [2] "GeneId"
## [3] "TranscriptId"
## [4] "BioType"
## [5] "variants_impact_HIGH"
## [6] "variants_impact_LOW"
## [7] "variants_impact_MODERATE"
## [8] "variants_impact_MODIFIER"
## [9] "variants_effect_3_prime_UTR_variant"
## [10] "variants_effect_5_prime_UTR_premature_start_codon_gain_variant"
## [11] "variants_effect_5_prime_UTR_variant"
## [12] "variants_effect_downstream_gene_variant"
## [13] "variants_effect_initiator_codon_variant"
## [14] "variants_effect_intron_variant"
## [15] "variants_effect_missense_variant"
## [16] "variants_effect_non_coding_transcript_exon_variant"
## [17] "variants_effect_non_coding_transcript_variant"
## [18] "variants_effect_splice_acceptor_variant"
## [19] "variants_effect_splice_donor_variant"
## [20] "variants_effect_splice_region_variant"
## [21] "variants_effect_stop_gained"
## [22] "variants_effect_stop_lost"
## [23] "variants_effect_stop_retained_variant"
## [24] "variants_effect_synonymous_variant"
## [25] "variants_effect_upstream_gene_variant"
Lets get some gene counts
# because the annotation was lifted from one genome assembly to a new one, we have some issues with the functional annotation. For example, we have genes missing start or stop codons, exons, etc. Therefore, we can focus on genes with the initial ID. SnpEff ties to annotate genes with missing exons or that have been broken up due to lifting.
# lets count how many genes we have starting with the letter LOC and remove rows with exons only
protein <-
snpEff |>
filter(
!grepl(
"exon",
GeneName
)
) |>
filter(
grepl(
"LOC",
GeneName
)
) |>
count(
BioType,
sort = TRUE
) |>
rename(
"BioType of the gene" = 1,
"Number of genes with SNPs" = 2
)
#
# make table
set_flextable_defaults(na_str = "NA", big.mark = ",")
ftabp <-
flextable(
protein
) |>
set_caption(
as_paragraph(
as_chunk(
"Number of genes with SNPs",
props = fp_text_default(
font.family = "Cambria"
)
)
),
word_stylename = "Table Caption"
)
#
# Apply zebra theme
ftabp <- flextable::theme_zebra(ftabp)
# print(ftab, preview = "docx") # to open and see it in Microsoft Word.
ftabp
BioType of the gene | Number of genes with SNPs |
---|---|
protein_coding | 17,461 |
pseudogene | 982 |
NA | 448 |
snRNA | 13 |
snoRNA | 5 |
rRNA | 2 |
The plot and statistics above are an approximation since the annotation was lifted, and we have more variant types I did not list. We could import the vcf to R, and parse it.
Import vcf to R.
# following the code shared here https://joemcgirr.github.io/files/code_tutorials/my_genome/SnpEFF.html
start_time <- Sys.time()
# Wong, B. Points of view: Color blindness. Nat Methods (2011).
bla <- '#000000'
blu <- '#0072b2'
grb <- '#56b4e9'
lir <- '#cc79a7'
gre <- '#009e73'
red <- '#d55e00'
org <- '#e69f00'
yel <- '#f0e442'
gry <- '#BBBBBB'
#
# this step we did it earlier
# wget https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip
# unzip snpEff_latest_core.zip
We need to get gatk to convert vcf to table
# you can run it on the terminal
wget \
https://github.com/broadinstitute/gatk/releases/download/4.3.0.0/gatk-4.3.0.0.zip \
--directory-prefix output/files
#
# decompress it
unzip \
/Users/lucianocosme/Dropbox/Albopictus/manuscript_chip/data/no_autogenous/albo_chip/output/files/gatk-4.3.0.0.zip \
-d /Users/lucianocosme/Packages
# export path
export PATH=$PATH:/Users/lucianocosme/Packages/gatk-4.3.0.0
#
Convert vcf to table
# export path
export PATH=$PATH:/Users/lucianocosme/Packages/gatk-4.3.0.0;
gatk \
VariantsToTable \
-V output/SnpEff/chip_ann.vcf \
-F CHROM -F POS -F TYPE -F ID -F ANN -F LOF -F NMD -GF AD -GF DP -GF GQ -GF GT \
-O output/SnpEff/chip_ann.txt
Import data to R
# read the data
ann <- vroom(
here(
"output", "SnpEff", "chip_ann.txt"
),
guess_max = 1000000,
show_col_types = FALSE
)
#
Create annotation types
# table based on SnpEFF documentation
ann_types <- data.frame(
impact = c(
"HIGH", "HIGH", "HIGH", "HIGH", "HIGH", "HIGH", "HIGH", "HIGH", "HIGH", "HIGH", "MODERATE", "MODERATE", "MODERATE", "MODERATE", "MODERATE", "MODERATE", "MODERATE", "MODERATE", "MODERATE", "MODERATE", "MODERATE", "LOW", "LOW", "LOW", "LOW", "LOW", "LOW", "MODIFIER", "MODIFIER", "MODIFIER", "MODIFIER", "MODIFIER", "MODIFIER", "MODIFIER", "MODIFIER", "MODIFIER", "MODIFIER", "MODIFIER", "MODIFIER", "MODIFIER", "MODIFIER", "MODIFIER", "MODIFIER", "MODIFIER", "MODIFIER", "MODIFIER", "MODIFIER", "MODIFIER", "MODIFIER", "MODIFIER", "MODIFIER", "MODIFIER"
),
ontology_term = c(
"chromosome_number_variation", "exon_loss_variant", "frameshift_variant", "rare_amino_acid_variant", "splice_acceptor_variant", "splice_donor_variant", "start_lost", "stop_gained", "stop_lost", "transcript_ablation", "3_prime_UTR_truncation&exon_loss", "5_prime_UTR_truncation&exon_loss_variant", "coding_sequence_variant-moderate", "conservative_inframe_deletion", "conservative_inframe_insertion", "disruptive_inframe_deletion", "disruptive_inframe_insertion", "missense_variant", "regulatory_region_ablation", "splice_region_variant-moderate", "TFBS_ablation", "5_prime_UTR_premature_start_codon_gain_variant", "initiator_codon_variant", "splice_region_variant-low", "start_retained", "stop_retained_variant", "synonymous_variant", "3_prime_UTR_variant", "5_prime_UTR_variant", "coding_sequence_variant-modifier", "conserved_intergenic_variant", "conserved_intron_variant", "downstream_gene_variant", "exon_variant", "feature_elongation", "feature_truncation", "gene_variant", "intergenic_region", "intragenic_variant", "intron_variant", "mature_miRNA_variant", "miRNA", "NMD_transcript_variant", "non_coding_transcript_exon_variant", "non_coding_transcript_variant", "regulatory_region_amplification", "regulatory_region_variant", "TF_binding_site_variant", "TFBS_amplification", "transcript_amplification", "transcript_variant", "upstream_gene_variant"
)
)
# check the possible variant annotations
ann_types |>
gt() |>
tab_header(
title = "Putative impact for sequence ontology terms output by SnpEFF"
)
Putative impact for sequence ontology terms output by SnpEFF | |
impact | ontology_term |
---|---|
HIGH | chromosome_number_variation |
HIGH | exon_loss_variant |
HIGH | frameshift_variant |
HIGH | rare_amino_acid_variant |
HIGH | splice_acceptor_variant |
HIGH | splice_donor_variant |
HIGH | start_lost |
HIGH | stop_gained |
HIGH | stop_lost |
HIGH | transcript_ablation |
MODERATE | 3_prime_UTR_truncation&exon_loss |
MODERATE | 5_prime_UTR_truncation&exon_loss_variant |
MODERATE | coding_sequence_variant-moderate |
MODERATE | conservative_inframe_deletion |
MODERATE | conservative_inframe_insertion |
MODERATE | disruptive_inframe_deletion |
MODERATE | disruptive_inframe_insertion |
MODERATE | missense_variant |
MODERATE | regulatory_region_ablation |
MODERATE | splice_region_variant-moderate |
MODERATE | TFBS_ablation |
LOW | 5_prime_UTR_premature_start_codon_gain_variant |
LOW | initiator_codon_variant |
LOW | splice_region_variant-low |
LOW | start_retained |
LOW | stop_retained_variant |
LOW | synonymous_variant |
MODIFIER | 3_prime_UTR_variant |
MODIFIER | 5_prime_UTR_variant |
MODIFIER | coding_sequence_variant-modifier |
MODIFIER | conserved_intergenic_variant |
MODIFIER | conserved_intron_variant |
MODIFIER | downstream_gene_variant |
MODIFIER | exon_variant |
MODIFIER | feature_elongation |
MODIFIER | feature_truncation |
MODIFIER | gene_variant |
MODIFIER | intergenic_region |
MODIFIER | intragenic_variant |
MODIFIER | intron_variant |
MODIFIER | mature_miRNA_variant |
MODIFIER | miRNA |
MODIFIER | NMD_transcript_variant |
MODIFIER | non_coding_transcript_exon_variant |
MODIFIER | non_coding_transcript_variant |
MODIFIER | regulatory_region_amplification |
MODIFIER | regulatory_region_variant |
MODIFIER | TF_binding_site_variant |
MODIFIER | TFBS_amplification |
MODIFIER | transcript_amplification |
MODIFIER | transcript_variant |
MODIFIER | upstream_gene_variant |
Make variant plots. First, get variant effect numbers
# function to calculate the number of variants annotated for each ontology
# not my code, it is from https://joemcgirr.github.io/files/code_tutorials/my_genome/SnpEFF.html
n_variants <- c()
for (ot in ann_types$ontology_term) {
n_variants <- c(n_variants, (filter(ann, grepl(ot, ANN)) |> nrow()))
# print(ot)
}
#
# create new column and sort
ann_types$n_variants <- n_variants
ann_types <- arrange(ann_types, n_variants)
#
# set factors
ann_types$ontology_term <- factor(ann_types$ontology_term, levels = ann_types$ontology_term)
ann_types$impact <- factor(ann_types$impact, levels = c("HIGH","MODERATE","LOW","MODIFIER"))
Prepare data for plotting
# arrange and remove those with those with no variants
ann_types2 <-
ann_types |>
mutate(
ontology_term = str_remove_all( # to remove the _variant from the column
ontology_term, "_variant"
)
) |>
arrange(
-n_variants, ann_types
) |>
filter(
!row_number() %in% c(1, 12) # remove row with by index, we remove those that have the same values
) |>
filter(
n_variants > 0
) |>
mutate(
impact = tolower( # we use base R to convert all values on column Effect to lower case
impact
),
impact = str_to_sentence( # here we use library stringr to convert the first letter to upper case
impact
),
ontology_term = str_to_sentence( # here we use library stringr to convert the first letter to upper case
ontology_term
),
across("ontology_term", ~str_replace(., "_", " ")),
across("ontology_term", ~str_replace(., " prime_utr", "'UTR")),
across("ontology_term", ~str_replace(., "_premature_start_codon_gain", " start codon gain")),
across("ontology_term", ~str_replace(., "Missense", "Nonsynonymous")),
ontology_term = factor( # set ontology_term as factor
ontology_term
),
impact = factor( # set impact as factor
impact
),
Percentage = prop.table( # we use prop.table to calculate percentages
n_variants
) * 100,
across( # we use across to set column 3 as 2 decimals
4, round, 2
)
) |>
rename( # to make impact start with capital letter
Impact = 1
)
Make plot
# Wong, B. Points of view: Color blindness. Nat Methods (2011).
bla <- '#000000'
blu <- '#0072b2'
grb <- '#56b4e9'
lir <- '#cc79a7'
gre <- '#009e73'
red <- '#d55e00'
org <- '#e69f00'
yel <- '#f0e442'
gry <- '#BBBBBB'
#
# create colors
my_colors2 <- c(
'#d55e00', '#f0e442', '#009e73','#56b4e9'
)
#
# create plot
p1 <-
ann_types2 |>
mutate(
ontology_term = fct_reorder( # library forcats to re-order by the number of variants
ontology_term, n_variants
),
Impact = fct_relevel(
Impact,
"High", "Moderate", "Low", "Modifier"
),
) |>
ggplot(
aes()
) +
geom_col( # to make colorful bar plot
aes(
x = ontology_term,
y = n_variants,
fill = Impact
),
color = "pink", # for the border
linewidth = .3
) +
coord_flip() + # to make it easier to read
hrbrthemes::theme_ipsum(
base_family = "Roboto Condensed",
axis_text_size = 12,
axis_title_size = 12,
plot_margin = margin(
10, 10, 10, 10 # top, right, bottom, left
),
grid = TRUE,
grid_col = "#fabbe2"
) +
labs(
y = "Number of annotated variants",
x = "",
title = "Chip variants per SnpEff annotation"
) +
geom_text(
aes(
y = n_variants,
x = ontology_term,
color = Impact,
label = percent(
Percentage / 100
) # we use library scales function to show percentage
),
size = 3.5,
family = "Roboto Condensed",
hjust = -0.01
) +
scale_fill_manual(
values = my_colors2
) +
scale_color_manual(
values = my_colors2
) +
scale_y_continuous(
labels = comma
) +
theme(
legend.title = element_text(
family = "Roboto Condensed",
face = "bold",
),
legend.position = "top",
axis.title.x = element_text(
size = 12
),
axis.title.y = element_text(
size = 12
),
axis.title = element_text(
size = 12
),
axis.text = element_text(
size = 12
),
plot.title = element_text(
size = 18,
face = "bold"
),
panel.grid.major = element_line(
linetype = "dashed",
linewidth = 0.2,
),
panel.grid.minor = element_line(
linetype = "dashed",
linewidth = 0.2
)
)
#
p1
#
# save plot
ggsave(
here(
"output", "figures", "SNPs_effect_chip_plot2.pdf"
),
width = 11,
height = 5,
units = "in"
)
The downside of this approach is importing the vcf into R. If you have millions of SNPs it is not doable.
We can obtain the list of genes from Boyle, et al. (2021). There is some indication that these may be associated with the climatic adaptation of Ae. albopictus in cold regions, where winter conditions do not allow breeding or survival of mosquitoes for a few months. We will use genes with FDR < 0.05; check their paper for more details (supplemental material).
Transpose file with the diapause genes
# copied the genes from the paper
# if you are in a MacOS, you can install datamash, and on the terminal, run the command below (but don't worry, we have both files in the repository)
# brew install datamash
# then we can easily transpose the file
cat output/files/diapause_genes_fdr_05_boyle_et_al_2021.txt | datamash -W transpose > output/files/fdr_05_diapause_genes.txt
Get the SNPs on the diapause genes
cat <(echo 'Gene Number_SNPs') \
<(cat output/files/fdr_05_diapause_genes.txt | while IFS=" " read -r diapause
do
echo $diapause; cat output/SnpEff/chip_ann.vcf | sed '/##/d' | grep $diapause | wc -l
done | xargs -n2) > output/files/number_snps_per_diapause_gene.txt;
# import data
diapause_genes <-
read_delim(
here(
"output", "files", "number_snps_per_diapause_gene.txt"
),
col_names = TRUE,
show_col_types = FALSE,
col_types = "ci"
) |>
rename(
Gene = 1,
SNPs = 2
) |>
arrange( # here we sort by descending value
desc(
SNPs
)
)
#
# make table
ftabd <-
flextable(
diapause_genes
) |>
set_caption(
as_paragraph(
as_chunk(
"Number of SNPs on diapause genes",
props = fp_text_default(
font.family = "Cambria"
)
)
),
word_stylename = "Table Caption"
)
#
# Apply zebra theme
ftabd <- flextable::theme_zebra(ftabd)
# print(ftab, preview = "docx") # to open and see it in Microsoft Word.
ftabd
Gene | SNPs |
---|---|
LOC109411232 | 19 |
LOC109398746 | 4 |
LOC109402178 | 3 |
LOC109398750 | 3 |
LOC109402174 | 2 |
LOC109398749 | 2 |
LOC109398756 | 2 |
LOC109398753 | 1 |
LOC109398754 | 1 |
LOC109413239 | 0 |
LOC109424986 | 0 |
# save
# then you can open MS Word and make adjustments
ftabd |>
autofit() |>
save_as_docx(
path = here(
"output", "SnpEff", "Number_SNPs_on_diapause_genes.docx"
)
)
Now lets get the SNPs associated on diapause genes
# we can bash tools to get some information about our SNPs
# to make our command to run faster, we can grep the rows with the diapause genes from our vcf and create a new file. Then we can use it to parse our results
# create new file with only the genes of interest - we could use # grep -Ff file1 file2
# ripgrep https://formulae.brew.sh/formula/ripgrep is faster - it will take long if we have long gene lists
# on MacOS you can use brew to install it - brew install ripgrip
# the -Ff are option to search fixed patter (-F), and file (-f) - check man rg
rg -Ff \
output/files/fdr_05_diapause_genes.txt \
output/SnpEff/chip_ann.vcf \
> output/SnpEff/fdr_05_diapause_snps.txt
Now use get the SNPs
# we can now parse the results to make a table later
grep "missense\|synonymous\|5_prime_UTR_variant\|3_prime_UTR_variant" output/SnpEff/fdr_05_diapause_snps.txt \
| awk '{print $1,$2,$3,$4,$5,$8}' \
| sed 's/|/\t/g' \
| awk '{print $1,$2,$3,$4,$5,$8,$7,$13,$15,$16,$23}' \
| sed 's|exon-XM_019671140.2-1|LOC109398756|' \
| sed 's|XM_019684743.2|LOC109411232|' \
| sed 's|XM_019686967.2|LOC109413239|' \
| sed 's|XM_019671140.2|LOC109398750|' \
| sed 's|rna-XM_029870393.1|LOC109398750|' \
| sed 's|XM_029870418.1|LOC109424986|' \
| sed 's|exon-XM_029870393.1-1|LOC109398750|' \
> output/SnpEff/snps_diapause_genes_1.txt
# the script above is not perfect, and we end up missing some gene IDs, we can use grep to get the rows with the missing ids. For example,
# we get a SNPs with gene id rna-XM_029870393.1, we can use grep "rna-XM_029870393.1" output/SnpEff/chip_ann.vcf to find its id. I did that and added that to the file
# we can use sed to replace them in our file
# you can use grep to get the gene names from the gff file.
# for example
# grep "exon-XM_029870393.1-1" output/files/genes.gff
# chr3.19 Gnomon exon 2234864 2235110 . - . ID=exon-XM_029870393.1-1;Parent=rna-XM_029870393.1;Dbxref=GeneID:109398750,Genbank:XM_029870393.1;gbkey=mRNA;gene=LOC109398750;product=methyl-CpG-binding domain protein 3%2C transcript variant X2;transcript_id=XM_029870393.1
# I did it for all genes missing ID
# LOC109398756 # exon-XM_019671140.2-1
# LOC109411232 # XM_019684743.2
# LOC109413239 # XM_019686967.2
# LOC109398750 # XM_019671140.2
# LOC109398750 # rna-XM_029870393.1
# LOC109424986 # XM_029870418.1
SNPs on introns
# the command in previous chunk does not work well for the SNPs on introns, we can do it in two steps
paste -d ' ' <(
# step 1
grep "intron_variant" output/SnpEff/fdr_05_diapause_snps.txt \
| awk '{print $1,$2,$3,$4,$5,$8}' \
| sed 's/,/\t/g' \
| awk '{print $1,$2,$3,$4,$5,$6}' \
| sed 's/|/\t/g' \
| awk '{print $1,$2,$3,$4,$5,$8}') \
<(
# step 2
grep "intron_variant" output/SnpEff/fdr_05_diapause_snps.txt\
| awk '{print $1,$2,$3,$4,$5,$8}' \
| sed 's/,/\t/g' \
| grep "protein_coding\|intron_variant" \
| awk '{
for (i = 1; i <= NF; i++) {
if ($i ~ /intron_variant/) {
printf "%s %s ", $i, $(i + 1)
}
}
print ""
}' | sed 's/|/\t/g' | awk '{print $2,$8,$10,$13,$15}') > output/SnpEff/snps_diapause_genes_2.txt
We can make a table from our data
# import data
snps_diapause1 <- # name of the tibble
read_delim( # function to read the file, check options typing ?read_delim on the console
here( # use library here to find the file
"output", "SnpEff", "snps_diapause_genes_1.txt" # the location of the file and it's name, works in all plattaforms
),
col_names = FALSE, # we don't have columns names
show_col_types = FALSE, # to hide message from tidyverse library
) |>
rename( # we use rename to specify the names of each column
Scaffold = 1, # the numbers are the columns numbers from left to right, the the first word is the name of the columns
Position = 2,
SNP = 3,
"Reference allele" = 4, # if the name of the columns is more than one word, we have to use " "
"Alternative allele" = 5,
Effect = 6,
Type = 7,
Biotype = 8,
"Nucleotide change" = 9,
"Aminoacid change" = 10,
GeneID = 11
) |>
mutate(
Scaffold = as.character( # make scaffold as character, we will manipulate it later
Scaffold
)
)
# sendond file
snps_diapause2 <-
read_delim(
here(
"output", "SnpEff", "snps_diapause_genes_2.txt"
),
col_names = FALSE,
show_col_types = FALSE,
) |>
rename(
Scaffold = 1,
Position = 2,
SNP = 3,
"Reference allele" = 4,
"Alternative allele" = 5,
Effect = 6,
Type = 7,
Biotype = 8,
"Nucleotide change" = 9,
"Aminoacid change" = 10,
GeneID = 11
) |>
mutate(
"Aminoacid change" = " " # adds a columns since SNPs on introns don't cause aa changes
) |>
mutate(
Scaffold = as.character( # make scaffold as character, we will manipulate it later
Scaffold
)
)
#
# rbind them
snps_diapause3 <-
rbind( # use rbind combine the objects
snps_diapause1, snps_diapause2 # the name of the objects
) |>
arrange( # use arrange to sort by scaffold and type
Scaffold, Type
) |>
mutate( # use mutate to change geneIDs
GeneID = str_remove( # use library stringr to remove strings
GeneID, "gene-" # we remove gene- since some genes have the extra word
)
) |>
mutate_at( # we use mutate_at to only make changes in one column
vars( # to select column or variable
"Aminoacid change" # name of the column
), funs( # list function calls
replace( # to replace
., !grepl( # we use !grepl to remove fields without the pattern
"p", . # all amino acid changes have the letter p on them, so we remove those without it
),
"" # here we replace it with nothing using ''
)
)
) |>
mutate(
Effect = tolower( # we use base R to convert all values on column Effect to lower case
Effect
)
) |>
mutate(
Effect = str_to_sentence( # here we use library stringr to convert the first letter to upper case, just like in a sentence
Effect
)
)
#
#
# make table
ftab.snps <-
flextable(
snps_diapause3
) |>
set_caption(
as_paragraph(
as_chunk(
"SNPs on diapause genes",
props = fp_text_default(
font.fam = "Cambria"
)
)
),
word_stylena = "Table Caption"
)
#
# Apply zebra theme
ftab.snps <- flextable::theme_zebra(ftab.snps)
# print(ftab.snps, preview = "docx") # to open and see it in Microsoft Word.
ftab.snps |>
autofit()
Scaffold | Position | SNP | Reference allele | Alternative allele | Effect | Type | Biotype | Nucleotide change | Aminoacid change | GeneID |
---|---|---|---|---|---|---|---|---|---|---|
1.5 | 384,768 | AX-583746747 | G | A | Modifier | intron_variant | protein_coding | c.145+30C>T | LOC109402178 | |
1.5 | 374,571 | AX-583746721 | A | C | Low | synonymous_variant | protein_coding | c.456T>G | p.Ala152Ala | LOC109413238 |
1.5 | 374,775 | AX-583750651 | A | G | Low | synonymous_variant | protein_coding | c.252T>C | p.Asp84Asp | LOC109413238 |
1.5 | 460,009 | AX-583747442 | G | A | Low | synonymous_variant | protein_coding | c.1212C>T | p.Pro404Pro | LOC109402173 |
1.5 | 460,904 | AX-583747485 | T | C | Low | synonymous_variant | protein_coding | c.432A>G | p.Leu144Leu | LOC109402173 |
3.1 | 3,300,814 | AX-580336192 | A | T | Modifier | 3_prime_UTR_variant | protein_coding | c.*729T>A | LOC109411232 | |
3.1 | 3,463,101 | AX-580336870 | T | G | Modifier | 5_prime_UTR_variant | protein_coding | c.-341A>C | LOC109411232 | |
3.1 | 3,463,774 | AX-580336878 | T | C | Modifier | 5_prime_UTR_variant | protein_coding | c.-1014A>G | LOC109411232 | |
3.1 | 3,304,574 | AX-580338227 | C | T | Modifier | intron_variant | protein_coding | c.961+594G>A | LOC109411232 | |
3.1 | 3,304,968 | AX-580338236 | A | G | Modifier | intron_variant | protein_coding | c.961+200T>C | LOC109411232 | |
3.1 | 3,339,542 | AX-580336357 | T | A | Modifier | intron_variant | protein_coding | c.679-30795A>T | LOC109411232 | |
3.1 | 3,340,340 | AX-580338346 | G | A | Modifier | intron_variant | protein_coding | c.679-31593C>T | LOC109411232 | |
3.1 | 3,340,788 | AX-580336395 | A | T | Modifier | intron_variant | protein_coding | c.679-32041T>A | LOC109411232 | |
3.1 | 3,377,418 | AX-580338434 | G | C | Modifier | intron_variant | protein_coding | c.678+13167C>G | LOC109411232 | |
3.1 | 3,389,566 | AX-580336499 | A | T | Modifier | intron_variant | protein_coding | c.678+1019T>A | LOC109411232 | |
3.1 | 3,389,812 | AX-580338487 | C | T | Modifier | intron_variant | protein_coding | c.678+773G>A | LOC109411232 | |
3.1 | 3,390,039 | AX-580336526 | T | C | Modifier | intron_variant | protein_coding | c.678+546A>G | LOC109411232 | |
3.1 | 3,401,309 | AX-580336571 | A | C | Modifier | intron_variant | protein_coding | c.392-170T>G | LOC109411232 | |
3.1 | 3,407,679 | AX-580338576 | T | C | Modifier | intron_variant | protein_coding | c.392-6540A>G | LOC109411232 | |
3.1 | 3,408,533 | AX-580338597 | A | G | Modifier | intron_variant | protein_coding | c.392-7394T>C | LOC109411232 | |
3.1 | 3,427,972 | AX-580338657 | A | G | Modifier | intron_variant | protein_coding | c.392-26833T>C | LOC109411232 | |
3.1 | 3,439,964 | AX-580338775 | T | C | Modifier | intron_variant | protein_coding | c.391+22406A>G | LOC109411232 | |
3.1 | 3,440,228 | AX-580338784 | G | A | Modifier | intron_variant | protein_coding | c.391+22142C>T | LOC109411232 | |
3.1 | 3,462,068 | AX-580336816 | T | C | Modifier | intron_variant | protein_coding | c.391+302A>G | LOC109411232 | |
3.19 | 2,229,464 | AX-581509085 | A | G | Modifier | 3_prime_UTR_variant | protein_coding | c.*44A>G | LOC109398750 | |
3.19 | 2,270,349 | AX-581511297 | G | T | Modifier | intron_variant | protein_coding | c.-8+9G>T | LOC109398746 | |
3.19 | 2,289,280 | AX-581511654 | A | G | Modifier | intron_variant | protein_coding | c.972+18T>C | LOC109398754 | |
3.19 | 2,233,284 | AX-581509100 | G | A | Low | synonymous_variant | protein_coding | c.498C>T | p.Asn166Asn | LOC109398750 |
3.19 | 2,235,931 | AX-581510984 | G | T | Low | synonymous_variant | protein_coding | c.306G>T | p.Pro102Pro | LOC109398756 |
3.19 | 2,260,721 | AX-581509393 | G | A | Low | synonymous_variant | protein_coding | c.1296C>T | p.Thr432Thr | LOC109398753 |
3.19 | 2,271,368 | AX-581509585 | C | T | Low | synonymous_variant | protein_coding | c.888C>T | p.Ser296Ser | LOC109398746 |
3.19 | 2,271,680 | AX-581511336 | C | T | Low | synonymous_variant | protein_coding | c.1200C>T | p.Asn400Asn | LOC109398746 |
#
# save it
# create settings - check ??save_as_docx on console
sect_properties <-
prop_section(
page_size = page_size(
orient = "landscape",
width = 8.3,
height = 11.7
),
type = "continuous",
page_margins = page_mar()
)
# save
# then you can open MS Word and make adjustments
ftab.snps |>
save_as_docx(
path = here(
"output", "SnpEff", "SNPs_on_diapause_genes.docx"
),
pr_section = sect_properties
)
I downloaded the supplemental material from Palinni et al, 2020, and got the list of immunity genes. Let’s check how many SNPs we have on them.
# we can bash tools to get some information about our SNPs
# to make our command to run faster, we can grep the rows with the diapause genes from our vcf and create a new file. Then we can use it to parse our results
# this is slow, I already have the output in the repository
# create new file with only the genes of interest
# this time only use ripgrip since we have more than 600 genes, it will take a long time for grep to do it
rg -Ff \
output/files/immunity_genes_Palinni_et_al_2020.txt \
output/SnpEff/chip_ann.vcf \
> output/SnpEff/immunity_snps.txt
We can use the same code from the diapause genes, however, since we have many more genes now, I will not fix the geneID field. We could write a loop to fix it later.
# we can now parse the results to make a table later
grep "missense\|synonymous\|5_prime_UTR_variant\|3_prime_UTR_variant" output/SnpEff/immunity_snps.txt \
| awk '{print $1,$2,$3,$4,$5,$8}' \
| sed 's/|/\t/g' \
| awk '{print $1,$2,$3,$4,$5,$8,$7,$13,$15,$16,$23}' \
> output/SnpEff/snps_immunity_genes_1.txt
SNPs on introns. We just replace the file names. You can use your own gene list of interest.
# the command in previous chunk does not work well for the SNPs on introns, we can do it in two steps
paste -d ' ' <(
# step 1
grep "intron_variant" output/SnpEff/immunity_snps.txt \
| awk '{print $1,$2,$3,$4,$5,$8}' \
| sed 's/,/\t/g' \
| awk '{print $1,$2,$3,$4,$5,$6}' \
| sed 's/|/\t/g' \
| awk '{print $1,$2,$3,$4,$5,$8}') \
<(
# step 2
grep "intron_variant" output/SnpEff/immunity_snps.txt\
| awk '{print $1,$2,$3,$4,$5,$8}' \
| sed 's/,/\t/g' \
| grep "protein_coding\|intron_variant" \
| awk '{
for (i = 1; i <= NF; i++) {
if ($i ~ /intron_variant/) {
printf "%s %s ", $i, $(i + 1)
}
}
print ""
}' | sed 's/|/\t/g' | awk '{print $2,$8,$10,$13,$15}') > output/SnpEff/snps_immunity_genes_2.txt
Using the same code from the diapause genes, we can create a table with our immunity genes SNPs. We could write a function to repeat the entire process for a gene list. First write a bash pipeline to get the files ready for R, then write a R function to import the data.
# import data
snps_immunity1 <- # name of the tibble
read_delim( # function to read the file, check options typing ?read_delim on the console
here( # use library here to find the file
"output", "SnpEff", "snps_immunity_genes_1.txt" # the location of the file and it's name, works in all plattaforms
),
col_names = FALSE, # we don't have columns names
show_col_types = FALSE, # to hide message from tidyverse library
) |>
rename( # we use rename to specify the names of each column
Scaffold = 1, # the numbers are the columns numbers from left to right, the the first word is the name of the columns
Position = 2,
SNP = 3,
"Reference allele" = 4, # if the name of the columns is more than one word, we have to use " "
"Alternative allele" = 5,
Effect = 6,
Type = 7,
Biotype = 8,
"Nucleotide change" = 9,
"Aminoacid change" = 10,
GeneID = 11
) |>
mutate(
Scaffold = as.character( # make scaffold as character, we will manipulate it later
Scaffold
)
)
# sendond file
snps_immunity2 <-
read_delim(
here(
"output", "SnpEff", "snps_immunity_genes_2.txt"
),
col_names = FALSE,
show_col_types = FALSE,
) |>
rename(
Scaffold = 1,
Position = 2,
SNP = 3,
"Reference allele" = 4,
"Alternative allele" = 5,
Effect = 6,
Type = 7,
Biotype = 8,
"Nucleotide change" = 9,
"Aminoacid change" = 10,
GeneID = 11
) |>
mutate(
"Aminoacid change" = " " # adds a columns since SNPs on introns don't cause aa changes
) |>
mutate(
Scaffold = as.character( # make scaffold as character, we will manipulate it later
Scaffold
)
)
#
# rbind them
snps_immunity3 <-
rbind( # use rbind combine the objects
snps_immunity1, snps_immunity2 # the name of the objects
) |>
arrange( # use arrange to sort by Scaffold and Type
Scaffold, Type
) |>
mutate( # use mutate to change geneIDs
GeneID = str_remove( # use library stringr to remove strings
GeneID, "gene-" # we remove gene- since some genes have the extra word
)
) |>
mutate_at( # we use mutate_at to only make changes in one column
vars( # to select column or variable
"Aminoacid change" # name of the column
), funs( # list function calls
replace( # to replace
., !grepl( # we use !grepl to remove fields without the pattern, we only want nonsynonymous changes starting with p
"p", . # all amino acid changes have the letter p on them, so we remove those without it
),
"" # here we replace it with nothing using ''
)
)
) |>
mutate(
Effect = tolower( # we use base R to convert all values on column Effect to lower case
Effect
)
) |>
mutate(
Effect = str_to_sentence( # here we use library stringr to convert the first letter to upper case, just like in a sentence
Effect
)
)
#
#
# make table
ftai.snps <-
flextable(
snps_immunity3 |>
filter( # the table would be too long (1,689 SNPs on immunity genes)
Type == "missense_variant"
)
) |>
set_caption(
as_paragraph(
as_chunk(
"SNPs resulting in nonsynonymous changes in immunity genes",
props = fp_text_default(
font.family = "Cambria"
)
)
),
word_stylename = "Table Caption"
)
#
# Apply zebra theme
ftai.snps <- flextable::theme_zebra(ftai.snps)
# print(ftab.snps, preview = "docx") # to open and see it in Microsoft Word.
# it is a long table, we have almost 1700 SNPs on immunity genes.
ftai.snps |>
autofit()
Scaffold | Position | SNP | Reference allele | Alternative allele | Effect | Type | Biotype | Nucleotide change | Aminoacid change | GeneID |
---|---|---|---|---|---|---|---|---|---|---|
1.1 | 10,697,741 | AX-583056929 | G | A | Moderate | missense_variant | protein_coding | c.491G>A | p.Arg164His | exon-XM_029855176.1-1 |
1.1 | 10,699,650 | AX-583058970 | A | C | Moderate | missense_variant | protein_coding | c.196T>G | p.Ser66Ala | LOC109430086 |
1.106 | 2,512,509 | AX-583109537 | C | A | Moderate | missense_variant | protein_coding | c.1405C>A | p.His469Asn | LOC115257625 |
1.113 | 5,762,762 | AX-583151909 | A | G | Moderate | missense_variant | protein_coding | c.623A>G | p.Glu208Gly | LOC109415428 |
1.113 | 5,763,562 | AX-583152046 | A | G | Moderate | missense_variant | protein_coding | c.1423A>G | p.Ile475Val | LOC109415428 |
1.131 | 5,279,847 | AX-583254712 | C | T | Moderate | missense_variant | protein_coding | c.626G>A | p.Ser209Asn | LOC109399009 |
1.147 | 5,839,558 | AX-583431723 | T | C | Moderate | missense_variant | protein_coding | c.775A>G | p.Ile259Val | exon-XM_029868010.1-1 |
1.147 | 5,840,353 | AX-583434252 | T | G | Moderate | missense_variant | protein_coding | c.193A>C | p.Thr65Pro | exon-XM_029868010.1-1 |
1.25 | 1,021,102 | AX-583588144 | A | G | Moderate | missense_variant | protein_coding | c.424A>G | p.Lys142Glu | LOC109402615 |
1.36 | 4,880,818 | AX-583694151 | T | C | Moderate | missense_variant | protein_coding | c.445A>G | p.Lys149Glu | LOC109397206 |
1.79 | 54,655 | AX-583972510 | G | A | Moderate | missense_variant | protein_coding | c.599C>T | p.Ser200Leu | LOC115260154 |
1.85 | 17,046,403 | AX-584044965 | C | T | Moderate | missense_variant | protein_coding | c.524C>T | p.Thr175Met | LOC109424643 |
1.85 | 23,680,098 | AX-584075828 | A | G | Moderate | missense_variant | protein_coding | c.4819A>G | p.Lys1607Glu | exon-XM_029862069.1-1 |
1.85 | 23,681,315 | AX-584075910 | A | G | Moderate | missense_variant | protein_coding | c.5722A>G | p.Met1908Val | exon-XM_029862069.1-1 |
1.85 | 27,167,461 | AX-584094434 | G | C | Moderate | missense_variant | protein_coding | c.27C>G | p.Asp9Glu | LOC109431512 |
1.91 | 3,380,846 | AX-584233670 | A | G | Moderate | missense_variant | protein_coding | c.2011A>G | p.Asn671Asp | LOC109412409 |
1.91 | 5,039,401 | AX-584245979 | A | G | Moderate | missense_variant | protein_coding | c.1156A>G | p.Ser386Gly | LOC109430356 |
1.91 | 8,897,363 | AX-584263120 | G | T | Moderate | missense_variant | protein_coding | c.880C>A | p.Leu294Met | LOC109621626 |
1.91 | 8,897,632 | AX-584265058 | A | G | Moderate | missense_variant | protein_coding | c.611T>C | p.Met204Thr | LOC109621626 |
2.103 | 2,573,871 | AX-584332894 | G | C | Moderate | missense_variant | protein_coding | c.471C>G | p.His157Gln | LOC115266575 |
2.103 | 2,574,411 | AX-584332912 | A | T | Moderate | missense_variant | protein_coding | c.28T>A | p.Ser10Thr | LOC115266575 |
2.138 | 335,263 | AX-584531554 | A | G | Moderate | missense_variant | protein_coding | c.172A>G | p.Ser58Gly | LOC109409282 |
2.14 | 12,158,966 | AX-585060557 | T | C | Moderate | missense_variant | protein_coding | c.7A>G | p.Thr3Ala | exon-XM_029859293.1-1 |
2.14 | 12,647,106 | AX-584552253 | G | C | Moderate | missense_variant | protein_coding | c.964G>C | p.Gly322Arg | LOC109409279 |
2.146 | 419,711 | AX-585089323 | A | T | Moderate | missense_variant | protein_coding | c.701A>T | p.Asp234Val | LOC109407295 |
2.167 | 1,433,370 | AX-585168461 | G | A | Moderate | missense_variant | protein_coding | c.199G>A | p.Val67Ile | exon-XM_029853094.1-1 |
2.17 | 3,924,782 | AX-585190434 | G | A | Moderate | missense_variant | protein_coding | c.587G>A | p.Arg196Gln | exon-XM_019684995.2-1 |
2.17 | 3,926,378 | AX-584681782 | C | G | Moderate | missense_variant | protein_coding | c.2050C>G | p.Pro684Ala | exon-XM_019684995.2-1 |
2.17 | 4,004,422 | AX-584682128 | C | T | Moderate | missense_variant | protein_coding | c.11C>T | p.Thr4Met | exon-XM_019684993.2-1 |
2.17 | 13,726,115 | AX-582441613 | T | C | Moderate | missense_variant | protein_coding | c.671A>G | p.Lys224Arg | LOC109410220 |
2.175 | 780,098 | AX-585206011 | C | T | Moderate | missense_variant | protein_coding | c.956C>T | p.Ser319Leu | exon-XM_029852847.1-1 |
2.175 | 800,927 | AX-584697392 | C | A | Moderate | missense_variant | protein_coding | c.3090C>A | p.Ser1030Arg | exon-XM_029852847.1-1 |
2.27 | 1,560,585 | AX-579525195 | A | T | Moderate | missense_variant | protein_coding | c.1326T>A | p.Asp442Glu | LOC109419663 |
2.42 | 12,014,338 | AX-579673469 | T | A | Moderate | missense_variant | protein_coding | c.961T>A | p.Ser321Thr | LOC109426828 |
2.6 | 16,276,549 | AX-579809339 | T | A | Moderate | missense_variant | protein_coding | c.490T>A | p.Ser164Thr | exon-XM_029878835.1-1 |
2.6 | 16,296,793 | AX-579809489 | C | T | Moderate | missense_variant | protein_coding | c.586C>T | p.Pro196Ser | exon-XM_019700969.2-1 |
2.68 | 5,446,027 | AX-579896611 | G | T | Moderate | missense_variant | protein_coding | c.1060C>A | p.Leu354Ile | LOC109622615 |
2.68 | 5,446,516 | AX-579896667 | A | G | Moderate | missense_variant | protein_coding | c.571T>C | p.Trp191Arg | LOC109622615 |
2.87 | 2,149,902 | AX-580019993 | G | A | Moderate | missense_variant | protein_coding | c.49G>A | p.Ala17Thr | LOC109398084 |
2.87 | 2,150,355 | AX-580020020 | A | G | Moderate | missense_variant | protein_coding | c.370A>G | p.Ile124Val | LOC109398084 |
2.87 | 2,174,851 | AX-580020260 | A | G | Moderate | missense_variant | protein_coding | c.1162A>G | p.Thr388Ala | LOC109398030 |
2.95 | 352,322 | AX-580191265 | T | A | Moderate | missense_variant | protein_coding | c.109A>T | p.Thr37Ser | LOC115269749 |
3.116 | 5,061,100 | AX-580503305 | T | C | Moderate | missense_variant | protein_coding | c.619A>G | p.Arg207Gly | exon-XM_029857358.1-1 |
3.122 | 5,261,799 | AX-580598894 | A | C | Moderate | missense_variant | protein_coding | c.801T>G | p.Asp267Glu | exon-XM_029865713.1-1 |
3.151 | 3,436,049 | AX-580842317 | T | C | Moderate | missense_variant | protein_coding | c.1045A>G | p.Thr349Ala | exon-XM_029875170.1-1 |
3.159 | 1,104,660 | AX-580902042 | A | G | Moderate | missense_variant | protein_coding | c.638T>C | p.Val213Ala | LOC109409956 |
3.17 | 18,459,269 | AX-581338723 | T | C | Moderate | missense_variant | protein_coding | c.28A>G | p.Thr10Ala | exon-XM_029875820.1-1 |
3.18 | 2,874,117 | AX-581428184 | T | C | Moderate | missense_variant | protein_coding | c.1267A>G | p.Ser423Gly | exon-XM_029875979.1-1 |
3.2 | 8,846,714 | AX-581596089 | T | C | Moderate | missense_variant | protein_coding | c.287A>G | p.Asn96Ser | LOC109413737 |
3.32 | 3,322,424 | AX-581763570 | T | C | Moderate | missense_variant | protein_coding | c.1193T>C | p.Val398Ala | LOC109415245 |
3.32 | 3,408,703 | AX-581765674 | G | A | Moderate | missense_variant | protein_coding | c.592G>A | p.Gly198Ser | LOC109414897 |
3.32 | 3,418,991 | AX-581765806 | T | G | Moderate | missense_variant | protein_coding | c.324T>G | p.Asp108Glu | LOC109416024 |
3.4 | 5,487,127 | AX-581830792 | T | A | Moderate | missense_variant | protein_coding | c.295A>T | p.Thr99Ser | LOC109399310 |
3.58 | 5,278,791 | AX-582051662 | C | A | Moderate | missense_variant | protein_coding | c.316G>T | p.Ala106Ser | exon-XM_019673611.2-1 |
3.75 | 16,587,391 | AX-582760777 | A | G | Moderate | missense_variant | protein_coding | c.334A>G | p.Asn112Asp | LOC109398879 |
3.75 | 16,589,393 | AX-582761776 | C | T | Moderate | missense_variant | protein_coding | c.317G>A | p.Arg106Lys | LOC109398826 |
3.77 | 6,092,180 | AX-582825108 | G | A | Moderate | missense_variant | protein_coding | c.92G>A | p.Arg31Lys | LOC109420317 |
3.93 | 5,614,846 | AX-583002481 | G | A | Moderate | missense_variant | protein_coding | c.1039G>A | p.Asp347Asn | exon-XM_029860991.1-1 |
#
# save it
# create settings - check ??save_as_docx on console
sect_properties <- prop_section(
page_size = page_size(
orient = "landscape",
width = 8.3, height = 11.7
),
type = "continuous",
page_margins = page_mar()
)
# save
# then you can open MS Word and make adjustments
ftai.snps |>
save_as_docx(
path = here(
"output", "SnpEff", "SNPs_nonsynonymous_immunity_genes.docx"
),
pr_section = sect_properties
)
We can also get the sum totals for the type of SNPs we have in diapause and immunity genes
# row bind both tibbles
diap_immun <-
bind_rows( # we use bind_rows and not rbind from base R - both would work here (same number of columns)
snps_diapause3, snps_immunity3 # the objects we want to bind
) |>
count( # here we count based on the column Type
Type,
sort = TRUE # we sort high > low
) |>
filter( # here we filter n > 90 because we only want to show the types with more SNPs, and the other ones are mixed types
n > 90
) |>
mutate_if( # here we replace all underscores with spaces
is.character, # all columns as characters
str_replace_all, # replace all
"_", # underscore
" " # with space
) |>
mutate(
Type = str_to_sentence( # here we use library stringr to convert the first letter to upper case, just like in a sentence
Type
)
) |>
rename( # we can finally rename the columns
"SNP effect" = 1,
"Number" = 2
) |>
mutate_if( # here we replace Missense with Synonymous
is.character, # all columns as characters
str_replace_all, # replace all
"Missense", #
"Nonsynonymous"
) |>
mutate_if( # here we replace prime with 'Synonymous' prime
is.character, # all columns as characters
str_replace_all, # replace all
" prime",
"' prime"
)
#
# lets make a table
set_flextable_defaults(na_str = "NA", big.mark = ",")
ftabdi <-
flextable(
diap_immun
) |>
set_caption(
as_paragraph(
as_chunk(
"The effects of SNPs from the Chip on immunity and diapause genes",
props = fp_text_default(
font.family = "Cambria"
)
)
),
word_stylename = "Table Caption"
)
#
# Apply zebra theme
ftabdi <- flextable::theme_zebra(ftabdi)
# print(ftabdi, preview = "docx") # to open and see it in Microsoft Word.
# check the table
ftabdi |>
autofit()
SNP effect | Number |
---|---|
Synonymous variant | 415 |
Intron variant | 409 |
5' prime utr variant | 91 |
To get an idea of the variant distribution across the genome of Ae. albopictus, I will do the functional annotation using the WGS data set I used to select variants for the chip. The initial data set had around 34 million variants, but after MAF filter of 10%, we obtained a data set with approximately 2.5 million variants. Because we do not have permission to share the WGS data set, I will not include it in the repository.
# the data is located at
/ycga-gpfs/project/caccone/lvc26/september_2020/snp_calls/chunk_calls/plink/merged_files/albo_MAF_0.01
#
# our first step is to set reference allele as the reference genome, and covert to vcf using Plink2
# the genome is at
/ycga-gpfs/project/caccone/lvc26/september_2020/genome/aedes_albopictus_LA2_20200826.fasta
#
# create interactive session with 2 CPU
srun --pty -N 1 -n 1 -c 2 -p interactive zsh
#
# Plink2
export PATH=$PATH:/ycga-gpfs/project/caccone/lvc26/plink2
#
# get plink2 version
plink2 --version
# PLINK v2.00a3.6LM 64-bit Intel (14 Aug 2022)
## bash: line 1: /ycga-gpfs/project/caccone/lvc26/september_2020/snp_calls/chunk_calls/plink/merged_files/albo_MAF_0.01: No such file or directory
## bash: line 5: /ycga-gpfs/project/caccone/lvc26/september_2020/genome/aedes_albopictus_LA2_20200826.fasta: No such file or directory
## bash: line 8: srun: command not found
## PLINK v2.00a3.3 64-bit (3 Jun 2022)
On the cluster, set reference allele, and make vcf file
# Plink 1.x sets major alleles as reference alleles, therefore it can change based on the allele frequency. We can use the reference genome with Plink to set the reference alleles correctly.
#
# MAF 1% is the default of plink.
## ............................................................................
## MAF 1% ####
# MAF 0.01 - default plink
plink2 \
--bfile /ycga-gpfs/project/caccone/lvc26/september_2020/snp_calls/chunk_calls/plink/merged_files/albo_MAF_0.01 \
--allow-extra-chr \
--fa /ycga-gpfs/project/caccone/lvc26/september_2020/genome/aedes_albopictus_LA2_20200826.fasta \
--ref-from-fa 'force' \
--make-bed \
--export vcf id-delim='-' ref-first bgz \
--out /ycga-gpfs/project/caccone/lvc26/SnpEff/albo_wgs \
--silent;
grep 'variants' /ycga-gpfs/project/caccone/lvc26/SnpEff/albo_wgs.log
#
# 24616205 variants loaded from
# --ref-from-fa force: 720711 variants changed, 23895494 validated.
#
## ............................................................................
## MAF 5% ####
plink2 \
--bfile /ycga-gpfs/project/caccone/lvc26/september_2020/snp_calls/chunk_calls/plink/merged_files/albo_MAF_0.01 \
--allow-extra-chr \
--fa /ycga-gpfs/project/caccone/lvc26/september_2020/genome/aedes_albopictus_LA2_20200826.fasta \
--ref-from-fa 'force' \
--make-bed \
--maf 0.05 \
--export vcf id-delim='-' ref-first bgz \
--out /ycga-gpfs/project/caccone/lvc26/SnpEff/albo_wgs_MAF_0.05 \
--silent;
grep 'variants' /ycga-gpfs/project/caccone/lvc26/SnpEff/albo_wgs_MAF_0.05.log
#
# 24616205 variants loaded from
# 18671088 variants removed due to allele frequency threshold(s)
# 5945117 variants remaining after main filters.
# --ref-from-fa force: 606703 variants changed, 5338414 validated.
#
## ............................................................................
## MAF 10% ####
plink2 \
--bfile /ycga-gpfs/project/caccone/lvc26/september_2020/snp_calls/chunk_calls/plink/merged_files/albo_MAF_0.01 \
--allow-extra-chr \
--fa /ycga-gpfs/project/caccone/lvc26/september_2020/genome/aedes_albopictus_LA2_20200826.fasta \
--ref-from-fa 'force' \
--make-bed \
--maf 0.1 \
--export vcf id-delim='-' ref-first bgz \
--out /ycga-gpfs/project/caccone/lvc26/SnpEff/albo_wgs_MAF_0.1 \
--silent;
grep 'variants' /ycga-gpfs/project/caccone/lvc26/SnpEff/albo_wgs_MAF_0.1.log
#
# 24616205 variants loaded from
# 21819313 variants removed due to allele frequency threshold(s)
# 2796892 variants remaining after main filters.
# --ref-from-fa force: 503613 variants changed, 2293279 validated.
#
## ............................................................................
## MAF 15% ####
plink2 \
--bfile /ycga-gpfs/project/caccone/lvc26/september_2020/snp_calls/chunk_calls/plink/merged_files/albo_MAF_0.01 \
--allow-extra-chr \
--fa /ycga-gpfs/project/caccone/lvc26/september_2020/genome/aedes_albopictus_LA2_20200826.fasta \
--ref-from-fa 'force' \
--make-bed \
--maf 0.15 \
--export vcf id-delim='-' ref-first bgz \
--out /ycga-gpfs/project/caccone/lvc26/SnpEff/albo_wgs_MAF_0.15 \
--silent;
grep 'variants' /ycga-gpfs/project/caccone/lvc26/SnpEff/albo_wgs_MAF_0.15.log
#
# 24616205 variants loaded from
# 22917110 variants removed due to allele frequency threshold(s)
# 1699095 variants remaining after main filters.
# --ref-from-fa force: 422154 variants changed, 1276941 validated.
#
## ............................................................................
## MAF 25% ####
plink2 \
--bfile /ycga-gpfs/project/caccone/lvc26/september_2020/snp_calls/chunk_calls/plink/merged_files/albo_MAF_0.01 \
--allow-extra-chr \
--fa /ycga-gpfs/project/caccone/lvc26/september_2020/genome/aedes_albopictus_LA2_20200826.fasta \
--ref-from-fa 'force' \
--make-bed \
--maf 0.20 \
--export vcf id-delim='-' ref-first bgz \
--out /ycga-gpfs/project/caccone/lvc26/SnpEff/albo_wgs_MAF_0.20 \
--silent;
grep 'variants' /ycga-gpfs/project/caccone/lvc26/SnpEff/albo_wgs_MAF_0.20.log
#
# 24616205 variants loaded from
# 23446040 variants removed due to allele frequency threshold(s)
# 1170165 variants remaining after main filters.
# --ref-from-fa force: 351474 variants changed, 818691 validated.
Transfer data to SSD
# run this command in the terminal and not R studio
# I put transferred the data to the external SSD drive because it more than 8Gb
rsync -chavzP --stats lvc26@ruddle.hpc.yale.edu:/ycga-gpfs/project/caccone/lvc26/SnpEff/ /Volumes/evo2/data_chip_design
6.1 Run annotation
# on SSD
## ............................................................................
## MAF 1% ####
# then annotate the vcf with the v4
# we first run annotation for MAF 0.01
java -Xmx8g -jar /Users/lucianocosme/Packages/snpEff/snpEff.jar albo_v3 \
/Volumes/evo2/data_chip_design/albo_wgs.vcf.gz \
-v \
-stats /Volumes/evo2/data_chip_design/MAF_0.01.html \
-csvStats /Volumes/evo2/data_chip_design/MAF_0.01.csv \
> /Volumes/evo2/data_chip_design/albo_wgs.ann_MAF_0.01.vcf
#
# we will get some warnings or errors because the annotation was lifted from AalbF2 to the AalbF3, with some transcripts missing some features.
# therefore we need to focus on genes with id starting with LOC
#
# because I need to run multiple annotations, I used the cluster to do the annotations, the bottleneck was the storage, the 2Tb SSD did not have enough space to save all the vcfs.
#
# on the cluster
srun --pty -N 1 -n 1 -c 2 -p interactive zsh
module load GATK/4.2.6.1-GCCcore-10.2.0-Java-11 # for java to be loaded
java -Xmx8g -jar $HOME/snpEff/snpEff.jar --help
#
## ............................................................................
## MAF 1% ####
# then annotate the vcf with the v4
# we first run annotation for MAF 0.01
java -Xmx10g -jar $HOME/snpEff/snpEff.jar albo_v3 \
/ycga-gpfs/project/caccone/lvc26/SnpEff/albo_wgs.vcf.gz \
-v \
-stats /ycga-gpfs/project/caccone/lvc26/SnpEff/MAF_0.01.html \
-csvStats /ycga-gpfs/project/caccone/lvc26/SnpEff/MAF_0.01.csv \
> /ycga-gpfs/project/caccone/lvc26/SnpEff/albo_wgs.ann_MAF_0.01.vcf
## ............................................................................
## MAF 5% ####
# then annotate the vcf with the v4
# we first run annotation for MAF 0.05
java -Xmx8g -jar /Users/lucianocosme/Packages/snpEff/snpEff.jar albo_v3 \
/Volumes/evo2/data_chip_design/albo_wgs_MAF_0.05.vcf.gz \
-v \
-stats /Volumes/evo2/data_chip_design/MAF_0.05.html \
-csvStats /Volumes/evo2/data_chip_design/MAF_0.05.csv \
> /Volumes/evo2/data_chip_design/albo_wgs.ann_MAF_0.05.vcf
#
#
## ............................................................................
## MAF 10% ####
# then annotate the vcf with the v4
# we first run annotation for MAF 0.1
java -Xmx8g -jar /Users/lucianocosme/Packages/snpEff/snpEff.jar albo_v3 /Volumes/evo2/data_chip_design/albo_wgs_MAF_0.1.vcf.gz \
-v \
-stats /Volumes/evo2/data_chip_design/MAF_0.1.html \
-csvStats /Volumes/evo2/data_chip_design/MAF_0.1.csv \
> /Volumes/evo2/data_chip_design/albo_wgs.ann_MAF_0.1.vcf
#
#
## ............................................................................
## MAF 15% ####
# then annotate the vcf with the v4
# we first run annotation for MAF 0.15
java -Xmx8g -jar /Users/lucianocosme/Packages/snpEff/snpEff.jar albo_v3 \
/Volumes/evo2/data_chip_design/albo_wgs_MAF_0.15.vcf.gz \
-v \
-stats /Volumes/evo2/data_chip_design/MAF_0.15.html \
-csvStats /Volumes/evo2/data_chip_design/MAF_0.15.csv \
> /Volumes/evo2/data_chip_design/albo_wgs.ann_MAF_0.15.vcf
#
#
## ............................................................................
## MAF 20% ####
# then annotate the vcf with the v4
# we first run annotation for MAF 0.2
java -Xmx8g -jar /Users/lucianocosme/Packages/snpEff/snpEff.jar albo_v3 \
/Volumes/evo2/data_chip_design/albo_wgs_MAF_0.20.vcf.gz \
-v \
-stats /Volumes/evo2/data_chip_design/MAF_0.2.html \
-csvStats /Volumes/evo2/data_chip_design/MAF_0.2.csv \
> /Volumes/evo2/data_chip_design/albo_wgs.ann_MAF_0.2.vcf
#
Move files to output
# SnpEff will create a .txt and a .html file in our project root directory, we can move them to the right place
mv snpEff_* output/SnpEff/
# you can double click the file snpEff_summary.html to see the summary of our annotation
Get types of variants
# I wrote a bash script to get the number of variants we have, you might have to run chmod 775 if you get permission denied
./scripts/analysis/variants_by_effect.sh /Volumes/evo2/data_chip_design/albo_wgs.ann_MAF_0.1.vcf
Counting variats from /Volumes/evo2/data_chip_design/albo_wgs.ann_MAF_0.1.vcf Nonsynonymous: 67217 Synonymous: 384482 Intron: 1281482 Intergenic: 993796 3’UTR: 64912 5’UTR: 67744
Alternatively, we can create a file with the effects and how they are annotated in the vcf file
echo "Nonsynonymous missense_variant
Synonymous synonymous_variant
Intron intron_variant
Intergenic intergenic_region
3UTR 3_prime_UTR_variant
5UTR 5_prime_UTR" > output/SnpEff/variant_types.txt
Count effects of variants
# not the most elegant but works
cat output/SnpEff/variant_types.txt | while IFS=" " read -r variant effect
do
echo "$variant"; cat /Volumes/evo2/data_chip_design/albo_wgs.ann_MAF_0.1.vcf | sed '/##/d' | grep -c "$effect"
done | xargs -n2 > output/SnpEff/wgs_annotations_summary_MAF_0.1.txt
Make a table
# import data
wgs_var_effects <- read_delim(
here(
"output", "SnpEff", "wgs_annotations_summary_MAF_0.1.txt"
),
col_names = FALSE,
show_col_types = FALSE,
col_types = "cn"
) |>
rename(
Effect = 1,
Number = 2
) |>
mutate(
Effect = as.factor(
Effect
),
Percentage = prop.table( # we use prop.table to calculate percentages
Number
) * 100,
across( # we use across to set column 3 as 2 decimals
3, round, 2
)
) |>
arrange( # we want to sort column percentage from higher to low
-Percentage
)
#
# make table
ftabv <- flextable(
wgs_var_effects |>
rename(
Effect = 1,
"Number of SNPs" = 2,
Percentage = 3
)
) |>
set_caption(
as_paragraph(
as_chunk(
"Variant effects for 2,796,892 SNPs from WGS data set - MAF 10%",
props = fp_text_default(
font.family = "Cambria"
)
)
),
word_stylename = "Table Caption"
)
#
# Apply zebra theme
ftabv <- flextable::theme_zebra(ftabv)
# print(ftab, preview = "docx") # to open and see it in Microsoft Word.
ftabv
Effect | Number of SNPs | Percentage |
---|---|---|
Intron | 1,281,482 | 44.81 |
Intergenic | 993,796 | 34.75 |
Synonymous | 384,482 | 13.45 |
5UTR | 67,744 | 2.37 |
Nonsynonymous | 67,217 | 2.35 |
3UTR | 64,912 | 2.27 |
# save
# then you can open MS Word and make adjustments
ftabv |>
autofit() |>
save_as_docx(
path = here(
"output", "SnpEff", "wgs_effect_type_and_number_SNPs_MAF_0.1.docx"
)
)
Plot of the effects
# create vector of colors for our plot
# option1
# special_colors <- list(
# "Intron" = "#BBBBBB",
# "Intergenic" = "#0072b2",
# "Synonymous" = "#009e73",
# "5UTR" = "#56b4e9",
# "3UTR" = "#cc79a7",
# "Nonsynonymous" = "#d55e00"
# )
#
# Option 2
special_colors <- list(
"Intron" = "#C1CDCD",
"Intergenic" = "#1AC722",
"Synonymous" = "#FF8C00",
"5UTR" = "#87CEFA",
"3UTR" = "#FF3030",
"Nonsynonymous" = "pink"
)
#
# make plot
wgs_var_effects |>
mutate( # we can use mutate to change the class of column
Effect = as.factor( # we make column "Effect" factor
Effect
),
Effect = fct_reorder( # library forcats to re-order by the number of variants
Effect, Number
)
) |>
ggplot() +
geom_col( # to make bar plot with different colors
aes( # mapping
x = Effect,
y = Number,
fill = Effect
),
color = "yellow" # for the border
) +
labs( # to change the ylab and tittle
title = "WGS variants per SnpEff annotation",
y = "Number of annotated variants",
x = "Annotated effect",
caption = "Functional annotation of SNPs in WGS data set (~2.8m - MAF 10%)"
) +
hrbrthemes::theme_ipsum( # I like this theme, but you can use any
base_family = "", # needs to have extrafont library
axis_text_size = 12,
axis_title_size = 14,
plot_margin = margin(
10, 10, 10, 10
),
grid = TRUE,
grid_col = "#fabbe2"
) +
scale_fill_manual(
values = unlist(
special_colors
)
) +
scale_color_manual(
values = unlist(
special_colors
)
) +
theme(
legend.position = "none",
panel.grid.major = element_line(
linetype = "dashed",
linewidth = 0.2,
),
panel.grid.minor = element_line(
linetype = "dashed",
linewidth = 0.2
)
) +
scale_y_continuous(
trans = "log10",
breaks = scales::log_breaks(
n = 7
),
labels = label_comma()
) +
annotation_logticks(
sides = "b"
) +
geom_label(
aes(
x = Effect,
y = Number,
color = Effect,
label = paste(
formatC(
Number,
format = "f",
big.mark = ",",
digits = 0
),"\n","(",Percentage,"%)"
)
),
face = "bold",
size = 2,
family = "Roboto Condensed",
alpha = .9,
) +
coord_flip() # to make it easier to read
# chip data
chip_data <- read_delim(
here(
"output", "SnpEff", "annotations_summary.txt"
),
col_names = FALSE,
show_col_types = FALSE,
col_types = "cn"
) |>
rename(
Effect = 1,
Number = 2
) |>
mutate(
Effect = as.factor(
Effect
),
Percentage = prop.table( # we use prop.table to calculate percentages
Number
) * 100,
across( # we use acrros to set column 3 as 2 decimals
3, round, 2
)
) |>
arrange( # we want to sort column percentage from higher to low
-Percentage
) |>
mutate(Data = "Chip")
#
# wgs data
wgs_data <- read_delim(
here(
"output", "SnpEff", "wgs_annotations_summary_MAF_0.1.txt"
),
col_names = FALSE,
show_col_types = FALSE,
col_types = "cn"
) |>
rename(
Effect = 1,
Number = 2
) |>
mutate(
Effect = as.factor(
Effect
),
Percentage = prop.table( # we use prop.table to calculate percentages
Number
) * 100,
across( # we use across to set column 3 as 2 decimals
3, round, 2
)
) |>
arrange( # we want to sort column percentage from higher to low
-Percentage
) |>
mutate(Data = "WGS")
#
# merge data sets
wgs_chip_data <-
rbind(
chip_data, wgs_data
)
Create plot
#
# plot it
wgs_chip_data |>
mutate( # we can use mutate to change the class of column
Effect = as.factor( # we make column "Effect" factor
Effect
),
Effect = fct_reorder( # library forcats to re-order by the number of variants
Effect, Number
)
) |>
ggplot() +
geom_col( # to make bar plot with different colors
aes( # mapping
x = Effect,
y = Number,
color = Effect,
fill = Data
),
show.legend = TRUE,
position = position_dodge(width = 0.7),
width = .6,
linewidth = .01,
alpha=0.8
) +
labs( # to change the ylab and tittle
title = "Variants effects per SnpEff annotation",
y = bquote(bold(.("Number of annotated variants effects"~(Log[10])))),
x = "Annotated effect",
caption = "Functional annotation of WGS SNPs used for chip design (~2.8m) and Chip recovered SNPs (~60k) - MAF = 0.1"
) +
hrbrthemes::theme_ipsum( # I like this theme, but you can use any
base_family = "", # needs to have extrafont library
axis_text_size = 12,
axis_title_size = 14,
plot_margin = margin(
10, 10, 10, 10
),
grid = TRUE,
grid_col = "#fabbe2"
) +
scale_color_manual(
name = "Effect",
breaks = c(
"Intron", "Intergenic", "Synonymous", "5UTR", "3UTR", "Nonsynonymous"
),
values = c(
"Intron" = "#7A7070",
"Intergenic" = "#4674D9",
"Synonymous" = "#17D13C",
"5UTR" = "purple",
"3UTR" = "#F7AB1E",
"Nonsynonymous" = "#ED1717"
),
guide = "none"
) +
scale_fill_manual(
name = "Data",
breaks = c(
"WGS", "Chip"
),
values = c(
"Chip" = "#bab9b1",
"WGS" = "#d2f59e"
)
) +
theme(
axis.text.y = element_text(
colour = "black"
),
axis.title.y = element_text(
colour = "black",
face = "bold"
),
legend.position = "right",
panel.grid.major = element_line(
linetype = "dashed",
linewidth = 0.2,
),
panel.grid.minor = element_line(
linetype = "dashed",
linewidth = 0.2
)
) +
scale_y_continuous(
trans = "log10",
breaks = scales::log_breaks(
n = 7
),
labels = label_comma()
) +
annotation_logticks(
sides = "b"
) +
geom_label(
aes(
x = Effect,
y = Number,
color = Effect,
label = paste(
formatC(
Number,
format = "f",
big.mark = ",",
digits = 0
), "\n", "(", Percentage, "%)"
)
),
fill = alpha(
c(
"white"
),
0.8
),
size = 2.5,
family = "Roboto Condensed",
label.size = NA, # to remove border
nudge_x = -.05
) +
coord_flip() # to make it easier to read
#
# save plot
ggsave(
here(
"output", "SnpEff","figures", "wgs_vs_chip_SnpEff.pdf"
),
width = 8,
height = 5,
units = "in"
)
We can make a table with the percentage differences
# calculate the differences
percetage_diff <-
wgs_data |>
left_join( # use left join to join by column Effect (common column)
chip_data,
by = "Effect",
suffix = c( # we add wgs and chip to keep track of data
".wgs", ".chip"
),
) |>
mutate( # we create a new column and calculate the difference
difference = Percentage.chip - Percentage.wgs
) |>
select( # we select only the columns we want in our table
Effect, Number.wgs, Percentage.wgs, Number.chip, Percentage.chip, difference
) |>
rename( # we rename the columns by index
"Variant effect" = 1,
"WGS (N)" = 2,
"WGS (%)" = 3,
"Chip (N)" = 4,
"Chip (%)" = 5,
"Chip bias (%)" = 6
)
#
# make table
ftab_diff <-
flextable(
percetage_diff
) |>
set_caption(
as_paragraph(
as_chunk(
"Percentage of variants by their effects in WGS and chip data sets (%)",
props = fp_text_default(
font.fam = "Cambria"
)
)
),
word_stylena = "Table Caption"
)
#
# Apply zebra theme
ftab_diff <- flextable::theme_zebra(ftab_diff)
# print(ftab.snps, preview = "docx") # to open and see it in Microsoft Word.
ftab_diff |>
autofit()
Variant effect | WGS (N) | WGS (%) | Chip (N) | Chip (%) | Chip bias (%) |
---|---|---|---|---|---|
Intron | 1,281,482 | 44.81 | 24,387 | 39.49 | -5.32 |
Intergenic | 993,796 | 34.75 | 13,758 | 22.28 | -12.47 |
Synonymous | 384,482 | 13.45 | 14,323 | 23.20 | 9.75 |
5UTR | 67,744 | 2.37 | 3,642 | 5.90 | 3.53 |
Nonsynonymous | 67,217 | 2.35 | 1,942 | 3.14 | 0.79 |
3UTR | 64,912 | 2.27 | 3,697 | 5.99 | 3.72 |
We do not know how the differences in the proportion of each variant by effect in the chip data set affects ancestry analysis. We can create SNP lists of the chip data matching the proportions of the WGS data set. For example, in the chip we have bias towards coding regions.
In the WGS data set, SNPs in introns are the highest proportion in the data set. Therefore, the chip data set should also have SNPs in introns with the highest proportion. The same idea is valid for other SNP effect types. Overall, we have an under-representation of variants in introns (-6.96%) and intergenic regions (-23.30%) in the chip data set, with an over representation of synonymous (+18.36%), 3’UTR (-4.47%), 5’UTR (+5.19%), and nonsynonymous (+2.23%). We do not know how selection is acting on each set of SNPs. Presumably, variants in coding region may be under stronger selection than those variants in intergenic regions.
# I used Excel to calculate the new proportions of the chip variants to match the WGS proportions
# then I used the RStudio addin datapasta to paste here as tibble, you can also import the Excel file or do it in a different way
new_proportions <- tibble::tribble(
~Variant.effect, ~`WGS.(N)`, ~`WGS.(%)`, ~`Chip.(N)`, ~`Chip.(%)`, ~`Chip.bias.(%)`, ~`Chip.Possible.(N)`, ~`Chip.new.(%)`,
"Intron", 1281482, 44.81, 24387, 39.49, -5.32, 17741, 44.81,
"Intergenic", 993796, 34.75, 13758, 22.28, -12.47, 13758, 34.75,
"Synonymous", 384482, 13.45, 14323, 23.2, 9.75, 5325, 13.45,
"5UTR", 67744, 2.37, 3642, 5.9, 3.53, 938, 2.37,
"3UTR", 64912, 2.27, 3697, 5.99, 3.72, 899, 2.27,
"Nonsynonymous", 67217, 2.35, 1942, 3.14, 0.79, 930, 2.35,
NA, 2859633, 100, 61749, 100, 0, 39591, 100
)
# create table
# make table
ftab_prop <-
flextable(
new_proportions
) |>
set_caption(
as_paragraph(
as_chunk(
"New proportion of variants from Chip data set to match WGS variant proportions",
props = fp_text_default(
font.fam = "Cambria"
)
)
),
word_stylena = "Table Caption"
)
#
# Apply zebra theme
ftab_prop <- flextable::theme_zebra(ftab_prop)
# print(ftab.snps, preview = "docx") # to open and see it in Microsoft Word.
ftab_prop |>
autofit()
Variant.effect | WGS.(N) | WGS.(%) | Chip.(N) | Chip.(%) | Chip.bias.(%) | Chip.Possible.(N) | Chip.new.(%) |
---|---|---|---|---|---|---|---|
Intron | 1,281,482 | 44.81 | 24,387 | 39.49 | -5.32 | 17,741 | 44.81 |
Intergenic | 993,796 | 34.75 | 13,758 | 22.28 | -12.47 | 13,758 | 34.75 |
Synonymous | 384,482 | 13.45 | 14,323 | 23.20 | 9.75 | 5,325 | 13.45 |
5UTR | 67,744 | 2.37 | 3,642 | 5.90 | 3.53 | 938 | 2.37 |
3UTR | 64,912 | 2.27 | 3,697 | 5.99 | 3.72 | 899 | 2.27 |
Nonsynonymous | 67,217 | 2.35 | 1,942 | 3.14 | 0.79 | 930 | 2.35 |
NA | 2,859,633 | 100.00 | 61,749 | 100.00 | 0.00 | 39,591 | 100.00 |
#
# save
ftab_prop |>
autofit() |>
save_as_docx(
path = here(
"output", "SnpEff", "new_proportions_chip.docx"
)
)
The challenge now is to subset the Chip data set by variant effect and generate a new data set with similar variant proportions to the WGS data set. We can randomly select variants from each effect group, then we can perform a new variant effect annotation, and calculate the proportion of each variant type by their effect. However, since some variants have more than one effect, we might not get the exact proportion we want.
We created a SNP list from the chip data set earlier. We can use those to create the new chip data set. We can use bash tools to randomly select SNPs, then we merge the lists of SNPs, and use Plink2 to generate the new data set. We then perform the functional annotation of the new data set. Perhaps, we can generate a few data sets, randomly selecting the same number of SNPs, and compare the robustness of our approach.
We can create a new directory inside the SnpEff directory, and move the files there
# I created a new directory and copy the variant lists there. I have a file with both 3 and 5'UTR SNPs, that is why we have more than 61k total
# mkdir output/SnpEff/chip_new_proportions
wc -l output/SnpEff/chip_new_proportions/*SNPs.txt
## 3697 output/SnpEff/chip_new_proportions/3UTR_SNPs.txt
## 3642 output/SnpEff/chip_new_proportions/5UTR_SNPs.txt
## 1942 output/SnpEff/chip_new_proportions/Nonsynonymous_SNPs.txt
## 14323 output/SnpEff/chip_new_proportions/Synonymous_SNPs.txt
## 7334 output/SnpEff/chip_new_proportions/UTRs_SNPs.txt
## 13758 output/SnpEff/chip_new_proportions/intergenic_SNPs.txt
## 24387 output/SnpEff/chip_new_proportions/intron_SNPs.txt
## 69083 total
We can create a bash pipeline to randomly sample SNPs and create a new data set, following the WGS variant proportions.
# the first step of the pipeline is to get the number of SNPs we need to randomly sample from the list of SNPs of each effect group. This information is in the table we created recently.
head(new_proportions)
## # A tibble: 6 × 8
## Variant.effect `WGS.(N)` `WGS.(%)` `Chip.(N)` `Chip.(%)` `Chip.bias.(%)`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Intron 1281482 44.8 24387 39.5 -5.32
## 2 Intergenic 993796 34.8 13758 22.3 -12.5
## 3 Synonymous 384482 13.4 14323 23.2 9.75
## 4 5UTR 67744 2.37 3642 5.9 3.53
## 5 3UTR 64912 2.27 3697 5.99 3.72
## 6 Nonsynonymous 67217 2.35 1942 3.14 0.79
## # ℹ 2 more variables: `Chip.Possible.(N)` <dbl>, `Chip.new.(%)` <dbl>
We use all the intergenic SNPs we have since they are the most under represented in the chip data set.
We have 13758 intergenic SNPs and we will always include them in all data sets. For all other categories of SNPs, we will always randomly sample the SNPs in that category. We have more than we need to achieve the WGS data set proportions.
For example, we will randomly sample 17741 SNPs out of the 39591 for SNPs on introns to end with a proportion of 44.81% for Chip data set.
We will have similar proportions to the WGS data set, but since we have access to more SNPs in coding regions, we can sample them and see if there are differences in the output of the analysis.
We don’t know if the the allele frequency differences we see in each data set are the result of different selection pressures, for example, due to local adaptation of populations.
We might see a swift in the allele frequency of populations if certain SNPs are sampled. For example, we have 24387 on introns, and we will randomly sample 17741, more than half of the data set will always be re-sampled.
In other SNPs sets we have a different scenario. We have 1942 SNPs with nonsynonymous effects, but we will always sample 930 SNPs, which is about 50% of the data set.
If variants with nonsynonymous effects have a strong effect on the allele frequency of a population, they might bias the output of our analysis.
We might conclude that SNPs sampling following the WGS variants proportions according to their effect, is influenced by the selection pressures on each variant. Then, even an even sampling of the variants according the their effect is not sufficient to replicate the allele frequencies using whole genome sequencing.
If a set of variants have strong effects on the allele frequency of populations, then we will be able to have a holistic view of the current populations’ genetic architecture, if we have high density data across the genome. It will also tell us that the selection pressure on variants is strong enough to bias the output of the analysis using the chip data. Then, precaution would be advised. Perhaps, limiting or avoiding certain applications or analysis, or valuing the outcome differently.
We wanted to include as many as possible SNPs on exons. If the SNPs on the exons are under strong selection, they might have low minor allele frequency. For example, for a variant causing a nonsynonymous change in a protein coding gene of vital importance, the polymorphic might exist only in the heterozygous state in a few individuals in the populations.
An example would be a variant causing a ineffective enzyme. The transcription machinery might not have adapted yet to the new mutation. Both chromosomes would be transcribed, but only RNA molecules from one chromosome would be translated to an normal enzyme.
The lower transcription rate of the RNA of the gene will result in lower levels of the enzyme. If the lower activity of this enzyme causes changes in the organism, positively or negatively, the mutation will be maintained or not in the population.
For example, the lower level of the enzyme, results in changes in behavior that increases the reproductive fitness of the individual.
If the offspring of this individual carrying the new mutation also have a higher reproductive fitness, the allele frequency of the population might change over generations. We observe these changes in frequency via linkage network analysis. For example, an influx of migrants in a population might change the allele frequencies in the next generations.
These changes can be visualized by measuring linkage differences between two populations, and of course, only if there are differences in recombination rate during meiosis. A general increase in linkage might be observed.
Create pipeline
# the first step in our pipeline is to create a file with how many SNPs we have to sample from each group.
# we can select it from the object we created earlier
random_variants <-
new_proportions |>
select(
"Variant.effect", "Chip.Possible.(N)"
) |>
filter(
!row_number() %in% c(7) # to remove the row "Total"
) |>
rename( # new names for the columns
effect = 1,
number = 2
) |>
mutate( # we need to change the cases of the rows to match the files
effect = tolower( # we use base R to change everything to lowercase
effect
),
across( # however we need to change utr back capital letters UTR
"effect", ~str_replace(., "utr", "UTR")
)
)
#
# save it to a file to use in the bash pipeline
write_delim(
random_variants,
here(
"output", "SnpEff", "chip_new_proportions", "n_variant_for_sampling_by_effect.txt"
),
delim = " ",
quote = "none",
col_names = FALSE
)
head(random_variants)
## # A tibble: 6 × 2
## effect number
## <chr> <dbl>
## 1 intron 17741
## 2 intergenic 13758
## 3 synonymous 5325
## 4 5UTR 938
## 5 3UTR 899
## 6 nonsynonymous 930
Loop to get sets of randomly selected variants
# a loop to generate random sets of SNPs
for set in $(seq 1 10); do
cat output/SnpEff/chip_new_proportions/n_variant_for_sampling_by_effect.txt | while IFS=" " read -r effect number
do
cat output/SnpEff/chip_new_proportions/$effect\_SNPs.txt | shuf -n $number
done > output/SnpEff/chip_new_proportions/sets/set$set.txt
done
Count how many SNPs we have in each set
## 39591 output/SnpEff/chip_new_proportions/sets/set1.txt
## 39591 output/SnpEff/chip_new_proportions/sets/set10.txt
## 39591 output/SnpEff/chip_new_proportions/sets/set2.txt
## 39591 output/SnpEff/chip_new_proportions/sets/set3.txt
## 39591 output/SnpEff/chip_new_proportions/sets/set4.txt
## 39591 output/SnpEff/chip_new_proportions/sets/set5.txt
## 39591 output/SnpEff/chip_new_proportions/sets/set6.txt
## 39591 output/SnpEff/chip_new_proportions/sets/set7.txt
## 39591 output/SnpEff/chip_new_proportions/sets/set8.txt
## 39591 output/SnpEff/chip_new_proportions/sets/set9.txt
## 395910 total
Compare the SNPs on two data sets (1 and 2)
# from man comm
# -1 Suppress printing of column 1, lines only in file1.
# -2 Suppress printing of column 2, lines only in file2.
# -3 Suppress printing of column 3, lines common to both.
#
# comparing set1 and set2
# to get the shared number of SNPs
echo "Comparing set1 and set2:";
echo "shared number SNPs";
comm -12 <(sort output/SnpEff/chip_new_proportions/sets/set1.txt) <(sort output/SnpEff/chip_new_proportions/sets/set2.txt) | wc -l;
comm -12 <(sort output/SnpEff/chip_new_proportions/sets/set1.txt) <(sort output/SnpEff/chip_new_proportions/sets/set2.txt) > output/SnpEff/chip_new_proportions/comm_set1_2.txt
#
# to get number SNPs unique to set1
echo "SNPs unique to set1";
comm -23 <(sort output/SnpEff/chip_new_proportions/sets/set1.txt) <(sort output/SnpEff/chip_new_proportions/sets/set2.txt) | wc -l;
#
# to get number SNPs unique to set2
echo "SNPs unique to set2";
comm -13 <(sort output/SnpEff/chip_new_proportions/sets/set1.txt) <(sort output/SnpEff/chip_new_proportions/sets/set2.txt) | wc -l;
## Comparing set1 and set2:
## shared number SNPs
## 29969
## SNPs unique to set1
## 9622
## SNPs unique to set2
## 9622
Compare the SNPs on two data sets (3 and 4)
echo "Comparing set3 and set4:";
echo "shared number SNPs";
comm -12 <(sort output/SnpEff/chip_new_proportions/sets/set3.txt) <(sort output/SnpEff/chip_new_proportions/sets/set4.txt) | wc -l;
comm -12 <(sort output/SnpEff/chip_new_proportions/sets/set3.txt) <(sort output/SnpEff/chip_new_proportions/sets/set4.txt) > output/SnpEff/chip_new_proportions/comm_set3_4.txt
#
# to get number SNPs unique to set1
echo "SNPs unique to set1";
comm -23 <(sort output/SnpEff/chip_new_proportions/sets/set3.txt) <(sort output/SnpEff/chip_new_proportions/sets/set4.txt) | wc -l;
#
# to get number SNPs unique to set2
echo "SNPs unique to set2";
comm -13 <(sort output/SnpEff/chip_new_proportions/sets/set3.txt) <(sort output/SnpEff/chip_new_proportions/sets/set4.txt) | wc -l;
## Comparing set3 and set4:
## shared number SNPs
## 29966
## SNPs unique to set1
## 9625
## SNPs unique to set2
## 9625
Check SNPs that are shared between set1 and 2 and and those from set 3 and 4
echo "shared SNPs for each comparison:"
echo " number SNPs shared set1 and set2"
wc -l output/SnpEff/chip_new_proportions/comm_set1_2.txt;
#
echo " number SNPs shared set3 and set4"
wc -l output/SnpEff/chip_new_proportions/comm_set3_4.txt;
#
echo "Comparing shared SNPs from set1_2 and set3_4:";
echo "shared SNPs - we got them twice";
comm -12 <(sort output/SnpEff/chip_new_proportions/comm_set1_2.txt) <(sort output/SnpEff/chip_new_proportions/comm_set3_4.txt) | wc -l;
#comm -12 <(sort output/SnpEff/chip_new_proportions/comm_set1_2.txt) <(sort output/SnpEff/chip_new_proportions/comm_set3_4.txt) > output/SnpEff/chip_new_proportions/comm_set3_4.txt
#
# to get number SNPs unique to set1
echo "SNPs unique to set1_2 - we did not sample then twice";
comm -23 <(sort output/SnpEff/chip_new_proportions/comm_set1_2.txt) <(sort output/SnpEff/chip_new_proportions/comm_set3_4.txt) | wc -l;
#
# to get number SNPs unique to set2
echo "SNPs unique to set3_4 - we did not sample then twice";
comm -13 <(sort output/SnpEff/chip_new_proportions/comm_set1_2.txt) <(sort output/SnpEff/chip_new_proportions/comm_set3_4.txt) | wc -l;
## shared SNPs for each comparison:
## number SNPs shared set1 and set2
## 29969 output/SnpEff/chip_new_proportions/comm_set1_2.txt
## number SNPs shared set3 and set4
## 29966 output/SnpEff/chip_new_proportions/comm_set3_4.txt
## Comparing shared SNPs from set1_2 and set3_4:
## shared SNPs - we got them twice
## 21697
## SNPs unique to set1_2 - we did not sample then twice
## 8272
## SNPs unique to set3_4 - we did not sample then twice
## 8269
Out of the 39591 SNPs from our comparison, we obtained 29943 shared between set1 and set3 and 29889 shared between set3 and set4.
From these two list of shared SNPs, we got 21683 that were sampled 4 times (they were in all 4 sets), while 16466 (8260 + 8206 unique to each re-sampling) were not re sampled in all 4 sets. We get around 8200 SNPs unique to each set.
Now we can use the sets of SNPs to create new files for our analyses.
# since we cannot repeat test all SNPs sets, we will choose 3 out of the 10 randomly
# you can run the comman below several time if you want. Each time, 3 random SNPs sets will be choose
ls -1 output/SnpEff/chip_new_proportions/sets/*.txt | shuf -n3
## output/SnpEff/chip_new_proportions/sets/random3.txt
## output/SnpEff/chip_new_proportions/sets/random5.txt
## output/SnpEff/chip_new_proportions/sets/random10.txt
When I first run it, I got set3, 5, and 10. We can also create another 3 sets with random SNPs from the full data set. Our goal is to see if the our approach is sensible to biological factors and not only due to dimensional reduction of the data set.
# the bed file7 in our population directory has all the 59384 SNPs passing QC
# we can create the 3 random SNP list from it. We need 39591 SNPs to match the number of SNPs in our previous SNP sets.
# note about each SNP set about LD pruning and MAF: we will have to perform filtering depending on the type of analysis we will perform with the data. For example, for Admixture, we will use LD pruning, which might create SNP sets with slight different proportions from what we wanted. Minor allele frequency filters will also affect our data set's SNP numbers
#
# we can use a loop to create 10 sets, then we randomly choose 3 of them
# when we extract the set of snps with plink we obtain around 33400 SNPs
for set in $(seq 1 10); do
awk '{print $2}' output/populations/file7.bim | shuf -n 33400 > output/SnpEff/chip_new_proportions/sets/random$set.txt
done
Select 3 sets from the 10 random sets
# since we cannot repeat test all SNPs sets, we will choose 3 out of the 10 randomly
# you can run the comm below several time if you want. Each time, 3 random SNPs sets will be choose
ls -1 output/SnpEff/chip_new_proportions/sets/random*.txt | shuf -n3
## output/SnpEff/chip_new_proportions/sets/random5.txt
## output/SnpEff/chip_new_proportions/sets/random1.txt
## output/SnpEff/chip_new_proportions/sets/random6.txt
Each time we run the command above a different set of SNPs will be selected. In my case, it selected random9, 4, and 5.
We can make Venn diagrams of our sets. Let’s do 3 way Venn diagrams
We will be using the R library reticulate, which allows integration of python and R
# once you import it, you will see the functions in our R environment
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
from matplotlib_venn import venn2, venn3, venn3_circles
Create Venn diagram from SNP sets
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
from matplotlib_venn import venn2, venn3, venn3_circles
import sys
sys.stdout = open('output.log', 'w')
# compare sets with SNP sampling following the WGS proportions
# we will compare set3, 5, and 10
set3 = open('output/SnpEff/chip_new_proportions/sets/set3.txt').read().split()
set5 = open('output/SnpEff/chip_new_proportions/sets/set5.txt').read().split()
set10 = open('output/SnpEff/chip_new_proportions/sets/set10.txt').read().split()
#
## count total number of SNPs
total = len(list(set(set3+set5+set10)))
#
# plot Venn
fig = plt.figure()
fig, ax = plt.subplots(1, figsize=(11,6))
#
v = venn3(subsets=([set(set3), set(set5), set(set10)]),
set_labels = ('set3', 'set5', 'set10'),
set_colors=("#ff88c0", "#4fd3ff", "#FF8C00"),
subset_label_formatter=lambda x: str(x) + "\n(" + f"{(x/total):.2%}" + ")",
alpha=0.5)
#
# add circles
c = venn3_circles(([set(set3), set(set5), set(set10)]),
linestyle='dashed',
linewidth=1,
color="#cfcac6")
#
# the aesthetics
overlap = v.get_patch_by_id("C") # get the overlap
overlap.set_edgecolor("white") # set the edge color
overlap.set_alpha(1) # set the alpha to 1 otherwise
overlap.set_linestyle("dashed") # set a dashed line
overlap.set_linewidth(2) # set same line width as the circles
overlap.set_zorder(2) # bump overlap up a level so we can
#
# Set titles for the figure and the subplot respectively
fig.suptitle('SNP sets following WGS proportions', fontsize=14, fontweight='bold');
ax.set_title('~34k SNPs each set');
fig.tight_layout()
#
plt.savefig('output/SnpEff/chip_new_proportions/sets/sets_following_wgs_prop.pdf', bbox_inches='tight')
plt.show()
From the Venn diagram, we can see that the 3 SNP sets share almost 25k SNPs (~48% of the SNPs), but they have about 4k unique SNPs (~8%), while sharing ~4.8k SNPs, and the other set (~9%). It will be interesting to see if the difference among sets will result in differences in the population or individual based statistics.
Create Venn diagram from random SNP sets
# compare sets with SNP sampling following the WGS proportions
# we will compare set3, 5, and 10
random2 = open('output/SnpEff/chip_new_proportions/sets/random2.txt').read().split()
random8 = open('output/SnpEff/chip_new_proportions/sets/random8.txt').read().split()
random9 = open('output/SnpEff/chip_new_proportions/sets/random9.txt').read().split()
#
## count total number of SNPs
total = len(list(set(random2+random8+random9)))
#
# plot Venn
fig = plt.figure()
fig, ax = plt.subplots(1, figsize=(11,6))
#
v = venn3(subsets=([set(random2), set(random8), set(random9)]),
set_labels = ('random2', 'random8', 'random9'),
set_colors=("#ff88c0", "#4fd3ff", "#FF8C00"),
subset_label_formatter=lambda x: str(x) + "\n(" + f"{(x/total):.2%}" + ")",
alpha=0.5)
#
# add circles
c = venn3_circles(([set(random2), set(random8), set(random9)]),
linestyle='dashed',
linewidth=1,
color="#cfcac6")
#
# the aesthetics
overlap = v.get_patch_by_id("C") # get the overlap
overlap.set_edgecolor("white") # set the edge color
overlap.set_alpha(1) # set the alpha to 1 otherwise
overlap.set_linestyle("dashed") # set a dashed line
overlap.set_linewidth(2) # set same line width as the circles
overlap.set_zorder(2) # bump overlap up a level so we can
#
# Set titles for the figure and the subplot respectively
fig.suptitle('Random SNP sets not following WGS proportions', fontsize=14, fontweight='bold')
ax.set_title('~34k SNPs each set')
fig.tight_layout()
#
plt.savefig('output/SnpEff/chip_new_proportions/sets/sets_random_not_following_WGS_prop.pdf', bbox_inches='tight')
plt.show()
To exit python, type quit or exit in the console.
We can see that the shared number of SNPs among the 3 randoms sets are different than when we follow the WGS proportions. Now we have about 30% shared SNPs instead of 48%. We see a similar number of unique SNPs for each set (8%), and each set shared more with other sets, increasing from 9% to 15%.
We will use the sets following the WGS SNP effect proportions to create new data sets to run with fastStructure.
Choose 2 sets randomly
# since we cannot repeat test all SNPs sets, we will choose 2 out of the 10 randomly
# you can run the comman below several time if you want. Each time, 2 random SNPs sets will be choose
ls -1 output/SnpEff/chip_new_proportions/sets/set*.txt | shuf -n2
## output/SnpEff/chip_new_proportions/sets/set8.txt
## output/SnpEff/chip_new_proportions/sets/set6.txt
I will use set4.txt and set10.txt.
We can extract these SNPs from the file7 using Plink
set4
plink2 \
--allow-extra-chr \
--bfile output/populations/file7 \
--extract output/SnpEff/chip_new_proportions/sets/set4.txt \
--make-bed \
--out output/populations/set4 \
--silent;
grep 'variants\|samples' output/populations/set4.log
## 237 samples (30 females, 67 males, 140 ambiguous; 237 founders) loaded from
## 82731 variants loaded from output/populations/file7.bim.
## --extract: 33447 variants remaining.
## 33447 variants remaining after main filters.
set10
plink2 \
--allow-extra-chr \
--bfile output/populations/file7 \
--extract output/SnpEff/chip_new_proportions/sets/set10.txt \
--make-bed \
--out output/populations/set10 \
--silent;
grep 'variants\|samples' output/populations/set10.log
## 237 samples (30 females, 67 males, 140 ambiguous; 237 founders) loaded from
## 82731 variants loaded from output/populations/file7.bim.
## --extract: 33380 variants remaining.
## 33380 variants remaining after main filters.
We can select two random sets of SNPs to compare with these two sets
# since we cannot repeat test all SNPs sets, we will choose 3 out of the 10 randomly
# you can run the comm below several time if you want. Each time, 3 random SNPs sets will be choose
ls -1 output/SnpEff/chip_new_proportions/sets/random*.txt | shuf -n2
## output/SnpEff/chip_new_proportions/sets/random5.txt
## output/SnpEff/chip_new_proportions/sets/random4.txt
random8
plink2 \
--allow-extra-chr \
--bfile output/populations/file7 \
--extract output/SnpEff/chip_new_proportions/sets/random8.txt \
--make-bed \
--out output/populations/random8 \
--silent;
grep 'variants\|samples' output/populations/random8.log
## 237 samples (30 females, 67 males, 140 ambiguous; 237 founders) loaded from
## 82731 variants loaded from output/populations/file7.bim.
## --extract: 39591 variants remaining.
## 39591 variants remaining after main filters.
random4
plink2 \
--allow-extra-chr \
--bfile output/populations/file7 \
--extract output/SnpEff/chip_new_proportions/sets/random4.txt \
--make-bed \
--out output/populations/random4 \
--silent;
grep 'variants\|samples' output/populations/random4.log
## 237 samples (30 females, 67 males, 140 ambiguous; 237 founders) loaded from
## 82731 variants loaded from output/populations/file7.bim.
## --extract: 39591 variants remaining.
## 39591 variants remaining after main filters.
We did run ancestry analysis with these different sets, but there was no significant differences.