23 KiB
Comparative Genomics Project
- General infos
- Download the source data
- Develop the Nextflow workflow
- From the given BLASTp format 6 file
- Amount of Tandemly Arrayed Genes
- Complete slides
- Family sizes
- $K_s$ For all duplicated pairs within a family
- Does Glycine max have big TAGs and which function are the big TAG gene implied?
- Are TAG pairs different in age from non-TAG pairs?
- Are genes inside a TAG orientated in the same way more often than random?
General infos
<2024-10-25 Fri>
Objectives
Part 1
-
Determine the amount of duplicate genes in my genome.
- What is the proportion of duplicate genes in my genome.
-
Give an interval based on two datasets (low stringency dataset and high stringency dataset).
Low High Coverage 30% 40% %identity 30% 50% Bitscore no filtering no filtering Clustering mcl (with default parameters) mcl (with default parameters) min_coverage=30 min_identity=30
min_coverage=40 min_identity=50
- Determine the amount of Tandemly Arrayed Genes (TAGs) in the genome (definition 0 and 1).
- Determine the amount in plant's chloroplasts and mitochondria.
- Is there a lot of duplicated operons in plants?
Part 2
- Compute \(K_s\) values and use it as a proxy of the age of the duplication.
- Are TAGs younger than other duplicate genes?
Our species
Glycine max is the soy plant.
Download the source data
Develop the Nextflow workflow
Filtering the proteome to keep the longest isoform
I used three awk families_and_TAGs/scripts to filter the FASTA file to keep only the longest isoforms:
protein_lengths.awk
associate each FASTA record with the length of its sequence,filter_longest.awk
keep the isoform ID that has the maximum length,filter_records.awk
based on the list of isoform ID, retrieve the sequence of from the original file.
protein_lengths.awk
filter_longest.awk
filter_records.awk
How many protein coding gene are there?
zcat ../data/Glycine_max.Glycine_max_v2.1.pep.all.fa.gz \
| grep -v 'supercontig' \
| awk ' /^>/ { print $4 }' | sort | uniq | wc -l
55589
55897
Filtering out Chloroplasts and Mitochondria proteins
From the Ensembl Plants index page we can see that there is no chloroplast nor mitochondria proteins in the sequenced genome of Glycine max.
Filter out supercontigs
<2024-10-28 Mon>
zcat ../data/Glycine_max.Glycine_max_v2.1.pep.all.fa.gz \
| awk -f ./workflow/families_and_TAGs/scripts/remove_supercontigs.awk \
> ./tmp/Glycine_max.Glycine_max_v2.1.pep.all.nosupercontig.fa
Sanity check: Has the script removed all supercontigs proteins and only them?
How many records in supercontigs were there?
zcat ../data/Glycine_max.Glycine_max_v2.1.pep.all.fa.gz \
| awk -F' ' ' /^>/ { print $3}' | grep -c "^supercontig"
435
How many records have been removed?
zgrep -c "^>" ../data/Glycine_max.Glycine_max_v2.1.pep.all.fa.gz
88412
grep -c "^>" ./tmp/Glycine_max.Glycine_max_v2.1.pep.all.nosupercontig.fa
87977
expr $ALL_ISOFORMS - $NO_SUPERCONTIGS
435
BLASTp All Against All
makeblastdb -in "$proteome" -dbtype prot -out 'tmp/blastdb' -title "Glycine max"
blastp -query "$proteome" -db 'tmp/blastdb' -outfmt 6 -out "tmp/Glycine_max.blastp.tsv"
Clustering
Using mcl
.
-
Extract the homology in ABC format
awk 'BEGIN { OFS="\t" } { print \$14, \$16, \$12 }' "${blastp}" > 'graph.abc'
-
Run
mcl
with default parametersmcl 'graph.abc' --abc -o 'custering.mcl'
-
Convert
mcl
output to TSV with two columnsgene id
,family id
.awk -f workflow/families_and_TAGs/scripts/mcl_to_tsv.awk 'clustering.mcl' > 'families.tsv'
Running the Nextflow workflow
Update the parameters in nextflow.config
and run
nextflow run main.nf -profile conda -resume
from the ./workflow/
folder
From the given BLASTp format 6 file
<2024-11-02 Sat> How many BLASTp HSP have been reported?
cat ../data/Glycine_max_Blastp_longIsoforme | wc -l
4456730
Remove HSP on proteins found on supercontigs
Extract the isoform ID whose gene is on supercontigs.
zcat ../data/Glycine_max.Glycine_max_v2.1.pep.all.fa.gz| awk -F' ' '/^>/ && $3 ~ /^supercontig/ { sequence_id = $1; gsub(">", "", sequence_id); print sequence_id }' > ./tmp/protein_on_supercontigs.list
cat ./tmp/protein_on_supercontigs.list | wc -l
435
Remove HSP with subject or query being on a supercontig
filter_blastp_sequence_id.awk
awk -f workflow/families_and_TAGs/scripts/filter_blastp_sequence_id.awk ./tmp/protein_on_supercontigs.list ../data/Glycine_max_Blastp_longIsoforme > ./tmp/Glycine_max_Blastp_longIsoforme_nosupercontig
How many HSP have been removed?
length() {
cat $1 | wc -l
}
expr $(length ../data/Glycine_max_Blastp_longIsoforme) - $(length ./tmp/Glycine_max_Blastp_longIsoforme_nosupercontig)
36598
cat ./tmp/Glycine_max_Blastp_longIsoforme_nosupercontig | wc -l
4420132
Filter by coverage and identity percentage
Add protein length and gene name columns
Extract the protein length and gene id from the FASTA file
zcat ../data/Glycine_max.Glycine_max_v2.1.pep.all.fa.gz | awk -f ./workflow/families_and_TAGs/scripts/protein_lengths.awk > ./tmp/Glycine_max_protein_length_gene.tsv
Join the BLASTp HSP file with the gene id and gene lengths
protein_length=./tmp/Glycine_max_protein_length_gene.tsv
blastp=./tmp/Glycine_max_Blastp_longIsoforme_nosupercontig
sort -k 1 "${blastp}" > tmp/blastp_s
sort -k 1 "${protein_length}" > tmp/protein_length_s
join -1 1 -2 1 -t $'\t' tmp/blastp_s tmp/protein_length_s > tmp/join1.tsv
sort -k 2 tmp/join1.tsv > tmp/join1.tsv_s
join -1 2 -2 1 -t $'\t' tmp/join1.tsv_s tmp/protein_length_s > tmp/joined.blastp.tsv
Filter the BLASTp according to coverage and identity percentage
blastp=./tmp/joined.blastp.tsv
awk -f "workflow/families_and_TAGs/scripts/filter_blastp.awk" \
-v min_coverage="${min_coverage}" \
-v min_identity="${min_identity}" \
"${blastp}" > "tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.tsv"
First dataset
<<low-stringency>>
<<filter-blastp>>
<<high-stringency>>
<<filter-blastp>>
cat ./tmp/Glycine_max_Blastp_filtered_coverage30_identity30.tsv | wc -l
1520850
cat ./tmp/Glycine_max_Blastp_filtered_coverage40_identity50.tsv | wc -l
342180
Clustering
<2024-11-03 Sun>
Extract the homology graph
blastp="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.tsv"
abc="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.abc"
awk 'BEGIN { FS = OFS = "\t" } { print $14, $16, $12 }' "$blastp" > "$abc"
<<low-stringency>>
<<extract-abc>>
<<high-stringency>>
<<extract-abc>>
head -n 1 "tmp/Glycine_max_Blastp_filtered_coverage30_identity30.abc"
GLYMA_10G098500 | GLYMA_U008600 | 154 |
Filter the homology graph to keep only the heaviest weight
original_abc="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.abc"
filtered_abc="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}_f.abc"
awk -f "workflow/families_and_TAGs/scripts/keep_heaviest_edge_abc.awk" "$original_abc" > "$filtered_abc"
<<low-stringency>>
<<filter-abc>>
<<high-stringency>>
<<filter-abc>>
Run Markov Clustering
abc="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}_f.abc"
mcl="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.mcl"
mcl "${abc}" --abc --discard-loops=yes -o "$mcl"
<<low-stringency>>
<<mcl>>
<<high-stringency>>
<<mcl>>
Transform MCL output into a gene to family mapping TSV file
mcl="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.mcl"
tsv="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.mcl.tsv"
awk -f "workflow/families_and_TAGs/scripts/mcl_to_tsv.awk" "${mcl}" > "${tsv}"
<<low-stringency>>
<<mcl-to-tsv>>
<<high-stringency>>
<<mcl-to-tsv>>
tsv="tmp/Glycine_max_Blastp_filtered_coverage40_identity50.mcl.tsv"
cat "${tsv}" | wc -l
46769
tsv="tmp/Glycine_max_Blastp_filtered_coverage30_identity30.mcl.tsv"
cat "${tsv}" | wc -l
50254
tsv="tmp/Glycine_max_Blastp_filtered_coverage40_identity50.mcl.tsv"
head "${tsv}"
GLYMA_12G034500 | 1 |
GLYMA_12G138300 | 1 |
GLYMA_12G124600 | 1 |
GLYMA_12G126700 | 1 |
GLYMA_12G125200 | 1 |
GLYMA_12G034900 | 1 |
GLYMA_U036400 | 1 |
GLYMA_12G124200 | 1 |
GLYMA_12G126300 | 1 |
GLYMA_12G035100 | 1 |
tsv="tmp/Glycine_max_Blastp_filtered_coverage30_identity30.mcl.tsv"
head "${tsv}"
GLYMA_07G053000 | 1 |
GLYMA_10G247700 | 1 |
GLYMA_12G188000 | 1 |
GLYMA_07G159200 | 1 |
GLYMA_13G313300 | 1 |
GLYMA_15G016600 | 1 |
GLYMA_14G159200 | 1 |
GLYMA_18G287500 | 1 |
GLYMA_20G132900 | 1 |
GLYMA_16G204800 | 1 |
What is the interval of duplicate gene count? There are between 48109 and 49391 duplicate genes in Glycine max.
What is the number of families?
tsv="tmp/Glycine_max_Blastp_filtered_coverage30_identity30.mcl.tsv"
cut -f2 "${tsv}" | sort | uniq | wc -l
8426
tsv="tmp/Glycine_max_Blastp_filtered_coverage40_identity50.mcl.tsv"
cut -f2 "${tsv}" | sort | uniq | wc -l
11997
What is the number of singletons?
expr $ALL - $DUPLICATED
5643
expr $ALL - $DUPLICATED
9128
Amount of Tandemly Arrayed Genes
Extract gene position from a GFF3 file
zcat "../data/Glycine_max.Glycine_max_v2.1.60.chr.gff3.gz" \
| awk -f "./workflow/families_and_TAGs/scripts/gff3_to_gene_positions_table.awk" \
> "./tmp/Glycine_max.gene_positions.tsv"
head "./tmp/Glycine_max.gene_positions.tsv"
GLYMA_01G000100 | 1 | 27355 | 28320 |
GLYMA_01G000200 | 1 | 58975 | 67527 |
GLYMA_01G000300 | 1 | 67770 | 69968 |
GLYMA_01G000400 | 1 | 90152 | 95947 |
GLYMA_01G000500 | 1 | 90289 | 91197 |
GLYMA_01G000600 | 1 | 116094 | 127845 |
GLYMA_01G000700 | 1 | 143467 | 155573 |
GLYMA_01G000800 | 1 | 157030 | 157772 |
GLYMA_01G000900 | 1 | 170534 | 193342 |
GLYMA_01G001000 | 1 | 196256 | 201895 |
cat "./tmp/Glycine_max.gene_positions.tsv" \
| wc -l
55589
There is 55897 protein coding genes in the sequence file, but only 55589 gene positions.
expr $GENES - $POSITIONS
308
Extract TAGs
positions="./tmp/Glycine_max.gene_positions.tsv"
families="./tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.mcl.tsv"
./rust/tagfinder/target/release/tagfinder \
--families "$families" \
--positions "$positions" \
--definitions 0,1 \
> "./tmp/Glycine_max_TAGs_coverage${min_coverage}_identity${min_identity}.tsv"
<<low-stringency>>
<<extract-TAGs>>
<<high-stringency>>
<<extract-TAGs>>
head "./tmp/Glycine_max_TAGs_coverage30_identity30.tsv"
gene | family | tag0 | tag1 |
GLYMA_07G257400 | 3109 | - | - |
GLYMA_17G203900 | 2067 | - | - |
GLYMA_20G062000 | 509 | - | - |
GLYMA_16G099400 | 40 | - | - |
GLYMA_14G045400 | 1136 | 804 | 896 |
GLYMA_11G189200 | 15 | - | - |
GLYMA_02G191500 | 1310 | - | - |
GLYMA_02G308200 | 520 | - | - |
GLYMA_14G205200 | 8 | - | - |
How many TAGs for definition 0?
awk 'BEGIN {
max=0
}
NR > 1 {
if ($3 > max) {
max=$3
}
}
END {
print max
}
' "./tmp/Glycine_max_TAGs_coverage30_identity30.tsv"
3208
For TAG1?
awk 'BEGIN {
max=0
}
NR > 1 {
if ($4 > max) {
max=$4
}
}
END {
print max
}
' "./tmp/Glycine_max_TAGs_coverage30_identity30.tsv"
3481
awk 'BEGIN {
max=0
}
NR > 1 {
if ($3 > max) {
max=$3
}
}
END {
print max
}
' "./tmp/Glycine_max_TAGs_coverage40_identity50.tsv"
2500
awk 'BEGIN {
max=0
}
NR > 1 {
if ($4 > max) {
max=$4
}
}
END {
print max
}
' "./tmp/Glycine_max_TAGs_coverage40_identity50.tsv"
2652
Size of the greatest TAG
High stringency
TAG₀
awk '
{
tag_index=$3
if (tag_index != "-")
gene_count[tag_index]++
}
END {
max = 0
max_tag_index = ""
for (tag_index in gene_count) {
if (gene_count[tag_index] > max) {
max_tag_index = tag_index
max = gene_count[tag_index]
}
}
print max_tag_index, max
}' "./tmp/Glycine_max_TAGs_coverage40_identity50.tsv"
421 53
TAG₁
awk '
{
tag_index=$4
if (tag_index != "-")
gene_count[tag_index]++
}
END {
max = 0
max_tag_index = ""
for (tag_index in gene_count) {
if (gene_count[tag_index] > max) {
max_tag_index = tag_index
max = gene_count[tag_index]
}
}
print max_tag_index, max
}' "./tmp/Glycine_max_TAGs_coverage40_identity50.tsv"
459 76
Low stringency
TAG₀
awk '
{
tag_index=$3
if (tag_index != "-")
gene_count[tag_index]++
}
END {
max = 0
max_tag_index = ""
for (tag_index in gene_count) {
if (gene_count[tag_index] > max) {
max_tag_index = tag_index
max = gene_count[tag_index]
}
}
print max_tag_index, max
}' "./tmp/Glycine_max_TAGs_coverage30_identity30.tsv"
579 76
TAG₁
awk '
{
tag_index=$4
if (tag_index != "-")
gene_count[tag_index]++
}
END {
max = 0
max_tag_index = ""
for (tag_index in gene_count) {
if (gene_count[tag_index] > max) {
max_tag_index = tag_index
max = gene_count[tag_index]
}
}
print max_tag_index, max
}' "./tmp/Glycine_max_TAGs_coverage30_identity30.tsv"
646 76
DONE Complete slides
<2024-11-23 Sat>
Family sizes
head -1 ./tmp/Glycine_max_Blastp_filtered_coverage30_identity30.mcl \
| awk '{ print NF }'
437
head -1 ./tmp/Glycine_max_Blastp_filtered_coverage40_identity50.mcl \
| awk '{ print NF }'
159
TODO $K_s$ For all duplicated pairs within a family
Let $A$ be a family of $n$ duplicate members. The number of duplicate pairs is $n(n-1) / 2$.
PAML package:
conda install paml pal2nal clustalw -c bioconda
Use a Nextflow workflow
<2024-12-22 Sun>
All pair of duplicate genes
Generated by a rust program in rust/pairs
folder.
The algorithm is simple.
We assume the gene-families mapping file is sorted by family index.
For each family, loop for i from 0 to the size of the family minus one, loop for j from i plus one to the size of the family and output the pair gene i - gene j.
Example:
cat ./tmp/Glycine_max_Blastp_filtered_coverage30_identity30.mcl.tsv | ./rust/pairs/target/release/pairs > ./tmp/Glycine_max_duplicate_gene_pairs.tsv
head ./tmp/Glycine_max_duplicate_gene_pairs.tsv
GLYMA_07G053000 | GLYMA_10G247700 |
GLYMA_07G053000 | GLYMA_12G188000 |
GLYMA_07G053000 | GLYMA_07G159200 |
GLYMA_07G053000 | GLYMA_13G313300 |
GLYMA_07G053000 | GLYMA_15G016600 |
GLYMA_07G053000 | GLYMA_14G159200 |
GLYMA_07G053000 | GLYMA_18G287500 |
GLYMA_07G053000 | GLYMA_20G132900 |
GLYMA_07G053000 | GLYMA_16G204800 |
GLYMA_07G053000 | GLYMA_08G302300 |
PAML Steps
Example for the first gene pair.
Extract the CDS sequence and protein sequence of the genes
gene_id_1="GLYMA_07G053000"
gene_id_2="GLYMA_10G247700"
CDS:
output_file="./tmp/${gene_id_1}_${gene_id_2}.cds.fst"
echo "" > "${output_file}"
for gene_id in "${gene_id_1}" "${gene_id_2}"; do
zcat ../data/Glycine_max.Glycine_max_v2.1.cds.all.fa.gz | awk -v gene_id="${gene_id}" '
BEGIN {
on_gene=0
}
/^[^>]/ && on_gene == 1 {
print $0
}
/^>/ {
gene = $4
gsub("gene:", "", gene)
if (gene == gene_id) {
on_gene=1
print $0
} else {
on_gene=0
}
}
' >> "${output_file}"
done
<<pair-gene-id>>
<<pair-gene-cds-extract>>
Proteins:
output_file="./tmp/${gene_id_1}_${gene_id_2}.prot.fst"
echo "" > "${output_file}"
for gene_id in "${gene_id_1}" "${gene_id_2}"; do
cat ./tmp/proteome_filtered.fa | awk -v gene_id="${gene_id}" '
BEGIN {
on_gene=0
}
/^[^>]/ && on_gene == 1 {
print $0
}
/^>/ {
gene = $4
gsub("gene:", "", gene)
if (gene == gene_id) {
on_gene=1
print $0
} else {
on_gene=0
}
}
' >> "${output_file}"
done
<<pair-gene-id>>
<<pair-gene-protein-extract>>
Clustalw2 alignment
<<pair-gene-id>>
protein_sequence="./tmp/${gene_id_1}_${gene_id_2}.prot.fst"
clustalw2 -quiet -align -infile="${protein_sequence}" -outfile="./tmp/${gene_id_1}_${gene_id_2}.prot.ali.aln"
Convert protein alignment to coding sequence alignment
<<pair-gene-id>>
protein_alignment="./tmp/${gene_id_1}_${gene_id_2}.prot.ali.aln"
coding_sequence="./tmp/${gene_id_1}_${gene_id_2}.cds.fst"
pal2nal "${protein_alignment}" "${coding_sequence}" -output paml > "./tmp/${gene_id_1}_${gene_id_2}.cds.ali.phy"
YN00 Ka-Ks computation
<<pair-gene-id>>
phylip_file="./tmp/${gene_id_1}_${gene_id_2}.cds.ali.phy"
echo "seqfile = ${phylip_file}\noutfile = ${phylip_file}.yn\nverbose = 0\nicode = 0\nweighting = 0\ncommonf3x4 = 0" > yn00.ctl
yn00
Extraction of Ka-Ks values
<<pair-gene-id>>
yn_file="./tmp/${gene_id_1}_${gene_id_2}.cds.ali.phy.yn"
awk '
BEGIN {
on_good_section=0
skip=0
}
$1 == "(B)" {
skip=8
on_good_section=1
}
on_good_section == 1 {
if (skip == 0) {
Ka=$8
Ks=$11
print Ka, Ks
exit
} else {
skip -= 1
}
}
' "${yn_file}"
0.6207 1.9409
Run the Nextflow workflow
<2024-12-23 Mon>