comparative-genomics-project/notebook.org

23 KiB

Comparative Genomics Project

General infos

<2024-10-25 Fri>

Objectives

Part 1

  1. Determine the amount of duplicate genes in my genome.

    • What is the proportion of duplicate genes in my genome.
    • Give an interval based on two datasets (low stringency dataset and high stringency dataset).

      Low High
      Coverage 30% 40%
      %identity 30% 50%
      Bitscore no filtering no filtering
      Clustering mcl (with default parameters) mcl (with default parameters)
      min_coverage=30
      min_identity=30
      min_coverage=40
      min_identity=50
  2. Determine the amount of Tandemly Arrayed Genes (TAGs) in the genome (definition 0 and 1).
  3. Determine the amount in plant's chloroplasts and mitochondria.
  4. Is there a lot of duplicated operons in plants?

Part 2

  1. Compute \(K_s\) values and use it as a proxy of the age of the duplication.
  2. Are TAGs younger than other duplicate genes?

Our species

Glycine max is the soy plant.

Develop the Nextflow workflow

Filtering the proteome to keep the longest isoform

I used three awk families_and_TAGs/scripts to filter the FASTA file to keep only the longest isoforms:

  1. protein_lengths.awk associate each FASTA record with the length of its sequence,
  2. filter_longest.awk keep the isoform ID that has the maximum length,
  3. filter_records.awk based on the list of isoform ID, retrieve the sequence of from the original file.

protein_lengths.awk

filter_longest.awk

filter_records.awk

How many protein coding gene are there?

zcat ../data/Glycine_max.Glycine_max_v2.1.pep.all.fa.gz \
    | grep -v 'supercontig' \
    | awk ' /^>/ { print $4 }' | sort | uniq | wc -l
55589
55897

Filtering out Chloroplasts and Mitochondria proteins

From the Ensembl Plants index page we can see that there is no chloroplast nor mitochondria proteins in the sequenced genome of Glycine max.

Filter out supercontigs

<2024-10-28 Mon>

zcat ../data/Glycine_max.Glycine_max_v2.1.pep.all.fa.gz \
    | awk -f ./workflow/families_and_TAGs/scripts/remove_supercontigs.awk \
    > ./tmp/Glycine_max.Glycine_max_v2.1.pep.all.nosupercontig.fa

Sanity check: Has the script removed all supercontigs proteins and only them?

How many records in supercontigs were there?

zcat ../data/Glycine_max.Glycine_max_v2.1.pep.all.fa.gz  \
    | awk -F' ' ' /^>/ { print $3}'  | grep -c "^supercontig"
435

How many records have been removed?

zgrep -c "^>" ../data/Glycine_max.Glycine_max_v2.1.pep.all.fa.gz
88412
grep -c "^>" ./tmp/Glycine_max.Glycine_max_v2.1.pep.all.nosupercontig.fa
87977
expr $ALL_ISOFORMS - $NO_SUPERCONTIGS
435

BLASTp All Against All

makeblastdb -in "$proteome" -dbtype prot -out 'tmp/blastdb' -title "Glycine max"
blastp -query "$proteome" -db 'tmp/blastdb' -outfmt 6 -out "tmp/Glycine_max.blastp.tsv"

Clustering

Using mcl.

  1. Extract the homology in ABC format

    awk 'BEGIN { OFS="\t" } { print \$14, \$16, \$12 }' "${blastp}" > 'graph.abc'
  2. Run mcl with default parameters

    mcl 'graph.abc' --abc -o 'custering.mcl'
  3. Convert mcl output to TSV with two columns gene id, family id.

    awk -f workflow/families_and_TAGs/scripts/mcl_to_tsv.awk 'clustering.mcl' > 'families.tsv'

Running the Nextflow workflow

Update the parameters in nextflow.config and run

nextflow run main.nf -profile conda -resume

from the ./workflow/ folder

From the given BLASTp format 6 file

<2024-11-02 Sat> How many BLASTp HSP have been reported?

cat ../data/Glycine_max_Blastp_longIsoforme | wc -l
4456730

Remove HSP on proteins found on supercontigs

Extract the isoform ID whose gene is on supercontigs.

zcat ../data/Glycine_max.Glycine_max_v2.1.pep.all.fa.gz| awk -F' ' '/^>/ && $3 ~ /^supercontig/ { sequence_id = $1; gsub(">", "", sequence_id); print sequence_id }' > ./tmp/protein_on_supercontigs.list
cat ./tmp/protein_on_supercontigs.list | wc -l
435

Remove HSP with subject or query being on a supercontig

filter_blastp_sequence_id.awk

awk -f workflow/families_and_TAGs/scripts/filter_blastp_sequence_id.awk ./tmp/protein_on_supercontigs.list ../data/Glycine_max_Blastp_longIsoforme > ./tmp/Glycine_max_Blastp_longIsoforme_nosupercontig

How many HSP have been removed?

length() {
    cat $1 | wc -l
 }
expr $(length ../data/Glycine_max_Blastp_longIsoforme) - $(length ./tmp/Glycine_max_Blastp_longIsoforme_nosupercontig)
36598
cat ./tmp/Glycine_max_Blastp_longIsoforme_nosupercontig | wc -l
4420132

Filter by coverage and identity percentage

Add protein length and gene name columns

Extract the protein length and gene id from the FASTA file
zcat ../data/Glycine_max.Glycine_max_v2.1.pep.all.fa.gz | awk -f ./workflow/families_and_TAGs/scripts/protein_lengths.awk > ./tmp/Glycine_max_protein_length_gene.tsv
Join the BLASTp HSP file with the gene id and gene lengths
protein_length=./tmp/Glycine_max_protein_length_gene.tsv
blastp=./tmp/Glycine_max_Blastp_longIsoforme_nosupercontig
sort -k 1 "${blastp}" > tmp/blastp_s
sort -k 1 "${protein_length}" > tmp/protein_length_s
join -1 1 -2 1 -t $'\t' tmp/blastp_s tmp/protein_length_s > tmp/join1.tsv
sort -k 2 tmp/join1.tsv > tmp/join1.tsv_s
join -1 2 -2 1 -t $'\t' tmp/join1.tsv_s tmp/protein_length_s > tmp/joined.blastp.tsv
Filter the BLASTp according to coverage and identity percentage
blastp=./tmp/joined.blastp.tsv
awk -f "workflow/families_and_TAGs/scripts/filter_blastp.awk" \
        -v min_coverage="${min_coverage}" \
        -v min_identity="${min_identity}" \
        "${blastp}" > "tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.tsv"
First dataset
<<low-stringency>>
<<filter-blastp>>
<<high-stringency>>
<<filter-blastp>>
cat ./tmp/Glycine_max_Blastp_filtered_coverage30_identity30.tsv | wc -l
1520850
cat ./tmp/Glycine_max_Blastp_filtered_coverage40_identity50.tsv | wc -l
342180

Clustering

<2024-11-03 Sun>

Extract the homology graph

blastp="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.tsv"
abc="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.abc"
awk 'BEGIN { FS = OFS = "\t" } { print $14, $16, $12 }' "$blastp" > "$abc"
<<low-stringency>>
<<extract-abc>>
<<high-stringency>>
<<extract-abc>>
head -n 1 "tmp/Glycine_max_Blastp_filtered_coverage30_identity30.abc"
GLYMA_10G098500 GLYMA_U008600 154

Filter the homology graph to keep only the heaviest weight

original_abc="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.abc"
filtered_abc="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}_f.abc"
awk -f "workflow/families_and_TAGs/scripts/keep_heaviest_edge_abc.awk" "$original_abc" > "$filtered_abc"
<<low-stringency>>
<<filter-abc>>
<<high-stringency>>
<<filter-abc>>

Run Markov Clustering

abc="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}_f.abc"
mcl="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.mcl"
mcl "${abc}" --abc --discard-loops=yes -o "$mcl"
<<low-stringency>>
<<mcl>>
<<high-stringency>>
<<mcl>>

Transform MCL output into a gene to family mapping TSV file

mcl="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.mcl"
tsv="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.mcl.tsv"
awk -f "workflow/families_and_TAGs/scripts/mcl_to_tsv.awk" "${mcl}" > "${tsv}"
<<low-stringency>>
<<mcl-to-tsv>>
<<high-stringency>>
<<mcl-to-tsv>>
tsv="tmp/Glycine_max_Blastp_filtered_coverage40_identity50.mcl.tsv"
cat "${tsv}" | wc -l
46769
tsv="tmp/Glycine_max_Blastp_filtered_coverage30_identity30.mcl.tsv"
cat "${tsv}" | wc -l
50254
tsv="tmp/Glycine_max_Blastp_filtered_coverage40_identity50.mcl.tsv"
head "${tsv}"
GLYMA_12G034500 1
GLYMA_12G138300 1
GLYMA_12G124600 1
GLYMA_12G126700 1
GLYMA_12G125200 1
GLYMA_12G034900 1
GLYMA_U036400 1
GLYMA_12G124200 1
GLYMA_12G126300 1
GLYMA_12G035100 1
tsv="tmp/Glycine_max_Blastp_filtered_coverage30_identity30.mcl.tsv"
head "${tsv}"
GLYMA_07G053000 1
GLYMA_10G247700 1
GLYMA_12G188000 1
GLYMA_07G159200 1
GLYMA_13G313300 1
GLYMA_15G016600 1
GLYMA_14G159200 1
GLYMA_18G287500 1
GLYMA_20G132900 1
GLYMA_16G204800 1

What is the interval of duplicate gene count? There are between 48109 and 49391 duplicate genes in Glycine max.

What is the number of families?

tsv="tmp/Glycine_max_Blastp_filtered_coverage30_identity30.mcl.tsv"
cut -f2 "${tsv}" | sort | uniq | wc -l
8426
tsv="tmp/Glycine_max_Blastp_filtered_coverage40_identity50.mcl.tsv"
cut -f2 "${tsv}" | sort | uniq | wc -l
11997

What is the number of singletons?

expr $ALL - $DUPLICATED
5643
expr $ALL - $DUPLICATED
9128

Amount of Tandemly Arrayed Genes

Extract gene position from a GFF3 file

zcat "../data/Glycine_max.Glycine_max_v2.1.60.chr.gff3.gz" \
    | awk -f "./workflow/families_and_TAGs/scripts/gff3_to_gene_positions_table.awk" \
    > "./tmp/Glycine_max.gene_positions.tsv"
head "./tmp/Glycine_max.gene_positions.tsv"
GLYMA_01G000100 1 27355 28320
GLYMA_01G000200 1 58975 67527
GLYMA_01G000300 1 67770 69968
GLYMA_01G000400 1 90152 95947
GLYMA_01G000500 1 90289 91197
GLYMA_01G000600 1 116094 127845
GLYMA_01G000700 1 143467 155573
GLYMA_01G000800 1 157030 157772
GLYMA_01G000900 1 170534 193342
GLYMA_01G001000 1 196256 201895
cat "./tmp/Glycine_max.gene_positions.tsv" \
    | wc -l
55589

There is 55897 protein coding genes in the sequence file, but only 55589 gene positions.

expr $GENES - $POSITIONS
308

Extract TAGs

positions="./tmp/Glycine_max.gene_positions.tsv"
families="./tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.mcl.tsv"
./rust/tagfinder/target/release/tagfinder \
    --families "$families" \
    --positions "$positions" \
    --definitions 0,1 \
    > "./tmp/Glycine_max_TAGs_coverage${min_coverage}_identity${min_identity}.tsv"
<<low-stringency>>
<<extract-TAGs>>
<<high-stringency>>
<<extract-TAGs>>
head "./tmp/Glycine_max_TAGs_coverage30_identity30.tsv"
gene family tag0 tag1
GLYMA_07G257400 3109 - -
GLYMA_17G203900 2067 - -
GLYMA_20G062000 509 - -
GLYMA_16G099400 40 - -
GLYMA_14G045400 1136 804 896
GLYMA_11G189200 15 - -
GLYMA_02G191500 1310 - -
GLYMA_02G308200 520 - -
GLYMA_14G205200 8 - -

How many TAGs for definition 0?

awk 'BEGIN {
           max=0
    }
    NR > 1 {
        if ($3 > max) {
            max=$3
        }
    }
    END  {
        print max
    }
    ' "./tmp/Glycine_max_TAGs_coverage30_identity30.tsv"
3208

For TAG1?

awk 'BEGIN {
           max=0
    }
    NR > 1 {
        if ($4 > max) {
            max=$4
        }
    }
    END  {
        print max
    }
    ' "./tmp/Glycine_max_TAGs_coverage30_identity30.tsv"
3481
awk 'BEGIN {
           max=0
    }
    NR > 1 {
        if ($3 > max) {
            max=$3
        }
    }
    END  {
        print max
    }
    ' "./tmp/Glycine_max_TAGs_coverage40_identity50.tsv"
2500
awk 'BEGIN {
           max=0
    }
    NR > 1 {
        if ($4 > max) {
            max=$4
        }
    }
    END  {
        print max
    }
    ' "./tmp/Glycine_max_TAGs_coverage40_identity50.tsv"
2652

Size of the greatest TAG

High stringency

TAG₀

awk '
    {
    tag_index=$3
    if (tag_index != "-")
        gene_count[tag_index]++
    }
    END {
    max = 0
    max_tag_index = ""
    for (tag_index in gene_count) {
        if (gene_count[tag_index] > max) {
            max_tag_index = tag_index
            max = gene_count[tag_index]
        }
    }
    print max_tag_index, max
    }' "./tmp/Glycine_max_TAGs_coverage40_identity50.tsv"
421 53

TAG₁

awk '
    {
    tag_index=$4
    if (tag_index != "-")
        gene_count[tag_index]++
    }
    END {
    max = 0
    max_tag_index = ""
    for (tag_index in gene_count) {
        if (gene_count[tag_index] > max) {
            max_tag_index = tag_index
            max = gene_count[tag_index]
        }
    }
    print max_tag_index, max
    }' "./tmp/Glycine_max_TAGs_coverage40_identity50.tsv"
459 76
Low stringency

TAG₀

awk '
    {
    tag_index=$3
    if (tag_index != "-")
        gene_count[tag_index]++
    }
    END {
    max = 0
    max_tag_index = ""
    for (tag_index in gene_count) {
        if (gene_count[tag_index] > max) {
            max_tag_index = tag_index
            max = gene_count[tag_index]
        }
    }
    print max_tag_index, max
    }' "./tmp/Glycine_max_TAGs_coverage30_identity30.tsv"
579 76

TAG₁

awk '
    {
    tag_index=$4
    if (tag_index != "-")
        gene_count[tag_index]++
    }
    END {
    max = 0
    max_tag_index = ""
    for (tag_index in gene_count) {
        if (gene_count[tag_index] > max) {
            max_tag_index = tag_index
            max = gene_count[tag_index]
        }
    }
    print max_tag_index, max
    }' "./tmp/Glycine_max_TAGs_coverage30_identity30.tsv"
646 76

DONE Complete slides

<2024-11-23 Sat>

Family sizes

head -1 ./tmp/Glycine_max_Blastp_filtered_coverage30_identity30.mcl \
    | awk '{ print NF }'
437
head -1 ./tmp/Glycine_max_Blastp_filtered_coverage40_identity50.mcl \
    | awk '{ print NF }'
159

TODO $K_s$ For all duplicated pairs within a family

Let $A$ be a family of $n$ duplicate members. The number of duplicate pairs is $n(n-1) / 2$.

PAML package:

conda install paml pal2nal clustalw -c bioconda

Use a Nextflow workflow

<2024-12-22 Sun>

All pair of duplicate genes

Generated by a rust program in rust/pairs folder. The algorithm is simple. We assume the gene-families mapping file is sorted by family index. For each family, loop for i from 0 to the size of the family minus one, loop for j from i plus one to the size of the family and output the pair gene i - gene j.

Example:

cat ./tmp/Glycine_max_Blastp_filtered_coverage30_identity30.mcl.tsv | ./rust/pairs/target/release/pairs > ./tmp/Glycine_max_duplicate_gene_pairs.tsv
head ./tmp/Glycine_max_duplicate_gene_pairs.tsv
GLYMA_07G053000 GLYMA_10G247700
GLYMA_07G053000 GLYMA_12G188000
GLYMA_07G053000 GLYMA_07G159200
GLYMA_07G053000 GLYMA_13G313300
GLYMA_07G053000 GLYMA_15G016600
GLYMA_07G053000 GLYMA_14G159200
GLYMA_07G053000 GLYMA_18G287500
GLYMA_07G053000 GLYMA_20G132900
GLYMA_07G053000 GLYMA_16G204800
GLYMA_07G053000 GLYMA_08G302300

PAML Steps

Example for the first gene pair.

Extract the CDS sequence and protein sequence of the genes

gene_id_1="GLYMA_07G053000"
gene_id_2="GLYMA_10G247700"

CDS:

output_file="./tmp/${gene_id_1}_${gene_id_2}.cds.fst"
echo "" > "${output_file}"
for gene_id in "${gene_id_1}" "${gene_id_2}"; do
        zcat ../data/Glycine_max.Glycine_max_v2.1.cds.all.fa.gz | awk -v gene_id="${gene_id}" '
        BEGIN {
            on_gene=0
        }
        /^[^>]/ && on_gene == 1 {
            print $0
        }
        /^>/ {
            gene = $4
            gsub("gene:", "", gene)
            if (gene == gene_id) {
                on_gene=1
                print $0
            } else {
                on_gene=0
            }
        }
        ' >> "${output_file}"
done
<<pair-gene-id>>
<<pair-gene-cds-extract>>

Proteins:

output_file="./tmp/${gene_id_1}_${gene_id_2}.prot.fst"
echo "" > "${output_file}"
for gene_id in "${gene_id_1}" "${gene_id_2}"; do
        cat ./tmp/proteome_filtered.fa | awk -v gene_id="${gene_id}" '
        BEGIN {
            on_gene=0
        }
        /^[^>]/ && on_gene == 1 {
            print $0
        }
        /^>/ {
            gene = $4
            gsub("gene:", "", gene)
            if (gene == gene_id) {
                on_gene=1
                print $0
            } else {
                on_gene=0
            }
        }
        ' >> "${output_file}"
done
<<pair-gene-id>>
<<pair-gene-protein-extract>>

Clustalw2 alignment

<<pair-gene-id>>
protein_sequence="./tmp/${gene_id_1}_${gene_id_2}.prot.fst"
clustalw2 -quiet -align -infile="${protein_sequence}" -outfile="./tmp/${gene_id_1}_${gene_id_2}.prot.ali.aln"

Convert protein alignment to coding sequence alignment

<<pair-gene-id>>
protein_alignment="./tmp/${gene_id_1}_${gene_id_2}.prot.ali.aln"
coding_sequence="./tmp/${gene_id_1}_${gene_id_2}.cds.fst"
pal2nal "${protein_alignment}" "${coding_sequence}" -output paml > "./tmp/${gene_id_1}_${gene_id_2}.cds.ali.phy"

YN00 Ka-Ks computation

<<pair-gene-id>>
phylip_file="./tmp/${gene_id_1}_${gene_id_2}.cds.ali.phy"
echo "seqfile = ${phylip_file}\noutfile = ${phylip_file}.yn\nverbose = 0\nicode = 0\nweighting = 0\ncommonf3x4 = 0" > yn00.ctl
yn00

Extraction of Ka-Ks values

<<pair-gene-id>>
yn_file="./tmp/${gene_id_1}_${gene_id_2}.cds.ali.phy.yn"
awk '
    BEGIN {
          on_good_section=0
          skip=0
    }
    $1 == "(B)" {
         skip=8
         on_good_section=1
    }

    on_good_section == 1 {
        if (skip == 0) {
            Ka=$8
            Ks=$11
            print Ka, Ks
            exit
        } else {
            skip -= 1
        }
    }
    ' "${yn_file}"
0.6207 1.9409

Run the Nextflow workflow

<2024-12-23 Mon>

TODO Does Glycine max have big TAGs and which function are the big TAG gene implied?

TODO Are TAG pairs different in age from non-TAG pairs?

TODO Are genes inside a TAG orientated in the same way more often than random?