comparative-genomics-project/notebook.org

#+title: Comparative Genomics Project
#+subtitle: Duplicate Genes in /Glycine max/
#+date: 2024-2025
#+author: Samuel Ortion
#+LATEX_CLASS: scrartcl
#+LATEX_HEADER: \titlehead{M2 GENIOMHE}

* General infos

<2024-10-25 Fri>

** Objectives
*** Part 1
1. Determine the amount of duplicate genes in my genome.
   - What is the proportion of duplicate genes in my genome.
   - Give an interval based on two datasets (low stringency dataset and high stringency dataset).
     #+caption: High stringency dataset and low stringency dataset duplicate genes filtering criteria.
     #+name: dataset-criteria
     |            | Low                           | High                          |
     |------------+-------------------------------+-------------------------------|
     | Coverage   | 30%                           | 40%                           |
     | %identity  | 30%                           | 50%                           |
     | Bitscore   | no filtering                  | no filtering                  |
     | Clustering | mcl (with default parameters) | mcl (with default parameters) |

     #+name: low-stringency
     #+begin_src bash
min_coverage=30
min_identity=30
     #+end_src

     #+name: high-stringency
     #+begin_src bash
min_coverage=40
min_identity=50
     #+end_src
2. Determine the amount of Tandemly Arrayed Genes (TAGs) in the genome (definition 0 and 1).
3. Determine the amount in plant's chloroplasts and mitochondria.
4. Is there a lot of duplicated operons in plants?

*** Part 2
1. Compute \(K_s\) values and use it as a proxy of the age of the duplication.
2. Are TAGs younger than other duplicate genes?


** Our species
/Glycine max/ is the soy plant.

* Download the source data

https://plants.ensembl.org/Glycine_max/Info/Index

https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-60/fasta/glycine_max/pep/Glycine_max.Glycine_max_v2.1.pep.all.fa.gz


* Develop the Nextflow workflow

** Filtering the proteome to keep the longest isoform

I used three awk families_and_TAGs/scripts to filter the FASTA file to keep only the longest isoforms:
 1. =protein_lengths.awk= associate each FASTA record with the length of its sequence,
 2. =filter_longest.awk= keep the isoform ID that has the maximum length,
 3. =filter_records.awk= based on the list of isoform ID, retrieve the sequence of from the original file.
=protein_lengths.awk=

#+include: workflow/families_and_TAGs/scripts/protein_lengths.awk src awk
=filter_longest.awk=

#+include: workflow/families_and_TAGs/scripts/filter_longest.awk src awk
=filter_records.awk=

#+include: workflow/families_and_TAGs/scripts/filter_records_fasta.awk src awk

How many protein coding gene are there?
#+name: protein-coding-gene-count
#+begin_src bash :export both
zcat ../data/Glycine_max.Glycine_max_v2.1.pep.all.fa.gz \
    | grep -v 'supercontig' \
    | awk ' /^>/ { print $4 }' | sort | uniq | wc -l
#+end_src

#+RESULTS: protein-coding-gene-count
: 55589

#+RESULTS:
: 55897

** Filtering out Chloroplasts and Mitochondria proteins

From the Ensembl Plants index page we can see that there is no chloroplast nor mitochondria proteins in the sequenced genome of /Glycine max/.

** Filter out supercontigs

<2024-10-28 Mon>

#+include: workflow/families_and_TAGs/scripts/remove_supercontigs.awk src awk

#+begin_src bash
zcat ../data/Glycine_max.Glycine_max_v2.1.pep.all.fa.gz \
    | awk -f ./workflow/families_and_TAGs/scripts/remove_supercontigs.awk \
    > ./tmp/Glycine_max.Glycine_max_v2.1.pep.all.nosupercontig.fa
#+end_src

#+RESULTS:
/Sanity check:/ Has the script removed all supercontigs proteins and only them?
/How many records in supercontigs were there?/
#+begin_src bash :export both
zcat ../data/Glycine_max.Glycine_max_v2.1.pep.all.fa.gz  \
    | awk -F' ' ' /^>/ { print $3}'  | grep -c "^supercontig"
#+end_src

#+RESULTS:
: 435
/How many records have been removed?/

#+name: all-isoforms
#+begin_src bash
zgrep -c "^>" ../data/Glycine_max.Glycine_max_v2.1.pep.all.fa.gz
#+end_src

#+RESULTS: all-isoforms
: 88412

#+name: all-isoforms-no-supercontig
#+begin_src bash
grep -c "^>" ./tmp/Glycine_max.Glycine_max_v2.1.pep.all.nosupercontig.fa
#+end_src

#+RESULTS: all-isoforms-no-supercontig
: 87977

#+begin_src bash :var ALL_ISOFORMS=all-isoforms :var NO_SUPERCONTIGS=all-isoforms-no-supercontig :exports both
expr $ALL_ISOFORMS - $NO_SUPERCONTIGS
#+end_src

#+RESULTS:
: 435

** BLASTp All Against All

#+begin_src bash
makeblastdb -in "$proteome" -dbtype prot -out 'tmp/blastdb' -title "Glycine max"
#+end_src

#+begin_src bash
blastp -query "$proteome" -db 'tmp/blastdb' -outfmt 6 -out "tmp/Glycine_max.blastp.tsv"
#+end_src

** Clustering

Using =mcl=.

1. Extract the homology in ABC format
   #+begin_src bash
awk 'BEGIN { OFS="\t" } { print \$14, \$16, \$12 }' "${blastp}" > 'graph.abc'
   #+end_src
2. Run =mcl= with default parameters
   #+begin_src bash
mcl 'graph.abc' --abc -o 'custering.mcl'
   #+end_src
3. Convert =mcl= output to TSV with two columns =gene id=, =family id=.
   #+include: workflow/families_and_TAGs/scripts/mcl_to_tsv.awk src awk

   #+begin_src bash
awk -f workflow/families_and_TAGs/scripts/mcl_to_tsv.awk 'clustering.mcl' > 'families.tsv'
   #+end_src


** Running the Nextflow workflow

Update the parameters in =nextflow.config= and run
#+begin_src bash
nextflow run main.nf -profile conda -resume
#+end_src

from the =./workflow/= folder


* From the given BLASTp format 6 file
<2024-11-02 Sat>
/How many BLASTp HSP have been reported?/
#+begin_src bash
cat ../data/Glycine_max_Blastp_longIsoforme | wc -l
#+end_src

#+RESULTS:
: 4456730

** Remove HSP on proteins found on supercontigs

Extract the isoform ID whose gene is on supercontigs.
#+begin_src bash
zcat ../data/Glycine_max.Glycine_max_v2.1.pep.all.fa.gz| awk -F' ' '/^>/ && $3 ~ /^supercontig/ { sequence_id = $1; gsub(">", "", sequence_id); print sequence_id }' > ./tmp/protein_on_supercontigs.list
#+end_src

#+RESULTS:

#+begin_src bash
cat ./tmp/protein_on_supercontigs.list | wc -l
#+end_src

#+RESULTS:
: 435

Remove HSP with subject or query being on a supercontig
=filter_blastp_sequence_id.awk=

#+include: workflow/families_and_TAGs/scripts/filter_blastp_sequence_id.awk src awk

#+begin_src bash
awk -f workflow/families_and_TAGs/scripts/filter_blastp_sequence_id.awk ./tmp/protein_on_supercontigs.list ../data/Glycine_max_Blastp_longIsoforme > ./tmp/Glycine_max_Blastp_longIsoforme_nosupercontig
#+end_src

#+RESULTS:

How many HSP have been removed?

#+begin_src bash
length() {
    cat $1 | wc -l
 }
expr $(length ../data/Glycine_max_Blastp_longIsoforme) - $(length ./tmp/Glycine_max_Blastp_longIsoforme_nosupercontig)
#+end_src

#+RESULTS:
: 36598

#+begin_src bash
cat ./tmp/Glycine_max_Blastp_longIsoforme_nosupercontig | wc -l
#+end_src

#+RESULTS:
: 4420132


** Filter by coverage and identity percentage

*** Add protein length and gene name columns

**** Extract the protein length and gene id from the FASTA file


#+begin_src bash
zcat ../data/Glycine_max.Glycine_max_v2.1.pep.all.fa.gz | awk -f ./workflow/families_and_TAGs/scripts/protein_lengths.awk > ./tmp/Glycine_max_protein_length_gene.tsv
#+end_src

#+RESULTS:


**** Join the BLASTp HSP file with the gene id and gene lengths

#+begin_src bash
protein_length=./tmp/Glycine_max_protein_length_gene.tsv
blastp=./tmp/Glycine_max_Blastp_longIsoforme_nosupercontig
sort -k 1 "${blastp}" > tmp/blastp_s
sort -k 1 "${protein_length}" > tmp/protein_length_s
join -1 1 -2 1 -t $'\t' tmp/blastp_s tmp/protein_length_s > tmp/join1.tsv
sort -k 2 tmp/join1.tsv > tmp/join1.tsv_s
join -1 2 -2 1 -t $'\t' tmp/join1.tsv_s tmp/protein_length_s > tmp/joined.blastp.tsv
#+end_src

#+RESULTS:

**** Filter the BLASTp according to coverage and identity percentage

#+name: filter-blastp
#+begin_src bash
blastp=./tmp/joined.blastp.tsv
awk -f "workflow/families_and_TAGs/scripts/filter_blastp.awk" \
        -v min_coverage="${min_coverage}" \
        -v min_identity="${min_identity}" \
        "${blastp}" > "tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.tsv"
#+end_src

#+RESULTS: filter-blastp

***** First dataset

#+begin_src bash :noweb strip-export
<<low-stringency>>
<<filter-blastp>>
#+end_src

#+RESULTS:

#+begin_src bash :noweb strip-export
<<high-stringency>>
<<filter-blastp>>
#+end_src

#+RESULTS:

#+begin_src bash
cat ./tmp/Glycine_max_Blastp_filtered_coverage30_identity30.tsv | wc -l
#+end_src

#+RESULTS:
: 1520850

#+begin_src bash
cat ./tmp/Glycine_max_Blastp_filtered_coverage40_identity50.tsv | wc -l
#+end_src

#+RESULTS:
: 342180

** Clustering
<2024-11-03 Sun>
*** Extract the homology graph

#+name: extract-abc
#+begin_src bash
blastp="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.tsv"
abc="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.abc"
awk 'BEGIN { FS = OFS = "\t" } { print $14, $16, $12 }' "$blastp" > "$abc"
#+end_src


#+begin_src bash :noweb strip-export
<<low-stringency>>
<<extract-abc>>
#+end_src

#+RESULTS:

#+begin_src bash :noweb strip-export
<<high-stringency>>
<<extract-abc>>
#+end_src

#+RESULTS:

#+begin_src bash
head -n 1 "tmp/Glycine_max_Blastp_filtered_coverage30_identity30.abc"
#+end_src

#+RESULTS:
| GLYMA_10G098500 | GLYMA_U008600 | 154 |

*** Filter the homology graph to keep only the heaviest weight


#+include: workflow/families_and_TAGs/scripts/keep_heaviest_edge_abc.awk src awk

#+name: filter-abc
#+begin_src bash
original_abc="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.abc"
filtered_abc="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}_f.abc"
awk -f "workflow/families_and_TAGs/scripts/keep_heaviest_edge_abc.awk" "$original_abc" > "$filtered_abc"
#+end_src

#+begin_src bash :noweb strip-export
<<low-stringency>>
<<filter-abc>>
#+end_src

#+RESULTS:

#+begin_src bash :noweb strip-export
<<high-stringency>>
<<filter-abc>>
#+end_src

#+RESULTS:

*** Run Markov Clustering

#+name: mcl
#+begin_src bash
abc="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}_f.abc"
mcl="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.mcl"
mcl "${abc}" --abc --discard-loops=yes -o "$mcl"
#+end_src

#+RESULTS: mcl

#+begin_src bash :noweb strip-export
<<low-stringency>>
<<mcl>>
#+end_src

#+RESULTS:

#+begin_src bash :noweb strip-export
<<high-stringency>>
<<mcl>>
#+end_src

#+RESULTS:

*** Transform MCL output into a gene to family mapping TSV file

#+name: mcl-to-tsv
#+begin_src bash
mcl="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.mcl"
tsv="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.mcl.tsv"
awk -f "workflow/families_and_TAGs/scripts/mcl_to_tsv.awk" "${mcl}" > "${tsv}"
#+end_src

#+begin_src bash :noweb strip-export
<<low-stringency>>
<<mcl-to-tsv>>
#+end_src

#+RESULTS:

#+begin_src bash :noweb strip-export
<<high-stringency>>
<<mcl-to-tsv>>
#+end_src

#+RESULTS:

#+name: duplicate-genes-count-high-stringency
#+begin_src bash
tsv="tmp/Glycine_max_Blastp_filtered_coverage40_identity50.mcl.tsv"
cat "${tsv}" | wc -l
#+end_src

#+RESULTS: duplicate-genes-count-high-stringency
: 46769

#+name: duplicate-genes-count-low-stringency
#+begin_src bash
tsv="tmp/Glycine_max_Blastp_filtered_coverage30_identity30.mcl.tsv"
cat "${tsv}" | wc -l
#+end_src

#+RESULTS: duplicate-genes-count-low-stringency
: 50254

#+begin_src bash
tsv="tmp/Glycine_max_Blastp_filtered_coverage40_identity50.mcl.tsv"
head "${tsv}"
#+end_src

#+RESULTS:
| GLYMA_12G034500 | 1 |
| GLYMA_12G138300 | 1 |
| GLYMA_12G124600 | 1 |
| GLYMA_12G126700 | 1 |
| GLYMA_12G125200 | 1 |
| GLYMA_12G034900 | 1 |
| GLYMA_U036400   | 1 |
| GLYMA_12G124200 | 1 |
| GLYMA_12G126300 | 1 |
| GLYMA_12G035100 | 1 |

#+begin_src bash
tsv="tmp/Glycine_max_Blastp_filtered_coverage30_identity30.mcl.tsv"
head "${tsv}"
#+end_src

#+RESULTS:
| GLYMA_07G053000 | 1 |
| GLYMA_10G247700 | 1 |
| GLYMA_12G188000 | 1 |
| GLYMA_07G159200 | 1 |
| GLYMA_13G313300 | 1 |
| GLYMA_15G016600 | 1 |
| GLYMA_14G159200 | 1 |
| GLYMA_18G287500 | 1 |
| GLYMA_20G132900 | 1 |
| GLYMA_16G204800 | 1 |

What is the interval of duplicate gene count? There are between 48109 and 49391 duplicate genes in /Glycine max/.

What is the number of families?

#+begin_src bash
tsv="tmp/Glycine_max_Blastp_filtered_coverage30_identity30.mcl.tsv"
cut -f2 "${tsv}" | sort | uniq | wc -l
#+end_src

#+RESULTS:
: 8426

#+begin_src bash
tsv="tmp/Glycine_max_Blastp_filtered_coverage40_identity50.mcl.tsv"
cut -f2 "${tsv}" | sort | uniq | wc -l
#+end_src

#+RESULTS:
: 11997

What is the number of singletons?

#+name: singletons-low-stringency
#+begin_src bash :var ALL=protein-coding-gene-count :var DUPLICATED=duplicate-genes-count-low-stringency
expr $ALL - $DUPLICATED
#+end_src

#+RESULTS: singletons-low-stringency
: 5643

#+name: singletons-high-stringency
#+begin_src bash :var ALL=protein-coding-gene-count :var DUPLICATED=duplicate-genes-count-high-stringency
expr $ALL - $DUPLICATED
#+end_src

#+RESULTS: singletons-high-stringency
: 9128

* Amount of Tandemly Arrayed Genes

*** Extract gene position from a GFF3 file

#+include: workflow/families_and_TAGs/scripts/gff3_to_gene_positions_table.awk src awk

#+begin_src bash
zcat "../data/Glycine_max.Glycine_max_v2.1.60.chr.gff3.gz" \
    | awk -f "./workflow/families_and_TAGs/scripts/gff3_to_gene_positions_table.awk" \
    > "./tmp/Glycine_max.gene_positions.tsv"
#+end_src

#+RESULTS:

#+begin_src bash
head "./tmp/Glycine_max.gene_positions.tsv"
#+end_src

#+RESULTS:
| GLYMA_01G000100 | 1 |  27355 |  28320 |
| GLYMA_01G000200 | 1 |  58975 |  67527 |
| GLYMA_01G000300 | 1 |  67770 |  69968 |
| GLYMA_01G000400 | 1 |  90152 |  95947 |
| GLYMA_01G000500 | 1 |  90289 |  91197 |
| GLYMA_01G000600 | 1 | 116094 | 127845 |
| GLYMA_01G000700 | 1 | 143467 | 155573 |
| GLYMA_01G000800 | 1 | 157030 | 157772 |
| GLYMA_01G000900 | 1 | 170534 | 193342 |
| GLYMA_01G001000 | 1 | 196256 | 201895 |

#+name: gene-positions-count
#+begin_src bash
cat "./tmp/Glycine_max.gene_positions.tsv" \
    | wc -l
#+end_src

#+RESULTS:
: 55589

There is 55897 protein coding genes in the sequence file, but only 55589 gene positions.

#+begin_src bash :var POSITIONS=gene-positions-count :var GENES=protein-coding-gene-count :exports both
expr $GENES - $POSITIONS
#+end_src

#+RESULTS:
: 308

*** Extract TAGs

#+name: extract-TAGs
#+begin_src bash
positions="./tmp/Glycine_max.gene_positions.tsv"
families="./tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.mcl.tsv"
./rust/tagfinder/target/release/tagfinder \
    --families "$families" \
    --positions "$positions" \
    --definitions 0,1 \
    > "./tmp/Glycine_max_TAGs_coverage${min_coverage}_identity${min_identity}.tsv"
#+end_src

#+RESULTS: extract-TAGs

#+begin_src bash :noweb strip-export
<<low-stringency>>
<<extract-TAGs>>
#+end_src

#+RESULTS:

#+begin_src bash :noweb strip-export
<<high-stringency>>
<<extract-TAGs>>
#+end_src

#+RESULTS:

# LocalWords:  operons

#+begin_src bash
head "./tmp/Glycine_max_TAGs_coverage30_identity30.tsv"
#+end_src

#+RESULTS:
| gene            | family | tag0 | tag1 |
| GLYMA_07G257400 |   3109 | -    | -    |
| GLYMA_17G203900 |   2067 | -    | -    |
| GLYMA_20G062000 |    509 | -    | -    |
| GLYMA_16G099400 |     40 | -    | -    |
| GLYMA_14G045400 |   1136 | 804  | 896  |
| GLYMA_11G189200 |     15 | -    | -    |
| GLYMA_02G191500 |   1310 | -    | -    |
| GLYMA_02G308200 |    520 | -    | -    |
| GLYMA_14G205200 |      8 | -    | -    |

How many TAGs for definition 0?

#+begin_src bash
awk 'BEGIN {
           max=0
    }
    NR > 1 {
        if ($3 > max) {
            max=$3
        }
    }
    END  {
        print max
    }
    ' "./tmp/Glycine_max_TAGs_coverage30_identity30.tsv"
#+end_src

#+RESULTS:
: 3208

For TAG1?

#+begin_src bash
awk 'BEGIN {
           max=0
    }
    NR > 1 {
        if ($4 > max) {
            max=$4
        }
    }
    END  {
        print max
    }
    ' "./tmp/Glycine_max_TAGs_coverage30_identity30.tsv"
#+end_src

#+RESULTS:
: 3481


#+begin_src bash
awk 'BEGIN {
           max=0
    }
    NR > 1 {
        if ($3 > max) {
            max=$3
        }
    }
    END  {
        print max
    }
    ' "./tmp/Glycine_max_TAGs_coverage40_identity50.tsv"
#+end_src

#+RESULTS:
: 2500


#+begin_src bash
awk 'BEGIN {
           max=0
    }
    NR > 1 {
        if ($4 > max) {
            max=$4
        }
    }
    END  {
        print max
    }
    ' "./tmp/Glycine_max_TAGs_coverage40_identity50.tsv"
#+end_src

#+RESULTS:
: 2652


*** Size of the greatest TAG

**** High stringency
/TAG₀/

#+begin_src bash
awk '
    {
    tag_index=$3
    if (tag_index != "-")
        gene_count[tag_index]++
    }
    END {
    max = 0
    max_tag_index = ""
    for (tag_index in gene_count) {
        if (gene_count[tag_index] > max) {
            max_tag_index = tag_index
            max = gene_count[tag_index]
        }
    }
    print max_tag_index, max
    }' "./tmp/Glycine_max_TAGs_coverage40_identity50.tsv"
#+end_src

#+RESULTS:
: 421 53
/TAG₁/

#+begin_src bash
awk '
    {
    tag_index=$4
    if (tag_index != "-")
        gene_count[tag_index]++
    }
    END {
    max = 0
    max_tag_index = ""
    for (tag_index in gene_count) {
        if (gene_count[tag_index] > max) {
            max_tag_index = tag_index
            max = gene_count[tag_index]
        }
    }
    print max_tag_index, max
    }' "./tmp/Glycine_max_TAGs_coverage40_identity50.tsv"
#+end_src

#+RESULTS:
: 459 76

**** Low stringency
/TAG₀/

#+begin_src bash
awk '
    {
    tag_index=$3
    if (tag_index != "-")
        gene_count[tag_index]++
    }
    END {
    max = 0
    max_tag_index = ""
    for (tag_index in gene_count) {
        if (gene_count[tag_index] > max) {
            max_tag_index = tag_index
            max = gene_count[tag_index]
        }
    }
    print max_tag_index, max
    }' "./tmp/Glycine_max_TAGs_coverage30_identity30.tsv"
#+end_src

#+RESULTS:
: 579 76
/TAG₁/

#+begin_src bash
awk '
    {
    tag_index=$4
    if (tag_index != "-")
        gene_count[tag_index]++
    }
    END {
    max = 0
    max_tag_index = ""
    for (tag_index in gene_count) {
        if (gene_count[tag_index] > max) {
            max_tag_index = tag_index
            max = gene_count[tag_index]
        }
    }
    print max_tag_index, max
    }' "./tmp/Glycine_max_TAGs_coverage30_identity30.tsv"
#+end_src

#+RESULTS:
: 646 76

* DONE Complete slides
<2024-11-23 Sat>

* Family sizes

#+begin_src bash
head -1 ./tmp/Glycine_max_Blastp_filtered_coverage30_identity30.mcl \
    | awk '{ print NF }'
#+end_src

#+RESULTS:
: 437

#+begin_src bash
head -1 ./tmp/Glycine_max_Blastp_filtered_coverage40_identity50.mcl \
    | awk '{ print NF }'
#+end_src

#+RESULTS:
: 159


* TODO $K_s$ For all duplicated pairs within a family

Let $A$ be a family of $n$ duplicate members. The number of duplicate pairs is $n(n-1) / 2$.

PAML package:
#+begin_src bash
conda install paml pal2nal clustalw -c bioconda
#+end_src

** Use a Nextflow workflow
<2024-12-22 Sun>

*** All pair of duplicate genes

Generated by a rust program in =rust/pairs= folder.
The algorithm is simple.
We assume the gene-families mapping file is sorted by family index.
For each family, loop for i from 0 to the size of the family minus one, loop for j from i plus one to the size of the family and output the pair gene i - gene j.


Example:

#+begin_src bash
cat ./tmp/Glycine_max_Blastp_filtered_coverage30_identity30.mcl.tsv | ./rust/pairs/target/release/pairs > ./tmp/Glycine_max_duplicate_gene_pairs.tsv
head ./tmp/Glycine_max_duplicate_gene_pairs.tsv
#+end_src

#+RESULTS:
| GLYMA_07G053000   | GLYMA_10G247700 |
| GLYMA_07G053000 | GLYMA_12G188000 |
| GLYMA_07G053000 | GLYMA_07G159200 |
| GLYMA_07G053000 | GLYMA_13G313300 |
| GLYMA_07G053000 | GLYMA_15G016600 |
| GLYMA_07G053000 | GLYMA_14G159200 |
| GLYMA_07G053000 | GLYMA_18G287500 |
| GLYMA_07G053000 | GLYMA_20G132900 |
| GLYMA_07G053000 | GLYMA_16G204800 |
| GLYMA_07G053000 | GLYMA_08G302300 |

*** PAML Steps
Example for the first gene pair.
/Extract the CDS sequence and protein sequence of the genes/

#+name: pair-gene-id
#+begin_src bash
gene_id_1="GLYMA_07G053000"
gene_id_2="GLYMA_10G247700"
#+end_src

CDS:


#+name: pair-gene-cds-extract
#+begin_src bash :noweb strip-export
output_file="./tmp/${gene_id_1}_${gene_id_2}.cds.fst"
echo "" > "${output_file}"
for gene_id in "${gene_id_1}" "${gene_id_2}"; do
        zcat ../data/Glycine_max.Glycine_max_v2.1.cds.all.fa.gz | awk -v gene_id="${gene_id}" '
        BEGIN {
            on_gene=0
        }
        /^[^>]/ && on_gene == 1 {
            print $0
        }
        /^>/ {
            gene = $4
            gsub("gene:", "", gene)
            if (gene == gene_id) {
                on_gene=1
                print $0
            } else {
                on_gene=0
            }
        }
        ' >> "${output_file}"
done
#+end_src

#+RESULTS: pair-gene-cds-extract

#+begin_src bash :noweb strip-export
<<pair-gene-id>>
<<pair-gene-cds-extract>>
#+end_src

#+RESULTS:

Proteins:

#+name: pair-gene-protein-extract
#+begin_src bash :noweb strip-export
output_file="./tmp/${gene_id_1}_${gene_id_2}.prot.fst"
echo "" > "${output_file}"
for gene_id in "${gene_id_1}" "${gene_id_2}"; do
        cat ./tmp/proteome_filtered.fa | awk -v gene_id="${gene_id}" '
        BEGIN {
            on_gene=0
        }
        /^[^>]/ && on_gene == 1 {
            print $0
        }
        /^>/ {
            gene = $4
            gsub("gene:", "", gene)
            if (gene == gene_id) {
                on_gene=1
                print $0
            } else {
                on_gene=0
            }
        }
        ' >> "${output_file}"
done
#+end_src

#+RESULTS: pair-gene-protein-extract


#+begin_src bash :noweb strip-export
<<pair-gene-id>>
<<pair-gene-protein-extract>>
#+end_src

#+RESULTS:
/Clustalw2 alignment/

#+begin_src bash :noweb strip-export
<<pair-gene-id>>
protein_sequence="./tmp/${gene_id_1}_${gene_id_2}.prot.fst"
clustalw2 -quiet -align -infile="${protein_sequence}" -outfile="./tmp/${gene_id_1}_${gene_id_2}.prot.ali.aln"
#+end_src

#+RESULTS:
/Convert protein alignment to coding sequence alignment/
#+begin_src bash :noweb strip-export
<<pair-gene-id>>
protein_alignment="./tmp/${gene_id_1}_${gene_id_2}.prot.ali.aln"
coding_sequence="./tmp/${gene_id_1}_${gene_id_2}.cds.fst"
pal2nal "${protein_alignment}" "${coding_sequence}" -output paml > "./tmp/${gene_id_1}_${gene_id_2}.cds.ali.phy"
#+end_src
/YN00 Ka-Ks computation/

#+begin_src bash :noweb strip-export
<<pair-gene-id>>
phylip_file="./tmp/${gene_id_1}_${gene_id_2}.cds.ali.phy"
echo "seqfile = ${phylip_file}\noutfile = ${phylip_file}.yn\nverbose = 0\nicode = 0\nweighting = 0\ncommonf3x4 = 0" > yn00.ctl
yn00
#+end_src
/Extraction of Ka-Ks values/

#+begin_src bash :noweb strip-export
<<pair-gene-id>>
yn_file="./tmp/${gene_id_1}_${gene_id_2}.cds.ali.phy.yn"
awk '
    BEGIN {
          on_good_section=0
          skip=0
    }
    $1 == "(B)" {
         skip=8
         on_good_section=1
    }

    on_good_section == 1 {
        if (skip == 0) {
            Ka=$8
            Ks=$11
            print Ka, Ks
            exit
        } else {
            skip -= 1
        }
    }
    ' "${yn_file}"
#+end_src

#+RESULTS:
: 0.6207 1.9409

*** Run the Nextflow workflow
<2024-12-23 Mon>


* TODO Does /Glycine max/ have big TAGs and which function are the big TAG gene implied?


* TODO Are TAG pairs different in age from non-TAG pairs?

* TODO Are genes inside a TAG orientated in the same way more often than random?