lamme2024-scientific-project/report.org

#+title: Further development on FTAG Finder, a pipeline to identify Gene Families and Tandemly Arrayed Genes
#+author: Samuel Ortion
#+date: 2023-2024
#+LATEX_CLASS: lamme2024
#+LATEX_CLASS_OPTIONS: [twoside=true]
#+LATEX_HEADER: \usepackage{sty/lamme2024}

#+bibliography: references.bib
#+exclude_tags: noexport
#+options: H:7
#+options: toc:nil
# ref. https://write.as/dani/writing-a-phd-thesis-with-org-mode

#+name: acronyms
| key        | abbreviation | full form                                  |
|------------+--------------+--------------------------------------------|
| TAG        | TAG          | Tandemly Arrayed Genes                     |
| FTAGFinder | FTAG Finder  | Families and Tandemly Arrayed Genes Finder |
| WGD        | WGD          | Whole Genome Duplication                   |
| MCL        | MCL          | Markov Clustering                          |
| BLAST      | BLAST        | Basic Local Alignment Search Tool          |
|      GO    | GO           |                                         Gene Ontology |

#+name: glossary
| label                | name                 | description                                                                                                                   |
|----------------------+----------------------+-------------------------------------------------------------------------------------------------------------------------------|
| singleton            | singleton            | A gene with a single copy                                                                                                     |
| polyploidisation     | polyploidisation     | Mechanism leading to the acquisition of at least three versions of the same original genome in a species                      |
| pseudogene           | pseudogene           | A gene-like sequence that lost its capacity to transcribe                                                                     |
| segment_duplication  | segment duplication  | Long stretches of DNA sequences with high identity score                                                                      |
| retroduplication     | retroduplication     | Duplication of a gene through retro-transcription of its RNA transcript                                                       |
| autopolyploidisation | autopolyploidisation | Polyploidisation within the same species                                                                                      |
| allopolyploidisation | allopolyploidisation | Polyploidisation with genetic material coming from a diverged species                                                         |
| polyspermy           | polyspermy           | Fertilization of an egg by more than one sperm                                                                                |
| segment_duplication  | segment duplication  | DNA sequences present in multiple locations within a genome that share high level of sequence identity                        |
| subfunctionalization | subfunctionalization | Fate of a duplicate gene which gets a part of the original gene function, the function being shared among multiple duplicates  |
| orthologues          | orthologues          | Homologous genes whose divergence started at a speciation event                                                               |
| neofunctionalization | neofunctionalization | Acquisition of a new function by the duplicate gene |

#+begin_export latex
\makeatletter
\hypersetup{
	pdfkeywords={duplicate genes, tandemly arrayed genes, pipeline},
        pdfauthor={Samuel Ortion},
}
\makeatother
#+end_export

#+begin_center
*keywords*: duplicate genes, tandemly arrayed genes, pipeline
#+end_center

#+begin_export latex
{
\hypersetup{linkcolor=black}
\tableofcontents
\listoffigures
\listoftables
}
#+end_export

[[printglossaries:]]

#+begin_export latex
\flstart
#+end_export

* Scientific context
[[latex:lettrine][D]]uplicate genes represent an important fraction of Eukaryotic genes: It is estimated that between 46% and 65.5% of human genes could be considered as duplicate[fn:: The estimate vary strongly depending on the criteria in use, because ancient duplication event may be hard to detect.] [cite:@correaTransposableElementEnvironment2021].
Duplicate genes offers a pool of genetic material available for further experimentation during species evolution.

** Gene duplication mechanisms

#+begin_src emacs-lisp :exports results :results raw
(setq fig:gene-duplication-mechanisms "#+label: fig:gene-duplication-mechanisms
#+caption[Different types of duplication]: Different types of duplication. (A) Whole genome duplication. (B) An unequal crossing-over leads to a duplication of a fragment of a chromosome. (C) In tandem duplication, two (set of) genes are duplicated one after the other. (D) Retrotransposon enables retroduplication: a RNA transcript is reverse transcribed and inserted back without introns and with a polyA tail in the genome. (E) A DNA transposon can acquire a fragment of a gene. (F) Segmental duplication corresponds to long stretches of duplicated sequences with high identity. Adapted from [cite:@lallemandOverviewDuplicatedGene2020] (fig. 1)
[[./figures/lallemand2020-fig1_copy.svg]]")

(if (eq org-export-current-backend 'html)
    fig:gene-duplication-mechanisms
  ""
  )
#+end_src

#+begin_export latex
\fladdfig{
	\includegraphics[width=.9\linewidth]{./figures/lallemand2020-fig1_copy.pdf}
	\caption[Different types of duplication]{\label{fig:gene-duplication-mechanisms}Different types of duplication. (A) Whole genome duplication. (B) An unequal crossing-over leads to a duplication of a fragment of a chromosome. (C) In tandem duplication, two (set of) genes are duplicated one after the other. (D) Retrotransposon enables retroduplication: a RNA transcript is reverse transcribed and inserted back without introns and with a polyA tail in the genome. (E) A DNA transposon can acquire a fragment of a gene. (F) Segmental duplication corresponds to long stretches of duplicated sequences with high identity. (Adapted from \textcite{lallemandOverviewDuplicatedGene2020} (fig. 1)).}
}
#+end_export

Multiple mechanisms may lead to a gene duplication. Their effect ranges from the duplication of the whole genome to the duplication of a fragment of a gene.

*** Whole genome duplication and polyploidisation
During an event of gls:WGD, the entire set of genes present on the chromosomes is duplicated ([[cref:fig:gene-duplication-mechanisms]] (A)).
gls:WGD can occur thanks to gls:polyspermy or in case of a non-reduced gamete.
Gls:polyploidisation is a mechanism leading to a species with at least three copies of an initial genome.
A striking example is probably /Triticum aestivum/ (wheat) which is hexaploid due to hybridisation events [cite:@golovninaMolecularPhylogenyGenus2007a].

We distinguish two kinds of glspl:polyploidisation, based on the origin of the duplicate genome: (i) Gls:allopolyploidisation occurs when the supplementary chromosomes come from a divergent species. This is the case for the /Triticum aestivum/ hybridisation, which consisted in the union of the chromosome set of a /Triticum/ species with that of an /Aegilops/ species. (ii) Gls:autopolyploidisation consists in the hybridisation or duplication of the whole genome within the same species.

*** Unequal crossing-over
Another source of gene duplication relies on unequal crossing-over. During cell division, a crossing-over occurs when two chromatids exchange fragments of chromosome. If the cleavage of the two chromatids occurs at different positions, the shared fragments may have different lengths. Homologous recombination of such uneven crossing-over leads to the incorporation of a duplicate region, as depicted in cref:fig:gene-duplication-mechanisms (B, C).
This mechanism leads to the duplication of the whole set of genes present in the fragment. These duplicate genes locate one set after the other: we call them gls:TAG. Gls:TAG are the kind of gene duplication we will be particularly interested in during this internship.

*** Retroduplication
Transposable elements play a major role in genome plasticity, and enable gene duplication too.
Retrotransposons, or RNA transposons are one type of transposable elements.
They share similar structure and replication mechanisms with retroviruses.
Retrotransposons replicate in the genome through a mechanism known as "copy-and-paste".
These transposons typically contain a reverse transcriptase gene. This enzyme proceeds in the reverse transcription of an mRNA transcript into its reverse complementary DNA sequence which can then insert elsewhere in the genome.
More generally, gls:retroduplication refers to the duplication of a sequence through reverse transcription of a RNA transcript. Genes duplicated through retroduplication lose their intronic sequences and bring a polyA tail with them in their new locus (cref:fig:gene-duplication-mechanisms (D)).

*** Transduplication
DNA transposons are another kind of transposable elements whose transposition mechanism can also lead to gene duplication.
This type of transposable element moves in the genome through a mechanism known as "cut-and-paste".
A typical DNA transposon contains a transposase gene. This enzyme recognizes two sites surrounding the donnor transposon sequence in the chromosome resulting in a DNA cleavage and an excision of the transposon. The transposase can then insert the transposon at a new genome locus. A transposon may bring a fragment of a gene during its transposition in the new locus (cref:fig:gene-duplication-mechanisms (E)), leading to the duplication of this fragment.

*** Segment duplication
Finally, glspl:segment_duplication, also called /low copy repeats/ are long stretches of DNA with high identity score ([[cref:fig:gene-duplication-mechanisms]] (F)). Their exact duplication mechanism remains unclear [cite:@lallemandOverviewDuplicatedGene2020]. They may come from an accidental replication, distinct from an uneven cross-over or a double stranded breakage.
Transposable elements may well be involved in the mechanism, as a high enrichment of transposable elements is found next to duplicate segment extremities, in /Drosophila/ [cite:@lallemandOverviewDuplicatedGene2020].

#+begin_src emacs-lisp :exports results :results raw
(setq fig:duplicate-genes-fate "#+label: fig:duplicate-genes-fate
,#+caption[Fate of duplicate genes]: Fate of duplicate genes. An original gene with four functions is duplicated. Its two copies may both keep the original functions (functional redoundancy). The original functions may split between the different copies (subfunctionalization). One of the copy may acquire a new function (neofunctionalization). It may also degenerate and lose its original functions (pseudogenization). Adapted from [[https://commons.wikimedia.org/wiki/File:Evolution_fate_duplicate_genes_-_vector.svg][Smedlib]], [[https://creativecommons.org/licenses/by-sa/4.0][CC BY-SA 4.0]] via Wikimedia Commons.
[[./figures/Evolution_fate_duplicate_genes.svg]]")

(if (eq org-export-current-backend 'html)
    fig:duplicate-genes-fate
  ""
  )
#+end_src

#+RESULTS:

#+begin_export latex
\fladdfig{
	\includegraphics[width=.9\linewidth]{figures/Evolution_fate_duplicate_genes.pdf}
	\caption[Fate of duplicate genes]{\label{fig:fate-duplicate-genes} Fate of duplicate genes. An original gene with four functions is duplicated. Its two copies may both keep the original functions (functional redoundancy). The original functions may split between the different copies (subfunctionalization). One of the copy may acquire a new function (neofunctionalization). It may also degenerate and lose its original functions (pseudogenization). (Adapted from \href{https://commons.wikimedia.org/wiki/File:Evolution_fate_duplicate_genes_-_vector.svg}{Smedlib}, \href{https://creativecommons.org/licenses/by-sa/4.0}{CC BY-SA 4.0}, via Wikimedia Commons).}
}
#+end_export

** Fate of duplicate genes in genome evolution
In his book /Evolution by Gene Duplication/, Susumu [[latex:textsc][Ohno]] proposed that gene duplication plays a major role in species evolution [cite:@ohnoEvolutionGeneDuplication1970], because it provides new genetic materials to build on new phenotypes while keeping a backup gene for the previous function.
Indeed, duplicate genes evolve after duplication: they may be inactivated, and become glspl:pseudogene; they may be deleted or conserved, and if conserved, the may or may not acquire a new function.
[[Cref:fig:fate-duplicate-genes]] depicts the different possible fates of a duplicate gene.

# *** Pseudogenization
As genome evolves, duplicate genes may be inactivated and become pseudogenes. These pseudogenes keep a gene-like structure which degrades as and when further genome modifications occur but they are no longer expressed.

# *** Neofunctionalization
After duplication, the new gene copy may gain a new function. We call this possible outcome gls:neofunctionalization.
For instance, the current set of olfactory receptor genes result from several duplication and deletion events (for /Drosophila/, see: [cite/t:@nozawaEvolutionaryDynamicsOlfactory2007]), after which each duplicate olfactory gene specialized in the detection of a particular chemical compound.

# *** Subfunctionalization
Two duplicate genes with the same original function may encounter a gls:subfunctionalization: each gene conserves only one part of the function.

# *** Functional redundancy
Another possibility is that the two gene copies keep the ancestral function, resulting in a functional redoundancy. In this case the quantity of gene product may increase.
* Objectives for the internship
** Scientific questions
The underlying question of FTAG Finder is the study of the evolutionary fate of duplicate genes in Eukaryotes.
Duplicate genes are
** Extend the existing FTAG Finder Galaxy pipeline
Galaxy is a web-based platform for running accessible data analysis pipelines, first designed for use in genomics data analysis [cite:@goecksGalaxyComprehensiveApproach2010].
Last year, Séanna [[latex:textsc][Charles]] worked on the Galaxy version of the FTAG Finder pipeline during her M1 internship  [cite:@charlesFinalisationPipelineFTAG2023]. I will continue this work.
FTAG Finder is currently deployed on the server of the /Laboratoire de Mathématiques et Modélisation d'Évry/[fn:: [[http://stat.genopole.cnrs.fr/galaxy]] ].

** Port FTAG Finder pipeline on a workflow manager
Another objective of my internship will be to port FTAG Finder on a workflow manager better suited to larger and more reproducible analysis.

We will have to make a choice for the tool we will use.
The two main options being Snakemake and Nextflow. Snakemake is a python powered workflow manager based on rules /à la/ GNU Make [cite:@kosterSnakemakeScalableBioinformatics2012]. Nextflow is a groovy powered workflow manager, which rely on the data flows paradigm [cite:@ditommasoNextflowEnablesReproducible2017]. Both are widely used in the bioinformatics community. Their use have been on the rise since they came out in 2012 and 2013 respectively [cite:@djaffardjyDevelopingReusingBioinformatics2023].


# #+begin_export latex
# \fladdtab{
#        \begin{tabular}{ccc}
#        \toprule
#        & List ref & List $L$ \\
#        \midrule
#        related to $go$ & $a$ & $b$ \\
#        unrelated & $c$ & $d$ \\
#        \bottomrule
#        \end{tabular}
#        \caption{\label{tab:fisher-test-contigency-table}Contingency table for a Fisher exact test on gene lists}
# }
# #+end_export
* Methodological approaches

** Duplicate gene detection method used in FTAG Finder
#+begin_export latex
\fladdfig{
	\includegraphics[width=.9\linewidth]{./figures/tag-definition.pdf}
	\caption[Schematic representation of TAG definitions]{\label{fig:tag-definitions} Schematic representation of TAG definitions. Several genes are represented on a linear chromosome. The red box represent a singleton gene. Orange boxes represent a TAG with two duplicate genes seperated by 7 other genes ($\mathrm{TAG}_7$). Four green boxes constitute a TAG, the gene at the extremities are seperated by three genes ($\mathrm{TAG}_3$). The two blue boxes represents a TAG with two genes next to each other ($\mathrm{TAG}_0$). The bended edges represents the homology links between each pair of genes within a TAG.}}
#+end_export

Different methods exists to detect duplicate genes. These methods depend on the type of duplicate genes they target and vary on computation burden as well as in the ease of use (for a review, see [cite/t:@lallemandOverviewDuplicatedGene2020]).

*** Paralog detection
Paralogs are homologous genes derived from a duplication event. We can identify them as homologous genes coming from the same genome, or as homologous genes between different species once we filtered out gls:orthologues (homologous genes derived from a speciation event).

We can use two gene characteristics to assess the homology between two genes: gene structure or sequence similarity.
The sequence similarity can be tested with a sequence alignment tool, such as =BLAST= [cite:@altschulBasicLocalAlignment1990], =Psi-BLAST=, and =HMMER3= [cite:@johnsonHiddenMarkovModel2010], or =diamond= [cite:@buchfinkSensitiveProteinAlignments2021]. These tools are heuristic algorithms, which means they may not provide the best results, but do so way faster than exact algorithms, such as the classical Smith and Waterman algorithm [cite:@smithIdentificationCommonMolecular1981] or its optimized versions =PARALIGN= [cite:@rognesParAlignParallelSequence2001] or =SWIMM=.

*** FTAG Finder
Developed in the LaMME laboratory, the FTAG Finder (Families and Tandemly Arrayed Genes Finder) pipeline is a simple pipeline targeting the detection of gls:TAG based on the sequence of the proteome of single species [cite:@bouillonFTAGFinderOutil2016].

The pipeline proceeds in three steps. First, it estimates the homology links between each pair of genes. Then, it deduces the gene families. Finally, it searches for gls:TAG, relying on the position of genes belonging to the same family.
**** Estimation of homology links between genes
This step consists in establishing a homology relationship between each genes in the proteome.
In this step, FTAG Finder uses =BLAST= (Basic Local Alignment Search Tool) [cite:@altschulBasicLocalAlignment1990] with an "all against all" search on the proteome.

Several =BLAST= metrics can be used as an homology measure, such as bitscore, identity percentage, E-value or a variation on these. The choice of metrics can affect the results of graph clustering in the following step, and we should therefore chose them carefully [cite:@gibbonsEvaluationBLASTbasedEdgeweighting2015].
**** Identification of gene families
Based on the homology links between each pair of genes, we construct an undirected weighted graph whose vertices correspond to genes and whose edges corresponds to homology links between them.
We apply a graph clustering algorithm on the homology gene graph in order to infer the gene families corresponding to densely connected communities of vertices.
FTAG Finder proposes three graph clustering algorithm alternatives: single linkage, Markov Clustering [cite:@vandongenNewClusterAlgorithm1998] or Walktrap [cite:@ponsComputingCommunitiesLarge2005].

**** Detection of TAG
The final step of FTAG Finder consists in the identification of gls:TAG from the gene families and the positions of genes.
For a given chromosome, the tool seeks genes belonging to the same family and located close to each other. The tool allows a maximal number of genes between the homologous genes, with a parameter set by the user. Cref:fig:tag-definitions is a schematic representation of some possible gls:TAG positioning on a genome associated with their definition in this FTAG Finder step.

** Analyses performed on TAG

FTAG Finder output consist mostly in list of genes, corresponding to TAG of various definition. These list can be subsequently used as the basis of more specific statistical analysis.

*** Are there over-represented gene functions among TAG

The gls:GO describes biological concepts across three main classes: Cellular Component, Molecular Function and Biological Process. It describes a controlled vocabulary of concepts and the relationships between them. We can link genes with function annotation with particular GO terms. We can then perform an GO enrichment analysis to assess whether a particular GO term is over-represented in a particular gene list, compared to another. To do so, we can use a Fisher exact test, using the FDR (False Discovery Rate) control procedure of [[latex:textsc][Benjamini]] and [[latex:textsc][Hocheberg]].

# Let $go$ be a GO term. We construct a contingency matrix based on the count of genes associated with this GO term (or associated with one of its brother GO term) for the reference gene list and the list of interest (here, the list of genes in a TAG) (see cref:tab:fisher-test-contigency-table).
*** Are TAG located preferentially on specific chromosome region?

*** Are there chromosomes enriched or depleted in TAG?

*** Do genes located next to each other in a TAG share the same orientation?

The concordance of two genes of a TAG falls in three possible cases: either both genes are on the same strand (\(\rightarrow \rightarrow\)), either they have a divergent orientation (\(\leftarrow \rightarrow\)), or a convergent one (\(\rightarrow \leftarrow\)). Graham conjectured that genes of a TAG that are close to each other would be more likely to share the same orientation, and it seems to be effectively the case [cite:@shojaRoadmapTandemlyArrayed2006].

# To test this, we can use a $\Chi^2$ test of goodness of fit or a Student $t$-test.

*** What is the robustness and accuracy of the detection method?

[cite/t:@le-hoangEtudeTranscriptomiqueGenes2017] started analyzing the impact of parameter choice on FTAG Finder results. A more detailed benchmark of FTAG Finder in comparison with other methods on some controlled test dataset might be of particular interest.
This would pose the challenge of homogenization of the outputs of the different methods.

#+begin_export latex
\flstop
#+end_export

* References
:PROPERTIES:
:UNNUMBERED: t
:END:

#+print_bibliography:

#+begin_export latex
\cleartoleftpage
\clearpairofpagestyles
#+end_export


* Summary
:PROPERTIES:
:UNNUMBERED: t
:END:

Duplicate genes is an important feature of Eukaryotic genomes. They contribute to the plasticity of genome, hence to the capacity of species to evolve.

Several mechanisms may lead to gene duplication. Among them, an unequal crossing-over leads to the formation of Tandemly Arrayed Genes (TAG) corresponding to homologous genes located one set after the other on the same chromosome.

There are multiple methods for detecting duplicate genes from sequences. These methods vary in terms of the particular gene duplication mechanism they target, computational efficiency and ease of use.

FTAG Finder is a simple Galaxy pipeline aiming at the detection of families of duplicate genes and the identification of TAG based on the proteome of a single species. FTAG Finder is developed in the /Laboratoire de Mathématiques et Modélisation d'Évry/, where I will do my internship.

On the one hand, the aim of my internship is to extend the current Galaxy implementation of FTAG Finder with new export lists best suited to the analysis requirements of the laboratory. On the other hand, the objective of my internship will be to port the Galaxy pipeline on another scientific workflow manager better suited to reproducible analyses such as Snakemake and Nextflow.

Then, the updated version of the FTAG Finder pipeline will be used to perform an analysis on the TAG of a model species, to assess its proper behavior. A benchmark of the pipeline will probably be run to compare the FTAG Finder with alternative published methods targetting duplicate genes and TAG in particular.

* Bean :noexport:
** MCL
MCL uses two operations on a stochastic matrix representation $M$ of the graph first derived from the adjacency matrix, namely /expansion/ and /inflation/. Expansion consists in elevating the matrix to a power $r$, and subsequently scaling its columns so that they sum to 1 again. The image of the inflation operator $\Gamma_r$ is defined as
\[
(\Gamma_r M)_{pq} = (M_{pq})^r / \sum_{i=1}^m (M_{iq})^r
\]
where $m$ is number of rows in the matrix, and $M_{pq}$ is the value in the $p, q$ cell of the matrix $M$.

This operator strengthens the edges with higher weights and tend to annihilate edges with lower flow.

The application of both operator iteratively eventually ends up in a partition of the initial graph's edges into clusters of closely connected nodes (corresponding, in our case to gene families).
** Walktrap
Principle: construct vertex communities based on where an agent would get stuck in a random walk.

* Setup :noexport:

#+name: startup
#+begin_src emacs-lisp
(org-babel-load-file "./setup.org")
#+end_src

#+RESULTS: startup
: Loaded ./setup.el


#  LocalWords:  speciation subfunctionalization neofunctionalization
#  LocalWords:  pseudogenization bioinformatics
# #  Local Variables:
# #  eval: (progn (org-babel-goto-named-src-block "startup") (org-babel-execute-src-block) (outline-hide-sublevels 1))
# #  End: