lamme2024-scientific-project/report.md

<div class="org-center">
<p>
<b>keywords</b>: duplicate genes, tandemly arrayed genes, pipeline
</p>
</div>


# Scientific context

It is estimated that between 46% and 65.5% of human genes could be considered as duplicate genes (<a href="#citeproc_bib_item_5">Correa et al. 2021</a>).
Duplicate genes offers a pool of genetic material available for further experimentation during species evolution.


## Gene duplication mechanisms

![img](./figures/lallemand2020-fig1_copy.svg "Different types of duplication. (A) Whole genome duplication. (B) An unequal crossing-over leads to a duplication of a fragment of a chromosome. (C) In tandem duplication, two (set of) genes are duplicated one after the other. (D) Retrotransposon enables retroduplication: a RNA transcript is reverse transcribed and inserted back without introns and with a polyA tail in the genome. (E) A DNA transposon can acquire a fragment of a gene. (F) Segmental duplication corresponds to long stretches of duplicated sequences with high identity. Adapted from (<a href="#citeproc_bib_item_14">Lallemand et al. 2020</a>) (fig. 1).")

Multiple mechanisms may lead to gene duplication. The following sections review them.


### Polyploidisation and whole genome duplication

In an event of WGD, the entire set of genes present on the chromosomes is duplicated (<cref:fig:gene-duplication-mechanisms> (A)).
WGD is more frequent in plants.
A striking example is probably the *Triticum* genus (wheat) in which some species (such as *T. aestivum*) are hexaploid, due to hybridisation events (<a href="#citeproc_bib_item_11">Golovnina et al. 2007</a>).

We distinguish two kinds of polyploidisation, based on the origin of the duplicate genome:

-   Allopolyploidisation occurs when the supplementary chromosomes comes from an other species. This is the case for *Triticum aestivum* hybridisation.
-   Autopolyploidisation consist in the duplication of the genome within the same species.

WGD can occur thanks to polyspermy or in case of a non-reduced gamete.


### Unequal crossing-over

A crossing-over may occur during cell division. Two chromatids may exchange a fragment of chromosome. If the cleavage of the two chromatids occurs at different positions, the shared fragments may have different lengths. Homologous recombination of such uneven crossover results in the incorporation of a duplicate region, as represented in <cref:fig:gene-duplication-mechanisms> (B, C).
This mechanism leads to the duplication of the whole set of genes present in the inserted fragment. These duplicate genes locate one set after the other, and are thus called TAG.


### Retroduplication

Retrotransposons, or RNA transposons are one type of transposable elements. Retrotransposons share similar structure and mechanism with retroviruses.
They may replicate in the genome through a mechanism known as &ldquo;copy-and-paste&rdquo;.
These transposons typically contain a reverse transcriptase gene. This enzyme may proceed in the reverse transcription of an mRNA transcript into DNA sequence which can then be inserted elsewhere in the genome.
More generally, retroduplication refers to the duplication of a region of a chromosome through reverse transcription of a RNA transcript. In this case the duplicate gene lost its intronic sequences and brings a polyA tail with it ( <cref:fig:gene-duplication-mechanisms> (D)).


### Transduplication

DNA transposons are another type of transposable elements whose transposition mechanism can also lead to gene duplication.
This type of transposable element moves in the genome through a mechanisms known as &ldquo;cut-and-paste&rdquo;.
A typical DNA transposon contains a transposase gene. This enzyme recognize two sites surrounding the donnor transposon sequence in the chromosome resulting in a DNA cleavage and excision of the transposon. The transposase can then insert the transposon in a new genome locus. A transposon can bring a fragment of a gene during its transposition in the other locus (<cref:fig:gene-duplication-mechanisms> (E)).


### Segment duplication

Segment duplications, also called low copy repeats are long stretches of DNA with high identity score (<cref:fig:gene-duplication-mechanisms> (E)). Their exact duplication mechanisms remains unclear (<a href="#citeproc_bib_item_14">Lallemand et al. 2020</a>), they may results from an accidental replication, distinct from an uneven cross-over or a double stranded breakage.
Nevertheless, transposable elements may well be involved as a high enrichment of transposable elements has been found at segment extremities, in *Drosophila* (<a href="#citeproc_bib_item_14">Lallemand et al. 2020</a>).


## Fate of duplicate genes in genome evolution

In his book *Evolution by Gene Duplication*, Susumu [Ohno](latex:textsc) proposed that gene duplication plays a major role in species evolution (<a href="#citeproc_bib_item_16">Ohno 1970</a>), as it provides a new genetic material to build on new phenotypes while keeping a backup gene for the previous function.

Duplicate genes may be inactivated becoming pseudogenes, be deleted or conserved.


### Pseudogenisation

Duplicate genes may be inactivated and become pseudogenes. These pseudogenes keep a gene-like structure, which degrades as and when further genome modifications occur, but are no longer expressed.


### Neofunctionalisation

Duplicate genes may be conserved and gain a new function.
For instance, in *Drosophila*, the set of olfactory receptor genes result from several duplication and deletion events (<a href="#citeproc_bib_item_15">Nozawa and Nei 2007</a>), after which the duplicate may specialize in the detection of a particular chemical compound.


### Subfunctionalisation

Two duplicate genes with the same original function may encounter a subfunctionalisation during which each gene conserves only one part of the function.


## Methods to identify duplicate genes

[Lallemand](latex:textsc) et al. review the different methods used to detect duplicate genes. These methods depend on the type of duplicate genes they target, and vary on computation burden (<a href="#citeproc_bib_item_14">Lallemand et al. 2020</a>).


### Paralog detection

Paralogs are homologous genes derived from a duplication event. They can be identified as homologous genes located in the same genome, or as homologous genes between different species once we filtered out orthologous genes (homologous genes derived from a speciation event).

Two gene characteristics can be used to assess to assess homology between two genes: gene structure of sequence similarity.
The sequence similarity can be tested with a sequence alignment tool, such as `BLAST` (<a href="#citeproc_bib_item_1">Altschul et al. 1990</a>), `Psi-BLAST`, and `HMMER3` (<a href="#citeproc_bib_item_12">Johnson, Eddy, and Portugaly 2010</a>), or `diamond` (<a href="#citeproc_bib_item_3">Buchfink, Reuter, and Drost 2021</a>), which are heuristic algorithm, which means they may not provide the best results, but do so way faster than exact algorithms, such as the classical Smith and Waterman algorithm (<a href="#citeproc_bib_item_18">Smith and Waterman 1981</a>) or its optimized versions `PARALIGN` or `SWIMM`.


### FTAG Finder

Developed in the LaMME laboratory, the FTAGfinder pipeline targets the detection of gene Families and Tandemly Arrayed Genes from a given species&rsquo; proteome (<a href="#citeproc_bib_item_2">Bouillon et al. 2016</a>).

The pipeline proceeds in three steps. First, it estimates the homology links between each pair of genes; then, it deduce the gene families and finally, it detects TAG.


#### Estimation of homology links between genes

This step consists in establishing a relation between each genes in the proteome.
In this step, the typical tool involved is `BLAST` (Basic Local Alignment Search Tool) (<a href="#citeproc_bib_item_1">Altschul et al. 1990</a>) run &ldquo;all against all&rdquo; on the proteome.

Several `BLAST` metrics can be used as homology measures, such as bitscore, identity percentage, E-value or variations of these. The choice of metrics can affect the results of graph clustering in the following step, and should therefore be chosen carefully (<a href="#citeproc_bib_item_9">Gibbons et al. 2015</a>).


#### Identification of gene families

Based on the homology links between each pair of genes, we construct a undirected weighted graph whose vertices correspond to genes and edges to homology links between them.
We apply a graph clustering algorithm on the graph in order to infer the gene families.

FTAG Finder proposes three clustering algorithm alternatives: single linkage, Markov Clustering (<a href="#citeproc_bib_item_8">van Dongen 1998</a>) or Walktrap (<a href="#citeproc_bib_item_17">Pons and Latapy 2005</a>).


#### Detection of TAGs

The final step of FTAG Finder consists in the determination of TAG from the gene families and the chromosome sequence.
For a given chromosome, the tool seeks genes belonging to the same family and located close to each other. The tool allows a maximal number of genes between the homologous genes, with a parameter set by the user.


# Objectives for the internship


## Scientific questions

The underlying question of FTAG Finder is the study of the evolutionary fate of duplicate genes in Eukaryotes.


## Extend the existing FTAG Finder Galaxy pipeline

Galaxy is a web-based platform for running accessible data analysis pipelines, first designed for use in genomic data analysis (<a href="#citeproc_bib_item_10">Goecks et al. 2010</a>).

Last year, Séanna [Charles](latex:textsc) worked on the Galaxy version of the FTAG Finder pipeline during her M1 internship  (<a href="#citeproc_bib_item_4">Charles 2023</a>). I will continue this work.


## Port FTAG Finder pipeline on a workflow manager

Another objective of my internship will be to port FTAG Finder on a workflow manager better suited to larger and more reproducible analysis.

We will have to make a choice for the tool we will use.
The two main options are Snakemake and Nextflow. Snakemake is a python powered workflow manager based on rules *à la* GNU Make (<a href="#citeproc_bib_item_13">Köster and Rahmann 2012</a>). Nextflow, is a groovy powered workflow manager, which rely on data flows (<a href="#citeproc_bib_item_6">Di Tommaso et al. 2017</a>). Both are widely used in the bioinformatics community, and their use have been on the rise since they came out in 2012 and 2013 respectively (<a href="#citeproc_bib_item_7">Djaffardjy et al. 2023</a>).

<h3>Bibliography</h3>

<style>.csl-entry{text-indent: -1.5em; margin-left: 1.5em;}</style><div class="csl-bib-body">
  <div class="csl-entry"><a id="citeproc_bib_item_1"></a>Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. 1990. “Basic Local Alignment Search Tool.” <i>Journal of Molecular Biology</i> 215 (3): 403–10. <a href="https://doi.org/10.1016/S0022-2836(05)80360-2">https://doi.org/10.1016/S0022-2836(05)80360-2</a>.</div>
  <div class="csl-entry"><a id="citeproc_bib_item_2"></a>Bouillon, Bérengère, Franck Samson, Etienne Birmelé, Loïc Ponger, and Carène Rizzon. 2016. “FTAG Finder: Un Outil Simple Pour Déterminer Les Familles de Gènes et Les Gènes Dupliqués En Tandem Sous Galaxy.”</div>
  <div class="csl-entry"><a id="citeproc_bib_item_3"></a>Buchfink, Benjamin, Klaus Reuter, and Hajk-Georg Drost. 2021. “Sensitive Protein Alignments at Tree-of-Life Scale Using DIAMOND.” <i>Nature Methods</i> 18 (4): 366–68. <a href="https://doi.org/10.1038/s41592-021-01101-x">https://doi.org/10.1038/s41592-021-01101-x</a>.</div>
  <div class="csl-entry"><a id="citeproc_bib_item_4"></a>Charles, Séanna. 2023. “Finalisation du pipeline FTAG (Families and TAG) Finder, un outil de détection des gènes dupliqués sous Galaxy.” Internship Report. Laboratoire de Mathématiques et Modélisation d’Évry.</div>
  <div class="csl-entry"><a id="citeproc_bib_item_5"></a>Correa, Margot, Emmanuelle Lerat, Etienne Birmelé, Franck Samson, Bérengère Bouillon, Kévin Normand, and Carène Rizzon. 2021. “The Transposable Element Environment of Human Genes Differs According to Their Duplication Status and Essentiality.” <i>Genome Biology and Evolution</i> 13 (5): evab062. <a href="https://doi.org/10.1093/gbe/evab062">https://doi.org/10.1093/gbe/evab062</a>.</div>
  <div class="csl-entry"><a id="citeproc_bib_item_6"></a>Di Tommaso, Paolo, Maria Chatzou, Evan W Floden, Pablo Prieto Barja, Emilio Palumbo, and Cedric Notredame. 2017. “Nextflow Enables Reproducible Computational Workflows.” <i>Nature Biotechnology</i> 35 (4): 316–19. <a href="https://doi.org/10.1038/nbt.3820">https://doi.org/10.1038/nbt.3820</a>.</div>
  <div class="csl-entry"><a id="citeproc_bib_item_7"></a>Djaffardjy, Marine, George Marchment, Clémence Sebe, Raphael Blanchet, Khalid Bellajhame, Alban Gaignard, Frédéric Lemoine, and Sarah Cohen-Boulakia. 2023. “Developing and Reusing Bioinformatics Data Analysis Pipelines Using Scientific Workflow Systems.” <i>Computational and Structural Biotechnology Journal</i> 21: 2075. <a href="https://doi.org/10.1016/j.csbj.2023.03.003">https://doi.org/10.1016/j.csbj.2023.03.003</a>.</div>
  <div class="csl-entry"><a id="citeproc_bib_item_8"></a>Dongen, S. van. 1998. “A New Cluster Algorithm for Graphs,” no. R 9814 (January). <a href="https://ir.cwi.nl/pub/4604">https://ir.cwi.nl/pub/4604</a>.</div>
  <div class="csl-entry"><a id="citeproc_bib_item_9"></a>Gibbons, Theodore R., Stephen M. Mount, Endymion D. Cooper, and Charles F. Delwiche. 2015. “Evaluation of BLAST-based Edge-Weighting Metrics Used for Homology Inference with the Markov Clustering Algorithm.” <i>Bmc Bioinformatics</i> 16 (1): 218. <a href="https://doi.org/10.1186/s12859-015-0625-x">https://doi.org/10.1186/s12859-015-0625-x</a>.</div>
  <div class="csl-entry"><a id="citeproc_bib_item_10"></a>Goecks, Jeremy, Anton Nekrutenko, James Taylor, and Galaxy Team. 2010. “Galaxy: A Comprehensive Approach for Supporting Accessible, Reproducible, and Transparent Computational Research in the Life Sciences.” <i>Genome Biology</i> 11 (8): R86. <a href="https://doi.org/10.1186/gb-2010-11-8-r86">https://doi.org/10.1186/gb-2010-11-8-r86</a>.</div>
  <div class="csl-entry"><a id="citeproc_bib_item_11"></a>Golovnina, K. A., S. A. Glushkov, A. G. Blinov, V. I. Mayorov, L. R. Adkison, and N. P. Goncharov. 2007. “Molecular Phylogeny of the Genus Triticum L.” <i>Plant Systematics and Evolution</i> 264 (3): 195–216. <a href="https://doi.org/10.1007/s00606-006-0478-x">https://doi.org/10.1007/s00606-006-0478-x</a>.</div>
  <div class="csl-entry"><a id="citeproc_bib_item_12"></a>Johnson, L. Steven, Sean R. Eddy, and Elon Portugaly. 2010. “Hidden Markov Model Speed Heuristic and Iterative HMM Search Procedure.” <i>Bmc Bioinformatics</i> 11 (1): 431. <a href="https://doi.org/10.1186/1471-2105-11-431">https://doi.org/10.1186/1471-2105-11-431</a>.</div>
  <div class="csl-entry"><a id="citeproc_bib_item_13"></a>Köster, Johannes, and Sven Rahmann. 2012. “Snakemake–a Scalable Bioinformatics Workflow Engine.” <i>Bioinformatics (Oxford, England)</i> 28 (19): 2520–22. <a href="https://doi.org/10.1093/bioinformatics/bts480">https://doi.org/10.1093/bioinformatics/bts480</a>.</div>
  <div class="csl-entry"><a id="citeproc_bib_item_14"></a>Lallemand, Tanguy, Martin Leduc, Claudine Landès, Carène Rizzon, and Emmanuelle Lerat. 2020. “An Overview of Duplicated Gene Detection Methods: Why the Duplication Mechanism Has to Be Accounted for in Their Choice.” <i>Genes</i> 11 (9): 1046. <a href="https://doi.org/10.3390/genes11091046">https://doi.org/10.3390/genes11091046</a>.</div>
  <div class="csl-entry"><a id="citeproc_bib_item_15"></a>Nozawa, Masafumi, and Masatoshi Nei. 2007. “Evolutionary Dynamics of Olfactory Receptor Genes in Drosophila Species.” <i>Proceedings of the National Academy of Sciences</i> 104 (17): 7122–27. <a href="https://doi.org/10.1073/pnas.0702133104">https://doi.org/10.1073/pnas.0702133104</a>.</div>
  <div class="csl-entry"><a id="citeproc_bib_item_16"></a>Ohno, Susumu. 1970. <i>Evolution by Gene Duplication</i>. Berlin, Heidelberg: Springer Berlin Heidelberg. <a href="https://doi.org/10.1007/978-3-642-86659-3">https://doi.org/10.1007/978-3-642-86659-3</a>.</div>
  <div class="csl-entry"><a id="citeproc_bib_item_17"></a>Pons, Pascal, and Matthieu Latapy. 2005. “Computing Communities in Large Networks Using Random Walks (Long Version).” December 12, 2005. <a href="https://doi.org/10.48550/arXiv.physics/0512106">https://doi.org/10.48550/arXiv.physics/0512106</a>.</div>
  <div class="csl-entry"><a id="citeproc_bib_item_18"></a>Smith, T. F., and M. S. Waterman. 1981. “Identification of Common Molecular Subsequences.” <i>Journal of Molecular Biology</i> 147 (1): 195–97. <a href="https://doi.org/10.1016/0022-2836(81)90087-5">https://doi.org/10.1016/0022-2836(81)90087-5</a>.</div>
</div>


## Summary