18 KiB

Raw Blame History

Further development on FTAG Finder, a pipeline to identify Gene Families and Tandemly Arrayed Genes

key	abbreviation	full form
TAG	TAG	Tandemly Arrayed Genes
FTAGFinder	FTAG Finder	Families and Tandemly Arrayed Genes Finder
WGD	WGD	Whole Genome Duplication
MCL	MCL	Markov Clustering

label	name	description
singleton	singleton	A gene with a single copy
polyploidisation	polyploidisation	Mechanism leading to the acquisition of at least three versions of the same original genome in a species
pseudogene	pseudogene	A gene-like sequence that lost its capacity to transcribe
segment_duplcation	segment duplication	Long stretches of DNA sequences with high identity score
retroduplication	retroduplication	Duplication of a gene through retro-transcription of its RNA transcript
autopolyploidisation	autopolyploidisation	Polyploidisation within the same species
allopolyploidisation	allopolyploidisation	Polyploidisation with genetic material comming from a diverged species
polyspermy	polyspermy	Fertilization of an egg by more than one sperm
segment_duplication	segment duplication	DNA sequences present in multiple locations within a genome that share high level of sequence identity
subfunctionalization	subfunctionalization	Fate of a duplicate gene which gets a part of the original gene function, the function being shared among multiple duplicates
orthologues	orthologues	Homologous genes whose divergence started at a speciation event

keywords: duplicate genes, tandemly arrayed genes, pipeline

printglossaries:

Scientific context

It is estimated that between 46% and 65.5% of human genes could be considered as duplicate genes\footnote{The estimate vary strongly depending on the criteria in use} [cite:@correaTransposableElementEnvironment2021]. Duplicate genes offers a pool of genetic material available for further experimentation during species evolution.

Gene duplication mechanisms

Multiple mechanisms may lead to a gene duplication. Their effect ranges from the duplication of the whole genome to the duplication of a fragment of a gene.

Whole genome duplication and polyploidisation

During an event of gls:WGD, the entire set of genes present on the chromosomes is duplicated (cref:fig:gene-duplication-mechanisms (A)). gls:WGD can occur thanks to gls:polyspermy or in case of a non-reduced gamete. Gls:polyploidisation is a mechanism leading to a species with at least three copies of an initial genome. A striking example is probably Triticum aestivum (wheat) which is hexaploid¹ due to several hybridisation events [cite:@golovninaMolecularPhylogenyGenus2007a].

We distinguish two kinds of glspl:polyploidisation, based on the origin of the duplicate genome: (i) Gls:allopolyploidisation occurs when the supplementary chromosomes come from a divergent species. This is the case for Triticum aestivum hybridisation, which consisted in the union of the chromosome set of a Triticum species with those of an Aegilops species. (ii) Gls:autopolyploidisation consists in the hybridisation or duplication of the whole genome within the same species.

Unequal crossing-over

Another source of gene duplication relies on unequal crossing-over. During cell division, a crossing-over occurs when two chromatids exchange fragments of chromosome. If the cleavage of the two chromatids occurs at different positions, the shared fragments may have different lengths. Homologous recombination of such uneven crossing-over leads to the incorporation of a duplicate region, as depicted in cref:fig:gene-duplication-mechanisms (B, C). This mechanism leads to the duplication of the whole set of genes present in the fragment. These duplicate genes locate one set after the other: we call them gls:TAG. Gls:TAG are the kind of gene duplication we will be particularly interested in during this internship.

Retroduplication

Transposable elements play a major role in genome plasticity, and enable gene duplication too. Retrotransposons, or RNA transposons are one type of transposable elements. They share similar structure and replication mechanisms with retroviruses. Retrotransposons replicate in the genome through a mechanism known as "copy-and-paste". These transposons typically contain a reverse transcriptase gene. This enzyme proceeds in the reverse transcription of an mRNA transcript into its reverse complementary DNA sequence which can then insert elsewhere in the genome. More generally, gls:retroduplication refers to the duplication of a sequence through reverse transcription of a RNA transcript. Genes duplicated through retroduplication lose their intronic sequences and bring a polyA tail with them in their new locus (cref:fig:gene-duplication-mechanisms (D)).

Transduplication

DNA transposons are another kind of transposable elements whose transposition mechanism can also lead to gene duplication. This type of transposable element moves in the genome through a mechanism known as "cut-and-paste". A typical DNA transposon contains a transposase gene. This enzyme recognizes two sites surrounding the donnor transposon sequence in the chromosome resulting in a DNA cleavage and an excision of the transposon. The transposase can then insert the transposon at a new genome locus. A transposon may bring a fragment of a gene during its transposition in the new locus (cref:fig:gene-duplication-mechanisms (E)), leading to the duplication of this fragment.

Segment duplication

Finally, glspl:segment_duplication, also called low copy repeats are long stretches of DNA with high identity score (cref:fig:gene-duplication-mechanisms (F)). Their exact duplication mechanism remains unclear [cite:@lallemandOverviewDuplicatedGene2020]. They may come from an accidental replication, distinct from an uneven cross-over or a double stranded breakage. Transposable elements may well be involved in the mechanism, as a high enrichment of transposable elements is found next to duplicate segment extremities, in Drosophila [cite:@lallemandOverviewDuplicatedGene2020].

Fate of duplicate genes in genome evolution

In his book Evolution by Gene Duplication, Susumu Ohno proposed that gene duplication plays a major role in species evolution [cite:@ohnoEvolutionGeneDuplication1970], because it provides new genetic materials to build on new phenotypes while keeping a backup gene for the previous function. Indeed, duplicate genes may evolve after duplication: they may be inactivated, becoming glspl:pseudogene; they may be deleted or conserved and so, they may acquire new functions.

Pseudogenization

Duplicate genes may be inactivated and become pseudogenes. These pseudogenes keep a gene-like structure, which degrades as and when further genome modifications occur. However, they are no longer expressed.

Neofunctionalization

Duplicate genes may be conserved and gain a new function. For instance, the current set of olfactory receptor genes result from several duplication and deletion events (in Drosophila: [cite/t:@nozawaEvolutionaryDynamicsOlfactory2007]), after which the duplicate olfactory genes specialized in the detection of particular chemical compounds.

Subfunctionalization

Two duplicate genes with the same original function may encounter a gls:subfunctionalization by which each gene conserves only one part of the function.

Functional redundancy

The two gene copies may keep the ancestral function: in this case the quantity of gene product may increase.

Methods to identify duplicate genes

Lallemand et al. review the different methods used to detect duplicate genes. These methods depend on the type of duplicate genes they target and vary on computation burden as well as ease of use [cite:@lallemandOverviewDuplicatedGene2020].

Paralog detection

Paralogs are homologous genes derived from a duplication event. We can identify them as homologous genes coming from the same genome, or as homologous genes between different species once we filtered out gls:orthologues (homologous genes derived from a speciation event).

We can use two gene characteristics to assess the homology between two genes: gene structure or sequence similarity. The sequence similarity can be tested with a sequence alignment tool, such as BLAST [cite:@altschulBasicLocalAlignment1990], Psi-BLAST, and HMMER3 [cite:@johnsonHiddenMarkovModel2010], or diamond [cite:@buchfinkSensitiveProteinAlignments2021], which are heuristic algorithms, which means they may not provide the best results, but do so way faster than exact algorithms, such as the classical Smith and Waterman algorithm [cite:@smithIdentificationCommonMolecular1981] or its optimized versions PARALIGN [cite:@rognesParAlignParallelSequence2001] or SWIMM.

FTAG Finder

Developed in the LaMME laboratory, the FTAG Finder (Families and Tandemly Arrayed Genes Finder) pipeline is a simple pipeline targeting the detection of gls:TAG from the proteome of single species [cite:@bouillonFTAGFinderOutil2016].

The pipeline proceeds in three steps. First, it estimates the homology links between each pair of genes. Then, it deduces the gene families. Finally, it searches for gls:TAG.

Estimation of homology links between genes

This step consists in establishing a homology relationship between each genes in the proteome. In this step, the typical tool involved is BLAST (Basic Local Alignment Search Tool) [cite:@altschulBasicLocalAlignment1990] run "all against all" on the proteome.

Several BLAST metrics can be used as an homology measure, such as bitscore, identity percentage, E-value or variations of these. The choice of metrics can affect the results of graph clustering in the following step, and we should therefore chose them carefully [cite:@gibbonsEvaluationBLASTbasedEdgeweighting2015].

Identification of gene families

Based on the homology links between each pair of genes, we construct a undirected weighted graph whose vertices correspond to genes and edges to homology links between them. We apply a graph clustering algorithm on the graph in order to infer the gene families corresponding to densely connected communities of vertices.

FTAG Finder proposes three clustering algorithm alternatives: single linkage, Markov Clustering [cite:@vandongenNewClusterAlgorithm1998] or Walktrap [cite:@ponsComputingCommunitiesLarge2005].

Detection of TAGs

The final step of FTAG Finder consists in the identification of gls:TAG from the gene families and the positions of genes. For a given chromosome, the tool seeks genes belonging to the same family and located close to each other. The tool allows a maximal number of genes between the homologous genes, with a parameter set by the user. Ref:fig:tag-definitions is a schematic representation of some possible gls:TAG positioning on a genome associated with their definition in FTAG Finder Find Tags step.

Objectives for the internship

Scientific questions

The underlying question of FTAG Finder is the study of the evolutionary fate of duplicate genes in Eukaryotes.

Extend the existing FTAG Finder Galaxy pipeline

Galaxy is a web-based platform for running accessible data analysis pipelines, first designed for use in genomics data analysis [cite:@goecksGalaxyComprehensiveApproach2010]. Last year, Séanna Charles worked on the Galaxy version of the FTAG Finder pipeline during her M1 internship [cite:@charlesFinalisationPipelineFTAG2023]. I will continue this work.

Port FTAG Finder pipeline on a workflow manager

Another objective of my internship will be to port FTAG Finder on a workflow manager better suited to larger and more reproducible analysis.

We will have to make a choice for the tool we will use. The two main options being Snakemake and Nextflow. Snakemake is a python powered workflow manager based on rules à la GNU Make [cite:@kosterSnakemakeScalableBioinformatics2012]. Nextflow is a groovy powered workflow manager, which rely on the data flows paradigm [cite:@ditommasoNextflowEnablesReproducible2017]. Both are widely used in the bioinformatics community, and their use have been on the rise since they came out in 2012 and 2013 respectively [cite:@djaffardjyDevelopingReusingBioinformatics2023].

Bibliography

Summary

An hexaploid cell have three pairs of homologous chromosomes.

18 KiB Raw Blame History