Fix typos, amend glossary entries, improve phrasing

2024-04-16 10:44:29 +02:00 · 2024-04-16 10:44:29 +02:00 · b7e3d67ed9
parent c3262936a1
commit b7e3d67ed9
4 changed files with 65 additions and 76 deletions
--- a/folder-structure.sh
+++ b/folder-structure.sh
@ -1,8 +0,0 @@
-#!/bin/sh
-
-find ./content -type d > folder_list.txt
-
-mkdir -p build
-cd build
-cat ../folder_list.txt | xargs mkdir -p
-rm ../folder_list.txt
--- a/report.org
+++ b/report.org
@ -15,10 +15,25 @@
 | key        | abbreviation | full form                                  |
 |------------+--------------+--------------------------------------------|
 | TAG        | TAG          | Tandemly Arrayed Genes                     |
-| FTAGfinder | FTAG Finder  | Families and Tandemly Arrayed Genes Finder |
+| FTAGFinder | FTAG Finder  | Families and Tandemly Arrayed Genes Finder |
 | WGD        | WGD          | Whole Genome Duplication                   |
 | MCL        | MCL          | Markov Clustering                          |

+#+name: glossary
+| label                | name                 | description                                                                                                                   |
+|----------------------+----------------------+-------------------------------------------------------------------------------------------------------------------------------|
+| singleton            | singleton            | A gene with a single copy                                                                                                     |
+| polyploidisation     | polyploidisation     | Mechanisms leading to the acquisition of at least three versions of the same original genome in a species                     |
+| pseudogene           | pseudogene           | A gene like sequence that lost its capacity to transcribe                                                                     |
+| segment_duplcation   | segment duplication  | Long stretches of DNA sequences with high identity score                                                                      |
+| retroduplication     | retroduplication     | Duplication of a gene through retro-transcription of its RNA transcript                                                       |
+| autopolyploidisation | autopolyploidisation | Polyploidisation within the same species                                                                                      |
+| allopolyploidisation | allopolyploidisation | Polyploidisation with genetic material comming from a diverged species                                                        |
+| polyspermy           | polyspermy           | Fertilization of an egg by more than one sperm                                                                                |
+| segment_duplication  | segment duplication  | DNA sequences present in multiple locations within a genome that share high level of sequence identity:w                      |
+| subfunctionalization | subfunctionalization | Fate of a duplicate genes which get a part of the original gene function, the function being shared among multiple duplicates |
+| orthologues          | orthologues          | Homologous genes whose divergence started at a speciation event                                                               |
+
 #+begin_export latex
 \makeatletter
 \hypersetup{
@ -35,7 +50,7 @@
 #+begin_export latex
 \tableofcontents
 \listoffigures
-%\listoftables
+\listoftables
 #+end_export

 [[printglossaries:]]
@ -45,7 +60,7 @@
 #+end_export

 * Scientific context
-It is estimated that between 46% and 65.5% of human genes could be considered as duplicate genes [cite:@correaTransposableElementEnvironment2021].
+It is estimated that between 46% and 65.5% of human genes could be considered as duplicate genes\footnote{The estimations vary strongly depending on the criteria in use} [cite:@correaTransposableElementEnvironment2021].
 Duplicate genes offers a pool of genetic material available for further experimentation during species evolution.

 ** Gene duplication mechanisms
@ -57,68 +72,66 @@ Duplicate genes offers a pool of genetic material available for further experime
 }
 #+end_export

+Multiple mechanisms may lead to a gene duplication. Their effetc ranges from the duplication of the whole genome to the duplication of a fragment of a gene.

-Multiple mechanisms may lead to gene duplication. The following sections review them.
-*** Polyploidisation and whole genome duplication
-In an event of gls:WGD, the entire set of genes present on the chromosomes is duplicated ([[cref:fig:gene-duplication-mechanisms]] (A)).
-gls:WGD is more frequent in plants.
-A striking example is probably the /Triticum/ genus (wheat) in which some species (such as /T. aestivum/) are hexaploid, due to hybridisation events [cite:@golovninaMolecularPhylogenyGenus2007a].
+In an event of *gls:WGD*, the entire set of genes present on the chromosomes is duplicated ([[cref:fig:gene-duplication-mechanisms]] (A)).
+gls:WGD can occur thanks to gls:polyspermy or in case of a non-reduced gamete.
+Gls:polyploidisation is a mechanism leading to a species with at least three copies of an initial genome.
+A striking example is probably /Triticum aestivum/ (wheat) which is hexaploid\footenote{An hexaploid cell have three pairs of homologous chromosomes} due to several hybridisation events [cite:@golovninaMolecularPhylogenyGenus2007a].
+We distinguish two kinds of glspl:polyploidisation, based on the origin of the duplicate genome: (i) Gls:allopolyploidisation occurs when the supplementary chromosomes come from a divergent species. This is the case for /Triticum aestivum/ hybridisation, which consisted in the union of the chromosome set of a /Triticum/ species with those of an /Aegilops/ species. (ii) Gls:autopolyploidisation consists in the hybridisation or duplication of the whole genome within the same species.

-We distinguish two kinds of polyploidisation, based on the origin of the duplicate genome:
- Allopolyploidisation occurs when the supplementary chromosomes comes from an other species. This is the case for /Triticum aestivum/ hybridisation.
- Autopolyploidisation consist in the duplication of the genome within the same species.
+Another source of gene duplication relies on *unequal crossing-over*. Unequal crossing-over may occur during cell division: two chromatids may exchange a fragment of chromosome and if the cleavage of the two chromatids occurs at different positions, the shared fragments may have different lengths. Homologous recombination of such uneven crossover results in the incorporation of a duplicate region, as depicted in cref:fig:gene-duplication-mechanisms (B, C).
+This mechanism leads to the duplication of the whole set of genes present in the fragment. These duplicate genes locate one set after the other: we call them gls:TAG, and they are the kind of gene duplication we will be particularly interested in during this internship.
+
+Transposable elements plays a major role in genome plasticity, and enable gene duplication too.
+
+Retrotransposons, or RNA transposons are one type of transposable elements. They share similar structure and replication mechanisms with retroviruses.
+They replicate in the genome through a mechanism known as "copy-and-paste".
+These transposons typically contain a reverse transcriptase gene. This enzyme proceed in the reverse transcription of an mRNA transcript into its reverse, complementary DNA sequence which can then insert elsewhere in the genome.
+More generally, gls:retroduplication refers to the duplication of a sequence through reverse transcription of a RNA transcript. A gene duplicated through retroduplication loses its intronic sequences and brings a polyA tail with it (cref:fig:gene-duplication-mechanisms (D)).
+
+DNA transposons are another kind of transposable elements whose transposition mechanism can also lead to gene duplication.
+This type of transposable element moves in the genome through a mechanism known as "cut-and-paste".
+A typical DNA transposon contains a transposase gene. This enzyme recognizes two sites surrounding the donnor transposon sequence in the chromosome resulting in a DNA cleavage and an excision of the transposon. The transposase can then insert the transposon at a new genome locus. A transposon may bring a fragment of a gene during its transposition in the new locus (cref:fig:gene-duplication-mechanisms (E)), leading to the duplication of this fragment.
+
+Finally, *glspl:segment_duplication*, also called /low copy repeats/ are long stretches of DNA with high identity score ([[cref:fig:gene-duplication-mechanisms]] (F)). Their exact duplication mechanism remains unclear [cite:@lallemandOverviewDuplicatedGene2020]. They may come from an accidental replication, distinct from an uneven cross-over or a double stranded breakage.
+Nevertheless, transposable elements may well be involved in the mechanisms, as a high enrichment of transposable elements has been found at duplicate segments extremities, in /Drosophila/ [cite:@lallemandOverviewDuplicatedGene2020].

-gls:WGD can occur thanks to polyspermy or in case of a non-reduced gamete.
-*** Unequal crossing-over
-A crossing-over may occur during cell division. Two chromatids may exchange a fragment of chromosome. If the cleavage of the two chromatids occurs at different positions, the shared fragments may have different lengths. Homologous recombination of such uneven crossover results in the incorporation of a duplicate region, as represented in cref:fig:gene-duplication-mechanisms (B, C).
-This mechanism leads to the duplication of the whole set of genes present in the inserted fragment. These duplicate genes locate one set after the other, and are thus called gls:TAG.
-*** Retroduplication
-Retrotransposons, or RNA transposons are one type of transposable elements. Retrotransposons share similar structure and mechanism with retroviruses.
-They may replicate in the genome through a mechanism known as "copy-and-paste".
-These transposons typically contain a reverse transcriptase gene. This enzyme may proceed in the reverse transcription of an mRNA transcript into DNA sequence which can then be inserted elsewhere in the genome.
-More generally, retroduplication refers to the duplication of a region of a chromosome through reverse transcription of a RNA transcript. In this case the duplicate gene lost its intronic sequences and brings a polyA tail with it ( cref:fig:gene-duplication-mechanisms (D)).
-*** Transduplication
-DNA transposons are another type of transposable elements whose transposition mechanism can also lead to gene duplication.
-This type of transposable element moves in the genome through a mechanisms known as "cut-and-paste".
-A typical DNA transposon contains a transposase gene. This enzyme recognize two sites surrounding the donnor transposon sequence in the chromosome resulting in a DNA cleavage and excision of the transposon. The transposase can then insert the transposon in a new genome locus. A transposon can bring a fragment of a gene during its transposition in the other locus (cref:fig:gene-duplication-mechanisms (E)).
-*** Segment duplication
-Segment duplications, also called low copy repeats are long stretches of DNA with high identity score ([[cref:fig:gene-duplication-mechanisms]] (F)). Their exact duplication mechanisms remains unclear [cite:@lallemandOverviewDuplicatedGene2020], they may results from an accidental replication, distinct from an uneven cross-over or a double stranded breakage.
-Nevertheless, transposable elements may well be involved as a high enrichment of transposable elements has been found at segment extremities, in /Drosophila/ [cite:@lallemandOverviewDuplicatedGene2020].
 ** Fate of duplicate genes in genome evolution
-In his book /Evolution by Gene Duplication/, Susumu [[latex:textsc][Ohno]] proposed that gene duplication plays a major role in species evolution [cite:@ohnoEvolutionGeneDuplication1970], as it provides a new genetic material to build on new phenotypes while keeping a backup gene for the previous function.
+In his book /Evolution by Gene Duplication/, Susumu [[latex:textsc][Ohno]] proposed that gene duplication plays a major role in species evolution [cite:@ohnoEvolutionGeneDuplication1970], because it provides new genetic materials to build on new phenotypes while keeping a backup gene for the previous function.
+Indeed, duplicate gene may evolve after duplication. They may be inactivated, and become glspl:pseudogene, they may be deleted or conserved and maybe acquire new functions.

-Duplicate genes may be inactivated becoming pseudogenes, be deleted or conserved.
 *** Pseudogenisation
-Duplicate genes may be inactivated and become pseudogenes. These pseudogenes keep a gene-like structure, which degrades as and when further genome modifications occur, but are no longer expressed.
-*** Neofunctionalisation
+Duplicate genes may be inactivated and become pseudogenes. These pseudogenes keep a gene-like structure, which degrades as and when further genome modifications occur. They are however no longer expressed.
+*** Neofunctionalization
 Duplicate genes may be conserved and gain a new function.
-For instance, in /Drosophila/, the set of olfactory receptor genes result from several duplication and deletion events [cite:@nozawaEvolutionaryDynamicsOlfactory2007], after which the duplicate may specialize in the detection of a particular chemical compound.
-*** Subfunctionalisation
-Two duplicate genes with the same original function may encounter a subfunctionalisation during which each gene conserves only one part of the function.
+For instance, in the set of olfactory receptor genes result from several duplication and deletion events (in /Drosophila: [cite:@nozawaEvolutionaryDynamicsOlfactory2007]), after which the duplicate may specialize in the detection of a particular chemical compound.
+*** Subfunctionalization
+Two duplicate genes with the same original function may encounter a gls:subfunctionalization during which each gene conserves only one part of the function.
 *** Functional redundancy
 Two copies may keep the ancestral function: in this case the organism may increase the quantity of gene product.

 ** Methods to identify duplicate genes
-[[latex:textsc][Lallemand]] et al. review the different methods used to detect duplicate genes. These methods depend on the type of duplicate genes they target, and vary on computation burden [cite:@lallemandOverviewDuplicatedGene2020].
+[[latex:textsc][Lallemand]] et al. review the different methods used to detect duplicate genes. These methods depend on the type of duplicate genes they target and vary on computation burden, or ease of application [cite:@lallemandOverviewDuplicatedGene2020].

 *** Paralog detection
-Paralogs are homologous genes derived from a duplication event. They can be identified as homologous genes located in the same genome, or as homologous genes between different species once we filtered out orthologous genes (homologous genes derived from a speciation event).
+Paralogs are homologous genes derived from a duplication event. We can identify them as homologous genes coming from the same genome, or as homologous genes between different species once we filtered out gls:orthologues (homologous genes derived from a speciation event).

-Two gene characteristics can be used to assess to assess homology between two genes: gene structure of sequence similarity.
-The sequence similarity can be tested with a sequence alignment tool, such as =BLAST= [cite:@altschulBasicLocalAlignment1990], =Psi-BLAST=, and =HMMER3= [cite:@johnsonHiddenMarkovModel2010], or =diamond= [cite:@buchfinkSensitiveProteinAlignments2021], which are heuristic algorithm, which means they may not provide the best results, but do so way faster than exact algorithms, such as the classical Smith and Waterman algorithm [cite:@smithIdentificationCommonMolecular1981] or its optimized versions =PARALIGN= or =SWIMM=.
+We can use two gene characteristics to assess the homology between two genes: gene structure of sequence similarity.
+The sequence similarity can be tested with a sequence alignment tool, such as =BLAST= [cite:@altschulBasicLocalAlignment1990], =Psi-BLAST=, and =HMMER3= [cite:@johnsonHiddenMarkovModel2010], or =diamond= [cite:@buchfinkSensitiveProteinAlignments2021], which are heuristic algorithms, which means they may not provide the best results, but do so way faster than exact algorithms, such as the classical Smith and Waterman algorithm [cite:@smithIdentificationCommonMolecular1981] or its optimized versions =PARALIGN= or =SWIMM=.

 *** FTAG Finder
-Developed in the LaMME laboratory, the gls:FTAGfinder pipeline targets the detection of gene Families and Tandemly Arrayed Genes from a given species' proteome [cite:@bouillonFTAGFinderOutil2016].
+Developed in the LaMME laboratory, the gls:FTAGFinder pipeline targets the detection of gene Families and Tandemly Arrayed Genes from a given species' proteome [cite:@bouillonFTAGFinderOutil2016].

-The pipeline proceeds in three steps. First, it estimates the homology links between each pair of genes; then, it deduce the gene families and finally, it detects gls:TAG.
+The pipeline proceeds in three steps. First, it estimates the homology links between each pair of genes; then, it deduces the gene families and finally, it detects gls:TAG.
 **** Estimation of homology links between genes
 This step consists in establishing a relation between each genes in the proteome.
 In this step, the typical tool involved is =BLAST= (Basic Local Alignment Search Tool) [cite:@altschulBasicLocalAlignment1990] run "all against all" on the proteome.

-Several =BLAST= metrics can be used as homology measures, such as bitscore, identity percentage, E-value or variations of these. The choice of metrics can affect the results of graph clustering in the following step, and should therefore be chosen carefully [cite:@gibbonsEvaluationBLASTbasedEdgeweighting2015].
+Several =BLAST= metrics can be used as an homology measure, such as bitscore, identity percentage, E-value or variations of these. The choice of metrics can affect the results of graph clustering in the following step, and we should therefore chose them carefully [cite:@gibbonsEvaluationBLASTbasedEdgeweighting2015].
 **** Identification of gene families
 Based on the homology links between each pair of genes, we construct a undirected weighted graph whose vertices correspond to genes and edges to homology links between them.
-We apply a graph clustering algorithm on the graph in order to infer the gene families.
+We apply a graph clustering algorithm on the graph in order to infer the gene families corresponding to densely connected communities of vertices.

 FTAG Finder proposes three clustering algorithm alternatives: single linkage, Markov Clustering [cite:@vandongenNewClusterAlgorithm1998] or Walktrap [cite:@ponsComputingCommunitiesLarge2005].
 **** Detection of TAGs
@ -129,18 +142,13 @@ For a given chromosome, the tool seeks genes belonging to the same family and lo
 The underlying question of FTAG Finder is the study of the evolutionary fate of duplicate genes in Eukaryotes.
 ** Extend the existing FTAG Finder Galaxy pipeline
 Galaxy is a web-based platform for running accessible data analysis pipelines, first designed for use in genomic data analysis [cite:@goecksGalaxyComprehensiveApproach2010].
-
 Last year, Séanna [[latex:textsc][Charles]] worked on the Galaxy version of the FTAG Finder pipeline during her M1 internship  [cite:@charlesFinalisationPipelineFTAG2023]. I will continue this work.

 ** Port FTAG Finder pipeline on a workflow manager
 Another objective of my internship will be to port FTAG Finder on a workflow manager better suited to larger and more reproducible analysis.

 We will have to make a choice for the tool we will use.
-The two main options are Snakemake and Nextflow. Snakemake is a python powered workflow manager based on rules /à la/ GNU Make [cite:@kosterSnakemakeScalableBioinformatics2012]. Nextflow, is a groovy powered workflow manager, which rely on data flows [cite:@ditommasoNextflowEnablesReproducible2017]. Both are widely used in the bioinformatics community, and their use have been on the rise since they came out in 2012 and 2013 respectively [cite:@djaffardjyDevelopingReusingBioinformatics2023].
-
-#+begin_export latex
-%\flstop
-#+end_export
+The two main options are Snakemake and Nextflow. Snakemake is a python powered workflow manager based on rules /à la/ GNU Make [cite:@kosterSnakemakeScalableBioinformatics2012]. Nextflow, is a groovy powered workflow manager, which rely on the data flows paradigme [cite:@ditommasoNextflowEnablesReproducible2017]. Both are widely used in the bioinformatics community, and their use have been on the rise since they came out in 2012 and 2013 respectively [cite:@djaffardjyDevelopingReusingBioinformatics2023].

 #+begin_export html
 <h3>Bibliography</h3>
@ -148,6 +156,10 @@ The two main options are Snakemake and Nextflow. Snakemake is a python powered w

 #+print_bibliography:

+#+begin_export latex
+\flstop
+#+end_export
+
 #+begin_export latex
 \cleartoleftpage
 #+end_export
--- a/report.pdf
+++ b/report.pdf
--- a/sty/lamme2024.sty
+++ b/sty/lamme2024.sty
@ -27,30 +27,15 @@
 % References
 \usepackage[
 	maxcitenames=2,
+	maxbibnames=99, % show all authors in the cited part
 	style=authoryear-comp,
-	backend=biber,
 	citestyle=authoryear-comp,
+	backend=biber,
 	natbib=true
 ]{biblatex}

 \RequirePackage{doi}
 \RequirePackage{xurl}
-% \AtEveryBibitem{\clearfield{number}}
-\DeclareSortingNamekeyScheme{
-	\keypart{
-		\namepart{given}
-	}
-	\keypart{
-		\namepart{prefix}
-	}
-	\keypart{
-		\namepart{family}
-	}
-	\keypart{
-		\namepart{suffix}
-	}
-}
-
 \RequirePackage{orcidlink}

 \RequirePackage[
@ -67,7 +52,7 @@

 \usepackage[
 	abbreviations,         % create "abbreviations" glossary
-	nomain,                % don't create "main" glossary
+	%nomain,                % don't create "main" glossary
 	stylemods=longbooktabs, % do the adjustments for the longbooktabs styles,
 	automake
 ]{glossaries-extra}