From fbb6e726b79b18487639360ce9e2cb573cf4728e Mon Sep 17 00:00:00 2001 From: Samuel Ortion Date: Mon, 8 Apr 2024 17:24:40 +0200 Subject: [PATCH] feat: Add new words --- Makefile | 4 +- report.org | 122 +++++++++++++++++-------------- report.pdf | 4 +- sty/floatlefttextright.sty | 8 +- sty/lamme2024.sty | 40 ++++++++-- sty/test_figurelefttextright.pdf | 4 +- sty/test_figurelefttextright.tex | 4 +- titlepage.tex | 6 +- 8 files changed, 115 insertions(+), 77 deletions(-) diff --git a/Makefile b/Makefile index c6cd8f5..c3b67cf 100755 --- a/Makefile +++ b/Makefile @@ -1,6 +1,6 @@ OPTIONS=-shell-escape -file-line-error -synctex=1 -interaction=batchmode SOURCE=report -all: build bib glossaries latexmk +all: build bib glossaries build build debug: lualatex -shell-escape -file-line-error $(SOURCE) @@ -16,3 +16,5 @@ bib: glossaries: makeglossaries $(SOURCE) + +.PHONY: build diff --git a/report.org b/report.org index 13b8b4d..b0a823c 100644 --- a/report.org +++ b/report.org @@ -13,24 +13,19 @@ #+name: acronyms | key | abbreviation | full form | |------+--------------+------------------------------------| -| TAG | TAG | Tandemly Arrayed Gene | -| FTAG | FTAG | Families and Tandemly Arrayed Gene | +| TAG | TAG | Tandemly Arrayed Genes | +| FTAG | FTAG | Families and Tandemly Arrayed Genes | | WGD | WGD | Whole Genome Duplication | | MCL | MCL | Markov Clustering | + #+begin_export latex \makeatletter \hypersetup{ - pdfkeywords={duplicate genes, workflow management systems, pipeline}, + pdfkeywords={duplicate genes, tandemly arrayed genes, pipeline}, } \makeatother -\pagenumbering{roman} #+end_export -#+begin_myabstract -Duplicate genes is an important component of genomes. -They have a particular role in genome evolution, allowing species to explore new gene functionality offering a pool of usable genes to build on. -TODO: -#+end_myabstract #+begin_center *keywords*: duplicate genes, tandemly arrayed genes, pipeline @@ -45,14 +40,16 @@ TODO: [[printglossaries:]] #+begin_export latex -\pagenumbering{arabic} +%\afterpage{\flblankpage} #+end_export -* Context +* Scientific context It is estimated that between 46% and 65.5% of human genes could be considered as duplicate genes [cite:@correaTransposableElementEnvironment2021]. Duplicate genes offers a pool of genetic material available for further experimentation during species evolution. -** Duplication mechanisms +** Gene duplication mechanisms + +#+begin_comment #+name: fig:gene-duplication-mechanisms #+begin_src emacs-lisp :exports results :results raw (cond @@ -62,71 +59,85 @@ Duplicate genes offers a pool of genetic material available for further experime "[[./figures/lallemand2020-fig1_copy.pdf]]") (t "[[./figures/lallemand2020-fig1_copy.svg]]")) #+end_src -#+caption[Different types of duplications]: Different types of duplications. (A) Whole genome duplication. (B) An unequal crossing-over leads to a duplication of a fragment of a chromosome. (C) In tandem duplication, two (set of) genes are duplicated one after the other. (D) Retrotransposon enables retroduplication: a RNA transcript is reverse transcribed and inserted back without introns and with a polyA tail in the genome. (E) A DNA transposon can acquire a fragment of a gene. (F) Segmental duplication corresponds to long stretches of duplicated sequences with high identity. *Source* Adapted from [cite:@lallemandOverviewDuplicatedGene2020]. -#+RESULTS: fig:gene-duplication-mechanisms -[[./figures/lallemand2020-fig1_copy.svg]] +#+end_comment + +#+LABEL: fig:gene-duplication-mechanisms +#+CAPTION[Different types of duplication]: Different types of duplication. (A) Whole genome duplication. (B) An unequal crossing-over leads to a duplication of a fragment of a chromosome. (C) In tandem duplication, two (set of) genes are duplicated one after the other. (D) Retrotransposon enables retroduplication: a RNA transcript is reverse transcribed and inserted back without introns and with a polyA tail in the genome. (E) A DNA transposon can acquire a fragment of a gene. (F) Segmental duplication corresponds to long stretches of duplicated sequences with high identity. Adapted from [cite:@lallemandOverviewDuplicatedGene2020] (fig. 1). +[[./figures/lallemand2020-fig1_copy.pdf]] # https://stackoverflow.com/questions/13611837/how-can-i-use-different-image-formats-for-different-exports-in-org-mode -Multiple mechanisms may lead to gene duplication. We review them in this section. - -*** Segment duplication -*** Retroduplication -Retrotransposons, or RNA transposons are one type of transposable elements. Retrotransposons share similar structure and mechanism with retroviruses. -They may replicate in the genome through a mechanism known as "copy-and-paste". -These transposons are typically composed of a reverse transcriptase gene. This enzyme may proceed in the reverse transcription of an mRNA transcript into DNA sequence which can then be inserted elsewhere in the genome. -More generally, retroduplication refers to the duplication of a region of a chromosome through reverse transcription or a RNA transcript. -*** Transduplication -DNA transposons are another type of transposable element whose transposition mechanism can also lead to gene duplication. -This type of transposable element moves in the genome through a mechanisms known as "cut-and-paste". -A typical DNA transposon contains a transposase gene. This enzyme recognize two sites surrounding the donnor transposon sequence in the chromosome resulting in a DNA cleavage and excision of the transposon. The transposase can then insert the transposon in a new place of the genome. -Similarly to retrotransposon, if a gene was present between the two cleavage sites of the donnor transposon, it may move with the transposed sequence. -*** Tandem Duplication -*** Polyploidisation and Whole Genome Duplication -In an event of whole genome duplication, the entire set of genes present on the chromosomes is duplicated. -Whole genome duplication is more frequent in plants. +Multiple mechanisms may lead to gene duplication. The following sections review them. +*** Polyploidisation and whole genome duplication +In an event of gls:WGD, the entire set of genes present on the chromosomes is duplicated (figure [[ref:fig:gene-duplication-mechanisms]] (A)). +gls:WGD is more frequent in plants. A striking example is probably the /Triticum/ genus (wheat) in which some species (such as /T. aestivum/) are hexaploid, due to hybridisation events [cite:@golovninaMolecularPhylogenyGenus2007]. We distinguish two kinds of polyploidisation, based on the origin of the duplicate genome: - Allopolyploidisation occurs when the supplementary chromosomes comes from an other species. This is the case for /Triticum aestivum/ hybridisation. -- Autopolyploidisation consist in the hybridisation of the genome within the same species. +- Autopolyploidisation consist in the duplication of the genome within the same species. -Whole genome duplication can occur thanks to polyspermy or in case of a non-reduced gamete, for instance. +gls:WGD can occur thanks to polyspermy or in case of a non-reduced gamete. *** Unequal crossing-over -A crossing-over may occur during cell division. A fragment of chromosome is exchanged between two chromatids of a pair of chromosome. If the cleavage of the two chromatids occured at different positions on both chromosomes, the shared fragments may have different lengths. When the repair of missing fragment is performed, the resulting chromosome will incorporate a duplicate region of the chromosome, leading to a potential duplication for genes present in this region, as represented in figure [[fig:gene-duplication-mechanisms]]. -This mechanism leads to the duplication of the whole set of genes present in the inserted fragment. An array of genes is duplicated after the original array and are thus called Tandemly Arrayed Genes. -** Role of duplicate genes in genome evolution -In his book /Evolution by Gene Duplication/, Susumu [[latex:textsc][Ohno]] proposed that gene duplication plays a major role in species evolution [cite:@ohnoEvolutionGeneDuplication1970]. +A crossing-over may occur during cell division. Two chromatids may exchange a fragment of chromosome. If the cleavage of the two chromatids occurs at different positions, the shared fragments may have different lengths. Homologous recombination of such uneven crossover results in the incorporation of a duplicate region, as represented in figure ref:fig:gene-duplication-mechanisms (B, C). +This mechanism leads to the duplication of the whole set of genes present in the inserted fragment. These duplicate genes locate one set after the other, and are thus called gls:TAG. +*** Retroduplication +Retrotransposons, or RNA transposons are one type of transposable elements. Retrotransposons share similar structure and mechanism with retroviruses. +They may replicate in the genome through a mechanism known as "copy-and-paste". +These transposons typically contain a reverse transcriptase gene. This enzyme may proceed in the reverse transcription of an mRNA transcript into DNA sequence which can then be inserted elsewhere in the genome. +More generally, retroduplication refers to the duplication of a region of a chromosome through reverse transcription of a RNA transcript. In this case the duplicate gene lost its intronic sequences and brings a polyA tail with it (figure ref:fig:gene-duplication-mechanisms (D)). +*** Transduplication +DNA transposons are another type of transposable elements whose transposition mechanism can also lead to gene duplication. +This type of transposable element moves in the genome through a mechanisms known as "cut-and-paste". +A typical DNA transposon contains a transposase gene. This enzyme recognize two sites surrounding the donnor transposon sequence in the chromosome resulting in a DNA cleavage and excision of the transposon. The transposase can then insert the transposon in a new genome locus. A transposon can bring a fragment of a gene during its transposition in the other locus (figure ref:fig:gene-duplication-mechanisms (E)). +*** Segment duplication +Segment duplications, also called low copy repeats are long stretches of DNA with high identity score (figure [[ref:fig:gene-duplication-mechanisms]] (E)). Their exact duplication mechanisms remains unclear [cite:@lallemandOverviewDuplicatedGene2020], they may results from an accidental replication, distinct from an uneven cross-over or a double stranded breakage. +Nevertheless, transposable elements may well be involved as a high enrichment of transposable elements has been found at segment extremities, in /Drosophila/ [cite:@lallemandOverviewDuplicatedGene2020]. +** Fate of duplicate genes in genome evolution +In his book /Evolution by Gene Duplication/, Susumu [[latex:textsc][Ohno]] proposed that gene duplication plays a major role in species evolution [cite:@ohnoEvolutionGeneDuplication1970], as it provides a new genetic material to build on new phenotypes while keeping a backup gene for the previous function. + +Duplicate genes may be inactivated becoming pseudogenes, be deleted or conserved. +*** Pseudogenisation +Duplicate genes may be inactivated and become pseudogenes. These pseudogenes keep a gene-like structure, which degrades as and when further genome modifications occur, but are no longer expressed. +*** Neofunctionalisation +Duplicate genes may be conserved and gain a new function. +For instance, in /Drosophila/, the set of olfactory receptor genes result from several duplication and deletion events [cite:@nozawaEvolutionaryDynamicsOlfactory2007], after which the duplicate may specialize in the detection of a particular chemical compound. +*** Subfunctionalisation +Two duplicate genes with the same original function may encounter a subfunctionalisation during which each gene conserves only one part of the function. ** Methods to identify duplicate genes -[[latex:textsc][Lallemand]] et al. review the different methods used to detect duplicate genes. These methods are dependant on the type of duplicate genes they target [cite:@lallemandOverviewDuplicatedGene2020]. +[[latex:textsc][Lallemand]] et al. review the different methods used to detect duplicate genes. These methods depend on the type of duplicate genes they target [cite:@lallemandOverviewDuplicatedGene2020]. *** FTAG Finder -Developped in the LaMME laboratory, this pipeline targets the detection of gene families and tandemly arrayed genes from a given species' proteome [cite:@bouillonFTAGFinderOutil]. +Developped in the LaMME laboratory, the FTAG Finder pipeline targets the detection of gene Families and Tandemly Arrayed Genes from a given species' proteome [cite:@bouillonFTAGFinderOutil2016]. +The pipeline proceeds in three steps: first, it estimates the homology links between each pair of gene; then, it deduce the gene families and finally, it detects gls:TAG. **** Estimation of homology links between genes -This steps consists in establishing a relation between each genes in a genome. -In this step, the typical tool involved is =BLAST= (Basic Local Alignment Search Tool) [cite:@altschulBasicLocalAlignment1990] run on the whole proteome. +This step consists in establishing a relation between each genes in the proteome. +In this step, the typical tool involved is =BLAST= (Basic Local Alignment Search Tool) [cite:@altschulBasicLocalAlignment1990] run "all against all" on the proteome. -Several =BLAST= metrics can be used as an homology measure, such as bitscore, identity percentage, E-value or variations on those. The metrics choice may have an impact on the results of graph clustering in the following step [cite:@gibbonsEvaluationBLASTbasedEdgeweighting2015]. +Several =BLAST= metrics can be used as homology measures, such as bitscore, identity percentage, E-value or variations of these. The choice of metrics can affect the results of graph clustering in the following step, and should therefore be chosen carefully [cite:@gibbonsEvaluationBLASTbasedEdgeweighting2015]. **** Identification of gene families -Based on the homology links between each pair of genes, we construct a weighted undirected graph whose vertices corresponds to genes and edges to homology links. -Then, a graph clustering algorithm is applied on this graph in order to infer the gene families. +Based on the homology links between each pair of genes, we construct a undirected weighted graph whose vertices correspond to genes and edges to homology links between them. +We apply a graph clustering algorithm on the graph in order to infer the gene families. -The team chosed to propose three clustering algorithms: Single linkage, Markov Clustering or Walktrap. -* Objectives -** Extend the existing Galaxy pipeline -Galaxy is a web-based platform for performing accessible data analysis pipeline, first designed for use in genomic data analysis [cite:@goecksGalaxyComprehensiveApproach2010]. +FTAG Finder proposes three clustering algorithm alternatives: single linkage, Markov Clustering [cite:@vandongenNewClusterAlgorithm1998] or Walktrap [cite:@ponsComputingCommunitiesLarge2005]. +**** Detection of TAGs +The final steps of FTAG Finder consists in the determination of gls:TAG from the gene families and the chromosome sequence. +For a given chromosome, the tool seeks genes belonging to the same family and located close to each other. The tool allows a maximal number of genes between the homologous genes, with a parameter set by the user. +* Objectives for the internship +** Scientific questions +The underlying question of FTAG Finder is the study of the evolutionary fate of duplicate genes in Eukaryotes. +** Extend the existing FTAG Finder Galaxy pipeline +Galaxy is a web-based platform for running accessible data analysis pipelines, first designed for use in genomic data analysis [cite:@goecksGalaxyComprehensiveApproach2010]. -Last year, Séanna [[latex:textsc][Charles]], worked on the Galaxy's version of the gls:FTAG Finder pipeline during her M1 internship [cite:@charlesFinalisationPipelineFTAG2023]. I will continue this work. +Last year, Séanna [[latex:textsc][Charles]] worked on the Galaxy version of the FTAG Finder pipeline during her M1 internship [cite:@charlesFinalisationPipelineFTAG2023]. I will continue this work. ** Port FTAG Finder pipeline on a workflow manager Another objective of my internship will be to port FTAG Finder on a workflow manager better suited to larger and more reproducible analysis. We will have to make a choice for the tool we will use. -The two main options are Snakemake and Nextflow. Snakemake is a python powered workflow manager based on rules /à la/ GNU Make [cite:@kosterSnakemakeScalableBioinformatics2012]. Nextflow, is a groovy powered workflow manager, which rely on data flows [cite:@ditommasoNextflowEnablesReproducible2017]. Both are widely used in the bioinformatics community, and their use have been on the rise since they came out in 2012 and 2016 respectively [cite:@djaffardjyDevelopingReusingBioinformatics2023]. - -These tools ease the deployment of large scale data analysis workflow with reproducible output. +The two main options are Snakemake and Nextflow. Snakemake is a python powered workflow manager based on rules /à la/ GNU Make [cite:@kosterSnakemakeScalableBioinformatics2012]. Nextflow, is a groovy powered workflow manager, which rely on data flows [cite:@ditommasoNextflowEnablesReproducible2017]. Both are widely used in the bioinformatics community, and their use have been on the rise since they came out in 2012 and 2013 respectively [cite:@djaffardjyDevelopingReusingBioinformatics2023]. #+begin_export html

Bibliography

@@ -135,14 +146,13 @@ These tools ease the deployment of large scale data analysis workflow with repro #+print_bibliography: #+begin_export latex -\clearpage +\cleartoleftpage #+end_export ** Summary :PROPERTIES: :UNNUMBERED: t :END: - * Bean :noexport: ** MCL MCL uses two operations on a stochastic matrix representation $M$ of the graph first derived from the adjacency matrix, namely /expansion/ and /inflation/. Expansion consists in elevating the matrix to a power $r$, and subsequently scaling its columns so that they sum to 1 again. The image of the inflation operator $\Gamma_r$ is defined as @@ -154,3 +164,5 @@ where $m$ is number of rows in the matrix, and $M_{pq}$ is the value in the $p, This operator strengthens the edges with higher weights and tend to anihilate edges with lower flow. The application of both operator iteratively eventually ends up in a partition of the initial graph's edges into clusters of closely connected nodes (corresponding, in our case to gene families). +** Walktrap +Principle: construct vertex communities based on where an agent would get stuck in a random walk. diff --git a/report.pdf b/report.pdf index 75e1105..39aa26d 100644 --- a/report.pdf +++ b/report.pdf @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:b570ea851097f0ad59e4783bc83432d1218c65ceec7896453f3457cf6513c81d -size 130136 +oid sha256:16005abcb1280de91d99fd5f867e02591b0a0d67d210d336b27cfe9b651bc785 +size 135146 diff --git a/sty/floatlefttextright.sty b/sty/floatlefttextright.sty index f031a06..c8c3fef 100644 --- a/sty/floatlefttextright.sty +++ b/sty/floatlefttextright.sty @@ -19,9 +19,8 @@ \clearpage% %\addtocounter{page}{-1} \afterpage{\flblankpage} -} - - + } + \fi \iffalse % Example @@ -36,9 +35,6 @@ \label{Ima1} } \fi - -\fi - \def\@floatplacement{\global\@topnum\c@topnumber \global\@toproom \topfraction\@colht \global\@botnum \c@bottomnumber diff --git a/sty/lamme2024.sty b/sty/lamme2024.sty index eecba42..108eb7a 100644 --- a/sty/lamme2024.sty +++ b/sty/lamme2024.sty @@ -33,9 +33,25 @@ backend=biber, citestyle=authoryear-comp, natbib=true -]{biblatex} -%\RequirePackage{doi} -%\RequirePackage{xurl} + ]{biblatex} + +\RequirePackage{doi} +\RequirePackage{xurl} +% \AtEveryBibitem{\clearfield{number}} +\DeclareSortingNamekeyScheme{ + \keypart{ + \namepart{given} + } + \keypart{ + \namepart{prefix} + } + \keypart{ + \namepart{family} + } + \keypart{ + \namepart{suffix} + } +} \RequirePackage{orcidlink} @@ -53,15 +69,16 @@ \usepackage[ abbreviations, % create "abbreviations" glossary - %nomain, % don't create "main" glossary + nomain, % don't create "main" glossary stylemods=longbooktabs, % do the adjustments for the longbooktabs styles, automake -]{glossaries-extra} + ]{glossaries-extra} +\setabbreviationstyle[acronym]{long-short} \usepackage{hyperref} % Force text on right side, float on left side -% \usepackage{sty/floatlefttextright} +\usepackage{sty/floatlefttextright} \renewcommand\maketitle{\include{titlepage}} @@ -85,3 +102,14 @@ \renewcommand*{\mkbibnamefamily}[1]{\textsc{#1}} \renewcommand*{\mkbibnameprefix}[1]{\textsc{#1}} + +% Ensure summary is on even page + +\newcommand*\cleartoleftpage{ +\clearpage\ifodd\c@page +\hbox{} +\vspace*{\fill} +\thispagestyle{empty} +\newpage +\fi +} diff --git a/sty/test_figurelefttextright.pdf b/sty/test_figurelefttextright.pdf index 6d96e77..64fdd32 100644 --- a/sty/test_figurelefttextright.pdf +++ b/sty/test_figurelefttextright.pdf @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:3f6f82a0a7aba261435428e3fb228b40df4fc8895e9f1e3a6b2efc7cb64d3a43 -size 24048 +oid sha256:31b2b9a303e6288a5e9ae9b02619aab4f6bf7e8358358bfde4a0dbeb6bee45c9 +size 11665 diff --git a/sty/test_figurelefttextright.tex b/sty/test_figurelefttextright.tex index 3e6dae8..ab99c7c 100644 --- a/sty/test_figurelefttextright.tex +++ b/sty/test_figurelefttextright.tex @@ -1,4 +1,4 @@ -\documentclass{book} +\documentclass[twoside=false]{scrbook} \usepackage{floatlefttextright} \usepackage{lipsum} @@ -6,7 +6,7 @@ \begin{document} - \afterpage{\blankpage} +\afterpage{\flblankpage} \lipsum{100} diff --git a/titlepage.tex b/titlepage.tex index f1344bf..97bceb7 100644 --- a/titlepage.tex +++ b/titlepage.tex @@ -11,7 +11,7 @@ \Large 2023--2024 - \vfill + \vspace{2cm} \Large Samuel \textsc{Ortion} \orcidlink{0009-0001-0971-497X} @@ -24,11 +24,11 @@ \vfill \normalsize - \begin{minipage}{15em} + \begin{minipage}{12.5em} \textbf{Advisors}:\\ Carène \textsc{Rizzon} \\ Franck \textsc{Samson} \\ - Laboratoire de Mathématiques et Modélisation d'Évry \\ + Laboratoire de Mathématiques \\et Modélisation d'Évry \\ \href{mailto:carene.rizzon@univ-evry.fr}{carene.rizzon@univ-evry.fr} \\ \href{mailto:franck.samson@inrae.fr}{franck.samson@inrae.fr} \\ +33~(0)~1~64~85~35~40 \\