feat: Add new text to the report

This commit is contained in:
Samuel Ortion 2024-03-27 19:51:22 +01:00
parent 61c504ec61
commit 012b91d0cc
7 changed files with 111 additions and 171 deletions

1
.gitignore vendored
View File

@ -1,3 +1,4 @@
report.tex
*.gls-abr
*.glo-abr
*.glg-abr

View File

@ -1,6 +1,9 @@
OPTIONS=-shell-escape -file-line-error -synctex=1
OPTIONS=-shell-escape -file-line-error -synctex=1 -interaction=batchmode
SOURCE=report
all: latexmk bib glossaries latexmk
all: build bib glossaries latexmk
debug:
lualatex -shell-escape -file-line-error $(SOURCE)
build:
lualatex $(OPTIONS) $(SOURCE)
@ -9,7 +12,7 @@ latexmk:
latexmk -gg -pdf $(SOURCE)
bib:
biber --output-directory=build $(SOURCE)
biber $(SOURCE)
glossaries:
makeglossaries $(SOURCE)

View File

@ -1,9 +1,8 @@
#+title: Further development on Finder, a pipeline to identify Tandem Arrayed Genes
#+title: Further development on FTAG Finder, a pipeline to identify Tandemly Arrayed Genes
#+author: Samuel Ortion
#+date: 2023-2024
#+LATEX_CLASS: scientific-project
#+LATEX_HEADER: \usepackage{sty/lamme2024}
#+latex_header_extra: \newglossaryentry{LaMME}{name={LaMME},description={Laboratoire de Mathématiques et Modélisation d'Évry}}
#+bibliography: ../references.bib
#+exclude_tags: noexport
@ -13,9 +12,10 @@
#+name: acronyms
| key | abbreviation | full form |
|-----+--------------+---------------------|
| TAG | TAG | Tandem Arrayed Gene |
| FTAG | FTAG | Families and Tandem Arrayed Gene |
|------+--------------+------------------------------------|
| TAG | TAG | Tandemly Arrayed Gene |
| FTAG | FTAG | Families and Tandemly Arrayed Gene |
| WGD | WGD | Whole Genome Duplication |
#+begin_export latex
\hypersetup{
@ -25,17 +25,20 @@
}
\pagenumbering{roman}
#+end_export
#+begin_abstract
Duplicate genes is an important component of genomes. They have a particular role in genome evolution, allowing species to explore new gene functionality offering a pool of usable genes to build on.
#+begin_myabstract
Duplicate genes is an important component of genomes.
They have a particular role in genome evolution, allowing species to explore new gene functionality offering a pool of usable genes to build on.
TODO:
#+end_abstract
#+end_myabstract
#+begin_center
*keywords*: duplicate genes, tandem arrayed genes, pipeline
*keywords*: duplicate genes, tandemly arrayed genes, pipeline
#+end_center
#+begin_export latex
\tableofcontents
\listoffigures
\listoftables
#+end_export
[[printglossaries:]]
@ -43,50 +46,87 @@ TODO:
#+begin_export latex
\pagenumbering{arabic}
#+end_export
* Context
** What are duplicate genes?
Duplicate genes are genes that experienced a duplication event during species evolution.
These are homologous genes.
*** Duplication mechanisms
** Duplication mechanisms
#+name: fig:gene-duplication-mechanisms
#+CAPTION: Mechanisms leading to gene duplication
[[./figures/lallemand2020-fig1_copy.pdf]]
Several mechanisms may lead to gene duplication. We review them in this section.
**** Segment duplication
**** Retroduplication
Transposable elements cause an important part of gene duplication [citation needed]
Retrotransposon, or RNA transposon is one type of transposable element. Some of the representant of retrotransposon are similar to retroviruses.
Retrotransposon may be duplicated in the genome through a mechanism known as "copy-and-paste".
These transposons are typically composed of a reverse transcriptase gene. The protein encoded by this gene may proceed in the reverse transcription of the RNA transcript of the transposon sequence resulting in a DNA sequence which can then be included elsewhere in the genome.
During this process, the RNA transcript may include nearby gene sequence, which can thus be copied and pasted along with the retrotransposon.
**** Transduplication
DNA transposon is an other type of transposable element whose transposition mechanisms can lead to gene duplication too.
Multiple mechanisms may lead to gene duplication. We review them in this section.
*** Segment duplication
*** Retroduplication
Retrotransposons, or RNA transposons are one type of transposable elements. Retrotransposons share similar structure and mechanism with retroviruses.
They may replicate in the genome through a mechanism known as "copy-and-paste".
These transposons are typically composed of a reverse transcriptase gene. This enzyme gene may proceed in the reverse transcription of an mRNA transcript into DNA sequence which can then be inserted elsewhere in the genome.
More generally, retroduplication refers to the duplication of a region in a chromosome through reverse transcription or a transcript.
*** Transduplication
DNA transposons are another type of transposable element whose transposition mechanism can also lead to gene duplication.
This type of transposable element moves in the genome through a mechanisms known as "cut-and-paste".
The typical DNA transposon contains a transposase gene. The protein encoded by this gene recognize two sites surrounding the donnor transposon sequence in the chromosome resulting in a DNA cleavage. The transposase can then insert the transposon in a new place of the genome.
A typical DNA transposon contains a transposase gene. This enzyme recognize two sites surrounding the donnor transposon sequence in the chromosome resulting in a DNA cleavage and excision of the transposon. The transposase can then insert the transposon in a new place of the genome.
Similarly to retrotransposon, if a gene was present between the two cleavage sites of the donnor transposon, it may move with the transposed sequence.
**** Tandem Duplication
**** Polyploidisation
***** Alloployploïdisation
***** Autopolyploïdisation
***** Mechanisms
****** Polyspermy
****** Non-reduced gametes
**** Unequal crossing-over
A crossing-over may occur during cell division. A fragment of chromosome is exchanged between two chromatids of a pair of chromosome. If the cleavage of the two chromatids occured at different positions on both chromosomes, the shared fragments may have different lengths. When the repair of missing fragment is performed, the resulting chromosome will incorporate a duplicate region of the chromosome, leading to a potential duplication for genes present in this region, as represented in figure [[fig:gene-duplication-mechanisms]] B. # TODO: check that this is really the B subfigure
*** Role in genome evolution
** Identification of duplicate genes
***
*** Finder
*** Tandem Duplication
*** Polyploidisation
**** Alloployploidisation
**** Autopolyploidisation
**** Polyploidisation mechanisms
***** Polyspermy
***** Non-reduced gametes
*** Unequal crossing-over
A crossing-over may occur during cell division. A fragment of chromosome is exchanged between two chromatids of a pair of chromosome. If the cleavage of the two chromatids occured at different positions on both chromosomes, the shared fragments may have different lengths. When the repair of missing fragment is performed, the resulting chromosome will incorporate a duplicate region of the chromosome, leading to a potential duplication for genes present in this region, as represented in figure [[fig:gene-duplication-mechanisms]].
This mechanism leads to the duplication of the whole set of genes present in the inserted fragment. These genes are duplicated one after the other in second array of genes placed after the original one and are thus called Tandemly Arrayed Genes.
** Role of duplicate genes in genome evolution
In his book /Evolution by Gene Duplication/, Susumu \textsc{Ohno} proposed that gene duplication plays a major role in species evolution [cite:@ohnoEvolutionGeneDuplication1970].
** Methods to identify duplicate genes
\textsc{Lallemand} et al. review the different methods used to detect duplicate genes. These methods are dependant on the type of duplicate genes they target [cite:@lallemandOverviewDuplicatedGene2020].
*** FTAG Finder
Developped in the LaMME laboratory, this pipeline targets the detection of gene family and tandemly arrayed genes from a given species' proteome [cite:@bouillonFTAGFinderOutil].
**** Estimation of homology links between genes
This steps consists in establishing a relation between each genes in a genome.
In this step, the typical tool involved is =BLAST= (Basic Local Alignment Search Tool) [cite:@altschulBasicLocalAlignment1990] run on the whole proteome.
Several =BLAST= metrics can be used as an homology measure, such as bitscore, identity percentage, E-value or modifications of thoses. The choice of the metrics used may have an impact on the results of graph clustering step [cite:@gibbonsEvaluationBLASTbasedEdgeweighting2015].
**** Identification of gene families
Based on the homology links between each pair of genes, we construct a weighted undirected graph whose vertices corresponds to genes and edges to homology links.
Then, a graph clustering algorithm is applied on this graph in order to infer the gene families.
The team choosed to propose three clustering algorithms: Single linkage, Markov Clustering or Walktrap.
* Objectives
** Amend the existing Galaxy pipeline
Last year, a M1 student, Seanna Charles, worked on the Galaxy's version of the gls: Finder pipeline [cite:@charlesFinalisationPipelineFTAG2023].
During my internship, I will continue this work.
** Porting Finder pipeline on a workflow manager
** Extend the existing Galaxy pipeline
Galaxy is a web-based platform for performing accessible data analysis pipeline, mostly used for genomic data analysis [cite:@goecksGalaxyComprehensiveApproach2010].
Last year, Séanna \textsc{Charles}, worked on the Galaxy's version of the gls:FTAG Finder pipeline [cite:@charlesFinalisationPipelineFTAG2023] during her M1 intenship. I will continue this work.
** Port FTAG Finder pipeline on a workflow manager
Another objective of my internship will be to port FTAG Finder on a workflow manager better suited to larger and more reproducible analysis.
We will have to make a choice for the tool we will use.
The two main options are Snakemake and Nextflow. Snakemake is a python powered workflow manager based on rules /à la/ GNU Make [cite:@kosterSnakemakeScalableBioinformatics2012]. Nextflow, is a groovy powered workflow manager, which rely on data flows [cite:@ditommasoNextflowEnablesReproducible2017]. Both are widely used in the bioinformatics community, and their use have been on the rise since they came out in 2012 and 2016 respectively [cite:@djaffardjyDevelopingReusingBioinformatics2023].
These tools ease the deployment of large scale data analysis workflow with reproducible output.
#+begin_export latex
\printbibliography
#+end_export
** Summary
:PROPERTIES:
:UNNUMBERED: t
:END:
* Bean :noexport:
MCL uses two operations on a stochastic matrix representation $M$ of the graph first derived from the adjacency matrix, namely /expansion/ and /inflation/. Expansion consists in elevating the matrix to a power $r$, and subsequently scaling its columns so that they sum to 1 again. The image of the inflation operator $\Gamma_r$ is defined as
\[
(\Gamma_r M)_{pq} = (M_{pq})^r / \sum_{i=1}^m (M_{iq})^r
\]
where $m$ is number of rows in the matrix, and $M_{pq}$ is the value in the $p, q$ cell of the matrix $M$.
This operator strengthens the edges with higher weights and tend to anihilate edges with lower flow.
The application of both operator iteratively eventually ends up in a partition of the initial graph's edges into clusters of closely connected nodes (corresponding, in our case to gene families).

BIN
report.pdf (Stored with Git LFS)

Binary file not shown.

View File

@ -1,111 +0,0 @@
% Created 2024-03-26 Tue 15:13
% Intended LaTeX compiler: lualatex
\documentclass{scrbook}
\usepackage{sty/lamme2024}
\newacronym{TAG}{TAG}{Tandem Arrayed Gene}
\newacronym{FTAG}{FTAG}{Families and Tandem Arrayed Gene}
\newglossaryentry{LaMME}{name={LaMME},description={Laboratoire de Mathématiques et Modélisation d'Évry}}
\makeindex
\makeglossaries
\usepackage{minted}
\author{Samuel Ortion}
\date{2023-2024}
\title{Further development on Finder, a pipeline to identify Tandem Arrayed Genes}
\hypersetup{
pdfauthor={Samuel Ortion},
pdftitle={Further development on Finder, a pipeline to identify Tandem Arrayed Genes},
pdfkeywords={},
pdfsubject={},
pdfcreator={Emacs 29.2 (Org mode 9.7)},
pdflang={English}}
\usepackage{biblatex}
\addbibresource{../references.bib}
\begin{document}
\maketitle
\hypersetup{
pdfauthor={Samuel Ortion},
pdftitle={},
pdfkeywords={duplicate genes, workflow management systems, pipeline},
}
\pagenumbering{roman}
\begin{abstract}
Duplicate genes is an important component of genomes. They have a particular role in genome evolution, allowing species to explore new gene functionality offering a pool of usable genes to build on.
TODO:
\end{abstract}
\begin{center}
\textbf{keywords}: duplicate genes, tandem arrayed genes, pipeline
\end{center}
\tableofcontents
\printglossaries
\pagenumbering{arabic}
\part{Context}
\label{sec:org8d0fa24}
\chapter{What are duplicate genes?}
\label{sec:orgee68751}
Duplicate genes are genes that experienced a duplication event during species evolution.
These are homologous genes.
\section{Duplication mechanisms}
\label{sec:orgcf44cad}
\begin{center}
\includegraphics[width=.9\linewidth]{./figures/lallemand2020-fig1_copy.pdf}
\caption{\label{fig:gene-duplication-mechanisms}Mechanisms leading to gene duplication}
\end{center}
Several mechanisms may lead to gene duplication. We review them in this section.
\subsection{Segment duplication}
\label{sec:org922a1dd}
\subsection{Retroduplication}
\label{sec:orgd8f7e18}
Transposable elements cause an important part of gene duplication [citation needed]
Retrotransposon, or RNA transposon is one type of transposable element. Some of the representant of retrotransposon are similar to retroviruses.
Retrotransposon may be duplicated in the genome through a mechanism known as ``copy-and-paste''.
These transposons are typically composed of a reverse transcriptase gene. The protein encoded by this gene may proceed in the reverse transcription of the RNA transcript of the transposon sequence resulting in a DNA sequence which can then be included elsewhere in the genome.
During this process, the RNA transcript may include nearby gene sequence, which can thus be copied and pasted along with the retrotransposon.
\subsection{Transduplication}
\label{sec:org74a527a}
DNA transposon is an other type of transposable element whose transposition mechanisms can lead to gene duplication too.
This type of transposable element moves in the genome through a mechanisms known as ``cut-and-paste''.
The typical DNA transposon contains a transposase gene. The protein encoded by this gene recognize two sites surrounding the donnor transposon sequence in the chromosome resulting in a DNA cleavage. The transposase can then insert the transposon in a new place of the genome.
Similarly to retrotransposon, if a gene was present between the two cleavage sites of the donnor transposon, it may move with the transposed sequence.
\subsection{Tandem Duplication}
\label{sec:org1185c12}
\subsection{Polyploidisation}
\label{sec:org349eaa4}
\subsubsection{Alloployploïdisation}
\label{sec:org323512f}
\subsubsection{Autopolyploïdisation}
\label{sec:orgba5b73e}
\subsubsection{Mechanisms}
\label{sec:orga1009de}
\paragraph{Polyspermy}
\label{sec:orgee32a5c}
\paragraph{Non-reduced gametes}
\label{sec:org3297de6}
\subsection{Unequal crossing-over}
\label{sec:org31e5f76}
A crossing-over may occur during cell division. A fragment of chromosome is exchanged between two chromatids of a pair of chromosome. If the cleavage of the two chromatids occured at different positions on both chromosomes, the shared fragments may have different lengths. When the repair of missing fragment is performed, the resulting chromosome will incorporate a duplicate region of the chromosome, leading to a potential duplication for genes present in this region, as represented in figure \ref{fig:gene-duplication-mechanisms} B. \# TODO: check that this is really the B subfigure
\section{Role in genome evolution}
\label{sec:orga7bdfd9}
\chapter{Identification of duplicate genes}
\label{sec:org3aec87b}
\textbf{*}
\section{Finder}
\label{sec:org9b93040}
\part{Objectives}
\label{sec:org1b30340}
\chapter{Amend the existing Galaxy pipeline}
\label{sec:orgf108b0f}
Last year, a M1 student, Seanna Charles, worked on the Galaxy's version of the gls: Finder pipeline \autocite{charlesFinalisationPipelineFTAG2023}.
During my internship, I will continue this work.
\chapter{Porting Finder pipeline on a workflow manager}
\label{sec:orgd5c8063}
\printbibliography
\end{document}

View File

@ -1,4 +1,6 @@
\usepackage{scrhack}
% Font
\usepackage{fontspec}
\setmainfont{TeX Gyre Termes} % Times New Roman alternative
@ -61,16 +63,14 @@
% Force text on right side, float on left side
% \usepackage{sty/floatlefttextright}
\makeglossaries
\makeindex
\renewcommand\maketitle{\include{titlepage}}
% Abstract
\providecommand{\abstractname}{Abstract} % not in scrbook class
\newenvironment{abstract}[1]{%
\hrule
\small\textbf{\abstractname: }
\providecommand{\myabstractname}{Abstract} % not in scrbook class
\newenvironment{myabstract}[1]{%
\hrule
\vspace{0.25cm}
\small\textbf{\myabstractname: }
%\small\emph #1 % emph takes an argument
\small\emph{#1} % or \small\textit{#1}
\itshape % use this if you want the text to be in italics
@ -78,3 +78,10 @@
\newline\hrule
\vspace{0.6cm}
}
\hypersetup{
hidelinks
}
\renewcommand*{\mkbibnamefamily}[1]{\textsc{#1}}
\renewcommand*{\mkbibnameprefix}[1]{\textsc{#1}}

View File

@ -13,7 +13,7 @@
\vfill
\Large Samuel ORTION \orcidlink{0009-0001-0971-497X}
\Large Samuel \textsc{Ortion} \orcidlink{0009-0001-0971-497X}
\vfill
@ -26,8 +26,8 @@
\normalsize
\begin{minipage}{15em}
\textbf{Advisors}:\\
Carène RIZZON \\
Franck SAMSON \\
Carène \textsc{Rizzon} \\
Franck \textsc{Samson} \\
Laboratoire de Mathématiques et Modélisation d'Évry \\
\href{mailto:carene.rizzon@univ-evry.fr}{carene.rizzon@univ-evry.fr} \\
\href{mailto:franck.samson@inrae.fr}{franck.samson@inrae.fr} \\