Add summary

This commit is contained in:
Samuel Ortion 2024-04-19 17:56:27 +02:00
parent 447b03797c
commit 0860998ac0
Signed by: sortion
GPG Key ID: 9B02406F8C4FB765
2 changed files with 57 additions and 7 deletions

View File

@ -19,7 +19,7 @@
| WGD | WGD | Whole Genome Duplication |
| MCL | MCL | Markov Clustering |
| BLAST | BLAST | Basic Local Alignment Search Tool |
| | | |
| GO | GO | Gene Ontology |
#+name: glossary
| label | name | description |
@ -153,7 +153,7 @@ Another possibility is that the two gene copies keep the ancestral function, res
#+begin_export latex
\fladdfig{
\includegraphics[width=.9\linewidth]{./figures/tag-definition.pdf}
\caption[Schematic representation of TAG definitions]{\label{fig:tag-definitions} Schematic representation of TAG definitions. Several genes are represented on a linear chromosome. The red box represent a singleton gene. Orange boxes represent a TAG with two duplicate genes seperated by 7 other genes ($\mathrm{TAG}_7$). Four green boxes constitute a TAG, the gene at the extremities are seperated by three genes ($\mathrm{TAG}_3$. The two blue boxes represents a TAG with two genes next to each other $\mathrm{TAG}_0$. The bended edges represents the homology links between each pair of genes of a TAG.}}
\caption[Schematic representation of TAG definitions]{\label{fig:tag-definitions} Schematic representation of TAG definitions. Several genes are represented on a linear chromosome. The red box represent a singleton gene. Orange boxes represent a TAG with two duplicate genes seperated by 7 other genes ($\mathrm{TAG}_7$). Four green boxes constitute a TAG, the gene at the extremities are seperated by three genes ($\mathrm{TAG}_3$). The two blue boxes represents a TAG with two genes next to each other ($\mathrm{TAG}_0$). The bended edges represents the homology links between each pair of genes within a TAG.}}
#+end_export
Different methods exists to detect duplicate genes. These methods depend on the type of duplicate genes they target and vary on computation burden as well as in the ease of use (for a review, see [cite/t:@lallemandOverviewDuplicatedGene2020]).
@ -192,6 +192,7 @@ Duplicate genes are
** Extend the existing FTAG Finder Galaxy pipeline
Galaxy is a web-based platform for running accessible data analysis pipelines, first designed for use in genomics data analysis [cite:@goecksGalaxyComprehensiveApproach2010].
Last year, Séanna [[latex:textsc][Charles]] worked on the Galaxy version of the FTAG Finder pipeline during her M1 internship [cite:@charlesFinalisationPipelineFTAG2023]. I will continue this work.
FTAG Finder is currently deployed on the server of the /Laboratoire d'Analyse et Modélisation d'Évry/[fn: [[http://stat.genopole.cnrs.fr/galaxy]] ].
** Port FTAG Finder pipeline on a workflow manager
Another objective of my internship will be to port FTAG Finder on a workflow manager better suited to larger and more reproducible analysis.
@ -200,11 +201,46 @@ We will have to make a choice for the tool we will use.
The two main options being Snakemake and Nextflow. Snakemake is a python powered workflow manager based on rules /à la/ GNU Make [cite:@kosterSnakemakeScalableBioinformatics2012]. Nextflow is a groovy powered workflow manager, which rely on the data flows paradigm [cite:@ditommasoNextflowEnablesReproducible2017]. Both are widely used in the bioinformatics community. Their use have been on the rise since they came out in 2012 and 2013 respectively [cite:@djaffardjyDevelopingReusingBioinformatics2023].
#+begin_export latex
\flstop
\fladdtab{
\begin{tabular}{ccc}
\toprule
& List ref & List $L$ \\
\midrule
related to $go$ & $a$ & $b$ \\
unrelated & $c$ & $d$ \\
\bottomrule
\end{tabular}
\caption{\label{tab:fisher-test-contigency-table}Contingency table for a Fisher exact test on gene lists}
}
#+end_export
* Methodological approaches
Based on the output of the FTAG Finder pipeline, which consist in lists of genes, researchers could perform bespoke subsequent analyses on TAGs.
** Analysis of over-represented gene functions among TAGs
The gls:GO describes biological concepts across three main classes: Cellular Component, Molecular Function and Biological Process. It describe a controlled vocabulary of concepts and the relationship between them. The genes with known functions can be associated with a particular GO term. We can perform an GO enrichment analysis to assess whether a particular GO term is over-represented in a particular gene list, compared to an other. We can use a Fisher exact test, using the FDR (False Discovery Rate) control procedure of [[latex:textsc][Benjamini]] and [[latex:textsc][Hocheberg]] to do so.
Let $go$ be a GO term. We construct a contingency matrix based on the count of genes associated with this GO term (or associated with one of its brother GO term) for the reference gene list and the list of interest (here, the list of genes in a TAG) (see cref:tab:fisher-test-contigency-table).
** Are TAG located preferentially on specific chromosome region?
** Are there chromosomes enriched or depleted in TAG?
** Do genes located next to each other in a TAG share the same orientation?
The concordance of two genes of a TAG falls in three possible cases: either both genes are on the same strand (\(\rightarrow \rightarrow\)), either they have a divergent orientation (\(\leftarrow \rightarrow\)), or a convergent one (\(\rightarrow \leftarrow\)). Graham conjectured that genes of a TAG that are close to each other would be more likely to share the same orientation, and it seems to be effectively the case [cite:@shojaRoadmapTandemlyArrayed2006].
# To test this, we can use a $\Chi^2$ test of goodness of fit or a Student $t$-test.
*** TODO write down the hypotheses
** What is the robustness and accuracy of the detection method?
[cite/t:@le-hoangEtudeTranscriptomiqueGenes2017] started analyses of the impact of parameter choice on FTAG Finder output lists. A more detailed benchmark of FTAG Finder in comparison with other methods on some known test dataset might be of particular interest.
#+begin_export latex
\flstop
#+end_export
* References
:PROPERTIES:
@ -226,6 +262,18 @@ The two main options being Snakemake and Nextflow. Snakemake is a python powered
:UNNUMBERED: t
:END:
Duplicate genes is an important feature of Eukaryotic genomes. They contribute to the plasticity of genome, hence to the capacity of species to evolve.
Several mechanisms may lead to gene duplication. Among them, an unequal crossing-over leads to the formation of Tandemly Arrayed Genes (TAG) corresponding to homologous genes located one set after the other on the same chromosome.
There are multiple methods for detecting duplicate genes from sequences. These methods vary in terms of the particular gene duplication mechanism they target, computational efficiency and ease of use.
FTAG Finder is a simple Galaxy pipeline aiming at the detection of families of duplicate genes and the identification of TAG based on the proteome of a single species. FTAG Finder is developed in the /Laboratoire de Mathématiques et Modélisation d'Évry/, where I will do my internship.
On the one hand, the aim of my internship is to extend the current Galaxy implementation of FTAG Finder with new export lists best suited to the analysis requirements of the laboratory. On the other hand, the objective of my internship will be to port the Galaxy pipeline on another scientific workflow manager better suited to reproducible analyses such as Snakemake and Nextflow.
Then, the updated version of the FTAG Finder pipeline will be used to perform an analysis on the TAG of a model species, to assess its proper behavior. A benchmark of the pipeline will probably be run to compare the FTAG Finder with alternative published methods targetting duplicate genes and TAG in particular.
* Bean :noexport:
** MCL
MCL uses two operations on a stochastic matrix representation $M$ of the graph first derived from the adjacency matrix, namely /expansion/ and /inflation/. Expansion consists in elevating the matrix to a power $r$, and subsequently scaling its columns so that they sum to 1 again. The image of the inflation operator $\Gamma_r$ is defined as
@ -234,7 +282,7 @@ MCL uses two operations on a stochastic matrix representation $M$ of the graph f
\]
where $m$ is number of rows in the matrix, and $M_{pq}$ is the value in the $p, q$ cell of the matrix $M$.
This operator strengthens the edges with higher weights and tend to anihilate edges with lower flow.
This operator strengthens the edges with higher weights and tend to annihilate edges with lower flow.
The application of both operator iteratively eventually ends up in a partition of the initial graph's edges into clusters of closely connected nodes (corresponding, in our case to gene families).
** Walktrap
@ -251,8 +299,10 @@ Principle: construct vertex communities based on where an agent would get stuck
: Loaded ./setup.el
#+begin_example
# LocalWords: speciation subfunctionalization neofunctionalization
# LocalWords: pseudogenization bioinformatics
# Local Variables:
# eval: (progn (org-babel-goto-named-src-block "startup") (org-babel-execute-src-block) (outline-hide-sublevels 1))
# End:
#+end_example

BIN
report.pdf (Stored with Git LFS)

Binary file not shown.