comparative-genomics-project/docs/report/report.tex

\documentclass{scrreprt}
\usepackage{sty/style}
\title{Tandemly arrayed genes in \textit{Glycine max}}
\subtitle{Comparative genomics project}
\titlehead{M2 GENIOMHE}
\author{Naïa Périnelle \and Samuel Ortion}
\date{2024-12-21}

\makeglossaries
\makeindex
\addbibresource{references.bib}
\begin{document}
\maketitle

\tableofcontents
\clearpage

\begin{relaxclearpage}
\listoffigures
\listoftables
\end{relaxclearpage}

\chapter{Introduction}

The objective of our analysis is to determine the amount of duplicate genes in a plant species. 
We chose to focus more specifically on soybean (\textit{Glycine max}).
We aimed to determine the amount of tandemly arrayed genes.
The main questions that marked our analysis progress were:
Is there a plant clade that is more loaded in duplicated genes?
Is the orientation of TAG gene pairs random?
Is there a lot of duplicated operons in plants?
Is TAG gene duplication more recent than non-TAG duplicates?

\section{Soybean}

Soybean (\textit{Glycine max}) is a native East Asian legume that is widely used in agriculture, mainly for cattle feed or human alimentation. 
Soybean has been domesticated from \textit{Glycine soja} multiple times in China \autocite{stuparInsightsSoybeanGlycine2013}. Documents dated from 1500 to 1100 BC relates its use.

Pictures of the general aspect of \textit{Glycine max}, its flowers, leaves and beans are reproduced in \cref{fig:soy-plant} and \cref{fig:soy-bean-varieties}.

Soybean karyotype comprises 20 chromosomes. Its genome is about 1~Gbp long. Soybean proteome counts 55,897 protein-coding genes.
General statistics on \textit{Glycine max} genome are summarized in \cref{tab:glycine-max-genome-statistics}.
% 
\begin{figure}
    \centering
    \begin{subfigure}[b]{0.3\columnwidth}
    \centering
        \includegraphics[height=3.5cm]{media/Glycine_max_plant1_Carol_Rose_(10220578213).jpg}
        \caption{}
        \label{fig:soy-plant-global}
    \end{subfigure}
    \begin{subfigure}[b]{0.3\columnwidth}
    \centering
        \includegraphics[height=3.5cm]{media/Soybean_flowers.png}
        \caption{}
        \label{fig:soy-plant-flower}
    \end{subfigure}
    \begin{subfigure}[b]{0.3\columnwidth}
    \centering
        \includegraphics[height=3.5cm]{media/Plante_de_Soja_-_Feuilles_et_fruits.jpg}
        \caption{}
        \label{fig:soy-plant-leaves}
    \end{subfigure}
    \caption[Soy plant pictures]{Soy plant pictures. (a) General aspect of the plant (Harry Rose, \href{https://creativecommons.org/licenses/by/2.0}{CC BY 2.0}, via Wikimedia Commons). (b) Flowers (Huwmanbeing, Public Domain via Wikimedia Commons). (c) Leaves and fruits (Pancrat, \href{https://creativecommons.org/licenses/by-sa/3.0}{CC BY-SA 3.0}, via Wikimedia Commons).}
    \label{fig:soy-plant}
\end{figure}

\begin{figure}
    \centering
    \includegraphics[width=0.5\linewidth]{media/Soybeanvarieties.jpeg}
    \caption{Soy bean varieties}
    \label{fig:soy-bean-varieties}
\end{figure}

\section{Duplicate genes and gene families}

During species evolution, different mechanisms may lead to a duplication of a gene.
Polyploidization may occur thanks to an abnormal meiosis, and lead to the duplication of the whole genome (Whole Genome Duplication, WGD) or a chromosome (as in the case of aneuploidy).
An uneven crossing-over may duplicate a segment of a chromosome that may potentially lead to a segment duplication, or tandemly arrayed genes.

Transposable elements may well introduce new copies of genes.
DNA transposon may incorporate a gene sequence with its transposase enzyme at a new place in the genome. The retrotransposase enzyme of retrotransposons may retro-transcribe a mRNA of a gene into its cDNA that may be included right back into the genome, resulting in a copy of the transcribed gene with its introns lost and with a poly-A tail appended.

\section{The fate of duplicate genes}

Once a gene is duplicated, one of the two copies may be lost in a process known as pseudogenization. They may both keep the same function, resulting in a functional redundancy. Otherwise, both of them may specialize in a subpart of the original function, being subject to subfunctionalization. Finally, one of the duplicate genes may acquire a new function, a process termed neofunctionalization.

\section{Tandemly arrayed genes}

Tandemly Arrayed Genes (TAGs) are duplicated genes located close to each other on the same chromosome.
Closeness on a chromosome is measured with the number of genes that do not belong to the duplicate gene family and that are located between two members of the TAG. Those genes are called ``spacers''. We denote by TAG\textsubscript{$d$} the TAG definition allowing at most $d$ spacers between TAG genes.

\section{Synonymous mutation rate as a proxy of duplication age}

Codons are three nucleotides units encoding one amino-acid. There exist $4^3 = 64$ codons, and only 20 proteinogenic amino-acid. Thus, some amino-acid may be encoded by multiple different codons.

A mutation in a coding sequence may not be involved in any change in the peptide sequence: This is what we call a synonymous mutation. The ratio of the number of synonymous mutations over synonymous sites ($K_s$) can be used as a proxy for the age of the duplication under the assumption that the rate of mutation remains steady and is similar in every gene.

\chapter{Material and Methods}

\section{Data sources}

For our species \textit{Glycine max}, we used the following input data files from the \href{https://plants.ensembl.org/Glycine_max/Info/Index}{Ensembl Plant portal}:
\begin{itemize}
    \item \verb|Glycine_max.Glycine_max_v2.1.pep.all.fa.gz| -- the proteome of \textit{Glycine max} in compressed FASTA format
    \item \verb|Glycine_max.Glycine_max_v2.1.60.chr.gff3.gz| -- the genome features table for \textit{Glycine max} in compressed GFF3 format
    \item \verb|Glycine_max.Glycine_max_v2.1.cds.all.fa.gz| -- the coding sequences of \textit{Glycine max} genes in compressed FASTA format
\end{itemize}
In addition, we used the following input data files from eCampus:
\begin{itemize}
    \item \verb|Glycine_max_list| -- the list of protein IDs and their respective gene ID
    \item \verb|Glycine_max_Blastp_longIsoforme| -- the results of the BLASTp alignment already performed.
\end{itemize}

For our extended analysis on the other proposed plants (cf. \cref{tab:count-table-for-all-plants}), we used the same data sources.

\section{Identification of Duplicated Gene Families and Detection of Tandemly Arrayed Genes}

\paragraph{Mitochondria and Chloroplasts} We do not want mitochondrial nor chloroplast genes to appear in our dataset. In our case, none of the organelles were sequenced in \textit{Glycine max}, so we did not have to filter out these genes. We did however systematically filter these genes in our analyses on the whole set of plants.

\paragraph{Supercontigs} In the Ensembl Plant proteome file, the denomination ``supercontig'' corresponds to a contiguous sequence that the assembly could not locate in the right position on the genome. We filter out proteins whose sequence is located in a supercontig.

\paragraph{Isoform selection} We kept only the longest protein isoform of each gene.

\paragraph{BLASTp all-against-all} To estimate the homology links between each pair of genes, we ran a BLASTp \autocite{altschulBasicLocalAlignment1990} all-against-all on the filtered proteome.

\paragraph{Filter BLASTp hits} Resulting BLASTp hits were filtered based on the coverage of the local alignment on both query and subject sequences and identity percentage for two different datasets whose threshold values are reported in \cref{tab:dataset-threshold-values}.

\begin{table}
\centering
\begin{tabular}{ccc}
    \toprule
     & \bfseries Low stringency dataset & \bfseries High stringency dataset  \\
    \midrule
     \bfseries Coverage & >30\% & >40\% \\
     \bfseries Identity & >30\% & >50\% \\
    \bottomrule
\end{tabular}
\caption{\label{tab:dataset-threshold-values} Threshold values for the high and low stringency datasets}
\end{table}

\paragraph{Clustering of the homology graph} From the BLASTp file, we extracted the inferred homology links between pairs of genes with the bitscore value of the local alignment. We kept the highest bitscore value in case there were several alignments for the same pair of genes. We clustered the homology graph with the Markov Clustering algorithm (\texttt{mcl} 22-28) to extract communities of genes that we consider being families of duplicate genes \autocite{vandongenUsingMCLExtract2012a}.

\paragraph{Detection of TAGs} A homemade Rust program has been written in order to associate each gene to its TAG status: whether it belongs to a TAG, and which one, or if it does not belong to a TAG. This program accepts as input a list of TAG definition numbers. Another version of the code was developed in python. Comparing the results of the two programs allowed to correct errors in both until arriving at the correct result.

\paragraph{Workflow automation with Nextflow} The workflow has been ported on Nextflow to be run on the thirteen plant genomes proposed more easily. Each plant genome analysis takes about a minute to run, from the given BLASTP format 7 file to the summary statistics.

\section{Testing orientation concordance between TAG genes and non-TAG genes}

Based on the TAG data table, we generated all pairs of genes not separated by any spacer (definition TAG\textsubscript{0}). We associated each pair to whether they belong to the same TAG and to its orientation concordance status. If both genes are oriented in the same direction, their orientation is said to be \textit{coherent} ($\rightarrow \rightarrow$ or $\leftarrow\leftarrow$). If the genes head in opposite directions, they are said to be \textit{divergent} ($\leftarrow \rightarrow$). If they head towards each other ($\rightarrow \leftarrow$), their orientation is \textit{convergent}.

We run a Fisher exact test on this contingency table.
The tested hypotheses are
\[
\begin{cases}
(H_0) & \text{the gene pair orientation concordance is not associated with the TAG status} \\
(H_1) & \text{the gene pair orientation concordance is associated with the TAG status}
\end{cases}
\]

Given a $2 \times 3$ contingency table like \cref{tab:contingency-table}, the Fisher exact test $p$-value is computed with the following formula
\begin{equation}
\label{eqn:fisher-exact-test-2x3}
p = \frac{\binom{a + b}{a}\binom{c + d}{c}\binom{e + f}{e}}{\binom{n}{a + c + e}},
\end{equation}
where $n$ is the total count of gene pairs. We used the R function \texttt{fisher.test} to run this test.

\begin{table}[H]
    \centering
\begin{tblr}{
	hline{1,2}={3-Z}{solid}, hline{3-Z}={solid},
    vline{1,2}={3-Z}{solid}, vline{3-Z}={solid},
	cell{1}{3} = {c = 3}{halign = c},
	cell{3}{1} = {r = 2}{valign = m},
}
 &   &  Category 1   &    \\
 & & Group 1 & Group 2 & Group 3 \\
Category 2 & Group 1 & $a$ & $b$ & $c$ \\
           & Group 2 & $d$ & $e$ & $f$ 
\end{tblr}
    \caption{$2 \times 3$ contingency table}
    \label{tab:contingency-table}
\end{table}

\section{Testing age difference between TAG pairs and non-TAG duplicates}

Given the list of duplicate genes associated with their family identifiers, we built a list of duplicate gene pairs.
This list contains 1,029,762 pairs.
Then, we computed the $K_a$ and $K_s$ values using \texttt{PAML} v4.10.7.
To do so, we wrote a simple bash script that (i) extracts the protein sequence of each of the gene pair members, (ii) extracts the coding sequence of these proteins, (iii) aligns the proteins using \texttt{Clustalw2} v2.1, (iv) use Pal2Nal v14 to obtain the corresponding nucleotide alignment, and finally (v) use PAML \texttt{yn00} tool using the Yang and Nielsen model to compute $K_a$ and $K_s$ values \autocite{yangEstimatingSynonymousNonsynonymous2000a}.

$K_s$ values above 5 have been filtered out, as the duplication events associated with such a high $K_s$ value are too ancient for the $K_s$ value to be reliable.
We did not filter the $K_s$ values based on the standard deviation.

To test whether the duplication age of TAG gene is more recent than the duplication age of non-TAG genes, we performed a Wilcoxon-Mann-Whitney U test.
For randomly selected values $X$ and $Y$ from `non-TAG' and `TAG' respectively, the tested hypotheses are
\[
\begin{cases}
(H_0) & \parbox[t]{.6\textwidth}{the probability of $X$ to be greater than $Y$ is equal to the probability of $Y$ being greater than $X$} \\
(H_1) & \parbox[t]{.6\textwidth}{the probability of $X$ to be greater than $Y$ is greater than the probability of $Y$ being greater than $X$}
\end{cases}
\]
To do so, we used the \texttt{wilcox.test} R function with the ``greater'' alternative.

\section{Identification of big TAG and functional analysis}

\paragraph{PANTHERdb annotation}

We counted the number of duplicated genes within each TAG to analyze the distribution of genes across TAGs and identify the TAG with the highest number of duplicated genes. We then retrieved the list of gene IDs contained in the largest TAG according to the definition ``1 spacer". To perform a brief functional analysis of the genes in this largest TAG\textsubscript{1}, we used the online \href{https://pantherdb.org/}{PANTHERdb} (Protein ANalysis THrough Evolutionary Relationships database) \autocite{thomasPANTHERMakingGenomescale2022}. We uploaded our gene lists and selected \textit{Glycine max} as the organism. We ran a statistical over-representation test using PANTHER GO-Slim Biological Process as the annotation set. The analysis was conducted with default parameters.

We looked at other annotation data set, and even changed the test type and the correction (respectively \texttt{Ficher exact} and \texttt{Calculation false discovery rate} initially, then \texttt{Binomial} and/or \texttt{Bonferroni correction for multiple testing}). But results remain always the same.

\paragraph{g:Profiler annotation} Using the same set of TAG gene identifiers, we ran a functional annotation with g:Profiler R package \autocite{reimandProfilerWebServer2016}. 
\chapter{Results}

\Cref{tab:statistics-results-glycine-max} reports the counts we obtained during our analysis. \Cref{tab:count-table-for-all-plants} reports our duplicate genes and TAGs counts for the thirteen plants selection.
In \cref{fig:family-sizes-distributions}, the distribution of the sizes of the duplicate genes families are represented for both dataset. As expected, this distribution follows an exponential degrowth.

\begin{table}
    \centering
    \begin{tabular}{ll}
    \toprule
         \bfseries Variable & \bfseries Value \\
     \midrule
         Chromosomes & 20 \\
        Genome length & 978,491,270~bp \\
        Protein coding genes (with supercontigs) & 55,897 \\
        Protein coding genes (without supercontigs) & 55,589 \\
        Protein isoforms & 88,412 \\
    \bottomrule
    \end{tabular}
    \caption{General statistics on \textit{Glycine max} genome}
    \label{tab:glycine-max-genome-statistics}
\end{table}

\begin{figure}
    \centering
    \begin{subfigure}[t]{0.45\columnwidth}
    \centering
    \includegraphics[width=\columnwidth]{results/Glycine_max_family_size_hist_coverage30_identity30.pdf}
    \end{subfigure}
    \begin{subfigure}[t]{0.45\columnwidth}
    \centering
    \includegraphics[width=\columnwidth]{results/Glycine_max_family_size_hist_coverage40_identity50.pdf}
    \end{subfigure}
    \caption{Gene family sizes distributions for low and high stringency dataset in \textit{Glycine max}}
    \label{fig:family-sizes-distributions}
\end{figure}

\begin{table}
    \centering
        \begin{tabular}{lll}
        \toprule
        \bfseries Dataset stringency & \bfseries low & \bfseries high \\
        \midrule
        Families & 8,426 & 11,997 \\
        Duplicate genes & 50,254 (89.9\%) & 46,769 (83.7\%) \\ 
        Singletons & 5,643 (10.1\%) & 9,128 (16.3\%) \\
        TAG\textsubscript{0} & 2,620 & 2,157 \\
        TAG\textsubscript{1} & 2,916 & 2,438 \\
        Genes in TAG\textsubscript{0} & 6,652 (13\%) & 5,335 (12\%) \\
        Genes in TAG\textsubscript{1} & 7,857 (16\%) & 6,420 (14\%) \\
        Genes in biggest TAG\textsubscript{0} & 32 (4.8\%) & 32 (6.0\%) \\
        Genes in biggest TAG\textsubscript{1} & 43 (5.5\%) & 43 (6.7\%) \\
        \bottomrule
    \end{tabular}

    \caption{Number of families, duplicated genes, singletons, TAGs, duplicated genes belongs to a TAG, and max size of a TAG obtained in \textit{Glycine max} for low and high stringency datasets. The percentages of genes in TAG\textsubscript{d} are calculated from the number of duplicated genes, not the total genes. Those for the largest TAG are calculated from the number of duplicated genes belongs to a TAG}
    \label{tab:statistics-results-glycine-max}
\end{table}

\begin{table}[]
    \centering
\begin{adjustbox}{angle=90}
    \csvreader[
head to column names,
tabular = lllllllllll,
table head = \toprule  \bfseries Species & \bfseries Genes & \bfseries Duplicates & \bfseries Singletons & \bfseries Families & \bfseries \makecell{Largest\\Family} & \bfseries TAG\textsubscript{0} & \bfseries TAG\textsubscript{1} & \bfseries \makecell{TAG\textsubscript{0}\\Genes} & \bfseries \makecell{TAG\textsubscript{1}\\Genes} & \bfseries \makecell{Largest\\TAG\textsubscript{0}} \\\midrule,
table foot = \bottomrule,
]{data/concat.csv}{}{%
\slshape\Species & \num[group-separator={,}]{\Genes} & \num[group-separator={,}]{\Duplicates} & \num[group-separator={,}]{\Singletons} & \num[group-separator={,}]{\Families} & \num[group-separator={,}]{\LargestFamily} & \num[group-separator={,}]{\csvcolvii} & \num[group-separator={,}]{\csvcolviii} & \num[group-separator={,}]{\csvcolix} & \num[group-separator={,}]{\csvcolx} & \num[group-separator={,}]{\csvcolxi} 
}
\end{adjustbox}
    \caption{Count tables for duplicate genes in a selection of thirteen plants, using the low stringency filtering criteria.}
    \label{tab:count-table-for-all-plants}
\end{table}

\section{TAG\textsubscript{0} gene pairs orientation concordance is different from the one of other gene pairs}

\Cref{tab:contingency-table-fisher-test-orientation-convergence} reports the count of TAG and not TAG gene pairs orientation concordance.
The Fisher exact test on this contingency table reports a $p$-value below $2.2\cdot 10^{-16}$, so at level $\alpha = 0.05$, we reject the null hypothesis: the orientations of a pair of genes in TAG is significantly different from that of non-TAG genes. More specifically, TAG gene are more coherent than non-TAG genes.

\begin{table}
    \centering
    \begin{tabular}{llll}
         \toprule
         & \bfseries Coherent & \bfseries Convergent & \bfseries Divergent \\
         \midrule 
 \bfseries TAG    &     3,353 &        342 &       361 \\
 \bfseries non-TAG &    25,699 &      12,926 &     12,907 \\
 \bottomrule
    \end{tabular}
    \caption{Contingency table, count of gene pair orientation concordance for TAG pairs or not TAG pairs}
    \label{tab:contingency-table-fisher-test-orientation-convergence}
\end{table}

\section{TAG\textsubscript{0} gene pair duplication is more recent than non-TAG duplication}

In \cref{fig:ks-density-tag-and-not-tag}, the distribution of duplication age, as seen through the proxy of substitutions per synonymous site ($K_s$), is depicted. There are a large peak in the distribution of $K_s$ (mode $K_s \approx 0.1$). This corresponds to a secondary burst of gene duplication in the species evolution. This is consistent with what has been found by Blanc and Wolfe, 2004 \autocite{blancWidespreadPaleopolyploidyModel2004}. 

The Wilcoxon-Mann-Whitney U test statistic is $W = 116946624$, $p$-value $< 2.2\cdot 10^{-16} < 0.05$, so at level $\alpha = 5~\%$, we reject the null hypothesis. The TAG gene pair duplication that we still observe in the remaining genes tends to have a lower $K_s$ value than non-TAG duplicate pairs, which means that TAG pair duplication tends to have occurred more recently than non-TAG pairs.

\begin{figure}
    \centering
    \includegraphics{results/ks_density_tag_and_not_tag.png}
    \caption{Distribution of substitutions per synonymous site ($K_s$) for \textit{Glycine max} duplicated gene pairs. The duplicate genes are splitted in two categories based on whether they belong to a TAG\textsubscript{0} (in blue) or not (in red).}
    \label{fig:ks-density-tag-and-not-tag}
\end{figure}

\section{The number of TAG increases logarithmically with the number of allowed spacers}

\Cref{fig:number-of-tag-in-function-of-tag-definition} depicts the number of TAG detected in the low stringency dataset with varying number of spacers.
We obtain a logarithmic growth of the number of TAGs as we increase the number of spacers.

\begin{figure}[H]
    \centering
    \includegraphics[width=0.5\linewidth]{results/nb_TAG_against_definition.png}
    \caption{Number of TAG in function of the TAG definition}
    \label{fig:number-of-tag-in-function-of-tag-definition}
\end{figure}

\section{Functional analysis}

One the one hand, with a reference list of 55,853 genes, only 27 of our 43 gene IDs map uniquely to the PANTHER database. However, these 27 genes are "Unclassified", like the majority of genes in the reference. 17 gene IDs mapped to multiple entries in the database. This result is the same for the two largest TAG of definition 1 for low and high stringency.

On the other hand, with g:Profiler, the largest TAG's genes present an enrichment in function related to cellular respiration and energy metabolism.

\chapter{Discussion}

\section{Amount of paralogs}

We found a number of paralogs quite different from previous studies performed on \textit{Glycine max} \autocite{blancWidespreadPaleopolyploidyModel2004}. They found a proportion of 32\% of genes being duplicated, whereas we estimate this proportion to be more than 80\%.  This may be explained by the different criteria used to define a homology between two genes. Indeed, in Blanc and Wolfe approach, they defined duplicate genes based on a nucleic sequence alignment, which is far more stringent than our protein alignment approach.

\section{Orientation of TAG genes}

As previously shown for \textit{Arabidopsis thaliana} \autocite{le-hoangEtudeTranscriptomiqueGenes2017}, the orientation of tandemly arrayed genes is not random: two genes belonging to the same gene family and separated by no spacer are more likely to share the same orientation.
However, this result has to be mitigated, because in TAG category there are only pairs of the same family, whereas in non-TAG any pair of genes is considered. Thus, we could well be measuring an effect of gene families instead of TAG ownership. In spite of this remark, we may interpret the orientation concordance as an effect of a gene regulation constraint on duplicate genes with no spacer, sharing a similar genetic environment -- enhancers, silencers: \textit{cis} regulatory regions -- might be advantageous to keep a coherent expression pattern for the duplicate. This could be important to keep the same stoichiometry in protein complexes for instance.

% TODO : discuss all results

% TODO : discuss more specificaly Ks results

\section{Age of gene duplication}

Our analysis of the age of genes duplication in soybean has shown that TAG gene duplicates found in the current soybean sequenced variety are more recent than non-TAG gene duplicates. The distribution of synonymous mutations per synonymous sites showed the presence of two more ancient bursts of gene duplication. These bursts of duplication could be due to ancient whole genome duplication events. It has indeed been shown that several whole genome duplication events occurred in the history of \textit{Glycine max} genome evolution \autocite{stuparInsightsSoybeanGlycine2013}.

% TODO : compare results to literature

% TODO : Compare to other plants
\section{Comparison with other plants}

In all the plants studied, it seems that the larger the genome (i.e., the more genes it contains), the greater the proportion of duplicated genes (and conversely, the smaller the proportion of singletons), although there are exceptions (such as \textit{Oryza sativa}). Our organism, \textit{Glycine max}, has a much larger number of genes in its genome (55,897 genes), while the other genomes have between 22,000 and 40,000 genes; in terms of genome size, \textit{Glycine max} is the most distant from the others. Its proportion of duplicated genes is the highest, around 90\%, while most other plants have between 72\% and 86\% duplicated genes. In terms of the proportion of duplicated genes, \textit{Glycine max} has indeed more duplicate genes than the other plants, but the proportions remain similar, whereas \textit{Oryza sativa} shows a much lower proportion of duplicated genes (65\%), despite being in the upper range for genome size (about 35,000 genes). The number of families is also higher when there are more genes. Thus, \textit{Glycine max} shows a significantly higher number of families. The same applies to the number of TAGs, for different definitions (comparison made for TAG\textsubscript{0} and TAG\textsubscript{1}). However, the size of the largest family in our organism is among the smallest when compared to the other organisms. Additionally, in terms of the proportion of duplicated genes associated with a TAG, \textit{Glycine max} shows lower proportions compared to the other genomes, with 13\% and 16\%, whereas the other genomes average between 14\%-21\% and 18\%-24\% (values for 7 of the other genomes). \textit{Prunus persica} exhibits the highest proportion of duplicated genes in TAGs, yet has one of the smallest numbers of families, duplicated genes, and total genes in the genome. To summarize, it seems that \textit{Glycine max} has a large genome, highly redundant (although normal compared to other plants), but with duplicated genes that are very dispersed.

\printbibliography

\end{document}