23 KiB

Raw Permalink Blame History

Comparative Genomics Project

General infos
- Objectives
  - Part 1
  - Part 2
- Our species
Download the source data
Develop the Nextflow workflow
From the given BLASTp format 6 file
Amount of Tandemly Arrayed Genes
Complete slides
Family sizes
$K_s$ For all duplicated pairs within a family
- Use a Nextflow workflow
Does Glycine max have big TAGs and which function are the big TAG gene implied?
Are TAG pairs different in age from non-TAG pairs?
Are genes inside a TAG orientated in the same way more often than random?

General infos

<2024-10-25 Fri>

Objectives

Part 1

Determine the amount of duplicate genes in my genome.
- What is the proportion of duplicate genes in my genome.
- Give an interval based on two datasets (low stringency dataset and high stringency dataset).
  
  Low High
  
  Coverage 30% 40%
  
  %identity 30% 50%
  
  Bitscore no filtering no filtering
  
  Clustering mcl (with default parameters) mcl (with default parameters)
```
min_coverage=30
min_identity=30
```
```
min_coverage=40
min_identity=50
```
Determine the amount of Tandemly Arrayed Genes (TAGs) in the genome (definition 0 and 1).
Determine the amount in plant's chloroplasts and mitochondria.
Is there a lot of duplicated operons in plants?

	Low	High
Coverage	30%	40%
%identity	30%	50%
Bitscore	no filtering	no filtering
Clustering	mcl (with default parameters)	mcl (with default parameters)

Part 2

Compute $K_s$ values and use it as a proxy of the age of the duplication.
Are TAGs younger than other duplicate genes?

Our species

Glycine max is the soy plant.

Download the source data

https://plants.ensembl.org/Glycine_max/Info/Index

https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-60/fasta/glycine_max/pep/Glycine_max.Glycine_max_v2.1.pep.all.fa.gz

Develop the Nextflow workflow

Filtering the proteome to keep the longest isoform

I used three awk families_and_TAGs/scripts to filter the FASTA file to keep only the longest isoforms:

protein_lengths.awk associate each FASTA record with the length of its sequence,
filter_longest.awk keep the isoform ID that has the maximum length,
filter_records.awk based on the list of isoform ID, retrieve the sequence of from the original file.

protein_lengths.awk

filter_longest.awk

filter_records.awk

How many protein coding gene are there?

zcat ../data/Glycine_max.Glycine_max_v2.1.pep.all.fa.gz \
    | grep -v 'supercontig' \
    | awk ' /^>/ { print $4 }' | sort | uniq | wc -l

Filtering out Chloroplasts and Mitochondria proteins

From the Ensembl Plants index page we can see that there is no chloroplast nor mitochondria proteins in the sequenced genome of Glycine max.

Filter out supercontigs

<2024-10-28 Mon>

zcat ../data/Glycine_max.Glycine_max_v2.1.pep.all.fa.gz \
    | awk -f ./workflow/families_and_TAGs/scripts/remove_supercontigs.awk \
    > ./tmp/Glycine_max.Glycine_max_v2.1.pep.all.nosupercontig.fa

Sanity check: Has the script removed all supercontigs proteins and only them?

How many records in supercontigs were there?

zcat ../data/Glycine_max.Glycine_max_v2.1.pep.all.fa.gz  \
    | awk -F' ' ' /^>/ { print $3}'  | grep -c "^supercontig"

How many records have been removed?

zgrep -c "^>" ../data/Glycine_max.Glycine_max_v2.1.pep.all.fa.gz

grep -c "^>" ./tmp/Glycine_max.Glycine_max_v2.1.pep.all.nosupercontig.fa

expr $ALL_ISOFORMS - $NO_SUPERCONTIGS

BLASTp All Against All

makeblastdb -in "$proteome" -dbtype prot -out 'tmp/blastdb' -title "Glycine max"

blastp -query "$proteome" -db 'tmp/blastdb' -outfmt 6 -out "tmp/Glycine_max.blastp.tsv"

Clustering

Using mcl.

Extract the homology in ABC format

awk 'BEGIN { OFS="\t" } { print \$14, \$16, \$12 }' "${blastp}" > 'graph.abc'

Run mcl with default parameters

mcl 'graph.abc' --abc -o 'custering.mcl'

Convert mcl output to TSV with two columns gene id, family id.

awk -f workflow/families_and_TAGs/scripts/mcl_to_tsv.awk 'clustering.mcl' > 'families.tsv'

Running the Nextflow workflow

Update the parameters in nextflow.config and run

nextflow run main.nf -profile conda -resume

from the ./workflow/ folder

From the given BLASTp format 6 file

<2024-11-02 Sat> How many BLASTp HSP have been reported?

cat ../data/Glycine_max_Blastp_longIsoforme | wc -l

Remove HSP on proteins found on supercontigs

Extract the isoform ID whose gene is on supercontigs.

zcat ../data/Glycine_max.Glycine_max_v2.1.pep.all.fa.gz| awk -F' ' '/^>/ && $3 ~ /^supercontig/ { sequence_id = $1; gsub(">", "", sequence_id); print sequence_id }' > ./tmp/protein_on_supercontigs.list

cat ./tmp/protein_on_supercontigs.list | wc -l

Remove HSP with subject or query being on a supercontig

filter_blastp_sequence_id.awk

awk -f workflow/families_and_TAGs/scripts/filter_blastp_sequence_id.awk ./tmp/protein_on_supercontigs.list ../data/Glycine_max_Blastp_longIsoforme > ./tmp/Glycine_max_Blastp_longIsoforme_nosupercontig

How many HSP have been removed?

length() {
    cat $1 | wc -l
 }
expr $(length ../data/Glycine_max_Blastp_longIsoforme) - $(length ./tmp/Glycine_max_Blastp_longIsoforme_nosupercontig)

cat ./tmp/Glycine_max_Blastp_longIsoforme_nosupercontig | wc -l

Filter by coverage and identity percentage

Add protein length and gene name columns

Extract the protein length and gene id from the FASTA file

zcat ../data/Glycine_max.Glycine_max_v2.1.pep.all.fa.gz | awk -f ./workflow/families_and_TAGs/scripts/protein_lengths.awk > ./tmp/Glycine_max_protein_length_gene.tsv

Join the BLASTp HSP file with the gene id and gene lengths

protein_length=./tmp/Glycine_max_protein_length_gene.tsv
blastp=./tmp/Glycine_max_Blastp_longIsoforme_nosupercontig
sort -k 1 "${blastp}" > tmp/blastp_s
sort -k 1 "${protein_length}" > tmp/protein_length_s
join -1 1 -2 1 -t $'\t' tmp/blastp_s tmp/protein_length_s > tmp/join1.tsv
sort -k 2 tmp/join1.tsv > tmp/join1.tsv_s
join -1 2 -2 1 -t $'\t' tmp/join1.tsv_s tmp/protein_length_s > tmp/joined.blastp.tsv

Filter the BLASTp according to coverage and identity percentage

blastp=./tmp/joined.blastp.tsv
awk -f "workflow/families_and_TAGs/scripts/filter_blastp.awk" \
        -v min_coverage="${min_coverage}" \
        -v min_identity="${min_identity}" \
        "${blastp}" > "tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.tsv"

First dataset

<<low-stringency>>
<<filter-blastp>>

<<high-stringency>>
<<filter-blastp>>

cat ./tmp/Glycine_max_Blastp_filtered_coverage30_identity30.tsv | wc -l

cat ./tmp/Glycine_max_Blastp_filtered_coverage40_identity50.tsv | wc -l

Clustering

<2024-11-03 Sun>

Extract the homology graph

blastp="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.tsv"
abc="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.abc"
awk 'BEGIN { FS = OFS = "\t" } { print $14, $16, $12 }' "$blastp" > "$abc"

<<low-stringency>>
<<extract-abc>>

<<high-stringency>>
<<extract-abc>>

head -n 1 "tmp/Glycine_max_Blastp_filtered_coverage30_identity30.abc"

GLYMA_10G098500

GLYMA_U008600

154

Filter the homology graph to keep only the heaviest weight

original_abc="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.abc"
filtered_abc="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}_f.abc"
awk -f "workflow/families_and_TAGs/scripts/keep_heaviest_edge_abc.awk" "$original_abc" > "$filtered_abc"

<<low-stringency>>
<<filter-abc>>

<<high-stringency>>
<<filter-abc>>

Run Markov Clustering

abc="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}_f.abc"
mcl="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.mcl"
mcl "${abc}" --abc --discard-loops=yes -o "$mcl"

<<low-stringency>>
<<mcl>>

<<high-stringency>>
<<mcl>>

Transform MCL output into a gene to family mapping TSV file

mcl="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.mcl"
tsv="tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.mcl.tsv"
awk -f "workflow/families_and_TAGs/scripts/mcl_to_tsv.awk" "${mcl}" > "${tsv}"

<<low-stringency>>
<<mcl-to-tsv>>

<<high-stringency>>
<<mcl-to-tsv>>

tsv="tmp/Glycine_max_Blastp_filtered_coverage40_identity50.mcl.tsv"
cat "${tsv}" | wc -l

tsv="tmp/Glycine_max_Blastp_filtered_coverage30_identity30.mcl.tsv"
cat "${tsv}" | wc -l

tsv="tmp/Glycine_max_Blastp_filtered_coverage40_identity50.mcl.tsv"
head "${tsv}"

GLYMA_12G034500	1
GLYMA_12G138300	1
GLYMA_12G124600	1
GLYMA_12G126700	1
GLYMA_12G125200	1
GLYMA_12G034900	1
GLYMA_U036400	1
GLYMA_12G124200	1
GLYMA_12G126300	1
GLYMA_12G035100	1

tsv="tmp/Glycine_max_Blastp_filtered_coverage30_identity30.mcl.tsv"
head "${tsv}"

GLYMA_07G053000	1
GLYMA_10G247700	1
GLYMA_12G188000	1
GLYMA_07G159200	1
GLYMA_13G313300	1
GLYMA_15G016600	1
GLYMA_14G159200	1
GLYMA_18G287500	1
GLYMA_20G132900	1
GLYMA_16G204800	1

What is the interval of duplicate gene count? There are between 48109 and 49391 duplicate genes in Glycine max.

What is the number of families?

tsv="tmp/Glycine_max_Blastp_filtered_coverage30_identity30.mcl.tsv"
cut -f2 "${tsv}" | sort | uniq | wc -l

tsv="tmp/Glycine_max_Blastp_filtered_coverage40_identity50.mcl.tsv"
cut -f2 "${tsv}" | sort | uniq | wc -l

What is the number of singletons?

expr $ALL - $DUPLICATED

expr $ALL - $DUPLICATED

Amount of Tandemly Arrayed Genes

Extract gene position from a GFF3 file

zcat "../data/Glycine_max.Glycine_max_v2.1.60.chr.gff3.gz" \
    | awk -f "./workflow/families_and_TAGs/scripts/gff3_to_gene_positions_table.awk" \
    > "./tmp/Glycine_max.gene_positions.tsv"

head "./tmp/Glycine_max.gene_positions.tsv"

GLYMA_01G000100	1	27355	28320
GLYMA_01G000200	1	58975	67527
GLYMA_01G000300	1	67770	69968
GLYMA_01G000400	1	90152	95947
GLYMA_01G000500	1	90289	91197
GLYMA_01G000600	1	116094	127845
GLYMA_01G000700	1	143467	155573
GLYMA_01G000800	1	157030	157772
GLYMA_01G000900	1	170534	193342
GLYMA_01G001000	1	196256	201895

cat "./tmp/Glycine_max.gene_positions.tsv" \
    | wc -l

There is 55897 protein coding genes in the sequence file, but only 55589 gene positions.

expr $GENES - $POSITIONS

Extract TAGs

positions="./tmp/Glycine_max.gene_positions.tsv"
families="./tmp/Glycine_max_Blastp_filtered_coverage${min_coverage}_identity${min_identity}.mcl.tsv"
./rust/tagfinder/target/release/tagfinder \
    --families "$families" \
    --positions "$positions" \
    --definitions 0,1 \
    > "./tmp/Glycine_max_TAGs_coverage${min_coverage}_identity${min_identity}.tsv"

<<low-stringency>>
<<extract-TAGs>>

<<high-stringency>>
<<extract-TAGs>>

head "./tmp/Glycine_max_TAGs_coverage30_identity30.tsv"

gene	family	tag0	tag1
GLYMA_07G257400	3109	-	-
GLYMA_17G203900	2067	-	-
GLYMA_20G062000	509	-	-
GLYMA_16G099400	40	-	-
GLYMA_14G045400	1136	804	896
GLYMA_11G189200	15	-	-
GLYMA_02G191500	1310	-	-
GLYMA_02G308200	520	-	-
GLYMA_14G205200	8	-	-

How many TAGs for definition 0?

awk 'BEGIN {
           max=0
    }
    NR > 1 {
        if ($3 > max) {
            max=$3
        }
    }
    END  {
        print max
    }
    ' "./tmp/Glycine_max_TAGs_coverage30_identity30.tsv"

For TAG1?

awk 'BEGIN {
           max=0
    }
    NR > 1 {
        if ($4 > max) {
            max=$4
        }
    }
    END  {
        print max
    }
    ' "./tmp/Glycine_max_TAGs_coverage30_identity30.tsv"

awk 'BEGIN {
           max=0
    }
    NR > 1 {
        if ($3 > max) {
            max=$3
        }
    }
    END  {
        print max
    }
    ' "./tmp/Glycine_max_TAGs_coverage40_identity50.tsv"

awk 'BEGIN {
           max=0
    }
    NR > 1 {
        if ($4 > max) {
            max=$4
        }
    }
    END  {
        print max
    }
    ' "./tmp/Glycine_max_TAGs_coverage40_identity50.tsv"

Size of the greatest TAG

High stringency

TAG₀

awk '
    {
    tag_index=$3
    if (tag_index != "-")
        gene_count[tag_index]++
    }
    END {
    max = 0
    max_tag_index = ""
    for (tag_index in gene_count) {
        if (gene_count[tag_index] > max) {
            max_tag_index = tag_index
            max = gene_count[tag_index]
        }
    }
    print max_tag_index, max
    }' "./tmp/Glycine_max_TAGs_coverage40_identity50.tsv"

421 53

TAG₁

awk '
    {
    tag_index=$4
    if (tag_index != "-")
        gene_count[tag_index]++
    }
    END {
    max = 0
    max_tag_index = ""
    for (tag_index in gene_count) {
        if (gene_count[tag_index] > max) {
            max_tag_index = tag_index
            max = gene_count[tag_index]
        }
    }
    print max_tag_index, max
    }' "./tmp/Glycine_max_TAGs_coverage40_identity50.tsv"

459 76

Low stringency

TAG₀

awk '
    {
    tag_index=$3
    if (tag_index != "-")
        gene_count[tag_index]++
    }
    END {
    max = 0
    max_tag_index = ""
    for (tag_index in gene_count) {
        if (gene_count[tag_index] > max) {
            max_tag_index = tag_index
            max = gene_count[tag_index]
        }
    }
    print max_tag_index, max
    }' "./tmp/Glycine_max_TAGs_coverage30_identity30.tsv"

579 76

TAG₁

awk '
    {
    tag_index=$4
    if (tag_index != "-")
        gene_count[tag_index]++
    }
    END {
    max = 0
    max_tag_index = ""
    for (tag_index in gene_count) {
        if (gene_count[tag_index] > max) {
            max_tag_index = tag_index
            max = gene_count[tag_index]
        }
    }
    print max_tag_index, max
    }' "./tmp/Glycine_max_TAGs_coverage30_identity30.tsv"

646 76

DONE Complete slides

<2024-11-23 Sat>

Family sizes

head -1 ./tmp/Glycine_max_Blastp_filtered_coverage30_identity30.mcl \
    | awk '{ print NF }'

head -1 ./tmp/Glycine_max_Blastp_filtered_coverage40_identity50.mcl \
    | awk '{ print NF }'

TODO $K_s$ For all duplicated pairs within a family

Let $A$ be a family of $n$ duplicate members. The number of duplicate pairs is $n(n-1) / 2$.

PAML package:

conda install paml pal2nal clustalw -c bioconda

Use a Nextflow workflow

<2024-12-22 Sun>

All pair of duplicate genes

Generated by a rust program in rust/pairs folder. The algorithm is simple. We assume the gene-families mapping file is sorted by family index. For each family, loop for i from 0 to the size of the family minus one, loop for j from i plus one to the size of the family and output the pair gene i - gene j.

Example:

cat ./tmp/Glycine_max_Blastp_filtered_coverage30_identity30.mcl.tsv | ./rust/pairs/target/release/pairs > ./tmp/Glycine_max_duplicate_gene_pairs.tsv
head ./tmp/Glycine_max_duplicate_gene_pairs.tsv

GLYMA_07G053000	GLYMA_10G247700
GLYMA_07G053000	GLYMA_12G188000
GLYMA_07G053000	GLYMA_07G159200
GLYMA_07G053000	GLYMA_13G313300
GLYMA_07G053000	GLYMA_15G016600
GLYMA_07G053000	GLYMA_14G159200
GLYMA_07G053000	GLYMA_18G287500
GLYMA_07G053000	GLYMA_20G132900
GLYMA_07G053000	GLYMA_16G204800
GLYMA_07G053000	GLYMA_08G302300

PAML Steps

Example for the first gene pair.

Extract the CDS sequence and protein sequence of the genes

gene_id_1="GLYMA_07G053000"
gene_id_2="GLYMA_10G247700"

CDS:

output_file="./tmp/${gene_id_1}_${gene_id_2}.cds.fst"
echo "" > "${output_file}"
for gene_id in "${gene_id_1}" "${gene_id_2}"; do
        zcat ../data/Glycine_max.Glycine_max_v2.1.cds.all.fa.gz | awk -v gene_id="${gene_id}" '
        BEGIN {
            on_gene=0
        }
        /^[^>]/ && on_gene == 1 {
            print $0
        }
        /^>/ {
            gene = $4
            gsub("gene:", "", gene)
            if (gene == gene_id) {
                on_gene=1
                print $0
            } else {
                on_gene=0
            }
        }
        ' >> "${output_file}"
done

<<pair-gene-id>>
<<pair-gene-cds-extract>>

Proteins:

output_file="./tmp/${gene_id_1}_${gene_id_2}.prot.fst"
echo "" > "${output_file}"
for gene_id in "${gene_id_1}" "${gene_id_2}"; do
        cat ./tmp/proteome_filtered.fa | awk -v gene_id="${gene_id}" '
        BEGIN {
            on_gene=0
        }
        /^[^>]/ && on_gene == 1 {
            print $0
        }
        /^>/ {
            gene = $4
            gsub("gene:", "", gene)
            if (gene == gene_id) {
                on_gene=1
                print $0
            } else {
                on_gene=0
            }
        }
        ' >> "${output_file}"
done

<<pair-gene-id>>
<<pair-gene-protein-extract>>

Clustalw2 alignment

<<pair-gene-id>>
protein_sequence="./tmp/${gene_id_1}_${gene_id_2}.prot.fst"
clustalw2 -quiet -align -infile="${protein_sequence}" -outfile="./tmp/${gene_id_1}_${gene_id_2}.prot.ali.aln"

Convert protein alignment to coding sequence alignment

<<pair-gene-id>>
protein_alignment="./tmp/${gene_id_1}_${gene_id_2}.prot.ali.aln"
coding_sequence="./tmp/${gene_id_1}_${gene_id_2}.cds.fst"
pal2nal "${protein_alignment}" "${coding_sequence}" -output paml > "./tmp/${gene_id_1}_${gene_id_2}.cds.ali.phy"

YN00 Ka-Ks computation

<<pair-gene-id>>
phylip_file="./tmp/${gene_id_1}_${gene_id_2}.cds.ali.phy"
echo "seqfile = ${phylip_file}\noutfile = ${phylip_file}.yn\nverbose = 0\nicode = 0\nweighting = 0\ncommonf3x4 = 0" > yn00.ctl
yn00

Extraction of Ka-Ks values

<<pair-gene-id>>
yn_file="./tmp/${gene_id_1}_${gene_id_2}.cds.ali.phy.yn"
awk '
    BEGIN {
          on_good_section=0
          skip=0
    }
    $1 == "(B)" {
         skip=8
         on_good_section=1
    }

    on_good_section == 1 {
        if (skip == 0) {
            Ka=$8
            Ks=$11
            print Ka, Ks
            exit
        } else {
            skip -= 1
        }
    }
    ' "${yn_file}"

0.6207 1.9409

Run the Nextflow workflow

<2024-12-23 Mon>

23 KiB Raw Permalink Blame History

Comparative Genomics Project

General infos

Objectives

Part 1

Part 2

Our species

Download the source data

Develop the Nextflow workflow

Filtering the proteome to keep the longest isoform

Filtering out Chloroplasts and Mitochondria proteins

Filter out supercontigs

BLASTp All Against All

Clustering

Running the Nextflow workflow

From the given BLASTp format 6 file

Remove HSP on proteins found on supercontigs

Filter by coverage and identity percentage

Add protein length and gene name columns

Extract the protein length and gene id from the FASTA file

Join the BLASTp HSP file with the gene id and gene lengths

Filter the BLASTp according to coverage and identity percentage

First dataset

Clustering

Extract the homology graph

Filter the homology graph to keep only the heaviest weight

Run Markov Clustering

Transform MCL output into a gene to family mapping TSV file

Amount of Tandemly Arrayed Genes

Extract gene position from a GFF3 file

Extract TAGs

Size of the greatest TAG

High stringency

Low stringency

DONE Complete slides

Family sizes

TODO $K_s$ For all duplicated pairs within a family

Use a Nextflow workflow

All pair of duplicate genes

PAML Steps

Run the Nextflow workflow

TODO Does Glycine max have big TAGs and which function are the big TAG gene implied?

TODO Are TAG pairs different in age from non-TAG pairs?

TODO Are genes inside a TAG orientated in the same way more often than random?

23 KiB

Raw Permalink Blame History