A ml project to classify French orthoptera species using their stridulation sound.
Find a file
2025-11-22 19:06:34 +01:00
data A base random forest model 2025-11-22 17:52:50 +01:00
src Move data loading in another module and test XGBoost 2025-11-22 19:03:48 +01:00
.gitignore Amend documentation 2025-11-22 19:06:34 +01:00
README.md Amend documentation 2025-11-22 19:06:34 +01:00

Orthoptera Sound Classification

List the French orthoptera species

From the ASsociation pour la Caractérisation et lÉTude des Entomocénoses ASCETE, we can retrieve a PDF listing the orthoptera species from France. From the raw text content of this file, we can extract a TSV listing the species binomial and French common names, using ./src/extract_species_list_tsv.py.

Build a reference audio dataset from Xeno-Canto

Using the xeno-canto-py helper functions to deal with Xeno-Canto API (modified for API v3 in this fork), we can bulk download a set of audio recordings for each orthoptera species, using ./src/construct_reference_dataset.py. To run this step, you will need to set the XENO_CANTO_API_KEY environment variable, e.g, in a .env file.

The audio files are downloaded and stored in subfolders in dataset/audio, named with the species binomial names.

Audio features extraction with Tadarida-D

Tadarida-D is a C++ program developed for the Vigie-Chiro program to extract features from audio files.

The objective is to be able to build a classifier of Orthoptera sounds in the audible spectrum.

A bash script ./src/tadarida_bulk.sh runs Tadarida-D on all audio files retrieved in the precedent step.

Tested Models

Decision tree

Random forest

XGBoost