Tennie Le | Data & BI Portfolio
  • Home
  • Case Studies
  • Resume

Project Echo: Can a Model Learn Animal Names From Sound?

A small-data audio classification study using mel spectrograms and MobileNetV2.

Data Science
Audio ML
Transfer Learning
Author

Tennie Le

Published

April 18, 2026

Project Echo
Can a Model Learn Animal Names From Sound?

This study explores how wildlife recordings can be turned into mel spectrograms, classified with transfer learning, and checked on unseen videos to see whether the predicted animal name stays stable over time.

Python librosa Mel Spectrograms TensorFlow / Keras MobileNetV2 Audio Classification
View GitHub Repository Open Scripts Folder
Dataset
52 clips
Clean rerun with 4 classes, using 9 train, 2 validation, and 2 test clips per class.
Best clean result
87.5%
Held-out accuracy on the improved run, with macro F1 of 0.8667.
Improvement
+50 pts
Accuracy improved from 37.5% in the first benchmark to 87.5% in the cleaner rerun.
Unseen video check
3 videos
Used to inspect prediction confidence and window-level stability outside the test split.

What I Studied

  • 4 animal classes: Boar, Cane Toad, Foxes, and Tree Frogs.
  • Mel spectrograms turned sound into a visual pattern the model could read.
  • 5-second clips were used so the model learned from a consistent audio length.
  • The final check used 3 unseen videos to compare label, confidence, and stability.

Why This Is Useful For Data Science And Business

Turn unstructured sound into structured data

Wildlife audio starts as raw sound. This pipeline converts it into features, class probabilities, labels, and model outputs that can be searched, summarized, and tracked.

Reduce manual review effort

A model can screen many recordings first, which helps people focus on the clips that are most likely to contain useful events.

Support monitoring and reporting

Predictions can feed simple summaries, alert logic, and trend analysis. That is the main business value of this kind of data science workflow.

How It Works

1. Audio preprocessing

Long recordings are clipped into consistent 5-second examples so the model trains on the same input length each time.

→

2. Mel spectrogram

Each clip becomes a time-frequency image. This makes rhythm and spectral shape easier to learn than a raw waveform.

→

3. MobileNetV2 classifier

A frozen MobileNetV2 backbone plus a small dense head classifies the spectrogram into one of four animal classes.

4. Window-level prediction

For longer media, the model predicts over sliding windows instead of one single pass.

→

5. Final label and confidence

Window predictions are averaged to produce the final label, confidence score, and a timeline of prediction behavior.

From First Run To Better Rerun

The rerun improved because the setup matched the dataset better. The gain came from cleaner small-data choices, not from making the model more complex.

First benchmark
37.5%

Accuracy on the initial run

  • Macro F1: 0.3214
  • 7 training clips per class
  • Larger trainable setup for a very small dataset
  • Weak separation in the harder classes
→
Better rerun
87.5%

Accuracy on the clean improved run

  • Macro F1: 0.8667
  • 9 training clips per class
  • Frozen backbone with a smaller dense head
  • Audio augmentation with noise and pitch variation
Accuracy jump
+50 pts
The rerun lifted held-out accuracy from 37.5% to 87.5%.
Macro F1
0.32 → 0.87
Class balance improved instead of one easy class dominating the result.
Model setup
Smaller, calmer
A frozen backbone and smaller dense head fit the small dataset better.
Data strategy
Cleaner rerun
More consistent training data and augmentation improved generalization.

Baseline training curves

Baseline training curves

The first run was useful as a benchmark, but the learning pattern was not strong enough to produce reliable class separation.

Improved training curves

Improved training curves

The better rerun trained more cleanly and reached a much stronger result with a simpler small-data configuration.

Results

Held-out accuracy
87.5%
The final clean run got 7 of 8 test clips correct.
Macro F1
0.8667
This suggests the model performed well across classes, not only on one label.
Strongest classes
Boar, Tree Frogs
These classes showed the clearest separation in the clean rerun.
Main challenge
Foxes vs Cane Toad
This is the class boundary that still causes the most uncertainty.
NoteMain insight

The strongest result was not just a better score. The rerun produced cleaner class separation, more stable predictions, and a more explainable path from sound to final label.

Improved confusion matrix

Confusion matrix

This figure shows that Boar and Tree Frogs are the clearest classes in the clean rerun. The main remaining confusion is still between Foxes and Cane Toad.

Prediction showcase figure

Prediction summary figure

This figure is useful because it shows both the final probabilities and the window-by-window behavior behind the final label. It makes the confidence story easier to understand than a single label alone.

NoteLimitation

The test split is still small at 8 clips total, so the improved result is promising but should be treated as an early study result rather than a final production-level claim.

Prediction On Unseen Videos

Your browser does not support the video tag.

Strong Cane Toad example

Cane Toad · 86.26%

The clearest example. The model stays on Cane Toad across 16 windows, which suggests a repeated and stable acoustic pattern.

16 windows Tree Frogs next: 7.78%
Your browser does not support the video tag.

Mixed fox example

Foxes · 47.10%

A weaker prediction. Foxes wins overall, but Cane Toad and Boar compete in several windows, so this result is less certain.

10 windows Cane Toad next: 29.17%
Your browser does not support the video tag.

Stronger fox example

Foxes · 58.88%

More confident than the previous fox sample, but still split with Cane Toad in multiple windows. This is the main class boundary that still needs work.

8 windows Cane Toad next: 38.47%

These video checks are useful because they show how confidence changes over time. A single final label can hide uncertainty, while window-level behavior shows whether the model is actually consistent.

What I Learned And Next Step

What I Learned

  • Small datasets need a controlled setup. Simpler transfer learning worked better than a heavier trainable model.
  • Mel spectrograms made the pipeline easier to understand because sound became a visible pattern.
  • Video inference was important because it exposed which predictions were stable and which ones were only weakly dominant.

Next Step

  • Add more source recordings per class so the model sees more recording conditions.
  • Use source-level splits for a fairer measure of generalization.
  • Focus on separating Foxes from Cane Toad more clearly, because that is the main remaining confusion.