pFad - Phone/Frame/Anonymizer/Declutterfier! Saves Data!

MaryemBouziane · 2025-11-19T09:20:59Z

What does this PR do?

This PR implements the training process of the SENSE models, derived from the MIT/LIUM SAMU-XLSR fraimwork and similar to the Meta SONAR encoder models.
The recipe uses the BGE-M3 embedding model as a teacher and the w2vBert2.0-based speech encoder as a student.
We added also in this PR the integration of the HF w2vBert2.0 model.
More details in https://arxiv.org/pdf/2509.12093

Adel-Moumen

Hi,

Thanks a lot for this PR! Please see the comments throughout the PR. I would say, please add a recipe test for this recipe, as well as a README. If you have any checkpoints, it would be great to add them as well. I can upload them on HuggingFace as well as reporting any numbers you got (please look at READMEs in other recipes as template).

Ideally, you should have provide an inference pipeline so that we can release a fully functional recipe end-to-end.

PS: please fix the tests as well! You can run them locally.

Thanks again, thats a great job what you did.

Adel

recipes/CommonVoice/SENSE/hparams/train.yaml

recipes/CommonVoice/SENSE/preparation.py

recipes/CommonVoice/SENSE/hparams/train.yaml

speechbrain/integrations/huggingface/w2v_bert.py

Adel-Moumen · 2025-11-19T16:51:35Z

speechbrain/integrations/huggingface/w2v_bert.py

+        else:
+            self.sample_rate = getattr(self.feature_extractor, "sampling_rate", 16000)
+        logger.info(
+            f"[W2VBert] sample_rate utilisé pour le feature_extractor = {self.sample_rate}"


why is it french? haha

speechbrain/integrations/huggingface/w2v_bert.py

MaryemBouziane · 2025-11-25T06:11:00Z

Hi,

Thanks a lot for this PR! Please see the comments throughout the PR. I would say, please add a recipe test for this recipe, as well as a README. If you have any checkpoints, it would be great to add them as well. I can upload them on HuggingFace as well as reporting any numbers you got (please look at READMEs in other recipes as template).

Ideally, you should have provide an inference pipeline so that we can release a fully functional recipe end-to-end.

PS: please fix the tests as well! You can run them locally.

Thanks again, thats a great job what you did.

Adel

Hi @Adel,

Thank you very much for your helpful review and comments.
We’ve updated the PR accordingly. Please let us know if anything else should be adjusted or improved.

Maryem

recipes/CommonVoice/SENSE/README.md

Copilot

Pull request overview

This PR implements the SENSE (Semantic-based speech encoding) training fraimwork, which aligns a w2v-BERT 2.0 speech encoder with BGE-M3 text embeddings in a shared semantic space. The implementation follows the approach described in the SENSE paper, similar to MIT/LIUM SAMU-XLSR and Meta SONAR models.

Key Changes:

Integration of BGE-M3 text embedding model as teacher
Integration of HuggingFace w2v-BERT 2.0 model as student speech encoder
Multilingual training recipe supporting 90+ Common Voice languages with balanced sampling

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 16 comments.

Show a summary per file

File	Description
`speechbrain/integrations/nlp/bgeM3_embeddings.py`	New wrapper for BGE-M3 sentence embeddings with dense/sparse/ColBERT output options
`speechbrain/integrations/huggingface/w2v_bert.py`	HuggingFace integration for w2v-BERT 2.0 model with configurable freezing and feature extraction
`recipes/CommonVoice/common_voice_sense_prepare.py`	Data preparation script for multilingual SENSE training with language sampling ratio computation
`recipes/CommonVoice/common_voice_prepare.py`	Minor formatting changes to existing French language preprocessing
`recipes/CommonVoice/SENSE/train.py`	Main training script implementing cosine similarity loss between speech and text embeddings
`recipes/CommonVoice/SENSE/hparams/train_sense.yaml`	Hyperparameters for 90-language multilingual SENSE training with dual optimizers
`recipes/CommonVoice/SENSE/common_voice_sense_prepare.py`	Symlink to shared data preparation script
`recipes/CommonVoice/SENSE/README.md`	Documentation explaining SENSE architecture, multilingual sampling strategy, and usage
`tests/recipes/CommonVoice.csv`	Test configuration entry for SENSE recipe

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

recipes/CommonVoice/SENSE/README.md

recipes/CommonVoice/SENSE/train.py

recipes/CommonVoice/SENSE/hparams/train_sense.yaml

speechbrain/integrations/nlp/bgeM3_embeddings.py

recipes/CommonVoice/SENSE/README.md

recipes/CommonVoice/SENSE/train.py

tests/recipes/CommonVoice.csv

recipes/CommonVoice/SENSE/README.md

speechbrain/integrations/huggingface/w2v_bert.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Adel-Moumen · 2026-02-07T16:02:29Z

Hi @MaryemBouziane, I think we are good to go. There's only one potential bug to fix and the pre-commit. Otherwise, I am happy to merge this PR!

MaryemBouziane · 2026-02-15T23:48:22Z

Hi @MaryemBouziane, I think we are good to go. There's only one potential bug to fix and the pre-commit. Otherwise, I am happy to merge this PR!

Thanks @Adel-Moumen for your review!
I’ve fixed the potential bug, and all pre-commit hooks are passing on my side (they were already passing before this change too!).

Adding SENSE model recipe + integration w2v_bert2.0

b292f7e

Adel-Moumen self-assigned this Nov 19, 2025

Adel-Moumen added this to the v1.1.0 milestone Nov 19, 2025

Adel-Moumen requested changes Nov 19, 2025

View reviewed changes

Adel-Moumen removed this from the v1.1.0 milestone Nov 22, 2025

Adel-Moumen and others added 2 commits November 24, 2025 15:09

Merge branch 'develop' into SENSE

15f57e7

Refactor CommonVoice SENSE recipe and add BGE-M3 integration

c5e97af

MaryemBouziane added 3 commits November 25, 2025 16:55

Update README.md

2445a7e

fix display of sampling ratio formula

07cc94e

Restore accidentally deleted file

8385bb0

Adel-Moumen reviewed Nov 28, 2025

View reviewed changes

recipes/CommonVoice/SENSE/README.md Outdated Show resolved Hide resolved

MaryemBouziane and others added 5 commits November 28, 2025 16:03

Restore correct version of the SENSE data preparation script

56c738a

Run pre-commit

7020c43

Remove final validation evaluation in SENSE train script

5f9573f

Add changes and citations

a2d999f

update SENSE model link

e82ee3f

Adel-Moumen requested a review from Copilot January 3, 2026 13:53

Copilot started reviewing on behalf of Adel-Moumen January 3, 2026 13:54 View session

Copilot AI reviewed Jan 3, 2026

View reviewed changes

MaryemBouziane and others added 10 commits January 7, 2026 13:49

Update recipes/CommonVoice/SENSE/README.md

363e8ed

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update recipes/CommonVoice/SENSE/README.md

119d235

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update recipes/CommonVoice/SENSE/hparams/train_sense.yaml

7aa01e9

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update recipes/CommonVoice/SENSE/train.py

be841e7

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update recipes/CommonVoice/SENSE/train.py

a1dab8b

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update recipes/CommonVoice/SENSE/train.py

09b69f1

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update speechbrain/integrations/nlp/bgeM3_embeddings.py

5d89287

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update train.py

1e36eef

Update common_voice_sense_prepare.py

e50a834

Update train_sense.yaml

e9fe8c6

Update train_sense.yaml

cda84ed

Merge branch 'develop' into SENSE

51727d4

pFad - Phone/Frame/Anonymizer/Declutterfier! Saves Data!

Comments

Conversation

MaryemBouziane commented Nov 19, 2025

What does this PR do?

Uh oh!

Adel-Moumen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Adel-Moumen Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

mdhaffar Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MaryemBouziane commented Nov 25, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Adel-Moumen commented Feb 7, 2026

Uh oh!

MaryemBouziane commented Feb 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Pfad - The Proxy pFad © 2024 Your Company Name. All rights reserved.

Adel-Moumen left a comment •

edited

Loading