pFad - Phone/Frame/Anonymizer/Declutterfier! Saves Data!

asumagic · 2024-09-23T11:53:41Z

What does this PR do?

WIP + need to ensure that it doesn't break things. Should behave strictly identically save for hparams misconfigurations.

Changes:

Prior to the PR, the AudioNormalizer was unconditionally configured to 16kHz unless the normalizer was overridden. With this PR, it will prefer the sample_rate hparams, if specified.
... There are some gotchas implied, e.g. MSTacotron2 uses sample_rate to refer to the TTS sample rate rather than the input sample rate. For MSTacotron2, the default audio normalizer is overridden to use the spk_emb_sample_rate hparams instead (which is non-optional).
EncoderClassifier, PIQAudioInterpreter, SepformerSeparation now all use Pretrained.load_audio, which performs fetching, audio loading and normalization using the audio_normalizer at once. This deduplicates a fair amount of code, and means everything uses the streamlined audio_normalizer.
Improved load_audio documentation.
Fixed some type annotations in fetch/audio loading code.
There is still some code that uses torchaudio.load manually. Fixed path conversion there (see Declaration: torchaudio_sox::load_audio_file(str _0, int? _1, int? _2, bool? _3, bool? _4, str? _5) -> (Tensor _0, int _1) Cast error details: Unable to cast Python instance of type <class 'pathlib.PosixPath'> to C++ type '?' (#define PYBIND11_DETAILED_ERROR_MESSAGES or compile in debug mode for details) #2650) and made the documentation about the use of it more obvious.

Fixes #2650

Before submitting

Did you read the contributor guideline?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified
Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
Review the self-review checklist to ensure the code is ready for review

`load_audio` is preferred as it goes through common path handling code and normalization (for resampling and downmixing). This relies on the audio normalizer being correctly configured for the model sample rate. Commit ccd0ed introduced functionality to infer the sample rate from `hparams.sample_rate` by default when the audio normalizer is not specified. This should result in strictly identical behavior save for misconfigurations of the normalizer in hparams (if this code actually uses the proper tensor format that is).

Whether this is technically superior or not doesn't really matter: The rest of interfaces use the working directory, so move to that. Additionally, v1.0.1 avoids creating symlinks, so this should avoid issues in the majority of cases anyway.

asumagic added 8 commits September 23, 2024 13:40

Use hparams["sample_rate"] for resampling when normalizer not explicit

ccd0ed5

Typing adjustment for paths, Pretrained.load_audio documentation

e9ba60a

Fix batch dim for load_audio

4a5ed8f

In VAD, explicitly convert provided path to str

6d03c9b

More explicit doc notice on audio load in VAD

7ea2c54

MSTacotron2 inference: fix normalizer and use load_audio

c29b4d2

asumagic changed the title ~~Refactor audio loading code in some interfaces~~ Inference audio normalizer changes and use load_audio in more places Sep 24, 2024

asumagic added 2 commits September 24, 2024 13:34

VAD: remove incorrect audio_normalizer and stub read_audio

0060cd0

Add docstring to override

69c44a0

pFad - Phone/Frame/Anonymizer/Declutterfier! Saves Data!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference audio normalizer changes and use `load_audio` in more places#2695

Inference audio normalizer changes and use `load_audio` in more places#2695
asumagic wants to merge 10 commits intospeechbrain:developfrom
asumagic:interface-read-audio-fixes

asumagic commented Sep 23, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Pfad - The Proxy pFad © 2024 Your Company Name. All rights reserved.

pFad - Phone/Frame/Anonymizer/Declutterfier! Saves Data!

Conversation

asumagic commented Sep 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

PR review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Pfad - The Proxy pFad © 2024 Your Company Name. All rights reserved.

asumagic commented Sep 23, 2024 •

edited

Loading