Inference audio normalizer changes and use load_audio in more places#2695
Draft
asumagic wants to merge 10 commits intospeechbrain:developfrom
Draft
Inference audio normalizer changes and use load_audio in more places#2695asumagic wants to merge 10 commits intospeechbrain:developfrom
load_audio in more places#2695asumagic wants to merge 10 commits intospeechbrain:developfrom
Conversation
`load_audio` is preferred as it goes through common path handling code and normalization (for resampling and downmixing). This relies on the audio normalizer being correctly configured for the model sample rate. Commit ccd0ed introduced functionality to infer the sample rate from `hparams.sample_rate` by default when the audio normalizer is not specified. This should result in strictly identical behavior save for misconfigurations of the normalizer in hparams (if this code actually uses the proper tensor format that is).
Whether this is technically superior or not doesn't really matter: The rest of interfaces use the working directory, so move to that. Additionally, v1.0.1 avoids creating symlinks, so this should avoid issues in the majority of cases anyway.
load_audio in more places
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
WIP + need to ensure that it doesn't break things. Should behave strictly identically save for hparams misconfigurations.
Changes:
AudioNormalizerwas unconditionally configured to 16kHz unless the normalizer was overridden. With this PR, it will prefer thesample_ratehparams, if specified.MSTacotron2usessample_rateto refer to the TTS sample rate rather than the input sample rate. ForMSTacotron2, the default audio normalizer is overridden to use thespk_emb_sample_ratehparams instead (which is non-optional).EncoderClassifier,PIQAudioInterpreter,SepformerSeparationnow all usePretrained.load_audio, which performs fetching, audio loading and normalization using theaudio_normalizerat once. This deduplicates a fair amount of code, and means everything uses the streamlinedaudio_normalizer.load_audiodocumentation.torchaudio.loadmanually. Fixed path conversion there (see Declaration: torchaudio_sox::load_audio_file(str _0, int? _1, int? _2, bool? _3, bool? _4, str? _5) -> (Tensor _0, int _1) Cast error details: Unable to cast Python instance of type <class 'pathlib.PosixPath'> to C++ type '?' (#define PYBIND11_DETAILED_ERROR_MESSAGES or compile in debug mode for details) #2650) and made the documentation about the use of it more obvious.Fixes #2650
Before submitting
PR review
Reviewer checklist