fix spectrogram drop mask staring position#3036
Open
gfdb wants to merge 1 commit intospeechbrain:developfrom
Open
fix spectrogram drop mask staring position#3036gfdb wants to merge 1 commit intospeechbrain:developfrom
gfdb wants to merge 1 commit intospeechbrain:developfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
In the implementation of SpectrogramDrop we get the bounds for the masks by first randomly sampling the mask starting position, i.e. the index of where the mask should begin. The correct way to set the upper bound for sampling this index is to take the length of the dimension we are applying the mask over, let's call it
D, and subtract the mask's length from it. This is done so the mask doesn't go out of bounds.However, in the current implementation, instead of subtracting the
mask_lenfromD, we are accidentally passing it as an argument tomax()due to what is almost certainly a typo in the form of an extra comma. Resulting in us sampling from [0, D[ instead of [0, D-mask_len[.How does it not go out of bounds?
The final mask gets computed as
Where arange only goes up to
D. We are comparing indices, so even ifmask_pos + mask_lenexceedsD, the arange comparison just truncates it.Why is this bad?
This is bad because the current implementation allows, for example, the last element in the slice to be sampled as the starting position for the mask i.e. no masking. Essentially, any sample drawn from [D-mask_len, D] will result in a partial mask.
How does this affect augmentation strength in existing recipes?
This bug makes the augmentation weaker. The longer the utterances are, the less it matters, but for extremely short utterances it matters a lot. Existing recipes that had their hparams tuned based on the current implementation may have higher drop_count / drop_length upper and lower bounds. But again, the longer the utterances, the less of a big deal it is.
How to fix?
If we wanted to keep the existing augmentation strength, we could just set the upper bound directly to D instead of doing the max call that always yields D.
If we think the change in strength is nbd, the simplest way to correct the mistake is the current fix I have on the branch (just removing the comma), but this bases the maximum mask starting position on the longest mask we sampled, again not totally correct but probably not a huge deal.
The most correct way to fix it would be to do something to the effect of:
mask_pos = (torch.rand(batch_size, n_masks, 1, device=spectrogram.device) * (D - mask_len))so that way each gets their own correct maximum mask starting position.
The comma key and it's consequences...
Fixes #<issue_number>
Before submitting
PR review
Reviewer checklist