fix spectrogram drop mask staring position#3036
fix spectrogram drop mask staring position#3036gfdb wants to merge 1 commit intospeechbrain:developfrom
Conversation
|
Good find @gfdb , clearly we should change something here so that the documentation and arguments match what's actually going on. The annoying part about "fixing" it is that it is not backwards compatible, and we might have to test recipes to make sure it doesn't regress performance. I'm not sure what exactly the performance implications are, but I imagine it could hurt performance in some recipes to mask out the entirety of short utterances every time. Perhaps equally unsatisfying/annoying would be to leave the behavior as-is, as it feels pretty ad-hoc and not mathematically sound, plus it treats long and short samples very differently. If that sort of length-dependent behavior was desired, we should be explicit, e.g. have some sort of maximum masking threshold, e.g. never cover more than 50% of the sample. I dunno if anyone has analyzed SpecAugment on short utterances before, if there's some approach we could lift from the literature. I think I'm leaning towards some sort of threshold-based approach so that we can try to keep the performance while fixing the correctness while not seeming too ad-hoc. What do @TParcollet and @Adel-Moumen and @mravanelli think? |
|
This paper proposes an "adaptive mask" which we could think about lifting for SpeechBrain: https://arxiv.org/pdf/1912.05533 The "uniform length sampling" actually sort of has the flavor that we've accidentally implemented for short utterances 😅 |
|
@pplantinga lol true, could be interesting, but they mention in that paper: "We note that while we have experimented with adaptive time masking policies, we have not discovered one that out-performs fixed policy SpecAugBasic." (2 time masks 2 freq masks, no time warp) so it might not be worth bothering |
What does this PR do?
In the implementation of SpectrogramDrop we get the bounds for the masks by first randomly sampling the mask starting position, i.e. the index of where the mask should begin. The correct way to set the upper bound for sampling this index is to take the length of the dimension we are applying the mask over, let's call it
D, and subtract the mask's length from it. This is done so the mask doesn't go out of bounds.However, in the current implementation, instead of subtracting the
mask_lenfromD, we are accidentally passing it as an argument tomax()due to what is almost certainly a typo in the form of an extra comma. Resulting in us sampling from [0, D[ instead of [0, D-mask_len[.How does it not go out of bounds?
The final mask gets computed as
Where arange only goes up to
D. We are comparing indices, so even ifmask_pos + mask_lenexceedsD, the arange comparison just truncates it.Why is this bad?
This is bad because the current implementation allows, for example, the last element in the slice to be sampled as the starting position for the mask i.e. no masking. Essentially, any sample drawn from [D-mask_len, D] will result in a partial mask.
How does this affect augmentation strength in existing recipes?
This bug makes the augmentation weaker. The longer the utterances are, the less it matters, but for extremely short utterances it matters a lot. Existing recipes that had their hparams tuned based on the current implementation may have higher drop_count / drop_length upper and lower bounds. But again, the longer the utterances, the less of a big deal it is.
How to fix?
If we wanted to keep the existing augmentation strength, we could just set the upper bound directly to D instead of doing the max call that always yields D.
If we think the change in strength is nbd, the simplest way to correct the mistake is the current fix I have on the branch (just removing the comma), but this bases the maximum mask starting position on the longest mask we sampled, again not totally correct but probably not a huge deal.
The most correct way to fix it would be to do something to the effect of:
mask_pos = (torch.rand(batch_size, n_masks, 1, device=spectrogram.device) * (D - mask_len))so that way each gets their own correct maximum mask starting position.
The comma key and it's consequences...
Fixes #<issue_number>
Before submitting
PR review
Reviewer checklist