X Tutup
Skip to content

fix spectrogram drop mask staring position#3036

Open
gfdb wants to merge 1 commit intospeechbrain:developfrom
gfdb:fix-spectrogram-drop
Open

fix spectrogram drop mask staring position#3036
gfdb wants to merge 1 commit intospeechbrain:developfrom
gfdb:fix-spectrogram-drop

Conversation

@gfdb
Copy link
Contributor

@gfdb gfdb commented Feb 17, 2026

What does this PR do?

In the implementation of SpectrogramDrop we get the bounds for the masks by first randomly sampling the mask starting position, i.e. the index of where the mask should begin. The correct way to set the upper bound for sampling this index is to take the length of the dimension we are applying the mask over, let's call it D, and subtract the mask's length from it. This is done so the mask doesn't go out of bounds.

However, in the current implementation, instead of subtracting the mask_len from D, we are accidentally passing it as an argument to max() due to what is almost certainly a typo in the form of an extra comma. Resulting in us sampling from [0, D[ instead of [0, D-mask_len[.

How does it not go out of bounds?

The final mask gets computed as

arange = torch.arange(D, device=spectrogram.device).view(1, 1, -1)
mask = (mask_pos <= arange) * (arange < (mask_pos + mask_len))

Where arange only goes up to D. We are comparing indices, so even if mask_pos + mask_len exceeds D, the arange comparison just truncates it.

Why is this bad?

This is bad because the current implementation allows, for example, the last element in the slice to be sampled as the starting position for the mask i.e. no masking. Essentially, any sample drawn from [D-mask_len, D] will result in a partial mask.

How does this affect augmentation strength in existing recipes?

This bug makes the augmentation weaker. The longer the utterances are, the less it matters, but for extremely short utterances it matters a lot. Existing recipes that had their hparams tuned based on the current implementation may have higher drop_count / drop_length upper and lower bounds. But again, the longer the utterances, the less of a big deal it is.

How to fix?

If we wanted to keep the existing augmentation strength, we could just set the upper bound directly to D instead of doing the max call that always yields D.

If we think the change in strength is nbd, the simplest way to correct the mistake is the current fix I have on the branch (just removing the comma), but this bases the maximum mask starting position on the longest mask we sampled, again not totally correct but probably not a huge deal.

The most correct way to fix it would be to do something to the effect of:

mask_pos = (torch.rand(batch_size, n_masks, 1, device=spectrogram.device) * (D - mask_len))

so that way each gets their own correct maximum mask starting position.

The comma key and it's consequences...

Fixes #<issue_number>

Before submitting
  • Did you read the contributor guideline?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified
  • Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
  • Review the self-review checklist to ensure the code is ready for review

@pplantinga
Copy link
Collaborator

Good find @gfdb , clearly we should change something here so that the documentation and arguments match what's actually going on. The annoying part about "fixing" it is that it is not backwards compatible, and we might have to test recipes to make sure it doesn't regress performance. I'm not sure what exactly the performance implications are, but I imagine it could hurt performance in some recipes to mask out the entirety of short utterances every time.

Perhaps equally unsatisfying/annoying would be to leave the behavior as-is, as it feels pretty ad-hoc and not mathematically sound, plus it treats long and short samples very differently. If that sort of length-dependent behavior was desired, we should be explicit, e.g. have some sort of maximum masking threshold, e.g. never cover more than 50% of the sample. I dunno if anyone has analyzed SpecAugment on short utterances before, if there's some approach we could lift from the literature.

I think I'm leaning towards some sort of threshold-based approach so that we can try to keep the performance while fixing the correctness while not seeming too ad-hoc. What do @TParcollet and @Adel-Moumen and @mravanelli think?

@pplantinga
Copy link
Collaborator

pplantinga commented Feb 24, 2026

This paper proposes an "adaptive mask" which we could think about lifting for SpeechBrain:

https://arxiv.org/pdf/1912.05533

The "uniform length sampling" actually sort of has the flavor that we've accidentally implemented for short utterances 😅

@gfdb
Copy link
Contributor Author

gfdb commented Feb 24, 2026

@pplantinga lol true, could be interesting, but they mention in that paper: "We note that while we have experimented with adaptive time masking policies, we have not discovered one that out-performs fixed policy SpecAugBasic." (2 time masks 2 freq masks, no time warp) so it might not be worth bothering

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

X Tutup