[speaker verification - ECAPA-TDNN(C=1024)] lower performance than the papers #2161

shlee782 · 2023-09-15T06:45:22Z

shlee782
Sep 15, 2023

Hello,

In the papers 1 and 2, the ECAPA-TDNN (C=1024) model, when trained on VoxCeleb2 and tested on VoxCeleb1-O, reported EERs of 0.87% and 0.856%, respectively.

Both papers indicate that the model consists of 14.7M parameters.

In the speaker verification recipe from SpeechBrain, the channels are defined as [1024, 1024, 1024, 1024, 3072] in the YAML file.
This setup, however, results in 20.8M parameters.
By adjusting the channels to [1024, 1024, 1024, 1024, 1536], the parameter count becomes 14.7M, consistent with the papers.
Thus, it appears the latter configuration is the accurate one.

However, even with this configuration and using the same training and test datasets (trained 10 epochs), the EER reached is 1.03%.
Why does this performance deviate from the results mentioned in the papers?

I would greatly appreciate direct assistance or a referral to someone who could provide insight.
@mravanelli

underdogliu · 2023-09-16T01:32:46Z

underdogliu
Sep 16, 2023
Collaborator

Hi and thanks for post.

So I think the performance of the system depends on many factors such as:

learning rate, scheme, and hardware configuration.
Network configuration. Decreasing the number of channels to 1024 may decrease the performance since the context for pooling layer is less.
Also, using VoxCeleb2 only instead of VoxCeleb1+2 may be detrimental, although based on my personal experience, this may not be always true. In fact, we have reported the performance using only VoxCeleb2, and if you used only VoxCeleb2, what you returned is better.

With C=3072 and 4 GTX 2080Ti, I reached 0.80% EER (with s-norm) as reported. Therefore, I'd like to learn a bit more about your related configurations.

0 replies

shlee782 · 2023-09-17T15:32:37Z

shlee782
Sep 17, 2023
Author

My configuration used C=1024 and an RTX 3090. While the achieved EER of 1.03% is better than what you reported, it still falls short of the figures mentioned in the papers (0.87% and 0.856%). I'm curious as to why this discrepancy exists.

I'll need to investigate this further.

To start, I plan on training the model for more epochs.

0 replies

its-nmt05 · 2025-05-30T17:40:37Z

its-nmt05
May 30, 2025

I have a similar issue. I was trying to use the default setup (C=1024) to reproduce the results in the paper. According to the paper, model (C=1024) should have 14.7M parameters, but the speechbrain setup comes ~20M parameteres. Any update on this @underdogliu?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[speaker verification - ECAPA-TDNN(C=1024)] lower performance than the papers #2161

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[speaker verification - ECAPA-TDNN(C=1024)] lower performance than the papers #2161

Uh oh!

shlee782 Sep 15, 2023

Replies: 3 comments

Uh oh!

underdogliu Sep 16, 2023 Collaborator

Uh oh!

shlee782 Sep 17, 2023 Author

Uh oh!

its-nmt05 May 30, 2025

shlee782
Sep 15, 2023

underdogliu
Sep 16, 2023
Collaborator

shlee782
Sep 17, 2023
Author

its-nmt05
May 30, 2025