Catching 98.9 Out of 100 Deepfakes: What It Takes to Lead Hugging Face’s Leaderboard

Voice deepfake losses are projected to hit $40B by 2027, a 6,566% jump from 2023. Modulate’s velma-2 now ranks #1 on Hugging Face’s Speech Deepfake Leaderboard with a 1.104% average EER across 14 datasets and 2M+ audio samples, catching 98.9 out of every 100 deepfakes. This post breaks down why the Hugging Face benchmark is the most credible public standard for detection, how Modulate’s voice-native ELM architecture outperforms repurposed models from Hiya and Resemble AI, and why running detection at $0.25/hr (100x cheaper than competitors) lets fraud teams monitor entire calls instead of just the opening seconds where most checks stop today.


This content originally appeared on HackerNoon and was authored by Modulate

\n $40 Billion in projected losses by 2027 for businesses exposed to voice deepfakes.

\n To put that into perspective, the total losses in 2023 from voice deepfakes were $600 million ($12.5 Billion in 2024). That’s an increase of 6,566% in five years.

\n Naturally, financial services were among the hardest hit, with 23% of financial-sector organizations reporting losses of over a million. Contact centers were also among the hardest hit. It’s now reported that these centers encounter a voice deepfake attack every 46 seconds.

\n Businesses need a deepfake detection API to accurately and reliably detect voice fraud, as it happens, to mitigate those hard losses.

\n Vetting such a solution depends on credible benchmarks, like the Hugging Face’s (HF) Speech Deepfake Leaderboard, where Modulate ranks #1 as of March 2026 (with an average EER of 1.104%). This translates to Modulate catching 98.9% of all AI-generated deepfake voices across the diverse range of audio used in Hugging Face’s 14 benchmarks. EER rate also measures the error rate of false positives, resulting in just a 1.1% false positive rate.

Why everyone looks to the Hugging Face Deepfake Leaderboard

The Hugging Face Deepfake Leaderboard refers to a set of public, continuously updated leaderboards that evaluate how well AI systems detect synthetic or manipulated media, especially speech deepfakes.

\n To truly test the mettle of each model, it uses 14 datasets and 2M+ audio samples spanning from clean lab audio to real-world telephony to benchmark each of the models.

\n Because the HF Deepfake Arena is open to the public and anyone can reproduce (or expose) the results to test the claim, it’s the most transparent benchmark for evaluating the accuracy of ASR models in their detection of manipulated audio.

\n Though the public may submit their own results, the leaderboard was developed and is continuously maintained by researchers at Idiap Research Institute (Switzerland), CNRS/IRISA (France), Mohamed bin Zayed University of Artificial Intelligence (UAE), Tallinn University of Technology (Estonia), and Validsoft Ltd. (UK).

\n This translates to HF’s stance as one of the most credible and rigorous public benchmarks for evaluating detection systems.

The Leaderboard: Modulate #1, Hiya #2, Resemble AI #3

The two most important values on the leaderboard are the Average Result and the Pool Result.

\n Average Result grants equal weight to the results across all fourteen datasets.

\n Pooled Result combines the computational value of all evaluation samples.

\n The leaderboard now shows Modulate (velma-2) as # 1 for both Average and Pooled, with a score of 1.104 and 1.586, respectively. This translates to an Average accuracy of 98.9%. Out of every 100 audio files, Modulate correctly catches 98.9%. Only 1.1% get falsely flagged as deepfakes. It’s the closest any model has come to complete accuracy with deepfake detection.

\n Continuing with the results, Resemble AI (resemble-detect-3b) closely follows with an Average of 2.570 (97.9% average accuracy) and Pooled at 2.099. Hiya (authenticity-verific) lands in third with an Average of 2,113 (97.4% average accuracy) and Pooled at 2.324.

\n

These top three results are rather incredible, especially when you consider how drastically more accurate each model is in comparison to the fourth model, DLMSL-Speaksure.

To get an even better understanding of why these achievements are so pivotal, we must look at our two top competitors individually:

Resemble AI is primarily a voice generation company (TTS and voice cloning), which could place them on the opposite side of the detection arena. However, their detection product (resemble-detect-3b) is a 3B parameter model. So while detection isn’t their core architecture, they’ve shown dedication to combating voice fraud with a solid, accurate model.

Hiya is a serious player in telephony fraud, with a model that is three times smaller than Resemble AI’s (using only 1 billion parameters), operating at 8x real-time speed in streaming mode. The majority of their business is focused on branded caller ID and voice agents, though they’ve dedicated a brand of their business to spam and fraud detection and prevention.

\n Modulate is a different story altogether. We’re voice-native from day one, and detection is the very core of our business. We are built on the ELM architecture, and our second offerings are contained within that architecture (conversation intelligence, speech-to-text, and deepfake detection).

This focus has paid off by allowing us to make strides in deepfake detection accuracy.

What Hugging Face’s 14 datasets actually test

There are roughly 2 million audio files across the 14 datasets in the HF deepfake arena. All collected to represent real-world attack scenarios across a variety of settings, accents/ languages, industries, and technical jargon.

\n Let’s take a look at how truly diverse these audio files are, and what they test for.

ASVspoof series (2019, 2021 LA, 2021 DF, 2024)

Out of all of the datasets in the HF collection, the ASVspoof series is the closest thing to an industry standard for model evaluation with controlled, but realistic conditions.

That’s why it is the longest-running and most widely cited anti-spoofing benchmark series in all of speech security.

It measures:

  • LA (Logical Access): TTS and voice conversion attacks injected directly into the system (no channel noise).
  • PA (Physical Access): Playback attacks in real rooms with microphones, reverberation, and environmental noise.
  • DF (Deepfake): Modern neural TTS/VC systems, including diffusion‑based models.

With each new edition, you get new attack types, codecs, and channel conditions. For instance, 2024 expands into VoIP, telephony, compression artifacts, and more realistic channel distortions.

ADD Challenges (2022 Track 1/3, 2023 R1/R2)

The Audio Deepfake Detection (ADD) challenge series was designed with the knowledge that most models are tested on clean audio, with those unchallenged benchmarks flouted as sales signals.

This dataset focuses on noisy, degraded, and real-world audio as a way to punish those models.

It measures:

  • Track 1: In-the-wild deepfake detection.
  • Track 3: Robustness to channel effects, background noise, and environmental distortions.
  • 2023 R1/R2: Introduces more diverse languages, codecs, and unseen synthesis methods.

In‑The‑Wild (YouTube, social media, uncontrolled noise)

In-The-Wild is not a dataset. It’s an entire category that’s typically curated from video based social media like TikTok, livestreams, and YouTube, as well as other audio mediums like podcasts and other uncontrolled environments. Essentially, it’s audio captured “in the wild”, our world of social experience.

The nature of the audio lends itself well to application across every modern platform that ingests user audio.

It measures:

  • Real‑world noise
  • Room acoustics
  • Microphone variability
  • Editing artifacts
  • Background music, cross‑talk, overlapping speech

CodecFake (Neural codec processing)

Neural codecs are now embedded into communication and social apps like WhatsApp, Instagram, TikTok, and Zoom. This increasingly makes real human audio look synthetic to older detectors.

\n

Obviously, this decreases the accuracy rates of deepfake detection models and has negative implications in real-world scenarios.

\n

The CodecFake aims to identify the models that are able to discern true human audio from deepfakes by focusing the benchmark on neural codecs (EnCodec, DAC, SoundStream, etc.) and codec‑induced artifacts.

It measures:

  • Whether a detector can handle audio that has been encoded → decoded → re‑encoded
  • Robustness to neural codec artifacts that resemble TTS artifacts
  • Sensitivity to bitrate changes

Academic benchmarks (Fake-or-Real, DFADD, SONAR)

Academic benchmarks are another category of research-grade datasets used in peer-reviewed papers to compare new architectures, evaluate TTS, voice conversion, and deepfake detection. This is research-grade performance.

Fake-or-Real is a dataset with binary classification and diverse TTS/VC systems. The audio is clean and controlled, which makes it perfect for setting a baseline of discriminative ability.

DFADD (Deepfake Audio Detection Dataset) includes multiple languages and synthesis methods and was designed to test generalization to unseen attacks.

SONAR dataset places heavy focus on neural vocoder artifacts by including challenging borderline-realistic samples and high-quality TTS systems.

LibriSeVOC (neural vocoder synthesis)

If you’re trying to detect high-quality synthetic speech in commercial TTS systems, then the LibriSeVOC dataset is critical.

While built on LibriSpeech, which is generally too clean to give a complete picture on WER accuracy in a text-to-speech application, it was re-synthesized using neural vocoders (HiFi-GAN, WaveGlow, WaveRNN, etc.), which is an essential part of detection today.

Modern TTS pipelines often use diffusion models for acoustic modeling and neural vocoders for waveform generation, so vocoder detection is a core capability.

This dataset measures:

  • Ability to detect vocoder‑generated speech
  • Sensitivity to subtle phase and spectral artifacts
  • Generalization across vocoder families

How voice-native architecture beats repurposed models

When systems repurpose a model, they’re typically layering non-voice-specific, generalized ML models to handle new tasks.

There are several drawbacks to this:

  • Inefficiency: The extensive post-processing/ manual review needed when repurposing a generalized model generally makes the process incredibly inefficient.

  • Accuracy gaps: Voice-native AI tools are purpose-built to take into account tone, cadence, and other complexities of speech-based communication. This makes them incredibly accurate. Repurposed, generalized models may misinterpret conversational nuances.

  • Missed context: The ability to detect tone and intent, as voice-native AI models do, are pertinent to ceasing harmful behaviors. Repurposed models may even enforce those behaviors and alienate users.

  • Limited scalability: Non-specified systems struggle to keep up with the growing volume of voice interactions, causing delayed responses (at minimum) while also increasing user harm.

    \n You can see how this plays out in the architecture of the top three models in the HF Deepfake Leaderboard. \n

Hiya is telephony-focused, which means it’s strong on phone call conditions. But their architecture is optimized for a specific channel. \n

Resemble AI comes from the generation side, which means they understand synthesis because, well, they build synthesizers. While this is an essential factor in effective deepfake detection, it’s not the only factor necessary. \n

Detection requires architectural priorities that repurposed models often don’t have, including:

  • Adversarial robustness

  • Real-time processing

  • False positive management at scale

    \n That is why Modulate’s ELMs take the voice-native approach, which generalizes across telephony, VoIP, clean audio, and degraded conditions. Because they’re purpose-built for voice, they operate directly on audio features (spectrograms, prosody, formant transitions, micro-temporal patterns).

These architectural differences have makes all the difference in the accuracy of these models in the application of deepfake detection.

Production performance (beyond the benchmark)

The question has never been just 'can you detect deepfakes in the lab?',  it's 'can you do it at scale without drowning your fraud team in false alerts?'. \n

That is why we must go beyond the benchmark to look at deployment. Average metrics on the deepfake leaderboard help us do just that.

| Dataset | EER (%) | |----|----| | Pooled | 1.586 | | Average | 1.104 | | In‑The‑Wild | 1.271 | | ASVspoof 2019 | 0.299 | | ASVspoof 2021 LA | 1.330 | | ASVspoof 2021 DF | 0.331 | | ASVspoof 2024 Eval | 0.384 | | Fake‑or‑Real | 0.133 | | CodecFake | 1.538 | | ADD 2022 Track 1 | 5.059 | | ADD 2022 Track 3 | 1.174 | | ADD 2023 R1 | 1.041 | | ADD 2023 R2 | 1.742 | | DFADD | 0.000 | | LibriSeVoc | 0.265 | | SONAR | 0.888 |

\n

Among the top-performing models, the margins for the highest rankings are extremely slim, making for a competitive field. Some models also compensate for scores with smaller models and fast processing times (Hiya claims a noteworthy 8x real-time processing speed).

\ Still, Modulate has managed to earn top placements, with near-perfect results across six datasets, despite how challenging each is in its own right. \n

In the remaining benchmarks, we remain competitive, steadfastly challenging the results of other submitted systems. \n

With all the success across the datasets, though, it is the 1.1% Average EER in production that demonstrates impressive generalization across all real-world applications and synthesis methods. \n

For those outside of the ASR industry, a 1.466% Average EER difference between Modulate and the next system might not seem like it’d make much difference in application. In reality, that difference amounts to 60% more deepfakes caught, or half as many (150,000) fewer false positives per 10 million calls. All done on a model that is 10x smaller.

For banks, insurance companies, and customer service teams, this could account for millions in losses.

Now consider that this unprecedented level of accuracy costs just $0.25/hr. You also gain both batch and streaming modes, and a structured output offered in four confidence segments (meaning you’ll get four separate scores representing levels of certainty about the predictions).

A lower-cost ASR model means continuous coverage

That $0.25/hr is 100x more affordable than our competitors, thanks to an inherently efficient and smaller model. This isn’t just nice to have while maintaining the highest level of accuracy in deepfake detection; it’s absolutely essential to stopping fraudsters.

\ Most banks, insurance companies, and call centers are checking for fraud. However, they’re running these checks at the beginning of calls and hitting the off switch early to avoid the high costs that come with longer run times. \n

Fraudsters know this. It’s why those with a little more sophistication open their calls with a real human voice to get through the fraud check and turn on the AI voice once they’re through it. \n

An affordable cost structure makes it possible to check the entire call, not just the opening seconds. You can run every single segment, for every speaker, continuously and even in the background.

The efficiency question: Does a smaller model really matter?

Hiya has highlighted their 1 billion parameter model as more efficient than other systems.

\ It’s a legitimate claim, since model efficiency matters for both deployment cost and latency. However, this claim doesn’t hold water when tested against other approaches to efficiency and accuracy.

\ Modulate is already a smaller model than most, but it gains an additional advantage with its voice-native architecture. This architecture provides an inherent efficiency advantage because it doesn’t need to process the full complexity of language.

\ The models operate purely on acoustic features, avoiding the computational overhead associated with transformer‑driven language processing.

\ Not to mention the avoidance of the post-processing most repurposed models typically need, which dramatically decreases their efficiency.

Voice-native architecture leads them all in accuracy, efficiency, and cost

The three models that occupy the Hugging Face Speech Deepfake Leaderboard have all taken a different architectural approach to achieving both efficiency and impressive accuracy. But it is Modulate’s voice-native architecture that delivers the best results, consistently.

\ We deliver this pivotal performance thanks to our consistent testing and training with noisy voice data. We built our models on half a billion hours of real audio, focusing on a diverse range of vocal tones, speech rhythms, and pronunciations that appear in patterns over longer audio segments.

\ This helps us guarantee accuracy, as seen in the HF benchmarks, at a price that allows businesses to continually run Modulate without significantly driving total costs higher.

\

:::tip As the scale of deepfake attacks grows, you need a solution that can scale with them. You need Modulate.

:::

\


This content originally appeared on HackerNoon and was authored by Modulate


Print Share Comment Cite Upload Translate Updates
APA

Modulate | Sciencx (2026-04-16T16:54:49+00:00) Catching 98.9 Out of 100 Deepfakes: What It Takes to Lead Hugging Face’s Leaderboard. Retrieved from https://www.scien.cx/2026/04/16/catching-98-9-out-of-100-deepfakes-what-it-takes-to-lead-hugging-faces-leaderboard/

MLA
" » Catching 98.9 Out of 100 Deepfakes: What It Takes to Lead Hugging Face’s Leaderboard." Modulate | Sciencx - Thursday April 16, 2026, https://www.scien.cx/2026/04/16/catching-98-9-out-of-100-deepfakes-what-it-takes-to-lead-hugging-faces-leaderboard/
HARVARD
Modulate | Sciencx Thursday April 16, 2026 » Catching 98.9 Out of 100 Deepfakes: What It Takes to Lead Hugging Face’s Leaderboard., viewed ,<https://www.scien.cx/2026/04/16/catching-98-9-out-of-100-deepfakes-what-it-takes-to-lead-hugging-faces-leaderboard/>
VANCOUVER
Modulate | Sciencx - » Catching 98.9 Out of 100 Deepfakes: What It Takes to Lead Hugging Face’s Leaderboard. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2026/04/16/catching-98-9-out-of-100-deepfakes-what-it-takes-to-lead-hugging-faces-leaderboard/
CHICAGO
" » Catching 98.9 Out of 100 Deepfakes: What It Takes to Lead Hugging Face’s Leaderboard." Modulate | Sciencx - Accessed . https://www.scien.cx/2026/04/16/catching-98-9-out-of-100-deepfakes-what-it-takes-to-lead-hugging-faces-leaderboard/
IEEE
" » Catching 98.9 Out of 100 Deepfakes: What It Takes to Lead Hugging Face’s Leaderboard." Modulate | Sciencx [Online]. Available: https://www.scien.cx/2026/04/16/catching-98-9-out-of-100-deepfakes-what-it-takes-to-lead-hugging-faces-leaderboard/. [Accessed: ]
rf:citation
» Catching 98.9 Out of 100 Deepfakes: What It Takes to Lead Hugging Face’s Leaderboard | Modulate | Sciencx | https://www.scien.cx/2026/04/16/catching-98-9-out-of-100-deepfakes-what-it-takes-to-lead-hugging-faces-leaderboard/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.