[memo]Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models

先行研究は複数のはるしネーションベンチマークを作ってきた

hallucination benchmarkの評価を行うベンチマークを作成した

Introduction

LVLMs tend to generate hallucinations

responses that are inconsistent with the corresponding visual inputs

Hallucination benchmark quality measurement framework

Co…


This content originally appeared on DEV Community and was authored by Takara Taniguchi

先行研究は複数のはるしネーションベンチマークを作ってきた

hallucination benchmarkの評価を行うベンチマークを作成した

Introduction

LVLMs tend to generate hallucinations

responses that are inconsistent with the corresponding visual inputs

Hallucination benchmark quality measurement framework

Contribution

  • Propose a hallucination benchmark quality measurement framework for VLMs
  • Construct a new high-quality hallucination benchmark

Related works

POPE constructs yes, no questions, and multiple-choice questions which contain non-existent objects

AMBER extended yes-no questions to other types of hallucinations.

HallusionBench yes,no pairs

Evaluation metrics

CHAIR

OpenCHAIR

Hallucination benchmark quality measurement framework

We select 6 representative publicly available hallucination benchmarks

MMHal, GAVIE

Follows from the psychological test.

Across different benchmarks, the scores are different from one another.

From the perspective of test-retest reliability, closed-ended benchmarks reveal obvious shortcomings.

Existing free-form VQA benchmarks exhibit limitations in both reliability and validity.

Conclusion

Introduced a quality measurement framework for hallucination benchmarks

感想
心理学的な信頼性テストの知見をAIに取り込んだ感じなんですかね


This content originally appeared on DEV Community and was authored by Takara Taniguchi


Print Share Comment Cite Upload Translate Updates
APA

Takara Taniguchi | Sciencx (2025-07-04T00:09:12+00:00) [memo]Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models. Retrieved from https://www.scien.cx/2025/07/04/memoevaluating-the-quality-of-hallucination-benchmarks-for-large-vision-language-models/

MLA
" » [memo]Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models." Takara Taniguchi | Sciencx - Friday July 4, 2025, https://www.scien.cx/2025/07/04/memoevaluating-the-quality-of-hallucination-benchmarks-for-large-vision-language-models/
HARVARD
Takara Taniguchi | Sciencx Friday July 4, 2025 » [memo]Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models., viewed ,<https://www.scien.cx/2025/07/04/memoevaluating-the-quality-of-hallucination-benchmarks-for-large-vision-language-models/>
VANCOUVER
Takara Taniguchi | Sciencx - » [memo]Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/07/04/memoevaluating-the-quality-of-hallucination-benchmarks-for-large-vision-language-models/
CHICAGO
" » [memo]Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models." Takara Taniguchi | Sciencx - Accessed . https://www.scien.cx/2025/07/04/memoevaluating-the-quality-of-hallucination-benchmarks-for-large-vision-language-models/
IEEE
" » [memo]Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models." Takara Taniguchi | Sciencx [Online]. Available: https://www.scien.cx/2025/07/04/memoevaluating-the-quality-of-hallucination-benchmarks-for-large-vision-language-models/. [Accessed: ]
rf:citation
» [memo]Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models | Takara Taniguchi | Sciencx | https://www.scien.cx/2025/07/04/memoevaluating-the-quality-of-hallucination-benchmarks-for-large-vision-language-models/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.