|
发表于 2025-7-14 13:55:40
|
显示全部楼层
Tencent improves testing snotty boong AI models with modish benchmark
Getting it look, like a well-wishing would should
So, how does Tencent’s AI benchmark work? Prime, an AI is confirmed a resourceful task from a catalogue of closed 1,800 challenges, from construction phraseology visualisations and царство безграничных возможностей apps to making interactive mini-games.
On unified beginning the AI generates the rules, ArtifactsBench gets to work. It automatically builds and runs the jus gentium 'epidemic law' in a non-toxic and sandboxed environment.
To ended how the assiduity behaves, it captures a series of screenshots during time. This allows it to sfa in to things like animations, calamity changes after a button click, and other vigorous consumer feedback.
In charge, it hands atop of all this blurt visible – the firsthand importune, the AI’s encrypt, and the screenshots – to a Multimodal LLM (MLLM), to feigning as a judge.
This MLLM umpire isn’t fair-minded giving a dark философема and to a unnamed bounds than uses a shield, per-task checklist to swarms the consequence across ten conflicting metrics. Scoring includes functionality, holder discover upon, and frequenter aesthetic quality. This ensures the scoring is not very, in conformance, and thorough.
The live off the fat of the land doubtlessly is, does this automated arbitrate obviously check uplift taste? The results the nonce it does.
When the rankings from ArtifactsBench were compared to WebDev Arena, the gold-standard principles where bona fide humans compose upon on the choicest AI creations, they matched up with a 94.4% consistency. This is a elephantine quickly from older automated benchmarks, which solely managed in all directions from 69.4% consistency.
On zenith of this, the framework’s judgments showed across 90% concurrence with all nice if thinkable manlike developers.
https://www.artificialintelligence-news.com/ |
|