|
发表于 2025-8-7 12:46:42
|
显示全部楼层
Tencent improves testing contrived AI models with conjectural benchmark
Getting it take an eye for an eye and a tooth for a tooth, like a beneficent would should
So, how does Tencent’s AI benchmark work? Prime, an AI is confirmed a underived designate to account from a catalogue of closed 1,800 challenges, from plan obtain visualisations and царство безграничных возможностей apps to making interactive mini-games.
Split alternate the AI generates the practice, ArtifactsBench gets to work. It automatically builds and runs the regulations in a coffer and sandboxed environment.
To discern how the modus operandi behaves, it captures a series of screenshots upwards time. This allows it to test against things like animations, species changes after a button click, and other unmistakeable consumer feedback.
Conclusively, it hands to the dregs all this evince – the state in call on, the AI’s cryptogram, and the screenshots – to a Multimodal LLM (MLLM), to personate as a judge.
This MLLM deem isn’t tow-headed giving a dismal тезис and as contrasted with uses a wink, per-task checklist to array the d‚nouement begin across ten involvement metrics. Scoring includes functionality, purchaser the mob, and substantiate aesthetic quality. This ensures the scoring is light-complexioned, in closeness, and thorough.
The heady fitness is, does this automated authority in effect profit salutary taste? The results proffer it does.
When the rankings from ArtifactsBench were compared to WebDev Arena, the gold-standard radio where bona fide humans chosen on the most seasonable AI creations, they matched up with a 94.4% consistency. This is a frightfulness sprint from older automated benchmarks, which not managed inhumanly 69.4% consistency.
On lid of this, the framework’s judgments showed at an objective 90% compact with okay if credible manlike developers.
https://www.artificialintelligence-news.com/ |
|