New Platform Uses Claude, GPT, Gemini, and Grok Agents to Score AI Tools Independently

The Problem With AI Tool Reviews

Let's be real—most AI tool reviews are hot garbage. You've got vendors hyping their own benchmarks, influencers shilling affiliate deals, and a hundred 'best of' lists that all copy each other. Finding what's actually worth your time has become its own full-time job. That's the gap Tested is trying to fill with their new platform at trytested.com.

How Tested Works

The core idea is elegant in its simplicity: let AI agents evaluate AI tools, then publish those scores transparently. Every tool on Tested gets scored independently by four different frontier model agents—Anthropic's Claude, OpenAI's GPT models, Google's Gemini, and xAI's Grok. Each agent runs its own evaluation with its own criteria, and the platform publishes every score with a date stamp so you can track how tools evolve over time.

Multi-Agent Evaluation: Snake Eating Its Own Tail?

Here's where it gets interesting from an insider perspective. Using LLMs to evaluate LLMs isn't exactly novel—researchers have been doing this for years with techniques like AlpacaEval and MT-Bench. But Tested's approach of cross-contaminating evaluations (Claude scoring tools that might compete with Claude, Grok scoring things adjacent to xAI products) is either brilliant meta-awareness or a potential minefield of conflicts. The platform doesn't shy away from this tension, which I respect.

Commission Transparency

Tested does accept affiliate commissions on some links—but they make a point of stating this never influences rankings or pricing. Whether you trust that claim is your call, but the transparency itself is refreshing in a space where most review sites bury their financial relationships three clicks deep in fine print.

Key Takeaways

Four independent frontier model agents evaluate each tool: Claude, GPT, Gemini, and Grok
All scores are timestamped for historical tracking and accountability
Affiliate links exist but are disclosed without affecting rankings per the platform's claim
Built by 'a squidcode project'—an entity that's clearly thinking about AI tooling from first principles

The Bottom Line

Tested won't solve the AI evaluation problem overnight, but it's tackling a real pain point with an approach that at least attempts to remove human bias from the equation. Whether four LLMs judging other LLMs produces better signal than traditional reviews—or just creates a more sophisticated echo chamber—remains to be seen. Worth bookmarking and watching how the scoring data evolves.

> New Platform Uses Claude, GPT, Gemini, and Grok Agents to Score AI Tools Independently