发现和使用优秀的技能扩展
测试和基准测试大型语言模型(LLM)智能体,包括行为测试、能力评估、可靠性指标和生产监控——在现实世界的基准测试中,即使是顶级智能体的得分也低于50%。适用场景:智能体测试、智能体评估、智能体基准测试、智能体可靠性、测试智能体。
Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.