New Benchmark Suite Evaluates Financial AI Competence
Researchers introduce FINESSE-Bench, a new benchmark for evaluating financial AI’s technical analysis skills. Addresses gaps in existing LLM frameworks. #AI #FinancialTech
Researchers introduce FINESSE-Bench, a new benchmark for evaluating financial AI’s technical analysis skills. Addresses gaps in existing LLM frameworks. #AI #FinancialTech
Researchers unveil PolitNuggets: a multilingual benchmark testing AI agents’ ability to discover rare political facts through FactNet protocol. Advances evaluation beyond static QA to open-ended discovery.
Google DeepMind takes minority stake in EVE Online studio CCP Games, turning the 20-year-old space MMO into a testing ground for advanced multi-agent AI research. Terms undisclosed.
OpenAI’s GPT-5.5 edges past Anthropic’s Claude Mythos Preview on Terminal-Bench 2.0, the agentic benchmark — underscoring the tightening race between U.S. AI labs. #AI #OpenAI #Anthropic