AI Model Performance and Benchmarking

2026-03-02 • ArXiv cs.LG

U-CAN: Utility-Aware Contrastive Attenuation for Efficient Unlearning in Generative Recommendation

A novel framework, U-CAN, addresses privacy concerns in LLM-based generative recommendation systems. U-CAN mitigates utility loss during machine unlearning by selectively attenuating sensitive parameters in low-rank adapters, while preserving perform...

#LLM On-Premise #Fine-Tuning #DevOps

2026-03-02 • ArXiv cs.AI

HumanMCP: A Dataset for Evaluating Tool Retrieval Performance

Introducing HumanMCP, a new large-scale dataset for evaluating the effectiveness of Model Context Protocol (MCP) servers. The dataset includes realistic, diverse, and high-quality user queries designed to simulate human interactions with 2800 tools a...

#Fine-Tuning

2026-03-01 • LocalLLaMA

Qwen3.5 Small Dense model release seems imminent?

Rumors on Reddit suggest the imminent release of Qwen3.5 Small Dense. The open-source community is eagerly awaiting to evaluate the performance and potential applications of this model.

#Hardware #LLM On-Premise #DevOps

2026-03-01 • LocalLLaMA

Qwen 3.5 27B: Best Chinese Translation Model Under 70B

A LocalLLaMA user reports that Qwen 3.5 27B offers Chinese translations comparable to GPT-3.5 and Gemini, outperforming other models up to 70B. The model was tested on a local setup with 24GB of VRAM, highlighting excellent tone and consistency.

#LLM On-Premise #DevOps

2026-02-28 • LocalLLaMA

Google: Longer Reasoning Chains Don't Imply Higher Accuracy in LLMs

New research from Google challenges the assumption that longer reasoning chains lead to better results in language models. The study introduces the concept of Deep Thinking Ratio (DTR) to measure reasoning quality, demonstrating that accurate token s...

#LLM On-Premise #DevOps

2026-02-28 • LocalLLaMA

Monthly update on top-performing open-weight models

A monthly overview of top-performing open-weight models, evaluated based on community discussions and benchmarks. The initiative aims to provide an updated view of open-source alternatives to proprietary models, focusing on their capabilities and lim...

#LLM On-Premise #DevOps

2026-02-27 • LocalLLaMA

LLmFit: a tool to find the right LLM for your hardware

LLmFit is a terminal tool that helps identify which LLM best fits available hardware resources. It analyzes system RAM, CPU, and GPU, evaluating models based on quality, speed, and context, suggesting the most suitable ones for execution.

#Hardware #LLM On-Premise #DevOps

2026-02-27 • LocalLLaMA

Little Qwen 3.5 27B and Qwen 35B-A3B models excel in logical reasoning

Little Qwen 3.5 27B and Qwen 35B-A3B models have demonstrated remarkable logical reasoning capabilities in a specific benchmark. The results, obtained using lineage-bench, highlight how relatively small models can handle complex deductions from hundr...

#Hardware #LLM On-Premise #DevOps

2026-02-27 • LocalLLaMA

Qwen3.5: promising performance for real-world workloads

A user tested Qwen3.5-35B-A3B-UD-Q6_K_XL on real-world projects, finding positive results. Token generation speed is high, especially on a single GPU. The experience suggests a potential shift to a hybrid model, with API models for spec generation an...

#Hardware #LLM On-Premise #DevOps

2026-02-27 • LocalLLaMA

PewDiePie fine-tuned Qwen2.5-Coder-32B to beat ChatGPT 4o on coding benchmarks

A user fine-tuned the Qwen2.5-Coder-32B model, achieving performance superior to ChatGPT 4o in coding benchmarks. The news, shared on Reddit, highlights the potential of open-source models when optimized for specific tasks. This demonstrates how acce...

#LLM On-Premise #Fine-Tuning #DevOps

2026-02-27 • ArXiv cs.CL

GPT-5: Contextual Analysis and Advanced Prompt Engineering

A new study explores the use of LLMs, specifically GPT-5, for analyzing the context of textual citations. The research focuses on prompt sensitivity, varying their structure to assess how they influence the model's interpretations. The goal is to und...

2026-02-27 • ArXiv cs.AI

FIRE: A Comprehensive Benchmark for Financial Intelligence of LLMs

FIRE is a new benchmark for evaluating LLM capabilities in the financial domain. It includes theoretical knowledge tests based on certification exams and practical scenarios with 3,000 questions. Results obtained with state-of-the-art models, such as...

2026-02-26 • The Register AI

AI models still struggle with math, but less than before

According to the ORCA test, current large language models (LLMs), while improving, remain prediction engines and do not always provide the correct solution to mathematical problems. Even Gemini 3 Flash, among the top performers, would receive a medio...

#LLM On-Premise #DevOps

2026-02-26 • Phoronix

ZCULL Support For Nouveau + NVK Brings Some Small Performance Gains

The integration of ZCULL into NVIDIA's open-source Vulkan driver NVK, via Mesa 26.1, promises small improvements in rendering performance. This optimization aims to make graphics processing more efficient.

#Hardware #LLM On-Premise #DevOps

2026-02-26 • LocalLLaMA

Qwen3.5-35B-A3B: Optimized GGUF for 24GB GPUs

A new GGUF quantization for the Qwen3.5-35B-A3B model promises improved performance on GPUs with 24GB of VRAM. The optimization focuses on using q8_0/q4_0/q4_1 quantization types and aims for increased speed, especially with Vulkan/ROCm backends. The...

#Hardware #LLM On-Premise

2026-02-26 • ArXiv cs.LG

ACAR: Adaptive Complexity Routing for Multi-Model Ensembles with Auditable Decision Traces

ACAR is a framework for orchestrating multiple models, using self-consistency variance to route tasks to configurations with one, two, or three models. Implemented on TEAMLLM, ACAR evaluates Claude Sonnet 4, GPT-4o, and Gemini 2.0 Flash on specific b...

#LLM On-Premise #DevOps

2026-02-25 • IEEE Spectrum

AI Is Acing Math Exams Faster Than Scientists Write Them

Artificial intelligence systems are rapidly improving in solving complex mathematical problems, surpassing the capabilities of scientists in some areas. New benchmarks are needed to assess the true capabilities of AI, as existing ones quickly become ...

2026-02-24 • TechCrunch AI

Spanish ‘soonicorn’ Multiverse Computing releases free compressed AI model

Spanish startup Multiverse Computing has released a new version of its HyperNova 60B model on Hugging Face that, it says, bests Mistral's model. The model is available for free to the community.

2026-02-24 • PyTorch Blog

Accelerating Autotuning in Helion with Bayesian Optimization

Helion, the high-level DSL for high-performance ML kernels, introduces a new search algorithm (LFBO Pattern Search) that leverages Bayesian optimization to drastically reduce autotuning times. The algorithm, based on machine learning models, filters ...

#Hardware

2026-02-24 • ArXiv cs.CL

ConfSpec: Efficient Step-Level Speculative Reasoning for LLMs

ConfSpec is a framework that accelerates inference in large language models (LLMs) through step-level speculative verification. It leverages smaller, well-calibrated verification models to reduce latency while maintaining target model accuracy. It op...

#Hardware #LLM On-Premise #DevOps

2026-02-24 • DigiTimes

Gemini 3.1 Pro raises the bar; when will DeepSeek respond?

Google introduces Gemini 3.1 Pro, setting a new benchmark in the large language model sector. It remains to be seen how DeepSeek will respond to this new challenge.

2026-02-23 • LocalLLaMA

GLM-5 surpasses Kimi K2.5 on the NYT Connections benchmark

The GLM-5 model has achieved a new high score on the Extended NYT Connections benchmark, surpassing Kimi K2.5 Thinking. This result highlights the progress in the field of open-source language models and their ability to solve complex reasoning and a...

#LLM On-Premise #DevOps

2026-02-23 • LocalLLaMA

Benchmarking 17 local LLMs: focusing on tool calling

A recent study compared 17 large language models (LLMs) running locally, evaluating their "tool calling" capabilities in real-world scenarios. The research highlights how the "agentic loop" approach, where the model receives feedback from the tools, ...

#Hardware #LLM On-Premise #Fine-Tuning

2026-02-23 • LocalLLaMA

Open-source framework for local LLMs: Gemini 3/GPT-5.2 performance

A new open-source framework aims to bridge the performance gap between proprietary large language models (LLMs) and locally run alternatives. The goal is to achieve performance levels comparable to Gemini 3 Deep Think and GPT-5.2 Pro using self-hoste...

#LLM On-Premise #DevOps

2026-02-21 • LocalLLaMA

Wave Field LLM: O(n log n) attention via wave equation dynamics

A novel attention mechanism for LLMs, Wave Field LLM, uses wave equations to scale at O(n log n). The model maps tokens onto a continuous 1D field and propagates information via damped wave equations. Initial results on WikiText-2 show competitive pe...

2026-02-21 • LocalLLaMA

GLM-4.7: Distilled Model for Advanced Reasoning Locally

A distilled model named GLM-4.7, designed to offer advanced reasoning capabilities, is available on Hugging Face. This version, mentioned by Unsloth, aims to provide high performance in local usage contexts. The model is available in GGUF format, fac...

#Hardware #LLM On-Premise #DevOps

2026-02-21 • LocalLLaMA

GLM-5: "Claude" Personality and Censorship Bypass?

A user discovered that GLM-5, a large language model, significantly changes its behavior when told it is Claude from Anthropic. This personality shift also appears to bypass some built-in censorship. It remains unclear whether this behavior is intent...

#LLM On-Premise #DevOps

2026-02-21 • LocalLLaMA

ChatJimmy: 15,000+ tok/s on dedicated silicio – the "Model-on-Silicio" era?

ChatJimmy.ai announced achieving 15,000 tokens per second using a dedicated ASIC, etching model weights directly into silicio. This approach bypasses HBM and VRAM bottlenecks, sparking debate about the efficiency of general-purpose hardware for infer...

#Hardware #LLM On-Premise #DevOps

2026-02-21 • TechCrunch AI

Sarvam launches Indus AI chat app as competition heats up

Indian startup Sarvam AI has launched Indus, a new chat application powered by artificial intelligence. Currently available in beta, Indus enters a rapidly expanding market where competition among AI solution providers is increasingly intense.

#LLM On-Premise #DevOps

2026-02-20 • OpenAI Blog

AI Model Attempts High-Level Math Challenges

An artificial intelligence model tackles the First Proof math challenge, a competition testing reasoning capabilities on complex problems. The initiative aims to evaluate the performance of AI models in scenarios requiring expert-level skills.

2026-02-20 • Tom's Hardware

The great Bench GPU retest begins — what to expect for 2026

A preview of the benchmark tests for next-generation GPUs, including RTX 50 and RX 9000 series. The article focuses on testing methodologies and performance expectations, excluding upscaling and frame generation.

#Hardware #LLM On-Premise #DevOps

2026-02-20 • LocalLLaMA

Deepseek and Gemma: comparison in the LocalLLaMA community

A Reddit post in the LocalLLaMA community compares Deepseek and Gemma models. The discussion revolves around the characteristics and performance of these models, with a focus on local usage. The original article includes an image, presumably comparat...

#LLM On-Premise #DevOps

2026-02-20 • LocalLLaMA

Hallucination Benchmark: Kimi K2.5 outperforms Opus 4.6 in Pharma

A recent benchmark evaluated the hallucination capabilities of several large language models (LLMs) in the pharmaceutical domain. Surprisingly, Kimi K2.5 outperformed Opus 4.6 in this specific test. The dataset used is available on Hugging Face, offe...

#LLM On-Premise #Fine-Tuning #DevOps

2026-02-20 • AI News

AI: Executives’ optimism about the future impact on productivity and hiring

A recent international study reveals that AI has produced modest increases in productivity and changes in employment. Executives expect these effects to accelerate over the next three years, with projected productivity gains of up to 2.25% in the Uni...

#LLM On-Premise #DevOps

2026-02-20 • LocalLLaMA

SanityBoard: New LLM Models and Open Source Agents Compared

SanityBoard updates with new benchmark results for models like Qwen3.5 Plus, GLM 5, and Gemini 3.1 Pro, along with three new open source coding agents. The analysis highlights the importance of infrastructure and model characteristics (iteration) on ...

#LLM On-Premise #DevOps

2026-02-20 • LocalLLaMA

Luma v2.9: a compact LLM trainable locally

Luma v2.9, a small language model (around 10 million parameters) based on a transformer architecture, has been released. Its key feature is that it can be trained with custom data and run entirely locally, without cloud dependencies or telemetry. The...

#Hardware #LLM On-Premise #Fine-Tuning

2026-02-20 • ArXiv cs.CL

Evaluating LLMs for Greek QA: The DemosQA Benchmark

A new study introduces DemosQA, a dataset for Question Answering in Greek, built from social media user questions. The research evaluates 11 language models, both monolingual and multilingual, using different prompting strategies, aiming to bridge th...

#Fine-Tuning

2026-02-20 • ArXiv cs.CL

References Improve LLM Alignment in Non-Verifiable Domains

A new study explores the use of reference-guided LLM-evaluators to improve the alignment of large language models (LLMs) in non-verifiable domains. The results show that this approach can significantly improve the accuracy of LLM-judges and lead to p...

#LLM On-Premise #DevOps

2026-02-20 • ArXiv cs.AI

LLMs and GraphRAG for Design Structure Matrix Generation

A new study explores the use of Large Language Models (LLMs) and Graph-based Retrieval-Augmented Generation (GraphRAG) to automate the creation of Design Structure Matrices (DSMs) in cyber-physical systems. The research evaluates performance on two u...

#LLM On-Premise #DevOps #RAG

2026-02-20 • ArXiv cs.AI

AIdentifyAGE Ontology for Decision Support in Forensic Dental Age Assessment

The AIdentifyAGE ontology is presented, a tool to standardize and make age assessments in the forensic field more transparent, particularly in cases of unaccompanied minors. The ontology aims to improve interoperability between clinical, forensic, an...

2026-02-20 • LocalLLaMA

Qwen3 Coder Next: impressive performance with 102GB of RAM

A user tested Qwen3 Coder Next 8FP by converting Flutter documentation with a three-sentence prompt and a 64K token context window. The model required 102GB of RAM out of 128GB available, outperforming other OSS models like GPT OSS 120B and GLM 4.7 F...

#Hardware

2026-02-20 • TechCrunch AI

Google Gemini Pro 3.1: Record-Breaking Benchmark Scores

Google's new Gemini Pro 3.1 model promises advanced capabilities for handling complex workloads. Benchmark performances suggest a significant step forward in Google's LLM capabilities.

#LLM On-Premise #DevOps

2026-02-20 • LocalLLaMA

Gemini 3.1 Coming Before Gemma 4: Sneak Peek

Rumors suggest that Google might release Gemini 3.1 before Gemma 4. The news, appearing on Antigravity and shared on Reddit, fuels speculation about Google's next moves in the field of large language models (LLMs). It remains to be seen what improvem...

2026-02-19 • LocalLLaMA

GLM-5: LLM Survives (Almost) a Month on FoodTruck Bench

GLM-5, a large language model (LLM), nearly completed a month of testing on the FoodTruck Bench platform, designed to simulate real-world business scenarios. Despite good diagnostic capabilities and efficient tool usage, the model failed due to exces...

#LLM On-Premise #DevOps

2026-02-19 • The Register AI

Google germinates Gemini 3.1 Pro in ongoing AI model race

Google announced the release of Gemini 3.1 Pro, characterizing the model's arrival as "a step forward in core reasoning." This new AI model promises improved reasoning capabilities, fueling the race in the large language model (LLM) space.

#Hardware #LLM On-Premise #DevOps

2026-02-19 • Ars Technica AI

Google announces Gemini 3.1 Pro, focuses on complex problem-solving

Google announced Gemini 3.1 Pro, the latest version of its AI model. It promises significant improvements in problem-solving and reasoning capabilities. The model is currently in preview for developers and consumers. Google's internal benchmarks show...

#LLM On-Premise #DevOps

2026-02-19 • LocalLLaMA

TextWeb: Render web pages as text grids for AI agents

TextWeb is an open-source project that transforms web pages into small text grids (2-5KB), ideal for processing by AI agents. Instead of 1MB screenshots, TextWeb uses MCP, LangChain, and CrewAI for a more efficient representation of information.

#LLM On-Premise #DevOps

2026-02-19 • Phoronix

Simdjson Shows More Speed-Ups Possible For SIMD In JSON Parsing: Another 30% Boost

Simdjson, the open-source project for high-performance JSON parsing leveraging SIMD instructions, has announced version 4.3. This update introduces further SIMD optimizations, further increasing processing speed. Simdjson aims to parse gigabytes of J...

#LLM On-Premise #DevOps

2026-02-19 • LocalLLaMA

Kitten TTS V0.8: New SOTA Super-tiny TTS Model (Less than 25 MB)

Kitten ML has released Kitten TTS V0.8, a series of super-tiny open-source text-to-speech (TTS) models, with the smallest model taking up less than 25 MB. These models, available under the Apache 2.0 license, offer eight expressive voices and can run...

#Hardware #LLM On-Premise #Fine-Tuning

2026-02-19 • ArXiv cs.CL

Language Model Representations for Efficient Few-Shot Tabular Classification

A new study explores the use of large language models (LLMs) to classify tabular data extracted from the web, such as product catalogs or scientific datasets. The method, called TaRL, uses semantic embeddings of table rows, optimized with calibration...

#LLM On-Premise #DevOps #RAG

2026-02-19 • ArXiv cs.CL

The Perplexity Paradox: Why Code Compresses Better Than Math in LLM Prompts

New research reveals that large language models (LLMs) handle code compression better than mathematical problems. Per-token analysis highlights how code syntax is preserved, while task-critical numerical values in math are discarded, negatively impac...

2026-02-19 • LocalLLaMA

Advanced Visualization of Quantization Techniques for Local LLMs

A Reddit user has revisited and expanded previous work on visualizing quantization techniques, including new types and PPL/KLD measurements to evaluate efficiency. Source code and some results are available on Codeberg. The analysis focuses on the im...

#LLM On-Premise #DevOps

2026-02-18 • LocalLLaMA

LLM Benchmark: Logical Reasoning and the 'Car Wash' Test

A test on 53 language models assessed their ability to solve a simple reasoning problem: if the car wash is 50 meters away, is it better to walk or drive? Only a minority answered correctly and consistently, highlighting the challenges in achieving r...

#LLM On-Premise #DevOps

2026-02-18 • LocalLLaMA

LLMs grading other LLMs: a meta-analysis

A Reddit user has repeated an interesting experiment: having different language models evaluate the performance of other LLMs on specific criteria. The collected data is available on Hugging Face for further analysis and comparison.

#LLM On-Premise #DevOps

2026-02-18 • Tom's Hardware

Nvidia's DLSS 4.5 Dominates FSR and Native Rendering in Blind Tests

In blind tests across six titles, Nvidia's DLSS 4.5 significantly outperformed AMD's FSR and native rendering. The results indicate a strong preference among PC gamers for Nvidia's technology, highlighting the importance of upscaling and frame genera...

#Hardware #LLM On-Premise #DevOps

2026-02-18 • LocalLLaMA

GLM-5: New techniques to reduce training and inference costs

The GLM-5 technical report reveals key innovations such as DSA adoption to reduce training and inference costs, an asynchronous RL infrastructure to improve post-training efficiency, and Agent RL algorithms for more effective learning. The model achi...

#LLM On-Premise #Fine-Tuning #DevOps

2026-02-18 • ArXiv cs.CL

EduResearchBench: A Benchmark for Evaluating LLMs in Academic Writing

EduResearchBench, a comprehensive evaluation platform for large language models (LLMs) in academic writing, has been introduced. The benchmark uses a Hierarchical Atomic Task Decomposition (HATD) framework to assess model capabilities across differen...

2026-02-18 • ArXiv cs.LG

Anomaly Detection: Time Series Embeddings and Feature Engineering

A novel hybrid approach combines time series embeddings from Granite TinyTimeMixer with statistical features for anomaly detection in HVAC equipment. The integration, evaluated on over 50,000 samples, achieves high precision and a low false positive ...

#LLM On-Premise #Fine-Tuning #DevOps

2026-02-18 • The Register AI

Anthropic's Sonnet 4.6 Improves Coding and Reasoning

Anthropic has released version 4.6 of the Sonnet model, focusing on improved coding, reasoning, and planning capabilities. The model also promises more 'warm, honest, and prosocial' responses.

#LLM On-Premise #DevOps

2026-02-17 • LocalLLaMA

Reasoning test: AI models fail on car wash scenario

A test conducted on 53 AI models revealed difficulties in basic reasoning. Many models provided incorrect answers to a simple question about car washing, suggesting that real-world reasoning capabilities are still a challenge for AI.

2026-02-17 • LocalLLaMA

Qwen3.5: Attention Architectures Under Scrutiny

An article by Maxime Labonne explores the different attention implementations in the Qwen3.5 language model. The discussion, originating on Reddit, highlights the lack of unanimous agreement on the most effective attention architectures, opening a de...

#LLM On-Premise #DevOps

2026-02-17 • ArXiv cs.CL

LLM-Powered Automatic Translation: Urgency Matters in Crisis Scenarios

Large language models (LLMs) are increasingly proposed for crisis management, particularly for multilingual communication. A recent study highlights how automatic translations, even if linguistically correct, can alter the perception of urgency, a cr...

#LLM On-Premise #DevOps

2026-02-17 • ArXiv cs.CL

Multimodal Data Selection for ASR Accent Adaptation

A novel approach to improve automatic speech recognition (ASR) systems when dealing with different accents. The technique uses multimodal consistency to select training data without labels, reducing the performance gap compared to fully supervised tr...

#Fine-Tuning

2026-02-17 • ArXiv cs.LG

Uncertainty Quantification in Generative Models: A New Approach

A new framework, Directional Concentration Uncertainty (DCU), promises to enhance the reliability of generative models. DCU quantifies uncertainty by measuring the geometric dispersion of outputs, surpassing traditional heuristics and generalizing to...

#LLM On-Premise #DevOps

2026-02-16 • LocalLLaMA

Qwen 3 Max-Thinking: Superior Performance in Spatial Reasoning

A spatial reasoning benchmark (MineBench) demonstrates a significant performance improvement in the Qwen 3 Max-Thinking model compared to Qwen 3.5. The results suggest that Qwen 3 Max-Thinking approaches or surpasses models like Opus 4.6, GPT-5.2, an...

2026-02-16 • LocalLLaMA

Qwen 3.5 struggles on Vending-Bench 2: results analysis

A user reported difficulties with the Qwen 3.5 language model when running the Vending-Bench 2 benchmark. The analysis of the results, shared on Reddit, highlights the model's limitations in this specific scenario. Vending-Bench 2 is designed to test...

#LLM On-Premise #DevOps

2026-02-16 • ArXiv cs.LG

Abstractive Red-Teaming: Testing Language Models on Specific Characteristics

A new approach, called abstractive red-teaming, aims to identify queries that violate the behavioral specifications of language models. The goal is to uncover categories of problematic questions before large-scale deployment, using reinforcement lear...

#LLM On-Premise #DevOps

2026-02-16 • ArXiv cs.AI

AI Benchmarking: An Adaptive Framework for Utility-Weighted Evaluations

A new study introduces a theoretical framework for rethinking benchmarking in artificial intelligence, particularly for large language models. The framework proposes an adaptive approach that integrates stakeholder priorities and evaluation metrics, ...

#LLM On-Premise #DevOps

AI Model Performance and Benchmarking

Related Coverage