Choosing a federated learning (FL) algorithm is a balancing act that feels more like a high-precision trade-off game than a linear sequence. Optimizer variants, server aggregation rules, local training schedules, normalization, regularization, model architecture: every tiny shift can quietly reroute the entire distributed training path. Manual exploration is expensive, and fair comparisons remain elusive.

To break the deadlock, researchers built Auto-FL-Research (AFR), a constrained coding-agent workflow where LLM-based agents propose and implement FL algorithmic recipes—server aggregation rules, client update schedules, local objectives, model variants included. Each experiment is wrapped into a campaign that records scores, runtime, edited files, artifacts, and failure status. The benchmark spans five cross-silo healthcare tasks from FLamby and grouped-client profiles on LEAF datasets, plus a LEAF synthetic task. Gains appear on four FLamby tasks and five LEAF profiles, but five-seed repeats also expose seed-sensitive failures and search-selected artifacts.

The most instructive part comes from same-budget controls: several improvements stem from genuine FL-recipe changes, while other gains are recovered by scalar tuning on a fixed mutation surface, or collapse under repeat or held-out evaluation. The authors themselves call these “mixed outcomes” and treat them as part of the contribution—they help separate repeatable FL mechanisms from local tuning effects and single-run ghosts.

For teams running on-premise federated learning infrastructure, often in healthcare or finance where data cannot leave the building, the signal is twofold. On one hand, agent-based automation can shorten the experimentation cycle and unearth combinations a human team might miss. On the other, non-repeatable results and single-seed gains demand caution: validating recipes across multiple seeds and held-out data is non-negotiable before any production change.

Cost figures are absent from the source, but in an on-premise setting, TCO benefits from reduced person-hours spent on manual hyperparameter search. Yet the computational bill of agent-driven campaigns—LLM inference plus multi-seed experiment replication—can tax local GPUs and CPUs, raising questions about the true efficiency of such workflows when hardware is tight.

Auto-FL-Research is not a magic wand but an analytical lens that forces us to scrutinize every claimed improvement. For organizations running FL on proprietary nodes, the real value lies in separating robust signals from single-run ghosts—an insight no cloud dashboard can offer.