GitHub's Scaling Challenges: AI's Impact on Service Availability

The AI Wave Puts GitHub to the Test

In recent months, GitHub has faced significant challenges with its service availability. The code-sharing platform, a cornerstone for millions of developers, is experiencing the impact of rapidly growing traffic, largely fueled by the widespread adoption of AI-assisted coding tools and new "agentic" development workflows. This surge has severely tested the existing infrastructure, making service stability a constant challenge.

To address these issues, GitHub has embarked on an ambitious strategy of expanding capacity and migrating an increasing number of workloads to Microsoft's Azure infrastructure. Despite these efforts, the situation has not yet stabilized. GitHub's availability report for May 2026 acknowledged nine incidents that degraded performance, a slight improvement from the ten reported in April, but the path to full reliability remains long.

Scaling and Migration: A Race Against Time

The scale of the scaling challenge is evident in the numbers. Although GitHub initially planned a tenfold capacity expansion in October 2025, by February 2026 it became clear that a thirtyfold expansion would be needed to handle the enormous volume of pull requests, commits, and new repositories. Last year, GitHub processed one billion commits in an entire year; today, it receives 1.4 billion commits every month.

Jakub Oleksy, SVP of Software Engineering at GitHub, stated in the report that the company is implementing "structural changes that permanently remove failure modes." He also highlighted migration progress: "We are now serving 40 percent of monolith traffic from Azure (up from 8 percent in February), with Git traffic at 30 percent and repository replication at 99 percent." These efforts have enabled the effective capacity to more than double in just four months. However, availability remains a critical point, partly because Azure has also recently encountered capacity problems.

The Challenges of Measurement and Cloud Trade-offs

The perception of service availability varies significantly depending on the source. While GitHub's official status page reports uptime figures close to 99.9% for listed services, independent projects like "The Missing GitHub Status Page" offer a different perspective. This unofficial project recorded twelve incidents in May and an average uptime of 87.26% over the past ninety days, with values of 78.33% in April, 93.86% in May, and 88.39% for June so far. GitHub's own incident history page cites 26 incidents in April, 23 in May, and 12 to date in June. This discrepancy highlights the complexity in defining and measuring "availability" in distributed environments.

These episodes underscore the complexities of managing large-scale infrastructures, especially when integrating high-intensity AI workloads. For companies evaluating the deployment of Large Language Models (LLM) or other AI workloads, the choice between self-hosted solutions and cloud services involves significant trade-offs. While the cloud offers apparent scalability and reduced initial investment (CapEx), capacity challenges and long-term operational costs (OpEx), along with data sovereignty concerns, can push towards on-premise solutions. AI-RADAR provides analytical frameworks on /llm-onpremise to evaluate these constraints and their implications on the Total Cost of Ownership (TCO).

Future Outlook and Infrastructure Control

GitHub's efforts to isolate its primary database cluster by moving users, authentication, and authorization into separate domains aim to prevent cascading failures that could compromise the entire system. This strategy, while promising, has not yet fully resolved the availability challenges. The need to manage an unprecedented volume of data and requests, combined with dependencies on external cloud infrastructures that are themselves facing capacity issues, creates a complex operational environment.

GitHub's situation serves as a warning for organizations that rely on external services for their critical development pipelines. The ability to maintain control over the underlying infrastructure, or at least to diversify dependencies, becomes a key factor in mitigating risks and ensuring operational continuity. The pursuit of solutions that balance scalability, reliability, and cost control remains a top priority in today's technological landscape.