- We provide insights into Meta’s Capacity Efficiency Program, where we developed an AI agent platform that helps automatically find and resolve performance issues across our infrastructure.
- By leveraging encrypted domain expertise through a unified, standardized tool interface, these agents help save power and give engineers more time to troubleshoot performance issues and develop new product innovations.
We’ve built a unified AI agent platform that encodes the expertise of experienced efficiency engineers into reusable, composable capabilities. These agents now automate both the search and resolution of performance issues, restoring hundreds of megawatts (MW) of power and condensing hours of manual regression investigations into minutes, allowing the program to scale MW deployment to a growing number of product areas without proportionally scaling headcount.
In defense, FBDetectMeta’s in-house regression detection tool, captures thousands of regressions weekly; A faster automated solution means fewer megawatts wasted on accumulation across the fleet. On the offensive side, the AI-powered opportunity solution is expanding into additional product areas on a semi-annual basis, handling a growing volume of orders that engineers would never be able to achieve manually. Taken together, this is how Meta’s capacity efficiency program continues to increase MW delivery without proportionally growing the team. The end goal is a self-sustaining efficiency engine with AI taking over the long tail.
Here’s how it works and where we’re going:
- Hyperscale efficiency requires both offense (proactively seeking optimizations) and defense (intercepting and mitigating regressions that make it into production); AI can accelerate both.
- We have built a unified platform that combines standardized tool interfaces with encrypted expertise to automate investigations on both sides.
- These AI systems now form the infrastructure for the Capacity Efficiency program, which has recovered hundreds of megawatts of electricity, enough to power hundreds of thousands of American homes for a year.
- Automating diagnostics can compress about 10 hours of manual investigation into about 30 minutes, while AI agents fully automate the path from efficiency opportunity to review-ready pull request.
Presentation of the capacity efficiency program
If the code you deploy serves more than 3 billion people, even a 0.1% performance regression can result in significant additional power consumption.
In Meta’s Capacity Efficiency organization, we view efficiency as a two-way effort:
- Attack: Finding and deploying opportunities (proactive code changes) to make our existing systems more efficient.
- Defense: Monitor resource usage in production to detect regressions, trace them to a pull request, and provide remediation.
These systems worked well and played an important role in Meta’s efficiency efforts for years. However, actually solving the problems that have arisen leads to a new bottleneck: human engineering time.
This human engineering time can be spent on any of the following activities:
- Query profiling data to find opportunities to optimize hot features.
- Review the description, documentation, and previous examples of an efficiency opportunity to understand the best approach to implementing an optimization.
- Review current code and configuration deployments that may have resulted in a drastic change in resource usage.
- Review current internal discussions about launches that may be related to regression.
Many engineers at Meta use our efficiency tools to work on these problems every day. But no matter how high-quality the tools, engineers have limited time to address performance issues when innovating new products is our top priority.
We started asking: What if AI could do the investigation and resolution?
Attack and defense have the same structure
The breakthrough was the realization that both problems have the same structure:
This meant we didn’t need two separate AI systems. We needed a platform that could serve both.
We built it on two levels:
- MCP tools: These are standardized interfaces for LLMs to call code. Each tool does one job: query profiling data, retrieve experiment results, retrieve configuration history, search code, or extract documentation.
- skills: These encode expertise about performance efficiency. A skill can tell an LLM which tools to use and how to interpret the results. It captures reasoning patterns that experienced engineers have developed over the years, such as “Consult the top GraphQL endpoints for endpoint latency regressions” or “Look for recent schema changes when the affected function adopts serialization.”
Together, tools and capabilities advance a generalized language model into something that can apply the domain expertise that senior engineers typically possess. The same Tools can strengthen both attack and defense. Only the skills differ.
Defense: Detect regressions before they worsen
FBDetect is Meta’s in-house regression detection tool that can detect performance degradations as small as 0.005% in noisy production environments. It analyzes time series data as follows:
![图片[2]-Capacity Efficiency at Meta: How Unified AI Agents Optimize Performance at Hyperscale For Windows 7,8,10,11-Winpcsoft.com](https://winpcsoft.com/wp-content/plugins/wp-fastest-cache-premium/pro/images/blank.gif)
When FBDetect finds a regression, we immediately try to attribute it to a code or configuration change. This is an important first step in understanding what happened. This is mainly done using traditional techniques such as correlating regression functions with current pull requests. After a root cause is identified, engineers are typically notified and expected to take action, such as optimizing the last code change. To speed this up, we added an additional feature:
AI regression solver
Our AI Regression Solver is the newest and most promising component of FBDetect that generates a pull request to automatically forward the regression. Traditionally, root causes (pull requests) that led to performance degradation were either rolled back (thereby slowing development speed) or ignored (thereby unnecessarily increasing the usage of infrastructure resources).
Now our in-house encoding agent is activated to do the following:
- Gather context with tools: Find the symptoms of regression, e.g. B. the functions that have regressed. Find the root cause (a pull request) of the regression, including the exact files and lines that changed.
- Apply expertise with the following skills: Leverage regression mitigation knowledge for the specific code base, language, or regression type. For example, regressions in logging can be mitigated by increasing sample collection.
- Create a solution: Create a new pull request and submit it to the original cause author for review.
Crime: Turning opportunities into delivered code
On the offensive side, “efficiency opportunities” are proposed conceptual code changes that are believed to improve the performance of existing code. We built a system that allows engineers to view an opportunity and request an AI-generated pull request that implements it. What once required hours of investigation now takes just minutes to review and deploy.
The pipeline reflects the defensive AI Regression Solver:
- Gather context with tools: The AI agent searches for:
- Opportunity metadata.
- Documentation explaining the optimization pattern.
- Examples showing how similar opportunities have been solved.
- The specific files and functions involved.
- Validation criteria to confirm that the fix works.
- Apply expertise with the following skills: Leverage the knowledge of experienced engineers on a specific type of efficiency opportunity, encoded in a skill. For example, memorizing a specific function to reduce CPU usage.
- Create resolution: Create a candidate fix with guardrails, review syntax and style, and confirm that it addresses the correct issue. View the generated code in the engineer’s editor and apply it with one click.
The important thing is that we use the same thing Tools as a defense: profiling data, documentation, code search. What’s different is that skills.
One platform, increasing returns
Our unified architecture with shared tools and data sources was a clean abstraction. Every existing and new agent has an easy way to capture performance context without having to reinvent the wheel with the interfaces we build.
This post focused on our initial use cases: performance declines and opportunities. Within a year, the same foundation supported additional applications: conversational efficiency assistants, capacity planning agents, personalized opportunity recommendations, guided investigation workflows, and AI-powered validation. Each new feature requires little to no new data integrations as they can simply compose existing tools with new capabilities.
Effects
The results of the Capacity Efficiency Program are remarkable: we have recovered hundreds of megawatts of electricity. The AI systems for offense and defense help support these efforts.
But the more profound change is in how offense and defense reinforce each other: Engineers who spent the morning on defensive triage are now reviewing AI-generated analytics in minutes. Engineers using our efficiency tools can now get AI-powered code instead of starting from scratch. The daunting question: “Where do I even begin?” has been replaced by reviewing and deploying high-impact fixes.
