On November 15, a group of researchers from the CLTC AI Security Initiative submitted a working paper to the Paris Peace Forum in response to its Call for AI Projects as part of the AI Action Summit, taking place on February 10-11, 2025 in Paris.
In May 2024 at the AI Seoul Summit, sixteen global AI industry organizations committed to publishing their efforts to measure and manage risks posed by their frontier AI models in an accountable and transparent manner and determine thresholds for intolerable risks. Specifically, as part of these Frontier AI Safety Commitments, they must define “thresholds at which severe risks posed by a model or system, unless adequately mitigated, would be deemed intolerable.“
In the absence of clear guidance from regulators, academics, or civil society that places a high priority on protecting public safety, companies may face incentives to develop thresholds that are low-cost for them to implement but do not provide adequate levels of public safety. The CLTC AI Security Initiative is working to help bridge this gap and puts forward this working paper to provide provisional recommendations and considerations to inform the development of intolerable risk thresholds for frontier AI models.
The paper builds upon an in-person roundtable held in Berkeley, CA on November 12, 2024, that brought together representatives from academia, industry, civil society, and government. We are grateful to the roundtable participants for their expert feedback and insights.
Multiple types of thresholds have been proposed and used thus far, including capability thresholds, compute thresholds, and risk thresholds. In the working paper, we focus primarily on capability thresholds (including those related to CBRN weapons, cyber operations, model autonomy, persuasion, and deception), which are most prominent in current examples and have been most closely aligned with establishing intolerable risk thresholds (See Section 2.1). In Table 1 in the working paper, we specify the outcomes of concern related to these risk categories, evidence of the risks materializing, and provisional intolerable risk thresholds for each category.
There are many other risks from frontier AI models that can stem from unacceptable uses of frontier AI models, unacceptable limitations (or failure modes) of frontier AI models, and unacceptable impacts of frontier AI models, which are also of critical importance and require attention and mitigation.
In proposing thresholds, we used the following key principles:
- Seek to identify cases of substantial increase in risk
- Focus on capabilities (more than “risk” per se or compute), at least until likelihood estimation becomes more reliable, while accounting for reasonably foreseeable ways capabilities can be enhanced (e.g. plugin tools and scaffolding)
- Compare to appropriate base cases
- Minimal increases in risk should be detectable, but not necessarily intolerable
- A substantial increase in capabilities for some attack stages can have a disproportionately large effect on risk
- Leave some margin of safety: operationalize intolerable risk thresholds at approximately the “substantial” level, leaving some margin of safety before arriving at a “severe” level
- Define at least some thresholds without factoring in model-capability mitigations due to the unreliability of virtually all safeguards at this time
- Use Best Practices in Dual-Use Capability Evaluation
In the working paper, we also identify the following key considerations:
- Additional Risk Criteria: This working paper has focused on how intolerable risk thresholds are informed by capabilities (in particular), compute, and risks, with additional discussion of how uses, impacts, and limitations may also play into intolerable risk thresholds. However, there are other risk criteria that may inform intolerable risk thresholds. For example, in the EU AI Act, additional criteria that inform the designation of general-purpose AI models with systemic risk (beyond compute and capabilities) include:
- The number of parameters of the model
- The quality or size of the data set, for example measured through tokens
- The input and output modalities of the model
- The size of its reach (e.g. if it will be made available to at least 10,000 business users)
- The number of registered end-users
- Risk-Benefit Tradeoffs: Companies may want to compare potentially substantial risks of frontier AI models against potentially substantial benefits. However unacceptable risk is likely to be absolute. In some cases, tradeoffs will be related to relative gains to offensive and defensive capabilities. The assumptions underlying decisions to continue the development of dual-use capabilities need to be explicit. For instance, developing AI capabilities to defend against cybersecurity threats is more promising than developing biological capabilities that are more likely to have a longer timeline to provide satisfactory defensive uses and be at risk of malicious use in the short term. Tentative predictions of offense-defense balance skewing towards offense in increasingly complex AI systems must also be confronted (Shevlane and Dafoe 2020).
- Defining Baselines: To approach sound empirical analyses of model capabilities against thresholds, it is necessary to determine baselines of human performance, other state-of-the-art models, and human-AI systems (UK AISI 2024). Inspired by the threat modeling approaches from computer security, Kapoor, Bommasani et al. (2024) propose a framework to determine empirically sound model evaluation by introducing a framework to assess the marginal risk introduced by foundational models when measured against the baseline of existing threats and defenses for a particular type of risk. See our Key Principle, “Compare to Appropriate Base Cases” for more information about some of the challenges and limitations of assessing marginal risk.
- Deployment Strategy: Hosted access or restricted access models are often easier to monitor and prevent misuse, whereas the release of model weights presents advantages in terms of access and customization. It is important to assess how different deployment strategies influence the scale, scope, and irreversibility of risks.
- Timeframe of Impact: Current industry discourse on intolerable risks advantages risks that arise from acute catastrophic events that might be made successful through foundation models, like the creation of CBRN weapons or sophisticated cyber attacks that cause the immediate and massive destruction of life and property. The long-term impacts of frontier models that could fundamentally change the fabric of society are then left for state actors to govern (Lohn and Jackson 2022).
- Metrics of Evaluation: Potential impacts of frontier models can be used to determine the intolerability of risks if there is consensus on the metric of evaluation. For instance, instead of determining intolerance based on the ‘number of human lives lost’, if we chose to determine impact by measuring its impact on the ‘quality of life,’ our appetites for risk may differ significantly. As an example, a company could set a threshold for carbon emissions from data centers to remain below a certain x% of global/national emissions to prevent climate-related deaths or could apply dynamic metrics to determine the receptiveness of the local population to its continued operations.
- Developer vs State Responsibilities: This working paper focuses on determining a small number of capability thresholds, but we also recommend the consideration of additional intolerable risks (Section 2.3). It is necessary for industry actors to also bear responsibility for these long-term impacts accumulating from frontier model deployment in high-impact systems that operate at scale in critical domains automating tasks, amplifying biases, and increasing emissions which result in long-term effects such as displacing the workforce, amplifying existing inequalities, cultural homogeneity, environmental degradation, etc.
- Future-Proofing Benchmarks: a. Intolerable risks do not necessarily require large-scale runs. Therefore, due consideration needs to be placed on how thresholds might have to change rapidly with the widespread availability and affordability of compute power for fine-tuning open models (Seger et al. 2023). b. Entrench these margins of safety in threshold determination along the dimensions of increasing affordability, access, and expertise in AI systems to ensure sufficient safety between calibrations.
- Benchmarking Often (or at least at every stage of model development): While model performance can be evaluated against a threshold before its deployment, it is necessary to mimic such robustness in safety testing at every point in the AI pipeline, from data sourcing to determining training sets, to choosing the right hyperparameters, etc. Only by identifying appropriate metrics at each stage of development and evaluating against the right benchmarks can developers exercise responsibility in the product life cycle. Ensuring traceability of decisions and inputs at every stage of development is imperative in enforcing appropriate oversight.
- Transparent Evaluations: These documented risks and decisions should also be reported transparently, to regulators or internal review boards, red teamers, and auditors to ensure appropriate testing against vulnerabilities in the chosen design of the model.
We are continuing to develop these ideas and refine this paper. Please provide written comments or feedback by email to deepika.raman@berkeley.edu.