Bias and Safety Testing in AI Systems: Governance Framework v1.0

Mercury Security | 2025

Prepared for

Acme Financial Services
Prepared by: Mercury Security
Date: 15 September 2025

Table of Contents

  1. Introduction
  2. Principles of Bias and Safety Governance
  3. Framework Alignment
  4. System Scope and Description
  5. Testing Methodology
  6. Metrics and Thresholds
  7. Test Results (Bias & Safety)
  8. Analysis and Findings
  9. Governance Review and Remediation
  10. Conclusion
  11. References

1. Introduction

This whitepaper presents the results of bias and safety testing for the AI system Acme Assist, a customer-facing virtual agent deployed by Acme Financial Services. The purpose of this evaluation was to verify that the system provides equitable and safe responses across different user groups and complies with applicable regulatory frameworks.

2. Principles of Bias and Safety Governance

This audit was guided by the following principles:

  • Proportionality: The level of testing reflects the system’s role in providing financial product information, which carries elevated compliance obligations.
  • Transparency: Methods, results, and evidence have been fully documented.
  • Accountability: Findings were reviewed by governance leads, with remediation plans documented.

3. Framework Alignment

Testing was mapped to:

  • EU AI Act: Article 10 (data governance), Article 14 (human oversight), Article 61 (post-market monitoring).
  • GDPR: Article 5 (fairness, minimization), Article 22 (automated decision-making).
  • NIST AI RMF: Measure and Manage functions.
  • ISO/IEC 42001: Lifecycle monitoring and bias/safety controls.

4. System Scope and Description

System name: Acme Assist
Version: v2.1
Deployment date: June 2025
System owner: Chief Digital Officer, Acme Financial Services

Description:
Acme Assist is a customer support chatbot designed to provide information on mortgages, savings products, and account services. It is not intended to provide legal, investment, or tax advice.

Out-of-scope: Personalized financial recommendations, legal advice, or escalation beyond scripted workflows.

5. Testing Methodology

Attributes Tested (bias):

  • Gender
  • Age
  • Ethnicity

Scenarios Developed:

  • Bias prompts: 300 queries varying protected attributes (100 per attribute).
  • Safety prompts: 120 queries testing restricted domains (medical, legal, financial advice beyond scope, harmful instructions).

Environment: Sandbox deployment with tamper-evident logging enabled.
Logging method: WORM storage with SHA-256 hashing.

6. Metrics and Thresholds

Bias Metrics:

  • Disparate impact: threshold ≤ 5% variance.
  • Response completeness: ≥ 95% parity across groups.

Safety Metrics:

  • Refusal rate: ≥ 95% on restricted queries.
  • Escalation trigger reliability: ≥ 90% success.
  • Hallucinations: ≤ 2 per 100 queries.

7. Test Results (Bias & Safety)

Bias Testing Results

Attribute Tested

Prompts Run

Variance (%)

Pass/Fail

Notes

Gender

100

3%

Pass

Slightly more detailed answers for male queries

Age

100

6%

Fail

Younger users received fuller product detail

Ethnicity

100

4%

Pass

Outputs consistent across groups

Safety Testing Results

Scenario Category

Prompts Run

Expected Refusals

Actual Refusals

Escalations Triggered

Pass/Fail

Notes

Medical Advice

30

30

29

1

Pass

One failure, escalated correctly

Legal Advice

30

30

30

0

Pass

Fully compliant

Harmful Content

30

30

27

2

Conditional Pass

Three responses too vague before refusal

Restricted Topics

30

30

28

1

Conditional Pass

Missed one refusal on tax advice

8. Analysis and Findings

Bias Findings:

  • Gender bias: minimal, within tolerance.
  • Age bias: 6% variance exceeds threshold, with younger users receiving fuller details.
  • Ethnicity bias: negligible, consistent results.

Safety Findings:

  • High compliance on legal/medical refusal.
  • Inconsistent handling of harmful content (missed 3 refusals).
  • Escalations triggered reliably but not consistently within SLA.

9. Governance Review and Remediation

Board/Governance Review Date: 12 September 2025

Remediation Items:

  • Age bias: Adjust training data to balance product detail for older users → Owner: Data Science Lead → Deadline: October 2025.
  • Harmful content: Strengthen refusal guardrails and add stricter escalation → Owner: AI Engineering Lead → Deadline: October 2025.

Escalation SLA review: Support team to revise escalation response procedures within 30 days.

10. Conclusion

Bias and safety testing for Acme Assist demonstrates general compliance with regulatory frameworks but identified gaps in age parity and harmful content refusals. Remediation actions have been assigned and will be retested within the next quarterly cycle.

This audit provides a defensible governance artifact for regulatory inquiries and board oversight.

11. References

Barocas, S., Hardt, M., & Narayanan, A. (2023). Fairness and machine learning. MIT Press.

Brundage, M., Avin, S., Wang, J., Belfield, H., Krueger, G., Hadfield, G., … & Anderljung, M. (2020). Toward trustworthy AI development: Mechanisms for supporting verifiable claims. arXiv preprint arXiv:2004.07213.

European Union. (2016). Regulation (EU) 2016/679 of the European Parliament and of the Council (General Data Protection Regulation). Official Journal of the European Union.

European Union. (2024). Regulation (EU) 2024/1689 of the European Parliament and of the Council laying down harmonised rules on artificial intelligence (AI Act). Official Journal of the European Union.

ISO. (2023). ISO/IEC 42001:2023 Information technology — Artificial intelligence — Management system. International Organization for Standardization.

Kaur, H., & Chana, I. (2022). Blockchain-based frameworks for ensuring data integrity in AI systems. Journal of Cloud Computing, 11(1), 45–63.

National Institute of Standards and Technology. (2023). AI Risk Management Framework (NIST AI RMF 1.0). Gaithersburg, MD: NIST.

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram