MSA Contract Redline RFC
Title: Open_Evaluation_Redline_Eval_0.1
Authors: Open Evaluation team
Date Published: July 29, 2025
Updated Date: July 29, 2025
Closed Date: August
For an version for commenting, please request access to the RFC here and for the evaluation questions here.
Introduction
The use of AI-powered tools in the legal space is growing, especially for contract interpretation and editing tasks. However, there is currently a lack of open guidance or standardized benchmarks to objectively assess the performance of these tools. Adoption decisions are often based on subjective perceptions rather than data-driven evaluation. This gap in evaluation standards creates a need for a transparent, fair, and rigorous environment to help organizations identify the most effective agentic MSA redlining products. By introducing a common benchmark and evaluation methodology, this project seeks to bring clarity and confidence to businesses considering such AI tools.
Objectives and Scope
The primary objectives of the Open Evaluation Contract Redlining project are:
  • Empower Decision Makers: Enable business decision-makers to identify which contract redlining agent best fits their needs by providing clear, data-driven performance comparisons,
  • Standardized Benchmark: Define a universal set of evaluation criteria for contract redlining agents using transparent, standardized framework that can be trusted by the industry.
  • Scalable Evaluation Platform: Build a scalable evaluation platform that can be expanded over time to assess other AI agent-driven products beyond MSA redlining.
  • Continuous Improvement: Create a foundation for continuous improvement, ensuring that the evaluation benchmarks can evolve alongside technological advancements in AI and legal tech.
Within the scope of this project, the Open Evaluation team will:
  • Evaluate Five Products: Apply the framework to five selected contract redlining AI tools, assessing each against the same datasets and criteria.
  • Industry Participation: Invite the companies behind these MSA redlining agents to participate by providing access to their tools and by contributing feedback on the evaluation process.
  • Publish Transparent Results: Analyze and publish the evaluation results in a public report and on a community platform to promote transparency and encourage industry-wide improvements.
  • Foundation for Broader Evaluations: Use this project as a foundation for a broader platform that can evaluate other agentic AI product categories going forward.
Out of Scope: It should be noted that adversarial security testing or "red-teaming" activities (e.g., prompt injection attacks and other AI security stress tests) are out of scope for this evaluation. The focus is on standard redlining performance and usability, not on penetration testing or other Red Team security evaluations.
Evaluation Process
The evaluation will utilize a standardized dataset comprising of the following:
  • Contracts: The dataset selection process is currently focused on Master Service Agreements (MSAs), beginning with an initial collection of ten agreements that exhibit varying degrees of complexity. These datasets are derived from publicly available MSA agreements that have been either published directly by the companies themselves or obtained through the Securities and Exchange Commission's Electronic Data Gathering, Analysis, and Retrieval (SEC/EDGAR) system or other publicly available databases. Each dataset will include a PDF copy accompanied by a timestamp documenting the date of data acquisition to ensure proper version control and traceability.
  • Playbooks: For each contract, a corresponding negotiation playbook (or guideline document) will be provided. These playbooks contain company-specific or industry-standard guidelines on acceptable contract edits and negotiation positions (e.g., which clauses to insert, tolerate, or reject). Each AI agent must use the playbook as a reference when generating its redlines.
  • Reference Standards: For every contract-playbook pair, a human expert redline will be prepared as the “ground truth” or Reference Standard. Experienced contract attorneys will manually redline the contracts according to the playbook instructions, producing the ideal responses against which the AI outputs will be compared.
The evaluation methodology involves collaboration with industry experts, specifically actively practicing attorneys who possess direct, hands-on experience in the contract redlining process or supervise teams engaged in such activities, ensuring that the assessment criteria and methodology reflect current industry practices and real-world application requirements.
The Open Evaluation team begins by interviewing industry experts who represent the target customer base, using their insights to propose evaluation criteria. These criteria are reviewed with the experts to ensure alignment with practical needs and industry standards and are open for public comment (link here). Once approved, products are tested against the criteria, and the results are validated with the experts. After expert review, the data is aggregated and prepared for publication, ensuring the final results reflect both rigorous testing and industry relevance.
The primary evaluation workflow requires submitting an MSA and a corresponding playbook, which the product must effectively reference. A diverse set of MSAs with varying levels of complexity will be selected to ensure broad coverage of real-world scenarios. Each MSA will undergo expert review and will be accompanied by a Reference Standard that defines the expected output, against which all product outputs will be measured. The evaluation process begins with the user submitting a contract along with a corresponding playbook. The product then generates a redlined version of the contract based on the provided inputs. A human evaluator subsequently reviews the output against a predefined reference standard, assessing factors such as accuracy, helpfulness, honesty, and harmlessness. This evaluation is further reviewed and approved by a licensed attorney to ensure legal soundness and compliance. Upon approval, a final report card evaluation is published.
As part of the overall assessment, a comprehensive review of available product features and capabilities will also be conducted. This includes detailed comparisons across areas such as collaboration functionality, integration capabilities, and notification systems. In parallel, the evaluation will address critical privacy and security dimensions, including adherence to recognized standards and certifications, policies governing data use and retention, the robustness of security controls, and other relevant safeguards.
Please note that both the evaluation process and the data used are subject to change at the discretion of the organization. Modifications may be made to improve accuracy, adapt to evolving standards, or align with strategic objectives. Any such changes will be implemented with careful consideration to maintain the integrity and reliability of the evaluation framework.
Output Explanations
Each product will receive a report card that will be publicly available on our website, using the following aggregated metrics.
Accuracy: Product Identified Risks / Total Identified Risks
To be marked as a pass, each redlined phrase must be correctly identified, appropriately edited, and include a clear rationale or citation referring to the applicable section of the provided playbook
Helpful, Honest, Harmless (Overall and Individual): Total number of passes / total number of relevant questions
We define the following categories as below:
  • Helpful: Enhances user value by delivering accurate, relevant, and contextually appropriate outputs aligned with the intended workflow.
  • Honest: Maintains truthfulness and transparency, avoids making false assertions, and appropriately signals uncertainty when applicable.
  • Harmless: Minimizes potential harm by mitigating bias, misinformation, or inappropriate use.
Each of these categories contains a predefined set of questions that assesses for specific criteria that together help answer the question “Is this result Helpful/Honest/Harmless?”. Evaluators will assess each redlined output against these specific benchmarks, assigning pass/fail designations that will then be aggregated into a percentage score for each category.
For instance, under Helpfulness, a representative question among many may be: “Does the tool avoid striking necessary content?”. For the full set of questions, we are welcoming input into the set of questions we’ve provided here.
The % for each category is calculated by the total number of passes divided by the total number of questions, while omitting any “N/A” results multiplied by 100. The overall HHH % is calculated by the total number passed across all 3 categories divided by the total number of questions, while omitting any “N/A” results multiplied by 100.
For example, if the Helpful % is 80% (8 passed of 10), Honest % is 50% (1 of 2), and Harmless % is 25% (1 of 4), the overall HHH % is 62.5% (10 (8+1+1) of 16 (10+2+4)).
Cost: Upfront (Listed as is) and additional usage costs (Total cost / Total contracts)
All products will have their costs listed. If applicable, any usage costs will be calculated on a per contract basis. If possible, a normalized metric to take into account length of the contract will also be provided.
Latency: Average Processing Time and Average Process Time, Normalized
Products will be evaluated for latency, using timestamps as closest to submission time and time stamps closest to results available to calculate the time it takes for the product to turn around results. This will be averaged amongst total contracts. There will be a separate metric that will be normalized against contract length by page.
  • Average Processing Time is defined as Total length of time of submission to results / Total contracts x 100%.
  • Average Process Time, Normalized is defined as Total length of time of submission to results / Total Contract Length in Pages x 100%.
Privacy and Security: List of certs
Companies will be asked to provide the list of privacy and security certification they currently are in compliance for.
Invitation to Contract Redline Agent Providers
We hereby invite companies developing generative AI-based MSA redlining agents to participate in this evaluation initiative. By participating, vendors will contribute to shaping an industry standard for quality and will gain visibility among potential users. Interested companies are encouraged to reach out to participants@openevaluation.org and agree to the following:
  • Provide Tool Access: Grant the evaluation team trial or demo access to your MSA redlining tool for the duration of the evaluation period (approximately two weeks). This access should include sufficient features or credits to run the test dataset through the tool.
  • Review Evaluation Criteria: Collaborate with us by reviewing the proposed evaluation framework and criteria. Participants have the opportunity to provide feedback or suggestions on the metrics and workflow before the evaluation begins, ensuring the criteria are fair and cover relevant aspects of performance.
  • For a comment-able version of this document, please request access here.
  • For a comment-able version of the HHH questions, please request access here.
  • Participate in Results Review: Attend a review meeting (or conference call) after the evaluation to discuss the outcomes. In this session, the Open Evaluation team will present your tool’s results and observations, and you will have a chance to ask questions or provide comments for clarification.
Participation in the evaluation process provides organizations with enhanced visibility through performance showcasing on a community-driven platform and inclusion in published reports, where strong results serve as public validation of tool capabilities. Companies gain industry leadership positioning by contributing to the establishment and refinement of industry benchmarks for AI contract review tools, thereby helping to elevate market standards and best practices. Additionally, participants receive comprehensive performance feedback that identifies specific strengths and weaknesses through expert analysis and real-world testing data. This detailed assessment provides valuable insights to inform product development and improvement strategies.
Review and Feedback
We welcome and highly encourage feedback on this RFC from all interested parties, including vendors, legal professionals, and potential end-users of MSA redlining tools. If you have comments, questions, or suggestions regarding the evaluation criteria, methodology, or any other aspect of this project, please reach out to us:
All feedback received during the RFC comment period will be taken into consideration. Our goal is to ensure the evaluation is as fair, comprehensive, and useful as possible. Contributors who provide valuable feedback will be acknowledged in the project report.
Interested in Contributing?
Reach out to volunteers@openevaluation.org with your background and areas of interest. We welcome contributors at all experience levels who are passionate about transparent AI evaluation.