Getting Started
⚙️ Installation
To install the TrustEval-toolkit, follow these steps:
1. Clone the Repository
2. Set Up a Conda Environment
Create and activate a new environment with Python 3.10:
3. Install Dependencies
Install the package and its dependencies:
🤖 Usage
Configure API Keys
Run the configuration script to set up your API keys:

Quick Start
The following example demonstrates an Advanced AI Risk Evaluation workflow.
Step 0: Set Your Project Base Directory
Step 1: Download Metadata
from trusteval import download_metadata
download_metadata(
section='advanced_ai_risk',
output_path=base_dir
)
Step 2: Generate Datasets Dynamically
from trusteval.dimension.ai_risk import dynamic_dataset_generator
dynamic_dataset_generator(
base_dir=base_dir,
)
Step 3: Apply Contextual Variations
Step 4: Generate Model Responses
from trusteval import generate_responses
request_type = ['llm'] # Options: 'llm', 'vlm', 't2i'
async_list = ['your_async_model']
sync_list = ['your_sync_model']
await generate_responses(
data_folder=base_dir,
request_type=request_type,
async_list=async_list,
sync_list=sync_list,
)
Step 5: Evaluate and Generate Reports
-
Judge the Responses
from trusteval import judge_responses target_models = ['your_target_model1', 'your_target_model2'] judge_type = 'llm' # Options: 'llm', 'vlm', 't2i' judge_key = 'your_judge_key' async_judge_model = ['your_async_model'] await judge_responses( data_folder=base_dir, async_judge_model=async_judge_model, target_models=target_models, judge_type=judge_type, )
-
Generate Evaluation Metrics
-
Generate Final Report
Your report.html
will be saved in the base_dir
folder. For additional examples, check the examples
folder.
Trustworthiness Report
A detailed trustworthiness evaluation report is generated for each dimension. The reports are presented as interactive web pages, which can be opened in a browser to explore the results. The report includes the following sections:
The data shown in the images below is simulated and does not reflect actual results.
Test Model Results
Displays the evaluation scores for each model, with a breakdown of average scores across evaluation dimensions.
Model Performance Summary
Summarizes the model's performance in the evaluated dimension using LLM-generated summaries, highlighting comparisons with other models.
Error Case Study
Presents error cases for the evaluated dimension, including input/output examples and detailed judgments.
Leaderboard
Shows the evaluation results for all models, along with visualized comparisons to previous versions (e.g., our v1.0 results).