Evaluation System
The evaluation system in OmniEmbodied Framework provides comprehensive benchmarking capabilities for embodied AI agents. It handles scenario management, parallel execution, performance measurement, and detailed result analysis.
Overview
The evaluation system is designed for rigorous, reproducible research:
Standardized Benchmarks: 1400+ curated scenarios across multiple task types
Parallel Execution: Efficient evaluation with configurable parallelism
Comprehensive Metrics: Success rates, efficiency measures, and error analysis
Flexible Configuration: Support for custom evaluation protocols
Result Management: Organized storage and analysis of experimental data
Core Components
Evaluation Manager
The EvaluationManager serves as the central coordinator for all evaluation activities:
Key Features:
Scenario selection and filtering
Parallel execution management
Result collection and aggregation
Performance monitoring and optimization
Experimental configuration management
from evaluation.evaluation_interface import EvaluationInterface
# Run evaluation using the interface
result = EvaluationInterface.run_evaluation(
config_file="single_agent_config", # Will resolve to config/baseline/
agent_type="single",
task_type="independent",
scenario_selection={
"dataset_type": "single",
"scenario_range": {"start": "00001", "end": "00100"}
}
)
print(f"Success rate: {result.get('success_rate', 0):.2%}")
Scenario Management
Scenario Selector:
The scenario selection system provides flexible filtering and sampling:
from evaluation.scenario_selector import ScenarioSelector
# Define selection criteria
selection_config = {
"dataset_type": "single",
"task_categories": ["direct_command", "attribute_reasoning"],
"difficulty_levels": ["basic", "intermediate"],
"scenario_count": 50
}
# Get scenario list
scenarios = ScenarioSelector.get_scenario_list(config, selection_config)
Scenario Types:
Single-Agent Scenarios: Independent task completion
Multi-Agent Scenarios: Collaborative task scenarios
Progressive Difficulty: From basic commands to complex reasoning
Specialized Tasks: Domain-specific evaluation scenarios
Task Execution
Scenario Executor:
Handles individual scenario execution with proper state management:
from evaluation.scenario_executor import ScenarioExecutor
# Execute single scenario
executor = ScenarioExecutor(config)
result = executor.execute_scenario(
scenario_id="00001",
max_steps=50,
timeout=300
)
# Result contains detailed execution information
print(f"Scenario: {result['scenario_id']}")
print(f"Success: {result['success']}")
print(f"Steps taken: {result['steps_taken']}")
print(f"Execution time: {result['execution_time']:.2f}s")
Task Executor:
Manages the execution lifecycle for individual tasks:
from evaluation.task_executor import TaskExecutor
# Initialize task executor
task_executor = TaskExecutor(agent_config, llm_config)
# Execute task with detailed tracking
result = task_executor.execute_task(
scenario_data=scenario,
task_config=task_config
)
Evaluation Workflows
Basic Evaluation Workflow
Standard evaluation process for single agents:
# 1. Configure evaluation parameters
scenario_selection = {
"dataset_type": "single",
"scenario_range": {"start": "00001", "end": "00050"},
}
# 2. Run evaluation using interface
result = EvaluationInterface.run_evaluation(
config_file="single_agent_config",
agent_type="single",
task_type="independent",
scenario_selection=scenario_selection
)
# 3. Analyze results
print(f"Overall success rate: {result.get('success_rate', 0):.2%}")
print(f"Total scenarios: {result.get('total_scenarios', 0)}")
Multi-Agent Evaluation
Evaluation workflow for multi-agent scenarios:
# Configure for multi-agent evaluation
result = EvaluationInterface.run_evaluation(
config_file="centralized_config", # Will resolve to config/baseline/centralized_config.yaml
agent_type="multi",
task_type="collaborative",
scenario_selection={
"dataset_type": "multi",
"scenario_range": {"start": "00001", "end": "00050"}
}
)
print(f"Multi-agent success rate: {result.get('success_rate', 0):.2%}")
Parallel Evaluation
Configure parallel execution for efficient evaluation:
# Configuration for parallel evaluation
parallel_evaluation:
scenario_parallelism:
max_parallel_scenarios: 5 # Number of scenarios to run simultaneously
timeout_per_scenario: 300 # Timeout for individual scenarios
resource_management:
memory_limit_mb: 8192 # Memory limit per process
cpu_cores: 4 # CPU cores to use
error_handling:
max_retries: 3 # Retry failed scenarios
continue_on_error: true # Continue evaluation after failures
Performance Metrics
Success Metrics
The evaluation system tracks multiple success criteria:
Primary Metrics:
Success Rate: Percentage of completed tasks
Efficiency: Steps taken relative to optimal solution
Completion Time: Total execution time per scenario
Resource Usage: Memory and computational requirements
Secondary Metrics:
Error Categories: Classification of failure modes
Action Distribution: Frequency of different action types
Exploration Efficiency: Coverage of search space
Communication Patterns: Inter-agent message analysis (multi-agent)
# Access detailed metrics
metrics = results['detailed_metrics']
print(f"Success rate by task type:")
for task_type, rate in metrics['success_by_task_type'].items():
print(f" {task_type}: {rate:.2%}")
print(f"Average steps by difficulty:")
for difficulty, steps in metrics['steps_by_difficulty'].items():
print(f" {difficulty}: {steps:.1f}")
Error Analysis
Comprehensive error categorization and analysis:
# Analyze failure modes
error_analysis = results['error_analysis']
print("Common failure modes:")
for error_type, count in error_analysis['error_types'].items():
percentage = (count / results['total_scenarios']) * 100
print(f" {error_type}: {count} ({percentage:.1f}%)")
# Get detailed error information
failed_scenarios = error_analysis['failed_scenarios']
for failure in failed_scenarios[:5]: # Show first 5 failures
print(f"Scenario {failure['scenario_id']}: {failure['error_message']}")
Comparative Analysis
Compare different models or configurations:
# Compare multiple evaluation runs
from evaluation.analysis import ResultsAnalyzer
analyzer = ResultsAnalyzer()
# Load multiple result sets
gpt4_results = analyzer.load_results("gpt-4_evaluation_results.json")
claude_results = analyzer.load_results("claude_evaluation_results.json")
# Generate comparison report
comparison = analyzer.compare_results([
("GPT-4", gpt4_results),
("Claude-3", claude_results)
])
print(comparison['summary'])
Configuration and Customization
Evaluation Configuration
Comprehensive configuration options:
# evaluation_config.yaml
evaluation:
# Dataset selection
dataset_type: "single" # single, multi, mixed
scenario_range:
start: "00001"
end: "00800"
# Task filtering
task_filter:
categories: ["direct_command", "attribute_reasoning"]
difficulty_levels: ["basic", "intermediate", "advanced"]
exclude_scenarios: [] # Specific scenarios to exclude
# Execution parameters
max_steps: 50 # Maximum steps per scenario
timeout: 300 # Timeout in seconds
max_retries: 3 # Retries for failed scenarios
# Output configuration
save_trajectories: true # Save detailed execution traces
save_intermediate_states: false # Save state at each step
output_format: "json" # json, csv, both
Agent Configuration
Configure agent behavior for evaluation:
# agent_evaluation_config.yaml
agent_config:
# Core agent settings
agent_class: "modes.single_agent.llm_agent.LLMAgent"
max_history: 20 # Maximum conversation history
# Reasoning configuration
use_chain_of_thought: true # Enable CoT reasoning
reflection_enabled: false # Enable self-reflection
# Performance settings
action_timeout: 30 # Timeout per action
retry_failed_actions: true # Retry on action failure
# Evaluation-specific settings
detailed_logging: true # Enable detailed logging
save_reasoning_traces: true # Save reasoning steps
Custom Evaluation Protocols
Define custom evaluation procedures:
from evaluation.custom_evaluator import CustomEvaluator
class DomainSpecificEvaluator(CustomEvaluator):
def __init__(self, config):
super().__init__(config)
self.domain_metrics = {}
def evaluate_scenario(self, scenario):
# Standard evaluation
result = super().evaluate_scenario(scenario)
# Add domain-specific metrics
domain_result = self.evaluate_domain_specific(scenario, result)
result.update(domain_result)
return result
def evaluate_domain_specific(self, scenario, base_result):
# Implement domain-specific evaluation logic
return {
'domain_score': self.calculate_domain_score(scenario),
'custom_metrics': self.extract_custom_metrics(base_result)
}
Result Management
Result Storage
Organized storage of evaluation results:
# Results are automatically saved with structured naming
# Format: {model}_{dataset}_{timestamp}_{suffix}.json
# Example result structure:
results = {
'metadata': {
'model_name': 'gpt-4',
'dataset_type': 'single',
'evaluation_time': '2024-01-15T10:30:00',
'total_scenarios': 100,
'config_hash': 'abc123...'
},
'summary': {
'success_rate': 0.85,
'average_steps': 12.3,
'total_time': 1800.5
},
'detailed_results': [
{
'scenario_id': '00001',
'success': True,
'steps_taken': 8,
'execution_time': 15.2,
'trajectory': [...]
}
]
}
Result Analysis
Built-in analysis tools for result interpretation:
from evaluation.result_analyzer import ResultAnalyzer
# Load and analyze results
analyzer = ResultAnalyzer("evaluation_results.json")
# Generate summary statistics
summary = analyzer.generate_summary()
print(summary)
# Create visualizations
analyzer.plot_success_by_task_type("success_by_task.png")
analyzer.plot_steps_distribution("steps_distribution.png")
analyzer.plot_error_analysis("error_analysis.png")
# Export detailed report
analyzer.export_detailed_report("evaluation_report.html")
Batch Processing
Process multiple evaluation runs efficiently:
from evaluation.batch_processor import BatchProcessor
# Define multiple evaluation configurations
eval_configs = [
{"model": "gpt-3.5-turbo", "scenarios": "00001-00100"},
{"model": "gpt-4", "scenarios": "00001-00100"},
{"model": "claude-3", "scenarios": "00001-00100"}
]
# Process all configurations
processor = BatchProcessor()
all_results = processor.process_batch(eval_configs)
# Generate comparative analysis
comparison_report = processor.generate_comparison(all_results)
Best Practices
Evaluation Design
Scenario Selection:
Use stratified sampling for representative evaluation sets
Include diverse task types and difficulty levels
Validate scenario quality and consistency
Consider domain-specific requirements
Metric Selection:
Choose metrics aligned with research objectives
Include both primary and secondary metrics
Consider efficiency metrics alongside accuracy
Use standardized metrics for comparability
Performance Optimization
Parallel Execution:
Configure parallelism based on available resources
Monitor memory usage during parallel evaluation
Implement proper error handling for parallel processes
Use timeout mechanisms to prevent hanging evaluations
Resource Management:
Set appropriate memory and CPU limits
Monitor resource usage during evaluation
Implement cleanup procedures for failed evaluations
Consider cost implications of API usage
Reproducibility
Configuration Management:
Use version control for evaluation configurations
Document all experimental parameters
Save configuration hashes with results
Implement deterministic random seeding where possible
Result Documentation:
Include comprehensive metadata with results
Document evaluation environment and dependencies
Save detailed execution traces for debugging
Implement result validation and consistency checks
Troubleshooting
Common Issues
Performance Problems:
High memory usage during parallel evaluation
Slow execution due to API rate limits
Timeout issues with complex scenarios
Resource contention in multi-process execution
Configuration Errors:
Missing or incorrect configuration parameters
Incompatible agent and evaluation configurations
Path and file access issues
Version compatibility problems
Result Inconsistencies:
Non-deterministic behavior in evaluations
Inconsistent scenario interpretations
Missing or corrupted result files
Statistical significance issues
Debugging Tools
# Enable detailed debugging
evaluator.enable_debug_mode()
# Get execution traces
trace = evaluator.get_last_execution_trace()
# Validate evaluation configuration
validation_result = evaluator.validate_configuration()
# Monitor resource usage
resource_stats = evaluator.get_resource_statistics()
API Reference
For complete API documentation, see:
evaluation.scenario_executor.ScenarioExecutor