OmniEmbodied Framework

The OmniEmbodied Framework is a comprehensive evaluation and training system built on top of OmniSimulator. It provides LLM integration, multi-agent coordination, and standardized benchmarking for embodied AI research.

What is the OmniEmbodied Framework?

The OmniEmbodied Framework extends OmniSimulator with:

πŸ“Š Evaluation System

Comprehensive benchmarking tools with standardized tasks, metrics, and analysis.

πŸ€– Agent Implementations

Ready-to-use agent architectures including single-agent and multi-agent coordination modes.

πŸ’¬ LLM Integration

Seamless integration with OpenAI, Anthropic, vLLM, and other language model providers.

πŸ“ˆ Data Generation

Automated tools for creating training datasets, evaluation scenarios, and synthetic data.

πŸ”§ Configuration Management

Flexible YAML-based configuration system supporting complex experimental setups.

Core Components

Evaluation Framework

The evaluation system provides:

  • Standardized Tasks: 1400+ curated scenarios across multiple difficulty levels

  • Task Categories: Direct commands, reasoning, collaboration, and tool use

  • Performance Metrics: Success rates, efficiency measures, and error analysis

  • Comparative Analysis: Tools for comparing different agent architectures

from evaluation.evaluation_interface import EvaluationInterface

# Configure evaluation parameters
result = EvaluationInterface.run_evaluation(
    config_file="single_agent_config",  # Will resolve to config/baseline/single_agent_config.yaml
    agent_type="single",
    task_type="independent",
    scenario_selection={
        "dataset_type": "single",
        "scenario_range": {"start": "00001", "end": "00003"}
    }
)

# Analyze results
print(f"Success rate: {result.get('success_rate', 0):.2%}")
print(f"Total scenarios: {result.get('total_scenarios', 0)}")

Agent Modes

The framework includes multiple agent architectures:

Single Agent Mode: - Individual agents complete tasks independently - Support for chain-of-thought reasoning - Configurable memory and history management

Centralized Multi-Agent Mode: - Central coordinator manages multiple worker agents - Hierarchical task decomposition and assignment - Coordinated action planning and execution

Decentralized Multi-Agent Mode (Future): - Autonomous agents with peer-to-peer communication - Negotiation and consensus mechanisms - Distributed problem solving

from modes.single_agent.llm_agent import LLMAgent
from modes.centralized.centralized_agent import CentralizedAgent

# Single agent
single_agent = LLMAgent(
    agent_id="solo_explorer",
    config=single_agent_config
)

# Centralized multi-agent
coordinator = CentralizedAgent(
    coordinator_id="mission_control",
    worker_count=3,
    config=centralized_config
)

LLM Integration

The framework supports various language model providers:

  • OpenAI Models: GPT-3.5, GPT-4, GPT-4-Turbo

  • Anthropic Models: Claude-3 family

  • Local Models: vLLM, HuggingFace Transformers

  • Custom Endpoints: Any OpenAI-compatible API

from llm.llm_factory import create_llm_from_config

# OpenAI integration
openai_llm = create_llm_from_config({
    "mode": "api",
    "api": {
        "provider": "openai",
        "model": "gpt-4",
        "temperature": 0.1,
        "api_key": "your-key"
    }
})

# vLLM local deployment
vllm_llm = create_llm_from_config({
    "mode": "vllm",
    "model": "Qwen2.5-7B-Instruct",
    "endpoint": "http://localhost:8000/v1"
})

Key Features

Comprehensive Benchmarking

  • Task Taxonomy: 8 distinct task categories covering different AI capabilities

  • Difficulty Levels: Progressive complexity from basic to advanced reasoning

  • Evaluation Protocols: Standardized procedures for reproducible research

  • Performance Analysis: Detailed breakdowns by task type and error mode

Multi-Modal Agent Support

  • Text-Based Reasoning: Natural language understanding and generation

  • Symbolic Manipulation: Logical reasoning and problem solving

  • Spatial Reasoning: Understanding of physical relationships and constraints

  • Tool Usage: Interaction with objects and environmental elements

Experimental Infrastructure

  • Parallel Execution: Concurrent evaluation across multiple scenarios

  • Result Management: Organized storage and retrieval of experimental data

  • Configuration Versioning: Track and reproduce experimental conditions

  • Error Analysis: Detailed logging and failure mode investigation

Getting Started with the Framework

Quick Evaluation:

# Configure your LLM (see quickstart guide)
vim config/baseline/llm_config.yaml

# Run evaluation using provided scripts
cd scripts/
bash qwen7b-wg.sh  # Single-agent with guidance
bash deepseekr1-wg.sh  # Multi-agent evaluation

Custom Evaluation:

from evaluation.evaluation_interface import EvaluationInterface

# Run evaluation using the interface
result = EvaluationInterface.run_evaluation(
    config_file="single_agent_config",
    agent_type="single",
    task_type="independent",
    scenario_selection={
        "dataset_type": "single",
        "scenario_range": {"start": "00001", "end": "00100"},
        "task_filter": {
            "categories": ["direct_command", "tool_use"]
        }
    }
)

print(f"Evaluation completed: {result.get('success_rate', 0):.2%} success rate")

Framework Architecture

The framework layers on top of OmniSimulator:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           Evaluation & Training             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚    Benchmark    β”‚ β”‚   Data Generation   β”‚ β”‚
β”‚  β”‚     Suite       β”‚ β”‚      Tools          β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Agent Modes                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚   Single    β”‚ β”‚ Centralized β”‚ β”‚ Custom β”‚ β”‚
β”‚  β”‚   Agent     β”‚ β”‚Multi-Agent  β”‚ β”‚ Modes  β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚             LLM Integration                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚   OpenAI    β”‚ β”‚  Anthropic  β”‚ β”‚  vLLM  β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              OmniSimulator                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Configuration System

The framework uses hierarchical YAML configuration:

# Base configuration
extends: "base_config"

# Agent setup
agent_config:
  agent_class: "modes.single_agent.llm_agent.LLMAgent"
  max_history: 20

# LLM configuration
llm_config:
  provider: "vllm"
  model_name: "Qwen2.5-7B-Instruct"
  endpoint: "http://localhost:8000/v1"
  temperature: 0.1

# Evaluation setup
evaluation:
  dataset_type: "single"
  scenario_range: {"start": "00001", "end": "00800"}
  max_parallel: 5

Data and Scenarios

The framework includes extensive datasets:

Evaluation Datasets: - Single-Agent: 800 scenarios across 4 task categories - Multi-Agent: 600 collaborative scenarios - Progressive Difficulty: From basic commands to complex reasoning

Task Categories: - Direct Command Following - Attribute-Based Reasoning - Tool Use and Manipulation - Spatial Reasoning - Compound Multi-Step Reasoning - Explicit Collaboration - Implicit Collaboration - Compound Collaboration

Performance and Optimization

The framework is optimized for research workflows:

Efficient Evaluation: - Parallel scenario processing - Intelligent caching and reuse - Optimized LLM API usage

Resource Management: - Configurable memory limits - Automatic cleanup and garbage collection - Progress tracking and resumption

Result Analysis: - Automated statistical analysis - Comparative performance metrics - Error categorization and analysis

Next Steps

To learn more about the Framework components:

  • Evaluation System - Evaluation system and benchmarking

  • Agent Modes - Agent implementations and customization

  • llm_integration - Language model integration guide

  • Data Generation Tools - Dataset creation and management

  • configuration - Advanced configuration patterns

For practical usage:

  • ../examples/evaluation_workflows - Evaluation examples

  • ../examples/custom_agents - Creating custom agents

  • ../examples/llm_integration - LLM integration examples