Quick Start

This guide will help you run your first evaluation with OmniEmbodied in just a few minutes.

Prerequisites

Make sure you have completed the Installation process before proceeding.

Configuration Setup

Before running evaluations, you need to configure the LLM service and observation mode. OmniEmbodied supports various LLM providers including DeepSeek, OpenAI API and local vLLM deployments.

Configure LLM Service

Edit the LLM configuration file:

vim config/baseline/llm_config.yaml  # Edit the LLM configuration

For DeepSeek API (Recommended for quick start):

# config/baseline/llm_config.yaml
api:
  provider: "deepseekv3"
  providers:
    deepseekv3:
      api_key: "your-api-key-here"  # Replace with your actual API key
      model: "deepseek-chat"
      temperature: 0.3
      max_tokens: 2048

For vLLM Server Deployment (For local models):

# config/baseline/llm_config.yaml
provider: "custom"
model_name: "your-model-name"  # e.g., "Qwen2.5-7B-Instruct"

api_config:
  base_url: "http://localhost:8000/v1"  # vLLM server endpoint
  api_key: "EMPTY"  # vLLM doesn't require real API key
  timeout: 60
  max_retries: 3

model_config:
  temperature: 0.1
  max_tokens: 2000
  top_p: 0.9

Start vLLM Server (if using local deployment):

# Install vLLM (if not already installed)
pip install vllm

# Start vLLM server with your model
python -m vllm.entrypoints.openai.api_server \
    --model /path/to/your/model \
    --host 0.0.0.0 \
    --port 8000

For OpenAI API (alternative):

# config/baseline/llm_config.yaml
provider: "openai"
model_name: "gpt-4"

api_config:
  api_key: "${OPENAI_API_KEY}"  # Set as environment variable
  timeout: 30
  max_retries: 3

model_config:
  temperature: 0.1
  max_tokens: 2000

Set your API key:

export OPENAI_API_KEY="your-api-key-here"

Global Observation Configuration

For scripts ending with -wg.sh (with global observation), you need to configure:

# config/simulator/simulator_config.yaml
global_observation: true  # Enable global observation mode

This enables agents to have complete visibility of the environment from the start.

Running Your First Evaluation

OmniEmbodied provides ready-to-use shell scripts for different evaluation scenarios.

Quick Start Example

Run a basic evaluation with DeepSeek:

# Run basic evaluation (without global observation)
bash scripts/deepseekv3-wo.sh

Single Agent Evaluation

Run single-agent tasks with different models:

# Run Qwen 7B single-agent evaluation (with global observation)
bash scripts/qwen7b-wg.sh

# Run Qwen 7B single-agent evaluation (without global observation)
bash scripts/qwen7b-wo.sh

The script will:

Load the configured LLM service
Run evaluation on single-agent scenarios (00001 to 00800)
Save results with timestamp to output/ directory
Generate detailed logs and trajectory files

Multi-Agent Evaluation

Run multi-agent collaborative tasks:

# Run DeepSeek R1 multi-agent evaluation
bash deepseekr1-wg.sh

# Run Llama 8B multi-agent evaluation
bash llama8b-wg.sh

Understanding Script Parameters

Each script contains key configuration parameters:

#!/bin/bash
# Example: qwen7b-wg.sh

# Model configuration
MODEL_NAME="qwen7b"
DATASET_TYPE="single"          # single or multi agent
GUIDANCE="wg"                  # wg (with guidance) or wo (without guidance)

# Evaluation range
START_SCENARIO="00001"
END_SCENARIO="00800"

# Parallel processing
MAX_PARALLEL=5                 # Number of concurrent scenarios

# Configuration file
CONFIG_FILE="single_agent_config"  # ConfigManager will resolve the full path

Key Parameters:

MODEL_NAME: Identifier for the model being evaluated
DATASET_TYPE: single for single-agent, multi for multi-agent tasks
GUIDANCE: wg includes task guidance, wo tests pure reasoning
START_SCENARIO/END_SCENARIO: Range of scenarios to evaluate
MAX_PARALLEL: Controls concurrent evaluation processes

Monitoring Evaluation Progress

During evaluation, you’ll see output like:

[INFO] Starting evaluation: qwen7b_single_00001_to_00800_wg
[INFO] Configuration loaded: single_agent_config (resolved to config/baseline/single_agent_config.yaml)
[INFO] LLM service connected: http://localhost:8000/v1
[INFO] Processing scenario 00001/00800...
[INFO] Task: "Find the red apple in the kitchen"
[INFO] Agent completed task in 12 steps - SUCCESS
[INFO] Processing scenario 00002/00800...
...
[INFO] Evaluation completed. Results saved to:
output/20241220_143052_qwen7b_single_00001_to_00800_wg.json

Understanding Results

The evaluation generates several output files:

Result Files: - {timestamp}_{model}_{type}_{range}_{guidance}.json - Main results - trajectory_logs/ - Detailed step-by-step agent actions - error_logs/ - Failed scenarios and error analysis

Key Metrics: - Success Rate: Percentage of successfully completed tasks - Average Steps: Mean number of actions per task - Task Categories: Performance breakdown by task type - Error Analysis: Common failure modes and patterns

Sample Results:

{
  "model_name": "qwen7b",
  "dataset_type": "single",
  "guidance": "wg",
  "total_scenarios": 800,
  "completed_scenarios": 782,
  "success_rate": 0.847,
  "average_steps": 8.3,
  "task_breakdown": {
    "direct_command": {"success_rate": 0.92, "count": 200},
    "attribute_reasoning": {"success_rate": 0.85, "count": 200},
    "tool_use": {"success_rate": 0.78, "count": 200},
    "compound_reasoning": {"success_rate": 0.73, "count": 200}
  }
}

Common Configuration Options

You can customize evaluations by modifying configuration files:

Agent Behavior (configured via single_agent_config):

agent_config:
  max_history: 20              # Conversation history length
  max_steps_per_task: 35       # Maximum actions per task

environment_description:
  detail_level: 'full'         # full, basic, minimal
  show_object_properties: true # Include object attributes
  update_frequency: 0          # Update every step

Execution Control:

execution:
  max_total_steps: 400         # Total simulation steps
  timeout_seconds: 300         # Per-action timeout

parallel_evaluation:
  max_parallel_scenarios: 5    # Concurrent evaluations

Task Filtering:

scenario_selection:
  task_filter:
    categories:                # Select specific task types
      - "direct_command"
      - "attribute_reasoning"

Troubleshooting

LLM Service Connection Issues:

# Test vLLM server connectivity
curl -X POST http://localhost:8000/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{"model":"your-model","messages":[{"role":"user","content":"Hello"}]}'

Common Issues:

“Connection refused”: vLLM server not running or wrong port
“Model not found”: Check model name in configuration matches vLLM server
“Out of memory”: Reduce max_parallel_scenarios or use smaller model
Slow evaluation: Check network latency to LLM service

Debug Mode:

Enable detailed logging for troubleshooting:

logging:
  level: "DEBUG"
  show_llm_details: true

Next Steps

Once you’ve run your first evaluation:

Explore Results: - Analyze performance by task category - Compare different models and configurations - Review trajectory logs for agent behavior insights

Advanced Usage: - user_guide/evaluation_framework - Systematic evaluation strategies - user_guide/configuration - Advanced configuration patterns - Examples - More evaluation examples

Customize Evaluations: - developer/extending - Create custom agents - OmniSimulator API Reference - Use simulation API directly - Task Types and Categories - Understand task categories

The quick start focuses on getting evaluations running quickly. For deeper understanding of the system components, see the detailed documentation sections.