Quick Start
This guide will help you run your first evaluation with OmniEmbodied in just a few minutes.
Prerequisites
Make sure you have completed the Installation process before proceeding.
Configuration Setup
Before running evaluations, you need to configure the LLM service and observation mode. OmniEmbodied supports various LLM providers including DeepSeek, OpenAI API and local vLLM deployments.
Configure LLM Service
Edit the LLM configuration file:
vim config/baseline/llm_config.yaml # Edit the LLM configuration
For DeepSeek API (Recommended for quick start):
# config/baseline/llm_config.yaml api: provider: "deepseekv3" providers: deepseekv3: api_key: "your-api-key-here" # Replace with your actual API key model: "deepseek-chat" temperature: 0.3 max_tokens: 2048
For vLLM Server Deployment (For local models):
# config/baseline/llm_config.yaml provider: "custom" model_name: "your-model-name" # e.g., "Qwen2.5-7B-Instruct" api_config: base_url: "http://localhost:8000/v1" # vLLM server endpoint api_key: "EMPTY" # vLLM doesn't require real API key timeout: 60 max_retries: 3 model_config: temperature: 0.1 max_tokens: 2000 top_p: 0.9
Start vLLM Server (if using local deployment):
# Install vLLM (if not already installed) pip install vllm # Start vLLM server with your model python -m vllm.entrypoints.openai.api_server \ --model /path/to/your/model \ --host 0.0.0.0 \ --port 8000
For OpenAI API (alternative):
# config/baseline/llm_config.yaml provider: "openai" model_name: "gpt-4" api_config: api_key: "${OPENAI_API_KEY}" # Set as environment variable timeout: 30 max_retries: 3 model_config: temperature: 0.1 max_tokens: 2000
Set your API key:
export OPENAI_API_KEY="your-api-key-here"
Global Observation Configuration
For scripts ending with -wg.sh (with global observation), you need to configure:
# config/simulator/simulator_config.yaml
global_observation: true # Enable global observation mode
This enables agents to have complete visibility of the environment from the start.
Running Your First Evaluation
OmniEmbodied provides ready-to-use shell scripts for different evaluation scenarios.
Quick Start Example
Run a basic evaluation with DeepSeek:
# Run basic evaluation (without global observation)
bash scripts/deepseekv3-wo.sh
Single Agent Evaluation
Run single-agent tasks with different models:
# Run Qwen 7B single-agent evaluation (with global observation)
bash scripts/qwen7b-wg.sh
# Run Qwen 7B single-agent evaluation (without global observation)
bash scripts/qwen7b-wo.sh
The script will:
Load the configured LLM service
Run evaluation on single-agent scenarios (00001 to 00800)
Save results with timestamp to
output/directoryGenerate detailed logs and trajectory files
Multi-Agent Evaluation
Run multi-agent collaborative tasks:
# Run DeepSeek R1 multi-agent evaluation
bash deepseekr1-wg.sh
# Run Llama 8B multi-agent evaluation
bash llama8b-wg.sh
Understanding Script Parameters
Each script contains key configuration parameters:
#!/bin/bash
# Example: qwen7b-wg.sh
# Model configuration
MODEL_NAME="qwen7b"
DATASET_TYPE="single" # single or multi agent
GUIDANCE="wg" # wg (with guidance) or wo (without guidance)
# Evaluation range
START_SCENARIO="00001"
END_SCENARIO="00800"
# Parallel processing
MAX_PARALLEL=5 # Number of concurrent scenarios
# Configuration file
CONFIG_FILE="single_agent_config" # ConfigManager will resolve the full path
Key Parameters:
MODEL_NAME: Identifier for the model being evaluatedDATASET_TYPE:singlefor single-agent,multifor multi-agent tasksGUIDANCE:wgincludes task guidance,wotests pure reasoningSTART_SCENARIO/END_SCENARIO: Range of scenarios to evaluateMAX_PARALLEL: Controls concurrent evaluation processes
Monitoring Evaluation Progress
During evaluation, you’ll see output like:
[INFO] Starting evaluation: qwen7b_single_00001_to_00800_wg
[INFO] Configuration loaded: single_agent_config (resolved to config/baseline/single_agent_config.yaml)
[INFO] LLM service connected: http://localhost:8000/v1
[INFO] Processing scenario 00001/00800...
[INFO] Task: "Find the red apple in the kitchen"
[INFO] Agent completed task in 12 steps - SUCCESS
[INFO] Processing scenario 00002/00800...
...
[INFO] Evaluation completed. Results saved to:
output/20241220_143052_qwen7b_single_00001_to_00800_wg.json
Understanding Results
The evaluation generates several output files:
Result Files:
- {timestamp}_{model}_{type}_{range}_{guidance}.json - Main results
- trajectory_logs/ - Detailed step-by-step agent actions
- error_logs/ - Failed scenarios and error analysis
Key Metrics: - Success Rate: Percentage of successfully completed tasks - Average Steps: Mean number of actions per task - Task Categories: Performance breakdown by task type - Error Analysis: Common failure modes and patterns
Sample Results:
{
"model_name": "qwen7b",
"dataset_type": "single",
"guidance": "wg",
"total_scenarios": 800,
"completed_scenarios": 782,
"success_rate": 0.847,
"average_steps": 8.3,
"task_breakdown": {
"direct_command": {"success_rate": 0.92, "count": 200},
"attribute_reasoning": {"success_rate": 0.85, "count": 200},
"tool_use": {"success_rate": 0.78, "count": 200},
"compound_reasoning": {"success_rate": 0.73, "count": 200}
}
}
Common Configuration Options
You can customize evaluations by modifying configuration files:
Agent Behavior (configured via single_agent_config):
agent_config:
max_history: 20 # Conversation history length
max_steps_per_task: 35 # Maximum actions per task
environment_description:
detail_level: 'full' # full, basic, minimal
show_object_properties: true # Include object attributes
update_frequency: 0 # Update every step
Execution Control:
execution:
max_total_steps: 400 # Total simulation steps
timeout_seconds: 300 # Per-action timeout
parallel_evaluation:
max_parallel_scenarios: 5 # Concurrent evaluations
Task Filtering:
scenario_selection:
task_filter:
categories: # Select specific task types
- "direct_command"
- "attribute_reasoning"
Troubleshooting
LLM Service Connection Issues:
# Test vLLM server connectivity
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"your-model","messages":[{"role":"user","content":"Hello"}]}'
Common Issues:
“Connection refused”: vLLM server not running or wrong port
“Model not found”: Check model name in configuration matches vLLM server
“Out of memory”: Reduce
max_parallel_scenariosor use smaller modelSlow evaluation: Check network latency to LLM service
Debug Mode:
Enable detailed logging for troubleshooting:
logging:
level: "DEBUG"
show_llm_details: true
Next Steps
Once you’ve run your first evaluation:
Explore Results: - Analyze performance by task category - Compare different models and configurations - Review trajectory logs for agent behavior insights
Advanced Usage: - user_guide/evaluation_framework - Systematic evaluation strategies - user_guide/configuration - Advanced configuration patterns - Examples - More evaluation examples
Customize Evaluations: - developer/extending - Create custom agents - OmniSimulator API Reference - Use simulation API directly - Task Types and Categories - Understand task categories
The quick start focuses on getting evaluations running quickly. For deeper understanding of the system components, see the detailed documentation sections.