Evaluation Metrics - The Reality Gap in Robotics

Assessing the Reality Gap

Sim-to-real Correlation Coefficient

A key question in sim-to-real transfer is whether performance improvement observed in simulation reliably leads to better performance in the real world. The Sim-to-real Correlation Coefficient (SRCC) addresses this by measuring the Pearson correlation coefficient between the performance metrics of agents in simulation and the real world. A high SRCC indicates that simulation is a good predictor of real-world performance.

Offline Replay Error

When direct real-world policy deployment is not feasible, offline replay error provides a practical alternative. It compares state trajectories between a real-world trajectory and the equivalent simulation trajectory when replaying the real-world actions in an open-loop fashion. This metric is simple, low-cost, and provides a quick diagnostic of how well the simulation represents the target real-world environment.

Visual Fidelity Analysis

Assessing the perceptual domain gap between simulated and real-world environments is particularly relevant for visuomotor policies. Metrics like Inception Score (IS), Fréchet Inception Distance (FID), and Structural Similarity Index (SSIM) are used to quantify visual similarity at both the distribution and single-image levels.

Assessing Sim-to-Real Transfer

Success Rate

Success rate measures the proportion of trials in which a policy successfully completes the intended task. It is a simple and reliable indicator of sim-to-real transfer effectiveness, but it is often a binary metric that doesn't reveal the nuances of failure.

Cumulative Reward

For policies learned using reinforcement learning, the cumulative reward measures how well a policy achieves long-term objectives. It offers a more fine-grained view of performance than success rate, but its interpretability depends on consistent reward functions across simulation and the real world.

Task-specific Metrics

Most successful sim-to-real studies evaluate performance using task-specific criteria tailored to the target domain, such as path efficiency in navigation or object-distance-to-goal in manipulation. These metrics offer fine-grained insights but can hinder fair comparison between methods.