Robotic Seminar:Sensor Modeling in Sim2Real

Sensor Modeling in Sim2Real


Advisor : Dr Philip Arm, Dr Filip Bjelonic


slides

report

Introduction

Training robots in real-world environments poses potential dangers and
incurs significant expenses. Simulations offer a safer and more
cost-effective alternative, mitigating risks commonly associated with
real-world experiments. In scenarios where real-world data is limited or
challenging to collect, simulations provide a viable mean to generate
the necessary data, thereby facilitating comprehensive training and
testing. Sim2Real refers to the process of transferring knowledge,
skills, or models developed in a simulated environment (Sim) to
real-world applications (Real). This approach bridges the gap between
theoretical models and practical, real-world utility. A primary
challenge in Sim2Real is the "reality gap," characterized by
discrepancies between simulated and real-world conditions. These
discrepancies may arise from differences in physics, sensor data,
lighting, and other environmental factors. To address the reality gap,
techniques such as Domain Randomization are employed. This method
introduces variability into simulations to better approximate the
unpredictability of the real world, thereby enhancing the robustness of
Sim2Real transfers [@gao2020domain_random]. Another approach involves
the use of high-fidelity simulations, which strive to enhance the
realism of simulated environments. Improvements in physics modeling,
material properties, lighting, and sensor behavior aim to more closely
replicate real-world
conditions [@tan2018sim; @du2021sim_param; @akella2023measure_discrepency; @byravan2023nerf2real; @singh2023acutors_feedback; @kadian2020sim_tune].

Developing models that exhibit generalizability and robustness is also
crucial. This includes implementing continuous learning strategies and
improving model architectures to ensure effective Sim2Real
applications [@abey2023human].

In shifting our focus from general Sim2Real challenges to a pivotal
aspect, sensor modeling, this paper delves into how sensors, such as RGB
and depth cameras, serve as vital tools for perceiving the environment
and the robot’s state, rather than for physical interaction. The precise
replication of sensor data in simulations is crucial for successful
Sim2Real applications. This paper not only explores the complexities of
accurately modeling these sensors within the Sim2Real framework but also
reviews recent research efforts in this domain. We aim to elucidate the
impact of sensor precision on the efficacy of Sim2Real transfer and
discuss contemporary strategies developed by researchers to enhance
sensor model fidelity for real-world applications

The structure of this article is as follows:

  • Sensor Modeling in Sim2Real: The second section provides an
    extensive review of various sensor models used in Sim2Real. It
    encompasses a range of sensor types, including RGB cameras, depth
    sensors, Inertial Measurement Units (IMUs), force sensors, and
    encoders. This section aims to dissect the methodologies,
    advantages, and limitations associated with each sensor type,
    offering insights into their practical applications in Sim2Real
    scenarios.

  • Challenges and Future Directions: The third section is dedicated
    to summarizing the prevailing challenges in sensor modeling within
    the Sim2Real context. It also outlines potential future research
    directions, aiming to address these challenges and enhance the
    efficacy of sensor models. This section aims to stimulate further
    research and development in the field, paving the way for more
    sophisticated and reliable Sim2Real transitions.

Review of Literature

RGB camera

RGB cameras, notable for their accessibility and affordability, offer a
high-density information format through pixels. The standard approach,
as Zhu et al [@zhu2018reinforcement] illustrate, employs a
Convolutional Neural Network (CNN) to encode the RGB stream
from the camera, followed by a Long Short-Term Memory
(LSTM) network to model the signal sequence.

Inspired by the impressive capabilities of Neural Radiance
Fields (Nerf), Byravan [@byravan2023nerf2real] utilized NeRF to
construct simulation environments from short mobile phone videos.
Although it’s easy to take a video of 4-5 mintes, this process,
involving lengthy preprocessing with COLMAP[1] (3-4 hours) and NeRF
training (20 minutes on 8 V100 GPUs), is resource-intensive.
Additionally, it demands manual calibration of the NeRF mesh with the
real world and struggles with dynamic objects, which are instead
incorporated using the MuJoCo simulation environment.
Figure [fig:nerf]{reference-type=“ref”
reference=“fig:nerf”} detailed illustrates the whole pipeline of this
approach.

::: figure*
image{width=“\linewidth”}
:::

Expanding upon the Neural Radiance Fields (NeRF) concept,
Yang [@yang2023unisim] introduced an innovative approach using voxel
rendering, as opposed to traditional mesh rendering, to enhance memory
efficiency and facilitate the handling of dynamic objects. In this
methodology, a hypernetwork is employed to generate voxel-based
representations for each dynamic actor within the simulated environment.
This technique effectively manages the complexities of dynamic scene
rendering, balancing detail and computational load.

To provide a clearer depiction of this technique,
Figure 1{reference-type=“ref” reference=“fig:voxel”}
illustrates the voxel rendering process. As shown, the 3D scene is
bifurcated into a static background (grey) and a set of dynamic actors
(red). The static scene is represented through a sparse feature-grid,
while the hypernetwork dynamically generates the voxel representation
for each actor. This representation leverages a learnable latent space,
and the scene is subsequently brought to life via neural rendering. This
innovative approach, as visualized in the figure, underscores the
efficiency and adaptability of voxel rendering in complex, dynamic
environments.

Overview of the approach in Paper [@yang2023unisim]. The 3D scene is
divided into a static background (grey) and a set of dynamic actors
(red). The static scene is modeled with a sparse feature-grid, and a
hypernetwork is utilized to generate the representation of each actor
from a learnable latent. The neural feature description is then produced
through neural
rendering.


Robotic Seminar:Sensor Modeling in Sim2Real
https://walkerchi.github.io/2023/12/19/ETHz/ETHz-RS/
Author
walkerchi
Posted on
December 19, 2023
Licensed under