A real-world dataset for transparent object detection, segmentation, and 3D reconstruction in human-robot interaction.
Shaken, Not Stirred: We present a real-world dataset for transparent object perception, featuring 7,850 images from 100 cluttered scenes with six types of glasses, captured by five cameras on the NICOL humanoid robot. Our automated pipeline generates segmentation masks and depth ground truth with minimal human effort. The dataset enables robust training and benchmarking for glass detection, classification, and manipulation, and supports research in human-robot interaction.
Our baseline model outperforms state-of-the-art open-vocabulary detectors and achieves an 81% success rate in a real-world robot bartender task.
100 scenes with a mix of transparent and non-transparent objects were captured on a 2m x 1m tabletop in front of the NICOL robot. Each scene is scanned three times: (1) with clean glasses, (2) with 3D-printed green caps for height measurement, and (3) with identical glasses sprayed with chalk for ground-truth depth and segmentation.
Five cameras (three RGB-D RealSense, two 4K fisheye RGB) provide multi-view data. Each scene has 25 robot head views, resulting in 7,850 images for training/validation and 150 manually labeled test images.
Applications: Transparent object detection, segmentation, depth estimation, robotic grasping, and HRI.
Our pipeline uses depth sensing, color verification, and object detection to create accurate segmentation masks and bounding boxes. Depth images are converted to 3D point clouds, objects are detected and filtered by height and color, and final candidates are verified with YOLO-World and segmented with the Segment Anything Model (SAM). All annotations are created automatically, minimizing human labor.
Our dataset was collected in a real-world human-robot interaction scenario, where the NICOL humanoid robot acts as a bartender, perceiving and manipulating glasses on a cluttered tabletop. The robot uses multi-view perception and our auto-labeling pipeline to detect, segment, and interact with transparent objects.
Watch the scenario video below
If you use this dataset, please cite our paper:
@INPROCEEDINGS{11246715,
author={Gajdošech, Lukáš and Ali, Hassan and Habekost, Jan-Gerrit and Madaras, Martin and Kerzel, Matthias and Wermter, Stefan},
booktitle={2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
title={Shaken, Not Stirred: A Novel Dataset for Visual Understanding of Glasses in Human-Robot Bartending Tasks},
year={2025},
volume={},
number={},
pages={20516-20523},
keywords={Visualization;Three-dimensional displays;Robot vision systems;Pipelines;Propioception;Glass;Detectors;Cameras;Planning;Sensors},
doi={10.1109/IROS60139.2025.11246715}}