GlassNICOL Dataset

Lukáš Gajdošech^†
Hassan Ali^†
Jan-Gerrit Habekost^†
Martin Madaras
Matthias Kerzel
Stefan Wermter

^† These authors have contributed equally.

A real-world dataset for transparent object detection, segmentation, and 3D reconstruction in human-robot interaction.

Abstract

Shaken, Not Stirred: We present a real-world dataset for transparent object perception, featuring 7,850 images from 100 cluttered scenes with six types of glasses, captured by five cameras on the NICOL humanoid robot. Our automated pipeline generates segmentation masks and depth ground truth with minimal human effort. The dataset enables robust training and benchmarking for glass detection, classification, and manipulation, and supports research in human-robot interaction.

Our baseline model outperforms state-of-the-art open-vocabulary detectors and achieves an 81% success rate in a real-world robot bartender task.

Paper GitHub Download Dataset

Dataset Modalities

Eye Camera RGB Texture

Head RealSense RGB

Side RealSense RGB

Fusion of all RGB-D Views

Network Prediction

Ground Truth Label

Depth Map

Glass Meshes

About the Dataset

100 scenes with a mix of transparent and non-transparent objects were captured on a 2m x 1m tabletop in front of the NICOL robot. Each scene is scanned three times: (1) with clean glasses, (2) with 3D-printed green caps for height measurement, and (3) with identical glasses sprayed with chalk for ground-truth depth and segmentation.

Five cameras (three RGB-D RealSense, two 4K fisheye RGB) provide multi-view data. Each scene has 25 robot head views, resulting in 7,850 images for training/validation and 150 manually labeled test images.

Applications: Transparent object detection, segmentation, depth estimation, robotic grasping, and HRI.

Auto-Labeling Pipeline

Our pipeline uses depth sensing, color verification, and object detection to create accurate segmentation masks and bounding boxes. Depth images are converted to 3D point clouds, objects are detected and filtered by height and color, and final candidates are verified with YOLO-World and segmented with the Segment Anything Model (SAM). All annotations are created automatically, minimizing human labor.

Key Features & Contributions

7,850 real-world images from 100 scenes, 5 camera views per scene
Six glass types, with 3D models for surface reconstruction
Challenging, cluttered scenes with occlusions and varied lighting
Automatic depth-based labeling pipeline for segmentation and ground-truth
Manual test set for robust benchmarking (150 images)
Baseline model outperforms SOTA open-vocabulary detectors
Integrated with a real robot bartender scenario (81% success rate)

Bartending HRI Scenario

Our dataset was collected in a real-world human-robot interaction scenario, where the NICOL humanoid robot acts as a bartender, perceiving and manipulating glasses on a cluttered tabletop. The robot uses multi-view perception and our auto-labeling pipeline to detect, segment, and interact with transparent objects.

Watch the scenario video below

Download the Dataset

Dataset Annotations Checkpoints Glass Meshes Synthetic Dataset

Raw Data (Previous Download Links)

Part 1 Part 2 Part 3 Part 4 Part 5 OOD Data

Citation

If you use this dataset, please cite our paper:

@INPROCEEDINGS{11246715,
  author={Gajdošech, Lukáš and Ali, Hassan and Habekost, Jan-Gerrit and Madaras, Martin and Kerzel, Matthias and Wermter, Stefan},
  booktitle={2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, 
  title={Shaken, Not Stirred: A Novel Dataset for Visual Understanding of Glasses in Human-Robot Bartending Tasks}, 
  year={2025},
  volume={},
  number={},
  pages={20516-20523},
  keywords={Visualization;Three-dimensional displays;Robot vision systems;Pipelines;Propioception;Glass;Detectors;Cameras;Planning;Sensors},
  doi={10.1109/IROS60139.2025.11246715}}