Andy Zeng

I am a research scientist at Robotics at Google AI. Before that, I received a PhD in Computer Science at Princeton, and a BA in Math and Computer Science at UC Berkeley. In my research, I work on vision for robotics. In particular, I am interested in developing learning algorithms that enable machines to intelligently interact with the physical world and improve themselves over time.

Github  •  G. Scholar  •  LinkedIn  •  Twitter
Email: andyzeng at google dot com


  • 2020 Jul
  • 2020 Jun
  • 2019 Sep
  • 2019 Jul
  • 2019 Mar
  • 2018 Oct
  • 2018 Oct
  • 2018 Sep
  • 2018 Sep
  • 2018 Apr
  • 2017 Aug
  • 2017 Jul
  • 2016 Jun
  • 2016 Jun
  • 2015 Oct
  • 2015 May


Spatial Action Maps for Mobile Manipulation

Jimmy Wu, Xingyuan Sun, Andy Zeng, Shuran Song, Johnny Lee, Szymon Rusinkiewicz, Thomas Funkhouser
Robotics: Science and Systems (RSS) 2020
Webpage  •   PDF

Typical end-to-end formulations for learning robotic navigation involve predicting a small set of steering command actions (forward, left, etc.) from images of the current state (e.g. birds-eye view of a SLAM reconstruction). Instead, we show that it can be advantageous to predict dense action representations defined in the same domain as the state. We present Spatial Action Maps, in which the set of possible actions is represented by a pixel map (aligned with the input image of the current state), where each pixel represents a navigational endpoint at the corresponding scene location. Action predictions are thereby spatially anchored on local visual features in the scene, enabling significantly faster learning of complex behaviors for mobile manipulation tasks with reinforcement learning.

Grasping in the Wild: Learning 6DoF Closed-Loop Grasping from Low-Cost Demonstrations

Shuran Song, Andy Zeng, Johnny Lee, Thomas Funkhouser
IEEE International Conference on Intelligent Robots and Systems (IROS) 2020
IEEE Robotics and Automation Letters (RA-L) 2020
Webpage  •   PDF

We present a new low-cost hardware interface for collecting grasping demonstrations by people in diverse environments. With this data, we show that it is possible to train an end-to-end 6DoF closed-loop grasping model with reinforcement learning that transfers to real robots. A key aspect of our model is that it uses “action-view”-based rendering to simulate future states with respect to different possible actions. By evaluating these states using a learned value function, our method is able to better select corresponding actions that maximize total rewards (i.e., grasping success). Our system is able to achieve reliable 6DoF closed-loop grasping of novel objects across various scene configurations, as well as dynamic scenes with moving objects.

Form2Fit: Learning Shape Priors for Generalizable Assembly from Disassembly

Kevin Zakka, Andy Zeng, Johnny Lee, Shuran Song
IEEE International Conference on Robotics and Automation (ICRA) 2020
★ Best Paper in Automation Award Finalist, ICRA ★
Webpage  •   PDF  •   Code  •   Google AI Blog  •   VentureBeat  •   2 Minute Papers

Is it possible to learn policies for robotic assembly that can generalize to new objects? We propose to formulate the kit assembly task as a shape matching problem, where the goal is to learn a shape descriptor that establishes geometric correspondences between object surfaces and their target placement locations from visual input. This formulation enables the model to acquire a broader understanding of how shapes and surfaces fit together for assembly — allowing it to generalize to new objects and kits. To obtain training data, we present a self-supervised data-collection pipeline that learns assembly from disassembly.

ClearGrasp: 3D Shape Estimation of Transparent Objects for Manipulation

Shreeyak Sajjan, Matthew Moore, Mike Pan, Ganesh Nagaraja, Johnny Lee, Andy Zeng, Shuran Song
IEEE International Conference on Robotics and Automation (ICRA) 2020
Webpage  •   PDF  •   Code  •   Dataset  •   Google AI Blog  •   VentureBeat

Transparent objects are a common part of everyday life, yet they possess unique visual properties that make them incredibly difficult for standard 3D sensors to produce accurate depth estimates for. We present a deep learning approach trained from large-scale synthetic data, to estimate accurate 3D geometry of transparent objects from a single RGB-D image. Our experiments demonstrate that ClearGrasp is substantially better than monocular depth estimation baselines and is capable of generalizing to real-world images and novel objects. We also show that ClearGrasp can be applied out-of-the-box to improve robotic grasping.

Learning to See before Learning to Act: Visual Pre-training for Manipulation

Lin Yen-Chen, Andy Zeng, Shuran Song, Phillip Isola, Tsung-Yi Lin
IEEE International Conference on Robotics and Automation (ICRA) 2020
Webpage  •   PDF  •   Code  •   Google AI Blog  •   VentureBeat

Does having visual priors (e.g. the ability to detect objects) facilitate learning vision-based manipulation (e.g. picking up objects)? We study this in the context of transfer learning, where a convolutional network is first trained on a passive vision task, then adapted to perform an active manipulation task. We find that pre-training on vision tasks can improve the generalization and sample efficiency of models used for learning manipulation affordances. This makes it possible to learn robotic grasping in just 10 minutes of trial and error experience.

TossingBot: Learning to Throw Arbitrary Objects with Residual Physics

Andy Zeng, Shuran Song, Johnny Lee, Alberto Rodriguez, Thomas Funkhouser
Robotics: Science and Systems (RSS) 2019
IEEE Transactions on Robotics (T-RO) 2020
Featured on the front page of The New York Times Business!
★ Best Systems Paper Award, RSS ★
★ Best Student Paper Award Finalist, RSS ★
Webpage  •   PDF  •   Google AI Blog  •   New York Times  •   IEEE Spectrum

Throwing is an excellent means of exploiting dynamics to increase the capabilities of a manipulator. In the case of pick-and-place, throwing can enable a robot arm to rapidly place objects into selected boxes outside its maximum kinematic range — improving its physical reachability and picking speed. We propose an end-to-end formulation (vision to actions) where we investigate the synergies between grasping and throwing (i.e., learning grasps that enable more accurate throws) and between simulation and deep learning (i.e., inferring residuals on top of control parameters predicted by a physics simulator). The resulting system is able to grasp and throw arbitrary objects into boxes located outside its maximum range at 500+ mean picks per hour and generalizes to new objects and target locations.

DensePhysNet: Learning Dense Physical Object Representations via Multi-step Dynamic Interactions

Zhenjia Xu, Jiajun Wu, Andy Zeng, Joshua B. Tenenbaum, Shuran Song
Robotics: Science and Systems (RSS) 2019
Webpage  •   PDF  •   Code

Through vision and interaction, can robots discover the physical properties of objects? In this work, we propose DensePhysNet, a system that actively executes a sequence of dynamic interactions (e.g. sliding and colliding), and uses a deep predictive model over its visual observations to learn dense pixel-wise representations that reflect the physical properties of observed objects. Our experiments in both simulation and real settings demonstrate that the learned representations carry rich physical information, and can be used for more accurate and efficient manipulation in downstream tasks than state-of-the-art alternatives.

Learning Synergies between Pushing and Grasping with Self-supervised Deep Reinforcement Learning

Andy Zeng, Shuran Song, Stefan Welker, Johnny Lee, Alberto Rodriguez, Thomas Funkhouser
IEEE International Conference on Intelligent Robots and Systems (IROS) 2018
★ Best Cognitive Robotics Paper Award Finalist, IROS ★
Webpage  •   PDF  •   Code  •   2 Minute Papers

Skilled robotic manipulation benefits from complex synergies between non-prehensile (e.g. pushing) and prehensile (e.g. grasping) actions: pushing can help rearrange cluttered objects to make space for arms and fingers; likewise, grasping can help displace objects to make pushing movements more precise and collision-free. In this work, we demonstrate that it is possible to discover and learn these synergies from scratch by combining visual affordance-based manipulation with model-free deep reinforcement learning. Our method is sample efficient and generalizes to novel objects and scenarios.

Robotic Pick-and-Place of Novel Objects in Clutter with Multi-Affordance Grasping and Cross-Domain Image Matching

Andy Zeng, Shuran Song, Kuan-Ting Yu, Elliott Donlon, Francois R. Hogan, Maria Bauza, Daolin Ma, Orion Taylor, Melody Liu, Eudald Romo, Nima Fazeli, Ferran Alet, Nikhil Chavan Dafle, Rachel Holladay, Isabella Morona, Prem Qu Nair, Druck Green, Ian Taylor, Weber Liu, Thomas Funkhouser, Alberto Rodriguez
IEEE International Conference on Robotics and Automation (ICRA) 2018
The International Journal of Robotics Research (IJRR) 2019
★ Best Systems Paper Award, Amazon Robotics ★
★ 1st Place (Stow Task), Amazon Robotics Challenge 2017 ★
Webpage  •   PDF  •   Code  •   Journal (IJRR)  •   MIT News  •   Engadget

We built a robo-picker that can grasp and recognize novel objects (appearing for the first time during testing) in cluttered environments without needing any additional data collection or re-training. It achieves this with pixel-wise visual affordance-based grasping and one-shot learning to recognize objects using only product images (e.g. from the web). The approach was part of the MIT-Princeton system that took 1st place (stow task) at the 2017 Amazon Robotics Challenge.

Im2Pano3D: Extrapolating 360° Structure and Semantics Beyond the Field of View

Shuran Song, Andy Zeng, Angel X. Chang, Manolis Savva, Silvio Savarese, Thomas Funkhouser
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2018
★ Oral Presentation, CVPR ★
Webpage  •   PDF

We explore the limits of leveraging strong contextual priors learned from large-scale synthetic and real-world indoor scenes. To this end, we trained a network that generates a dense prediction of 3D structure and a probability distribution of semantic labels for a full 360° panoramic view of an indoor scene when given only a partial observation in the form of an RGB-D image (i.e., infers what's behind you).

Matterport3D: Learning from RGB-D Data in Indoor Environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Nießner, Manolis Savva, Shuran Song, Andy Zeng, Yinda Zhang
IEEE International Conference on 3D Vision (3DV) 2017
Webpage  •   PDF  •   Code  •   Matterport Blog

We introduce Matterport3D, a large-scale RGB-D dataset with 10,800 panoramic views from 194,400 RGB-D images of 90 building-scale scenes. Annotations are provided with surface reconstructions, camera poses, and 2D and 3D semantic segmentations. The precise global alignment and comprehensive, diverse panoramic set of views over entire buildings enable a variety of self-supervised tasks, including keypoint matching, view overlap prediction, normal prediction from color, semantic segmentation, and scene classification.

3DMatch: Learning Local Geometric Descriptors from RGB-D Reconstructions

Andy Zeng, Shuran Song, Matthias Nießner, Matthew Fisher, Jianxiong Xiao, Thomas Funkhouser
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017
★ Oral Presentation, CVPR ★
Webpage  •   PDF  •   Code  •   Talk  •   2 Minute Papers

We present a data-driven model that learns a local 3D shape descriptor for establishing correspondences between partial and noisy 3D/RGB-D data. To amass training data for our model, we propose an unsupervised feature learning method that leverages the millions of correspondence labels found in existing RGB-D reconstructions. Our learned descriptor is not only able to match local geometry in new scenes for reconstruction, but also generalizes to different tasks and spatial scales (e.g. instance-level object model alignment for the Amazon Picking Challenge, and mesh surface correspondence).

Semantic Scene Completion from a Single Depth Image

Shuran Song, Fisher Yu, Andy Zeng, Angel X. Chang, Manolis Savva, Thomas Funkhouser
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017
★ Oral Presentation, CVPR ★
Webpage  •   PDF  •   SUNCG Dataset  •   Code  •   Talk  •   2 Minute Papers

We present an end-to-end model that can infer a complete 3D voxel representation of volumetric occupancy and semantic labels for a scene from a single-view depth map observation. To train our model, we construct SUNCG -- a manually created large-scale dataset of synthetic 3D scenes with dense volumetric annotations.

Multi-view Self-supervised Deep Learning for 6D Pose Estimation in the Amazon Picking Challenge

Andy Zeng, Kuan-Ting Yu, Shuran Song, Daniel Suo, Ed Walker Jr., Alberto Rodriguez, Jianxiong Xiao
IEEE International Conference on Robotics and Automation (ICRA) 2017
★ 3rd Place, Amazon Robotics Challenge 2016 ★
Webpage  •   PDF  •   Shelf & Tote Dataset  •   Code

We developed a vision system that can recognize objects and estimate their 6D poses under cluttered environments, partial data, sensor noise, multiple instances of the same object, and a large variety of object categories. Our approach leverages fully convolutional networks to segment and label multiple RGB-D views of a scene, then fits pre-scanned 3D object models to the resulting segmentation to estimate their poses. We also propose a scalable self-supervised method that leverages precise and repeatable robot motions to generate a large labeled dataset without tedious manual annotations. The approach was part of the MIT-Princeton system that took 3rd place at the 2016 Amazon Picking Challenge.