Andy Zeng

I am a research scientist at Google Brain working on machine learning, computer vision, and robotics.
My research focuses on robot vision, manipulation, and deep learning, to enable machines to intelligently interact with the physical world and improve themselves over time. These days, I am particularly interested in how robot learning can benefit from the knowledge stored across Internet-scale data. Formal bio here.

Github  •  G. Scholar  •  LinkedIn  •  Twitter
Email: andyzeng at google dot com


  • 2022 Jun
  • 2022 Feb
  • 2021 Nov
  • 2021 Oct
  • 2021 Apr
  • 2020 Nov
  • 2020 Jun
  • 2019 Jul
  • 2019 Mar
  • 2018 Oct
  • 2018 Oct
  • 2018 Sep
  • 2018 Apr
  • 2017 Jul
  • 2015 Oct


Inner Monologue: Embodied Reasoning through Planning with Language Models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, Brian Ichter
Webpage  •   PDF

Recent works have shown how the reasoning capabilities of Large Language Models (LLMs) can be applied to domains beyond natural language processing, such as planning and interaction for robotics. These embodied problems require an agent to understand many semantic aspects of the world: the repertoire of skills available, how these skills influence the world, and how changes to the world map back to language. LLMs planning in embodied environments need to consider not just what skills to do, but also how and when to do them - answers that change over time in response to the agent’s own choices. In this work, we investigate to what extent LLMs used in such embodied contexts can reason over sources of feedback provided through natural language, without any additional training. We propose that by leveraging environment feedback, LLMs are able to form an inner monologue that allows them to more richly process and plan in robotic control scenarios. We investigate a variety of sources of feedback, such as success detection, object recognition, scene description, and human interaction. We find that closed-loop language feedback significantly improves high level instruction completion on three domains, including simulated and real table top rearrangement tasks and long-horizon mobile manipulation tasks in a real kitchen environment.

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choromanski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, Pete Florence
Webpage  •   PDF  •   Code

Large pretrained (e.g., "foundation") models exhibit distinct capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (e.g., spreadsheets, SAT questions, code). As a result, these models store different forms of commonsense knowledge across different domains. In this work, we show that this diversity is symbiotic, and can be leveraged through Socratic Models (SMs): a modular framework in which multiple pretrained models may be composed zero-shot i.e., via multimodal-informed prompting, to exchange information with each other and capture new multimodal capabilities, without requiring finetuning. With minimal engineering, SMs are not only competitive with state-of-the-art zero-shot image captioning and video-to-text retrieval, but also enable new applications such as (i) answering free-form questions about egocentric video, (ii) engaging in multimodal assistive dialogue with people (e.g., for cooking recipes) by interfacing with external APIs and databases (e.g., web search), and (iii) robot perception and planning.

Learning to Fold Real Garments with One Arm: A Case Study in Cloud-Based Robotics Research

Ryan Hoque, Kaushik Shivakumar, Shrey Aeron, Gabriel Deza, Aditya Ganapathi, Adrian Wong, Johnny Lee, Andy Zeng, Vincent Vanhoucke, Ken Goldberg
IEEE International Conference on Intelligent Robots and Systems (IROS) 2022
Webpage  •   PDF  •   Code

Autonomous fabric manipulation is a longstanding challenge in robotics, but evaluating progress is difficult due to the cost and diversity of robot hardware. Using Reach, a new cloud robotics platform that enables low-latency remote execution of control policies on physical robots, we present the first systematic benchmarking of fabric manipulation algorithms on physical hardware. We develop 4 novel learning-based algorithms that model expert actions, keypoints, reward functions, and dynamic motions, and we compare these against 4 learning-free and inverse dynamics algorithms on the task of folding a crumpled T-shirt with a single robot arm. The entire lifecycle of data collection, model training, and policy evaluation is performed remotely without physical access to the robot workcell. Results suggest a new algorithm combining imitation learning with analytic methods achieves 84% of human-level performance on the folding task.

Learning Pneumatic Non-Prehensile Manipulation with a Mobile Blower

Jimmy Wu, Xingyuan Sun, Andy Zeng, Shuran Song, Szymon Rusinkiewicz, Thomas Funkhouser
IEEE International Conference on Intelligent Robots and Systems (IROS) 2022
IEEE Robotics and Automation Letters (RA-L) 2022
Webpage  •   PDF  •   Code

We investigate pneumatic non-prehensile manipulation (i.e., blowing) as a means of efficiently moving scattered objects into a target receptacle. Due to the chaotic nature of aerodynamic forces, a blowing controller must (i) continually adapt to unexpected changes from its actions, (ii) maintain fine-grained control, since the slightest misstep can result in large unintended consequences (e.g., scatter objects already in a pile), and (iii) infer long-range plans (e.g., move the robot to strategic blowing locations). We tackle these challenges in the context of deep reinforcement learning, introducing a multi-frequency version of the spatial action maps framework. This allows for efficient learning of vision-based policies that effectively combine high-level planning and low-level closed-loop control for dynamic mobile manipulation. Experiments show that our system naturally encourages emergent specialization between the different subpolicies spanning low-level fine-grained control and high-level planning. On a real mobile robot equipped with a miniature air blower, we show that our simulation-trained policies transfer well to a real environment and can generalize to novel objects.

Multiscale Sensor Fusion and Continuous Control with Neural CDEs

Sumeet Singh, Francis McCann Ramirez, Jacob Varley, Andy Zeng, Vikas Sindhwani
IEEE International Conference on Intelligent Robots and Systems (IROS) 2022

Though robot learning is often formulated in terms of discrete-time Markov decision processes (MDPs), physical robots require near-continuous multiscale feedback control. Machines operate on multiple asynchronous sensing modalities, each with different frequencies, e.g., video frames at 30Hz, proprioceptive state at 100Hz, force-torque data at 500Hz, etc. While the classic approach is to batch observations into fixed-time windows then pass them through feed-forward encoders (e.g., with deep networks), we show that there exists a more elegant approach – one that treats policy learning as modeling latent state dynamics in continuous-time. Specifically, we present InFuser, a unified architecture that trains continuous time-policies with Neural Controlled Differential Equations (CDEs). InFuser evolves a single latent state representation over time by (In)tegrating and (Fus)ing multi-sensory observations (arriving at different frequencies), and inferring actions in continuous-time. This enables policies that can react to multi-frequency multi-sensory feedback for truly end-to-end visuomotor control, without discrete-time assumptions.

Implicit Kinematic Policies: Unifying Joint and Cartesian Action Spaces in End-to-End Robot Learning

Aditya Ganapathi, Pete Florence, Jake Varley, Kaylee Burns, Ken Goldberg, Andy Zeng
IEEE International Conference on Robotics and Automation (ICRA) 2022
Webpage  •   PDF

Action representation is an important yet often overlooked aspect in end-to-end robot learning with deep networks. Choosing one action space over another (e.g. target joint positions, or Cartesian end-effector poses) can result in surprisingly stark performance differences between various downstream tasks – and as a result, considerable research has been devoted to finding the right action space for a given application. However, in this work, we instead investigate how our models can discover and learn for themselves which action space to use. Leveraging recent work on implicit behavioral cloning, which takes both observations and actions as input, we demonstrate that it is possible to present the same action in multiple different spaces to the same policy -- allowing it to learn inductive patterns from each space. Specifically, we study the benefits of combining Cartesian and joint action spaces in the context of learning manipulation skills. To this end, we present Implicit Kinematic Policies (IKP), which incorporates the kinematic chain as a differentiable module within the deep network. Quantitative experiments across several simulated continuous control tasks—from scooping piles of small objects, to lifting boxes with elbows, to precise block insertion with miscalibrated robots—suggest IKP not only learns complex prehensile and non-prehensile manipulation from pixels better than baseline alternatives, but also can learn to compensate for small joint encoder offset errors.

VIRDO: Visio-Tactile Implicit Representations of Deformable Objects

Youngsun Wi, Pete Florence, Andy Zeng, Nima Fazeli
IEEE International Conference on Robotics and Automation (ICRA) 2022

Deformable object manipulation requires computationally efficient representations that are compatible with robotic sensing modalities. In this paper, we present VIRDO: an implicit, multi-modal, and continuous representation for deformable-elastic objects. VIRDO operates directly on visual (point cloud) and tactile (reaction forces) modalities and learns rich latent embeddings of contact locations and forces to predict object deformations subject to external contacts. Here, we demonstrate VIRDO's ability to: i) produce high-fidelity cross-modal reconstructions with dense unsupervised correspondences, ii) generalize to unseen contact formations, and iii) state-estimation with partial visio-tactile feedback.

Multi-Task Learning with Sequence-Conditioned Transporter Networks

Michael H. Lim, Andy Zeng, Brian Ichter, Maryam Bandari, Erwin Coumans, Claire Tomlin, Stefan Schaal, Aleksandra Faust
IEEE International Conference on Robotics and Automation (ICRA) 2022

Enabling robots to learn multiple vision-based manipulation tasks can lead to a wide range of industrial applications. While learning-based approaches enjoy flexibility and generalizability, scaling these approaches to solve compositional tasks remains a challenge. In this work, we present a framework for multi-task learning via sequence-conditioning and weighted sampling that outperforms learning on individual tasks. We also present MultiRavens, a new benchmark designed around compositional tasks.

Hybrid Random Features

Krzysztof Choromanski, Haoxian Chen, Han Lin, Yuanzhe Ma, Arijit Sehanobish, Deepali Jain, Michael S Ryoo, Jake Varley, Andy Zeng, Valerii Likhosherstov, Dmitry Kalashnikov, Vikas Sindhwani, Adrian Weller
The International Conference on Learning Representations (ICLR) 2022

We propose a new class of random feature methods for linearizing softmax and Gaussian kernels called hybrid random features (HRFs) that automatically adapt the quality of kernel estimation to provide most accurate approximation in the defined regions of interest. Special instantiations of HRFs lead to well-known methods such as trigonometric or positive random features. By generalizing Bochner's Theorem for softmax/Gaussian kernels and leveraging random features for compositional kernels, the HRF-mechanism provides strong theoretical guarantees - unbiased approximation and strictly smaller worst-case relative errors than its counterparts. Experiments with HRF from pointwise kernel estimation to implicit-attention Transformers and downstream robotics, demonstrate its effectiveness in a wide spectrum of machine learning problems.

Implicit Behavioral Cloning

Pete Florence, Corey Lynch, Andy Zeng, Oscar Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, Jonathan Tompson
Conference on Robot Learning (CoRL) 2021
Webpage  •   PDF  •   Code  •   Google AI Blog

Supervised robot policy learning with implicit models (states and actions as input, inference by sampling or gradient descent) generally perform better than common explicit models (states as input, actions as ouput). We present extensive experiments on this finding, with theoretical arguments distinguishing the properties of implicit vs explicit models, particularly when approximating complex, discontinuous and multi-valued (set-valued) functions. For robot learning, implicit behavioral cloning (IBC) policies with energy-based models (EBM) often outperform common explicit (Mean Square Error, or Mixture Density) policies, including tasks with high-dimensional action spaces and visual RGB image inputs. IBC also outperforms state-of-the-art offline reinforcement learning methods on the D4RL benchmark without rewards. In the real world, robots with implicit policies can learn complex and remarkably subtle behaviors on contact-rich tasks from human demonstrations, including tasks with high combinatorial complexity and tasks requiring 1mm precision.

XIRL: Cross-Embodiment Inverse Reinforcement Learning

Kevin Zakka, Andy Zeng, Pete Florence, Jonathan Tompson, Jeannette Bohg, Debidatta Dwibedi
Conference on Robot Learning (CoRL) 2021
★ Best Paper Award Finalist, CoRL ★
Webpage  •   PDF  •   Code  •   Benchmark  •   Google AI Blog

We investigate visual cross-embodiment imitation, in which agents learn policies from videos of other agents (such as humans) demonstrating the same task, but with stark differences in their embodiments -- shape, actions, end-effector dynamics, strategies, etc. In this work, we show that it is possible to automatically discover and learn vision-based reward functions from cross-embodiment demonstration videos that are robust to these differences. Specifically, we present a self-supervised method for Cross-Embodiment Inverse Reinforcement Learning (XIRL) that leverages temporal cycle-consistency constraints to learn deep visual embeddings that capture task progression from offline videos of demonstrations across multiple expert agents, each performing the same task differently due to embodiment differences.

Spatial Intention Maps for Multi-Agent Mobile Manipulation

Jimmy Wu, Xingyuan Sun, Andy Zeng, Shuran Song, Szymon Rusinkiewicz, Thomas Funkhouser
IEEE International Conference on Robotics and Automation (ICRA) 2021
Webpage  •   PDF  •   Code  •   Princeton News

The ability to communicate intention enables decentralized multi-agent robots to collaborate while performing physical tasks. We present spatial intention maps, a new representation for multi-agent vision-based deep reinforcement learning that improves coordination between decentralized mobile manipulators. In this representation, each agent's intention is provided to other agents, and rendered into an overhead 2D map aligned with visual observations. This synergizes with the spatial action maps framework, in which state and action representations are spatially aligned, providing inductive biases that encourage emergent cooperative behaviors requiring spatial coordination, such as passing objects to each other or avoiding collisions. Experiments across a variety of multi-agent environments, including heterogeneous robot teams with different abilities (lifting, pushing, or throwing), show that incorporating spatial intention maps improves performance for different mobile manipulation tasks while significantly enhancing cooperative behaviors.

Learning to Rearrange Deformable Cables, Fabrics, and Bags with Goal-Conditioned Transporter Networks

Daniel Seita, Pete Florence, Jonathan Tompson, Erwin Coumans, Vikas Sindhwani, Ken Goldberg, Andy Zeng
IEEE International Conference on Robotics and Automation (ICRA) 2021
Webpage  •   PDF  •   Code  •   Google AI Blog

Rearranging deformable objects such as cables, fabrics, and bags remains a long-standing challenge in robotic manipulation due to complex dynamics and high-dimensional configuration spaces. This presents difficulties not only for multi-step planning, but also for goal specification, which may involve complex spatial relations such as "place the item inside the bag". In this work, we develop a suite of simulated benchmarks with 1D, 2D, and 3D deformable structures, including tasks that involve image-based goal-conditioning and multi-step manipulation. We show that we can embed goal-conditioning into Transporter Networks to enable robots to sequence pick and place actions that manipulate deformable objects into desired configurations.

Transporter Networks: Rearranging the Visual World for Robotic Manipulation

Andy Zeng, Pete Florence, Jonathan Tompson, Stefan Welker, Jonathan Chien, Maria Attarian, Travis Armstrong, Ivan Krasin, Dan Duong, Vikas Sindhwani, Johnny Lee
Conference on Robot Learning (CoRL) 2020
★ Plenary Talk, Best Paper Presentation Award Finalist, CoRL ★
Webpage  •   PDF  •   Code  •   Google AI Blog  •   VentureBeat

Robotic manipulation can be formulated as inducing a sequence of spatial displacements: where the space being moved can encompass an object, part of an object, or end effector. In this work, we propose the Transporter Network, a simple model architecture that rearranges deep features to infer spatial displacements from visual input -- which can parameterize robot actions. It makes no assumptions of objectness (e.g. canonical poses, models, or keypoints), it exploits spatial symmetries, and is orders of magnitude more sample efficient than our benchmarked alternatives in learning vision-based manipulation tasks: from stacking a pyramid of blocks, to assembling kits with unseen objects; from manipulating deformable ropes, to pushing piles of small objects with closed-loop feedback. Our method can represent complex multi-modal policy distributions and generalizes to multi-step sequential tasks, as well as 6DoF pick-and-place.

Spatial Action Maps for Mobile Manipulation

Jimmy Wu, Xingyuan Sun, Andy Zeng, Shuran Song, Johnny Lee, Szymon Rusinkiewicz, Thomas Funkhouser
Robotics: Science and Systems (RSS) 2020
Webpage  •   PDF  •   Code  •   Princeton News

Typical end-to-end formulations for learning robotic navigation involve predicting a small set of steering command actions (forward, left, etc.) from images of the current state (e.g. birds-eye view of a SLAM reconstruction). Instead, we show that it can be advantageous to predict dense action representations defined in the same domain as the state. We present Spatial Action Maps, in which the set of possible actions is represented by a pixel map (aligned with the input image of the current state), where each pixel represents a navigational endpoint at the corresponding scene location. Action predictions are thereby spatially anchored on local visual features in the scene, enabling significantly faster learning of complex behaviors for mobile manipulation tasks with reinforcement learning.

Grasping in the Wild: Learning 6DoF Closed-Loop Grasping from Low-Cost Demonstrations

Shuran Song, Andy Zeng, Johnny Lee, Thomas Funkhouser
IEEE International Conference on Intelligent Robots and Systems (IROS) 2020
IEEE Robotics and Automation Letters (RA-L) 2020
Webpage  •   PDF

We present a new low-cost hardware interface for collecting grasping demonstrations by people in diverse environments. With this data, we show that it is possible to train an end-to-end 6DoF closed-loop grasping model with reinforcement learning that transfers to real robots. A key aspect of our model is that it uses “action-view”-based rendering to simulate future states with respect to different possible actions. By evaluating these states using a learned value function, our method is able to better select corresponding actions that maximize total rewards (i.e., grasping success). Our system is able to achieve reliable 6DoF closed-loop grasping of novel objects across various scene configurations, as well as dynamic scenes with moving objects.

Form2Fit: Learning Shape Priors for Generalizable Assembly from Disassembly

Kevin Zakka, Andy Zeng, Johnny Lee, Shuran Song
IEEE International Conference on Robotics and Automation (ICRA) 2020
★ Best Paper in Automation Award Finalist, ICRA ★
Webpage  •   PDF  •   Code  •   Google AI Blog  •   VentureBeat  •   2 Minute Papers

Is it possible to learn policies for robotic assembly that can generalize to new objects? We propose to formulate the kit assembly task as a shape matching problem, where the goal is to learn a shape descriptor that establishes geometric correspondences between object surfaces and their target placement locations from visual input. This formulation enables the model to acquire a broader understanding of how shapes and surfaces fit together for assembly — allowing it to generalize to new objects and kits. To obtain training data, we present a self-supervised data-collection pipeline that learns assembly from disassembly.

ClearGrasp: 3D Shape Estimation of Transparent Objects for Manipulation

Shreeyak Sajjan, Matthew Moore, Mike Pan, Ganesh Nagaraja, Johnny Lee, Andy Zeng, Shuran Song
IEEE International Conference on Robotics and Automation (ICRA) 2020
Webpage  •   PDF  •   Code  •   Dataset  •   Google AI Blog  •   VentureBeat

Transparent objects are a common part of everyday life, yet they possess unique visual properties that make them incredibly difficult for standard 3D sensors to produce accurate depth estimates for. We present a deep learning approach trained from large-scale synthetic data, to estimate accurate 3D geometry of transparent objects from a single RGB-D image. Our experiments demonstrate that ClearGrasp is substantially better than monocular depth estimation baselines and is capable of generalizing to real-world images and novel objects. We also show that ClearGrasp can be applied out-of-the-box to improve robotic grasping.

Learning to See before Learning to Act: Visual Pre-training for Manipulation

Lin Yen-Chen, Andy Zeng, Shuran Song, Phillip Isola, Tsung-Yi Lin
IEEE International Conference on Robotics and Automation (ICRA) 2020
Webpage  •   PDF  •   Code  •   Google AI Blog  •   VentureBeat

Does having visual priors (e.g. the ability to detect objects) facilitate learning vision-based manipulation (e.g. picking up objects)? We study this in the context of transfer learning, where a convolutional network is first trained on a passive vision task, then adapted to perform an active manipulation task. We find that pre-training on vision tasks can improve the generalization and sample efficiency of models used for learning manipulation affordances. This makes it possible to learn robotic grasping in just 10 minutes of trial and error experience.

TossingBot: Learning to Throw Arbitrary Objects with Residual Physics

Andy Zeng, Shuran Song, Johnny Lee, Alberto Rodriguez, Thomas Funkhouser
Robotics: Science and Systems (RSS) 2019
IEEE Transactions on Robotics (T-RO) 2020
Featured on the front page of The New York Times Business!
★ King-Sun Fu Memorial Best Paper Award, T-RO ★
★ Best Systems Paper Award, RSS ★
★ Best Student Paper Award Finalist, RSS ★
Webpage  •   PDF  •   Google AI Blog  •   New York Times  •   IEEE Spectrum

Throwing is an excellent means of exploiting dynamics to increase the capabilities of a manipulator. In the case of pick-and-place, throwing can enable a robot arm to rapidly place objects into selected boxes outside its maximum kinematic range — improving its physical reachability and picking speed. We propose an end-to-end formulation (vision to actions) where we investigate the synergies between grasping and throwing (i.e., learning grasps that enable more accurate throws) and between simulation and deep learning (i.e., inferring residuals on top of control parameters predicted by a physics simulator). The resulting system is able to grasp and throw arbitrary objects into boxes located outside its maximum range at 500+ mean picks per hour and generalizes to new objects and target locations.

DensePhysNet: Learning Dense Physical Object Representations via Multi-step Dynamic Interactions

Zhenjia Xu, Jiajun Wu, Andy Zeng, Joshua B. Tenenbaum, Shuran Song
Robotics: Science and Systems (RSS) 2019
Webpage  •   PDF  •   Code

Through vision and interaction, can robots discover the physical properties of objects? In this work, we propose DensePhysNet, a system that actively executes a sequence of dynamic interactions (e.g. sliding and colliding), and uses a deep predictive model over its visual observations to learn dense pixel-wise representations that reflect the physical properties of observed objects. Our experiments in both simulation and real settings demonstrate that the learned representations carry rich physical information, and can be used for more accurate and efficient manipulation in downstream tasks than state-of-the-art alternatives.

Learning Synergies between Pushing and Grasping with Self-supervised Deep Reinforcement Learning

Andy Zeng, Shuran Song, Stefan Welker, Johnny Lee, Alberto Rodriguez, Thomas Funkhouser
IEEE International Conference on Intelligent Robots and Systems (IROS) 2018
★ Best Cognitive Robotics Paper Award Finalist, IROS ★
Webpage  •   PDF  •   Code  •   2 Minute Papers

Skilled robotic manipulation benefits from complex synergies between non-prehensile (e.g. pushing) and prehensile (e.g. grasping) actions: pushing can help rearrange cluttered objects to make space for arms and fingers; likewise, grasping can help displace objects to make pushing movements more precise and collision-free. In this work, we demonstrate that it is possible to discover and learn these synergies from scratch by combining visual affordance-based manipulation with model-free deep reinforcement learning. Our method is sample efficient and generalizes to novel objects and scenarios.

What are the Important Technologies for Bin Picking? Technology Analysis of Robots in Competitions Based on a Set of Performance Metrics

Masahiro Fujita, Yukiyasu Domae, Akio Noda, Gustavo Alfonso Garcia Ricardez, Tatsuya Nagatani, Andy Zeng, Shuran Song, Alberto Rodriguez, Albert Causo, I-Ming Chen, Tsukasa Ogasawara
Advanced Robotics (Journal) 2019
★ Japan Factory Automation (FA) Foundation Paper Award ★
Webpage  •   PDF

Bin picking is still a challenge in robotics, as shown in recent robot competitions. These competitions are an excellent platform for technology comparisons since some participants may use state-of-the-art technologies, while others may use conventional ones. Nevertheless, even though points are awarded or subtracted based on the performance in the frame of the competition rules, the final score does not directly reflect the suitability of the technology. Therefore, it is difficult to understand which technologies and their combination are optimal for various real-world problems. In this paper, we propose a set of performance metrics selected in terms of actual field use as a solution to clarify the important technologies in bin picking. Moreover, we use the selected metrics to compare our four original robot systems, which achieved the best performance in the Stow task of the Amazon Robotics Challenge 2017. Based on this comparison, we discuss which technologies are ideal for practical use in bin-picking robots in the fields of factory and warehouse automation.

Robotic Pick-and-Place of Novel Objects in Clutter with Multi-Affordance Grasping and Cross-Domain Image Matching

Andy Zeng, Shuran Song, Kuan-Ting Yu, Elliott Donlon, Francois R. Hogan, Maria Bauza, Daolin Ma, Orion Taylor, Melody Liu, Eudald Romo, Nima Fazeli, Ferran Alet, Nikhil Chavan Dafle, Rachel Holladay, Isabella Morona, Prem Qu Nair, Druck Green, Ian Taylor, Weber Liu, Thomas Funkhouser, Alberto Rodriguez
IEEE International Conference on Robotics and Automation (ICRA) 2018
The International Journal of Robotics Research (IJRR) 2019
★ Best Systems Paper Award, Amazon Robotics ★
★ 1st Place (Stow Task), Amazon Robotics Challenge 2017 ★
Webpage  •   PDF  •   Code  •   Journal (IJRR)  •   MIT News  •   Engadget

We built a robo-picker that can grasp and recognize novel objects (appearing for the first time during testing) in cluttered environments without needing any additional data collection or re-training. It achieves this with pixel-wise visual affordance-based grasping and one-shot learning to recognize objects using only product images (e.g. from the web). The approach was part of the MIT-Princeton system that took 1st place (stow task) at the 2017 Amazon Robotics Challenge.

Im2Pano3D: Extrapolating 360° Structure and Semantics Beyond the Field of View

Shuran Song, Andy Zeng, Angel X. Chang, Manolis Savva, Silvio Savarese, Thomas Funkhouser
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2018
★ Oral Presentation, CVPR ★
Webpage  •   PDF

We explore the limits of leveraging strong contextual priors learned from large-scale synthetic and real-world indoor scenes. To this end, we trained a network that generates a dense prediction of 3D structure and a probability distribution of semantic labels for a full 360° panoramic view of an indoor scene when given only a partial observation in the form of an RGB-D image (i.e., infers what's behind you).

Matterport3D: Learning from RGB-D Data in Indoor Environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Nießner, Manolis Savva, Shuran Song, Andy Zeng, Yinda Zhang
IEEE International Conference on 3D Vision (3DV) 2017
Webpage  •   PDF  •   Code  •   Matterport Blog

We introduce Matterport3D, a large-scale RGB-D dataset with 10,800 panoramic views from 194,400 RGB-D images of 90 building-scale scenes. Annotations are provided with surface reconstructions, camera poses, and 2D and 3D semantic segmentations. The precise global alignment and comprehensive, diverse panoramic set of views over entire buildings enable a variety of self-supervised tasks, including keypoint matching, view overlap prediction, normal prediction from color, semantic segmentation, and scene classification.

3DMatch: Learning Local Geometric Descriptors from RGB-D Reconstructions

Andy Zeng, Shuran Song, Matthias Nießner, Matthew Fisher, Jianxiong Xiao, Thomas Funkhouser
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017
★ Oral Presentation, CVPR ★
Webpage  •   PDF  •   Code  •   Talk  •   2 Minute Papers

We present a data-driven model that learns a local 3D shape descriptor for establishing correspondences between partial and noisy 3D/RGB-D data. To amass training data for our model, we propose an unsupervised feature learning method that leverages the millions of correspondence labels found in existing RGB-D reconstructions. Our learned descriptor is not only able to match local geometry in new scenes for reconstruction, but also generalizes to different tasks and spatial scales (e.g. instance-level object model alignment for the Amazon Picking Challenge, and mesh surface correspondence).

Semantic Scene Completion from a Single Depth Image

Shuran Song, Fisher Yu, Andy Zeng, Angel X. Chang, Manolis Savva, Thomas Funkhouser
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017
★ Oral Presentation, CVPR ★
Webpage  •   PDF  •   SUNCG Dataset  •   Code  •   Talk  •   2 Minute Papers

We present an end-to-end model that can infer a complete 3D voxel representation of volumetric occupancy and semantic labels for a scene from a single-view depth map observation. To train our model, we construct SUNCG -- a manually created large-scale dataset of synthetic 3D scenes with dense volumetric annotations.

Multi-view Self-supervised Deep Learning for 6D Pose Estimation in the Amazon Picking Challenge

Andy Zeng, Kuan-Ting Yu, Shuran Song, Daniel Suo, Ed Walker Jr., Alberto Rodriguez, Jianxiong Xiao
IEEE International Conference on Robotics and Automation (ICRA) 2017
★ 3rd Place, Amazon Robotics Challenge 2016 ★
Webpage  •   PDF  •   Shelf & Tote Dataset  •   Code

We developed a vision system that can recognize objects and estimate their 6D poses under cluttered environments, partial data, sensor noise, multiple instances of the same object, and a large variety of object categories. Our approach leverages fully convolutional networks to segment and label multiple RGB-D views of a scene, then fits pre-scanned 3D object models to the resulting segmentation to estimate their poses. We also propose a scalable self-supervised method that leverages precise and repeatable robot motions to generate a large labeled dataset without tedious manual annotations. The approach was part of the MIT-Princeton system that took 3rd place at the 2016 Amazon Picking Challenge.

Invited Talks

  • 2022 Jun
  • 2022 Apr
  • 2021 Jun
  • 2021 May
  • 2021 Feb
  • 2020 Jul
  • 2020 Jul
  • 2019 Oct
  • 2019 May
  • 2018 Jun
  • 2017 Nov