As e-commerce orders flood in, a warehouse robot picks up mugs from a shelf and places them in boxes to ship them. Everything buzzes, until the warehouse processes a change and the robot now has to grab taller, narrower cups that are stored upside down.
Reprogramming this robot involves hand labeling thousands of pictures that show it how to grab those new cups, and then retraining the system.
But a new technique developed by MIT researchers would only require a handful of human demonstrations to reprogram the robot. This method of machine learning allows a robot to pick up and place never-before-seen objects into random poses it has never encountered. In 10-15 minutes, the robot would be ready to perform a new pick-and-place task.
The technique uses a specially designed neural network to reconstruct the shapes of 3D objects. With just a few demos, the system uses what the neural network has learned about 3D geometry to grab new objects similar to those in the demos.
In simulations and using a real robotic arm, the researchers show that their system can efficiently manipulate never-before-seen cups, bowls and bottles, arranged in random poses, using just 10 demonstrations to teach the robot.
“Our major contribution is the general ability to deliver new skills much more efficiently to robots that need to operate in less structured environments where there could be a lot of variability. The concept of generalization by construction is a fascinating ability because this problem is usually much harder,” says Anthony Simeonov, an electrical and computer engineering (EECS) graduate student and co-lead author of the paper.
Simeonov wrote the article with co-lead author Yilun Du, an EECS graduate student; Andrea Tagliasacchi, research scientist at Google Brain; Joshua B. Tenenbaum, Paul E. Newton Career Development Professor of Cognitive Science and Computation in the Department of Brain and Cognitive Sciences and member of the Computer Science and Artificial Intelligence Laboratory (CSAIL); Alberto Rodriguez, associate professor of the class of 1957 in the Department of Mechanical Engineering; and lead authors Pulkit Agrawal, professor at CSAIL, and Vincent Sitzmann, incoming assistant professor at EECS. The research will be presented at the International Conference on Robotics and Automation.
A robot can be trained to pick up a specific object, but if that object is lying on its side (perhaps it has fallen), the robot sees this as a completely new scenario. This is one of the reasons why it is so difficult for machine learning systems to generalize to new object orientations.
To overcome this challenge, the researchers created a new type of neural network model, a neural descriptor field (NDF), which learns the 3D geometry of a class of elements. The model calculates the geometric representation of a specific element using a 3D point cloud, which is a collection of three-dimensional data points or coordinates. Data points can be obtained from a depth camera which provides information about the distance between the object and a viewpoint. Although the network was trained in simulation on a large data set of synthetic 3D shapes, it can be directly applied to real-world objects.
The team designed the NDF with a property known as equivariance. With this property, if the model sees an image of a cup upright, and then shows an image of the same cup on its side, it understands that the second cup is the same object, just rotated.
“This equivariance is what allows us to handle cases where the object you’re observing is in an arbitrary orientation much more efficiently,” says Simeonov.
As the NDF learns to reconstruct the shapes of similar objects, it also learns to associate related parts of those objects. For example, he learns that cup handles are similar, although some cups are taller or wider than others, or have shorter or longer handles.
“If you wanted to do this with another approach, you would have to hand label all the parts. Instead, our approach automatically discovers these parts from the shape reconstruction,” says Du.
Researchers use this trained NDF model to teach a robot a new skill with just a few physical examples. They move the robot’s hand over the part of an object they want it to grab, such as the rim of a bowl or the handle of a cup, and register the fingertip locations.
Because the NDF has learned a lot about 3D geometry and how to reconstruct shapes, it can infer the structure of a new shape, which allows the system to transfer demonstrations to new objects in arbitrary poses, explains Of.
Choose a winner
They tested their model in simulations and on a real robotic arm using cups, bowls and bottles as objects. Their method had an 85% success rate on pick and place tasks with new objects in new orientations, while the best baseline could only achieve a 45% success rate. Success means grabbing a new object and placing it on a target location, like hanging cups on a rack.
Many baselines use 2D image information rather than 3D geometry, which makes it more difficult to incorporate equivariance by these methods. This is one of the reasons why the NDF technique worked so much better.
While the researchers were happy with its performance, their method only works for the particular category of objects it is trained on. A robot learned to pick up cups will not be able to pick up boxes or headphones, because these objects have geometric characteristics too different from those on which the network was trained.
“In the future, extending it to many categories or completely abandoning the notion of category would be ideal,” says Simeonov.
They also plan to adapt the system to non-rigid objects and, in the longer term, to allow the system to perform pick and place tasks when the target area changes.
This work is supported, in part, by the Defense Advanced Research Projects Agency, the Singapore Defense Science and Technology Agency, and the National Science Foundation.