Imagine having a robot that can bring you a drink from the fridge. Unload the dishwasher. Paint the wall. Unload a few boxes of books. It doesn’t look human, it doesn’t have consciousness, it isn’t your friend. It is just extremely damn useful since it can understand any simple command and execute it.
Let’s call it a “General-Purpose Robot” or GPR.
We currently lack GPRs; the best most of us can get is a robot vacuum with lidar or a ride in a Waymo self-driving car in Phoenix, Arizona (assuming we have the means and, more importantly, the willingness to go to Phoenix, Arizona). At the same time, it doesn’t seem inconceivable that we could have GPRs in the near future. Recent advances in Machine Learning, Computer Vision, Self-Driving Cars, Robot Dancing all point that way. The purpose of this post is to assess the technological and economic feasibility of building GPRs.
Let’s start with hardware. Robot hardware can be divided into 3 groups: body, sensors, brains. As far as bodies go, we mostly have things handled, the Boston Dynamics Spot with a robotic arm is likely all we need for the first version of a GPR.
Sensors are slightly trickier. It is almost certain that we have sensors that are good enough for vision. Lidar is also great for not bumping into things. Microphones are undoubtedly good enough. However, it is uncertain whether we have tactile sensors that are sufficient for a GPR. A recent paper that showcased new tactile sensors demonstrated 94% accuracy in classifying garbage into 7 categories. That is good, but nowhere near human good. However, that paper relied ONLY on tactile sensors. It is likely that if the vision system is good enough to recognize most objects, the robotic arm would be able to manipulate them without perfect tactile recognition. This paper is also very encouraging - it shows blind locomotion on extremely challenging terrain. Robots learned to interact with such tricky matter as ice and rocks at the current level of tactile sensors using a lot of ML magic. I’m going to tentatively put us at “if we’re not there, we’re almost there” as far as tactile sensors go.
You would expect brains to be more difficult than sensors, but that is actually not the case, at least as far as hardware goes. We know that our computers will be able to run the required software, as they are general-purpose devices. We might need a lot of computer chips per GPR, but if allow for some of the chips to be in the cloud and not in the robot itself, we have access to almost infinite computing power.
Software is the real blocker. We don’t have the software required to run such robots, and it will take a long time to develop it. Moreover, we don’t necessarily know how to write it… Or do we?
In a sense, the only proof of knowing how to write software that does something is, well, having written it. At the same time, one cannot help to notice trends. The biggest, baddest trend in the digital world today is the scaling of neural networks.
Observing the successes of GPT-3 and AlphaGo Zero, I am going to posit the following:
No advances in neural networks are required to build a General-Purpose Robot. We can achieve that with larger neural networks of the kinds we have today.
Note that this doesn’t say anything at all about AGI. A GPR needs to do but a few things: understand human language, link words to physical objects, navigate the 3-dimensional world without breaking anything, develop a plan to achieve an understood goal. We have neural nets that do these things already. Examples include GPT-3, DALL-E, self-driving cars, and the neural nets that learned to play all of the Atari games. We are raising the level of complexity, for sure. A driving environment is fairly regimented and standardized. It mostly consists of other cars and pedestrians. People’s apartments contain thousands of objects with billions of variations. So, the neural nets will need to be BIGGER, that’s fine. But there is no obvious reason why we would need a fundamental breakthrough in Machine Learning technology to build GPRs. This is an assumption, but it’s heavily supported by data.
What we are missing is the training environment. Enormous neural nets take a long time to train. As we’ve discovered with self-driving cars, the real world is too slow. We need to first build a virtual training environment where the nets can do the bulk of their training. Some training will need to happen in the real world, but the vast majority of it will be done virtually. What would such a training environment look like? A lot like this:
It would be a collection of thousands (millions?) of real-world scenes that the neural nets can explore and interact with - with the initial objective of learning how to manipulate objects without breaking them.
However, just because we have something that looks like what we need, doesn’t mean that we are close to having it. Developing the necessary training environment is a hideously difficult task for two reasons.
The first one is physics. Video games are beginning to look like real-life, but their physics are still, effectively, toy physics. To see what are the limits of our physics modeling are, go to a website of one of the commercial simulation companies. We have OK models of individual objects that take a metric ton of human time to create, and two metric tons of computers to run.
The second one is variety. As mentioned above, there are thousands of objects in each household with billions of variations. The training environment would have to feature some significant proportion of these variations for the neural nets to be able to generalize to the full set. And every one of them will have to be at the physical fidelity of the simulation software…. Ooof.
So, once we create a physics-simulation environment of unprecedented fidelity (that we can run thousands of instances of…) We would have to have highly-paid engineers figure out the physical parameters of each of billions of objects found in the real world? Hopefully, not necessarily. AlphaFold is a neural net that achieved a phenomenal level of predictive power in protein folding. The general category of what it did was “predicting physical properties of an object”. It is entirely possible that we will be able to train neural nets that predict (infer?) physical properties of everyday objects. Not from the code that makes them, as was the case for AlphaFold, but from a few physical interactions.
In practice it will look like this: you will point your phone at a physical object. You will try to film it from all angles. Maybe the phone will illuminate it with different levels/colors of light. Then you interact with the object in a few ways. Knock on it. Throw it up and down. Try to bend it with a small amount of force. Roll it on the floor. Maybe you’ll write down what it’s made of. You’ll definitely have to label the object as a “lamp” or a “chair”.
Then the neural net will create a virtual version of the object in the training environment, with all of its physical characteristics.
Millions of people will have to do that with billions of objects. But, in the grand scheme of things, that’s the easy part.
We don’t have this virtualization technology today. We don’t even have anything quite like it. But it is a crucial step in developing a virtual training environment required to train GPR-capable neural nets. Physics simulation and virtualization are the two biggest obstacles we have to overcome to build General-Purpose Robots.
After that point, it’s relatively smooth sailing. Host the virtual environment on thousands of extremely powerful computers. Train huge neural nets, spending hundreds of millions of dollars just on electricity. Develop safety guidelines required to operate near living beings. Wrap them around the neural net. Finish training the robots in the real world in thousands of locations. Difficult? Yes, but nothing we fundamentally don’t know how to do.
To sum up: we have almost all of the technology we need to develop GPRs. We need to develop advanced and performant physics simulation software to host a training environment and a way to virtualize billions of objects. We need neural net scaling to continue to deliver gains for just a little bit longer.
This is not to minimize the difficulty of the project. It will require a ton of work by many incredibly talented people. It will likely require some brilliant inventions. But it seems that, from the viewpoint of civilization, we are almost there.
Now let’s look at the economic feasibility of such a project. Obviously, there are a lot of unknowns and assumptions involved, but, overall, the assumptions chosen are conservative and even pessimistic.
Let’s start with the training environment. Unity Technologies has raised around a billion dollars to build a world-class video game engine. Let’s say that our virtual environment will be 20X as difficult to make. $20B
Virtualizing 10 billion objects. 50 cents an object seems reasonable, as it would be quite quick. $5B
Training the neural net. GPT-3, by one estimate, took $12M to train. Let’s assume that the neural net we need is 1000X as expensive to train. $12B
Boston Dynamics recently sold for just over a billion dollars. Let’s say the cost to develop our hardware is 5X that (including sensors, chips, and whatnot). $5B
Once we are done with development, we would need a factory. The Economist recently penned the most expensive factory in the world at $17B. While the most realistic assumption is that a robot factory will be cheaper than a semiconductor fab, let’s pen our first factory at twice that. $34B
That brings the total to $76 billion dollars. It is a mind-bogglingly huge amount of money. That being said, as far as mind-bogglingly huge amounts of money go, it is not that big? The Apollo program cost about $152B in today’s dollars. We expect to spend $1.5T (and rising) on F-35s over the lifetime of the project. The latest round of Covid stimulus is $1.9T.
But wait! Success of the project is not guaranteed. Let’s say that we estimate the chances of our R&D succeeding at 25%. The factory will only need to be built if our R&D succeeds, so for the purposes of expected value calculation, let’s quadruple all of the costs apart from the factory. So while the initial amount of capital we need to risk is just $42B, the economic cost of the project can be viewed as $202B. Still mind-bogglingly huge, still… Reasonable, as far as such things go.
Now let’s think about how much money General-Purpose Robots would make. The base case is that, very quickly, they become the biggest manufacturing industry on the planet. Right now, that spot is held by car manufacturing, which did around $3T in revenue globally in 2020. General-Purpose Robots will likely be bigger than cars, but let’s assume they will be about as big. The conservative assumption is that the company that invents them will capture 10% of the value (Apple’s net margin is 20%). That means, that in a single year at full capacity, the company can make $300B, 150% of the economic cost. Now, in the first year, we will only have one factory, but the margins will likely also be higher. Let’s say that the first factory will produce $100B worth of GPRs, at 50% margins, that’s $50B - enough to recoup all of the R&D (or the cost of the factory).
Napkin math for this project works out.
It also assumes 0 revenue apart from GPRs from putting some $42B into R&D, which is extremely unlikely. If nothing else, the training environment will advance video games by a generation.
This project is actually feasible! Are there entities that can muster the required technological and financial resources (keep in mind that $202B is the economic cost, the actual cost of the R&D is $42B)? There are at least 12:
Chinese Government and its tech menagerie
There are other contenders that are close (OpenAI if it gets Softbank-tier amount of capital, Elon Musk if he is able to sell enough Tesla stock without pushing down the share price to anything resembling reasonable levels). But the point is, yes, organizations exist today that can make GPRs happen, and for which it is an expected-value positive project (thought experiment - how much is it worth to the US Government that GPRs are invented in America?)
Granted, telling your shareholders (or taxpayers or wife) that you are going to spend tens of billions of dollars on a blue sky R&D project requires extreme boldness, vision, chutzpah.
Perhaps, that is the biggest obstacle we have to overcome in order to create GPRs.
I think the biggest thing missing is how to explore and incorporate new environments/tasks. There are teams that can solve any *one set* of tasks (especially in controlled settings), but the generalization and data-efficiency is a big step behind. Simulators can help us pre-solve many generic environments, but people want them in homes and we are no where near there.
I suspect I will write about this soon. The exploration and where do we get the data question is a good way to think of it. You may be interested in poking around my content https://democraticrobots.substack.com/p/the-uncanny-world-of-at-home-robots or here is a post on exploration https://lilianweng.github.io/lil-log/2020/06/07/exploration-strategies-in-deep-reinforcement-learning.html.
"General-Purpose Robots will likely be bigger than cars...." The whole economic analysis hinges on this statement, but there's no evidence to support it.
In contrast, consider that Google, which has the AI expertise and compute hardware to make this a reality, sold Boston Dynamics in June 2017 because they didn't see a commercial opportunity. I'm sure that you, Sergey Alexashenko, are a smart guy, but my money is on Sergey Brin, Larry Page, and Astro Teller. They had all the ingredients in their hands but made the decision to sell Boston Dynamics. (https://techcrunch.com/2017/06/08/softbank-is-buying-robotics-firm-boston-dynamics-and-schaft-from-alphabet/)