Researchers at OpenAI managed to teach a neural network to play Minecraft using gameplay videos and a Video PreTraining (VPT) method. They used a massive unlabeled video dataset of human Minecraft play and a small amount of labeled contractor data.
During the first stage, the AI "watched" 2 thousand hours of labeled gameplay videos. The labeled data was keypresses and mouse movements, and the AI used emulation of a standard mouse and keyboard. As a result, the neural network learnt how to process videos, guess keypresses, and record them.
During the second stage, the neural network watched 70 thousand hours of unlabeled gameplay videos (without data about the keypresses) taken from open sources. As a result, the system learned not only how to walk in the game world, but also how to mine resources and create objects, search for food and hunt, run, swim, bypass obstacles, etc. The AI also learned to pillar jump – to elevate oneself by repeatedly jumping and placing a block underneath oneself.
During the next stage, the researchers involved users who were asked to create a new world in the game, collect the necessary resources and make basic necessities from them. This data was recorded on video and shown to the neural network. The researchers also used a reinforcement learning method, which allowed the AI to eventually create a diamond pickaxe.
Researchers believe that the Video PreTraining (VPT) method will quickly train neural networks for the right tasks, and also allow artificial intelligence to be trained to use a mouse and keyboard.