Artificial intelligence research has made rapid progress in a wide variety of domains from speech recognition and image classification to genomics and drug discovery. In many cases, these are specialist systems that leverage enormous amounts of human expertise and data.
However, for some problems this human knowledge may be too expensive, too unreliable or simply unavailable. As a result, a long-standing ambition of AI research is to bypass this step, creating algorithms that achieve superhuman performance in the most challenging domains with no human input.
The paper introduces AlphaGo Zero, the latest evolution of AlphaGo Zero, the first computer program to defeat a world champion at the ancient Chinese game of Go. Zero is even more powerful and is arguably the strongest Go player in history.
Previous versions of AlphaGo initially trained on thousands of human amateur and professional games to learn how to play Go. AlphaGo Zero skips this step and learns to play simply by playing games against itself, starting from completely random play. In doing so, it quickly surpassed human level of play and defeated the previously published champion-defeating version of AlphaGo by 100 games to 0.
It is able to do this by using a novel form of reinforcement learning in which AlphaGo Zero becomes its own teacher. The system starts off with a neural network that knows nothing about the game of Go. It then plays games against itself, by combining this neural network with a powerful search algorithm. As it plays, the neural network is tuned and updated to predict moves, as well as the eventual winner of the games.
This updated neural network is then recombined with the search algorithm to create a new, stronger version of AlphaGo Zero, and the process begins again. In each iteration, the performance of the system improves by a small amount, and the quality of the self-play games increases, leading to more and more accurate neural networks and ever stronger versions of AlphaGo Zero.
This technique is more powerful than previous versions of AlphaGo because it is no longer constrained by the limits of human knowledge. Instead, it is able to learn tabula rasa from the strongest player in the world: AlphaGo itself.
It also differs from previous versions in other notable ways.
- AlphaGo Zero only uses the black and white stones from the Go board as its input, whereas previous versions of AlphaGo included a small number of hand-engineered features.
- It uses one neural network rather than two. Earlier versions of AlphaGo used a “policy network” to select the next move to play and a ”value network” to predict the winner of the game from each position. These are combined in AlphaGo Zero, allowing it to be trained and evaluated more efficiently.
- AlphaGo Zero does not use “rollouts” - fast, random games used by other Go programs to predict which player will win from the current board position. Instead, it relies on its high quality neural networks to evaluate positions.
All of these differences help improve the performance of the system and make it more general. But it is the algorithmic change that makes the system much more powerful and efficient.
AlphaGo has become progressively more efficient thanks to hardware gains and more recently algorithmic advances
After just three days of self-play training, AlphaGo Zero emphatically defeated the previously published version of AlphaGo - which had itself defeated 18-time world champion Lee Sedol - by 100 games to 0. After 40 days of self training, AlphaGo Zero became even stronger, outperforming the version of AlphaGo known as “Master”, which has defeated the world's best players and world number one Ke Jie.
Elo ratings - a measure of the relative skill levels of players in competitive games such as Go - show how AlphaGo has become progressively stronger during its development
Over the course of millions of AlphaGo vs AlphaGo games, the system progressively learned the game of Go from scratch, accumulating thousands of years of human knowledge during a period of just a few days. AlphaGo Zero also discovered new knowledge, developing unconventional strategies and creative new moves that echoed and surpassed the novel techniques it played in the games against Lee Sedol and Ke Jie.
Innovations of AlphaGo
One of the great promises of AI is its potential to help us unearth new knowledge in complex domains. We’ve already seen exciting glimpses of this, when our algorithms found ways to dramatically improve energy use in data centers - as well as of course with our program AlphaGo.
Since its historic success in Seoul last March, AlphaGo has heralded a new era for the ancient game of Go. Thanks to AlphaGo's creative and intriguing revelations, players of all levels have been inspired to test out new moves and strategies of their own, often re-evaluating centuries of inherited knowledge in the process.
Ahead of 'The Future of Go Summit in Wuzhen', we summarise some recent examples of AlphaGo’s strategic and tactical innovations, and the new insights they have revealed.
"AlphaGo’s game last year transformed the industry of Go and its players. The way AlphaGo showed its level was far above our expectations and brought many new elements to the game."
– Shi Yue, 9 Dan Professional, World Champion
“I believe players more or less have all been affected by Professor Alpha. AlphaGo’s play makes us feel more free and no move is impossible to play anymore. Now everyone is trying to play in a style that hasn’t been tried before.”
– Zhou Ruiyang, 9 Dan Professional, World Champion
AlphaGo's greatest strength is not any one move or sequence, but rather the unique perspective that it brings to every game. While Go style is difficult to encapsulate, one could say that AlphaGo's strategy embodies a spirit of flexibility and open-mindedness: a lack of preconceptions that allows it to find the most effective line of play. As the following two games will show, this philosophy often leads AlphaGo to discover counterintuitive yet powerful moves.
Although Go is a game of territory, most decisive battles hinge on the balance of power between groups, and AlphaGo excels in shaping this balance. Specifically, AlphaGo makes masterful use of "influence," or the effect of existing stones on surrounding areas. Although influence cannot be measured exactly, AlphaGo's value network enables it to consider all stones on the board at once, endowing its judgment with subtlety and precision. These abilities let AlphaGo convert local regions of influence into coordinated global advantages.