French version:

https://docs.google.com/document/d/1lBh-2l9LvkVU6ArZo5e5iRo6ODClYZ8Ak4Ue3tXsPKs/edit#

Computers playing the game of Go

Google-Deepmind vs Fan Hui, the machine wins 5 - 0.

Notes on the performance “Alpha-Go” (Google DeepMind) vs “Fan Hui” (pro 2p).

Written by Olivier Teytaud and Jialin Liu (Inria Saclay-IDF, Tao team, olivier.teytaud@inria.fr, jialin.liu@inria.fr).

Convolutional deep networks, CDN

Imitation learning: towards a policy network

Learning by self-play: improving the actor network and learning a critic network

Monte Carlo Tree Search (MCTS)

Abstract:

at last, a computer won against a pro human at the game of Go, in a fair game (no handicap), in 19x19. Up to now, computers needed at least 4 handicap stones.

The algorithm uses several tools from artificial intelligence:

- imitation learning on deep convolutional networks (mimicking human moves, from expert games);
- learning by self-play (policy gradient);
- value function learning, with another deep network.
- Monte Carlo tree search, as all current strong programs.

We see that the algorithm is “multiple”, and that most human forms of learning are included (learning by studying games, by pattern recognition, by self-play, by mental analysis of possible futures).

Whereas many games are either solved (i.e. the machine plays optimally, as in Draughts) or quasi-solved (in the sense that the machine is stronger than all humans, as in Chess), the game of Go was until recently untractable by computers, where humans were still unbeatable. The game of Go comes from Asia.

Go board (source: Wikipedia).

The tool used by Google Deepmind uses several artificial intelligence tools: deep neural networks, Monte Carlo tree search.

A neuron computes a linear combination z of its entries x1,x2,...,xk, z=w0+w1x1+w2x2+...+wkxk, then transformed by a sigmoid function z → f(z), for example f(z)=1/(1+exp(-z)). The neuron, when it receives an input x, outputs 1/(1+exp(-w0-w1x1-w2x2-...-wkxk)).

A neural network combines a large number of neurons, where, say, the information arrives on the left and is propagated/processed to the right where an output is given. Below, y is the output obtained when an input x is proposed.

Such a network has a huge expressivity; thanks to nonlinearities (sigmoids), each layer drastically increases the expressivity. A challenge, however, is to find the coefficients w such that the network computes something interesting.

Some solutions have been proposed long ago, when data x1,y1,...,xm,ym are available: we compute the w such that the network outputs (on the right) y1 when it sees the input x1 on the left (and yn for the input xn). The classical technique for that is called “gradient backpropagation”. However, this method is slow when there are many layers. For this reason, the artificial intelligence community has a little bit forgotten neural networks with many layers (“deep” neural nets). However, two key tools can save the situation:

- specifical tools for deep networks;
- convolutional networks.

We have seen that neural networks are a powerful tool, but that it’s difficult to handle the w coefficients (millions of neurons). An approach consists in creating successive layers in an unsupervised manner, i.e. without using the output (the yi) part. In the case of Go, it means that the network first studies positions only, and not the move proposed by human experts. This can be done layerwise, making the problem much easier. Then, the last layer, and only this one, uses the output information (the yi). Possibly, later, all the network is updated, using the whole data.

But for images, and a Go board is an image (one pixel per interesction of the board^{[1]}), another trick brings a huge improvement. The first layers, as well as in the human visual cortex, are low level features; they are the same everywhere in the picture; therefore we can have far less coefficients and use the same coefficients everywhere. Even for high level features, such as recognizing a cat, a cat in the top of an image is the same as a cat in the bottom of an image. With such a trick, what is learnt in a part of the board is learnt also for everywhere on the board^{[2]}.

Illustration of deep neural network by Google, used by Cypher. The first layers distinguish some simple features (such as “diagonal line”). The later layers recognize high level concepts such as “cats” or “faces”.

Deep neural networks have outperformed all other approaches for many areas of image recognition and are used in many domains - including Atari games and the game of Go.

Using convolutional network learning, the neural network is ready for associated some outputs y (which are moves) to some inputs x (which are Go positions). For this, it is given plenty of boards equipped with an excellent move (this is obtained on games from strong players). This is the first step of the program. Please note that the network proposes several possible moves, with different probabilities. This neural net is said “actor”: it is a network which provides legal moves in given situations. This is imitation learning, because it is based on mimicking human moves.

Learning by imitation might be disappointed at first view, because it is aimed at reproducing the behavior of an existing expert; but it is a good first step for a learning (in the case of Go, the neural network is later trained by self-play). In this picture, table tennis learning at Tuebingen MPI-IS.

In a second step, the deep network plays against itself or earlier versions of itself. The neural learns using the REINFORCE algorithm. Then, using gradients (again) it increases the probability of moves which lead to more victories. In addition, the games are stored; they will be used for a second deep convolutional network (value network, also known as critic network), which evaluates positions (for each position, the network provides probabilities).

Contrarily to Chess or many games, Go, as well as some games which are hard for computers, has no simple evaluation function: there is no simple function which, given a board with stones, outputs an approximate probability of winning. For these games, some pioneering works have shown that the simple methodologie in which random games are played, and the frequency of games won by Black is an estimate of the quality of the position for Black, is not that bad: this is termed the Monte Carlo approach. B. Bouzy developped various works in this direction, originally proposed by B. Bruegman. T. Cazenave still uses it for very hard games, such as Phantom Go, for which it is still the state of the art.

Left: a naive Monte Carlo. Right: a smarter Monte Carlo. Y. Wang has been a pioneer in the art of using “smart” Monte Carlo in MCTS.

This approach, nonetheless, lacks a critical component, at least for fully observable games: a good combination with tree search.

Artificial intelligence algorithms for games are usually either alpha-beta (with as a central components the tree search and the evaluation function; the latter is used as a bias in the former), or Monte Carlo; R. Coulom’s article in 2006 provided a new tool, MCTS. The principle of MCTS is to introduce, dynamically, a bias in the Monte Carlo simulations; moves leading to more victories are more likely to be played. Therefore, the simulations become smarter and smarter.

This approach has been critical for the first milestones of Computer Go against humans, with the first wins against pros in 9x9 (without handicap) and in 19x19 (with handicap), thanks to parallel machines.

The first win in 9x9 against a top pro, without handicap, with the disadvantageous side (komi 7.5, Computer is black). This was realized by MoGo (2009).

Schéma de l’utilisation de réseaux profonds pour biaiser la fouille d’arbre Monte Carlo.

The combination is realized as follows:

- First, imitation learning provides a policy, which provides, for any possible situation, a list of legal moves with some probabilities. The obtained networks plays at a good amateur level, very quickly. Convolution is used for fastening learning. This is the first actor deep network.
- Then, learning by self-play: the network plays against itself and increases its probabilities of playing some moves (which leads to higher winning rates in simulation). This provides a better deep network, and, also, a large set of positions.
- These positions are given to another deep network, used as a predictor (critic) able to estimate the value of a position.
- Finally, the MCTS part plays as follows:

- the critic deep network is used for biasing the MCTS into interesting parts of the tree of possible futures;
- the deep network policy (actor) is used for generating good MCTS moves, expanding the tree into promising directions);
- the classical MCTS principle, i.e. increasing the likelihood of moves when they lead to better situations, is used as in classical MCTS.

AI designed by Google Deepmind, combining the last technologies from artificial intelligence (deep neural networks, Monte Carlo tree search).

(source: Wikipedia)

Fan Hui (https://fr.wikipedia.org/wiki/Fan_Hui) is second Dan pro (2p). He is usually considered as the best Go player in Europe. The best players in the world are however in Asia.

5 games have been played formally, with time settings chosen by the human player; the machine won 5 out of 5 games. Then, the human requested blitz games (3 byoyomis of 30s); the machine won 3 games out of 5.

1 2 3 4 5

- Tristan Cazenave (université Paris-Dauphine) Expert Monte Carlo Tree Search, auteur d’un programme qui gagne régulièrement les compétitions de phantom-Go)
- Bruno Bouzy (Expert Monte Carlo Tree Search; université Paris 5), un des premiers auteurs du Monte Carlo Go
- Sylvain Gelly (Google Zürich), à l’origine des premières victoires contre human en Go 9x9 à l’époque où il était doctorant à l’université Paris-Sud, équipe Tao, LRI.
- Yann Lecun (maintenant à Facebook research, originellement dans une université américaine)
- Rémi Coulom (anciennement université Lille et Inria; désormais dans le privé), à l’origine de la fouille d’arbres Monte Carlo

- Convolutional neural networks: Yann LeCun, L. Bottou, Y. Bengio, P. Haffner, 1998 (http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf)
- Pioneering work on Monte Carlo Tree Search: Rémi Coulom, 2006. (https://hal.inria.fr/inria-00116992/document/ )
- Pioneering work on Monte Carlo Go: B. Bruegmann, 1993. (http://www.ideanest.com/vegos/MonteCarloGo.pdf )
- Clever Monte Carlo for Monte Carlo Tree Search: S. Gelly, Y. Wang, R. Munos, O. Teytaud, 2006. ( https://hal.inria.fr/inria-00117266v2/document )
- Pioneering work on deep networks: K. Fukushima, 1980 (“neocognitron”, http://www.cs.princeton.edu/courses/archive/spr08/cos598B/Readings/Fukushima1980.pdf ).
- Deep networks for learning a policy at Go: Clark et Storkey, http://jmlr.org/proceedings/papers/v37/clark15.pdf
- Gradient backpropagation for neural network learning: Rumelhart et Hinton et Williams, 1986 (http://www.nature.com/nature/journal/v323/n6088/abs/323533a0.html )

[1] In the game of Go, stones are put on line intersections.

[2] Inspired by the pioneering work by Yann Lecun (1998).