Hi everyone and welcome to this webinar on reinforcement learning in minecraft challenge and opportunities in multiplayer games my name is diego pereira leona i’m i’m a lecturer on computer games and ai at kubernetes university of london i’m part of the gaming group as the same as two of my Core presenters in this webinar look again and martin bala who will talk after me they’re also part of the gaming group there are psd students in industry of london and then we have a final speaker with sam dublin from microsoft research who will close this presentation during this webinar We’re going to talk about four main parts the in the first one i’m going to introduce the main concepts and ideas we have general game ai research after me reluc again i will talk about different games that have been implemented in minecraft and can be used for this particular area of research After raluca martin bala will be talking about how to train different reinforcement learning agents in malmo the the main platform for multi-platform game multi-agents games in minecraft and finally sam dublin will be talking about several open questions and research that is happening at the moment in this In this domain so starting with the first block general game ai research traditionally in in game ai something that we have done quite useful quite often is to utilize different games and doing different research on those games so for instance you could create an ai agent that is going to play one particular Game could be a game on a 3d game like minecraft could be games two-dimensional games like for instance sequest or could be other navigational games in which you have to maybe control a ship in a in an environment and then maybe visit a series of waypoints normally the ai the way it’s Focused is you have a particular agent and you try to create this ai so it plays the game as best as it can while this advances the field in many cases we are interested in similar approaches that could focus on one ai agent that is able to play multiple Games this allows the researchers to focus exclusively on the ai methods on the actual technology behind the agents rather than in specifics on every game so we can advance ai in order to to make a ai method that are applicable to many different domains and not only one particular one Examples of these can be found in the literature one example you might have heard of is the work on the atari learning environment with using deep convolutional neural networks and deep qns that allow you to receive as an input image of the game and then analyze it try to extract features automatically And then map that to the different outputs that the agent can take essentially the actions that the agent can take in the game and then by progressively and repeatedly playing these games he’s able to learn and he’s able to play better and better as time progresses another example for could be the general Video game ai foreign competition this is something that we have developed at queen mary and essentially the idea is that you can play multiple arcade games in in a fashion in which you don’t have to train on them you have access to the model of the game And you can plan which are the best actions and you can play multi multiple games like could be space invaders or soccer band or things more complex like for instance portals or other games like for instance could be also butterflies or orcella this was against the competition and allowed people to train And prepare agents even not knowing the games that they’re going to be playing with later on in a more general case more general example we have stratega which is a general strategy games framework that we have been developing this for instance try to tries to tackle more complex situations in which you Have games that require decision in multiple dimensions like for instance you might be defining how to do research or different technologies or you might be controlling multiple units that you need to be managing at the same time or you might also try to for instance a managed economy Or playing against multiple enemies in this case we are developing this this framework that allows you to play multiple different strategy games at once and trying to focus research on the difficult questions of decision making for these domains but one clear example of one one of the main platforms at the moment multi Games development is the project malmo which is developed by microsoft and is using minecraft as a platform for doing ai research you can find all the information in these links as well but in general this takes the game minecraft with all these complexities to offer a rich Diversity of games that you can use for training and learning with your agents in these environments so the main design principles of malmo are that they have a low entry barrier so you essentially can can create your own agents in the in the language that you prefer could be java.net c C plus part python or lua it’s cross platform so you can essentially use it in multiple operating systems like windows linux or macos uh in general incorporation api for agents so you can create an agent biasing or basing your your um your agent in several functions that you can Implement in order to interact with environment the actual environments as well can be created in xml so you can essentially define how does the world look like the type of objects the type of elements that are incorporated into that and also you can incorporate how is the task itself that you want to You want to develop like you can define the actions that are available for the agents you can also define different rewards that exist how does the episode end what you have to do to win sense so given that you have all these possibilities all these languages to define different games You we are moving in malmo from just executing agents in one particular domain to arrange the wide range of domain that you can you can work and this allows that to focus in a more general multi-task learning rather than focusing on the narrow ai of one simple environment not only that But also you can use malmo to have multiple agents in the same game so for instance if you’re playing maybe a football game or when you capture the flag like the examples on the screen you could have multiple agents each one of them controlled by a different ai algorithm or controller they are Trying to either compete in the same game or trying to cooperate to achieve a common goal or even you have a human who’s playing with all these ais and there is an interesting aspect of collaboration and competition between the ais and the humans that can be also explored in this domain This is the introduction from project malmo and now rolex again is going to talk about the different agent different games that have been created examples of tasks that are interesting for doing ai in this particular environment thank you diego my name is alka and i will be talking about the multiplayer games Available in the framework we currently have three different games that are all parameterized with procedural level generation behind them as well and that is to allow for many variations of each of the environments to be generated to allow for diverse and challenging training settings for general players the first game Is mob chase previously known as pic jays which was used in a previous microsoft collaborative ai challenge the point here is for two different agents to corner a mob in a fenced off play area they get a large reward for catching the mob and a smaller individual reward if they Move to an exit block instead which could be the better choice if the partner of the agent is not actually cooperating in order to play this game the agents have three actions available here they can move forward backward or turn they have the goal the main goal of catching the mob which Awards them the most points and they have a secondary goal of reaching an exit which should also end the game but award them a fewer number of points and all of the games that we have here also have a maximum number of commands that can be sent by the agents So this is to encourage them to actually complete the problems as fast as possible with the current parameter options which can be easily modified and increased to cover more potentially interesting spaces we can get over 6 million variations of mob chase not including the various level layouts resultant from the chosen parameters We can vary the look of the game the number of objects in the level and their position within a varied size play area all of which can be used to also tweak the difficulty of the game if needed these are some examples of what the games generated Might actually look like in practice mob chase specifically targets ai skills based on cooperation chasing navigation and exploration the next game that we’re looking at is build battle in which two different agents compete here to build a given structure one agent placing a block correctly awards in points which subtracts from the opponent Here the agents can move forward backward or turn but they can also jump and they can place or destroy objects the main goal of the game is to copy the given structure with towards one point but here we can also see a more granular reward structure we towards a small Number of points for placing correctly some blocks or removing incorrectly placed blocks and the opposite for placing blocks in incorrect positions or removing blocks that have already been placed correctly and we have an opposite reward scheme for the opponent here we can obtain over 10 000 variations of the game Not including different level layouts and we’re varying things like the size of the structure to be built and the overall level size the look of the game through the types of blocks used and the percentage of the structure that is already built for the players to generate interesting challenges of varying complexity And to look at this game in action we can see two different instances here build battle specifically targets ai skills based on navigation puzzle solving and resource management and lastly we have treasure hunt in which one agent is defenseless and has the mission of collecting treasure the brightly clock the brightly colored blocks In the pictures and videos shown here and then make it to the exit of the dungeon as well the other agent is equipped with a sword and must protect the collector from the enemies as they navigate the dungeon together multiple teams of collector protector players could be added in the same game For an interesting competitive angle and in this game the agents can again move forward backward or turn but they have also different about abilities depending on their particular role in the challenge the protector can attack enemies and the collector can interact with the items in the level and collect them Goal overall is for the collector to reach the exit and we again see here a more granular reward structure with points awarded for each of the treasure blocks being collected as well as agents being penalized if any one of the ones in the same team die over 3 billion variations of this game Can be created by varying the look of the game combat aspects and the number of enemies faced as well as the overall and detailed configuration of the dungeon in these examples we’re only seeing some one-room play areas but larger and more complex dungeons can easily be generated And this is what the game looks like in action treasure hunt is one of the more complex scenarios that we have available which targets many ai skills such as navigation exploration object transportation and escort aiming defeating enemies chasing fleeing and obstacle and harm avoidance overall the games that we have available In the framework pose a wide variety of challenges requiring the agents to exhibit many different skills in order to successfully solve all of the problems thank you very much for listening and i will pass on to martin we’ll be talking about actually training agents to play such games Thank you thank you rolker hi my name is martin bala and after seeing how the missions are in mamo we turn our attention to how to get an agent to successfully play these games so for this we use reinforcement learning in reinforcement learning an agent interacts with an environment initially The environment gives gives the agent an initial observation and based on that observation the agent picks an action the action gets into the environment and the environment updates its internal state and returns the next observation with a reward the agent’s objective is to accumulate the highest discounted reward As it can the observations in minecraft are the are in the form of rgb pixels and the action space in the simplest form can be moving one grid forward turning 90 degrees left or right and the reward is in the simplest case minus one or one one is When the agent successfully completes a mission a minus one when it fails the next is to have a look at how to actually set up marmo on your own machine it got a bit simpler than with previous versions so the first step is to get java version 8 and python 3 installed unfortunately Minecraft runs with an older version of java so it requires java 8 specifically it wouldn’t work with newer versions and after you get java and python installed the next step is to clone the repository and if you got this cloned then you can install momo and using pip And if you want to run the examples and the tutorials from the repository then you can optionally install the example sub package but if you want to get it to run on your own machine refer you to your to the project’s readme so we use rliv for our examples Rliv is part of the ray project you might have come across this as ray is a popular python package to speed up computation by parallelization rle provides high quality state-of-the-art easily scalable rl algorithms so you can run the same algorithm on your laptop and easily transfer it to a cluster And you just have to change the resources that you have available so you can specify more cpus and more gpus and rlev automatically handles that and it supports both tensorflow and pi torch so if you don’t want to learn the other framework it’s not a problem we prepared a few Notebook tutorials to lower the entry barrier for mamo we recommend you to go through in the order listed in the slide as they build from basic concept to more advanced ones so the first tutorial is about how to run a random agent in malmo it shows you how to set up the environment And then it just samples random actions at each time step the next tutorial shows how to use rlev and run a ppu agent in a sample mission then we go through how to restore a checkpoint and how to evaluate the checkpoint and finally we are going to show you an Example of how to run a multi-agent experiment in mamo and rliv the first notebook is about how to run a random agent in mamo so the first thing we have to do is create the environment calling manuem.make and then for the initialization we have to pass the mission file With a port and that port is going to be used to communicate between java and the python process and we have the launcher which automatically starts up the java minecraft instance it just requires an array of ports in this example we only use a single port and a launch script that we explain In the next slide and then we can just use the normal reinforcement learning group and in this example while the episode is not over we just sample a random action at each time step so it’s worth highlighting how the launcher works so previously you have to manually start up the Minecraft instances using a gradle process for each environment that you wanted to use and instead we created a python script that does it for you and the launcher takes in an array of ports and for each port it starts a new minecraft instance and it also takes a launch script the launch script Is a bash file that defines the launch options for minecraft so by default it renders a window on your desktop but that doesn’t work on for example had less linux machines if you want to train it for longer periods on a server and you can use xvf to export the display and That helps you running it that was one of the issues that we had with the earlier versions of mamo the next example is how to run a ppu agent in malmo ppo is one of the state-of-the-art reinforcement learning algorithms it stands for proximal policy optimization and we are using the tune api That helps you run an experiment so it’s quite straightforward to run it in the first line as the first argument you define the algorithm so in this case we use ppo then we give a config which is a python dictionary that defines the algorithm’s parameters so it defines the Learning rate and the available resources so you can define how many cpus you want to allocate and how many gpus then we set the stopping conditions that it makes the training stop after a certain number of environment agent interactions and then we have some checkpoint argument it creates a checkpoint at the End and also throughout training after every few in the algorithm iterations it makes a checkpoint and finally we set the log there so everything that rle blocks it just saved at a specific place after training we can visualize the 10 server output that rlip automatically saves we don’t have to wait Until the end of the training we can visualize it during training and we have some a few example what kind of data we can visualize so it shows the average length the episode text 12 training the maximum and the average rewards that it collected during the episode and Rlip collects much more data so for example you can visualize even cpu usage or memory usage and sampling time in milliseconds there are much more these are just a few examples and then after we trained an agent we can visualize what it does the gif on the right Is recorded using the screen recorder that we also provide in the framework it takes the observations that the agent gets and saves them as i give or an mp4 file the the mission was the single agent mob chase in this example the agent just has to navigate to the brick block That the agent does very nicely on the gif so in the next tutorial we have a look at how to restore a reinforcement learning agent in marmo so one use case for this is that you train an agent you have the checkpoint but you want to visualize What it actually learned so in this case you can just change the configuration that you you may want to just use a single cpu and for example no gpu for the visualization you can switch off exploration so the agent picks the best action it can At each time step and then we can use the same radio tune function as before but now on the last argument we use the restore flag and we provided the checkpoint file that we had before and then we may want to evaluate an agent as you’ve seen the tune api doesn’t have The same flexibility as we have a normal agent so we may want to use a different level or we may want to get more insight of what the agent does in the environment so in this case it’s better to load the agent’s trainer so in this case we load the bpo trainer We restore the checkpoint file that we had and then we can directly access the policy in the reinforcement learning group that we’ve seen in the random agent example we can call a policy.compute action and give it the current observation and it doesn’t just return the action but it also provides the Action distribution the value function and any other algorithm specific quantities for example it can provide q values and then we can use that the best action that the algorithm has and send it to the environment and then before we move on to the last tutorial on mutagen reinforcement learning we Have a look at how the mutagen setup works in marmo so so far we only had single agent examples where each minecraft instance was attached to a single agent in the multi-agent setup there are multiple roles and for each environment there is an agent withdrawals 0 That’s going to act as the server and all the other roles are going to act as clients and once we establish the mission all the clients connect to the server and they have this synchronized observation and to do that in code we have to use an additional helper function to create a multi-agent Environment where we define the agent configuration so that’s where we assign the roles to the agent then we use these turn based rla voltage and and wrapper it’s turn based in the sense that each agent acts at the same time so they are not asynchronously acting in their own time and then In this in a similar rate the tune the trend function now we use the multi-agent argument where we define how the policies are defined so in this example we use a shared policy which means that both of the agents are going to have identical weights but they act in a decentralized manner So they don’t share information or observation within each other to do that we also need to define the observation space and the action space and a policy mapping function but you can find more information of this in the rlip documentation okay so after you’ve trained the ppa agent on the multi agent Setup you should see something like this on the left side the agent is withdrawal zero it acts as the server and the right size window is the client and in this mission the agent should collaborate and catch the chicken or they can decide to not collaborate and just move to the Sand tile which one of the agent does and sam in the next section is going to talk about ways that you can make reinforcement learning agents learn how to collaborate so to sum up what is in the queen mary malmo repository we added the launcher that wasn’t in the previous repository And the updated p package so it’s more convenient to call the launcher for example and also we have some updated documentation we also have the notebooks that i mentioned previously and we also provide normal python scripts so if you don’t want to use the notebooks it’s fine you can Use the scripts instead that might be much easier to use in your own project and we also provide some additional helper functions for example an observation wrapper that just converts the minecraft observations to any arbitrary size and we also have a video recorder that you could see the single agent Recording earlier and we also have a symbolic representation extractor so so far all the examples were based on image input and the symbolic representation extractor it gives you a top-down view of the symbolic representation of the environment that could be helpful for your project finally we provide two ppo checkpoints One for the single agent mob chase mission and one for the two player watches mission next sam is going to talk about how to learn to collaborate thank you martin so i’m sam devlin a senior researcher at microsoft and in this final section i want to talk a little bit about What happens uh when we try to apply single agent reinforcement learning to multi-agent tasks such as the games that we’ve talked about so far today in this section i want to include some of our recent research that provides scalable approaches to learning coordinated policies in these complex games So in all the work that we’ve seen so far today we’ve seen issues when applying directly applying single agent reinforcement learning algorithms to multiplayer tasks this is why the agents in martin section didn’t coordinate in particular the challenge is that if we just naively place multiple individual learners into the same environment The environment appears non-stationary to any one of them that is to say that the same observation and action matches the different outcomes as the other agents are also updating their policies at the same time in the environment this causes issues with breaking fundamental assumptions in how single agent reinforcement learning algorithms are designed Uh and in particular for deep reinforcement learning where it’s common to use a replay buffer it can cause issues where these experiences become stale because they’re dependent on the previous policies of the other agents in the environment an alternative approach is to just group all agents as one so if we’re trying to Learn a a joint policy for a group of agents that we’re controlling then why not just stick them all into one big agent that controls all of them this can be done certain use cases but it does lead to an exponential increase in the state action space Which then leads to even more training time needed and as deep reinforcement learning algorithms are typically very sample intensive having this exponential increase in the state action space can be very problematic making it intractable for many people to be able to train agents in this way from our perspective looking at gaming Applications even worse than this it can be considered as cheating a lot of these games are designed so that you have a partial observability based on where you currently are in that space so if we allow all the agents in the game to see what everyone else is doing Then that’s not how the game is played one way to work around these two issues is to use the paradigm of centralized training for decentralized execution so in this approach the agents are considered as one whilst training so we use all of the data that can be generated from all agents During training and then we learn policies that can be decoupled at the time that we deploy them so this framework is used a lot in many modern deep multi-agent reinforcement learning approaches it uses the same assumption as many recent distributed reinforcement learning algorithms such as impala and it’s the simplest but most Effective way to to quickly learn a coordinated policy in particular this gets used to learn often to learn a centralized critic and there’s an example implementation in the rl lib docs for how to do this with the setup that we’ve provided and talked about in the earlier sections In any project where i’m trying to learn coordinated policies for multiple agents this approach with a centralized critic would always be my first go-to as a safe bet for potentially learning a reasonable enough policy as we dive deeper into some of the problems that come up I want to take a look at this particular instance of one of the games that we talked about earlier in this situation uh one of the agents has trapped the mob in the far corner and the other one is not really doing anything it’s just standing here in the corner looking at What’s going on in a team game where both agents are rewarded the same based on how they’re performing as a team both agents here would receive a positive reward because one of the agents has captured the mob but the other agent who’s not really helping will also get a positive reward And so may learn that standing in the corner doing nothing is a useful behavior this is known as the multi-agent credit assignment problem how do we break down a global reward so that each agent understands what its contribution is and learns policies that are actually useful to contributing towards the team One approach to tackle this is difference rewards uh so this was originally proposed by david walper and khan toomer in a 1999 nasa tech report under the name the wonderful life utility formally this takes the approach where instead of receiving just the shared team reward that the game Gives each agent instead receives a shaped reward which is the difference between what the global team reward was and what it would have been if that agent didn’t exist or had followed some sort of default policy by doing so we get to reward the agents based on their actual contribution To the game rather than how the team is doing overall if each agent tries to maximize its contribution then typically the team will perform better this is a multi-agent specific form of reward shaping designed to remove the noise created by actions of the other agents in the environment So each agent gets a clear signal about what they are actually contributing to the game so this approach was proposed originally in 1999 and there was over a decade of very successful applications of this to collaborative games however it made one fundamental assumption that you had direct access to the reward function So that you could calculate that right hand far right hand term Where what the global reward would have been if you weren’t there or if you’d taken a default action this isn’t always possible and so in this 2014 paper uh by mitch colby khan and colleagues they propose to instead that you learn the reward function so that you can then Query it with this for this second term to find out what it might have been if you weren’t in the environment this line of work was then built on further by jacob forrester and colleagues in their triple ai 2018 paper that extended it into a deep reinforcement learning approach Where they instead use the value function of a centralized critic so again they’re using this uh paradigm of centralized training decentralized execution they learn a centralized critic and then they calculate the difference rewards of the value function from the central value function instead this allowed them to scale up to Some tasks within starcraft and other complex games using the the flexible abilities of deep nets as a value function approximation in a more recent upcoming paper that i was involved in have proposed doctor reinforce another deep reinforcement lending approach that instead returns back to the concept of learning the reward function This is because learning a centralized q function can be prohibitively challenging so if we can learn just the reward function rather than having to learn the more forward looking q function we might be able to get a more efficient way of estimating the difference rewards while still gaining the benefits that Coma had by using deep reinforcement learning to learn our policy let’s have a look at how this performs in practice So in this paper we we considered a smaller version of our mob chase game where we have three agents trying to capture a mob or prey pictured as the red square by experimenting in this simpler environment we were able to compare against earlier methods that had direct access to the reward function So that that past decade of work that i alluded to earlier when we have free agents like this we can see that all methods are performing relatively well the the lines in particular to look at here is the green line the top left corner as the best performing agent learns very quickly to Achieve the highest performance this agent is the one that used the previous assumption that you had access to the reward function so it’s it can directly calculate the difference reward and not approximate it whereas our agent in the dark blue line which is having to learn the reward Function online whilst the agents are acting is able to recover the same performance fairly quickly after that agent is but all agents uh are acting fairly similar um we see uh coma is also com competitive here uh and the colby agent from the original 2014 paper One of the big things i would take away from this is that any time that you’re using difference rewards they are still outperforming the naive approach of just placing multiple individual learners in the environment however as we increase the number of agents in this environment we see that the effect size becomes Larger and the methods start to separate and in this case we see that our approach using uh doctor reinforce where learning the reward function and still using a deep reinforcement learning approach to learning the policy is able to handle more agents in the environment which we believe is down to this Difficulty in learning the centralized q function with coma which is again evidenced in the final poll on the far right where coma is now performing worse than the original porridge by colby despite having a more powerful function approximation for the policy whereas colby has a more limited version but is again learning The reward function and not the joint q value we’re still exploring the cause of all of these differences and how this approach scales to more complex environments uh but this is one approach that has been tested in a wide variety of scenarios for learning more coordinated policies and overcoming that challenge of Multi-agent credit assignment so i want to move on now to a slightly different problem and something we saw when we proposed originally proposed these tasks as a competition in 2018 in that competition we had a number of participants submit agents that perform very well when playing the games with another Instance of their own agent right so if we have the the two agents in the left box they perform very well with another instance of itself uh similar with the ones on the right however if we took one agent from each of these uh these teams so these are two agents Trained by different teams but both perform well when when in a team with another one of themselves they don’t necessarily perform well together there can be miscoordination due to assumptions that the agent is making about the other agent in its environment even more bizarre we found that if you train two instances Of the same agent using exactly the same code base but just a different random seed they can often be uncoordinated and not play the game well together this is known as the ad hoc teamwork problem what we want is agents to to be able to play with any other agent without any prior Coordination so the agents in this competition so far have been have strictly relied on the fact that they have trained with the other agent and that they’ve formed um concepts on how they should be taking different roles within that game but ultimately we want an agent that can recognize who they’re playing with And adapt to them so they play online well with them formally this extends the normal multi-agent mdp objective which tries to maximize the accumulated reward of all agents in the environment to and the extension just includes an expectation over the other possible agents so we wanted to perform well on average Across all agents that it’s going to play with in the environment this was uh particularly well summed up in this uh challenge paper from aaai 2010 by peter stone and colleagues where they talk about human ability of ad hoc teamwork to do things like play pickup games at basketball Right we should as humans be able to play a game that we’re good at with anybody that that we meet maybe with a brief period of coordination at the beginning but we shouldn’t completely fall apart and not be able to adapt to them and that’s ultimately what we want for The agents and what we’re trying to challenge trying to achieve when taking on the ad hoc teamwork challenge so to go after this challenge we proposed this method recently that’s upcoming at the armas conference this year so this looks like a a nightmarish bit of a network But i can break this down into four simple stages where this represents the network that is our policy for this agent first we observe the behavior of other agents in other in our environment this can be done online once the agent is learning as in this paper Or from a power batch of data for instance replays of human players playing this game on public servers our network architecture includes an information bottleneck for which these observations must be compressed from this compressed representation we then try to predict the future actions of the other agents The error in our predictions can be used to update the parameters in the blue encoder to maximize the information retained from the observations that is needed to predict the future actions we factor these per agent which allows us to scale well in the number of agents But you could also do a joint prediction hero for those actions of the other agents from that that training of the uh the top two sections we get this compressed representation that is attempting to capture our belief over the other agents um we did this in a way so that there There are two versions separated here one that that will stay stable throughout the entire episode which is our intention that is trying to capture the play style of the other agent and another that changes per time step which is hoping to capture the current mindset so you might be playing with someone That plays in a particular style but recognize that they’ve maybe gone into a particular mode within that style and we do this uh with a variational auto encoder like structure so we are both capturing our our a current mean estimate but also some variance over this to try and capture some uncertainty over Our current belief of the other agent’s playstyles and mindsets in the final stage we can then condition our policy on that current belief of the other agents so instead of just trying to choose an action based off of the current state we choose our action based off of both The current state and our current belief of the other agents in our environment if those beliefs are currently highly uncertain our agent may learn to perform information gathering actions to infer more about the the other agents in the environment that they’re acting with or if it’s in a critical stage of the Game it may choose to act despite this uncertainty for instance in the mob chase scenario if the mob was about to escape the agent might choose to capture it even if it doesn’t know that the other agent will be there to support it either way the agent now adapts to others in In in its environment instead of assuming that they will adapt to accommodate the agent finally to demonstrate this approach in practice i’ll show this on another small game that we used in the paper so in this game we have two agents that are collecting coins That they want to take to the bank red coins must be taken to the red bank blue coins to the blue bank the team is rewarded collectively so this is a fully collaborative game where they both want to maximize the number of coins they can Take to the bank in a fixed period of time we are controlling the agent with the fort bubble but we have no control over the agent in the bottom left corner this agent might have a preference for particular types of coins might prefer to take the coin that’s Always closest to it it might always try and take the coin that’s furthest away from it depending on how that other agent is acting our agent needs to recognize it and adapt and play the game differently and what we see when we apply our method in this this approach So our method is here is the bamel approach uh we see far higher average return than some comparative methods from the literature so first we have the dashed line dash gray line which is a typical model free approach with a feed forward network for a policy so this agent takes no consideration of Other agents in the environment it’s just learn on average how to best respond to all of the agents it’s seen during training alternatively our approach can be seen as a metal learning approach where it is learnt over a population of other agents that it’s trained with so we compared to the rl squared Algorithm with the green line which is a state-of-the-art meta-learning approach with a recurrent network as we can see both of our approaches significantly outperform these these two baseline approaches if we look a little deeper into this and we can also see a potential cause for why this is happening So in this next plot what we’re trying to do here is predict what the other agent is in the environment so this wasn’t a part of the training loop for either agent but what we do is once the agents are trained or at various time steps throughout training we take the current Intermediate representation and attempt to predict from that what the other agent is so this is a separate supervised learning problem uh just to use to probe what is what is being learned in the intermediate representations of these agents and what we see is that from the intermediate layers of the rl squared agent It remains throughout training quite challenging to predict what the other agent is doing so this agent is not retaining information about what the other agent in the environment is doing whereas with our beymol approach that intermediate representation can after training be used to to accurately predict what the other agent is Showing that it is retained information critical to understanding who this agent is playing with so to close i i just want to summarize some of the core problems that occur in multi-agent reinforcement learning and some of the methods that we’ve talked about today to overcome them first if we just naively place multiple Single agent reinforcement learning algorithms into the same environment then we introduced the problem of non-stationarity which breaks many standard assumptions in rl algorithms causing them not necessarily to converge towards an optimal policy secondly we have the curse of dimensionality where if we put all of these agents into one monolithic agent That has to control all of them then we get an exponential growth in the state action space which can be intractable to learn then we talked about the multi-agent credit assignment problem where it’s hard for an agent to understand from a global team reward what it did to contribute towards that that team Score and so here we talked about difference rewards as an approach to give more informed credit to the agents that actually contributed to the success of the team and finally we talked about ad hoc teamwork and this is the problem of an agent that has to generalize to another agent Which it hasn’t previously coordinated with so how do you learn an agent’s policy that can generalize to a wide range of other partner agents that they might play with we talked a little bit about some approaches to this but interested listeners uh i would recommend these Two surveys too that cover a wide range of past methods both uh from the sort of deep pre-deep reinforcement learning era in the first survey from 2008 and also a very recent survey that covers a large portion of the the approaches that have been proposed more recently in the deep reinforcement learning paradigm The approaches that i’ve covered today have not yet been tested in the multi-agent minecraft tasks my collaborators presented earlier so i invite all those on the call today to try these approaches out and would love to hear the learnings of anybody in the audience that gained uh insights by applying these approaches In those multi-agent problems in minecraft finally before we start the q a session i would just like to invite you all to join us for our upcoming ai and gaming research summit which will be taking place on february 23rd and 24th 2021 there’s a link there for registration and for Any of the resources that we make available after the event uh this would be a great opportunity to learn about far wider range of research in this area looking both at ai agents in other settings but also topics such as responsible gaming computational creativity and understanding players thank you Hello everyone many thanks to all of you for attending the refusal learning in minecraft seminar today and also for staying for this live q a so i’m diego perez i’m one of the corporations of the presenters of this webinar with me uh roluk again matiba and sam Dublin who will be also answering your questions um you have probably seen that we’ve answered already some of your questions in the chat and now we’re going to take some extra questions that have been submitted by you uh we’ve selected to reply live now so let’s get started with this Um the first one is uh it’s a question submitted by lucas from university of edinburgh and i’ll answer the question now so did you observe for the game ml meta learning work that the variance and uncertainty of the v a e was interpretable could you for example observe that the variance increase Whenever the other agents executed behavior rarely or not seen during training before and hence including the play styles could be difficult um fingers sam is gonna answer this sure thanks for the question lucas uh so in in the example that we showed in the talk Uh it was a little harder for us to interpret what was in the later representation learned by the vae due to the size that we uh we chose to encode it as however in the paper that supports this work we did look at a game with Where we were able to use a far smaller latent representation and there it was a very interpretable representation that was learned so we see things like that it was counting the actions that the other agents were taking that it was clearly separating the different types of agents that they trained against In all of those cases what we did see was that the variants would reduce over time when it was recognizing a behavior so once it had seen from the history that it it had more confidence over which agent it was playing with but what we didn’t explore was looking At how do we generalize beyond agents that are in that that training set so i think that’s the the really exciting part in in lucas’s question about having the agent actually recognize uh when it’s playing with someone that it doesn’t recognize i think that’s a really key thing for Future work these agents need to be able to learn how to acknowledge that they’re in a position where they don’t understand what’s going on and so take more information gathering actions uh i know lucas and the team at edinburgh are also working on this so really keen to see what they Do in that space maybe has something to to share with us all soon all right thank you um this next question is by the way tongue from q11 uh there are actually quite a few questions in one so i’m going to split this in in two parts uh so starting with uh with The first one is the sign of the structure in d neural network important in il uh do you use a general neural network for different games and are these hyper parameters of the dnn tuned for every game yeah so another good question yeah i can jump on that Um so another good question uh obviously ideally we wouldn’t want to have to be tuning these specifically for every single game but in practice that’s still the way we’re mostly doing things uh there’s a very interesting line of work on making dparel more generalizable more robust and this is something we explored I posted a paper in the questions to our europe’s 2019 paper on this topic so we’re definitely exploring options for for network architectures that allow more generalization we’ve particularly found that the uh variation null information bottleneck is beneficial for doing so but still a wide open question particularly for looking at that Challenge of generalizing across different games uh we’re also exploring this from another perspective of uh working with colleagues at msr new york from a more cognitive neuroscience background are looking at more sort of neuro-inspired architectures that can encode more human-like priors to get more human-like behavior Out of our agents that we hope would then be broadly applicable to a whole range of games and as a follow-up if the this input is an image do you need a state-of-the-art image processing or a simple model can be used so so this is somewhere where our team Differs a little bit we we work very closely with game studios and with the uh in this instance with the malmo uh the way malmo is set up we can get access to more low-level game state from the game itself uh we don’t see an advantage In many of our games for operating at that sort of per pixel level so a lot of this work is done off of a lower level game state taking access to taking the advantage of the fact that we have access to the game so you can learn a lot more sample Efficient than uh doing things at the the pixel level okay fantastic um we’ve got another question from ashwin uh with vinay from the university of buffalo considering that this is a stochastic environment uh how do we how do the agents act based on the probability and are the choice of states Random or based on certain transition priority and i think some can also reply that sure so um for the majority of the work that we do we use a policy gradient or actor credit based algorithms so that we can learn stochastic policies particularly in multi-agent settings this is very important I think quite follow the second half of this so the choice of states uh will very much be based off of the transition probability within the environment it is probabilistic but these algorithms are able to deal with that the part that we do sometimes randomize though is about uh generating lots of different Instances so that the agents get trained in different settings uh this is both for um making the agents more robust it’s a method popularly known as domain randomization uh but it’s also something that we do in in these uh these environments when we ran them as competitions to ensure that we could provide Test set environments that the participants had previously trained on all right i hope that covers it okay there’s another question by hamsa sorry if i’m pronouncing the the names incorrectly from ryerson university um so i’m a student i don’t know too much about employing reinforcement learning in games How would you recommend i get study with this field uh and i believe martin bala can answer this there you go yeah so this is a good question i would recommend starting by learning a bit about the theory of reinforcement learning and there are some good lectures On youtube that you can follow um and i recommend you picking up item programming if you don’t know it already and then open ai scheme have very good examples to get started with it okay so thank you let’s see other questions from kimbeau lee from university of cambridge recent researching draft new networks Also have been actively investigated to build up decentralized multi-agent problems related wars show that this can be helpful for generalization performance that train in 10 agents but tested in a larger scale as also addressed in your work have you considered the applications of gnns In your team um i know some if you can answer this in your team sure um yeah so it’s not something we’ve experimented with yet um but it is a very exciting direction uh i again i know that uh jing bao and the the group at cambridge are doing some Really exciting work in this space so uh we’d love to see uh if that can scale to to these tasks in malo like this this is why we put these environments out there really is to see uh see how the uh community can sort of engage with these problems and try out these Different approaches uh there’s only so much bandwidth within one team to try out new ideas but graph neural networks are definitely a very powerful tool for approaching these okay thank you we’re gonna leave a couple of minutes in case somebody else has another question So i did i did have one uh that that i didn’t get quite around to responding to in my in my badge uh so from sagar ubretti at the university of warwick asked how we define mindset in the uh the paper on ad-hoc teamwork um and so here we we we weren’t Explicitly defining them essentially we have two variables in late and representation one that’s only sampled once per game and one that’s being sampled per time step and so really the the labeling of it as a mindset is more just a way for us to uh interpret and communicate our intention With those two different components it’s actually something that was originally defined in the machine theory of mind paper a few years back um i can’t remember the reference directly from my head but it’s it decided in the paper for where that one came from but yeah so the mindset i believe was The per time step one so we capture the the overall play style with the per episode so you have a particular play style that you’re dear to throughout uh when you play whereas the mindset is more that sort of moment-to-moment reaction and how these might change based on what’s currently going on That’s great thank you uh this is another question uh by uc liu from qmul uh he says hello in terms of learning all agency monkeys and platform uh how to learn the other agent efficiently uh given the data for the region is is too limited in addition come the method of learning Other agents being implemented into the national game to learn about the point again um well actually multi-agent learning is something that has been has been researched for for many many years that is one of the main research questions that is at the moment and normally what the people Try to do uh the most said to to implement or create a model of the of the opponent by basically trying to analyze uh the title and the policy that this opponent is going to be executing uh by basically starting data for playing uh often i guess i guess with these Games versus this uh this agents and trying to build this this model uh in a in a more or less dynamic manner um i i don’t know if if anyone else wants to add something else to this or any particular ways in which maybe microsoft is doing this Your research yeah so so i think um the the example i presented at the end was probably the the best one i i have to mind was learning that model uh in a competitive game it’s absolutely the case that if if you learn the model then that that’s a way to exploit A particular opponent um beyond playing just a nash equilibria that’s safe against and robust to a whole range of opponents um in the in the paper that’s linked for the ad hoc teamwork setting we do have a game that is uh more competitive than than the example that i showed so It is equally applicable in that setting but i’m really motivated by that application of using it for fully collaborative games i mean ideally what we want is for these agents to empower the player and enable the player to do more rather than exploit them or try to beat them It’s quite easy to make agents that are very good at games and be people it’s much more fun to create agents that can allow more people to enjoy games okay there are no more questions coming up we can wait a minute just in case somebody’s typing Probably a good chance just to make sure everybody’s seen the the invite in the resource list to our ai and gaming summit next week uh so the registration is open until tomorrow so please do sign up today if you’re interested there’s a there’s going to be a lot of Work there that’s going to span a range of different topics in game ai uh including lots of stuff on uh collaborative agents collaborating with humans uh but also going into things such as computational creativity responsible gaming and understanding players so hopefully of wide interest to people on this call today Okay so i think we we might be ready to start wrapping up um so thank you thank you for attending today we appreciate your participation your questions your interest in the subject this tutorial is going to be available on demand very very shortly as after these q a finishes And if you are interested to learn more as some just said we have a list of great resources in the resource list to the right of your screen uh so as you can see it was there we do have the presentation slides that that this webinar we’ve been using the webinar we Have a couple of links to the project malmo the actual research project in microsoft research and also the uh the link to the latest repository in github so you can you can find the the final examples on the final code that has been implemented there are also a couple of reference to Revlon papers that we have been mentioning today during the during the webinar so there are there two to reference to archive papers and as also some just mentioned uh the latest one is is a link to the ai and gaming recent summit who’s happening uh next week 23rd to the 4th of february Uh the recession is free uh but it’s gonna be closing tomorrow so basically try to reduce the bike tomorrow the latest um is it’s gonna be a very very interesting event there’s lots lots of different topics being converted in the event uh to two full days of work And talk on rl and gamingi and so on yeah so let’s check that out and and we will we look forward to seeing you uh see how you build all of these research and how you prosper there uh the boundaries of rl and game ai and we’re looking forward to seeing your Your work so thank you again for tuning in and have a great day Video Information
This video, titled ‘Reinforcement learning in Minecraft: Challenges and opportunities in multiplayer games’, was uploaded by Microsoft Research on 2021-03-17 20:22:48. It has garnered 3468 views and 81 likes. The duration of the video is 01:07:25 or 4045 seconds.
Webinar starts here: https://youtu.be/bhHWmwSixJw?t=59
Games have a long history as test beds in pushing AI research forward. From early works on chess and Go to more recent advances on modern video games, researchers have used games as complex decision-making benchmarks. Learning in multi-agent settings is one of the fundamental problems in AI research, posing unique challenges for agents that learn independently, such as coordinating with other learning agents or adapting rapidly online to agents they haven’t previously learned with.
In this webinar, join Microsoft researcher Sam Devlin and Queen Mary University of London researchers Martin Balla, Raluca D. Gaina, and Diego Perez-Liebana to learn how the latest AI techniques can be applied to multiplayer games in the challenging and diverse 3D environment of Minecraft. The researchers will demonstrate how Project Malmo—a platform for AI experimentation built on Minecraft—provides an ideal environment for designing different and rich training tasks and how reinforcement learning agents can be trained in these scenarios. They’ll provide examples of tasks, agent implementations, and the latest research done in this area.
Together, you’ll explore:
■ The Malmo platform and multi-agent tasks ■ Using the reinforcement learning library RLlib to implement and train agents to complete Minecraft tasks ■ Coordinated policies for collaborative multi-agent tasks ■ Open challenges in learning robust policies for ad-hoc teamwork
𝗥𝗲𝘀𝗼𝘂𝗿𝗰𝗲 𝗹𝗶𝘀𝘁:
■ Project Malmo – Microsoft Research (project page): https://www.microsoft.com/en-us/research/project/project-malmo/ ■ Project Malmo key repository (GitHub): https://github.com/GAIGResearch/malmo ■ Difference Rewards Policy Gradients (paper): https://www.microsoft.com/en-us/research/publication/dr-reinforce/ ■ Deep Interactive Bayesian Reinforcement Learning via Meta-Learning (paper): https://www.microsoft.com/en-us/research/publication/deep-interactive-bayesian-reinforcement-learning-via-meta-learning/
*This on-demand webinar features a previously recorded Q&A session and open captioning.
Explore more Microsoft Research webinars: https://aka.ms/msrwebinars