University of Houston

To put things simply, our project considered many different challenges an agent would have to overcome when learning to drop off and pick up blocks through reinforcement learning. How to do this effectively and in coordination with other agents was the biggest challenge. Our solution generally involved more in depth state space for the agents to learn from. Being limited to 9000 steps for all agents made it a challenge on how to make this state space learnable in that amount of steps.We will demonstrate the consequences of making a state space too small and state space too large. In the next few experiments, we will focus on two different state spaces. It is important to note we made sure that each agent is given their own Q table. We decided we wanted each agent to learn based on their own actions.

For the first (basic) state space we only considered the location of the agent, the action it performed, and the item being held to generate Q values. That meant at max each agent would have less than 300 states (6 actions, 25 locations, 2 possible scenarios of carrying). Some state spaces are never used since pickups/drop offs can only be considered at certain locations and some values are out of bounds (you cannot go west for example on the point (0,0)). This state space trained quickly and required little space for memory for the program.

For the second (complex) state space we considered an additional state value to be added to be learned on. We created a state space where agents would learn based on the environment's availability of pickups and drop offs. The availability means either still has a block to get or still can accept a block for pickups and drop offs respectively. This means in this complex world instead of each agent having 1 single Q table, each agent now instead has a max 64 sub Q tables. This was done by assigning a boolean value to each of the available objects (pickups (P) and drop offs (D)) in the world in the format (P1,P2,P3,D1,D2,D3). So for example an agent that has a sub Q table 1,0,1,0,0,1 means two pickups (P1,P3) are still usable (1) and one drop off (D3) is still usable. This more atomic representation of the state space would make learning new paths easier since agents are not confined to a world representation and are able to adapt almost to the changing environment of objects. This however in theory seemed effective, lost its advantage as the number of steps provided was not enough to learn in such a large state space.

So between these two extremes, the middle ground is agent coordination. If the agents are not simply memorizing a fixed world (basic) or learning a highly complex changing one, the best solution would be to coordinate to avoid branching out to a larger state space. The experiment's run will focus on the two state spaces addressed above, agent coordination will also be examined in the two. We will dive deeper into which provided agent coordination, if at all.

Outside of agent coordination, the gamma between 0 - 1 (how much emphasis is placed on future rewards) and alpha between 0 - 1 (how much to forget). These values are set in each experiment but will change the Q tables accordingly with each update. We will explore the impact this has on learning.

Another aspect and critical element was the two types of learning implemented. One was on off policy (independent of agents actions) Q-learning and SARSA (dependent on agents actions) on policy.

The seeds run in the following experiments are easy. Simply the first run will be done with seed 1 and the second run will be done with seed 2. However we will actually be running these seeds twice for each experiment since we want to show the advantages and disadvantages of the complex vs basic state space. That means both complex and state space will be run with seeds 1 and seeds 2. So for each task outlined in the experiment, there will be an expected 4 runs (complex with seeds 1 + 2 and basic with seeds 1 +2)

General Program Overview

In the Python script run_simulation.py, the SimulationWorker class is central to managing the simulation of agents within a given environment. This class is initialized with several parameters that define the behavior and setup of the simulation, including the agents themselves, the environment (env), the complexity level of the world (complex_world2), whether the simulation is episode-based, and various settings regarding simulation speed and execution.

The simulation runs within the run_simulation method, which either iterates through a set number of episodes or until a certain number of steps are reached, based on the configuration provided during initialization. This method employs a helper function, core_logic, which is responsible for executing the simulation's main loop for each episode or set of steps. The core_logic function manages the agents' interactions with the environment, updating their states and decisions at each step.

Within each cycle of the simulation, agents decide on their actions based on their perception of the environment and their policy, which is defined in policy.py and chosen with the initialization of the world. The environment responds to these actions, modifying agent states and providing feedback in the form of rewards or penalties. This interaction is captured and emitted through signals to update the graphical user interface in real-time, providing a visualization of the agents' behaviors.

The program also includes mechanisms to record blockages and collisions, essential aspects of managing agent movement in shared spaces, which is visualized in the third tab in the UI. Agents assess their available actions and determine the best course based on the state of the environment and other agents' positions.

Throughout the simulation, the system maintains a high level of interactivity and configurability, allowing users to pause, adjust speed, or skip through the simulation to examine different aspects of agent behavior and environmental interaction. This flexibility makes run_simulation.py a robust tool for studying complex systems and behaviors in a controlled yet dynamic setting.

Default and Complex State Spaces

In the scripts agent.py and environment.py, a primary focus is placed on the concept of state spaces and their pivotal role in defining and managing the interactions of agents within a simulated gridworld environment. State spaces in this context serve as a backbone for both the structure of the environment and the behavioral rules that govern agent interactions.

The script agent.py introduces the Agent class, which relies heavily on the manipulation and understanding of state spaces for effective functioning. Agents maintain and update their state based on interactions within the environment, with state spaces defining the possible conditions each agent can find itself in. These states are crucial for agents as they decide their actions based on their current state and the anticipated outcomes defined within the state space. State management within the agent class includes maintaining a current state and determining the feasibility of actions based on this state. Furthermore, the learning mechanisms employed by the agents, such as Q-learning or SARSA, update their knowledge of the state space through Q-values associated with state-action pairs. These values are dynamically updated as agents interact with the environment, learning from the outcomes of their actions and refining their decision-making processes.

Concurrently, the environment.py script defines the grid world’s class, which establishes the simulation's physical layout and the abstract state spaces within which the agents operate. This class allows for the representation of the environment in varying levels of complexity—from simple binary states indicating the presence or absence of items at specific locations to more sophisticated configurations that account for proximity to other agents or dominance over certain areas. The concept of a pd_string within GridWorld is particularly significant for customizing state spaces. This string enables dynamic definitions of the environment, where different elements like agent proximity, item availability, and territorial dominance can be encoded. The technicalities and results of these various “pd_string”s will be discussed at a further point in this paper.

State spaces are thus categorized into two primary types: default and complex. Default state spaces typically involve straightforward interactions where agents move and interact within static environments designed with fixed pickups and dropoffs. On the other hand, complex state spaces are crafted through the pd_string, enabling a dynamic environment where state spaces can be tailored to include various strategic elements. These customized state spaces allow for a deeper exploration of agent behaviors and strategies, providing a richer simulation experience. In essence, the management and customization of state spaces are central to the simulation of agent behavior in these scripts. By defining the rules and conditions of agent interactions through state spaces, the system facilitates a controlled yet highly configurable setting for studying various agent strategies and interactions within a complex environment.

Experiment 1

Outline: The main idea is to understand how different policies create different learning patterns in Q- learning and the tradeoffs each entail. For the following sub experiments (A,B,C) we keep these items constant across runs:

Learning Rate (alpha / α ) = 0.3 | Discount Factor (gamma/ γ) = 0.5 | Learning Style = Q - Learning

For each of the following experiments, the policy PRandom was mandated for the first 500 steps of the agents followed by the policy required for each part (A - PRandom, B - PExploit, C - PGreedy) for an additional 8,500 steps.

Each of these experiments were run using two state spaces.

Normal State Space | Complex State Space

Each run with two different seeds:

1 | 2

To create a total of four sub experiments for each part:

Normal Seed 1 | Normal Seed 2 | Complex Seed 1 | Complex Seed 2

This was done so we could compare which state space is able to handle blockage and agent coordination more robustly than the other. And other details which will be explained later.

Part A – PRandom Policy - Simple

Results

Seed = 1

Seed =2

Total Steps (top line) | Total Blockages (middle line) | Total Rewards (bottom line)\

Number of Episodes (X-Axis) | Y axis (values for all three lines)

Analysis

The random policy for movement did not allow agents to use their learning (Q values) from the state space. This made agents in some cases perform very well and others minimally. Agents from the two seeds were able to complete the world roughly on average 6 times (episodes). There is no need to examine coordination here since agents moving randomly would not coordinate since they are not moving based on their Q values. This shows the idea of maximum exploration with no attempt to exploit. Since the state space design is simple in this experiment, the Q values for all the agents were filled relatively fast. Now we will examine how efficient each of these tables were with a closer look on the actions of each agent based on their Q tables. Here we will examine agent 1, the below table shows the Q values for agent 1 when not holding a block.

(In the above diagram, N,S,E,W are the four cross actions. 0 values represents Q values never encountered and/ or impossible to attain)

We can see from the figure that the Agent 1 Q values are able to grow in strength closer to the pick up locations (red) when not holding a block. The blue shows the surrounding actions to the pick up locations increased relative to those pick up locations. And the yellow shows Q values although not positive were the best choice and led to finding a pick up. In full exploitation it does help to learn the environment, but without using those values to guide the next action provides no use to the agent learning how to accomplish this faster. Here we can see the Q values being updated purely on exploration.

Part A – PRandom Policy - Complex

Results

Seed = 1

Seed =2

Total Steps (top line) | Total Blockages (middle line) | Total Rewards (bottom line)\

Number of Episodes (X-Axis) | Y axis (values for all three lines)

Analysis

Comparatively the complex world is not enough to say it's doing better (8 completions on the left and 6 completions on the right).This is because the agents are moving at random and are unable to coordinate/use their learned Q values to make decisions . We again examine Agent 1 without holding a block but this time we have to consider the environment of the world. We will use the table for agent 1 in which all the pickups and drop offs have not been used (1,1,1,1,1,1).

It may seem the same however there is one key difference, there is less exploration. A key way to see this is to look at the start point (pink[3,3]) for the agent. We only have Q values for two actions (north and west). Since the state space has increased by many fold (roughly 64), there is only so much that the agents can explore within 9000 steps. Now keep in mind these strictly had exploration, for the other experiments exploration will be limited to only 500 steps. This is an obvious trade off between the state spaces. While basic one can explore almost each state, it's hard to relearn a changing world. And while the complex state space can keep up and learn with the changing world, it needs more training for it to be efficient/complete.

A brief overview of the policy (PGreedy) is that it will choose the best Q value to take an action upon (if present). Note we do this in the later part (8,500 steps), we still allow the agents to familiarize/explore themselves with the world in the beginning (500 steps).

Part B - PGreedy Policy - Simple

Results

Seed = 1

Seed =2

Total Steps (top line) | Total Blockages (middle line) | Total Rewards (bottom line)\

Number of Episodes (X-Axis) | Y axis (values for all three lines)

Analysis

PGreedy shows some promise in the Simple state space. We are able to hit above 10 episodes in both cases, however we see problems starting to develop. Agents will assume this path is best and increase their Q values as a result. However this works well on maybe the first run, on the second run, these agents now will experience more blockage since all of these paths are attractive across the board for all agents. And each agent will be attracted and have to wait for the other agent to be able to use the pick up or drop off consequently. Outside of agent coordination, agents will return to drop offs that are full and pick ups that have no blocks and have to unlearn this path to compensate. But since this world is super tiny, at most the agent would be able to even at random find another pick up and dropoff for their block action. We can see the simple state space works with P greedy since the state q values are faster to be changed and are able to find other paths to promising reward despite agent blockage for the same spots. But this quickly changes outside the scope of 9,000 steps. For example, this was approx with 318,000 steps , which account for 300 completions we start to see more of the issue:

The red circle looks like an amazing accomplishment by PGreedy, however in the future this policy starts overfitting more and more to where there is no reasonable increase in performance. The agents are somewhat having to relearn and penalize themselves since they are basically “fighting” for the same resources consistently and blocking each other. However they are not aware of this, to them it seems like a move they cannot use but in reality it is because of agent coordination being low.

Let's take look on more how coordinated these agents are:

Here the agents are evaluated based on how much other agents blocked them for a pick/drop off; they were close too but unable to use it as a next location due to it being occupied. The empty columns indicate the agent and the other columns indicate which agent and the number of times they blocked the other agent split based on pickups/drop offs. Here we are assuming they would have picked these actions since we are under a P-Greedy policy. We see that agent 3 and agent 2 seem poorly coordinated with agent 1 for example. In general these blockages show that greedy policy along with limited/simple state space design brought about really poor agent coordination. Hopefully by advancing the state space a bit we can solve this problem.

Part B - PGreedy Policy - Complex

Results

Seed = 1

Seed =2

Total Steps (top line) | Total Blockages (middle line) | Total Rewards (bottom line)\

Number of Episodes (X-Axis) | Y axis (values for all three lines)

Analysis

Compared to the simple state space the agents take longer to reach completion, however we can see one main difference between the two. Because the Greedy policy is acting over a bigger state you would think better results, however contrary. This bigger state space creates even bigger issues. Initially the blockage is minimal since the state space has not been fully explored and utilized but after subsequent completions its evident. Eventually it causes the agents to overfit harder and since no agent coordination is involved there are only a few points it keeps considering in the learning and if all agents are centering on these paths there will be more blockage and less coordination. This is seen below with as stated previously of the setup a longer run below:

Overall Greedy in the long run is not a good idea for reinforcement learning. Exploiting the previously found values does not work when agents exploit them more and more; these agents will become locked and blocked in the process leading to less and less agent coordination.

Here is a visual of the simple learning Q learning table and show the effects of the PGreedy Learning on the Q values for the first simple world of Agent 1 when not holding a block:

It's clear how the policy greedy involves minimal agent coordination, somewhat forcing agents to stay fixed on certain paths and neglect others. Here we can see the path is focused on mainly getting to the pickups when not holding a block. However this is problematic, the other agents are similar in their Q tables. This over time will create huge blockages and provide very inefficient solutions.

The PExploit focuses on adding some exploration to avoid so much over-fitting when agents select their actions. 80% of the time agents will pick the action with the highest Q value and perform this action. While 20% of the time they will choose a random action. We see the difference between the PGreedy.

Part C – PExploit Policy - Simple

Results

Seed = 1

Seed =2

Total Steps (top line) | Total Blockages (middle line) | Total Rewards (bottom line)\

Number of Episodes (X-Axis) | Y axis (values for all three lines)

Analysis

Right from the start we can see that PExploit generally has a larger amount of steps involved but less blockage between the agents even in the simple state space. Let's examine the future possibility of PExploit

We see that PExploit seems to offer no better approach in this simple state space. But that is expected since the state space finds it hard to learn the removal of unattainable rewards and agent coordination is not present. This may be solved when turning to a complex representation of the state space.

Part C - PExploit Policy -Complex

Results

Seed = 1

Seed =2

Total Steps (top line) | Total Blockages (middle line) | Total Rewards (bottom line)

Number of Episodes (X-Axis) | Y axis (values for all three lines)

Analysis

Here we can see the agents using this large state space are able to find efficient solutions.Exploit seems to be doing a better job than greedy and the simple state space of exploit. However the agents are not coordinating effectively. All the agents are doing is learning how to make the best of their situation and randomly involve a little exploration to compensate. Overtime, their final result will be like that of any of the three over more than 9000 steps. But for the most part if the agents were not trained to coordinate, PExploit holds as the best policy. It gives passive exploration while honing in on the important Q values as well. PExploit coupled with agent coordination state space implementation would provide the best efficiency in this world amongst agents. Below are the agent coordination of this complex state, you can see they are similar to the rest. This policy really provides no competitive edge over the other in terms of agent coordination, but in terms of learning the changing world the complex state space has the advantage.

Now we will examine on of the agent 1 Q values without holding a block to understand the policy like previously:

Here we can see less positive values surrounding each pickup after 9000 steps. This shows instead of following the optimal route the agents at some point took a non-optimal route (randomly) to aid in exploration. This takes the hybrid of the two Q tables shown above. Allows for exploration (PRandom) and exploitation (PGreedy).

Here are the Q tables for P-Exploit.

Q Tables Simple State Seed 1:

https://docs.google.com/spreadsheets/d/1cpM5QJeWq82tOdQNPeEgYhhL3wstuO2J2Cog5I-0mJ4/edit?usp=sharing

Q Tables Complex State Seed 1:

https://docs.google.com/spreadsheets/d/1mGm6rDB37gJzUHPouanGu8LxyCfutxiWcidSMgfYdnk/edit?usp=sharing

Conclusion Experiment 1

The state space design from basic to complex did a good job when it came to allowing the individual agent to learn an updated environment of the world. However those blockages were still present and did not fade away since the state space did not incorporate it. The best policy that showed promise was PExploit. PExploit allowed the agent to explore and exploit, preventing extreme overfitting found in PGreedy or excessive exploration found in PRandom.

Experiment 2

In this experiment, the learning algorithm we had was SARSA and the policies remained unchanged: PRandom for 500 steps, then PExploit for another 8500 steps. One of the final Q-tables, which were Q-tables for Agent 1 in simple state space, would be provided with 2 seeds for comparison. Besides, we would use the number of steps the agents took and the reward they achieved each episode as a metric to assess the difference between Q-learning and SARSA.

One of the final Q-tables:

Complex state space for agent 1 seed 1:

When first drop off location is filled (This is the new state space before the 1st drop off location is filled: 100111 1st episode):

When a terminal state is reached (New state space right before the terminal state is reached (1st episode 000100):

Final Q-table of seed 1 and seed 2:

Seed 1: seed 2:

Trendline for number of steps and rewards per episode:

Simple State Space: Complex State Space:

Seed 1: Seed 1:

Seed2: Seed2:

Analysis

Based on the number of steps that agents took to finish all the tasks in each episode in Complex state space seed 1, a downward trend was created, which proved that the agents were actually learning to choose better paths. The upward trend formed by the total rewards also support the efficiency of our model. Although the result for seed 2 seemed like the agents were taking more steps, it was actually due to the complex state space being so large that having more episodes was necessary to assess the model more accurately. The trends were not shown clearly in 2 seeds of the simple state space because the state space was too small so they were unable to unlearn the paths. But we could notice that the results from the complex state space were more stable than the normal one.

The results from SARSA appeared to be less chaotic than Q-learning because of how the q-values were updated. The PExploit policy allowed SARSA to update q-values more efficiently and had some more explorations instead of blindly updating the q-values based on the max Q of the next action. As a result, if there had been more episodes, the lines generated by SARSA would have shown a better convergence compared to the Q-learning.

In terms of the agent coordination, the number of blockages did not have any noticeable changes. They were basically the same as Q-learning. It was probably because no punishment was given for not moving so the agents were unable to learn to choose a different path if they were blocked.

Conclusion experiment 2

The SARSA learning algorithm allowed the agents to make decisions more realistically than q-learning.

Experiment 3

In this experiment, we would recreate experiment 2 but with different alpha values = 0.15, 0.3, 0.45 for comparison.The learning algorithm would be SARSA.

Trendline for number of steps and rewards per episode:

Simple State Space:

Alpha = 0.15 Alpha = 0.45

Seed 1: Seed 1:

Seed2: Seed2:

Complex State Space

Seed 1: Seed 1:

Seed2: Seed2:

Analysis

Apparently, there were not any significant differences between those 3 different alpha values. Because of the small number of steps and episodes, the change in learning rate could not be determined yet. The only thing we could spot was that the number of steps and number of rewards per episode of alpha = 0.3 did not fluctuate like the other two. However, when we increased the number of steps to 300,000 which translated to approximately 495 episodes for seed 1 and complex state space, we noticed that alpha = 0.45 allowed the total steps and total rewards to converge more quickly than alpha = 0.15, nearly 100 for 0.45 and 140 for 0.15. And the lines were also more stable than alpha = 0.3.

Alpha = 0.15 alpha = 0.45

Alpha = 0.3

Conclusion experiment 3:

Alpha = 0.45 would be a better choice than 0.15 and 0.3 because it was able to balance those two

Experiment 4

For this experiment, we are looking to see how changing pickup locations can affect the learning abilities of the agents. To do this, I set up 2 types of runs:

Run 1: This is the run as specified in the assignment instructions. 6 episodes. We run PRandom for 500 steps, then switch to PExploit. After 3 episodes have completed, we change the pickup locations and run for 3 more episodes.

Run 2: This is Run 1, but without changing the pickup locations mid-run. The purpose of this run is to provide a baseline for Run 1.

Run 1

Seed=11 Seed=43

In both runs for Run 1, we can see an increase in steps and decrease in rewards on the fourth episode, which is to be expected considering we changed the pickup locations. However, what’s interesting is how much the seed value had an effect on these two runs.

The first seed featured slowly increasing steps, plateauing at episode 4, and then quickly dropping off, showing massive improvement with the new pickup locations. Conversely, the second seed featured a very high-step initial episode which quickly dropped off on episode 2. The step count then slowly increased all the way through the end of the run.

Run 2

Seed=11 Seed=43

As in Run 1, the seed value gives us two very different graphs. The first seed seems to linger in high step count for the first few episodes, while the second seed quickly drops off on episode 2 (as it did in Run 1).

Comparing the two runs, we can see that changing the pickup locations mid-run does have an effect on step count; step count rises in Run 1 and decreases in Run 2. When looking at Run 1, we can see that despite the change in pickup locations, the agents are still able to recover and lower the step count in the succeeding episode. This is especially apparent in seed 11, as step count decreases dramatically afterward.

What’s interesting is that, looking at the 4 total runs, they all end up in roughly the same place after the final episode; around 800 steps to complete the episode, with Run 1 Seed 11 being an outlier at around 600 steps. Looking at the Q-tables, the agents seem to have figured out their own optimal path decently well, with the exception of Run 1 Seed 11 Agent 3; this agent has lower than average Q-values:

Run1 Seed11 Agent3 Run1 Seed43 Agent3

This poor performance is due to a combination of the change in pickup locations and agent collision. Agent 3 always went last and therefore got last pick when it came to optimal routes, which became a problem when the pickup locations were suddenly switched.

Overall, the best performing run was Run 1 Seed 11. Unlike the other runs, it is the only one that had a downward trend in step count after episode 3 (pickup location switch). This culminated in the lowest step count of the four runs for the final episode. Agent collisions seem to be unaffected across the runs and have no obvious trend.

Complex World Runs

Seed=11 Seed=43

When I ran the complex world, the results were not desirable. Or rather, there wasn’t enough time for them to get to what is desirable. We can see the step count following a similar trend to those of Run 1, yet the counts end at much higher numbers. This is likely because the complex world takes longer to fully train than the simple world, and we would need to run many more episodes before we see improvement over the simple world.

Agent Blocking & Technical Appendix

In an effort to expand the capabilities and interactivity of the agent-based simulation environment, several features were developed, each designed to enhance the depth and analytical possibilities of the system. These features are available on the user interface, and some of them are as follows:

Agent Pheromone Trails:

This introduces a novel approach to territorial control within the gridworld environment. Modeled after pheromone trails in ant colonies, this system allows agents to assert dominance over specific squares by traversing them repeatedly. Each pass slightly increases the dominance level of that square for the agent, eventually marking it as "their territory". Conversely, agents receive a penalty when they enter a square dominated by another agent, discouraging frequent cross-territorial movements and enhancing strategic path planning. This feature not only adds a layer of depth to agent interaction but also mirrors certain natural phenomena, offering a unique twist to traditional grid-based navigation challenges.

No Pheromone Trails (50 episodes)

Using Pheromone Trails and Space Dominance

8-State Proximity Checker:

This is a crucial feature that enhances the spatial awareness of each agent by providing real-time data on the positions of nearby agents. This function examines eight surrounding cells (up, down, left, right, two blocks in each direction) to determine the proximity of the closest neighboring agent. Agents are able to receive valuable state data without increasing memory complexity drastically. If an agent comes into direct contact with another, a penalty is applied, simulating a real-world scenario where personal space or operational boundaries are essential. This mechanism is particularly useful as agents are more efficient when they remain around 2 blocks apart.

No Proximity Checking (50 episodes)

Using Proximity Checking and Touch Punishment

(clear region dominance and isolated behavior)

Heatmap Visualization:

A heatmap visualization tool (pictured above) has been integrated to graphically represent the "dominance" of a certain agent in the gridworld. While certain datasets (total reward, total steps) give us a view of the model as a whole, this key visualization provides insight into how well agents coordinate with one another. The heatmap provides immediate visual feedback on agent behavior and movement efficiency, making it an invaluable tool for debugging and refining agent strategies.

While these features significantly enhance the functionality of the project, the specific guidelines of the assignment required adherence to a ruleset that did not accommodate the full utilization of these tools. As a result, these features were not activated in the official experiments presented in this paper. However, they remain accessible through the user interface for further exploration and use in unrestricted scenarios. This setup allows users to interact with these tools, exploring their effects on agent behavior and system dynamics outside of the formal experimental constraints.

By providing these capabilities, the project not only adheres to the academic requirements but also offers extended value through a sandbox-like environment where users can experiment with complex agent interactions and control mechanisms that were otherwise constrained by the assignment parameters.

Conclusion

This paper presented a comprehensive analysis of multi-agent reinforcement learning within controlled environments, exploring the impact of different state space complexities and learning policies. We investigated the effects of basic and complex state spaces, as well as the implications of employing various policies such as PRandom, PGreedy, and PExploit. These experiments demonstrated that while complex state spaces offer a nuanced understanding and adaptation to dynamic environments, they require a significantly higher number of steps to learn effectively, often resulting in initial inefficiencies. Conversely, simpler state spaces, though quick to learn, often fail to capture the necessary details to facilitate effective long-term strategy and coordination among agents.

Key takeaways from this research underscore the importance of balancing state space complexity and learning policy to optimize agent performance and efficiency. The experiments reveal that a mid-range complexity in state spaces, coupled with a policy that supports both exploration and exploitation (like PExploit), tends to yield the most effective results in terms of learning speed and operational efficiency.

As we look to the future, the insights and technological advancements derived from this research have significant implications for the broader application of multi-agent systems. This work enhances our understanding of how to effectively implement and manage complex interactions in environments ranging from robotic teams to automated traffic systems. With each step forward, we move closer to realizing the full potential of intelligent, adaptable, and autonomously coordinated systems.