To improve the reasoning and question-answering capabilities of Large Language Models (LLMs), several multi-agent approaches have been introduced. While these methods enhance performance, the application of collective intelligence-based approaches to complex network structures and the dynamics of agent interactions remain underexplored. This work extends the concept of multi-agent debate to more general network topologies, measuring the question-answering accuracy, influence, consensus, and the effects of bias on the collective. The results show that random networks perform similarly to fully connected networks despite using significantly fewer tokens. Furthermore, a strong consensus among agents correlates with correct answers, whereas divided responses typically indicate incorrect answers. Analyzing the influence of the agents reveals a balance between self-reflection and interconnectedness; self-reflection aids when local interactions are incorrect, and local interactions aid when the agent itself is incorrect. Additionally, bias plays a strong role in system performance with correctly biased hub nodes boosting performance. These insights suggest that using random networks or scale-free networks with knowledgeable agents placed in central positions can enhance the overall performance of multi-agent systems.
LLMs have demonstrated impressive performance on various tasks but still struggle with hallucinations and incorrect answers. Multi-agent approaches, inspired by human problem-solving, have been introduced to address these issues. Techniques like ReAct and Reflexion enable LLMs to engage in iterative reasoning and self-reflection. This work explores multi-agent systems on scale-free networks to understand agent influence and network topology effects on performance, extending the concept of multi-agent debate to these complex networks.
LLM agents are represented as nodes in a network with edges indicating communication channels. In multi-agent debate, agents solve problems individually, then reconsider answers based on neighbors' responses and previous answers. This process repeats for several rounds, culminating in a majority vote for the collective answer. We introduce bias by providing certain agents with correct or incorrect answers and analyze their influence based on network position (hubs or edges).
Experiments were conducted using three scale-free networks, each with 25 GPT-3.5-Turbo agents, engaging in four rounds of debate to answer 100 high-school mathematics questions from the MMLU dataset. The experiment was repeated three times for statistical significance. Bias was introduced into hub or edge nodes, comparing performance with unbiased networks to observe how biased nodes influence information spread and overall accuracy.
Comparing the QA performance between different types of networks, it is evident that structure plays a role in the accuracy, as shown in Table 1. In particular, random networks achieve similar performance to fully connected networks, while using 250 times fewer input tokens per round of debate. Scale-free networks, which use a similar amount of input tokens per round to random networks, exhibit worse performance than random networks, suggesting the random network topology is superior for problem-solving tasks. In contrast, fully disconnected networks demonstrate the lowest performance, highlighting the importance of collaborative problem-solving.
Network | Tokens per Round | Accuracy |
---|---|---|
Fully Connected | 125000 | 67.7 ± 1.1% |
Fully Disconnected | 5000 | 63.9 ± 0.4% |
Random | 28600 | 68.2 ± 0.5% |
Scale-Free | 21800 | 64.8 ± 1.0% |
Comparing the performance between biased and unbiased systems, it is found that bias also plays a role in QA accuracy, shown in Table 2. Networks with correctly (incorrectly) biased nodes at their hubs perform significantly better (worse) than their unbiased counterpart. In particular, networks with correctly biased hub nodes performed twice as well when compared to networks with incorrectly biased hubs, with accuracies of 88.1 ± 0.5% and 43.8 ± 1.5% respectively. Although bias is expected to impact the performance, the significant decrease in accuracy for incorrectly biased networks highlights that it only takes a few biased and well-connected agents, two in this case, to impair the results significantly. Moreover, the stronger comparative performance of the unbiased system demonstrates that although agents may be capable of solving problems correctly, they are easily influenced by incorrect agents. In the case where bias is inserted on the edge of the network on the other hand, it is found that there is little effect on the QA performance.
Network | Accuracy |
---|---|
Unbiased | 64.8 ± 1.0% |
Correctly Biased (Hub) | 88.1 ± 0.5% |
Incorrectly Biased (Hub) | 43.8 ± 1.5% |
Correctly Biased (Edge) | 65.7 ± 1.1% |
Incorrectly Biased (Edge) | 64.9 ± 1.3% |
To understand how an agent may be influenced, the probability of the agent being correct in round n, given its previous response and the response of its neighbours in round n-1 is shown in Figure 5. The figure shows that as the number of correct neighbours increases, so too does the probability of the agent being correct. Furthermore, the tendency for green points to lie above red points highlights the positive impact of self-reflection; regardless of the neighbour's response, an agent is more likely to answer correctly if it was correct in the previous round. These findings highlight the importance of both individuality and collective thinking in multi-agent systems. That is, collaborative problem-solving improves the overall performance of the collective, while self-reflection acts to improve performance when local interactions are misguided.
To further understand the dynamics of these systems, the way in which agents change their answers between rounds is shown in Figure 6. In the case of fully connected, scale-free, and random networks, the number of agents selecting and remaining on the correct answer increases with each round of debate. For fully disconnected networks, on the other hand, the number of agents remaining correct or incorrect is near-constant, with agents continuing to switch between the correct and incorrect answers. Considering bias, networks correctly biased at their hubs exhibit a large number of agents switching from incorrect to correct answers after the first round, in agreement with Figure 4. These agents with correct answers tend to keep the correct response throughout the remaining rounds of debate. Networks incorrectly biased at their hubs, on the other hand, have an increasing number of agents switching from correct to incorrect after round two. This is a significant result, as it highlights the fact that agents may have the correct answer, but will be convinced to switch due to the influence of their biased neighbours.
While the accuracy gave us an insight into the average QA performance of the system, it provides little information on how the answers are distributed inside the network during any given round and whether or not the agents agree. In fact, the network can be correct with less than half of its agents giving the correct answer, due to majority voting. This section explores how and under which conditions a consensus is formed. The percentage of agents in the network that answered the question correctly in the final round is shown in Figure 7. Although this metric highlights the relationship between the consensus towards the correct answer and the overall QA performance, which captures the total number of questions answered correctly, little information is provided on how the answers are distributed.
To gain insight into the distribution of answers, the Simpson index is used to estimate the level of consensus within the collective. The Simpson index, which is used to quantify diversity, measures the probability that any two randomly selected agents give the same answer in the final round of the experiment. High values for fully connected networks, followed by those for random and scale-free networks, indicate a relationship between network connectivity and the agreement among agents. The results show that a high degree of consensus among agents correlates with correct answers, indicating greater certainty. Conversely, when consensus is lower, the system is more likely to be incorrect.
Strategic placement of knowledgeable agents in central network positions can enhance multi-agent systems' performance. Future systems should leverage network topology to optimize collective intelligence by placing larger, more capable models at network hubs and smaller models at the periphery, improving performance without significant computational cost. Further research should explore different network structures and larger systems to generalize these findings.
This study has important implications for designing future multi-agent systems but is limited by the number of agents, questions, and rounds due to computational constraints. Future work should explore a broader range of network structures, including random and small-world networks, and increase the number of agents to better understand these systems' dynamics and performance. Despite limitations, findings provide valuable insights into how bias and network topology influence collective problem-solving and consensus formation in multi-agent systems.