Problem-Solving in Language Model Networks

Ciaran Regan1, Alexandre Gournail2, Mizuki Oka1

1University of Tsukuba, Japan         2Grenoble INP - Ensimag, UGA, France

 

Abstract

To improve the reasoning and question-answering capabilities of Large Language Models (LLMs), several multi-agent approaches have been introduced. While these methods enhance performance, the application of collective intelligence-based approaches to complex network structures and the dynamics of agent interactions remain underexplored. This work extends the concept of multi-agent debate to more general network topologies, measuring the question-answering accuracy, influence, consensus, and the effects of bias on the collective. The results show that random networks perform similarly to fully connected networks despite using significantly fewer tokens. Furthermore, a strong consensus among agents correlates with correct answers, whereas divided responses typically indicate incorrect answers. Analyzing the influence of the agents reveals a balance between self-reflection and interconnectedness; self-reflection aids when local interactions are incorrect, and local interactions aid when the agent itself is incorrect. Additionally, bias plays a strong role in system performance with correctly biased hub nodes boosting performance. These insights suggest that using random networks or scale-free networks with knowledgeable agents placed in central positions can enhance the overall performance of multi-agent systems.

Scale-Free Networks
Overview of multi-agent debate on complex networks.

Introduction

LLMs have demonstrated impressive performance on various tasks but still struggle with hallucinations and incorrect answers. Multi-agent approaches, inspired by human problem-solving, have been introduced to address these issues. Techniques like ReAct and Reflexion enable LLMs to engage in iterative reasoning and self-reflection. This work explores multi-agent systems on scale-free networks to understand agent influence and network topology effects on performance, extending the concept of multi-agent debate to these complex networks.

Methods

LLM agents are represented as nodes in a network with edges indicating communication channels. In multi-agent debate, agents solve problems individually, then reconsider answers based on neighbors' responses and previous answers. This process repeats for several rounds, culminating in a majority vote for the collective answer. We introduce bias by providing certain agents with correct or incorrect answers and analyze their influence based on network position (hubs or edges).

Scale-Free Networks
Figure 1: Scale-Free Networks.
Random Networks
Figure 2: Random Networks.

Experimental Setup

Experiments were conducted using three scale-free networks, each with 25 GPT-3.5-Turbo agents, engaging in four rounds of debate to answer 100 high-school mathematics questions from the MMLU dataset. The experiment was repeated three times for statistical significance. Bias was introduced into hub or edge nodes, comparing performance with unbiased networks to observe how biased nodes influence information spread and overall accuracy.

Results

Network Structure and QA Performance

Comparing the QA performance between different types of networks, it is evident that structure plays a role in the accuracy, as shown in Table 1. In particular, random networks achieve similar performance to fully connected networks, while using 250 times fewer input tokens per round of debate. Scale-free networks, which use a similar amount of input tokens per round to random networks, exhibit worse performance than random networks, suggesting the random network topology is superior for problem-solving tasks. In contrast, fully disconnected networks demonstrate the lowest performance, highlighting the importance of collaborative problem-solving.

Table 1: Accuracy and number of tokens used per round of debate for various types of (unbiased) networks.
Network Tokens per Round Accuracy
Fully Connected 125000 67.7 ± 1.1%
Fully Disconnected 5000 63.9 ± 0.4%
Random 28600 68.2 ± 0.5%
Scale-Free 21800 64.8 ± 1.0%
Accuracy per Round
Figure 3: Accuracy per round of debate for different types of networks. Initially, the accuracy of all systems is the same, but as rounds progress, fully connected networks show the steepest increase in accuracy.

Bias and QA Performance

Comparing the performance between biased and unbiased systems, it is found that bias also plays a role in QA accuracy, shown in Table 2. Networks with correctly (incorrectly) biased nodes at their hubs perform significantly better (worse) than their unbiased counterpart. In particular, networks with correctly biased hub nodes performed twice as well when compared to networks with incorrectly biased hubs, with accuracies of 88.1 ± 0.5% and 43.8 ± 1.5% respectively. Although bias is expected to impact the performance, the significant decrease in accuracy for incorrectly biased networks highlights that it only takes a few biased and well-connected agents, two in this case, to impair the results significantly. Moreover, the stronger comparative performance of the unbiased system demonstrates that although agents may be capable of solving problems correctly, they are easily influenced by incorrect agents. In the case where bias is inserted on the edge of the network on the other hand, it is found that there is little effect on the QA performance.

Table 2: Accuracy for biased scale-free networks.
Network Accuracy
Unbiased 64.8 ± 1.0%
Correctly Biased (Hub) 88.1 ± 0.5%
Incorrectly Biased (Hub) 43.8 ± 1.5%
Correctly Biased (Edge) 65.7 ± 1.1%
Incorrectly Biased (Edge) 64.9 ± 1.3%
Accuracy per Round with Bias
Figure 4: Accuracy per round of debate for scale-free networks with different types of biases. Networks with correctly biased hubs show a significant increase in accuracy after the first round.

Influence

To understand how an agent may be influenced, the probability of the agent being correct in round n, given its previous response and the response of its neighbours in round n-1 is shown in Figure 5. The figure shows that as the number of correct neighbours increases, so too does the probability of the agent being correct. Furthermore, the tendency for green points to lie above red points highlights the positive impact of self-reflection; regardless of the neighbour's response, an agent is more likely to answer correctly if it was correct in the previous round. These findings highlight the importance of both individuality and collective thinking in multi-agent systems. That is, collaborative problem-solving improves the overall performance of the collective, while self-reflection acts to improve performance when local interactions are misguided.

Agent Influence
Figure 5: Agent influence; the likelihood of an agent being correct based on both its own answer and the answer of its neighbors in the previous round. Agents are more likely to be correct if the majority of their neighbors are correct.

Dynamics

To further understand the dynamics of these systems, the way in which agents change their answers between rounds is shown in Figure 6. In the case of fully connected, scale-free, and random networks, the number of agents selecting and remaining on the correct answer increases with each round of debate. For fully disconnected networks, on the other hand, the number of agents remaining correct or incorrect is near-constant, with agents continuing to switch between the correct and incorrect answers. Considering bias, networks correctly biased at their hubs exhibit a large number of agents switching from incorrect to correct answers after the first round, in agreement with Figure 4. These agents with correct answers tend to keep the correct response throughout the remaining rounds of debate. Networks incorrectly biased at their hubs, on the other hand, have an increasing number of agents switching from correct to incorrect after round two. This is a significant result, as it highlights the fact that agents may have the correct answer, but will be convinced to switch due to the influence of their biased neighbours.

Fully Connected Dynamics
Figure 6a: Dynamics in Fully Connected Networks.
Fully Disconnected Dynamics
Figure 6b: Dynamics in Fully Disconnected Networks.
Scale-Free Unbiased Dynamics
Figure 6c: Dynamics in Scale-Free Unbiased Networks.
Random Networks Dynamics
Figure 6d: Dynamics in Random Networks.
Correctly Biased Hub Dynamics
Figure 6e: Dynamics in Correctly Biased Hub Networks.
Incorrectly Biased Hub Dynamics
Figure 6f: Dynamics in Incorrectly Biased Hub Networks.
Correctly Biased Edge Dynamics
Figure 6g: Dynamics in Correctly Biased Edge Networks.
Incorrectly Biased Edge Dynamics
Figure 6h: Dynamics in Incorrectly Biased Edge Networks.

Consensus

While the accuracy gave us an insight into the average QA performance of the system, it provides little information on how the answers are distributed inside the network during any given round and whether or not the agents agree. In fact, the network can be correct with less than half of its agents giving the correct answer, due to majority voting. This section explores how and under which conditions a consensus is formed. The percentage of agents in the network that answered the question correctly in the final round is shown in Figure 7. Although this metric highlights the relationship between the consensus towards the correct answer and the overall QA performance, which captures the total number of questions answered correctly, little information is provided on how the answers are distributed.

Correct Proportion vs Network Type
Figure 7: The percentage of agents that answered correctly in the final round.

To gain insight into the distribution of answers, the Simpson index is used to estimate the level of consensus within the collective. The Simpson index, which is used to quantify diversity, measures the probability that any two randomly selected agents give the same answer in the final round of the experiment. High values for fully connected networks, followed by those for random and scale-free networks, indicate a relationship between network connectivity and the agreement among agents. The results show that a high degree of consensus among agents correlates with correct answers, indicating greater certainty. Conversely, when consensus is lower, the system is more likely to be incorrect.

Simpson Fully Connected
Figure 8a: Simpson Index for Fully Connected Networks.
Simpson Fully Disconnected
Figure 8b: Simpson Index for Fully Disconnected Networks.
Simpson Scale-Free Unbiased
Figure 8c: Simpson Index for Scale-Free Unbiased Networks.
Simpson Random Networks
Figure 8d: Simpson Index for Random Networks.

Conclusion

Strategic placement of knowledgeable agents in central network positions can enhance multi-agent systems' performance. Future systems should leverage network topology to optimize collective intelligence by placing larger, more capable models at network hubs and smaller models at the periphery, improving performance without significant computational cost. Further research should explore different network structures and larger systems to generalize these findings.

Discussion and Limitations

This study has important implications for designing future multi-agent systems but is limited by the number of agents, questions, and rounds due to computational constraints. Future work should explore a broader range of network structures, including random and small-world networks, and increase the number of agents to better understand these systems' dynamics and performance. Despite limitations, findings provide valuable insights into how bias and network topology influence collective problem-solving and consensus formation in multi-agent systems.