The government’s new AI partnerships with Anthropic and Meta, including pilots for AI assistants across public services, come as new research from the Open Data Institute (ODI) raises concerns about whether chatbots - which are increasingly used in daily life to answer questions from transport to health and finance - can be trusted to give citizens accurate information about government services.
Drawing on more than 22,000 synthetically generated “citizen queries” such as “How do I apply for Universal Credit?” which were mapped against authoritative answers from gov.uk, the Open Data Institute and its collaborators tested how leading AI models performed when answering real public-service questions. Responses from models including Anthropic’s Claude-4.5-Haiku, Google’s Gemini-3-Flash and OpenAI’s ChatGPT-4o were then compared directly with official government sources.
Examples include:
- Chat GPT-OSS-20B incorrectly advised that a person caring for a child whose parents have died is only eligible for Guardian’s Allowance if they are the guardian of a child who has died. It also incorrectly states that the applicant is ineligible if they receive other benefits for the child.
- Llama 3.1 8B stated that a court order is essential to add an ex-partner’s name to a child’s birth certificate. If followed, this advice would lead to unnecessary stress and financial cost.
- Qwen3-32B incorrectly informs a pregnant woman that the Sure Start Maternity Grant is available in Scotland, even though it is not. It also advised a small charity to follow the standard Self Assessment tax deadline, but failed to ask what legal form the charity took – a crucial detail that changes which return is required and when it is due.
When LLMs are used as chatbots, they aim to be helpful by using information from a diverse range of sources and presenting it as a single answer. Unless they are explicitly configured to do so, they don’t reliably prioritise official government information, reference their sources or always admit when they don’t know the answer.
If AI is to accurately support people when they ask citizen queries, models must deliver concise, accurate responses. However, some chatbots, such as Claude 4.5 Haiku, gave very verbose answers. These models may demonstrate accuracy but swamp people with unrelated information, adding confusion and putting users at risk. Crucially, when researchers experimented with forcing models to be more concise and direct, their factual accuracy actually dropped, suggesting that when they are asked to concentrate their responses to citizen queries, they do not prioritise gov.uk information over other sources.
Although the models are generally accurate, some answers are significantly wrong, undermining overall reliability. All tested models showed an extremely high variance and a 'long tail', demonstrating unreliability. This is particularly important when users are asking questions such as ‘What is Capital Gains Tax’, where there is no room for inaccuracies.
Importantly, the research also found that when AI models were given examples to follow, they spoke less and became less accurate, focusing their responses on information not supported by gov.uk. This highlights an important challenge for the usage of LLMs: the need for models to prioritise authoritative government sources over other information, especially when people use ‘few-shot prompting’ - a common technique when using chatbots.
Professor Elena Simperl, ODI's Director of Research, said, “The CitizenQuery-UK Benchmark is an important new tool to help anyone in the country assess how well large language models respond to people's needs for timely, up-to-date information about public services.
If language models are to be used safely in citizen-facing services, we need to understand where the technology can be trusted and where it cannot. That means being open about uncertainty, keeping answers tightly focused on authoritative sources such as gov.uk, and addressing the high levels of inconsistency seen in current systems. Investment in independent benchmarks such as CitizenQuery-UK is essential if we want to build AI-supported public services that are reliable, accountable, and worthy of public trust.
CitizenQuery‑UK, released as an open benchmark and accompanying dataset on arXiv, assesses the trustworthiness of LLMs in citizen query tasks. As AI is incredibly integrated into day-to-day life, our benchmark, built entirely on open data, lays the foundations for better evidence-based decision-making for AI and the public sector.”
The authors also challenge the idea that ever‑larger, more resource‑intensive models are always the best fit for the public sector. In many cases, smaller models delivered comparable results at a lower cost than large, closed-source models such as ChatGPT 4.1. In many cases, this makes them better suited to public-sector requirements, challenging assumptions that larger models are always better and demonstrating the need to be flexible in public-sector adoption of AI. It also shows why governments should avoid locking themselves into long-term contracts when models temporarily outperform one another on price or benchmarks.
With new AI models and model updates being launched every week, the ongoing evaluation of their risks and benefits is essential, particularly before adoption by the public sector and application to public services. The research also indicates that rigorous testing and configuration are essential prior to use, and to ensure safeguarding measures such as defining their limits and admitting fallibility. CitizenQuery-UK is the first benchmark that facilitates the collection of evidence required to support analysis and decision-making for public-sector AI. The ODI is releasing the code and dataset to ensure that everyone has the tools to test LLMs for factuality and trustworthiness.
The benchmark will be expanded to include languages such as Welsh, since gov.uk is fully available in Welsh, and will be regularly refreshed and curated to stay up to date. Evaluation pipelines, released as standalone code, will also be scaled up to enable continued monitoring of the accuracy and utility of large language models for citizen queries over time.