Data-centric AI webinar #4: knowledge representation in the age of LLMs

1.2 Grids+Face-Orange-ArticleHeroBanner-1110x452-ODI-Research

In the last few years, LLMs (Large Language Models) and related technologies have vastly extended the capabilities of computers to work with natural language. But even though they are often treated as question answering machines, they have found it challenging to remain faithful to facts.

KGs (Knowledge Graphs) have become a new core component in many organisations, enriching their capabilities to fuse data together and to answer questions. But KGs often come with limitations on their expressivity and a rigidity in their schema.

In this talk we will offer introductions to both of these technologies, and discuss how we can use the novel capabilities of large language models to extend the expressivity of the knowledge representation used in knowledge graphs, thus complementing both technologies.

Who is this webinar for?

Anyone who is interested in data, AI, LLMs and KGs.

Level of difficulty

For everyone, it's an opportunity to hear from a renowned expert and learn about LLMs and KGs.

Can't come to the event, but would like to see the recording? Then join our research email list and we'll email you as soon as it is released, along with updates of all our latest research.

Speakers

Denny Vrandečić, Head of Special Projects, Wikimedia Foundation

Denny Vrandečić is Head of Special Projects at the Wikimedia Foundation, where he leads the work on Abstract Wikipedia and Wikifunctions. He is the founder of Wikidata. Previously, he co-founded Semantic MediaWiki, used in many organizations such as NASA, the US intelligence agencies, the Metropolitan Museum of Art, and others. He received a PhD from KIT and was a visiting researcher at USC's ISI and the Laboratory of Applied Ontologies at the CNR in Rome. He was the founder of the Croatian Wikipedia, and was an elected member of the Wikimedia Foundation Board of Trustees. He worked on the Google Knowledge Graph from 2013 to 2019. He now lives in Stuttgart, Germany.

Wikifunctions: https://wikifunctions.org

Wikidata: https://wikidata.org

Elena Simperl, Director of Research, ODI

Elena Simperl is the ODI’s Director of Research and a Professor of Computer Science at King’s College London. She is also a Fellow of the British Computer Society, a Fellow of the Royal Society of Arts, a senior member of the Society for the Study of AI and Simulation of Behaviour, and a Hans Fischer Senior Fellow.

Elena’s research is in human-centric AI, exploring socio-technical questions around the management, use, and governance of data in AI applications. According to AMiner, she is in the top 100 most influential scholars in knowledge engineering of the last decade. She also features in the Women in AI 2000 ranking.

In her 15-year career, she has led 14 national and international research projects, contributing to another 26. She leads the ODI’s programme of research on data-centric AI, which studies and designs the socio-technical data infrastructure of AI models and applications. Elena chaired several conferences in artificial intelligence, social computing, and data innovation. She is the president of the Semantic Web Science Association.

Elena is passionate about ensuring that AI technologies and applications allow everyone to take advantage of their opportunities, whether that is by making AI more participatory by design, investing in novel AI literacy interventions, or paying more attention to the stewardship and governance of data in AI.

Questions asked at the ODI Data-centric AI Webinar #4 on Oct 8 2024

Question: I want to ask about the connection between Wikifunctions, Wikipedia and Wikidata, since for me it's still vague how a function library processes information stored in Wikipedia and Wikidata.

Wikifunctions will very soon have access to data in Wikidata. That will be simply baked in: a function in Wikifunctions can access data from Wikidata. For example, if we want a function that calculates the distance between two cities as the crow flies, we can take the two cities as the argument, then use Wikidata to look up the geo-coordinates of those cities, and then based on the coordinates calculate the distance.

In Wikipedia, we will be able to put a function call into the Wikipedia article. This way, the article source text contains the function with the arguments, Wikifunctions evaluates the function call, returns a result, and the result is displayed in the Wikipedia article. To give a simple example: if we want to write the text “Edinburgh is 534 km (332 miles) away from London.”, we could simply give the argument 534 to a function that will then calculate 332, and this way we can avoid calculating the second number manually (and with the function above, we can even avoid calculating the first number manually).

Question: What is the ontology inside Wikidata [not Wikimedia]? And is it 4D, so that each item can have timestamps to show when it was valid?

The ontology is crowdsourced, just like everything in Wikidata. Every statement can have a time qualifier, but that is not required. So, the ontology allows individual statements to be 4D, but does not require it for every single statement.

Question: To achieve what the last slide portrays [Abstract Wikipedia], would it not be important that the the statements in Wikidata adhering to the data model standard? How is the community performing or enforcing QA systematically?

Each function using data would require a certain structure of the data in order to work. This will be a negotiation between the contributors doing data entry and the contributors creating the functions to use the data. It is expected that this will lead to more uniformity in the data than we currently have. Wikifunctions will become a forcing functions towards more uniformity in Wikidata.

Question: Sadly, symbolic approaches still have an image problem -- the Semantic Web never really took root in the mainstream. How can we better promote explicit semantics?

I would disagree with the statement that the Semantic Web never took off: more than a third of all Websites have RDF annotations on them. I do agree about the Semantic Web having an image problem, i.e. it really is a problem of perception – which may partially be due to inflated expectations at the beginning of that research program. In my opinion, the best promotion of semantic technology is to continue to make it increasingly useful, and to help with implementing use cases where they show their benefit.

Question: Does the pursuit of smaller LLM models conflict with the potential benefits they offer as a fuzzy coupling? In other words, are the advantages of these models only achievable at a larger scale?

I think that is an open research question. My gut feeling says that the current capabilities of state of the art models can be achieved by considerably smaller models as well, but that’s something I expect an empirical answer in the near future.

Question: Do you not think that AI mistakes will be ironed out by RLFH [Reinforcement Learning from Human Feedback]? Will AI eventually train itself?

We have had years of RLFH, and I don’t see a qualitative improvement regarding the types of mistakes I have presented. It seems that architectural changes (e.g. RAG, or evaluating code snippets) had a much more qualitative effect for the performance of many AI applications. The follow-up question –will AI eventually train itself– has a similar question: yes, I think we will eventually have an architecture that will allow for that. But we don’t have that architecture yet.

Question: Is AI-assisted knowledge formalisation being considered to "digest" knowledge outside of Wikipedia?

I would hope that this will eventually be the case.

Question: Most approaches to RAG use an embedding model to rank the retrieved knowledge, where this has been pre-computed over documents. Given that LLMs don’t know QIDs, how do we avoid going back to the ambiguity of natural language to interface LLMs to a dynamic structured knowledge source like Wikidata?

LLMs are actually not too bad at remembering QIDs – certainly better than most humans I know. Besides that, Wikimedia Deutschland has recently published a vectorized version of Wikidata that has the explicit goal of helping with that task. You can read more about this here: https://blog.wikimedia.de/2024/09/17/wikidata-and-artificial-intelligence-simplified-access-to-open-data-for-open-source-projects/

About us

Our five year plan

What we do

Solid

Membership