R&D
11
Building Agentic Systems for Cultural Archives
Field
June 9, 2026

Creative Developer Daan Rongen explores the agentic systems behind IBM’s GRAMMY Museum interactive table, looking at how FIELD approached content-generation tools for music-led experiences.

The Starting Point

IBM commissioned this work in partnership with the GRAMMY Museum for a permanent installation about the connections that run through recorded music. The brief was deceptively simple: let a visitor pick a piece of music information — a track, an artist, a genre — and find their way to related content.

There are several reasonable ways to build a system like that, and the obvious answer was a recommender. Recommenders are now the default response to “what relates to this?” They power the next song, the next film, the next product, and almost every platform large enough to predict what someone might want next. For a music archive, that logic seemed useful at first, until the system began to show us the difference between proximity and meaning.

The Limits of Latent Space

The first build represented music as vectors in latent space. We embedded ten-second audio chunks through both CLAP and MERT to capture timbral and emotional character. We transcribed lyrics with Whisper and re-embedded the transcripts with a sentence transformer. Each artist’s biography, era, and origin went through the same sentence transformer. Then we ran KNN across the resulting latent spaces and let similarity build a graph.

The latent-space build: a track and its KNN neighbours across three embedding spaces, rendered as a graph.

A visitor at the table taps a track, and the system places it at the centre of a small constellation. Eight or ten nearby tracks orbit it. Each one is pulled in because the embedding model thinks it is close along some dimension: timbre, lyrical content, biography, or place of origin. Tapping a neighbour reshuffles the constellation around the new centre. There is no narration; the graph speaks in adjacencies, and the adjacencies are produced by the geometry of the latent space.

The map is not the territory, and a latent space is a particularly seductive kind of map: high-dimensional, dense, opinionated, and invisible. To see it at all, we have to project it down through something like UMAP or t-SNE, and the projection always looks better than it is. Distances are distorted by flattening, and clusters that appear obvious in three dimensions may not exist in the original embedding. Spanish-language tracks clustered together because Whisper had transcribed them as Spanish, not because they belonged together culturally or semantically. Instrumental songs collapsed because Whisper transcribed them using closed captioning.

From Retrieval to Interpretation

From Retrieval to Interpretation

We stopped asking the system what is nearest and started asking how they are related. That second question is structurally different. It requires reading several sources, weighing them against each other, writing a paragraph, and being prepared to be wrong about it in public. None of those is operations a vector index performs. They are operations a person performs, and the closest software analogue we have to a person performing them today is an agent.

The problem also changed shape. It stopped being a Cartesian problem: vectors, distances, KNN. It became a data-aggregation problem. The task was to populate a graph from text by gathering verifiable facts from named online sources, then turning those facts into edges. The graph stopped being learned and became authored. Every edge carries a label, a direction, and a weight put there by something we can interrogate.

A Small Editorial System

The system that replaced the recommender looks like an editorial department rather than a model. Five agents, each running on IBM watsonx against the Granite 4H Small family, sit atop a Neo4j graph. The graph is the archive that the editorial team works from.

The choice of model was a real constraint. Granite 4H Small is a small, on-premise-friendly model, far below the generalist frontier in raw capability. That constraint pushed the design toward small, isolated, testable functions; at this size, reading, writing, evaluating, and rewriting needed tighter prompts and stricter output schemas. The authored graph also had to carry much of the model's reasoning. Most agentic write-ups quietly assume frontier-scale capability; this one does not.

The team has a particular shape: there is no orchestrator agent at the top of the system delegating to sub-agents. The five agents are isolated from each other. Each has its own context, tools, prompt, and output schema. They are arranged on a pipeline rather than wired into a hierarchy. The orchestration is the pipeline; the intelligence stays local to the role, closer to a small editorial process than to a swarm.

A genre node, an artist node, and an authored edge between them. The agents work on graphs of this shape.

The research agent goes out and reads. It queries a set of trusted music data sources through tool calls, then sanitises the responses. It returns atomic research facts tagged by category: biography, history, style, influence, and impact. This is what tells a visitor where an artist made their name, or how a genre took shape in a particular decade.

Research expansion: the agent issues tool calls, research-fact nodes bloom outward, and synthetic Place and Decade nodes are extracted and linked back to the source.

The describe agent reads those facts and the node’s weighted neighbours, then writes the paragraph. Its prompt tells it to reference connected nodes by name in the prose. A tool wraps each reference in an HTML anchor tag, so the installation text becomes a navigable map. Higher-weight relationships anchor earlier sentences; lower-weight ones get a clause or nothing.

The describe agent reads the research log and the weighted neighbours, then composes the paragraph in a museum register.

The Audit Loop

A public cultural system has a low tolerance for confident drift. The evaluate and refine agents handle that work before a paragraph reaches the audience.

The evaluate agent reads what the describe agent wrote and scores it against the research log. It uses five rubrics: factuality, style, specificity, context use, and grounding. Each score sits on a one-to-five scale, with reasoning recorded per criterion. It does not rewrite; its only job is to flag what is wrong and explain why. The evaluator is calibrated in our eval suite against a golden-standard set of human-written samples. Its rubric is anchored to a curator’s judgement rather than the model’s own self-assessment.

The refine agent takes the full description and evaluation, then rewrites. It is not trying to improve the prose; its brief is narrower: correct what was flagged, strip what exceeds the evidence. It records its own reasoning as a diff, so the changes are traceable back to specific evaluation criteria.

Take Billie Holiday. The research agent was clear: Strange Fruit was her civil rights anthem; God Bless the Child was a jazz standard. The describe agent saw both alongside a high-weight connection to the God Bless the Child track and wrote fluent prose — then attributed the protest role to the wrong song. Every piece was in context; it was the synthesis that confused the cultural role of one song with that of another. The evaluate agent scored factuality at two out of five: it’s ‘Strange Fruit’ that is widely considered a civil rights anthem. The refine agent changed 75 per cent of the text and corrected it. The error that would have been visible on a museum touchscreen was resolved before a curator saw it.

Per-criterion scores from the evaluate agent are passed to the refine agent, which rewrites and records its reasoning as a diff.

A team-shaped architecture is more expensive than a single longer prompt. The gain is legibility of failure: language models fail plausibly, not loudly, and a single long prompt tells you nothing about where the reasoning broke. Isolated roles shrink that surface to one agent at a time, so a failing role can be re-prompted without touching the others.

The agents do not run unsupervised. The CMS sits on top of the back-end API, making the audit loop visible. A curator opens a node and sees its weighted neighbours, the research log, the describe agent’s draft, the evaluate agent’s scores, and the refine agent’s rewrite in one view. Approval is still an editorial decision; the agents prepare the material around it.

Handing the Interpretation Back

The four agents described so far operate outside the visitor loop. They populate the graph, write the descriptions, and audit them. The report agent is the exception: it runs at the table, in real time, and is the only one that responds to a specific visitor’s behaviour. It reads the traversal as a behavioural signature, nodes visited, order, dwell category, and graph paths between interests, and returns a title, a short paragraph, and one onward recommendation.

The visitor’s traversal aggregated into title, narrative, insight, and a single onward recommendation.

Closing the experience this way makes the session history visible. The visitor leaves the table with a short account of the path they made through the graph, rather than another generic recommendation.

When the System Becomes the Work

For most of the last few years, agentic tools have sat on the studio’s side of the deliverable. They helped texture a render, allowed creative teams to quickly draft prototypes, and tested our code. This project sits on the other side. The deliverable is the agentic system.

That changes the design work. The client is not only buying output; they are buying behaviour over time. Roles, prompts, schemas, review states, and failure visibility become part of the experience layer. The system has to keep working when the people who designed it are no longer in the room.

For cultural institutions, brands, and platforms, capability alone will not be enough. Agents need roles, restraint, and an observable means of correction. That is where the work starts to look less like model selection and more like editorial design.

Our R&D shows the technical and creative studies behind our work, from early prototypes and visual systems to machine learning experiments, interaction tests, and production methods. It gives space to the development process itself, showing how ideas take shape through research, iteration, and hands-on exploration.