Bias in metadata 3: Adding (polyvocal) context to semantic web representations

  • feb 2023
  • Ryan Brate
  • ·
  • Aangepast 28 jun
  • 136
Ryan Brate
Preservation Digitaal Erfgoed

This blog post is the third of three blog posts, published by the Dutch Heritage Network Preservation Watch, which monitors technological developments as relevant to the GLAM sphere.

The blog posts are written by Ryan Brate, who is a PhD candidate within DHLab of the Koninklijke Nederlandse Akademie van Wetenschappen. Read the interview with Ryan and the first blog and second blog in the series on Kennisplatform Preservation.

Humans are very good at parsing the relationships between objects described in unstructured text, recognising variation in contexts. However the information content in such unstructured formats, means very little to computers, thus making information extraction and exploration difficult to automated methods. Semantic web technologies are an effort to provide underlying structure to information, i.e. to make information machine-readable. They do this by describing knowledge snippets in terms subject, predicate, object (triple) relationships (e.g., Amsterdam grantedCityRights 1306). By leveraging URIs as unique object identifiers, it becomes possible to produce linked data assemblies (i.e., knowledge graphs) of pieces of unambiguous information across multiple datasets. As the name would imply, Semantic web technologies have their origin in an early vision of its founder Tim Berners Lee of how the information of the internet may interconnect. Such semantic web technologies are attractive to digital humanities scholars and GLAM institutions (Galleries, Libraries, Archives and Museums) in mapping what we know about a subject. The knowledge representation provided by linked data is readily interrogable and can be used to explore information linkages.

What do we mean by polyvocality and the role of context in information representation?

The philosopher David Hume separates all statements of knowledge into 2 types: Relations of ideas, and matters of fact. That the sum of the angles in a rectangle is 360 degrees, is an example of a relation of ideas: a truth that can be recognised without empirical evidence. Matters of fact, on the other hand, is knowledge based on inference from what we observe and test in the real world. However, neither of these categories of knowledge are absolute and are instead dependent to varying degrees on underlying assumptions and contextual framing.

The humanities field is specifically concerned with human viewpoints as the contextual frames of reference when recording statements of fact. Multiple, potentially differing, viewpoints make up a polyvocal collection of viewpoints. Thus, when we talk of polyvocality we mean narratives in relation to some subject, from a particular subjective viewpoint. Such viewpoints may be informed by specific events, by country, by religion or socio-economic background etc. In short, where there exists a difference in human perspective, there may exist a particular polyvocal narrative worthy of distinction and representation in data. We don’t have to search particularly hard to identify terminology relevant to GLAM institutions, for which being able to highlight different viewpoints regarding their use is invaluable. E.g., Amsterdam Museum recently stopped using the phrase “Gouden Eeuw”, as a synonym for the 17th century, which is explicit in its positive bias towards Dutch expansion and colonial activity in this period: representing an entirely Dutch-centric viewpoint of the period, but likely very different than the perspective of the colonised or slave-traded peoples. Similarly, the Tropenmuseum’s Words Matter publication highlights the contrary implications of the word marron (maroon in English), generally derogatory, but for some a source of pride representing struggle against colonialism. By identifying and mapping such context-dependent subjective viewpoints as related to particular objects, GLAM institutions are better able to sensitively handle such cases in an informed manner.


The incorporation of different viewpoints, i.e., polyvocality, in aforementioned linked data assemblies, would represent a valuable resource for heritage professionals in recording the varying connotations and implications of objects, according to the contrasting contexts from which they may be viewed. This may incorporate time or affected people groups or otherwise. The objects in question could be collection objects (paintings, essays etc) or any other item of interest, such as terminology. Such a resource would be valuable to GLAM institutions in enriching object descriptions with information which communicates to the reader lesser-known viewpoints surrounding a subject.

Can the semantic web accommodate polyvocality and time-dependent information?

The short answer is “Yes”, but significant challenges remain.

Often the core of semantic web technologies is described as being made up of the following: RDF, OWL and SPARQL. RDF (Resource Description Framework) is the fundamental data model of the semantic web: i.e., the definition of knowledge collections as subject, predicate, object triples, using URIs as unique object references. OWL (Web Ontology Language) enables the definition of domain specific ontologies: i.e., for defining predicate relationships between objects and type definitions of the objects being linked. SPARQL is SQL-like querying language, for querying data described in RDF.

Accommodating polyvocality in linked data, is a problem of recording the provenance of a statement (RDF triple): i.e., being able to attach additional metadata to an RDF triple which gives information on the viewpoint the data originates from (e.g., timeframe over which the statement is true). This means referencing one RDF triple with another RDF triple (known as reification). RDF-star is an extension to the RDF data model and represents a convenient way for statements to be made about other statements. As recently as December 17 2021, a W3C community report was released detailing a final specification for RDF-star and an extension to SPARQL, for querying RDF-star triples. The figure below, taken from the community report, gives an example of rdf-star being used to incorporate a viewpoint wrt., an RDF-triple.

Thus, RDF-star enables a simple approach to attach provenance data (and other metadata) to triples defined in a knowledge graph, and thus facilitates the capturing of polyvocal narrative information parts.

The challenges of knowledge graph polyvocality

Researchers in the semantic web community are realising that knowledge graphs need to capture context better to represent data in a more inclusive and polyvocal manner. The topic has for example come up in multiple Dagstuhl workshops, such as seminar 18371 Knowledge Graphs: New Directions for Knowledge Representation on the Semantic Web and 22372 Knowledge Graphs and Their Role in the Knowledge Engineering of the 21st Century (report forthcoming). These workshops are meant to bring together researchers to brainstorm and map out new research directions for the community. Some recent publications have also called for addressing this topic, such as the Cultural AI Lab’s authored A Polyvocal and Contextualised Semantic Web (2021). As this is a problem that is not yet solved, we summarise the core challenges here.

Adding information to Knowledge Graphs

Supposing we have a collection of knowledge statements, to be added to a knowledge graph, complete with a variety of metadata information related to their source, for which to provide RDF-star (or otherwise) provenance information: we have still not solved polyvocal representation in knowledge graphs. In the first instance, there is an issue of correctly interpreting and representing the context in which the knowledge statements appear. The textual context in which statements are made can be complicated for even human readers to adequately summarise: for automated NLP methods this is a hard problem, and frequently knowledge graphs are constructed from data extracted from unstructured text via text mining techniques. Also, even among an information source there may be inconsistency of voice. A publication may tend towards a certain socio-cultural outlook, but there may be considerable room for variation between writers. In short, further work is needed in the development of voice-aware information extraction techniques in adequately representing the context and ultimately viewpoints from which knowledge statements are taken.

Usage of Polyvocal Knowledge

Accepting then, that even contextualised knowledge graph information triples will likely not point to immediate viewpoints: for any such polyvocal analysis of knowledge graphs, further interrogation and analysis of the complex information contained within will be needed. Semantic Web examples such as Europeana and Wikidata already represent complex data models, adding a voice as a further layer in some form, has the potential to increase this complexity. Work into further exploratory and visualisation tooling to aid such interrogation would be greatly beneficial.

The digital heritage community harbours a vast treasure of polyvocal data sources, as well as the expertise to interpret them. This makes it the ideal domain to develop and test polyvocal approaches.

The complete series by Ryan Brate:
blog 1: Monitoring advances in the field of AI, with an emphasis on bias
blog 2: The Influence of polyvocality on the life-cycle of the GLAM objects
blog 3: Adding (polyvocal) context to semantic web representations
Interview with Ryan Brate

Trefwoorden