Karma: Tools for Mapping Collection Meta-Data to Linked Open Data


Pedro Szekely, University of Southern California, USA, Craig Knoblock, University of Southern California, USA, Jing Wan, Beijing University of Chemical Technology, China

Abstract

Museums around the world have built databases with metadata about millions of objects, their history, the people who created them, and the entities they represent. This data is stored in proprietary databases and is not readily available for use. Recently, museums embraced the Semantic Web as a means to make this data available to the world, but the experience so far shows that publishing museum data to the linked data cloud is difficult: the databases are large and complex, the information is richly structured and varies from museum to museum, and it is difficult to link the data to other datasets. This paper describes Karma, our system for mapping museum data to the Linked Data Cloud. We describe the capabilities in Karma to easily map the museum data to RDF according to an ontology of the user's choice, to link the data to hub datasets such as DBpedia and other museum datasets, and for curating the data.

Keywords: Semantic Web, Linked Data, Resource Description Framework (RDF), Web Ontology Language (OWL), Extraction transformation and loading (ETL), Cultural Heritage

Introduction

Recently, several efforts have sought to publish metadata about the objects in museums as Linked Open Data (LOD). LOD provides an approach to publishing data in a standard format (called RDF) using a shared terminology (called a domain ontology) and linked to other data sources. The linking is particularly important because it relates information across sources, breaks down data silos and enables applications that provide rich context. Some notable LOD efforts include the Euopeana project, the Amsterdam Museum, the LODAC Museum and Research Space.

Mapping the data of a museum to linked data involves three steps:

  • Map the Data to RDF: The first step is to map the metadata about works of art into RDF. De Boer et al. (Boer 2012) note that the process is complicated because many museums have richly-structured data including attributes that are unique to a particular museum, and the data is often inconsistent and noisy.
  • Link to External Sources: Once the data is in RDF, the next step is to find the links from the metadata to other repositories from other museums or data hubs, such as DBpedia or GeoNames.
  • Curate the Linked Data: The third step is to curate the data to ensure that both the published information and its links to other sources within the LOD are accurate.

Despite the many recent efforts, significant challenges remain. Mapping the data to RDF is typically done by writing rules in specialized languages such as R2RML (http://www.w3.org/TR/r2rml/) or D2RQ (http://d2rq.org/), or by writing scripts in languages such as XSLT (http://www.w3.org/TR/xslt), Python or Java. Writing mappings using these technologies is labor intensive and requires significant technical expertise. Linking to external sources is also a technically difficult problem, so the number of links in past work is actually quite small as a percentage of the total set of objects that have been published. This means that museums need to set up teams with data experts who understand the data and software developers who understand the technologies. The process is expensive and creates a barrier to publishing linked open data. In addition, curation is labor intensive, creating an addition barrier for the publication of high quality linked data.

In previous work (Szekely 2013), we described Karma, a tool for mapping structured sources to RDF, and for establishing and curating links to external sources. We described the lessons learned mapping the Smithsonian American Art Museum (SAAM) to the Europeana EDM ontology.

In this work we describe how Karma can be used to map data to the CIDOC CRM (http://www.cidoc-crm.org/)ontology, and present the challenges and lessons learned mapping the SAAM data to CRM.

Overview of CIDOC CRM

The CIDOC Conceptual Reference Model (CRM) is an ontology for describing information about cultural heritage. The CRM Web site states that “The intended scope of the CIDOC CRM may be defined as all information required for the scientific documentation of cultural heritage collections, with a view to enabling wide area information exchange and integration of heterogeneous sources.” The CRM OWL ontology is large, containing 82 classes and 263 properties. The ontology includes classes to represent a wide variety of events (e.g., creation, production, attribute assignment), immaterial things (e.g., information objects, appellations, rights) and material things (e.g., actors, physical things, man-made objects). It also includes many properties to represent the relationships among the classes of entities that can be represented.

CIDOC CRM became and ISO standard in 2006, and in recent years has gained significant traction in the Linked Data community. The British Museum and the Yale Center for British Art are notable users of CRM, providing extensive datasets using CRM.

The challenge when mapping data to CRM is that each column of a database must be mapped to the appropriate class and property in the CRM ontology. For example, to map the name of a person to CRM, we need to define a two-class structure. First, we map the name to the “rdfs:label” property of and “E82_Actor_Appellation” class. Then we need to define a “E21_Person” class and connect these two classes using the “P1_is_identified_by” property. In addition, we need to connect the classes defined for each column of a data table into a coherent structure. To do this, we often need to introduce additional classes and connect them together. For example, to specify that a person is the creator of an artwork, we need to introduce a “E12_Production” class, and then connect the production class to the person class using the “P14_carried_out_by” property and then connect the class for the artwork to the production class using the “P108i_was_produced_by” property.

The process is both intellectually and technically challenging. It is intellectually challenging because we need to build elaborate structures using many classes and properties connected in a coherent way. For example, the structure to represent an object in SAAM has 49 classes and 66 properties.

The process is technically challenging because we need to specify how to build this elaborate structure from the data contained in the database. If we are using a language such as R2RML, we must specify a TripleMap structure for each of the classes (49 TripleMaps in the case of the SAAM dataset) and on average, each TripleMap requires 10 to 20 R2RML statements. An additional challenge is that often the data in the database does not map directly to the properties and classes in the CRM and needs to be reformatted and before it fits the CRM.

Using Karma to map the SAAM data to the CRM

Our workflow to map cultural heritage data to the CRM consists of six steps:

  1. Define the URI Scheme: each entity needs a URI that uniquely identifies it in the published Linked Data. In this step we define a template for the URI for each type of entity (e.g., http:// http://collection.americanart.si.edu/id/object/{object-number}). It is important to define these templates at the outset so that we can use them consistently when mapping each of the tables in the database.
  2. Extend the Ontology: even though the CRM provides a comprehensive set of classes and properties, the SAAM data contains data elements that cannot be accurately represented using the available classes and properties. For example, for some artists the nationality data is uncertain, so we had to define a new property to specify that the nationality is uncertain. Our extensions consist primarily of specializations of the CRM object and data properties.
  3. Define the Controlled Vocabularies: museums often user their own controlled vocabularies to represent features such as material, medium, subjects, etc. We represent these controlled vocabularies using SKOS and assign an appropriate URI for each term in these vocabularies.
  4. Clean and Normalize Data: even though museums carefully curate their data, tables often contain rows that should be filtered out, and cells contain values that don’t directly map to the CRM. For example, date fields often contain values such as “ca. 1890”, but the CRM requires that dates be in ISO format. The CRM provides constructs to define time intervals, so we need to write scripts to convert dates such as “ca. 1890” into the appropriate interval representation.
  5. Define the Mapping for Each Database Table: this involves defining the structure of classes and properties that must be built for each row in the table, and specifying the rules to build this structure from the columns in the table. As mentioned before, this is a critical and difficult step.
  6. Create and Publish RDF: once the mappings for each table are defined, we need to execute the mappings on the complete database to generate the RDF and then load the RDF in a triple store to make it available to the world. It is important to automate this process to automatically refresh the RDF after the master database is updated.

In addition to the six steps for creating and publishing the RDF data, there are two additional steps for creating and curating links. We discussed these steps in earlier work, so we will not discuss them further in this paper. In the rest of this section we discuss each of the six steps in more detail.

Define the URI Scheme

The URI scheme defines the conventions that will be used to identify all resources in the RDF dataset. The URI for each resource must be unique, must never change after it is published, and must be independent of implementation details. The URIs for all entities in the SAAM dataset have a common prefix:

“http://collection.americanart.si.edu/id/”

The SAAM data contains four different type of entities: objects, people or institutions, properties and thesauri. We organize the URIs in a hierarchical structure as follows:

  • Collections: object/{ObjectNumber} (e.g., “/object/1984.124.45”)
  • People or institutions: person-institution/{ConstitutionID} (e.g., “/person-institution/3”)
  • Properties of a collection: object/{ObjectNumber}/{property name} (e.g., “/object/1984.124.45/ dimension”)
  • Properties of a person or an institution: person-institution/{ConstitutionID}/ {property name} (e.g., “/person-institution/3/birth”, “/person-institution/3/birth/birthdate”)
  • Thesauri: thesauri/{TermType}/{term name} (e.g., “/thesauri/nationality/American”)
  • In Karma we define the URIs for resource by defining formulas that assemble the URI from the elementary values present in the columns of a table. For example, the URI for an object is defined by the expression:

“object/” + getValue(“ObjectNumber”)

This expression concatenates the string “object/” with the value of the “ObjectNumber” column. The interface in Karma is similar to the interface for defining formulas in Microsoft Excel. The user enters the formula in a formula dialog, and Karma shows the effect of the formula in a new column on the screen. In contrast to Microsoft Excel, this new column is present only on the screen: Karma records it internally as part of the mapping definition, but the database table remains unchanged.

Extend the Ontology

Even though the CRM ontology is extensive, the SAAM dataset includes fields that cannot be accurately represented using the available classes and properties. For example, the SAAM dataset provides information about people’s first, middle and last names, but the CRM does not define properties in the “E82_Actor_Appellation” class to capture the parts of a name. Similarly, the CRM provides a “P3_has_note” property with domain “E1_CRM_Entity”, but does not provide a property to record the biography of an artist. In the CRM it would have been necessary to represent people’s name as a single property, foregoing the ability to, for example, search for people based on last name only. Similarly, it would have been necessary to record the biography of an artist as a “P3_has_note”, which would not distinguish the biography from any other type of note attached to a person.

In almost all cases we were able to define our extensions as subclasses and subproperties of existing classes and properties in the CRM. For example, we defined a “PE_has_biography” as a subproperty of “P3_has_note”, restricting the domain to “E39_Actor”.

One of the difficult cases was to capture uncertainty in the nationality of an artist. The CRM provides property “P107i_is_current_or_former_member_of” to record membership of an actor in a group. When the nationality is certain, we record the relationship between an actor and his or her nationality using this property. One possibility to record the uncertainty in the relationship would be to provide an annotation on the relationship to specify that the relationship is uncertain. We rejected this approach because applications that do not reason with the annotations would retrieve the relationship as if it was certain. Instead, we define a new “PE_claimed_current_or_former_member_of” property as a super-property of “P107i_is_current_or_former_member_of”. Applications that do not understand the extension would not retrieve the uncertain nationality relationship.

We define the ontology extensions in a separate ontology file and load it into Karma along with the CRM, SKOS and QUDT ontologies.

Define the Controlled Vocabularies

Each museum defines controlled vocabularies in its collection management system. The SAAM dataset uses the following controlled vocabularies:

  • Collection type (e.g. photography, graphic arts)
  • Collection medium ( e.g. Mezzotint on paper, color lithograph on paper)
  • Name type (e.g. married name, variant name)
  • Name title (e.g. King, sister)
  • Suffix (e.g. the Elder, le jeune)
  • Image rights (e.g. restricted, unrestricted)
  • Media type (used for collectionÕs images, e.g. digital image, black and white print)
  • Code (e.g. provisional, candidate, verified)
  • Medium view (used for artist images, e.g. luce, juley, permanet collection)
  • Keyword type (e.g. Subject specific, folk art, mark type)
  • Current site (e.g. AA2@E251, AA2@E250)
  • Place
  • Dimension
  • Nationality

We define the controlled vocabularies as thesauri represented using SKOS, and link the corresponding SKOS concepts to the relevant classes in the CRM ontology. For example, nationality “American” is represented as follows:

/id/thesauri/nationality/American   rdf:type   crm:E55_Type
/id/thesauri/nationality/American   rdf:type   crm:E74_Group
/id/thesauri/nationality/American   rdf:type   skos:core#Concept
/id/thesauri/nationality/American   skos:inScheme   /id/thesauri/nationality
/id/thesauri/nationality/American   skos:preLabel   American

We extract the data for the controlled vocabularies using SQL queries on the relevant tables in the SAAM database. We map these tables to RDF in the same way as we map all the other tables in the SAAM dataset. The difference is that we map the data to both SKOS and CRM classes and properties.

Clean and Normalize Data

Often, museum datasets represent data in formats that are different from those required by the CRM. For example, the SAAM dataset represents days, months and years in separate column, and the CRM expects them in ISO format. Before mapping the data, it is necessary to combine the fields. This is not always trivial, as for example, the single-digit months need a leading zero. In other cases, museum fields contain annotations such as “ca. 1890”. In these cases we need to remove the annotations and create separate columns to record the information they convey.

In Karma we define these transformations using formulas, in the same way that we use formulas to define URIs. Karma’s formula language is Python, so users can use the full power of the Python language to manipulate the data.

Define the Mapping for Each Database Table

Karma provides a graphical user interface to make it easy for users to map database tables to an ontology. The figure below shows a Karma screen with the “WebMakers” table loaded and modeled according to the CRM ontology. The bottom part of the screen shows the data from the database. The top part of the figure shows the mapping to the CRM ontology. The top bubble, labeled “E22_Man-Made_Object1”, specifies that the table represents information about a man-made object. The “classLink” arrow ties the bubble to the column “Object” that defines the URI for the man-made object. The values in the column are defined using a formula that concatenates “object/” with the value of the “ObjectId” column (not in view). These URIs will be expanded using the global prefix for all the SAAM data. For example, the URI of the first man-made object will be “http://collection.americanart.si.edu/id/object/824”.

Karma automatically suggests mappings to the CRM ontology and provides an easy to use interface to adjust the automatically generated mappings

https://mw2014.museumsandtheweb.com/wp-content/uploads/2014/02/karma-screen-300x170.png 300w, https://mw2014.museumsandtheweb.com/wp-content/uploads/2014/02/karma-screen-500x283.png 500w, https://mw2014.museumsandtheweb.com/wp-content/uploads/2014/02/karma-screen.png 1387w" sizes="(max-width: 584px) 100vw, 584px" /> Karma automatically suggests mappings to the CRM ontology and provides an easy to use interface to adjust the automatically generated mappings

The second bubble labeled “E12_Production” specifies that the columns it encompasses contain information about the production of the artwork. The link clearly indicates the relationship to the man-made object. The third bubble specifies information about the person who created the object. The last column provides an example of a thesaurus that defines the different roles of people.

Create and Publish RDF

We mapped each of the 14 tables and views in the SAAM database. The models for the other tables are of similar complexity as the mapping shown in the figure above. After each mapping is complete, Karma produces a mapping file that can be used to generate the RDF for the complete table (40,000 object and 9,000 people). The process of generating the RDF for all the tables takes less than 5 minutes on a modern laptop (MacBook Pro).

References

Boer, V., Wielemaker, J., Gent, J., Hildebrand, M., Isaac, A., Ossenbruggen, J., Schreiber, G. “Supporting Linked Data Production for Cultural Heritage Institutes: The Amsterdam Museum Case Study” In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) Lecture Notes in Computer Science, pp. 733–747. Springer Berlin Heidelberg 2012.
Szekely, P.; Knoblock; A, C.; Fengyu, Y.; Zhu, X.; Fink, E.; Allen, R.; and Goodlander, G. “Connecting the Smithsonian American Art Museum to the Linked Data Cloud”. Proceedings of the 10th Extended Semantic Web Conference (ESWC 2013), Montpellier, May 2013.
Enhanced by Zemanta

Cite as:
. "Karma: Tools for Mapping Collection Meta-Data to Linked Open Data." MW2014: Museums and the Web 2014. Published February 1, 2014. Consulted .
https://mw2014.museumsandtheweb.com/paper/karma-tools-for-mapping-collection-meta-data-to-linked-open-data/


Leave a Reply