Search and Deploy: Solving the Silo Problem with XML Document Databases
Ryan Donahue, Baltimore Museum of Art, USA
The Metropolitan Museum of Art has begun implementation of an application-agnostic content-persistence and search tool, in the form of an XML document database. This provides a consistent platform to store and search across content, while supporting a wide variety of deployment strategies and content-creation workflows.
Keywords: XML, CMS, content management, silos, database
1. In the beginning, there was only ad hoc
Modern museum content infrastructure is the result of the first fifty years of museums’ forays into the digital domain. In order to properly contextualize how we’ve gotten to our current state, we must first briefly detail those fifty years. Digital began in many museums in the late 1960s with the work of New York University’s Institute for Computer Research in the Humanities. By 1967, the Museum Computer Network was founded as a professional development group geared towards the establishment of the computer in cultural heritage institutions. In the 1970s, museums’ relationships with technology largely manifested as partnerships with research institutions and universities. In these early days, the focuses of digital efforts were, in most cases, solely electronic record keeping and collection management.
As time dragged on, more of museum operations, including content production, went digital. By the 1990s, museums were posting content to the Web, digital asset management systems emerged, and the first “viral videos” appeared. The 2000s brought further refinements to the internet, including Web 2.0, smart phones, and tablets. Museums largely began adopting Digital Asset Management Systems, which became fast friends with long-established collection management tools. These two systems, coupled with bespoke per-project or per-channel content management tools, became standard content infrastructure for museums. At a mid-sized museum, its not uncommon to see three or more content tools, a variety of databases, and lots of ‘sticky glue’ code holding the pieces in place.
At the Metropolitan Museum of Art (MMA) alone, we have utilized Sitecore, WordPress, and Drupal as web-facing CMSes. We’ve also utilized SQL Server, FileMakerPro, flat files, Access, and MySQL for databases; and lots of bespoke Applescript, bash scripts, PHP, perl, VBscript, C#, and objective-c code for gluing processes together.
The complexity and maintenance required for such an infrastructure is immense, and distracts the digital media team from focusing on developing new solutions and projects. Furthermore, with so many technologies in play, the MMA risks permanently losing the ability to make materials accessible as technologies are made obsolete. This is due partly to the amount of code delivered by temporary vendors and contractors, who rarely focus on reducing maintenance costs and sustainability with the same intensity as internal staff.
2. Selecting an integration methodology
Recognizing the inability to act with agility on the corpus of MMA content, the MMA set out to develop a next-generation data integration system capable of enabling rapid application development, and to maximize the amount of shared code between projects, both in delivery and in content editing. This had to be undertaken while recognizing the importance of keeping legacy systems in place, at least for the time being.
The MMA set several design goals for the implementation of a new generation of content management infrastructure. The Met Content Repository (MCR) would use a shared-database approach to create an environment we could tap into to quickly create downstream applications that used the data and published across a variety of channels, from digital signage systems to the desktop to mobile and more. The MCR needed to require as little operational code and support as possible, to help focus the MMA development resources squarely on creating new digital products instead of maintaining a large data warehouse or complex “god application.” These constraints ruled out more complicated forms of data integration, such as messaging buses and canonical data model data warehousing.
Every museum, at some point in their quest for a sensible content architecture, has undergone (or will undergo) a Sisyphean phase of finding the One True System. Some institutions choose a collection management system (CMS) or a digital asset management system (DAM), jamming in assets with a mere cursory thought paid to the system intent. Some museums prefer using channel-specific content management systems.
Those that choose to use collection management systems for other purposes quickly learn how painful it can be to peel potatoes with hammers. Data maps become increasingly complicated as you bend the system to support entities that aren’t strictly in the purview of the system’s scope. Museum programs modeled as very short exhibitions, TMS-powered DAM systems (and DAM-powered CMSs), and object records for educational discovery kits are but three of the recent horrors the museum community has seen.
What appear to be perfectly sensible decisions when your software is on version 3 may look and feel quite different by the time you deploy version 9. As software matures, and domain-specific features and schema tweaks are made, the jury-rigged records drift further from an ideal data model and move closer to serious peril via data inconsistency.
People who use channel-specific CMSs (such as a WordPress or Drupal) tend to fare a bit better. Generally, these systems have a smaller impedance mismatch than their non-content-management domain counterparts. Many of these systems, however, do not make regular ingestion of external content easy, and many have limitations for export as well. These systems generally assume that the content input is intended to be published to the channel the CMS is designed to serve. This can lead to significant differences in how content is structured in the CMS. Many systems assume the deliverable format is a website, and therefore the content is atomized into ‘pages,’ whereby in a channel-agnostic system, a web page would likely be a combination of smaller content pieces.
While content for an exhibition listing must generally be entered as one large document, in many cases exhibition text for the website is a combination of curatorial writing and the marketing department’s take on the show, sprinkled with with artist interviews or content from the print catalogue or various essays. In reality, this document is composed of several different pieces, and closely derived documents (e.g., the traveling exhibition page for the show, press release for the show, text for the online version of the exhibition) are generally started by copy and paste. On larger sites with more content, this can lead to an extremely difficult editorial process dominated by scouring the site for all instances of the artist quote where the curatorial intern accidentally misspelled the name of the artist’s beloved childhood pet, or that donor’s last name, or any other speck of minutiae (read: nuance).
The boundaries of the system are clearly delineated: provide content down stream in ways that facilitate rapid innovation in the digital space, and assist in internal applications utilizing communication among major systems.
3. Content types: Entities and documents
The next step in the MMA process was to identify key entities that developers need downstream. The most obvious was the ‘object’ entity, representing an object in the museum collection. Almost every collection-driven feature the MMA has put in place relies on object data, including: the Timeline of Art History, Connections, 82nd & 5th, in-gallery kiosks, the website, digital signage, and more. Similarly, information about the constituents surrounding the art in the collection was necessary, as was information about the MMA’s various digital assets.
Beyond access to MMA business entities, the next corpus of data we needed to provide access for is the litany of bespoke content, such as journal articles and essays. This content exists in a variety of ways, from HTML blobs in databases to Word documents and other semi-structured content. These disparate forms of content can slowly coalesce towards a common format, starting with a standard XML envelope pattern, whereby content is ‘wrapped’ in a standard way.
Simplicity led us to focusing on providing both entities and authored content as aggregated documents, as opposed to providing access to on-the-fly assemblages of these entities from a variety of relational databases and DAM applications. This was a more natural fit with the schema-less documents that represented the most complex data we needed to account for, and additionally removed a lot of work from MMA developers by aggregating entity information for them.
4. Selecting a canonical representation
Selecting a canonical representation for the information our institutions produce is no small task. For the MMA content infrastructure, high value was put on minimalism — deriving the smallest possible solution to the problem so as to better manage complexity, operational overhead, and ease-of-use for developers utilizing our new system. In addition to the intrinsic benefits of systems consolidation, we wanted to remove the onus of understanding several business applications’ relationships and APIs off the developer and onto an abstracted set of APIs that a developer could begin using quickly.
Given this overall vision, we wanted a format that was well supported from content creation through content deployment to a variety of channels with minimal need for transformation. This preference does not preclude a format transformation at any point in the process; we merely sought a format that would not require transformation to perform routine tasks (such as validation, content enrichment, search indexing).
Finally, we wanted whatever format we selected be suitable for composition and in-line enrichment (for example, using tools to automatically annotate artist names in scholarly essays), which mandated the ability to do some sort of document join, and additionally provide some mechanism for insuring that composed documents suffered from no ambiguity problems. Given the MMA’s preference for storage in a document store, two different technologies emerged as clear front-runners for our new application: JSON and XML.
The eXtensible Markup Language (XML) takes a very different approach. XML’s basis is in Standard Generalized Markup Language (SGML). SGML is a generalized model for creating formats, and is based on a hierarchical tree structure. XML at its most basic consists of elements (called “tags”) and named attributes, with support for some other features such as comments. It is capable of representing almost any kind of information, from highly structured fielded documents to rich-text documents.
The complexity XML affords requires programmers to exert some effort in transforming the XML into native data structures in their language of choice. As JSON is a subset of common programming data structures, exceptionally little effort is required to make equivalent data structures out of JSON data. Couple that with a perception of significant size and speed advantages for JSON, and you have some technology experts decrying the death of XML as we know it.
Closer examination of our needs started surfacing significant advantages for XML as the lingua franca of content at the MMA. While stability between JSON and XML is largely a push, as both have native stores, can be parsed easily in-browser, and are ubiquitous on the Web, XML does have the advantage of native schemas and namespacing. XML has a clear edge in flexibility, as it is outstanding for rich text (of which the MMA has heaps), but this flexibility does come at the cost of programmer speed. XML also has support for limited document joins via XInclude, and also supports arbitrary combining of documents with clear disambiguates. These features are crucial for document composition, as they ensure that we can compose and transform documents while retaining validity. Finally, the MMA already has a sizable corpus of XML internally, yielding some significant amount of momentum.
Its been widely reported as fact that JSON presents significant speed advantages to XML, but recent research (Lee, 2013) suggests the speed differences in transfer, parsing, and querying are overstated and minimally impactful to an end-user experience.
XML has a vast array of ancillary tools surrounding it. Things like validation, batch processing, transformation, query languages, search tools, editors, pdf conversion, and more exist as standard tools in an XML toolchain, along with several different mechanisms for parsing (including special parsing methods suitable for extremely large documents, such as pull parsing). JSON simply does not feature a toolset surrounding it that is as robust and available in as many languages.
Given the above information, it was clear that XML was a more suitable format for the goals we are trying to achieve. It very well could be that some or all applications upstream may one day consume JSON or another format, but at the core every piece of content in the repository will be an XML document (or partial document, known in XML as a “fragment”).
5. Content storage
The storage environment for the Met Content Repository (MCR) needed to exhibit the same characteristics as any production-critical persistent store: security and reliability. To that end, we needed to ensure that documents we put into the repository persisted in the exact condition they were entered in, and were only accessible by select users. We were searching for a solution to store aggregated XML documents in a manner that maintained schema-consistency where necessary, and transactional reliability at all times. Additionally, integration with the museum’s standard Active Directory authentication was a requirement to enable remote management of service accounts and role assignments for use in the MCR.
Finally, any outstanding storage mechanism for content would need to include some version-control mechanisms for content versioning. Delegating versioning so low in the stack prevents custom versioning logic from needing to be part of each application domain.
6. Content transformation
The MCR needs to feature extensive support for programmatic transformation and recombination of content elements to suit application purposes. While there is a balance that ought be struck between transformation living in the infrastructure and application-specific transformations, logic dictates that a content repository cannot be universally useful without significant transformation capability.
Ideally, there would be multiple ways to enable rich transformation possibilities. While a technology such as XSLT could be sufficient for basic transformations, transformations that require many external data sources to provide enrichment would need a more powerful language to transform and manipulate external data, such as Java or XQuery.
7. Content search
Often times, a weak spot in a cultural heritage institutions is search. In fact, many common search engines require separate and non-integrated indexing of content to enable rich full-text search, and complex customization to facilitate fielded search. Offerings like Solr entice institutions with low buy-in costs; in some cases the initial buy is just the hardware to run it on. As needs become sophisticated and types of content skyrocket, significant resourcing must be given to add-on search tools to perfect indexing, ranking, and query flexibility.
Ideally, search would be a full-stop function of the content repository itself, with results passing through transformations to augment result rank, richness, and function. Ideally, certain key metadata pieces (like a title) should be mappable from multiple schemas, and easy ways to build standardized workflow around content (such as a transform-on-ingest workflow, or transform-on-fetch) would be first-class features of the MCR system.
A longtime strength of using relational data stores is outstanding structured, fielded queries. Historically, much skepticism has been levied on document databases due to their very basic or simplistic search. Most document databases are based on a key-value concept, yielding an index (usually in the form of a universal resource locator (URL)) to a particular document, or a named index for a particular XML element. With our requirements for fielded, full-text, and (eventually) semantic searching, any solution would almost certainly be a combination of database and search system. We needed a document database with outstanding integrated repository search, utilizing full-text, semantic, and fielded search all together to enable complex queries that aren’t possible utilizing one one of the aforementioned search technologies at a time.
Integrating search into the repository has many benefits, including reduced work with respect to indexing (as search indexes are built in-line). This has many potential side effects, such as enabling us to provide search across a variety of content (including social media data such as tweets) and enabling true universal search for both public-facing and enterprise initiatives. The surface area of code under the MMA’s responsibility is also drastically limited if the integration between the repository and the search engine is complete out of the box and supported by the integration company.
As more museum applications, such as DAMs and CMSs, move toward rich Web clients, having enterprise search and arbitrary transformations allows for interesting possibilities with deep linking into DAMs and CMSes to maximize productivity and enable curatorial and collection management users to use ‘one click’ to edit any piece of content across the various key information systems. Saving users that manage object record publication in external systems can save substantial time by not having to switch between content editing environments, and far less duplication of user interface has to be done to enable Web editing of pieces of MMA content.
8. Content deployment
Any content repository should enable a wide variety of content transfer mechanisms. Data should be able to be loaded, unloaded, and copied in bulk for MMA staff, and easily accessible on a wide variety of platforms from Web, print, mobile, and more. Ideal integration mechanisms would include at the very least a RESTful API (which is a style of inter-application communication that uses web conventions and is very well supported across just about every modern language), a command-line mechanism for content extraction, and some sort of filesystem access. Additionally, extraction should be empowered with transformational capability to enable application-specific renditions to be generated on export.
9. Product selection
In addition to the feature requirements of the MCR, we were interested in backing the MCR with a system with excellent support, a track record of success, and the ability to hire both vendors and programmers with experience in the product. We began by sorting features into must-have, should-have, and bonus.
- Store XML documents
- Provide easy-to-use interfaces (e.g., REST, CLI, File system)
- Support standards (e.g., XSLT, XML, XQuery)
- Have excellent reliability
- Have excellent search (i.e., fielded, full-text, bonus for semantic)
- Fine-grained access control
- Easy, well-documented, and proven scalability
- Flexible output formats
- Index in real-time
- Content-processing workflowtransformation engine
- Open source
- Cloud deployment
- Support multilingual content
- Ecosystem of vendors and candidates for employment
Given these evaluative criteria, we identified three targets: exist-db, baseX, and MarkLogic.
Exist-db and baseX both scored highly in a number of areas, including support for standards such as Xquery and XSLT, and also for the open-source communities around them. Both are in active development and have been utilized in many projects. These options would be a nice fit for an organization that didn’t need lots of bells and whistles, and was dealing with a smaller corpus of content to manage, as the environmental review of their scalability online left many serious questions unanswered.
In almost every area of differentiation, MarkLogic was the clear winner. The only areas in which MarkLogic did not lead were in observation of standards (exist-db has more complete support for xquery 3.0) and in its commercially licensed nature. MarkLogic is the only product to feature extensive features surrounding geospatial indexing, a content workflow engine, extensible REST endpoints, and scalability.
10. The Audio Guide cometh
The Audio Guide project was our first application of this new approach to digital interactive development. The design follows a cascading set of XML documents that starts with an application configuration xml file, which specifies one or more menu documents. Menu documents specify one or more tour documents, and finally tour documents specify one or more tour stop documents. While the Audio guide would not normally be considered an ideal candidate for this style of integrated environment, in many ways it represented a ‘worst-case’ scenario, as the content is largely simple and fielded with no real need for complex markup or searching. On the other hand, the project had a very simple and limited scope, and featured a wide range of integration needs (at various points, the MMA CMS (TMS), the web CMS (Sitecore), and the DAM (MediaBin)). Recognizing what the amount of overkill might be, we bravely plunged forward, excited for the possibility of a better way of museum-content-driven application development.
In each instance of the content documents (menus, tours, and tour stops), a portion of the document was copied from the child document into the parent to ease parsing in the application. The menu showed graphical lists of tours available, so it was imperative the menu document contain the information needed to visually represent the tour on the device, taking the largest number of XML documents that required parsing for each screen was one. Similarly, tour documents showed graphical lists of tour stops, and as such required similar information from the tour stop documents themselves. Additionally, several indexes were created to enable some specific functionality, such as the standard keypad mechanism in the app that operates on a separately generated stop number index.
The entire environment around the project consisted of a few distinct modules, with a few modules existing in deployments on multiple servers (e.g., both the editing server and the extraction server leverage the Mark Logic Bridge and Met Content Auditor).
Externally developed software used
- MarkLogic – XML Document Database (and REST API, depicted separately)
- MarkLogic Content Pump (MLCP)- CLI tool for ingestion of content.
- Jenkins – Continuous Integration Server
- Apache – Web serving
- PHP – Server-side dynamic Web components
- Bash – Support for content extraction and integration with Jenkins
- XMLStarlet – CLI XML processor used for extraction and index generation
Internally developed software used
MarkLogic Bridge – A small PHP wrapper around the MarkLogic REST API
- Met Content Editor – A PHP application to wrap XOpus and provide document selection
- Met Content Auditor – A PHP app for additional auditing and certain batch operations
- Met Content Extractor – A command line tool responsible for extracting content for the device
- MetAudioGuide App – The ios Audio Guide application
- Met Mobile App Deployment Tool – Web interface to manage on-device provisioning and deployment
The process began with a painful migration of legacy data stored in a text file, with fields delimited by commas (a comma-separated-value file, or CSV) to load the original Audio Guide stops and relevant object information and media links. The schema was simple: every piece of content became its own entity, with a single piece of media, stop number, and associated metadata. The legacy application would pick the corresponding language record based off of the app’s language preference.
We built a small ingestion pipeline to accommodate moving content from this CSV, to a simple XML file, to a series of three XSLT transformations. The first transformation assembled the necessary information for the app and correctly renamed fields to the new schema. The second XSLT transformation used Manchurian grouping to collect all languages for each original stop, and also consolidated stops based on related object. Stops without objects (such as gallery overview stops) were manually consolidated shortly after ingestion. The third and final XSLT transformation split each stop in the document full of stops into separate files for use in MarkLogic. The XML was then loaded using the MarkLogic’s batch loading tool.
When schema changes were necessary, the XML was extracted, transformed using XSLT, and reloaded using the MLCP again. Each migration was separately archived during development, enabling ‘rollback’ to earlier iterations of the schema as necessary.
This process begins by calling a saved search from the MarkLogic Bridge. This would then call a search in MarkLogic that matched preset criteria. A supplementary function provides additional data (such as title, department, and object number (if present)) to the search response. This is then run through an Extensible Stylesheet Language Transformation (XSLT, an XML format that defines transforming XML documents) to format the XML in a way the grid component can easily leverage.
Each time a document is saved from the editor, it runs through Met Content Auditor to run data audits. Throughout the process, wherever data consistency issues were raised, we solved the inconsistency, then implemented a related audit to ensure that the inconsistency would not regress later in the content editing process. A tour stop would need to pass all audits to be saved to MarkLogic, and content editors would be shown an aggregated list of failed audits on each save, where necessary. Future enhancements will result in more humane audit failure messages.
The editor was used for all tour stop and tour documents. The tour editor had a different look, but it was still easily integrated into the larger content editing package.
The editor was almost totally integrated into TMS for the selection and update of TMS data. Future enhancements will enable more automated means of keeping TMS data up to date and will get image transfer working. With our shared architecture, this feature will immediately become available to other apps that use this setup for content editing.
With additional refinement, the editor component and the components for locating documents to edit could be decoupled heavily from this specific application, yielding a strong start towards a general purpose content management template.
Whenever a build was triggered on our continuous integration server (Jenkins), we ran a bespoke content extraction tool that extracted the necessary XML documents, generated the indexes (such as the aforementioned stop-number JSON index), and preformed some very basic transformations to the data to accommodate the application. Post extraction, the data was subject to one final set of rigorous audits (also leveraging the auditing tool) as part of the build process to ensure referential integrity between files and to address any data quality issues present.
When the IPA was built, Jenkins would copy the IPA and generate some configuration data for a mobile app deployment server, which held deployment certs and on-device one-click install, similar to TestFlight. As our IPA for the main building weighed in around 3.5 gigabytes for most of its life, keeping deployment in-house reduced network transfer times considerably.
Completing the process of requirements gathering, software selection, procurement, and application development provides an excellent vantage point for retrospection. While it is far too early to declare the MCR done, or the process over, the promise is slowly becoming realized. This XML-driven approach had several advantages over our traditional relational-database-driven content infrastructure, including enabling rapid iteration on schema without need for extensive data migration and query or ORM code in the content editor, extractor, or application. The approach makes extending data (e.g., adding new fields) fairly easy. In fact, a criticism of our first implementation is that we were a bit quick to add new fields without necessarily thinking through the implications and side effects, a more natural occurrence in relational models where adding fields comes at a more substantial cost.
The integrations we built between MarkLogic and TMS can easily be a part of future projects, and the lessons learned about the nuances of this approach will almost certainly make the next application to hit this content infrastructure better, faster, and easier to develop.
This approach is not without its challenges. Abandoning the current (comfortable) practice of Web development, whereby we model the domain relationally and build straightforward and well-understood Web applications on top, is not one done easily. Old habits die hard, and developers new to this way of working have difficulty quickly adopting this methodology. I myself was skeptical through the early stages of development, not yet understanding the level of productivity afforded by thinking of data as collections of aggregated documents.
Once developers became familiar with this methodology, particularly when leveraging MarkLogic and XOpus, creation of content editors and tooling happened very rapidly. Additionally, dealing with XML on the client side is largely well understood by most development shops, and virtually every platform has a substantial variety of production-grade XML parsers, libraries, and tools. Parts of the application (specifically one of the XOpus content-editing modules, and the auditing tool architecture were constructed by a second developer at the MMA, and once the initial learning curve was conquered, his productivity rapidly reached the level of the traditional approach.
In the end, we have found a system that we know is capable of easily ingesting content from legacy systems, has tools for rapid creation of content editing components, enables on-the-fly channel-specific transformations, and is wrapped in a well-understood API architecture (REST). The architecture is new to museums but old hat to many industries, including healthcare and publishing. While developers skilled in this methodology are hard to find fresh from university, any sufficiently talented developer can pick up the workflow and be productive.
Thanks to all of my colleagues at the MMA, because it takes a village to raise a content infrastructure, and more specifically:
- Don Undeen, for having the foresight to hire me and for stewarding the process
- Thomas P. Campbell and Carrie Reborah Barratt, for their extraordinary vision
- Erin Coburn, for allowing me to start this, and Sree Sreenivasan, for allowing me to finish it
- Jeff Spar, for the hot tip about MarkLogic
- Cristina Del Valle, for making contract reviews as awesome as contract reviews can be
- Paco Link, Colin Kennedy, Mike Westfall, Rachel Rothbaum, Liz Filardi, Staci Hou and Loic Tallon, for doing the rest of the Audio Guide project
- Ariel Estrada, for all the admin support
- Adam Padron, Nicholas Cerbini, and Steve Ryan, for all their IS&T support
- Jennie Choi, Shyam Oberoi, and Jeff Strickland, for helping with integrating other MMA systems
- Scott Hall, Eric Austvold, and Sandy Bodzin at MarkLogic, for their incredible knowledge and patience
- Finally, Faber Fedor and Roy Davies, for the tech talk and encouragement over many cigars
Lee, D. (2013). “Fat markup: Trimming the myth one calorie at a time.” Presented at Balisage: The Markup Conference 2013, Montréal, Canada, August 6–9, 2013. In Proceedings of Balisage: The Markup Conference 2013. Balisage Series on Markup Technologies, volume 10. http://www.balisage.net/Proceedings/vol10/html/Lee01/BalisageVol10-Lee01.html
. "Search and Deploy: Solving the Silo Problem with XML Document Databases." MW2014: Museums and the Web 2014. Published February 1, 2014. Consulted .