CART Transcript for Connecting the Clouds: Strategies and tools for managing museum data

Thursday, April 3, 2014 1:33 p.m. – 3:00 p.m.

Museums and the Web 2014 Conference: Connecting the Clouds: Strategies and Tools for Managing Museum Data.

 Held at:

Renaissance Baltimore Harborplace Hotel

Ballroom A

202 East Pratt Street

Baltimore, MD

Communication Access Realtime Translation (CART) is provided in order to facilitate communication accessibility and may not be a totally verbatim record of the proceedings.

>> Rob Lancefield: All right. We will get started in just a moment here.
This is the section, “Connecting the Clouds: Strategies and Tools for Managing Museum Data.” I’m Rob Lancefield from the Davison Art Center at Wesleyan University. I will be chairing, but not saying anything of substance here, but our presenters will be. We can jump right in to conserve time. I’ll be introducing them prior to each individual presentation.
First up we have Jane Alexander and Niki Krause from Cleveland Museum of Art speaking about “Art in the Clouds: Creating a Cloud‑Based Archival Repository for Museum Assets.”
>> Jane Alexander: Good afternoon. I am Jane Alexander, I’m the CIO at the Museum of Art and Niki Krause, who is the manager of applications for the Cleveland Museum of Art. Niki has ran with this project and done an awesome job, so I’m going to quickly give you an overview of my full service as advisory and then pass it on to Niki.
So a couple years ago when we were contemplating the art in the clouds project, it turned out to be a much bigger project than I originally had intended it to be. Or I thought it would be. And so to give you a little background and a little plug for our talk on Saturday at our digital strategy, Looking at the Big Picture.
When I came to the Museum of Art there were multiple projects going on and the biggest one was Gallery One, which was ‑‑ no one knew what it was going to be, we didn’t know what kind of hardware we were going to use. I quickly realized what we were going to build was going to change and be outdated soon enough. So we had to look at how our systems were going to work and talk.
Since the museum had made a big effort to digitizing the collection, we had just began our building project. It was a seven‑year project. We decided to make sure that Gallery One could look at a test bed, and it ended up being a project that any object on the wall of the art wall pulls information from our digital asset management system. When something is new or moves on loan or goes to conservation, it reflects dynamically on the wall, and in the art land. That was sort of our first step. We said, why don’t we look at all our systems. Our systems naturally fell into functional groups, we could tie them together, so that when we worked on projects. This diagram that Niki drew, actually we began to see that they fell into natural groups and this project, this archival project would fall right here into the collection information and scholarship backbone.
So just to tell you a bit how this came about, was the project started, when I first came here I met with the photographer studio because I was very interested in their process for digitizing the collection. They called me down for a completely different reason.
When I walked in there, I saw Bruce, and he’s our photo studio manager, and he does all the injection of our high res images. I said what’s that? Those are 2800 DVDs of our high res images and there’s another two stacks offsite. I said that’s how we’re managing our expensive digitized collection?
So I immediately in the next budget year, it was the first project. I said, oh, we need to get some storage on site and manage the system. As you realize, the cost of digitizing a collection to keep it stored that way, the risks were pretty high. And this is to say our collection was worth its weight in gold.
This is my joke I inserted for her, that unfortunately she’s never seen the movie, so she doesn’t get it.
>> Niki Krause: It was unfortunately for the big pile of gold that we have in our image assets, finding anything in any specific image is very difficult currently in the system we’re using with all those DVDs. It’s like the dragon, you know it’s there, it’s very precious, but don’t try and find the ring.
>> Jane Alexander: So we knew at the beginning of this project we were going to have these high proper, high res masters. We had three sets of 3800. We had to manually look up and pull to use any of them. So anytime there was an injection that went wrong, you didn’t do it right away. That actually became a huge issue when we were doing Gallery One because we were cleaning the data and we had to have every single object on view perfect that could display in multiple ways. These DVDs were also expensive, so we ditched them to move to an online storage. I went to Niki. I put a capital line in 2011 and about 2012, halfway through 2012 I told Niki, we really got to work on this project.
So she looked into it and she said you know what? This is archival in nature, and now she’ll tell you about the whole project.
>> Niki Krause: Jane and I started the discussion on this based on her visit to the photo studio. I have an archival background, so my first thought was we can’t take our very valuable image assets and just put them up on online storage with no organization, no way to search and no solid plan for holding those for the long term. How do you access them 100 years from now, 150 years from now. How do you make sure that as a group, you can deal with them properly.
So we put together an intricate departmental team and invited our museum archivist and digital archivist, our network infrastructure and storage duo, our library applications analyst who happens to be a top notch developer, and Bruce who you saw buried in his DVDs to talk about the project and start working on a plan to put together a proper archival repository which is something the museum hadn’t had before.
As the work progressed and we talked more, we extended the team to include our photographers, not just our photo manager, our collection management and Reg Star staff, conservation staff and our performing arts, film and music staff, all whom had digital assets we decided would eventually need to go into the repository as long‑term institution records.
The approach we came up with was six‑step. Very easy. The first three steps all took place concurrently. We needed to inventory all digital materials, find them throughout the museum on every form of storage, jump drive, portable hard drive, people’s computers, whatever it had to be. We had to find all the digital materials and find out how much of everything we had. We had to establish standards to guide the project. We had to choose both a storage platform and appropriate archival management software to make sure we could find everything later. We needed to map metadata, because mapping is my life. We needed to define work flows to make sure everything was very efficient, and then we needed to start iterating the process to make sure we fine‑tuned it from the easiest content types we were going to be digesting to the most complex.
We started with our inventory and projections and Jane told you about the impetus for the project which was the art work photography. We found that the amount of photography was over 6 terabytes, was dwarfed by the photography, they had x‑rays in addition to high resolution and so forth details, pre and post treatment.
We had a huge cache of editorial photography. Anything that would show up in magazine ended up in the editorial department. Our business documents. Everything from minutes, policies, manuals, anything that people came up with in Microsoft Office, anything they came up with in their e‑mail, that’s fair for the business documents area. We also had significant caches of AV materials, documenting projects and other aspects of our institutional history and also recordings of audio and video from lectures, performances and other events the museum has had over the past 100 years in various medias.
Finally, we also had to consider our latecomer, artwork and time-based media. All that together currently, actually as of last year, totaled 20 terabytes, which for a museum like us was quite a lot. It’s bigger than we knew how to handle. Projected growth based on what we knew was happening with the technology, takes us to 35 terabytes by early 2018. So we had a lot of thinking to do.
The standards we identified, we started with a standard archival best practices and the rule of thumb, lots of copies, keep stuff safe, LOCKSS. We looked at the OAIS reference model for building an archival repository, we looked at preservation file formats and what’s recommended as best practice for that beyond the tif and jped2000 and is EG going to become an archival standard in the future. We got up to our elbows in premise and started studying what we needed for preservation metadata and then we looked at what we needed to support, long‑term metadata needs for searching.
We are taking aspects of IPTC, XMP and mapping them into Dublin Core to capture part of what our photographs have autographed. We are dealing with local metadata means with a local schema. Things like our service request order numbers, our on object session numbers which are not necessary object photography, but art appears in the background, also needs to be recorded.
And it’s not something that’s standard in any of the existing metadata standards. For those of you who haven’t seen it this week, here’s the OAIS model. And this was our guiding principle to make sure we had a good bundle of information going into the system.
We made sure we had the right coverage for our preservation planning and our administration to deal with both our metadata and our digital files properly. And that we had the right security and the right hand back to people who were using our data, the good solid package of information.
As I mentioned, our approach had many things going on simultaneously. Our team is very busy working in groups of two or three in different areas. The first aspect that we started looking at was storage platform options. We considered three. Onsite hardware and storage, plus onsite archival management system, we considered hosted solutions, where we didn’t have access to anything, we just plugged in our data and hoped it worked. And a Cloud‑based solution where the hardware and storage were managed by the virtual data center, but we had full access to the archival management system.
We had concerns. The team made the decision to go on site. It was the safest and what papers in the archival industry recommended. Nobody was saying Cloud, Cloud, Cloud, they were saying you have to have control and you have to monitor it carefully.
>> Jane Alexander: I missed the meeting they decided that and then they started working and a month later I came and said yeah, can we go back and look at this Cloud thing. And Niki said, I’m going to need a moment.
>> Niki Krause: I took a moment and I went to the bathroom and went, oh, my God, all that work and planning we just did! What are we going to do? So we sat down, the digital archivists and the library analyst and I, and listed all the concerns we had originally and sat down then and talked to Jane, who called her friend from a recent cocktail party that she met who happened to be a co‑owner of a virtual data center in Cleveland. We had concerns, so we addressed them.
First of all, we were concerned about performance. This is one of the images from our archival collection. This is Euclid Avenue in Downtown Cleveland. We were afraid our performance would look like this, bumpy, congested, things getting crossed in the mix. We were very concerned about this.
However, the data center was about 100 blocks away. One of us was on one side and one of us was on the other. Amazingly, we share the same ISP and we’re on the same fiber trunk line. So we did benchmarking tests with the same sets of data, uploading and accessing on our internal systems, uploading and accessing to a pilot system set up at their data center and uploading and testing the same data at a SASS solution operator down in Columbus, Ohio. Accessing the network system 100 blocks down was like doing it in-house. It was so fast, it was unbelievable. With just a couple modifications. To the point to point, we were able to get beautiful, almost in‑house performance.
Obviously, having a SASS center not on your ISP or trunk might not be workable for everybody. The performance frames with upload and downloads to the Columbus SASS provider were dismal. Live and learn. Test everything.
We also had some concerns about data.
Our photographers specifically were very concerned about the effects of hardware compression on the algorithms already used with compression. They wanted to make sure we had absolutely no loss whatsoever. So there was a huge discussion, many technical papers pulled and many discussions. We found indeed the hardware compression has no effect whatsoever on the files themselves. Those are fine whenever you go to use them. No loss of integrity or data. We were also concerned about security. But when you come down to it and admit it, the virtual data center’s security people knew more about security than our IT department. They do it for a living. They have three or four stacks of servers the size of a ballroom and one stack generates 25 million a month. They’re not worried about security. They’ve got that down. They’re much better at it than we are.
So one of the things we were also concerned with was the archival requirement for keeping many copies. And in the case of this type of data, the recommendation is to keep the copies as far away from each other as possible. You want them in different geographic regions if possible so that if a power grid goes down, you still have access on another grid. If there’s natural disaster, both of the copies aren’t wiped out.
One thing we found was the data center based in Cleveland already had setups so that they could have data in Cleveland, in Columbus, and in Grand Rapids, Michigan.
What we ultimately chose was to go with Cleveland and Grand Rapids so we actually have a geographic divergence for our redundant data systems. So the Cloud was looking better and better. The sun was shining. Every Cloud has a silver lining. You’ve all heard that.
Unfortunately, when we started looking very closely after resolving and discussing all our technical concerns was the fact that we would need this many gold bars to pay for the Cloud service. Because all of that hardware, all of that 24/7, 365 uptime, all the staffing, all the expertise, you paid for it. If you want that redundancy, that absolute surety of your data center, you pay for it.
So what happened? This was our biggest concern. How did we handle cost?
Well, luckily this Cleveland‑based firm is a philanthropic firm. So over the course of five years, they are donating over $500,000 worth of their services to us as a donation. That’s wonderful. And there they are, on our website, listed as one of our major corporate donors.
We also took a look at archival management systems. We considered open source by preference, although we knew and had reviewed early on several of the commercial solutions including Rosetta and several others. We concentrated on open source because the museum is moving away from ongoing licensing fees. We also considered a couple of SASS solutions because DSpace, one of the main open source systems, is available both for local install and as a SASS provider.
So we did look at those. We also looked at Content DM, which is SASS provided by OCLC in Columbus, Ohio. We also looked at Fedora Commons, something that came out of Cornell. Great system. Archivematica and Invenio. We read a lot of white papers.
We narrowed it down to two and decided we were going to try out DSpace and Fedora Commons. This was going on concurrently with the choice of the platform. So we were able to set up one online, an in‑house system in our own virtual host system to do the same thing. We also did the same with Fedora Commons. We were able to compare apples to apples for performance and customizability and so forth.
Both were easy to install, easy to configure. But you needed a plug‑in for Fedora Commons to be able to put together a good firm end for users.
So bad thing, we decided we were going to go to DSpace. We didn’t want to deal with yet another plug‑in option.
Both of them had an issue with bulk loaders, so we knew we were going to have to code that area and our developers started getting to work putting together a bulk uploader for our archival materials. This is a screen cap of what it came up with that allows us to capture not only the context within the museum, essentially archival series, but also do the checks and double‑checks of all the extracted header information, the creation of the XML files, the addition of assumed data values that are going to come in and become part of the data set.
And keeping all the logs for each. We have a nice bulk loader.
I told you the various file groups we were going to start with. We decided that editorial photography was the lowest‑hanging fruit in the materials we needed to deal with, so that’s where we started. And then chose the other groups in order to go up in complexity. So we would learn from our mistakes and not make mistakes with bigger, more complex sets.
This is a little update on where we are right now. Our editorial photography is about two‑thirds ingested, searching beautifully, it looks great. Artwork photography, of those 2800 disks, about 2,000 are now staged and we’re in the final bits of putting together the scripts needed behind the bulk loader to map everything properly. Because the metadata is a bit different for the artwork photography.
Conservation photography, we’re in the process of doing a file review in de‑duping probably about 8 terabytes’ worth of files. Lots of fun. Our business documents. We’re scripting the transformations right now to get those all properly into a PDF/A format if they’re textual and figuring out a way to do that very efficiently. We’re performing a detailed inventory with our staff members of all the analog and digital AV materials. We have to decide how to handle each set of them, because there are many permutations we have collected over the last hundred years.
Our AV media, we have another team working on in-house standards for handling artwork in time-based media. We only have ten works so far. We figure we can afford to wait a little bit and get it right before we start ingesting those into the archival repository for long‑term keeping as well.
Challenges? Our developer got another job. Tomorrow is his last day. So watch the CMA website if you’re a developer. There will be a job posting shortly. If you like DSpace, have I got a challenge for you. One other challenge is time. Our digital archivist is stretched thin and she’s taking over some duties as far as analysis of data. She works in the library archives department. And our PAMF staff is working overtime. Our network storage is running very high.
You say Niki, you’re in the Cloud, why do you care what your local network storage is? Well, we’re pulling things off jump drives, people’s PCs, and we have to look at them together. So there is a certain overhead on our own local systems as we’re sorting through cleaning up, figuring out what we have and how to organize it for final ingest. So that’s always a challenge. We’re clearing things over time.
As people see this DSpace install work and be able to use it, we actually have a couple demands mostly from the library of archives for additional access systems, not for archival, but for public use. So we have a few more pictures to add to our diagram in the near future. So thank you very much.
>> Rob Lancefield: Thank you, Niki, and thank you, Jane. We have a moment for a quick question as our next speaker makes his laptop transition.
>> You mentioned your main objection to Commons was the plug‑in for it. It sounds like you need to plug in equals, don’t want to go there? Is there a reason for that?
>> Niki Krause: It was just a neatness of platform and the fact we couldn’t find any one of the plug‑ins that was recommended as being better than the others. At the time, all of them that we evaluated seemed to have a bigger problem list.
>> The plug-in, they didn’t have the front ‑‑
>> Niki Krause: There wasn’t a good base for us to start with that point that we found. There might be one out there we missed completely or there might have been one developed in the meantime.
Anybody else?
>> Hypothetically speaking, what would you do if Bridgeware busts?
>> Niki Krause: That’s handled in our contract and we can get all the data back. However, if they go bust, I don’t know how we would set up the hardware in a timely fashion to be able to take all that information back.
However, their biggest customer in Cleveland is a very large Homeland Security office.
>> You said bank?
>> Niki Krause: Bank, no. Homeland Security and a lot of e‑commerce. So I think we’re safe for a while. We’re one very small little stack of servers in an entire ballroom full of racks and racks and racks.
And all of those make money.
>> Rob Lancefield: Great questions. I’ll have to ask that other questions be held until the end. But our next speaker is Ryan Donahue from the Metropolitan Museum of Art.
Speaking on “Solving the Silo Problem with XML Document Databases.”
>> Ryan Donahue: Hello, everyone. That was a very tech‑heavy, hard act to follow. I don’t know if I have as many interesting details ‑‑ they’re all in my paper coincidently.
My name is Ryan Donahue. I’m a graduate of the prestigious Rochester Institute of Technology. As is Mike over there. I spent 6 years or so at the George Eastman House in Western New York. The last two years I’ve been at the Metropolitan Museum of Art doing data integration and projects surrounding data integration. And the Met’s not ‑‑ not scale problems, not the way that engineers normally look at scale problems. Lots of people like to go to our websites or online resources, that is true, but the problem is we produce digital content faster than we can keep track of it.
We have so many people creating digital products and videos and photographs. It’s just a nightmare to kind of keep up with the tidal wave of systems. So to be brief, much more of the detail’s in the paper. Including things like why we really chose XML and think that JSOM is a bunch of rubbish.
That’s just to get the JSOM nerds to read the paper.
We’ll go through kind of how we’re solving it at the Met. One way we particularly, one application we’ve already done and kind of where we’re going with it.
So really quick, overview for those in the room that aren’t kind of day in, day out museum kind of project builders, is that we kind of go through this standard workflow for any new microsite or whatever kind of spin we want to put on the silo. Sometimes we call it microsites, temporary web features, whatever you want to call them.
Essentially we take whatever data it is that this application is to wield, figure out how to relationalize it and put it into databases that developers are used to using, things like Oracle, OK, not Oracle, but sort of post‑guess, the rest, kind of the traditional databases. We figure how to get that data out of the database and then assemble it back into the package we need that builds the page or the mobile phone or whatever we’re building.
And sort of being privileged with the opportunity to spend my days thinking of nothing but this kind of problem. It dawned on me that hold on, we take data we’ve got. We kind of convenient tidy packages, we split it up, store it, take it out of storage, put it back together, put it on the page. It sounds like there’s a few extra steps in there we might not really have to do if we’re willing to evaluate some tools that might be unfamiliar or uncomfortable for us to consider.
And that’s exactly this, is that a lot of the work we do for building information retrieval applications is unnecessary work.
There’s a lot of inertia and a lot of good reasons to do that, if you’re kind of unwilling to think outside the box a bit.
Another problem is with relational data models is you end up putting different projects in different databases and it becomes very difficult to search across those databases. A search on the Met website will not return everything that’s on the Met website. There’s a difference between what search can see and what the web can see, a very vexing problem, one that’s difficult to solve because we have more than one place to store stuff.
Finally, relational databases tend to be really poor at doing things like full text search. We end up needing tools like Solar or Cloud‑based search services on top.
It’s just like, you know, we’re museums. Our kind of day-to-day money-maker is information retrieval. And a huge part of information retrieval search is something that very few of us handle capably. That’s kind of a pervasive problem faced by big and small institutions.
Small museums don’t have the same level of tech resources but also don’t have as much stuff to search through. So the problem scales nicely as you add engineers and your organization gets bigger and bigger you’re also adding content and increasing that your search returns the right mix of things to keep curators and educators happy. All sorts of political problems that happen with search.
The kind of benefit of search working poorly is if it does, it works poorly for everyone.
Once you start fixing a search, the critical ‑‑ it becomes the second frontier after the home page or technology turf battles.
This is perhaps my favorite quote, XML quote of all time. Particularly in the context of the museum conference.
XML to any developer in the room or any non‑developer in the room is something the kids really hate. It tends to be sort of an overestablished technology that people have used in some creative and innovative ways and has sort of garnered a slightly poor reputation among a certain set.
And part of it is the reason this quote exists is because a lot of places use XML to describe the XML that describes the XML that describes the XML that marks up their collection records. For some people that kind of thing is a bit mind‑bending.
A few things, a few characteristics that I particularly like about XML is that it is pervasive from a digital preservation standpoint. Much like Java is kind of in the default language for boring business applications over the last 20 years or so, XML is sort of the default language for boring business markup. It shares similar routes as HTML, and even things like this presentation, if I were to export it as key note anon, would be in fact an XML document.
This is probably the most controversial one to the developers in the room because I happen to like these things and developers sometimes don’t like them so much. But XML has some really interesting toolings that enable you to transform XML, map data, do things you would normally do with Python or PHP with XML, which again gets back to that kind of nest and nest and nest and infinite incursion. But if you use it prudently it’s a very powerful thing as we will see in a few moments.
The other kind of really important thing for me that XML has is several ways of validating structures of XML documents. It’s definitely one of those things that the JSON people are starting to catch up, in terms of I have this data, but I don’t know if this data is formatted well, if this is a number, a string, a geographical location. XML has sort of built-in tools for dealing with some very important things.
So what we started doing is instead of thinking relationally, we started thinking with aggregates. We said what if we took the data that we needed for review and never just broke it up. What if we kept it whole, put it in the database whole, got it out whole. We could save some time and some sort of considerable complexity with sort of just that whole process of kind of ORM, or just marshaling it or serializing or de-serializing data from our data store. We figured hey, the fewer formats we’ve got, if we’ve got a chronological representation of an audio guide tour stop or of a scholarly essay, let’s keep it whole. The problem with that has traditionally been, there’s been very little with tooling and support around querying and searching whole pieces of documents.
We’ve removed that unnecessary work, an XML document database. The one we chose is called MarkLogic. There are other things that are also XML document database. The paper goes into substantial detail why we went the way we went. I’m happy to field questions at the end, but I feel it’s more important to focus on what we did rather than how we did it.
As I will tell anyone that asks me, if we do things the way the Met did them, I will say no, no, no, no, no, no. No. Don’t even think about it.
Every museum has a completely different set of resourcing challenges, and there are tools in this kind of workflow that we’ve arranged, tools kind of on the front end tools, on the back end, that are just not appropriate for organizations of a different size.
That said, this as an approach and MarkLogic as a system is pretty good scaling up and down depending on your challenges.
So pay attention very little to the specifics and more to the general what we’re trying to do. So yes. We’ve tried to eliminate unnecessary work whenever we can.
When we create a code that pulls object information from TMS and stores it as XML or XML document database. Every project from that point on uses that tool to get data out of TMS, which is a departure from how we’ve traditionally done it where every silo that pops up forms a brand new path for getting information out of TMS.
Enough of that because next time 2014, TMS2016 comes out, instead of having one pile of code effects we’re going to have 18 and it’s going to be buggy for a couple weeks while stuff is unstable. But if you only do that work once, you only have to fix it in one spot, which is to me like smart money.
As an IT pro, I would rather work on new projects than fix old ones. Not everyone in the museum agrees with that premise. In fact most don’t it seems, present company excluded.
The other thing is that it tends to be confrontationally expensive. You don’t want to have that 13 joined sequel query run every time you want to get information about your object. You want to run that once. Hopefully you’ve gotten it to the point where you can run it once every time TMS changes. We do it once nightly is generally our rule of thumb.
Do it once, aggregate it, store it, you never have to endure the overhead of that computation again. You’ve just got your XML representation and it’s literally as fast as reading it off of whatever storage you put it on. There’s no additional kind of fetchy things going on.
The other interesting thing about XML document databases is that they are partitioned completely differently from relationship databases in that I could easily throw in, and in fact in a two-week period when we were trying out MarkLogic we did throw in scholarly essays, tweets, web pages, object records, any data we had lying around, education, PDFs and finding aids. Just kind of a smattering.
And searched across all of them, regardless of what the format of the individual pieces were.
You can do these interesting things whereby there’s kind of two ways you can go about it. Since you can transform XML using XML, the strategy we employed and we basically built an envelope around each document in the search guide of corpus that had some very minimal structured metadata about what was the title of the thing, what was the link to the web representation, what type of record was it. Very basic stuff just for the purposes of having kind of sensible search rules come up when you type in Picasso and potatoes what would you get? Turns out some interesting stuff.
The most interesting piece of data I managed to fish out of a search was a conversation the director of the Metropolitan Museum of Art had with one of our curators in the paintings department talking about an exhibition that resulted in the curator receiving a lot of unsolicited e‑mail.
And searched for SPAM and it was clever enough to figure out it was probably related to that and curators don’t generally know what SPAM is.
Go figure.
It sort of started jelling that this was sort of an approach that might solve the problems we pushed off as too difficult to solve. Search, we’ll turn around and do it eventually, we’ve got bigger problems. You don’t have bigger problems than search at the end of the day. Search is your problem.
And that was sort of the third, yeah, there’s other stuff for accomplishing it, you want the text tank or any of the hosts of kind of third party searches you’re bolting onto your corpus. But this one sort of just had it. Everything we wanted to do from loading kind of custom thesaurus in to doing field and search, semantic search using triples, we can do that using this database and combine them into the same queries, which is mind-bending when you think of combining Sparkle with advanced and full text touch.
I can’t even imagine how cool that will be yet, unlike just starting to scratch the service of the things we can do with exposing link data as part of our search tool chain.
And so ‑‑ how are we doing on time?
>> Rob Lancefield: Five minutes.
>> Ryan Donahue: Perfect. The Met audio guide. I love Met stock photography because we just tend to use digital media people. That’s Stacy on the left and an intern on the right. Stacy is back at the office, but she’ll love she was in this presentation.
Here’s a sneak peek. There’s bigger, poorly reproduced versions of these screenshots in the printed guide and I think online there’s some good ones. Our content editor, as colorful as it is doesn’t really translate well to black and white. I will say as the person who put this content editor together, that’s why you don’t let developers pick colors.
I mean, they said they wanted me to color coordinate the boxes so when I tell other people, I put it in the green box. The designers looked at me and were just like ‑‑
So our content editor is actually a pretty cool piece. Since we’re only editing XML documents that are going into the audio guide, we use this piece of software called Xopus which takes XML schema which we already had and an XSLT transformation that we’re doing all kinds of interesting things with and turns into this sort of editable form depicted in the screenshots. It’s HTML under the hood so the editor doesn’t know he’s editing horrendous XML documents.
It allowed us to build the trappings around Xopus content. It finds the content you want to edit. If you wanted to find all the tour stops that had Japanese audio, were about Egyptian art, there’s only a couple, and contains the word “blue,” you could enter it into the content, get the one document, pop it open and edit away.
Which is like a sort of a lot of drudge work we ended up not having to do because it was just very easy, between search results coming out of the database and the tree view that we were using for document selection.
This ended up being a really kind of powerful way to enable us to build content authoring around XML schemas quickly. We had a fairly substantial schema change in the middle of the document, in the middle of the project as is almost always the case with big projects. Right in the middle everything changes. Instead of having to go back and completely redo our database and all our ORM, all I did was update the schema, do a SXLT transformation to go from the old to the new schema and update the transformation for the content editor and we were back in business in a couple days rather than a couple weeks, which was surprising to me.
I was sitting there doing it, which I always view like a technology as being truly remarkable when it surprises the person wielding it. We actually used the XML documents raw in the IOS app. IOS has a number of libraries for parsing XML and most work well. We slapped them into this beautiful app. Pay no attention to the case, we didn’t do the case. But the application itself, just kind of read the XML, provided the interface and our kind of build process was kind of merely copy documents out of the database, build the app, load it on the device, we’re done. This has been running for a little under ‑‑ let’s see, when did it actually launch? September? So we’re coming up on goodly portion of a year. It feels like it’s been running forever because we’ve been working on it so darn long. It’s been behaving well enough that this is sort of the approach we’re taking with our next mobile project.
The other kind of interesting stuff we had to do was we created an audit harness to kind of beyond the schema, do some external validation. Like make sure an object hadn’t moved from one gallery to another. Things that schema won’t address, your sort of high order business rules.
A lot of the stuff was reusing validation code we had already written in other parts of our application data.
Next we were going to migrate some kiosks in there. We’re going to continue with this style. We have to kind of blow out the temporary bridges we built to TMS and our dams at our website and replace them with something a little more suitably long‑lived. Then we start exploring the information this kind of technology has on areas like editorial, design. When you look at kind of MarkLogic target verticals, they do really well in publishing and the Met happens to publish a lot of regular brink collateral so that seems like a very interesting opportunity target.
We’ve already had some interesting requests come through the — with stuff like ‑‑ start looking at things like DIDA and Dot Book, and to do some interesting things with labels. And sort of trying and formalize some projects surrounding around making label production a little more standard and consistent.
But yeah, more or less it just sort of, full speed ahead with this kind of technology. We’re still wrapping our heads around it, and it takes a developer that’s never done work this way probably six weeks to a few months to really wrap their head around like they really don’t have to do that stuff they had to do every single time, and it’s OK. The kind of hardest part of using this is letting go of your old habits and paradigms and patterns you’ve used and that seems to be somewhat consistent across the board. Now that I’ve done this, I find it really tedious to go back and do things the old way.
Thank you very much for sitting through the most exciting session of post‑lunch, I think.
>> Rob Lancefield: Thanks, Ryan. And we do have time for a question or two. You can pick.
>> Thank you very much. I would like to talk with you a bit afterwards.
>> Sure.
>> Given all you said about the advantages of XML versus relationship structures, if you were starting from scratch, buying the system for the museum, setting it up, would you not get ATMS?
>> Ryan Donahue: A very good question. Considering the field doesn’t have a standard for cataloging in any way, shape, or form. We’ve got LITO, but that’s not meant to be a full-flight cataloging replacement, an interchange mechanism.
I would probably still go with a collections management system off the shelf. TMS comes from a friend of mine most readily as it’s the most efficient collection management system I’ve ever known. It’s easy to imagine a world where we have standards for this and you could potentially explore other avenues.
But the whole idea that we’ve undertaken with rolling this technology to the Met is that it is not our intent to roll in and make everyone give up their systems of record, to change what they’re doing. What this kind of technology allows us to do is build injection pathways from those systems our document database and sort of let them keep going. As painful as it might be to let a file maker or access database or heaven forbid an Excel spreadsheet add ‑‑ sometimes it’s just better to render what is Caesar’s. Just give us an XML representation of it and we’ll be cool. That’s been a strategy that’s certainly working for us with TMS.
There are things you don’t necessarily want to re‑create in a XML document database. Let the damn thing do the damn thing. Conde Nast built kind of a mini dam. They decided they were going to build a central metadata library that builds them altogether so they would stop buying assets from Getty that they own the rights to.
I’ve done it. I’ve bought photographs from the collection because it was easier than digging it up at the time. That was before we rolled the dam in. After the dam it was smooth sailing. But it’s a great piece of software to be part of the stack, but I haven’t done enough with it to recommend replacing the whole stack yet.
>> Rob Lancefield: We have time for one more question.
>> It seems like you have a lot of data you’re working on here ‑‑ as a developer a lot of the advantages, the one of them being ontological references, I wonder if you have an ontology problem and what MarkLogic does. On the receiving end on the various tech that you’re using ‑‑ all this XML which would revolve for ontological problems. It seems you’re going to have various things that are going to be similar and you’re going to want to know about them.
>> Ryan Donahue: That’s a good question and right now we’re sort of just kicking the can down the road as long as we can. Particularly with the newest reiteration of the database they have added sort of a proper triple score to it and I’m hoping the Semantic Web people are magically going to find a way to sort of make it work. That’s what they’ve been promising. They haven’t been saying it, but they’ve been implying. When that does, I want to be the first in line to take advantage.
Another advantage we have is we’ve got name spaces. We’ve got at least the ability to make sure we know this title isn’t the same thing as this title isn’t the same thing as this title. So while we’re not necessarily doing anything to solve that problem now, we’re trying our level best to make sure we’re not making the problem worse.
If you’ve got any ideas, we’ll talk over beers.
>> Rob Lancefield: Excellent transitional moment, Ryan.
So thank you again.
And our final speaker on today’s session is Jing Wan from the Bejing University of Chemical Technology. If you’re looking at the printed program with the two abstracts, the name of her two coauthors of the paper are on there. Hers is on the website.
>> Jing Wan: Good afternoon, everyone. I’m Jing Wan. I come from China. And I am a visiting professor to the University of Southern California. I work for the information group.
Thank you.
The group was led by Professor Craig. My presentation is about assistance for publish museum data to the linked data Cloud.
This is the outline of my presentation. First I will talk about using the linked data Cloud. What is linked data Cloud and what building do for it. Second we’ll talk about the challenges. If a museum want to publish their data to the linked data Cloud, what will they need to do and what will problem.
Third we will talk about what Karma can do for you. And after that, we’ll talk about a project where ‑‑ in the project we use Karma, we ‑‑ data, we publishing a museum data to linked data Cloud. Last, we will talk about our future work.
Museum and the linked data Cloud. What is linked data. That refers to a set of best practice for publishing and collecting data on the web using URIs and RDF. Everything with URI is like an ID. And the data format and the data are linked together.
During last several years these best practices adopted by a number of linked data provider and a global data space was built.
It’s contained millions of data and we call it linked data Cloud. The linked data Cloud consists of data sets in different domains and resources within the data set.
This picture, the bubble data set.
What in the museum will benefit from the data. Look at this picture. In the center is the linked data Cloud. Now for example, DBpedia, and now stores are now dating database and then they build website, users can access now data from web.
But now this data set isolated. If museum publish their data on the linked data Cloud, and it became a path linked data Cloud it will be linked together. It is a promise to me to provide richer content to users. It is a promising way to share data.
Challenges if our museum want to publish a data to the linked data Cloud, what are we to do? In the center is the linked data Cloud and this is the data set of the museum. First we need to map data to RDF. We map data to RDF. The second step, we link to external resources. We need to link the RDF, the data in the RDF to external resource.
Maybe link two data for a museum, maybe link to data set such as the DVPD or Times.
After these two steps we can build application.
We can build application. Users can access the data of the museum. And it’s through these links they can access data in the linked data Cloud. I will talk about the challenges. The first step, and the second step the works. When you’re mapping data to RDF it will be labeled and saved. It will be significant public external resources. The way you link is technically difficult. You need to build a team, the team includes data expert and programmers.
It will be expensive and time consuming. Karma, the software Karma will want to erase the first and second steps and make it easier. What Karma can do, Karma is an interactive tool for extracting clean, transform and publishing data.
This is a very simple image to show a part of function of Karma. Karma can be used by domain. It’s very easy to use. First assemble source data, the data you want to map. The second step is domain model. We often choose public or ‑‑ output is source mapping. Karma help us to mapping the data to ‑‑ we record mapping in the model. And after model was generated we use the model to generate RDF file.
This is Karma model. What is Karma model? Karma model means of the data in each column. And relationship among data columns. This is the example.
The bottom is the database table. And the top is mapping to linked data. We specify semantic type to each column. We specify relationship between columns. This is the model.
Now after function of Karma is creating links. Curating links.
After you think to source you need to verify it. Karma provides this interface to make you to verify them.
Next I will talk about the project where we are going. We’re using Karma with the same data. This page is Smithsonian American Art Museum. Researchers can search collections and artists. About 40,000 artworks and 80,000 [Inaudible] institutions. This is the data we will mark in.
This image show what we have down. The input is the same data, is include the tables and the records and columns. We ‑‑ the CIM has many classes and properties. It’s used to describe the information of heritage. After we mapping the same data to CRM, we generate the RDF exile and we create link to DBpedia and New York Times.
That’s the work we have done.
Let’s talk about the process steps. Step 1, define URI scheme. Everything identified by an URI. This is an example. We define for types of entities. For example, collection include three part. Base object, object number. The left we see example.
Step 2 extends to ontology. Even the CIM has many classes and the properties, but sometimes it can describe or repeat some element. For example, an object has many classes, classified, classification, and a self‑classification, and it has media. There are no properties for it in the CIM, so we define property, object make class, object class, and media description for it.
Step 3, define the controlled vocabularies. Each human being has their own vocabularies. For example, they have their own term of classification. We need to define these controlled vocabulary. For example, the least is the collection classifying of sent data. We need to define and return the ontology. It has a label and it has a scheme.
Step 4, classroom and normalize data. Sometimes the data in the table has their own format. And when we do some mapping, we need another format of data. For example in the table, the date three columns, but in the mapping we want them to join together.
You can use Kaising to manipulate data and after transformation the data became what we want.
Step 5, define the mapping. The table is the original data from same data and first we need to add some URI to the table. We use PY Transform. And after that, we add Lee URI, object URI, and the place URI. Now we have all the data or columns we need, then we need to map.
With six semantic types, Karma can give you some suggestions and make this process easier.
We specify, identify URI as identifier. Identifier is a class of the CRM. We do this again and we have one middle object in the place.
This is the model we will get. There are three class. Identify memory objects and the place, and place. And each column have another type. And then we have relationship between the main object and identifier.
Main object and place.
This is a very simple mapping. The real mapping is more complicated. Step 6, creating links with Karma.
In this interface, you can view the data of the Smithsonian Art Museum, and data we want to link Wikipedia and then the data of the New York Times. And you can check and make them firm.
Step 7 create and publish RDF.
We publish model, publish RDF. This model we will use to generate our final RDF file. We use this model for complete database. After generating our PDF, we load that file to three source. In our project we build 19 RDF files and triples.
This is the result. After we publish the data.
Users can assess it through SPARQL. And link on the website. You can link to the Wikipedia or New York Times.
Our future work. We would like to build open data virtual museum. It use the data in the triple store. The triple store get data from several different museum. We have finished mapping some data and upload in the triple store. We still need several data set.
If anyone interested, please send us letter of interest. This is the contact information. You can call me or Eleanor. I will convert your data. And I will give you your collection data.
This is the website of Karma. Karma is available to download. Now I’ll use ‑‑ there are user guides and useful demonstration. And this is the website of our project we document the mapping to CRM. We build this website in order to share with others.
In this website you can see the example resource and examples from other data sets, and we recalled the mapping. This is not recalled about mapping, for each table, they recall sample data. We recalled the mapping. And we recalled an RDF sample.
Thank you. I’m still struggling with spoken English. Thank you for your consideration.
I have several fliers with me about the project. If you are interested, you can take one.
>> Rob Lancefield: Thank you. We do have a bit of time, so perhaps a question or two for Jing Wan and perhaps maybe a few questions or discussion?
>> I’m wondering what the scope of Karma is and the architecture you were showing us, does it store, does it triple store and have its own Sparkle interface.
>> Jing Wan: It’s not a triple store, it’s a just a software making it convert, making you do mapping. Yeah.
>> Rob Lancefield: Any other questions?
>> Jing Wan: I just wanted to add what Jane had said, I know the manager of the American Art Collaborative Project where we’re trying to collect the American art data. She mentioned it’s free of charge, and that’s true, but we have limitations as to how much we can take for this pilot project.
So if you could get that response back to us, bear in mind we’re going to be selective going through and provided in the format needed and everything.
>> Rob Lancefield: Any other specific questions about this presentation? Then I might ask if anybody has questions that could cross‑cut any of these topics, the session, and tie things together.
We have a microphone over there.
>> This is not only for Ryan, I suppose it works for you as well. You were talking about the benefits of XML link store over traditional relational data store. Where would you say that breaks down and XML data store would be just a terrible fit for what type of data or what use case. Same question for RDF, where that would actually be a good fit for a data set and ‑‑ so where’s the good fit and the bad fit for each of these techniques?
>> Ryan Donahue: I’ve seen demos of XML document databases doing some insane things, like transactional processing and stuff you would just assume all day long is like Oracle and just relational databases’ sweet spot. I haven’t really found ‑‑ I mean there’s stuff that’s like so, kind of relational in nature that it just would be painful to come up with ‑‑ if you’re doing data warehousing, you’re going to pivot and do it in those big relational stores, it’s not going to be what I would use an XML document database for.
Most of what we’re doing is sort of long‑form, text‑heavy, makes more sense. To me, whenever the database tables have more kind of giant bar cars and text fields in them than they do, than kind of more interesting sequel data types, the more text heavy it is, the more I start leaning towards looking at XML first.
Versus if it’s a lot of numbers or a lot of analytical stuff, I sort of don’t. It’s becoming very blurry to me as the whole nova sequel thing gets better and better. It’s not like the computer science is new. The computer science is old. Vendors are finally starting to implement the technology in interesting ways.
It starts with like the Mongos of the world, for the love of Pete, don’t ever store anything you want to keep around in Mongo, or at least not one instance of Mongo. Get a dozen. But these no-seeker databases and these kind of nonrelationship databases are making a comeback. It’s interesting.
That’s not probably a very satisfying answer, but I think that a large percentage of the things that museums do are better suited for document databases than relational databases.
>> Rob Lancefield: Any other questions?
>> Yes. A quick question back here. As you all are moving to these different data formats, they all seem to be sort of pipelining out, so most of them are focused on read only. I guess the question that I have is the people who are working on applications or thinking about what to do with the data, are they imagining different things to do with it, or is there an education task that those of you who have your hands in the soil, so to speak, are having to do with ‑‑ are actually responsible for them putting the data out there and coming up with application ideas.
>> No one else? OK, Ryan.
>> Ryan Donahue: Fine!
I think that sort of for me, the kind of outward‑facing demand isn’t changing, it’s just our ways of delivering what they’ve wanted kind of the whole time. Like very few APIs have external facing things expose sequel queries. Sparkle is sort of an interesting thing. Not only is it transformationally different because it’s using triples, but also kind of a direct exported standardized query language which is an interesting wrinkle which you could sort of do if you weren’t so darn concerned with security in most contexts. But the sort of security lists, anyone can triple about anything at any time. Sort of vibe with the link updata stuff, like heck with it, put the query and let the public face it.
But in terms of what people expect out of us in terms of external collaboration with data, I think it’s the same as it’s ever been. They want all of our work done and in as easy form as possible. Which is general with JSON. The beauty of XML, they make it easy to convert to JSON at the last mile. Another handy side effect, is if your last file transformation is to JSON, you can sort of ‑‑ automatic XML to JSON translators grows unless you tool your XML to look sort of JSONish to do that.
So if a developer is really like I want JSON or I can’t do XML because I don’t like to work ‑‑
Read the paper!
You cannot only deliver them JSON, but JSON they would never know is XML under the hood.
>> Rob Lancefield: And we do have time for one more question. Is there a JSON die‑hard in the house? Care to rebut?
Anything else for any of the presenters here? One in the back.
>> A general question. In the first presentation there was the motivation that you wanted to go to for open source solutionize a person ‑‑ data ‑‑ obviously you’ve done a very thorough comparison of the different approaches. Did you also compare the cost of data ‑‑ resulting that you need to do for open source. Based on licenses? Or commercial application is just not good enough?
>> Hopefully I can get this all.
>> Niki Krause: We were asked if we had also considered the costs of maintaining the in‑house development skills to maintain the open source when saying that we were going to start moving away from licensing necessary for commercial sourcing, yes, we actually did. The primary motivation for open source is control and flexibility. We wanted to make sure we had the in‑house power to on our own schedules and our own terms make changes that would allow us to deliver things, to integrate things and so forth.
If as anybody has tried to integrate with a commercial product has found, you might hang and call customer support and technical support back and wait weeks to get a meeting to find out exactly how to get the exposure for what you need to get in there and grab the data. They have the tools and don’t necessarily want to share them with anybody that’s a customer. We’ve had that and ongoing problems with that with a couple of our other projects.
We have a central table for our donor data and consolidating all of our transactions to perform a history of a person’s relationship with the museum, the Museum of Art. Digging into those commercial systems using the centralized approach has really formatted the fact that we need open whenever probably so we can get in there without having to go through technical support.
>> I would add that a lot of these projects were started ‑‑ I think one of my first goals was really also how many softwares we’re using that are not good support. And the licensing fees for our operating budgets were crazy. Especially in our libraries. So we really looked at ‑‑ we always looked at the best way to do each project, but we definitely are going away from that.
We want to own our own ‑‑ we want to be able to get at our own data the way we want to be able to get at it.
>> Ryan Donahue: Sounds like you need to visit the Tesator booth.
>> Or the Black Lab.
>> Rob Lancefield: And so it draws to a close for now. But there is coffee out there, I believe. Conversations can continue. I would like to thank Jane Alexander, Ryan Donahue, Jane Alexander, and Jing Wan. She has information about the projects and she has some up here. Thank you again.