Publishing Data: First thoughts on simple and sustainable
In this blog post we’re going to share some of our very early thinking about publishing data to VertNet. None of these ideas are final and we’re definitely looking for feedback, so please leave comments and suggestions so that we can incorporate your ideas!
For a quick review, VertNet won’t need servers for data portals (e.g., MaNIS) or data publishers (e.g., DiGIR providers on local servers). VertNet data and applications will be stored and run in the Google App Engine cloud. To publish, users will simply upload their records and then they’ll be ready for searching, accessing, analyzing, and visualizing.
Nice! But there has to be more to it than that, right? To reflect how biodiversity data management works in the real world, VertNet will support familiar concepts such as data publishers (e.g., institutions or individuals sharing data), data collections (e.g., birds or mammals), and Darwin Core records (e.g., specimens and observations). This is not unlike how the organizational structure within the GBIF Integrated Publishing Toolkit works (more about that later in a future post).
But what about the actual mechanics of publishing records? How will that work? That’s meant to be as simple as possible, and no simpler, but simple is not our only goal. It is important for the system to be sustainable in the long term, so let’s take a moment to think about what that means.
The community won’t need to maintain servers just to participate in VertNet, which will save time, money, and hassle. But there’s more to sustainability than that, because we still get charged for running VertNet in the cloud. The cost is based entirely on the amount of cloud services we use (storage, processing time, bandwidth). VertNet has funding to pay for this during the three years of the grant, but we need to think longer term than that. So there are a couple of things that we’re doing right now. First, we’re designing a highly optimized VertNet that uses cloud services in cost-effective ways. Second, we’re figuring out how to pay for it once the grant is over. Stay tuned for the latter, because VertNet will have a workshop in its third year specifically to address sustainability.
Since we’re charged for using cloud services, it makes sense not to use them for things we can do outside of the cloud. We will use the cloud for things that we can’t do better and more cheaply elsewhere, namely performance, reliability, and scalability. Performance is the solution to impatience, reliability is the solution to consistent performance, and scalability is the solution to consistent performance regardless of how many people are using the system.
However, we can prepare records locally before publishing them to VertNet. This includes building search indexes, validating data, etc. You may be thinking, “Wait, this sounds complicated, and like a lot of extra work for me!” Well, not really. The idea is to have a very simple script (in Python, for those who are interested) on your own computer that prepares and publishes records to VertNet while hiding all of the complexity. Our lingua franca (or lingua biodiversité) will continue to be Darwin Core, which is great because it means that we can leverage tools that already exist (e.g., CSV, Darwin Core Archive), including tools that we’ve created to produce Darwin Core records from current databases in the existing networks.
That’s all for now. Our take home message is that the publishing process at VertNet is still being worked out. However, we plan to use familiar concepts (publishers, collections, records, collaborations) and will be smart about design and cloud usage so that VertNet is sustainable beyond the funding period of the grant. Right now we have functional prototypes for the local Python script that we mentioned earlier, and also for the search API. These are two of the components of the Darwin Core Engine which we’ll discuss in detail in a future post.
Thanks for reading, and again we’re hoping to get feedback through your comments.
Posted by Aaron Steele and John Wieczorek on behalf of the VertNet team.