Monday, May 5, 2014

Data Usage Reports Now Available for VertNet Data Publishers

VertNet is now publishing monthly data use reports for every dataset you’ve published to our data portal.

How Reporting Works

The reports contain two main categories: statistics about searches and statistics about downloads.  Reports contain the monthly statistics for each category as well as cumulative statistics for both the calendar year and since VertNet started tracking each metric.

There are 6 distinct metrics in the search category, including the # of searches that retrieved data, the # of records retrieved, and a list of query terms that retrieved data from your resources.  The download category contains 7 distinct metrics, including the # of download events and the total # of records downloaded. To see a complete list of the metrics we’re tracking, with explanations, take a look at our Guide to Usage Reporting posted on the VertNet web site.

The search category statistics include records that were viewed on-screen by a user via the VertNet portal.  Thus, if a user queries Sus scrofa and looks only at the first page of 100 records, ONLY those 100 records will be tallied in their respective institutional usage reports.  If a user loads more results, those additional records will be tallied.  In the Download domain, ALL of the records included in the downloaded dataset, regardless of whether or not they were seen on screen, will be tallied.  In essence, the Search stats tell you what users actually see, while the Download stats tell you what they take with them.

About this First Set of Reports

The statistics in the first report set will be a little bit different from future monthly reports. The search category will include statistics beginning at the start of March 2014, but the statistics for the download category will contain cumulative numbers beginning in August 2013. The reason for this is due to the fact that we have been tracking downloads since August. Only recently, and thanks to the great feedback from our testers (all data publishers) and from posts to us on various listservs, have we begun to track search activity. The result is that you’ll be able to track the same statistics that were provided for the classic networks (MaNIS, et al.) as well as some new ones.

How to See the Reports

If you’ve published one or more dataset to VertNet, you can view the usage report for each dataset in its corresponding GitHub repository.  As of May 1, the first data usage reports for each dataset published to the VertNet portal are now available.

We’re using GitHub (woo!), the very same system we’ve set up to help portal users and publishers submit and track issues about occurrence records in order to get these results to providers.

For example, the Royal Ontario Museum’s Fishes repository on GitHub is the location of all issues reported about records contained within it (none, right now, we assume because the data is perfect).  There is now a little blue folder just above the README file titled, “reports.”  Opening that folder will present all of the reports created to date (see image below).

image

All dataset repositories are public, so anybody with a GitHub account can subscribe to and follow the activity within. If you don’t want to get a username, you can always check the page at your leisure to see if anything new has occurred.

If you’re a data publisher and you haven’t talked to us about access to your institution’s GitHub organization, all you need to do is to go to https://github.com/ and sign up (for free) to get yourself a username.  Once you have a username, send it to us (well, to Dave Bloom, VertNet Project Coordinator, actually) and we’ll get you set up to:

  • Receive notifications that issues have been submitted about your dataset and that monthly reports have been posted,

  • Track and address issues posted by users directly from the VertNet portal, and,

  • Post responses, edit issues, and download documents (such as data reports).

If you’ve forgotten or haven’t heard about VertNet’s Issue Tracking service, you can read more about it on our blog.

Kudos to VertNet’s Javier Otegui, at the University of Colorado, Boulder, for the hard work needed to make these reports possible.

If you have feedback or would like to request statistics that are not covered in our first version of the reports, do not hesitate to tell us.

Monday, April 14, 2014

Sustainability and VertNet: The Quest Continues

More than three years ago the VertNet team proposed to NSF that we would seek ways to address the challenge of sustainability in two ways.  VertNet would:

(1) strive to reduce the costs necessary to maintain hardware, software, and IT services to keep biodiversity data publicly accessible.

(2) engage the biodiversity and data user community in conversation to explore existing and future strategies for the long-term sustainability of digitization and data-sharing efforts.

So far, we’ve taken great strides to accomplish these two goals:

We’ve developed and implemented a data publishing procedure (using the IPT) and a portal architecture that reduces the cost needed to maintain the network.  We believe our cloud-based infrastructure is a huge step toward sustainability.  Despite these successes, it still takes our team time and money to publish datasets and maintain the portal.  We’ve been engaged in a serious effort to learn as much as we can about sustainability, funding models, and how VertNet delivers its services to the community.  This includes participation in an intensive sustainability training course, Sustaining Digital Collections, developed by Ithaka S+R, and supported by the Mellon Foundation (we’re right in middle of it, so more on that when we’re done).  We’ve also been accepted to participate in the Ecological Society of America’s Sustaining Biological Infrastructure Workshop in June, 2014.

But now we need your help.

We need to know what you think of us.  We want, in fact we need to know what you think about our services and products.  Without your feedback, we can’t achieve our goal to deliver a plan to sustain VertNet and it’s services into the future.  This includes everything we do: Dataset Publication, Data Quality Improvement, Statistics and Reporting, Customer Service, Training and Capacity Building, and the Data Portal.

So, if you value what we do, as well as how VertNet and other data portals publish biodiversity data, please complete our Sustainability Survey.  It’ll only take about 5 minutes of your time, but it’s 5 minutes that will assure the continued availability of the VertNet network and all of the data it publishes and services it provides.

We thank you for your support.  We’ll let you know what the community has to say.

Thursday, January 23, 2014

Issue Tracking Available in the VertNet Portal, At Last!

VertNet provides some significant data quality services to it’s data publishers, but we couldn’t possibly catch every taxonomic and geographic name change, georeferencing error, or mistyped collector name, preparation, or event date.  That’s where you come in.   

Please allow us to introduce you to issue tracking, VertNet-style.  The goal of this new system is to (1) give users the opportunity to communicate suggested corrections, updates, and questions to data publishers directly from the VertNet portal, and (2) to assist VertNet’s efforts to publish the highest quality data possible.

The VertNet issue tracking system isn’t rocket science.  When a user identifies a record in the VertNet portal that he or she believes is problematic and in need of review, there are three ways to communicate questions, concerns, and corrections to the data publisher.  They can (1) use the portal Feedback form, linked on each portal page, (2) contact a data publisher directly, OR! (3) Post issues and questions from the portal directly to the data publisher from the VertNet portal using our new GitHub integration.

Options #1 and #2 are good communication options, but they are just simple one-way (usually) communications that end up hitting an inbox, and may or may not earn a response (although we always respond to comments posted via the Feedback form).

Option #3 is where the good stuff lives.  With VertNet’s issue tracking system, all communications are submitted to a public GitHub collaborative repository so that the user and data publisher (and any other interested party) can track, discuss, and resolve data quality issues.

“Why GitHub?”

Well, for many reasons including the fact that we already use this web-based code hosting service for the VertNet project.  If you’re interested, check out our fundamental code to interoperate with GitHub.

“But this isn’t code we’re posting.”

That’s true, but GitHub is used for all kinds of things, from wedding planning to home remodelling.  We’ve selected to use it because it’s got everything we need for a robust issue tracking service, such as:

  • public, project-based information repositories

  • automatic notification services (great for people who want to do everything via email and don’t want to log in to yet another web page)

  • interactive discussion

  • public URLs for easy sharing

  • long-term archiving

  • subscription services (for folks who want to be notified when things change), and

  • easy administration (plus, we’ll help you whenever you need it)!

It really is simple!  The only thing you need to participate is a username from GitHub.  Once you’ve got one, you don’t need to visit the GitHub pages again if you don’t want to - and that includes data publishers and portal users.  There is a login link built into the portal interface, so you can login directly from the portal and start submitting issues (see images below).

There is a lot going on behind the scenes with the VertNet issue tracking system, so if you want to learn how it works, feel free to read the GitHub Reference and Set Up Guide we’ve written to help data publishers get everything set up.  Best of all, the VertNet development team is here to help.  We’ll do 90% of the work, you get 100% of the benefit.

The next time you visit the VertNet portal, keep your eyes open for the green issue flag (see images below).  That’s where you can jump in and help us publish the highest quality data possible.

Here are six different looks at the VertNet issue tracking system using GitHub.

image

VertNet search page with issue tracking (GitHub) login highlighted.

image

VertNet occurrence record summary page with the green issue flag button to submit an issue.

image

VertNet issue submission form.

image

Example of an issue page once submitted to GitHub via the VertNet portal.

image

Example of single repository for a single data set (MVZ-Herps) with a list of all issues submitted about records in that data set.

image

Example of an organization page (for MVZ).  All repositories owned by the organization are listed on the right.  Recent activities within the organization are listed on the left.

Friday, August 16, 2013

New VertNet Data Portal Released

We at VertNet are pleased to announce the public release of the new beta VertNet data portal (portal.vertnet.org).

We’ve been working over the last few months to test and retest our portal concept, and after a lot of great feedback from our community of testers, we finally ready to put this out to anyone and everyone who wants to give the new VertNet portal a try.  So please, visit the portal!

The new VertNet portal is faster, more scalable, and more efficient than any of the past vertebrate networks.  We hope searching is a much improved experience (please check out the Advanced search!). Visualizations have been improved, access to metadata associated with individual records has been improved, and our new spatial query tool should make many of you happy.  We’ve still got a lot of work to do before we’re finished, so over the next few months we’ll update the portal, it’s features, and the number of data sets available.  Check back regularly to see what’s new.  

What you need to know about this version:

  • The portal contains 8,196,215 records from 44 data publishers.  Over the next several months, we’ll be adding more and more data and data publishers as we work toward our goal of an estimated 150M records in 2014.

  • There are three ways to search for data: full text-string, spatial, and advanced search.

  • Visualizations use CartoDB’s mapping interface.

  • Using Advanced search, you a can search records by tissue and media.  We’ve even started to move paleo data sets into VertNet.

  • The portal works in Chrome, Firefox, Safari, and Explorer.

Don’t forget - WE WANT YOUR FEEDBACK!  To this end, we have provided you with a “Feedback” tab at the top of every page.  Please, let us know what you think, what you want, and definitely let us know if something doesn’t work.  We’re still in beta, so you might experience some bugs.  Tell us when you find them so you won’t find them again.  We want to make the VertNet data portal a tool that meets your needs and makes your work easier.

Get in there and search!

The VertNet Team


Monday, July 22, 2013

Validating scientific names with the GBIF Checklist Bank

This guest post was written by Gaurav Vaidya, a graduate student i the Department of Ecology and Evolutionary Biology, University of Colorado, Boulder.  It is cross posted with the GBIF Developers Blog

image

A whale named Physeter macrocephalus Physeter catodon Physeter macrocephalus (photograph by Gabriel Barathieu, reused under CC-BY-SA from the Wikimedia Commons

Validating scientific names is one of the hardest parts of cleaning up a biodiversity dataset: as taxonomists’ understanding of species boundaries change, the names attached to them can be synonymized, moved between genera or even have their Latin grammar corrected (it’s Porphyrio martinicus, not Porphyrio martinica). Different taxonomists may disagree on what to call a species, whether a particular set of populations make up a species, subspecies or species complex, or even which of several published names correspond to our modern understanding of that species, such as the dispute over whether the sperm whale is really Physeter catodon Linnaeus, 1758, or Physeter macrocephalus Linnaeus, 1758.

A good way to validate scientific names is to match them against a taxonomic checklist: a publication that describes the taxonomy of a particular taxonomic group in a particular geographical region. It is up to the taxonomists who write such treatises to catalogue all the synonyms that have ever been used for the names in their checklist, and to identify a single accepted name for each taxon they recognize. While these checklists are themselves evolving over time and sometimes contradict each other, they serve as essential points of reference in an ever-changing taxonomic landscape.

Over a hundred digitized checklists have been assembled by the Global Biodiversity Information Facility (GBIF) and will be indexed in the forthcoming GBIF Portal, currently in development and testing. This collection includes large, global checklists, such as the Catalogue of Life and the International Plant Names Index, alongside smaller, more focussed checklists, such as a checklist of 383 species of seed plants found in the Singhalila National Park in India and the 87 species of moss bug recorded in the Coleorrhyncha Species File. Many of these checklists can be downloaded as Darwin Core Archive files, an important format for working with and exchanging biodiversity data.

So how can we match names against these databases? OpenRefine (the recently-renamed Google Refine) is a popular data cleaning tool, with features that make it easy to clean up many different types of data. Javier Otegui has written a tutorial on cleaning biodiversity data in OpenRefine, and last year Rod Page provided tools and a step-by-step guide to reconciling scientific names, establishing OpenRefine as an essential tool for biodiversity data and scientific name cleanup.

We extended Rod’s work by building a reconciliation service against the forthcoming GBIF web services API. We wanted to see if we could use one of the GBIF Portal’s biggest strengths — the large number of checklists it has indexed — to identify names recognized in similar ways by different checklists. Searching through multiple checklists containing possible synonyms and accepted names increases the odds of finding an obscure or recently created name; and if the same name is recognized by a number of checklists, this may signify a well-known synonymy — for example, two of the Portal checklists recognize that the species Linnaeus named Felis tigris is the same one that is known as Panthera tigris today.

image

Linnaeus’ original description of Felis Tigris. From an 1894 republication of Linnaeus’ Systema Naturae, 10th edition, digitized by the Biodiversity Heritage Library.

To do this, we wrote a new OpenRefine reconciliation service that searches for a queried name in all the checklists on the GBIF Portal. It then clusters names using four criteria and counts how often a particular name has the same:

  • scientific name (for example, “Felis tigris”),
  • authority (“Linnaeus, 1758”),
  • accepted name (“Panthera tigris”), and
  • kingdom (“Animalia”).

Once you do a reconciliation through our new service, your results will look like this: 

image

Since OpenRefine limits the number of results it shows for any reconciliation, we know only that at least five checklists in the GBIF Portal matched the name “Felis tigris”. Of these,

Two checklists consider Felis tigris Linnaeus, 1758 to be a junior synonym of Panthera tigris (Linnaeus, 1758). Names are always sorted by the number of checklists that contain that interpretation, so this interpretation — as it happens, the correct one — is at the top of the list.

The remaining checklists all consider Felis tigris to be an accepted name in its own right. They contain mutually inconsistent information: one places this species in the kingdom Animalia, another in the kingdom Metazoa, and the third contains both a kingdom and an taxonomic authority. You can click on each name to find out more details.

Using our reconciliation service, you can immediately see how many checklists agree on the most important details of the name match, and whether a name should be replaced with an accepted name. The same name may also be spelled identically under different nomenclatural codes: for example, does “Ficus” refer to the genus Ficus Röding, 1798 or the genus Ficus L.? If you know that the former is in kingdom Animalia while the latter is in Plantae, it becomes easier to figure out the right match for your dataset.

We’ve designed a complete workflow around our reconciliation service, starting with ITIS as a very fast first step to catch the most well recognized names, and ending with EOL’s fuzzy matching search as a final step to look for incorrectly spelled names. For VertNet’s 2013 Biodiversity Informatics Training Workshop, we wrote two tutorials that walk you through our workflow:

Name validation in OpenRefine, using both the new GBIF API reconciliation service as well as Rod Page’s reconciliation service for EOL, and

Higher taxonomy in OpenRefine, using the web service APIs provided by GBIF and EOL, as well as OpenRefine’s ability to parse JSON.

If you’re already familiar with OpenRefine, you can add the reconciliation service with the URL:

http://refine.taxonomics.org/gbifchecklists/reconcile

Give it a try, and let us know if it helps you reconcile names faster!

The Map of Life project is currently working on improving OpenRefine for taxonomic use in a project we call TaxRefine. If you have suggestions for features you’d like to see, please let us know! You can leave a comment on this blog post, or add an issue to our issue tracker on GitHub.

Monday, December 3, 2012

Reminders! Reminders! Lots of ways to get involved!

Three great ways to get involved with the VertNet community.

All deadlines are January 11, 2013, 11:59pm Pacific Time.

Biodiversity Informatics Training Workshop

During the five-day course, participants will work closely with trainers to address compelling biodiversity research questions and focus on the entire scope of a research project, including data acquisition, tools for data evaluation, analysis, and project dissemination and outreach. Participants are expected to bring high levels of motivation and desire to learn the fundamentals of biodiversity informatics and become users of cutting edge tools in GIS and modeling although not necessarily expected to have experience in them. The workshop will…

VertNet Summer Internships

Two interns positions are available; one at the University of California, Berkeley, and one at the University of Colorado in Boulder.  Interns will work with the VertNet project team of experienced museum curators, researchers, and informaticists, as well as individuals from VertNet integration partners, to design and conduct a research project… 

At-Large Steering Committee Positions

The term for each at-large member will be one (1) year with the opportunity for renewal at the completion of the term. At-large members should have significant experience in at least one of the following:

  • collection management
  • informatics development and big data
  • community and academic outreach
  • biodiversity-oriented research
  • business management and sustainability
  • grant writing and fundraising

VertNet welcomes applications from members of the private, non-profit, academic, governmental, and broad biodiversity communities. Applicants may be… 

Friday, February 10, 2012

VertNet, Creative Commons, and Data Licensing and Waivers

VertNet is faced regularly with issues involving what data can be shared via its portals, who uses the data, how they use it, and how credit is given to the institutions sharing their data.  As we work to build a better network we need to make decisions about how we, and all of our data publishers, will make data available to the public.

Thankfully, Peter Desmet and our friends at Canadensys have put together an excellent primer on the options available to data networks like ours.  They kindly posted the results of their thinking on their blog (27Jan2012), but we believe this is important enough to re-post here.

You can view the original post at the Canadensys Blog, along with all of the comments it received.  We’ve modified the orignal slightly to fit our formatting.

____________________________

With the first datasets getting published and more coming soon, the issue comes up under what license we – the Canadensys community and the individual collections – will publish our data. Dealing with the legal stuff can be tedious, which is why we have looked into this issue with the Canadensys Steering Committee & Science and Technology Advisory Board before opening the discussion to the whole community.

By data we mean specimen, observation or checklist datasets published as a Darwin Core Archive and any derivatives. To keep the discussion focused, this does not include pictures or software code.

2012.01.30 – Update to post: technically CC0 is not a license, but a waiver.

What we hope to achieve

  1. One license for the whole Canadensys community, which is easier for aggregation and sends a strong message as one community.
  2. An existing license, because we don’t want to write our own legal documents.
  3. An open license, allowing our data to be really used.
  4. A clear license, so users can focus on doing great research with the data, instead of figuring out the fine print.
  5. Giving credit where credit is due.

Our recommendation

cc-zeroWe recommend Canadensys participants to publish their data under Creative Commons Zero (CC0). With CC0 you waive any copyright you might have over the data(set) and dedicate it to the public domain. Users can copy, use, modify and distribute the data without asking your permission. You cannot be held liable for any (mis)use of the data either.

CC0 is recommended for data and databases and is used by hundreds of organizations. It is especially recommended for scientific data and thus encouraged by Pensoft (see their guidelines for biodiversity data papers) and Nature (see this opinion piece). Although CC0 doesn’t legally require users of the data to cite the source, it does not take away the moral responsibility to give attribution, as is common in scientific research (more about that below).

Why would I waive my copyright?

For starters, there’s very little copyright to be had in our data, datasets and databases. Copyright only applies to creative content and 99% of our data are facts, which cannot be copyrighted. We do hold copyright over some text in remarks fields, the data format or database model we chose/created, and pictures. If we consider a Darwin Core Archive (which is how we are publishing our data) the creative content is even further reduced: the data format is a standard and we only provide a link to pictures, not the pictures themselves.

Figuring out where the facts stop and where the (copyrightable) creative content begins can already be difficult for the content owner, so imagine what a legal nightmare it can become for the user. On top of that different rules are used in different countries. Publishing our data under CC0 removes any ambiguity and red tape. We waive any copyright we might have had over the creative content and our data gets the legal status of public domain. It can no longer be copyrighted by anyone.

Can’t we use another license?

Let’s go over the options. Keep in mind that these licenses only apply to the creative aspect of the dataset, not the facts. But as pointed out above, figuring this out can be difficult or impossible for the user. So much so in fact, that the user may decide not to use the data at all, especially if they think they might not meet the conditions of the license.

All rights reserved

copyrightThe user cannot use the data(set) without the permission of the owner.

Conclusion: Not good.

Open Data Commons Public Domain Dedication and License (PDDL)

There are no restrictions on how to use the data. This license is very similar to CC0.

Conclusion: Perfect, in fact this license was a precursor of CC0, but… it is less well known and maybe not as legally thorough as CC0. CC0 made a huge effort to cover legislation in almost all countries and the Creative Commons community is working hard to improve this even further. Therefore, if you have to choose, CC0 is probably better.

Creative Commons Attribution-NoDerivs (CC BY-ND)

by-ndThe user cannot build upon the data(set), which is what most data use involves.

Conclusion: Not good, and sadly used by theplantlist.orgRoderic Page pointed this out by showing what cool things he can NOT do with the data.

Creative Commons Attribution-NonCommercial (CC BY-NC)

by-ncThe user cannot use the data(set) for commercial purposes. This seems fine from an academic viewpoint, but the license is a lot more restrictive than intuitively thought. See: Hagedorn, G. et al. ZooKeys 150 (2011). Creative Commons licenses and the non-commercial condition: Implications for the re-use of biodiversity information.

Conclusion: Not good.

Creative Commons Attribution-ShareAlike (CC BY-SA) or Open Data Commons Open Database License (ODbL)

by-saThe user has to share any work based upon the data(set) under under a license that is identical or similar to the one used.

Conclusion: Good, but… this can lead to some problems for an aggregator like Canadensys or GBIF: if they are mixing and merging data with different SA licenses, which one do they choose? They might be incompatible.

Creative Commons Attribution (CC BY) or Open Data Commons Attribution License (ODC-By)

byThe user has to attribute the data(set) in the manner specified by the owner. This condition is also present in the three licenses above.

Conclusion: Good, but… this can lead to impractical “attribution stacking”. If an aggregator or a user of that aggregator is using and integrating different datasets provided under a BY license, they legally have to cite the owner for each and every one of those in the manner specified by these owners (again, for the potential creative content in the data). See point 5.3 at the bottom of this Creative Commons page for a better explanation and this blog post for an example.

But giving credit is a good thing!

Absolutely, but legally enforcing it can lead to the opposite affect: a user may decide not to use the data out of fear of not completely complying with the license (see paragraph above). As hinted at the beginning of this post, CC0 removes the drastic legally enforceable requirement to give attribution, but it does not remove the moral obligation to give attribution. In fact, this has been the common practice in scientific research for many decades: legally, you don’t have to cite the research/data you’re using, but not doing so could be considered plagiarism, which would compromise your reputation and the credibility of your work.

To encourage users to give credit where credit is due, we propose to create Canadensys norms. Norms are not a legal document (see an example here), but a “code of conduct” where we declare how we would like users to use, share and cite our data, and how they can participate. We can explain how one could cite an individual specimen, a collection, a dataset or an aggregated “Canadensys” download. We can point out that our data are constantly being corrected or added to, so it is useful to keep coming back to the original repository and not to a secondary repository that may not have been updated. In addition to that, we can build tools to monitor downloads or automatically create an adequate citation. And with the arrival of data papers – which drafts can now be automatically generated from IPT – data(sets) are really brought into the realm of traditional publishing and the associated scientific recognition.

Conclusion

All this to say that there are mechanisms where both users and data owners can benefit, without the legal burden. CC0 + norms guarantees that our data can be used now and in the future. I for one will update the license for our Université de Montréal Biodiversity Centre datasets. We hope you will join us!

Thanks to the Gregor Hagedorn for his valuable advice on all the intricacies of data licensing.

Thursday, January 12, 2012

Farewell, NBII

On January 15, 2012, the National Biological Information Infrastructure will be taken offline permanently.

All of us at VertNet view this event with mixed emotions. On the one hand, we understand the pressures of shrinking budgets and the difficulty in making decisions to prioritize the public services provided by a government entity, such as the USGS.  On the other hand, we are sad to see this program, one with which we have been closely linked, go.

It is not an exaggeration to say that VertNet would not likely exist without the support of NBII. NBII staff participated in the creation of VertNet back in 2008 and have provided critical financial support to hire and sustain two full-time positions, the VertNet Coordinator and VertNet Programmer, ever since. Although NBII’s services will be discontinued or distributed to other departments at of the USGS, VertNet’s activities will continue with funding from the National Science Foundation.

We wish to thank the individuals at NBII and the USGS for this support over the last four years and we look forward to working with others at the USGS into the future.

Wednesday, December 21, 2011

VertNet Project Quarterly Update #2

The final quarter of 2011 was very productive for VertNet. We announced the call for applications for our first Biodiversity Informatics workshop, made some major decisions about the development of the VertNet platform, and, somehow, managed to produce four months worth of posts on the VertNet Blog.

Before we get into the details of our recent progress, we’d just like to remind you that you can follow our work or ask questions of us in a number of ways.

Now, on to the details…

Biodiversity Informatics Training Workshop

We’re closing in on the January 10th application deadline for VertNet’s first Biodiversity Informatics Training Workshop. The workshop, hosted by the University of Colorado, Boulder, will feature five days during which participants will work closely with trainers to address compelling biodiversity research questions, focusing on the entire scope of a research project, from initial data acquisition to tools for data evaluation to analysis and finally, dissemination and outreach. The workshop will include large and small group exercises on a common curriculum as well as the opportunity for participants to discuss and explore individual research questions with trainers.

By the end of the week, participants should leave the workshop with:

  • an understanding of the work flows needed to acquire, analyze and report results generated from biodiversity resources found in data repositories such as VertNet.
  • a set of basic skills to use data repositories and informatics and analytic tools, and understand which tools are appropriate for tasks
  • knowledge of the abundant resources and additional training available to them.

You can learn more about the workshop and who should apply on VertNet.org.

VertNet Development

A cornerstone of the VertNet mission is to deliver a cloud-based platform upon which data publishers (i.e., institutions providing data) can store growing quantities of biodiversity data that can be accessed and enhanced via the Web. An important measure of our success will be how effectively this new platform can overcome the core technical challenges of scalability, data discovery, sustainability, and integration with other platforms. We will continue to work with our integration partners on evaluating platforms and, if all goes according to plan, we’ll open up testing to existing data publishers and others interested in our work.

In this quarter, the development team made some critical decisions about the design of our technical solutions. A key design principle in our decision-making process requires us to seek a balance between the most cost effective (i.e., sustainable) and flexible (i.e., scalable) software and the potential for development of robust tools and innovations that maximize data discovery. We plan to optimize sustainability and scalability by lowering maintenance costs and by simplifying the VertNet architecture so that data publishers can deploy VertNet anywhere, using any system.

In July of 2011, we built a prototype platform using the Google App Engine cloud, Python, and SQLite. The prototype, available on GitHub, was composed of a data bulkloading script used for uploading records, and an application programming interface (API) for searching them. This prototype was tested by our VertNet integration partners (e.g., AmphibiaWeb, Arctos, GeoLocate), who provided excellent feedback about usability, text matching queries, and support for Darwin Core Archives.

Coincidentally, while our prototype was being tested, Google App Engine announced a new service level agreement and pricing model. Because this change in pricing would likely increase our estimated annual operating cost, we proceeded to identify different cloud-based alternatives to our original plan.

After further research and several productive conversations with our integration partners, we decided to explore the cloud-based CouchDB and HTML5. CouchDB is intriguing because it is both a web server and a database server with strong replication support. It would allow us to deploy VertNet data and HTML5 applications in a highly flexible, scalable, and sustainable way. Plus, CouchDB would give our integration partners a way to extend VertNet with custom functionality specific to their needs.

In September 2011, we built a new prototype using CouchDB.  This prototype was composed of a browser-based bulkloader and a search API hosted in the cloud. The prototype and technical architecture were presented at the 2011 Biodiversity Information Standards (TDWG) conference in New Orleans.  So far, we are very pleased with the results of our testing with CouchDB and are encouraged by the potential to address our technical challenges.

Thanks for keeping your eyes on us.  More exciting things are coming in 2012.  If you’ve got ideas, suggestions, or questions, feel free to send us a note or comment.

Posted by Dave Bloom, VertNet Coordinator, and Aaron Steele, VertNet Information Architect, on behalf of the VertNet Team.

Wednesday, December 14, 2011

AIM-UP!: Advancing the use of museum collections

This guest post was written by Dr. Joseph Cook (Univ. or New Mexico) and Dr. Eileen Lacy (UC Berkeley) on behalf of the Advancing Integration of Museums into Undergraduate Programs (AIM-UP!) Research Coordination Network.
AIM-UP! Logo Image

Those familiar with VertNet are well aware of the importance of museums and museum data for research. Perhaps less immediately apparent is the vital role that museum collections can play in undergraduate education. Even a quick glance at a few specimens is typically enough to generate numerous student questions regarding the nature of museum collections and the reasons for the vast organismal diversity captured by museum specimens. Now, with specimen data increasingly available online, the power of natural history collections to excite and to inform students extends to institutions that lack physical specimens.

To further the use of natural history collections in undergraduate education, curators from the Museum of Southwestern Biology (University of New Mexico), the Museum of Vertebrate Zoology (UC Berkeley), the Museum of the North (University of Alaska), and the Museum of Comparative Zoology (Harvard University) have teamed up to create AIM-UP!, an NSF-sponsored network of museum scientists, collection specialists, undergraduate instructors, and artists dedicated to using museum data to promote undergraduate understanding of science.

In particular, AIM-UP! encourages undergraduate educators and students to explore the treasure trove of information available through natural history collections and their associated databases and data linkages. To facilitate this goal, the AIM-UP! network is working to develop new ways of incorporating the extensive archives and cyberinfrastructure of natural history museums into undergraduate education. These efforts focus on the following five themes:

  1. Integrative Inventories and Coevolving Communities: Exploring Complex Biotic Associations Across Space and Time
  2. Decoding Diversity: Making Sense of Geographic Variation
  3. Generating Genotypes: Evolutionary Dynamics of Genomes
  4. Fast Forward: Biotic Response to Climate Change
  5. Coevolving Communities and the Human Dimension

AIM-UP! rationale: Many natural history museums associated with academic institutions engage students in learning through specimen-based field projects and training opportunities related to the curatorial process. These experiences are often transformative, as witnessed by the large number of influential environmental and evolutionary biologists who cite their early exposure to natural history collections as pivotal to their career path. Such experiences, however, are necessarily limited to students at institutions with collections and, even then, the percentage of students who take advantage of such opportunities is often small.  How do we extend these formative experiences to reach a broader swath of the next generation of scientists?

By digitizing specimens, it has now become possible for anyone with access to the Internet to explore the vast reservoirs of information held in collections. What can students and instructors do with all the newly available data, images, recordings and other associated information? Can we encourage educators to use these increasingly comprehensive natural history databases to engage students in inquiry-based projects and activities? Will educational use of these databases stimulate greater public interest in our natural surroundings and in the dwindling wild places on earth? In short, how do we begin to incorporate the vast online digital databases now available into critically needed educational initiatives?

AIM-UP! is addressing these questions through the development of educational modules that build upon natural history collections and associated databases to make such information accessible to instructors in multiple biological disciplines, including those (e.g., developmental biology, behavior, physiology, and cellular biology) that may not typically use museum collections. The modules provide inquiry-based learning experiences for undergraduates (including students in AP High School Biology Courses) that are built upon the informatics tools and natural history specimen databases now readily accessible online e.g., VertNet, GenBank, BerkeleyMapper, MorphBank).

A few examples of educational modules already developed (or currently in progress) include:

  • Getting Started With On-Line Specimen Databases
  • Climate Change—Sierras, Great Lakes
  • Geographic Variation in Bird Song Dialects
  • Virtual Herbaria
  • GenBank & Museum Specimens: phylogeny and phylogeography

AIM-UP! goals: By integrating our expertise and experiences with university-based museums, we seek to greatly advance traditional and emerging fields that could use museum collections. Inclusion of participants from federal agencies, large free-standing museums, and leading educators from Latin America are ensuring wider dissemination of our educational products.

Upcoming activities: In Spring 2012, AIM-UP! will present a semester-long seminar exploring Geographic Variation and will include a series of 2-day workshops and a cross-listed course with the Art and Ecology Program and Biology Department at the University of New Mexico.  This course will be broadcast to the Museum of the North at the University of Alaska, Museum of Vertebrate Zoology, University of California Berkeley, and Museum of Comparative Zoology, Harvard University.

To learn more about AIM-UP! or the upcoming seminar, visit http://www.aim-up.org/.