Monday, September 22, 2014

Introducing the VertNet Norms for Data Use and Publication

VertNet has just released it’s Norms for Data Use and Publication.  Soon, the VertNet norms will be published with every dataset hosted by VertNet.  All records downloaded from those hosted datasets via the data portal or VertNet APIs will be accompanied by a link to the new norms.

We hope that everyone will read these norms and do the right thing to help us to build a vibrant data-sharing community that recognizes the long-standing efforts of data publishers and respects the needs of data users to discover and utilize high-quality biodiversity data.

We wish to thank Canadensys for their permission to amend the Canadensys norms for VertNet.

What’s in the Norms?

The VertNet norms are a code of conduct that we expect anyone who participates in the VertNet network to uphold.  Included in the norms are recommended behaviors for both data publishers and data users that we believe will help the community to give credit where credit is due, use data responsibly, and share their knowledge.  We’ve also included a set of recommended formats for creating data record and dataset citations.

Can I adopt or adapt these norms from my own project?

Yes!  We encourage you to do so.  VertNet has published the norms to its own web page, but, like Canadensys, we have also published a version of the norms to a VertNet GitHub repository under the CC-BY designation.  Please feel free to use either version of the norms to generate your own.  All we ask is that you adhere to the norms and give us, or Canadensys, proper attribution.

And here they are:

 

NORMS FOR DATA USE AND PUBLICATION

This document describes the VertNet norms for data publication and use. This is NOT a legal document or contract. This IS a well-considered code of conduct that anyone who publishes data to or uses data downloaded from VertNet are expected to uphold. When you adopt these norms, you will both model a much needed set of ethical behaviors and help us to build a vibrant community intended to support efforts to make biodiversity data as complete, discoverable, and accessible as possible.

VertNet wishes to acknowledge Canadensys for their efforts to develop the norms upon which this document is based.

WHAT WE BELIEVE

  • Biodiversity data should be as complete, discoverable, and accessible as possible.

  • Biodiversity data should be standardized so they can be aggregated, shared, and used as easily as possible.

  • Factual data in the VertNet Guide to Copyright and Licenses for Dataset Publication) cannot be protected by copyright and should be committed to the Public Domain (see, What Can be Protected by Copyright and Licensing.

  • Any licenses used to protect compilations or datasets (when applicable) should be licensed using standardized, machine-readable licenses, such as licenses from the Creative Commons. The ideal is for all datasets be published under the designation of the Creative Commons Zero (CC0) waiver to affirm clearly that the data are in the public domain.

  • Data publishers have the right to protect any creative content, including, but not limited to images, sound files, videos, and descriptive/speculative text, where applicable.

  • The licensing and waiver process should be as simple as possible for data publishers and for users (results may vary from collection to collection).

  • All data publishers should get the credit they deserve when the data they curate are used by others.

  • It is worth the time and effort to achieve these goals to the greatest extent possible.

THE NORMS

GIVE CREDIT WHERE CREDIT IS DUE

As is common practice in scientific research, cite the resources you are using. VertNet data publishers have invested considerable time, resources, and effort into collecting, digitizing, maintaining, and publishing the biodiversity information you are using. They deserve credit for their work. We have provided recommended formats for citation at the end of this document.

BE RESPONSIBLE

Use the data responsibly. Data are published to VertNet and other biodiversity portals to allow you and everyone else to better study and understand the world in which we live. It is your responsibility to use these data for the benefit of our collective health, knowledge, and self-improvement. Avoid using these data in any way that is unlawful, harmful, or misleading. Please understand that these data are subject to change, error, and bias. When giving credit where credit is due, protect the reputation of the data publisher and indicate clearly any changes you may have made to the data.

SHARE KNOWLEDGE

Let VertNet data publishers and the broader community know if — and how — you have used data from the network. Sharing helps:

  • VertNet to understand value of this project to you and the community and to create better tools and services for your use.

  • Data publishers to showcase their efforts and encourages them to continue to improve data quality and maintenance of collections.

  • You and your work reach a wider audience through an expanded presence. (Yes, we publicize and celebrate the work that people do with data published through VertNet.)

  • Everyone - data publishers, bioinformatics projects, and researchers - to raise the money needed to keep the data, products, services, and research alive.

You can contact us or share your work with any member of the VertNet team directly.

Communicate with the data publisher(s) directly. Let them know if you have comments or questions, notice errors, or want more information about the data they publish. We’ve provided you with three ways to start a conversation:

  • Post an issue directly from the VertNet data portal. It’s simple and we’ve posted instructions on our blog.

  • Contact the data publisher directly using the contact information provided within each dataset published through VertNet via the Publishers page on the VertNet portal.

  • Use the feedback form provided at the top of every VertNet web page.

RESPECT THE DATA LICENSE OR WAIVER

Understand and respect the data license or waiver under which the data are published. Whenever possible, VertNet places selected licenses and waivers in the rights field of every record and in the dataset metadata. In some cases, data publishers have published using non-standard terms of use. These terms could be located in many possible locations in the dataset, so please review the data fully before you use it.

To help data publishers make the best decision about how to license or waive rights to their datasets, and to help data users understand the waivers and licenses, we have created the VertNet Guide to Copyright and Licenses for Dataset Publication. Most of the data publishers who have selected a recommended Creative Commons license or waiver dedicated their data to the public domain using the Creative Commons Zero waiver (CC0).

Do not remove the public domain mark or provide misleading information about the copyright status.

DATA PUBLICATION CONDITIONS

We invite any institution with biological or natural history collections (not just vertebrate collections) to join our growing community of data publishers. To publish data to the VertNet network you will need:

  • Digitized biological or natural history data.

  • The desire to share your data with scientists, researchers, educators, and others around the world.

  • A willingness to maintain and improve your data over time.

If you have all three, or if you have questions, please contact VertNet’s support team and we’ll get you and your data started on the path to publication and discovery.

PREFERRED CITATIONS

In the absence of a citation practice that takes precedence, we recommend the following preferred formats to use when citing data published through VertNet. Square brackets denote values that must be obtained either from records within a dataset or from the description of the dataset. A glossary of terms used in square brackets is given at the end of the Preferred Citations section.

SINGLE DATASET

GENERAL FORMAT

[dataset name]. [data publisher]. [link to dataset] (accessed on [date])

EXAMPLE

Cowan Tetrapod Collection at the University of British Columbia Beaty Biodiversity Museum (UBCBBM). University of British Columbia. http://ipt.vertnet.org:8080/ipt/resource.do?r=ubc_bbm_ctc_birds (accessed on 2014-07-28)

AGGREGATED DATA (FROM MULTIPLE DATASETS)

Cite each data publisher in the aggregate using the single dataset citation format described above.

EXAMPLE

Cowan Tetrapod Collection at the University of British Columbia Beaty Biodiversity Museum (UBCBBM). University of British Columbia. http://ipt.vertnet.org:8080/ipt/resource.do?r=ubc_bbm_ctc_birds (accessed on 2014-07-28)
Field Museum of Natural History (Zoology) Bird Collection. Field Museum. http://fmipt.fieldmuseum.org:8080/ipt/resource.do?r=fm_birds (accessed on 2014-07-28)
University of Kansas Bird Collection. University of Kansas Biodiversity Institute. http://ipt.nhm.ku.edu/ipt/resource.do?r=kubi_ornithology (accessed on 2014-07-28)

SINGLE SPECIMEN/OBSERVATION RECORD

VertNet includes the text of a record citation in the bibliographicCitation field in the record itself for all data publishers who provide this information. If the record has a value in the bibliographicCitation field, construct the full citation by appending information about the date the data were accessed. If the record does not contain a value for bibliographicCitation, use the appropriate format described below.

IF THE BIBLIOGRAPHICCITATION IS PROVIDED IN THE RECORD:

[bibliographicCitation] (accessed on [date])

EXAMPLE

urn:catalog:CM:Herps:105730. Carnegie Museum of Natural History Herpetology Collection. Carnegie Museums. http://ipt.vertnet.org:8080/ipt/resource.do?r=cm_herps (accessed on 2014-07-28)

IF THERE IS NO BIBLIOGRAPHICCITATION, BUT OCCURRENCEID IS PROVIDED IN THE RECORD:

[occurrenceID]. [dataset name]. [data publisher]. [link to dataset] (accessed on [date])

EXAMPLE

urn:catalog:CM:Herps:105730. AMNH Mammal Collection. American Museum of Natural History. http://ipt.vertnet.org:8080/ipt/resource.do?r=amnh_mammals (accessed on 2014-07-28)

IF THERE IS NO BIBLIOGRAPHICCITATION OR OCCURRENCEID PROVIDED IN THE RECORD:

[catalogNumber]. [dataset name]. [data publisher]. [link to dataset] (accessed on [date])

EXAMPLE

105730. Field Museum of Natural History (Zoology) Bird Collection. Field Museum. http://fmipt.fieldmuseum.org:8080/ipt/resource.do?r=fm_birds (accessed on 2014-07-28)

WHERE CAN I FIND THE ELEMENTS TO CREATE A CITATION?

All the elements can be found in fields contained within downloaded records or in the description of the dataset (i.e., the metadata) from which the record originates:

  • [bibliographicCitation]: in the field bibliographicCitation in the record.

  • [occurrenceID]: in the field occurrenceID in the record.

  • [catalogNumber]: in the field catalogNumber in the record.

  • [dataset name]: This is listed as the Resource Citation (under Citations) in the dataset metadata (published as an EML file in a Darwin Core archive) and can also be found under Citation on the Rights tab in the record detail in the VertNet portal. If the Resource Citation is missing, use the Title of the resource in the dataset metadata, also found under Resource on the Rights tab in the record detail in the VertNet portal.

  • [data publisher]: This is listed as the Organisation in the dataset metadata and under Organisation on the Rights tab in the record detail in the VertNet portal. It is not in downloaded records.

  • [link to dataset]: in the field [dataSource] in the record. If dataSource is missing, use the Source URL on the Rights tab in the record detail in the VertNet portal.

** These norms may be subject to minor revisions without notice; major revisions will be announced on the VertNet web site, data portal, blog and social media (TwitterFacebook, and Google Plus).


If you have any questions about this document, please contact VertNet’s support team.

Visit our Help page for more resources created for the VertNet project.

If you’d like to adapt these norms to your own project or work, please fork our norms repository on GitHub.

Monday, May 5, 2014

Data Usage Reports Now Available for VertNet Data Publishers

VertNet is now publishing monthly data use reports for every dataset you’ve published to our data portal.

How Reporting Works

The reports contain two main categories: statistics about searches and statistics about downloads.  Reports contain the monthly statistics for each category as well as cumulative statistics for both the calendar year and since VertNet started tracking each metric.

There are 6 distinct metrics in the search category, including the # of searches that retrieved data, the # of records retrieved, and a list of query terms that retrieved data from your resources.  The download category contains 7 distinct metrics, including the # of download events and the total # of records downloaded. To see a complete list of the metrics we’re tracking, with explanations, take a look at our Guide to Usage Reporting posted on the VertNet web site.

The search category statistics include records that were viewed on-screen by a user via the VertNet portal.  Thus, if a user queries Sus scrofa and looks only at the first page of 100 records, ONLY those 100 records will be tallied in their respective institutional usage reports.  If a user loads more results, those additional records will be tallied.  In the Download domain, ALL of the records included in the downloaded dataset, regardless of whether or not they were seen on screen, will be tallied.  In essence, the Search stats tell you what users actually see, while the Download stats tell you what they take with them.

About this First Set of Reports

The statistics in the first report set will be a little bit different from future monthly reports. The search category will include statistics beginning at the start of March 2014, but the statistics for the download category will contain cumulative numbers beginning in August 2013. The reason for this is due to the fact that we have been tracking downloads since August. Only recently, and thanks to the great feedback from our testers (all data publishers) and from posts to us on various listservs, have we begun to track search activity. The result is that you’ll be able to track the same statistics that were provided for the classic networks (MaNIS, et al.) as well as some new ones.

How to See the Reports

If you’ve published one or more dataset to VertNet, you can view the usage report for each dataset in its corresponding GitHub repository.  As of May 1, the first data usage reports for each dataset published to the VertNet portal are now available.

We’re using GitHub (woo!), the very same system we’ve set up to help portal users and publishers submit and track issues about occurrence records in order to get these results to providers.

For example, the Royal Ontario Museum’s Fishes repository on GitHub is the location of all issues reported about records contained within it (none, right now, we assume because the data is perfect).  There is now a little blue folder just above the README file titled, “reports.”  Opening that folder will present all of the reports created to date (see image below).

image

All dataset repositories are public, so anybody with a GitHub account can subscribe to and follow the activity within. If you don’t want to get a username, you can always check the page at your leisure to see if anything new has occurred.

If you’re a data publisher and you haven’t talked to us about access to your institution’s GitHub organization, all you need to do is to go to https://github.com/ and sign up (for free) to get yourself a username.  Once you have a username, send it to us (well, to Dave Bloom, VertNet Project Coordinator, actually) and we’ll get you set up to:

  • Receive notifications that issues have been submitted about your dataset and that monthly reports have been posted,

  • Track and address issues posted by users directly from the VertNet portal, and,

  • Post responses, edit issues, and download documents (such as data reports).

If you’ve forgotten or haven’t heard about VertNet’s Issue Tracking service, you can read more about it on our blog.

Kudos to VertNet’s Javier Otegui, at the University of Colorado, Boulder, for the hard work needed to make these reports possible.

If you have feedback or would like to request statistics that are not covered in our first version of the reports, do not hesitate to tell us.

Monday, April 14, 2014

Sustainability and VertNet: The Quest Continues

More than three years ago the VertNet team proposed to NSF that we would seek ways to address the challenge of sustainability in two ways.  VertNet would:

(1) strive to reduce the costs necessary to maintain hardware, software, and IT services to keep biodiversity data publicly accessible.

(2) engage the biodiversity and data user community in conversation to explore existing and future strategies for the long-term sustainability of digitization and data-sharing efforts.

So far, we’ve taken great strides to accomplish these two goals:

We’ve developed and implemented a data publishing procedure (using the IPT) and a portal architecture that reduces the cost needed to maintain the network.  We believe our cloud-based infrastructure is a huge step toward sustainability.  Despite these successes, it still takes our team time and money to publish datasets and maintain the portal.  We’ve been engaged in a serious effort to learn as much as we can about sustainability, funding models, and how VertNet delivers its services to the community.  This includes participation in an intensive sustainability training course, Sustaining Digital Collections, developed by Ithaka S+R, and supported by the Mellon Foundation (we’re right in middle of it, so more on that when we’re done).  We’ve also been accepted to participate in the Ecological Society of America’s Sustaining Biological Infrastructure Workshop in June, 2014.

But now we need your help.

We need to know what you think of us.  We want, in fact we need to know what you think about our services and products.  Without your feedback, we can’t achieve our goal to deliver a plan to sustain VertNet and it’s services into the future.  This includes everything we do: Dataset Publication, Data Quality Improvement, Statistics and Reporting, Customer Service, Training and Capacity Building, and the Data Portal.

So, if you value what we do, as well as how VertNet and other data portals publish biodiversity data, please complete our Sustainability Survey.  It’ll only take about 5 minutes of your time, but it’s 5 minutes that will assure the continued availability of the VertNet network and all of the data it publishes and services it provides.

We thank you for your support.  We’ll let you know what the community has to say.

Thursday, January 23, 2014

Issue Tracking Available in the VertNet Portal, At Last!

VertNet provides some significant data quality services to it’s data publishers, but we couldn’t possibly catch every taxonomic and geographic name change, georeferencing error, or mistyped collector name, preparation, or event date.  That’s where you come in.   

Please allow us to introduce you to issue tracking, VertNet-style.  The goal of this new system is to (1) give users the opportunity to communicate suggested corrections, updates, and questions to data publishers directly from the VertNet portal, and (2) to assist VertNet’s efforts to publish the highest quality data possible.

The VertNet issue tracking system isn’t rocket science.  When a user identifies a record in the VertNet portal that he or she believes is problematic and in need of review, there are three ways to communicate questions, concerns, and corrections to the data publisher.  They can (1) use the portal Feedback form, linked on each portal page, (2) contact a data publisher directly, OR! (3) Post issues and questions from the portal directly to the data publisher from the VertNet portal using our new GitHub integration.

Options #1 and #2 are good communication options, but they are just simple one-way (usually) communications that end up hitting an inbox, and may or may not earn a response (although we always respond to comments posted via the Feedback form).

Option #3 is where the good stuff lives.  With VertNet’s issue tracking system, all communications are submitted to a public GitHub collaborative repository so that the user and data publisher (and any other interested party) can track, discuss, and resolve data quality issues.

“Why GitHub?”

Well, for many reasons including the fact that we already use this web-based code hosting service for the VertNet project.  If you’re interested, check out our fundamental code to interoperate with GitHub.

“But this isn’t code we’re posting.”

That’s true, but GitHub is used for all kinds of things, from wedding planning to home remodelling.  We’ve selected to use it because it’s got everything we need for a robust issue tracking service, such as:

  • public, project-based information repositories

  • automatic notification services (great for people who want to do everything via email and don’t want to log in to yet another web page)

  • interactive discussion

  • public URLs for easy sharing

  • long-term archiving

  • subscription services (for folks who want to be notified when things change), and

  • easy administration (plus, we’ll help you whenever you need it)!

It really is simple!  The only thing you need to participate is a username from GitHub.  Once you’ve got one, you don’t need to visit the GitHub pages again if you don’t want to - and that includes data publishers and portal users.  There is a login link built into the portal interface, so you can login directly from the portal and start submitting issues (see images below).

There is a lot going on behind the scenes with the VertNet issue tracking system, so if you want to learn how it works, feel free to read the GitHub Reference and Set Up Guide we’ve written to help data publishers get everything set up.  Best of all, the VertNet development team is here to help.  We’ll do 90% of the work, you get 100% of the benefit.

The next time you visit the VertNet portal, keep your eyes open for the green issue flag (see images below).  That’s where you can jump in and help us publish the highest quality data possible.

Here are six different looks at the VertNet issue tracking system using GitHub.

image

VertNet search page with issue tracking (GitHub) login highlighted.

image

VertNet occurrence record summary page with the green issue flag button to submit an issue.

image

VertNet issue submission form.

image

Example of an issue page once submitted to GitHub via the VertNet portal.

image

Example of single repository for a single data set (MVZ-Herps) with a list of all issues submitted about records in that data set.

image

Example of an organization page (for MVZ).  All repositories owned by the organization are listed on the right.  Recent activities within the organization are listed on the left.

Friday, August 16, 2013

New VertNet Data Portal Released

We at VertNet are pleased to announce the public release of the new beta VertNet data portal (portal.vertnet.org).

We’ve been working over the last few months to test and retest our portal concept, and after a lot of great feedback from our community of testers, we finally ready to put this out to anyone and everyone who wants to give the new VertNet portal a try.  So please, visit the portal!

The new VertNet portal is faster, more scalable, and more efficient than any of the past vertebrate networks.  We hope searching is a much improved experience (please check out the Advanced search!). Visualizations have been improved, access to metadata associated with individual records has been improved, and our new spatial query tool should make many of you happy.  We’ve still got a lot of work to do before we’re finished, so over the next few months we’ll update the portal, it’s features, and the number of data sets available.  Check back regularly to see what’s new.  

What you need to know about this version:

  • The portal contains 8,196,215 records from 44 data publishers.  Over the next several months, we’ll be adding more and more data and data publishers as we work toward our goal of an estimated 150M records in 2014.

  • There are three ways to search for data: full text-string, spatial, and advanced search.

  • Visualizations use CartoDB’s mapping interface.

  • Using Advanced search, you a can search records by tissue and media.  We’ve even started to move paleo data sets into VertNet.

  • The portal works in Chrome, Firefox, Safari, and Explorer.

Don’t forget - WE WANT YOUR FEEDBACK!  To this end, we have provided you with a “Feedback” tab at the top of every page.  Please, let us know what you think, what you want, and definitely let us know if something doesn’t work.  We’re still in beta, so you might experience some bugs.  Tell us when you find them so you won’t find them again.  We want to make the VertNet data portal a tool that meets your needs and makes your work easier.

Get in there and search!

The VertNet Team


Monday, July 22, 2013

Validating scientific names with the GBIF Checklist Bank

This guest post was written by Gaurav Vaidya, a graduate student i the Department of Ecology and Evolutionary Biology, University of Colorado, Boulder.  It is cross posted with the GBIF Developers Blog

image

A whale named Physeter macrocephalus Physeter catodon Physeter macrocephalus (photograph by Gabriel Barathieu, reused under CC-BY-SA from the Wikimedia Commons

Validating scientific names is one of the hardest parts of cleaning up a biodiversity dataset: as taxonomists’ understanding of species boundaries change, the names attached to them can be synonymized, moved between genera or even have their Latin grammar corrected (it’s Porphyrio martinicus, not Porphyrio martinica). Different taxonomists may disagree on what to call a species, whether a particular set of populations make up a species, subspecies or species complex, or even which of several published names correspond to our modern understanding of that species, such as the dispute over whether the sperm whale is really Physeter catodon Linnaeus, 1758, or Physeter macrocephalus Linnaeus, 1758.

A good way to validate scientific names is to match them against a taxonomic checklist: a publication that describes the taxonomy of a particular taxonomic group in a particular geographical region. It is up to the taxonomists who write such treatises to catalogue all the synonyms that have ever been used for the names in their checklist, and to identify a single accepted name for each taxon they recognize. While these checklists are themselves evolving over time and sometimes contradict each other, they serve as essential points of reference in an ever-changing taxonomic landscape.

Over a hundred digitized checklists have been assembled by the Global Biodiversity Information Facility (GBIF) and will be indexed in the forthcoming GBIF Portal, currently in development and testing. This collection includes large, global checklists, such as the Catalogue of Life and the International Plant Names Index, alongside smaller, more focussed checklists, such as a checklist of 383 species of seed plants found in the Singhalila National Park in India and the 87 species of moss bug recorded in the Coleorrhyncha Species File. Many of these checklists can be downloaded as Darwin Core Archive files, an important format for working with and exchanging biodiversity data.

So how can we match names against these databases? OpenRefine (the recently-renamed Google Refine) is a popular data cleaning tool, with features that make it easy to clean up many different types of data. Javier Otegui has written a tutorial on cleaning biodiversity data in OpenRefine, and last year Rod Page provided tools and a step-by-step guide to reconciling scientific names, establishing OpenRefine as an essential tool for biodiversity data and scientific name cleanup.

We extended Rod’s work by building a reconciliation service against the forthcoming GBIF web services API. We wanted to see if we could use one of the GBIF Portal’s biggest strengths — the large number of checklists it has indexed — to identify names recognized in similar ways by different checklists. Searching through multiple checklists containing possible synonyms and accepted names increases the odds of finding an obscure or recently created name; and if the same name is recognized by a number of checklists, this may signify a well-known synonymy — for example, two of the Portal checklists recognize that the species Linnaeus named Felis tigris is the same one that is known as Panthera tigris today.

image

Linnaeus’ original description of Felis Tigris. From an 1894 republication of Linnaeus’ Systema Naturae, 10th edition, digitized by the Biodiversity Heritage Library.

To do this, we wrote a new OpenRefine reconciliation service that searches for a queried name in all the checklists on the GBIF Portal. It then clusters names using four criteria and counts how often a particular name has the same:

  • scientific name (for example, “Felis tigris”),
  • authority (“Linnaeus, 1758”),
  • accepted name (“Panthera tigris”), and
  • kingdom (“Animalia”).

Once you do a reconciliation through our new service, your results will look like this: 

image

Since OpenRefine limits the number of results it shows for any reconciliation, we know only that at least five checklists in the GBIF Portal matched the name “Felis tigris”. Of these,

Two checklists consider Felis tigris Linnaeus, 1758 to be a junior synonym of Panthera tigris (Linnaeus, 1758). Names are always sorted by the number of checklists that contain that interpretation, so this interpretation — as it happens, the correct one — is at the top of the list.

The remaining checklists all consider Felis tigris to be an accepted name in its own right. They contain mutually inconsistent information: one places this species in the kingdom Animalia, another in the kingdom Metazoa, and the third contains both a kingdom and an taxonomic authority. You can click on each name to find out more details.

Using our reconciliation service, you can immediately see how many checklists agree on the most important details of the name match, and whether a name should be replaced with an accepted name. The same name may also be spelled identically under different nomenclatural codes: for example, does “Ficus” refer to the genus Ficus Röding, 1798 or the genus Ficus L.? If you know that the former is in kingdom Animalia while the latter is in Plantae, it becomes easier to figure out the right match for your dataset.

We’ve designed a complete workflow around our reconciliation service, starting with ITIS as a very fast first step to catch the most well recognized names, and ending with EOL’s fuzzy matching search as a final step to look for incorrectly spelled names. For VertNet’s 2013 Biodiversity Informatics Training Workshop, we wrote two tutorials that walk you through our workflow:

Name validation in OpenRefine, using both the new GBIF API reconciliation service as well as Rod Page’s reconciliation service for EOL, and

Higher taxonomy in OpenRefine, using the web service APIs provided by GBIF and EOL, as well as OpenRefine’s ability to parse JSON.

If you’re already familiar with OpenRefine, you can add the reconciliation service with the URL:

http://refine.taxonomics.org/gbifchecklists/reconcile

Give it a try, and let us know if it helps you reconcile names faster!

The Map of Life project is currently working on improving OpenRefine for taxonomic use in a project we call TaxRefine. If you have suggestions for features you’d like to see, please let us know! You can leave a comment on this blog post, or add an issue to our issue tracker on GitHub.

Monday, December 3, 2012

Reminders! Reminders! Lots of ways to get involved!

Three great ways to get involved with the VertNet community.

All deadlines are January 11, 2013, 11:59pm Pacific Time.

Biodiversity Informatics Training Workshop

During the five-day course, participants will work closely with trainers to address compelling biodiversity research questions and focus on the entire scope of a research project, including data acquisition, tools for data evaluation, analysis, and project dissemination and outreach. Participants are expected to bring high levels of motivation and desire to learn the fundamentals of biodiversity informatics and become users of cutting edge tools in GIS and modeling although not necessarily expected to have experience in them. The workshop will…

VertNet Summer Internships

Two interns positions are available; one at the University of California, Berkeley, and one at the University of Colorado in Boulder.  Interns will work with the VertNet project team of experienced museum curators, researchers, and informaticists, as well as individuals from VertNet integration partners, to design and conduct a research project… 

At-Large Steering Committee Positions

The term for each at-large member will be one (1) year with the opportunity for renewal at the completion of the term. At-large members should have significant experience in at least one of the following:

  • collection management
  • informatics development and big data
  • community and academic outreach
  • biodiversity-oriented research
  • business management and sustainability
  • grant writing and fundraising

VertNet welcomes applications from members of the private, non-profit, academic, governmental, and broad biodiversity communities. Applicants may be… 

Friday, February 10, 2012

VertNet, Creative Commons, and Data Licensing and Waivers

VertNet is faced regularly with issues involving what data can be shared via its portals, who uses the data, how they use it, and how credit is given to the institutions sharing their data.  As we work to build a better network we need to make decisions about how we, and all of our data publishers, will make data available to the public.

Thankfully, Peter Desmet and our friends at Canadensys have put together an excellent primer on the options available to data networks like ours.  They kindly posted the results of their thinking on their blog (27Jan2012), but we believe this is important enough to re-post here.

You can view the original post at the Canadensys Blog, along with all of the comments it received.  We’ve modified the orignal slightly to fit our formatting.

____________________________

With the first datasets getting published and more coming soon, the issue comes up under what license we – the Canadensys community and the individual collections – will publish our data. Dealing with the legal stuff can be tedious, which is why we have looked into this issue with the Canadensys Steering Committee & Science and Technology Advisory Board before opening the discussion to the whole community.

By data we mean specimen, observation or checklist datasets published as a Darwin Core Archive and any derivatives. To keep the discussion focused, this does not include pictures or software code.

2012.01.30 – Update to post: technically CC0 is not a license, but a waiver.

What we hope to achieve

  1. One license for the whole Canadensys community, which is easier for aggregation and sends a strong message as one community.
  2. An existing license, because we don’t want to write our own legal documents.
  3. An open license, allowing our data to be really used.
  4. A clear license, so users can focus on doing great research with the data, instead of figuring out the fine print.
  5. Giving credit where credit is due.

Our recommendation

cc-zeroWe recommend Canadensys participants to publish their data under Creative Commons Zero (CC0). With CC0 you waive any copyright you might have over the data(set) and dedicate it to the public domain. Users can copy, use, modify and distribute the data without asking your permission. You cannot be held liable for any (mis)use of the data either.

CC0 is recommended for data and databases and is used by hundreds of organizations. It is especially recommended for scientific data and thus encouraged by Pensoft (see their guidelines for biodiversity data papers) and Nature (see this opinion piece). Although CC0 doesn’t legally require users of the data to cite the source, it does not take away the moral responsibility to give attribution, as is common in scientific research (more about that below).

Why would I waive my copyright?

For starters, there’s very little copyright to be had in our data, datasets and databases. Copyright only applies to creative content and 99% of our data are facts, which cannot be copyrighted. We do hold copyright over some text in remarks fields, the data format or database model we chose/created, and pictures. If we consider a Darwin Core Archive (which is how we are publishing our data) the creative content is even further reduced: the data format is a standard and we only provide a link to pictures, not the pictures themselves.

Figuring out where the facts stop and where the (copyrightable) creative content begins can already be difficult for the content owner, so imagine what a legal nightmare it can become for the user. On top of that different rules are used in different countries. Publishing our data under CC0 removes any ambiguity and red tape. We waive any copyright we might have had over the creative content and our data gets the legal status of public domain. It can no longer be copyrighted by anyone.

Can’t we use another license?

Let’s go over the options. Keep in mind that these licenses only apply to the creative aspect of the dataset, not the facts. But as pointed out above, figuring this out can be difficult or impossible for the user. So much so in fact, that the user may decide not to use the data at all, especially if they think they might not meet the conditions of the license.

All rights reserved

copyrightThe user cannot use the data(set) without the permission of the owner.

Conclusion: Not good.

Open Data Commons Public Domain Dedication and License (PDDL)

There are no restrictions on how to use the data. This license is very similar to CC0.

Conclusion: Perfect, in fact this license was a precursor of CC0, but… it is less well known and maybe not as legally thorough as CC0. CC0 made a huge effort to cover legislation in almost all countries and the Creative Commons community is working hard to improve this even further. Therefore, if you have to choose, CC0 is probably better.

Creative Commons Attribution-NoDerivs (CC BY-ND)

by-ndThe user cannot build upon the data(set), which is what most data use involves.

Conclusion: Not good, and sadly used by theplantlist.orgRoderic Page pointed this out by showing what cool things he can NOT do with the data.

Creative Commons Attribution-NonCommercial (CC BY-NC)

by-ncThe user cannot use the data(set) for commercial purposes. This seems fine from an academic viewpoint, but the license is a lot more restrictive than intuitively thought. See: Hagedorn, G. et al. ZooKeys 150 (2011). Creative Commons licenses and the non-commercial condition: Implications for the re-use of biodiversity information.

Conclusion: Not good.

Creative Commons Attribution-ShareAlike (CC BY-SA) or Open Data Commons Open Database License (ODbL)

by-saThe user has to share any work based upon the data(set) under under a license that is identical or similar to the one used.

Conclusion: Good, but… this can lead to some problems for an aggregator like Canadensys or GBIF: if they are mixing and merging data with different SA licenses, which one do they choose? They might be incompatible.

Creative Commons Attribution (CC BY) or Open Data Commons Attribution License (ODC-By)

byThe user has to attribute the data(set) in the manner specified by the owner. This condition is also present in the three licenses above.

Conclusion: Good, but… this can lead to impractical “attribution stacking”. If an aggregator or a user of that aggregator is using and integrating different datasets provided under a BY license, they legally have to cite the owner for each and every one of those in the manner specified by these owners (again, for the potential creative content in the data). See point 5.3 at the bottom of this Creative Commons page for a better explanation and this blog post for an example.

But giving credit is a good thing!

Absolutely, but legally enforcing it can lead to the opposite affect: a user may decide not to use the data out of fear of not completely complying with the license (see paragraph above). As hinted at the beginning of this post, CC0 removes the drastic legally enforceable requirement to give attribution, but it does not remove the moral obligation to give attribution. In fact, this has been the common practice in scientific research for many decades: legally, you don’t have to cite the research/data you’re using, but not doing so could be considered plagiarism, which would compromise your reputation and the credibility of your work.

To encourage users to give credit where credit is due, we propose to create Canadensys norms. Norms are not a legal document (see an example here), but a “code of conduct” where we declare how we would like users to use, share and cite our data, and how they can participate. We can explain how one could cite an individual specimen, a collection, a dataset or an aggregated “Canadensys” download. We can point out that our data are constantly being corrected or added to, so it is useful to keep coming back to the original repository and not to a secondary repository that may not have been updated. In addition to that, we can build tools to monitor downloads or automatically create an adequate citation. And with the arrival of data papers – which drafts can now be automatically generated from IPT – data(sets) are really brought into the realm of traditional publishing and the associated scientific recognition.

Conclusion

All this to say that there are mechanisms where both users and data owners can benefit, without the legal burden. CC0 + norms guarantees that our data can be used now and in the future. I for one will update the license for our Université de Montréal Biodiversity Centre datasets. We hope you will join us!

Thanks to the Gregor Hagedorn for his valuable advice on all the intricacies of data licensing.

Thursday, January 12, 2012

Farewell, NBII

On January 15, 2012, the National Biological Information Infrastructure will be taken offline permanently.

All of us at VertNet view this event with mixed emotions. On the one hand, we understand the pressures of shrinking budgets and the difficulty in making decisions to prioritize the public services provided by a government entity, such as the USGS.  On the other hand, we are sad to see this program, one with which we have been closely linked, go.

It is not an exaggeration to say that VertNet would not likely exist without the support of NBII. NBII staff participated in the creation of VertNet back in 2008 and have provided critical financial support to hire and sustain two full-time positions, the VertNet Coordinator and VertNet Programmer, ever since. Although NBII’s services will be discontinued or distributed to other departments at of the USGS, VertNet’s activities will continue with funding from the National Science Foundation.

We wish to thank the individuals at NBII and the USGS for this support over the last four years and we look forward to working with others at the USGS into the future.

Wednesday, December 21, 2011

VertNet Project Quarterly Update #2

The final quarter of 2011 was very productive for VertNet. We announced the call for applications for our first Biodiversity Informatics workshop, made some major decisions about the development of the VertNet platform, and, somehow, managed to produce four months worth of posts on the VertNet Blog.

Before we get into the details of our recent progress, we’d just like to remind you that you can follow our work or ask questions of us in a number of ways.

Now, on to the details…

Biodiversity Informatics Training Workshop

We’re closing in on the January 10th application deadline for VertNet’s first Biodiversity Informatics Training Workshop. The workshop, hosted by the University of Colorado, Boulder, will feature five days during which participants will work closely with trainers to address compelling biodiversity research questions, focusing on the entire scope of a research project, from initial data acquisition to tools for data evaluation to analysis and finally, dissemination and outreach. The workshop will include large and small group exercises on a common curriculum as well as the opportunity for participants to discuss and explore individual research questions with trainers.

By the end of the week, participants should leave the workshop with:

  • an understanding of the work flows needed to acquire, analyze and report results generated from biodiversity resources found in data repositories such as VertNet.
  • a set of basic skills to use data repositories and informatics and analytic tools, and understand which tools are appropriate for tasks
  • knowledge of the abundant resources and additional training available to them.

You can learn more about the workshop and who should apply on VertNet.org.

VertNet Development

A cornerstone of the VertNet mission is to deliver a cloud-based platform upon which data publishers (i.e., institutions providing data) can store growing quantities of biodiversity data that can be accessed and enhanced via the Web. An important measure of our success will be how effectively this new platform can overcome the core technical challenges of scalability, data discovery, sustainability, and integration with other platforms. We will continue to work with our integration partners on evaluating platforms and, if all goes according to plan, we’ll open up testing to existing data publishers and others interested in our work.

In this quarter, the development team made some critical decisions about the design of our technical solutions. A key design principle in our decision-making process requires us to seek a balance between the most cost effective (i.e., sustainable) and flexible (i.e., scalable) software and the potential for development of robust tools and innovations that maximize data discovery. We plan to optimize sustainability and scalability by lowering maintenance costs and by simplifying the VertNet architecture so that data publishers can deploy VertNet anywhere, using any system.

In July of 2011, we built a prototype platform using the Google App Engine cloud, Python, and SQLite. The prototype, available on GitHub, was composed of a data bulkloading script used for uploading records, and an application programming interface (API) for searching them. This prototype was tested by our VertNet integration partners (e.g., AmphibiaWeb, Arctos, GeoLocate), who provided excellent feedback about usability, text matching queries, and support for Darwin Core Archives.

Coincidentally, while our prototype was being tested, Google App Engine announced a new service level agreement and pricing model. Because this change in pricing would likely increase our estimated annual operating cost, we proceeded to identify different cloud-based alternatives to our original plan.

After further research and several productive conversations with our integration partners, we decided to explore the cloud-based CouchDB and HTML5. CouchDB is intriguing because it is both a web server and a database server with strong replication support. It would allow us to deploy VertNet data and HTML5 applications in a highly flexible, scalable, and sustainable way. Plus, CouchDB would give our integration partners a way to extend VertNet with custom functionality specific to their needs.

In September 2011, we built a new prototype using CouchDB.  This prototype was composed of a browser-based bulkloader and a search API hosted in the cloud. The prototype and technical architecture were presented at the 2011 Biodiversity Information Standards (TDWG) conference in New Orleans.  So far, we are very pleased with the results of our testing with CouchDB and are encouraged by the potential to address our technical challenges.

Thanks for keeping your eyes on us.  More exciting things are coming in 2012.  If you’ve got ideas, suggestions, or questions, feel free to send us a note or comment.

Posted by Dave Bloom, VertNet Coordinator, and Aaron Steele, VertNet Information Architect, on behalf of the VertNet Team.