Database Frontiers

They aren’t really frontiers anymore, but I’m learning more about NoSQL databases and use cases–specifically, what the best way to create a better fossil specimen database is.  The use case would be a simple “what do we have and where is it” database for the University of North Dakota paleo collections.

Ideal features are

  • Easy replication or download, in case someone wants to run an analysis on all or part of the database.
  • Easy and or flexible query capabilities.  I’m getting the impression I get one or the other.
  • Easy import.
  • Versioning would be nice.
  • Scalability (in the end, 500,000 specimens).

So far CouchDB and MongoDB are my frontrunners, and even though I think CouchDB has more of the features I’d like to use, the supposedly easier querying in MongoDB has me intrigued.  This is definitely not going to be a post comparing the two.  It’s possible that MySQL is the right choice, and I am re-exploring that possibility.  I’m beginning to shy away from NodakPaleo as it exists right now, because Drupal adds a layer of complexity between researchers and the data.  Sure, it’s nice to have all the extras and be able to build a View that you can query a certain way via GUI, but ideally the databse would have an API that could be used by researchers for more interesting things–mapping exercises, mashups with different datasets, etc.  For that to happen, I think a JSON/BSON/XML document database may be the best choice in the end.  

I just want people to be able to talk to it and use it in a way that makes sense to them.  Continuing on, I hope to add more comprehensible thoughts about each option and how I am learning (because I know very little about these systems).

Finally, if you want to see where my free time went during 2007, check out SpecimenDB, the PHP/MySQL version I created.  Much of the backend I scraped from ‘PHP and MySQL for Dummies,’ but apparently I learned a lot that summer.  This is a new installation, so it obviously doesn’t have any information in it.


Geologic Figures Database, step 1

I hope to someday put together a tagged database of figures pulled from the geological literature that may be useful to others, primarily maps and stratigraphic columns.  For now, though, I am sharing an Evernote notebook to which I will be adding over time.  It can be accessed here: [EDIT: No longer. 2014-02-04]

Future plans include tagging the images (type of map, area covered, stratigraphic units) but I am not sure of the platform to use.  Tumblr, Blogger, or even Pinterest (in addition to the possibility of Drupal) are all possibilities at this point that will allow collaboration.

Petra: Select wells with a datum

In Petra 4, to select wells that have a datum (e.g., if you want wells you can export with sub-sea true vertical depth [SSTVD] rather than just measured depth [MD]), the correct search criterion is Well Header–Datum–By Active Datum–Datum Is Between the Minimum and Maximum, and set the minimum depth to something greater than one.

Exporting from Panorama Sheets

I just started playing with Panorama Sheets, which is a cheaper (and lower-power) version of Panorama, but which does exactly what I want: combines a spreadsheet interface with database-querying capabilities. At the moment, I don’t need the other functions of the full version of Panorama, so I may purchase this software after my trial runs out. I’ve seen a few glitches so far, one of them being that while in the Export Wizard (File->Export Text…), clicking on the “Export Text” button results in an “Error resolving alias.” error. Luckily, this can be solved by selecting the Export->Export to Text File… menu item. An additional problem I ran into is that the encoding is weird when trying to open this file in OpenOffice Calc, so it was easiest to open the text file, copy the contents, and paste into Calc, which will bring up a dialog box you can use to select your delimiter, etc.


Trying to Focus on Specimen Databasing

This post will explore some fairly specific topics, but I hope the thought process will be instructional (or inspiring) to others. Additionally I think it’s worthwhile to talk about the concepts of specimen/biological collection database management with reference to funding, not schemas and platforms.

At the UND Department of Geology and Geological Engineering, a small number of us have been pursuing an overall upgrade of the paleontological specimen and lithotype collection consisting of improved facilities (compactor cabinets) and a comprehensive online database. We’ve applied for funding from NSF and been denied twice, and the project would be dead in the water except for the quarter-time assistantship I’m receiving from the Dean’s office at the School of Engineering and Mines. Development has been slow, mostly due to the conversion between the existing databases (stored as flat text files) and the online system (I will not mention the name of the new system because events today have made me question (again) the cost/benefit ratio of utilizing it), and I’ve been importing locality data so we can use the new system to analyze locality distribution, among other things.

The question today is how to proceed. As useful as locality data are to paleontological and geological researchers, locality information is, at its core, supplementary to the specimens themselves. (I’ll avoid an argument right here: I believe that locality data are essential to proper context, and I’m not advocating the dissociation of these data from specimens.) Specimens are the core of the paleontological sciences, and it is from specimens and their assigned taxonomic identities that researchers work toward understanding past life. Rather than browsing locality lists and then looking at specimens, given a database most researchers will search by taxon or in special cases by specimen number, and then they will look at the associated locality data. In my opinion, we’ve been doing it wrong.

The above point regards usability, and I promised to talk about funding issues, so here we go: in order for such an online database (and more importantly, the effort to digitize specimen data and provide specimen imagery) to keep getting funding, it needs to be usable so it will be used! That’s the whole point. If the Dean (or any other UND administrator) wants to put us on the map for having a world-class collection, we need to get the data out there that people want, we need to tell them about it, and we need to encourage them to use it. From the administration’s perspective, numbers are going to determine how successful we are: number of unique visitors the online database gets every year, number of publications that reference specimens held in our collections, and number of researchers who visit or request material loans.

What can I do today that will improve our chances? In my opinion, we need to improve usability by others before we can improve usability by ourselves. This means a focus on specimen-data entry, the postponement of certain analytical capabilities we (as UND researchers) would like, and beginning with those specimens referenced in peer-reviewed articles, dissertations, and theses. These specimens have already gotten the most attention and they are likely to get more attention in the future because of their “published” status. The associated material can come next, and then we can start adding data systematically. At this point, to show that this is possible and that it shows our research collections in a good light, we need to get the bare bones online first and follow with everything else later.

That’s what I think, and what I will discuss with others here later today. Has anyone else come across such a crux of funding issues? How about with specimen collections that are even less sexy than ours (which are primarily freshwater mollusks, and are pretty darn sexy in my opinion)? Am I on the right track, or should we back this train up again?

Specify: possible solutions?

Brainstorming database solutions.

– Use Specify and jettison some data that won’t fit. This wouldn’t go over well.

– Use Access and deal with not having the data schema or the Specify software (which seems like it can work fairly well).

– Use MySQL and build a frontend in PHP that mimics Q&A, including the ability to rearrange fields, sort things easily, and add new fields when needed.

– Use MySQL and steal the schema from Specify, and build a frontend in PHP.

– Use MySQL and steal the schema from Specify, and use Access as a frontend.

– Use Specify and add the columns to the tables that need to be added with Access (I think this can be done), use Specify for the web interface and Access for on-site.

– Keep bugging the Specify team to tell me how the program knows when it needs to display new tables and hope they eventually tell me (see a couple posts back).

– Waste a lot of time figuring it out on my own.

Why does it seem like the last option would be the best?

Links that need to be made into a tutorial.

Specify 5.2 and my quest to import lots of data

I’ve been unable to figure out how to get a username and password to login to the Specify forum; I’ve requested help with that but until I get a response, I’m going to shout my questions out to the ether and hope I get some echos of answers back.

UPDATE: Apparently you have to email to get a login (source).

1. When is Specify 6 coming out? Originally it was early 2008, then it was mid 2008, and I haven’t seen any updates recently. I hope it comes out soon.

2. Is there a Specify mailing list? If there isn’t, there should be, especially since it seems like nobody is able to log into the official forum. I might be interested in getting one going depending on interest, since there are supposedly lots of people using the software. There is a mailing list here.

3. The big question: is it possible to add more than the default offering of a few new fields to each form? I’m on the cusp of moving over a great deal of data, and I’d like to be able to include everything and add fields as necessary, but that doesn’t seem to be an option.

I’ve tried adding fields to the database itself (through Server Management Studio Express), but I think the number and name of each field is hard-coded into the Specify program rather than being stored as an expandable list in the database, because the new field doesn’t show up when I go to edit forms. For example, each field has both a name and a set of properties–the properties (such as what kind of data is allowed, how long it is allowed to be, etc.) are described in the database tables, but I have not been able to determine where the name of each field is stored. Hold up. It has to be stored somewhere, because it can be changed by the user. I may have to dig deeper. Sometimes talking through things does lead to answers. We’ll see what I can dig up.

In any case, I’m very comfortable with messing with the database tables themselves in order to get the data in, which is what I’m going to have to do anyway, because the data will be coming out of a tab-delimited file when I’m done with it and should be easy to get into SQL. It’s just a question of making sure those columns exist in the tables first…

Anyway, that’s where I stand. I’ll update when I know more.