Database Frontiers

They aren’t really frontiers anymore, but I’m learning more about NoSQL databases and use cases–specifically, what the best way to create a better fossil specimen database is.  The use case would be a simple “what do we have and where is it” database for the University of North Dakota paleo collections.

Ideal features are

  • Easy replication or download, in case someone wants to run an analysis on all or part of the database.
  • Easy and or flexible query capabilities.  I’m getting the impression I get one or the other.
  • Easy import.
  • Versioning would be nice.
  • Scalability (in the end, 500,000 specimens).

So far CouchDB and MongoDB are my frontrunners, and even though I think CouchDB has more of the features I’d like to use, the supposedly easier querying in MongoDB has me intrigued.  This is definitely not going to be a post comparing the two.  It’s possible that MySQL is the right choice, and I am re-exploring that possibility.  I’m beginning to shy away from NodakPaleo as it exists right now, because Drupal adds a layer of complexity between researchers and the data.  Sure, it’s nice to have all the extras and be able to build a View that you can query a certain way via GUI, but ideally the databse would have an API that could be used by researchers for more interesting things–mapping exercises, mashups with different datasets, etc.  For that to happen, I think a JSON/BSON/XML document database may be the best choice in the end.  

I just want people to be able to talk to it and use it in a way that makes sense to them.  Continuing on, I hope to add more comprehensible thoughts about each option and how I am learning (because I know very little about these systems).

Finally, if you want to see where my free time went during 2007, check out SpecimenDB, the PHP/MySQL version I created.  Much of the backend I scraped from ‘PHP and MySQL for Dummies,’ but apparently I learned a lot that summer.  This is a new installation, so it obviously doesn’t have any information in it.

 

NodakPaleo

If you’re looking for NodakPaleo, the University of North Dakota Paleontology Specimen Database, you can find it (for the moment) at http://nodakpaleo.mattbk.com.  Sorry for any confusion, I am seeing if we can get the domain name renewed.

Specify: possible solutions?

Brainstorming database solutions.

– Use Specify and jettison some data that won’t fit. This wouldn’t go over well.

– Use Access and deal with not having the data schema or the Specify software (which seems like it can work fairly well).

– Use MySQL and build a frontend in PHP that mimics Q&A, including the ability to rearrange fields, sort things easily, and add new fields when needed.

– Use MySQL and steal the schema from Specify, and build a frontend in PHP.

– Use MySQL and steal the schema from Specify, and use Access as a frontend.

– Use Specify and add the columns to the tables that need to be added with Access (I think this can be done), use Specify for the web interface and Access for on-site.

– Keep bugging the Specify team to tell me how the program knows when it needs to display new tables and hope they eventually tell me (see a couple posts back).

– Waste a lot of time figuring it out on my own.

Why does it seem like the last option would be the best?

LATER:
Links that need to be made into a tutorial.
http://dev.mysql.com/tech-resources/articles/migrating-from-microsoft.html
http://www.ucl.ac.uk/is/mysql/access/
http://mobiledeveloper.wordpress.com/2007/01/31/data-import-export-with-sql-server-express-using-dts-wizard/
http://dev.mysql.com/downloads/connector/odbc/5.1.html
http://www.ucl.ac.uk/is/mysql/odbc/
http://www.aspnetcafe.com/post/2007/12/HOWTO-Get-Complete-SQL-dump-of-SQL-Express-2005-Database.aspx
http://www.microsoft.com/downloads/details.aspx?FamilyId=56E5B1C5-BF17-42E0-A410-371A838E570A&displaylang=en

Specify 5.2 and my quest to import lots of data

I’ve been unable to figure out how to get a username and password to login to the Specify forum; I’ve requested help with that but until I get a response, I’m going to shout my questions out to the ether and hope I get some echos of answers back.

UPDATE: Apparently you have to email specify@ku.edu to get a login (source).

1. When is Specify 6 coming out? Originally it was early 2008, then it was mid 2008, and I haven’t seen any updates recently. I hope it comes out soon.

2. Is there a Specify mailing list? If there isn’t, there should be, especially since it seems like nobody is able to log into the official forum. I might be interested in getting one going depending on interest, since there are supposedly lots of people using the software. There is a mailing list here.

3. The big question: is it possible to add more than the default offering of a few new fields to each form? I’m on the cusp of moving over a great deal of data, and I’d like to be able to include everything and add fields as necessary, but that doesn’t seem to be an option.

I’ve tried adding fields to the database itself (through Server Management Studio Express), but I think the number and name of each field is hard-coded into the Specify program rather than being stored as an expandable list in the database, because the new field doesn’t show up when I go to edit forms. For example, each field has both a name and a set of properties–the properties (such as what kind of data is allowed, how long it is allowed to be, etc.) are described in the database tables, but I have not been able to determine where the name of each field is stored. Hold up. It has to be stored somewhere, because it can be changed by the user. I may have to dig deeper. Sometimes talking through things does lead to answers. We’ll see what I can dig up.

In any case, I’m very comfortable with messing with the database tables themselves in order to get the data in, which is what I’m going to have to do anyway, because the data will be coming out of a tab-delimited file when I’m done with it and should be easy to get into SQL. It’s just a question of making sure those columns exist in the tables first…

Anyway, that’s where I stand. I’ll update when I know more.