Data Basics With Databases – The Wonders of Data Material

Vacation time is over and – which is a bit surprising for PhD students – I had much time to think about my dissertation and my academic future. I did some researches and picked up many new interesting fields someone should deal with. However, from time to time, I got angry about an issue which came across a lot. Have you ever noticed how stupid some subject-related databases are? I mean, having them helped us find sources and texts, which enriched our own work, was a wonderful opportunity 10 years ago; but today, 10 years later the relation between the whole bunch of data sets became the focus of some researchers’ attention.

The need (or wish) to work with the full material of a database leads to the wide discussion about Open Access in science. I don’t want to deepen the aspect here in this post, but to make my opinion clear: Open Access is a great convenience for an open and free science. I support this position and I think my contribution to the scientific society is to share research results and raw data in an easy, accessible way. Yet, I understand the need of legal restrictions, payment, and license policies. Publishers, universities, and scholars invest so much time and money to develop their systems; and, of course, charges and copyright restrictions are necessary for financial profitability. I’m happy to pay for access if that’s the price to support scientific progress. One can’t be so naive to think open science is for free. Someone must pay for it and even my time/work as a PhD costs resources and money. To make a long story short: If a database is hidden behind payment or license restrictions, there should be good reasons for it. And I hope the reason is not profit, but the necessity to keep the system running.

In this context, it is very important to differ between two types of access. The first one is the access of an intended user. A publisher builds a database with a user interface and provides services for specific cases. The Library of the University of Heidelberg, for example, collects 3,291 databases on DBIS, the “Datenbankinformationssystem” (the German word means: Information System for Databases). DBIS is nothing more than a colorful index of bibliographies, encyclopedias, dictionaries, etc. with different licenses and payment policies. If you look more closely at the different databases – and even with the view of different disciplines -, you will determine that most of these databases are made for saving and restoring knowledge. Databases that are caught up in their saving and restoring manner may make researches much easier, however, it is wasted potential.

Imagine the following example, which I picked randomly as surrogate out of many (and without any offensive motivation): The World Biographical Information System (WBIS) is a very, very, very powerful tool for searching biographical articles. It collects information from encyclopedias and archives from all over the world. The opportunities for researchers are enormous. For example, if you look up the name “Mark Twain,” you will find entries in biographical archives from 7 different countries. Some of them are digitized and machine-readable. Of course, researchers who are interested in persons, history of reception, or cultural reception will find a huge number of sources. But – and now your imagination is in demand – wouldn’t it be crazy, if someone would code an algorithm which would detect “Mark Twain” in all the other machine-readable articles, create an interactive network of persons in biographical resources, and visualize the links (maybe even with places and years)? Wait … Someone already did this with Wikipedia: it’s called  EVELIN.

The difference between WBIS and Wikipedia is obvious in this case and it unfolds the second type of access: closed and open systems. Again, this post is not a plea for Open Access, but for a shift in mindset that closed databases have much more potential than expected. Maybe, lexicologists could explore new horizons with access to closed encyclopedias or linguists with access to full text databases. The obvious basics and functions of databases are only a very small part the creators think you need the most. But if an idea of links and overlapping questions arises in your mind, problems begin with closed systems. As soon as the raw material, the underlying datasets of the services, and user interfaces gets important for your research, it gets even harder to work with the material.

For publishers, there could be a simple solution for an “easier access.” Differ between usage scenarios! To pay for and to license a service or tool highlights the “product” that is based on the data; to grant access to the whole data material for academic research would set up a gold rush among digital humanists. Maybe some day, what Google did (as they made the  raw data for the Ngram Viewer downloadable) will be a standard. Or as a concept for publishers: a few-based tool, but Open Access to raw data.

I will close with a very positive experience: During vacation, I had an idea for a future project for which data from a database is needed. Some key features and search functions in the text sources of the database are available online for free, but it’s not possible (and maybe illegal, too) to parse it with a script. That is why I made something silly: I told my idea to a responsible person and asked for raw data, plain texts, and license policies. (I will describe the project idea in another post if everything works as expected). Within some days, I received an answer: they will not confer about IF they grant me access, but about HOW they can do it! Let’s dig up the treasure.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.