The Problematic Lack of Transparency

This week’s readings discussed databases and the ways in which such technology affects the historical profession. Patrick Spedding’s article “The New Machine: Discovering the Limits of ECCO” touches briefly on something we mentioned in discussion last week. Spedding notes that ECCO’s OCR transcriptions are not available to the users of the database. This is a problem that all historians have encountered before. When working on my Clio 1 final project that utilized the text mining tool Voyant, I went to ProQuest and pulled a selection of newspaper articles written in the 20th century. ProQuest is one of the many databases that does not provide a transcription of the OCRed text, so I had to go through and transcribe all of my newspaper articles by hand. While it was time-consuming, in the long-run it was probably quicker than going through the transcription and fixing each and every error. This topic is better suited to last week’s topic of digitization, but I am perplexed as to why databases so frequently do not allow users to have access to OCR transcripts. Pete had mentioned that databases essentially have these “black boxes,” meaning that we don’t fully understand what sort of process our data has gone through. Why can’t all databases follow the model set by Chronicling America? While I am fairly certain my entire Clio 1 class was appalled by the poor OCR quality, showing the OCRed transcripts promotes transparency and openness. Not only are OCR transcripts not provided, but users of databases like ProQuest are given no information as to the accuracy of the OCRed text. Digital history as a field promotes transparency, and as digital scholars I feel that something should be done to correct this problem. It is important to understand how the data we are seeking has transformed. For example, what had to happen for this article on Roe v. Wade to appear as I see it now on my computer screen? What sort of process did this article go through to appear in this digitized format? Databases need to be accountable to their users and share their practices.

Caleb McDaniel’s blog post “The Digital Early Republic” examines the practices of historians who do not identify as digital historians and how they use databases in their research. McDaniel mentions that there are no specific conventions on how to cite searches performed in databases, as well as citing which databases were used, and reporting on the results yielded from such searches. The lack of citation convention continues to pose a problem for digital historians, and even those historians who do not identify with the digital part of the field but use databases while researching. Once again, there is a lack of transparency. We’ve discussed how historians are sometimes hesitant to say that they used digital sources and digital methodologies because those historians often come under stricter scrutiny than those historians who have researched the “old-fashioned” way. We need to abandon this tendency that we have to judge the digital more harshly than the analog: all methodologies should be examined, discussed, and meticulously documented. We need to create standards for citing digital sources, especially the ones mentioned by McDaniel, since all historians engage in such practices whether or not they cite them as such.

