Digitization: Accuracy, Integrity, and Transparency

When I was working on my Master’s degree, I had the opportunity to work on a fairly large-scale digitization project. The Archive of European Integration is digitizing and putting online official European Union documents. As a part of the project, and together with other Master’s students, I disbound documents; scanned them using a scanner with an automatic sheet feeder; and then page-checked and bookmarked the PDF versions. There were quite a few steps that the documents then had to go through before being put online for others to access, including having the scanned versions run through OCR software. Despite being highly repetitive, this job gave me invaluable experience with digitization and made me much more appreciative of the amount of work that goes into such efforts.

There are both positive and negative aspects of digitization. One of the disadvantages is that OCR technology is not fully accurate. In “Deciding Whether Optical Character Recognition is Feasible,” author Simon Tanner gives an example of how flawed OCR can be. Hypothetically, if there was a page of 500 words with 2,500 characters, and the OCR was 98% accurate, then 50 of those characters would be incorrect. This can create a host of problems for researchers. Ian Milligan, in “Illusionary Order: Online Databases, Optical Character Recognition, and Canadian History, 1997-2010” states that OCR can alter research, lead to missed hits while performing searches, and lays a flawed fundamental layer beneath historical research. Bob Nicholson provides one solution to fixing OCR problems in “The Digital Turn.” The British Newspaper Archive enables suers to manually correct OCR errors, as well as tag articles with their own keywords. Crowd-sourcing might be one way to deal with this problem. Tanner mentions that OCR is only one way to deliver text to a user. Why do we rely so heavily on OCR rather than other methods? Is it the most accurate (despite its inaccuracies)? Regardless of whether it is or is not, what are the alternatives to OCR and what are the drawbacks to using them?

Yet another problem with digitization is the loss of integrity that occurs when a source is digitized. When any analog source becomes digital, there is a loss of data. How are we able to retain the original integrity of the piece that became digitized? Newspapers provide a great example of this loss of integrity. Prior to becoming digitized, newspapers were originally in a single issue format. They were then placed into bound volumes, and then were microfilmed. Each of those remediations places the user one step further away from the original text. We need to be especially careful to collect as much information as we can prior to digitizing in order to try and remedy the loss of data. This is one reason why metadata is so vitally important to researchers: the more information you have about a source, the better informed you are as a researcher, and your research will be of a higher quality because of it.

Another issue brought up in the articles this week is that of transparency in the research process. This is something that I, as both a librarian and a historian, am passionate about. While this is not a common practice in the historical discipline, I think it is highly important for researchers in any field to discuss search strategies and results; keywords used; what was successful and what was not. All scholars should be able to describe in explicit detail how they arrived at a specific source so colleagues can replicate that search process. This would allow historians to collectively discuss any problems they might have, and then they could work together to find solutions to those problems. Transparency in research is incredibly necessary, even more so now that we are using digital sources. When we are creating bibliographies, why do we not cite digital sources or at least detail our search process that helped us reach those sources? This should become a “best practice” for historians in the 21st century.

There are many ways that digitization has made the research process exponentially easier. Both Nicholson and Sarah Werner, in “When Material Book Culture Meets Digital Humanities,” highlight the positive aspects of digitization and OCR technology, and the ways in which they benefit material culture. Werner’s article, which is a highly engaging read, shows how microfilm does not always capture what it should, since it can only print in black and white. She also examines the ways that digital tools can be used to study physical characteristics of text, which is just one use of digitization that would not have occurred to me but is fascinating to read about. Nicholson argues that we are on the cusp of a “digital turn” in scholarship driven by these new technologies and new possibilities in research. Instead of focusing on the negative aspects of digitization, he believes it is far more beneficial to focus on the various opportunities that have been created.

Overall, digitization and OCR technologies have made research much easier, although inaccuracies and loss of integrity are serious obstacles that should be studied and remedied. In addition, transparency in research should be promoted and encouraged until it becomes standard practice across all disciplines.

In response to Manoff’s article, I do believe that libraries have adequately met the challenges posed by digital technologies and continue to work to keep themselves abreast of technological changes and advances. Since the article is dated, I feel that the issues raised have largely been dealt with, thus making my comments superfluous. What I would really like to see is a follow-up to the original article: would Manoff agree or disagree with me?

Week 3 readings

Leave a Reply

Your email address will not be published. Required fields are marked *