The Shocking Truth About OCR

Google OCR – page three of Pinkerton files

Pinkerton – vertical
Pinkerton – cropped
Pinkerton – contrast

At first I left the image horizontal and attempted to run it through Google OCR. It did not recognize any of the characters and thus I didn’t have any text to examine. I manipulated the image so it was vertical, and Google OCR did recognize the characters and words, but with a high degree of inaccuracy. Out of 23 lines of OCRed text, there was only one line of text that correctly displayed the words shown in the image. The OCR completely cut off the last few lines of text as well as a few others throughout the document, so the text was incomplete. I then cropped the image and ran it through again, and this was a bit more successful in that it recognized a few more words than the un-cropped version had. The software was able to OCR all of the lines of text this time, but the OCR was still incredibly poor. I then tried to make the contrast darker, which did help the software to recognize a few more words, but it was not vastly improved. April and I compared our results and we had similar problems, but April had less success with the cropped version of the image than I did. I was shocked that the OCR did not read our files in the exact same way. We both had poor success with Google OCR, but I would’ve assumed our results would have had more similarities in the OCRed text. In general the software did not recognize the Qs and As in the document, and simply skipped many words. The OCR was so inaccurate because the original image was not of high quality, the size of the text was very small, the typewritten words make it difficult to distinguish certain characters and many of them resembled each other, and there was bleed-through from the other side of the page.

Chronicling America – Daily Capital Journal, September 15, 1914, Image 1

Original image
OCRed text

The first thing that I noticed with the OCRed text is that it had a hard time recognizing the words that are close to the fold seen on the left side of the page. This made the first few words of each line in the first column unrecognizable to the OCR software. Whoever scanned the microfilm of the image did not ensure that the page was completely flat. I was disconcerted by how inaccurate the text was. The software didn’t even read the masthead of the paper. I noticed that the OCR had a difficult time reading subheadings. “Say Austrians Must Quit” was OCRed as “i 8ajr Austrian Mim Quit.” The software was also unable to read the bottom portion of the page. Most of the “Uncle Sam Protests” column came out as gibberish, and then several of the last lines were not even a part of the OCRed text.

Chronicling America – The Hawaiian Gazette, September 18, 1914, Image 1

Original image
OCRed text

I chose this particular paper because the image is so dark, and I wanted to see how that would affect the OCR. I was surprised to find that it did not seem to impact the accuracy of the OCR, which was actually better in the darker places of the paper. For example, in the far right column, under the subheading “Blunder Not Known Till Damage is Done,” the text is almost perfectly accurate. There were a few folds in the image, but it  didn’t affect the OCR as badly as I had expected. The OCR was able to read the bottom portions of the paper, unlike that of the first image I studied, and picked up the “Teuton Cruiser Sinks British Merchantmen” very well. Once again the OCR did not compute the masthead.

Chronicling America – The Day Book, September 18, 1914, Image 1

Original image
OCRed text

It was not surprising to find that this image produced almost completely accurate OCR text. The only issue the software had was with the image underneath the title of the paper. The near perfect OCR can be attributed to a few factors: the easy readability of the font, the lack of images, the lack of columns, the greyscale, and the lack of folds in the paper.  Chronicling America and Google OCR seemed to have the same sorts of problems, but I found the latter to be much worse than the former, simply because the quality of the original Pinkerton file was so poor. While the Chronicling America papers did not produce high quality OCRed text by any stretch, it was still better than Google OCR because the microfilm produced much more readable and searchable text.

Digital resource in my field – The Abbeville Press and Banner, February 8, 1918, Image 6

Original image
OCRed text

The first problem I had with this task was finding a source that allowed users to see the text output. I checked sources in the African American Newspapers (1827-1998) database, the ProQuest Historical Newspapers database, the electronic holdings at the Library of Virginia, the Virginia Historical Society, and elsewhere. I attempted to run a few newspaper sources I had through Google OCR and Google did not want to read the characters or words, so that didn’t work. After doing a search on Chronicling America I was able to find newspaper article that will be highly useful for my research further down the road. The text did not OCR as well as I had hoped. The image is very clear and easy to read, but nevertheless the text is not accurate. There are symbols, such as asterisks, placed in the middle of words, and the software did not know what to do with the empty spaces after one paragraph ends and another begins.

After doing this exercise, it is very clear to me how inaccurate OCR can be. This can be extremely detrimental to historians, because if they do keyword searches in databases and the text is OCRed incorrectly, then they might miss out on an important source. Despite the appalling problems I had with OCR technology, I still think the positive impacts it has on the research process outweigh the negatives.

Leave a Reply

Your email address will not be published. Required fields are marked *