We seem to have moved on from the question of ‘what is digital history’ to how to perform and use digital history. The topics for this week’s readings are text mining and topic modeling. I was eager to read the articles this week because text mining and topic modeling are two phrases that I’ve heard used but have had no idea what they actually mean. Close reading, distant reading, and ngrams (as well as ngram viewers) are other words/phrases used this week that I need to take some time to examine more closely. I wasn’t sure what they were or how they are used in the field of digital humanities, so this week’s blog post will focus on these new terms and how the articles defined and discussed them. Some of these concepts continue to remain a bit abstract to me, despite articles like Blevins’ and Kaufman’s that demonstrate their application, so these definitions might not be fully accurate.
Text mining, also referred to as data mining, is a quantitative method that analyzes words from a large corpus of text. It is a tool that can be used to understand the history of culture and can complement the ways we already organize historical information. Text mining can be used as an exploratory technique to determine what areas of a research topic need to be fleshed out and written up using more traditional methods. Searching is a form of text mining, but it is not a pure form in that it only shows the user what he/she is already expecting. Scholars have questioned the validity of this method, since meaning can be found in a much wider variety of cultural objects and not simply text. Words can be ambiguous and are dependent entirely upon context, which a computer cannot understand. The underlying language of text mining is Bayesian statistics. Underwood argues that text mining can be used to help scholars think more deeply about existing practices of algorithmic research.
Topic modeling is a computational method that shows patterns in text. It uses an algorithm to organize the language of a collection of text into clusters of terms that tend to occur in the same contexts. The clusters of terms are “topics.” The modeling program then creates a visual map to show which groups of words appear most often in a group of documents. This method can show patterns throughout the collection of text that the reader might not be able to see. Rather than grouping documents that have the same words in them, topic modeling groups the words themselves that appear in the same documents. This is a popular method since the maps created are easily interpreted. In addition, topic modeling allows the researcher to examine the larger trends across the corpus rather than analyze individual documents. Topic modeling is incredibly useful in identifying patterns that scholars cannot explain, which then prompts more research questions, according to Nelson.
Close reading is more traditional than distant reading and involves the close examination of small body of text. Distant reading utilizes a large collection of text and is computer-enhanced. The latter focuses on the ways in which content and meaning emerge across a large scale. Topic modeling is one way of practicing distant reading. Some of the authors mentioned a dichotomy between close and distant reading, but all authors mentioned that the two should be used in conjunction in order to provide more in-depth, complex research findings. Digital methodologies must allow scholars to move easily between close and distant reading.
An ngram is a word unit, and an ngram viewer is an application of text mining. The authors of the articles seem to have an overwhelmingly negative view of Google’s ngram viewer. The viewer ignores the meaning of words when it searches, thus the words are taken out of context, which is not helpful when conducting research. In addition to the critique regarding context, Gibbs and Cohen argue that the viewer does not show transparency, nor does it allow users to interact with the interface.