Ways to Use Text Mining

For this week’s practicum, we are using various ngram viewers and comparing them, and then using Voyant. For first half of this exercise I chose keywords that are of interest to me in my research. I started out the practicum hoping to get a better grasp on what text mining is and how it works, since the topic seemed abstract to me even after the readings and class discussion.

The Google ngram viewer scans the corpus collected in the Google Books project to display a graph showing the usage of a particular word or words. I first ran the following set of words through the viewer: domestic_NOUN, maid, housemaid, maidservant. Neither domestic nor maidservant had been used frequently, and housemaid saw a small percentage of usage from 1860 to the 1940s. Maid had the highest frequency of matches within Google Books, and usage peaked in 1900 and has slowly been declining since. I also did a search using maid and housewife. Maid was much more common, and the use of housewife increased a small amount in the 1920s until its peak in the 1980s, and then declined. Google then displayed links in chronological order, such as 1800-1843, 1843-1900, and so on, that, when followed, brought me to a list of books within the Google Books project that had my word or words in them. This is not usually helpful for me as a 20th century historian since copyright restrictions do not allow the user to read the entire book online. In addition to not being able to follow through on finding the actual words within the books, there is not enough transparency. How was this ngram viewer put together? What algorithm(s) are used? While the interface is clean and straightforward, it does not allow for user interaction.

Bookworm’s Chronicling American ngram viewer scans the newspapers included within the Chronicling American database. I ran the words housewife, domestic, and maid in the viewer. The results I received were different than those of Google’s. Domestic had a far greater number of hits, but I had to remind myself that in the Google ngram viewer I was able to state that I only wanted the noun form of domestic. Housewife was hardly used, and the frequency of maid was only a bit higher than housewife. One of the greatest advantages of using Bookworm is that it will link you back to the original paper the word was found in since there are no copyright restrictions. Bookworm’s ngram viewer is more user friendly and the interface is much more interactive than Google’s. You can change the publishing time to be either year, month, day, day and year, month and year, or week and year. The user can also determine the quantity based on the percentage of words, percentage of texts, word count, or text count. The user can decide whether they want the search to be case sensitive or insensitive.

Bookworm's Chronicling America Viewer displaying the results for a search of housewife, domestic, maid.

Bookworm’s Chronicling America viewer displaying the results for a search of housewife, domestic, maid.

I also used Bookworm’s Congress.gov ngram viewer to search the percentage of words in Democrat-sponsored and Republican-sponsored bills about abortion. The results showed that Republicans sponsored more bills with the word abortion in it. But I cannot say that Republicans have sponsored more bills about abortion, since the visual analysis shows the percentage of words in bills themselves. Thus this is not a straightforward analysis. I’m sure that other scholars who have worked with quantitative methodologies previously will not not make this mistake. It is important to make sure to explicitly state what the graph is actually showing, versus what your initial reaction might be. The graph will link the user to a site that gives the full text of the bill, as well as pertinent information about the sponsors, the bill’s progress or lack thereof through committees, etc. This viewer would be particularly beneficial to those studying legal or political history.

The Congress.gov Viewer is useful for legal and political historians.

The Congress.gov viewer is useful for legal and political historians.

The New York Times Chronicle is an ngram viewer that analyzes every issue of the newspaper. I attempted to search housewi* to see the results for both housewife and housewives in one graph, but it didn’t work. I searched for both housewife and housewives separately. The only great disparity between the results for both is in the early to late 1940s, when there is a much greater use of housewives than housewife. This viewer links back to the articles in which searched word was found, but this can get complicated. If you personally do not have a New York Times subscription or if your institution doesn’t, than there are certain articles that you will not be able to read. The Chronicle has a somewhat friendly user interface. It’s much better than Google but not as interactive as Bookworm. You can choose whether you want your results to be displayed based on the percentage of total articles or the number of articles.

Truncated searching would be extremely useful for all ngram viewers.

Truncated searching would be extremely useful for all ngram viewers.

I received different results in each ngram viewer because each of the viewers scans a different corpus of information. Google Books will have the word housewife appear in various books over time but those results may or may not correlate with what has been in the Chronicling America papers or the New York Times. Also Google allows users to specify the part of speech of the word, which was incredibly helpful. In general I found these viewers to be fun and neat to play with, and they did help me get a better feel for what text mining is. I’m still not entirely sure to what extent text mining will be useful to me in my research. Also, these viewers should be used as tools, not as methodologies, and I will need to be careful should I choose to use them in my research. They can be utilized in conjunction with scholarly research but should not be the sole method for proving one’s argument. The analysis and interpretation of the graphs can differ, and it is important to ensure that, as a scholar and researcher, you are reading the graph the way it was meant to be read. I also think that all of the ngram viewers should have the capability of search truncation. There is not enough transparency with these viewers, either. The problems I had with Google were never fully resolved with the exception of interface interactivity.

I ran the magazine version and novel of The Picture of Dorian Gray through Voyant, which is a web-based tool that analyzes digital texts. Voyant is much more than an ngram viewer and is by far a better tool because there is more information readily available for the user, and the interface is customizable. The summary details the number of words in the documents; vocabulary density; most frequent words in the corpus; words that had notable peaks in their frequency; and distinctive words. There is also a list of words in the entire corpus, and the user can decide whether or not he/she wants to include stop words. Within this window the trend is shown by a small icon. For The Picture of Dorian Gray the five most frequently used words, in descending order, are: Dorian, said, lord, life, and Henry. Voyant can also trace the  word trends. There are two other categories: keywords in context and words in documents. The latter displays even more information, showing the raw count and the relative count, and the trend graphs shows the values of the mean relative counts across the corpus. I really enjoyed playing around in Voyant, but I do have one complaint. While I appreciate the interface being so interactive, it seems a bit dated and very crowded. The experience would be greatly enhanced if the user could choose which of the many boxes they want to see on the screen at one time. I know the boxes can be minimized, but that doesn’t solve the problem. For example, I can minimize the corpus reader box, but that leaves a gaping whole in the middle of the site where other graphs or information could go should I want them to. Ultimately, I think Voyant will be incredibly useful to come back to when I’m conducting my own research, especially since I have control over the corpus. I do have a much better sense of what text mining is and how it works now that I’ve played around with the ngram viewers and Voyant. I am certain I’ll use Voyant in the future but am still undecided about the ngram viewers.





Leave a Reply

Your email address will not be published. Required fields are marked *