Historically Inclined: Digital_Humanities, Weekly Update 10/13: A Mess of Words

This post is a bit late, but hopefully no one will really mind too much and forgive the tardiness. I've been playing around with Voyant, Google Ngram, and Bookworm recently. Bookworm remains the most stubborn, but, mechanically, I've been able to adjust to the former two with ease. The question remains, however, of how I will integrate this into my research and if the effort for using such tools will result in adequate payoff.

In Lara Turner O'Hara's post, "Cleaning OCR'd text with Regular Expressions," [link] she discusses the process of cleaning up a text document in order to utilize it as a CSV file. She admits that the input of work needed to clean up text files before they are ready for analysis can be taxing and may not be worth the effort in order to interpret texts without the aid of digital analysis tools. I have to agree with her on this point; in Voyant, I loaded several editions of the Daily Lobo's archives from 1890 - 1910. I've used this collection before for research, and all of the editions available are scanned and searchable by keyword. The PDF files are saved in a format that recognizes text, which is why it worked in Voyant. I attempted to do the same with other newspapers through different online archives, but these PDF files and records did not have the same attributes, illustrating one main problem with using these digital tools. Even if a newspaper is digital and searchable via keyword, the file itself does not guarantee compatibility.

I was expecting this issue, however, so it didn't cause much alarm. For the sake of the exercise, I stuck with the Daily Lobo newspapers and selected several ranging from 1890 - 1910, primarily to see if the issue of New Mexico statehood and enfranchisement of Hispanic males and females was a common topic that fluctuated throughout this period. Neither one of these topics, however, appeared on my word map (though they were contained in individual articles, just not at a high enough frequency to generate interest). Instead, my word map looked something like this:

This prominently displayed the second issue with using such tools and archival material. Not only were more common words, such as "the," featured more prominently than any word that would actually generate interest, the text was read with several errors and errors were throughout. Just because the document was a PDF which recognized text does not mean that it could read the text correctly. This is where cleaning and manipulating data is crucial.

After establishing several stop words that removed common words and errors, my word map started to develop into something significant. It still did not really tell me anything, aside from the limitation of using a university newspaper, which found more references to words such as "varsity" and "university" than events that I found to be significant.

It does, at least, paint a more interesting picture and potentially raises some questions that could encourage further research. The significant frequency of the word "phone," for instance, seems unusual, and while this word does not come close to my original intended query of statehood and suffrage, at least shows how such tools can be useful. As Ted Underwood notes in his blog post, "Where to Start with Text Mining," [link] tools such as Voyant and Google Ngram can provide insights to textual sources which might otherwise be overlooked. This can be the simple generation of a future research question or it could be something more complex that potentially negates historian interpretations and estimations, as my text mining exercise illustrates. I had originally hypothesized that, due to the debate and issue of statehood, there would be a significant frequency of its mention in these records. I was wrong. This could be, in part, due to the editions I chose rather haphazardly and could also be a result of the limitations of the tools. Voyant shows frequency of words, for example, but, unlike Google Ngram, does not display the changing rate that a certain term is mentioned to really see any rising discussion. In my opinion, Voyant is better suited as a tool to look at a wide body of textual documents and see if a pattern emerges; any pre-formed conception or intent should not be the driving force of using this tool. Other text mining tools have their own limitations and strengths which encourage multiple tools in order to analyze the same body of word. Interesting conclusions might be drawn from the results, but there is a lot of work that needs to be completed before any interpretation of data can truly occur.

Cleaning a document, of course, is key. It is a daunting process. In the interest of time, I did not even attempt to clean up the texts I used. For more serious research outside of this exercise, however, I may dabble in this textual cleaning and make the mess of words actually form coherent sentences that can then be analyzed. In spite of the questionable yield of this work, the exercise alone is valuable to any historian. At a very simple level, it can open up new questions and insights. As a community, making such archival documents and information (copyright permitting) available can also enrich the field and allow for further investigation by other historians and digital humanists in a new way. And, if something true does result from text mining that supports your research, it can seem like a truly significant addition to a body of work. Text mining and its uses does need refinement and my own abilities with these tools needs further development, but its potential as a historical tool of analysis is apparent.

Historically Inclined

Sunday, October 19, 2014

Digital_Humanities, Weekly Update 10/13: A Mess of Words

No comments:

Post a Comment