In my first experiment with Mallet, the Command Line did not want to run the program or even list what items were in the directory. It was in my C:\\ drive as instructed, but I was missing the Java installation required to run the program. Similar failures occurred throughout the process, from trying to convert txt files into a readable mallet file as recommended in the Programming Historian tutorial (great tutorial, by the way. I highly recommend it. Check it out here.) to simply listing what was in a specific directory or txt file. But, with patience and a lot of back and forth between the tutorial, the GUI of the mallet directory, and my command prompt, I was able to extract the information I wanted and get some pretty interesting results.
The documents I analyzed were just for the purpose of the exercise and won't help me in my research, but I'm starting to see how text mining might be useful in the future. I guarantee that whatever files I deal with will have to be cleaned to make them suitable for these tools, but it could reveal some interesting connections that I would not have normally considered. If I do come across a large corpus, these tools will enable greater analysis and will also help my own process in understanding exactly how these documents are connected and how they might be useful.
The content of the documents can also be made clearer by such tools as Overview. I did find this tool easier and more appealing than Mallet, in part due to my own avoidance of the Command Prompt (it's a bit hard for me to read the text in this format, even if I open it in a Notepad or Word doc). But, I could see its limitations. The tool is very specific in what it can do with text files and with text modeling. It shows the same strings of words that Mallet does, but a portion of the original connections is lost; Overview breaks down the texts in a greater degree and makes distinct divisions between different documents, which is useful in understanding what kind of files you might be dealing with, but may not reveal as clear distinctions of how they connect or how certain topics are addressed in the material as a whole. This view of Overview might just be to my own novice status with the program, but that was my initial impression none the less.
My own preferences and views of these programs aside, I'm starting to feel how much my toolkit has grown over this past semester and how my general understanding and approach to these once foreign tools has developed. It's a good feeling. I still have a long way to go (there are many more failures to make and moments of frustration to experience), but it's a start.
No comments:
Post a Comment