As part of our ongoing research into text-mining historical newspapers, we’ve been experimenting with new methods for extracting language patterns scattered across millions of digitized words. One of the most intriguing methods for such work that has emerged in recent years is topic-modeling. The idea of topic modeling is, at base, to use mathematical and statistical models to identify words that are related to one another and then group them into “topics.” The hope is to concept is to thereby expose underlying patterns in the language of large-scale collections that would be hard, if not impossible, to otherwise see.
And so we have been experimenting with topic modeling for this project, concentrating on the popular MALLET software package. We recently presented a paper based on this work at the meeting of the Association for Computational Linguistics in June 2011, “Topic Modeling on Historical Newspapers.”
Download the paper: Topic Modeling on Historical Newspapers
From Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (2011), pp. 96-104.