This post is a follow-up to last year’s How to Create Topic Clouds with Lexos, where I showed how Lexos can be used to visualise topic models produced by Mallet. From time to time, colleagues have wondered whether it would be possible to use Lexos to perform cluster analysis on the topics Mallet produces. The motivation for doing this is simple enough; topics are often very similar, and it would be useful to have some statistical measure of this similarity to help us decide where groups of topics really should be interpreted under some meta-class. Some added urgency has arisen in discussions for the 4Humanities WhatEvery1Says Project, which is topic modelling a large collection of public discourse about the Humanities. We’ve begun considering whether doing cluster analysis on topic models can help us to refine our experiments.
The first step is finding a way to massage the Mallet data into a form we can submit to clustering algorithms. Lexos already transforms Mallet output into a topic-term matrix, which is then used to make word clouds using the top 100 words in each topic. Essentially, the topics are treated just like text documents (or at least slices of them). Since the Mallet file is uploaded through the Lexos Multicloud tool, this data is separate from the main Lexos file store and can’t be submitted to the cluster analysis tools.
So we’ve addressed this by introducing a new feature to the Multicloud Tool. That “Convert Topics to Documents” check box creates new files for each topic and adds them to the main Lexos file store so that they can be used by any of the Lexos tools, such as Hierarchical Clustering. Each file consists of all the words in the topic (not just the top 100) with each word duplicated the number of times it occurs in the topic in the mallet —word–topic–counts–file output.
The result can be a little odd since it produces texts like regional regional bringing wednesday charter charter charter charter, etc., but for many operations, these word strings function like any other text. (Certain functions like cutting or n-gram analysis may not be terribly useful.) But, if you want to analyse how similar certain topics are to others, you can head straight over to the Lexos clustering functions. If you want to try it out, you can download a 20-topic model based on the WhatEvery1Says US Patents Related to the Humanities data set created by Alan Liu. Just save the file as a .txt file and upload it in the Lexos Multicloud tool.
We hope users find this topic to document conversion feature useful, and I hope at a future date to report on issues and theoretical considerations arising from our experiments with clustering topic model results.