Digital Humanities Projects with Small and Unusual Data: Some Experiences from the Trenches

Update March 15 2016: This content was selected for Digital Humanities Now by Editor-in-Chief Joshua Catalano based on nominations by Editors-at-Large: Ann Hanlon, Harika Kottakota, Heather Hill, Heriberto Sierra, Jill Buban, Marisha Caswell, Nuria Rodriguez Ortega. The slides have now been integrated, and they can also be seen in a reveal.js presentation here.

This is an edited version of a talk I gave at UC Irvine on February 5, at a Symposium on Data Science and Digital Humanities organized by Peter Krapp and Geoffrey Bowker.

I’ve made the focus of my talk Digital Humanities projects involving small and unusual data. What constitutes small and unusual will mean different things to different people, so keep in mind that I’ll always be speaking in relative terms. My working definition of small and unusual data will be texts and languages that are typically not used for developing and testing the tools, methods, and techniques used for Big Data analysis. I’ll be using “Big Data” as my straw man, even though most data sets in the Humanities are much smaller than those for whom the term is typically used in other fields. But I want to distinguish the types of data I will be discussing the from large corpora of hundreds or thousands of novels in Modern English which are the basis of important Digital Humanities work.… Read more…

Continue reading


How to Create and Cluster Topic Files in Lexos

This post is a follow-up to last year’s How to Create Topic Clouds with Lexos, where I showed how Lexos can be used to visualise topic models produced by Mallet. From time to time, colleagues have wondered whether it would be possible to use Lexos to perform cluster analysis on the topics Mallet produces. The motivation for doing this is simple enough; topics are often very similar, and it would be useful to have some statistical measure of this similarity to help us decide where groups of topics really should be interpreted under some meta-class. Some added urgency has arisen in discussions for the 4Humanities WhatEvery1Says Project, which is topic modelling a large collection of public discourse about the Humanities. We’ve begun considering whether doing cluster analysis on topic models can help us to refine our experiments.

The first step is finding a way to massage the Mallet data into a form we can submit to clustering algorithms. Lexos already transforms Mallet output into a topic-term matrix, which is then used to make word clouds using the top 100 words in each topic. Essentially, the topics are treated just like text documents (or at least slices of them).… Read more…

Continue reading


How to Create Topic Clouds with Lexos

Some Background

Topic modelling is gaining increasing momentum as a research method in Digital Humanities, with MALLET as the general tool of choice. However, many would-be topic modellers have struggled to make effective use of MALLET’s output, which is raw data. In fact, there has been a growing movement to devise methods of visualising topic modelling data generally. A while back, Elijah Meeks had an idea for generating topic clouds: separate word clouds for each topic in the model. [I can’t seem to access his original blog post, but here is his code on GitHub.] Although word clouds have their problems as visualisations, Meeks speculated that they were particularly effective for examining topics in a topic model. Indeed, others have used word clouds to visualise topic modelling results, most notable Matt Jockers in the digital supplement to his Macroanalysis. One of the things I liked about Meeks’ implementation using d3.js was that it placed the clouds next to each other so that they could be compared.

I quickly transferred this idea to our work on the Lexomics project, and our software Lexos. In Lexomics, we frequently cut texts into chunks or segments, which can then be clustered to measure similarities and differences.… Read more…

Continue reading