Digital Humanities Projects with Small and Unusual Data: Some Experiences from the Trenches

Update March 15 2016: This content was selected for Digital Humanities Now by Editor-in-Chief Joshua Catalano based on nominations by Editors-at-Large: Ann Hanlon, Harika Kottakota, Heather Hill, Heriberto Sierra, Jill Buban, Marisha Caswell, Nuria Rodriguez Ortega. The slides have now been integrated, and they can also be seen in a reveal.js presentation here.

This is an edited version of a talk I gave at UC Irvine on February 5, at a Symposium on Data Science and Digital Humanities organized by Peter Krapp and Geoffrey Bowker.

I’ve made the focus of my talk Digital Humanities projects involving small and unusual data. What constitutes small and unusual will mean different things to different people, so keep in mind that I’ll always be speaking in relative terms. My working definition of small and unusual data will be texts and languages that are typically not used for developing and testing the tools, methods, and techniques used for Big Data analysis. I’ll be using “Big Data” as my straw man, even though most data sets in the Humanities are much smaller than those for whom the term is typically used in other fields. But I want to distinguish the types of data I will be discussing the from large corpora of hundreds or thousands of novels in Modern English which are the basis of important Digital Humanities work.… Read more…

Continue reading


Digital Humanities as Gamified Scholarship

The Digital Humanities trace their origins back to Father Roberto Busa’s efforts to analyse the works of Thomas Aquinas in the 1940s, which was then followed by further efforts to perform textual analysis with the aid of computers. Since that time, the Digital Humanities has expanded to encompass a myriad of other activities (and acquired its name in the process) and a devoted community of practitioners. Nevertheless, doubts persist about whether the growth of the Digital Humanities has had, or has the potential to have, any significant impact on scholarship in the Humanities as a whole.  Although I can’t say for certain, my feeling is that when doubters look back at the past, they tend to be thinking primarily of computational textual analysis as the method that has failed to obtain a wide impact. Whether this is a fair assessment of the Digital Humanities, or whether the appropriate criteria have been selected for assessing the significance for even this one area, is worthy of discussion, but my intention here is to look forward, rather than back. Computational textual analysis is beginning to evolve more rapidly, and to become more widely accessible to both students and scholars, meaning that the past should not be taken as an indication of the future.… Read more…

Continue reading


Topic Modelling Early Middle English II

This is a follow-up to my earlier post on Topic Models and Spelling Variation: The Case of Early Middle English. There I discussed the challenges of generating topic models of texts with non-standardised spelling where topics did not merely correspond to texts. In any given model, differences in the spelling systems of two texts will cause the generated topics to be prominent in some texts and effectively non-existent in others, whereas other topics will be non-existent in the former and prominent in the latter. Topics are thus essentially orthographic patterns, rather than rhetorical discourses or indicators of subject matter. My initial experiments showed that some linguistic smoothing–admittedly,  a questionable form of textual deformance–can help address this problem, along with increasing the granularity of the model by breaking texts into small chunks. Chunk sizes of about 1000 words began to reveal the sorts of patterns I was looking for: cases where individual topics had high prominence in parts of more than one text. But I speculated that chunk sizes needed to be still smaller in order make these patterns more noticeable. A second reason for using smaller chunk sizes is that it becomes much easier to make the leap back from distant to close reading.… Read more…

Continue reading


Topic Models and Spelling Variation: The Case of Early Middle English

Topic Modelling has developed quite a following in the DH world, but it still has a long way to go before it proves itself a reliable method for literary research. (Caveat: I have not yet read Matthew Jockers’ soon-to-be released Macroanalysis, which may answer many questions about how to use topic modelling to study literature.) As far as I can tell, topic modelling was originally tested on materials that, although diverse in subject matter, were fairly homogeneous in language. Literary language is problematic for topic modelling not so much because it contains more ambiguities or fuzziness than, say, scientific journals but because the types of questions literary scholars ask tend to probe at these aspects of language. There’s no reason why we should expect a single, and fairly new, computational method to provide miraculous insight into questions that sustain whole disciplinary fields, and neither is that a reason to assume that it can provide no insight at all. Topic modelling has already shown particular importance in the area of literary history, as can be seen from the work of Jockers’s work, as well as that of people like Ted Underwood and Lisa Rhody. But the results that they have made available share one thing in common with the various topic models of scientific journals, Day of DH posts, PMLA articles, and the like.… Read more…

Continue reading