Computational Approaches to “Small Data” in the Digital Humanities

Taking a break from my re-working of past posts on computational approaches to text analysis, I want to express a plea for attention to be paid to small data and for small data to form part of the larger conversation about quantitative analysis in the Digital Humanities. As the range of digitised materials expand through the proliferation of digitised print materials and born digital texts, big data has come to represent the opportunity for innovation in the Digital Humanities. For those of us who work in areas that have corpora that are restricted in size (medieval English literature, in my case), this represents a problem. How can those of us working in fields that employ these corpora participate in the growing conversation? Are the Digital Humanities doomed to experience a divide between those who do distant reading on “big data” and those who do close reading through markup and editing of smaller numbers of texts? One way the latter group can bridge the divide is by generating lots of metadata. But that only goes so far and doesn’t address some of the analysis of language being done through techniques like topic modelling and the like.

Small data is very useful because it throws into relief some of the same theoretical questions that (in my view) have yet to be sufficiently addressed in approaches to “big data”. Essentially, quantitative analysis of big data and small data produce the same kinds of results: statistical output that is frequently explored through graphing and other types of visualisation. There is a great deal of work to be done on the epistemological status of knowledge derived from this approach, particularly its relationship with the “original” texts and the extent (and necessity) of its claims to truth. Any method employing statistics must at the very least come to terms with┬ástatistical measures of probability, significance, confidence, and reliability. In general, studies employing more data will be interpreted as having more reliable results, but the criteria by which we judge this in the Digital Humanities really needs to be examined. What kinds of thresholds do we actually need? Is a topic model based on five hundred 300-page novels any more appropriate for study than one based on a single five-hundred line poem? On what basis? How far can we push small data to produce quantitative analyses that fit our criteria for interpretive significance?

I’m very excited about the new opportunities afforded by the investigation of big data, but those who are doing it are ultimately going to have to address the same questions about their methods. The best approach would be to tackle these problems in tandem so that a wider group of Digital Humanists can benefit from quantitative approaches to the study of texts.