Some Thoughts on Combining Close and Distant Reading, Markup and Algorithms

I’m a little under the weather, so this post might not be as coherent as I’d like, but I want to get it up before I get overwhelmed by the what is likely to be a very busy few days.

Over the weekend, I decided that an interesting exercise for my students reading the Alliterative Morte Arthure would be to have them compare two very different approaches to the poem, Kateryna Alexandra Rudnytzky’s article on Arthur’s battle with the giant of Mont Saint Michel, and Patricia DeMarco’s “An Arthur for the Ricardian Age”. The one examines the poem in terms of the transformation of its source material and connections with literary analogues; the other focuses on the poem’s engagement with military history. Both approaches add depth to our understanding of the text and its place in the medieval literary and cultural world, yet they are based on exactly the sorts of observations that students cannot make because they have not had the opportunity to read widely. Students are forced to read a few texts, those for which there is time during a single semester, in a virtual vacuum. Naturally, that’s why we have professors–to assign secondary literature and to draw students’ attention to this type of knowledge in class. But it’s a poor substitute for the type of exposure that students need to read literary texts (especially ones as historically remote as medieval texts) with much critical insight, let alone aesthetic enjoyment.

Enter distant reading. If students could make use of some data-based observations generated by a wider variety of texts—as opposed to synthesised/simplified professorial pronouncements—they might be able to gain a greater contextual understanding of the literature they are reading more closely. These generalities need not be innovative; they may amount to nothing more than observations scholars have already made through close reading. But the point is that students don’t know this information, and they generally can’t learn it by traditional means unless they have the luxury of pursuing a PhD. To my mind, this limits the kinds of questions they can ask. If distant reading can help bridge the knowledge gap between students and scholars, we need to cultivate it as a teaching method.

This is not to say that there is no potential for insights of greater interest to scholars. The oft-repeated criticism that computational approaches to literature only tell us what we already know is not quite true. When that happens, it is generally due to limitations in the data, the methodology, or the types of questions we are asking. To date, distant reading has tended to rely on fairly narrow sets of metadata (authorship, date, country of origin, and so on). Ideally, there would be richer metadata, and the texts themselves would have deeper forms of tagging so that we could investigate, say, how many monsters in Middle English literature get their genitals sliced off (which is what happens to the giant at Mont Saint Michel). In the absence of such specific markup, it becomes necessary to find ways to approximate it algorithmically. Topic modelling is a promising approach, but there are certainly others.

What we are talking about is creating conditions for investigating nuance through computational means. We will need to attack the problem at both ends of the spectrum, providing increasingly dense markup and finding ways for statistical analysis to take up where metadata and markup leave off. Indeed, a coordinated approach would be most beneficial—too often those working on markup standards and those working on statistical analysis are not in dialogue. Dealing with the non-standardised spellings of medieval English (my hobby horse) is an obvious problem that could benefit from a two-pronged approach. In order to understand the rich, multi-textural nature of medieval English literature, we really need to be able to toggle lemmatised and unlemmatised word forms, selectively collapse dialectal distinctions, and perform other types of manipulation that can only come from fine analysis. Our markup must be designed to generate data that can then be used to visualise patterns. I have experimented with tagging lemmatised and normalised spellings, as well as diplomatic and critical texts, for the Lexomics project. The notion of markup for algorithmic analysis as well as searchability and display is also part of the nascent Archive of Early Middle English project (currently seeking funding). But we have a long way to go in figuring out the best ways of combining markup-based and algorithmically-based research. I think we need to conceive of a text that is so dynamic that it might have, for instance, multiple topic models embedded within it. A reader might mouse over the text, creating new structural divisions, generating topic models, and tagging words with their prominence in topics, on the fly. A user of a large archive might generate a network graph with communities and then see what happens to them as individual documents are added or removed from the literary landscape.

This is where my ruminations about how we might expand the potential of computational approaches—for both teaching and research—converge with what appears (to go by a flurry of tweets in the last couple of hours) to have been the topic of Julia Flanders’ and Matthew Jockers’ keynote presentation at the Boston Area Days of Day 2013. Bringing these two scholars together (or these two scholars coming together) is a stroke of genius. Flanders represents the TEI, the gold standard of markup based on close textual analysis, whereas Jockers is pioneering methods of distant reading and algorithmic criticism. I greatly rue the fact that I was not in Boston to hear them speak, and I do hope they will make their keynote presentation public in some form as soon as possible (fully marked up and topic modelled). According to Twitter, Flanders called for a text analysis tool that uses TEI, and I want to second that. But there is also a need for a tool that marks up texts based on the results of algorithmic analysis. There is some fabulous work to be done in combining these two approaches. I’d like to imagine this further such a tool beyond the couple of sentences above, but that’s going to have to wait for another post.

Update: A PDF of Flanders’ and Jocker’s presentation, “A Matter of Scale”, is now available online.