Note: This post is part of a series reworkings of materials originally written between 2009 and 2012. A description of the nature and purpose of this project can be found here.
Back when I was a PhD student, I tried to do some basic quantitative analysis with the Dictionary of Old English corpus, but my move into work on regionalism in the early years of this century seems to have taken me away from further research of this nature. Then in 2009, the Lexomics project seems to have captured my interest. My vague memory is that I was already thinking about how to detect regional discourses quantitatively, but the first of my blog posts in September of that year has me returning to Old English in order to begin thinking through the methodological challenges involved. At that time, the first Lexomics tools were emerging, and Peter Stokes seems to have provided an initial response to them (I think in “The Digital Dictionary”, Florilegium 26 (2009), but I don’t have it to hand to make sure). Stokes comments on a comparison of the works of Ælfric with the West Saxon translation of Bede’s Ecclesiastical History:
Indeed, the project team complained that their software was “identifying Ælfric as Ælfric and Bede as Bede”, rather than finding one’s use of the other. I would suggest, however, that this is a misinterpretation: the software is not distinguishing Ælfric’s writing from Bede’s (nor even that of Bede’s translator), but rather Godden’s edition from Miller’s, or perhaps the West Saxon copy of Ælfric from the Anglian translation of Bede. It is therefore probably identifying editors or scribes at least as much as authors.
Instinctively, I wanted to say that the lexomics algorithm, a form of hierarchical agglomerative clustering, is really distinguishing Godden’s edition from Miller’s. I felt that the distinction was first and foremost dialectal (Anglian v. West Saxon), with other (scribal and editorial) influences being more minor. But this was no more than an impression, and I suggested that lemmatisation and spelling normalisation would make a big difference, allowing us to identify the extent to which these factors operate. I proposed using these techniques to apply filters but cautioned that we have to be clear that we are restricting our data set to semantics.
Stokes also noted:
At the time of writing it appears that the Old English Lexomics data does not use this information [Miller’s corrections flagged in the Corpus] but includes these editorial forms without comment. For example, a small sample of words flagged in the Corpus text of “Bede4” (Cameron no. B.6.6) gives bodode, lyfesne, and towurpun: these are all editorial reconstructions and are not found in the manuscript. However, the so-called “Virtual Manuscript” Tool lists all of them without comment (LeBlank [sic] et al. 2009 “Tools” – “Virtual Manuscript” – “Text B.6.6”).
We do know that Ælfric used Bede , so, presumably, if we took the reconstructed words out, and affinities we found between them disappeared, we would learn that our presumptions about the influence of Bede on Ælfric owes something to the decisions of past editors.
With these observations in mind, I thought about how we could address the role of editors on the patterns we extract. Not a lot of work has been done on this to my knowledge, in part because so much of the quantitative research done today focuses on modern printed texts. But maybe there ought to be a more thorough investigation of the phenomenon even in print media. From time to time I have seen scholars of modern literature detecting forms of “influence” that are superimposed upon the “finished” work, rather than subsumed within it. In terms of examples, I would cite Lisa Rhody’s observation that the diverse ekphrastic poems in the The Gazer’s Spirit tend to cluster together, rather than being spread around a larger corpus, and Matthew Jockers’ recent observation that the Englishman George James’ first novel, which clusters with a group of Scottish novels, was only sent off for publication after it had been “endorsement” by Sir Walter Scott. This whole phenomenon needs to be theorised a bit more, but I’m dubbing it “post-authorial influences” at the moment to supply a title for this post.
How do we go about “examining post-authorial influences”? Back in 2009, I optimistically proposed that the lexomics team incorporate an XML schema. Texts could be marked up with both edited and unedited forms, and tools could be used to extract one or the other for the purposes of comparison. As proof of concept, I set to work on one of the basic test-cases used in lexomics, the Old English poems Daniel and Azarias. Scholars have long known that the poems share common source material, and lexomics was able to reproduce this observation using word frequency clustering. In order to test whether the patterns would change with unreconstructed forms of the texts, I began producing XML versions of Daniel and Azarias which tagged the vocabulary appropriately, developing a schema loosely based on TEI. In principle, this was a good idea since most digital texts today are produced in XML. However, it quickly became apparent that the shortcomings of XML, namely the capturing of overlapping text segments would be a factor in marking up and collating editions. Desmond Schmidt’s work in finding an alternative to XML came to my attention, and I follow it from time to time. In any event, the need for a special lexomics schema seems to have faded for the moment as energy was devoted to developing web-based tools, rather than scripts. Currently, support for XML texts is still being developed in the form of the Scrubber tool, which has some limited functionality to extract appropriate forms from tagged documents. That said, my initial work in incorporating marked up texts in the lexomics workflow was important to the further development of the method in ways that will be chronicled in the next few posts.
But it seems appropriate to me to pause and reflect on the larger significance of what I was doing. Essentially, I was trying to bring together two strands of the Digital Humanities—quantitative analysis and text markup—which are sometimes portrayed as opposite or unrelated methods. The former removes context from the textual representation; the latter attempts to encode context in the textual representation. I think most would agree that there is no real reason why these two methods should be opposed. It’s just that scholars tend to focus on one or the other. But this is a situation that really needs more reflection. It is not just that quantitative methods are suspect when texts are reduced to mere lists of words. Markup systems could be improved with a greater attention to text manipulation and systems of representation. The power of XML is clear when you consider that the Lexomics project is now encoding dendrogram structures using the Phyloxml schema. But anyone who has tried to learn XSLT can tell you that marking up texts in XML often dead ends because it is so difficult for anyone but an expert to devise usable stylesheets. Thus it becomes very difficult to explore the questions this post began with. However, this is not just a question of technical difficulty. The choices digital editors make in what to markup obviously determine what contextual information can be analysed for quantitative analysis. If those doing the analysis are not also involved in the markup, an opportunity is lost. Following from that, if those doing the markup are not also interested in quantitative analysis, they are ignoring one of the most powerful tools available to digital scholarship for making the case that their work is useful. This is an area where more concerted collaboration is to be desired.