Topic Models and Spelling Variation: The Case of Early Middle English

Topic Modelling has developed quite a following in the DH world, but it still has a long way to go before it proves itself a reliable method for literary research. (Caveat: I have not yet read Matthew Jockers’ soon-to-be released Macroanalysis, which may answer many questions about how to use topic modelling to study literature.) As far as I can tell, topic modelling was originally tested on materials that, although diverse in subject matter, were fairly homogeneous in language. Literary language is problematic for topic modelling not so much because it contains more ambiguities or fuzziness than, say, scientific journals but because the types of questions literary scholars ask tend to probe at these aspects of language. There’s no reason why we should expect a single, and fairly new, computational method to provide miraculous insight into questions that sustain whole disciplinary fields, and neither is that a reason to assume that it can provide no insight at all. Topic modelling has already shown particular importance in the area of literary history, as can be seen from the work of Jockers’s work, as well as that of people like Ted Underwood and Lisa Rhody. But the results that they have made available share one thing in common with the various topic models of scientific journals, Day of DH posts, PMLA articles, and the like. They all deal with texts that are fairly homogeneous in linguistic form. (It’s not very clear how Goldstone and Underwood dealt with quotations in their survey of PMLA.) They may complain about bad OCR corrupting data, but it is a relatively minor issue in applying the methodology. Where we really run into trouble is when we try to consider semantic change. What does a topic model tell us if words change their meanings during the time frame of the documents in the collections we are exploring? Increasingly, there is talk of employing methods such as Topics over Time as a way of addressing this issue.

There may in fact be algorithmic methods of addressing semantic change, but they still require a collection of texts with word types that are fairly stable in form, if not in meaning. My own forays into topic modelling explore another type of data, pre-modern (specifically Early Middle English) texts which are very diverse in spelling and morphology. A single word in an Early Middle English text may have as many as thirty different spellings, and another text from the same time period may have a completely different set of spellings for the same word. Dialectal variations mean that grammatical variants of words may be entirely different in texts from different parts of the country. It might be possible to perform stemming or lemmatisation to erase these differences, but I don’t believe these methods have been shown to provide more meaningful results in modern English texts, and, in any event, there are not effective means for stemming or lemmatising Early Middle English (the same linguistic heterogeneity is the barrier). Compounding the problem is the fact that the available corpus of Early Middle English is much smaller than it is for modern English, which limits our ability to rely on scale to smooth out statistical perturbations caused by what Chaucer called the “diversite of tonge” in Middle English. [Update: When this was written, I had not yet had a chance to read Adam Crymble’s analysis of the impact of reporters and editors on the Old Bailey Proceedings. Anyone interested in the impact of scribal usage on texts produced during the period of orthographic standardisation should take a look.]

So what can topic modelling tell us about literature before the development of standardised spelling? Given any statistical clustering method, we might assume that the results will group together texts with similarities in spelling. In a topic model, certain topics will therefore map entirely onto individual texts and the topics are likely to be marked less by semantic coherence than by affiliation to particular scribal conventions. That, at least, is a prediction. It suggests that the study of medieval scribal habits might be one possible application. We medievalists like that kind of thing, but I rather suspect that the next 100 tenure-track jobs in medieval literature will not go to people who wrote their PhDs on medieval orthography.

If topic modelling is to make a wider impact in the study of medieval literature (in English), I think it will be as a means of overcoming the linguistic diversity of the period. If we can reliably say that text n% of the vocabulary of a particular text (or chunk of text) and that some similar percentage can be found in another text, we might then begin to generate networks of something that I don’t quite what to call “influence”. One way to think of it is using topic modelling to create maps of textual communities. But, in order to do this, we need to make the topic modelling algorithm ignore spelling patterns and focus on lexical patterns.

In my experiments, I have tried to rely on a few assumptions:

  1. LDA will tend to identify co-occurrences of spelling variations with stable word types, thus generating topics at least partially on similar uses in different texts.
  2. Early Middle English texts contain a wide number of topics which occur in a wide range of proportions. That is texts are not dominated by certain topics which are almost completely absent in other texts.
  3. Partial smoothing of linguistic heterogeneity, rather than complete stemming or lemmatisation, will help to encourage the above to a point where we can generate meaningful results.

What is the truth behind these assumptions? In order to test them, I created a corpus of some 25 texts composed between about 1100 and 1350, taking them, as much as possible from the southerly and westerly region of England (although a few have eastern affiliations). The texts, which I’ve listed the texts at the end of this post, were a mixture of poetry and prose. A combination of simple word-frequency calculations and observations of the most prominent words in topic models, allowed me to generate a list of about 1000 stop words, including named entities. (I should point out that one of the texts, Laȝamon’s Brut, is a history of Britain, and is filled with names and places.) The remaining collection contained about 30,000 word types.

I actually skipped the first step, which would be to compare topic prominence in individual texts. Maybe this should be done, but it doesn’t address the types of questions I am exploring. Instead, I chunked the texts into 2000-word chunks—approximately 200-300 lines of poetry—in order to see what topics were prominent in sections of texts. My working theory is that evidence for textual affinity is more likely to show up in smaller sections of text, and my experience working on the Lexomics Project suggests that— at least for Old English—chunks of about 1000-1500 words are ideal.

The resulting topic models tended to confirm the prediction above that topics would map to texts (that is, individual topics were overwhelmingly prominent in individual texts and non-existent in all the others). Here is a typical example of a topic taken from a model with 100 topics.

TOPIC: holie men wise drihten lif louerd word seluen helende mannes mihte togenes þridde lichame sume …

All but one chunk of the Trinity Homilies contained between 148 and 375 words assigned to this topic. The Sayings of Saint Bernard ranked fairly high by comparison (339 words). Otherwise, there was a precipitous fall-off: 48 words from Chunk 7 of the Lambeth Homiles, 28 from Chunk 4 of the Lambeth Homilies, and everything else at 19 words or below. Admittedly, Chunk 9 of the Trinity Homilies had only 7 words assigned to this topic, implying that it might have very different subject matter. That is worth investigating, especially if it is still true after the model is tweaked a bit. We should note that the top words in the topic are not entirely coherent with the chosen parameters. Certain words like “seluen”, “mihte”, and “sume” might be moved to the stop word list, and the topic might still split into two if a different number of topics were chosen for the model. (Although, as Ted Underwood and Lisa Rhody have reminded us, we need not make semantic coherence a criterion for the usability of topics.)

Topic profiles like the above are fairly typical. Individual topics are only really prominent in specific texts. I next experimented with some light linguistic smoothing. I changed –eth at the ends of present tense verbs to –, changed æ to e, and a few other tweaks to make the spelling more consistent. It is important to note that I engaged in some pretty drastic deformance of the texts in the process. Since I made the changes globally, I did not check whether they were appropriate in all cases. I re-tried the experiment using different numbers of topics, but the results, I found, made very little difference, perhaps because I had been too conservative in the changes I made.

Regardless, the members of topics like the one discussed above, combined with the few instances like the anomalous chunk of the Trinity Homilies, suggested to me that my experiments needed to be more granular. So I re-chunked the texts into 1000-word chunks and ran the topic models again. The same patterns emerged, but they were much less marked. Here is an excerpt from a model with only 20 topics.

Topic: luue leue swete pine blod wat bodi lef blisse oc water world war sunne wan …
Rank Text and Chunk Number Words Assigned to Topic
1 Bestiary (4) 275
2 Bestiary (3) 253
3 Bestiary (2) 248
4 Vision of Saint Paul (1) 223
5 Bestiary (1) 217
6 Debate between the Body and Soul (1) 204
7 Wohunge of Ure Lauerd (1) 220
8 Wohunge of Ure Lauerd (2) 195
9 Debate between the Body and Soul (2) 179
10 Debate between the Body and Soul (3) 173
11 Wohunge of Ure Lauerd (3) 163
12 Sawles Warde (5) 55
13 Religious Poems from Jesus 29 (5) 41
14 Litel Soth Sermun (1) 36
15 Religious Poems Caligula A ix (1) 34
16 Ancrene Wisse (17) 34
17 King Horn (7) 31
18 Trinity Homilies (15) 29
19 King Horn (2) 29
20 Religious Poems from Jesus 29 (4) 27

Here we are starting to see some mixture. This topic (something to do with the purgation of sin?) is fairly prominent in four different texts, especially at the beginnings. That provides us with some interesting food for thought. It is a matter of debate at what point we should see the drop off in words in a chunk assigned to the topic as significant. Consider the 29 words in of Chunk 2 of the romance King Horn, just over 10% of the number found in the first four chunks of the Bestiary. Is that sufficient to conclude that there is some commonality between that section near the beginning of King Horn and the beginnings of the mostly religious literature where the topic is more prominent? By way of example, I’ve deliberately chosen a low-ranked chunk to present this question, rather than the more obvious place between ranks 11 and 12 because, as yet, I have no sense of a cut-off point.

Nevertheless, we can see that the smaller chunk size does give us a little more flexibility to examine relationships between texts. In the topic modelling literature, there are occasional references to the ability to model documents as small as paragraphs, but, to my knowledge, no one applying the method to literature is attempting that level of granularity. Perhaps I should take the chunk size down to 500 words, which would be at the low level of what works for the hierarchical clustering method used in the Lexomics Project. I can also consider some more radical linguistic smoothing. For instance, Topic 7 in the above model, dominated by King Horn, contains the word “louerd”. The same word is prominent in Topic 13, dominated by Laȝamon’s Brut, but there it is spelt “lauerd”. If I just do some matching of variant spellings that appear most prominently in each topic, I may find that the topics re-configure in ways that help to clarify textual affinities.

These are all preliminary experiments, as the pre-processing steps for dealing with Early Middle English texts are fairly time consuming. (And, although the software claims to be able to handle Unicode, I find that it never works unless I change æ, þ, and ȝ to something else and then change them back after running the model.) I will continue to report developments when I can.

Update: I am belatedly reminded that the challenges of performing stylometric analysis on non-homogeneous text collections was discussed by Christof Schöch in his post Author or genre? Assessing the quality of cluster analysis graphs in two-dimensional classification problems. Schöch shows the difficulty in removing authorship as a determiner of text classification when one wants to classify texts by genre. The situation is analogous to the problem of separating orthography from lexical patterns. Although it makes a lot of extra work for us, this difficulty can be a good thing in theoretical terms, as it forces us to confront aspects of the interpretive process which traditional scholarly methods often ignore. How much does our interpretation of the thematic content of Hamlet rest upon our simultaneous assumption that it is by Shakespeare? How much does our impression that the Trinity Homilies and Ancrene Wisse are unconnected relate to their differences in dialect? Quantifying answers to these questions at the very least forces us to put a lens to our interpretive practices and consider them more closely.

Texts in the Collection (sources in parentheses)

  1. Ancrene Wisse (based on the TEAMS edition)
  2. Bestiary (London, British Library, Arundel 292)
  3. Debate between the Body and Soul (Oxford, Bodleian Library, Laud Misc 108)
  4. Hali Meiðhad (Oxford, Bodleian Library, Bodley 34)
  5. The Infancy of Christ (Oxford, Bodleian Library, Laud Misc 108)
  6. Juliana (Oxford, Bodleian Library, Bodley 34)
  7. Kentish Sermons (Oxford, Bodleian Library, Laud Misc 471)
  8. King Horn (Oxford, Bodleian Library, Laud Misc 108)
  9. Laȝamon’s Brut (London, British Library, Cotton Caligula A ix)
  10. Laȝamon’s Brut (London, British Library, Cotton Otho C xiii)
  11. Lambeth Homilies (London, Lambeth Palace Library 487)
  12. The Life of Christ (Oxford, Bodleian Library, Laud Misc 108)
  13. A Litel Soth Sermun (London, British Library, Cotton Caligula A ix)
  14. The Owl and the Nightingale (London, British Library, Cotton Caligula A ix)
  15. The Passion of Our Lord (Oxford, Jesus College 29)
  16. Poema Morale (Oxford, Jesus College 29)
  17. The Proverbs of Alfred (Oxford, Jesus College 29)
  18. Religious Poems from Cotton Caligula A ix
  19. Religious Poems from Oxford, Jesus College 29
  20. Sawles Warde (Oxford, Bodleian Library, Bodley 34)
  21. The Sayings of St Bernard (Oxford, Bodleian Library, Laud Misc 108)
  22. Trinity Homilies (Cambridge, Trinity College 335)
  23. Vices and Virtues (London, British Library, Stowe 34)
  24. The Vision of Saint Paul (Oxford, Bodleian Library, Laud Misc 108)
  25. Wohunge of Ure Lauerd (London, British Library, Cotton Titus D xviii)