Topic Modelling Early Middle English II

This is a follow-up to my earlier post on Topic Models and Spelling Variation: The Case of Early Middle English. There I discussed the challenges of generating topic models of texts with non-standardised spelling where topics did not merely correspond to texts. In any given model, differences in the spelling systems of two texts will cause the generated topics to be prominent in some texts and effectively non-existent in others, whereas other topics will be non-existent in the former and prominent in the latter. Topics are thus essentially orthographic patterns, rather than rhetorical discourses or indicators of subject matter. My initial experiments showed that some linguistic smoothing–admittedly,  a questionable form of textual deformance–can help address this problem, along with increasing the granularity of the model by breaking texts into small chunks. Chunk sizes of about 1000 words began to reveal the sorts of patterns I was looking for: cases where individual topics had high prominence in parts of more than one text. But I speculated that chunk sizes needed to be still smaller in order make these patterns more noticeable. A second reason for using smaller chunk sizes is that it becomes much easier to make the leap back from distant to close reading. 500 words corresponds to approximately 65 lines of poetry in Laȝamon’s Brut, the text I am primarily interested in, which makes it easy to eyeball the text and see if the topic words are concentrated in a small passage or distributed sporadically around the chunk.

I’ve now had a chance to run a topic model of my Early Middle English corpus chunked to in 500 words. I’ve only run a 20-topic version so far, but I’m happy to say that the increased granularity does continue to reveal individual topic prominence in multiple texts that are not necessarily similar in their orthography. There’s a lot in the model that seems of interest to the study of the texts used to generate it, but only one topic that seems to provide direct insight for the questions I am asking about Laȝamon’s Brut.

Topic: men sune godes liue mon dei muchele drihten halie mihte laȝe folc walde burh dæþe …
Text and Number of Chunks Words Assigned to Topic
6 Chunks of the Lambeth Homilies 101-121
19 Chunks of the Lambeth Homilies 45-83
19 Chunks of  Laȝamon’s Brut (MS C) 39-42
1 Chunk of the Lambeth Homilies 38
3 Chunks of Laȝamon’s Brut (MS C) 31-33
1 Chunk of Juliana 30
1 Chunk of the Lambeth Homilies 29
… More Chunks of Various Texts …
1 Chunk of Laȝamon’s Brut (MS O) 14

Some observations. The topic is most prominent in chunks of the Lambeth Homilies, so we might (simplistically) see it as associated with the homiletic genre. It is one of a number of possible homiletic discourses, and it is clearly very prominent in some portions of the Lambeth Homilies, but not the majority of them. In the majority of the sections of the Lambeth Homilies, the topic is not much more prominent than it is in some portions of the Caligula manuscript of Laȝamon’s Brut. In fact, the topic is more prominent in some sections of the Brut than it is in some sections of the Lambeth Homilies. The presence of the topic in the Brut is more or less specific to the Caligula version; the first chunk of the Otho version of the Brut to appear in the list contains only 14 items assigned to this topic.

Probably these patterns can still be explained by orthography; there is a close correlation in both date and dialect between the Lambeth Homilies (Herefordshire, c.1200) and the Caligula Brut (Worcestershire, last quarter of the 13th century, but with much archaism). Nevertheless, there appears to be a possibility that certain sections of both of these texts share affinities to the particular homiletic discourse–defined by something other than spelling–that is most prominent in 6 chunks of the Lambeth Homilies.

One question I asked in my earlier study is where the cut-off point should for attributing significance to the numbers of words assigned to a topic. Clearly, the presence of only 14 words in a section of the Otho Brut does not indicate that we should look for any significant presence of the discourse we are examining (except in the unlikely case that all 14 words occur in the space of, say 5 lines of poetry). But what about the range between 45-83 words of the chunk assigned to the topic? And what about 39-42?

The drop from 101 to 83 is the obvious place for a cut-off point, except that the scattering of chunks of the Lambeth Homilies where even fewer words are assigned to the topic suggests that we should see the topic as truly distributed in various proportions around the text as a whole. That leaves open the possibility of making the same conclusion about the Caligula Brut. Certainly, we should not expect 65 lines of a non-homiletic text to have a highly prominent homiletic discourse. But does the presence of the topic in some chunks at levels equivalent to some chunks of the Lambeth Homilies indicate that a few lines here and there share phrasing with homilies? We can also consider this in purely numeric terms. Given chunk sizes of 500 words, the highest ranked chunks for this topic have approximately 20-25% of their vocabulary assigned to the topic. For the highest ranked chunks of Laȝamon’s Brut, the percentage is between 7-9%. Is that a large enough percentage to attribute some kind of significance to the ranking?

As it happens, the top-ranked chunk of the Brut is the second one, which corresponds to lines 64-122 of the poem. Here is a rough summary. Aeneas arrives in Italy, where, after defeating Duke Turnus,  he wins the hand of Lavinia, daughter of King Latin. His son Ascanius rules after him and builds a city called Alba Longa. He tries to take a statue of a god which Aeneas had brought from Troy with him, but as soon as he brings it there, it is carried off by the devil (“feond”) on the wind. His reign ends after thirty-four years.

It is somewhat gratifying to know that Laȝamon tells us that, upon marrying Lavinia, Ascanius rules “inne griðe & inne friðe” (in peace and in prosperity), a phrase I have argued elsewhere to be derived from the homilies of Wulfstan. The delightful story about the disappearing statue is phrased obscurely, and the meaning is only moderately clarified in the Otho version.  It is easier to understand what is going in from Laȝamon’s main source, Wace’s Roman de Brut. Here is the text there, along with Judith Weiss’s translation.

Mais lé Deus de Troie en ad pris
Ke Eneas i aveit mis,
En Albe les vuleit aveir,
Mais il n’i pourent remaneir;
Unches nes i sout tant porter
K’al main les poüst trover;
El temple ralouent ariere,
Mais jo ne sai en quel maniere. (97-104)

But [Ascanius] took from it [Aeneas castle] the gods of Troy, which Aeneas had placed here, and wanted to have them in Alba. However, they would not stay there: he could never carry them away to such an extent that he would find them there in the morning. They would return to the temple, but, how, I do not know.

Laȝamon refers to the city as Alba Longa, the name found in the Variant Version of Geoffrey of Monmouth’s Historia regum Brittaniae. Wace appears to have used the Variant Version here, but not selected the name of the city from it. So there are some issues related to copy of the Roman de Brut that Laȝamon was using. Nevertheless, it is clear that Laȝamon alone is responsible for the introduction of the devil, and that would something we could attach to a homiletic discourse. But none of these words, “griðe”, “friðe”, or “feond” can be found in the first fifteen words of the topic. In fact, they don’t appear to be associated with the topic at all!

Have I engaged in a wild goose chase only to find a golden egg by coincidence? Perhaps. But I think there is something else going on. Laȝamon does not have to blame the dissipating statue on the devil. There are plenty of other places in the Brut where he leaves such marvels alone. I want to suggest that the devil has entered the text here by finding an “exploit”. In The Exploit: A Theory of Networks, Galloway and Thacker conceive of “the exploit” as “a resonant flaw designed to resist, threaten, and ultimately desert the dominant political paradigm” (22). Here we may see the dominant paradigm not in political terms but in terms of source and genre. The fact that the vocabulary invoked by Laȝamon’s translation of Wace into English employ so many words (in this passage) that he might have encountered together in some homilies puts him, if you like, in a homiletic state of mind. And this allows the entry of the devil into the text at this point.

Imagine, if you will a text that places in close proximity apples, oranges, bread, paper towels, pasta, chicken stock, and coffee. Although this particular example of such a text comes from scholarly discourse, you may well associate it with a marketing list, and you could easily add more items off the top of your head. And this is an important point, they don’t have to come from a particular source text or necessarily the same discourse (topic). In the case of this chunk of the Brut, we could say that the relatively large prominence of words found in the Lambeth Homilies found here suggests that Laȝamon may have been unconsciously nudged to select further items familiar to him from homilies in general.

I think this example reveals the way in which close reading and distant reading complement each other. Reading the Brut closely, I may well have noticed the divergence from Wace and speculated impressionistically about Laȝamon’s reasons for modifying the text. Looking at the topic model, I might have observed a statistical affinity between portions of the Lambeth Homilies and portions of the Brut. But I would have trouble getting a clear idea of the nature of that affinity. With this approach, I can begin to make claims about the texts by constructing a relationship between them that does not require a direct source connection.

A few caveats. I’m starting to make big claims based on one example, and I haven’t even looked yet at the sections of the Lambeth Homilies where the topic in question is most prominent. I should also run the topic model again with larger numbers of topics. Also, I’ve pushed to the side important question of what proportion of words should be assigned to a topic in order for us even to begin speculating along the lines I have pursued. There is a great deal of work still to be done.