Note: This post is part of a series reworkings of materials originally written between 2009 and 2012. A description of the nature and purpose of this project can be found here.
Still in September of 2009, I was speculating that comparing manuscript variations would be one of the most popular uses of Lexomics. I was definitely wrong. To date, the Anglo-Saxon Penitentials are the only texts we have subjected to this kind of analysis, and the verdict is not yet in. The process of (digital) manuscript collation is a real challenge, particularly in collating in such a way that one can extract lined up token lists for word frequency analysis. I haven’t seen what the latest version of Juxta can do; I probably need to make time for an experiment in the future. Sadly, I am not currently working on any texts that survive in more than two manuscripts, so this may have to wait.
As my speculation about collation more or less ground to a halt, I was continuing to think about the problem of representing meaning through cluster analysis. In addition to the problem of editorial influence, there is a question over what textual phenomena are being measured by the algorithm. If two chunks of a text cluster together, is this because of lexico-semantic similarities, morpho-syntactic similarities, dialectal similarities, scribal similarities, or some combination? Early on, critics of Lexomics suggested that more meaningful results would be obtained from lemmatised texts. The Lexomics team were sceptical, arguing that lemmatisation would erase valuable information. More pragmatically, lemmatisation of Old English is an arduous process, and they were getting good results without it. I argued that a comparative approach would be effective. If lemmatised and unlemmatised texts produced dendrograms with similar structures, chances are that the clusters are based on lexico-semantic phenomena (so I speculated). Things turned out to be a little more complicated, and testing is still ongoing. In this short post, I’ll just say that I began hand lemmatising Daniel and Azarias as a way to test this theory (since they are short) whilst developing an algorithmic means of lemmatising other Old English texts. I think the Dictionary of Old English editors decided as early as the 1970s that this was impossible, and I wasn’t trying to prove them wrong. I wanted to create a “semi-automatic” lemmatiser (in part because it sounds like a weapon) which would at least speed up the process. That proved to be a less-than-straightforward undertaking as well, but it did (eventually) yield some results, which I’ll discuss at length in future posts.