Text Mining – Page 2 – scottkleinman.net

Thanksgiving 2012

November 23, 2012December 13, 2012 Leave a comment Fun, Text Mining

This year’s Thanksgiving word cloud includes both Facebook and my Twitter feed.

Noticeably absent this year is the word “bacon”.

And, for good measure, here is a topic model of the tweets and posts:

I’m canyon things man cuteness blog ol drop teens gourd
you’re lovely hike home family kill center orangutanes morning daily
thanksgiving love turkey join audience kennedys infotech anonymous open driving
kill history started jugar marylebone brightest uk leave topic twitter
happy monrovia post reading move pie cat puddytat beagles
day 1st work hiking eating dogs participation gustar espa obama
thankful friends house make grading awesome head miss family dpla
today beautiful turkeys year media con project triumphs articles tea
la weather london dinner survived grateful marblehead hackfest workshop en
park gobble beginning good kittens feasting te cnn touch wouldn’t

Suggestions for topic labels welcome.

Update: I was surprised to find that this post appeared on DH Now Unfiltered, so, given the increased traffic, I thought I’d better provide some explanation. I am trying to make this a regular tradition, after having produced a word cloud of last year’s Facebook posts on Thanksgiving. All the posts are only those by me or my friends, and this year’s Twitter feed includes only tweets by me or those I follow (plus the sponsored posts).… Read more…

Computational Approaches to “Small Data” in the Digital Humanities

August 6, 2012December 13, 2012 1 Comment Text Mining

Taking a break from my re-working of past posts on computational approaches to text analysis, I want to express a plea for attention to be paid to small data and for small data to form part of the larger conversation about quantitative analysis in the Digital Humanities. As the range of digitised materials expand through the proliferation of digitised print materials and born digital texts, big data has come to represent the opportunity for innovation in the Digital Humanities. For those of us who work in areas that have corpora that are restricted in size (medieval English literature, in my case), this represents a problem. How can those of us working in fields that employ these corpora participate in the growing conversation? Are the Digital Humanities doomed to experience a divide between those who do distant reading on “big data” and those who do close reading through markup and editing of smaller numbers of texts? One way the latter group can bridge the divide is by generating lots of metadata. But that only goes so far and doesn’t address some of the analysis of language being done through techniques like topic modelling and the like.

Small data is very useful because it throws into relief some of the same theoretical questions that (in my view) have yet to be sufficiently addressed in approaches to “big data”.… Read more…

A Change of Direction

August 4, 2012 Leave a comment Text Mining

Note: This post is part of a series reworkings of materials originally written between 2009 and 2012. A description of the nature and purpose of this project can be found here.

Still in September of 2009, I was speculating that comparing manuscript variations would be one of the most popular uses of Lexomics. I was definitely wrong. To date, the Anglo-Saxon Penitentials are the only texts we have subjected to this kind of analysis, and the verdict is not yet in. The process of (digital) manuscript collation is a real challenge, particularly in collating in such a way that one can extract lined up token lists for word frequency analysis. I haven’t seen what the latest version of Juxta can do; I probably need to make time for an experiment in the future. Sadly, I am not currently working on any texts that survive in more than two manuscripts, so this may have to wait.

As my speculation about collation more or less ground to a halt, I was continuing to think about the problem of representing meaning through cluster analysis. In addition to the problem of editorial influence, there is a question over what textual phenomena are being measured by the algorithm.… Read more…

Editorial Influences in Daniel and the Dynamism of Texts in Quantitative Analysis

August 2, 2012 Leave a comment Text Mining

Note: This post is part of a series reworkings of materials originally written between 2009 and 2012. A description of the nature and purpose of this project can be found here.

Having marked up the diplomatic and edited forms of the Old English poem Azarias (as discussed here), I turned my attention to the much longer poem Daniel. A look at this poem will provide some specific examples of the kinds of issues that arise when using markup to prep materials for quantitative analysis. The edition of Daniel in the Anglo-Saxon Poetic Records (ASPR) contains some 80 editorial emendations—or an average of one every 9 ½ lines. That’s denser than I was expecting-enough to make me wonder if they would affect the results of the Daniel/Azarias study initially performed by the Lexomics team.

A closer look at the emendations raises some important points. There are a fair number of editorially supplied words like to in line 25 or hæfdon in line 56. There are also many examples of late Old English changed to classical West Saxon: e.g. MS þeoden, changed to þeodum in line 34; MS metod, changed to meted in line 129. And there are clear uncorrected errors like mægen hwyrfe, changed from MS mæ gehwurfe in l.… Read more…

Username
Password

	Remember Me Lost your password?