How to do your own topic modelling

The topic flower tool produced by Jeff Clark in 2006 provides an easy way to demonstrate the power of topic modelling. Whilst topic flowers have the appeal of producing pretty visualisations, they are based on only six very broad topics, which may limit the insight they can give into the texts the tool processes.

A solution is to do your own topic modelling. There is a really good overview of topic modelling by Clay Templeton on the Maryland Institute for Technology in the Humanities web site (and see also his follow-up post). Another good, and fairly accessible, document for background reading is Scott Weingart’s blog post on Topic Modeling and Network Analysis. “Of Monsters, Men — And Topic Modeling” is a New York Times article which gives a nice example of how topic modelling can be used for history.

Any more background would start to get annoying because the great appeal of the Digital Humanities is that it’s pretty easy to get down and dirty with the tools without too much prior knowledge. Topic modelling has traditionally been something of an exception, but now there is a way to begin the process in just a few minutes using a GUI Topic Modeling Tool. The rest of this post is designed to give you some easy-to-follow instructions for using it.

Getting Everything Set Up

First, download the tool. You should be able to do everything just by reading the online instructions. Everything below exists to highlight certain points or warn you about certain pitfalls.

Double-clicking on the tool should open it, but, if it doesn’t work, you probably need to download and install Java (the language the tool is programmed in) on your computer. Assuming you have Java, the tool should open. When I did this, the first thing I wanted to do was test it, but it took me a moment to find the test data sets the tool’s makers provide. There are four of them (in addition to the tool itself) available for download here. The next instruction I give you is not absolutely necessary but will make your life easier.

Using your computer’s operating system, create a new folder with a descriptive name. If you intend to play with the test data set called “testdata_news_music_2084docs.txt“, you may wish to call your folder “NewsMusicTopicModels”, or something like that. Download the test data set and save it to your new folder. There are two reasons why this is important. First, you’re saving you data set in a place where you can find it easily. Second, the “Create New Folder” button within the Topic Modeling Tool does not appear to work (at least it did not work for me). If you create the folder in advance, you won’t have a problem.

Once you have downloaded a test data set, open the Topic Modeling Tool (if you haven’t already), and click Select Input File or Dir. Navigate to your new folder and select the test data file. Next, click Select Output Dir and select your new folder.

Choosing Your Options

For your first test, creating 10 topics is a good choice, since there will be some variety, but it’s not too many for your mind to take in. Click on the Advanced button. A new dialog box will open with some more choices. You should probably leave the Remove stopwords box checked and the Preserve case box unchecked. The latter will make sure that “dog” and “Dog” are treated as the same word. Stopwords are lists of words that will be ignored by the processing. These are typically short function words like “the” and “and”. The Topic Modelling Tool uses a standard list of stopwords by default. However, you can upload your own list. For instance, if you are modelling a text in Old English or French, you might want a very different list. To create a list of stopwords, just type them into a text (.txt) file with each word on a separate line. The click the Stopword file button in the Topic Modelling Tool to upload it.

The other options allow for further refinement. The No. of iterations setting refers to the number of times the program cycles through your data. Increasing the number will possibly produce more refined results but will take longer. I’m not actually sure what Topic proportion threshold refers to, and I haven’t yet had a chance to play with it.

No. of topic words printed requires a little more explanation. The Topic Modelling Tool produces results which can be visualised as a number of circles representing topics, with each circle containing all the words in that belong to that topic. However, the tool does not produce a label “Science” for these circles. Hence the only way to identify topics is by referring to all the words that belong to them. That is pretty tedious, so a shorthand is to refer to them by the first 10 words in the topic. However, the No. of topic words printed option allows you to change this to suit your taste. For instance, I found five words worked pretty well for me.

When you are done choosing advanced options, click OK. Now you’re ready to do your modelling. Just click Learn Topics, and the tool will go to work. It will display the message “Learning” and spit some feedback out into the window. You can pretty much ignore the information, but by watching it you’ll know when the process is done. The mathematics involved are not trivial (Weingart refers to them as the “dark arts”) and require some computing power. So it may take a while. My experience is that a poem like Beowulf only takes a few seconds on my laptop. If your data set is larger, it may take longer. But your patience will be rewarded.

So what’s the result when the process is finished? Go to the folder you created and you will see two new folders inside: output_html and output_csv. The latter contains your topic models in comma-separated-value (.csv) files, which you can open in a text editor or, even better, a spreadsheet program like Excel. If you need to manipulate the topics, you can use these files.

However, you may be more interested in the contents of the output_html folder. Go into that folder and double-click on all_topics.html file. Your browser will open a web page showing a master list of your topics named using the top 10 words in each topic (or whatever number of words you specified in the advanced options). Click on one, and you will see a list of every document in the data set that contains the topic, listed in order of the prominence of the topic within the document. This is calculated by the number of words in the topic that occur in the document. Click on any document, and you will see a list of all the topics occurring in that document, ranked in order of prominence. Now you can explore your data set in two different ways.

An important note. The Topic Modelling Tool really assumes that you are submitting a corpus of multiple texts (documents), so that is the vocabulary it uses in the output. However, you can also submit a chunked text, in which case everything referred to as a document would actually be a chunk. Why would you want to do this? If you chunked a text and submitted it, you could then see which topics are more prominent in different parts of the text.

Creating Your Own Data Sets

A data set is just a text (.txt) file with each text/document on a separate line. The easiest way to do this is to tokenise your text. Once you have done this, copy the list of test tokens into Microsoft Word. Using the Search and Replace functions, search for ^p (which means a paragraph break) and type a single space in the replace field. Click Replace all. The result (it may take a while) should be a list of words with spaces instead of line breaks between them. Save the file as a .txt file. Repeat this process for any other texts/documents you are planning to use. When you are done, copy each text/document and paste it onto a new line in your text editor. Save the file in .txt format, and now you have your data set.

The easiest way to chunk a text for making into a data set is to begin with the list of tokens. Paste it into an editor like Notepad++, which lists line numbers. You can then see one token on each numbered line. If you wish to divide the text into 1000 word chunks, scroll down to line 1000. Next to the token, type (without spaces) the tag <milestone>. Then do the same thing at line 2000, and so on until you have reached the last chunk. Now copy the entire text and paste it into words. Follow the instructions above to replace ^p with a space. Once you have done that, you have one additional step. Replace the text <milestone> with ^p. When you have done this, every chunk of the text will be on a separate line. Save the file in .txt format, and now you have a data set consisting of a chunked text.

Some Further Magic

Aspects of the HTML output may annoy you. For instance, if you are using a chunked text, the references to “doc1”, “doc2”, and so on, may be confusing. But, using your knowledge of HTML, you can open all_topics.html and simply change the text to what you want. Change “doc1” to “Chunk1”, for instance, then go ahead and do the same for all the other HTML files the tool generates. It’s not really necessary, but it helps to customise the output to something you like.

You can also use this method for a more important function. Let’s say you have a topic for which the first five words are “dog cat elephant giraffe bird zoo”. It might be reasonable to call this topic “Animals”, even though one member is not an animal (but does relate to the dominant semantic idea). This is only a human interpretation given to a category constructed by the computer based on the probably relationships between the words, rather than on any knowledge of their meanings. But that’s what we humans (and humanities students) do: interpret. So we could just hack into our HTML files and change “dog cat elephant giraffe bird zoo” to “Animals: dog cat elephant giraffe bird zoo”, or even just “Animals”. That way, when we navigate around the web pages, we have actual useful labels for our topics. If the topic “Animals” occurs in document 1 or chunk 1, we can then easily begin to form conclusions about its prominence in that text or chunk.

Visualisation

At the end of the day, this process gives us more information than topic flowers, but does not give us a pretty visualisation. I have no easy answers for how to go to the next step. It will probably require using the data in the output_csv folder, massaging it into a particular format for visualisation tool. If I find any easy ways to do this, I’ll add them to the end of this document, which is long enough already.

For now, I want to emphasise that I have been deliberately long-winded in describing the process of doing topic modelling using the Topic Modeling Tool. Despite the length of this post, using the tool is really very easy.

Update

Getting Everything Set Up

Choosing Your Options

Creating Your Own Data Sets

Some Further Magic

Visualisation