Topic Modelling Assignment

This work due for Wednesday is in the second half of this post.

To do this assignment, you will need the GUI Topic Modeling Tool. Before you download it, make sure Java is running on your computer. To check whether Java is installed, go here and follow the instructions.

To get some background on topic modelling, try reading some of the links here (you don’t have to read all of them).

Here are the steps for doing topic modelling:

1. On your computer, create a folder for your topic model. Inside that folder, create a “texts” and an “ouput” folder.

2. Scrub your texts using Scrubber. Then copy them into the “texts” folder. Each text will be a document in your topic model. You can divide the texts into chunks and make each chunk a document, or you can use separate texts as documents.

3. Optional: Copy the GUI Topic Modeling Tool into your topic model folder so that it is easy to find.

4. Open the GUI Topic Modeling Tool. Select the “texts” folder as your input folder and the “output” folder as your output folder. You may experiment with the various other options. When you are ready, click “Learn Topics”. After about 5-15 seconds, the tool will tell you that it is done.

5. In your web browser, select “Open” from the File menu (or type click/command+O). Navigate to your topic model folder, open the output folder, and then the new “output_html” folder that has been created. Inside you will see a file called “all_topics.html”. Open that and you can then browse through your topic model.

Now that you have the process down, here is the assignment for Wednesday:

Download this collection of 18th century texts (courtesy of Zoe Barovsky). Unzip the archive and place the extracted files in your “texts” folder. Then run a topic model. You may wish to re-name the text files (the titles are given at the beginning of the file) to give yourself clearer results.

Run the topic model and do the following:

  1. Label the topics if you feel like their contents are coherent enough to describe the content’s “theme”. Note that you should not try running large numbers of topics (e.g. 50) because you will then have to label more. 10-15 topics is a good number.
  2. Are there groups of eighteenth-century texts that are dominated by particular topics? Do a cursory internet search to see if you can find out anything about the texts.
  3. Try to form conclusions based on the results. Are you able to say something about the texts without reading them? Post your conclusions (1-2 paragraphs) to the online forum.

As always, post questions or comments about the process to the forum and/or Twitter. You should especially consider posting thoughts about the problems and implications of employing this method. We will be discussing this on Wednesday.