How to Scrub Multiple Files in Your Text Editor

Although you could search for and replace everything you don’t want one character at a time in one file at a time, this would get very tedious. Text Editors allow you to automate the tasks for multiple files at once. They also allow you to enter patterns of multiple characters to scrub using a language called Regular Expressions (this is what Scrubber does behind the scenes). The instructions below allow you to do essentially the same thing as Scrubber.

Although the instructions are for Notepad++, TextWrangler works in a very similar way (except that Regular expression is called “Grep”). Instructions for using TextWrangler can be found on pages 120-123 of this document. Use the same regular expression patterns in the Find field as those described below.

To scrub multiple files in Notepad++

  1. Place all the files you want to scrub in a single folder.
  2. In Notepad++, select select Replace… in the Search menu (or type Control+H). Click the Find in Files tab.
  3. At the bottom of the dialog, select the Regular expression radio button.
  4. In the Find what field, enter the following regular expression pattern (just copy and paste it from here):

    [.?!'”()[]*,;:]

    Put a space in the Replace with field.

  5. Click the “” button to the right of the Directory field to select the folder where your files are located. Then click Replace in Files.

This procedure will delete most of the punctuation marks. However, you should scan the resulting files for other punctuation marks like “–” to make sure nothing has been left.

If you wish to delete numbers, put [0-9]+ in the Find what field.

Converting Text to Lower Case

I know of no way to convert multiple files to lower case in a text editor. However, you can do this file by file. In Notepad++, open the file you wish to convert and Select All (control+A). Then select Convert Case to->lowercase in the Edit menu (or type control+U). Make sure to save the file when you are done.

Instructions for TextWrangler are on pages 99-100 of this document.

Removing Stop Words

After you have changed all punctuation with spaces and converted to lower case, you can use a similar technique to remove stop words from multiple files. For instance, if the want to remove the stop words “the”, “to”, and “this”, you could enter the following regular expression pattern into the Find what field:

( the | to | this )

Note that it is important to have spaces around each word so that you don’t remove, say “then”, “town”, or “thistle”. It is also possible that you may have to use several short lists of stop words. In TextWrangler, the pattern seems to break of the list is more than one line in the field. This has not been tested extensively.

Leave the Replace field… empty, and all the words will be deleted.