CSCI 315 Assignment #6

Due 11:59PM Friday 17 November, via Sakai

Goals

  1. Understand the basics of recurrent networks by replicating the classic Simple Recurrent Net results of Elman (1990) using NumPy.

  2. Understand state-of-the art recurrent networks by building a Long Short-Term Memory (LSTM) network example using TensorFlow.

Part 1: The Simple Recurrent Network

To get started, copy-and-paste your backprop.py module from Assignment #3 into a new folder asssignment6. If your Backprop train method doesn't use online learning (no batch, immediate weight update at the end of each pattern), modify it to do online learning, and re-test it on the classic XOR problem (two inputs one output) to make sure it still works.

To generate your input and target patterns, you should now create another script, part1.py, in which you'll write two functions to produce the kinds of sequences shown on page 185 of Elman's classic paper:

Here is an example IDLE3 session to illustrate how these functions work:
>>> from part1 import xorseq, shift
>>> inputs = xorseq(10)
>>> inputs
[[1], [1], [0], [1], [1], [0], [0], [1], [1], [1], [1], [0], [0], [1], [1], [0], [1], [1], [1], [0], [1], [1], [0], [1], [0], [1], [1]]
>>> targets = shift(inputs)
>>> targets
[[1], [0], [1], [1], [0], [0], [1], [1], [1], [1], [0], [0], [1], [1], [0], [1], [1], [1], [0], [1], [1], [0], [1], [0], [1], [1], [0]]
#      *              *              *              *              *              *              *              *              *
I've added a comment line with asterisks to show that , as in Elman's paper, if you start with the second bit of the target, every third bit of the target is predictable: it is the XOR of the input bit above it and the input bit before that. Note also that, since our Backprop.train method expects the input and target patterns to be lists of lists, xorseq puts each bit into its own list.

Although I started writing xorseq using NumPy slicing tricks, I ended up just writing it using a loop (start with an empty list, append to it in a loop, then return the list). For the random bit, you can use np.random.randint(2), and for the XOR you can use the ^ operator (up-arrow, caret, shift-6, whatever you want to call that key).

Once you've written these two functions, it should be easy to get the expected mediocre results using Elman's training regime of 3,000-bit sequences and 600 training iterations and two hidden units, by putting the following code in the main of your part1.py:

  from backprop import Backprop
  bp = Backprop(1, 2, 1) 
  inputs = xorseq(1000)
  targets = shift(xorseq)
  bp.train(inputs, targets, eta=0.5, niter=600)
If your train method doesn't already periodically report the RMS Error (mean of the errors summed over all output units and patterns for given iteration), add some code and the end of the train method to do that now. Once I did that, running this code took several minutes, after which (even with the large learning rate) the error only dropped to around 0.5. As Elman mentions, this is the best you can hope for with the classic back-prop net you have implemented, because without the context of the previous bit, the sequential XOR function looks like a random 50/50 guess.

So, now you're going to replicate Elman's actual contribution, the SRN. Copy-paste-rename your backprop.py into a new script srn.py, and rename the Backprop class in it to SRN. As shown on p. 184 of Elman's paper, this modification is relatively small: you'll need to (1) maintain a copy of the hidden-layer activations computed on the previous input patterns – what Elman calls “context units” – and (2) add more weights to connect this context layer to the hidden layer.

For (1), the context layer, you can simply re-use the variable containing hidden layer activations you've already computed (I called it ah). Of course, this variable won't exist on the first iteration; so, at the start of each training iteration (before looping over patterns), you can assign this variable (hidden-layer activations) a value using np.zeros().

For (2), the context-to-hidden weights, it's easier to think of the context units as additional input units, so you'll still only need two layers of weights (self.wih, self.who); however, self.wih will now have shape $(n+h+1) \times h$, instead of$(n+1) \times h$. Then your new input-layer activations (including bias) can be computed using two appends: np.append(ah, np.append(Ij,1)).

Once you've written srn.py, it should be trivial to test it by adding the following lines to your part1.py:

  from srn import SRN
  srn = SRN(1, 2, 1) 
  srn.train(inputs, targets, eta=0.5, niter=600)
If you've coded up your SRN correctly, you should see a small but noticeable difference in the final RMS error: instead of the 0.5 you get with ordinary back-prop, you'll get something around 0.48, because of the SRN's increased ability to predict every third bit.

To finish up this part, you're going to add some code to do a more sensible error computation, the RMS error computed by starting at the second bit and striding to every third bit (indices 1, 3, 4, 7, ...). Before doing that, make sure that your Backprop and SRN classes each have a test method, if you haven't written that already. This method should take just the input patterns, then loop over them, returning a NumPy array containing the output for each pattern. Finally, back in your part1.py script, create a function xorerr that takes the vector returned by these test methods, as well as the targets vector from above, and returns the RMS error for the values at those 1, 3, 4, 7 ... bits. By increasing the number of hidden units to 20 and reducing the number of training iterations to 100, I was able to get the following output from my part.py script; yours should look similar:

Training ordinary backprop:
10/100: 0.504145
20/100: 0.503831
30/100: 0.503760
40/100: 0.503728
50/100: 0.503710
60/100: 0.503698
70/100: 0.503690
80/100: 0.503684
90/100: 0.503679

Training SRN:
10/100: 0.468300
20/100: 0.462518
30/100: 0.460941
40/100: 0.460271
50/100: 0.459882
60/100: 0.459505
70/100: 0.459096
80/100: 0.458638
90/100: 0.458146

XOR Error:  BP = 0.499610390031   SRN = 0.249277977162
As you can see, if we look at just the predictable bits, ordinary backprop still does no better than chance, but the SRN gets it right around 75% of the time. I've seen as bad as 67% on some runs, but it's still obviously better than chance.

Part 2: LSTM

Of course, now that you've had a taste of Deep Learning awesomeness, 75% accuracy seems, well ... “not Scottish”.1. For this second part of the assignment, we're going to repeat our familiar download/copy/paste/modify approach to TensorFlow, to experiment with the Long Short-Term Memory (LSTM) networks that we're learning about in class. But this time, rather than fixing some partially-working code from the author's github, we're going to take some already-working code and modify it in a useful way.

Spend a few minutes reading through this popular blog post by Prof. Rowel Atienza, on using LSTM to predict words in a story. Then grab the program and example text provided by Atienza. (I just did a select-all/copy/paste into the IDLE3 editor.) Open the program in IDLE3, hit F5, and the program should begin training on the data without further ado. (If you get a big error report, there's likely another TensorFlow program running on your machine, perhaps one that crashed earlier. Your easiest option is to log out and log back in, and, as a last resort, reboot the computer). After several minutes the program should converge to around 90% average accuracy, at which point it will let you type in sequences of words from the text, to show how LSTM can (partially) restore the next words by prediction:

3 words: we could easily
we could easily escape while her . cat that this , and some said that but at last a young mouse got up and said that is all very well and then spoke . then
As you can probably see, much of the fun of using LSTM networks and related technologies for automated language processing is the semi-coherent, obscenely funny outputs that they produce. (If you're in a playful mood, see what happens when you enter a nonsensical sequence of words like cat mouse the.)

Modification 1: Stop training at 90% average accuracy (easy!)

Pretty cool, huh? Still, I found it annoying to have to sit through 50,000 training epochs, when the network surpassed 90% average accuracy about halfway through and then failed to improve. So the first modification you're going to make is to break out of the training loop once the average accuracy surpassed 90%. Run the program again, making sure it still gives decent predictions after you've modified this way.

Modification 2: Avoid masking errors during testing (easy!)

If you look at the bottom of the code (testing phase), you'll see that Atienza has used a try ... except mechanism to avoid crashing the program when the user tests it with an unknown word. This makes sense; however, he has also put the rest of the testing code (including TensorFlow calls) in that try ... except, which will make it difficult to debug that code once we change it – because any error, including undefined variables and the like, will confusingly be reported as "Word not in dictionary". So for this step, make the simple change of moving the code starting with for i in range(32): outside the exception handler. You'll have to be more careful about not entering bad input in the testing phase, but it'll be worth the trouble.

Modification 3: Towards more realistic word codes (difficult!)

As Atienza mentions (final note 2), using one-hot codes for words is okay with a small vocabulary like this 112-word vocabulary; however, once your vocabulary grows to a realistic size (like the 20,000-35,000-word vocabulary of a typical adult), the need to have a new output unit for each word would make the output layer impractically huge.

Unfortunately, looking over the code, you'll see that the one-hot assumption is heavily entrenched in the model:

Let's deal with these three interrelated issues one at a time.

For the loss function, a little digging in the tf.nn section of the Python API section on the left side of the TensorFlow docs revealed a similarly-named loss function that used an activation function other than softmax. (Hint: it should be familiar from the first half of the course.) When I changed my loss function to use that other function, I still got good results (hit 90% average accuracy well before 50,000 epochs).

For the dictionary part, I was a little puzzled by the confusingly-named build_dataset function that creates the dictionary and reverse dictionary; specifically, the clever use of zip to create the reverse dictionary. So the first thing I did was to get rid of this trick and build the reverse dictionary in the same loop as the dictionary. This gave me confidence that I could modify both dictionaries to do something different. Once I had done that, I realized I could have this function return a third dictionary, mapping from words to one-hot codes. This allowed me to eliminate the on-the-fly construction of the one-hot codes during training. Instead, I had the build_dataset function also build a code dictionary of one-hot codes, and return it as the third returned item. Then I replaced the on-the-fly one-hot construction code with a lookup in this new dictionary:

        symbols_out_onehot = code_dictionary[str(training_data[offset+n_input])]  # new

        symbols_out_onehot = np.reshape(symbols_out_onehot,[1,-1])                # same as before
To keep things simple, my code dictionary used a plain old Python list of 0s and 1s (integers), not a NumPy array of floating-point numbers, which worked just fine thanks to the power of TensorFlow's placeholder mechanism.

What about the reverse dictionary? If you look at where it's used, you'll see a clever trick: the one-hot prediction from the network is turned into an index using argmax, and the index is passed the to the reverse dictionary to give you back the corresponding word. This is done twice: once during training, as a kind of additional progress report, and once during testing, to reconstruct the story from your three-word input fragment.

To move away from relying on this trick, I had my build_dataset function build the reverse dictionary using keys that were no longer indices, but rather the string form of one-hot codes. For example, one of the entries was:

'0100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000': 'the'
So, we're still using a one-hot code, but now it's in the dictionary instead of the training/testing code, which will make things easier if we want to switch to a more realistic coding scheme later.

Unfortunately, we're still going to have to rely on argmax to finish the job. Looking at the final line of the RNN neural-net function, I realized that it's returning a raw dot product (tf.matmul) without passing the result through a sigmoidal or ReLU function. So instead of zeros and ones, your prediction is going to be a bunch of arbitrary floating-point numbers. As a temporary solution to this problem, I wrote a “cleanup” function that took this prediction and returned a string representing the one-hot code made via argmax, which I then used to get the predicted word:

  onehot_pred = session.run(pred, feed_dict={x: keys})
  symbols_out_pred = reverse_code_dictionary[cleanup_code(onehot_pred)]
So the third issue – how to eliminate argmax and move beyond one-hot codes – has to remain unsolved at this point (see next section). We have, however, taken some big, crucial steps toward its solution, by moving some of our argmax usage into the dictionary, and out of the main body of the code.

Extra credit / Final project idea

So, as you might expect, an extra-credit version of rnn_words.py would eliminate argmax entirely. One way of doing this is to use a
sparse distributed representation, in which several of the bits (instead of just one) are 1, and the rest are 0. Although this might not seem like a big difference, consider the capacity of such a coding scheme: whereas a 112-bit one-hot code can represent 112 items, a 112-bit code with three 1 bits can represent 112 choose 3, or 227,920 items!

Another popular alternative to one-hot codes is word embedding, in which the representation of a word is compressed into a reasonably-sized vector in which similar words are closer together (have a higher dot product) than dissimilar words. The LSTM code that you've been working with here is part of a larger repository, which includes code for building word vectors.

So, if this language-processing work appeals to you, an excellent final-project idea would be to modify your scripts for this part to use Word2Vec representations instead of one-hot codes, or even just put aside LSTM do something interesting (like semantic-space visualization) with the Word2Vec program alone.

What to submit to Sakai

Zip up the following into a folder assignment6.zip, as usual, containing everything I'll need to test your work quickly: For Part 1, I will test your code by opening part1.py and hitting F5, after which I'll want to see a short, simple output like the one shown at the end of the instructions for Part 1 above. For Part 2,
1. For a fair comparison of SRN and Deep Learning on the sequential XOR problem you'd want to switch to using one-hot codes in the SRN ([0,1] for False, [1,0] for True). If you have time at the end for a little extra-credit work, feel free to add a script extra.py that does this.