Computer Science 252
Assignment 6: Latent Semantic Analysis
Due Friday 18 November
- Enhance our understanding of LSA by applying it to the real-world problem of document retrieval.
- Replicate precisely the work of other scientists, a crucial and
under-appreciated aspect of
Because of the popularity of LSA, there are already some great examples on the web. This step-by-step
shows the use of LSA for document retrieval, using a very small corpus of documents (three) of a few words each.
The output of my lsa.py script closely matches these results, except that I
avoid repeating the results from a previous step.
To get started, write a main that looks like this:
if __name__ == '__main__':
docs = ['shipment of gold damaged in a fire',
'delivery of silver arrived in a silver truck',
'shipment of gold arrived in a truck']
So all the actual work will go into your show_lsa() function. Use your Pythonic toolkit,
like .split(), set() and array slicing, to minimize the amount of code you need to write.
Indeed, thanks to the the power of NumPy, the first step – printing out a
nicely-formatted table – was actually the most time-consuming. For that step, I
experimented with printing blank spaces and tabs until I got a nice-looking table.
Here are some tips to help you with the mathematical part:
- numpy.set_printoptions(precision=4) will allow you to check your results against the ones in the PDF.
- To build the query matrix (really, a vector) in Step 1, I wrote a function build_query
that took a list of document words and a list of query words, and returned a 1 where a document
word was in the query, and 0 otherwise.
- I also wrote a cosine function for vector cosine, and a magnitude function to
support it, exactly as we did in our first exam.
- For a tiny text corpus like this one, the co-occurrence values are
small enough that taking the logarithm creates more problems than it
solves. So you do not need to do the add-one / take-logarithm part of
LSA here. It is still however worth revisiting the simple LSA program
from slide #15 of the
- As you will see, numpy.linalg.svd returns the matrix VT, the
transpose of V. So
when I needed the actual matrix V, I did V = Vt.transpose().
- For the matrix inverse in Step 5, I used numpy.linalg.inv().
What to turn into sakai
All you need to turn in for this assignment is your lsa.py script, which should produce an
output like mine.