Lab 8
Similarity Analysis of Text Files
Submit after the end of the lab session

Overview

For this lab, you will experiment with a TextModel class that provides methods for analyzing texts in terms of their word frequencies. Comparisons will be made based on cosine similarity, a concept discussed in class.

Instructions

  1. Download the zip folder for the lab.
  2. Study the TextModel class. Note the class variables, class methods, instance methods and instance variables. Provide an example of each.
  3. Using examples in the comments and of your own, run TextModel.process_word on some words. Does this process always provide reasonable changes to the words? Provide examples of when it works well and when it might cause problems.
  4. Perform a few pair-wise similarity comparisons in order to get a sense of what the similarity measure reports. Without running any more comparisons, see if you can predict which pairs of texts in the docs folder would be most similar and which pairs would be most dissimilar.
  5. Run TextModel.load() and TextModel.analyze(). Compare these results to your predictions.
  6. Uncomment the include_list assignment in the py file. Now, comparisons will only be made on these words. Run the analysis again (repeat the last step). What difference does it make?
  7. Create your own include_list on words that matter to you. Be sure to include contrasting words (e.g. 'war' and 'peace'). Report any changes.

Deliverable

Create a text or pdf file that contains the following:

  1. A statement that summarizes your completion of the lab. As appropriate, the statement should include the following:
    • Who you worked with on the lab
    • Any difficulties you encountered
    • Your summary of your experience
    • Your answers to the prompts in the instructions

Put your file in a folder, zip it and submit it under Lab 8 on D2L. Check that your submitted zip file is complete.

Grading

Your lab submission will be graded using the following rubric:

  • + .5 --- Your submission is clearly formatted.
  • + .5 --- Your submission includes a summary statement and includes how you collaborated.
  • + .5 / 1.0 --- You submitted most of the lab (0.5) or you submitted all of the lab (1.0).
  • + .5 --- Your lab submission is generally correct.