Go to content ALT+c

Quantitative Applications for Data Analysis  (Winter 2019) (Old site; new site is at https://scinet.courses)

Wednesday May 22, 2024 - 22:56

5.9 Assignment 9

Due date: Friday, March 22, 2019 at 11:55 pm.

Be sure to use version control git , as you develop your scripts. Do git add and git commit repeatedly as you add to your script. You will hand in the output of git log for your assignment repository as part of the assignment.

Problem

a) Create a file "myutils.py" which contains a function called 'word_count', that accepts two string arguments, one mandatory and one optional. The function should decompose the mandatory first string argument into its component words, create a dictionary which contains a key for each word, and assign each key's value the number of times that word appears in the string.

Use the optional second argument of the function to specify the list of stop words you would like to exclude from the dictionary. The function should then return the dictionary.


>>> import myutils
>>> myutils.word_count("hello there")
{
'hello'1'there'1}
>>> 
myutils.word_count("this is a wonderful wonderful world")
{
'this'1'is'1'a'1'wonderful'2'world'1}
>>> 
myutils.word_count("this is a wonderful wonderful world"stopwords = ["this""is""a"])
{
'wonderful'2'world'1}

Note the string function "split" will be useful here. Also, the dictionary function 'setdefault' is also useful.

b) Add another function named "read_file" to the file "myutils.py". This function should receive a file name as an argument, read the file and return its contents. You can use the first three commands in this sequence of commands as an outline of how to read the data from the file.


>>> file open("shakespeare.sonnets.txt""r")
>>> 
data file.read()
>>> 
file.close()
>>>
>>> 
data[:100]
"\n\n                     1\n  From fairest creatures we desire increase,\n  That thereby beauty's rose m"

c) Download the file plotTopWords.py (link: https://support.scinet.utoronto.ca/~alexey/plotTopWords.py) and add it to your git repository.

d) Create a Python driver script called "count_words.py" and import your file "myutils.py".
Your driver script, "count_words.py", should be able to receive and make use of command line arguments. The script should take a file name as an argument, and using the functions "myutils.read_file" and "myutils.word_count", count the word occurrences in the file.

Once the word occurrences have been counted, the script should use the provided function "plotTopWords.plot" to create a plot which displays the results of the word count.

As a test, use the script to count the word frequencies in the file shakespeare.sonnets.txt. Your output should look like the bar plot below.

You should not hard code the file names in your script. Use command line arguments to pass the file name into the script.

e) Download the file stopwords_en.txt (link: https://support.scinet.utoronto.ca/~alexey/stopwords_en.txt) containing several stop words.

Modify your driver script so that it will now take an optional argument. This optional second argument is a file of stop words. If the optional stop-words-file argument is supplied, the script should run the function "myutils.word_count" using the stop words taken from the stop words file.

Use the stop words from the provided file "stopwords_en.txt" (use the "myutils.read_file" to read the stop words) to test your script.

If the stop words file is supplied the script should still plot the results using the provided function "plotTopWords.plot".

f) Create a bash script named "compare_words.sh" that runs your script "count_words.py" twice:

  • Once on the "shakespeare.sonnets.txt" file without using stop words and
  • Once on the "shakespeare.sonnets.txt" file using the "stopwords_en.txt" stop words file.


Submit your myutils.py , count_words.py script files, compare_words.sh file and the two generated plots, and the output of git log from your assignment repository, to the 'Assignment Dropbox'.

To capture the output of git log use redirection, git log > git.log , and hand in the git.log file.

Assignments will be graded on a 10 point basis.

Due date is March 22, 2019 at 11:55pm, with 0.5 point penalty per day for late submission until the cut-off date of March 29, 2019 at 11:00am.

Last Modified: Friday Mar 15, 2019 - 11:33. Revision: 54. Release Date: Friday Mar 15, 2019 - 10:00.


Content Navigation


Course Calendar


Related



Questions? Contact Support.
Web site engine's code is copyright © ATutor®.
Modifications and code of added modules are copyright of SciNet.