yghern / pwp-capstones

0 stars 0 forks source link

frequency_comparison() does not work appropriately #3

Open karl-project-review opened 5 years ago

karl-project-review commented 5 years ago

For reference, here is your frequency_comparison() function:

def frequency_comparison(table1, table2):
    appearances = {}
    mutual_appearances = {}
    for key in table1:
        if key in table2:
            if table1[key] <= table2[key]:
                mutual_appearances[key] = table1[key]
            else:
                mutual_appearances[key] = table2[key]
        else:
            appearances[key] = table1[key]
    for key in table2:
        if key not in table1:
            appearances[key] = table2[key]
    return len(mutual_appearances)/(len(mutual_appearances) + len(appearances))

There are a few problems with how this function works, since currently it is calculating the number of unique words that appeared in both samples divided by the total number of unique words, but what we want is the number of uses of each given word that are the same in both divided by the total number of word uses.

To make this clearer, I am going to use a small example. Say we have two sentences, "this is a sample and that is not" and "this is what this is". Then the count of mutual appearances should be 3, because "is" appears twice in each sample and there is at least one instance of "this" in each sample but the second instance of "this" in the second sentence does not have a counterpart in the first sentence. Thus, we have 2 for "is" plus 1 for "this" gives us three mutual appearances. Then the count of appearances should be 11, since that is just the total number of words (the 6 in the first sentence plus the 5 in the second sentence). So, we would want to return 3/11. However, your function currently would return 2/(2 + 6) = 1/4. There are a few things we can do to fix this.

One, instead of using len() in the final line (which gives the number of elements in mutual_appearances and appearances rather than the sum of the values in them) we could use sum(). However, it would be more efficient just to make appearances and mutual appearances integer variables instead of dictionaries, as that way we could just add the totals as we go instead of building up dictionaries and then calculating the totals.

Two, we need to increment appearances for every word that is used, not just the words that have no mutual appearances. And three, once this is done, we can just divide by the number of appearances rather than by the sum of the number of appearances with the number of mutual appearances (since the number of mutual appearances will be included in this total).

I am going to include a modified version of your code below that would fix all these things, which you can use as a reference, but make sure you understand what these changes do and why they are important.

# working code
def frequency_comparison(table1, table2):
    appearances = 0 #note that I made these integers
    mutual_appearances = 0
    for key in table1:
        if key in table2:
            if table1[key] <= table2[key]:
                mutual_appearances += table1[key] 
                appearances += table2[key]
            else:
                mutual_appearances += table2[key]
                appearances += table1[key]
        else:
            appearances += table1[key]
    for key in table2:
        if key not in table1:
            appearances += table2[key]
    return mutual_appearances / appearances
yghern commented 5 years ago

OK fixed The trouble was I was (again -_-) writing everything on last day, read instruction two or three times and then worked fast around the clock. Now after your explaination it makes much more sense - and my use of dictionaries for that makes not. Thanks for your detailed comment on that.