Simplifying build_frequency_table


def build_frequency_table(corpus):
    #I came back and added the below conditions due to the next two steps of ngram. Point is prepare_text wouldn't work 
    #on lists due to lower(). and so I had to check the type before creating the dictionary. I hope that makes sense =)
    if type(corpus) is str:
        corpus_words = prepare_text(corpus)
    else:
        corpus_words = corpus

    unique_corpus_words = []
    for word in corpus_words:
        if word not in unique_corpus_words:
            unique_corpus_words.append(word)

    unique_word_counts = []
    for word in unique_corpus_words:
        unique_word_counts.append(corpus_words.count(word))

    frequency_table = {key:value for key, value in zip(unique_corpus_words, unique_word_counts)}

    return frequency_table

print(lily_sample.word_count_frequency)

Great job with this function! As you mention in your comment, it corrects the problem I mention in Issue 2. However, even with Issue 2 in mind, I did want to show a bit of a simpler way to achieve the same functionality. Here is an example implementation:

def build_frequency_table(corpus):
    # I came back and added the below conditions due to the next two steps of ngram. Point is prepare_text wouldn't work
    # on lists due to lower(). and so I had to check the type before creating the dictionary. I hope that makes sense =)
    if type(corpus) is str:
        corpus_words = prepare_text(corpus)
    else:
        corpus_words = corpus

    unique_word_counts = {}
    for word in corpus_words:
        if not unique_word_counts.get(word): # Check if the word exists
            unique_word_counts[word] = 0 # If not, add it to the table
        unique_word_counts[word] += 1 # Now add 1 to it

    return unique_word_counts

So, to correct for the type of input we are accepting, the first part of the function can remain. Also, I did want to mention quickly: great job correcting for the type of input we are accepting -- I actually do this type of test frequently in my own code.

Nevertheless, notice that the first thing I do is change the unique_word_count variable to a dictionary. This is the dictionary we will be later returning with the word count of each word.

Next, notice that I am using the get function to test if the word exists in our newly created dictionary. The get function will return the value if it does exist, but it will return None if it does not. Thus, the if not statement in front of the get function will determine if the value exists or not. If it does not exist, we are creating the value in our dictionary using the line unique_word_counts[word] = 0 # If not, add it to the table.

Finally, I am always adding 1 to the value stored in our dictionary, thus keeping track of any instance of the value in our dictionary.

This simplified our function because we no longer need two for loops or the zip function to create the dictionary. Instead, we only need to create the lone dictionary and generate the proper count of the word as we iterate through the loop.

Again, I think the original implementation is clever, but this simplifies it a little bit. Hope it helps!

yalqaoud / pwp-capstones

Simplifying build_frequency_table #3