zzcheah / Interview-Exercises

0 stars 0 forks source link

Discussion Only: Do not need to implement #2

Open orangepaydev opened 1 year ago

orangepaydev commented 1 year ago

Right now we have at least 2 out of 3 attribute needs to match.

  1. How much effort would it cost to add an additional attribute? Say now we have 4 attributes and 3 out of 4 needs to match?
  2. Using the current implementation, how much memory would it cost to upkeep if we were to scale up and process 1 Miliion record? Would be a Bloom filter help in this case?
zzcheah commented 1 year ago

Question 1

Depending on the requirement, the number of possible combinations, N can be calculated. Patterns for these combinations can be dynamically identified (e.g. ",", {,} ). Based on these patterns, there will be N entries to be added into the HashSet.

For current implementation, the entry pattern is easily identified. However, we can take in parameters like (number_of_attributes and number_of_matches_required) and form nested loop to produce the necessary entries to be added to the set.


Question 2

For the worse case where no duplicate is detected, each record will produce 3 entries to be added to the set. 3 million entries will be there in the HashMap, taking up a massive amount of memory (considering an average of 15 characters per entries for current implementation, worse if more combinations are required).

Due to my limited understanding on Bloom Filter, I try to come up with a "better" solution as below. Please correct me if my understanding or suggested solution is not feasible.

Flow:

  1. Design a Bloom Filter, possibly with help of tool like this: Bloom Filter Calculator. Example:

image

  1. For every record, we still generate the entries and add them to the set. At the same time, we calculate the hash bits of the bloom filter and add it to the filter (Insert operation).

  2. Until a certain threshold (maybe when it reaches 100 entries in the set), we will write the entries in the set into a file to be stored in disk space.

Detection:

  1. For each record after calculating the hash bits, we do the lookup operation to check if there is any possibility of matches (all hashed bits are 1).

  2. If all hashed bits are 1, first check duplicate in the hashset. if not found, then load the set saved into disk space earlier one by one to check for duplicate.

Reasoning:

  1. This new design will use less memory. As compared to current implementation where all entries are in memory, the new design will store set in the disk space.

  2. While the current design can also opt to save set in disk space, introduction of bloom filter can provide another benefit: reduce the number of times to load the sets stored in disk space.

Bloom filter filtered out all the true negative entries to avoid the "heavy" operation of loading sets from disk space. Seeing the example above. By using a Bloom filter of 8MB size, only 1 out of 443 entries will require the loading of sets from disk space.