xNS5 / rentalreviewsdata

Data repository for the Rental Reviews front-end
https://github.com/xNS5/rentalreviews
GNU General Public License v3.0
1 stars 0 forks source link

JSON Key Modifications #9

Closed xNS5 closed 4 months ago

xNS5 commented 5 months ago

Combining the reviews makes it a smidge annoying to display the data separate from each other. An idea I had was to add a new key to the reviews indicating the source, but that seems like a lot of extra work for little gain. Instead of running some kind of filter(...) or map(...) function on the data client side, I could just return the specific key for the data.

xNS5 commented 5 months ago

Order of operations (in no particular order)

This is important as at this stage I have yet to implement the review portion in the frontend.

My reasoning for this is to reduce the number of DB requests made. I ultimately want the user to be able to view the reviews grouped by the originating website, and presently that would require another DB request. It doesn't make much sense at present to keep the reviews grouped as they presently are.

xNS5 commented 5 months ago

Add a "slug" key in the JSON file. It'll make it easier to do DB searches based on slug instead of doing string manipulation

xNS5 commented 5 months ago

Now that I'm at the point where I'm wanting to change the JSON structure, I think this is a good opportunity to just re-scrape the data and create a single "Review" class that handles all of the data structuring.

xNS5 commented 5 months ago

TODO for tomorrow:

xNS5 commented 4 months ago

TODO:

On the other hand, when I'm merging files I'm doing it with just the json library. A problem arises when I'm trying to get just the reviews. When the key prefix will be variable, I would need to get the keys for the base Business object, then iterate through them and look for _reviews. I don't like that much. One alternative solution to this would be to bring back the overarching reviews key, and add the [company_name]_reviews key + array values. A benefit of this would be that I could just iterate through the keys of that object and I wouldn't really need to do any string inspection/manipulation when calculating the averages. This way I could keep the average calculations where they are.

I think the ultimate goal is that the averages will be calculated in the step before getting sent off to OpenAI. This way I'd have more control over the output instead of assuming OpenAI would do fine with it.

Food for thought in the morning.

xNS5 commented 4 months ago

Currently experiencing some data validation issues, and I believe it's stemming from the google scraper. Unsure the cause, investigating.

Example: Landmark Bellingham on Yelp has 71 reviews, and 677 on Google. The expected output of this would (obviously) be a total of 748 reviews. After loading up the data, the scraper returns 533 elements in the list. Unsure of the cause at this point. I'm adding an extra call to the scroll() function to go to the bottom, seeing if that helps.

Hypothesis: It could be that Google unloads reviews if the user scrolls back up enough.

xNS5 commented 4 months ago

It turns out that Google hides reviews while also displaying the total number of reviews, and nothing is wrong with my scripts. It would be a good idea to create JSON key(s) that differentiate between the actual and displayed review count + averages.

Edit: Made the modifications to the Business class and updated the average calculation scripts to different keys so that it differentiates between displayed, actual, and adjusted. Will begin updating the review files in the AM.

xNS5 commented 4 months ago

Would be a good idea to change the name of the average calculation scripts as they're now returning dict objects instead of just straight values.

xNS5 commented 4 months ago

Final TODO:

Edit: Updated to remove OpenAI component, will be combining with merge script.

xNS5 commented 4 months ago

I think I discovered how to use the Yelp after value. I still have no idea how or where it's calculated, but I did discover that it's a shared value. Everything on page 0 doesn't have an after value, on page 1 the after gets set to eyJ2ZXJzaW9uIjoxLCJ0eXBlIjoib2Zmc2V0Iiwib2Zmc2V0Ijo5fQ==. A request sent with that value will return with reviews starting from page 1.

If the page returns 42 reviews maximum at every request, which is another thing I discovered, I think I can figure out a way to get reviews for companies with > 42 reviews.