xmunoz / sodapy

Python client for the Socrata Open Data API
MIT License
402 stars 114 forks source link

`get_all()` does not paginate correctly, returning duplicate rows #85

Closed tonofshell closed 2 years ago

tonofshell commented 2 years ago

When using get_all() on large datasets, the results are not paginated correctly. The returned response has the correct total amount of rows but approximately 10% of the rows are duplicates of other rows. If the API call does not explicitly order rows, there is no guarantee that each page of results is a unique chunk of the total rows in the dataset. This could be resolved by creating an API call with limit greater than or equal to the total number of rows in the dataset.

shua123 commented 2 years ago

Dev.socrata.com says to provide an order clause to ensure stable results. https://dev.socrata.com/docs/paging.html

Implementing their "at a minimum" recommendation on get_all might be the best fix.

Heads Up! The order of the results of a query are not implicitly ordered, so if you're paging, make sure you provide an $order clause or at a minimum $order=:id. That will guarantee that the order of your results will be stable as you page through the dataset.

xmunoz commented 2 years ago

I would be happy to review and merge a pull request to address this :)