ziritrion / dataeng-zoomcamp

308 stars 164 forks source link

NYC dataset changed format and S3 url #2

Open kyleaddis opened 2 years ago

kyleaddis commented 2 years ago

NYC.gov has changed all their files to Parquet. The csv files are no longer available through the provided S3 links. The new link is https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.parquet But it requires some additional processing to follow a long. This mostly applies to video DE Zoomcamp 1.2.2 - Ingesting NY Taxi Data to Postgres, but it may pop up in other places throughout the course.

First pip install pyarrow

Then convert the parquet to pandas:

import pyarrow.parquet as pq
trips = pq.read_table('yellow_tripdata_2021-01.parquet')
df = trips.to_pandas()

Finally, run this command and wait. It will take awhile then return a number when it is finished. df.to_sql(name='yellow_taxi_data', con=engine, if_exists='replace', chunksize=100000)

Alternatively, the .csv files could be added to the repo with links to those instead.

erick093 commented 1 year ago

Changed again, now the link is: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page#:~:text=January-,Yellow%20Taxi%20Trip%20Records,-(PARQUET)