NYC dataset changed format and S3 url

NYC.gov has changed all their files to Parquet. The csv files are no longer available through the provided S3 links. The new link is https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.parquet But it requires some additional processing to follow a long. This mostly applies to video DE Zoomcamp 1.2.2 - Ingesting NY Taxi Data to Postgres, but it may pop up in other places throughout the course.

First pip install pyarrow

Then convert the parquet to pandas:

import pyarrow.parquet as pq
trips = pq.read_table('yellow_tripdata_2021-01.parquet')
df = trips.to_pandas()

Finally, run this command and wait. It will take awhile then return a number when it is finished. df.to_sql(name='yellow_taxi_data', con=engine, if_exists='replace', chunksize=100000)

Alternatively, the .csv files could be added to the repo with links to those instead.

ziritrion / dataeng-zoomcamp

NYC dataset changed format and S3 url #2