xhochy / nyc-taxi-fare-prediction-deployment-example

Deployment example for a scikit-learn/lightgbm pipeline
MIT License
11 stars 4 forks source link

Dataset is not available anymore #2

Open pavelzw opened 1 year ago

pavelzw commented 1 year ago

The yellow_tripdata_2016-01.csv is not available anymore. They moved to parquet on 13.05.2022 and removed the pickup/dropoff_longditute/latitude columns and replaced them with PULocationID and DOLocationID. They also added the airport_fee and congestion_surcharge columns.

Old `df.info()` output from `Train.ipynb` ``` Int64Index: 10906858 entries, 0 to 10906857 Data columns (total 19 columns): # Column Dtype --- ------ ----- 0 VendorID int64 1 tpep_pickup_datetime datetime64[ns] 2 tpep_dropoff_datetime datetime64[ns] 3 passenger_count int64 4 trip_distance float64 5 pickup_longitude float64 6 pickup_latitude float64 7 RatecodeID int64 8 store_and_fwd_flag bool 9 dropoff_longitude float64 10 dropoff_latitude float64 11 payment_type int64 12 fare_amount float64 13 extra float64 14 mta_tax float64 15 tip_amount float64 16 tolls_amount float64 17 improvement_surcharge float64 18 total_amount float64 dtypes: bool(1), datetime64[ns](2), float64(12), int64(4) memory usage: 1.6 GB ```
New `df.info()` from `pd.read_parquet("yellow_tripdata_2016-01.parquet")` ``` RangeIndex: 10905067 entries, 0 to 10905066 Data columns (total 19 columns): # Column Dtype --- ------ ----- 0 VendorID int64 1 tpep_pickup_datetime datetime64[ns] 2 tpep_dropoff_datetime datetime64[ns] 3 passenger_count int64 4 trip_distance float64 5 RatecodeID int64 6 store_and_fwd_flag object 7 PULocationID int64 8 DOLocationID int64 9 payment_type int64 10 fare_amount float64 11 extra float64 12 mta_tax float64 13 tip_amount float64 14 tolls_amount float64 15 improvement_surcharge float64 16 total_amount float64 17 congestion_surcharge object 18 airport_fee object dtypes: datetime64[ns](2), float64(8), int64(6), object(3) memory usage: 1.5+ GB ```

The old dataset can still be downloaded from archive.org; the version from 08.01.2022 works, see here.

xhochy commented 1 year ago

I should archive this repo as I don't intend to maintain it 😎 You can find a more minimal example over at https://github.com/xhochy/nyc-taxi-fare-prediction-distributed-example