The nyc-taxi-data repository is a rich dataset and exploratory project around New York City taxi trip records. It collects and preprocesses large-scale trip datasets (fares, pickup/dropoff, timestamps, locations, passenger counts) to enable data analysis, modeling, and visualization efforts. The project includes scripts and notebooks for cleaning and filtering the raw data, memory-efficient processing for large CSV/Parquet files, and aggregation workflows (e.g. trips per hour, heatmaps of pickups/dropoffs). It also contains example analyses—spatial and temporal visualizations like maps, time-series plots, and hotspot detection—highlighting insights such as patterns of demand, peak times, and geospatial distributions. The repository is often used as a benchmark dataset and example for teaching, benchmarking, and demonstration purposes in the data science and urban analytics communities.
Features
- Large-scale NYC taxi trip dataset with structured schemas
- Data-cleaning and preprocessing scripts for handling raw trip data
- Aggregation and summarization pipelines (hourly, daily, spatial bins)
- Example notebooks/analyses for visualization, heatmaps, and demand patterns
- Support for efficient I/O (Parquet/CSV handling, chunked reading)
- Educational benchmark for urban analytics, modeling, and demonstration use