ποΈ Architecture
ποΈ Technical Architecture
- Orchestration: GitHub Actions
- Data Warehouse: Snowflake
- Transformation: dbt
- Language: Python
π Project Structure
nyc-taxi-pipeline/
βββ .github/
β βββ workflows/
β β βββ nyc_taxi_pipeline.yml
β β βββ codeql.yml
β β βββ python_code_tests.yml
β β βββ release.yml
β β βββ sqlfluff.yml
β β
β βββ dependabot.yml
β
βββ docs/
β
βββ snowflake_ingestion/
β βββ init_data_warehouse.py
β βββ scrape_links.py
β βββ upload_stage.py
β βββ load_to_table.py
β β
β βββ sql/
β β βββ init/
β β βββ scraping/
β β βββ stage/
β β βββ load/
β β
β βββ tests/
β
βββ dbt_transformations/
βββ NYC_Taxi_dbt/
βββ models/
βββ staging/
βββ final/
βββ marts/
π Processing Flow
Main Pipeline
NYC Taxi Data Pipeline
Monthly execution data ingestion pipeline:
- Snowflake Infra Init Initialization of Snowflake infrastructure (database, schemas, warehouse, role, user).
- Scrape Links Scraping and retrieval of source links.
- Upload to Stage Uploading raw files to Snowflake stage.
- Load to Table Loading data into the RAW schema table.
- Run dbt Transformations dbt transformations (STAGING then FINAL).
- Run dbt Tests Execution of dbt tests to validate models.
- Backup Policy
Automatic configuration of backup policies for the database, RAW table, and FINAL schema.
Quality Pipelines
- CodeQL Security Scan
Static analysis of Python code using CodeQL to detect vulnerabilities on every push or pull request to
devandmain. - Dependabot Updates Automated updates of Python and GitHub Actions dependencies on a quarterly schedule.
- pages-build-deployment Automatic deployment of project documentation via GitHub Pages.
- Python Code Tests
Execution of Pytest unit tests on every push or pull request to
devandmain. - Release
Automatic versioning, changelog generation, and release publishing via Python Semantic Release on every push or pull request to
main. - SQL Code Quality
Automatic linting of SQL code (dbt models and Snowflake scripts) with SQLFluff on every push or pull request to
devandmain.
Data Modeling
This table documents how the data is stored.
| Table Name | Schema | Table Type | Materialization |
|---|---|---|---|
| FILE_LOADING_METADATA | SCHEMA_RAW |
Transient | Table |
| YELLOW_TAXI_TRIPS_RAW | SCHEMA_RAW |
Permanent | Incremental |
| TAXI_ZONE_LOOKUP | SCHEMA_RAW |
Permanent | Table |
| TAXI_ZONE_STG | SCHEMA_STAGING |
Transient | Table |
| YELLOW_TAXI_TRIPS_STG | SCHEMA_STAGING |
Transient | Incremental |
| int_trip_metrics | SCHEMA_STAGING |
View | |
| fact_trips | SCHEMA_FINAL |
Permanent | Incremental |
| dim_locations | SCHEMA_FINAL |
Permanent | Table |
| dim_time | SCHEMA_FINAL |
Permanent | Table |
| dim_date | SCHEMA_FINAL |
Permanent | Table |
| marts | SCHEMA_FINAL |
View |
Details available in the π Online dbt documentation
Star Schema (ERD)
%%{init: {"themeVariables": {"fontSize": "10px"}}}%%
erDiagram
FACT_TRIPS {
number surrogate_key PK
number date_id FK
number time_id FK
number location_id FK
float fare_amount
float trip_distance
}
DIM_DATE {
number date_id PK
int year
int month
int day_of_week
}
DIM_TIME {
number time_id PK
int hour
string period_of_day
}
DIM_LOCATIONS {
number location_id PK
string zone
string borough
}
FACT_TRIPS }o--|| DIM_DATE : "pickup / dropoff"
FACT_TRIPS }o--|| DIM_TIME : "pickup / dropoff"
FACT_TRIPS }o--|| DIM_LOCATIONS : "pickup / dropoff"
π Slowly Changing Dimensions
All 3 dimensions are SCD Type 0: no variation is expected.
| Dimension | SCD Type | Justification |
|---|---|---|
dim_date |
Type 0 | Date attributes never change |
dim_time |
Type 0 | Time attributes never change |
dim_locations |
Type 0 | The NYC TLC zone reference is stable |
Possible evolutions:
- Zone name correction β SCD Type 1 (overwrite without history)
- Zone split β SCD Type 2 (new row with
valid_from,valid_to,is_current)