Customer Service Requests Analysis

applied data science
visualization
Comprehensive Data analysis of NYC 311 service request calls dataset.
Author

Chaance Graves

Published

May 9, 2023

Modified

June 26, 2025

Open In Colab

Project Description

This project analyzes New York City’s 311 service request data to identify patterns and visualize the most common types of citizen complaints. Using the NYC Open Data API, I developed a CLI script to collect updated data from API endpoint at https://data.cityofnewyork.us/resource/erm2-nwe9.json.

I applied data wrangling techniques to clean and analyze the dataset. The analysis also includes interactive visualizations of complaint trends. To ensure reproducibility, I’ve uploaded a dataset snapshot to Hugging Face for use throughout the project. More information about the original data source is available on the NYC Open Data page.

Import the libraries for analysis

import numpy as np
import polars as pl
import altair as alt
from scipy.stats import stats, chi2_contingency
from datasets import Dataset, load_dataset

import warnings
warnings.filterwarnings('ignore')
# checking the info about the installed polars package version
print(pl.show_versions())
--------Version info---------
Polars:              1.31.0
Index type:          UInt32
Platform:            Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.39
Python:              3.13.5 (main, Jun 14 2025, 08:23:24) [GCC 13.3.0]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               5.5.0
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2025.3.0
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
numpy                2.3.1
openpyxl             <not installed>
pandas               2.3.1
polars_cloud         <not installed>
pyarrow              20.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
None

Understanding the dataset

nyc_data = load_dataset("ctgadget/nyc_311_service_requests", split="train")

# let's get the dataset's features
nyc_data.info.features
{'unique_key': Value('string'),
 'created_date': Value('string'),
 'agency': Value('string'),
 'agency_name': Value('string'),
 'complaint_type': Value('string'),
 'descriptor': Value('string'),
 'location_type': Value('string'),
 'incident_zip': Value('string'),
 'incident_address': Value('string'),
 'street_name': Value('string'),
 'cross_street_1': Value('string'),
 'cross_street_2': Value('string'),
 'intersection_street_1': Value('string'),
 'intersection_street_2': Value('string'),
 'address_type': Value('string'),
 'city': Value('string'),
 'landmark': Value('string'),
 'status': Value('string'),
 'resolution_action_updated_date': Value('string'),
 'community_board': Value('string'),
 'bbl': Value('string'),
 'borough': Value('string'),
 'x_coordinate_state_plane': Value('string'),
 'y_coordinate_state_plane': Value('string'),
 'open_data_channel_type': Value('string'),
 'park_facility_name': Value('string'),
 'park_borough': Value('string'),
 'latitude': Value('string'),
 'longitude': Value('string'),
 'location': {'human_address': Value('string'),
  'latitude': Value('string'),
  'longitude': Value('string')},
 'resolution_description': Value('string'),
 'vehicle_type': Value('string'),
 'closed_date': Value('string'),
 'data_retrieved_at': Value('string')}
df = Dataset.to_polars(nyc_data)
df.head()
shape: (5, 34)
unique_key created_date agency agency_name complaint_type descriptor location_type incident_zip incident_address street_name cross_street_1 cross_street_2 intersection_street_1 intersection_street_2 address_type city landmark status resolution_action_updated_date community_board bbl borough x_coordinate_state_plane y_coordinate_state_plane open_data_channel_type park_facility_name park_borough latitude longitude location resolution_description vehicle_type closed_date data_retrieved_at
str str str str str str str str str str str str str str str str str str str str str str str str str str str str str struct[3] str str str str
"65363980" "2025-06-25T02:03:21.000" "NYPD" "New York City Police Departmen… "Illegal Parking" "Commercial Overnight Parking" "Street/Sidewalk" "11411" "115-22 FRANCIS LEWIS BOULEVARD" "FRANCIS LEWIS BOULEVARD" "115 AVENUE" "115 ROAD" "115 AVENUE" "115 ROAD" "ADDRESS" "CAMBRIA HEIGHTS" "FRANCIS LEWIS BOULEVARD" "In Progress" "2025-06-25T02:11:30.000" "12 QUEENS" "4110460160" "QUEENS" "1054262" "194055" "MOBILE" "Unspecified" "QUEENS" "40.699035965405535" "-73.7475083832415" {"{"address": "", "city": "", "state": "", "zip": ""}","40.699035965405535","-73.7475083832415"} null null null "06_26_2025_02_38_45_PM"
"65357474" "2025-06-25T02:01:47.000" "NYPD" "New York City Police Departmen… "Blocked Driveway" "Partial Access" "Street/Sidewalk" "11428" "90-25 219 STREET" "219 STREET" "90 AVENUE" "91 AVENUE" "90 AVENUE" "91 AVENUE" "ADDRESS" "QUEENS VILLAGE" "219 STREET" "In Progress" null "13 QUEENS" "4107180042" "QUEENS" "1055805" "203889" "ONLINE" "Unspecified" "QUEENS" "40.72601549402461" "-73.74183921146468" {"{"address": "", "city": "", "state": "", "zip": ""}","40.72601549402461","-73.74183921146468"} null null null "06_26_2025_02_38_45_PM"
"65367267" "2025-06-25T02:01:33.000" "NYPD" "New York City Police Departmen… "Noise - Residential" "Loud Talking" "Residential Building/House" "10031" "620 WEST  141 STREET" "WEST  141 STREET" "BROADWAY" "RIVERSIDE DRIVE" "BROADWAY" "RIVERSIDE DRIVE" "ADDRESS" "NEW YORK" "WEST  141 STREET" "In Progress" null "09 MANHATTAN" "1020880043" "MANHATTAN" "997264" "239545" "MOBILE" "Unspecified" "MANHATTAN" "40.82416166479311" "-73.95297805704422" {"{"address": "", "city": "", "state": "", "zip": ""}","40.82416166479311","-73.95297805704422"} null null null "06_26_2025_02_38_45_PM"
"65357731" "2025-06-25T02:01:29.000" "DHS" "Department of Homeless Service… "Homeless Person Assistance" "Non-Chronic" "Store/Commercial" "10001" "6 AVENUE" "6 AVENUE" "6 AVENUE" "WEST   27 STREET" "6 AVENUE" "WEST   27 STREET" "INTERSECTION" null null "In Progress" null "05 MANHATTAN" null "MANHATTAN" "986753" "210854" "PHONE" "Unspecified" "MANHATTAN" "40.745421755862786" "-73.99096689618635" {"{"address": "", "city": "", "state": "", "zip": ""}","40.745421755862786","-73.99096689618635"} null null null "06_26_2025_02_38_45_PM"
"65364041" "2025-06-25T02:01:21.000" "NYPD" "New York City Police Departmen… "Noise - Street/Sidewalk" "Loud Music/Party" "Street/Sidewalk" "10460" "1460 HOE AVENUE" "HOE AVENUE" "JENNINGS STREET" "EAST  172 STREET" "JENNINGS STREET" "EAST  172 STREET" "ADDRESS" "BRONX" "HOE AVENUE" "In Progress" null "03 BRONX" "2029880002" "BRONX" "1014792" "242538" "PHONE" "Unspecified" "BRONX" "40.832333371980106" "-73.88963263564457" {"{"address": "", "city": "", "state": "", "zip": ""}","40.832333371980106","-73.88963263564457"} null null null "06_26_2025_02_38_45_PM"
# Let's get the info about the dataset, shape of the dimensions
print(df.describe())
print( f"\nThe shape of the dataset is {df.shape}\n")
shape: (9, 35)
┌────────────┬────────────┬───────────┬────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ statistic  ┆ unique_key ┆ created_d ┆ agency ┆ … ┆ resolutio ┆ vehicle_t ┆ closed_da ┆ data_retr │
│ ---        ┆ ---        ┆ ate       ┆ ---    ┆   ┆ n_descrip ┆ ype       ┆ te        ┆ ieved_at  │
│ str        ┆ str        ┆ ---       ┆ str    ┆   ┆ tion      ┆ ---       ┆ ---       ┆ ---       │
│            ┆            ┆ str       ┆        ┆   ┆ ---       ┆ str       ┆ str       ┆ str       │
│            ┆            ┆           ┆        ┆   ┆ str       ┆           ┆           ┆           │
╞════════════╪════════════╪═══════════╪════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ count      ┆ 5000       ┆ 5000      ┆ 5000   ┆ … ┆ 3797      ┆ 171       ┆ 3762      ┆ 5000      │
│ null_count ┆ 0          ┆ 0         ┆ 0      ┆ … ┆ 1203      ┆ 4829      ┆ 1238      ┆ 0         │
│ mean       ┆ null       ┆ null      ┆ null   ┆ … ┆ null      ┆ null      ┆ null      ┆ null      │
│ std        ┆ null       ┆ null      ┆ null   ┆ … ┆ null      ┆ null      ┆ null      ┆ null      │
│ min        ┆ 65354926   ┆ 2025-06-2 ┆ DCWP   ┆ … ┆ DSNY      ┆ Car       ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆            ┆ 4T18:08:4 ┆        ┆   ┆ already   ┆           ┆ 4T18:09:3 ┆ 5_02_38_4 │
│            ┆            ┆ 0.000     ┆        ┆   ┆ has a     ┆           ┆ 7.000     ┆ 5_PM      │
│            ┆            ┆           ┆        ┆   ┆ request   ┆           ┆           ┆           │
│            ┆            ┆           ┆        ┆   ┆ to …      ┆           ┆           ┆           │
│ 25%        ┆ null       ┆ null      ┆ null   ┆ … ┆ null      ┆ null      ┆ null      ┆ null      │
│ 50%        ┆ null       ┆ null      ┆ null   ┆ … ┆ null      ┆ null      ┆ null      ┆ null      │
│ 75%        ┆ null       ┆ null      ┆ null   ┆ … ┆ null      ┆ null      ┆ null      ┆ null      │
│ max        ┆ 65367590   ┆ 2025-06-2 ┆ TLC    ┆ … ┆ Your      ┆ Van       ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆            ┆ 5T02:03:2 ┆        ┆   ┆ request   ┆           ┆ 5T03:05:0 ┆ 5_02_38_4 │
│            ┆            ┆ 1.000     ┆        ┆   ┆ can not   ┆           ┆ 0.000     ┆ 5_PM      │
│            ┆            ┆           ┆        ┆   ┆ be        ┆           ┆           ┆           │
│            ┆            ┆           ┆        ┆   ┆ proces…   ┆           ┆           ┆           │
└────────────┴────────────┴───────────┴────────┴───┴───────────┴───────────┴───────────┴───────────┘

The shape of the dataset is (5000, 34)
df.collect_schema()
Schema([('unique_key', String),
        ('created_date', String),
        ('agency', String),
        ('agency_name', String),
        ('complaint_type', String),
        ('descriptor', String),
        ('location_type', String),
        ('incident_zip', String),
        ('incident_address', String),
        ('street_name', String),
        ('cross_street_1', String),
        ('cross_street_2', String),
        ('intersection_street_1', String),
        ('intersection_street_2', String),
        ('address_type', String),
        ('city', String),
        ('landmark', String),
        ('status', String),
        ('resolution_action_updated_date', String),
        ('community_board', String),
        ('bbl', String),
        ('borough', String),
        ('x_coordinate_state_plane', String),
        ('y_coordinate_state_plane', String),
        ('open_data_channel_type', String),
        ('park_facility_name', String),
        ('park_borough', String),
        ('latitude', String),
        ('longitude', String),
        ('location',
         Struct({'human_address': String, 'latitude': String, 'longitude': String})),
        ('resolution_description', String),
        ('vehicle_type', String),
        ('closed_date', String),
        ('data_retrieved_at', String)])
# Check for null values in each column
null_counts = df.null_count()

print("Null counts per column:")
print(null_counts)

# To check if there are ANY null values in the entire DataFrame:
# You can sum all the null counts and check if the total is greater than 0.
total_nulls = null_counts.sum_horizontal().item()

if total_nulls > 0:
    print("\nThere are null values in the DataFrame.")
else:
    print("\nNo null values found in the DataFrame.")
Null counts per column:
shape: (1, 34)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ unique_key ┆ created_da ┆ agency ┆ agency_na ┆ … ┆ resolutio ┆ vehicle_t ┆ closed_da ┆ data_retr │
│ ---        ┆ te         ┆ ---    ┆ me        ┆   ┆ n_descrip ┆ ype       ┆ te        ┆ ieved_at  │
│ u32        ┆ ---        ┆ u32    ┆ ---       ┆   ┆ tion      ┆ ---       ┆ ---       ┆ ---       │
│            ┆ u32        ┆        ┆ u32       ┆   ┆ ---       ┆ u32       ┆ u32       ┆ u32       │
│            ┆            ┆        ┆           ┆   ┆ u32       ┆           ┆           ┆           │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 0          ┆ 0          ┆ 0      ┆ 0         ┆ … ┆ 1203      ┆ 4829      ┆ 1238      ┆ 0         │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘

There are null values in the DataFrame.

Observations:

There are null values present in the df dataframe for the following variables:

  1. Closed Date
  2. Descriptor
  3. Location Type
  4. etc ...

Exploratory Data Analysis (EDA)

Plotting directly from Polars Series (more “Polars-native” if possible) This method is slightly more involved for a horizontal bar chart directly. We’ll build our charts in Altair.

# Let's prepare data as a dictionary of column_name: null_count
null_dict = null_counts.row(0, named=True) # Get the first (and only) row as a dictionary

columns = list(null_dict.keys())
counts = list(null_dict.values())

#--- Use the dict/lists to build a Polars DataFrame suitable for Altair ---
# Create a new Polars DataFrame from the lists
# This DataFrame will have the 'long' format expected by Altair
null_data = pl.DataFrame({
    "Column Name": columns,
    "Null Count": counts
})

# Optional: Filter out columns with 0 nulls for cleaner visualization
null_data = null_data.filter(pl.col("Null Count") > 0)

print("Null Count DataFrame prepared:\n")
print(null_data.head())

# --- Create the Altair Horizontal Bar Chart ---
chart = alt.Chart(null_data).mark_bar().encode(
    # X-axis: Null Count (Quantitative type)
    x=alt.X('Null Count:Q', title='Count'),
    # Y-axis: Column Name (Nominal type), sorted by null count descending
    y=alt.Y('Column Name:N', title='Columns', sort='-x'), # Sort by null count for better readability

    # Tooltip for interactivity
    tooltip=[
        alt.Tooltip('Column Name:N', title='Column'),
        alt.Tooltip('Null Count:Q', title='Nulls')
    ]
).properties(
    title='Null Value Counts',
    width=800,  # <-- Adjust the width here (e.g., 800 pixels)
    height=600  # <-- Adjust the height here (e.g., 600 pixels)
    ).interactive() # Allows zooming and panning

# Display the chart
chart.show()
Null Count DataFrame prepared:

shape: (5, 2)
┌──────────────────┬────────────┐
│ Column Name      ┆ Null Count │
│ ---              ┆ ---        │
│ str              ┆ i64        │
╞══════════════════╪════════════╡
│ location_type    ┆ 2527       │
│ incident_zip     ┆ 15         │
│ incident_address ┆ 490        │
│ street_name      ┆ 490        │
│ cross_street_1   ┆ 526        │
└──────────────────┴────────────┘
# Remove null values from "Closed Date" column
# Polars operations are immutable, so you need to reassign the DataFrame
df = df.drop_nulls(subset=["closed_date"])

# Show sample of dataset with null values removed
print("\nShape after dropping nulls:", df.shape)
print("\nSample of dataset with null values removed:")
print(df.head())

# Verify nulls in "Closed Date" after dropping
print("\nNulls in 'Closed Date' after dropping:", df.select(pl.col("closed_date").null_count()))

Shape after dropping nulls: (3762, 34)

Sample of dataset with null values removed:
shape: (5, 34)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ unique_key ┆ created_da ┆ agency ┆ agency_na ┆ … ┆ resolutio ┆ vehicle_t ┆ closed_da ┆ data_retr │
│ ---        ┆ te         ┆ ---    ┆ me        ┆   ┆ n_descrip ┆ ype       ┆ te        ┆ ieved_at  │
│ str        ┆ ---        ┆ str    ┆ ---       ┆   ┆ tion      ┆ ---       ┆ ---       ┆ ---       │
│            ┆ str        ┆        ┆ str       ┆   ┆ ---       ┆ str       ┆ str       ┆ str       │
│            ┆            ┆        ┆           ┆   ┆ str       ┆           ┆           ┆           │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 65355773   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ The       ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ T01:50:40. ┆        ┆ City      ┆   ┆ Police    ┆           ┆ 5T01:58:0 ┆ 5_02_38_4 │
│            ┆ 000        ┆        ┆ Police    ┆   ┆ Departmen ┆           ┆ 0.000     ┆ 5_PM      │
│            ┆            ┆        ┆ Departmen ┆   ┆ t         ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆ responde… ┆           ┆           ┆           │
│ 65366664   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ The       ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ T01:45:52. ┆        ┆ City      ┆   ┆ Police    ┆           ┆ 5T02:02:3 ┆ 5_02_38_4 │
│            ┆ 000        ┆        ┆ Police    ┆   ┆ Departmen ┆           ┆ 0.000     ┆ 5_PM      │
│            ┆            ┆        ┆ Departmen ┆   ┆ t issued  ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆ a…        ┆           ┆           ┆           │
│ 65358593   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ The       ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ T01:42:25. ┆        ┆ City      ┆   ┆ Police    ┆           ┆ 5T01:52:1 ┆ 5_02_38_4 │
│            ┆ 000        ┆        ┆ Police    ┆   ┆ Departmen ┆           ┆ 4.000     ┆ 5_PM      │
│            ┆            ┆        ┆ Departmen ┆   ┆ t         ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆ responde… ┆           ┆           ┆           │
│ 65366547   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ The       ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ T01:40:10. ┆        ┆ City      ┆   ┆ Police    ┆           ┆ 5T02:03:0 ┆ 5_02_38_4 │
│            ┆ 000        ┆        ┆ Police    ┆   ┆ Departmen ┆           ┆ 8.000     ┆ 5_PM      │
│            ┆            ┆        ┆ Departmen ┆   ┆ t         ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆ responde… ┆           ┆           ┆           │
│ 65363618   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ The       ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ T01:37:15. ┆        ┆ City      ┆   ┆ Police    ┆           ┆ 5T01:59:5 ┆ 5_02_38_4 │
│            ┆ 000        ┆        ┆ Police    ┆   ┆ Departmen ┆           ┆ 8.000     ┆ 5_PM      │
│            ┆            ┆        ┆ Departmen ┆   ┆ t         ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆ responde… ┆           ┆           ┆           │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘

Nulls in 'Closed Date' after dropping: shape: (1, 1)
┌─────────────┐
│ closed_date │
│ ---         │
│ u32         │
╞═════════════╡
│ 0           │
└─────────────┘
# Check if null values are still present in "Closed Date" column
# Use pl.col() for column selection and .is_null().any() for the check

if df.select(pl.col("closed_date").is_null().any()).item():
    print('Null values still present in "Closed Date" column.')
else:
    print('No null values present in "Closed Date" column.')
No null values present in "Closed Date" column.
# Convert the "Closed Date" and "Created Date" columns to datetime format
# Polars operations are immutable, so use .with_columns() and reassign

df = df.with_columns([
    # converting "created date"
    pl.col("created_date").str.to_datetime().alias("created_date"),
    # converting "closed date"
    pl.col("closed_date").str.to_datetime().alias("closed_date"),
])
# Display the schema (Polars' equivalent of df.info() for types)
print("DataFrame Schema after datetime conversion:\n")
print(df.collect_schema)
DataFrame Schema after datetime conversion:

<bound method DataFrame.collect_schema of shape: (3_762, 34)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ unique_key ┆ created_da ┆ agency ┆ agency_na ┆ … ┆ resolutio ┆ vehicle_t ┆ closed_da ┆ data_retr │
│ ---        ┆ te         ┆ ---    ┆ me        ┆   ┆ n_descrip ┆ ype       ┆ te        ┆ ieved_at  │
│ str        ┆ ---        ┆ str    ┆ ---       ┆   ┆ tion      ┆ ---       ┆ ---       ┆ ---       │
│            ┆ datetime[μ ┆        ┆ str       ┆   ┆ ---       ┆ str       ┆ datetime[ ┆ str       │
│            ┆ s]         ┆        ┆           ┆   ┆ str       ┆           ┆ μs]       ┆           │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 65355773   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ The       ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ 01:50:40   ┆        ┆ City      ┆   ┆ Police    ┆           ┆ 5         ┆ 5_02_38_4 │
│            ┆            ┆        ┆ Police    ┆   ┆ Departmen ┆           ┆ 01:58:00  ┆ 5_PM      │
│            ┆            ┆        ┆ Departmen ┆   ┆ t         ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆ responde… ┆           ┆           ┆           │
│ 65366664   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ The       ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ 01:45:52   ┆        ┆ City      ┆   ┆ Police    ┆           ┆ 5         ┆ 5_02_38_4 │
│            ┆            ┆        ┆ Police    ┆   ┆ Departmen ┆           ┆ 02:02:30  ┆ 5_PM      │
│            ┆            ┆        ┆ Departmen ┆   ┆ t issued  ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆ a…        ┆           ┆           ┆           │
│ 65358593   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ The       ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ 01:42:25   ┆        ┆ City      ┆   ┆ Police    ┆           ┆ 5         ┆ 5_02_38_4 │
│            ┆            ┆        ┆ Police    ┆   ┆ Departmen ┆           ┆ 01:52:14  ┆ 5_PM      │
│            ┆            ┆        ┆ Departmen ┆   ┆ t         ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆ responde… ┆           ┆           ┆           │
│ 65366547   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ The       ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ 01:40:10   ┆        ┆ City      ┆   ┆ Police    ┆           ┆ 5         ┆ 5_02_38_4 │
│            ┆            ┆        ┆ Police    ┆   ┆ Departmen ┆           ┆ 02:03:08  ┆ 5_PM      │
│            ┆            ┆        ┆ Departmen ┆   ┆ t         ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆ responde… ┆           ┆           ┆           │
│ 65363618   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ The       ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ 01:37:15   ┆        ┆ City      ┆   ┆ Police    ┆           ┆ 5         ┆ 5_02_38_4 │
│            ┆            ┆        ┆ Police    ┆   ┆ Departmen ┆           ┆ 01:59:58  ┆ 5_PM      │
│            ┆            ┆        ┆ Departmen ┆   ┆ t         ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆ responde… ┆           ┆           ┆           │
│ …          ┆ …          ┆ …      ┆ …         ┆ … ┆ …         ┆ …         ┆ …         ┆ …         │
│ 65362166   ┆ 2025-06-24 ┆ NYPD   ┆ New York  ┆ … ┆ The       ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ 18:10:04   ┆        ┆ City      ┆   ┆ Police    ┆           ┆ 4         ┆ 5_02_38_4 │
│            ┆            ┆        ┆ Police    ┆   ┆ Departmen ┆           ┆ 21:57:40  ┆ 5_PM      │
│            ┆            ┆        ┆ Departmen ┆   ┆ t         ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆ responde… ┆           ┆           ┆           │
│ 65359062   ┆ 2025-06-24 ┆ DOB    ┆ Departmen ┆ … ┆ Your      ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ 18:09:37   ┆        ┆ t of      ┆   ┆ Service   ┆           ┆ 4         ┆ 5_02_38_4 │
│            ┆            ┆        ┆ Buildings ┆   ┆ Request   ┆           ┆ 18:09:37  ┆ 5_PM      │
│            ┆            ┆        ┆           ┆   ┆ has been  ┆           ┆           ┆           │
│            ┆            ┆        ┆           ┆   ┆ …         ┆           ┆           ┆           │
│ 65357050   ┆ 2025-06-24 ┆ NYPD   ┆ New York  ┆ … ┆ The       ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ 18:09:26   ┆        ┆ City      ┆   ┆ Police    ┆           ┆ 4         ┆ 5_02_38_4 │
│            ┆            ┆        ┆ Police    ┆   ┆ Departmen ┆           ┆ 19:19:26  ┆ 5_PM      │
│            ┆            ┆        ┆ Departmen ┆   ┆ t         ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆ responde… ┆           ┆           ┆           │
│ 65361679   ┆ 2025-06-24 ┆ NYPD   ┆ New York  ┆ … ┆ The       ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ 18:09:21   ┆        ┆ City      ┆   ┆ Police    ┆           ┆ 4         ┆ 5_02_38_4 │
│            ┆            ┆        ┆ Police    ┆   ┆ Departmen ┆           ┆ 19:36:37  ┆ 5_PM      │
│            ┆            ┆        ┆ Departmen ┆   ┆ t         ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆ responde… ┆           ┆           ┆           │
│ 65358615   ┆ 2025-06-24 ┆ NYPD   ┆ New York  ┆ … ┆ The       ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ 18:08:43   ┆        ┆ City      ┆   ┆ Police    ┆           ┆ 4         ┆ 5_02_38_4 │
│            ┆            ┆        ┆ Police    ┆   ┆ Departmen ┆           ┆ 19:29:13  ┆ 5_PM      │
│            ┆            ┆        ┆ Departmen ┆   ┆ t         ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆ responde… ┆           ┆           ┆           │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘>
# --- Convert 'time_delta' from Duration to a numerical type ---
# First, ensure 'time_delta' column is created as done previously
df = df.with_columns(
    (pl.col("closed_date") - pl.col("created_date")).alias("time_delta")
)

# Convert time_delta from Duration to total seconds (or minutes, hours, days)
# Polars Duration objects have methods like .total_seconds(), .total_minutes() etc.
df = df.with_columns(
    pl.col("time_delta").dt.total_seconds().alias("time_delta_seconds")
)

# Now, 'time_delta_seconds' is a numerical column (float)
# You can use this new column for plotting in Altair.
print("DataFrame schema after converting time_delta to seconds:")
print(df.schema)
print("\nHead of DataFrame showing new 'time_delta_seconds' column:")
print(df.head())
DataFrame schema after converting time_delta to seconds:
Schema({'unique_key': String, 'created_date': Datetime(time_unit='us', time_zone=None), 'agency': String, 'agency_name': String, 'complaint_type': String, 'descriptor': String, 'location_type': String, 'incident_zip': String, 'incident_address': String, 'street_name': String, 'cross_street_1': String, 'cross_street_2': String, 'intersection_street_1': String, 'intersection_street_2': String, 'address_type': String, 'city': String, 'landmark': String, 'status': String, 'resolution_action_updated_date': String, 'community_board': String, 'bbl': String, 'borough': String, 'x_coordinate_state_plane': String, 'y_coordinate_state_plane': String, 'open_data_channel_type': String, 'park_facility_name': String, 'park_borough': String, 'latitude': String, 'longitude': String, 'location': Struct({'human_address': String, 'latitude': String, 'longitude': String}), 'resolution_description': String, 'vehicle_type': String, 'closed_date': Datetime(time_unit='us', time_zone=None), 'data_retrieved_at': String, 'time_delta': Duration(time_unit='us'), 'time_delta_seconds': Int64})

Head of DataFrame showing new 'time_delta_seconds' column:
shape: (5, 36)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ unique_key ┆ created_da ┆ agency ┆ agency_na ┆ … ┆ closed_da ┆ data_retr ┆ time_delt ┆ time_delt │
│ ---        ┆ te         ┆ ---    ┆ me        ┆   ┆ te        ┆ ieved_at  ┆ a         ┆ a_seconds │
│ str        ┆ ---        ┆ str    ┆ ---       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
│            ┆ datetime[μ ┆        ┆ str       ┆   ┆ datetime[ ┆ str       ┆ duration[ ┆ i64       │
│            ┆ s]         ┆        ┆           ┆   ┆ μs]       ┆           ┆ μs]       ┆           │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 65355773   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 7m 20s    ┆ 440       │
│            ┆ 01:50:40   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 01:58:00  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65366664   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 16m 38s   ┆ 998       │
│            ┆ 01:45:52   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 02:02:30  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65358593   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 9m 49s    ┆ 589       │
│            ┆ 01:42:25   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 01:52:14  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65366547   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 22m 58s   ┆ 1378      │
│            ┆ 01:40:10   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 02:03:08  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65363618   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 22m 43s   ┆ 1363      │
│            ┆ 01:37:15   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 01:59:58  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘
# View descriptive statistics of the "time_delta" column
# Changed the column selection to "time_delta"
print("\nDescriptive statistics for 'time_delta' column:")
print(df.select(pl.col("time_delta")).describe())

Descriptive statistics for 'time_delta' column:
shape: (9, 2)
┌────────────┬────────────────┐
│ statistic  ┆ time_delta     │
│ ---        ┆ ---            │
│ str        ┆ str            │
╞════════════╪════════════════╡
│ count      ┆ 3762           │
│ null_count ┆ 0              │
│ mean       ┆ 0:31:40.608187 │
│ std        ┆ null           │
│ min        ┆ 0:00:00        │
│ 25%        ┆ 0:00:00        │
│ 50%        ┆ 0:00:00        │
│ 75%        ┆ 0:42:35        │
│ max        ┆ 7:37:14        │
└────────────┴────────────────┘

Finding the major type of complaints

# Check the number of null values in the "complaint_type" and "city" columns using f-strings
print(f'Number of null values in "complaint_type" column: {df.select(pl.col("complaint_type").null_count()).item()}')
print(f'Number of null values in "city" column: {df.select(pl.col("city").null_count()).item()}')
Number of null values in "complaint_type" column: 0
Number of null values in "city" column: 148
# Fill in missing values (nulls) in the 'city' column with 'Unknown City'
df = df.with_columns(
    pl.col("city").fill_null(pl.lit("UNKNOWN CITY")).alias("city")
)

# Get Unique Count using Polars' n_unique()
# We need to extract the single value using .item() for the f-string
city_count = df.select(pl.col("city").n_unique()).item()

# Get all unique values in the 'city' column as a sorted list
unique_areas = df["city"].unique().sort().to_list()

print(f"Number of null values in 'city' after fill_null: {df.select(pl.col("city").null_count()).item()}")
print(f"\nThere are {city_count} values in the dataset that contain the following info:\n")
print(unique_areas)
Number of null values in 'city' after fill_null: 0

There are 42 values in the dataset that contain the following info:

['ARVERNE', 'ASTORIA', 'BAYSIDE', 'BRONX', 'BROOKLYN', 'CAMBRIA HEIGHTS', 'COLLEGE POINT', 'CORONA', 'EAST ELMHURST', 'ELMHURST', 'FAR ROCKAWAY', 'FLUSHING', 'FOREST HILLS', 'FRESH MEADOWS', 'HOLLIS', 'HOWARD BEACH', 'JACKSON HEIGHTS', 'JAMAICA', 'LITTLE NECK', 'LONG ISLAND CITY', 'MANHATTAN', 'MASPETH', 'MIDDLE VILLAGE', 'NEW YORK', 'OAKLAND GARDENS', 'OZONE PARK', 'QUEENS', 'QUEENS VILLAGE', 'REGO PARK', 'RICHMOND HILL', 'RIDGEWOOD', 'ROSEDALE', 'SAINT ALBANS', 'SOUTH OZONE PARK', 'SOUTH RICHMOND HILL', 'SPRINGFIELD GARDENS', 'STATEN ISLAND', 'SUNNYSIDE', 'UNKNOWN CITY', 'WHITESTONE', 'WOODHAVEN', 'WOODSIDE']
# Calculate total complaints per city and complaint type, then pivot
total_city_complaints = df.group_by(["city", "complaint_type"]).len().pivot(
    index="city",              # 'city' will be the rows
    columns="complaint_type",  # 'complaint_type' will be the new columns
    values="len"               # The aggregated count (from .len()) will be the values
)

total_city_complaints.head()
shape: (5, 32)
city Blocked Driveway Noise - Residential Abandoned Vehicle Elevator Noise - Street/Sidewalk Illegal Fireworks Illegal Parking Noise - Commercial General Construction/Plumbing Water System Noise - Vehicle Non-Emergency Police Matter Panhandling Dirty Condition Traffic Signal Condition Disorderly Youth Noise - Park Drinking Derelict Vehicles Electrical Encampment Drug Activity Violation of Park Rules Vendor Enforcement Street Condition Bike/Roller/Skate Chronic Animal-Abuse Traffic Urinating in Public Building/Use Emergency Response Team (ERT)
str u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32
"FAR ROCKAWAY" 1 null null null 5 null 15 null null 2 null null null null null null 1 null null 1 null null null null null null null null null null null
"HOLLIS" 1 null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null
"UNKNOWN CITY" 1 7 1 null 87 4 15 2 null null 13 7 5 null 2 null 1 1 null null null null null null null 1 null 1 null null null
"BRONX" 24 70 null 2 220 5 42 8 1 1798 20 6 null 2 6 null null 3 4 1 null 1 null 1 null null 2 null null null null
"WOODSIDE" 2 1 3 null null null 3 2 null null null null 1 null null null null null null null null null null null null null null null null null null
# Reshape the DataFrame to "long" format for Altair

# narrow down into 'city' for 'complaint_type' and 'number_of_complaints'
complaints = total_city_complaints.unpivot(
    index=['city'],
    variable_name='complaint_type', # New column for the complaint type names
    value_name='number_of_complaints' # New column for the complaint counts
)

# Filter out rows where number_of_complaints is 0, as they don't contribute to the visualization
# This makes the tooltip cleaner and chart potentially faster if there are many zeros
complaints = complaints.filter(pl.col("number_of_complaints") > 0)

# Create the Altair Stacked Bar Chart
chart = alt.Chart(complaints).mark_bar().encode(
    # X-axis: Number of Complaints (Quantitative type)
    x=alt.X('number_of_complaints:Q', title='Number of Complaints'),
    # Y-axis: City (Nominal type), sorted by the sum of complaints for better readability
    y=alt.Y('city:N', title='City', sort='-x'), # Sort cities by total complaints in descending order

    # Color: Complaint Type (Nominal type) for stacking
    color=alt.Color('complaint_type:N', title='Complaint Type'),

    # Tooltip for interactivity
    tooltip=[
        alt.Tooltip('city:N', title='City'),
        alt.Tooltip('complaint_type:N', title='Complaint Type'),
        alt.Tooltip('number_of_complaints:Q', title='Count')
    ]
).properties(
    title='Frequency of Complaints by City',
    width=800,  # <-- Adjust the width here (e.g., 800 pixels)
    height=600  # <-- Adjust the height here (e.g., 600 pixels)
).interactive() # Allows zooming and panning

# Display the chart
chart.show()

Data Visualization of the major type of complaints

In order to find the major type of complaints in the dataset, the following code groups the dataset by complaint type, counts the number of occurrences for each type, and then sorts the counts in descending order. The resulting bar graph shows the counts for each complaint type.

# Group by 'complaint_type' and count the number of occurrences
# Then sort the counts in descending order
complaint_counts = df.group_by("complaint_type").agg(
    pl.len().alias("counts") # pl.len() counts rows in each group, and we alias it to 'counts'
).sort("counts", descending=True) # Sort the resulting DataFrame by 'counts' in descending order

complaint_types = df.select(pl.col("complaint_type").n_unique()).item()

# View the result
print(f"\nThere are {complaint_types} complaint types in the dataset that contain the following info:\n")
print(complaint_counts)

There are 31 complaint types in the dataset that contain the following info:

shape: (31, 2)
┌───────────────────────────────┬────────┐
│ complaint_type                ┆ counts │
│ ---                           ┆ ---    │
│ str                           ┆ u32    │
╞═══════════════════════════════╪════════╡
│ Water System                  ┆ 2145   │
│ Noise - Street/Sidewalk       ┆ 552    │
│ Illegal Parking               ┆ 337    │
│ Noise - Residential           ┆ 267    │
│ Blocked Driveway              ┆ 90     │
│ …                             ┆ …      │
│ Disorderly Youth              ┆ 2      │
│ Traffic                       ┆ 1      │
│ Emergency Response Team (ERT) ┆ 1      │
│ Bike/Roller/Skate Chronic     ┆ 1      │
│ Violation of Park Rules       ┆ 1      │
└───────────────────────────────┴────────┘
# Plot the bar graph
chart = alt.Chart(complaint_counts).mark_bar().encode(
    # X-axis: 'counts' (Quantitative type)
    x=alt.X('counts:Q', title='Number of Complaints',
           scale=alt.Scale(domainMin=0, zero=True)
           ),
    # Y-axis: 'complaint_type' (Nominal type), sorted by counts descending
    y=alt.Y('complaint_type:N', title='Complaint Type', sort='-x'), # Sort by counts for readability
    color=alt.value('orange'), # You can use color names or hex codes like '#FF5733'

    # Tooltip for interactivity
    tooltip=[
        alt.Tooltip('complaint_type:N', title='Complaint Type'),
        alt.Tooltip('counts:Q', title='Count')
    ]
).properties(
    title='Frequency of Complaints by Type',
    width=800,  # <-- Adjust the width here (e.g., 800 pixels)
    height=600  # <-- Adjust the height here (e.g., 600 pixels)
).interactive()

# Display the chart
chart.show()

From our bar chart analysis, the top 10 major complaint types are the following:

  1. Illegal Parking
  2. Noise - Residental
  3. Noise - Street/Sidewalk
  4. Blocked Driveway
  5. Abandoned Vehicle
  6. Noise - Vehicle
  7. Noise - Commercial
  8. General Construction/Plumbing
  9. Vendor Enforcement
  10. Homeless Encampment

Let’s explore the complaint data specifically categorized as NYC

This code filters the dataset to only include complaints from New York City, groups the remaining data by complaint type, counts the number of occurrences for each type, and then sorts the counts in descending order. The resulting output shows the frequency of each complaint type for New York City.

# Filter the DataFrame for only the 'NEW YORK' city data
# Use .filter() with a boolean expression
nyc_data = df.filter(pl.col('city') == 'NEW YORK')

print("Head of the nyc_data DataFrame (filtered for 'NEW YORK'):")
print(nyc_data.head())

print(f"\nShape of the original DataFrame: {df.shape}")
print(f"Shape of the 'nyc_data' DataFrame (after filtering): {nyc_data.shape}")

# You can also verify that all cities in nyc_data are 'NEW YORK'
print("\nUnique cities in 'nyc_data':")
print(nyc_data['city'].unique().to_list())
Head of the nyc_data DataFrame (filtered for 'NEW YORK'):
shape: (5, 36)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ unique_key ┆ created_da ┆ agency ┆ agency_na ┆ … ┆ closed_da ┆ data_retr ┆ time_delt ┆ time_delt │
│ ---        ┆ te         ┆ ---    ┆ me        ┆   ┆ te        ┆ ieved_at  ┆ a         ┆ a_seconds │
│ str        ┆ ---        ┆ str    ┆ ---       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
│            ┆ datetime[μ ┆        ┆ str       ┆   ┆ datetime[ ┆ str       ┆ duration[ ┆ i64       │
│            ┆ s]         ┆        ┆           ┆   ┆ μs]       ┆           ┆ μs]       ┆           │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 65357312   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 10m 16s   ┆ 616       │
│            ┆ 01:19:05   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 01:29:21  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65365325   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 29m 48s   ┆ 1788      │
│            ┆ 01:18:35   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 01:48:23  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65366862   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 30m 6s    ┆ 1806      │
│            ┆ 01:12:03   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 01:42:09  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65355807   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 9m 18s    ┆ 558       │
│            ┆ 01:06:14   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 01:15:32  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65361896   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 41m 42s   ┆ 2502      │
│            ┆ 01:04:42   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 01:46:24  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘

Shape of the original DataFrame: (3762, 36)
Shape of the 'nyc_data' DataFrame (after filtering): (339, 36)

Unique cities in 'nyc_data':
['NEW YORK']
nyc_complaint_types = nyc_data.group_by("complaint_type").agg(
    pl.len().alias("counts") # pl.len() counts rows in each group, and we alias it to 'counts'
).sort("counts", descending=True) # Sort the resulting DataFrame by 'counts' in descending order

# get the number of unique complaints
nyc_complaint_counts = df.select(pl.col("complaint_type").n_unique()).item()

# View the result
print(f"\nThere are {nyc_complaint_counts} complaint types in the dataset that contain the following info:\n")
print(nyc_complaint_types)

There are 31 complaint types in the dataset that contain the following info:

shape: (21, 2)
┌───────────────────────────────┬────────┐
│ complaint_type                ┆ counts │
│ ---                           ┆ ---    │
│ str                           ┆ u32    │
╞═══════════════════════════════╪════════╡
│ Water System                  ┆ 95     │
│ Noise - Street/Sidewalk       ┆ 88     │
│ Noise - Residential           ┆ 45     │
│ Illegal Parking               ┆ 38     │
│ Encampment                    ┆ 17     │
│ …                             ┆ …      │
│ General Construction/Plumbing ┆ 1      │
│ Animal-Abuse                  ┆ 1      │
│ Blocked Driveway              ┆ 1      │
│ Dirty Condition               ┆ 1      │
│ Emergency Response Team (ERT) ┆ 1      │
└───────────────────────────────┴────────┘
# Plot the bar graph
chart = alt.Chart(nyc_complaint_types).mark_bar().encode(
    # X-axis: 'counts' (Quantitative type)
    x=alt.X('counts:Q', title='Number of Complaints',
           scale=alt.Scale(domainMin=0, zero=True)
           ),
    # Y-axis: 'complaint_type' (Nominal type), sorted by counts descending
    y=alt.Y('complaint_type:N', title='Complaint Type', sort='-x'), # Sort by counts for readability
    color=alt.value('green'), # You can use color names or hex codes like '#FF5733'

    # Tooltip for interactivity
    tooltip=[
        alt.Tooltip('complaint_type:N', title='Complaint Type'),
        alt.Tooltip('counts:Q', title='Count')
    ]
).properties(
    title='Types of Complaints in New York City',
    width=800,  # <-- Adjust the width here (e.g., 800 pixels)
    height=600  # <-- Adjust the height here (e.g., 600 pixels)
).interactive()

# Display the chart
chart.show()

Let’s explore the complaint data specifically for Brooklyn

This code filters the dataset to only include complaints from Brooklyn, groups the remaining data by complaint type, counts the number of occurrences for each type, and then sorts the counts in descending order. The resulting output shows the frequency of each complaint type for Brooklyn.

# Filter the DataFrame for only the 'BROOKLYN' city data
# Use .filter() with a boolean expression
bkn_data = df.filter(pl.col('city') == 'BROOKLYN')

print("Head of the bkn_data DataFrame (filtered for 'BROOKLYN'):")
print(bkn_data.head())

print(f"\nShape of the original DataFrame: {df.shape}")
print(f"Shape of the 'nyc_data' DataFrame (after filtering): {bkn_data.shape}")

# You can also verify that all cities in nyc_data are 'BROOKLYN'
print("\nUnique cities in 'bkn_data':")
print(bkn_data['city'].unique().to_list())
Head of the bkn_data DataFrame (filtered for 'BROOKLYN'):
shape: (5, 36)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ unique_key ┆ created_da ┆ agency ┆ agency_na ┆ … ┆ closed_da ┆ data_retr ┆ time_delt ┆ time_delt │
│ ---        ┆ te         ┆ ---    ┆ me        ┆   ┆ te        ┆ ieved_at  ┆ a         ┆ a_seconds │
│ str        ┆ ---        ┆ str    ┆ ---       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
│            ┆ datetime[μ ┆        ┆ str       ┆   ┆ datetime[ ┆ str       ┆ duration[ ┆ i64       │
│            ┆ s]         ┆        ┆           ┆   ┆ μs]       ┆           ┆ μs]       ┆           │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 65355773   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 7m 20s    ┆ 440       │
│            ┆ 01:50:40   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 01:58:00  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65366857   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 19m 28s   ┆ 1168      │
│            ┆ 01:25:36   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 01:45:04  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65365214   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 32m 12s   ┆ 1932      │
│            ┆ 01:15:21   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 01:47:33  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65355770   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 31m 42s   ┆ 1902      │
│            ┆ 01:08:58   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 01:40:40  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65355768   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 53m 13s   ┆ 3193      │
│            ┆ 01:08:52   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 02:02:05  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘

Shape of the original DataFrame: (3762, 36)
Shape of the 'nyc_data' DataFrame (after filtering): (486, 36)

Unique cities in 'bkn_data':
['BROOKLYN']
bkn_complaint_types = bkn_data.group_by("complaint_type").agg(
    pl.len().alias("counts") # pl.len() counts rows in each group, and we alias it to 'counts'
).sort("counts", descending=True) # Sort the resulting DataFrame by 'counts' in descending order

# get the number of unique complaints
bkn_complaint_counts = df.select(pl.col("complaint_type").n_unique()).item()

# View the result
print(f"\nThere are {bkn_complaint_counts} complaint types in the dataset that contain the following info:\n")
print(bkn_complaint_types)

There are 31 complaint types in the dataset that contain the following info:

shape: (22, 2)
┌───────────────────────────────┬────────┐
│ complaint_type                ┆ counts │
│ ---                           ┆ ---    │
│ str                           ┆ u32    │
╞═══════════════════════════════╪════════╡
│ Water System                  ┆ 124    │
│ Illegal Parking               ┆ 101    │
│ Noise - Residential           ┆ 78     │
│ Noise - Street/Sidewalk       ┆ 63     │
│ Blocked Driveway              ┆ 24     │
│ …                             ┆ …      │
│ Animal-Abuse                  ┆ 1      │
│ General Construction/Plumbing ┆ 1      │
│ Electrical                    ┆ 1      │
│ Violation of Park Rules       ┆ 1      │
│ Disorderly Youth              ┆ 1      │
└───────────────────────────────┴────────┘
# Plot the bar graph
chart = alt.Chart(bkn_complaint_types).mark_bar().encode(
    # X-axis: 'counts' (Quantitative type)
    x=alt.X('counts:Q', title='Number of Complaints',
           scale=alt.Scale(domainMin=0, zero=True)
           ),
    # Y-axis: 'complaint_type' (Nominal type), sorted by counts descending
    y=alt.Y('complaint_type:N', title='Complaint Type', sort='-x'), # Sort by counts for readability
    color=alt.value('red'), # You can use color names or hex codes like '#FF5733'

    # Tooltip for interactivity
    tooltip=[
        alt.Tooltip('complaint_type:N', title='Complaint Type'),
        alt.Tooltip('counts:Q', title='Count')
    ]
).properties(
    title='Types of Complaints in Brooklyn',
    width=800,  # <-- Adjust the width here (e.g., 800 pixels)
    height=600  # <-- Adjust the height here (e.g., 600 pixels)
).interactive()

# Display the chart
chart.show()
# filter data for the geographic locations of Brooklyn in the dataset.
bkn_geo_data = df.filter(
    (pl.col("city") == "BROOKLYN") &
    pl.col("longitude").is_not_null() &
    pl.col("latitude").is_not_null()
).select(["longitude", "latitude"]) # <-- This line selects only longitude and latitude
# create a scatter plot of the concentration of complaints across Brooklyn

# Using typical bounds for Brooklyn for demonstration:
bkn_lon_min, bkn_lon_max = -74.05, -73.85
bkn_lat_min, bkn_lat_max = 40.55, 40.75

scatter_plot = alt.Chart(bkn_geo_data).mark_point(
    opacity=0.2,  # Reduce opacity to see overlapping points
    size=5        # Adjust point size
).encode(
    x=alt.X(
        'longitude:Q',
        title='Longitude',
        scale=alt.Scale(domain=[bkn_lon_min, bkn_lon_max])  # Set fixed domain for x
    ),
    y=alt.Y(
        'latitude:Q',
        title='Latitude',
        scale=alt.Scale(domain=[bkn_lat_min, bkn_lat_max])  # Set fixed domain for y
    ),
    tooltip=['longitude:Q', 'latitude:Q']
).properties(
    title='Complaint Locations in Brooklyn (Scatter Plot)',
    width=700,   # Set explicit width
    height=560   # Set explicit height
).interactive()  # Allows zooming and panning

print("\nDisplaying Scatter Plot for Brooklyn Complaints:")
scatter_plot.show()

Displaying Scatter Plot for Brooklyn Complaints:
# --- Hexbin plot using mark_circle() + bin approach ---
hexbin_plot = alt.Chart(bkn_geo_data).mark_circle().encode(
    # Bin longitude and latitude into hexagonal cells
    x=alt.X('longitude:Q', bin=alt.Bin(maxbins=50), title='Longitude'), # Adjust maxbins for granularity
    y=alt.Y('latitude:Q', bin=alt.Bin(maxbins=50), title='Latitude'),   # Adjust maxbins for granularity

    # Color by the count of points in each bin
    color=alt.Color('count():Q',
                    title='Number of Complaints',
                    scale=alt.Scale(range='heatmap', domainMin=1) # Ensure heatmap starts from non-zero count
                    ),
    # Size the circles by count as well for emphasis
     size=alt.Size('count():Q', legend=None), # Hide size legend as color already indicates count

    tooltip=[
        alt.Tooltip('count():Q', title='Complaints in Bin'),
        alt.Tooltip('longitude:Q', bin=True, title='Longitude Bin'),
        alt.Tooltip('latitude:Q', bin=True, title='Latitude Bin')
    ]
).properties(
    title='Complaint Concentration in Brooklyn (Hexbin Plot)',
    width=700,  # Set explicit width
    height=560  # Set explicit height
).interactive() # Allows zooming and panning

print("\nDisplaying Hexbin Plot for Brooklyn Complaints:")
hexbin_plot.show()

Displaying Hexbin Plot for Brooklyn Complaints:

This code groups the data by both city and complaint type, counts the number of occurrences for each combination, and then prints the resulting output, which shows the various types of complaints in each city.

# 1. Group by city and complaint type and count the number of occurrences
# This will result in a DataFrame with columns: 'city', 'complaint_type', 'count'
city_complaint_counts = df.group_by(["city", "complaint_type"]).agg(
    pl.len().alias("count")
).sort(["city", "count"], descending=[False, True]) # Sort by city, then by count descending for readability


print("Grouped by City and Complaint Type (Head):")
print(city_complaint_counts.head())
Grouped by City and Complaint Type (Head):
shape: (5, 3)
┌─────────┬─────────────────────────┬───────┐
│ city    ┆ complaint_type          ┆ count │
│ ---     ┆ ---                     ┆ ---   │
│ str     ┆ str                     ┆ u32   │
╞═════════╪═════════════════════════╪═══════╡
│ ARVERNE ┆ Blocked Driveway        ┆ 1     │
│ ASTORIA ┆ Illegal Parking         ┆ 21    │
│ ASTORIA ┆ Noise - Street/Sidewalk ┆ 5     │
│ ASTORIA ┆ Noise - Residential     ┆ 4     │
│ ASTORIA ┆ Animal-Abuse            ┆ 1     │
└─────────┴─────────────────────────┴───────┘

This code uses the pivot function to pivot the data in the original DataFrame df. In Polars, you first group by the columns you want to pivot by, then count, then pivot. We’ll use ‘complaint_type’ and ‘city’ as confirmed lowercase names.

The resulting DataFrame, df_new, has complaint types as rows, cities as columns, and the count of occurrences as the values. The fill_value=0 parameter fills any missing values with zeros. The resulting output shows the new DataFrame with cities as columns and complaint types in rows.

complaints_by_city = df.group_by(["complaint_type", "city"]).len().pivot(
    index="complaint_type", # Rows of the new pivot table
    columns="city",         # Columns of the new pivot table
    values="len"            # The count from .len() will fill the values
)

# Fill any null values (where a complaint type didn't occur in a city) with 0
complaints_by_city = complaints_by_city.fill_null(0)

# Print the new DataFrame's head
print("Complaints by City (head of the pivot table):")
print(complaints_by_city.head())
Complaints by City (head of the pivot table):
shape: (5, 43)
┌─────────────────────┬──────────┬─────────┬────────┬───┬────────┬─────────┬─────────┬─────────────┐
│ complaint_type      ┆ BROOKLYN ┆ ASTORIA ┆ FOREST ┆ … ┆ HOLLIS ┆ ARVERNE ┆ CAMBRIA ┆ LITTLE NECK │
│ ---                 ┆ ---      ┆ ---     ┆ HILLS  ┆   ┆ ---    ┆ ---     ┆ HEIGHTS ┆ ---         │
│ str                 ┆ u32      ┆ u32     ┆ ---    ┆   ┆ u32    ┆ u32     ┆ ---     ┆ u32         │
│                     ┆          ┆         ┆ u32    ┆   ┆        ┆         ┆ u32     ┆             │
╞═════════════════════╪══════════╪═════════╪════════╪═══╪════════╪═════════╪═════════╪═════════════╡
│ Disorderly Youth    ┆ 1        ┆ 0       ┆ 0      ┆ … ┆ 0      ┆ 0       ┆ 0       ┆ 0           │
│ Blocked Driveway    ┆ 24       ┆ 1       ┆ 0      ┆ … ┆ 1      ┆ 1       ┆ 0       ┆ 0           │
│ Noise - Residential ┆ 78       ┆ 4       ┆ 1      ┆ … ┆ 0      ┆ 0       ┆ 0       ┆ 0           │
│ Traffic Signal      ┆ 23       ┆ 0       ┆ 0      ┆ … ┆ 0      ┆ 0       ┆ 0       ┆ 0           │
│ Condition           ┆          ┆         ┆        ┆   ┆        ┆         ┆         ┆             │
│ Illegal Parking     ┆ 101      ┆ 21      ┆ 2      ┆ … ┆ 0      ┆ 0       ┆ 1       ┆ 0           │
└─────────────────────┴──────────┴─────────┴────────┴───┴────────┴─────────┴─────────┴─────────────┘

Statistical Analysis of Request_Closing_Time

# Calculate the response time (duration)
# Polars automatically creates a Duration type from datetime subtraction
df = df.with_columns(
    (pl.col("closed_date") - pl.col("created_date")).alias("request_closing_duration")
)

# Convert the duration to total hours and total minutes (as numerical values for plotting)
df = df.with_columns([
    pl.col("request_closing_duration").dt.total_hours().alias("request_closing_time_hrs"),
    pl.col("request_closing_duration").dt.total_minutes().alias("request_closing_time_min")
])

print("DataFrame Head with new time columns:")
print(df.head())

print("\nSchema of new time columns:")
print(df.select(["request_closing_duration", "request_closing_time_hrs", "request_closing_time_min"]).schema)
DataFrame Head with new time columns:
shape: (5, 39)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ unique_key ┆ created_da ┆ agency ┆ agency_na ┆ … ┆ time_delt ┆ request_c ┆ request_c ┆ request_c │
│ ---        ┆ te         ┆ ---    ┆ me        ┆   ┆ a_seconds ┆ losing_du ┆ losing_ti ┆ losing_ti │
│ str        ┆ ---        ┆ str    ┆ ---       ┆   ┆ ---       ┆ ration    ┆ me_hrs    ┆ me_min    │
│            ┆ datetime[μ ┆        ┆ str       ┆   ┆ i64       ┆ ---       ┆ ---       ┆ ---       │
│            ┆ s]         ┆        ┆           ┆   ┆           ┆ duration[ ┆ i64       ┆ i64       │
│            ┆            ┆        ┆           ┆   ┆           ┆ μs]       ┆           ┆           │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 65355773   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 440       ┆ 7m 20s    ┆ 0         ┆ 7         │
│            ┆ 01:50:40   ┆        ┆ City      ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65366664   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 998       ┆ 16m 38s   ┆ 0         ┆ 16        │
│            ┆ 01:45:52   ┆        ┆ City      ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65358593   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 589       ┆ 9m 49s    ┆ 0         ┆ 9         │
│            ┆ 01:42:25   ┆        ┆ City      ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65366547   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 1378      ┆ 22m 58s   ┆ 0         ┆ 22        │
│            ┆ 01:40:10   ┆        ┆ City      ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65363618   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 1363      ┆ 22m 43s   ┆ 0         ┆ 22        │
│            ┆ 01:37:15   ┆        ┆ City      ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘

Schema of new time columns:
Schema({'request_closing_duration': Duration(time_unit='us'), 'request_closing_time_hrs': Int64, 'request_closing_time_min': Int64})
# Define the list of city names that represent the main boroughs in your dataset
borough_cities = ['NEW YORK', 'BROOKLYN', 'QUEENS', 'BRONX', 'STATEN ISLAND']

# Filter the DataFrame to include only these borough cities
borough_complaints_df = df.filter(pl.col("city").is_in(borough_cities))

# Calculate the average complaint resolution time (in hours) for each borough city
avg_resolution_time_by_borough = borough_complaints_df.group_by("city").agg(
    pl.col("request_closing_time_hrs").mean().alias("avg_resolution_hours")
).sort("avg_resolution_hours", descending=False) # Sort for better visualization

print("\nAverage Complaint Resolution Time (Hours) by Borough City:")
print(avg_resolution_time_by_borough)

# --- Plotting the average complaint resolved time for the five borough cities ---
chart_borough_resolution = alt.Chart(avg_resolution_time_by_borough).mark_bar().encode(
    # X-axis: Average Resolution Time in Hours (Quantitative)
    x=alt.X('avg_resolution_hours:Q',
            title='Average Resolution Time (Hours)',
            scale=alt.Scale(domainMin=0) # Ensure axis starts from 0
           ),
    # Y-axis: City (Nominal), sorted by average time
    y=alt.Y('city:N',
            title='City',
            sort='-x' # Sort cities by their average resolution time
           ),
    color=alt.value('darkcyan'), # Set a single color for the bars
    tooltip=[
        alt.Tooltip('city:N', title='City'),
        alt.Tooltip('avg_resolution_hours:Q', title='Avg. Hours', format='.2f') # Format to 2 decimal places
    ]
).properties(
    title='Average Complaint Resolution Time by Borough City',
    width=800,  # <-- Adjust the width here (e.g., 800 pixels)
    height=600  # <-- Adjust the height here (e.g., 600 pixels)
).interactive()

chart_borough_resolution.show()

Average Complaint Resolution Time (Hours) by Borough City:
shape: (5, 2)
┌───────────────┬──────────────────────┐
│ city          ┆ avg_resolution_hours │
│ ---           ┆ ---                  │
│ str           ┆ f64                  │
╞═══════════════╪══════════════════════╡
│ QUEENS        ┆ 0.0                  │
│ BRONX         ┆ 0.204422             │
│ NEW YORK      ┆ 0.389381             │
│ BROOKLYN      ┆ 0.623457             │
│ STATEN ISLAND ┆ 0.863014             │
└───────────────┴──────────────────────┘
# Calculate the average closing time by city
# Group by 'city', aggregate the mean of 'request_closing_time_hrs', and sort
avg_closing_time_dict = df.group_by("city").agg(
    pl.col("request_closing_time_hrs").mean().alias("avg_hours_to_close")
).sort("avg_hours_to_close", descending=False).to_dict(as_series=False) # Convert to dictionary

# Use a dictionary comprehension to reformat it
# We're zipping the 'city' list with the 'avg_hours_to_close' list
avg_closing_time_by_city_dict = {
    city: avg_time
    for city, avg_time in zip(avg_closing_time_dict['city'], avg_closing_time_dict['avg_hours_to_close'])
}

print("Average Closing Time by City (sorted, as dictionary):\n")
for key, values in avg_closing_time_by_city_dict.items():
    print(f"{key} : {values:.2f} hrs")
Average Closing Time by City (sorted, as dictionary):

LITTLE NECK : 0.00 hrs
MIDDLE VILLAGE : 0.00 hrs
SPRINGFIELD GARDENS : 0.00 hrs
OAKLAND GARDENS : 0.00 hrs
BAYSIDE : 0.00 hrs
CAMBRIA HEIGHTS : 0.00 hrs
HOLLIS : 0.00 hrs
ROSEDALE : 0.00 hrs
QUEENS : 0.00 hrs
MANHATTAN : 0.11 hrs
SOUTH OZONE PARK : 0.14 hrs
SUNNYSIDE : 0.17 hrs
RIDGEWOOD : 0.17 hrs
FLUSHING : 0.18 hrs
HOWARD BEACH : 0.20 hrs
BRONX : 0.20 hrs
JAMAICA : 0.24 hrs
MASPETH : 0.25 hrs
REGO PARK : 0.33 hrs
WHITESTONE : 0.33 hrs
OZONE PARK : 0.33 hrs
UNKNOWN CITY : 0.36 hrs
NEW YORK : 0.39 hrs
FRESH MEADOWS : 0.44 hrs
FOREST HILLS : 0.50 hrs
CORONA : 0.55 hrs
BROOKLYN : 0.62 hrs
LONG ISLAND CITY : 0.73 hrs
RICHMOND HILL : 0.75 hrs
WOODHAVEN : 0.81 hrs
JACKSON HEIGHTS : 0.83 hrs
STATEN ISLAND : 0.86 hrs
COLLEGE POINT : 1.00 hrs
SAINT ALBANS : 1.00 hrs
FAR ROCKAWAY : 1.00 hrs
QUEENS VILLAGE : 1.00 hrs
ARVERNE : 1.00 hrs
SOUTH RICHMOND HILL : 1.00 hrs
ASTORIA : 1.23 hrs
EAST ELMHURST : 1.27 hrs
ELMHURST : 1.82 hrs
WOODSIDE : 1.83 hrs
# Calculate the average closing time by city (as a Polars DataFrame)
avg_closing_time_df = df.group_by("city").agg(
    pl.col("request_closing_time_hrs").mean().alias("avg_hours_to_close")
).sort("avg_hours_to_close", descending=False) # Sort for better chart readability

print("DataFrame for plotting (Avg. Closing Time by City):")
print(avg_closing_time_df.head())

# --- Create the Altair Horizontal Bar Chart ---
chart = alt.Chart(avg_closing_time_df).mark_bar().encode(
    # X-axis: Average Hours to Close (Quantitative)
    x=alt.X('avg_hours_to_close:Q',
            title='Average Closing Time (Hours)',
            axis=alt.Axis(titleFontSize=14),
            scale=alt.Scale(domainMin=0) # Ensure axis starts from 0
           ),
    # Y-axis: City (Nominal), sorted by average closing time
    y=alt.Y('city:N',
            title='City',
            sort='-x', # Sort cities by their average resolution time
            axis=alt.Axis(titleFontSize=14)
           ),
    color=alt.value('purple'), # Set a single color for the bars
    tooltip=[
        alt.Tooltip('city:N', title='City'),
        alt.Tooltip('avg_hours_to_close:Q', title='Avg. Hours', format='.2f') # Format to 2 decimal places
    ]
).properties(
    title=alt.Title(
        'Average Complaint Closing Time by City',
        fontSize=18
    ),
    width=600,  # Adjust overall chart width
    height=alt.Step(15) # Dynamically adjust height based on number of cities, 15 pixels per bar
).interactive() # Enable pan and zoom

chart.show()
DataFrame for plotting (Avg. Closing Time by City):
shape: (5, 2)
┌─────────────────────┬────────────────────┐
│ city                ┆ avg_hours_to_close │
│ ---                 ┆ ---                │
│ str                 ┆ f64                │
╞═════════════════════╪════════════════════╡
│ CAMBRIA HEIGHTS     ┆ 0.0                │
│ MIDDLE VILLAGE      ┆ 0.0                │
│ LITTLE NECK         ┆ 0.0                │
│ SPRINGFIELD GARDENS ┆ 0.0                │
│ BAYSIDE             ┆ 0.0                │
└─────────────────────┴────────────────────┘

Let’s see the avg. response time across different types of complaints. We will also compare them across each borough.

# Calculate the average closing time by complaint type
# Group by 'complaint_type', aggregate the mean of 'request_closing_time_min', and sort
avg_closing_time_per_complaint = df.group_by("complaint_type").agg(
    pl.col("request_closing_time_min").mean().alias("avg_minutes_to_close")
).sort("avg_minutes_to_close", descending=False) # Default sort is ascending, so descending=False for ascending order

# Print the result
print("Average Closing Time (Minutes) per Complaint Type:")
print(avg_closing_time_per_complaint)
Average Closing Time (Minutes) per Complaint Type:
shape: (31, 2)
┌───────────────────────────────┬──────────────────────┐
│ complaint_type                ┆ avg_minutes_to_close │
│ ---                           ┆ ---                  │
│ str                           ┆ f64                  │
╞═══════════════════════════════╪══════════════════════╡
│ Street Condition              ┆ 0.0                  │
│ Building/Use                  ┆ 0.0                  │
│ Emergency Response Team (ERT) ┆ 0.0                  │
│ Elevator                      ┆ 0.0                  │
│ General Construction/Plumbing ┆ 0.0                  │
│ …                             ┆ …                    │
│ Animal-Abuse                  ┆ 123.4                │
│ Violation of Park Rules       ┆ 137.0                │
│ Traffic                       ┆ 152.0                │
│ Dirty Condition               ┆ 233.0                │
│ Derelict Vehicles             ┆ 268.2                │
└───────────────────────────────┴──────────────────────┘
# --- Create the Altair Horizontal Bar Chart ---
chart = alt.Chart(avg_closing_time_per_complaint).mark_bar().encode(
    # X-axis: Average Minutes to Close (Quantitative)
    x=alt.X('avg_minutes_to_close:Q',
            title='Average Closing Time (Minutes)',
            axis=alt.Axis(titleFontSize=14),
            scale=alt.Scale(domainMin=0) # Ensure axis starts from 0
           ),
    # Y-axis: Complaint Type (Nominal), sorted by average closing time
    y=alt.Y('complaint_type:N',
            title='Complaint Type',
            sort='-x', # Sort complaint types by their average resolution time
            axis=alt.Axis(titleFontSize=14)
           ),
    color=alt.value('seagreen'), # Set a single color for the bars
    tooltip=[
        alt.Tooltip('complaint_type:N', title='Complaint Type'),
        alt.Tooltip('avg_minutes_to_close:Q', title='Avg. Minutes', format='.2f') # Format to 2 decimal places
    ]
).properties(
    title=alt.Title(
        'Average Complaint Closing Time by Complaint Type',
        fontSize=18
    ),
    width=600,  # Adjust overall chart width
    height=alt.Step(15) # Dynamically adjust height based on number of complaint types, 15 pixels per bar
).interactive() # Enable pan and zoom

chart.show()

Here you can see that the Graffitti takes almost 11 hours to resolve. Other complaint types are more frequent where Air Quality, Noise from Places of Worship take much less time while some issues haven’t been recently reported at all, such as Plumbing.

# 1. Define the mapping (as a list of dictionaries or directly construct a DataFrame)
borough_mapping_data = [
    {'city_name': 'NEW YORK', 'Borough': 'MANHATTAN'},
    {'city_name': 'BRONX', 'Borough': 'BRONX'},
    {'city_name': 'BROOKLYN', 'Borough': 'BROOKLYN'},
    {'city_name': 'QUEENS', 'Borough': 'QUEENS'},
    {'city_name': 'STATEN ISLAND', 'Borough': 'STATEN ISLAND'},
    {'city_name': 'JAMAICA', 'Borough': 'QUEENS'},
    {'city_name': 'FLUSHING', 'Borough': 'QUEENS'},
    {'city_name': 'ASTORIA', 'Borough': 'QUEENS'},
    {'city_name': 'LONG ISLAND CITY', 'Borough': 'QUEENS'},
    {'city_name': 'EAST ELMHURST', 'Borough': 'QUEENS'},
    {'city_name': 'CORONA', 'Borough': 'QUEENS'},
    {'city_name': 'ELMHURST', 'Borough': 'QUEENS'},
    {'city_name': 'OZONE PARK', 'Borough': 'QUEENS'},
    {'city_name': 'SOUTH OZONE PARK', 'Borough': 'QUEENS'},
    {'city_name': 'ARVERNE', 'Borough': 'QUEENS'},
    {'city_name': 'FAR ROCKAWAY', 'Borough': 'QUEENS'},
    {'city_name': 'QUEENS VILLAGE', 'Borough': 'QUEENS'},
    {'city_name': 'RICHMOND HILL', 'Borough': 'QUEENS'},
    {'city_name': 'SOUTH RICHMOND HILL', 'Borough': 'QUEENS'},
    {'city_name': 'WOODHAVEN', 'Borough': 'QUEENS'},
    {'city_name': 'REGO PARK', 'Borough': 'QUEENS'},
    {'city_name': 'FOREST HILLS', 'Borough': 'QUEENS'},
    {'city_name': 'JACKSON HEIGHTS', 'Borough': 'QUEENS'},
    {'city_name': 'KEW GARDENS', 'Borough': 'QUEENS'},
    {'city_name': 'FRESH MEADOWS', 'Borough': 'QUEENS'},
    {'city_name': 'OAKLAND GARDENS', 'Borough': 'QUEENS'},
    {'city_name': 'COLLEGE POINT', 'Borough': 'QUEENS'},
    {'city_name': 'BAYSIDE', 'Borough': 'QUEENS'},
    {'city_name': 'GLEN OAKS', 'Borough': 'QUEENS'},
    {'city_name': 'BELLEROSE', 'Borough': 'QUEENS'},
    {'city_name': 'LITTLE NECK', 'Borough': 'QUEENS'},
    {'city_name': 'CAMBRIA HEIGHTS', 'Borough': 'QUEENS'},
    {'city_name': 'ROSEDALE', 'Borough': 'QUEENS'},
    {'city_name': 'ROCKAWAY PARK', 'Borough': 'QUEENS'},
    {'city_name': 'SPRINGFIELD GARDENS', 'Borough': 'QUEENS'},
    {'city_name': 'WOODSIDE', 'Borough': 'QUEENS'},
    {'city_name': 'MASPETH', 'Borough': 'QUEENS'},
    {'city_name': 'SUNNYSIDE', 'Borough': 'QUEENS'},
    {'city_name': 'MIDDLE VILLAGE', 'Borough': 'QUEENS'},
    {'city_name': 'RIDGEWOOD', 'Borough': 'BROOKLYN'},
    {'city_name': 'HOWARD BEACH', 'Borough': 'QUEENS'},
    {'city_name': 'BREEZY POINT', 'Borough': 'QUEENS'},
    {'city_name': 'WHITESTONE', 'Borough': 'QUEENS'},
    {'city_name': 'SAINT ALBANS', 'Borough': 'QUEENS'},
    {'city_name': 'Unknown City', 'Borough': 'Unspecified'} # Map previously filled 'Unknown City'
]

mapping_df = pl.DataFrame(borough_mapping_data)

# 2. Perform a left join to add the 'Borough' column
df = df.join(mapping_df, left_on="city", right_on="city_name", how="left")

# 3. Fill any remaining nulls in 'Borough' (for cities not in mapping) with 'Unspecified'
df = df.with_columns(
    pl.col("Borough").fill_null(pl.lit("Unspecified")).alias("Borough")
)
# --- Calculate average response time per complaint type per borough ---
avg_response_time_by_borough_complaint = df.group_by(["Borough", "complaint_type"]).agg(
    pl.col("request_closing_time_min").mean().alias("avg_minutes")
).sort(["Borough", "avg_minutes"], descending=[False, True])

print("Average Response Time by Borough and Complaint Type (Head):")
print(avg_response_time_by_borough_complaint.head())
Average Response Time by Borough and Complaint Type (Head):
shape: (5, 3)
┌─────────┬─────────────────────────────┬─────────────┐
│ Borough ┆ complaint_type              ┆ avg_minutes │
│ ---     ┆ ---                         ┆ ---         │
│ str     ┆ str                         ┆ f64         │
╞═════════╪═════════════════════════════╪═════════════╡
│ BRONX   ┆ Drug Activity               ┆ 394.0       │
│ BRONX   ┆ Derelict Vehicles           ┆ 324.75      │
│ BRONX   ┆ Dirty Condition             ┆ 219.0       │
│ BRONX   ┆ Illegal Fireworks           ┆ 137.8       │
│ BRONX   ┆ Non-Emergency Police Matter ┆ 130.833333  │
└─────────┴─────────────────────────────┴─────────────┘
# --- Define the boroughs for subplots (matching the order in Pandas code) ---
boroughs = ['BRONX', 'BROOKLYN', 'MANHATTAN', 'QUEENS', 'STATEN ISLAND', 'Unspecified']

# --- Set common X-axis limit (like plt.xlim(0, 3000)) ---
x_axis_limit = 3000 # Use a fixed limit as in Matplotlib

# --- Helper function to create an Altair subplot for each borough ---
def create_borough_subplot(data: pl.DataFrame, borough_name: str, x_domain_max: float) -> alt.Chart:
    """
    Creates a horizontal bar chart for average response time by complaint type for a given borough.
    """
    borough_data = data.filter(pl.col("Borough") == borough_name)
    borough_data = borough_data.filter(pl.col("avg_minutes").is_not_null() & (pl.col("avg_minutes") > 0))

    # --- FIX: Calculate height as an integer, not multiplying alt.Step() ---
    # Determine the number of bars (complaint types) for this borough
    num_complaint_types = len(borough_data)
    # Calculate the total height in pixels
    chart_height = 15 * max(1, num_complaint_types) # 15 pixels per bar
    # --- END FIX ---

    chart = alt.Chart(borough_data).mark_bar(color='cornflowerblue').encode(
        x=alt.X('avg_minutes:Q',
                title='Average Response Time (in minutes)',
                scale=alt.Scale(domain=[0, x_domain_max]),
                axis=alt.Axis(titleFontSize=12, labelFontSize=10)
               ),
        y=alt.Y('complaint_type:N',
                title='Complaint Type',
                sort='-x',
                axis=alt.Axis(titleFontSize=12, labelFontSize=10)
               ),
        tooltip=[
            alt.Tooltip('complaint_type:N', title='Complaint Type'),
            alt.Tooltip('avg_minutes:Q', title='Avg. Minutes', format='.2f')
        ]
    ).properties(
        title=alt.Title(
            borough_name,
            fontSize=14,
            anchor='middle'
        ),
        width=300,
        height=chart_height # Pass the integer height here
    )
    return chart
# --- Create individual subplots for each borough ---
borough_charts = [create_borough_subplot(avg_response_time_by_borough_complaint, b, x_axis_limit) for b in boroughs]

# --- Combine the subplots into a 2x3 grid ---
combined_chart = alt.vconcat(
    alt.hconcat(borough_charts[0], borough_charts[1], borough_charts[2]),
    alt.hconcat(borough_charts[3], borough_charts[4], borough_charts[5])
).properties(
    title=alt.Title(
        'Average Response Time for Complaint Types by Borough',
        fontSize=18,
        anchor='middle'
    )
).resolve_scale(
    x='shared'
)

combined_chart.show()

Hypothesis Testing

A boxplot can reveal visual differences in the min, median, max and presence of outliers across complaint types.

# Filter out extreme outliers for better visualization if needed
df_filtered_for_plot = df.filter(pl.col("request_closing_time_min") >= 0) # Ensure no negative times

# Drop the 'time_delta' and 'request_closing duration' column for Altair plotting
df_filtered_for_plot = df_filtered_for_plot.drop("time_delta", "request_closing_duration")

# Create the Boxplot
boxplot_chart = alt.Chart(df_filtered_for_plot).mark_boxplot().encode(
    x=alt.X('complaint_type:N',
            title='Complaint Type',
            axis=alt.Axis(labelAngle=-45, titleFontSize=14, labelFontSize=10) # Rotate labels for readability
           ),
    y=alt.Y('request_closing_time_min:Q',
            title='Request Closing Time (Minutes)',
            scale=alt.Scale(domainMin=0), # Ensure Y-axis starts at 0
            axis=alt.Axis(titleFontSize=14, labelFontSize=10)
           ),
    tooltip=[
        alt.Tooltip('complaint_type:N', title='Complaint Type'),
        alt.Tooltip('request_closing_time_min:Q', title='Closing Time', format='.2f')
        # For boxplots, you might add tooltips for quartiles if directly aggregated
    ]
).properties(
    title=alt.Title(
        'Distribution of Request Closing Time by Complaint Type',
        fontSize=18
    ),
    width=alt.Step(25), # Adjust width of each box
    height=400 # Overall chart height
).interactive() # Enable zooming and panning

boxplot_chart.show()