Customer Service Requests Analysis

applied data science

visualization

Comprehensive Data analysis of NYC 311 service request calls dataset.

Author

Chaance Graves

Published

May 9, 2023

Modified

June 26, 2025

Project Description

This project analyzes New York City’s 311 service request data to identify patterns and visualize the most common types of citizen complaints. Using the NYC Open Data API, I developed a CLI script to collect updated data from API endpoint at https://data.cityofnewyork.us/resource/erm2-nwe9.json.

I applied data wrangling techniques to clean and analyze the dataset. The analysis also includes interactive visualizations of complaint trends. To ensure reproducibility, I’ve uploaded a dataset snapshot to Hugging Face for use throughout the project. More information about the original data source is available on the NYC Open Data page.

Import the libraries for analysis

import numpy as np
import polars as pl
import altair as alt
from scipy.stats import stats, chi2_contingency
from datasets import Dataset, load_dataset

import warnings
warnings.filterwarnings('ignore')

# checking the info about the installed polars package version
print(pl.show_versions())

--------Version info---------
Polars:              1.31.0
Index type:          UInt32
Platform:            Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.39
Python:              3.13.5 (main, Jun 14 2025, 08:23:24) [GCC 13.3.0]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               5.5.0
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2025.3.0
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
numpy                2.3.1
openpyxl             <not installed>
pandas               2.3.1
polars_cloud         <not installed>
pyarrow              20.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
None

Understanding the dataset

nyc_data = load_dataset("ctgadget/nyc_311_service_requests", split="train")

# let's get the dataset's features
nyc_data.info.features

{'unique_key': Value('string'),
 'created_date': Value('string'),
 'agency': Value('string'),
 'agency_name': Value('string'),
 'complaint_type': Value('string'),
 'descriptor': Value('string'),
 'location_type': Value('string'),
 'incident_zip': Value('string'),
 'incident_address': Value('string'),
 'street_name': Value('string'),
 'cross_street_1': Value('string'),
 'cross_street_2': Value('string'),
 'intersection_street_1': Value('string'),
 'intersection_street_2': Value('string'),
 'address_type': Value('string'),
 'city': Value('string'),
 'landmark': Value('string'),
 'status': Value('string'),
 'resolution_action_updated_date': Value('string'),
 'community_board': Value('string'),
 'bbl': Value('string'),
 'borough': Value('string'),
 'x_coordinate_state_plane': Value('string'),
 'y_coordinate_state_plane': Value('string'),
 'open_data_channel_type': Value('string'),
 'park_facility_name': Value('string'),
 'park_borough': Value('string'),
 'latitude': Value('string'),
 'longitude': Value('string'),
 'location': {'human_address': Value('string'),
  'latitude': Value('string'),
  'longitude': Value('string')},
 'resolution_description': Value('string'),
 'vehicle_type': Value('string'),
 'closed_date': Value('string'),
 'data_retrieved_at': Value('string')}

df = Dataset.to_polars(nyc_data)

df.head()

shape: (5, 34)

unique_key	created_date	agency	agency_name	complaint_type	descriptor	location_type	incident_zip	incident_address	street_name	cross_street_1	cross_street_2	intersection_street_1	intersection_street_2	address_type	city	landmark	status	resolution_action_updated_date	community_board	bbl	borough	x_coordinate_state_plane	y_coordinate_state_plane	open_data_channel_type	park_facility_name	park_borough	latitude	longitude	location	resolution_description	vehicle_type	closed_date	data_retrieved_at
str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	struct[3]	str	str	str	str
"65363980"	"2025-06-25T02:03:21.000"	"NYPD"	"New York City Police Departmen…	"Illegal Parking"	"Commercial Overnight Parking"	"Street/Sidewalk"	"11411"	"115-22 FRANCIS LEWIS BOULEVARD"	"FRANCIS LEWIS BOULEVARD"	"115 AVENUE"	"115 ROAD"	"115 AVENUE"	"115 ROAD"	"ADDRESS"	"CAMBRIA HEIGHTS"	"FRANCIS LEWIS BOULEVARD"	"In Progress"	"2025-06-25T02:11:30.000"	"12 QUEENS"	"4110460160"	"QUEENS"	"1054262"	"194055"	"MOBILE"	"Unspecified"	"QUEENS"	"40.699035965405535"	"-73.7475083832415"	{"{"address": "", "city": "", "state": "", "zip": ""}","40.699035965405535","-73.7475083832415"}	null	null	null	"06_26_2025_02_38_45_PM"
"65357474"	"2025-06-25T02:01:47.000"	"NYPD"	"New York City Police Departmen…	"Blocked Driveway"	"Partial Access"	"Street/Sidewalk"	"11428"	"90-25 219 STREET"	"219 STREET"	"90 AVENUE"	"91 AVENUE"	"90 AVENUE"	"91 AVENUE"	"ADDRESS"	"QUEENS VILLAGE"	"219 STREET"	"In Progress"	null	"13 QUEENS"	"4107180042"	"QUEENS"	"1055805"	"203889"	"ONLINE"	"Unspecified"	"QUEENS"	"40.72601549402461"	"-73.74183921146468"	{"{"address": "", "city": "", "state": "", "zip": ""}","40.72601549402461","-73.74183921146468"}	null	null	null	"06_26_2025_02_38_45_PM"
"65367267"	"2025-06-25T02:01:33.000"	"NYPD"	"New York City Police Departmen…	"Noise - Residential"	"Loud Talking"	"Residential Building/House"	"10031"	"620 WEST 141 STREET"	"WEST 141 STREET"	"BROADWAY"	"RIVERSIDE DRIVE"	"BROADWAY"	"RIVERSIDE DRIVE"	"ADDRESS"	"NEW YORK"	"WEST 141 STREET"	"In Progress"	null	"09 MANHATTAN"	"1020880043"	"MANHATTAN"	"997264"	"239545"	"MOBILE"	"Unspecified"	"MANHATTAN"	"40.82416166479311"	"-73.95297805704422"	{"{"address": "", "city": "", "state": "", "zip": ""}","40.82416166479311","-73.95297805704422"}	null	null	null	"06_26_2025_02_38_45_PM"
"65357731"	"2025-06-25T02:01:29.000"	"DHS"	"Department of Homeless Service…	"Homeless Person Assistance"	"Non-Chronic"	"Store/Commercial"	"10001"	"6 AVENUE"	"6 AVENUE"	"6 AVENUE"	"WEST 27 STREET"	"6 AVENUE"	"WEST 27 STREET"	"INTERSECTION"	null	null	"In Progress"	null	"05 MANHATTAN"	null	"MANHATTAN"	"986753"	"210854"	"PHONE"	"Unspecified"	"MANHATTAN"	"40.745421755862786"	"-73.99096689618635"	{"{"address": "", "city": "", "state": "", "zip": ""}","40.745421755862786","-73.99096689618635"}	null	null	null	"06_26_2025_02_38_45_PM"
"65364041"	"2025-06-25T02:01:21.000"	"NYPD"	"New York City Police Departmen…	"Noise - Street/Sidewalk"	"Loud Music/Party"	"Street/Sidewalk"	"10460"	"1460 HOE AVENUE"	"HOE AVENUE"	"JENNINGS STREET"	"EAST 172 STREET"	"JENNINGS STREET"	"EAST 172 STREET"	"ADDRESS"	"BRONX"	"HOE AVENUE"	"In Progress"	null	"03 BRONX"	"2029880002"	"BRONX"	"1014792"	"242538"	"PHONE"	"Unspecified"	"BRONX"	"40.832333371980106"	"-73.88963263564457"	{"{"address": "", "city": "", "state": "", "zip": ""}","40.832333371980106","-73.88963263564457"}	null	null	null	"06_26_2025_02_38_45_PM"

# Let's get the info about the dataset, shape of the dimensions
print(df.describe())
print( f"\nThe shape of the dataset is {df.shape}\n")

shape: (9, 35)
┌────────────┬────────────┬───────────┬────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ statistic  ┆ unique_key ┆ created_d ┆ agency ┆ … ┆ resolutio ┆ vehicle_t ┆ closed_da ┆ data_retr │
│ ---        ┆ ---        ┆ ate       ┆ ---    ┆   ┆ n_descrip ┆ ype       ┆ te        ┆ ieved_at  │
│ str        ┆ str        ┆ ---       ┆ str    ┆   ┆ tion      ┆ ---       ┆ ---       ┆ ---       │
│            ┆            ┆ str       ┆        ┆   ┆ ---       ┆ str       ┆ str       ┆ str       │
│            ┆            ┆           ┆        ┆   ┆ str       ┆           ┆           ┆           │
╞════════════╪════════════╪═══════════╪════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ count      ┆ 5000       ┆ 5000      ┆ 5000   ┆ … ┆ 3797      ┆ 171       ┆ 3762      ┆ 5000      │
│ null_count ┆ 0          ┆ 0         ┆ 0      ┆ … ┆ 1203      ┆ 4829      ┆ 1238      ┆ 0         │
│ mean       ┆ null       ┆ null      ┆ null   ┆ … ┆ null      ┆ null      ┆ null      ┆ null      │
│ std        ┆ null       ┆ null      ┆ null   ┆ … ┆ null      ┆ null      ┆ null      ┆ null      │
│ min        ┆ 65354926   ┆ 2025-06-2 ┆ DCWP   ┆ … ┆ DSNY      ┆ Car       ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆            ┆ 4T18:08:4 ┆        ┆   ┆ already   ┆           ┆ 4T18:09:3 ┆ 5_02_38_4 │
│            ┆            ┆ 0.000     ┆        ┆   ┆ has a     ┆           ┆ 7.000     ┆ 5_PM      │
│            ┆            ┆           ┆        ┆   ┆ request   ┆           ┆           ┆           │
│            ┆            ┆           ┆        ┆   ┆ to …      ┆           ┆           ┆           │
│ 25%        ┆ null       ┆ null      ┆ null   ┆ … ┆ null      ┆ null      ┆ null      ┆ null      │
│ 50%        ┆ null       ┆ null      ┆ null   ┆ … ┆ null      ┆ null      ┆ null      ┆ null      │
│ 75%        ┆ null       ┆ null      ┆ null   ┆ … ┆ null      ┆ null      ┆ null      ┆ null      │
│ max        ┆ 65367590   ┆ 2025-06-2 ┆ TLC    ┆ … ┆ Your      ┆ Van       ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆            ┆ 5T02:03:2 ┆        ┆   ┆ request   ┆           ┆ 5T03:05:0 ┆ 5_02_38_4 │
│            ┆            ┆ 1.000     ┆        ┆   ┆ can not   ┆           ┆ 0.000     ┆ 5_PM      │
│            ┆            ┆           ┆        ┆   ┆ be        ┆           ┆           ┆           │
│            ┆            ┆           ┆        ┆   ┆ proces…   ┆           ┆           ┆           │
└────────────┴────────────┴───────────┴────────┴───┴───────────┴───────────┴───────────┴───────────┘

The shape of the dataset is (5000, 34)

df.collect_schema()

Schema([('unique_key', String),
        ('created_date', String),
        ('agency', String),
        ('agency_name', String),
        ('complaint_type', String),
        ('descriptor', String),
        ('location_type', String),
        ('incident_zip', String),
        ('incident_address', String),
        ('street_name', String),
        ('cross_street_1', String),
        ('cross_street_2', String),
        ('intersection_street_1', String),
        ('intersection_street_2', String),
        ('address_type', String),
        ('city', String),
        ('landmark', String),
        ('status', String),
        ('resolution_action_updated_date', String),
        ('community_board', String),
        ('bbl', String),
        ('borough', String),
        ('x_coordinate_state_plane', String),
        ('y_coordinate_state_plane', String),
        ('open_data_channel_type', String),
        ('park_facility_name', String),
        ('park_borough', String),
        ('latitude', String),
        ('longitude', String),
        ('location',
         Struct({'human_address': String, 'latitude': String, 'longitude': String})),
        ('resolution_description', String),
        ('vehicle_type', String),
        ('closed_date', String),
        ('data_retrieved_at', String)])

# Check for null values in each column
null_counts = df.null_count()

print("Null counts per column:")
print(null_counts)

# To check if there are ANY null values in the entire DataFrame:
# You can sum all the null counts and check if the total is greater than 0.
total_nulls = null_counts.sum_horizontal().item()

if total_nulls > 0:
    print("\nThere are null values in the DataFrame.")
else:
    print("\nNo null values found in the DataFrame.")

Null counts per column:
shape: (1, 34)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ unique_key ┆ created_da ┆ agency ┆ agency_na ┆ … ┆ resolutio ┆ vehicle_t ┆ closed_da ┆ data_retr │
│ ---        ┆ te         ┆ ---    ┆ me        ┆   ┆ n_descrip ┆ ype       ┆ te        ┆ ieved_at  │
│ u32        ┆ ---        ┆ u32    ┆ ---       ┆   ┆ tion      ┆ ---       ┆ ---       ┆ ---       │
│            ┆ u32        ┆        ┆ u32       ┆   ┆ ---       ┆ u32       ┆ u32       ┆ u32       │
│            ┆            ┆        ┆           ┆   ┆ u32       ┆           ┆           ┆           │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 0          ┆ 0          ┆ 0      ┆ 0         ┆ … ┆ 1203      ┆ 4829      ┆ 1238      ┆ 0         │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘

There are null values in the DataFrame.

Observations:

There are null values present in the df dataframe for the following variables:

Closed Date
Descriptor
Location Type
etc ...

Exploratory Data Analysis (EDA)

Plotting directly from Polars Series (more “Polars-native” if possible) This method is slightly more involved for a horizontal bar chart directly. We’ll build our charts in Altair.

# Let's prepare data as a dictionary of column_name: null_count
null_dict = null_counts.row(0, named=True) # Get the first (and only) row as a dictionary

columns = list(null_dict.keys())
counts = list(null_dict.values())

#--- Use the dict/lists to build a Polars DataFrame suitable for Altair ---
# Create a new Polars DataFrame from the lists
# This DataFrame will have the 'long' format expected by Altair
null_data = pl.DataFrame({
    "Column Name": columns,
    "Null Count": counts
})

# Optional: Filter out columns with 0 nulls for cleaner visualization
null_data = null_data.filter(pl.col("Null Count") > 0)

print("Null Count DataFrame prepared:\n")
print(null_data.head())

# --- Create the Altair Horizontal Bar Chart ---
chart = alt.Chart(null_data).mark_bar().encode(
    # X-axis: Null Count (Quantitative type)
    x=alt.X('Null Count:Q', title='Count'),
    # Y-axis: Column Name (Nominal type), sorted by null count descending
    y=alt.Y('Column Name:N', title='Columns', sort='-x'), # Sort by null count for better readability

    # Tooltip for interactivity
    tooltip=[
        alt.Tooltip('Column Name:N', title='Column'),
        alt.Tooltip('Null Count:Q', title='Nulls')
    ]
).properties(
    title='Null Value Counts',
    width=800,  # <-- Adjust the width here (e.g., 800 pixels)
    height=600  # <-- Adjust the height here (e.g., 600 pixels)
    ).interactive() # Allows zooming and panning

# Display the chart
chart.show()

Null Count DataFrame prepared:

shape: (5, 2)
┌──────────────────┬────────────┐
│ Column Name      ┆ Null Count │
│ ---              ┆ ---        │
│ str              ┆ i64        │
╞══════════════════╪════════════╡
│ location_type    ┆ 2527       │
│ incident_zip     ┆ 15         │
│ incident_address ┆ 490        │
│ street_name      ┆ 490        │
│ cross_street_1   ┆ 526        │
└──────────────────┴────────────┘

# Remove null values from "Closed Date" column
# Polars operations are immutable, so you need to reassign the DataFrame
df = df.drop_nulls(subset=["closed_date"])

# Show sample of dataset with null values removed
print("\nShape after dropping nulls:", df.shape)
print("\nSample of dataset with null values removed:")
print(df.head())

# Verify nulls in "Closed Date" after dropping
print("\nNulls in 'Closed Date' after dropping:", df.select(pl.col("closed_date").null_count()))


Shape after dropping nulls: (3762, 34)

Sample of dataset with null values removed:
shape: (5, 34)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ unique_key ┆ created_da ┆ agency ┆ agency_na ┆ … ┆ resolutio ┆ vehicle_t ┆ closed_da ┆ data_retr │
│ ---        ┆ te         ┆ ---    ┆ me        ┆   ┆ n_descrip ┆ ype       ┆ te        ┆ ieved_at  │
│ str        ┆ ---        ┆ str    ┆ ---       ┆   ┆ tion      ┆ ---       ┆ ---       ┆ ---       │
│            ┆ str        ┆        ┆ str       ┆   ┆ ---       ┆ str       ┆ str       ┆ str       │
│            ┆            ┆        ┆           ┆   ┆ str       ┆           ┆           ┆           │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 65355773   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ The       ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ T01:50:40. ┆        ┆ City      ┆   ┆ Police    ┆           ┆ 5T01:58:0 ┆ 5_02_38_4 │
│            ┆ 000        ┆        ┆ Police    ┆   ┆ Departmen ┆           ┆ 0.000     ┆ 5_PM      │
│            ┆            ┆        ┆ Departmen ┆   ┆ t         ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆ responde… ┆           ┆           ┆           │
│ 65366664   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ The       ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ T01:45:52. ┆        ┆ City      ┆   ┆ Police    ┆           ┆ 5T02:02:3 ┆ 5_02_38_4 │
│            ┆ 000        ┆        ┆ Police    ┆   ┆ Departmen ┆           ┆ 0.000     ┆ 5_PM      │
│            ┆            ┆        ┆ Departmen ┆   ┆ t issued  ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆ a…        ┆           ┆           ┆           │
│ 65358593   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ The       ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ T01:42:25. ┆        ┆ City      ┆   ┆ Police    ┆           ┆ 5T01:52:1 ┆ 5_02_38_4 │
│            ┆ 000        ┆        ┆ Police    ┆   ┆ Departmen ┆           ┆ 4.000     ┆ 5_PM      │
│            ┆            ┆        ┆ Departmen ┆   ┆ t         ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆ responde… ┆           ┆           ┆           │
│ 65366547   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ The       ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ T01:40:10. ┆        ┆ City      ┆   ┆ Police    ┆           ┆ 5T02:03:0 ┆ 5_02_38_4 │
│            ┆ 000        ┆        ┆ Police    ┆   ┆ Departmen ┆           ┆ 8.000     ┆ 5_PM      │
│            ┆            ┆        ┆ Departmen ┆   ┆ t         ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆ responde… ┆           ┆           ┆           │
│ 65363618   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ The       ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ T01:37:15. ┆        ┆ City      ┆   ┆ Police    ┆           ┆ 5T01:59:5 ┆ 5_02_38_4 │
│            ┆ 000        ┆        ┆ Police    ┆   ┆ Departmen ┆           ┆ 8.000     ┆ 5_PM      │
│            ┆            ┆        ┆ Departmen ┆   ┆ t         ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆ responde… ┆           ┆           ┆           │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘

Nulls in 'Closed Date' after dropping: shape: (1, 1)
┌─────────────┐
│ closed_date │
│ ---         │
│ u32         │
╞═════════════╡
│ 0           │
└─────────────┘

# Check if null values are still present in "Closed Date" column
# Use pl.col() for column selection and .is_null().any() for the check

if df.select(pl.col("closed_date").is_null().any()).item():
    print('Null values still present in "Closed Date" column.')
else:
    print('No null values present in "Closed Date" column.')

No null values present in "Closed Date" column.

# Convert the "Closed Date" and "Created Date" columns to datetime format
# Polars operations are immutable, so use .with_columns() and reassign

df = df.with_columns([
    # converting "created date"
    pl.col("created_date").str.to_datetime().alias("created_date"),
    # converting "closed date"
    pl.col("closed_date").str.to_datetime().alias("closed_date"),
])

# Display the schema (Polars' equivalent of df.info() for types)
print("DataFrame Schema after datetime conversion:\n")
print(df.collect_schema)

DataFrame Schema after datetime conversion:

<bound method DataFrame.collect_schema of shape: (3_762, 34)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ unique_key ┆ created_da ┆ agency ┆ agency_na ┆ … ┆ resolutio ┆ vehicle_t ┆ closed_da ┆ data_retr │
│ ---        ┆ te         ┆ ---    ┆ me        ┆   ┆ n_descrip ┆ ype       ┆ te        ┆ ieved_at  │
│ str        ┆ ---        ┆ str    ┆ ---       ┆   ┆ tion      ┆ ---       ┆ ---       ┆ ---       │
│            ┆ datetime[μ ┆        ┆ str       ┆   ┆ ---       ┆ str       ┆ datetime[ ┆ str       │
│            ┆ s]         ┆        ┆           ┆   ┆ str       ┆           ┆ μs]       ┆           │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 65355773   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ The       ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ 01:50:40   ┆        ┆ City      ┆   ┆ Police    ┆           ┆ 5         ┆ 5_02_38_4 │
│            ┆            ┆        ┆ Police    ┆   ┆ Departmen ┆           ┆ 01:58:00  ┆ 5_PM      │
│            ┆            ┆        ┆ Departmen ┆   ┆ t         ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆ responde… ┆           ┆           ┆           │
│ 65366664   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ The       ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ 01:45:52   ┆        ┆ City      ┆   ┆ Police    ┆           ┆ 5         ┆ 5_02_38_4 │
│            ┆            ┆        ┆ Police    ┆   ┆ Departmen ┆           ┆ 02:02:30  ┆ 5_PM      │
│            ┆            ┆        ┆ Departmen ┆   ┆ t issued  ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆ a…        ┆           ┆           ┆           │
│ 65358593   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ The       ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ 01:42:25   ┆        ┆ City      ┆   ┆ Police    ┆           ┆ 5         ┆ 5_02_38_4 │
│            ┆            ┆        ┆ Police    ┆   ┆ Departmen ┆           ┆ 01:52:14  ┆ 5_PM      │
│            ┆            ┆        ┆ Departmen ┆   ┆ t         ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆ responde… ┆           ┆           ┆           │
│ 65366547   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ The       ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ 01:40:10   ┆        ┆ City      ┆   ┆ Police    ┆           ┆ 5         ┆ 5_02_38_4 │
│            ┆            ┆        ┆ Police    ┆   ┆ Departmen ┆           ┆ 02:03:08  ┆ 5_PM      │
│            ┆            ┆        ┆ Departmen ┆   ┆ t         ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆ responde… ┆           ┆           ┆           │
│ 65363618   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ The       ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ 01:37:15   ┆        ┆ City      ┆   ┆ Police    ┆           ┆ 5         ┆ 5_02_38_4 │
│            ┆            ┆        ┆ Police    ┆   ┆ Departmen ┆           ┆ 01:59:58  ┆ 5_PM      │
│            ┆            ┆        ┆ Departmen ┆   ┆ t         ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆ responde… ┆           ┆           ┆           │
│ …          ┆ …          ┆ …      ┆ …         ┆ … ┆ …         ┆ …         ┆ …         ┆ …         │
│ 65362166   ┆ 2025-06-24 ┆ NYPD   ┆ New York  ┆ … ┆ The       ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ 18:10:04   ┆        ┆ City      ┆   ┆ Police    ┆           ┆ 4         ┆ 5_02_38_4 │
│            ┆            ┆        ┆ Police    ┆   ┆ Departmen ┆           ┆ 21:57:40  ┆ 5_PM      │
│            ┆            ┆        ┆ Departmen ┆   ┆ t         ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆ responde… ┆           ┆           ┆           │
│ 65359062   ┆ 2025-06-24 ┆ DOB    ┆ Departmen ┆ … ┆ Your      ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ 18:09:37   ┆        ┆ t of      ┆   ┆ Service   ┆           ┆ 4         ┆ 5_02_38_4 │
│            ┆            ┆        ┆ Buildings ┆   ┆ Request   ┆           ┆ 18:09:37  ┆ 5_PM      │
│            ┆            ┆        ┆           ┆   ┆ has been  ┆           ┆           ┆           │
│            ┆            ┆        ┆           ┆   ┆ …         ┆           ┆           ┆           │
│ 65357050   ┆ 2025-06-24 ┆ NYPD   ┆ New York  ┆ … ┆ The       ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ 18:09:26   ┆        ┆ City      ┆   ┆ Police    ┆           ┆ 4         ┆ 5_02_38_4 │
│            ┆            ┆        ┆ Police    ┆   ┆ Departmen ┆           ┆ 19:19:26  ┆ 5_PM      │
│            ┆            ┆        ┆ Departmen ┆   ┆ t         ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆ responde… ┆           ┆           ┆           │
│ 65361679   ┆ 2025-06-24 ┆ NYPD   ┆ New York  ┆ … ┆ The       ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ 18:09:21   ┆        ┆ City      ┆   ┆ Police    ┆           ┆ 4         ┆ 5_02_38_4 │
│            ┆            ┆        ┆ Police    ┆   ┆ Departmen ┆           ┆ 19:36:37  ┆ 5_PM      │
│            ┆            ┆        ┆ Departmen ┆   ┆ t         ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆ responde… ┆           ┆           ┆           │
│ 65358615   ┆ 2025-06-24 ┆ NYPD   ┆ New York  ┆ … ┆ The       ┆ null      ┆ 2025-06-2 ┆ 06_26_202 │
│            ┆ 18:08:43   ┆        ┆ City      ┆   ┆ Police    ┆           ┆ 4         ┆ 5_02_38_4 │
│            ┆            ┆        ┆ Police    ┆   ┆ Departmen ┆           ┆ 19:29:13  ┆ 5_PM      │
│            ┆            ┆        ┆ Departmen ┆   ┆ t         ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆ responde… ┆           ┆           ┆           │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘>

# --- Convert 'time_delta' from Duration to a numerical type ---
# First, ensure 'time_delta' column is created as done previously
df = df.with_columns(
    (pl.col("closed_date") - pl.col("created_date")).alias("time_delta")
)

# Convert time_delta from Duration to total seconds (or minutes, hours, days)
# Polars Duration objects have methods like .total_seconds(), .total_minutes() etc.
df = df.with_columns(
    pl.col("time_delta").dt.total_seconds().alias("time_delta_seconds")
)

# Now, 'time_delta_seconds' is a numerical column (float)
# You can use this new column for plotting in Altair.
print("DataFrame schema after converting time_delta to seconds:")
print(df.schema)
print("\nHead of DataFrame showing new 'time_delta_seconds' column:")
print(df.head())

DataFrame schema after converting time_delta to seconds:
Schema({'unique_key': String, 'created_date': Datetime(time_unit='us', time_zone=None), 'agency': String, 'agency_name': String, 'complaint_type': String, 'descriptor': String, 'location_type': String, 'incident_zip': String, 'incident_address': String, 'street_name': String, 'cross_street_1': String, 'cross_street_2': String, 'intersection_street_1': String, 'intersection_street_2': String, 'address_type': String, 'city': String, 'landmark': String, 'status': String, 'resolution_action_updated_date': String, 'community_board': String, 'bbl': String, 'borough': String, 'x_coordinate_state_plane': String, 'y_coordinate_state_plane': String, 'open_data_channel_type': String, 'park_facility_name': String, 'park_borough': String, 'latitude': String, 'longitude': String, 'location': Struct({'human_address': String, 'latitude': String, 'longitude': String}), 'resolution_description': String, 'vehicle_type': String, 'closed_date': Datetime(time_unit='us', time_zone=None), 'data_retrieved_at': String, 'time_delta': Duration(time_unit='us'), 'time_delta_seconds': Int64})

Head of DataFrame showing new 'time_delta_seconds' column:
shape: (5, 36)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ unique_key ┆ created_da ┆ agency ┆ agency_na ┆ … ┆ closed_da ┆ data_retr ┆ time_delt ┆ time_delt │
│ ---        ┆ te         ┆ ---    ┆ me        ┆   ┆ te        ┆ ieved_at  ┆ a         ┆ a_seconds │
│ str        ┆ ---        ┆ str    ┆ ---       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
│            ┆ datetime[μ ┆        ┆ str       ┆   ┆ datetime[ ┆ str       ┆ duration[ ┆ i64       │
│            ┆ s]         ┆        ┆           ┆   ┆ μs]       ┆           ┆ μs]       ┆           │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 65355773   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 7m 20s    ┆ 440       │
│            ┆ 01:50:40   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 01:58:00  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65366664   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 16m 38s   ┆ 998       │
│            ┆ 01:45:52   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 02:02:30  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65358593   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 9m 49s    ┆ 589       │
│            ┆ 01:42:25   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 01:52:14  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65366547   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 22m 58s   ┆ 1378      │
│            ┆ 01:40:10   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 02:03:08  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65363618   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 22m 43s   ┆ 1363      │
│            ┆ 01:37:15   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 01:59:58  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘

# View descriptive statistics of the "time_delta" column
# Changed the column selection to "time_delta"
print("\nDescriptive statistics for 'time_delta' column:")
print(df.select(pl.col("time_delta")).describe())


Descriptive statistics for 'time_delta' column:
shape: (9, 2)
┌────────────┬────────────────┐
│ statistic  ┆ time_delta     │
│ ---        ┆ ---            │
│ str        ┆ str            │
╞════════════╪════════════════╡
│ count      ┆ 3762           │
│ null_count ┆ 0              │
│ mean       ┆ 0:31:40.608187 │
│ std        ┆ null           │
│ min        ┆ 0:00:00        │
│ 25%        ┆ 0:00:00        │
│ 50%        ┆ 0:00:00        │
│ 75%        ┆ 0:42:35        │
│ max        ┆ 7:37:14        │
└────────────┴────────────────┘

Finding the major type of complaints

# Check the number of null values in the "complaint_type" and "city" columns using f-strings
print(f'Number of null values in "complaint_type" column: {df.select(pl.col("complaint_type").null_count()).item()}')
print(f'Number of null values in "city" column: {df.select(pl.col("city").null_count()).item()}')

Number of null values in "complaint_type" column: 0
Number of null values in "city" column: 148

# Fill in missing values (nulls) in the 'city' column with 'Unknown City'
df = df.with_columns(
    pl.col("city").fill_null(pl.lit("UNKNOWN CITY")).alias("city")
)

# Get Unique Count using Polars' n_unique()
# We need to extract the single value using .item() for the f-string
city_count = df.select(pl.col("city").n_unique()).item()

# Get all unique values in the 'city' column as a sorted list
unique_areas = df["city"].unique().sort().to_list()

print(f"Number of null values in 'city' after fill_null: {df.select(pl.col("city").null_count()).item()}")
print(f"\nThere are {city_count} values in the dataset that contain the following info:\n")
print(unique_areas)

Number of null values in 'city' after fill_null: 0

There are 42 values in the dataset that contain the following info:

['ARVERNE', 'ASTORIA', 'BAYSIDE', 'BRONX', 'BROOKLYN', 'CAMBRIA HEIGHTS', 'COLLEGE POINT', 'CORONA', 'EAST ELMHURST', 'ELMHURST', 'FAR ROCKAWAY', 'FLUSHING', 'FOREST HILLS', 'FRESH MEADOWS', 'HOLLIS', 'HOWARD BEACH', 'JACKSON HEIGHTS', 'JAMAICA', 'LITTLE NECK', 'LONG ISLAND CITY', 'MANHATTAN', 'MASPETH', 'MIDDLE VILLAGE', 'NEW YORK', 'OAKLAND GARDENS', 'OZONE PARK', 'QUEENS', 'QUEENS VILLAGE', 'REGO PARK', 'RICHMOND HILL', 'RIDGEWOOD', 'ROSEDALE', 'SAINT ALBANS', 'SOUTH OZONE PARK', 'SOUTH RICHMOND HILL', 'SPRINGFIELD GARDENS', 'STATEN ISLAND', 'SUNNYSIDE', 'UNKNOWN CITY', 'WHITESTONE', 'WOODHAVEN', 'WOODSIDE']

# Calculate total complaints per city and complaint type, then pivot
total_city_complaints = df.group_by(["city", "complaint_type"]).len().pivot(
    index="city",              # 'city' will be the rows
    columns="complaint_type",  # 'complaint_type' will be the new columns
    values="len"               # The aggregated count (from .len()) will be the values
)

total_city_complaints.head()

shape: (5, 32)

city	Blocked Driveway	Noise - Residential	Abandoned Vehicle	Elevator	Noise - Street/Sidewalk	Illegal Fireworks	Illegal Parking	Noise - Commercial	General Construction/Plumbing	Water System	Noise - Vehicle	Non-Emergency Police Matter	Panhandling	Dirty Condition	Traffic Signal Condition	Disorderly Youth	Noise - Park	Drinking	Derelict Vehicles	Electrical	Encampment	Drug Activity	Violation of Park Rules	Vendor Enforcement	Street Condition	Bike/Roller/Skate Chronic	Animal-Abuse	Traffic	Urinating in Public	Building/Use	Emergency Response Team (ERT)
str	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32
"FAR ROCKAWAY"	1	null	null	null	5	null	15	null	null	2	null	null	null	null	null	null	1	null	null	1	null	null	null	null	null	null	null	null	null	null	null
"HOLLIS"	1	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
"UNKNOWN CITY"	1	7	1	null	87	4	15	2	null	null	13	7	5	null	2	null	1	1	null	null	null	null	null	null	null	1	null	1	null	null	null
"BRONX"	24	70	null	2	220	5	42	8	1	1798	20	6	null	2	6	null	null	3	4	1	null	1	null	1	null	null	2	null	null	null	null
"WOODSIDE"	2	1	3	null	null	null	3	2	null	null	null	null	1	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null

# Reshape the DataFrame to "long" format for Altair

# narrow down into 'city' for 'complaint_type' and 'number_of_complaints'
complaints = total_city_complaints.unpivot(
    index=['city'],
    variable_name='complaint_type', # New column for the complaint type names
    value_name='number_of_complaints' # New column for the complaint counts
)

# Filter out rows where number_of_complaints is 0, as they don't contribute to the visualization
# This makes the tooltip cleaner and chart potentially faster if there are many zeros
complaints = complaints.filter(pl.col("number_of_complaints") > 0)

# Create the Altair Stacked Bar Chart
chart = alt.Chart(complaints).mark_bar().encode(
    # X-axis: Number of Complaints (Quantitative type)
    x=alt.X('number_of_complaints:Q', title='Number of Complaints'),
    # Y-axis: City (Nominal type), sorted by the sum of complaints for better readability
    y=alt.Y('city:N', title='City', sort='-x'), # Sort cities by total complaints in descending order

    # Color: Complaint Type (Nominal type) for stacking
    color=alt.Color('complaint_type:N', title='Complaint Type'),

    # Tooltip for interactivity
    tooltip=[
        alt.Tooltip('city:N', title='City'),
        alt.Tooltip('complaint_type:N', title='Complaint Type'),
        alt.Tooltip('number_of_complaints:Q', title='Count')
    ]
).properties(
    title='Frequency of Complaints by City',
    width=800,  # <-- Adjust the width here (e.g., 800 pixels)
    height=600  # <-- Adjust the height here (e.g., 600 pixels)
).interactive() # Allows zooming and panning

# Display the chart
chart.show()

Data Visualization of the major type of complaints

In order to find the major type of complaints in the dataset, the following code groups the dataset by complaint type, counts the number of occurrences for each type, and then sorts the counts in descending order. The resulting bar graph shows the counts for each complaint type.

# Group by 'complaint_type' and count the number of occurrences
# Then sort the counts in descending order
complaint_counts = df.group_by("complaint_type").agg(
    pl.len().alias("counts") # pl.len() counts rows in each group, and we alias it to 'counts'
).sort("counts", descending=True) # Sort the resulting DataFrame by 'counts' in descending order

complaint_types = df.select(pl.col("complaint_type").n_unique()).item()

# View the result
print(f"\nThere are {complaint_types} complaint types in the dataset that contain the following info:\n")
print(complaint_counts)


There are 31 complaint types in the dataset that contain the following info:

shape: (31, 2)
┌───────────────────────────────┬────────┐
│ complaint_type                ┆ counts │
│ ---                           ┆ ---    │
│ str                           ┆ u32    │
╞═══════════════════════════════╪════════╡
│ Water System                  ┆ 2145   │
│ Noise - Street/Sidewalk       ┆ 552    │
│ Illegal Parking               ┆ 337    │
│ Noise - Residential           ┆ 267    │
│ Blocked Driveway              ┆ 90     │
│ …                             ┆ …      │
│ Disorderly Youth              ┆ 2      │
│ Traffic                       ┆ 1      │
│ Emergency Response Team (ERT) ┆ 1      │
│ Bike/Roller/Skate Chronic     ┆ 1      │
│ Violation of Park Rules       ┆ 1      │
└───────────────────────────────┴────────┘

# Plot the bar graph
chart = alt.Chart(complaint_counts).mark_bar().encode(
    # X-axis: 'counts' (Quantitative type)
    x=alt.X('counts:Q', title='Number of Complaints',
           scale=alt.Scale(domainMin=0, zero=True)
           ),
    # Y-axis: 'complaint_type' (Nominal type), sorted by counts descending
    y=alt.Y('complaint_type:N', title='Complaint Type', sort='-x'), # Sort by counts for readability
    color=alt.value('orange'), # You can use color names or hex codes like '#FF5733'

    # Tooltip for interactivity
    tooltip=[
        alt.Tooltip('complaint_type:N', title='Complaint Type'),
        alt.Tooltip('counts:Q', title='Count')
    ]
).properties(
    title='Frequency of Complaints by Type',
    width=800,  # <-- Adjust the width here (e.g., 800 pixels)
    height=600  # <-- Adjust the height here (e.g., 600 pixels)
).interactive()

# Display the chart
chart.show()

From our bar chart analysis, the top 10 major complaint types are the following:

Illegal Parking
Noise - Residental
Noise - Street/Sidewalk
Blocked Driveway
Abandoned Vehicle
Noise - Vehicle
Noise - Commercial
General Construction/Plumbing
Vendor Enforcement
Homeless Encampment

Let’s explore the complaint data specifically categorized as NYC …

This code filters the dataset to only include complaints from New York City, groups the remaining data by complaint type, counts the number of occurrences for each type, and then sorts the counts in descending order. The resulting output shows the frequency of each complaint type for New York City.

# Filter the DataFrame for only the 'NEW YORK' city data
# Use .filter() with a boolean expression
nyc_data = df.filter(pl.col('city') == 'NEW YORK')

print("Head of the nyc_data DataFrame (filtered for 'NEW YORK'):")
print(nyc_data.head())

print(f"\nShape of the original DataFrame: {df.shape}")
print(f"Shape of the 'nyc_data' DataFrame (after filtering): {nyc_data.shape}")

# You can also verify that all cities in nyc_data are 'NEW YORK'
print("\nUnique cities in 'nyc_data':")
print(nyc_data['city'].unique().to_list())

Head of the nyc_data DataFrame (filtered for 'NEW YORK'):
shape: (5, 36)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ unique_key ┆ created_da ┆ agency ┆ agency_na ┆ … ┆ closed_da ┆ data_retr ┆ time_delt ┆ time_delt │
│ ---        ┆ te         ┆ ---    ┆ me        ┆   ┆ te        ┆ ieved_at  ┆ a         ┆ a_seconds │
│ str        ┆ ---        ┆ str    ┆ ---       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
│            ┆ datetime[μ ┆        ┆ str       ┆   ┆ datetime[ ┆ str       ┆ duration[ ┆ i64       │
│            ┆ s]         ┆        ┆           ┆   ┆ μs]       ┆           ┆ μs]       ┆           │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 65357312   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 10m 16s   ┆ 616       │
│            ┆ 01:19:05   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 01:29:21  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65365325   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 29m 48s   ┆ 1788      │
│            ┆ 01:18:35   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 01:48:23  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65366862   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 30m 6s    ┆ 1806      │
│            ┆ 01:12:03   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 01:42:09  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65355807   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 9m 18s    ┆ 558       │
│            ┆ 01:06:14   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 01:15:32  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65361896   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 41m 42s   ┆ 2502      │
│            ┆ 01:04:42   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 01:46:24  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘

Shape of the original DataFrame: (3762, 36)
Shape of the 'nyc_data' DataFrame (after filtering): (339, 36)

Unique cities in 'nyc_data':
['NEW YORK']

nyc_complaint_types = nyc_data.group_by("complaint_type").agg(
    pl.len().alias("counts") # pl.len() counts rows in each group, and we alias it to 'counts'
).sort("counts", descending=True) # Sort the resulting DataFrame by 'counts' in descending order

# get the number of unique complaints
nyc_complaint_counts = df.select(pl.col("complaint_type").n_unique()).item()

# View the result
print(f"\nThere are {nyc_complaint_counts} complaint types in the dataset that contain the following info:\n")
print(nyc_complaint_types)


There are 31 complaint types in the dataset that contain the following info:

shape: (21, 2)
┌───────────────────────────────┬────────┐
│ complaint_type                ┆ counts │
│ ---                           ┆ ---    │
│ str                           ┆ u32    │
╞═══════════════════════════════╪════════╡
│ Water System                  ┆ 95     │
│ Noise - Street/Sidewalk       ┆ 88     │
│ Noise - Residential           ┆ 45     │
│ Illegal Parking               ┆ 38     │
│ Encampment                    ┆ 17     │
│ …                             ┆ …      │
│ General Construction/Plumbing ┆ 1      │
│ Animal-Abuse                  ┆ 1      │
│ Blocked Driveway              ┆ 1      │
│ Dirty Condition               ┆ 1      │
│ Emergency Response Team (ERT) ┆ 1      │
└───────────────────────────────┴────────┘

# Plot the bar graph
chart = alt.Chart(nyc_complaint_types).mark_bar().encode(
    # X-axis: 'counts' (Quantitative type)
    x=alt.X('counts:Q', title='Number of Complaints',
           scale=alt.Scale(domainMin=0, zero=True)
           ),
    # Y-axis: 'complaint_type' (Nominal type), sorted by counts descending
    y=alt.Y('complaint_type:N', title='Complaint Type', sort='-x'), # Sort by counts for readability
    color=alt.value('green'), # You can use color names or hex codes like '#FF5733'

    # Tooltip for interactivity
    tooltip=[
        alt.Tooltip('complaint_type:N', title='Complaint Type'),
        alt.Tooltip('counts:Q', title='Count')
    ]
).properties(
    title='Types of Complaints in New York City',
    width=800,  # <-- Adjust the width here (e.g., 800 pixels)
    height=600  # <-- Adjust the height here (e.g., 600 pixels)
).interactive()

# Display the chart
chart.show()

Let’s explore the complaint data specifically for Brooklyn …

This code filters the dataset to only include complaints from Brooklyn, groups the remaining data by complaint type, counts the number of occurrences for each type, and then sorts the counts in descending order. The resulting output shows the frequency of each complaint type for Brooklyn.

# Filter the DataFrame for only the 'BROOKLYN' city data
# Use .filter() with a boolean expression
bkn_data = df.filter(pl.col('city') == 'BROOKLYN')

print("Head of the bkn_data DataFrame (filtered for 'BROOKLYN'):")
print(bkn_data.head())

print(f"\nShape of the original DataFrame: {df.shape}")
print(f"Shape of the 'nyc_data' DataFrame (after filtering): {bkn_data.shape}")

# You can also verify that all cities in nyc_data are 'BROOKLYN'
print("\nUnique cities in 'bkn_data':")
print(bkn_data['city'].unique().to_list())

Head of the bkn_data DataFrame (filtered for 'BROOKLYN'):
shape: (5, 36)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ unique_key ┆ created_da ┆ agency ┆ agency_na ┆ … ┆ closed_da ┆ data_retr ┆ time_delt ┆ time_delt │
│ ---        ┆ te         ┆ ---    ┆ me        ┆   ┆ te        ┆ ieved_at  ┆ a         ┆ a_seconds │
│ str        ┆ ---        ┆ str    ┆ ---       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
│            ┆ datetime[μ ┆        ┆ str       ┆   ┆ datetime[ ┆ str       ┆ duration[ ┆ i64       │
│            ┆ s]         ┆        ┆           ┆   ┆ μs]       ┆           ┆ μs]       ┆           │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 65355773   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 7m 20s    ┆ 440       │
│            ┆ 01:50:40   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 01:58:00  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65366857   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 19m 28s   ┆ 1168      │
│            ┆ 01:25:36   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 01:45:04  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65365214   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 32m 12s   ┆ 1932      │
│            ┆ 01:15:21   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 01:47:33  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65355770   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 31m 42s   ┆ 1902      │
│            ┆ 01:08:58   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 01:40:40  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65355768   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 53m 13s   ┆ 3193      │
│            ┆ 01:08:52   ┆        ┆ City      ┆   ┆ 5         ┆ 5_02_38_4 ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆ 02:02:05  ┆ 5_PM      ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘

Shape of the original DataFrame: (3762, 36)
Shape of the 'nyc_data' DataFrame (after filtering): (486, 36)

Unique cities in 'bkn_data':
['BROOKLYN']

bkn_complaint_types = bkn_data.group_by("complaint_type").agg(
    pl.len().alias("counts") # pl.len() counts rows in each group, and we alias it to 'counts'
).sort("counts", descending=True) # Sort the resulting DataFrame by 'counts' in descending order

# get the number of unique complaints
bkn_complaint_counts = df.select(pl.col("complaint_type").n_unique()).item()

# View the result
print(f"\nThere are {bkn_complaint_counts} complaint types in the dataset that contain the following info:\n")
print(bkn_complaint_types)


There are 31 complaint types in the dataset that contain the following info:

shape: (22, 2)
┌───────────────────────────────┬────────┐
│ complaint_type                ┆ counts │
│ ---                           ┆ ---    │
│ str                           ┆ u32    │
╞═══════════════════════════════╪════════╡
│ Water System                  ┆ 124    │
│ Illegal Parking               ┆ 101    │
│ Noise - Residential           ┆ 78     │
│ Noise - Street/Sidewalk       ┆ 63     │
│ Blocked Driveway              ┆ 24     │
│ …                             ┆ …      │
│ Animal-Abuse                  ┆ 1      │
│ General Construction/Plumbing ┆ 1      │
│ Electrical                    ┆ 1      │
│ Violation of Park Rules       ┆ 1      │
│ Disorderly Youth              ┆ 1      │
└───────────────────────────────┴────────┘

# Plot the bar graph
chart = alt.Chart(bkn_complaint_types).mark_bar().encode(
    # X-axis: 'counts' (Quantitative type)
    x=alt.X('counts:Q', title='Number of Complaints',
           scale=alt.Scale(domainMin=0, zero=True)
           ),
    # Y-axis: 'complaint_type' (Nominal type), sorted by counts descending
    y=alt.Y('complaint_type:N', title='Complaint Type', sort='-x'), # Sort by counts for readability
    color=alt.value('red'), # You can use color names or hex codes like '#FF5733'

    # Tooltip for interactivity
    tooltip=[
        alt.Tooltip('complaint_type:N', title='Complaint Type'),
        alt.Tooltip('counts:Q', title='Count')
    ]
).properties(
    title='Types of Complaints in Brooklyn',
    width=800,  # <-- Adjust the width here (e.g., 800 pixels)
    height=600  # <-- Adjust the height here (e.g., 600 pixels)
).interactive()

# Display the chart
chart.show()

# filter data for the geographic locations of Brooklyn in the dataset.
bkn_geo_data = df.filter(
    (pl.col("city") == "BROOKLYN") &
    pl.col("longitude").is_not_null() &
    pl.col("latitude").is_not_null()
).select(["longitude", "latitude"]) # <-- This line selects only longitude and latitude

# create a scatter plot of the concentration of complaints across Brooklyn

# Using typical bounds for Brooklyn for demonstration:
bkn_lon_min, bkn_lon_max = -74.05, -73.85
bkn_lat_min, bkn_lat_max = 40.55, 40.75

scatter_plot = alt.Chart(bkn_geo_data).mark_point(
    opacity=0.2,  # Reduce opacity to see overlapping points
    size=5        # Adjust point size
).encode(
    x=alt.X(
        'longitude:Q',
        title='Longitude',
        scale=alt.Scale(domain=[bkn_lon_min, bkn_lon_max])  # Set fixed domain for x
    ),
    y=alt.Y(
        'latitude:Q',
        title='Latitude',
        scale=alt.Scale(domain=[bkn_lat_min, bkn_lat_max])  # Set fixed domain for y
    ),
    tooltip=['longitude:Q', 'latitude:Q']
).properties(
    title='Complaint Locations in Brooklyn (Scatter Plot)',
    width=700,   # Set explicit width
    height=560   # Set explicit height
).interactive()  # Allows zooming and panning

print("\nDisplaying Scatter Plot for Brooklyn Complaints:")
scatter_plot.show()


Displaying Scatter Plot for Brooklyn Complaints:

# --- Hexbin plot using mark_circle() + bin approach ---
hexbin_plot = alt.Chart(bkn_geo_data).mark_circle().encode(
    # Bin longitude and latitude into hexagonal cells
    x=alt.X('longitude:Q', bin=alt.Bin(maxbins=50), title='Longitude'), # Adjust maxbins for granularity
    y=alt.Y('latitude:Q', bin=alt.Bin(maxbins=50), title='Latitude'),   # Adjust maxbins for granularity

    # Color by the count of points in each bin
    color=alt.Color('count():Q',
                    title='Number of Complaints',
                    scale=alt.Scale(range='heatmap', domainMin=1) # Ensure heatmap starts from non-zero count
                    ),
    # Size the circles by count as well for emphasis
     size=alt.Size('count():Q', legend=None), # Hide size legend as color already indicates count

    tooltip=[
        alt.Tooltip('count():Q', title='Complaints in Bin'),
        alt.Tooltip('longitude:Q', bin=True, title='Longitude Bin'),
        alt.Tooltip('latitude:Q', bin=True, title='Latitude Bin')
    ]
).properties(
    title='Complaint Concentration in Brooklyn (Hexbin Plot)',
    width=700,  # Set explicit width
    height=560  # Set explicit height
).interactive() # Allows zooming and panning

print("\nDisplaying Hexbin Plot for Brooklyn Complaints:")
hexbin_plot.show()


Displaying Hexbin Plot for Brooklyn Complaints:

This code groups the data by both city and complaint type, counts the number of occurrences for each combination, and then prints the resulting output, which shows the various types of complaints in each city.

# 1. Group by city and complaint type and count the number of occurrences
# This will result in a DataFrame with columns: 'city', 'complaint_type', 'count'
city_complaint_counts = df.group_by(["city", "complaint_type"]).agg(
    pl.len().alias("count")
).sort(["city", "count"], descending=[False, True]) # Sort by city, then by count descending for readability


print("Grouped by City and Complaint Type (Head):")
print(city_complaint_counts.head())

Grouped by City and Complaint Type (Head):
shape: (5, 3)
┌─────────┬─────────────────────────┬───────┐
│ city    ┆ complaint_type          ┆ count │
│ ---     ┆ ---                     ┆ ---   │
│ str     ┆ str                     ┆ u32   │
╞═════════╪═════════════════════════╪═══════╡
│ ARVERNE ┆ Blocked Driveway        ┆ 1     │
│ ASTORIA ┆ Illegal Parking         ┆ 21    │
│ ASTORIA ┆ Noise - Street/Sidewalk ┆ 5     │
│ ASTORIA ┆ Noise - Residential     ┆ 4     │
│ ASTORIA ┆ Animal-Abuse            ┆ 1     │
└─────────┴─────────────────────────┴───────┘

This code uses the pivot function to pivot the data in the original DataFrame df. In Polars, you first group by the columns you want to pivot by, then count, then pivot. We’ll use ‘complaint_type’ and ‘city’ as confirmed lowercase names.

The resulting DataFrame, df_new, has complaint types as rows, cities as columns, and the count of occurrences as the values. The fill_value=0 parameter fills any missing values with zeros. The resulting output shows the new DataFrame with cities as columns and complaint types in rows.

complaints_by_city = df.group_by(["complaint_type", "city"]).len().pivot(
    index="complaint_type", # Rows of the new pivot table
    columns="city",         # Columns of the new pivot table
    values="len"            # The count from .len() will fill the values
)

# Fill any null values (where a complaint type didn't occur in a city) with 0
complaints_by_city = complaints_by_city.fill_null(0)

# Print the new DataFrame's head
print("Complaints by City (head of the pivot table):")
print(complaints_by_city.head())

Complaints by City (head of the pivot table):
shape: (5, 43)
┌─────────────────────┬──────────┬─────────┬────────┬───┬────────┬─────────┬─────────┬─────────────┐
│ complaint_type      ┆ BROOKLYN ┆ ASTORIA ┆ FOREST ┆ … ┆ HOLLIS ┆ ARVERNE ┆ CAMBRIA ┆ LITTLE NECK │
│ ---                 ┆ ---      ┆ ---     ┆ HILLS  ┆   ┆ ---    ┆ ---     ┆ HEIGHTS ┆ ---         │
│ str                 ┆ u32      ┆ u32     ┆ ---    ┆   ┆ u32    ┆ u32     ┆ ---     ┆ u32         │
│                     ┆          ┆         ┆ u32    ┆   ┆        ┆         ┆ u32     ┆             │
╞═════════════════════╪══════════╪═════════╪════════╪═══╪════════╪═════════╪═════════╪═════════════╡
│ Disorderly Youth    ┆ 1        ┆ 0       ┆ 0      ┆ … ┆ 0      ┆ 0       ┆ 0       ┆ 0           │
│ Blocked Driveway    ┆ 24       ┆ 1       ┆ 0      ┆ … ┆ 1      ┆ 1       ┆ 0       ┆ 0           │
│ Noise - Residential ┆ 78       ┆ 4       ┆ 1      ┆ … ┆ 0      ┆ 0       ┆ 0       ┆ 0           │
│ Traffic Signal      ┆ 23       ┆ 0       ┆ 0      ┆ … ┆ 0      ┆ 0       ┆ 0       ┆ 0           │
│ Condition           ┆          ┆         ┆        ┆   ┆        ┆         ┆         ┆             │
│ Illegal Parking     ┆ 101      ┆ 21      ┆ 2      ┆ … ┆ 0      ┆ 0       ┆ 1       ┆ 0           │
└─────────────────────┴──────────┴─────────┴────────┴───┴────────┴─────────┴─────────┴─────────────┘

Statistical Analysis of `Request_Closing_Time`

# Calculate the response time (duration)
# Polars automatically creates a Duration type from datetime subtraction
df = df.with_columns(
    (pl.col("closed_date") - pl.col("created_date")).alias("request_closing_duration")
)

# Convert the duration to total hours and total minutes (as numerical values for plotting)
df = df.with_columns([
    pl.col("request_closing_duration").dt.total_hours().alias("request_closing_time_hrs"),
    pl.col("request_closing_duration").dt.total_minutes().alias("request_closing_time_min")
])

print("DataFrame Head with new time columns:")
print(df.head())

print("\nSchema of new time columns:")
print(df.select(["request_closing_duration", "request_closing_time_hrs", "request_closing_time_min"]).schema)

DataFrame Head with new time columns:
shape: (5, 39)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ unique_key ┆ created_da ┆ agency ┆ agency_na ┆ … ┆ time_delt ┆ request_c ┆ request_c ┆ request_c │
│ ---        ┆ te         ┆ ---    ┆ me        ┆   ┆ a_seconds ┆ losing_du ┆ losing_ti ┆ losing_ti │
│ str        ┆ ---        ┆ str    ┆ ---       ┆   ┆ ---       ┆ ration    ┆ me_hrs    ┆ me_min    │
│            ┆ datetime[μ ┆        ┆ str       ┆   ┆ i64       ┆ ---       ┆ ---       ┆ ---       │
│            ┆ s]         ┆        ┆           ┆   ┆           ┆ duration[ ┆ i64       ┆ i64       │
│            ┆            ┆        ┆           ┆   ┆           ┆ μs]       ┆           ┆           │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 65355773   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 440       ┆ 7m 20s    ┆ 0         ┆ 7         │
│            ┆ 01:50:40   ┆        ┆ City      ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65366664   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 998       ┆ 16m 38s   ┆ 0         ┆ 16        │
│            ┆ 01:45:52   ┆        ┆ City      ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65358593   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 589       ┆ 9m 49s    ┆ 0         ┆ 9         │
│            ┆ 01:42:25   ┆        ┆ City      ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65366547   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 1378      ┆ 22m 58s   ┆ 0         ┆ 22        │
│            ┆ 01:40:10   ┆        ┆ City      ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
│ 65363618   ┆ 2025-06-25 ┆ NYPD   ┆ New York  ┆ … ┆ 1363      ┆ 22m 43s   ┆ 0         ┆ 22        │
│            ┆ 01:37:15   ┆        ┆ City      ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ Police    ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ Departmen ┆   ┆           ┆           ┆           ┆           │
│            ┆            ┆        ┆ …         ┆   ┆           ┆           ┆           ┆           │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘

Schema of new time columns:
Schema({'request_closing_duration': Duration(time_unit='us'), 'request_closing_time_hrs': Int64, 'request_closing_time_min': Int64})

# Define the list of city names that represent the main boroughs in your dataset
borough_cities = ['NEW YORK', 'BROOKLYN', 'QUEENS', 'BRONX', 'STATEN ISLAND']

# Filter the DataFrame to include only these borough cities
borough_complaints_df = df.filter(pl.col("city").is_in(borough_cities))

# Calculate the average complaint resolution time (in hours) for each borough city
avg_resolution_time_by_borough = borough_complaints_df.group_by("city").agg(
    pl.col("request_closing_time_hrs").mean().alias("avg_resolution_hours")
).sort("avg_resolution_hours", descending=False) # Sort for better visualization

print("\nAverage Complaint Resolution Time (Hours) by Borough City:")
print(avg_resolution_time_by_borough)

# --- Plotting the average complaint resolved time for the five borough cities ---
chart_borough_resolution = alt.Chart(avg_resolution_time_by_borough).mark_bar().encode(
    # X-axis: Average Resolution Time in Hours (Quantitative)
    x=alt.X('avg_resolution_hours:Q',
            title='Average Resolution Time (Hours)',
            scale=alt.Scale(domainMin=0) # Ensure axis starts from 0
           ),
    # Y-axis: City (Nominal), sorted by average time
    y=alt.Y('city:N',
            title='City',
            sort='-x' # Sort cities by their average resolution time
           ),
    color=alt.value('darkcyan'), # Set a single color for the bars
    tooltip=[
        alt.Tooltip('city:N', title='City'),
        alt.Tooltip('avg_resolution_hours:Q', title='Avg. Hours', format='.2f') # Format to 2 decimal places
    ]
).properties(
    title='Average Complaint Resolution Time by Borough City',
    width=800,  # <-- Adjust the width here (e.g., 800 pixels)
    height=600  # <-- Adjust the height here (e.g., 600 pixels)
).interactive()

chart_borough_resolution.show()


Average Complaint Resolution Time (Hours) by Borough City:
shape: (5, 2)
┌───────────────┬──────────────────────┐
│ city          ┆ avg_resolution_hours │
│ ---           ┆ ---                  │
│ str           ┆ f64                  │
╞═══════════════╪══════════════════════╡
│ QUEENS        ┆ 0.0                  │
│ BRONX         ┆ 0.204422             │
│ NEW YORK      ┆ 0.389381             │
│ BROOKLYN      ┆ 0.623457             │
│ STATEN ISLAND ┆ 0.863014             │
└───────────────┴──────────────────────┘

# Calculate the average closing time by city
# Group by 'city', aggregate the mean of 'request_closing_time_hrs', and sort
avg_closing_time_dict = df.group_by("city").agg(
    pl.col("request_closing_time_hrs").mean().alias("avg_hours_to_close")
).sort("avg_hours_to_close", descending=False).to_dict(as_series=False) # Convert to dictionary

# Use a dictionary comprehension to reformat it
# We're zipping the 'city' list with the 'avg_hours_to_close' list
avg_closing_time_by_city_dict = {
    city: avg_time
    for city, avg_time in zip(avg_closing_time_dict['city'], avg_closing_time_dict['avg_hours_to_close'])
}

print("Average Closing Time by City (sorted, as dictionary):\n")
for key, values in avg_closing_time_by_city_dict.items():
    print(f"{key} : {values:.2f} hrs")

Average Closing Time by City (sorted, as dictionary):

LITTLE NECK : 0.00 hrs
MIDDLE VILLAGE : 0.00 hrs
SPRINGFIELD GARDENS : 0.00 hrs
OAKLAND GARDENS : 0.00 hrs
BAYSIDE : 0.00 hrs
CAMBRIA HEIGHTS : 0.00 hrs
HOLLIS : 0.00 hrs
ROSEDALE : 0.00 hrs
QUEENS : 0.00 hrs
MANHATTAN : 0.11 hrs
SOUTH OZONE PARK : 0.14 hrs
SUNNYSIDE : 0.17 hrs
RIDGEWOOD : 0.17 hrs
FLUSHING : 0.18 hrs
HOWARD BEACH : 0.20 hrs
BRONX : 0.20 hrs
JAMAICA : 0.24 hrs
MASPETH : 0.25 hrs
REGO PARK : 0.33 hrs
WHITESTONE : 0.33 hrs
OZONE PARK : 0.33 hrs
UNKNOWN CITY : 0.36 hrs
NEW YORK : 0.39 hrs
FRESH MEADOWS : 0.44 hrs
FOREST HILLS : 0.50 hrs
CORONA : 0.55 hrs
BROOKLYN : 0.62 hrs
LONG ISLAND CITY : 0.73 hrs
RICHMOND HILL : 0.75 hrs
WOODHAVEN : 0.81 hrs
JACKSON HEIGHTS : 0.83 hrs
STATEN ISLAND : 0.86 hrs
COLLEGE POINT : 1.00 hrs
SAINT ALBANS : 1.00 hrs
FAR ROCKAWAY : 1.00 hrs
QUEENS VILLAGE : 1.00 hrs
ARVERNE : 1.00 hrs
SOUTH RICHMOND HILL : 1.00 hrs
ASTORIA : 1.23 hrs
EAST ELMHURST : 1.27 hrs
ELMHURST : 1.82 hrs
WOODSIDE : 1.83 hrs

# Calculate the average closing time by city (as a Polars DataFrame)
avg_closing_time_df = df.group_by("city").agg(
    pl.col("request_closing_time_hrs").mean().alias("avg_hours_to_close")
).sort("avg_hours_to_close", descending=False) # Sort for better chart readability

print("DataFrame for plotting (Avg. Closing Time by City):")
print(avg_closing_time_df.head())

# --- Create the Altair Horizontal Bar Chart ---
chart = alt.Chart(avg_closing_time_df).mark_bar().encode(
    # X-axis: Average Hours to Close (Quantitative)
    x=alt.X('avg_hours_to_close:Q',
            title='Average Closing Time (Hours)',
            axis=alt.Axis(titleFontSize=14),
            scale=alt.Scale(domainMin=0) # Ensure axis starts from 0
           ),
    # Y-axis: City (Nominal), sorted by average closing time
    y=alt.Y('city:N',
            title='City',
            sort='-x', # Sort cities by their average resolution time
            axis=alt.Axis(titleFontSize=14)
           ),
    color=alt.value('purple'), # Set a single color for the bars
    tooltip=[
        alt.Tooltip('city:N', title='City'),
        alt.Tooltip('avg_hours_to_close:Q', title='Avg. Hours', format='.2f') # Format to 2 decimal places
    ]
).properties(
    title=alt.Title(
        'Average Complaint Closing Time by City',
        fontSize=18
    ),
    width=600,  # Adjust overall chart width
    height=alt.Step(15) # Dynamically adjust height based on number of cities, 15 pixels per bar
).interactive() # Enable pan and zoom

chart.show()

DataFrame for plotting (Avg. Closing Time by City):
shape: (5, 2)
┌─────────────────────┬────────────────────┐
│ city                ┆ avg_hours_to_close │
│ ---                 ┆ ---                │
│ str                 ┆ f64                │
╞═════════════════════╪════════════════════╡
│ CAMBRIA HEIGHTS     ┆ 0.0                │
│ MIDDLE VILLAGE      ┆ 0.0                │
│ LITTLE NECK         ┆ 0.0                │
│ SPRINGFIELD GARDENS ┆ 0.0                │
│ BAYSIDE             ┆ 0.0                │
└─────────────────────┴────────────────────┘

Let’s see the avg. response time across different types of complaints. We will also compare them across each borough.

# Calculate the average closing time by complaint type
# Group by 'complaint_type', aggregate the mean of 'request_closing_time_min', and sort
avg_closing_time_per_complaint = df.group_by("complaint_type").agg(
    pl.col("request_closing_time_min").mean().alias("avg_minutes_to_close")
).sort("avg_minutes_to_close", descending=False) # Default sort is ascending, so descending=False for ascending order

# Print the result
print("Average Closing Time (Minutes) per Complaint Type:")
print(avg_closing_time_per_complaint)

Average Closing Time (Minutes) per Complaint Type:
shape: (31, 2)
┌───────────────────────────────┬──────────────────────┐
│ complaint_type                ┆ avg_minutes_to_close │
│ ---                           ┆ ---                  │
│ str                           ┆ f64                  │
╞═══════════════════════════════╪══════════════════════╡
│ Street Condition              ┆ 0.0                  │
│ Building/Use                  ┆ 0.0                  │
│ Emergency Response Team (ERT) ┆ 0.0                  │
│ Elevator                      ┆ 0.0                  │
│ General Construction/Plumbing ┆ 0.0                  │
│ …                             ┆ …                    │
│ Animal-Abuse                  ┆ 123.4                │
│ Violation of Park Rules       ┆ 137.0                │
│ Traffic                       ┆ 152.0                │
│ Dirty Condition               ┆ 233.0                │
│ Derelict Vehicles             ┆ 268.2                │
└───────────────────────────────┴──────────────────────┘

# --- Create the Altair Horizontal Bar Chart ---
chart = alt.Chart(avg_closing_time_per_complaint).mark_bar().encode(
    # X-axis: Average Minutes to Close (Quantitative)
    x=alt.X('avg_minutes_to_close:Q',
            title='Average Closing Time (Minutes)',
            axis=alt.Axis(titleFontSize=14),
            scale=alt.Scale(domainMin=0) # Ensure axis starts from 0
           ),
    # Y-axis: Complaint Type (Nominal), sorted by average closing time
    y=alt.Y('complaint_type:N',
            title='Complaint Type',
            sort='-x', # Sort complaint types by their average resolution time
            axis=alt.Axis(titleFontSize=14)
           ),
    color=alt.value('seagreen'), # Set a single color for the bars
    tooltip=[
        alt.Tooltip('complaint_type:N', title='Complaint Type'),
        alt.Tooltip('avg_minutes_to_close:Q', title='Avg. Minutes', format='.2f') # Format to 2 decimal places
    ]
).properties(
    title=alt.Title(
        'Average Complaint Closing Time by Complaint Type',
        fontSize=18
    ),
    width=600,  # Adjust overall chart width
    height=alt.Step(15) # Dynamically adjust height based on number of complaint types, 15 pixels per bar
).interactive() # Enable pan and zoom

chart.show()

Here you can see that the Graffitti takes almost 11 hours to resolve. Other complaint types are more frequent where Air Quality, Noise from Places of Worship take much less time while some issues haven’t been recently reported at all, such as Plumbing.

# 1. Define the mapping (as a list of dictionaries or directly construct a DataFrame)
borough_mapping_data = [
    {'city_name': 'NEW YORK', 'Borough': 'MANHATTAN'},
    {'city_name': 'BRONX', 'Borough': 'BRONX'},
    {'city_name': 'BROOKLYN', 'Borough': 'BROOKLYN'},
    {'city_name': 'QUEENS', 'Borough': 'QUEENS'},
    {'city_name': 'STATEN ISLAND', 'Borough': 'STATEN ISLAND'},
    {'city_name': 'JAMAICA', 'Borough': 'QUEENS'},
    {'city_name': 'FLUSHING', 'Borough': 'QUEENS'},
    {'city_name': 'ASTORIA', 'Borough': 'QUEENS'},
    {'city_name': 'LONG ISLAND CITY', 'Borough': 'QUEENS'},
    {'city_name': 'EAST ELMHURST', 'Borough': 'QUEENS'},
    {'city_name': 'CORONA', 'Borough': 'QUEENS'},
    {'city_name': 'ELMHURST', 'Borough': 'QUEENS'},
    {'city_name': 'OZONE PARK', 'Borough': 'QUEENS'},
    {'city_name': 'SOUTH OZONE PARK', 'Borough': 'QUEENS'},
    {'city_name': 'ARVERNE', 'Borough': 'QUEENS'},
    {'city_name': 'FAR ROCKAWAY', 'Borough': 'QUEENS'},
    {'city_name': 'QUEENS VILLAGE', 'Borough': 'QUEENS'},
    {'city_name': 'RICHMOND HILL', 'Borough': 'QUEENS'},
    {'city_name': 'SOUTH RICHMOND HILL', 'Borough': 'QUEENS'},
    {'city_name': 'WOODHAVEN', 'Borough': 'QUEENS'},
    {'city_name': 'REGO PARK', 'Borough': 'QUEENS'},
    {'city_name': 'FOREST HILLS', 'Borough': 'QUEENS'},
    {'city_name': 'JACKSON HEIGHTS', 'Borough': 'QUEENS'},
    {'city_name': 'KEW GARDENS', 'Borough': 'QUEENS'},
    {'city_name': 'FRESH MEADOWS', 'Borough': 'QUEENS'},
    {'city_name': 'OAKLAND GARDENS', 'Borough': 'QUEENS'},
    {'city_name': 'COLLEGE POINT', 'Borough': 'QUEENS'},
    {'city_name': 'BAYSIDE', 'Borough': 'QUEENS'},
    {'city_name': 'GLEN OAKS', 'Borough': 'QUEENS'},
    {'city_name': 'BELLEROSE', 'Borough': 'QUEENS'},
    {'city_name': 'LITTLE NECK', 'Borough': 'QUEENS'},
    {'city_name': 'CAMBRIA HEIGHTS', 'Borough': 'QUEENS'},
    {'city_name': 'ROSEDALE', 'Borough': 'QUEENS'},
    {'city_name': 'ROCKAWAY PARK', 'Borough': 'QUEENS'},
    {'city_name': 'SPRINGFIELD GARDENS', 'Borough': 'QUEENS'},
    {'city_name': 'WOODSIDE', 'Borough': 'QUEENS'},
    {'city_name': 'MASPETH', 'Borough': 'QUEENS'},
    {'city_name': 'SUNNYSIDE', 'Borough': 'QUEENS'},
    {'city_name': 'MIDDLE VILLAGE', 'Borough': 'QUEENS'},
    {'city_name': 'RIDGEWOOD', 'Borough': 'BROOKLYN'},
    {'city_name': 'HOWARD BEACH', 'Borough': 'QUEENS'},
    {'city_name': 'BREEZY POINT', 'Borough': 'QUEENS'},
    {'city_name': 'WHITESTONE', 'Borough': 'QUEENS'},
    {'city_name': 'SAINT ALBANS', 'Borough': 'QUEENS'},
    {'city_name': 'Unknown City', 'Borough': 'Unspecified'} # Map previously filled 'Unknown City'
]

mapping_df = pl.DataFrame(borough_mapping_data)

# 2. Perform a left join to add the 'Borough' column
df = df.join(mapping_df, left_on="city", right_on="city_name", how="left")

# 3. Fill any remaining nulls in 'Borough' (for cities not in mapping) with 'Unspecified'
df = df.with_columns(
    pl.col("Borough").fill_null(pl.lit("Unspecified")).alias("Borough")
)

# --- Calculate average response time per complaint type per borough ---
avg_response_time_by_borough_complaint = df.group_by(["Borough", "complaint_type"]).agg(
    pl.col("request_closing_time_min").mean().alias("avg_minutes")
).sort(["Borough", "avg_minutes"], descending=[False, True])

print("Average Response Time by Borough and Complaint Type (Head):")
print(avg_response_time_by_borough_complaint.head())

Average Response Time by Borough and Complaint Type (Head):
shape: (5, 3)
┌─────────┬─────────────────────────────┬─────────────┐
│ Borough ┆ complaint_type              ┆ avg_minutes │
│ ---     ┆ ---                         ┆ ---         │
│ str     ┆ str                         ┆ f64         │
╞═════════╪═════════════════════════════╪═════════════╡
│ BRONX   ┆ Drug Activity               ┆ 394.0       │
│ BRONX   ┆ Derelict Vehicles           ┆ 324.75      │
│ BRONX   ┆ Dirty Condition             ┆ 219.0       │
│ BRONX   ┆ Illegal Fireworks           ┆ 137.8       │
│ BRONX   ┆ Non-Emergency Police Matter ┆ 130.833333  │
└─────────┴─────────────────────────────┴─────────────┘

# --- Define the boroughs for subplots (matching the order in Pandas code) ---
boroughs = ['BRONX', 'BROOKLYN', 'MANHATTAN', 'QUEENS', 'STATEN ISLAND', 'Unspecified']

# --- Set common X-axis limit (like plt.xlim(0, 3000)) ---
x_axis_limit = 3000 # Use a fixed limit as in Matplotlib

# --- Helper function to create an Altair subplot for each borough ---
def create_borough_subplot(data: pl.DataFrame, borough_name: str, x_domain_max: float) -> alt.Chart:
    """
    Creates a horizontal bar chart for average response time by complaint type for a given borough.
    """
    borough_data = data.filter(pl.col("Borough") == borough_name)
    borough_data = borough_data.filter(pl.col("avg_minutes").is_not_null() & (pl.col("avg_minutes") > 0))

    # --- FIX: Calculate height as an integer, not multiplying alt.Step() ---
    # Determine the number of bars (complaint types) for this borough
    num_complaint_types = len(borough_data)
    # Calculate the total height in pixels
    chart_height = 15 * max(1, num_complaint_types) # 15 pixels per bar
    # --- END FIX ---

    chart = alt.Chart(borough_data).mark_bar(color='cornflowerblue').encode(
        x=alt.X('avg_minutes:Q',
                title='Average Response Time (in minutes)',
                scale=alt.Scale(domain=[0, x_domain_max]),
                axis=alt.Axis(titleFontSize=12, labelFontSize=10)
               ),
        y=alt.Y('complaint_type:N',
                title='Complaint Type',
                sort='-x',
                axis=alt.Axis(titleFontSize=12, labelFontSize=10)
               ),
        tooltip=[
            alt.Tooltip('complaint_type:N', title='Complaint Type'),
            alt.Tooltip('avg_minutes:Q', title='Avg. Minutes', format='.2f')
        ]
    ).properties(
        title=alt.Title(
            borough_name,
            fontSize=14,
            anchor='middle'
        ),
        width=300,
        height=chart_height # Pass the integer height here
    )
    return chart

# --- Create individual subplots for each borough ---
borough_charts = [create_borough_subplot(avg_response_time_by_borough_complaint, b, x_axis_limit) for b in boroughs]

# --- Combine the subplots into a 2x3 grid ---
combined_chart = alt.vconcat(
    alt.hconcat(borough_charts[0], borough_charts[1], borough_charts[2]),
    alt.hconcat(borough_charts[3], borough_charts[4], borough_charts[5])
).properties(
    title=alt.Title(
        'Average Response Time for Complaint Types by Borough',
        fontSize=18,
        anchor='middle'
    )
).resolve_scale(
    x='shared'
)

combined_chart.show()

Hypothesis Testing

A boxplot can reveal visual differences in the min, median, max and presence of outliers across complaint types.

# Filter out extreme outliers for better visualization if needed
df_filtered_for_plot = df.filter(pl.col("request_closing_time_min") >= 0) # Ensure no negative times

# Drop the 'time_delta' and 'request_closing duration' column for Altair plotting
df_filtered_for_plot = df_filtered_for_plot.drop("time_delta", "request_closing_duration")

# Create the Boxplot
boxplot_chart = alt.Chart(df_filtered_for_plot).mark_boxplot().encode(
    x=alt.X('complaint_type:N',
            title='Complaint Type',
            axis=alt.Axis(labelAngle=-45, titleFontSize=14, labelFontSize=10) # Rotate labels for readability
           ),
    y=alt.Y('request_closing_time_min:Q',
            title='Request Closing Time (Minutes)',
            scale=alt.Scale(domainMin=0), # Ensure Y-axis starts at 0
            axis=alt.Axis(titleFontSize=14, labelFontSize=10)
           ),
    tooltip=[
        alt.Tooltip('complaint_type:N', title='Complaint Type'),
        alt.Tooltip('request_closing_time_min:Q', title='Closing Time', format='.2f')
        # For boxplots, you might add tooltips for quartiles if directly aggregated
    ]
).properties(
    title=alt.Title(
        'Distribution of Request Closing Time by Complaint Type',
        fontSize=18
    ),
    width=alt.Step(25), # Adjust width of each box
    height=400 # Overall chart height
).interactive() # Enable zooming and panning

boxplot_chart.show()

Let’s identify the relationship between Complaint Type and Request_Closing_Time using p-values. Subset the data to only include relevant columns

# Using confirmed lowercase column names 'complaint_type'
df_subset = df.select(["complaint_type", "request_closing_time_min"])

# Drop rows with missing values in these specific columns (Polars is immutable, so reassign)
df_subset = df_subset.drop_nulls()

# Create a dictionary to store the p-values for each Complaint Type
p_values = {}

# Get the data for the entire dataset's closing times as a NumPy array
# This is the 'all_data_closing_times' against which each subset will be compared
all_data_closing_times = df_subset['request_closing_time_min'].to_numpy()

# Loop through each Complaint Type and perform a t-test
# Get unique complaint types as a Python list for iteration
unique_complaint_types = df_subset['complaint_type'].unique().to_list()

print("Performing t-tests for each complaint type vs. the entire dataset...")
for complaint_type in unique_complaint_types:
    # Filter data for the current complaint type (get a Polars Series)
    subset_data_series = df_subset.filter(pl.col('complaint_type') == complaint_type)['request_closing_time_min']

    # Ensure there's enough data for the t-test (at least 2 samples required)
    if len(subset_data_series) < 2:
        p_values[complaint_type] = np.nan # Assign NaN if not enough data
        continue

    # Convert Polars Series to NumPy array for scipy.stats
    subset_data_np = subset_data_series.to_numpy()

    # Perform t-test (comparing the subset to the entire dataset)
    # equal_var=False for Welch's t-test, which does not assume equal variances
    _, p_value = stats.ttest_ind(subset_data_np, all_data_closing_times, equal_var=False)
    p_values[complaint_type] = p_value

# Print the p-values for each Complaint Type
print("\n--- P-values for Complaint Type vs. Entire Dataset's Closing Time ---")
for complaint_type, p_value in p_values.items():
    print(f"{complaint_type}: {p_value:.4f}")

Performing t-tests for each complaint type vs. the entire dataset...

--- P-values for Complaint Type vs. Entire Dataset's Closing Time ---
Electrical: 0.0000
Blocked Driveway: 0.0000
Noise - Commercial: 0.0003
Building/Use: 0.0000
Noise - Park: 0.6933
Derelict Vehicles: 0.0374
Encampment: 0.0003
Noise - Residential: 0.0000
Urinating in Public: 0.1941
Traffic: nan
Animal-Abuse: 0.0412
Violation of Park Rules: nan
Panhandling: 0.0995
General Construction/Plumbing: 0.0000
Vendor Enforcement: 0.0066
Non-Emergency Police Matter: 0.0570
Noise - Street/Sidewalk: 0.0000
Bike/Roller/Skate Chronic: nan
Drinking: 0.0061
Emergency Response Team (ERT): nan
Street Condition: 0.0000
Illegal Fireworks: 0.0009
Noise - Vehicle: 0.0000
Dirty Condition: 0.0409
Drug Activity: 0.3738
Water System: 0.0000
Elevator: 0.0000
Illegal Parking: 0.0000
Traffic Signal Condition: 0.0171
Disorderly Youth: 0.5417
Abandoned Vehicle: 0.0000

we create a contingency table using Polars to count the number of complaints for each complaint type and city. Then, we use chi2_contingency() from the scipy.stats library to perform the chi-square test on the contingency table. Finally, we print the results including the chi-square value, p-value, degrees of freedom, and expected values.

The null hypothesis for the chi-square test is that there is no association between the two variables (in this case, Complaint Type and City), while the alternative hypothesis is that there is some association. If the p-value is less than our chosen significance level (e.g., 0.05), we reject the null hypothesis and conclude that there is evidence of a significant association between the variables.

# 1. Create a contingency table using Polars - group by both columns and count occurrences
contingency_counts = df.group_by(["complaint_type", "city"]).len()

# Then, pivot to get Complaint Types as rows and Cities as columns
# Fill any missing combinations with 0
cont_table_polars = contingency_counts.pivot(
    index="complaint_type",
    columns="city",
    values="len"
).fill_null(0)

# The chi2_contingency function from scipy expects a NumPy array.
# Convert the Polars DataFrame to a NumPy array for the test.
observed_values = cont_table_polars.drop("complaint_type").to_numpy() # Drop the index column before converting

# 2. Perform the chi-square test
chi2, pval, dof, expected = chi2_contingency(observed_values)

# Set significance level
alpha = 0.05

# Print results
print(f"Chi-square value: {chi2:.2f}")
print(f"P-value: {pval:.4f}")
print(f"Degrees of freedom: {dof}")
print("Expected frequencies (truncated for display if large):")

# Print expected frequencies. They can be very large, so truncate for display.
if expected.size > 200: # Adjust this number based on how much output you want
    print(expected[:5, :5]) # Print top-left corner
    print("... (array truncated) ...")
else:
    print(expected)
print("\n")

# Check if p-value is less than alpha
if pval < alpha: # Use pval here, not p_value from previous context
    print("Conclusion: There is a statistically significant association between Complaint Type and City. (Reject the null hypothesis)")
    print("This suggests that the distribution of complaint types is not the same across all cities, or vice versa.")
else:
    print("Conclusion: There is no statistically significant association between Complaint Type and City. (Fail to reject the null hypothesis)")
    print("This suggests that the distribution of complaint types is similar across cities, or vice versa.")

Chi-square value: 7973.75
P-value: 0.0000
Degrees of freedom: 1230
Expected frequencies (truncated for display if large):
[[  0.67464115   0.215311     4.86602871   1.04784689   0.63157895]
 [  3.33572568   1.0645933   24.05980861   5.18102073   3.12280702]
 [  1.12440191   0.35885167   8.11004785   1.74641148   1.05263158]
 [ 26.79824561   8.55263158 193.28947368  41.62280702  25.0877193 ]
 [  6.89633174   2.20095694  49.74162679  10.71132376   6.45614035]]
... (array truncated) ...


Conclusion: There is a statistically significant association between Complaint Type and City. (Reject the null hypothesis)
This suggests that the distribution of complaint types is not the same across all cities, or vice versa.

Kruskal-Wallis H Test - One-way ANOVA

The Kruskal-Wallis H Test is a non-parametric test, meaning that it makes no assumptions about the distribution of the data. However, it does assume that the samples are independent and that the variances of the populations are equal.

\[ H_0: \text{There is no significant difference in the mean of Request\_Closing\_Time across Complaint types} \]

\[ H_1: \text{There is a significant difference in the mean of Request\_Closing\_Time across Complaint types} \]

# Filter out rows with nulls in 'request_closing_time_min' or 'complaint_type'
# Kruskal-Wallis requires complete data for each group.
df_filtered_for_test = df.filter(
    pl.col("request_closing_time_min").is_not_null() &
    pl.col("complaint_type").is_not_null()
)

# Define a minimum sample size for each group to be included in the test.
# Small groups can skew results or cause errors in statistical tests.
min_samples_per_group = 20 # You can adjust this threshold

# Get the list of complaint types that have enough samples
valid_complaint_types = df_filtered_for_test.group_by("complaint_type").agg(
    pl.len().alias("count")
).filter(pl.col("count") >= min_samples_per_group).select("complaint_type").to_series().to_list()

print(f"\n--- There are {len(valid_complaint_types)} valid complaints for testing ---")
valid_complaint_types


--- There are 12 valid complaints for testing ---

['Blocked Driveway',
 'Illegal Parking',
 'Noise - Commercial',
 'Noise - Residential',
 'Noise - Vehicle',
 'Illegal Fireworks',
 'Noise - Street/Sidewalk',
 'Abandoned Vehicle',
 'Traffic Signal Condition',
 'Encampment',
 'Non-Emergency Police Matter',
 'Water System']

Prepare data for the Kruskal-Wallis test: a list of NumPy arrays. Each array contains the 'request_closing_time_min' for a specific complaint type

# Prepare data for the Kruskal-Wallis test: a list of NumPy arrays.
# Each array contains the 'request_closing_time_min' for a specific complaint type.
data_for_kruskal = []
for complaint_type in valid_complaint_types:
    # Filter for the current complaint type and extract the 'request_closing_time_min' column
    subset_series = df_filtered_for_test.filter(
        pl.col("complaint_type") == complaint_type
    ).select("request_closing_time_min").to_series()

    # Convert the Polars Series to a NumPy array
    data_for_kruskal.append(subset_series.to_numpy())

# Check if there are at least two groups to compare
if len(data_for_kruskal) < 2:
    print(f"Cannot perform Kruskal-Wallis test: Need at least two complaint types with >= {min_samples_per_group} samples each.")
else:
    # Perform the Kruskal-Wallis H-test
    # The *data_for_kruskal unpacks the list of arrays as separate arguments to kruskal()
    h_statistic, p_value = stats.kruskal(*data_for_kruskal)

    print(f"\n--- Kruskal-Wallis H-Test Results (comparing {len(valid_complaint_types)} complaint types) ---")
    print(f"H-statistic: {h_statistic:.2f}")
    print(f"P-value: {p_value:.4f}")

    # Interpret the p-value
    alpha = 0.05 # Significance level
    if p_value < alpha:
        print(f"The P-value ({p_value:.4f}) is less than the significance level ({alpha}).")
        print("Conclusion: Reject the null hypothesis. There is a statistically significant difference in the median request closing times among the selected complaint types. 🎉")
    else:
        print(f"The P-value ({p_value:.4f}) is greater than the significance level ({alpha}).")
        print("Conclusion: Fail to reject the null hypothesis. There is no statistically significant difference in the median request closing times among the selected complaint types. 🤔")


--- Kruskal-Wallis H-Test Results (comparing 12 complaint types) ---
H-statistic: 3287.00
P-value: 0.0000
The P-value (0.0000) is less than the significance level (0.05).
Conclusion: Reject the null hypothesis. There is a statistically significant difference in the median request closing times among the selected complaint types. 🎉

Since the p value for the Complaint is less than 0.01, we accept alternate hypothesis testing (i.e. there’s a significant difference in the mean response time of different types of complaints)

Observations and Outcomes

Here are some of the major conclusions that can be drawn from the 311 Customer service NYC dataset:

Many Data visualizations across various complaint types in all boroughs are related proivde rich context and insight.
Brooklyn has the highest number of complaints compared to other boroughs while Queens takes the longest to resolve complaints.
The scatter and hexbin plot of Brooklyn illustrate the relationships between different complaint types and how they are clustered together.
The box plot shows that there is significant variation in response times across different complaint types.
Hypothesis testing (Kruskal-Wallis H Test and one-way ANOVA) suggests that there are significant differences in response times among different complaint types.
The p-value and chi-squared test suggest that there is a significant association between certain complaint types and boroughs.
Overall, the analysis of the 311 customer service dataset can provide valuable insights into the common complaints and response times in different boroughs, which can inform decisions on resource allocation and policy changes.