import numpy as np
import polars as pl
import altair as alt
from scipy.stats import stats, chi2_contingency
from datasets import Dataset, load_dataset
import warnings
warnings.filterwarnings('ignore')Customer Service Requests Analysis
Project Description
This project analyzes New York City’s 311 service request data to identify patterns and visualize the most common types of citizen complaints. Using the NYC Open Data API, I developed a CLI script to collect updated data from API endpoint at https://data.cityofnewyork.us/resource/erm2-nwe9.json.
I applied data wrangling techniques to clean and analyze the dataset. The analysis also includes interactive visualizations of complaint trends. To ensure reproducibility, I’ve uploaded a dataset snapshot to Hugging Face for use throughout the project. More information about the original data source is available on the NYC Open Data page.
Import the libraries for analysis
# checking the info about the installed polars package version
print(pl.show_versions())--------Version info---------
Polars: 1.31.0
Index type: UInt32
Platform: Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.39
Python: 3.13.5 (main, Jun 14 2025, 08:23:24) [GCC 13.3.0]
LTS CPU: False
----Optional dependencies----
Azure CLI <not installed>
adbc_driver_manager <not installed>
altair 5.5.0
azure.identity <not installed>
boto3 <not installed>
cloudpickle <not installed>
connectorx <not installed>
deltalake <not installed>
fastexcel <not installed>
fsspec 2025.3.0
gevent <not installed>
google.auth <not installed>
great_tables <not installed>
matplotlib <not installed>
numpy 2.3.1
openpyxl <not installed>
pandas 2.3.1
polars_cloud <not installed>
pyarrow 20.0.0
pydantic <not installed>
pyiceberg <not installed>
sqlalchemy <not installed>
torch <not installed>
xlsx2csv <not installed>
xlsxwriter <not installed>
None
Understanding the dataset
nyc_data = load_dataset("ctgadget/nyc_311_service_requests", split="train")
# let's get the dataset's features
nyc_data.info.features{'unique_key': Value('string'),
'created_date': Value('string'),
'agency': Value('string'),
'agency_name': Value('string'),
'complaint_type': Value('string'),
'descriptor': Value('string'),
'location_type': Value('string'),
'incident_zip': Value('string'),
'incident_address': Value('string'),
'street_name': Value('string'),
'cross_street_1': Value('string'),
'cross_street_2': Value('string'),
'intersection_street_1': Value('string'),
'intersection_street_2': Value('string'),
'address_type': Value('string'),
'city': Value('string'),
'landmark': Value('string'),
'status': Value('string'),
'resolution_action_updated_date': Value('string'),
'community_board': Value('string'),
'bbl': Value('string'),
'borough': Value('string'),
'x_coordinate_state_plane': Value('string'),
'y_coordinate_state_plane': Value('string'),
'open_data_channel_type': Value('string'),
'park_facility_name': Value('string'),
'park_borough': Value('string'),
'latitude': Value('string'),
'longitude': Value('string'),
'location': {'human_address': Value('string'),
'latitude': Value('string'),
'longitude': Value('string')},
'resolution_description': Value('string'),
'vehicle_type': Value('string'),
'closed_date': Value('string'),
'data_retrieved_at': Value('string')}
df = Dataset.to_polars(nyc_data)df.head()| unique_key | created_date | agency | agency_name | complaint_type | descriptor | location_type | incident_zip | incident_address | street_name | cross_street_1 | cross_street_2 | intersection_street_1 | intersection_street_2 | address_type | city | landmark | status | resolution_action_updated_date | community_board | bbl | borough | x_coordinate_state_plane | y_coordinate_state_plane | open_data_channel_type | park_facility_name | park_borough | latitude | longitude | location | resolution_description | vehicle_type | closed_date | data_retrieved_at |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | struct[3] | str | str | str | str |
| "65363980" | "2025-06-25T02:03:21.000" | "NYPD" | "New York City Police Departmen… | "Illegal Parking" | "Commercial Overnight Parking" | "Street/Sidewalk" | "11411" | "115-22 FRANCIS LEWIS BOULEVARD" | "FRANCIS LEWIS BOULEVARD" | "115 AVENUE" | "115 ROAD" | "115 AVENUE" | "115 ROAD" | "ADDRESS" | "CAMBRIA HEIGHTS" | "FRANCIS LEWIS BOULEVARD" | "In Progress" | "2025-06-25T02:11:30.000" | "12 QUEENS" | "4110460160" | "QUEENS" | "1054262" | "194055" | "MOBILE" | "Unspecified" | "QUEENS" | "40.699035965405535" | "-73.7475083832415" | {"{"address": "", "city": "", "state": "", "zip": ""}","40.699035965405535","-73.7475083832415"} | null | null | null | "06_26_2025_02_38_45_PM" |
| "65357474" | "2025-06-25T02:01:47.000" | "NYPD" | "New York City Police Departmen… | "Blocked Driveway" | "Partial Access" | "Street/Sidewalk" | "11428" | "90-25 219 STREET" | "219 STREET" | "90 AVENUE" | "91 AVENUE" | "90 AVENUE" | "91 AVENUE" | "ADDRESS" | "QUEENS VILLAGE" | "219 STREET" | "In Progress" | null | "13 QUEENS" | "4107180042" | "QUEENS" | "1055805" | "203889" | "ONLINE" | "Unspecified" | "QUEENS" | "40.72601549402461" | "-73.74183921146468" | {"{"address": "", "city": "", "state": "", "zip": ""}","40.72601549402461","-73.74183921146468"} | null | null | null | "06_26_2025_02_38_45_PM" |
| "65367267" | "2025-06-25T02:01:33.000" | "NYPD" | "New York City Police Departmen… | "Noise - Residential" | "Loud Talking" | "Residential Building/House" | "10031" | "620 WEST 141 STREET" | "WEST 141 STREET" | "BROADWAY" | "RIVERSIDE DRIVE" | "BROADWAY" | "RIVERSIDE DRIVE" | "ADDRESS" | "NEW YORK" | "WEST 141 STREET" | "In Progress" | null | "09 MANHATTAN" | "1020880043" | "MANHATTAN" | "997264" | "239545" | "MOBILE" | "Unspecified" | "MANHATTAN" | "40.82416166479311" | "-73.95297805704422" | {"{"address": "", "city": "", "state": "", "zip": ""}","40.82416166479311","-73.95297805704422"} | null | null | null | "06_26_2025_02_38_45_PM" |
| "65357731" | "2025-06-25T02:01:29.000" | "DHS" | "Department of Homeless Service… | "Homeless Person Assistance" | "Non-Chronic" | "Store/Commercial" | "10001" | "6 AVENUE" | "6 AVENUE" | "6 AVENUE" | "WEST 27 STREET" | "6 AVENUE" | "WEST 27 STREET" | "INTERSECTION" | null | null | "In Progress" | null | "05 MANHATTAN" | null | "MANHATTAN" | "986753" | "210854" | "PHONE" | "Unspecified" | "MANHATTAN" | "40.745421755862786" | "-73.99096689618635" | {"{"address": "", "city": "", "state": "", "zip": ""}","40.745421755862786","-73.99096689618635"} | null | null | null | "06_26_2025_02_38_45_PM" |
| "65364041" | "2025-06-25T02:01:21.000" | "NYPD" | "New York City Police Departmen… | "Noise - Street/Sidewalk" | "Loud Music/Party" | "Street/Sidewalk" | "10460" | "1460 HOE AVENUE" | "HOE AVENUE" | "JENNINGS STREET" | "EAST 172 STREET" | "JENNINGS STREET" | "EAST 172 STREET" | "ADDRESS" | "BRONX" | "HOE AVENUE" | "In Progress" | null | "03 BRONX" | "2029880002" | "BRONX" | "1014792" | "242538" | "PHONE" | "Unspecified" | "BRONX" | "40.832333371980106" | "-73.88963263564457" | {"{"address": "", "city": "", "state": "", "zip": ""}","40.832333371980106","-73.88963263564457"} | null | null | null | "06_26_2025_02_38_45_PM" |
# Let's get the info about the dataset, shape of the dimensions
print(df.describe())
print( f"\nThe shape of the dataset is {df.shape}\n")shape: (9, 35)
┌────────────┬────────────┬───────────┬────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ statistic ┆ unique_key ┆ created_d ┆ agency ┆ … ┆ resolutio ┆ vehicle_t ┆ closed_da ┆ data_retr │
│ --- ┆ --- ┆ ate ┆ --- ┆ ┆ n_descrip ┆ ype ┆ te ┆ ieved_at │
│ str ┆ str ┆ --- ┆ str ┆ ┆ tion ┆ --- ┆ --- ┆ --- │
│ ┆ ┆ str ┆ ┆ ┆ --- ┆ str ┆ str ┆ str │
│ ┆ ┆ ┆ ┆ ┆ str ┆ ┆ ┆ │
╞════════════╪════════════╪═══════════╪════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ count ┆ 5000 ┆ 5000 ┆ 5000 ┆ … ┆ 3797 ┆ 171 ┆ 3762 ┆ 5000 │
│ null_count ┆ 0 ┆ 0 ┆ 0 ┆ … ┆ 1203 ┆ 4829 ┆ 1238 ┆ 0 │
│ mean ┆ null ┆ null ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null │
│ std ┆ null ┆ null ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null │
│ min ┆ 65354926 ┆ 2025-06-2 ┆ DCWP ┆ … ┆ DSNY ┆ Car ┆ 2025-06-2 ┆ 06_26_202 │
│ ┆ ┆ 4T18:08:4 ┆ ┆ ┆ already ┆ ┆ 4T18:09:3 ┆ 5_02_38_4 │
│ ┆ ┆ 0.000 ┆ ┆ ┆ has a ┆ ┆ 7.000 ┆ 5_PM │
│ ┆ ┆ ┆ ┆ ┆ request ┆ ┆ ┆ │
│ ┆ ┆ ┆ ┆ ┆ to … ┆ ┆ ┆ │
│ 25% ┆ null ┆ null ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null │
│ 50% ┆ null ┆ null ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null │
│ 75% ┆ null ┆ null ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null │
│ max ┆ 65367590 ┆ 2025-06-2 ┆ TLC ┆ … ┆ Your ┆ Van ┆ 2025-06-2 ┆ 06_26_202 │
│ ┆ ┆ 5T02:03:2 ┆ ┆ ┆ request ┆ ┆ 5T03:05:0 ┆ 5_02_38_4 │
│ ┆ ┆ 1.000 ┆ ┆ ┆ can not ┆ ┆ 0.000 ┆ 5_PM │
│ ┆ ┆ ┆ ┆ ┆ be ┆ ┆ ┆ │
│ ┆ ┆ ┆ ┆ ┆ proces… ┆ ┆ ┆ │
└────────────┴────────────┴───────────┴────────┴───┴───────────┴───────────┴───────────┴───────────┘
The shape of the dataset is (5000, 34)
df.collect_schema()Schema([('unique_key', String),
('created_date', String),
('agency', String),
('agency_name', String),
('complaint_type', String),
('descriptor', String),
('location_type', String),
('incident_zip', String),
('incident_address', String),
('street_name', String),
('cross_street_1', String),
('cross_street_2', String),
('intersection_street_1', String),
('intersection_street_2', String),
('address_type', String),
('city', String),
('landmark', String),
('status', String),
('resolution_action_updated_date', String),
('community_board', String),
('bbl', String),
('borough', String),
('x_coordinate_state_plane', String),
('y_coordinate_state_plane', String),
('open_data_channel_type', String),
('park_facility_name', String),
('park_borough', String),
('latitude', String),
('longitude', String),
('location',
Struct({'human_address': String, 'latitude': String, 'longitude': String})),
('resolution_description', String),
('vehicle_type', String),
('closed_date', String),
('data_retrieved_at', String)])
# Check for null values in each column
null_counts = df.null_count()
print("Null counts per column:")
print(null_counts)
# To check if there are ANY null values in the entire DataFrame:
# You can sum all the null counts and check if the total is greater than 0.
total_nulls = null_counts.sum_horizontal().item()
if total_nulls > 0:
print("\nThere are null values in the DataFrame.")
else:
print("\nNo null values found in the DataFrame.")Null counts per column:
shape: (1, 34)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ unique_key ┆ created_da ┆ agency ┆ agency_na ┆ … ┆ resolutio ┆ vehicle_t ┆ closed_da ┆ data_retr │
│ --- ┆ te ┆ --- ┆ me ┆ ┆ n_descrip ┆ ype ┆ te ┆ ieved_at │
│ u32 ┆ --- ┆ u32 ┆ --- ┆ ┆ tion ┆ --- ┆ --- ┆ --- │
│ ┆ u32 ┆ ┆ u32 ┆ ┆ --- ┆ u32 ┆ u32 ┆ u32 │
│ ┆ ┆ ┆ ┆ ┆ u32 ┆ ┆ ┆ │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 0 ┆ 0 ┆ 0 ┆ 0 ┆ … ┆ 1203 ┆ 4829 ┆ 1238 ┆ 0 │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘
There are null values in the DataFrame.
Observations:
There are null values present in the df dataframe for the following variables:
Closed DateDescriptorLocation Typeetc ...
Exploratory Data Analysis (EDA)
Plotting directly from Polars Series (more “Polars-native” if possible) This method is slightly more involved for a horizontal bar chart directly. We’ll build our charts in Altair.
# Let's prepare data as a dictionary of column_name: null_count
null_dict = null_counts.row(0, named=True) # Get the first (and only) row as a dictionary
columns = list(null_dict.keys())
counts = list(null_dict.values())
#--- Use the dict/lists to build a Polars DataFrame suitable for Altair ---
# Create a new Polars DataFrame from the lists
# This DataFrame will have the 'long' format expected by Altair
null_data = pl.DataFrame({
"Column Name": columns,
"Null Count": counts
})
# Optional: Filter out columns with 0 nulls for cleaner visualization
null_data = null_data.filter(pl.col("Null Count") > 0)
print("Null Count DataFrame prepared:\n")
print(null_data.head())
# --- Create the Altair Horizontal Bar Chart ---
chart = alt.Chart(null_data).mark_bar().encode(
# X-axis: Null Count (Quantitative type)
x=alt.X('Null Count:Q', title='Count'),
# Y-axis: Column Name (Nominal type), sorted by null count descending
y=alt.Y('Column Name:N', title='Columns', sort='-x'), # Sort by null count for better readability
# Tooltip for interactivity
tooltip=[
alt.Tooltip('Column Name:N', title='Column'),
alt.Tooltip('Null Count:Q', title='Nulls')
]
).properties(
title='Null Value Counts',
width=800, # <-- Adjust the width here (e.g., 800 pixels)
height=600 # <-- Adjust the height here (e.g., 600 pixels)
).interactive() # Allows zooming and panning
# Display the chart
chart.show()Null Count DataFrame prepared:
shape: (5, 2)
┌──────────────────┬────────────┐
│ Column Name ┆ Null Count │
│ --- ┆ --- │
│ str ┆ i64 │
╞══════════════════╪════════════╡
│ location_type ┆ 2527 │
│ incident_zip ┆ 15 │
│ incident_address ┆ 490 │
│ street_name ┆ 490 │
│ cross_street_1 ┆ 526 │
└──────────────────┴────────────┘
# Remove null values from "Closed Date" column
# Polars operations are immutable, so you need to reassign the DataFrame
df = df.drop_nulls(subset=["closed_date"])
# Show sample of dataset with null values removed
print("\nShape after dropping nulls:", df.shape)
print("\nSample of dataset with null values removed:")
print(df.head())
# Verify nulls in "Closed Date" after dropping
print("\nNulls in 'Closed Date' after dropping:", df.select(pl.col("closed_date").null_count()))
Shape after dropping nulls: (3762, 34)
Sample of dataset with null values removed:
shape: (5, 34)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ unique_key ┆ created_da ┆ agency ┆ agency_na ┆ … ┆ resolutio ┆ vehicle_t ┆ closed_da ┆ data_retr │
│ --- ┆ te ┆ --- ┆ me ┆ ┆ n_descrip ┆ ype ┆ te ┆ ieved_at │
│ str ┆ --- ┆ str ┆ --- ┆ ┆ tion ┆ --- ┆ --- ┆ --- │
│ ┆ str ┆ ┆ str ┆ ┆ --- ┆ str ┆ str ┆ str │
│ ┆ ┆ ┆ ┆ ┆ str ┆ ┆ ┆ │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 65355773 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ The ┆ null ┆ 2025-06-2 ┆ 06_26_202 │
│ ┆ T01:50:40. ┆ ┆ City ┆ ┆ Police ┆ ┆ 5T01:58:0 ┆ 5_02_38_4 │
│ ┆ 000 ┆ ┆ Police ┆ ┆ Departmen ┆ ┆ 0.000 ┆ 5_PM │
│ ┆ ┆ ┆ Departmen ┆ ┆ t ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ responde… ┆ ┆ ┆ │
│ 65366664 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ The ┆ null ┆ 2025-06-2 ┆ 06_26_202 │
│ ┆ T01:45:52. ┆ ┆ City ┆ ┆ Police ┆ ┆ 5T02:02:3 ┆ 5_02_38_4 │
│ ┆ 000 ┆ ┆ Police ┆ ┆ Departmen ┆ ┆ 0.000 ┆ 5_PM │
│ ┆ ┆ ┆ Departmen ┆ ┆ t issued ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ a… ┆ ┆ ┆ │
│ 65358593 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ The ┆ null ┆ 2025-06-2 ┆ 06_26_202 │
│ ┆ T01:42:25. ┆ ┆ City ┆ ┆ Police ┆ ┆ 5T01:52:1 ┆ 5_02_38_4 │
│ ┆ 000 ┆ ┆ Police ┆ ┆ Departmen ┆ ┆ 4.000 ┆ 5_PM │
│ ┆ ┆ ┆ Departmen ┆ ┆ t ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ responde… ┆ ┆ ┆ │
│ 65366547 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ The ┆ null ┆ 2025-06-2 ┆ 06_26_202 │
│ ┆ T01:40:10. ┆ ┆ City ┆ ┆ Police ┆ ┆ 5T02:03:0 ┆ 5_02_38_4 │
│ ┆ 000 ┆ ┆ Police ┆ ┆ Departmen ┆ ┆ 8.000 ┆ 5_PM │
│ ┆ ┆ ┆ Departmen ┆ ┆ t ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ responde… ┆ ┆ ┆ │
│ 65363618 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ The ┆ null ┆ 2025-06-2 ┆ 06_26_202 │
│ ┆ T01:37:15. ┆ ┆ City ┆ ┆ Police ┆ ┆ 5T01:59:5 ┆ 5_02_38_4 │
│ ┆ 000 ┆ ┆ Police ┆ ┆ Departmen ┆ ┆ 8.000 ┆ 5_PM │
│ ┆ ┆ ┆ Departmen ┆ ┆ t ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ responde… ┆ ┆ ┆ │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘
Nulls in 'Closed Date' after dropping: shape: (1, 1)
┌─────────────┐
│ closed_date │
│ --- │
│ u32 │
╞═════════════╡
│ 0 │
└─────────────┘
# Check if null values are still present in "Closed Date" column
# Use pl.col() for column selection and .is_null().any() for the check
if df.select(pl.col("closed_date").is_null().any()).item():
print('Null values still present in "Closed Date" column.')
else:
print('No null values present in "Closed Date" column.')No null values present in "Closed Date" column.
# Convert the "Closed Date" and "Created Date" columns to datetime format
# Polars operations are immutable, so use .with_columns() and reassign
df = df.with_columns([
# converting "created date"
pl.col("created_date").str.to_datetime().alias("created_date"),
# converting "closed date"
pl.col("closed_date").str.to_datetime().alias("closed_date"),
])# Display the schema (Polars' equivalent of df.info() for types)
print("DataFrame Schema after datetime conversion:\n")
print(df.collect_schema)DataFrame Schema after datetime conversion:
<bound method DataFrame.collect_schema of shape: (3_762, 34)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ unique_key ┆ created_da ┆ agency ┆ agency_na ┆ … ┆ resolutio ┆ vehicle_t ┆ closed_da ┆ data_retr │
│ --- ┆ te ┆ --- ┆ me ┆ ┆ n_descrip ┆ ype ┆ te ┆ ieved_at │
│ str ┆ --- ┆ str ┆ --- ┆ ┆ tion ┆ --- ┆ --- ┆ --- │
│ ┆ datetime[μ ┆ ┆ str ┆ ┆ --- ┆ str ┆ datetime[ ┆ str │
│ ┆ s] ┆ ┆ ┆ ┆ str ┆ ┆ μs] ┆ │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 65355773 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ The ┆ null ┆ 2025-06-2 ┆ 06_26_202 │
│ ┆ 01:50:40 ┆ ┆ City ┆ ┆ Police ┆ ┆ 5 ┆ 5_02_38_4 │
│ ┆ ┆ ┆ Police ┆ ┆ Departmen ┆ ┆ 01:58:00 ┆ 5_PM │
│ ┆ ┆ ┆ Departmen ┆ ┆ t ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ responde… ┆ ┆ ┆ │
│ 65366664 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ The ┆ null ┆ 2025-06-2 ┆ 06_26_202 │
│ ┆ 01:45:52 ┆ ┆ City ┆ ┆ Police ┆ ┆ 5 ┆ 5_02_38_4 │
│ ┆ ┆ ┆ Police ┆ ┆ Departmen ┆ ┆ 02:02:30 ┆ 5_PM │
│ ┆ ┆ ┆ Departmen ┆ ┆ t issued ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ a… ┆ ┆ ┆ │
│ 65358593 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ The ┆ null ┆ 2025-06-2 ┆ 06_26_202 │
│ ┆ 01:42:25 ┆ ┆ City ┆ ┆ Police ┆ ┆ 5 ┆ 5_02_38_4 │
│ ┆ ┆ ┆ Police ┆ ┆ Departmen ┆ ┆ 01:52:14 ┆ 5_PM │
│ ┆ ┆ ┆ Departmen ┆ ┆ t ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ responde… ┆ ┆ ┆ │
│ 65366547 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ The ┆ null ┆ 2025-06-2 ┆ 06_26_202 │
│ ┆ 01:40:10 ┆ ┆ City ┆ ┆ Police ┆ ┆ 5 ┆ 5_02_38_4 │
│ ┆ ┆ ┆ Police ┆ ┆ Departmen ┆ ┆ 02:03:08 ┆ 5_PM │
│ ┆ ┆ ┆ Departmen ┆ ┆ t ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ responde… ┆ ┆ ┆ │
│ 65363618 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ The ┆ null ┆ 2025-06-2 ┆ 06_26_202 │
│ ┆ 01:37:15 ┆ ┆ City ┆ ┆ Police ┆ ┆ 5 ┆ 5_02_38_4 │
│ ┆ ┆ ┆ Police ┆ ┆ Departmen ┆ ┆ 01:59:58 ┆ 5_PM │
│ ┆ ┆ ┆ Departmen ┆ ┆ t ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ responde… ┆ ┆ ┆ │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 65362166 ┆ 2025-06-24 ┆ NYPD ┆ New York ┆ … ┆ The ┆ null ┆ 2025-06-2 ┆ 06_26_202 │
│ ┆ 18:10:04 ┆ ┆ City ┆ ┆ Police ┆ ┆ 4 ┆ 5_02_38_4 │
│ ┆ ┆ ┆ Police ┆ ┆ Departmen ┆ ┆ 21:57:40 ┆ 5_PM │
│ ┆ ┆ ┆ Departmen ┆ ┆ t ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ responde… ┆ ┆ ┆ │
│ 65359062 ┆ 2025-06-24 ┆ DOB ┆ Departmen ┆ … ┆ Your ┆ null ┆ 2025-06-2 ┆ 06_26_202 │
│ ┆ 18:09:37 ┆ ┆ t of ┆ ┆ Service ┆ ┆ 4 ┆ 5_02_38_4 │
│ ┆ ┆ ┆ Buildings ┆ ┆ Request ┆ ┆ 18:09:37 ┆ 5_PM │
│ ┆ ┆ ┆ ┆ ┆ has been ┆ ┆ ┆ │
│ ┆ ┆ ┆ ┆ ┆ … ┆ ┆ ┆ │
│ 65357050 ┆ 2025-06-24 ┆ NYPD ┆ New York ┆ … ┆ The ┆ null ┆ 2025-06-2 ┆ 06_26_202 │
│ ┆ 18:09:26 ┆ ┆ City ┆ ┆ Police ┆ ┆ 4 ┆ 5_02_38_4 │
│ ┆ ┆ ┆ Police ┆ ┆ Departmen ┆ ┆ 19:19:26 ┆ 5_PM │
│ ┆ ┆ ┆ Departmen ┆ ┆ t ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ responde… ┆ ┆ ┆ │
│ 65361679 ┆ 2025-06-24 ┆ NYPD ┆ New York ┆ … ┆ The ┆ null ┆ 2025-06-2 ┆ 06_26_202 │
│ ┆ 18:09:21 ┆ ┆ City ┆ ┆ Police ┆ ┆ 4 ┆ 5_02_38_4 │
│ ┆ ┆ ┆ Police ┆ ┆ Departmen ┆ ┆ 19:36:37 ┆ 5_PM │
│ ┆ ┆ ┆ Departmen ┆ ┆ t ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ responde… ┆ ┆ ┆ │
│ 65358615 ┆ 2025-06-24 ┆ NYPD ┆ New York ┆ … ┆ The ┆ null ┆ 2025-06-2 ┆ 06_26_202 │
│ ┆ 18:08:43 ┆ ┆ City ┆ ┆ Police ┆ ┆ 4 ┆ 5_02_38_4 │
│ ┆ ┆ ┆ Police ┆ ┆ Departmen ┆ ┆ 19:29:13 ┆ 5_PM │
│ ┆ ┆ ┆ Departmen ┆ ┆ t ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ responde… ┆ ┆ ┆ │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘>
# --- Convert 'time_delta' from Duration to a numerical type ---
# First, ensure 'time_delta' column is created as done previously
df = df.with_columns(
(pl.col("closed_date") - pl.col("created_date")).alias("time_delta")
)
# Convert time_delta from Duration to total seconds (or minutes, hours, days)
# Polars Duration objects have methods like .total_seconds(), .total_minutes() etc.
df = df.with_columns(
pl.col("time_delta").dt.total_seconds().alias("time_delta_seconds")
)
# Now, 'time_delta_seconds' is a numerical column (float)
# You can use this new column for plotting in Altair.
print("DataFrame schema after converting time_delta to seconds:")
print(df.schema)
print("\nHead of DataFrame showing new 'time_delta_seconds' column:")
print(df.head())DataFrame schema after converting time_delta to seconds:
Schema({'unique_key': String, 'created_date': Datetime(time_unit='us', time_zone=None), 'agency': String, 'agency_name': String, 'complaint_type': String, 'descriptor': String, 'location_type': String, 'incident_zip': String, 'incident_address': String, 'street_name': String, 'cross_street_1': String, 'cross_street_2': String, 'intersection_street_1': String, 'intersection_street_2': String, 'address_type': String, 'city': String, 'landmark': String, 'status': String, 'resolution_action_updated_date': String, 'community_board': String, 'bbl': String, 'borough': String, 'x_coordinate_state_plane': String, 'y_coordinate_state_plane': String, 'open_data_channel_type': String, 'park_facility_name': String, 'park_borough': String, 'latitude': String, 'longitude': String, 'location': Struct({'human_address': String, 'latitude': String, 'longitude': String}), 'resolution_description': String, 'vehicle_type': String, 'closed_date': Datetime(time_unit='us', time_zone=None), 'data_retrieved_at': String, 'time_delta': Duration(time_unit='us'), 'time_delta_seconds': Int64})
Head of DataFrame showing new 'time_delta_seconds' column:
shape: (5, 36)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ unique_key ┆ created_da ┆ agency ┆ agency_na ┆ … ┆ closed_da ┆ data_retr ┆ time_delt ┆ time_delt │
│ --- ┆ te ┆ --- ┆ me ┆ ┆ te ┆ ieved_at ┆ a ┆ a_seconds │
│ str ┆ --- ┆ str ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ ┆ datetime[μ ┆ ┆ str ┆ ┆ datetime[ ┆ str ┆ duration[ ┆ i64 │
│ ┆ s] ┆ ┆ ┆ ┆ μs] ┆ ┆ μs] ┆ │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 65355773 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 7m 20s ┆ 440 │
│ ┆ 01:50:40 ┆ ┆ City ┆ ┆ 5 ┆ 5_02_38_4 ┆ ┆ │
│ ┆ ┆ ┆ Police ┆ ┆ 01:58:00 ┆ 5_PM ┆ ┆ │
│ ┆ ┆ ┆ Departmen ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ ┆ ┆ ┆ │
│ 65366664 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 16m 38s ┆ 998 │
│ ┆ 01:45:52 ┆ ┆ City ┆ ┆ 5 ┆ 5_02_38_4 ┆ ┆ │
│ ┆ ┆ ┆ Police ┆ ┆ 02:02:30 ┆ 5_PM ┆ ┆ │
│ ┆ ┆ ┆ Departmen ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ ┆ ┆ ┆ │
│ 65358593 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 9m 49s ┆ 589 │
│ ┆ 01:42:25 ┆ ┆ City ┆ ┆ 5 ┆ 5_02_38_4 ┆ ┆ │
│ ┆ ┆ ┆ Police ┆ ┆ 01:52:14 ┆ 5_PM ┆ ┆ │
│ ┆ ┆ ┆ Departmen ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ ┆ ┆ ┆ │
│ 65366547 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 22m 58s ┆ 1378 │
│ ┆ 01:40:10 ┆ ┆ City ┆ ┆ 5 ┆ 5_02_38_4 ┆ ┆ │
│ ┆ ┆ ┆ Police ┆ ┆ 02:03:08 ┆ 5_PM ┆ ┆ │
│ ┆ ┆ ┆ Departmen ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ ┆ ┆ ┆ │
│ 65363618 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 22m 43s ┆ 1363 │
│ ┆ 01:37:15 ┆ ┆ City ┆ ┆ 5 ┆ 5_02_38_4 ┆ ┆ │
│ ┆ ┆ ┆ Police ┆ ┆ 01:59:58 ┆ 5_PM ┆ ┆ │
│ ┆ ┆ ┆ Departmen ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ ┆ ┆ ┆ │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘
# View descriptive statistics of the "time_delta" column
# Changed the column selection to "time_delta"
print("\nDescriptive statistics for 'time_delta' column:")
print(df.select(pl.col("time_delta")).describe())
Descriptive statistics for 'time_delta' column:
shape: (9, 2)
┌────────────┬────────────────┐
│ statistic ┆ time_delta │
│ --- ┆ --- │
│ str ┆ str │
╞════════════╪════════════════╡
│ count ┆ 3762 │
│ null_count ┆ 0 │
│ mean ┆ 0:31:40.608187 │
│ std ┆ null │
│ min ┆ 0:00:00 │
│ 25% ┆ 0:00:00 │
│ 50% ┆ 0:00:00 │
│ 75% ┆ 0:42:35 │
│ max ┆ 7:37:14 │
└────────────┴────────────────┘
Finding the major type of complaints
# Check the number of null values in the "complaint_type" and "city" columns using f-strings
print(f'Number of null values in "complaint_type" column: {df.select(pl.col("complaint_type").null_count()).item()}')
print(f'Number of null values in "city" column: {df.select(pl.col("city").null_count()).item()}')Number of null values in "complaint_type" column: 0
Number of null values in "city" column: 148
# Fill in missing values (nulls) in the 'city' column with 'Unknown City'
df = df.with_columns(
pl.col("city").fill_null(pl.lit("UNKNOWN CITY")).alias("city")
)
# Get Unique Count using Polars' n_unique()
# We need to extract the single value using .item() for the f-string
city_count = df.select(pl.col("city").n_unique()).item()
# Get all unique values in the 'city' column as a sorted list
unique_areas = df["city"].unique().sort().to_list()
print(f"Number of null values in 'city' after fill_null: {df.select(pl.col("city").null_count()).item()}")
print(f"\nThere are {city_count} values in the dataset that contain the following info:\n")
print(unique_areas)Number of null values in 'city' after fill_null: 0
There are 42 values in the dataset that contain the following info:
['ARVERNE', 'ASTORIA', 'BAYSIDE', 'BRONX', 'BROOKLYN', 'CAMBRIA HEIGHTS', 'COLLEGE POINT', 'CORONA', 'EAST ELMHURST', 'ELMHURST', 'FAR ROCKAWAY', 'FLUSHING', 'FOREST HILLS', 'FRESH MEADOWS', 'HOLLIS', 'HOWARD BEACH', 'JACKSON HEIGHTS', 'JAMAICA', 'LITTLE NECK', 'LONG ISLAND CITY', 'MANHATTAN', 'MASPETH', 'MIDDLE VILLAGE', 'NEW YORK', 'OAKLAND GARDENS', 'OZONE PARK', 'QUEENS', 'QUEENS VILLAGE', 'REGO PARK', 'RICHMOND HILL', 'RIDGEWOOD', 'ROSEDALE', 'SAINT ALBANS', 'SOUTH OZONE PARK', 'SOUTH RICHMOND HILL', 'SPRINGFIELD GARDENS', 'STATEN ISLAND', 'SUNNYSIDE', 'UNKNOWN CITY', 'WHITESTONE', 'WOODHAVEN', 'WOODSIDE']
# Calculate total complaints per city and complaint type, then pivot
total_city_complaints = df.group_by(["city", "complaint_type"]).len().pivot(
index="city", # 'city' will be the rows
columns="complaint_type", # 'complaint_type' will be the new columns
values="len" # The aggregated count (from .len()) will be the values
)
total_city_complaints.head()| city | Blocked Driveway | Noise - Residential | Abandoned Vehicle | Elevator | Noise - Street/Sidewalk | Illegal Fireworks | Illegal Parking | Noise - Commercial | General Construction/Plumbing | Water System | Noise - Vehicle | Non-Emergency Police Matter | Panhandling | Dirty Condition | Traffic Signal Condition | Disorderly Youth | Noise - Park | Drinking | Derelict Vehicles | Electrical | Encampment | Drug Activity | Violation of Park Rules | Vendor Enforcement | Street Condition | Bike/Roller/Skate Chronic | Animal-Abuse | Traffic | Urinating in Public | Building/Use | Emergency Response Team (ERT) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 |
| "FAR ROCKAWAY" | 1 | null | null | null | 5 | null | 15 | null | null | 2 | null | null | null | null | null | null | 1 | null | null | 1 | null | null | null | null | null | null | null | null | null | null | null |
| "HOLLIS" | 1 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
| "UNKNOWN CITY" | 1 | 7 | 1 | null | 87 | 4 | 15 | 2 | null | null | 13 | 7 | 5 | null | 2 | null | 1 | 1 | null | null | null | null | null | null | null | 1 | null | 1 | null | null | null |
| "BRONX" | 24 | 70 | null | 2 | 220 | 5 | 42 | 8 | 1 | 1798 | 20 | 6 | null | 2 | 6 | null | null | 3 | 4 | 1 | null | 1 | null | 1 | null | null | 2 | null | null | null | null |
| "WOODSIDE" | 2 | 1 | 3 | null | null | null | 3 | 2 | null | null | null | null | 1 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
# Reshape the DataFrame to "long" format for Altair
# narrow down into 'city' for 'complaint_type' and 'number_of_complaints'
complaints = total_city_complaints.unpivot(
index=['city'],
variable_name='complaint_type', # New column for the complaint type names
value_name='number_of_complaints' # New column for the complaint counts
)
# Filter out rows where number_of_complaints is 0, as they don't contribute to the visualization
# This makes the tooltip cleaner and chart potentially faster if there are many zeros
complaints = complaints.filter(pl.col("number_of_complaints") > 0)
# Create the Altair Stacked Bar Chart
chart = alt.Chart(complaints).mark_bar().encode(
# X-axis: Number of Complaints (Quantitative type)
x=alt.X('number_of_complaints:Q', title='Number of Complaints'),
# Y-axis: City (Nominal type), sorted by the sum of complaints for better readability
y=alt.Y('city:N', title='City', sort='-x'), # Sort cities by total complaints in descending order
# Color: Complaint Type (Nominal type) for stacking
color=alt.Color('complaint_type:N', title='Complaint Type'),
# Tooltip for interactivity
tooltip=[
alt.Tooltip('city:N', title='City'),
alt.Tooltip('complaint_type:N', title='Complaint Type'),
alt.Tooltip('number_of_complaints:Q', title='Count')
]
).properties(
title='Frequency of Complaints by City',
width=800, # <-- Adjust the width here (e.g., 800 pixels)
height=600 # <-- Adjust the height here (e.g., 600 pixels)
).interactive() # Allows zooming and panning
# Display the chart
chart.show()Data Visualization of the major type of complaints
In order to find the major type of complaints in the dataset, the following code groups the dataset by complaint type, counts the number of occurrences for each type, and then sorts the counts in descending order. The resulting bar graph shows the counts for each complaint type.
# Group by 'complaint_type' and count the number of occurrences
# Then sort the counts in descending order
complaint_counts = df.group_by("complaint_type").agg(
pl.len().alias("counts") # pl.len() counts rows in each group, and we alias it to 'counts'
).sort("counts", descending=True) # Sort the resulting DataFrame by 'counts' in descending order
complaint_types = df.select(pl.col("complaint_type").n_unique()).item()
# View the result
print(f"\nThere are {complaint_types} complaint types in the dataset that contain the following info:\n")
print(complaint_counts)
There are 31 complaint types in the dataset that contain the following info:
shape: (31, 2)
┌───────────────────────────────┬────────┐
│ complaint_type ┆ counts │
│ --- ┆ --- │
│ str ┆ u32 │
╞═══════════════════════════════╪════════╡
│ Water System ┆ 2145 │
│ Noise - Street/Sidewalk ┆ 552 │
│ Illegal Parking ┆ 337 │
│ Noise - Residential ┆ 267 │
│ Blocked Driveway ┆ 90 │
│ … ┆ … │
│ Disorderly Youth ┆ 2 │
│ Traffic ┆ 1 │
│ Emergency Response Team (ERT) ┆ 1 │
│ Bike/Roller/Skate Chronic ┆ 1 │
│ Violation of Park Rules ┆ 1 │
└───────────────────────────────┴────────┘
# Plot the bar graph
chart = alt.Chart(complaint_counts).mark_bar().encode(
# X-axis: 'counts' (Quantitative type)
x=alt.X('counts:Q', title='Number of Complaints',
scale=alt.Scale(domainMin=0, zero=True)
),
# Y-axis: 'complaint_type' (Nominal type), sorted by counts descending
y=alt.Y('complaint_type:N', title='Complaint Type', sort='-x'), # Sort by counts for readability
color=alt.value('orange'), # You can use color names or hex codes like '#FF5733'
# Tooltip for interactivity
tooltip=[
alt.Tooltip('complaint_type:N', title='Complaint Type'),
alt.Tooltip('counts:Q', title='Count')
]
).properties(
title='Frequency of Complaints by Type',
width=800, # <-- Adjust the width here (e.g., 800 pixels)
height=600 # <-- Adjust the height here (e.g., 600 pixels)
).interactive()
# Display the chart
chart.show()From our bar chart analysis, the top 10 major complaint types are the following:
Illegal ParkingNoise - ResidentalNoise - Street/SidewalkBlocked DrivewayAbandoned VehicleNoise - VehicleNoise - CommercialGeneral Construction/PlumbingVendor EnforcementHomeless Encampment
Let’s explore the complaint data specifically categorized as NYC …
This code filters the dataset to only include complaints from New York City, groups the remaining data by complaint type, counts the number of occurrences for each type, and then sorts the counts in descending order. The resulting output shows the frequency of each complaint type for New York City.
# Filter the DataFrame for only the 'NEW YORK' city data
# Use .filter() with a boolean expression
nyc_data = df.filter(pl.col('city') == 'NEW YORK')
print("Head of the nyc_data DataFrame (filtered for 'NEW YORK'):")
print(nyc_data.head())
print(f"\nShape of the original DataFrame: {df.shape}")
print(f"Shape of the 'nyc_data' DataFrame (after filtering): {nyc_data.shape}")
# You can also verify that all cities in nyc_data are 'NEW YORK'
print("\nUnique cities in 'nyc_data':")
print(nyc_data['city'].unique().to_list())Head of the nyc_data DataFrame (filtered for 'NEW YORK'):
shape: (5, 36)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ unique_key ┆ created_da ┆ agency ┆ agency_na ┆ … ┆ closed_da ┆ data_retr ┆ time_delt ┆ time_delt │
│ --- ┆ te ┆ --- ┆ me ┆ ┆ te ┆ ieved_at ┆ a ┆ a_seconds │
│ str ┆ --- ┆ str ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ ┆ datetime[μ ┆ ┆ str ┆ ┆ datetime[ ┆ str ┆ duration[ ┆ i64 │
│ ┆ s] ┆ ┆ ┆ ┆ μs] ┆ ┆ μs] ┆ │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 65357312 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 10m 16s ┆ 616 │
│ ┆ 01:19:05 ┆ ┆ City ┆ ┆ 5 ┆ 5_02_38_4 ┆ ┆ │
│ ┆ ┆ ┆ Police ┆ ┆ 01:29:21 ┆ 5_PM ┆ ┆ │
│ ┆ ┆ ┆ Departmen ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ ┆ ┆ ┆ │
│ 65365325 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 29m 48s ┆ 1788 │
│ ┆ 01:18:35 ┆ ┆ City ┆ ┆ 5 ┆ 5_02_38_4 ┆ ┆ │
│ ┆ ┆ ┆ Police ┆ ┆ 01:48:23 ┆ 5_PM ┆ ┆ │
│ ┆ ┆ ┆ Departmen ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ ┆ ┆ ┆ │
│ 65366862 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 30m 6s ┆ 1806 │
│ ┆ 01:12:03 ┆ ┆ City ┆ ┆ 5 ┆ 5_02_38_4 ┆ ┆ │
│ ┆ ┆ ┆ Police ┆ ┆ 01:42:09 ┆ 5_PM ┆ ┆ │
│ ┆ ┆ ┆ Departmen ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ ┆ ┆ ┆ │
│ 65355807 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 9m 18s ┆ 558 │
│ ┆ 01:06:14 ┆ ┆ City ┆ ┆ 5 ┆ 5_02_38_4 ┆ ┆ │
│ ┆ ┆ ┆ Police ┆ ┆ 01:15:32 ┆ 5_PM ┆ ┆ │
│ ┆ ┆ ┆ Departmen ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ ┆ ┆ ┆ │
│ 65361896 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 41m 42s ┆ 2502 │
│ ┆ 01:04:42 ┆ ┆ City ┆ ┆ 5 ┆ 5_02_38_4 ┆ ┆ │
│ ┆ ┆ ┆ Police ┆ ┆ 01:46:24 ┆ 5_PM ┆ ┆ │
│ ┆ ┆ ┆ Departmen ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ ┆ ┆ ┆ │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘
Shape of the original DataFrame: (3762, 36)
Shape of the 'nyc_data' DataFrame (after filtering): (339, 36)
Unique cities in 'nyc_data':
['NEW YORK']
nyc_complaint_types = nyc_data.group_by("complaint_type").agg(
pl.len().alias("counts") # pl.len() counts rows in each group, and we alias it to 'counts'
).sort("counts", descending=True) # Sort the resulting DataFrame by 'counts' in descending order
# get the number of unique complaints
nyc_complaint_counts = df.select(pl.col("complaint_type").n_unique()).item()
# View the result
print(f"\nThere are {nyc_complaint_counts} complaint types in the dataset that contain the following info:\n")
print(nyc_complaint_types)
There are 31 complaint types in the dataset that contain the following info:
shape: (21, 2)
┌───────────────────────────────┬────────┐
│ complaint_type ┆ counts │
│ --- ┆ --- │
│ str ┆ u32 │
╞═══════════════════════════════╪════════╡
│ Water System ┆ 95 │
│ Noise - Street/Sidewalk ┆ 88 │
│ Noise - Residential ┆ 45 │
│ Illegal Parking ┆ 38 │
│ Encampment ┆ 17 │
│ … ┆ … │
│ General Construction/Plumbing ┆ 1 │
│ Animal-Abuse ┆ 1 │
│ Blocked Driveway ┆ 1 │
│ Dirty Condition ┆ 1 │
│ Emergency Response Team (ERT) ┆ 1 │
└───────────────────────────────┴────────┘
# Plot the bar graph
chart = alt.Chart(nyc_complaint_types).mark_bar().encode(
# X-axis: 'counts' (Quantitative type)
x=alt.X('counts:Q', title='Number of Complaints',
scale=alt.Scale(domainMin=0, zero=True)
),
# Y-axis: 'complaint_type' (Nominal type), sorted by counts descending
y=alt.Y('complaint_type:N', title='Complaint Type', sort='-x'), # Sort by counts for readability
color=alt.value('green'), # You can use color names or hex codes like '#FF5733'
# Tooltip for interactivity
tooltip=[
alt.Tooltip('complaint_type:N', title='Complaint Type'),
alt.Tooltip('counts:Q', title='Count')
]
).properties(
title='Types of Complaints in New York City',
width=800, # <-- Adjust the width here (e.g., 800 pixels)
height=600 # <-- Adjust the height here (e.g., 600 pixels)
).interactive()
# Display the chart
chart.show()Let’s explore the complaint data specifically for Brooklyn …
This code filters the dataset to only include complaints from Brooklyn, groups the remaining data by complaint type, counts the number of occurrences for each type, and then sorts the counts in descending order. The resulting output shows the frequency of each complaint type for Brooklyn.
# Filter the DataFrame for only the 'BROOKLYN' city data
# Use .filter() with a boolean expression
bkn_data = df.filter(pl.col('city') == 'BROOKLYN')
print("Head of the bkn_data DataFrame (filtered for 'BROOKLYN'):")
print(bkn_data.head())
print(f"\nShape of the original DataFrame: {df.shape}")
print(f"Shape of the 'nyc_data' DataFrame (after filtering): {bkn_data.shape}")
# You can also verify that all cities in nyc_data are 'BROOKLYN'
print("\nUnique cities in 'bkn_data':")
print(bkn_data['city'].unique().to_list())Head of the bkn_data DataFrame (filtered for 'BROOKLYN'):
shape: (5, 36)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ unique_key ┆ created_da ┆ agency ┆ agency_na ┆ … ┆ closed_da ┆ data_retr ┆ time_delt ┆ time_delt │
│ --- ┆ te ┆ --- ┆ me ┆ ┆ te ┆ ieved_at ┆ a ┆ a_seconds │
│ str ┆ --- ┆ str ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ ┆ datetime[μ ┆ ┆ str ┆ ┆ datetime[ ┆ str ┆ duration[ ┆ i64 │
│ ┆ s] ┆ ┆ ┆ ┆ μs] ┆ ┆ μs] ┆ │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 65355773 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 7m 20s ┆ 440 │
│ ┆ 01:50:40 ┆ ┆ City ┆ ┆ 5 ┆ 5_02_38_4 ┆ ┆ │
│ ┆ ┆ ┆ Police ┆ ┆ 01:58:00 ┆ 5_PM ┆ ┆ │
│ ┆ ┆ ┆ Departmen ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ ┆ ┆ ┆ │
│ 65366857 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 19m 28s ┆ 1168 │
│ ┆ 01:25:36 ┆ ┆ City ┆ ┆ 5 ┆ 5_02_38_4 ┆ ┆ │
│ ┆ ┆ ┆ Police ┆ ┆ 01:45:04 ┆ 5_PM ┆ ┆ │
│ ┆ ┆ ┆ Departmen ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ ┆ ┆ ┆ │
│ 65365214 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 32m 12s ┆ 1932 │
│ ┆ 01:15:21 ┆ ┆ City ┆ ┆ 5 ┆ 5_02_38_4 ┆ ┆ │
│ ┆ ┆ ┆ Police ┆ ┆ 01:47:33 ┆ 5_PM ┆ ┆ │
│ ┆ ┆ ┆ Departmen ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ ┆ ┆ ┆ │
│ 65355770 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 31m 42s ┆ 1902 │
│ ┆ 01:08:58 ┆ ┆ City ┆ ┆ 5 ┆ 5_02_38_4 ┆ ┆ │
│ ┆ ┆ ┆ Police ┆ ┆ 01:40:40 ┆ 5_PM ┆ ┆ │
│ ┆ ┆ ┆ Departmen ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ ┆ ┆ ┆ │
│ 65355768 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ 2025-06-2 ┆ 06_26_202 ┆ 53m 13s ┆ 3193 │
│ ┆ 01:08:52 ┆ ┆ City ┆ ┆ 5 ┆ 5_02_38_4 ┆ ┆ │
│ ┆ ┆ ┆ Police ┆ ┆ 02:02:05 ┆ 5_PM ┆ ┆ │
│ ┆ ┆ ┆ Departmen ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ ┆ ┆ ┆ │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘
Shape of the original DataFrame: (3762, 36)
Shape of the 'nyc_data' DataFrame (after filtering): (486, 36)
Unique cities in 'bkn_data':
['BROOKLYN']
bkn_complaint_types = bkn_data.group_by("complaint_type").agg(
pl.len().alias("counts") # pl.len() counts rows in each group, and we alias it to 'counts'
).sort("counts", descending=True) # Sort the resulting DataFrame by 'counts' in descending order
# get the number of unique complaints
bkn_complaint_counts = df.select(pl.col("complaint_type").n_unique()).item()
# View the result
print(f"\nThere are {bkn_complaint_counts} complaint types in the dataset that contain the following info:\n")
print(bkn_complaint_types)
There are 31 complaint types in the dataset that contain the following info:
shape: (22, 2)
┌───────────────────────────────┬────────┐
│ complaint_type ┆ counts │
│ --- ┆ --- │
│ str ┆ u32 │
╞═══════════════════════════════╪════════╡
│ Water System ┆ 124 │
│ Illegal Parking ┆ 101 │
│ Noise - Residential ┆ 78 │
│ Noise - Street/Sidewalk ┆ 63 │
│ Blocked Driveway ┆ 24 │
│ … ┆ … │
│ Animal-Abuse ┆ 1 │
│ General Construction/Plumbing ┆ 1 │
│ Electrical ┆ 1 │
│ Violation of Park Rules ┆ 1 │
│ Disorderly Youth ┆ 1 │
└───────────────────────────────┴────────┘
# Plot the bar graph
chart = alt.Chart(bkn_complaint_types).mark_bar().encode(
# X-axis: 'counts' (Quantitative type)
x=alt.X('counts:Q', title='Number of Complaints',
scale=alt.Scale(domainMin=0, zero=True)
),
# Y-axis: 'complaint_type' (Nominal type), sorted by counts descending
y=alt.Y('complaint_type:N', title='Complaint Type', sort='-x'), # Sort by counts for readability
color=alt.value('red'), # You can use color names or hex codes like '#FF5733'
# Tooltip for interactivity
tooltip=[
alt.Tooltip('complaint_type:N', title='Complaint Type'),
alt.Tooltip('counts:Q', title='Count')
]
).properties(
title='Types of Complaints in Brooklyn',
width=800, # <-- Adjust the width here (e.g., 800 pixels)
height=600 # <-- Adjust the height here (e.g., 600 pixels)
).interactive()
# Display the chart
chart.show()# filter data for the geographic locations of Brooklyn in the dataset.
bkn_geo_data = df.filter(
(pl.col("city") == "BROOKLYN") &
pl.col("longitude").is_not_null() &
pl.col("latitude").is_not_null()
).select(["longitude", "latitude"]) # <-- This line selects only longitude and latitude# create a scatter plot of the concentration of complaints across Brooklyn
# Using typical bounds for Brooklyn for demonstration:
bkn_lon_min, bkn_lon_max = -74.05, -73.85
bkn_lat_min, bkn_lat_max = 40.55, 40.75
scatter_plot = alt.Chart(bkn_geo_data).mark_point(
opacity=0.2, # Reduce opacity to see overlapping points
size=5 # Adjust point size
).encode(
x=alt.X(
'longitude:Q',
title='Longitude',
scale=alt.Scale(domain=[bkn_lon_min, bkn_lon_max]) # Set fixed domain for x
),
y=alt.Y(
'latitude:Q',
title='Latitude',
scale=alt.Scale(domain=[bkn_lat_min, bkn_lat_max]) # Set fixed domain for y
),
tooltip=['longitude:Q', 'latitude:Q']
).properties(
title='Complaint Locations in Brooklyn (Scatter Plot)',
width=700, # Set explicit width
height=560 # Set explicit height
).interactive() # Allows zooming and panning
print("\nDisplaying Scatter Plot for Brooklyn Complaints:")
scatter_plot.show()
Displaying Scatter Plot for Brooklyn Complaints:
# --- Hexbin plot using mark_circle() + bin approach ---
hexbin_plot = alt.Chart(bkn_geo_data).mark_circle().encode(
# Bin longitude and latitude into hexagonal cells
x=alt.X('longitude:Q', bin=alt.Bin(maxbins=50), title='Longitude'), # Adjust maxbins for granularity
y=alt.Y('latitude:Q', bin=alt.Bin(maxbins=50), title='Latitude'), # Adjust maxbins for granularity
# Color by the count of points in each bin
color=alt.Color('count():Q',
title='Number of Complaints',
scale=alt.Scale(range='heatmap', domainMin=1) # Ensure heatmap starts from non-zero count
),
# Size the circles by count as well for emphasis
size=alt.Size('count():Q', legend=None), # Hide size legend as color already indicates count
tooltip=[
alt.Tooltip('count():Q', title='Complaints in Bin'),
alt.Tooltip('longitude:Q', bin=True, title='Longitude Bin'),
alt.Tooltip('latitude:Q', bin=True, title='Latitude Bin')
]
).properties(
title='Complaint Concentration in Brooklyn (Hexbin Plot)',
width=700, # Set explicit width
height=560 # Set explicit height
).interactive() # Allows zooming and panning
print("\nDisplaying Hexbin Plot for Brooklyn Complaints:")
hexbin_plot.show()
Displaying Hexbin Plot for Brooklyn Complaints:
This code groups the data by both city and complaint type, counts the number of occurrences for each combination, and then prints the resulting output, which shows the various types of complaints in each city.
# 1. Group by city and complaint type and count the number of occurrences
# This will result in a DataFrame with columns: 'city', 'complaint_type', 'count'
city_complaint_counts = df.group_by(["city", "complaint_type"]).agg(
pl.len().alias("count")
).sort(["city", "count"], descending=[False, True]) # Sort by city, then by count descending for readability
print("Grouped by City and Complaint Type (Head):")
print(city_complaint_counts.head())Grouped by City and Complaint Type (Head):
shape: (5, 3)
┌─────────┬─────────────────────────┬───────┐
│ city ┆ complaint_type ┆ count │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 │
╞═════════╪═════════════════════════╪═══════╡
│ ARVERNE ┆ Blocked Driveway ┆ 1 │
│ ASTORIA ┆ Illegal Parking ┆ 21 │
│ ASTORIA ┆ Noise - Street/Sidewalk ┆ 5 │
│ ASTORIA ┆ Noise - Residential ┆ 4 │
│ ASTORIA ┆ Animal-Abuse ┆ 1 │
└─────────┴─────────────────────────┴───────┘
This code uses the pivot function to pivot the data in the original DataFrame df. In Polars, you first group by the columns you want to pivot by, then count, then pivot. We’ll use ‘complaint_type’ and ‘city’ as confirmed lowercase names.
The resulting DataFrame, df_new, has complaint types as rows, cities as columns, and the count of occurrences as the values. The fill_value=0 parameter fills any missing values with zeros. The resulting output shows the new DataFrame with cities as columns and complaint types in rows.
complaints_by_city = df.group_by(["complaint_type", "city"]).len().pivot(
index="complaint_type", # Rows of the new pivot table
columns="city", # Columns of the new pivot table
values="len" # The count from .len() will fill the values
)
# Fill any null values (where a complaint type didn't occur in a city) with 0
complaints_by_city = complaints_by_city.fill_null(0)
# Print the new DataFrame's head
print("Complaints by City (head of the pivot table):")
print(complaints_by_city.head())Complaints by City (head of the pivot table):
shape: (5, 43)
┌─────────────────────┬──────────┬─────────┬────────┬───┬────────┬─────────┬─────────┬─────────────┐
│ complaint_type ┆ BROOKLYN ┆ ASTORIA ┆ FOREST ┆ … ┆ HOLLIS ┆ ARVERNE ┆ CAMBRIA ┆ LITTLE NECK │
│ --- ┆ --- ┆ --- ┆ HILLS ┆ ┆ --- ┆ --- ┆ HEIGHTS ┆ --- │
│ str ┆ u32 ┆ u32 ┆ --- ┆ ┆ u32 ┆ u32 ┆ --- ┆ u32 │
│ ┆ ┆ ┆ u32 ┆ ┆ ┆ ┆ u32 ┆ │
╞═════════════════════╪══════════╪═════════╪════════╪═══╪════════╪═════════╪═════════╪═════════════╡
│ Disorderly Youth ┆ 1 ┆ 0 ┆ 0 ┆ … ┆ 0 ┆ 0 ┆ 0 ┆ 0 │
│ Blocked Driveway ┆ 24 ┆ 1 ┆ 0 ┆ … ┆ 1 ┆ 1 ┆ 0 ┆ 0 │
│ Noise - Residential ┆ 78 ┆ 4 ┆ 1 ┆ … ┆ 0 ┆ 0 ┆ 0 ┆ 0 │
│ Traffic Signal ┆ 23 ┆ 0 ┆ 0 ┆ … ┆ 0 ┆ 0 ┆ 0 ┆ 0 │
│ Condition ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ Illegal Parking ┆ 101 ┆ 21 ┆ 2 ┆ … ┆ 0 ┆ 0 ┆ 1 ┆ 0 │
└─────────────────────┴──────────┴─────────┴────────┴───┴────────┴─────────┴─────────┴─────────────┘
Statistical Analysis of Request_Closing_Time
# Calculate the response time (duration)
# Polars automatically creates a Duration type from datetime subtraction
df = df.with_columns(
(pl.col("closed_date") - pl.col("created_date")).alias("request_closing_duration")
)
# Convert the duration to total hours and total minutes (as numerical values for plotting)
df = df.with_columns([
pl.col("request_closing_duration").dt.total_hours().alias("request_closing_time_hrs"),
pl.col("request_closing_duration").dt.total_minutes().alias("request_closing_time_min")
])
print("DataFrame Head with new time columns:")
print(df.head())
print("\nSchema of new time columns:")
print(df.select(["request_closing_duration", "request_closing_time_hrs", "request_closing_time_min"]).schema)DataFrame Head with new time columns:
shape: (5, 39)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ unique_key ┆ created_da ┆ agency ┆ agency_na ┆ … ┆ time_delt ┆ request_c ┆ request_c ┆ request_c │
│ --- ┆ te ┆ --- ┆ me ┆ ┆ a_seconds ┆ losing_du ┆ losing_ti ┆ losing_ti │
│ str ┆ --- ┆ str ┆ --- ┆ ┆ --- ┆ ration ┆ me_hrs ┆ me_min │
│ ┆ datetime[μ ┆ ┆ str ┆ ┆ i64 ┆ --- ┆ --- ┆ --- │
│ ┆ s] ┆ ┆ ┆ ┆ ┆ duration[ ┆ i64 ┆ i64 │
│ ┆ ┆ ┆ ┆ ┆ ┆ μs] ┆ ┆ │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 65355773 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ 440 ┆ 7m 20s ┆ 0 ┆ 7 │
│ ┆ 01:50:40 ┆ ┆ City ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ Police ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ Departmen ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ ┆ ┆ ┆ │
│ 65366664 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ 998 ┆ 16m 38s ┆ 0 ┆ 16 │
│ ┆ 01:45:52 ┆ ┆ City ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ Police ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ Departmen ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ ┆ ┆ ┆ │
│ 65358593 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ 589 ┆ 9m 49s ┆ 0 ┆ 9 │
│ ┆ 01:42:25 ┆ ┆ City ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ Police ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ Departmen ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ ┆ ┆ ┆ │
│ 65366547 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ 1378 ┆ 22m 58s ┆ 0 ┆ 22 │
│ ┆ 01:40:10 ┆ ┆ City ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ Police ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ Departmen ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ ┆ ┆ ┆ │
│ 65363618 ┆ 2025-06-25 ┆ NYPD ┆ New York ┆ … ┆ 1363 ┆ 22m 43s ┆ 0 ┆ 22 │
│ ┆ 01:37:15 ┆ ┆ City ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ Police ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ Departmen ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ … ┆ ┆ ┆ ┆ ┆ │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘
Schema of new time columns:
Schema({'request_closing_duration': Duration(time_unit='us'), 'request_closing_time_hrs': Int64, 'request_closing_time_min': Int64})
# Define the list of city names that represent the main boroughs in your dataset
borough_cities = ['NEW YORK', 'BROOKLYN', 'QUEENS', 'BRONX', 'STATEN ISLAND']
# Filter the DataFrame to include only these borough cities
borough_complaints_df = df.filter(pl.col("city").is_in(borough_cities))
# Calculate the average complaint resolution time (in hours) for each borough city
avg_resolution_time_by_borough = borough_complaints_df.group_by("city").agg(
pl.col("request_closing_time_hrs").mean().alias("avg_resolution_hours")
).sort("avg_resolution_hours", descending=False) # Sort for better visualization
print("\nAverage Complaint Resolution Time (Hours) by Borough City:")
print(avg_resolution_time_by_borough)
# --- Plotting the average complaint resolved time for the five borough cities ---
chart_borough_resolution = alt.Chart(avg_resolution_time_by_borough).mark_bar().encode(
# X-axis: Average Resolution Time in Hours (Quantitative)
x=alt.X('avg_resolution_hours:Q',
title='Average Resolution Time (Hours)',
scale=alt.Scale(domainMin=0) # Ensure axis starts from 0
),
# Y-axis: City (Nominal), sorted by average time
y=alt.Y('city:N',
title='City',
sort='-x' # Sort cities by their average resolution time
),
color=alt.value('darkcyan'), # Set a single color for the bars
tooltip=[
alt.Tooltip('city:N', title='City'),
alt.Tooltip('avg_resolution_hours:Q', title='Avg. Hours', format='.2f') # Format to 2 decimal places
]
).properties(
title='Average Complaint Resolution Time by Borough City',
width=800, # <-- Adjust the width here (e.g., 800 pixels)
height=600 # <-- Adjust the height here (e.g., 600 pixels)
).interactive()
chart_borough_resolution.show()
Average Complaint Resolution Time (Hours) by Borough City:
shape: (5, 2)
┌───────────────┬──────────────────────┐
│ city ┆ avg_resolution_hours │
│ --- ┆ --- │
│ str ┆ f64 │
╞═══════════════╪══════════════════════╡
│ QUEENS ┆ 0.0 │
│ BRONX ┆ 0.204422 │
│ NEW YORK ┆ 0.389381 │
│ BROOKLYN ┆ 0.623457 │
│ STATEN ISLAND ┆ 0.863014 │
└───────────────┴──────────────────────┘
# Calculate the average closing time by city
# Group by 'city', aggregate the mean of 'request_closing_time_hrs', and sort
avg_closing_time_dict = df.group_by("city").agg(
pl.col("request_closing_time_hrs").mean().alias("avg_hours_to_close")
).sort("avg_hours_to_close", descending=False).to_dict(as_series=False) # Convert to dictionary
# Use a dictionary comprehension to reformat it
# We're zipping the 'city' list with the 'avg_hours_to_close' list
avg_closing_time_by_city_dict = {
city: avg_time
for city, avg_time in zip(avg_closing_time_dict['city'], avg_closing_time_dict['avg_hours_to_close'])
}
print("Average Closing Time by City (sorted, as dictionary):\n")
for key, values in avg_closing_time_by_city_dict.items():
print(f"{key} : {values:.2f} hrs")Average Closing Time by City (sorted, as dictionary):
LITTLE NECK : 0.00 hrs
MIDDLE VILLAGE : 0.00 hrs
SPRINGFIELD GARDENS : 0.00 hrs
OAKLAND GARDENS : 0.00 hrs
BAYSIDE : 0.00 hrs
CAMBRIA HEIGHTS : 0.00 hrs
HOLLIS : 0.00 hrs
ROSEDALE : 0.00 hrs
QUEENS : 0.00 hrs
MANHATTAN : 0.11 hrs
SOUTH OZONE PARK : 0.14 hrs
SUNNYSIDE : 0.17 hrs
RIDGEWOOD : 0.17 hrs
FLUSHING : 0.18 hrs
HOWARD BEACH : 0.20 hrs
BRONX : 0.20 hrs
JAMAICA : 0.24 hrs
MASPETH : 0.25 hrs
REGO PARK : 0.33 hrs
WHITESTONE : 0.33 hrs
OZONE PARK : 0.33 hrs
UNKNOWN CITY : 0.36 hrs
NEW YORK : 0.39 hrs
FRESH MEADOWS : 0.44 hrs
FOREST HILLS : 0.50 hrs
CORONA : 0.55 hrs
BROOKLYN : 0.62 hrs
LONG ISLAND CITY : 0.73 hrs
RICHMOND HILL : 0.75 hrs
WOODHAVEN : 0.81 hrs
JACKSON HEIGHTS : 0.83 hrs
STATEN ISLAND : 0.86 hrs
COLLEGE POINT : 1.00 hrs
SAINT ALBANS : 1.00 hrs
FAR ROCKAWAY : 1.00 hrs
QUEENS VILLAGE : 1.00 hrs
ARVERNE : 1.00 hrs
SOUTH RICHMOND HILL : 1.00 hrs
ASTORIA : 1.23 hrs
EAST ELMHURST : 1.27 hrs
ELMHURST : 1.82 hrs
WOODSIDE : 1.83 hrs
# Calculate the average closing time by city (as a Polars DataFrame)
avg_closing_time_df = df.group_by("city").agg(
pl.col("request_closing_time_hrs").mean().alias("avg_hours_to_close")
).sort("avg_hours_to_close", descending=False) # Sort for better chart readability
print("DataFrame for plotting (Avg. Closing Time by City):")
print(avg_closing_time_df.head())
# --- Create the Altair Horizontal Bar Chart ---
chart = alt.Chart(avg_closing_time_df).mark_bar().encode(
# X-axis: Average Hours to Close (Quantitative)
x=alt.X('avg_hours_to_close:Q',
title='Average Closing Time (Hours)',
axis=alt.Axis(titleFontSize=14),
scale=alt.Scale(domainMin=0) # Ensure axis starts from 0
),
# Y-axis: City (Nominal), sorted by average closing time
y=alt.Y('city:N',
title='City',
sort='-x', # Sort cities by their average resolution time
axis=alt.Axis(titleFontSize=14)
),
color=alt.value('purple'), # Set a single color for the bars
tooltip=[
alt.Tooltip('city:N', title='City'),
alt.Tooltip('avg_hours_to_close:Q', title='Avg. Hours', format='.2f') # Format to 2 decimal places
]
).properties(
title=alt.Title(
'Average Complaint Closing Time by City',
fontSize=18
),
width=600, # Adjust overall chart width
height=alt.Step(15) # Dynamically adjust height based on number of cities, 15 pixels per bar
).interactive() # Enable pan and zoom
chart.show()DataFrame for plotting (Avg. Closing Time by City):
shape: (5, 2)
┌─────────────────────┬────────────────────┐
│ city ┆ avg_hours_to_close │
│ --- ┆ --- │
│ str ┆ f64 │
╞═════════════════════╪════════════════════╡
│ CAMBRIA HEIGHTS ┆ 0.0 │
│ MIDDLE VILLAGE ┆ 0.0 │
│ LITTLE NECK ┆ 0.0 │
│ SPRINGFIELD GARDENS ┆ 0.0 │
│ BAYSIDE ┆ 0.0 │
└─────────────────────┴────────────────────┘
Let’s see the avg. response time across different types of complaints. We will also compare them across each borough.
# Calculate the average closing time by complaint type
# Group by 'complaint_type', aggregate the mean of 'request_closing_time_min', and sort
avg_closing_time_per_complaint = df.group_by("complaint_type").agg(
pl.col("request_closing_time_min").mean().alias("avg_minutes_to_close")
).sort("avg_minutes_to_close", descending=False) # Default sort is ascending, so descending=False for ascending order
# Print the result
print("Average Closing Time (Minutes) per Complaint Type:")
print(avg_closing_time_per_complaint)Average Closing Time (Minutes) per Complaint Type:
shape: (31, 2)
┌───────────────────────────────┬──────────────────────┐
│ complaint_type ┆ avg_minutes_to_close │
│ --- ┆ --- │
│ str ┆ f64 │
╞═══════════════════════════════╪══════════════════════╡
│ Street Condition ┆ 0.0 │
│ Building/Use ┆ 0.0 │
│ Emergency Response Team (ERT) ┆ 0.0 │
│ Elevator ┆ 0.0 │
│ General Construction/Plumbing ┆ 0.0 │
│ … ┆ … │
│ Animal-Abuse ┆ 123.4 │
│ Violation of Park Rules ┆ 137.0 │
│ Traffic ┆ 152.0 │
│ Dirty Condition ┆ 233.0 │
│ Derelict Vehicles ┆ 268.2 │
└───────────────────────────────┴──────────────────────┘
# --- Create the Altair Horizontal Bar Chart ---
chart = alt.Chart(avg_closing_time_per_complaint).mark_bar().encode(
# X-axis: Average Minutes to Close (Quantitative)
x=alt.X('avg_minutes_to_close:Q',
title='Average Closing Time (Minutes)',
axis=alt.Axis(titleFontSize=14),
scale=alt.Scale(domainMin=0) # Ensure axis starts from 0
),
# Y-axis: Complaint Type (Nominal), sorted by average closing time
y=alt.Y('complaint_type:N',
title='Complaint Type',
sort='-x', # Sort complaint types by their average resolution time
axis=alt.Axis(titleFontSize=14)
),
color=alt.value('seagreen'), # Set a single color for the bars
tooltip=[
alt.Tooltip('complaint_type:N', title='Complaint Type'),
alt.Tooltip('avg_minutes_to_close:Q', title='Avg. Minutes', format='.2f') # Format to 2 decimal places
]
).properties(
title=alt.Title(
'Average Complaint Closing Time by Complaint Type',
fontSize=18
),
width=600, # Adjust overall chart width
height=alt.Step(15) # Dynamically adjust height based on number of complaint types, 15 pixels per bar
).interactive() # Enable pan and zoom
chart.show()Here you can see that the Graffitti takes almost 11 hours to resolve. Other complaint types are more frequent where Air Quality, Noise from Places of Worship take much less time while some issues haven’t been recently reported at all, such as Plumbing.
# 1. Define the mapping (as a list of dictionaries or directly construct a DataFrame)
borough_mapping_data = [
{'city_name': 'NEW YORK', 'Borough': 'MANHATTAN'},
{'city_name': 'BRONX', 'Borough': 'BRONX'},
{'city_name': 'BROOKLYN', 'Borough': 'BROOKLYN'},
{'city_name': 'QUEENS', 'Borough': 'QUEENS'},
{'city_name': 'STATEN ISLAND', 'Borough': 'STATEN ISLAND'},
{'city_name': 'JAMAICA', 'Borough': 'QUEENS'},
{'city_name': 'FLUSHING', 'Borough': 'QUEENS'},
{'city_name': 'ASTORIA', 'Borough': 'QUEENS'},
{'city_name': 'LONG ISLAND CITY', 'Borough': 'QUEENS'},
{'city_name': 'EAST ELMHURST', 'Borough': 'QUEENS'},
{'city_name': 'CORONA', 'Borough': 'QUEENS'},
{'city_name': 'ELMHURST', 'Borough': 'QUEENS'},
{'city_name': 'OZONE PARK', 'Borough': 'QUEENS'},
{'city_name': 'SOUTH OZONE PARK', 'Borough': 'QUEENS'},
{'city_name': 'ARVERNE', 'Borough': 'QUEENS'},
{'city_name': 'FAR ROCKAWAY', 'Borough': 'QUEENS'},
{'city_name': 'QUEENS VILLAGE', 'Borough': 'QUEENS'},
{'city_name': 'RICHMOND HILL', 'Borough': 'QUEENS'},
{'city_name': 'SOUTH RICHMOND HILL', 'Borough': 'QUEENS'},
{'city_name': 'WOODHAVEN', 'Borough': 'QUEENS'},
{'city_name': 'REGO PARK', 'Borough': 'QUEENS'},
{'city_name': 'FOREST HILLS', 'Borough': 'QUEENS'},
{'city_name': 'JACKSON HEIGHTS', 'Borough': 'QUEENS'},
{'city_name': 'KEW GARDENS', 'Borough': 'QUEENS'},
{'city_name': 'FRESH MEADOWS', 'Borough': 'QUEENS'},
{'city_name': 'OAKLAND GARDENS', 'Borough': 'QUEENS'},
{'city_name': 'COLLEGE POINT', 'Borough': 'QUEENS'},
{'city_name': 'BAYSIDE', 'Borough': 'QUEENS'},
{'city_name': 'GLEN OAKS', 'Borough': 'QUEENS'},
{'city_name': 'BELLEROSE', 'Borough': 'QUEENS'},
{'city_name': 'LITTLE NECK', 'Borough': 'QUEENS'},
{'city_name': 'CAMBRIA HEIGHTS', 'Borough': 'QUEENS'},
{'city_name': 'ROSEDALE', 'Borough': 'QUEENS'},
{'city_name': 'ROCKAWAY PARK', 'Borough': 'QUEENS'},
{'city_name': 'SPRINGFIELD GARDENS', 'Borough': 'QUEENS'},
{'city_name': 'WOODSIDE', 'Borough': 'QUEENS'},
{'city_name': 'MASPETH', 'Borough': 'QUEENS'},
{'city_name': 'SUNNYSIDE', 'Borough': 'QUEENS'},
{'city_name': 'MIDDLE VILLAGE', 'Borough': 'QUEENS'},
{'city_name': 'RIDGEWOOD', 'Borough': 'BROOKLYN'},
{'city_name': 'HOWARD BEACH', 'Borough': 'QUEENS'},
{'city_name': 'BREEZY POINT', 'Borough': 'QUEENS'},
{'city_name': 'WHITESTONE', 'Borough': 'QUEENS'},
{'city_name': 'SAINT ALBANS', 'Borough': 'QUEENS'},
{'city_name': 'Unknown City', 'Borough': 'Unspecified'} # Map previously filled 'Unknown City'
]
mapping_df = pl.DataFrame(borough_mapping_data)
# 2. Perform a left join to add the 'Borough' column
df = df.join(mapping_df, left_on="city", right_on="city_name", how="left")
# 3. Fill any remaining nulls in 'Borough' (for cities not in mapping) with 'Unspecified'
df = df.with_columns(
pl.col("Borough").fill_null(pl.lit("Unspecified")).alias("Borough")
)# --- Calculate average response time per complaint type per borough ---
avg_response_time_by_borough_complaint = df.group_by(["Borough", "complaint_type"]).agg(
pl.col("request_closing_time_min").mean().alias("avg_minutes")
).sort(["Borough", "avg_minutes"], descending=[False, True])
print("Average Response Time by Borough and Complaint Type (Head):")
print(avg_response_time_by_borough_complaint.head())Average Response Time by Borough and Complaint Type (Head):
shape: (5, 3)
┌─────────┬─────────────────────────────┬─────────────┐
│ Borough ┆ complaint_type ┆ avg_minutes │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞═════════╪═════════════════════════════╪═════════════╡
│ BRONX ┆ Drug Activity ┆ 394.0 │
│ BRONX ┆ Derelict Vehicles ┆ 324.75 │
│ BRONX ┆ Dirty Condition ┆ 219.0 │
│ BRONX ┆ Illegal Fireworks ┆ 137.8 │
│ BRONX ┆ Non-Emergency Police Matter ┆ 130.833333 │
└─────────┴─────────────────────────────┴─────────────┘
# --- Define the boroughs for subplots (matching the order in Pandas code) ---
boroughs = ['BRONX', 'BROOKLYN', 'MANHATTAN', 'QUEENS', 'STATEN ISLAND', 'Unspecified']
# --- Set common X-axis limit (like plt.xlim(0, 3000)) ---
x_axis_limit = 3000 # Use a fixed limit as in Matplotlib
# --- Helper function to create an Altair subplot for each borough ---
def create_borough_subplot(data: pl.DataFrame, borough_name: str, x_domain_max: float) -> alt.Chart:
"""
Creates a horizontal bar chart for average response time by complaint type for a given borough.
"""
borough_data = data.filter(pl.col("Borough") == borough_name)
borough_data = borough_data.filter(pl.col("avg_minutes").is_not_null() & (pl.col("avg_minutes") > 0))
# --- FIX: Calculate height as an integer, not multiplying alt.Step() ---
# Determine the number of bars (complaint types) for this borough
num_complaint_types = len(borough_data)
# Calculate the total height in pixels
chart_height = 15 * max(1, num_complaint_types) # 15 pixels per bar
# --- END FIX ---
chart = alt.Chart(borough_data).mark_bar(color='cornflowerblue').encode(
x=alt.X('avg_minutes:Q',
title='Average Response Time (in minutes)',
scale=alt.Scale(domain=[0, x_domain_max]),
axis=alt.Axis(titleFontSize=12, labelFontSize=10)
),
y=alt.Y('complaint_type:N',
title='Complaint Type',
sort='-x',
axis=alt.Axis(titleFontSize=12, labelFontSize=10)
),
tooltip=[
alt.Tooltip('complaint_type:N', title='Complaint Type'),
alt.Tooltip('avg_minutes:Q', title='Avg. Minutes', format='.2f')
]
).properties(
title=alt.Title(
borough_name,
fontSize=14,
anchor='middle'
),
width=300,
height=chart_height # Pass the integer height here
)
return chart# --- Create individual subplots for each borough ---
borough_charts = [create_borough_subplot(avg_response_time_by_borough_complaint, b, x_axis_limit) for b in boroughs]
# --- Combine the subplots into a 2x3 grid ---
combined_chart = alt.vconcat(
alt.hconcat(borough_charts[0], borough_charts[1], borough_charts[2]),
alt.hconcat(borough_charts[3], borough_charts[4], borough_charts[5])
).properties(
title=alt.Title(
'Average Response Time for Complaint Types by Borough',
fontSize=18,
anchor='middle'
)
).resolve_scale(
x='shared'
)
combined_chart.show()Hypothesis Testing
A boxplot can reveal visual differences in the min, median, max and presence of outliers across complaint types.
# Filter out extreme outliers for better visualization if needed
df_filtered_for_plot = df.filter(pl.col("request_closing_time_min") >= 0) # Ensure no negative times
# Drop the 'time_delta' and 'request_closing duration' column for Altair plotting
df_filtered_for_plot = df_filtered_for_plot.drop("time_delta", "request_closing_duration")
# Create the Boxplot
boxplot_chart = alt.Chart(df_filtered_for_plot).mark_boxplot().encode(
x=alt.X('complaint_type:N',
title='Complaint Type',
axis=alt.Axis(labelAngle=-45, titleFontSize=14, labelFontSize=10) # Rotate labels for readability
),
y=alt.Y('request_closing_time_min:Q',
title='Request Closing Time (Minutes)',
scale=alt.Scale(domainMin=0), # Ensure Y-axis starts at 0
axis=alt.Axis(titleFontSize=14, labelFontSize=10)
),
tooltip=[
alt.Tooltip('complaint_type:N', title='Complaint Type'),
alt.Tooltip('request_closing_time_min:Q', title='Closing Time', format='.2f')
# For boxplots, you might add tooltips for quartiles if directly aggregated
]
).properties(
title=alt.Title(
'Distribution of Request Closing Time by Complaint Type',
fontSize=18
),
width=alt.Step(25), # Adjust width of each box
height=400 # Overall chart height
).interactive() # Enable zooming and panning
boxplot_chart.show()Let’s identify the relationship between Complaint Type and Request_Closing_Time using p-values. Subset the data to only include relevant columns
# Using confirmed lowercase column names 'complaint_type'
df_subset = df.select(["complaint_type", "request_closing_time_min"])
# Drop rows with missing values in these specific columns (Polars is immutable, so reassign)
df_subset = df_subset.drop_nulls()
# Create a dictionary to store the p-values for each Complaint Type
p_values = {}
# Get the data for the entire dataset's closing times as a NumPy array
# This is the 'all_data_closing_times' against which each subset will be compared
all_data_closing_times = df_subset['request_closing_time_min'].to_numpy()
# Loop through each Complaint Type and perform a t-test
# Get unique complaint types as a Python list for iteration
unique_complaint_types = df_subset['complaint_type'].unique().to_list()
print("Performing t-tests for each complaint type vs. the entire dataset...")
for complaint_type in unique_complaint_types:
# Filter data for the current complaint type (get a Polars Series)
subset_data_series = df_subset.filter(pl.col('complaint_type') == complaint_type)['request_closing_time_min']
# Ensure there's enough data for the t-test (at least 2 samples required)
if len(subset_data_series) < 2:
p_values[complaint_type] = np.nan # Assign NaN if not enough data
continue
# Convert Polars Series to NumPy array for scipy.stats
subset_data_np = subset_data_series.to_numpy()
# Perform t-test (comparing the subset to the entire dataset)
# equal_var=False for Welch's t-test, which does not assume equal variances
_, p_value = stats.ttest_ind(subset_data_np, all_data_closing_times, equal_var=False)
p_values[complaint_type] = p_value
# Print the p-values for each Complaint Type
print("\n--- P-values for Complaint Type vs. Entire Dataset's Closing Time ---")
for complaint_type, p_value in p_values.items():
print(f"{complaint_type}: {p_value:.4f}")Performing t-tests for each complaint type vs. the entire dataset...
--- P-values for Complaint Type vs. Entire Dataset's Closing Time ---
Electrical: 0.0000
Blocked Driveway: 0.0000
Noise - Commercial: 0.0003
Building/Use: 0.0000
Noise - Park: 0.6933
Derelict Vehicles: 0.0374
Encampment: 0.0003
Noise - Residential: 0.0000
Urinating in Public: 0.1941
Traffic: nan
Animal-Abuse: 0.0412
Violation of Park Rules: nan
Panhandling: 0.0995
General Construction/Plumbing: 0.0000
Vendor Enforcement: 0.0066
Non-Emergency Police Matter: 0.0570
Noise - Street/Sidewalk: 0.0000
Bike/Roller/Skate Chronic: nan
Drinking: 0.0061
Emergency Response Team (ERT): nan
Street Condition: 0.0000
Illegal Fireworks: 0.0009
Noise - Vehicle: 0.0000
Dirty Condition: 0.0409
Drug Activity: 0.3738
Water System: 0.0000
Elevator: 0.0000
Illegal Parking: 0.0000
Traffic Signal Condition: 0.0171
Disorderly Youth: 0.5417
Abandoned Vehicle: 0.0000
we create a contingency table using Polars to count the number of complaints for each complaint type and city. Then, we use chi2_contingency() from the scipy.stats library to perform the chi-square test on the contingency table. Finally, we print the results including the chi-square value, p-value, degrees of freedom, and expected values.
The null hypothesis for the chi-square test is that there is no association between the two variables (in this case, Complaint Type and City), while the alternative hypothesis is that there is some association. If the p-value is less than our chosen significance level (e.g., 0.05), we reject the null hypothesis and conclude that there is evidence of a significant association between the variables.
# 1. Create a contingency table using Polars - group by both columns and count occurrences
contingency_counts = df.group_by(["complaint_type", "city"]).len()
# Then, pivot to get Complaint Types as rows and Cities as columns
# Fill any missing combinations with 0
cont_table_polars = contingency_counts.pivot(
index="complaint_type",
columns="city",
values="len"
).fill_null(0)
# The chi2_contingency function from scipy expects a NumPy array.
# Convert the Polars DataFrame to a NumPy array for the test.
observed_values = cont_table_polars.drop("complaint_type").to_numpy() # Drop the index column before converting
# 2. Perform the chi-square test
chi2, pval, dof, expected = chi2_contingency(observed_values)
# Set significance level
alpha = 0.05
# Print results
print(f"Chi-square value: {chi2:.2f}")
print(f"P-value: {pval:.4f}")
print(f"Degrees of freedom: {dof}")
print("Expected frequencies (truncated for display if large):")
# Print expected frequencies. They can be very large, so truncate for display.
if expected.size > 200: # Adjust this number based on how much output you want
print(expected[:5, :5]) # Print top-left corner
print("... (array truncated) ...")
else:
print(expected)
print("\n")
# Check if p-value is less than alpha
if pval < alpha: # Use pval here, not p_value from previous context
print("Conclusion: There is a statistically significant association between Complaint Type and City. (Reject the null hypothesis)")
print("This suggests that the distribution of complaint types is not the same across all cities, or vice versa.")
else:
print("Conclusion: There is no statistically significant association between Complaint Type and City. (Fail to reject the null hypothesis)")
print("This suggests that the distribution of complaint types is similar across cities, or vice versa.")Chi-square value: 7973.75
P-value: 0.0000
Degrees of freedom: 1230
Expected frequencies (truncated for display if large):
[[ 0.67464115 0.215311 4.86602871 1.04784689 0.63157895]
[ 3.33572568 1.0645933 24.05980861 5.18102073 3.12280702]
[ 1.12440191 0.35885167 8.11004785 1.74641148 1.05263158]
[ 26.79824561 8.55263158 193.28947368 41.62280702 25.0877193 ]
[ 6.89633174 2.20095694 49.74162679 10.71132376 6.45614035]]
... (array truncated) ...
Conclusion: There is a statistically significant association between Complaint Type and City. (Reject the null hypothesis)
This suggests that the distribution of complaint types is not the same across all cities, or vice versa.
Kruskal-Wallis H Test - One-way ANOVA
The Kruskal-Wallis H Test is a non-parametric test, meaning that it makes no assumptions about the distribution of the data. However, it does assume that the samples are independent and that the variances of the populations are equal.
# Filter out rows with nulls in 'request_closing_time_min' or 'complaint_type'
# Kruskal-Wallis requires complete data for each group.
df_filtered_for_test = df.filter(
pl.col("request_closing_time_min").is_not_null() &
pl.col("complaint_type").is_not_null()
)
# Define a minimum sample size for each group to be included in the test.
# Small groups can skew results or cause errors in statistical tests.
min_samples_per_group = 20 # You can adjust this threshold
# Get the list of complaint types that have enough samples
valid_complaint_types = df_filtered_for_test.group_by("complaint_type").agg(
pl.len().alias("count")
).filter(pl.col("count") >= min_samples_per_group).select("complaint_type").to_series().to_list()
print(f"\n--- There are {len(valid_complaint_types)} valid complaints for testing ---")
valid_complaint_types
--- There are 12 valid complaints for testing ---
['Blocked Driveway',
'Illegal Parking',
'Noise - Commercial',
'Noise - Residential',
'Noise - Vehicle',
'Illegal Fireworks',
'Noise - Street/Sidewalk',
'Abandoned Vehicle',
'Traffic Signal Condition',
'Encampment',
'Non-Emergency Police Matter',
'Water System']
Prepare data for the Kruskal-Wallis test: a list of NumPy arrays. Each array contains the 'request_closing_time_min' for a specific complaint type
# Prepare data for the Kruskal-Wallis test: a list of NumPy arrays.
# Each array contains the 'request_closing_time_min' for a specific complaint type.
data_for_kruskal = []
for complaint_type in valid_complaint_types:
# Filter for the current complaint type and extract the 'request_closing_time_min' column
subset_series = df_filtered_for_test.filter(
pl.col("complaint_type") == complaint_type
).select("request_closing_time_min").to_series()
# Convert the Polars Series to a NumPy array
data_for_kruskal.append(subset_series.to_numpy())
# Check if there are at least two groups to compare
if len(data_for_kruskal) < 2:
print(f"Cannot perform Kruskal-Wallis test: Need at least two complaint types with >= {min_samples_per_group} samples each.")
else:
# Perform the Kruskal-Wallis H-test
# The *data_for_kruskal unpacks the list of arrays as separate arguments to kruskal()
h_statistic, p_value = stats.kruskal(*data_for_kruskal)
print(f"\n--- Kruskal-Wallis H-Test Results (comparing {len(valid_complaint_types)} complaint types) ---")
print(f"H-statistic: {h_statistic:.2f}")
print(f"P-value: {p_value:.4f}")
# Interpret the p-value
alpha = 0.05 # Significance level
if p_value < alpha:
print(f"The P-value ({p_value:.4f}) is less than the significance level ({alpha}).")
print("Conclusion: Reject the null hypothesis. There is a statistically significant difference in the median request closing times among the selected complaint types. 🎉")
else:
print(f"The P-value ({p_value:.4f}) is greater than the significance level ({alpha}).")
print("Conclusion: Fail to reject the null hypothesis. There is no statistically significant difference in the median request closing times among the selected complaint types. 🤔")
--- Kruskal-Wallis H-Test Results (comparing 12 complaint types) ---
H-statistic: 3287.00
P-value: 0.0000
The P-value (0.0000) is less than the significance level (0.05).
Conclusion: Reject the null hypothesis. There is a statistically significant difference in the median request closing times among the selected complaint types. 🎉
Since the p value for the Complaint is less than 0.01, we accept alternate hypothesis testing (i.e. there’s a significant difference in the mean response time of different types of complaints)
Observations and Outcomes
Here are some of the major conclusions that can be drawn from the 311 Customer service NYC dataset:
Many Data visualizations across various complaint types in all boroughs are related proivde rich context and insight.
Brooklyn has the highest number of complaints compared to other boroughs while Queens takes the longest to resolve complaints.
The scatter and hexbin plot of Brooklyn illustrate the relationships between different complaint types and how they are clustered together.
The box plot shows that there is significant variation in response times across different complaint types.
Hypothesis testing (Kruskal-Wallis H Test and one-way ANOVA) suggests that there are significant differences in response times among different complaint types.
The p-value and chi-squared test suggest that there is a significant association between certain complaint types and boroughs.
Overall, the analysis of the 311 customer service dataset can provide valuable insights into the common complaints and response times in different boroughs, which can inform decisions on resource allocation and policy changes.