The NYC Department of Finance collects data on every parking ticket issued in NYC (~10M per year). This data is made publicly available to aid in ticket resolution and to guide policymakers.

The data set we’ll explore is in 2017. Since its size is quite large - 2.09 GB, we’ll apply the concept of ‘laziness’ from Dask. We’ll also visualize our analysis with interactive Plotly (not displayed here).

import dask
import dask.dataframe as dd
import pandas as pd

from dask.distributed import Client
client = Client(n_workers=4)

df = dd.read_csv('Parking_Violations_Issued_-_Fiscal_Year_2017.csv', parse_dates=['Issue Date'])
df

Dask DataFrame Structure:

	Summons Number	Plate ID	Registration State	Plate Type	Issue Date	Violation Code	Vehicle Body Type	Vehicle Make	Issuing Agency	Street Code1	Street Code2	Street Code3	Vehicle Expiration Date	Violation Location	Violation Precinct	Issuer Precinct	Issuer Code	Issuer Command	Issuer Squad	Violation Time	Time First Observed	Violation County	Violation In Front Of Or Opposite	House Number	Street Name	Intersecting Street	Date First Observed	Law Section	Sub Division	Violation Legal Code	Days Parking In Effect	From Hours In Effect	To Hours In Effect	Vehicle Color	Unregistered Vehicle?	Vehicle Year	Meter Number	Feet From Curb	Violation Post Code	Violation Description	No Standing or Stopping Violation	Hydrant Violation	Double Parking Violation
npartitions=33
	int64	object	object	object	datetime64[ns]	int64	object	object	object	int64	int64	int64	int64	float64	int64	int64	int64	object	object	object	object	object	object	object	object	object	int64	int64	object	object	object	object	object	object	float64	int64	object	int64	object	object	float64	float64	float64
	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...

Dask Name: from-delayed, 99 tasks

df.head()

	Summons Number	Plate ID	Registration State	Plate Type	Issue Date	Violation Code	Vehicle Body Type	Vehicle Make	Issuing Agency	Street Code1	...	Vehicle Color	Unregistered Vehicle?	Vehicle Year	Meter Number	Violation Post Code	Violation Description	No Standing or Stopping Violation	Hydrant Violation	Double Parking Violation
0	5092469481	GZH7067	NY	PAS	2016-07-10	7	SUBN	TOYOT	V	0	...	GY	NaN	2001	NaN	NaN	FAILURE TO STOP AT RED LIGHT	NaN	NaN	NaN
1	5092451658	GZH7067	NY	PAS	2016-07-08	7	SUBN	TOYOT	V	0	...	GY	NaN	2001	NaN	NaN	FAILURE TO STOP AT RED LIGHT	NaN	NaN	NaN
2	4006265037	FZX9232	NY	PAS	2016-08-23	5	SUBN	FORD	V	0	...	BK	NaN	2004	NaN	NaN	BUS LANE VIOLATION	NaN	NaN	NaN
3	8478629828	66623ME	NY	COM	2017-06-14	47	REFG	MITSU	T	10610	...	WH	NaN	2007	NaN	04	47-Double PKG-Midtown	NaN	NaN	NaN
4	7868300310	37033JV	NY	COM	2016-11-21	69	DELV	INTER	T	10510	...	WHITE	NaN	2007	NaN	31 6	69-Failure to Disp Muni Recpt	NaN	NaN	NaN

5 rows × 43 columns

We check some basic info of the data set, e.g. number of partitions dask created.

df.npartitions

df.columns

Index(['Summons Number', 'Plate ID', 'Registration State', 'Plate Type',
       'Issue Date', 'Violation Code', 'Vehicle Body Type', 'Vehicle Make',
       'Issuing Agency', 'Street Code1', 'Street Code2', 'Street Code3',
       'Vehicle Expiration Date', 'Violation Location', 'Violation Precinct',
       'Issuer Precinct', 'Issuer Code', 'Issuer Command', 'Issuer Squad',
       'Violation Time', 'Time First Observed', 'Violation County',
       'Violation In Front Of Or Opposite', 'House Number', 'Street Name',
       'Intersecting Street', 'Date First Observed', 'Law Section',
       'Sub Division', 'Violation Legal Code', 'Days Parking In Effect    ',
       'From Hours In Effect', 'To Hours In Effect', 'Vehicle Color',
       'Unregistered Vehicle?', 'Vehicle Year', 'Meter Number',
       'Feet From Curb', 'Violation Post Code', 'Violation Description',
       'No Standing or Stopping Violation', 'Hydrant Violation',
       'Double Parking Violation'],
      dtype='object')

1. When are tickets most likely to be issued? Answer in different time periods

By season

New York Has A Continental Climate. New York’s climate is classed as being continental, which means that it receives four distinct seasons spring (March-May), summer (June-August), autumn (September-November) and winter (December-February.

# a helper function to get seasons from months
def get_season(df):
    seasons = df['Issue Date'].dt.month
    
    seasons.replace([3, 4, 5], 'spring', inplace=True)
    seasons.replace([6, 7, 8], 'summer', inplace=True)
    seasons.replace([9, 10, 11], 'autumn', inplace=True)
    seasons.replace([12, 1, 2], 'winter', inplace=True)
    
    return seasons

# map over df a function to return a Series season
result = df.map_partitions(get_season)

/Users/MimiHMB/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py:6746: SettingWithCopyWarning: modifications to a property of a datetimelike object are not supported and are discarded. Change values on the original.
  self._update_inplace(new_data)

issue_season = result.value_counts()

issue_season_computed = issue_season.compute()

# violations happened more frequently in spring and autumn 
import plotly.graph_objects as go

fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=issue_season_computed.index,
        y=issue_season_computed.values,
        name='line'
    ))

fig.add_trace(
    go.Bar(
        x=issue_season_computed.index,
        y=issue_season_computed.values,
        name='bar'
    ))

fig.update_layout(
    title="Parking violations by season",
    xaxis_title="season",
    yaxis_title="violation")

fig.show()

png

By month

issue_month = df['Issue Date'].dt.month

len(issue_month)

10803028

issue_month_computed = issue_month.value_counts().compute()

import plotly.express as px

months = issue_month_computed.sort_index().index
values = issue_month_computed.sort_index().values

fig = px.scatter(issue_month_computed, x=months, y=values, text=months)

fig.update_traces(textposition='top center')

fig.update_layout(xaxis = dict(tickmode = 'array',
                               tickvals = list(range(1, 13)),
                               ticktext = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']))

fig.update_layout(xaxis_title="month", 
                  yaxis_title="violation",
                  title_text='Parking violations by month')

fig.show()

png

By am hour

am = [i[0:2] for i in df['Violation Time'] if ('A' in str(i))]

am_df = pd.DataFrame.from_dict(dict((i, am.count(i)) for i in set(am)), orient='index', columns=['count'])
am_df = am_df.sort_index()[4:]

am_df

	count
00	58313
01	97058
02	82352
03	66118
04	30163
05	84800
06	236777
07	531769
08	1008957
09	1181344
10	976913
11	1138919
12	38904

# violations happened mostly during 8-11 am
times = am_df.index
time_values = am_df['count'].values
size = time_values/10000

fig = go.Figure(data=[go.Scatter(
    x=times,
    y=time_values,
    mode='markers',
    marker=dict(
        size=size,
        sizemode='area',
        sizeref=1.*max(size)/(40.**2),
        sizemin=5,
        color=[220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275]
    )
)])

fig.update_layout(xaxis = dict(tickmode = 'array',
                               tickvals = list(range(0, 13))))
fig.update_layout(xaxis_title="am hour", 
                  yaxis_title="violation",
                  title_text='Parking violations by am hour')

fig.show()

png

By pm hour

pm = [i[0:2] for i in df['Violation Time'] if ('P' in str(i))]

pm_df = pd.DataFrame.from_dict(dict((i, pm.count(i)) for i in set(pm)), orient='index', columns=['count'])
pm_df = pm_df.sort_values(by='count', ascending=False)[:13].sort_index()

pm_df

	count
00	261
01	1076390
02	907268
03	617187
04	587900
05	423884
06	223467
07	61111
08	108834
09	119346
10	91404
11	62722
12	990649

# violations happened mostly during 12-2 pm
pm_times = pm_df.index
pm_time_values = pm_df['count'].values
pm_size = pm_time_values/10000

fig = go.Figure(data=[go.Scatter(
    x=pm_times,
    y=pm_time_values,
    mode='markers',
    marker=dict(
        size=pm_size,
        sizemode='area',
        sizeref=1.*max(pm_size)/(40.**2),
        sizemin=5,
        color=[100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155]
    )
)])

fig.update_layout(xaxis = dict(tickmode = 'array',
                               tickvals = list(range(0, 13))))
fig.update_layout(xaxis_title="pm hour", 
                  yaxis_title="violation",
                  title_text='Parking violations by pm hour')

fig.show()

png

2. Where are tickets most commonly issued?

At which street

where = df['Street Name'].value_counts().compute()

# most issued at Broadway
where[:5]

Broadway         206157
3rd Ave          160282
Madison Ave       98853
5th Ave           90707
Lexington Ave     81199
Name: Street Name, dtype: int64

At which intersections

bw_int= df[df['Street Name'] == 'Broadway']['Intersecting Street'].value_counts().compute()

# top 20 intersections with Broadway street that parking violations occurred
bw_int[:20]

30ft S/of W 118th St    82
30ft N/of E 8th St      62
20ft S/of W 118th St    61
20ft N/of E 8th St      49
10ft N/of W 110th St    45
10ft S/of W 118th St    42
20ft S/of W 76th St     41
10ft S/of W 90th St     41
40ft N/of E 8th St      41
15ft S/of W 118th St    41
20ft S/of W 77th St     37
20ft N/of W 102nd St    37
30ft N/of W 102nd St    37
20ft N/of W 67th St     36
40ft S/of W 118th St    35
10ft N/of W 77th St     33
5ft N/of W 102nd St     31
30ft S/of W 76th St     31
10ft N/of W 102nd St    30
20ft N/of W 77th St     30
Name: Intersecting Street, dtype: int64

# a helper function to filter only street names
def get_int(df):
    filter = df[df['Street Name'] == 'Broadway']['Intersecting Street'].dropna().str.split().str.get(-2)
    return filter

broadway_int = df.map_partitions(get_int)

broadway_int_count = broadway_int.value_counts().compute()

# most commonly issued at the intersection of Broadway and 118th streets
broadway_int_count.sort_values(ascending=False)[:10]

118th      516
77th       478
8th        448
78th       435
76th       404
102nd      373
Britton    290
74th       284
82nd       258
107th      252
Name: Intersecting Street, dtype: int64

For which violation reasons

bw_desc = df[df['Street Name'] == 'Broadway']['Violation Description'].value_counts().compute()

# top 10 violation reasons at Broadway Street
bw_desc[:10]

38-Failure to Display Muni Rec    47029
37-Expired Muni Meter             29334
46B-Double Parking (Com-100Ft)    16541
14-No Standing                    15002
19-No Stand (bus stop)            12799
21-No Parking (street clean)      11571
69-Failure to Disp Muni Recpt      9755
31-No Stand (Com. Mtr. Zone)       8380
71A-Insp Sticker Expired (NYS)     6288
40-Fire Hydrant                    5490
Name: Violation Description, dtype: int64