Open data yield a bounty of insights, but what you find may be difficult to stomach.
San Francisco has more restaurants per capita than any other city in the country, and inspecting these kitchens is a small staff of tireless Department of Public Health workers, routinely hitting the streets to help ensure that you and I dine on food and nothing else. What’s more, our fair city by the bay has implemented a progressive open data policy that has seen countless public records made available in detailed, machine-readable formats. As a hacker foodie I was drawn to the store of health inspection data available through the program, and what I found did not disappoint.
This trove of data paints a revealing picture of the perils of dining out in San Francisco, with more than 40% of restaurants in many neighborhoods incurring high-risk health violations such as vermin infestation, sewage contamination, unapproved living quarters and the sale of previously-served food. Combining these inspection reports with restaurant review data sourced through Mechanical Turk we find, disappointingly if not surprisingly, that many perennially popular cuisines such as Indian, Chinese & Thai rate consistently among the most unclean. Dodgy dives aside, worse still are the many highly-reviewed restaurants whose vermin infested kitchens and adulterated food go unnoticed by thousands of hapless diners every month.
The Facts of the Matter
Focusing on the eleven neighborhoods in San Francisco with the most restaurants, a natural starting place is an average health inspection score on a per-neighborhood basis.
Somewhat surprisingly we find that businesses scattered throughout the city’s eastern industrial district are remarkably clean, though this score is probably helped in no small measure by the pristine, slightly precious, eateries serving ubranite families atop Potrero Hill. Mercifully, my home, San Francisco’s Mission district, lands squarely in the middle of the pack, though the low end of the spectrum in this neighborhood will surely give even the most hardened diner-goer pause. Finally, bringing up the bottom of the list, is lovable, touristic Chinatown, which, as we soon shall see, is not a place you would be well advised to eat again. For a compelling, interactive visualization of the geospatial dimensions of this data set, be sure to check out the awesome Leaflet-based map (below) from Zipfian Academy co-founder Jonathan Dinu.
Crimes Against Hygiene
Digging deeper, we can unpack these scores by category of violation to understand exactly where businesses are taking the hit. To be clear, many violations cataloged by the Department of Public Health are those which may have been, at one point or another, committed in our own kitchens: inappropriate cooling methods, improper food storage, unclean nonfood contact surfaces, etc.
Violations of this type aren’t anything to be proud of, but they’re not going to put you in the hospital, either. There are a class of violations, however, that just might.
From the sixty eight unique violation categories I identified nine that I consider of essentially unforgivable. These are the worst of the worst, and one can only hope that it is with great infrequency that you dine in a restaurant that has perpetrated the following transgresssions:
- High risk vermin infestation
- Moderate risk vermin infestation
- Employee discharge from eyes, nose, or mouth
- Sewage or wastewater contamination
- Unapproved living quarters in a food facility
- Unsanitary employee garments, hair, or nails
- Improper food labeling or menu misrepresentation
- Contaminated or adulterated food
- Service of previously served foods
The news I have for you on this front is not good.
Below you’ll find the proportion of all businesses in each neighborhood that have incurred at least one of the violations on this list.
While you savor these findings, notice two interesting things. First and most unsettling is the fact that more than half the businesses in Chinatown and North Beach are committing the infractions listed above. For anyone who’s been to Chinatown, it’s not unrealistic to imagine that a number of businesses are operating on the margins, but even to my somewhat jaded sensibility this is an arrestingly high figure.
The other, more insidious takeaway is that, in even the cleanest neighborhoods, nearly one in five restaurants are operating under conditions that are unsettling at best and dangerous at worst. Recall here that inspections are metted out at random, with little or no forewarning, and it’s not unreasonable to assume that these infractions, where they do occur, are happening on the regular.
The city health inspection records are, in their own right, revealing, but the most interesting analyses almost always involve bringing together distinct data sources. To this end, I employed crowd workers to identify the review site ratings and categories associated with over 1,000 businesses documented in DPH records. At right are presented the average health inspection scores across forty nine restaurant categories with at least twenty five unique businesses.
|Juice Bars & Smoothies||96.0|
|Ice Cream & Frozen Yogurt||93.7|
|Coffee & Tea||92.8|
|Beer, Wine & Spirits||90.9|
|Breakfast & Brunch||87.2|
Reassuringly, our school cafeterias, while generally devoid of what you and I might consider food, are not unclean, a trend that extends, somewhat surprisingly, to many of your favorite dive bars. There’s much to be said about this list and the spectrum it represents, but in the interest of space I’ll call attention to just one other fact.
What is the substance, tucked inside those little bundles of ground meat, that puts these businesses so far below the rest of the pack? How many dumplings have you eaten during the course of your life?
What was it, again, that you’ve been eating?
This unexpected feature of the data, namely that a favorite food could come from such miserable kitchens, begs a final comparison.
Here we plot the relationship between a businesses’ average review score and mean historical health inspection rating. Taking the ratio between these values yields a measure I call the ‘Squick Factor’, the extent to which a place is at once wildly popular and absolutely filthy. The lower right hand quadrant of this plot shows many such businesses, the worst of which are documented below.
Data science is powerful because it maximizes leverage in the presence of finite resources. Health inspections serve a vital function in protecting the public from food-borne illness, but are expensive and time consuming to perform. In light of this, my hope is that the data science community can contribute to the public welfare by applying statistical modeling techniques to the wealth of open data to which we have access.
Using a simple linear regression, for example, on restaurants’ postal code, category, and mean review site score I produced models (10-fold cross-validation R-squares of ~.22) that were able to predict, on average, a previously unseen restaurant’s future health inspection scores within 8.5 points. Additional data, such as whether the restaurant serves alcohol, its hours of operation, and the text of restaurant reviews could surely improve these models’ accuracy. Equipped with such statistical tools, the Department of Public Health could to prioritize the inspection of newly-opened restaurants in terms of their likeliness to spread food borne illness, saving money, time, and potentially lives. It is my intention to see that these tools make it into their hands.
The promise of data is an efficiently functioning society, in which critical decisions are made in the presence of meaningful and actionable information. As data scientists, we live in an exciting time, and occupy an especially privileged position. It is our obligation to harness our abilities in service of the public good, such that we all may benefit from the hidden structure of the modern world.