This sample takes a restaurant violation dataset from the NYC Open Data portal and process it using Spark.NET. Then, the processed data will be used to train a machine learning model that attempts to predict the grade an establishment will receive after an inspection. The model will be trained using ML.NET, an open-source, cross-platform machine learning framework. Finally, data for which no grade currently exists will be enriched using the trained model to assign an expected grade.
For a detailed write-up, check out the Restaurant Inspections ETL & Data Enrichment with Spark.NET and ML.NET Automated (Auto) ML blog post.
This project was built using Ubuntu 18.04 but should work on Windows and Mac devices.
- .NET Core 2.1
- Java 8
- Apache Spark 2.4.1 with Hadoop 2.7
- .NET Spark Worker 0.4.0
The dataset used in this solution is the DOHMH New York City Restaurant Inspection Results and comes from the NYC Open Data portal. It is updated daily and contains assigned and pending inspection results and violation citations for restaurants and college cafeterias. The dataset excludes establishments that have gone out of business. Although the dataset contains several columns, only a subset of them are used in this solution. For a detailed description of the dataset, visit the site.
This solution is made up of different .NET Core applications:
- RestaurantInspectionsETL: .NET Core Console application that takes raw data and uses Spark.NET to clean and transform the data into a format that is easier to use as input for training and making predictions with a machine learning model built with ML.NET.
- RestaurantInspectionsML: .NET Core Class Library that defines the input and output schema of the ML.NET machine learning model. Additionally, this is where the trained model is saved to.
- RestaurantInspectionsTraining: .NET Core Console application that uses the graded data generated by the RestaurantInspectionsETL application to train a multiclass classification machine learning model using ML.NET's AutoML.
- RestaurantInspectionsEnrichment: .NET Core Console application that uses the ungraded data generated by the RestaurantInspectionsETL application as input for the trained ML.NET machine learning model to predict what grade an establishment is most likely to receive based on the violations found during inspection.
git clone https://github.com/lqdev/RestaurantInspectionsSparkMLNET.git
Before building the code, update the location of the solution in the RestaurantInspectionsTraining and RestaurantInspectionsEnrichment.
Replace the value of solutionDirectory
with the path of where your solution is saved.
Original:
string solutionDirectory = "/home/lqdev/Development/RestaurantInspectionsSparkMLNET";
New:
string solutionDirectory = "<YOUR-SOLUTION-PATH>/RestaurantInspectionsSparkMLNET";
dotnet publish -f netcoreapp2.1 -r ubuntu.18.04-x64
dotnet build
dotnet publish -f netcoreapp2.1 -r ubuntu.18.04-x64
From the project directory run the application with spark-submit.
spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local bin/Debug/netcoreapp2.1/ubuntu.18.04-x64/publish/microsoft-spark-2.4.x-0.4.0.jar dotnet bin/Debug/netcoreapp2.1/ubuntu.18.04-x64/publish/RestaurantInspectionsETL.dll
dotnet run
Navigate to the publish directory. In this case, it's bin/Debug/netcoreapp2.1/ubuntu.18.04-x64/publish.
From the publish directory, run the application with spark-submit
.
spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local microsoft-spark-2.4.x-0.4.0.jar dotnet RestaurantInspectionsEnrichment.dll