Exercise 4 - Custom Notebooks

Return to Workshop

In this exercise we’ll be loading into Zeppelin a custom notebook. This notebook ingests data regarding wild mushrooms and uses some Machine Learning algorithms to determine if the mushrooms are edible or poisonous.

Add a notebook to Zeppelin

Note the two steps noted with numbers
Note the two steps noted with numbers

After finishing the Apache Spark in 5 Minutes notebook, go back to the Zeppelin dashboard and click on Import note. A modal box will pop-up, give the note a name in the Import As text box, and click on Add from URL and paste in the following URL:

https://raw.githubusercontent.com/kenmoini/RandomForestMushroomClassifier/master/Mushroom%20Classifier%20-%20Scala.json

Click Import note & then click into the newly added notebook to launch it

Reset interpreters

Make note of the order, and restart both Sparks
Make note of the order, and restart both Sparks

Now before we continue, we need to reset and reorder the Spark interpreters.

  1. In the upper right hand corner of the notebook's Title bar, click the Cog icon to open the Interpreter Bindings pane.
  2. Reorder spark2 to be above spark
  3. To the left of each spark and spark2, click the Refresh button to restart them.
  4. Click Save to save and close the Interpreter Bindings pane.

Mushroom Workbook

What would Bear do?
What would Bear do?
Ever wonder if a mushroom is poisonous just by the way it looks or smells? We can do just that!

This notebook will take us through the following processes:
  1. Intro
  2. Setup
  3. Marshalling data
  4. Distribution Analysis
  5. Creating and training RandomForest ML models
  6. Evaluating the model and displaying results
Make sure to expand the code and results sections of your paragraphs.

Dependencies, sessions, and HDFS

3 paragraphs that help set things up
3 paragraphs that help set things up

Load Dependency Libraries

Here we're simply setting up a Spark 2 dependency to load in the spark-csv package.

Initialize Spark Session

Next we're building the Spark 2 session and setting a few configuration points such as application name, enabling Hive support, and returning the Spark 2 version after the Spark session object has been created.

Download Datafile and Copy to HDFS

Here we're using the shell interpreter to execute some bash and wget the MushroomDatabase.csv file.

From there we're adding the file to HDFS.


You may recall uploading this MushroomDatabase.csv file into HDFS via Ambari earlier. That is correct and this is doing it again. A nice way you can ensure data is present before processing

Load into data frame, test with SQL

Import data from the CSV into a data frame
Import data from the CSV into a data frame

This paragraph creates a data set and data frame.

The CSV is loaded into the data frame, the option headers is set to true because the imported CSV does have title headers, and the schema is automatically inferred upon otherwise.

We're creating a view of this data frame and can access it as SQL now, just as the next paragraph does.

Distribution Analysis

Some would say these odds are good enough but we need better
Some would say these odds are good enough but we need better

This is where you can start to realize the power of big data and large distributed data sets. With a wide and deep enough data set we can start to draw some very keen correlations across different properties.

Take a close look at the Odor by Poisonous paragraph. Here we can tell that smell almost always gives a definitive answer for if the mushroom is poisonous for instance if it smells spicy or creosote (?) that it's most certainly poisonous. There's an outlier though in the None column, most often odorless mushrooms are safe but there are enough poisonous ones that are odorless to raise concern.

Let's see if we can do better than this and find a better deterministic value for these mushrooms.

Indexes

Long listed output trimmed in the middle
Long listed output trimmed in the middle

Here we're using one hot encoding to transform our data. We're taking our categorical variables and converting them into a form to be provided to an ML algorithm. To do that, we need to create some indexes, a lot of indexes.

Vectors

A vector, simply put, is a one row of input data, being that this is a feature vector we're using. For instance here, we're assembling a feature vector with the different features of the mushrooms.

We're then transforming our data frame into a feature vector, taking the first two results and displaying them, then printing the schema of the vector.

Create Test & Train Sets

Here we're taking our feature vector we just created in the Spark 2 context and creating a test and train set.
What that means is that we're splitting the group of vectors into a random 80%/20% pool, from there we're creating the training set from the 80% split, and a test set from the 20%. This percentage can differ but these are common and standard values.

Train and Run Model

You can watch the progress bar go up and down as the Random Forest model is running
You can watch the progress bar go up and down as the Random Forest model is running
Now that we've got the floor set, let's dance.

We're creating a new Random Forest (or Random Decision Forest) classifier. The Random Forest learning method will take a number of "trees" in our forest and through random classification and regression produce an informed decision.
Next we're building a classifier, and setting up a parameter grid map with three sets of decision trees at 50, 150, and 300 trees each.
Following that we're cross-validating, training the cross validating model, and making some selections and predictions.

Evaluating the model

First thing is we're setting up two binary classifiers and evaluating discrimination and precision among the predictions. This value is represented as a percentage and called the area under curve values. The higher the value means there's a higher level of assurance and precision among the predicted models.

Ranking evaluated features

Here we can see how named indexes are ranked
Here we can see how named indexes are ranked
Now that the model has been run and our data set has been evaluated with a series of Random Forest classification models, we can see the ranking of importance for determined features. The classifier is fit with the test set, the features are gathered, and subsequently sorted and printed in order of determined importance and weight.

Final computation

Much better than simple distribution analysis
Much better than simple distribution analysis
Just as we've done before with the Apache Spark 2 context, we can query the computed results easily with standard SQL query.

The following paragraph gathers the computed results and echos out the count of determined edible and poisonous mushrooms.

The key with this final paragraph is that earlier in the notebook we saw some distribution analysis that gave great insight into data at large, but was not able to give a high level of assurance whether a mushroom would be poisonous or not based off the features it had. Now that we've computed this data set with Machine Learning functions we have much better peace of mind when evaluating the toxicity of fungi!


Workshop Details

Domain Fierce Software Logo
Workshop
Student ID

Return to Workshop