In this exercise we’ll be loading into Zeppelin a custom notebook. This notebook ingests data regarding wild mushrooms and uses some Machine Learning algorithms to determine if the mushrooms are edible or poisonous.
After finishing the Apache Spark in 5 Minutes notebook, go back to the Zeppelin dashboard and click on Import note. A modal box will pop-up, give the note a name in the Import As text box, and click on Add from URL and paste in the following URL:https://raw.githubusercontent.com/kenmoini/RandomForestMushroomClassifier/master/Mushroom%20Classifier%20-%20Scala.json
Click Import note
Now before we continue, we need to reset and reorder the Spark interpreters.
Here we're simply setting up a Spark 2 dependency to load in the spark-csv package.
Next we're building the Spark 2 session and setting a few configuration points such as application name, enabling Hive support, and returning the Spark 2 version after the Spark session object has been created.
Here we're using the shell interpreter to execute some bash and wget the MushroomDatabase.csv file.
From there we're adding the file to HDFS.
You may recall uploading this MushroomDatabase.csv file into HDFS via Ambari earlier. That is correct and this is doing it again. A nice way you can ensure data is present before processing
This paragraph creates a data set and data frame.
The CSV is loaded into the data frame, the option headers is set to true because the imported CSV does have title headers, and the schema is automatically inferred upon otherwise.
We're creating a view of this data frame and can access it as SQL now, just as the next paragraph does.
This is where you can start to realize the power of big data and large distributed data sets. With a wide and deep enough data set we can start to draw some very keen correlations across different properties.
Take a close look at the Odor by Poisonous paragraph. Here we can tell that smell almost always gives a definitive answer for if the mushroom is poisonous for instance if it smells spicy or creosote (?) that it's most certainly poisonous. There's an outlier though in the None column, most often odorless mushrooms are safe but there are enough poisonous ones that are odorless to raise concern.
Let's see if we can do better than this and find a better deterministic value for these mushrooms.
Here we're using one hot encoding to transform our data. We're taking our categorical variables and converting them into a form to be provided to an ML algorithm. To do that, we need to create some indexes, a lot of indexes.
A vector, simply put, is a one row of input data, being that this is a feature vector we're using. For instance here, we're assembling a feature vector with the different features of the mushrooms.
We're then transforming our data frame into a feature vector, taking the first two results and displaying them, then printing the schema of the vector.