[mapr on-demand spark software development webinars data engineering webinars data science webinars]
Join us for a complimentary webinar to learn more about Apache Spark ‘s MLlib, which makes machine learning scalable and easier with ML pipelines built on top of DataFrames.
In this webinar, we will go over an example from the ebook Getting Started with Apache Spark 2.x.: predicting flight delays using Apache Spark machine learning.
Flight delays create real problems in scheduling, passenger inconvenience, and economic losses, and there is growing interest in predicting flight delays in order to optimize operations and improve customer satisfaction. For example Google Flights uses historic flight status data with machine learning algorithms to find common patterns in late departures in order to predict flight delays and share the reasons for those delays.
In this webinar, you will learn:
Review Machine Learning Classification and Random Forests
Use Spark SQL and DataFrames to explore real historic flight data
Use Spark DataFrames with Spark ML pipelines
Predict Flight Delays with Apache Spark ML Random Forests
Use Zeppelin to run Spark commands, visualize the results and discuss what features contribute the most to Flight Delays
00:00 So in this talk we’re going to go over getting started with Spark 2.x Machine Learning Pipelines to Predict Flight Delays. The agenda looks like this. First we’re going to go over an introduction to machine learning classifications. Then we’re gonna explore the flight dataset with Spark DataFrames. And finally we’re going to use a random forest algorithm to predict flight delays.
00:25 First we’ll go over machine learning. So first of all, what is artificial intelligence. Actually the definition of AI is continuously being redefined. It’s an umbrella term, but the idea started in the 50s. And machine learning a subset of AI and deep learning is a subset of machine learning. AI is a broad term applying to any technique that enables computers to mimic human intelligence.
00:52 In the late 80s, AI was also a very hot topic. At that time the latest technology was all about expert systems. Expert systems capture an expert’s knowledge in if-then-else rules, which are passed through an inference engine to answer questions.
01:09 Rules engines have had a wide use in industries such as finance and healthcare. And they have some advantages, but they also have some disadvantages. Defining the rules is a manual process and this can become difficult to update and maintain. Also machine learning has the advantage that it learns from your data. And it can give finer grained data driven probabilistic decisions.
01:32 So the example in this image here is from an article. And it was about how this patient got an overdose, 38 times his dosage, because the doctors and the pharmacists kept getting these alerts from the rules engine too often. So if you have rules that are giving you alerts all the time, then you quit paying attention. So that’s why it’s good to have finer grained predictions so that you’ll pay attention to the important ones. So what is machine learning? Machine learning uses algorithms to find patterns in your data. And it trains an algorithm and then it uses a model that recognizes those patterns to make predictions on new data.
02:18 In general, machine learning can be broken down into two classes of algorithms: supervised and unsupervised. With supervised learning you have a known outcome that you want your data to predict. With unsupervised learning, you don’t have the output or known outcome in advance. These you have to make sense of the data without any hints.
02:40 Supervised algorithms use labeled data in which both the input and target outcome are labeled, are provided to the algorithm. And they use this to build the model. And then, with new data, they can predict the label.
02:56 Gmail uses a machine learning technique called classification to identify if an email is spam or not based on the data of an email. Classification is a family of supervised machine learning algorithms that identify which category an item belongs to. For example, whether a transaction is fraud or not based on labeled examples of known items. For example, transactions known to be fraud or not. Some common use cases for classification includes credit card fraud detection, email spam detection, and sentiment analysis.
03:31 Classification takes a set of data with known labels and predetermined features and learns how to label new records based on that information. The features are the if questions that you ask and the label is the answer to those questions. In this example, if it walks, swims, and quacks like a duck, then the label is duck.
03:49 So let’s go through an example. Let’s go through a credit card fraud example. What are we trying to predict? In this case, we’re trying to predict is the credit card transaction fraud or not. And this is the label. What are the if questions or properties that we could use to predict this? Some examples are here. So you could use the number of transactions in the last 24 hours. The total amount in the last 24 hours. The average amount in the last 24 hours compared to the transaction history. The location and time difference since the last transaction. These are all the if questions or features that you could use to predict.
04:26 And here’s an example of classification using logistic regression. Logistic regression measures the relationship between the why label and the X features by estimating the probabilities using a logistic function. The model predicts a probability, which is used to predict the label class. Here on the X we have our feature’s transaction amount and time/location for example. Then on the Y we have the probability of fraud.
04:56 This shows again, here we’re building the model. In this case, the model is this logistic function. Which is then gonna be used with the features to predict the Y label.
05:10 Let’s get to another example. This time we’re gonna go through house price prediction. What are we trying to predict in this case? In this case, the label is gonna be the house price and the if questions, or the features, are going to be, in this case a very simple one, just the size of the house. Here we’re using regression. A linear regression models the relationship between the Y label and the X feature. In this case, the relationship between the house price and the size. Regression is a line where Y is the label we want to predict and X is the input feature. So for each of these little dots, we have a dot based on the X and Y, and then the regression draws this line. So that’s the function that it draws. And this is the function that we have down here. And here the coefficient is going to measure the impact of the feature, which is the house size.
06:10 Again, in our supervised here, we’re building the model of linear regression. Which is going to be used to predict the house size. The house price, sorry. The house price.
06:23 Here’s some examples of classification. A retail example is predicting sales or price. Telecom, an example could be predicting if a customer will churn or not. A healthcare example is predicting the probability of readmission. A marketing example is predicting the probability is a customer will click on an ad.
06:45 So that was a brief overview of classification. In a future webinar we’re going to go over clustering. So that would be unsupervised learning. What we’re going to do now, next, is we’re gonna explore the flight delay dataset. This is how Spark runs an application on your cluster. It runs as an independent executor processes. Here we have the executor processes. And the cluster manager assigns tasks to workers, one task per partition. The Spark session is in your application. That’s going to coordinate your application with the resource manager and the executors.
07:24 And then, at each task, these run in threads inside the JVM. Each executor is a JVM. So the task applies its unit of work to the dataset in its partition and outputs a new partition.
07:39 With Spark 2.0, datasets and dataframes were merged together. And what a dataset is it’s a distributed collection of objects. It’s like a list except that it’s spread out across multiple nodes in the cluster. And what a dataframe is like a table, except for it’s also spread out across a cluster. So it’s partitioned so that you have rows on certain nodes and other rows on other nodes. So that I can be all executed concurrently and parallel.
08:15 In Spark 2.0, dataframe API was merged with a dataset API. This just shows the difference between this. So a dataset is a collection of typed objects. And you can use that function [inaudible 00:08:27] so the Spark functions with that and also SQL. And a dataframe is a dataset of generic row objects. You can just use only SQL with the dataframes. And they both provide an ease of use and also they’re much more efficient than the older RDDs that Spark has.
08:49 This is the dataset that we’re gonna be looking at. The dataset comes in as a JSON format and, as an example of it, we have an ID, the day of the week, the carrier, the origin airport, the destination airport, the scheduled departure hour, the scheduled departure time, the departure delay. This is what we’re going to use to predict. The scheduled arrival time, the arrival delay, we’re not going to use this one. And then the lapse time. This is scheduled how long it take for the trip. And then the distance.
09:28 First what we’re going to do is we’re going to load the dataset and then we’re going to explore this dataset. This is how you load a dataset with Spark. You have your data in a disc, in this case we’re going to load it from a file. And it gets load into task and you’re going to have one task per partition. So it’s then put into partitions in cache and you’re going to have one task per partition.
09:52 In order to load the dataset, what we’re going to do is we’re going to specify the class. So this is the object corresponding to our schema. And then we’re also going to specify the schema. Since we’re using JSON, we could have the schema inferred, but you’re going to get a lot better performance if you actually specify the schema. So here we’re going to specify it explicitly.
10:16 And this is what it looks like. The code for loading the dataset. We’re using the Spark reads. We’re not inferring the schema. Here we specified the schema that we just look at. That was the schema. And also we’re specifying the case class. So this will be a dataframe and a dataset. If we left off this slide, then it would just be a dataframe.
10:40 And this is what it looks like. So if we do a DF show, we see that a dataframe looks like a table with columns and rows. And with dataframes you have what’s called transformations and actions. Transformations actually create just a new dataset and they return a new dataset. These are lazily evaluated. They’re not actually executed until you call an action. And actions return a value to the driver, which is your program.
11:15 Here’s an example of an action. So here’s an action take one. We’re just taking one element object from our dataset. So this is returning that as an array of one. And we see that it consists of flight objects. Because it is a dataset. If it was just a dataframe, this would be rows, generic rows.
11:35 Some examples of transformations. As you see, you can have transformations like filter group by where. These allow you to perform SQL like actions. And also sorting, selecting, and joining. And here is the actions. You can collect. You wanna be careful with that one because it’s going to return all the elements in your dataframe. Count the now elements. You can just get one or you can show 20.
12:04 Briefly we’re going to go over an example of loading a dataset so that you understand it. So here we’re reading a dataset from a file, like we showed before. We’re going to cache it and then count it. Nothing actually happens until we call count. That’s the action.
12:20 What happens is then the driver, the task are gonna get launched from the driver. Each task is going to read in a block from the distributed file into cache. Then it’s going to process this data and then return the results. Which, in this case, is the long is 41 thousand 300.
12:42 Then we’re calling another action, in this case the show. This is going to then launch the task again on the worker nodes. But here the data is in the cache, so it’s going to just process the data from cache. Because we already cached it. We called DF cache. So it doesn’t have to retrieve it from the file system. And then it returns the results. In this case, the show shows the first 20 rows. So if you’re going to reuse a dataset, it can be faster to cache this and then you’ll have faster results.
13:16 So you see that Spark can be good for iterative algorithms because you can cache this in memory and then work on it repeatedly. Some algorithms, for example for machine learning that need iterations are clustering. When you’re regression, we saw that it needs to calculate that line. It iterates between that. Graph algorithms also. So this can be useful for machine learning for anomaly detection, classification. Also for recommendations.
13:44 Now we’re going to explore our dataset with the dataframes. In this example, what we’re doing is we’re filtering for departure delays greater than 40 minutes. We’re grouping by the destination and then we’re counting. Then we’re ordering this by the count. This is going to show the destinations with the highest number of departure delays.
14:05 So here we see that San Francisco and Newark have the highest number of departure delays in our dataset. The dataset we’re using is actually January 2017 and I also limited the number of airports in this dataset.
14:23 Here’s an example for the top longest departure delays. Here we’re selecting which coms we want. The carrier, origin and destination. Then we’re filtering for departure delay greater than 40. And then we’re ordering it by departure delay, the highest departure delays. So descending. Here we see, for example, the longest ones were for San Francisco to Chicago with American Airlines.
14:56 Next what we’re going to do is we’re going to register the dataset as a temporary view in order to explore this with SQL. And we’re using also here a Zeppelin notebook, which, with the SQL, it makes nice that you can visually see, digitally explore your dataset with the SQL. In this case, we’re looking at the average departure delay by carrier. So here we’re getting the average and we’re grouping this by carrier. We see that United Airlines has the highest averages.
15:29 So the one point of exploring this data is we wanna find out which features might help to predict these departure delays. So we’re exploring it to see what things contribute to the departure delays. Here we have the count of departure delays by carrier. And here we see that United Airlines has … Here we’re defining departure delay is greater than 40 minutes.
15:51 We see that United Airlines has the highest count. Here we’re getting the count of departure delays by day of the week. And here we see that Monday is number one. Mondays and Sundays have the highest count. Here we have the count of departure delays by hour of the day. And here we see that the highest departure delays, the highest one is five o’clock. So rush hour. The departure delays are really the worst in the rush hour, the later rush hour.
16:29 Here the departure delays by originating airport. So Chicago and Atlanta have the highest. And the count of departure delays by originating airport and destination. We see that Denver, San Francisco. This one is Chicago and San Francisco. Those were some of the highest ones.
16:53 Another thing that we can do is we can bucket this dataset. We can add a column for delayed or not delayed. And that can also help us explore this dataset. What we’re doing is we’re creating a bucketizer. Which is a Spark object. And we’re seeing the input column as departure delay. That’s the minute of delays. And then we’re going to create an output column whether it was delayed or not. So true or false. And zero is going to be less than 40 minutes and one is going to be equal to or greater than 40 minutes. So in that column. Then, with this bucketizer, we call transform on our dataset and that returns a dataset with this new delayed column.
17:36 Here’s an example of here we’re counting the departure delays by origin and destination and we’re seeing delayed is zero. That means not delayed. And delayed true is orange. We see that we see that we have a lot more that were not delayed than that were delayed. But we do see, again, that five o’clock had the most delayed. At seven o’clock we have a lot of flights that were not delayed. This is useful to see the delayed and not delayed by hour, for example.
18:11 Here we’re just going to get the count of delayed versus not delayed. We see that here we have a whole lot more not delayed than delayed. One thing that we can do is we can stratify our dataset in order to have more delayed data. In order for the training. So to give us more examples of delayed for the training.
18:33 What we can do is we can sample by. And here we’re going to keep all of the delayed examples and we’re only gonna keep 13% of the not delayed. So here we see that here was have a more equal distribution now.
18:57 Now before we move on to the predicting, I just wanna show you what we just did in the Zeppelin notebook. I’m going to switch over to … Here I have this in the Zeppelin notebook and I just wanna go over what we just went over in a Zeppelin notebook. This is actually the Zeppelin notebook running as, what’s called, our data science refinery. And it’s running on a MapR cluster. But you can also run this on your laptop with Sandbox or MapR has what’s called Developer Container.
19:35 First of all, what we’re doing in our notebook is we’re importing what we need to import. Here we’re defining the schema that we went over with our case class and our schema. Then we’re reading the data from the JSON file into the dataset. And this is what our dataset looks like.
19:54 Then another thing what we’re doing here is we’re adding a column. With this column, what we’re adding is a origin destination column, which shows the origination and the destination. And we’re making that from these two columns. So we’re creating a new column. For example here, this is originating in Atlanta and going to Boston. So we add that column.
20:19 Here we’re printing out our schema. Then we’re showing just the count of our origination destination. The count for each one of these. Showing just the top five rows. Here we’re performing some examples statistics of the distance. So the time for each trip, the average. So we’re getting the count, the mean, the standard deviation, the min and the max, for the departure delays and the arrival delays.
20:52 Here we’re doing that bucketizer, which I talked about, to add a column for delayed or not delayed. And we’ve been getting the count for each one of these. So that’s how many delayed flights we have, not delayed. And this is how many delayed we have.
21:07 Now we’re registering this as a temporary view in order to explore it with SQL. Here we have the top five longest departure delays. Which we see the longest one was a thousand 440 minutes. Here we have the average departure delay by carrier. Which is United Airlines had the highest average. Here we have the maximum and American Airlines had the highest maximum.
21:36 Here we have the average departure delay by destination, which was San Francisco had the highest average. The average departure delay by origin, which also was San Francisco. And here we have the average departure delay by day of the week. Which this also has Monday and Sunday as the highest average.
22:02 Now we’re moving into the count of departure delays by day of the week. Again, this is Monday and Sunday had the highest. The count of departure delays by carrier. And the count of delayed and not delayed by carrier. So here we see that we still have here that zero is not delayed. We still have a lot more not delayed.
22:34 Here we have the count of departure delays by hour of the day. And here we have the count of delayed and not delayed by hour. We won’t go over all these since we went over a lot of them already.
22:52 Here are the count of departure delays by origin and destination. And then here, again, we’re going to stratify this in order to get a better sampling.
23:04 So now we’ll go back to the slides. Next what we’re going to look at is we’re going to look at building our model. We’re going to take the data, split it into training and test sets. We’re going to use the training set to build the model and a test set to evaluate the results. And then we test the results. And we’re going to do this repeatedly until we have the model that we want to deploy.
23:49 Again, looking at what are we trying to predict. This is the label. We’re trying to predict if it was delayed or not. In this case, we’re going to have delayed is equal to true if the delay was greater than 40 minutes.
24:01 So the question for the properties that we can use to predict, these are the features. We’re going to use the day of the week, the scheduled departure time, the carrier, the lapse time, the origin airport, the destination airport, and the distance.
24:22 We’re going to use a random forest for the predictions. The random forest uses decision tree. First we’ll go over what is a decision tree. Decision trees create a model that predicts a class by evaluating a set of rules that fall in if-then-else pattern. The if-then-else features questions are the nodes and the answers true or false are the branches in the tree to the nodes.
24:44 A decision model estimates the minimum number of true false questions needed to assess the probability of making a correct decision. So the difference between this and the if-then-else questions of a rules engine is that these are based on your data and it’s based on probabilities. And the machine learning algorithm learns these probabilities.
25:09 Here’s a simplified decision tree for flight delays. Here we have, for example, if the scheduled departure time was less than ten o’clock, if that’s false, if the origin was not in Boston or Miami, then if the destination was not in Boston or Miami, then true it was delayed. So that’s an example. You go through these true or false questions and then the end gives you the answer.
25:38 Here’s another example of the questions. If the origin airport is less than ten o’clock, if the origin is set, if the day of the week is Monday or Sunday, then it’s true if it’s one or false if it’s delayed. So that’s how the decision trees work.
25:54 And what random forest is they use a model that consists of multiple decision trees. And each one is based on different subsets of data at the training stage. And the predictions are made by combining the output from all the trees, which reduces the variance and improves the predictive accuracy.
26:15 For each trees prediction is counted as a vote for one class and the label is predicted to be the class which receives the most votes.
26:26 What you’re doing when you’re feeding your features into these algorithms is you have to create what’s called feature vectors. Vectors are numbers represent the value of each feature. So you have to transform all your features into numbers and then you’re going to use this for training and model evaluation.
26:46 So that’s what we’re going to go through now. We’re gonna go through this feature extraction first. And Spark provides a workflow. It provides a set of APIs built on top of dataframes for machine learning workflows. We’re going to use a transformer to transform a dataframe into another dataframe with a features vector column. And then we’re going to use an estimator to train on a dataframe to produce a model. And we’re gonna combine these into pipelines.
27:16 What’s important about these pipelines is we call fit on the pipeline. And that’s going to give us a pipeline for using in production and also for testing. That’s going to give us our model. And we call transform on this to get the predictions to evaluate. So the important ones are fitting our model and then using transform for evaluating the model.
27:42 In more detail, this is what our pipeline’s gonna look like. And we’re gonna start off with the top part. We’re going to build this top part of the pipeline. Then we call fit on this and that’s going to give us a pipeline model which we’re going to call transform to get the predictions for the final evaluation. This looks complicated, but we’ll go through it.
28:03 First we’re going to go through extracting the features. For extracting the features we’re going to build this transformer estimator pipeline. Which consists of multiple transformers and an estimator.
28:21 The first thing that we’re gonna use is called a stringindexer. And what an stringindexer does is it maps strings to numbers. Here we have an example of our carrier string. And what the stringindexer does is when you call transform, it’s going to return numbers that map for each one of these strings. For example, Delta will be number one and American Airlines will be number two, for example. What we’re going to do is we’re going to create stringindexers for all our categorical features. The categorical features are the ones that have strings. So the carrier, origin, destination, day of the week and originating destinating. We’re going to create stringindexers for all of these. What we’re doing here, for all these, we’re mapping through these and we’re using the column names as the input column and then the output column is going to be the column name plus index. And then we call fit to train this. Then we’re going to put these stringindexers into our pipeline. We’ll show that later.
29:44 We’re going to us a bucketizer to transform and add the label zero or one. We saw how the bucketizer works before. What we do is we set the input column, which is the departure delay, the output column is going to be a label and the splits are going to be zero. So this is the departure delay of zero and that’ll be one. And then the departure delay … This is how it’s gonna split it up. So less than 40 minutes is gonna be a label of zero. Greater than or equal to 40 minutes is gonna be a label of one. In our pipeline, that’s gonna create this label of zero or one. And we’re gonna put that in the pipeline too. And then we’re gonna use a vector assembler in our pipeline. This is also a transformer that’s gonna combine a list of columns into a single vector column. What we do is we specify our list of columns here. We have the column names. So those are the input columns. And the output column is the feature. So the vector assembler is going to create our feature vector in a column. Which then we can pass in for the training. This is a feature vector inside of a column.
31:05 Next we’re creating a random forest estimator to train on our model. Here we specify we’re creating a random forest classifier. We specify the label column, which is that one that gets created by the bucketizer. And the features column is the one that gets created by the vector assembler. Those two columns are gonna be used by the classifier which is then going to return our model.
31:35 And then what we’re going to do is put all of these into a pipeline. So here we’re defining our stats, which is the string indexers, the labeler, the vector assembler, and the random forest. So we’re putting all these as the stages in a pipeline. This is gonna pass the data through the transformers to extract the features and label, and then pass this to the random forest estimator to fit the model when we call fit. So all this is actually an estimator pipeline that we cost it on. So what the fit is going to do is it’s going to use the data to train the model. Typically what you want to do when you train the model is you are going to have a training set and a test set. What the training set does is you state in the features and the labels to train and then the test set, you use the features and the predictions to test the accuracy. So you pass in the features of the model and then you test the actual label with the predictions.
32:47 A common technique for model selection is K-fold with Cross-Validation where the data is randomly split in K partition training and test data set pairs. Each of which used 2/3 of the data for training and 1/3 for testing. So what you first do, is you split the data into training and test sets with the K-fold cross validation. Then the training set is used to train the model. The test set is used to test the model predictions. So here you would test the prediction with the actual label and then you’re going to train test this K number of times for the cross validation. So you repeat this process K number of times. And the model parameters leading to the highest performance metric produce the best model.
33:38 Spark supports K-fold Cross Validation with a transformation estimation pipeline to trial different combinations of parameters using a process called grid search. So you set up the cross validator with the parameters to test and estimator and an evaluator. So you supply a parameter grid, an estimator and an evaluator, and that’s going to then do cross validation in a workflow.
34:08 So here we’re setting up our Cross-Validator. First, we create the parameters grid specifying the parameters that we want it to go through. Here, we have the max number of trees, the max steps of trees, number of the trees, the bins, and the impurities. Then we’re setting up the evaluator. And here, we’re setting up the Cross-Validator, which consists of the estimator, which was the pipeline that we set up before, our evaluator, which is there, and then the parameter grid. And then we set the number of folds.
34:45 So then, after we set you this Cross-Validator, we call fit on the Cross-Validator, which is going to then do the cross-validation on the training set, and it’s going to return a model. So again, we call fit, and it’s going to return a pipeline model.
35:07 So this is then, after we have the pipeline model, what we can do is … This is the [inaudible 00:35:14] pipeline model. We can get the best model from this, and then we can print out the feature importances. So here, what we’re finding out is which features were the most important. And we’re sorting these. In this case, what we found out was the schedule departure time was the most important. Then, after that was the origination destination feature. And then after that was arrival time, the departure hour, and so on.
35:45 So then, after we’ve called fit, we have this model. We can get the test predictions. So what we do is we call transform on our pipeline model. And this is going to … With the test status brain, it’s going to extract the features and predict with the model and then return a predictions data frame, which we can use to evaluate.
36:09 So here, what we do is we have our pipeline model. We call transform, passing in a test data set. And this returns a predictions data frame, which we can then use on the evaluator. So here, with our evaluator, we’re calling evaluate, passing in the predictions data frame. And this returns various metrics. The default metric is the area under the ROC curve. And we’ll look at, in a separate notebook I have, the values … The output of that one. Then you can evaluate this. This shows what the area under the ROC curve means. So what the ROC curve measures is the area of the ability to test and correctly classify. So here, we have the false positive rate, and the true positive rate. And it’s measuring the area under this curve.
37:06 An area of 1 would be a perfect test. And an area of .5 would be a worthless test. So that would be just random. So the closer you have your test to one, the better it is.
37:20 And we can also calculate some other metrics just from our predictions data frame. So here, what we’re doing is we’re getting the label and the predictions from the predictions data frame. Then here we have the total count of our data sets. Then we’re counting the number of correct, so where the label equals the prediction. Here, we’re counting where the label does not equal the prediction. Then we’re calculating the true positive. So this is where the prediction equals 0 and the label equals the prediction. Then we count where the prediction equals 1 and the label equals the prediction. Then we count where the prediction equals 0 and the label does not equal the prediction, and so on. So when the false … When the label equals 1, the prediction equals 1, and the label does not equal the prediction. And then we can calculate the ratios using these values.
38:25 So in our case, what we had was the ration correct was .61. Wrong was .38. True positive was .53. True negative was .07. So one thing I want to say is we were only using 1 month for this data set, and also … So this could be improved by using a lot more months. And I originally ran this also on just one note so that I could do it on my laptop, and that the example could be easily be done on a laptop. And so it could be improved using more data. It could also be improved by adding different types of data to this data set like weather, holidays, things like that could also improve the predictions.
39:13 So now I want to go through this Zeppelin notebook again real quick to show what we just went over. So here again … Here we’re setting up our string indexers for all the categorical columns. Here we’re setting up the vector assembler, which is going to put all those features into one column. The labeler, which will create the label 0 or 1 based on if it’s less than or greater than 40 minutes. And then, we’re creating a random forest estimator. Then we’re putting all the labeler, the string indexers, the vector assembler, and the random forest estimator into a pipeline. Then here we’re setting up our cross-[inaudible 00:40:06]-validation. We created the param grid that we want, the binary classification evaluator, and then we set up the cross-validation with the pipeline that we created, the evaluator, and then parameter grid, and we set the number of folds to 3. So that’s going to do 3 times cross-fold validation.
40:27 Then on our cross-validator, we call fit with the training data, which is going to return the pipeline model. From this model, we’re getting the best model. And then we’re printing out the most important features by order. And here, the departure time and the origination-destination index, those were the most important features in this one. Just an example.
40:55 Then we’re also pointing out which scored the best estimator parameters, which says the max bins was a hundred, the max stacks of the trees was 4, and the number of trees was 20 in the best model.
41:09 Next, we’re getting the predictions from our pipeline model. So we call transform, passing in a test data, and we get the predictions returned as a data frame. Then with the evaluator, we call evaluate on those predictions and this returns the area under the ROC. In this case, this was .6948 as the area. And remember, this is the area under that curve, so the closer to 1 the better. And again, this could be improved with more data.
41:40 Then we’re calculating more metrics here, which we already went over. And so then those are the output of those metrics. So now I’ll go back to finish with the slides.
41:53 So the next thing that we can do is we can save this model. He we’re saving it to the distributive file system. Then later, we can load this with what’s called the Cross-Validator model. We can load this from the file system and then we can use it just like we saw before. We can use this in production and then what that’s going to do is it’s going to … When we pass it in a data set with the features, it’s going to extract those features using that pipeline that we set up for extracting the features.
42:36 So this is an example of an end-to-end application architecture, for example. Here, we’re building the model with he data and then from the file system, for example, then we’re deploying this model. And then here we’re using the model with stream processing on real-time events. So for example, we have our flight information coming in and we want to predict if it’s going to be delayed or not.
43:04 Also, you can have complex pipelines with streaming data. And if you want to learn more about this, I’ll show you … There’s an e-book all about how you can set up pipelines for your models and predictions. So, this is where you can get more information. So we have complimentary e-books that you can download. So you can download the new one, the Getting Started With Apache Spark. And that has this example that we went over in this webinar, and it has other examples, too. And it also has a pointer for where you can download the code inside that book. Then the Machine Learning Logistics, that one goes over the model management that I just showed you: the pipelines for model management. There’s also Streaming Architecture, and other eBooks that you might be interested in.
43:51 We also have free, on-demand, trainings, and there’s a new one for Spark 2.0 which you can use to get certified. We also have a MapR Spark 2.0 certification. And the trainings [inaudible 00:44:03] certification costs money. Then we also have a lot of these examples on our blog. These examples and other examples are on this blog, mapr.com/blog. And then … So that’s the end of this. If you have any questions, you can send those in and we’ll get back to you with the answers. You can also ask the questions underneath those blogs. Or on the blog you can ask questions about whatever you have. So thank you for attending this seminar.
44:38 Thanks Carol, and thanks everyone for joining. We had a large number of questions come in during the call. And instead of answering just a few of them, Carol will … We want to go through each of those questions in-depth and get back to you separately. So we will be responding shortly after this event with answers to your questions. So thank you again and thank you, Carol, for putting together this presentation. That is all the time we have for today. For more information on this topic and others, please visit the site that Carol just walked through. You can always start with mapr.com/resources. Thank you again, and have a great rest of your day.
REGISTER HERE TO THIS WEBINAR