In this blog post, we will be looking at an interesting data science application, briefly, Customer churn modeling for Sparkify using Apache Spark.
Project Definition:
This project is about predicting the customer churning behavoioor for a fictions music service called Sparkify. Many of our internet music consumption revolves around online subscription where users pay a fee every so often to keep their service and consume music. This is one of the major revenue sources for the internet companies.
Unfortunately, customers tend to leave service when they are not satisfied with it and being able to predict the likelihood of a subscriber to leave the service can be beneficial towards minimizing loss of revenue over time (by tailoring various promotional strategies).
The project involves analyzing thousands of log files in JSON document format, using Apache Spark and Apache Hadoop framework.
Data is getting bigger and we need modern methods to handle it. Apache Spark is part of the big data ecosystem, a modern paradigm where data that cannot be processed on a single machine gets accommodated for analysis and modeling. In this context, data is loaded and processed in-memory as online platform tend to generate serious amount s of data and Spark can process these things on the go.
Apache Spark distributes data by partitioning them into small batches and sending them to various machines within its big data cluster for analysis and processing. It then utiizes Apache Hadoop framework to process using distributed computing algorithms (map reduce) to efficiently process data wrangling for large amounts of data.
Apache Spark uses functional programming paradigm. Initially developed in Scala, a functional programming language, it can be accessed through Python via an API called PySPark
Problem Statement
I was given this task of predicting customer churn for a fictitious music streaming service called Sparkigy by analyzing their log files in JSON document format and prepare the data and eventually deploy machine learning modeling on them.
Metrics
The sparkfify python jupyter notebook contains two major machine learning models — logitistic regression and decision tree classifier. These two models have been deployed on clean data with new features built specially for the algorithms to effectively predict customer churning. I am using Accuracy metric and F1 score to understand the effectiveness of the algorithm. After several iterations of analysis and troubleshooting, I find decision tree classifier to be a better fit with the features I have engineered for this project.
Data Exploration and Visualization:
A mini data set file called mini_sparkify_event_data.json has been imported for analysis and investigation. Data cleansing and analysis has been done on this data set that has little over 285,000 log enteries.
Data Inputs:
Data is in form of JSON documents. Each JSON document, is a log entry of the web app, that contains the following:
We have the user information, location, type of subscription, details about the songs they are listening to, various pages that user navigated while using the app from thumbs up/down to cancelling their paid subscription, along with time stamp information and session information.
Data Visualization:
Initial descriptive analysis, tells us that length of song, sessionID, status, and time stamp are continuous variables while the rest are categorical variables. We have various demographic information about the user from location from which they are using it from to the type of device that are accessing Sparkify’s music streaming service.
Upon running some quick analysis, we can see that good amount of account activity is coming from paid subscribers log and small portion of that is coming from free subscribers log.
Initial analysis suggests that 52 of them are free subscribers while 226 of them are paid subscribers. That leaves the current churning rate at 23.01%.
Further analysis of location data suggests that most number of listening activity are coming from metropolitan zones like Los Angeles-Long Beach-Anaheim, CA followed by New York-Newark-Jersey City-NY-NJ-PA area with third most populous subscription zone coming from Boston-New England area (MA-NH).
To maintain chart consistency, I created this function:
def horizontalbar(x,y,chartcolor,xlabel,ylabel,title):
“””
Reusable horizontal bar chart, with options for chart color, label and title baked in.
This needs matplotlib and refers to it as plt
please have this import — import matplotlib as plt
“””
plt.style.use(‘ggplot’)
fig,ax = plt.subplots()
y_pos = np.arange(len(y))
ax.barh(y_pos,y,color=chartcolor)
ax.set_yticks(y_pos)
ax.set_yticklabels(x)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
for i, v in enumerate(y):
ax.text(v + 50, i + .0015, str(v), color=’black’, fontweight=’bold’)
plt.show()
Upon closely introspecting we see that gender is well distributed with males and females occupying a good proportion with one missing value. Data is relatively clean.
Methodology:
Data Preprocessing:
I started off our modeling process by eliminating the missing values and duplicate data. There seem to be a lot of missing entries in many columns, the ones that are concerning are the lack of userId’s.
Followed by, looking at various events within the page category. This categorical variable briefly refers to all the user activity that had interacted during their app usage. The following are the various categories available:
We then created a pivot table that pivots the user categories for their respective userId and created a box whisker plot, that helped to understand their pattern better .
Here is how the series of box-whisker plots were created:
# transform the data to each user
df_userview = df_valid.groupby([“userId”])\
.pivot(“page”)\
.count()\
.fillna(0)
df_userview = df_userview.toPandas()# create box and whisker plot against churn variable — cancellation confirmation
fig = plt.figure(figsize=(20,20))
for i in range(1,len(df_userview.columns)):
df_userview.boxplot(df_userview.columns[i],
ax=fig.add_subplot(6,4,i),
by=”Cancellation Confirmation”)
As we see above, in the box and whisker plots, subscribers who leave the sparkify service, tend to participate less in “add to playlist” event — they do not seem to add as many songs in a playlist when compared to the ones who remain as loyal subscriber to sparkify service. Subscribers who leave this service also tend to add less friends when compared to ones who remain as loyal subscribers. Subscribers who leave this service are less likely tend to click to go to next songs when compared to subscribers who continue to be a loyal subscriber. Subscribers who leave this service tend to upgrade more than compared to subscribers who remain as loyal subscribers to sparkify streaming service. Similar patterns are also seem to be prevalent in thumbs up and thumbs down category especially in thumbs up event.
The features that went into the model are:
- Total number of songs each user had listened to
- Total number of friends that each user has
- Total number of playlists that each user created
- Total number of sessions that user had during the course of utilization
- Total number of days that user engaged with Sparkify service
Implementation:
This project has been built on the following ecosystem:
Apache Spark 2.4.1
Apache Hadoop 2.7
Amazon S3 Storage
Amazon Elastic Map Reduce Cluster
PySpark 2.1.0
Python packages: Pandas, Numpy, DateTime, PySpark
Although Apache Spark is known for its fast and resilient distributed computing mechanism, this analysis took a long time every time a process or an algorithm was being run.
If there were more memory and compute power available for us to process then perhaps it would have taken a lot sooner to complete.
The machine learning process involve training a model or sometimes various models on the data set and selecting the one wtih best performance. I used primarily logistic regression and decision tree classifier models to predict the churning behavior of the subscribers.
I created a base model with no tuning and later implemented models with their hyper parameters tuned. Also, utilized k-fold cross validation and let me explain what that means.
As we know data is normally divided into training set and testing set; training set is used to train the model and testing set is used to evaluate the performance of the model. There can be cases where accuracy obtained on one set may be different when compared to accuracy obtained on the other set. So I employed k-fold cross validation process. I divide the data into k sets, where k-1 are used for training the data set and the remaining is used for testing. The k-fold cross validation result is the mean of results obtained in each iteration.
Secondly, the hyper parameters tuning. Machine learning models has two types of parameters — ones that they learn through training a machine learning model and other hyper parameter that we pass to the machine learning model. the tradition is that subject matter experts help determine the value of hyper parameters (for tuning) but PySpark offers us the ability and approach to develop algorithm that finds out the best parameter for a particular model. Grid search helps us to do just that. I used scaler methods to remove outliers and scale the data and used pipelines to get the best parameters.
For logistic regression, I tuned the regularization parameter (the one that helps alliviate model overfitting) and for decision tree I tuned the maxDepth parameter (depth/various layers of the decision tree)
The data is relatively clean and I also had confidence in my features that I had built. Hence, I chose the simplest classifiers as I did not see a particular use case for deploying complex models with a lot of hyper parameters and tuning for the same. These are the Logit model and Decision Tree model respectively.
Modeling Methodology and Refinement:
To develop the predictive model for this project, I decided to first, build a “base models” where I run without any parameters tuning for the model and let PySpark decide the best outcome for the same.
Then, I tune the hyperparameters for the models and develop a second set of tuned models. I am discussing this more detail, in next section, under results.
Results:
Model Evaluation and Validation:
The following are the results I got for base model:
Logistic Regression: Accuracy 0.76 and F-1 Score 0.76
Decision Tree: Accuracy 0.82 and F-1 Score 0.79
The base models results seem impressive. My first reaction is model overfitting. In my experience any accuracy between 0.6 and 0.8 can be deemed good model and anything above 0.8 could be a possible case of over fitting. Over-fitting models can be dangerous when introduced to completely new set of data and can result in poor performance, which would result in lack of revenue generation.
Given the base models performance and possible over fitting, I chose to tune the hyperparameters. For logistic regression, I tuned the regularization parameter and for the decision tree classifier, I tuned the maxdepth (maximum number of trees) paramter. I also decided to run with 2-fold cross validation and 3-fold cross validation, and eventually decided to proceed with 3-fold cross validation, expecting more training with hyper parameter tuning would yield better results. The following are the results I obtained after hyper parameter tuning:
Logistic Regression: Accuracy of 0.76 and F-1 Score of 0.76
Decision Tree Classifier: Accuracy of 0.8 and F-1 Score of 0.75
These numbers, although they seem lower than the base model, I now have the confidence, I have tested for the over fitting by tuning the regularization parameter and max depth parameter.
If I had the access to allocate more memory, I would certainly advocate for more than 3-fold cross validation for my machine learning model. Besides, the data set has only a little over 250 rows, so breaking it further by n-fold will compromise the quality of the solution. So 2-fold was deemed reasonable.
To offer more clarity here,
Accuracy metric is measure of all correctly identified cases. In this case the model is 81% accurate. However the remaining percent actually contains subscribers that are going to leave our music streaming service without being identified. This can be a high cost for the organization and we should try to minimize the false negatives. this is where we use F1 Score.
F1 score gives us better measure of incorrectly classifeid cases than the accuracy metric. F1 is used when false positives and false negatives are very important. Missing a customer who is about to leave the service can be a high cost and the reason why we need to look at F1 score and weigh more than accuracy.
To understand the metrics better, lets look at some definitions that fits only to our context:
Note: Churning is when a subscriber leaves (downgrades) Sparkify music streaming service
- True Positive: predicted churning + actual churning event
- False Positive: predicted churning + actual not churning (staying with Sparkify)
- True Negative: predicted to stay with Sparkify + actually churning event
- False Negative: predicted to not leave Sparkify + actually not leaving Sparkify
Precision is the ratio of true positives to model’s totally predicted positive observations. The formula is:
Recall is the ratio of true positives to model’s to actual positive event. Recall is basically answering the following question:
Of all the total subscribers left paid service, how many did the model correctly identify?
The formula is:
Accuracy is how many churners did the model correctly classify out of all churners, can be defined as:
and finally, F-1 Score is the harmonic mean of precision and recall (weighted average)
I am using F-1 Score as the metric because of the context of our problem. Each subscriber that goes unidentified can be a loss of revenue to organization. A higher f1 score would mean the model has identified almost most of the churners who are likely to leave Sparkify so we can offer them more promotions.
Justification:
Our decision tree model has a F-1 Score of 0.7595 which would mean that almost 76% of churners have been correctly identified and since, the accuracy is higher than logistic regression, I am proposing the decision tree model to predict customers churning behaviour. There may be other limitations and assumptions that are outside the scope of the project can affect the model results however given the constrained space, decision tree classifier is what I would recommend.
Conclusion:
In this capstone project, I have imported log files of an app from a fictitious company, Sparkify, performed data cleaning, identified missing values and treated them, performed univariate and bivariate analysis, data vizualizations, fit a predictive model, tuned the hyper parameters to obtain a better fit and presented the results.
I found the JSON documents part of it, interesting and enjoyable as I have worked with JSON documents in one of my previous projects, and found the technical setup difficult; the spark cluster, seem to have memory limitations and number of nodes running the background.
Reflection and Improvement
As someone who has played with Apache Spark during Scala days, it was a pleasure to discover the PySpark matured into a proper package with full support for data frames and growing traction for PySpark ML library.
Also, at the same time, I understand the pain points of running relatively small batch of data into a distributed system. It is going to be ridicule to use RDD for a data set that has little over 250 rows with few columns. Nevertheless, it has been a great practice for me and I remain thanking for that.
The data set is somewhat cleaned and relatively less messy. The behavior of churners (subscribers) does not seem relatively trivial — one would imagine someone spends less time on the streaming service, does not add more friends or songs to play list would likely to dropout however we have good mix of data that has subscribers who appear to be loyal to service and yet have decided to downgrade or execute the event “cancellation confirmation” so this was interesting.
There is also the concept of system tuning or troubleshooting the resource allocation (job scheduling, memory reallocation, etc…) from the system level that would play a crucial role in duration it takes for each job to run and data to persist.
Improvement:
I wish I had the opportunity to allocate more memory to my memory intensive tasks; lack of sample size is concerning and potentially effectiveness will be more obvious if the number of users were to be in few thousands. However, given the computation limit of this course, I am pleased with the exercise.