Articles

The Art of Statistical Clusters: Building and Forecasting

Article Author
Dr. A. K. Singh and Andrew Cardno
Publish Date
March 31, 2009
Article Tools
View all articles in the CEM Archive
Author: 
Dr. A. K. Singh and Andrew Cardno

Authors Note: In our earlier series of articles,1 we showed how market basket analysis can be used to produce a set of customer behavioral clusters. Once clustering is complete, the next step is the identification and analysis of the individual clusters. This article focuses on how clustering information can be incorporated into forecasting models to improve the forecasts (for variables such as revenue) for each cluster. These forecasts take into account seasonality and other factors and produce a fine-grained behavioral view of the business. Thus, the clusters become mini-business units with their own seasonality and profitability.

Personal computers, introduced in 1976, cost thousands of dollars and had a few kilobytes of RAM, no hard drive, and process speeds reaching 1MHz.2 Today, desktops with 2 gigabytes of RAM, 250 gigabytes (GB) of hard disk drive space, and 2 GHz of processing speed cost around $500, and comparable laptops are available for about the same price. The dramatic decrease in the cost of computers and the phenomenal increase in computing power have enabled businesses to collect and store vast amounts of customer related data in electronic form. Wal-Mart, the retail giant, started with a 250 GB database in the 1990s; the size of that database grew to 280 terabytes by 2004 and reached 4 petabytes in 2007.

Wal-Mart has been using data mining tools such as Association Rules Mining to optimize store operations. Other business organizations including casinos have increased their use of data mining techniques to find information (patterns, correlations, associations, trends) hidden in large customer transactions and other related databases. Harrah’s introduced Total Rewards, the casino industry’s first customer loyalty program, in 2000, which is now in its third generation, offering members the ability to earn Reward Credits and Tier Credits for almost all purchases at its properties.3 Loyalty programs that give real reasons for customers to use their membership cards for most of their transactions on the property typically have a large member database.4 The daily transactions of their members are stored in their customer database and data mining tools are used to find actionable information from the data.

Historically, we have split the list of customers into groups based on dimensional tiers of spending. This often deployed approach runs into limitations, as the dimensionality of the data increases. Questions such as how to combine the spending in the hotel with the spending in the casino for a complete view of the customer become increasingly difficult to address. The critical question becomes, what are the defining characteristics that enable us to distinguish one segment of customers from another, or simply put—if we consider any two segments, are they different?

In this article, we present a statistical approach to data mining that uses customer transaction data to first form customer segments, and then identify the different segments. This identification of customer clusters provides information that can be used to optimize marketing operations. The example presented uses simulated data from a small casino with a customer loyalty program, but the approach outlined is quite general and has been used successfully in a number of business areas.

Review of Statistical Customer Segmentation
Many store chains cluster stores and use sophisticated forecasting models in order to customize inventory for the store clusters or groupings. Clustering in these cases is a division of the entire list of stores into similar groups of stores. Grouping the stores into clusters results in losing some finer details but brings out common features of stores in a cluster. This information is useful in managing the store clusters. Casinos can use the same methodology to cluster their customers with loyalty cards and use this information to better understand their customers and increase their market share.

Searching a large group of customer transaction data to find natural groupings or clusters of customers is an exploratory multivariate statistical method,5 which has become an important data mining tool6 and is referred to as “unsupervised learning.”

There are essentially two types of clustering schemes—hierarchical and nonhierarchical. The hierarchical clustering methods begin with computing a measure of similarity or a measure of distance for all items in the data and goes through either a series of mergers of clusters formed at the previous step (agglomerative) or a series of divisions of clusters formed at the previous step (divisive). Nonhierarchical methods typically cluster customers in a specified number (K) of clusters and start with an initial clustering, going through the list of customers and assigning each customer to the closest cluster and then recalculating the cluster centroids.

The centroids are defined by the means of the variables used for clustering calculated from the data; if the data has only two dimensions, the centroid can be shown on a 2-D plot. In Figure 1, the centroid is (5,5). This process of recalculating the centroids is repeated until no new reassignments of a customer into a different cluster takes place. Since a matrix of similarities or distances does not need to be calculated or stored, nonhierarchical clustering methods are better suited when data sets are very large as they require less computation.

Example Calculation
We now present an example to illustrate the above approach for customer segmentation. For this example, we will use a simulated data set of 4, 899 player visits to a small casino. Table 1 shows a small subset of this data set.The variables in this example used for clustering the customers represent  the fraction of total money spent by each customer on a trip on the following activities offered by the casino:

 The last column in Table 1 is the total dollar amount spent by a customer per visit.
The data was first standardized by subtracting the mean of each variable (column) and resulting values divided by the standard deviation of the column. The K-Means clustering method with K=4 clusters, applied to the data, produced the following four clusters with cluster centers shown in Table 2.

 

 


The numbers in each column are the cluster centers; as an example, in Table 2 the largest number in the Cluster 3 column and Slots row is 98.78 (highlighted light red), so Cluster 3 is dominated by customers who spend most of their money on slots. One useful analysis is to look at the relative importance of each dimension to the cluster, so for example looking at the Restaurant dimension it is very important to Cluster 1 and Cluster 2, but relatively unimportant to Cluster 3 and Cluster 4.

 

Following this reasoning, four clusters are identified and are shown in Table 3. Figure 2 shows the average proportions spent on various activities by customers in each of the four clusters. The summary statistics of the four clusters are shown in Table 4.

Table 3 shows that Cluster 1 consists of customers who are staying at the hotel and eating at one of the restaurants in the casino. You can see from Table 4 that the means for Restaurant and Hotel for Cluster 1 are higher than the other three clusters.

Table 5 shows the summary statistics for the total dollars spent per trip for each of the four clusters.

Customer Segmentation, Forecasting and Marketing
The clustering information can be used in optimizing marketing. As an example, it appears that Cluster 1 customers are customers who like all the nongaming facilities within the casino.

Casino databases also have information about the date of each visit by various carded customers. This information along with customer segmentation can be used to forecast the average amount a customer from one of the clusters will spend on his or her future visits. We illustrate this clustered forecasting method for Cluster 2 in Figure 3. 

Figure 3 Plot of total dollars spent per month by Cluster 2 members.

The following multiple linear regression model was fitted to the total dollars spent for Cluster 2 (Total 2). The predictors used in the model are T for month number (1, 2, …, 101), binary variables Jan, Feb, …, Nov for 11 months a year, and another binary variable Road Work (road work around our fictitious casino). The model fitted to the simulated data is: Total 2 = 251 + 1.17 T + 16.3 Feb + 12.4 Apr - 25.9 Jun - 16.1 Aug + 22.5 Sep - 35.9 Road Work.

The months that have a significant affect on Total 2 are February, April, June, August and September (indicating seasonality). The months with positive coefficients (Feb, Apr, Sep) correspond to high total dollar amount spent. The coefficient of determination (R2) for the model is 83.9 percent indicating a reasonable fit. The fitted model was used to forecast total dollars spent in a month for the next six months (see Table 6).

Comparison to Pivot Techniques
When the dimensionality is low, it is useful to use OLAP style pivot tables; however, once the dimensionality is high it becomes impossible for humans to usefully manipulate the data. It is an extremely complicated task to decide on which dimensions should be juxtaposed against each other, and even more complicated to decide on the statistical importance of the relative weights.

When faced with the challenge of creating analytics that are actionable and simple, clustering stands head and shoulders above other methods. The clustering technique is a mathematical way of dividing certain customers into groups based on important behaviors. The alternative, which is to use drag and drop style pivot tables, is unlikely to give the same accuracy of information, as handling of the large number of dimensions is quite simply beyond human capability.

As one can see in this example, different aspects of the customer list are brought to light by the four clusters and in the end, the four clusters are understandable enough that one could market to them in different and meaningful ways. Furthermore, we have described how to build forecast models of the individual clusters. Now, once control is established, marketing activities can be planned and measured against forecasted results.

           
Footnotes

1    Bart A. Lewin, Dr. A. K. Singh and Andrew Cardno: “Let’s Talk Turkey: Applying Retail Market Basket Analysis to Gaming.” Casino Enterprise Management, December 2008, pp. 10-14; “Market Basket Analysis II – Recovering Mr. Benedict’s Money.” Casino Enterprise Management, January 2009, pp. 16-18; “Market Basket Analysis, Part III: Using Demographics And Spatial Information.” Casino Enterprise Management, February 2009, pp. 10-15.

2 Berndt, Ernst R. and Rappaport, Neal J. (2001). “Price and Quality of Desktop and Mobile Personal Computers: A Quarter-Century Historical Overview.” The American Economic Review, Vol. 91, No. 2, (May, 2001), pp. 268-273.

3 www.reuters.com/article/pressRelease/idUS133764+26-Sep-2008+PRN20080926.

4 http://knowledge.wpcarey.asu.edu/article.cfm?articleid=1451.

5 Johnson, Richard and Wichern, Dean (2003). Applied Multivariate Statistical Analysis. Prentice Hall.

6 Shmueli, Galit, Patel, Nitin R., and Bruce, Peter C. (2007). Data Mining for Business Intelligence. Wiley Interscience.

Dr. A. K. Singh has taught statistics, mathematics and operations research courses at New Mexico Tech, Socorro, N.M., and statistics and mathematics courses at University of Nevada, Las Vegas. He has over 75 publications in theoretical and applied statistics.


Andrew Cardno has more than 16 years of experience in business analytics, ranging from modeling health care drive times to casino gaming floor analytics. He often presents on the future of analytics across the world and has spent the last seven years living in the United States and working with corporations around the world.

Comments

Post new comment

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.