Segmentation Tutorial

Created by Steve Hoover, Modified on Thu, Jan 16 at 2:48 PM by Steve Hoover

Overview

Segmentation and classification are analytic techniques that helps firms compare and group customers who share common characteristics (i.e., segmentation variables) into homogeneous segments and identify ways to target particular segments of customers in a market on the basis of external variables (i.e., descriptor variables).

Segmentation refers to the process of classifying customers into homogenous groups (segments), such that each group of customers shares enough characteristics in common to make it viable for the firm to design specific offerings or products for it. This application identifies customer segments using needs-based variables called basis variables. Cluster analysis helps firms to:

  • better understand their customers.
  • identify different segments in a market.
  • choose attractive customer segments for classification with its marketing programs. 

Getting Started

To apply segmentation and classification analysis, you can use your own data or use a template preformatted by the Enginius software. Because the Segmentation model requires a specific data format, users with their own data should review the preformatted template to become familiar with the appropriate structure. The next section explains how to create an easy-to-use template to enter your own data.

 

The following section, “Creating a template”, is used if you need to enter your own data for analysis. If you are using one of our supplied cases, or the tutorial, you may want to move ahead to the “Entering your data” portion of this tutorial (page 5).

 

Creating a Template

From the Enginius Dashboard, click the Templates dropdown and select Segmentation to open the dialog box to create a Segmentation template.

 

The options for the Segmentation template are as follows:

  • Segmentation data:
  • Number of segmentation variables: These variables serve as the basis for segmentation and are often called basis variables. They might include customer's needs, wants, expectations, or preferences.
  • Number of respondents: The number of customers or respondents in the data that need to be clustered.
  • Descriptor data (optional):
    • Include descriptor data: Check the box if you have descriptor variables. Descriptor variables, or descriptors, are variables that do not contribute to the segment definition, but can be used to describe them (e.g., age, gender).
    • Number of descriptor variables: The total number of respondents (customers) in your study.
  • Out-of-sample classification data (optional – option only available when descriptor data is also included in template):
    • Include classification data: Check the box if you have classification data. Classification data refers to individuals for whom you only have descriptors available, and wish to classify into their most likely segments.
    • Number of respondents to classify: The number of individuals with descriptor data only that you would like to classify into segments.

Note: the check box at the bottom of the dialog box will cause the template to populate with sample (random) data that will allow you to run Segmentation modeling immediately so you can preview the output produced. 

 

 

It is not always clear whether a specific variable should be treated as a segmentation variable or descriptor variable. This choice might depend on the context, the managerial question, or the product category.

When in doubt, ask yourself the following questions: (1) Would this piece of information tell me what that customer wants, in which case it should be treated as segmentation variable, or (2) does this piece of information tell me who that customer is and therefore should be treated as descriptor variable? For example, “gender” would fall in the second category most of the time, whereas “need for timely information” usually falls in the former category.

 

 

After selecting the desired model options, click Run to generate the data collection template. The software generates the required data blocks depending on whether you have included descriptor data (and classification data) and fills with random data (if selected):

Entering Your Data

A typical segmentation set up contains one or two data blocks that contain segmentation and/or discrimination data. If running classification, an additional data block is used for the data to be classified.  

  • Segmentation data are required for the segmentation model. This data block contains the respondent identifier and a column for each segmentation variable collected in the study. The data within each column must be scaled using the same scale (e.g., 1–10), but each column can have a different scale (e.g., 1–10 for satisfaction, 1–5 for convenience). Typically, segmentation variables are numerical values (interval or ratio scale). The data set contains one row per respondent in your study. If you must use basis variables that are nominal (e.g., “male” “female”), then you can apply latent class segmentation analysis (see appendix).
  • Descriptor data constitute an optional data block, depending on whether your study has collected discrimination data. Recall that discrimination data enables you to differentiate one customer from another (e.g., age, income, gender). Again, data within a column must be scaled using the same scale, but different columns may use different scales. Typically, descriptor variables are numerical (interval or ratio scale) or nominal (“male”, “female”). Each respondent in your study appears in a separate row.
  • Classification data is an optional data block. The block will contain columns that match your Descriptor data block that is already being used for analysis.

Running Segmentation Analyses

 

For the remainder of this tutorial, we will use the “OfficeStar: Segmentation” data set that is available with the Enginius Segmentation tutorial. To access the data set, open “Segmentation” under the Tutorials dropdown in the Enginius Dashboard. This will automatically load the “OfficeStar: Segmentation” data.

 

 

After you enter and/or upload your data, click on the Run Segmentation Analysis button in upper left corner to begin the Analysis.

Analysis options

When you click on the Run Segmentation Analysis, a number of analysis options will be presented:

Segmentation method

You may specify the number of segments (clusters) to develop during the analysis or you can allow Enginius to determine the appropriate number of segments. If you allow Enginius to determine the number of segments automatically, it will do so strictly on a mathematical basis. This may, or may not, be appropriate from a management perspective but can be useful as a starting point for manually determining the number of segments.

 

Usually, a segmentation analysis consists of two steps when manually determining the number of segments (i.e., using “Force number of segments” option). First, you run the analysis with a large number of segments (up to 9). Second, on the basis of analysis from the initial report (discussed subsequently), you can determine the number of segments to retain for further analysis.

 

 

Segmentation data

The dropdown box that appears under Segmentation data allows you to select the data block that corresponds to your segmentation data.  In OfficeStar, this data block is named “Segmentation data”.

A check box under the Segmentation data section allows you to choose whether to Standardize data. This option scales all variables to 0 mean and unit variance before the analysis. Choosing this option is recommended if you have measured the variables on different scales. 

Descriptor analysis

To perform a Descriptor analysis, check the box beside “Run descriptor analysis” and then choose the data block where your descriptor data is located.

You may also choose to run Classification analysis if you have provided classification data.  Please refer to page 17 for a description of classification analysis. One would not typically include classification analysis until the segmentation analysis was complete.

Advanced options

Checking the Advanced checkbox will provide two additional options for running your segmentation analysis.

 

Segmentation method

For the segmentation method, you may choose either Hierarchical clustering or K-means (Hierarchical clustering is the default if the Advanced option is not checked). 

  • Hierarchical clustering builds up or breaks down the data, customer by customer (row by row). Due to the computational requirements, hierarchical clustering is not suitable for large data sets.
    Note: if there are more than 2,000 data points, Enginius will use K-means regardless of which method is selected.
  • K-means partitioning breaks the data into a pre-specified number of segments and then reallocates or swaps customers to improve some measure of effectiveness.

Data transformation

The Data transformation dropdown under the Segmentation data section allows you to select a method to pre-process the data.

  • None. This option indicates you want to use the original data.
  • Standardization (by column). This will standardize the data by column (variable), so that columns measured on different scales become comparable.
  • Standardization (by row). This will standardize the data by row (respondent), so that data will be measured as a deviation from each respondent's average response. This method is only valid if variables are already measured on the same scale.
  • Box-Cox normalization. This will apply Box-Cox normalization to the data which reduces data skewness and the effect of potential outliers.
  • Factorization. This will transform the data using factorization, and then segment the (weighted) factor loadings instead. This will remove smaller factors (i.e., noise) from the data.

After selecting all the options, click the Run button found at the bottom of the Segmentation analysis setup window to begin the analysis. By default, the report will output as a web page.

 

Reminder: Clicking the world icon beside the “Run” option will allow you to choose a different output format for the report.

 

 

You will see a pop-up indicating the segmentation analysis is underway.  Your report will output in the format chosen (Microsoft, PDF, or Zip format may automatically download to your hard drive).

Interpreting the Segmentation Results

The report generated by segmentation analysis contains several sections, depending on the options chosen.  The results described below were generated with these model settings:

The first section depicts the number of segments (either chosen by the user or automatically chosen by Enginius).  These segments are depicted in 3 different displays: dendrogram, silhouette chart, and scree plot.

Dendrogram

Dendrograms provide graphical representations of the loss of information generated by grouping different clusters (or customers) together. 

Chart

Description automatically generated

At one extreme (upper part of the dendrogram), all customers group into one cluster, and the loss of information is maximum, because they all receive undifferentiated treatment, regardless of their characteristics.

At the other extreme (lower part of the dendrogram), customers appear in separate, small clusters, and only those customers very similar to one another group together (“similar” or “close” in this context refers to the distance between two customers in terms of the segmentation variables).

When reviewing a dendrogram, look for significant distances or “jumps” in the distances (using the scale on the Y axis). For example, the OfficeStar example contains a very large jump when moving from three to two clusters. Grouping these three clusters into two generates a significant loss of information; in other words, it results in grouping within the same cluster customers who are very dissimilar. In the preceding example, a three-cluster solution seems to be the best approach.

Scree Plot

The scree plot compares the sum of squared error (SSE) for each cluster solution. A good cluster solution might be when the SSE slows dramatically, creating an 'elbow'. Such elbow may not always exist. 

The above charts are simply a graphical representation of the clustering output. For a more detailed understanding of cluster members and attributes, you must analyze the other segmentation output as well.

Segment description

The section of the report contains the statistical output of the cluster process in terms of Segment size, Segment description, Segment differences and spatial depiction of segments and segment variables.  

Segment size: The population of each segment in count and percent is shown in the table below.

Segment description: Average value of each segmentation variable, overall for each segment (centroid). Segmentation variables that are statistically different from the rest of the population are highlighted in red (lower) or green (higher).

Segment differences per segment: Expanding on the previous chart colors, the shade of cell color indicates to what extent a segment is statistically different from the rest of the population on each segmentation variable.

Segment space: Spatial representation of segments and segmentation variables, using principal component analysis. Because only the first two dimensions of the PCA are displayed, and these two dimensions capture only part of the variance in the data, some differences between segments might not appear here. Note that segmentation variables with no variance, if any, have been excluded.

Two clusters that appear to overlap in the first two dimensions might actually be distinct on other dimensions. Consequently, this chart is a useful guide, especially to see which segmentation variables are correlated, but may be misleading if used to select the optimal number of segments.

Segment membership: The chart below shows an excerpt of the respondents mapped to their segment. The complete membership list is only available in the Excel formatted report.



Segment profiles (only available when data is NOT standardized)

A spider chart is displayed showing the averages of the segment variables across all segments.

 

To easier visualize each segment and the segmentation variables, a chart is created to represent the profile of each segment. For each segment, the segmentation variables have been ordered in decreasing order of magnitude. 

  • The colored dots represent the average of the segment. 
  • The horizontal lines represent the standard deviations within that segment. 
  • The vertical, gray lines represent the averages of the rest of the population, after excluding members of the segment under scrutiny.

Descriptor analysis

The next section of the report shows the output of Descriptor analysis (if selected). This portion of the report will show information regarding:  

  • Segment sizes depict the number of respondents who appear in each cluster, along with the proportion of the whole population that each cluster represents.
  • Descriptor variables depict the means of each descriptor variable for each cluster.
  • Descriptor function reflects the correlation of the variables with each significant descriptor function and thus indicates the predictive ability of each descriptor function.  
  • Confusion matrix depicts how well the descriptor data predict correct clusters. Two matrices are available, one showing the actual data counts and the other showing percentages for these same data.
  • Classification weights and classification coefficients are intermediary results required to run further classification analyses on external data. These matrices are of no particular interest as is, and cannot be easily interpreted, but are necessary to carry over further classification analyses.

Descriptors

This table reports the descriptor variable averages of each segment. The more differences can be found, the easier it will be to predict segment membership based on descriptor data alone.

Descriptor data per segment: Average value of each descriptor variable, overall and within each cluster. Descriptor variables that are statistically different from the rest of the population are highlighted in red (lower) or green (higher).

Descriptor differences per segment. Expanding on the previous chart colors, the shade of cell color indicates to what extent the distribution of a descriptor variable in a segment is statistically different from the rest of the population.

Descriptor space

Spatial representation of segments and descriptor variables, using principal component analysis. Because only the first two dimensions of the PCA are displayed, and these two dimensions capture only part of the variance in the data, some differences between segments might not appear here. Note that descriptor variables with no variance, if any, have been excluded.

If two or more segments fully overlap, it is unlikely that they could be clearly separated based on descriptor data alone.

However, two segments that seem to overlap on two dimensions may be more clearly separated on other dimensions. Consequently, the confusion matrix is a better guide to assess the quality of segment discrimination.

Classification model

Often, segmentation variables may not be available to managers, but descriptors may be.

In this section, we explore whether descriptors alone could predict segment membership with sufficient accuracy. The confusion matrix and hit rates (reported below) indicate whether the model is accurate enough.

For descriptor analysis, Enginius uses a multinomial logit model (similar to the one used to predict 'choices between multiple alternatives (A/B/C)' in the predictive modeling module.

The largest segment is selected as the default option (dummy), and the model identifies which descriptor variables are the most significant to predict cluster memberships. If a descriptor variable is highly predictive, its p-values will be close to zero, and the cells will appear in green (or red).

Model coefficients

P-values

Confusion matrix

The confusion matrix compares actual segment membership (obtained from the segmentation analysis and the original segmentation variables) and predicted segment membership (obtained from the descriptor analysis and the descriptors alone). When actual and predicted segment memberships coincide, the diagonal elements will be comparatively large, indicating that the descriptor model is accurate.

The plot below shows the graphic representation of the confusion matrix.  Bubbles along the diagonal shows where respondents were correctly classified.

 

Model predictions

This table details the probabilities of each member of the segmentation dataset to belong to each cluster (as predicted by the descriptor model and the descriptors alone). The segment with the highest probability is retained, and is compared to the actual segment membership to measure model accuracy and classification errors. The complete list is only available in the Excel formatted report.

Interpreting the Classification Output

Introduction

If you ran selected analysis with descriptor data, the software estimated the best way to predict to which cluster an individual is most likely to belong based solely on descriptor data. This is very useful to predict whether young people (age as a descriptor factor) are more likely to be more price sensitive (price sensitivity as a segmentation variable); or if businesses in certain industries require more support than others.

The ability of recouping segment membership based on descriptor variables is best summarized by the confusion matrix and hit rate (see above).

Once this descriptor analysis has been applied to the original dataset, it can be applied again to external customers for whom descriptor data—but no segmentation data—is available. The process of classifying customers among segments, based on a preceding segmentation analysis, but using descriptor data only, is called classification analysis.

Note that this classification of customers across segments is our best guess based on descriptor analysis. It is not perfect, and some customers might be misclassified, that is, they are the closest to segment A in terms of needs, but their descriptor variables send us astray and predict they are more likely to belong to segment B.

 

Classification analysis is usually applied to new customers, for whom segmentation data is not available. For learning purpose, you can also apply it to descriptor data of customers for whom segmentation data is available, and see how well segment memberships are recouped. This analysis is automatically done when you run a segmentation analysis, and its results are summarized by the confusion matrix.

 

 

 

 

 

 

Interpreting the results
The Classification output shows the output from applying the descriptor model to the respondents to be classified. Because segmentation variables and actual segment membership are unavailable, the actual accuracy of the model predictions are unknown and can only be inferred from the previous section.

Segment size


 

Model predictions

This table details the probabilities of each member of the classification dataset to belong to each cluster (as predicted by the descriptor model and the descriptors alone). The segment with the highest probability is retained.

 

 

 

 

 

 

 

 

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select at least one of the reasons

Feedback sent

We appreciate your effort and will try to fix the article