Questionnaire to clustering to Sankey

Shopper Journey
Segmentation

A complete demonstration of how an online questionnaire can collect journey behavior, convert multi-select touchpoints into clustering features, and produce a Sankey diagram as the primary output.
Online Questionnaire

Respondent-facing mock

This is the collection screen that creates the variables used later for synthetic data generation, clustering, and Sankey aggregation.

Home hanging shopper survey

J1 First discovery

Where did you first notice or learn about the hanging or mounting solution?

Google Search
YouTube
Shopee
Hardware Store
Instagram / TikTok
Friend / Family

J2 Research touchpoints

Which sources did you use before deciding? Select all that apply.

Google Search
YouTube
Shopee Reviews
momo Reviews
Brand Website
In-store advice

J3 Most influential source

Which one source most influenced your final product choice?

Google Search
Shopee Reviews
YouTube
Hardware Store

J4 / J6 Purchase and timing

Where did you buy, and how many days passed from first need to purchase?

Shopee
momo
Hardware Store

Encoded row from this mock response

J1_discoveryGoogle Search
J2_touchpoint_count4
J2_search_count1
J2_social_content_count1
J2_marketplace_count1
J2_brand_owned_count1
J2_offline_count0
J3_researchShopee Reviews
J4_purchaseShopee
J6_days9
J2 is stacked into count features while keeping one row per respondent. That avoids duplicated people in the clustering model.
Synthetic Data

Mock respondent data

The table below shows respondent-level rows generated from the questionnaire structure before clustering and Sankey aggregation.

ID J1 J2 count J3 J4 J6 days Brand Assigned group
Clustering Method

Proposed latent class clustering

We propose using Latent Class Clustering for the final segmentation. It is well suited to questionnaire data because it estimates hidden shopper groups from observed response patterns and gives each respondent a probability of belonging to each class.


J2 is still used, but not as a raw multi-select string. Each respondent keeps one row, and selected touchpoints are converted into behavioral indicators such as total touchpoint count and channel-family counts for search, social or content, marketplace, offline, and brand-owned sources.


Discovery, research, purchase, J2 touchpoint behavior, and decision duration are used to infer journey classes. Brand, mission, trigger, age, and gender are held out of the model and used only after grouping to profile the classes.

Latent class inputs

J1 Discovery channel
J3 Most influential research channel
J4 Final purchase channel
J2 Total touchpoint count
J2 Search, social/content, marketplace, offline, brand-owned counts
J6 Days to decision
One row per respondent No attitudinal inputs Profile fields held out

The final number of classes should be selected by balancing statistical fit, class size, and interpretability. For the demonstration, four journey classes are used because they produce a clear and reviewable Sankey story.

Sankey Output

Shopper journey by clustered group

The Sankey is generated from synthetic respondent-level questionnaire data after clustering. Filter by group to review how each journey type moves from Discovery through Research and Purchase to Brand.

All groups - n = 600
Search Social / Content Marketplace Offline / Physical Brand
Grouping Review

Cluster profiles for review

These cards are generated from the same respondent rows as the Sankey. Names are analyst labels applied after reviewing channel patterns, depth, speed, missions, triggers, brands, and demographics.