Shopper Journey Segmentation - Data Collection to Sankey

Online Questionnaire

Respondent-facing mock

This is the collection screen that creates the variables used later for synthetic data generation, clustering, and Sankey aggregation.

J1 First discovery

Where did you first notice or learn about the hanging or mounting solution?

Google Search

YouTube

Shopee

Hardware Store

Instagram / TikTok

Friend / Family

J2 Research touchpoints

Which sources did you use before deciding? Select all that apply.

Google Search

YouTube

Shopee Reviews

momo Reviews

Brand Website

In-store advice

J3 Most influential source

Which one source most influenced your final product choice?

Google Search

Shopee Reviews

YouTube

Hardware Store

J4 / J6 Purchase and timing

Where did you buy, and how many days passed from first need to purchase?

Shopee

momo

Hardware Store

Encoded row from this mock response

J1_discoveryGoogle Search

J2_touchpoint_count4

J2_search_count1

J2_social_content_count1

J2_marketplace_count1

J2_brand_owned_count1

J2_offline_count0

J3_researchShopee Reviews

J4_purchaseShopee

J6_days9

J2 is stacked into count features while keeping one row per respondent. That avoids duplicated people in the clustering model.

Synthetic Data

Mock respondent data

The table below shows respondent-level rows generated from the questionnaire structure before clustering and Sankey aggregation.

ID	J1	J2 count	J3	J4	J6 days	Brand	Assigned group

Clustering Method

Proposed latent class clustering

We propose using Latent Class Clustering for the final segmentation. It is well suited to questionnaire data because it estimates hidden shopper groups from observed response patterns and gives each respondent a probability of belonging to each class.

J2 is still used, but not as a raw multi-select string. Each respondent keeps one row, and selected touchpoints are converted into behavioral indicators such as total touchpoint count and channel-family counts for search, social or content, marketplace, offline, and brand-owned sources.

Discovery, research, purchase, J2 touchpoint behavior, and decision duration are used to infer journey classes. Brand, mission, trigger, age, and gender are held out of the model and used only after grouping to profile the classes.

Latent class inputs

J1 Discovery channel

J3 Most influential research channel

J4 Final purchase channel

J2 Total touchpoint count

J2 Search, social/content, marketplace, offline, brand-owned counts

J6 Days to decision

One row per respondent No attitudinal inputs Profile fields held out

The final number of classes should be selected by balancing statistical fit, class size, and interpretability. For the demonstration, four journey classes are used because they produce a clear and reviewable Sankey story.

Sankey Output

Shopper journey by clustered group

The Sankey is generated from synthetic respondent-level questionnaire data after clustering. Filter by group to review how each journey type moves from Discovery through Research and Purchase to Brand.

All groups - n = 600

Search Social / Content Marketplace Offline / Physical Brand

Grouping Review

Cluster profiles for review

These cards are generated from the same respondent rows as the Sankey. Names are analyst labels applied after reviewing channel patterns, depth, speed, missions, triggers, brands, and demographics.

Shopper JourneySegmentation

Respondent-facing mock

J1 First discovery

J2 Research touchpoints

J3 Most influential source

J4 / J6 Purchase and timing

Encoded row from this mock response

Mock respondent data

Proposed latent class clustering

Latent class inputs

Shopper journey by clustered group

Cluster profiles for review

Shopper Journey
Segmentation