Segment Based on Customer Behavior - Not Characteristics - with Louvain

Every meaningful marketing decision starts with segmentation—who to target, what to recommend, and how to personalize an experience. The quality of those decisions depends entirely on how customers are grouped.

There are two fundamentally different ways to think about customer segmentation. We can group customers by who they are, or by what they do.

Overview

What Is Neo4j Graph Analytics For Snowflake?

Neo4j helps organizations find hidden relationships and patterns across billions of data connections deeply, easily, and quickly. Neo4j Graph Analytics for Snowflake brings to the power of graph directly to Snowflake, allowing users to run 65+ ready-to-use algorithms on their data, all without leaving Snowflake!

Prerequisites

What You Will Need

  • A Snowflake account with appropriate access to databases and schemas.

  • Neo4j Graph Analytics application installed from the Snowflake marketplace. Access the marketplace via the menu bar on the left hand side of your screen, as seen below: image

What You Will Build

  • A method to identify communities that are at high risk of fraud in P2P networks

What You Will Learn

  • How to prepare and project your data for graph analytics

  • How to read and write directly from and to your snowflake tables

Loading The Data

Dataset overview : This dataset is a subset of instakart data and can be found here.

Let’s name our database RETAIL_RECS. We are going to add five new tables:

  • One called aisles based on the aisles.csv

  • One called baskets based on baskets.csv

  • One called departments based on departments.csv

  • One called order_history based on order_history.csv

  • One called products based on products.csv

Follow the steps found here to load in your data.

Then switch to that database:

USE DATABASE RETAIL_RECS;
USE SCHEMA PUBLIC;

Segmentation by Characteristics

In a traditional approach like K-means, each customer is first reduced to a small set of characteristics. These characteristics are fixed numeric attributes — things like average basket size or reorder tendency — that at best, these approaches cluster customers by summarized behavior and at worst, by characteristics that entirely miss how they are actually behaving.

Once every customer is represented as a point in feature space, K-means groups together customers who look numerically similar.

Let’s first prep our data for this approach:

CREATE OR REPLACE TABLE ORDER_ITEM_COUNTS AS
SELECT
  order_id,
  COUNT(*) AS items_in_order
FROM BASKETS
GROUP BY 1;

We are going to create two features: reorder rate and average items per order. These features do get at a given customer’s behavior, specifically how many things they buy each time and how often they reorder a item.

CREATE OR REPLACE TABLE USER_KMEANS_FEATURES AS
WITH ORDER_STATS AS (
  SELECT
    order_id,
    COUNT(*) AS items_in_order,
    AVG(reordered::FLOAT) AS reorder_rate_in_order
  FROM BASKETS
  GROUP BY 1
)
SELECT
  o.user_id,
  AVG(s.items_in_order::FLOAT)       AS avg_items_per_order,
  AVG(s.reorder_rate_in_order)       AS reorder_rate
FROM ORDERS o
JOIN ORDER_STATS s
  ON o.order_id = s.order_id
GROUP BY 1;
select * from user_kmeans_features
USER_ID AVG_ITEMS_PER_ORDER REORDER_RATE

202279

9

0.6666666666666666

205970

8

1

178520

13

0.9230769230769231

156122

26

0.8076923076923077

142903

2

0

Now that we have our features, we will take them out of SQL and load them into python where we will perform our analysis. With Graph Analytics for Snowflake, you can avoid this data movement and run your algorithm directly on SQL tables.

from snowflake.snowpark.context import get_active_session

session = get_active_session()

df = session.table("USER_KMEANS_FEATURES").select(
    "USER_ID", "AVG_ITEMS_PER_ORDER", "REORDER_RATE"
).dropna()

pdf = df.to_pandas()
pdf.head()
USER_ID AVG_ITEMS_PER_ORDER REORDER_RATE

202279

9

0.6666666666666666

205970

8

1

178520

13

0.9230769230769231

156122

26

0.8076923076923077

142903

2

0

import os
os.environ["LOKY_MAX_CPU_COUNT"] = "1"
os.environ["JOBLIB_MULTIPROCESSING"] = "0"
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

X = pdf[["AVG_ITEMS_PER_ORDER", "REORDER_RATE"]].to_numpy()

X_scaled = MinMaxScaler().fit_transform(X)

km = KMeans(n_clusters=3, n_init=10, random_state=42, algorithm="lloyd")
pdf["CLUSTER"] = km.fit_predict(X_scaled)

Let’s take a look at our final result. We sorted our customers into three distinct groups based on how much they buy when they go to the store and how many of their items are reorders.

But these are summaries of customer’s behavior, and if we pause and consider what exactly we are modeling, we will be able to poke some holes. Does a teacher reordering pencils form a meaningful group with a new father reordering baby formula? From a marketing standpoint, likely not.

When customer behavior is flattened into these summaries, we necessarily lose information. That is the fundamental challenge in modeling based on characteristics rather than the behavior itself.

import matplotlib.pyplot as plt

plt.figure()
plt.scatter(
    pdf["AVG_ITEMS_PER_ORDER"],
    pdf["REORDER_RATE"],
    c=pdf["CLUSTER"]
)
plt.xlabel("avg_items_per_order")
plt.ylabel("reorder_rate")
plt.title("K-means customer clusters")
plt.show()

image

Model Based on Behavior with Louvain

With a graph-based approach, we don’t summarize behavior into attributes. Instead, we model behavior directly.

Customer segmentation in graph analytics mirrors how students choose tables in a cafeteria. Students who interact frequently sit together; likewise, customers who interact with the same items naturally form groups. Thats how graph algorithms, like louvain, segment customers based on directly on their behavior

In our example, customers are connected to the products they purchase, forming a network of interactions.

Louvain finds communities in this network — groups of customers who are tightly connected through shared purchasing behavior.

We will create three tables. Two are node tables for the customers and the products. The third is an edge list showing customer purchase events.

CREATE OR REPLACE VIEW CUSTOMERS AS
SELECT DISTINCT
  user_id AS nodeId
FROM ORDERS;

CREATE OR REPLACE VIEW PRODUCTS_NODES AS
SELECT
  product_id AS nodeId
FROM PRODUCTS;

CREATE OR REPLACE VIEW PURCHASES AS
SELECT
  o.user_id    AS sourceNodeId,
  b.product_id AS targetNodeId
FROM ORDERS o
JOIN BASKETS b
  ON o.order_id = b.order_id;

Granting Permissions

Next we grant the necessary permissions:

-- Use a role with the required privileges
USE ROLE ACCOUNTADMIN;

-- Create a consumer role for users of the Graph Analytics application
CREATE ROLE IF NOT EXISTS MY_CONSUMER_ROLE;
GRANT APPLICATION ROLE Neo4j_Graph_Analytics.app_user TO ROLE MY_CONSUMER_ROLE;
SET MY_USER = (SELECT CURRENT_USER());
GRANT ROLE MY_CONSUMER_ROLE TO USER IDENTIFIER($MY_USER);

USE SCHEMA retail_recs.PUBLIC;
CREATE TABLE NODES (nodeId Number);
INSERT INTO NODES VALUES (1), (2), (3), (4), (5), (6);
CREATE TABLE RELATIONSHIPS (sourceNodeId Number, targetNodeId Number);
INSERT INTO RELATIONSHIPS VALUES (1, 2), (2, 3), (4, 5), (5, 6);

-- Grants needed for the app to read consumer data stored in tables and views, using a database role
USE DATABASE retail_recs;
CREATE DATABASE ROLE IF NOT EXISTS MY_DB_ROLE;
GRANT USAGE ON DATABASE retail_recs TO DATABASE ROLE MY_DB_ROLE;
GRANT USAGE ON SCHEMA retail_recs.PUBLIC TO DATABASE ROLE MY_DB_ROLE;
GRANT SELECT ON ALL TABLES IN SCHEMA retail_recs.PUBLIC TO DATABASE ROLE MY_DB_ROLE;
GRANT SELECT ON ALL VIEWS IN SCHEMA retail_recs.PUBLIC TO DATABASE ROLE MY_DB_ROLE;
-- Future tables also include tables that are created by the application itself.
-- This is useful as many use-cases require running algorithms in a sequence and using the output of a prior algorithm as input.
GRANT SELECT ON FUTURE TABLES IN SCHEMA retail_recs.PUBLIC TO DATABASE ROLE MY_DB_ROLE;
GRANT SELECT ON FUTURE VIEWS IN SCHEMA retail_recs.PUBLIC TO DATABASE ROLE MY_DB_ROLE;
GRANT CREATE TABLE ON SCHEMA retail_recs.PUBLIC TO DATABASE ROLE MY_DB_ROLE;
GRANT DATABASE ROLE MY_DB_ROLE TO APPLICATION Neo4j_Graph_Analytics;

-- Ensure the consumer role has access to tables created by the application
GRANT USAGE ON DATABASE retail_recs TO ROLE MY_CONSUMER_ROLE;
GRANT USAGE ON SCHEMA retail_recs.PUBLIC TO ROLE MY_CONSUMER_ROLE;
GRANT SELECT ON FUTURE TABLES IN SCHEMA retail_recs.PUBLIC TO ROLE MY_CONSUMER_ROLE;

-- Use the consumer role to run the algorithm and inspect the output
USE ROLE MY_CONSUMER_ROLE;

Running our Algorithms

We apply Louvain to a customer–item purchase graph to identify communities of customers with similar buying behavior. These communities emerge from shared purchase patterns, rather than predefined customer attributes.

CALL Neo4j_Graph_Analytics.graph.louvain('CPU_X64_XS', {
    'defaultTablePrefix': 'retail_recs.public',
    'project': {
        'nodeTables': [ 'CUSTOMERS', 'PRODUCTS_NODES'],
        'relationshipTables': {
            'PURCHASES': {
                'sourceTable': 'CUSTOMERS',
                'targetTable': 'PRODUCTS_NODES',
                'orientation': 'UNDIRECTED'
            }
        }
    },
    'compute': {
        'mutateProperty': 'community_id'
    },
    'write': [{
        'nodeLabel': 'CUSTOMERS',
        'outputTable': 'CUSTOMERS_GROUP',
        'nodeProperty': 'community_id'
    }]
});

And now we have a different customer segments derived directly from those customer’s behavior rather than the customer’s characteristics:

select  community_id, count(*) as node_count from customers_group group by community_id
COMMUNITY_ID NODE_COUNT

4970

91

4952

43

5277

47

5380

30