Connected Feature Extraction

Important - page not maintained

This page is no longer being maintained and its content may be out of date. For the latest guidance, please visit the Neo4j Graph Data Science Library .

Goals
In this guide, we will learn about concepts related to connected feature extraction, a technique that is used to improve the performance of Machine Learning models.

Beginner

At a high level, machine learning algorithms take input data and create some sort of output. The input data comprises features, where a feature could be an attribute or property.

Examples of features could be the number of rooms or size of the garden in meters squared, if we were building a model to predict house prices.

Feature Engineering

The process of generating and selecting appropriate features is called feature engineering.

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.

— Discover Feature Engineering: How to Engineer Features and How to Get Good at It
machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/

Feature Extraction

Feature extraction is one of the sub problems of feature engineering. It describes the process of how we change the shape or format of our raw data so that it can be used in a machine learning pipeline.

It can be thought of as a dimensionality reduction process, where raw variables are reduced to more manageable groups (features), while still accurately describing the original data set.

feature extraction
Figure 1. Feature Extraction

These features are often derived from data stored in tables in a relational database.

Connected Feature Extraction

Connected feature extraction is the process of changing the shape or format of graph data so that it is usable in a machine learning pipeline.

connected feature extraction
Figure 2. Connected Feature Extraction

Connected features can be generated in two main ways:

Running local graph queries

These work best when we know what we’re trying to find. For example, counting the number of fraudulent users up to 3 degrees away from a person in a fraud graph

Running global graph algorithms

These are used when we know the general structure that we want, but not the exact pattern. For example, computing the PageRank score of all users as an indicator of potentially fraudulent behavior.

In a talk from the ML4ALL 2019 conference, Amy Hodler explained how connected features can be used to improve machine learning predictions.

You can also read more articles on this topic in Amy’s AI & Graph Technology blog series.

For a practical example of how connected features can be used to train a machine learning model, see the Link Prediction with scikit-learn developer guide.

Use Cases for Connected Features

Connected features are used in many industries and have been particularly helpful for investigating financial crimes like fraud and money laundering. In these scenarios, criminals often try to hide activities through multiple layers of obfuscation and network relationships.

Traditional feature extraction methods may be unable to detect such behavior, and this is where graph extracted features work well.

Amy Hodler’s Artificial Intelligence & Graph Technology: Enhancing AI with Context & Connections white paper goes into the uses for connected features in more detail.