Scale Properties

Introduction

The Scale Properties algorithm is a utility algorithm that is used to pre-process node properties for model training or post-process algorithm results such as PageRank scores. It scales the node properties based on the specified scaler. Multiple properties can be scaled at once and are returned in a list property.

The input properties must be numbers or lists of numbers. The lists must all have the same size. The output property will always be a list. The size of the output list is equal to the sum of length of the input properties. That is, if the input properties are two scalar numeric properties and one list property of length three, the output list will have a total length of five.

If a node is missing a value for a property, the node will be omitted from scaling of that property. It will receive an output value of NaN. This includes list properties.

There are a number of supported scalers for the Scale Properties algorithm. These can be configured using the scaler configuration parameter.

List properties are scaled index-by-index. See the list example for more details.

In the following equations, p denotes the vector containing all property values for a single property across all nodes in the graph.

Min-max scaler

Scales all property values into the range [0, 1] where the minimum value(s) get the scaled value 0 and the maximum value(s) get the scaled value 1, according to this formula:

scaled p equals p minus minimum of p divided by maximum of p minus minimum of p

The minimum and maximum values are reported as statistics when this scaler is used.

Max scaler

Scales all property values into the range [-1, 1] where the maximum absolute value(s) get the scaled value 1, according to this formula:

scaled p equals p divided by the maximum of absolute p

The maximum absolute value is reported as statistic when this scaler is used.

Mean scaler

Scales all property values into the range [-1, 1] where the average value(s) get the scaled value 0.

scaled p equals p minus average of p divided by maximum of p minus minimum of p

The minimum, maximum and average values are reported as statistics when this scaler is used.

Log scaler

Transforms all property values using the natural logarithm. C denotes a configurable constant offset, which can be used to avoid negative values or zeros in the value space, as their logarithms are not finite values.

Standard Score

Scales all property values using the Standard Score (Wikipedia).

scaled p equals p minus average of p divided by standard deviation of p

The average value and standard deviation are reported as statistics when this scaler is used.

Center

Transforms all properties by subtracting the mean.

The average value is reported as statistic when this scaler is used.

Some scalers must do divisions as part of their computation. For example, computing the "Standard Score" requires dividing by the standard deviation. If computing a scaled property requires division by an illegal value, like 0 or NaN, the resulting scaled property value will be 0.

Syntax

This section covers the syntax used to execute the Scale Properties algorithm in each of its execution modes. We are describing the named graph variant of the syntax. To learn more about general syntax variants, see Syntax overview.

Scale Properties syntax per mode

Run Scale Properties in stream mode on a named graph.

CALL gds.scaleProperties.stream(
  graphName: String,
  configuration: Map
) YIELD
  nodeId: Integer,
  scaledProperty: List of Float

Table 1. Parameters
Name	Type	Default	Optional	Description
graphName	String	`n/a`	no	The name of a graph stored in the catalog.
configuration	Map	`{}`	yes	Configuration for algorithm-specifics and/or graph filtering.

Table 2. Configuration
Name	Type	Default	Optional	Description
nodeLabels	List of String	`['*']`	yes	Filter the named graph using the given node labels. Nodes with any of the given labels will be included.
relationshipTypes	List of String	`['*']`	yes	Filter the named graph using the given relationship types. Relationships with any of the given types will be included.
concurrency	Integer	`4 ^[1]`	yes	The number of concurrent threads used for running the algorithm.
jobId	String	`Generated internally`	yes	An ID that can be provided to more easily track the algorithm’s progress.
logProgress	Boolean	`true`	yes	If disabled the progress percentage will not be logged.
nodeProperties	List of String	`n/a`	no	The names of the node properties that are to be scaled. All property names must exist in the projected graph.
scaler	String or Map	`n/a`	no	The name of the scaler applied for the properties. Supported values are `MinMax`, `Max`, `Mean`, `Log`, `Center`, and `StdScore`, case insensitively. To apply scaler-specific configuration, use the Map syntax: `{scaler: 'name', …}`.
1. In a GDS Session the default is the number of available processors

Table 3. Results
Name	Type	Description
nodeId	Integer	Node ID.
scaledProperty	List of Float	Scaled values for each input node property.

Run Scale Properties in mutate mode on a named graph.

CALL gds.scaleProperties.mutate(
  graphName: String,
  configuration: Map
) YIELD
  scalerStatistics: Map,
  preProcessingMillis: Integer,
  computeMillis: Integer,
  mutateMillis: Integer,
  postProcessingMillis: Integer,
  nodePropertiesWritten: Integer,
  configuration: Map

Table 4. Parameters
Name	Type	Default	Optional	Description
graphName	String	`n/a`	no	The name of a graph stored in the catalog.
configuration	Map	`{}`	yes	Configuration for algorithm-specifics and/or graph filtering.

Table 5. Configuration
Name	Type	Default	Optional	Description
mutateProperty	String	`n/a`	no	The node property in the GDS graph to which the scaled properties is written.
nodeLabels	List of String	`['*']`	yes	Filter the named graph using the given node labels.
relationshipTypes	List of String	`['*']`	yes	Filter the named graph using the given relationship types.
concurrency	Integer	`4`	yes	The number of concurrent threads used for running the algorithm.
jobId	String	`Generated internally`	yes	An ID that can be provided to more easily track the algorithm’s progress.
nodeProperties	List of String	`n/a`	no	The names of the node properties that are to be scaled. All property names must exist in the projected graph.
scaler	String or Map	`n/a`	no	The name of the scaler applied for the properties. Supported values are `MinMax`, `Max`, `Mean`, `Log`, `Center`, and `StdScore`, case insensitively. To apply scaler-specific configuration, use the Map syntax: `{scaler: 'name', …}`.

Table 6. Results
Name	Type	Description
scalerStatistics	Map	Statistics computed by the specified scaler, if any.
preProcessingMillis	Integer	Milliseconds for preprocessing the data.
computeMillis	Integer	Milliseconds for running the algorithm.
mutateMillis	Integer	Milliseconds for adding properties to the projected graph.
postProcessingMillis	Integer	Unused.
nodePropertiesWritten	Integer	Number of node properties written.
configuration	Map	Configuration used for running the algorithm.

Run Scale Properties in stats mode on a named graph.

CALL gds.scaleProperties.stats(
  graphName: String,
  configuration: Map
)
YIELD
  scalerStatistics: Map,
  preProcessingMillis: Integer,
  computeMillis: Integer,
  postProcessingMillis: Integer,
  configuration: Map

Table 7. Parameters
Name	Type	Default	Optional	Description
graphName	String	`n/a`	no	The name of a graph stored in the catalog.
configuration	Map	`{}`	yes	Configuration for algorithm-specifics and/or graph filtering.

Table 8. Configuration
Name	Type	Default	Optional	Description
nodeLabels	List of String	`['*']`	yes	Filter the named graph using the given node labels. Nodes with any of the given labels will be included.
relationshipTypes	List of String	`['*']`	yes	Filter the named graph using the given relationship types. Relationships with any of the given types will be included.
concurrency	Integer	`4 ^[2]`	yes	The number of concurrent threads used for running the algorithm.
jobId	String	`Generated internally`	yes	An ID that can be provided to more easily track the algorithm’s progress.
logProgress	Boolean	`true`	yes	If disabled the progress percentage will not be logged.
nodeProperties	List of String	`n/a`	no	The names of the node properties that are to be scaled. All property names must exist in the projected graph.
scaler	String or Map	`n/a`	no	The name of the scaler applied for the properties. Supported values are `MinMax`, `Max`, `Mean`, `Log`, `Center`, and `StdScore`, case insensitively. To apply scaler-specific configuration, use the Map syntax: `{scaler: 'name', …}`.
2. In a GDS Session the default is the number of available processors

Table 9. Results
Name	Type	Description
scalerStatistics	Map	Statistics computed by the specified scaler, if any.
preProcessingMillis	Integer	Milliseconds for preprocessing the data.
computeMillis	Integer	Milliseconds for running the algorithm.
postProcessingMillis	Integer	Unused.
configuration	Map	Configuration used for running the algorithm.

Run Scale properties in write mode on a named graph.

CALL gds.scaleProperties.write(
  graphName: String,
  configuration: Map
)
YIELD
  scalerStatistics: Map,
  preProcessingMillis: Integer,
  computeMillis: Integer,
  writeMillis: Integer,
  postProcessingMillis: Integer,
  nodePropertiesWritten: Integer,
  configuration: Map

Table 10. Parameters
Name	Type	Default	Optional	Description
graphName	String	`n/a`	no	The name of a graph stored in the catalog.
configuration	Map	`{}`	yes	Configuration for algorithm-specifics and/or graph filtering.

Table 11. Configuration
Name	Type	Default	Optional	Description
nodeLabels	List of String	`['*']`	yes	Filter the named graph using the given node labels. Nodes with any of the given labels will be included.
relationshipTypes	List of String	`['*']`	yes	Filter the named graph using the given relationship types. Relationships with any of the given types will be included.
concurrency	Integer	`4 ^[3]`	yes	The number of concurrent threads used for running the algorithm.
jobId	String	`Generated internally`	yes	An ID that can be provided to more easily track the algorithm’s progress.
logProgress	Boolean	`true`	yes	If disabled the progress percentage will not be logged.
writeConcurrency	Integer	`value of 'concurrency'`	yes	The number of concurrent threads used for writing the result to Neo4j.
writeProperty	String	`n/a`	no	The node property in the Neo4j database to which the scaled properties is written.
nodeProperties	List of String	`n/a`	no	The names of the node properties that are to be scaled. All property names must exist in the projected graph.
scaler	String or Map	`n/a`	no	The name of the scaler applied for the properties. Supported values are `MinMax`, `Max`, `Mean`, `Log`, `Center`, and `StdScore`, case insensitively. To apply scaler-specific configuration, use the Map syntax: `{scaler: 'name', …}`.
3. In a GDS Session the default is the number of available processors

Table 12. Results
Name	Type	Description
scalerStatistics	Map	Statistics computed by the specified scaler, if any.
preProcessingMillis	Integer	Milliseconds for preprocessing the data.
computeMillis	Integer	Milliseconds for running the algorithm.
writeMillis	Integer	Milliseconds for writing result back to Neo4j.
postProcessingMillis	Integer	Unused.
nodePropertiesWritten	Integer	Number of node properties written.
configuration	Map	Configuration used for running the algorithm.

Scaler-specific configuration options

The log scaler supports specific configuration, which we document here.

Table 13. Specific configuration for `log` scaler
Name	Type	Default	Optional	Description
type	String	`n/a`	no	Type of the scaler applied for the properties. Supported values are `MinMax`, `Max`, `Mean`, `Log`, `Center`, and `StdScore`, case insensitively.
offset	Number	`0`	yes	Constant additive term applied before computing the logarithm of the property value.

All other scalers do not support additional, custom configuration.

Examples

All the examples below should be run in an empty database.

The examples use Cypher projections as the norm. Native projections will be deprecated in a future release.

In this section we will show examples of running the Scale Properties algorithm on a concrete graph. The intention is to illustrate what the results look like and to provide a guide in how to make use of the algorithm in a real setting. We will do this on a small hotel graph of a handful nodes connected in a particular pattern. The example graph looks like this:

The following Cypher statement will create the example graph in the Neo4j database:

CREATE
  (:Hotel {avgReview: 4.2, buildYear: 1978, storyCapacity: [32, 32, 0], name: 'East'}),
  (:Hotel {avgReview: 8.1, buildYear: 1958, storyCapacity: [18, 20, 0], name: 'Plaza'}),
  (:Hotel {avgReview: 19.0, buildYear: 1999, storyCapacity: [100, 100, 70], name: 'Central'}),
  (:Hotel {avgReview: -4.12, buildYear: 2005, storyCapacity: [250, 250, 250], name: 'West'}),
  (:Hotel {avgReview: 0.01, buildYear: 2020, storyCapacity: [1250, 1250, 900], name: 'Polar'}),
  (:Hotel {avgReview: 3.3, buildYear: 1981, storyCapacity: [240, 240, 0], name: 'Beach'}),
  (:Hotel {avgReview: 6.7, buildYear: 1984, storyCapacity: [80, 0, 0], name: 'Mountain'}),
  (:Hotel {avgReview: -1.2, buildYear: 2010, storyCapacity: [55, 20, 0], name: 'Forest'})

With the graph in Neo4j we can now project it into the graph catalog to prepare it for algorithm execution. We do this using a Cypher projection targeting the Hotel nodes, including their properties. Note that no relationships are necessary to scale the node properties. Thus we use a star projection ('*') for relationships.

The following statement will project a graph using a Cypher projection and store it in the graph catalog under the name 'myGraph'.

MATCH (hotel:Hotel)
RETURN gds.graph.project(
  'myGraph',
  hotel,
  null,
  {
    sourceNodeProperties: hotel { .avgReview, .buildYear, .storyCapacity },
    targetNodeProperties: {}
  }
)

In the following examples we will demonstrate how to scale the node properties of this graph.

Memory Estimation

First off, we will estimate the cost of running the algorithm using the estimate procedure. This can be done with any execution mode. We will use the stream mode in this example. Estimating the algorithm is useful to understand the memory impact that running the algorithm on your graph will have. When you later actually run the algorithm in one of the execution modes the system will perform an estimation. If the estimation shows that there is a very high probability of the execution going over its memory limitations, the execution is prohibited. To read more about this, see Automatic estimation and execution blocking.

For more details on estimate in general, see Memory Estimation.

The following will estimate the memory requirements for running the algorithm:

CALL gds.scaleProperties.stream.estimate('myGraph', {
  nodeProperties: ['buildYear', 'storyCapacity'],
  scaler: 'MinMax'
})
YIELD nodeCount, relationshipCount, bytesMin, bytesMax, requiredMemory

Table 14. Results
nodeCount	relationshipCount	bytesMin	bytesMax	requiredMemory
8	0	480	480	"480 Bytes"

Stream

In the stream execution mode, the algorithm returns the scaled properties for each node. This allows us to inspect the results directly or post-process them in Cypher without any side effects. Note that the output is always a single list property, containing all scaled node properties in the input order.

For more details on the stream mode in general, see Stream.

The following will run the algorithm in stream mode:

CALL gds.scaleProperties.stream('myGraph', {
  nodeProperties: ['buildYear', 'avgReview'],
  scaler: 'MinMax'
}) YIELD nodeId, scaledProperty
RETURN gds.util.asNode(nodeId).name AS name, scaledProperty
  ORDER BY name ASC

Table 15. Results
name	scaledProperty
"Beach"	[0.3709677419354839, 0.3209342560553633]
"Central"	[0.6612903225806451, 1.0]
"East"	[0.3225806451612903, 0.35986159169550175]
"Forest"	[0.8387096774193549, 0.12629757785467127]
"Mountain"	[0.41935483870967744, 0.4679930795847751]
"Plaza"	[0.0, 0.5285467128027681]
"Polar"	[1.0, 0.17863321799307957]
"West"	[0.7580645161290323, 0.0]

In the results we can observe that the first element in the resulting scaledProperty we get the min-max-scaled values for buildYear, where the Plaza hotel has the minimum value and is scaled to zero, while the Polar hotel has the maximum value and is scaled to one. This can be verified with the example graph. The second value in the scaledProperty result are the scaled values of the avgReview property.

Mutate

The mutate execution mode enables updating the named graph with a new node property containing the scaled properties for that node. The name of the new property is specified using the mandatory configuration parameter mutateProperty. The result is a single summary row containing metrics from the computation. The mutate mode is especially useful when multiple algorithms are used in conjunction.

For more details on the mutate mode in general, see Mutate.

In this example we will scale the two hotel properties of buildYear and avgReview using the Mean scaler. The output is a list property which we will call hotelFeatures, imagining that we will use this as input for a machine learning model later on.

The following will run the algorithm in mutate mode:

CALL gds.scaleProperties.mutate('myGraph', {
  nodeProperties: ['buildYear', 'avgReview'],
  scaler: 'Mean',
  mutateProperty: 'hotelFeatures'
}) YIELD nodePropertiesWritten, scalerStatistics

Table 16. Results
nodePropertiesWritten	scalerStatistics
8	{avgReview={avg=[4.49875], max=[19.0], min=[-4.12]}, buildYear={avg=[1991.875], max=[2020.0], min=[1958.0]}}

The result shows that there are now eight new node properties in the in-memory graph. These contain the scaled values from the input properties, where the scaled buildYear values are in the first list position and scaled avgReview values are in the second position. To find out how to inspect the new schema of the in-memory graph, see Listing graphs in the catalog.

Stats

In the stats execution mode, the algorithm returns a single row containing a summary of the algorithm result. This execution mode does not have any side effects. It can be useful for evaluating algorithm performance by inspecting the computeMillis return item. In the examples below we will omit returning the timings. The full signature of the procedure can be found in the syntax section.

For more details on the stats mode in general, see Stats.

The following will run the algorithm in stats mode using the "Center" scaler:

CALL gds.scaleProperties.stats('myGraph', {
  nodeProperties: ['buildYear', 'avgReview'],
  scaler: 'center'
}) YIELD scalerStatistics

Table 17. Results
scalerStatistics
{avgReview={avg=[4.49875]}, buildYear={avg=[1991.875]}}

Different scalers will need to compute different statistics as part of their computation. This will be reflected in the scalerStatistics returned. Since the "Center" computes the average value of the various input properties, that is what we get as scaler statistics results in this case.

Write

The write execution mode extends the stats mode with an important side effect: writing the scaled properties for each node as a property to the Neo4j database. The name of the new property is specified using the mandatory configuration parameter writeProperty. The result is a single summary row, similar to stats, but with some additional metrics. The write mode enables directly persisting the results to the database.

For more details on the write mode in general, see Write.

The following will run the algorithm in write mode:

CALL gds.scaleProperties.write('myGraph', {
  nodeProperties: ['buildYear', 'avgReview'],
  scaler: 'stdscore',
  writeProperty: 'hotelStdScore'
}) YIELD nodePropertiesWritten, scalerStatistics

Table 18. Results
nodePropertiesWritten	scalerStatistics
8	{avgReview={avg=[4.49875], std=[6.6758378454]}, buildYear={avg=[1991.875], std=[18.9171714323]}}

The result shows that there are now eight new node properties in the database graph on the nodes corresponding to those in the projection 'myGraph'. These node properties contain the scaled values from the input properties, where the scaled buildYear values are in the first list position and scaled avgReview values are in the second position.

List properties

The storyCapacity property models the amount of rooms on each story of the hotel. The property is normalized so that hotels with fewer stories have a zero value. This is because the Scale Properties algorithm requires that all values for the same property have the same length. In this example we will show how to scale the values in these lists using the Scale Properties algorithm. We imagine using the output as feature vector to input in a machine learning algorithm. Additionally, we will include the avgReview property in our feature vector.

The following will run the algorithm in mutate mode:

CALL gds.scaleProperties.stream('myGraph', {
  nodeProperties: ['avgReview', 'storyCapacity'],
  scaler: 'StdScore'
}) YIELD nodeId, scaledProperty
RETURN gds.util.asNode(nodeId).name AS name, scaledProperty AS features
  ORDER BY name ASC

Table 19. Results
name	features
"Beach"	[-0.17956547594003253, -0.03401933556831381, 0.00254261210704973, -0.5187592498702616]
"Central"	[2.172199255871029, -0.3968922482969945, -0.3534230828799124, -0.2806402499298136]
"East"	[-0.0447509371737933, -0.5731448059080679, -0.526320706159294, -0.5187592498702616]
"Forest"	[-0.8536381697712284, -0.513529970245499, -0.5568320514438908, -0.5187592498702616]
"Mountain"	[0.32973389273242665, -0.4487312358296632, -0.6076842935848854, -0.5187592498702616]
"Plaza"	[0.5394453974799097, -0.609432097180936, -0.5568320514438908, -0.5187592498702616]
"Polar"	[-0.672387512096618, 2.583849534831454, 2.5705808402272767, 2.542770749364069]
"West"	[-1.2910364511016934, -0.00809984180197948, 0.027968733177547028, 0.3316657499170525]

The resulting feature vector contains the standard-score scaled value for the avgReview property in the first list position. We can see that some values are negative and that the maximum value sticks out for the Central hotel.

The other three list positions are the scaled values for the storyCapacity list property. Note that each list item is scaled only with respect to the corresponding item in the other lists. Thus, the Polar hotel has the greatest scaled value in all list positions.

Scaler-specific configuration

The log scaler supports a configurable offset parameter. In this example we illustrate how to configure that offset.

We want to scale the avgReview property, but it contains negative numbers, for which the logarithm is not defined. First, we’ll determine what the minimum value is, by using Cypher’s min() aggregating function:

CALL gds.graph.nodeProperty.stream('myGraph', 'avgReview') YIELD propertyValue
RETURN min(propertyValue) AS minimumAvgReview

Table 20. Results
minimumAvgReview
-4.12

Learning this value, we can use a greater value, thus ensuring that the logarithm will be a finite value. We will use 5.12, as this will make the smallest scaled value zero.

The following will run the algorithm with a custom offset for the log scaler:

CALL gds.scaleProperties.stream('myGraph', {
  nodeProperties: ['avgReview'],
  scaler: {type: 'Log', offset: 5.12}
}) YIELD nodeId, scaledProperty
RETURN gds.util.asNode(nodeId).name AS name, scaledProperty
  ORDER BY name ASC

Table 21. Results
name	scaledProperty
"Beach"	[2.130609828254235]
"Central"	[3.183041371858985]
"East"	[2.2321626286975]
"Forest"	[1.366091653802371]
"Mountain"	[2.469793011977952]
"Plaza"	[2.581730834423540]
"Polar"	[1.635105659182678]
"West"	[0.0]

As we can see, all scaled values are finite numbers. In particular, the smallest scaled value is zero. Try this example with an offset lower than 4.12 if you are curious about the results.