Aggregating functions
Introduction
Aggregating functions take a set of values and calculate an aggregated value over them. Aggregation can be computed over all the matching paths, or it can be further divided by introducing grouping keys. Grouping keys are non-aggregate expressions that are used to group the values going into the aggregate functions.
For example, given the following query containing two return expressions, n
and count(*)
:
RETURN n, count(*)
The first, n
is not an aggregate function, so it will be the grouping key.
The latter, count(*)
is an aggregate expression.
The matching paths will be divided into different buckets, depending on the grouping key.
The aggregate function will then be run on these buckets, calculating an aggregate value per bucket.
The input expression of an aggregation function can contain any expression, including expressions that are not grouping keys.
However, not all expressions can be composed with aggregation functions.
The example below will throw an error since n.x
, which is not a grouping key, is combined with the aggregation expression count(*)
.
For more information, see Grouping keys.
RETURN n.x + count(*)
To use aggregations to sort the result set, the aggregation must be included in the RETURN
to be used in the ORDER BY
.
The DISTINCT
operator works in conjunction with aggregation.
It is used to make all values unique before running them through an aggregate function.
More information about DISTINCT
can be found in Syntax → Aggregation operators.
Example graph
The following graph is used for the examples below:
To recreate the graph, run the following query against an empty Neo4j database:
CREATE
(keanu:Person {name: 'Keanu Reeves', age: 58}),
(liam:Person {name: 'Liam Neeson', age: 70}),
(carrie:Person {name: 'Carrie Anne Moss', age: 55}),
(guy:Person {name: 'Guy Pearce', age: 55}),
(kathryn:Person {name: 'Kathryn Bigelow', age: 71}),
(speed:Movie {title: 'Speed'}),
(keanu)-[:ACTED_IN]->(speed),
(keanu)-[:KNOWS]->(carrie),
(keanu)-[:KNOWS]->(liam),
(keanu)-[:KNOWS]->(kathryn),
(carrie)-[:KNOWS]->(guy),
(liam)-[:KNOWS]->(guy)
avg() - Numeric values
The function avg()
returns the average of a set of numeric values.
avg(expression)
Returns:
Either an Integer or a Float, depending on the values returned by |
Arguments:
Name | Description |
---|---|
|
An expression returning a set of numeric values. |
Considerations:
Any |
|
MATCH (p:Person)
RETURN avg(p.age)
The average of all the values in the property age
is returned:
avg(p.age) |
---|
|
Rows: 1 |
avg() - Durations
The function avg()
returns the average of a set of Durations.
avg(expression)
Returns:
A Duration. |
Arguments:
Name | Description |
---|---|
|
An expression returning a set of Durations. |
Considerations:
Any |
|
UNWIND [duration('P2DT3H'), duration('PT1H45S')] AS dur
RETURN avg(dur)
The average of the two supplied Durations is returned:
avg(dur) |
---|
|
Rows: 1 |
collect()
The function collect()
returns a single aggregated list containing the values returned by an expression.
collect(expression)
Returns:
A list containing heterogeneous elements; the types of the elements are determined by the values returned by |
Arguments:
Name | Description |
---|---|
|
An expression returning a set of values. |
Considerations:
Any |
|
MATCH (p:Person)
RETURN collect(p.age)
All the values are collected and returned in a single list:
collect(p.age) |
---|
|
Rows: 1 |
count()
The function count()
returns the number of values or rows, and appears in two variants:
count(*)
-
returns the number of matching rows.
count(expr)
-
returns the number of non-
null
values returned by an expression.
count(expression)
Returns:
An Integer. |
Arguments:
Name | Description |
---|---|
|
An expression. |
Considerations:
|
|
|
Using count(*)
to return the number of nodes
The function count(*)
can be used to return the number of nodes; for example, the number of nodes connected to a node n
.
MATCH (p:Person {name: 'Keanu Reeves'})-->(x)
RETURN labels(p), p.age, count(*)
The labels and age
property of the start node Keanu Reeves
and the number of nodes related to it are returned:
labels(p) | p.age | count(*) |
---|---|---|
|
|
|
Rows: 1 |
Using count(*)
to group and count relationship types
The function count(*)
can be used to group the type of matched relationships and return the number of types.
MATCH (p:Person {name: 'Keanu Reeves'})-[r]->()
RETURN type(r), count(*)
The type of matched relationships are grouped and the group count of relationship types is returned:
type(r) | count(*) |
---|---|
|
|
|
|
Rows: 2 |
Counting non-null
values
Instead of simply returning the number of rows with count(*)
, the function count(expression)
can be used to return the number of non-null
values returned by the expression.
MATCH (p:Person)
RETURN count(p.age)
The number of nodes with the label Person
and a property age
is returned:
(To calculate the sum, use sum(n.age)
)
count(p.age) |
---|
|
Rows: 1 |
Counting with and without duplicates
This example tries to find all friends of friends of Keanu Reeves
and count them.
count(DISTINCT friendOfFriend)
-
Will only count a
friendOfFriend
once, asDISTINCT
removes the duplicates. count(friendOfFriend)
-
Will consider the same
friendOfFriend
multiple times.
MATCH (p:Person)-->(friend:Person)-->(friendOfFriend:Person)
WHERE p.name = 'Keanu Reeves'
RETURN friendOfFriend.name, count(DISTINCT friendOfFriend), count(friendOfFriend)
The nodes Carrie Anne Moss
and Liam Neeson
both have an outgoing KNOWS
relationship to Guy Pearce
.
The Guy Pearce
node will, therefore, get counted twice when not using DISTINCT
.
friendOfFriend.name | count(DISTINCT friendOfFriend) | count(friendOfFriend) |
---|---|---|
|
|
|
max()
The function max()
returns the maximum value in a set of values.
max(expression)
Returns:
A property type, or a list, depending on the values returned by |
Arguments:
Name | Description |
---|---|
|
An expression returning a set containing any combination of property types and lists thereof. |
Considerations:
Any |
In a mixed set, any numeric value is always considered to be higher than any string value, and any string value is always considered to be higher than any list. |
Lists are compared in dictionary order, i.e. list elements are compared pairwise in ascending order from the start of the list to the end. |
|
UNWIND [1, 'a', null, 0.2, 'b', '1', '99'] AS val
RETURN max(val)
The highest of all the values in the mixed set — in this case, the numeric value 1
— is returned:
The value |
max(val) |
---|
|
Rows: 1 |
UNWIND [[1, 'a', 89], [1, 2]] AS val
RETURN max(val)
The highest of all the lists in the set — in this case, the list [1, 2]
— is returned, as the number 2
is considered to be a higher value than the string 'a'
, even though the list [1, 'a', 89]
contains more elements.
max(val) |
---|
|
Rows: 1 |
MATCH (p:Person)
RETURN max(p.age)
The highest of all the values in the property age
is returned:
max(p.age) |
---|
|
Rows: 1 |
min()
The function min()
returns the minimum value in a set of values.
min(expression)
Returns:
A property type, or a list, depending on the values returned by |
Arguments:
Name | Description |
---|---|
|
An expression returning a set containing any combination of property types and lists thereof. |
Considerations:
Any |
In a mixed set, any string value is always considered to be lower than any numeric value, and any list is always considered to be lower than any string. |
Lists are compared in dictionary order, i.e. list elements are compared pairwise in ascending order from the start of the list to the end. |
|
UNWIND [1, 'a', null, 0.2, 'b', '1', '99'] AS val
RETURN min(val)
The lowest of all the values in the mixed set — in this case, the string value "1"
— is returned.
Note that the (numeric) value 0.2
, which may appear at first glance to be the lowest value in the list, is considered to be a higher value than "1"
as the latter is a string.
min(val) |
---|
|
Rows: 1 |
UNWIND ['d', [1, 2], ['a', 'c', 23]] AS val
RETURN min(val)
The lowest of all the values in the set — in this case, the list ['a', 'c', 23]
— is returned, as (i) the two lists are considered to be lower values than the string "d"
, and (ii) the string "a"
is considered to be a lower value than the numerical value 1
.
min(val) |
---|
|
Rows: 1 |
MATCH (p:Person)
RETURN min(p.age)
The lowest of all the values in the property age
is returned:
min(p.age) |
---|
|
Rows: 1 |
percentileCont()
The function percentileCont()
returns the percentile of the given value over a group, with a percentile from 0.0
to 1.0
.
It uses a linear interpolation method, calculating a weighted average between two values if the desired percentile lies between them.
For nearest values using a rounding method, see percentileDisc
.
percentileCont(expression, percentile)
Returns:
A Float. |
Arguments:
Name | Description |
---|---|
|
A numeric expression. |
|
A numeric value between |
Considerations:
Any |
|
MATCH (p:Person)
RETURN percentileCont(p.age, 0.4)
The 40th percentile of the values in the property age
is returned, calculated with a weighted average:
percentileCont(p.age, 0.4) |
---|
|
Rows: 1 |
percentileDisc()
The function percentileDisc()
returns the percentile of the given value over a group, with a percentile from 0.0
to 1.0
.
It uses a rounding method and calculates the nearest value to the percentile.
For interpolated values, see percentileCont
.
percentileDisc(expression, percentile)
Returns:
Either an Integer or a Float, depending on the values returned by |
Arguments:
Name | Description |
---|---|
|
A numeric expression. |
|
A numeric value between |
Considerations:
Any |
|
MATCH (p:Person)
RETURN percentileDisc(p.age, 0.5)
The 50th percentile of the values in the property age
is returned:
percentileDisc(p.age, 0.5) |
---|
|
Rows: 1 |
stDev()
The function stDev()
returns the standard deviation for the given value over a group.
It uses a standard two-pass method, with N - 1
as the denominator, and should be used when taking a sample of the population for an unbiased estimate.
When the standard variation of the entire population is being calculated, stdDevP
should be used.
stDev(expression)
Returns:
A Float. |
Arguments:
Name | Description |
---|---|
|
A numeric expression. |
Considerations:
Any |
|
MATCH (p:Person)
WHERE p.name IN ['Keanu Reeves', 'Liam Neeson', 'Carrie Anne Moss']
RETURN stDev(p.age)
The standard deviation of the values in the property age
is returned:
stDev(p.age) |
---|
|
Rows: 1 |
stDevP()
The function stDevP()
returns the standard deviation for the given value over a group.
It uses a standard two-pass method, with N
as the denominator, and should be used when calculating the standard deviation for an entire population.
When the standard variation of only a sample of the population is being calculated, stDev
should be used.
stDevP(expression)
Returns:
A Float. |
Arguments:
Name | Description |
---|---|
|
A numeric expression. |
Considerations:
Any |
|
MATCH (p:Person)
WHERE p.name IN ['Keanu Reeves', 'Liam Neeson', 'Carrie Anne Moss']
RETURN stDevP(p.age)
The population standard deviation of the values in the property age
is returned:
stDevP(p.age) |
---|
|
Rows: 1 |
sum() - Numeric values
The function sum()
returns the sum of a set of numeric values.
sum(expression)
Returns:
Either an Integer or a Float, depending on the values returned by |
Arguments:
Name | Description |
---|---|
|
An expression returning a set of numeric values. |
Considerations:
Any |
|
MATCH (p:Person)
RETURN sum(p.age)
The sum of all the values in the property age
is returned:
sum(p.age) |
---|
|
Rows: 1 |
sum() - Durations
The function sum()
returns the sum of a set of durations.
sum(expression)
Returns:
A Duration. |
Arguments:
Name | Description |
---|---|
|
An expression returning a set of Durations. |
Considerations:
Any |
UNWIND [duration('P2DT3H'), duration('PT1H45S')] AS dur
RETURN sum(dur)
The sum of the two supplied Durations is returned:
sum(dur) |
---|
|
Rows: 1 |
Grouping keys
Aggregation expressions are expressions which contain one or more aggregation functions.
A simple aggregation expression consists of a single aggregation function.
For instance, SUM(x.a)
is an aggregation expression that only consists of the aggregation function SUM( )
with x.a
as its argument.
Aggregation expressions are also allowed to be more complex, where the result of one or more aggregation functions are input arguments to other expressions.
For instance, 0.1 * (SUM(x.a) / COUNT(x.b))
is an aggregation expression that contains two aggregation functions, SUM( )
with x.a
as its argument and COUNT( )
with x.b
as its argument.
Both are input arguments to the division expression.
For aggregation expressions to be correctly computable for the buckets formed by the grouping key(s), they have to fulfill some requirements. Specifically, each sub expression in an aggregation expression has to be either:
-
an aggregation function, e.g.
SUM(x.a)
, -
a constant, e.g.
0.1
, -
a parameter, e.g.
$param
, -
a grouping key, e.g. the
a
inRETURN a, count(*)
-
a local variable, e.g. the
x
incount(*) + size([ x IN range(1, 10) | x ])
, or -
a subexpression, all whose operands are operands allowed in an aggregation expression.
Examples of aggregation expressions.
MATCH (p:Person)
RETURN max(p.age)
max(p.age) |
---|
|
Rows: 1 |
MATCH (p:Person)
RETURN max(p.age) + 1
max(p.age) + 1 |
---|
|
Rows: 1 |
Note that p
is a grouping key:
MATCH (p:Person{name:'Keanu Reeves'})-[:KNOWS]-(f:Person)
RETURN p, p.age - max(f.age)
p | p.age - max(f.age) |
---|---|
|
|
Rows: 1 |
Note that p.age
is a grouping key:
MATCH (p:Person {name:'Keanu Reeves'})-[:KNOWS]-(f:Person)
RETURN p.age, p.age - max(f.age)
p.age | p.age - max(f.age) |
---|---|
|
|
Rows: 1 |
Grouping keys themselves can be complex expressions. For better query readability, Cypher® only recognizes a sub-expression in aggregation expressions as a grouping key if the grouping key is either:
-
A variable - e.g. the
p
inRETURN p, p.age - max(f.age)
-
A property access - e.g. the
p.age
inRETURN p.age, p.age - max(f.age)
-
A map access - e.g. the
p.age
inWITH {name:'Keanu Reeves', age:58} AS p RETURN p.age, p.age - max(p.age)
If more complex grouping keys are needed as operands in aggregation expression, it is always possible to project them in advance with WITH
.
Using the property p.age
will throw an exception, since p.age
is not a grouping key.
Therefore, it cannot be used in the expressions which contain the aggregating function.
The below two queries would consequently return the same error message:
MATCH (p:Person {name:'Keanu Reeves'})-[:KNOWS]-(f:Person)
RETURN p.age - max(f.age)
MATCH (p:Person {name:'Keanu Reeves'})-[:KNOWS]-(f:Person)
RETURN p.age + p.age, p.age + p.age - max(f.age)
Aggregation column contains implicit grouping expressions. For example, in 'RETURN n.a, n.a + n.b + count(*)' the aggregation expression 'n.a + n.b + count(*)' includes the implicit grouping key 'n.b'. It may be possible to rewrite the query by extracting these grouping/aggregation expressions into a preceding WITH clause. Illegal expression(s): n.age
However, the latter query would work if rewritten to:
MATCH (p:Person {name:'Keanu Reeves'})-[:KNOWS]-(f:Person)
WITH p.age + p.age AS groupingKey, f
RETURN groupingKey, groupingKey - max(f.age)
groupingKey | groupingKey - max(f.age) |
---|---|
|
|
Rows: 1 |
Was this page helpful?