Enforcing Data Quality in Neo4j 5: New Property Type Constraints & Functions


At Neo4j, we’ve been working on making the Cypher types visible in the Cypher language. This includes seeing what type a value has with the type predicate expression and valueType function. It also applies to introducing type constraints, which let you restrict types allowed to be stored under a property. In this blog, we’ll explore what you can do with this new functionality.

We also presented on this topic at our NODES developer conference. Watch it here:

Cleaning Up Data

Consider this example. We set up a new database and imported data from two sources. Both sources contain information that is stored on a node with the label Person. A person is identified with the property person_id. Unfortunately, it turns out not all person IDs have been stored in the same format.

Here’s how the new features can help us make sense of this situation.

First, we check what types are present in the database:

MATCH (p:Person) RETURN DISTINCT valueType(p.person_id) AS type

╒════════════════════╕
│type │
╞════════════════════╡
│ INTEGER NOT NULL │
└────────────────────┘
│ STRING NOT NULL │
└────────────────────┘
│ NULL │
└────────────────────┘

We know that the person ID is supposed to be a number. To investigate it closer, we can return the nodes that have a string value with the following query:

MATCH (p)
WHERE p.person_id IS :: STRING
RETURN p

╒══════════════════════════════════════════════╕
│p │
╞══════════════════════════════════════════════╡
│(:Person {name: "Alice",person_id: "111 111"})│
├──────────────────────────────────────────────┤
│(:Person {name: "Eve"}) │
└──────────────────────────────────────────────┘

In our example, we notice that the person IDs are also numbers, but they contain whitespaces to make them more readable, such as “123 456”. We also see that there is data where the person ID is missing entirely (as expected, since the third result row in the previous query is NULL).

To clean this up, we can use the toInteger function to convert all string values to numbers. First, we must remove the whitespace from the value; otherwise, the result will be null.

MATCH (p:Person)
SET p.person_id = toInteger(replace(p.person_id, " ", ""))

This fails with the exception:

Expected a string value for `replace`, but got: Long(222222); 

Consider converting it to a string with toString().

This is because we are also hitting the properties that already contain an INTEGER value. We can use the new type predicate expression to only execute the statement on nodes with string properties:

MATCH (p:Person)
WHERE p.person_id IS :: STRING NOT NULL
SET p.person_id = toInteger(replace(p.person_id, " ", ""))

Enforce Data Integrity

We’ve cleaned up the data. We can now create a new constraint to guarantee that we do not add a person node where the id has another type in the future.

CREATE CONSTRAINT personId_type
FOR (p:Person)
REQUIRE p.person_id IS :: INTEGER

This allows us to create the following new nodes:

CREATE (:Person)
CREATE (:Person { person_id: 123456 })
CREATE (:Person { person_id: 0x123A })

But not:

CREATE (:Person { person_id = “123 456” })

Without cleaning up the previous data, we would have gotten the following error message when creating the constraint:

Node(0) with label `Person` required the property `person_id` to be of type
`INTEGER`, but was of type `STRING`.

Note that while type checks allow checking for INTEGER NOT NULL, it is not possible to define a constraint that is both a type constraint and an existence constraint right now. This might change in the future. In the meantime, if we wanted to enforce that the person_id is not null, we could use an additional property existence constraint. You can read up on the different constraint types here.

Could we create a valid type constraint without cleaning up the original data?

In fact, we could have created a weaker constraint by specifying several allowed types instead of only one, which we call a closed dynamic union type:

CREATE CONSTRAINT personId_clean_up_constraint
FOR (p:Person)
REQUIRE p.person_id IS :: INTEGER | STRING

These kinds of constraints could also come in handy while migrating between different types.

If you want to try out the new functionality on your own, you can use the following query to create a minimal dataset:

CREATE (:Person { name: "Alice", person_id: "111 111"}),
(:Person { name: "Bob", person_id: 222222}),
(:Person { name: "Eve" })

Corner Cases and Pitfalls

With all that out of the way, let’s take a look at some corner cases that might catch you up.

Typing Checks Return True If the Value Type Is Contained in the Checked Type

The Cypher data types can be seen as a set of possible values. A type predicate expression returns true if a type is contained in it.

We’ve already seen an example of this: a null value is contained in INTEGER. An INTEGER is technically a 64-bit integer — but that doesn’t mean it is not possible to store other types if you define an INTEGER constraint.

When interacting with Neo4j through the Java API, it would still be possible to store a short:

Short var = 2;
Result result = session.run(
"CREATE (p:Person { person_id: $person_id }) RETURN p.person_id, valueType(p.person_id)",
parameters("person_id", var)
);

List & Maps

A similar thing happens for lists. A type check of a list will return true if all values in the list satisfy the type check of the inner type. All of these queries will return true:

RETURN [1, 2] IS :: LIST<INTEGER>;
RETURN [] IS :: LIST<INTEGER>;
RETURN [null] IS :: LIST<INTEGER>;

If we return the valueType function instead, we can see the most precise type assigned to these values:

[1, 2] :: LIST<INTEGER>
[] :: LIST<NOTHING>
[null] :: LIST<NULL>

While it is mandatory to define the inner types for checks on LIST values, it is not possible to do the same for MAP values. The following property will simply return MAP:

MATCH (p:Person)
RETURN valueType(properties(p)) AS type

If you want to analyze the type in a MAP, you can do that by unwinding the properties instead:

MATCH (p:Person)
UNWIND keys(p) AS key
RETURN p, collect(key), collect(valueType(p[key]))

A List of Integers May Not Be a List

While Cypher itself does not support byte arrays, it is also possible to store a byte array using the Driver or the Java API. If you look at the documentation of types available in Cypher, you will notice that there exists no BYTE ARRAY type. When using the valueType function, it returns the most precise type known to Cypher, which is a LIST<INTEGER>, since a byte is technically an INTEGER.

While regular queries should treat the value as a LIST<INTEGER>, it is possible to get the stored types when accessing the data with an API, for example, in a custom procedure.

Let’s define the following procedure:

@Procedure("listsOnly")
public Stream<Output> listsOnly(@Name("object") Object object) throws IllegalAccessException {
List<Output> list = new ArrayList<>();
if (object instanceof List) {
list.add(new Output(object.getClass().getSimpleName()));
} else {
throw new IllegalAccessException("input was not a list");
}
return list.stream();
}

Executing the following statement will throw an exception if n.prop was stored as a byte array:

MATCH (n)
WHERE n.prop IS :: LIST<INTEGER> NOT NULL
CALL listsOnly(n.prop)
YIELD out
RETURN out

The procedure framework will, however, cast the passed-in value to a java.util.List of Long if you define the parameter type as a List<Long> instead.

Type Normalization

Before types are returned, Cypher normalizes type expressions. This is visible, for example, in SHOW CONSTRAINTS or in the output of the valueType function. Let’s look at the following example:

WITH [[1, 2, null], [], 1, 1.0, null] as val
RETURN valueType(val)

If we execute the valueType function on the individual elements, we would get the following results:

[1, 2, null] :: LIST<INTEGER> NOT NULL
[] :: LIST<NOTHING> NOT NULL //NOTHING means that this list doesn't contain any values
1 :: INTEGER NOT NULL
1.0 :: FLOAT NOT NULL
null :: NULL

However, when looking at the result of the entire list, this is simplified to LIST<INTEGER | FLOAT | LIST<INTEGER>> NOT NULL. Both [] and null are encompassed in other types and have been removed, returning the most precise and simple description of the given value type. Additionally, the types inside the list will always be returned in the same order regardless of the order of its elements.

Have a look at the documentation to see what else is included in the Cypher type normalization.

Case Statements

Let’s look at how all of this may affect you while writing case statements. Consider the following example:

MATCH(p)
RETURN
CASE
WHEN p.person_id IS :: INTEGER THEN "INTEGER"
WHEN p.person_id IS :: STRING THEN "STRING"
ELSE "UNKNOWN"
END AS result

This works as intended for fixed values like 1 or “String”, but what happens with null values? Since null is a legitimate INTEGER value, a null value would match the first case and return INTEGER.

The order of the type checks requires some extra attention to get right. If you want to define an order for a given set of types, you can check out the documentation of the type normalization order.

The order is defined in a way so that types will only be encompassed by another type later in the order, for example NULL comes before INTEGER. In short, these are the potentially conflicting types you need to watch out for:

  • If you want special logic for NULL values, you should have the NULL type check first
  • Dynamic unions should come after their simple types, e.g. INTEGER before INTEGER | FLOAT
  • Within dynamic unions, unions with a smaller size should come before unions with a bigger size.

For closed dynamic unions with the same union size, ordering becomes potentially ambiguous if some types are overlapping. Let’s look at a little more complex example with different dynamic unions:

MATCH(p)
RETURN
CASE
WHEN p.person_id IS :: LIST<INTEGER | STRING> THEN "INTEGER or STRING"
WHEN p.person_id IS :: LIST<INTEGER | FLOAT> "NUMBER"
END AS results

In this example, [1, “test”] would return “INTEGER or STRING” and [1, 2.0] would return “NUMBER”. What happens with [1, 2]?

The answer: Since type checks also use encompassing types, it will match the first case.

One option to handle this is to add an individual case to catch a LIST<INTEGER> and make the logic for this case clear:

MATCH(p)
RETURN
CASE
WHEN p.person_id IS :: LIST<INTEGER> THEN "INTEGER"
WHEN p.person_id IS :: LIST<INTEGER | STRING> THEN "INTEGER or STRING"
WHEN p.person_id IS :: LIST<INTEGER | FLOAT> "NUMBER"
END AS result

Another option would be to use the valueType function for comparing types instead. However, this solution may require more case statements and would have to be updated if Cypher introduces a subtype such as INTEGER32 in the future.

Get Hands-on With These New Features

Try the new features out yourself! All of them are available on Neo4j AuraDB. They were introduced gradually across multiple releases:

  • Type constraints & Type predicate expression
  • 5.9: initial release for simple types type constraints and type predicate expressions
  • 5.10: list types for type constraints, NOT NULL type predicates expressions
  • 5.11: support of closed dynamic unions
  • 5.13: introduction of the valueType function

Resources

See Neo4j 5 release notes for details.

Documentation:

Please let us know if this new functionality is helpful for you and how we can improve it.


Enforcing Data Quality in Neo4j 5: New Property Type Constraints & Functions was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.