Schema inference

Since Neo4j is essentially schemaless while Spark DataFrames use a fixed tabular schema, the Spark connector uses a schema inference system to convert graph data into DataFrames.

  1. If APOC is installed, the connector uses the apoc.meta.nodeTypeProperties and the apoc.meta.relTypeProperties procedures. You can tune both.

  2. If APOC is not installed, the connector uses the first n results (defined by the schema.flatten.limit option) of an additional Cypher® query to infer the schema by the type of each column. When using the query option, the schema is inferred from the result of the query itself.

Both methods use sampling, which is the default value (sample) of the schema.strategy option. The exact APOC procedure or Cypher query depends on the read option.

This strategy works when all instances of a property in Neo4j have the same type. Otherwise, the connector still attempts to infer a schema but it logs a message like the following:

The field "age" has different types: [String, Long]
Every value will be casted to string.

In this case you should define a schema instead.

labels option

If APOC is installed, the connector uses the apoc.meta.nodeTypeProperties procedure. Otherwise, it executes the following Cypher query:

MATCH (n:<labels>) (1)
RETURN n
ORDER BY rand()
LIMIT <limit> (2)
1 <labels> is the list of labels provided by the labels option.
2 <limit> is the value provided by the schema.flatten.limit option.

The schema is then inferred from the query result.

relationships option

If APOC is installed, the connector uses the apoc.meta.relTypeProperties procedure. Otherwise, it executes the following Cypher query:

MATCH (source:<source_labels>)-[rel:<relationship>]->(target:<target_labels>)  (1) (2) (3)
RETURN rel
ORDER BY rand()
LIMIT <limit> (4)
1 <source_labels> is the list of labels provided by relationship.source.labels option.
2 <target_labels> is the list of labels provided by relationship.target.labels option.
3 <relationship> is the list of labels provided by relationship option.
4 <limit> is the value provided via schema.flatten.limit.

The schema is then inferred from the query result.

query option

With the query option, the connector uses the first n results (defined by the schema.flatten.limit option) of the query result to infer the schema.

For example, if the read query is MATCH (n:Person) WITH n LIMIT 2 RETURN id(n) as id, n.name as name, the connector runs the following query first:

MATCH (n:Person) WITH n LIMIT 2 RETURN id(n) as id, n.age as age (1)
ORDER BY rand()
LIMIT <limit> (2)
1 The original read query.
2 <limit> is the value provided via schema.flatten.limit.

The schema is then inferred from the query result.

If the query returns no data, sampling is not possible. In this case the connector creates a schema from the RETURN statement, with every column of type String. This does not cause any issues since the result set is empty.