Schema inference

Since Neo4j is essentially schemaless while Spark DataFrames use a fixed tabular schema, the Spark connector uses a schema inference system to convert graph data into DataFrames.

If APOC is installed, the connector uses the apoc.meta.nodeTypeProperties and the apoc.meta.relTypeProperties procedures. You can tune both.
If APOC is not installed, the connector uses the first n results (defined by the schema.flatten.limit option) of an additional Cypher^® query to infer the schema by the type of each column. When using the query option, the schema is inferred from the result of the query itself.

Both methods use sampling, which is the default value (sample) of the schema.strategy option. The exact APOC procedure or Cypher query depends on the read option.

This strategy works when all instances of a property in Neo4j have the same type. Otherwise, the connector still attempts to infer a schema but it logs a message like the following:

The field "age" has different types: [String, Long]
Every value will be casted to string.

In this case you should define a schema instead.

`labels` option

If APOC is installed, the connector uses the apoc.meta.nodeTypeProperties procedure. Otherwise, it executes the following Cypher query:

MATCH (n:<labels>) (1)
RETURN n
ORDER BY rand()
LIMIT <limit> (2)

1	`<labels>` is the list of labels provided by the `labels` option.
2	`<limit>` is the value provided by the `schema.flatten.limit` option.

The schema is then inferred from the query result.

`relationships` option

If APOC is installed, the connector uses the apoc.meta.relTypeProperties procedure. Otherwise, it executes the following Cypher query:

MATCH (source:<source_labels>)-[rel:<relationship>]->(target:<target_labels>)  (1) (2) (3)
RETURN rel
ORDER BY rand()
LIMIT <limit> (4)

1	`<source_labels>` is the list of labels provided by `relationship.source.labels` option.
2	`<target_labels>` is the list of labels provided by `relationship.target.labels` option.
3	`<relationship>` is the list of labels provided by `relationship` option.
4	`<limit>` is the value provided via `schema.flatten.limit`.

The schema is then inferred from the query result.

`query` option

With the query option, the connector uses the first n results (defined by the schema.flatten.limit option) of the query result to infer the schema.

For example, if the read query is MATCH (n:Person) WITH n LIMIT 2 RETURN id(n) as id, n.name as name, the connector runs the following query first:

MATCH (n:Person) WITH n LIMIT 2 RETURN id(n) as id, n.age as age (1)
ORDER BY rand()
LIMIT <limit> (2)

1	The original read query.
2	`<limit>` is the value provided via `schema.flatten.limit`.

The schema is then inferred from the query result.

If the query returns no data, sampling is not possible. In this case the connector creates a schema from the RETURN statement, with every column of type String. This does not cause any issues since the result set is empty.

Schema inference

labels option

relationships option

query option

`labels` option

`relationships` option

`query` option