Schema inference
Since Neo4j is essentially schemaless while Spark DataFrames use a fixed tabular schema, the Spark connector uses a schema inference system to convert graph data into DataFrames.
-
If APOC is installed, the connector uses the
apoc.meta.nodeTypeProperties
and theapoc.meta.relTypeProperties
procedures. You can tune both. -
If APOC is not installed, the connector uses the first n results (defined by the
schema.flatten.limit
option) of an additional Cypher® query to infer the schema by the type of each column. When using thequery
option, the schema is inferred from the result of the query itself.
Both methods use sampling, which is the default value (sample
) of the schema.strategy
option.
The exact APOC procedure or Cypher query depends on the read option.
This strategy works when all instances of a property in Neo4j have the same type. Otherwise, the connector still attempts to infer a schema but it logs a message like the following:
The field "age" has different types: [String, Long]
Every value will be casted to string.
In this case you should define a schema instead.
labels
option
If APOC is installed, the connector uses the apoc.meta.nodeTypeProperties
procedure.
Otherwise, it executes the following Cypher query:
MATCH (n:<labels>) (1)
RETURN n
ORDER BY rand()
LIMIT <limit> (2)
1 | <labels> is the list of labels provided by the labels option. |
2 | <limit> is the value provided by the schema.flatten.limit option. |
The schema is then inferred from the query result.
relationships
option
If APOC is installed, the connector uses the apoc.meta.relTypeProperties
procedure.
Otherwise, it executes the following Cypher query:
MATCH (source:<source_labels>)-[rel:<relationship>]->(target:<target_labels>) (1) (2) (3)
RETURN rel
ORDER BY rand()
LIMIT <limit> (4)
1 | <source_labels> is the list of labels provided by relationship.source.labels option. |
2 | <target_labels> is the list of labels provided by relationship.target.labels option. |
3 | <relationship> is the list of labels provided by relationship option. |
4 | <limit> is the value provided via schema.flatten.limit . |
The schema is then inferred from the query result.
query
option
With the query
option, the connector uses the first n results (defined by the schema.flatten.limit
option) of the query result to infer the schema.
For example, if the read query is MATCH (n:Person) WITH n LIMIT 2 RETURN id(n) as id, n.name as name
, the connector runs the following query first:
MATCH (n:Person) WITH n LIMIT 2 RETURN id(n) as id, n.age as age (1)
ORDER BY rand()
LIMIT <limit> (2)
1 | The original read query. |
2 | <limit> is the value provided via schema.flatten.limit . |
The schema is then inferred from the query result.
If the query returns no data, sampling is not possible.
In this case the connector creates a schema from the RETURN
statement, with every column of type String
.
This does not cause any issues since the result set is empty.