Vector search indexes
Node vector search indexes were released as a public beta in Neo4j 5.11 and general availability in Neo4j 5.13.
Vector indexes allow users to query vector embeddings from large datasets. An embedding is a numerical representation of a data object, such as text, image, audio, or document.
For example, each word or token in a text is typically represented as high-dimensional vector where each dimension represents a certain aspect of the word’s meaning. Words that are semantically similar or related are often represented by vectors that are closer to each other in this vector space. This allows for mathematical operations like addition and subtraction to carry semantic meaning. For example, the vector representation of "king" minus "man" plus "woman" might be close to the vector representation of "queen." In other words, vector embeddings can be said to be a numerical representation of a particular data object, capturing its semantic meaning.
The embedding for a particular data object can be generated by, for example, the Vertex AI or OpenAI embedding generators, which can produce vector embeddings with dimensions such as 256, 768, 1536, and 3072.
These vector embeddings are stored as LIST<INTEGER | FLOAT>
properties on a node, where each dimensional component of the vector is an element in the LIST
.
A Neo4j vector index can be used to index nodes by LIST<INTEGER | FLOAT>
properties valid to the index.[1]
In Neo4j, a vector index allows you to write queries that match a neighborhood of nodes based on the similarity between the properties of those nodes and the ones specified in the query.
Neo4j vector indexes are powered by the Apache Lucene indexing and search library. Lucene implements a Hierarchical Navigable Small World[2] (HNSW) Graph to perform a k approximate nearest neighbors (k-ANN) query over the vector fields.
Vector index commands and procedures
Vector indexes are managed through Cypher® commands and built-in procedures, see Operations Manual → Procedures for a complete reference.
The procedures and commands for vector indexes are listed in the following table:
Usage | Procedure/Command | Description |
---|---|---|
Create node vector index. |
|
Create a vector index for the specified label and property with the given vector dimensionality using the given similarity function.
See the |
Create relationship vector index. |
|
Create a relationship vector index for the specified relationship type and property with the given vector dimensionality using the given similarity function.
See the |
Create vector index. |
It is replaced by |
|
Use node vector index. |
Query the given node vector index. Returns the requested number of approximate nearest neighbor nodes and their similarity score, ordered by score. |
|
Use relationship vector index. |
Query the given relationship vector index. Returns the requested number of approximate nearest neighbor relationships and their similarity score, ordered by score. Introduced in 5.18 |
|
Drop vector index. |
|
Drop the specified index, see the |
Listing all vector indexes. |
|
Lists all vector indexes, see the |
Set node vector property. |
Update a given node property with the given vector in a more space-efficient way than directly using |
|
Set node vector property. |
It is replaced by |
|
Set relationship vector property. |
Update a given relationship property with the given vector in a more space-efficient way than directly using |
Create and configure vector indexes
You can create vector indexes using the CREATE VECTOR INDEX
command.
An index can be given a unique name when created (or get a generated one), which is used to reference the specific index when querying or dropping it.
Creating a vector index requires the CREATE INDEX
privilege.
The index name must be unique among both indexes and constraints. |
A vector index is a single-label, single-property index for nodes or a single-relationship-type, single-property index for relationships.[1]
A vector index needs to be configured with both the dimensionality of the vector (INTEGER
between 1
and 4096
inclusive),[1] and the measure of similarity between two vectors (case-insensitive STRING
).
For details, see Supported similarity functions.
Command | Description |
---|---|
|
Create a vector index on nodes. The options map is mandatory because setting the vector dimensions and similarity function is mandatory when creating a vector index. |
|
Create a vector index on relationships. The options map is mandatory because setting the vector dimensions and similarity function is mandatory when creating a vector index. Introduced in 5.18 |
It is considered best practice to give the index a name when it is created.
This name is needed for both dropping and querying the index.
If the index is not explicitly named, it will get an auto-generated name.
As of Neo4j 5.16, the index name can also be given as a parameter, CREATE VECTOR INDEX $name FOR …
.
The index name must be unique among all indexes and constraints. |
The CREATE VECTOR INDEX
command take an OPTIONS
clause. This has two parts, the indexProvider
and indexConfig
.
As of Neo4j 5.18, there are two available index providers, vector-2.0
(default) and vector-1.0
.
The indexConfig
is a MAP
from STRING
values to STRING
and INTEGER
values, and is used to set the index-specific configuration settings (the vector dimensions and similarity function).
The command is optionally idempotent. This means that its default behavior is to throw an error if an attempt is made to create the same index twice.
With IF NOT EXISTS
, no error is thrown and nothing happens should an index with the same name, schema or both already exist.
It may still throw an error should a constraint with the same name exist.
As of Neo4j 5.17, an informational notification is instead returned showing the existing index which blocks the creation.
The new index is not immediately available but is created in the background. |
All vectors within the index must have the same dimensionality. The measure of similarity is determined by the given vector similarity function. This defines how similar two vectors are to one another by a similarity score, how vectors are interpreted, and what vectors are valid for the index.
A node or relationship is indexed if all the following are true:
-
The node/relationship contains the configured label/relationship type.
-
The node/relationship contains the configured property key.
-
The respective property value is of type
LIST<INTEGER | FLOAT>
.[1] -
The
size()
of the respective value is the same as the configured dimensionality. -
The value is a valid vector for the configured similarity function.
Otherwise, a node or relationship is not indexed.
For instance, assume you have a graph of research papers, and each paper has an abstract. You want to find papers in the neighborhood of a paper you know.
(:Title)<--(:Paper)-->(:Abstract)
Assume for each abstract, you have generated a 1536-dimensional vector embedding
of the abstract’s text
using Open AI’s text-embedding-ada-002
model.
This model suggests a cosine similarity.
For more information, see OpenAI’s official documentation.
You can create a cosine node vector index over the embedding
property.
CREATE VECTOR INDEX `abstract-embeddings`
FOR (n: Abstract) ON (n.embedding)
OPTIONS {indexConfig: {
`vector.dimensions`: 1536,
`vector.similarity_function`: 'cosine'
}}
Assume you have a graph of employees and their managers, who are themselves employees. Managers review their reports periodically, and you wish to search for reviews of similar themes and nuances to find excellent employees.
(:Manager)-[:REVIEWED]->(:Employee)
Assume for each review, you have generated a 256-dimensional vector embedding
of the review’s text
using a shortening of Open AI’s text-embedding-3-large
model.
This model suggests a cosine similarity.
For more information, see OpenAI’s official documentation.
You can create a cosine relationship vector index over the embedding
property.
CREATE VECTOR INDEX `review-embeddings`
FOR ()-[r:REVIEWED]-() ON (r.embedding)
OPTIONS {indexConfig: {
`vector.dimensions`: 256,
`vector.similarity_function`: 'cosine'
}}
You can see that the two vector indexes have been created using SHOW INDEXES
:
SHOW VECTOR INDEXES YIELD name, type, entityType, labelsOrTypes, properties, options
name | type | entityType | labelsOrTypes | properties | options |
---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
Query a vector index
You can query a vector index using the db.index.vector.queryNodes
or the db.index.vector.queryRelationships
procedure.
db.index.vector.queryNodes
to query a node vector indexdb.index.vector.queryNodes(indexName :: STRING, numberOfNearestNeighbours :: INTEGER, query :: ANY) :: (node :: NODE, score :: FLOAT)
db.index.vector.queryRelationships
to query a relationship vector indexdb.index.vector.queryRelationships(indexName :: STRING, numberOfNearestNeighbours :: INTEGER, query :: ANY) :: (relationship :: RELATIONSHIP, score :: FLOAT)
-
The
indexName
(aSTRING
) refers to the unique name of the vector index to query. -
The
numberOfNearestNeighbours
(anINTEGER
) refers to the number of nearest neighbors to return as the neighborhood. -
The
query
vector (aLIST<INTEGER | FLOAT>
) in which to search for the neighborhood.
The procedures return the neighborhood of nodes or relationships with their respective similarity scores, ordered by those scores.
The scores are bounded between 0
and 1
, where the closer to 1
the score is, the more similar the indexed vector is to the query vector.
This example takes the paper that describes the HNSW[2] graph structure that the vector index implements and tries to find similar papers.
First you MATCH
to find the paper, and then you query the abstract-embeddings
index for a neighborhood of 10
similar abstracts to your query.
Finally, you MATCH
for the neighborhood’s respective titles.
MATCH (title:Title)<--(:Paper)-->(abstract:Abstract)
WHERE toLower(title.text) = 'efficient and robust approximate nearest neighbor search using
hierarchical navigable small world graphs'
CALL db.index.vector.queryNodes('abstract-embeddings', 10, abstract.embedding)
YIELD node AS similarAbstract, score
MATCH (similarAbstract)<--(:Paper)-->(similarTitle:Title)
RETURN similarTitle.text AS title, score
title | score |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Rows: 10 |
The results are expected, with papers discussing graph-based nearest-neighbor searches.
The most similar to this result is the query vector itself, which is to be expected as the index was queried with an indexed property.
If the query vector itself is not wanted, you can use WHERE score < 1
to remove equivalent vectors to the query vector.
This example takes a query vector describing particular themes and nuances in the reviews.
This query vector can be acquired by encoding the themes using the GenAI integrations plugin.
Then you query the review-embeddings
index for a neighborhood of 10
reviews containing similar themes and nuances to the query.
Finally, you MATCH
for the neighborhood’s respective employees.
CALL db.index.vector.queryRelationships('review-embeddings', 10, $query)
YIELD relationship AS review, score
MATCH ()-[review]->(employee:Employee)
RETURN employee.name AS name, score
name | score |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Rows: 10 |
Drop vector indexes
A vector index is dropped by using the same command as for other indexes, DROP INDEX
.
Dropping a vector index requires the DROP INDEX
privilege.
In the following example, you drop the abstract-embeddings
that you created previously:
DROP INDEX `abstract-embeddings`
Removed 1 index.
The index name can also be given as a parameter, DROP INDEX $name
.
Set a vector property
Valid vectors for use in the index must have components finitely representable in IEEE 754[3] single precision.
They are represented as properties on nodes with the type LIST<INTEGER | FLOAT>
.
As of Neo4j 5.13, you can set a vector property on a node using the db.create.setNodeVectorProperty
procedure.
It validates the input and sets the property as an array of IEEE 754[3] single precision values.
This beta procedure replaces db.create.setVectorProperty
.
As of Neo4j 5.18, you can set a vector property on a relationship using the db.create.setRelationshipVectorProperty
procedure.
db.create.setNodeVectorProperty
db.create.setNodeVectorProperty(node :: NODE, key :: STRING, vector :: ANY)
db.create.setVectorProperty
Deprecateddb.create.setVectorProperty(node :: NODE, key :: STRING, vector :: ANY) :: (node :: NODE)
db.create.setRelationshipVectorProperty
db.create.setRelationshipVectorProperty(relationship :: RELATIONSHIP, key :: STRING, vector :: ANY)
The following example shows how to define embeddings as Cypher parameters by matching a node and setting its vector properties using db.create.setNodeVectorProperty
:
db.create.setNodeVectorProperty
MATCH (n:Node {id: $id})
CALL db.create.setNodeVectorProperty(n, 'propertyKey', $vector)
RETURN n
Likewise, by matching a relationship and setting its vector properties using db.create.setRelationshipVectorProperty
:
db.create.setRelationshipVectorProperty
MATCH ()-[r:Relationship {id: $id}]->()
CALL db.create.setRelationshipVectorProperty(r, 'propertyKey', $vector)
RETURN r
Furthermore, you can also use a list parameter containing several MATCH
criteria and embeddings to update multiple nodes in an UNWIND
clause.
This is ideal for creating and setting new vector properties in the graph.
You can also set a vector property on a node using the SET
command as in the following example:
SET
MATCH (node:Node {id: $id})
SET node.propertyKey = $vector
RETURN node
However, Cypher coerces and stores the provided LIST<INTEGER | FLOAT>
as a primitive array of IEEE 754[3] double precision values.
This takes up almost twice as much space compared to the alternative method, where you use the db.create.setNodeVectorProperty
procedure.
As a result, using SET
for a vector index is not recommended.
To reduce the storage space, you can reset the existing properties using db.create.setNodeVectorProperty
.
However, this comes with the cost of an increase in the transaction log size until they are rotated away.
Supported similarity functions
The choice of similarity function affects which indexed vectors are considered similar, and which are valid. The semantic meaning of the vector may itself dictate which similarity function to choose. Refer to the documentation for the particular vector embedding model you are using, as it may suggest a preference for certain similarity functions. Otherwise, being able to differentiate between the various similarity functions can assist in making a more informed decision.
Name | Case insensitive argument | Key similarity feature |
---|---|---|
|
distance |
|
|
angle |
For -normalized vectors (unit vectors), thus having unit length , Euclidean and cosine similarity functions produce the same similarity ordering.
Euclidean similarity
Euclidean similarity is useful when the distance between the vectors is what determines how similar two vectors are.
A valid vector for a Euclidean vector index is when all vector components can be represented finitely in IEEE 754[3] single precision.
Euclidean interprets the vectors in Cartesian coordinates. The measure is related to the Euclidean distance, i.e., how far two points are from one another. However, that distance is unbounded and less useful as a similarity score. Euclidean similarity bounds the square of the Euclidean distance.
Cosine similarity
Cosine similarity is used when the angle between the vectors is what determines how similar two vectors are.
A valid vector for a cosine vector index is when:[1]
Cosine similarity interprets the vectors in Cartesian coordinates. The measure is related to the angle between the two vectors. However, an angle can be described in many units, sign conventions, and periods. The trigonometric cosine of this angle is both agnostic to the aforementioned angle conventions and bounded. Cosine similarity rebounds the trigonometric cosine.
In the above equation the trigonometric cosine is given by the scalar product of the two unit vectors.
Vector index providers for compatibility
As of Neo4j 5.18, the default and preferred vector index provider is vector-2.0
.
Previously created vector-1.0
indexes will continue to function.
New indexes can still be created with the vector-1.0
index provider if specified, see Create and configure vector indexes.
Supported | vector-1.0 |
vector-2.0 |
---|---|---|
Index schema |
Single-label, single-property index for nodes.
No relationship support |
Single-label, single-property index for nodes.
Single-type, single-property index for relationships. |
Indexed property value type |
|
|
Indexed vector dimensionality |
|
|
All vector components can be represented finitely in IEEE 754[3] single precision.
Its -norm is non-zero and can be represented finitely in IEEE 754[3] single precision. |
All vector components can be represented finitely in IEEE 754[3] double precision.
Its -norm is non-zero and can be represented finitely in IEEE 754[3] double precision.
The ratio of each vector component with its -norm can be represented finitely in IEEE 754[3] single precision. |
Limitations and idiosyncrasies
-
The query is an approximate nearest neighbor search. The requested k nearest neighbors may not be the exact k nearest, but close within the same wider neighborhood, such as finding a local extremum vs the true extremum.
-
For large requested nearest neighbors, k, close to the total number of indexed vectors, the search may retrieve fewer than k results.
-
Only one vector index can be over a schema. For example, you cannot have one Euclidean and one cosine vector index on the same label-property key pair.
-
No provided settings or options for tuning the index.
-
Changes made within the same transaction are not visible to the index.
Known issues
As of Neo4j 5.13, the vector search index is no longer a beta feature. The following table lists the known issues and the version in which they were fixed:
Known issues | Fixed in | ||
---|---|---|---|
Procedure signatures from
|
|||
Only node vector indexes are supported. |
Neo4j 5.18 |
||
Vector indexes cannot be assigned autogenerated names. |
Neo4j 5.15 |
||
There is no Cypher syntax for creating a vector index.
|
Neo4j 5.15 |
||
The standard index type filtering for
|
Neo4j 5.15 |
||
Vector indexes may incorrectly reject valid queries in a cluster setting. This is caused by an issue in the handling of index capabilities on followers.
For more information about clustering in Neo4j, see the Operations Manual → Clustering. |
Neo4j 5.14 |
||
Querying for a single approximate nearest neighbor from an index would fail a validation check. Passing a |
Neo4j 5.13 |
||
Vector index queries throw an exception if the transaction state contains changes. This means that writes may only take place after the last vector index query in a transaction.
|
Neo4j 5.13 |
||
|
Neo4j 5.12 |
||
Passing |
Neo4j 5.12 |
||
The creation of the vector index skipped the check to limit the dimensionality to
|
Neo4j 5.12 |
||
The validation for cosine similarity verifies that the vector’s -norm can be represented finitely in IEEE 754[3] double precision, rather than in single precision.
This can lead to certain large component vectors being incorrectly indexed, and return a similarity score of |
Neo4j 5.12 |
||
|
Neo4j 5.12 |
||
The vector index |
Neo4j 5.12 |
||
Copying a database store with a vector index does not log the recreation command, and instead logs an error: ERROR: [StoreCopy] Unable to format statement for index 'index-name' Due to an: java.lang.IllegalArgumentException: Did not recognize index type VECTOR
|
Neo4j 5.12 |
||
Some of the protections preventing the use of new features during a database rolling upgrade are missing. This can result in a transaction to create a vector index on a cluster member running Neo4j 5.11 and distributing it to other cluster members running an older Neo4j version. The older Neo4j versions will fail to understand the transaction.
|
Neo4j 5.12 |
Suggestions
Vector indexes can take advantage of the incubated Java 20 Vector API for noticeable speed improvements. If you are using a compatible version of Java, you can add the following setting to your configuration settings:
server.jvm.additional=--add-modules jdk.incubator.vector