1.4. Full-text index analyzer providers

This section describes how to extend full-text index analyzers provided in Neo4j.

Full-text indexes always have an analyzer, which describes how text is analyzed for indexing and querying. The analyzer breaks the text up into smaller tokens, and processes these tokens with a number of filters. These filters can do many different things. For example, some remove stop-words, such as "the" and "is", while others transform the tokens by stemming them, or turn them into lower-case.

Which analyzer you should use depends on what you want to use the index for. For example, if the text being indexed belong to a special domain, such as email addresses, you would want an analyzer specific to that domain. Or if the text is always in a particular language, such as Russian, you could use an analyzer specific to that language.

Neo4j comes with a number of built-in analyzers. The full list of analyzers is available by calling the db.index.fulltext.listAvailableAnalyzers() procedure. The procedure returns the analyzer name, a short description, and the full list of stop-words used by the analyzer.

The default analyzer is the standard-no-stop-words analyzer. This default can be changed with the dbms.index.fulltext.default_analyzer setting. This setting only takes effect when a full-text index is created. Once a full-text index has been created, it remembers the analyzer in its index-specific settings.

It is possible to extend the available analyzers in Neo4j, by implementing an AnalyzerProvider. The AnalyzerProvider acts as a factory, which builds the concrete Lucene Analyzer instance used by the index. The following example creates a custom analyzer:

    public class CustomAnalyzerProvider extends AnalyzerProvider (1)
    {
        public CustomAnalyzerProvider()                          (2)
        {
            super( "custom-analyzer" );                          (3)
        }

        @Override
        public Analyzer createAnalyzer()                         (4)
        {
            try
            {
                return CustomAnalyzer.builder()                  (5)
                        .withTokenizer( StandardTokenizerFactory.class )
                        .addTokenFilter( LowerCaseFilterFactory.class )
                        .addTokenFilter( StopFilterFactory.class, "ignoreCase", "false", "words", "stopwords.txt", "format", "wordset" )
                        .build();
            }
            catch ( IOException e )
            {
                throw new UncheckedIOException( e );
            }
        }
    }

1

The custom analyzer provider class must be public, and it must extend the org.neo4j.graphdb.schema.AnalyzerProvider class.

2

The custom analyzer provider class must additionally have a public constructor that takes no arguments. Without this constructor, the new analyzer provider will not be loaded by Neo4j. There will be no warnings logged when an analyzer provider is ignored for this reason, so take care.

3

The constructor must then call its super-constructor, and give the name of the custom analyzer provider as an argument. This is the name that will be used to refer to this analyzer provider, when configuring indexes.

4

Lastly, the createAnalyzer method must be implemented. This method creates and returns the concrete Analyzer instance, that the indexes will use. If this method returns null or throws an exception, then the index will be marked as FAILED.

5

In this example, we are creating instances of Lucene’s CustomAnalyzer. We can, however, create and return anything that extends the org.apache.lucene.analysis.Analyzer class.

Follow the guide in Setting up a plugin project, to learn how to package the custom AnalyzerProvider in a JAR file that can be integrated into Neo4j.

The analyzer providers are picked up by Neo4j via service loading. This means that in addition to having implemented the class above, we must also add the fully-qualified class name to a service file, and put that file on the class path. These service files are usually included in the JAR file that contains the Neo4j extensions. In a typical Maven project, such as the one created by the guide above, the directory structure would look like this:

project/
   src/
      main/
         java/
            my_package/
               CustomAnalyzerProvider.java (1)
         resources/
            META-INF/
               services/
                  org.neo4j.graphdb.schema.AnalyzerProvider (2)

1

This is the CustomAnalyzerProvider from our previous code example.

2

This is the service loader file. It is a plain-text file that contains a line, with the fully-qualified name, for each AnalyzerProvider we implement in our project. In this case, it contains a single line: my_package.CustomAnalyzerProvider.

In order for the META-INF/services resources to be handled correctly by the maven-shade-plugin, it may be necessary to include a service resource transformation to the plug-in configuration. This is an example of what that may look like:

<plugin>
  <groupId>org.apache.maven.plugins</groupId>
  <artifactId>maven-shade-plugin</artifactId>
  <executions>
    <execution>
      <goals>
        <goal>shade</goal>
      </goals>
      <configuration>
        <transformers>
          <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
        </transformers>
      </configuration>
    </execution>
  </executions>
</plugin>

Consult the documentation on the maven-shade-plugin for more details on this step.