Full-text index analyzer providers

Full-text indexes always have an analyzer, which describes how text is analyzed for indexing and querying. The analyzer breaks the text up into smaller tokens and processes these tokens using filters. The filters can do different things, such as removing stop-words (e.g., the and is), stemming the tokens, or turning them into lower-case.

Which analyzer to use, depends on what you want to use the index for. For example, if the text being indexed belongs to a special domain, such as email addresses, you would want an analyzer specific to that domain. Or if the text is always in a particular language, such as Russian, you could use an analyzer specific to that language.

Neo4j comes with a number of built-in analyzers. The full list of analyzers is available by calling the db.index.fulltext.listAvailableAnalyzers() procedure. The procedure returns the analyzer name, a short description, and the full list of stop-words used by the analyzer.

The default analyzer is the standard-no-stop-words analyzer. This default can be changed with the dbms.index.fulltext.default_analyzer setting. This setting only takes effect when a full-text index is created. Once a full-text index has been created, it remembers the analyzer in its index-specific settings.

It is possible to extend the available analyzers in Neo4j by implementing AnalyzerProvider. The AnalyzerProvider acts as a factory, which builds the concrete Lucene Analyzer instance used by the index. The following example creates a customized analyzer:

public class CustomAnalyzerProvider extends AnalyzerProvider (1)
{
    public CustomAnalyzerProvider()                          (2)
    {
        super( "custom-analyzer" );                          (3)
    }

    @Override
    public Analyzer createAnalyzer()                         (4)
    {
        try
        {
            return CustomAnalyzer.builder()                  (5)
                    .withTokenizer( StandardTokenizerFactory.class )
                    .addTokenFilter( LowerCaseFilterFactory.class )
                    .addTokenFilter( StopFilterFactory.class, "ignoreCase", "false", "words", "stopwords.txt", "format", "wordset" )
                    .build();
        }
        catch ( IOException e )
        {
            throw new UncheckedIOException( e );
        }
    }
}
1 The CustomAnalyzerProvider class must be public, and it must extend the org.neo4j.graphdb.schema.AnalyzerProvider class.
2 The CustomAnalyzerProvider class must additionally have a public constructor that takes no arguments. Without this constructor, the new analyzer provider will not be loaded by Neo4j. There will be no warnings logged when an analyzer provider is ignored because of this reason, so take care.
3 The constructor must then call its super-constructor, and give the name of the custom-analyzer provider as an argument. This is the name that will be used to refer to this analyzer provider when configuring indexes.
4 Lastly, the createAnalyzer method must be implemented. This method creates and returns the concrete Analyzer instance, that the indexes will use. If this method returns null or throws an exception, the index will be marked as FAILED.
5 This example creates instances of Lucene’s CustomAnalyzer. You can, however, create and return anything that extends the org.apache.lucene.analysis.Analyzer class.

Follow the guide in Setting up a plugin project, to learn how to package the customized AnalyzerProvider in a JAR file that can be integrated into Neo4j.

The analyzer providers are picked up by Neo4j via service loading. This means that in addition to having implemented the class, you must also add the fully qualified class name to a service file, and put that file on the classpath. These service files are usually included in the JAR file that contains the Neo4j extensions. In a typical Maven project, such as the one created by Setting up a plugin project, the directory structure would look like this:

project/
   src/
      main/
         java/
            my_package/
               CustomAnalyzerProvider.java (1)
         resources/
            META-INF/
               services/
                  org.neo4j.graphdb.schema.AnalyzerProvider (2)
1 This is the CustomAnalyzerProvider from our previous code example.
2 This is the service loader file. It is a plain-text file that contains a line, with the fully qualified name, for each AnalyzerProvider implemented in our project. In this case, it contains a single line: my_package.CustomAnalyzerProvider.

For the META-INF/services resources to be handled correctly by the maven-shade-plugin, it may be necessary to include a service resource transformation to the plug-in configuration. This is an example of what that may look like:

<plugin>
  <groupId>org.apache.maven.plugins</groupId>
  <artifactId>maven-shade-plugin</artifactId>
  <executions>
    <execution>
      <goals>
        <goal>shade</goal>
      </goals>
      <configuration>
        <transformers>
          <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
        </transformers>
      </configuration>
    </execution>
  </executions>
</plugin>

Consult the documentation on the Maven Shade Plugin for more details on this step.