How to Securely Deploy Neo4j into Amazon Web Services (AWS)

Editor’s Note: This presentation was given by Benjamin Nussbaum at GraphConnect Europe in April 2016. Here’s a quick TL;DR of what he covered:
    • The importance of cloud security
    • The language of security
    • NAT Routing
    • GraphGrid services

Today we’re going to talk about securely deploying Neo4j into Amazon Web Services (AWS):

To start, we need to ask the question: Why is cloud security important? Over the last several years, there has been an increase in security incidents in which millions of records have been stolen. It has been calculated that each leak costs a company $154 per record, which adds up to a huge loss for businesses.

And as the ones developing and building these solutions, it’s our responsibility to provide a high level of security to our customers. This isn’t just a technical aspect — security begins with personnel. A culture of security is the first place to start.

Each cloud technology provides a set of frameworks, tools and APIs that you can combine with different security components. Certain cloud providers, such as AWS, have a very robust security infrastructure that enable you to work with default security components, which saves time.

Whether you’re using a virtual private cloud (VPC) or private network, you want to have everything in SSL. In Neo4j, run all interactions between your graph and application over SSL, which can be configured on 7473 with HDPS.

AWS and Neo4j Deployment

Now we’re going to explore a few different ways to deploy AWS if you’d like to roll out your own cloud development:

However, none of the above come with security by default. So, even though Neo4j is now deployed in AWS, you have to determine what steps you need to take to secure your data. And this is where a lot of the learning starts.

The Language of Security: Part 1

Before figuring out how all the different components work together to secure your environment and Neo4j while being able to access the external world, while simultaneously preventing people from seeing that you’re running Neo4j on 7474 on your server, you need to learn a few acronyms:

    • Identity and Access Management (IAM). Provides user- and group-level permissions for authentication and authorization controls to AWS resources. This is where your operations team users and groups are managed for who has access to Neo4j within the organization when authenticated.
    • Multi-factor Authentication (MFA). This is an added layer of security that requires a token for access in addition to a username and password. This prevents those who have access to Neo4j information from accessing privileged accounts
    • Virtual Private Cloud (VPC). This allows AWS resources to be launched into a private network without being publicly accessible. It also requires a VPN client. This restricts access to authorized personnel with the correct VPN access.

How to Access Secure Information

Once all your information is secure, you need to ensure that the appropriate people can access the secured information. There are a few options:

    • openVPN can be used to authenticate a user for VPC access. This has a very low cost of entry — $9.60 per connection per year. This is very affordable, which provides access even for startups.
    • Direct Connect establishes a dedicated network connection from your premises — such as an office or data center — to your VPC in AWS. This is a great option for an enterprise with existing infrastructure to migrate data to the cloud because it allows the company to use AWS as an extension of the existing network.

The Language of Security: Part 2

The next set of acronyms relate to security groups, which control inbound and outbound traffic and operate at an instance level with support for “only allows” rules. These include:
    • Network Access Control List (ACLs). These control inbound and outbound traffic for one or more subnets, and they are where broad sweeping port decisions are made for public vs. private. These are the broader, sweeping configurations for entire subnets. Something to keep in mind: if you have outbound traffic that requires a response from the server, you need to make sure the response can get back in. If you’re expecting a response from the server, you need to configure your ACL in such a way that you can ensure a response can get back in.
    • S3 ACLs. These define the accounts and groups with access and the type of access to a bucket or an object. This provides more granular control and the option to segment groups or individuals.
In Neo4j, by default, the network ACLs with the subnets are used to block all incoming traffic. S3 is required when you’re storing references in the graph where you’re going to query, where you need to use connectedness and load a document from S3. You can combine all of that to ensure that only the correct application or server is requesting that document and sending it back out.

A Neo4j Example

Below is an inbound security group for Neo4j on the elastic load balancer. You use the two defaults, HTTP and HTTPS:


All the IP addresses are 172.128, which is an internal range of IP addresses in the networking schema and the first 16 for the CIDR block.

7473 is HTTPS and 7474 is HTTP. This provides access from your internal servers to Neo4j, but not from the external world. Because it’s limited to traffic only from the IP range of servers within your network, this prevents any access from external sources.

In this type of infrastructure configuration, a network address translation (NAT) instance controls all inbound and outbound traffic through an Internet gateway. This allows you to control inbound traffic via expected protocols — which generally you’d want 80 in 443, through some API layer, that then proxies to Neo4j and any other application servers you may have behind your VPC. This provides granular control while still providing a way to run your full internal infrastructure and have really good communication, without exposing it to the outside world.

Watch Benjamin Nussbaum's presentation on how to securely deploy Neo4j into Amazon Web Services (AWS)

Security at GraphGrid

We’ve set up all the infrastructure so that everything we deploy is inside a VPC, even if it’s across regions and availability zones. There is some fairly complex networking involved with tunneling regions, keeping the resources internal and ensuring that all the pieces are simultaneously isolated but can also communicate.

Consider the following example in which we have three different Neo4j instances in different availability zones that need to communicate with one another:

A VPC of Neo4j Instances in GraphGrid

We have to set up the private DNS and the EBS for the data volume, which can be encrypted if necessary. And then along with the S3 storage, and the elastic load balancer endpoints for master/slave and available. These manage the subnet access with the security groups so that you can have them communicate and route the traffic correctly.

GraphGrid provides the basic security architecture that I just reviewed so that you don’t have to build it from scratch.

Inspired by Benjamin’s talk? Download your copy of this white paper, The Top 5 Use Cases of Graph Databases, and tap into the power of connected data at your enterprise.