GraphGists

Summary

Geoptima is an application for collecting passive and active events on mobile devices running on cellular networks. The event logs can be used to analyze the subscribers experience of the mobile network and help track performance problems. For example, does iPhone 4 perform better or worse than iPhone 5 when accessing Facebook in a specific region of the network? For more information on the product that creates this data, refer to the description of Geoptima. Alternatively watch my original videos on vimeo at https://vimeo.com/17321571.

This GraphGist describes one way of collecting and managing data for a hypothetical mobile network called 'Operator X' in Sweden. My own phone, an HTC One, is the primary example device being shown, but most other information is anonymized.

Data Model

The data model can be shown using a graphviz dot diagram.

allocation1024
Figure 1. Our data model

The light nodes at the bottom represent the main information carriers, the JSON encoded files of events. This model does not manage the events themselves, but only the files, because the purpose is to make a per-file decision on where this data belongs. Who should have access to the data, and how to get to it. The rules governing this are managed in the green sub-graph, and the rest of the graph provides convenient structure for the kinds of queries we will make on the graph. This will make sense once we see the example queries below.

Building the Model with Cypher

We’ll build the model in stages. Starting with the simplest part, the users and projects. This sub-graph can be built using the Cypher query below.

Now we have a simple graph with one project and three users, and for convenience, nodes for traversing to projects and users. We have also assigned users to projects, implying access rights. In the real application the access rights are handled in a far more complex way, but for this example we keep this simple because we want to focus on the event log data management.

The visualization has conveniently colored the users blue and the projects red, based on the fact that we used the Neo4j 2.0 standard of assigning labels to the nodes. Before building the more complex graph, lets perform a couple of simple queries.

Querying the Graph

We will show a few useful queries on this graph, like:

  • Who has access to the project?

  • What rules does this project use to decide on data ownership?

  • How many devices does use 'Craig' collect data for?

  • How much data has 'Craig' collected?

  • For how many days has Craig been collecting data?

Only the first query above is possible on the simple graph so far, so let’s try that before building more.

The Project

Let’s do a basic query. Who has access to 'Operator X' data?

MATCH (u)-[:su]->(p)
WHERE p.name = 'Operator X'
RETURN u.name AS `Users with access to Operator X data`

Two of the three defined users have access to 'Operator X' data.

Project Allocation Rules

Now the purpose of the project is to collect data. We need to define rules for which data to collect. Let’s start by adding two sets of rules, one for devices by their internal device identity number, and another based on the mobile network the device is actually running on. In a real network both types of rules are commonly used.

MATCH (project:Project)
WHERE project.name = 'Operator X'
CREATE
(filter_plmn:Filter {name:"Filter PLMN"}),
(filter_devices:Filter {name:"Filter Devices"}),
(f1:FilterPLMN {name:"Operator X", mcc:'240', mnc:'08'}),
(f2:FilterPLMN {name:"My Operator", mcc:'240', mnc:'18'}),
(f3:FilterPLMN {name:"XTele 2", mcc:'240', mnc:'28'}),
(fd:FilterDevices {name:"Test Devices", devices:[
  '354436058915420','358506046830281','356451041578183','351503053121388','353328059211902'
]}),
(project)-[:filter]->(filter_plmn),
(project)-[:filter]->(filter_devices),
(filter_plmn)-[:filter]->(f1),
(filter_plmn)-[:filter]->(f2),
(filter_plmn)-[:filter]->(f3),
(filter_devices)-[:filter]->(fd)

Now we can ask the questions:

  • How many operators are selected for?

  • How many specific test devices are also included?

Operators Selected

How many operators will be selected during allocation of data to 'Operator X'?

MATCH (p:Project)-[:filter*]->(f:FilterPLMN)
WHERE p.name = 'Operator X'
RETURN f.name AS Name, f.mcc AS mcc, f.mnc AS mnc

If a device collects data while served by any one of the above three network operators, their data will be assocated with the project 'Operator X'.

Test Devices Selected

MATCH (p:Project)-[:filter*]->(f:FilterDevices)
RETURN f.name AS Name, f.devices AS Devices

If any one of the five devices listed above collects data, it will be allocated to the project 'Operator X'.

Device Management

So far we’ve looked only at the model used to decide what data should be collected. Now let’s look at the actual data collected. We’ll model sample data for one of the devices listed in the filters above, my own phone, an HTC One device with identity defined by the number '354436058915420'.

Now the graph starts to look quite complex. This is in fact a complete version of the graphvis example at the top of the page. We have less control over layout than with graphviz, so this is harder to make sense of, but now we can query it with Cypher.

Data collected

Let’s try two queries on this graph:

  • How many events has Craig collected?

  • For how many days has Craig been collecting data?

MATCH (u:User)-[:USED_DEVICE]->(d)-[:ASSOC]->(ds)-[:files]->(f)-[:DATE]->(dd)
WHERE u.name = 'Craig'
RETURN u.name AS Name ,ds.imei AS imei,ds.imsi AS imsi,dd.date AS Date

The above query answers the second question. We traverse the graph from the user, through the devices used by that user, and the device-SIM card associations to the files and the days the files contain events for. However, if all we want is the number of days, we should not write the entire table. Rather we can use the count() function like:

MATCH (u:User)-[:USED_DEVICE]->(d)-[:ASSOC]->(ds)-[:files]->(f)-[:DATE]->(dd)
WHERE u.name = 'Craig'
RETURN count(dd.date) AS `# Days`

Now we can see that we have 9 days of data collected.

Since we now know how to use functions like count(), let’s try another function sum() for adding the event properties of all event files together:

MATCH (u:User)-[:USED_DEVICE]->(d)-[:ASSOC]->(ds)-[:files]->(f)-[:DATE]->(dd)-[:JSON]->(json)
WHERE u.name = 'Craig'
RETURN count(json.events) AS `# Events`,sum(json.events) AS `Total Events`,sum(json.events)/count(json.events) AS `Avg Events/File`,min(json.events) AS `Min Events/File`,max(json.events) AS `Maximum Events/File`

So we can clearly see that we collected 1005 events in three files with an average of 335 events per file.

Summary

The above example was produced as part of some internal documentation while brainstorming on possible data models for an upgrade of one of the data collection components of the Geoptima data collection system by AmanziTel. This is not an exact model of the actual data collection system in use, but does represent some of the decision logic being done by the real system. The use of Neo4j as a database for this has facilitated both the data modeling aspect of product management, as well as the ease of development of the actual products.