Uncovering Open Source Community Stories with Neo4j [Community Post]


Learn How Ed Finkler at Graph Story Used Neo4j to Uncover Open Source Community Stories and Trends

[As community content, this post reflects the views and opinions of the particular author and does not necessarily reflect the official stance of Neo4j.]

Every dataset has a story to tell — we just need the right tools to find it.

At Graph Story, we believe that graph databases are one of the best tools for finding the story in your data. Because we are also active members of several open source communities, we wanted to find interesting stories about those communities. So, we decided to look at package ecosystems used by developers.

The first one we tackled was Packagist, the community package repository for PHP. Nearly 20,000 maintainers have submitted over 60,000 packages to Packagist, which gives us a lot of interesting data to investigate.

How We Used Neo4j to Graph the Packagist Data


Collecting this data and getting it into Neo4j was relatively straightforward.

One HTTP endpoint on the Packagist site returns a JSON array of all the package names. We iterated over that, and made individual calls to another endpoint to retrieve a JSON hash for each package, which includes both base package data and information on each version of the package, including what packages a given version requires.

The data model for our initial version was pretty straightforward. We have three node labels:

    • Package
    • Maintainer
    • Version
and five relationship types:

    • HAS_VERSION
      (Package)-[:HAS_VERSION]->(Version)
    • MAINTAINED_BY
      (Package)-[:MAINTAINED_BY]->(Maintainer)
    • REQUIRES
      (Version)-[:REQUIRES]->(Package)
    • REQUIRES_DEV
      (Version)-[:REQUIRES_DEV]->(Package)
    • SUGGESTS
      (Version)-[:SUGGESTS]->(Package)
This certainly isn't a complete schema to represent everything within the Packagist ecosystem, but it let us do some interesting analyses:

    1. What packages get required the most by other packages?
    2. What maintainers have the most packages?
    3. What maintainers have the most requires of their packages?
    4. What maintainers work together the most (packages can have multiple maintainers)?
    5. What are the shortest paths between two given packages, or two given maintainers

Our Findings


You can see our results so far at packagist.graphstory.com.

Some of what we found was expected: certain well-known open source component libraries get required the most, like doctrine/orm and illuminate/support.

It gets more interesting when examining maintainers, though. Some are high profile folks in the PHP community, like fabpot and taylorotwell, but some are people with whom we weren't as familiar. It certainly made us re-examine what we thought we knew about the PHP community – it's not always folks who are speaking at conferences that are making big contributions.

The shortest path analyses were interesting as well. There were a few packages that showed up in these paths over and over to tie together maintainers and packages, such as psr/log. "Keystone packages" might be a good term for these, because they seem to join and support the PHP open source community again and again.

A Cypher Example: Finding Top Maintainers by Packages


Here's one example Cypher query we ran to find the top Packagist maintainers by package count:

MATCH (m1:Maintainer)<-[:MAINTAINED_BY]-(Package)
WITH m1,COUNT(*) AS count
WHERE count > 1
WITH m1,count
ORDER BY count DESC
RETURN m1.name as name, count
LIMIT { limit }

See the results of this query and others on packagist.graphstory.com.

Why We Used a Graph Database


Much of what we've done would be possible with an RDBMS or a document database, so why do it in a graph database – specifically Neo4j?

We found three major upsides while working on this project:

    1. It is so much easier to map out data and relationships. Making relationships in RDBMSes work, even in simple cases, is harder, and significantly more difficult to change down the road. Compared to popular document databases, Neo4j relationships are done in the database -- we don't have to maintain them with application logic.
    2. Discovering how people and packages are connected is much easier and faster than with RDBMSes and popular document databases. Cypher and the graph model makes it easy to get the data we want without complex SQL joins or a wrapper script in another language.
    3. Trying new queries to explore the data is so convenient with Neo4j’s web interface. It's quick and easy to prototype and profile from there, and then copy and paste the Cypher into your app.
We're obviously big believers in graph databases at Graph Story, but this is a fun project that highlights a lot of the advantages of Neo4j. We found a number of interesting stories in Packagist, and there are certainly more to uncover.

For more from Ed Finkler, follow him on Twitter.


Want in on this? Click below to get your free copy of the Learning Neo4j ebook and catch up to speed with the world’s leading graph database.