Analyzing Software Dependencies With deps.dev – Discover AuraDB Free (Week 49)

Head of Product Innovation & Developer Strategy, Neo4j

May 25, 2023

6 min read

This week we looked at software dependencies, an important use case within software analytics for graph databases. Not only can you understand what libraries your software uses not just directly but also indirectly, but also how you’re affected by software vulnerabilities.

If you missed it – the call for papers for our online developer conference NODES 2023 is open till June 30th, but if you submit early you might be selected as a featured speaker.

Two years ago, Google launched https://deps.dev which is an open source package dependency database that makes package information from these systems available:

npm (Javascript)
PyPI (Python)
maven (Java / JVM)
cargo (Rust)
NuGet (.Net)
Go

It even talks about dependency graphs in its “How it works” section.

The service repeatedly examines sites such as github.com, npmjs.com, and pkg.go.dev to find up-to-date information about open source software packages. Using that information it builds for each package the full dependency graph from scratch—not just from package lock files—connecting it to the packages it depends on and to those that depend on it. And then does it all again to keep the information fresh. This transitive dependency graph allows problems in any package to be made visible to the owners and users of any software they affect.

If you rather watch the recording for the livestream, you find it here:

Back then I threw together a quick script to load the data via their unofficial REST API that powered the site.

And tweeted about it:

Neat, REST API, let’s do @Neo4j

call apoc.load.json(“https://t.co/2CEy0rS9Ro“) yield value as v
merge (p:Package {name:v.package .name, version:v.version})
with * unwind v.dependencies as d
merge (o:Package {name:d.package .name, version:d.version})
merge (p)-[:DEPENDS_ON]->(o) https://t.co/XHdMJxCT6V pic.twitter.com/r4XncqPTXi

— Michael Hunger 🇪🇺 🇺🇦 @mesirii@chaos.social (@mesirii) June 4, 2021

Neat, REST API, let’s do @Neo4j call apoc.load.json(“https://t.co/2CEy0rS9Ro”) yield value as vmerge (p:Package {name:v.package .name, version:v.version})with * unwind v.dependencies as dmerge (o:Package {name:d.package .name, version:d.version})merge (p)-[:DEPENDS_ON]->(o) https://t.co/XHdMJxCT6V pic.twitter.com/r4XncqPTXi

But meanwhile, they have published an API that we can use to access the data. The API docs are minimal, but good enough for our purposes.

The minimal API for getting information for package is straightforward but doesn’t give us a lot of data, more interesting is the information per version, which also lists licenses, security vulnerabilities, and links (homepage, repo, issue-tracker).

Here is the example for React (no security vulnerabilities):

https://api.deps.dev/v3alpha/systems/npm/packages/react/versions/18.2.0

{
    "versionKey": {
        "system": "NPM",
        "name": "react",
        "version": "18.2.0"
    },
    "isDefault": true,
    "licenses": [
        "MIT"
    ],
    "advisoryKeys": [],
    "links": [
        {
            "label": "HOMEPAGE",
            "url": "https://reactjs.org/"
        },
        {
            "label": "ISSUE_TRACKER",
            "url": "https://github.com/facebook/react/issues"
        },
        {
            "label": "ORIGIN",
            "url": "https://registry.npmjs.org/react/18.2.0"
        },
        {
            "label": "SOURCE_REPO",
            "url": "git+https://github.com/facebook/react.git"
        }
    ]
}

But we’re more interested in the graph, so let’s go directly for the package dependencies.

Dependencies of a Package

You can find the dependencies of a package (like TensorFlow) in the UI

The API Docs are here

Loading the data for the TensorFlow packages via API uses the system, name and version of a package in the URL.

https://api.deps.dev/v3alpha/systems/pypi/packages/tensorflow/versions/2.12.0:dependencies

And responds with a JSON that has already a graph format:

{
"nodes": [
{
    "versionKey": {
    "system": "PYPI",
    "name": "tensorflow",
    "version": "2.12.0"
    },
    "bundled": false,
    "relation": "SELF",
    "errors": []
},
{
    "versionKey": {
    "system": "PYPI",
    "name": "absl-py",
    "version": "1.4.0"
    },
    "bundled": false,
    "relation": "DIRECT",
    "errors": []
},...],
"edges": [
{
    "fromNode": 0,
    "toNode": 1,
    "requirement": ">=1.0.0"
},
{
    "fromNode": 0,
    "toNode": 2,
    "requirement": ">=1.6.0"
},
{
    "fromNode": 0,
    "toNode": 6,
    "requirement": ">=2.0"
}, ... ]}

The response contains data in a graph format, first a list of nodes then a list of edges with fromNode and toNode (based on the index in the nodes array) and semantic version requirement.

To load the data from the API we use apoc.load.json to provide the response as a Cypher nested structure result.

call apoc.load.json("https://api.deps.dev/v3alpha/systems/pypi/packages/tensorflow/versions/2.12.0:dependencies")
yield value as r

We can now import the data by creating the nodes first and then collecting them into an array again to provide the index lookup for the edges. We encode the “system”, here “pypi” as an additional label :PyPi on our :Package nodes which then also hold the constraint by name

create constraint package_pypi if not exists for (p:PyPi) require (p.name) is unique

In a real system we would create separate version nodes on each package that we would then link to, here for simplicity we stuck with the :Package nodes only.

And then iterate over the nodes with UNWIND within a CALL subquery to create the nodes. And then do a second subquery for the relationships.

with "pypi" as system, "tensorflow" as name, "2.12.0" as version

call apoc.load.json("https://api.deps.dev/v3alpha/systems/"+system+"/packages/"
                    +name+"/versions/"+version+":dependencies")
yield value as r
// create nodes
call { with r
        unwind r.nodes as package
        merge (p:Package:PyPi {name:package.versionKey.name}) on create set p.version = package.versionKey.version
        return collect(p) as packages
}
// create relationships by linking nodes
call { with r, packages
        unwind r.edges as edge
        with packages[edge.fromNode] as from, packages[edge.toNode] as to, edge
        merge (from)-[rel:DEPENDS_ON]->(to) ON CREATE SET rel.requirement = edge.requirement
        return count(*) as numRels
}
return size(packages) as numPackages, numRels

Now we can visualize the data in the Query UI by running
MATCH path=(:PyPi {name:”tensorflow”})-[:DEPENDS_ON*]→() RETURN path

Or we can head over to “Explore” and visualize it in the hierarchical layout and also find the shortest paths between packages visually.

Explore dependencies with the hierarchical layout

We can also use the packages that we already have imported into our graph to fetch their dependencies.

To achieve that we replace the hardcoded initial data for package and version with data from the graph. We also set an additional property (or label) to indicate which packages have already been loaded.

match (root:Package:PyPi) where root.imported is null
set root.imported = true
with "pypi" as system, root.name as name, root.version as version
call apoc.load.json("https://api.deps.dev/v3alpha/systems/"+system+"/packages/"
                    +name+"/versions/"+version+":dependencies")
yield value as r
call { with r
        unwind r.nodes as package
        merge (p:Package:PyPi {name:package.versionKey.name}) on create set p.version = package.versionKey.version
        return collect(p) as packages
}
call { with r, packages
        unwind r.edges as edge
        with packages[edge.fromNode] as from, packages[edge.toNode] as to, edge
        merge (from)-[rel:DEPENDS_ON]->(to) ON CREATE SET rel.requirement = edge.requirement
        return count(*) as numRels
}
return size(packages) as numPackages, numRels

Loading Dependents

The UI also shows dependents (i.e. packages that use the current package), which we could infer inversely from our imported data too. Unfortunately, there is no API call for this, so we need to get the REST API call for the UI, which is the following:

https://deps.dev/_/s/pypi/p/tensorflow/v/2.12.0/dependents

It has a different response format and only lists 100 results, but that’s better than nothing for demonstration purposes. We can pick the directSample list of entries and connect them to our root package that we start with.

with "pypi" as system, "tensorflow" as name, "2.12.0" as version
merge (root:PyPi { name:name}) on create set root.version = version
with *
call apoc.load.json("https://deps.dev/_/s/"+system+"/p/"+name+"/v/"+version+"/dependents")
yield value as r


unwind r.directSample as entry
merge (dep:PyPi:Package {name:entry.package.name})
on create set dep.version = entry.version
merge (dep)-[:DEPENDS_ON]->(root)

Question from the viewers — Eshwar: How do I fix relationships that I imported wrongly?

Answer: