apoc.load.html

Procedure APOC Full

apoc.load.html('url',{name: jquery, name2: jquery}, config) YIELD value - Load Html page and return the result as a Map

Signature

apoc.load.html(url :: STRING?, query = {} :: MAP?, config = {} :: MAP?) :: (value :: MAP?)

Input parameters

Name Type Default

url

STRING?

null

query

MAP?

{}

config

MAP?

{}

Config parameters

The procedure support the following config parameters:

Table 1. Config parameters
name type default description

browser

Enum [NONE, CHROME, FIREFOX]

NONE

If it is set to "CHROME" or "FIREFOX", is used Selenium Web Driver to read the dynamically generated js. In case it is "NONE" (default), it is not possible to read dynamic contents. Note that to use the Chrome or Firefox driver, you need to have them installed on your machine and you have to download additional jars into the plugin folder. See below

wait

long

0

If greater than 0, it waits until it finds at least one element for each of those entered in the query parameter (up to a maximum of defined seconds, otherwise it continues execution). Useful to handle elements which can be rendered after the page is loaded (i.e. slow asynchronous calls).

charset

String

"UTF-8"

the character set of the page being scraped, if http-equiv meta-tag is not set.

headless

boolean

true

Valid with browser not equal to NONE, allow to run browser in headless mode, that is without actually opening the browser UI (recommended).

acceptInsecureCerts

boolean

true

If true, allow to read html from insecure certificates

baseUri

String

""

Base URI used to resolve relative paths

failSilently

Enum [FALSE, WITH_LOG, WITH_LIST]

FALSE

If the parse fails with one or more elements, using FALSE it throws a RuntimeException, using WITH_LOG a log.warn is created for each incorrect item and using WITH_LIST an errorList key is added to the result with the failed tags.

htmlString

boolean

true

to use a string instead of an url as 1st parameter

Output parameters

Name Type

value

MAP?

Usage Examples

We can extract the metadata and h2 heading from the Wikipedia home page, by running the following query:

CALL apoc.load.html("https://en.wikipedia.org/",{metadata:"meta", h2:"h2"});
Table 2. Results
Output
{
   "metadata":[
      {
         "tagName":"meta",
         "attributes":{
            "charset":"UTF-8"
         }
      },
      {
         "tagName":"meta",
         "attributes":{
            "name":"ResourceLoaderDynamicStyles"
         }
      },
      {
         "tagName":"meta",
         "attributes":{
            "name":"generator",
            "content":"MediaWiki 1.36.0-wmf.16"
         }
      },
      {
         "tagName":"meta",
         "attributes":{
            "name":"referrer",
            "content":"origin"
         }
      },
      {
         "tagName":"meta",
         "attributes":{
            "name":"referrer",
            "content":"origin-when-crossorigin"
         }
      },
      {
         "tagName":"meta",
         "attributes":{
            "name":"referrer",
            "content":"origin-when-cross-origin"
         }
      },
      {
         "tagName":"meta",
         "attributes":{
            "property":"og:image",
            "content":"https://upload.wikimedia.org/wikipedia/commons/1/1c/Orion_pulse_unit_%28transparent%29.png"
         }
      }
   ],
   "h2":[
      {
         "attributes":{
            "class":"mp-h2",
            "id":"mp-tfa-h2"
         },
         "text":"From today's featured article",
         "tagName":"h2"
      },
      {
         "attributes":{
            "class":"mp-h2",
            "id":"mp-dyk-h2"
         },
         "text":"Did you know ...",
         "tagName":"h2"
      },
      {
         "attributes":{
            "class":"mp-h2",
            "id":"mp-itn-h2"
         },
         "text":"In the news",
         "tagName":"h2"
      },
      {
         "attributes":{
            "class":"mp-h2",
            "id":"mp-otd-h2"
         },
         "text":"On this day",
         "tagName":"h2"
      },
      {
         "attributes":{
            "class":"mp-h2",
            "id":"mp-tfl-h2"
         },
         "text":"From today's featured list",
         "tagName":"h2"
      },
      {
         "attributes":{
            "class":"mp-h2",
            "id":"mp-tfp-h2"
         },
         "text":"Today's featured picture",
         "tagName":"h2"
      },
      {
         "attributes":{
            "class":"mp-h2",
            "id":"mp-other"
         },
         "text":"Other areas of Wikipedia",
         "tagName":"h2"
      },
      {
         "attributes":{
            "class":"mp-h2",
            "id":"mp-sister"
         },
         "text":"Wikipedia's sister projects",
         "tagName":"h2"
      },
      {
         "attributes":{
            "class":"mp-h2",
            "id":"mp-lang"
         },
         "text":"Wikipedia languages",
         "tagName":"h2"
      },
      {
         "tagName":"h2",
         "text":"Navigation menu"
      }
   ]
}

Let’s suppose we have a test.html file like this:

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
  <h6 i d="error">test</h6>
  <h6 id="correct">test</h6>
</html>

We can handle the parse error caused by i d through failSilently configuration. So, we can execute:

CALL apoc.load.html("test.html",{h6:"h6"});
Table 3. Results
Failed to invoke procedure apoc.load.html: Caused by: java.lang.RuntimeException: Error during parsing element: <h6 i d="error">test</h6>

or with failSilently WITH_LIST:

CALL apoc.load.html("test.html",{h6:"h6"}, {failSilently: 'WITH_LIST'});
Table 4. Results
Output
{
  "errorList": [
    "<h6 i d="error">test</h6>"
  ],
  "h6": [
    {
      "attributes": {
        "id": "correct"
      },
      "text": "test",
      "tagName": "h6"
    }
  ]
}

or with failSilently WITH_LOG (note that will be created a log.warn("Error during parsing element: <h6 i d="error">test</h6>") ):

CALL apoc.load.html("test.html",{h6:"h6"}, {failSilently: 'WITH_LOG'});
Table 5. Results
Output
{
  "h6": [
    {
      "attributes": {
        "id": "correct"
      },
      "text": "test",
      "tagName": "h6"
    }
  ]
}

Runtime js generated html

If we have a .html file with a jQuery script like:

<!DOCTYPE html>
<head>
    <script type="text/javascript">
    $(() => {
        var newP = document.createElement("strong");
        var textNode = document.createTextNode("This is a new text node");
        newP.appendChild(textNode);
        document.getElementById("appendStuff").appendChild(newP);
    });
    </script>
    <meta charset="UTF-8"/>
</head>
<body onLoad="loadData()" class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject page-Aap_Kaa_Hak rootpage-Aap_Kaa_Hak skin-vector action-view">
	<div id="appendStuff"></div>
</body>
</html>

we can read the generated js through the browser config. Note that to use a browser, you have to install this dependencies:

CALL apoc.load.html("test.html",{strong: "strong"}, {browser: "FIREFOX"});
Table 6. Results
Output
{
  "strong": [
    {
      "tagName": "strong",
      "text": "This is a new text node"
    }
  ]
}

If we can parse a tag from a slow async call, we can use wait config to waiting for 10 second (in this example):

CALL apoc.load.html("test.html",{asyncTag: "#asyncTag"}, {browser: "FIREFOX", wait: 10});

Dependencies

To use the apoc.load.html procedures with browser config (not NONE), you have to add additional dependencies.

This dependency is included in apoc-selenium-dependencies-4.3.0.12.jar, which can be downloaded from the releases page. Once that file is downloaded, it should be placed in the plugins directory and the Neo4j Server restarted.