apoc.load.html

Procedure APOC Full

apoc.load.html('url',{name: jquery, name2: jquery}, config) YIELD value - Load Html page and return the result as a Map

Signature

apoc.load.html(url :: STRING?, query = {} :: MAP?, config = {} :: MAP?) :: (value :: MAP?)

Input parameters

Name	Type	Default
url	STRING?	null
query	MAP?	{}
config	MAP?	{}

Name

Type

Default

url

STRING?

null

query

MAP?

{}

config

MAP?

{}

Config parameters

The procedure support the following config parameters:

Table 1. Config parameters
name	type	default	description
`browser`	`Enum [NONE, CHROME, FIREFOX]`	`NONE`	If it is set to "CHROME" or "FIREFOX", is used Selenium Web Driver to read the dynamically generated js. In case it is "NONE" (default), it is not possible to read dynamic contents. Note that to use the Chrome or Firefox driver, you need to have them installed on your machine and you have to download additional jars into the plugin folder. See below
`wait`	`long`	`0`	If greater than 0, it waits until it finds at least one element for each of those entered in the query parameter (up to a maximum of defined seconds, otherwise it continues execution). Useful to handle elements which can be rendered after the page is loaded (i.e. slow asynchronous calls).
`charset`	`String`	`"UTF-8"`	the character set of the page being scraped, if `http-equiv` meta-tag is not set.
`headless`	`boolean`	`true`	Valid with `browser` not equal to `NONE`, allow to run browser in headless mode, that is without actually opening the browser UI (recommended).
`acceptInsecureCerts`	`boolean`	`true`	If true, allow to read html from insecure certificates
`baseUri`	`String`	`""`	Base URI used to resolve relative paths
`failSilently`	`Enum [FALSE, WITH_LOG, WITH_LIST]`	`FALSE`	If the parse fails with one or more elements, using `FALSE` it throws a `RuntimeException`, using `WITH_LOG` a `log.warn` is created for each incorrect item and using `WITH_LIST` an `errorList` key is added to the result with the failed tags.
`htmlString`	`boolean`	`true`	to use a string instead of an url as 1st parameter

Output parameters

Name	Type
value	MAP?

Name

Type

value

MAP?

Usage Examples

We can extract the metadata and h2 heading from the Wikipedia home page, by running the following query:

CALL apoc.load.html("https://en.wikipedia.org/",{metadata:"meta", h2:"h2"});

Output

Table 2. Results
Output
{ "metadata":[ { "tagName":"meta", "attributes":{ "charset":"UTF-8" } }, { "tagName":"meta", "attributes":{ "name":"ResourceLoaderDynamicStyles" } }, { "tagName":"meta", "attributes":{ "name":"generator", "content":"MediaWiki 1.36.0-wmf.16" } }, { "tagName":"meta", "attributes":{ "name":"referrer", "content":"origin" } }, { "tagName":"meta", "attributes":{ "name":"referrer", "content":"origin-when-crossorigin" } }, { "tagName":"meta", "attributes":{ "name":"referrer", "content":"origin-when-cross-origin" } }, { "tagName":"meta", "attributes":{ "property":"og:image", "content":"https://upload.wikimedia.org/wikipedia/commons/1/1c/Orion_pulse_unit_%28transparent%29.png" } } ], "h2":[ { "attributes":{ "class":"mp-h2", "id":"mp-tfa-h2" }, "text":"From today's featured article", "tagName":"h2" }, { "attributes":{ "class":"mp-h2", "id":"mp-dyk-h2" }, "text":"Did you know ...", "tagName":"h2" }, { "attributes":{ "class":"mp-h2", "id":"mp-itn-h2" }, "text":"In the news", "tagName":"h2" }, { "attributes":{ "class":"mp-h2", "id":"mp-otd-h2" }, "text":"On this day", "tagName":"h2" }, { "attributes":{ "class":"mp-h2", "id":"mp-tfl-h2" }, "text":"From today's featured list", "tagName":"h2" }, { "attributes":{ "class":"mp-h2", "id":"mp-tfp-h2" }, "text":"Today's featured picture", "tagName":"h2" }, { "attributes":{ "class":"mp-h2", "id":"mp-other" }, "text":"Other areas of Wikipedia", "tagName":"h2" }, { "attributes":{ "class":"mp-h2", "id":"mp-sister" }, "text":"Wikipedia's sister projects", "tagName":"h2" }, { "attributes":{ "class":"mp-h2", "id":"mp-lang" }, "text":"Wikipedia languages", "tagName":"h2" }, { "tagName":"h2", "text":"Navigation menu" } ] }

{
   "metadata":[
      {
         "tagName":"meta",
         "attributes":{
            "charset":"UTF-8"
         }
      },
      {
         "tagName":"meta",
         "attributes":{
            "name":"ResourceLoaderDynamicStyles"
         }
      },
      {
         "tagName":"meta",
         "attributes":{
            "name":"generator",
            "content":"MediaWiki 1.36.0-wmf.16"
         }
      },
      {
         "tagName":"meta",
         "attributes":{
            "name":"referrer",
            "content":"origin"
         }
      },
      {
         "tagName":"meta",
         "attributes":{
            "name":"referrer",
            "content":"origin-when-crossorigin"
         }
      },
      {
         "tagName":"meta",
         "attributes":{
            "name":"referrer",
            "content":"origin-when-cross-origin"
         }
      },
      {
         "tagName":"meta",
         "attributes":{
            "property":"og:image",
            "content":"https://upload.wikimedia.org/wikipedia/commons/1/1c/Orion_pulse_unit_%28transparent%29.png"
         }
      }
   ],
   "h2":[
      {
         "attributes":{
            "class":"mp-h2",
            "id":"mp-tfa-h2"
         },
         "text":"From today's featured article",
         "tagName":"h2"
      },
      {
         "attributes":{
            "class":"mp-h2",
            "id":"mp-dyk-h2"
         },
         "text":"Did you know ...",
         "tagName":"h2"
      },
      {
         "attributes":{
            "class":"mp-h2",
            "id":"mp-itn-h2"
         },
         "text":"In the news",
         "tagName":"h2"
      },
      {
         "attributes":{
            "class":"mp-h2",
            "id":"mp-otd-h2"
         },
         "text":"On this day",
         "tagName":"h2"
      },
      {
         "attributes":{
            "class":"mp-h2",
            "id":"mp-tfl-h2"
         },
         "text":"From today's featured list",
         "tagName":"h2"
      },
      {
         "attributes":{
            "class":"mp-h2",
            "id":"mp-tfp-h2"
         },
         "text":"Today's featured picture",
         "tagName":"h2"
      },
      {
         "attributes":{
            "class":"mp-h2",
            "id":"mp-other"
         },
         "text":"Other areas of Wikipedia",
         "tagName":"h2"
      },
      {
         "attributes":{
            "class":"mp-h2",
            "id":"mp-sister"
         },
         "text":"Wikipedia's sister projects",
         "tagName":"h2"
      },
      {
         "attributes":{
            "class":"mp-h2",
            "id":"mp-lang"
         },
         "text":"Wikipedia languages",
         "tagName":"h2"
      },
      {
         "tagName":"h2",
         "text":"Navigation menu"
      }
   ]
}

Let’s suppose we have a test.html file like this:

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
  <h6 i d="error">test</h6>
  <h6 id="correct">test</h6>
</html>

We can handle the parse error caused by i d through failSilently configuration. So, we can execute:

CALL apoc.load.html("test.html",{h6:"h6"});

Table 3. Results
Failed to invoke procedure `apoc.load.html`: Caused by: java.lang.RuntimeException: Error during parsing element: <h6 i d="error">test</h6>

or with failSilently WITH_LIST:

CALL apoc.load.html("test.html",{h6:"h6"}, {failSilently: 'WITH_LIST'});

Table 4. Results
Output
`{ "errorList": [ "<h6 i d="error">test</h6>" ], "h6": [ { "attributes": { "id": "correct" }, "text": "test", "tagName": "h6" } ] }`

or with failSilently WITH_LOG (note that will be created a log.warn("Error during parsing element: <h6 i d="error">test</h6>") ):

CALL apoc.load.html("test.html",{h6:"h6"}, {failSilently: 'WITH_LOG'});

Table 5. Results
Output
`{ "h6": [ { "attributes": { "id": "correct" }, "text": "test", "tagName": "h6" } ] }`

We can also pass an HTML string into the 1st parameter by putting as a config parameter htmlString: true, for example:

CALL apoc.load.html("<!DOCTYPE html> <html> <body> <p class='firstClass'>My first paragraph.</p> </body> </html>",{metadata:"meta", h2:"h2"}, {htmlString: true});

The jsoup class org.jsoup.nodes.Element provides a set of functions that can be used. Anyway, we can emulate all of them using the appropriate css/jQuery selectors in these ways (except for the last one, we can substitute the with a tag name to search into it instead of everywhere. Furthermore, by removing the selector will be returned the same result):

jsoup function css/jQuery selector description

jsoup function	css/jQuery selector	description
`getElementById(id)`	`#id`	Find an element by ID, including or under this element.
`getElementsByTag(tag)`	`tag`	Finds elements, including and recursively under this element, with the specified tag name.
`getElementsByClass(className)`	`.className`	Find elements that have this class, including or under this element.
`getElementsByAttribute(key)`	`[key]`	Find elements that have a named attribute set.
`getElementsByAttributeStarting(keyPrefix)`	`*[^keyPrefix]`	Find elements that have an attribute name starting with the supplied prefix. Use data
to find elements that have HTML5 datasets.	`getElementsByAttributeValue(key,value)`	`*[key=value]`
Find elements that have an attribute with the specific value.	`getElementsByAttributeValueContaining(key,match)`	`[key=match]`
Find elements that have attributes whose value contains the match string.	`getElementsByAttributeValueEnding(key,valueSuffix)`	`*[class$="test"]`
Find elements that have attributes that end with the value suffix.	`getElementsByAttributeValueMatching(key,regex)`	`*[id~=content]`
Find elements that have attributes whose values match the supplied regular expression.	`getElementsByAttributeValueNot(key,value)`	`*:not([key="value"])`
Find elements that either do not have this attribute, or have it with a different value.	`getElementsByAttributeValueStarting(key,valuePrefix)`	`*[key^=valuePrefix]`
Find elements that have attributes that start with the value prefix.	`getElementsByIndexEquals(index)`	`*:nth-child(index)`
Find elements whose sibling index is equal to the supplied index.	`getElementsByIndexGreaterThan(index)`	`*:gt(index)`
Find elements whose sibling index is greater than the supplied index.	`getElementsByIndexLessThan(index)`	`*:lt(index)`
Find elements whose sibling index is less than the supplied index.	`getElementsContainingOwnText(searchText)`	`*:containsOwn(searchText)`
Find elements that directly contain the specified string.	`getElementsContainingText(searchText)`	`*:contains('searchText')`
Find elements that contain the specified string.	`getElementsMatchingOwnText(regex)`	`*:matches(regex)`
Find elements whose text matches the supplied regular expression.	`getElementsMatchingText(pattern)`	`*:matchesOwn(pattern)`
Find elements whose text matches the supplied regular expression.	`getAllElements()`	`*`

getElementById(id)

#id

Find an element by ID, including or under this element.

getElementsByTag(tag)

tag

Finds elements, including and recursively under this element, with the specified tag name.

getElementsByClass(className)

.className

Find elements that have this class, including or under this element.

getElementsByAttribute(key)

[key]

Find elements that have a named attribute set.

getElementsByAttributeStarting(keyPrefix)

*[^keyPrefix]

Find elements that have an attribute name starting with the supplied prefix. Use data

to find elements that have HTML5 datasets.

getElementsByAttributeValue(key,value)

*[key=value]

Find elements that have an attribute with the specific value.

getElementsByAttributeValueContaining(key,match)

[key=match]

Find elements that have attributes whose value contains the match string.

getElementsByAttributeValueEnding(key,valueSuffix)

*[class$="test"]

Find elements that have attributes that end with the value suffix.

getElementsByAttributeValueMatching(key,regex)

*[id~=content]

Find elements that have attributes whose values match the supplied regular expression.

getElementsByAttributeValueNot(key,value)

*:not([key="value"])

Find elements that either do not have this attribute, or have it with a different value.

getElementsByAttributeValueStarting(key,valuePrefix)

*[key^=valuePrefix]

Find elements that have attributes that start with the value prefix.

getElementsByIndexEquals(index)

*:nth-child(index)

Find elements whose sibling index is equal to the supplied index.

getElementsByIndexGreaterThan(index)

*:gt(index)

Find elements whose sibling index is greater than the supplied index.

getElementsByIndexLessThan(index)

*:lt(index)

Find elements whose sibling index is less than the supplied index.

getElementsContainingOwnText(searchText)

*:containsOwn(searchText)

Find elements that directly contain the specified string.

getElementsContainingText(searchText)

*:contains('searchText')

Find elements that contain the specified string.

getElementsMatchingOwnText(regex)

*:matches(regex)

Find elements whose text matches the supplied regular expression.

getElementsMatchingText(pattern)

*:matchesOwn(pattern)

Find elements whose text matches the supplied regular expression.

getAllElements()

*

For example, we can execute:

CALL apoc.load.html($url, {nameKey: '#idName'})

Table 6. Results
Output
`{ "h6": [ { "attributes": { "id": "idName" }, "text": "test", "tagName": "h6" } ] }`

Html plain text representation

Using the same syntax and logic as apoc.load.html, we can get a plain text representation of the whole document, using the apoc.load.htmlPlainText(URL_OR_TEXT, QUERY_MAP, CONFIG_MAP) procedure, for example:

CALL apoc.load.htmlPlainText($urlOrString, {nameKey: 'body'})

or of some elements, with a selector:

CALL apoc.load.htmlPlainText($urlOrString, {nameKey: 'div'})