apoc.load.htmlPlainText

Procedure Apoc Extended

apoc.load.htmlPlainText('urlOrHtml',{name: jquery, name2: jquery}, config) YIELD value - Load Html page and return the result as a Map

Signature

apoc.load.htmlPlainText(urlOrHtml :: STRING?, query = {} :: MAP?, config = {} :: MAP?) :: (value :: MAP?)

Input parameters

Name Type Default

urlOrHtml

STRING?

null

query

MAP?

{}

config

MAP?

{}

Output parameters

Name Type

value

MAP?

Usage Examples

We can extract the <h1> tag and the tag with id mp-right the Wikipedia home page, by running the following query:

CALL apoc.load.htmlPlainText("https://en.wikipedia.org/",{h1:"h1", mp:"#mp-right"});

with a result like this (that is, a map of mp: "content of tag with id mp-right", h1: "content of h1 tags"):

Table 1. Results
Output
{
  "mp": "

In the news

Elizabeth II
 - In Nigeria, at least 40 people are killed in an attack  at a Catholic church  in Owo , Ondo State .
 - A fire and explosions  at a storage depot in Sitakunda , Bangladesh, kill at least 49 people and
injure more than 450 others.
 - The Commonwealth of Nations  celebrates the Platinum Jubilee  of Elizabeth

II (pictured) .
 - Denmark votes  to eliminate its opt-out  of the European Union 's Common Security and Defence
Policy . Ongoing :
 - COVID-19 pandemic
 - Russian invasion of Ukraine Recent deaths :
 - Paula Rego
 - Christopher Pratt
 -
Dorothy E. Smith

 - Zeta Emilianidou
 -
Ann Turner Cook

 - Barry Sussman
 - Nominate an article

On this day


June 10
Frederick Barbarossa
 - 1190  – Third Crusade : Frederick Barbarossa (pictured) , Holy Roman Emperor , drowned in the
Saleph River  in Anatolia .
 - 1692  – Bridget Bishop  became the first person to be executed for witchcraft  in the Salem witch
trials  in colonial Massachusetts .
 - 1878  – The League of Prizren  was officially founded to "struggle in arms to defend the wholeness
of the territories of Albania".
 - 1925  – The United Church of Canada , the country's largest Protestant  denomination, held its
inaugural service at the Mutual Street Arena  in Toronto.
 - 2008  – Sudan Airways Flight 109  crashed on landing at Khartoum International Airport , killing
30 of the 214 occupants on board.
 - Theodor Philipsen  ( b.
 1840)
 - Margarito Bautista  ( b.
 1878)
 - Margaret Abbott  ( d.
 1955)  More anniversaries:
 - June 9
 - June 10
 - June 11
 - Archive
 - By email
 - List of days of the year "
,

  "h1": "
Main Page


Welcome to Wikipedia

"
}

or we can extract and get the whole body document by running:

CALL apoc.load.htmlPlainText("https://en.wikipedia.org/",{body:"body"})
YIELD value
RETURN value["body"]

with a result similar to this:

Table 2. Results
body
"
Main Page
From Wikipedia, the free encyclopedia Jump to navigation Jump to search

Welcome to Wikipedia

, the free encyclopedia  that anyone can edit . 6,510,947  articles in English


From today's featured article

Life restoration  of Mosasaurus hoffmanni
Mosasaurus  is a genus  of mosasaurs , an extinct group of aquatic scaly reptiles . It lived from
about 82 to 66 million years ago during the Late Cretaceous . Its earliest fossils were found as
skulls near the River Meuse  ( Mosa  in Latin). In 1808, Georges Cuvier  concluded that the skulls
belonged to a giant marine lizard with similarities to monitors  but otherwise unlike any known
living animal, supporting the then-developing idea of extinction . Scientists continue to
debate whether its closest living relatives are monitors or snakes . Mosasaurus  had jaws
capable of swinging back and forth and was capable of powerful bites, using dozens of teeth
designed for cutting prey. Its four limbs were shaped into paddles to steer underwater.
Mosasaurus  was a predator with excellent vision but a poor sense of smell, and a high metabolic
rate suggesting it was warm-blooded . It lived in much of the Atlantic  and in a wide range of
oceanic climates including tropical, subtropical, temperate, and subpolar. ( Full
article... )
 Recently featured:
 - On the Job  (2013 film)
 - White swamphen
 - Lake Estancia
 - Archive
 - By email
 - More featured articles

Did you know ...

Bare formula shelves with purchase limit notice, January 2022
 - ... that the ongoing infant formula shortage in the United States (example pictured)  also
affects non-infant medical patients who require nasogastric feeding ?
 - ... that John Jacob Withrow  allegedly did not consult anyone before announcing a permanent
exhibition  in Toronto?
 - ... that the Hawaii Civil Liberties Committee  was designated as a Communist front  by the House
Un-American Activities Committee ?
 - ... that Mahendra Raj
's
 engineering work on the Hindustan Lever pavilion  resembled a crumpled sheet of paper?
 - ... that the clown character Mombo was created for The Dr. Max Show  after being blamed for an
off-stage noise?
 - ... that Roddie Fleming  was expecting to inherit the family business, but it was sold to Chase
Bank  instead?
 - ... that Darkness Visible: A Study of Vergil's Aeneid  was thought by one reviewer to have "the
remarkable qualities of the oracular"?
 - ... that Sunny Low  and his sister were dubbed the "King and Queen of Cha-Cha-Cha and Rock 'n'
Roll"?
 - Archive
 - Start a new article
 - Nominate an article

In the news

Elizabeth II
 - In Nigeria, at least 40 people are killed in an attack  at a Catholic church  in Owo , Ondo State .
 - A fire and explosions  at a storage depot in Sitakunda , Bangladesh, kill at least 49 people and
injure more than 450 others.
 - The Commonwealth of Nations  celebrates the Platinum Jubilee  of Elizabeth

II (pictured) .
 - Denmark votes  to eliminate its opt-out  of the European Union 's Common Security and Defence
Policy . Ongoing :
 - COVID-19 pandemic
 - Russian invasion of Ukraine Recent deaths :
 - Paula Rego
 - Christopher Pratt
 -
Dorothy E. Smith

 - Zeta Emilianidou
 -
Ann Turner Cook

 - Barry Sussman
 - Nominate an article

On this day


June 10
Frederick Barbarossa
 - 1190  – Third Crusade : Frederick Barbarossa (pictured) , Holy Roman Emperor , drowned in the
Saleph River  in Anatolia .
 - 1692  – Bridget Bishop  became the first person to be executed for witchcraft  in the Salem witch
trials  in colonial Massachusetts .
 - 1878  – The League of Prizren  was officially founded to "struggle in arms to defend the wholeness
of the territories of Albania".
 - 1925  – The United Church of Canada , the country's largest Protestant  denomination, held its
inaugural service at the Mutual Street Arena  in Toronto.
 - 2008  – Sudan Airways Flight 109  crashed on landing at Khartoum International Airport , killing
30 of the 214 occupants on board.
 - Theodor Philipsen  ( b.
 1840)
 - Margarito Bautista  ( b.
 1878)
 - Margaret Abbott  ( d.
 1955)  More anniversaries:
 - June 9
 - June 10
 - June 11
 - Archive
 - By email
 - List of days of the year


....


"

Note that the procedure returns empty results in case of tags without textual content, for example:

CALL apoc.load.htmlPlainText("https://en.wikipedia.org/", {meta:"meta"});
Table 3. Results
value

{ "meta": "" }

Load from runtime generated file

If we have a test.html file with a jQuery script like:

<!DOCTYPE html>
<html>
    <head>
      <script src="https://code.jquery.com/jquery-1.9.1.min.js"></script>
      <script type="text/javascript">
        $(() => {
            var newP = document.createElement("strong");
            var textNode = document.createTextNode("This is a new text node");
            newP.appendChild(textNode);
            document.getElementById("appendStuff").appendChild(newP);
        });
      </script>
    </head>
    <body>
        <div id="appendStuff"></div>
    </body>
</html>

we can read the generated js through the browser config.

Install dependencies

Note that to use the browser config (except with "NONE" value), you have to install additional dependencies which can be downloaded from this link.

For example, with the above file we can execute:

CALL apoc.load.htmlPlainText("test.html", {strong: "strong"}, {browser: "FIREFOX"});
Table 4. Results
Output
{ "body": "This is a new text node " }

If we have to parse a tag from a slow async call, we can use wait config to waiting for 10 second (in this example):

CALL apoc.load.htmlPlainText("test.html", {asyncTag: "#asyncTag"}, {browser: "FIREFOX", wait: 10});

We can also pass an HTML string into the 1st parameter by putting as a config parameter htmlString: true, for example:

CALL apoc.load.htmlPlainText("<!DOCTYPE html> <html> <body> <p class='firstClass'>My first paragraph.</p> </body> </html>",{body:"body"}, {htmlString: true})
YIELD value
RETURN value["body"] as body
Table 5. Results
body ---- " My first paragraph. " ----

Css / jQuery selectors

The jsoup class org.jsoup.nodes.Element provides a set of functions that can be used. Anyway, we can emulate all of them using the appropriate css/jQuery selectors in these ways (except for the last one, we can substitute the with a tag name to search into it instead of everywhere. Furthermore, by removing the selector will be returned the same result):

jsoup function css/jQuery selector description

getElementById(id)

#id

Find an element by ID, including or under this element.

getElementsByTag(tag)

tag

Finds elements, including and recursively under this element, with the specified tag name.

getElementsByClass(className)

.className

Find elements that have this class, including or under this element.

getElementsByAttribute(key)

[key]

Find elements that have a named attribute set.

getElementsByAttributeStarting(keyPrefix)

*[^keyPrefix]

Find elements that have an attribute name starting with the supplied prefix. Use data

to find elements that have HTML5 datasets.

getElementsByAttributeValue(key,value)

*[key=value]

Find elements that have an attribute with the specific value.

getElementsByAttributeValueContaining(key,match)

[key=match]

Find elements that have attributes whose value contains the match string.

getElementsByAttributeValueEnding(key,valueSuffix)

*[class$="test"]

Find elements that have attributes that end with the value suffix.

getElementsByAttributeValueMatching(key,regex)

*[id~=content]

Find elements that have attributes whose values match the supplied regular expression.

getElementsByAttributeValueNot(key,value)

*:not([key="value"])

Find elements that either do not have this attribute, or have it with a different value.

getElementsByAttributeValueStarting(key,valuePrefix)

*[key^=valuePrefix]

Find elements that have attributes that start with the value prefix.

getElementsByIndexEquals(index)

*:nth-child(index)

Find elements whose sibling index is equal to the supplied index.

getElementsByIndexGreaterThan(index)

*:gt(index)

Find elements whose sibling index is greater than the supplied index.

getElementsByIndexLessThan(index)

*:lt(index)

Find elements whose sibling index is less than the supplied index.

getElementsContainingOwnText(searchText)

*:containsOwn(searchText)

Find elements that directly contain the specified string.

getElementsContainingText(searchText)

*:contains('searchText')

Find elements that contain the specified string.

getElementsMatchingOwnText(regex)

*:matches(regex)

Find elements whose text matches the supplied regular expression.

getElementsMatchingText(pattern)

*:matchesOwn(pattern)

Find elements whose text matches the supplied regular expression.

getAllElements()

*

For example, we can execute:

CALL apoc.load.htmlPlainText($url, {nameKey: '#idName'})
Table 6. Results
Output
{
  "h6": [
    {
      "attributes": {
        "id": "idName"
      },
      "text": "test",
      "tagName": "h6"
    }
  ]
}

Html as json list

If, instead of a map of plain text representations, you want to get a map of json list results, you can use the apoc.load.html procedure, which use the same syntax, logic and config parameters as apoc.load.htmlPlainText.