Google apps script parse html

Parsing HTML using Google Apps Script

This is a sample script for parsing HTML using Google Apps Script. When HTML data is converted to Google Document, the HTML data can be parsed and be converted to Google Document. In this case, the paragraphs, lists and tables are included. From this situation, I thought that this situation can be used for parsing HTML using Google Apps Script. So I could came up with this method.

In the Sheet API, the HTML data can be put to the Spreadsheet with the PasteDataRequest. But unfortunately, in this case, I couldn’t distinguish between the body and tables.

The flow of this method is as follows. In this sample script, the tables from HTML are retrieved.

Flow

  1. Retrieve HTML data using UrlFetchApp.fetch() .
  2. Create new Google Document by converting HTML data to Google Document using Drive API.
    • This is a temporal file.
  3. Retrieve all tables using Document service of Google Apps Script.
  4. Delete the temporal file.

Sample script

Before you run this script, please enable Drive API at Advanced Google Services.

function parseTablesFromHTML(url)   var html = UrlFetchApp.fetch(url);  var docId = Drive.Files.insert(  < title: "temporalDocument", mimeType: MimeType.GOOGLE_DOCS >,  html.getBlob()  ).id;  var tables = DocumentApp.openById(docId)  .getBody()  .getTables();  var res = tables.map(function(table)   var values = [];  for (var row = 0; row  table.getNumRows(); row++)   var temp = [];  var cols = table.getRow(row);  for (var col = 0; col  cols.getNumCells(); col++)   temp.push(cols.getCell(col).getText());  >  values.push(temp);  >  return values;  >);  Drive.Files.remove(docId);  return res; >  // Please run this function.  function run()   var url = "###"; //   var res = parseTablesFromHTML(url);  Logger.log(res); > 

Result

As a test case, when you set https://gist.github.com/tanaikech/f52e391b68473cbf6d4ab16108dcfbbb to url and run the script, the following result can be retrieved.

[  [  ["head1_1", "head1_2", "head1_3\n"],  ["value1_a1", "value1_b1", "value1_c1"],  ["value1_a2", "value1_b2", "value1_c2"]  ],  [  ["head2_1", "head2_2", "head2_3\n"],  ["value2_a1", "value2_b1", "value2_c1"],  ["value2_a2", "value2_b2", "value2_c2"]  ] ] 

Note

  • Using this method, all paragraphs and lists can be also retrieved.
  • This method can be also used with other languages.

References

Источник

How to parse an html string in google apps script without using xmlservice?

Parsing an HTML string in Google Apps Script can sometimes prove to be a challenging task, especially if the XmlService is not available as an option. However, there are other methods to achieve this goal. In this article, we will explore various methods to parse an HTML string in Google Apps Script without the use of XmlService.

Method 1: RegExp and split

Here is a step-by-step guide on how to parse an HTML string in Google Apps Script without using XmlService, using only RegExp and split:

var html token tag">div>p>Hello world!p>div>";

This regular expression matches any string that starts with a «» character, and does not contain any other «» characters in between.

  1. Use the split() method to split the HTML string into an array of strings, using the regular expression as the delimiter:

This will split the HTML string into an array of strings, where each string is a piece of text between two HTML tags.

var text = tags.filter(function(str) < return str.trim().length >0; >);

This will remove any empty strings from the array, leaving only the text between the HTML tags.

This will join all the text strings into a single string, which is the parsed HTML without any tags.

Here is the complete code example:

var html token tag">div>p>Hello world!p>div>"; var regex = /[^>]+>/g; var tags = html.split(regex); var text = tags.filter(function(str) < return str.trim().length >0; >); var result = text.join("");

This code example should work for most HTML strings, but may not work for more complex HTML with nested tags or attributes.

Method 2: DOMParser API

To parse an HTML string in Google Apps Script using the DOMParser API, you can follow these steps:

const parser = new DOMParser();
  1. Use the parseFromString method of the DOMParser API to parse the HTML string. This method takes two arguments: the HTML string to parse and the MIME type of the document being parsed.
const htmlString = "

Hello, world!

"
; const mimeType = "text/html"; const parsedHtml = parser.parseFromString(htmlString, mimeType);
  1. You can now access the parsed HTML as a DOM tree. For example, to get the text content of the p element, you can use the textContent property.
const pElement = parsedHtml.querySelector("p"); const textContent = pElement.textContent; // "Hello, world!"

Here’s the complete code example:

const parser = new DOMParser(); const htmlString = "

Hello, world!

"
; const mimeType = "text/html"; const parsedHtml = parser.parseFromString(htmlString, mimeType); const pElement = parsedHtml.querySelector("p"); const textContent = pElement.textContent; // "Hello, world!"

You can use this method to parse any valid HTML string without using the XmlService API in Google Apps Script.

Method 3: JQuery Parse HTML

To parse an HTML string in Google Apps Script using JQuery Parse HTML, you can follow these steps:

  1. Load the JQuery library in your Google Apps Script project. You can do this by going to the «Resources» menu, selecting «Libraries», and searching for «JQuery». Choose the latest version and save it.
  2. Create a variable to hold your HTML string. For example:
var htmlString = "

Hello World!

"
;
var parsedHTML = $.parseHTML(htmlString);
  1. You can now manipulate the parsed HTML as a DOM element using JQuery or vanilla JavaScript. For example:
$(parsedHTML).find('p').text('Hello Google Apps Script!');

This will change the text inside the

tag to «Hello Google Apps Script!».

Here’s the full code example:

// Load the JQuery library function loadJQuery()  var libraryUrl = 'https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js'; var response = UrlFetchApp.fetch(libraryUrl); eval(response.getContentText()); > // Parse the HTML string function parseHTML()  // Load JQuery loadJQuery(); // HTML string to parse var htmlString = "

Hello World!

"
; // Parse the HTML string into a DOM element var parsedHTML = $.parseHTML(htmlString); // Manipulate the parsed HTML $(parsedHTML).find('p').text('Hello Google Apps Script!'); // Log the manipulated HTML Logger.log($(parsedHTML).html()); >

Method 4: Regular Expression and Replace

To parse an HTML string in Google Apps Script without using XmlService, you can use Regular Expression and Replace. Here are the steps to do so:

  1. First, create a regular expression pattern to match the HTML tags in the string. Here is an example pattern:

This pattern matches any string that starts with < and ends with >, and has any number of characters in between that are not > .

var htmlString = "

Hello, world!

"
; var plainText = htmlString.replace(pattern, "");

This code removes all the HTML tags from the htmlString variable and assigns the resulting plain text to the plainText variable.

  1. If you want to preserve some of the text inside the HTML tags, you can modify the regular expression pattern to capture those parts. Here is an example pattern that captures the text inside

    tags:

This pattern matches any string that starts with

and ends with

, and captures any number of characters in between that are not

.

  1. You can then use the replace function with a callback function to replace each match with the captured text. Here is an example code:
var htmlString = "

Hello, world!

"
; var plainText = htmlString.replace(pattern, function(match, text) return text; >);

This code removes all the

tags from the htmlString variable and assigns the resulting plain text to the plainText variable.

These are the basic steps to parse an HTML string in Google Apps Script without using XmlService using Regular Expression and Replace.

Источник

tanaikech / submit.md

This is a sample script for parsing HTML using Google Apps Script. When HTML data is converted to Google Document, the HTML data can be parsed and be converted to Google Document. In this case, the paragraphs, lists and tables are included. From this situation, I thought that this situation can be used for parsing HTML using Google Apps Script. So I could came up with this method.

In the Sheet API, the HTML data can be put to the Spreadsheet with the PasteDataRequest. But unfortunately, in this case, I couldn’t distinguish between the body and tables.

The flow of this method is as follows. In this sample script, the tables from HTML are retrieved.

  1. Retrieve HTML data using UrlFetchApp.fetch() .
  2. Create new Google Document by converting HTML data to Google Document using Drive API.
    • This is a temporal file.
  3. Retrieve all tables using Document service of Google Apps Script.
  4. Delete the temporal file.

Before you run this script, please enable Drive API at Advanced Google Services.

function parseTablesFromHTML(url)  var html = UrlFetchApp.fetch(url); var docId = Drive.Files.insert(  title: "temporalDocument", mimeType: MimeType.GOOGLE_DOCS >, html.getBlob() ).id; var tables = DocumentApp.openById(docId) .getBody() .getTables(); var res = tables.map(function(table)  var values = []; for (var row = 0; row  table.getNumRows(); row++)  var temp = []; var cols = table.getRow(row); for (var col = 0; col  cols.getNumCells(); col++)  temp.push(cols.getCell(col).getText()); > values.push(temp); > return values; >); Drive.Files.remove(docId); return res; > // Please run this function. function run()  var url = "###"; // var res = parseTablesFromHTML(url); Logger.log(res); >

As a test case, when you set https://gist.github.com/tanaikech/f52e391b68473cbf6d4ab16108dcfbbb to url and run the script, the following result can be retrieved.

[ [ ["head1_1", "head1_2", "head1_3\n"], ["value1_a1", "value1_b1", "value1_c1"], ["value1_a2", "value1_b2", "value1_c2"] ], [ ["head2_1", "head2_2", "head2_3\n"], ["value2_a1", "value2_b1", "value2_c1"], ["value2_a2", "value2_b2", "value2_c2"] ] ]
  • Using this method, all paragraphs and lists can be also retrieved.
  • This method can be also used with other languages.

Источник

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

gas-commons / HtmlParser Public archive

HTML Parser for Googe Apps Script

gas-commons/HtmlParser

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Bumps [ini](https://github.com/isaacs/ini) from 1.3.5 to 1.3.7. — [Release notes](https://github.com/isaacs/ini/releases) — [Commits](npm/ini@v1.3.5. v1.3.7) Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot]

Git stats

Files

Failed to load latest commit information.

README.md

HTML Parser for Googe Apps Script

project key: 1gMNYu6-SlYdKbfFMSXZz718quQVgll-qKhNobIaJwMVYL_9EgZ9GQlmp

var html = UrlFetchApp.fetch('http://en.wikipedia.org/wiki/Document_Object_Model').getContentText() var doc = XmlService.parse(html) var rootElement = doc.getRootElement() var parser = HtmlParser.of(rootElement) var element = parser.getElementById('firstHeading')
var element = parser.getElementById('firstHeading')
var elements = parser.getElementsByClassName('firstHeading')
var elements = parser.getElementsByTagName('h1')

Источник

Читайте также:  Python module os has no attribute fork
Оцените статью