- Parsing HTML using Google Apps Script
- Flow
- Sample script
- Result
- Note
- References
- How to parse an html string in google apps script without using xmlservice?
- Method 1: RegExp and split
- Method 2: DOMParser API
- Method 3: JQuery Parse HTML
- Method 4: Regular Expression and Replace
- tanaikech / submit.md
- Saved searches
- Use saved searches to filter your results more quickly
- gas-commons/HtmlParser
- Name already in use
- Sign In Required
- Launching GitHub Desktop
- Launching GitHub Desktop
- Launching Xcode
- Launching Visual Studio Code
- Latest commit
- Git stats
- Files
- README.md
Parsing HTML using Google Apps Script
This is a sample script for parsing HTML using Google Apps Script. When HTML data is converted to Google Document, the HTML data can be parsed and be converted to Google Document. In this case, the paragraphs, lists and tables are included. From this situation, I thought that this situation can be used for parsing HTML using Google Apps Script. So I could came up with this method.
In the Sheet API, the HTML data can be put to the Spreadsheet with the PasteDataRequest. But unfortunately, in this case, I couldn’t distinguish between the body and tables.
The flow of this method is as follows. In this sample script, the tables from HTML are retrieved.
Flow
- Retrieve HTML data using UrlFetchApp.fetch() .
- Create new Google Document by converting HTML data to Google Document using Drive API.
- This is a temporal file.
- Retrieve all tables using Document service of Google Apps Script.
- Delete the temporal file.
Sample script
Before you run this script, please enable Drive API at Advanced Google Services.
function parseTablesFromHTML(url) var html = UrlFetchApp.fetch(url); var docId = Drive.Files.insert( < title: "temporalDocument", mimeType: MimeType.GOOGLE_DOCS >, html.getBlob() ).id; var tables = DocumentApp.openById(docId) .getBody() .getTables(); var res = tables.map(function(table) var values = []; for (var row = 0; row table.getNumRows(); row++) var temp = []; var cols = table.getRow(row); for (var col = 0; col cols.getNumCells(); col++) temp.push(cols.getCell(col).getText()); > values.push(temp); > return values; >); Drive.Files.remove(docId); return res; > // Please run this function. function run() var url = "###"; // var res = parseTablesFromHTML(url); Logger.log(res); >
Result
As a test case, when you set https://gist.github.com/tanaikech/f52e391b68473cbf6d4ab16108dcfbbb to url and run the script, the following result can be retrieved.
[ [ ["head1_1", "head1_2", "head1_3\n"], ["value1_a1", "value1_b1", "value1_c1"], ["value1_a2", "value1_b2", "value1_c2"] ], [ ["head2_1", "head2_2", "head2_3\n"], ["value2_a1", "value2_b1", "value2_c1"], ["value2_a2", "value2_b2", "value2_c2"] ] ]
Note
- Using this method, all paragraphs and lists can be also retrieved.
- This method can be also used with other languages.
References
How to parse an html string in google apps script without using xmlservice?
Parsing an HTML string in Google Apps Script can sometimes prove to be a challenging task, especially if the XmlService is not available as an option. However, there are other methods to achieve this goal. In this article, we will explore various methods to parse an HTML string in Google Apps Script without the use of XmlService.
Method 1: RegExp and split
Here is a step-by-step guide on how to parse an HTML string in Google Apps Script without using XmlService, using only RegExp and split:
var html token tag">div>p>Hello world!p>div>";
This regular expression matches any string that starts with a «» character, and does not contain any other «» characters in between.
- Use the split() method to split the HTML string into an array of strings, using the regular expression as the delimiter:
This will split the HTML string into an array of strings, where each string is a piece of text between two HTML tags.
var text = tags.filter(function(str) < return str.trim().length >0; >);
This will remove any empty strings from the array, leaving only the text between the HTML tags.
This will join all the text strings into a single string, which is the parsed HTML without any tags.
Here is the complete code example:
var html token tag">div>p>Hello world!p>div>"; var regex = /[^>]+>/g; var tags = html.split(regex); var text = tags.filter(function(str) < return str.trim().length >0; >); var result = text.join("");
This code example should work for most HTML strings, but may not work for more complex HTML with nested tags or attributes.
Method 2: DOMParser API
To parse an HTML string in Google Apps Script using the DOMParser API, you can follow these steps:
const parser = new DOMParser();
- Use the parseFromString method of the DOMParser API to parse the HTML string. This method takes two arguments: the HTML string to parse and the MIME type of the document being parsed.
const htmlString = "Hello, world!
"; const mimeType = "text/html"; const parsedHtml = parser.parseFromString(htmlString, mimeType);
- You can now access the parsed HTML as a DOM tree. For example, to get the text content of the p element, you can use the textContent property.
const pElement = parsedHtml.querySelector("p"); const textContent = pElement.textContent; // "Hello, world!"
Here’s the complete code example:
const parser = new DOMParser(); const htmlString = "Hello, world!
"; const mimeType = "text/html"; const parsedHtml = parser.parseFromString(htmlString, mimeType); const pElement = parsedHtml.querySelector("p"); const textContent = pElement.textContent; // "Hello, world!"
You can use this method to parse any valid HTML string without using the XmlService API in Google Apps Script.
Method 3: JQuery Parse HTML
To parse an HTML string in Google Apps Script using JQuery Parse HTML, you can follow these steps:
- Load the JQuery library in your Google Apps Script project. You can do this by going to the «Resources» menu, selecting «Libraries», and searching for «JQuery». Choose the latest version and save it.
- Create a variable to hold your HTML string. For example:
var htmlString = "Hello World!
";
var parsedHTML = $.parseHTML(htmlString);
- You can now manipulate the parsed HTML as a DOM element using JQuery or vanilla JavaScript. For example:
$(parsedHTML).find('p').text('Hello Google Apps Script!');
This will change the text inside the
tag to «Hello Google Apps Script!».
Here’s the full code example:
// Load the JQuery library function loadJQuery() var libraryUrl = 'https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js'; var response = UrlFetchApp.fetch(libraryUrl); eval(response.getContentText()); > // Parse the HTML string function parseHTML() // Load JQuery loadJQuery(); // HTML string to parse var htmlString = "Hello World!
"; // Parse the HTML string into a DOM element var parsedHTML = $.parseHTML(htmlString); // Manipulate the parsed HTML $(parsedHTML).find('p').text('Hello Google Apps Script!'); // Log the manipulated HTML Logger.log($(parsedHTML).html()); >
Method 4: Regular Expression and Replace
To parse an HTML string in Google Apps Script without using XmlService, you can use Regular Expression and Replace. Here are the steps to do so:
- First, create a regular expression pattern to match the HTML tags in the string. Here is an example pattern:
This pattern matches any string that starts with < and ends with >, and has any number of characters in between that are not > .
var htmlString = "Hello, world!
"; var plainText = htmlString.replace(pattern, "");
This code removes all the HTML tags from the htmlString variable and assigns the resulting plain text to the plainText variable.
- If you want to preserve some of the text inside the HTML tags, you can modify the regular expression pattern to capture those parts. Here is an example pattern that captures the text inside
tags:
This pattern matches any string that starts with
and ends with
, and captures any number of characters in between that are not
.
- You can then use the replace function with a callback function to replace each match with the captured text. Here is an example code:
var htmlString = "Hello, world!
"; var plainText = htmlString.replace(pattern, function(match, text) return text; >);
This code removes all the
tags from the htmlString variable and assigns the resulting plain text to the plainText variable.
These are the basic steps to parse an HTML string in Google Apps Script without using XmlService using Regular Expression and Replace.
tanaikech / submit.md
This is a sample script for parsing HTML using Google Apps Script. When HTML data is converted to Google Document, the HTML data can be parsed and be converted to Google Document. In this case, the paragraphs, lists and tables are included. From this situation, I thought that this situation can be used for parsing HTML using Google Apps Script. So I could came up with this method.
In the Sheet API, the HTML data can be put to the Spreadsheet with the PasteDataRequest. But unfortunately, in this case, I couldn’t distinguish between the body and tables.
The flow of this method is as follows. In this sample script, the tables from HTML are retrieved.
- Retrieve HTML data using UrlFetchApp.fetch() .
- Create new Google Document by converting HTML data to Google Document using Drive API.
- This is a temporal file.
- Retrieve all tables using Document service of Google Apps Script.
- Delete the temporal file.
Before you run this script, please enable Drive API at Advanced Google Services.
function parseTablesFromHTML(url) var html = UrlFetchApp.fetch(url); var docId = Drive.Files.insert( title: "temporalDocument", mimeType: MimeType.GOOGLE_DOCS >, html.getBlob() ).id; var tables = DocumentApp.openById(docId) .getBody() .getTables(); var res = tables.map(function(table) var values = []; for (var row = 0; row table.getNumRows(); row++) var temp = []; var cols = table.getRow(row); for (var col = 0; col cols.getNumCells(); col++) temp.push(cols.getCell(col).getText()); > values.push(temp); > return values; >); Drive.Files.remove(docId); return res; > // Please run this function. function run() var url = "###"; // var res = parseTablesFromHTML(url); Logger.log(res); >
As a test case, when you set https://gist.github.com/tanaikech/f52e391b68473cbf6d4ab16108dcfbbb to url and run the script, the following result can be retrieved.
[ [ ["head1_1", "head1_2", "head1_3\n"], ["value1_a1", "value1_b1", "value1_c1"], ["value1_a2", "value1_b2", "value1_c2"] ], [ ["head2_1", "head2_2", "head2_3\n"], ["value2_a1", "value2_b1", "value2_c1"], ["value2_a2", "value2_b2", "value2_c2"] ] ]
- Using this method, all paragraphs and lists can be also retrieved.
- This method can be also used with other languages.
Saved searches
Use saved searches to filter your results more quickly
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.
gas-commons / HtmlParser Public archive
HTML Parser for Googe Apps Script
gas-commons/HtmlParser
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Bumps [ini](https://github.com/isaacs/ini) from 1.3.5 to 1.3.7. — [Release notes](https://github.com/isaacs/ini/releases) — [Commits](npm/ini@v1.3.5. v1.3.7) Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot]
Git stats
Files
Failed to load latest commit information.
README.md
HTML Parser for Googe Apps Script
project key: 1gMNYu6-SlYdKbfFMSXZz718quQVgll-qKhNobIaJwMVYL_9EgZ9GQlmp
var html = UrlFetchApp.fetch('http://en.wikipedia.org/wiki/Document_Object_Model').getContentText() var doc = XmlService.parse(html) var rootElement = doc.getRootElement() var parser = HtmlParser.of(rootElement) var element = parser.getElementById('firstHeading')
var element = parser.getElementById('firstHeading')
var elements = parser.getElementsByClassName('firstHeading')
var elements = parser.getElementsByTagName('h1')