Html to text console

    . uppercaseHeadings true By default, headings ( , , etc) are uppercased. Set to false to leave headings as they are. wordwrap 80 After how many chars a line break should follow in p elements.
    Set to null or false to disable word-wrapping.

By using the format option, you can specify formatting for ALL elements:

Key Tags
anchor a
blockquote blockquote
heading h1 , h2 , h3 , h4 , h5 , h6
horizontalLine hr
image img
lineBreak br
listItem
orderedList ol
paragraph p , pre
table table
text
unorderedList ul
. .

Each key must be a function which eventually receive elem (the current elem), fn (the next formatting function) and options (the options passed to html-to-text).

var htmlToText = require('html-to-text');
format:
heading: function (elem, fn, options)
var h = fn(elem.children, options);
return '====\n' + h.toUpperCase() + '\n====';
>
>
>);
console.log(text);
// ====\nHELLO WORLD\n====
var text2 = htmlToText.fromString('
Hello
World
!
'
,
format:
div: function (elem, fn, options)
const h = fn(elem.children, options);
return h + '\n';
>
>
>);
console.log(text2);
// Hello\nWorld\n!

It is possible to use html-to-text as command line interface. This allows an easy validation of your generated text and the integration in other systems that does not run on node.js.

html-to-text uses stdin and stdout for data in and output. So you can use html-to-text the following way:

cat example/test.html | html-to-text > test.txt 

There also all options available as described above. You can use them like this:

cat example/test.html | html-to-text --tables=#invoice,.address --wordwrap=100 > test.txt 

The tables option has to be declared as comma separated list without whitespaces.

html>
head>
meta charset="utf-8">
head>
body>
table cellpadding="0" cellspacing="0" border="0">
tr>
td>
h2>Paragraphsh2>
p class="normal-space">At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. a href="www.github.com">Githuba>
p>
p class="normal-space">At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.
p>
td>
td>/td>
tr>
tr>
td>
hr/>
h2>Pretty printed tableh2>
table id="invoice">
thead>
tr>
th>Articleth>
th>Priceth>
th>Taxesth>
th>Amountth>
th>Totalth>
tr>
thead>
tbody>
tr>
td>
p>
Product 1br />
span style="font-size:0.8em">Contains: 1x Product 1span>
p>
td>
td align="right" valign="top">6,99€td>
td align="right" valign="top">7%td>
td align="right" valign="top">1td>
td align="right" valign="top">6,99€td>
tr>
tr>
td>Shipment coststd>
td align="right">3,25€td>
td align="right">7%td>
td align="right">1td>
td align="right">3,25€td>
tr>
tbody>
tfoot>
tr>
td> td>
td> td>
td colspan="3">to pay: 10,24€td>
tr>
tr>
td>/td>
td>/td>
td colspan="3">Taxes 7%: 0,72€td>
tr>
tfoot>
table>
td>
td>/td>
tr>
tr>
td>
hr/>
h2>Listsh2>
ul>
li>At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.li>
li>At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.li>
ul>
ol>
li>At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.li>
li>At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.li>
ol>
td>
tr>
tr>
td>
hr />
h2>Column Layout with tablesh2>
table class="address">
tr>
th align="left">Invoice Addressth>
th align="left">Shipment Addressth>
tr>
tr>
td align="left">
p>
Mr.br/>
John Doebr/>
Featherstone Street 49br/>
28199 Bremenbr/>
p>
td>
td align="left">
p>
Mr.br/>
John Doebr/>
Featherstone Street 49br/>
28199 Bremenbr/>
p>
td>
tr>
table>
td>
td>/td>
tr>
tr>
td>
hr/>
h2>Mailto formatingh2>
p class="normal-space small">
Some Companybr />
Some Street 42br />
Somewherebr />
E-Mail: a href="mailto:test@example.com">Click herea>
p>
td>
tr>
table>
body>
html>
PARAGRAPHS
At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd
gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum
dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor
invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos
et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea
takimata sanctus est Lorem ipsum dolor sit amet. Github [www.github.com]
At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd
gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum
dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor
invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos
et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea
takimata sanctus est Lorem ipsum dolor sit amet.
--------------------------------------------------------------------------------
PRETTY PRINTED TABLE
ARTICLE PRICE TAXES AMOUNT TOTAL
Product 1 6,99€ 7% 1 6,99€
Contains: 1x Product 1
Shipment costs 3,25€ 7% 1 3,25€
to pay: 10,24€
Taxes 7%: 0,72€
--------------------------------------------------------------------------------
LISTS
* At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd
gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.
* At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd
gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.
1. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd
gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.
2. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd
gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.
--------------------------------------------------------------------------------
COLUMN LAYOUT WITH TABLES
INVOICE ADDRESS SHIPMENT ADDRESS
Mr. Mr.
John Doe John Doe
Featherstone Street 49 Featherstone Street 49
28199 Bremen 28199 Bremen
--------------------------------------------------------------------------------
MAILTO FORMATING
Some Company
Some Street 42
Somewhere
E-Mail: Click here [test@example.com]

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the ‘Software’), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED ‘AS IS’, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Источник

☩ Walking in Light with Christ – Faith, Computing, Diary

How to convert html pages to text in console / terminal on GNU / Linux and FreeBSD

Thursday, 8th December 2011

HTML to Plain Text Convertion on GNU / Linux and FreeBSD

I’m realizing the more I’m converting to a fully functional GUI user, the less I’m doing coding or any interesting stuff…
I remembered of the old glorious times, when I was full time console user and got a memory on a nifty trick I was so used to back in the day.
Back then I was quite often writing shell scripts which were fetching (html) webpages and converting the html content into a plain TEXT (TXT) files

In order to fetch a page back in the days I used lynx(a very simple UNIX text browser, which by the way lacks support for any CSS or Javascipt) in combination with html2text – (an advanced HTML-to-text converter).

Let’s say I wanted to fetch a my personal home page https://www.pc-freak.net/, I did that via the command:

$ lynx -source https://www.pc-freak.net/ | html2text > pcfreak_page.txt

The content from www.pc-freak.net got spit by lynx as an html source and passed html2pdf wchich saves it in plain text file pcfreak_page.txt
The bit more advanced elinks – (lynx-like alternative character mode WWW browser) provides better support for HTML and even some CSS and Javascript so to properly save the content of many pages in plain html file its better to use it instead of lynx, the way to produce .txt using elinks files is identical, e.g.:

$ elinks -source https://www.pc-freak.net/blog/ | html2text > pcfreak_blog_page.txt

By the way back in the days I was used more to links , than the superior elinks , nowdays I have both of the text browsers installed and testing to fetch an html like in the upper example and pipe to html2text produced garbaged output.

Here is the time to tell its not even necessery to have a text browser installed in order to fetch a webpage and convert it to a plain text TXT!. wget file downloading tools supports source dump as well, for all those who did not (yet) tried it and want to test it:

$ wget -qO- https://www.pc-freak.net | html2text Anyways of course, some pages convertion of text inside HTML tags would not properly get saved with neither lynx or elinks cause some texts might be embedded in some elinks or lynx unsupported CSS or JavaScript. In those cases the GUI browser is useful. You can use any browser like Firefox, Epiphany or Opera ‘s File -> Save As (Text Files) embedded functionality, below is a screenshot showing an html page which I’m about to save as a plain Text File in Mozilla Firefox:

Firefox iceWeasel Opera etc. save html webpage as plain text on GNU / Linux, FreeBSD

Besides being handy in conjunction with text browsers, html2text is also handy for converting .html pages already existing on the computer’s hard drive to a plain (.TXT) text format.
One might wonder, why would ever one would like to do that?? Well I personally prefer reading plain text documents instead of htmls 😉
Converting an html files already existing on hard drive with html2text is done with cmd:

$ html2text index.html >index.txt

To convert a whole directory full of .html (documentation) or whatever files to plain text .TXT , cd the directory with HTMLs and issue the one liner bash loop command:

$ cd html/
html$ for i in $(echo *.html); do html2text $i > $(echo $i | sed -e ‘s#.html#.txt#g’); done

Now lay off your back and enjoy reading the dox like in the good old hacker days when .TXT files were fashionable 😉

Источник

Echo HTML Into Text File [duplicate]

That depends on how bash was built. In the bash of Solaris 11.4 for instance, \x sequences are expanded by default and echo -e outputs -e (as POSIX currently requires). Use printf instead to get a consistent behaviour.

You can solve this by using single quotes instead of double-quotes. So, this should work as expected —

echo '\n\n\t\n\t\t

Hello World!

\n\t\n' > index.html

When you use single quotes, bash doesn’t try to interpret special characters and simply preserves the literal string.

You are triggering a history expansion in bash with ! . Either turn off history expansions with set +H , use a single quoted string, or use a here-document to write your HTML:

$ cat index.html   

Hello World!

END_HTML

Or, if you want to write out those encoded tabs and newlines as they are:

$ cat index.html \n\n\t\n\t\t

Hello World!

\n\t\n END_HTML

History expansions are not triggered within here-documents in bash .

Linked

Hot Network Questions

Site design / logo © 2023 Stack Exchange Inc; user contributions licensed under CC BY-SA . rev 2023.7.21.43541

Linux is a registered trademark of Linus Torvalds. UNIX is a registered trademark of The Open Group.
This site is not affiliated with Linus Torvalds or The Open Group in any way.

By clicking “Accept all cookies”, you agree Stack Exchange can store cookies on your device and disclose information in accordance with our Cookie Policy.

Источник

Читайте также:  Java virtual method call
Оцените статью