Doc to html github

Содержание

Saved searches
Use saved searches to filter your results more quickly
License
geoffstratton/Doc-and-DocX-to-HTML-Converter
Name already in use
Sign In Required
Launching GitHub Desktop
Launching GitHub Desktop
Launching Xcode
Launching Visual Studio Code
Latest commit
Git stats
Files
README.md
Saved searches
Use saved searches to filter your results more quickly
License
dmryutov/document2html
Name already in use
Sign In Required
Launching GitHub Desktop
Launching GitHub Desktop
Launching Xcode
Launching Visual Studio Code
Latest commit
Git stats
Files
README.md
About
Saved searches
Use saved searches to filter your results more quickly
License
WqyJh/doc2html
Name already in use
Sign In Required
Launching GitHub Desktop
Launching GitHub Desktop
Launching Xcode
Launching Visual Studio Code
Latest commit
Git stats
Files
README.md
Saved searches
Use saved searches to filter your results more quickly
License
bradmontgomery/word2html
Name already in use
Sign In Required
Launching GitHub Desktop
Launching GitHub Desktop
Launching Xcode
Launching Visual Studio Code
Latest commit
Git stats
Files
README.md
About

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Converts Word documents into clean HTML

License

geoffstratton/Doc-and-DocX-to-HTML-Converter

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

Doc/DocX to HTML Converter

Converts Word documents into clean HTML

I have this problem: mo matter what my official job or title, people keep sending me Word documents that they want posted online to match the web site styling.

Yes, you can use Word to convert documents to HTML, but Microsoft’s version of «HTML» frequently looks worse than if you just pasted in plain text. And yes, you can save Word documents as plain text, but then to use them on the web you have to add in the HTML tags.

Finally I got fed up and wrote a converter to produce minimally formatted HTML that I can copy into common web editors like CKEditor or TinyMCE. The operation is linear: you take a Word .doc or .docx file, drag it onto a Windows form, the program invokes Word, converts your .doc/.docx to the cleanest HTML that Word can manage, parses the HTML using Html Agility Pack, and finally spits out a simple HTML document in Notepad that you can copy-paste into whatever web system you’re using.

When building this I had the Microsoft.Office.Interop.Word 12.0 (Word 2007) library referenced from the project. The easiest way to meet this requirement is to install some recent version of Office, but any version of the Microsoft.Office.Interop.Word library that natively handles the .docx format should work.
I had the very useful Html Agility Pack version 1.4.6 library referenced as well. I was using .NET 4.0 and the 4.0 version of the library. Html Agility Pack now lives on Github so you can grab it easily and reference it from your project.

Later I realized a better way to do this might be to invoke the LibreOffice converter on the command line, convert your document to HTML or text, filter it with Python’s BeautifulSoup library or sed or Ruby’s Nokogiri, and then insert the results straight into the database of your web system. But maybe not: in text, tags like < table >and < ul >would be lost, and LibreOffice’s HTML is still pretty ugly.

GNU General Public License v3.0

Источник

Saved searches

Use saved searches to filter your results more quickly

Documents to HTML converter

License

dmryutov/document2html

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

Documents to HTML converter

Extension	Text	Styles extraction	Images extraction
HTML/XHTML	Yes	Yes	Yes
XML	Yes	Not applicable	Not applicable
DOCX	Yes	Yes	Yes
DOC	Yes	No	No
RTF	Yes	Yes	Yes
ODT	Yes	Yes	Yes
XLSX	Yes	Yes	Yes
XLS	Yes	Yes	No
CSV	Yes	Not applicable	Not applicable
TXT/MD	Yes	Yes	Yes
JSON	Yes	Not applicable	Not applicable
EPUB	Yes	Yes	Yes
PDF	Yes	No	Yes
PPT	Yes	No	No

cURL for downloading images:

apt-get install libcurl4-openssl-dev or brew install curl

iconv for encoding conversion

sudo apt-get install libc6 or brew install libiconv

Tidy for cleaning and repairing HTML

sudo apt-get install libtidy-dev or brew install tidy-html5

file for determining file extension

getoptpp — Command line options parser
lodepng — PNG encoder and decoder
miniz — Data compression library
json — JSON parser
pygixml — XML parser

Make sure the Qt (>= 5.6) development libraries are installed:

In Ubuntu/Debian: apt-get install qt5-default qttools5-dev-tools zlib1g-dev
In Fedora: sudo dnf builddep tiled
In Arch Linux: pacman -S qt

In Mac OS X with Homebrew:

brew install qt5
brew link qt5 —force

Now you can compile by running:

qmake (or qmake-qt5 on some systems) make

To do a shadow build, you can run qmake from a different directory and refer it to space-invaders.pro, for example:

mkdir build cd build qmake ../src/document2html.pro make

If you have ideas how to build project with CMake instead of Qt please contact me.

 document2html -f|-d -o [-si] document2html -h document2html -v

Short Flag	Long Flag	Description
-f	—file	Input file
-d	—dir	Input directory
-o	—out	Output directory
-s	—style	Extract styles
-i	—image	Extract images
-h	—help	Display help message
-v	—version	Display package version

rembish — DOC, PPT and PDF converter (PHP)
PolicyStat — DOCX converter (Python)
python-excel — XLSX and XLS converter (Python)
lvu — RTF converter (C++)
adhocore — TXT/MD converter (PHP)
ahupp — libmagic wrapper (Python)

If you have questions regarding the libraries, I would like to invite you to open an issue at Github. Please describe your request, problem, or question as detailed as possible, and also mention the version of the libraries you are using as well as the version of your compiler and operating system. Opening an issue at Github allows other users and contributors to this libraries to collaborate.

About

Documents to HTML converter

Источник

Saved searches

Use saved searches to filter your results more quickly

Convert documents to html and deploy to github pages.

License

WqyJh/doc2html

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

Convert document file to html and publish it to github pages.

Why convert to html? For me, I prefer to read technology documents with web browser where I can read e-books like reading blogs, easier to zoom, copy, etc.

Use ebook-convert provided by calibre to convert source document to .htmlz format.
Use unar to unarchive the .htmlz to web files.
Use git to create a local git repo and commit web files.
Use hub to create a remote repo on github.
Use git to push local files to remote gh-pages branch.
Wait a moment then read it on https://.github.io//

Support all formats supported by calibre.

./doc2html.py doc_path> username>/repository>

Convert a.epub and publish it to my github repository book-a , which would be created as private repository. If you want to deploy to public repository just add —public flag.

./doc2html.py /path/to/a.epub WqyJh/book-a

Convert b.pdf and publish it.

./doc2html.py b.pdf WqyJh/book-b

If pdf file is scanned, it won’t be converted. Because almost all of its contents are images, which is slow to load and hard to read.

How to determine a pdf file is scanned? Generally a scanned page only have one image and no text, some of which has a few lines of meta text. Define a —pdf-threshold (default:100) , if the number of characters are less than it, then the page is treat as scanned. Define a —pdf-rate (default:0.6) , if the rate of text pages over total pages is less than it, then the pdf document is treat as scanned.

If you still want to convert the scanned pdf, use —pdf-force switch.

./doc2html.py --pdf-force c.pdf WqyJh/book-c

Источник

Saved searches

Use saved searches to filter your results more quickly

a quick and dirty script to convert a Word (docx) document to html.

License

bradmontgomery/word2html

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

Convert a Word Doc to html

This will give you a command-line script, which you can run:

$ word2html /path/to/MyGloriousDoc.docx

This will give you a new file, /path/to/MyGloriousDoc.html , that’s (hopefully) decent-looking html.

This project has NO TESTS! (feel free to add some of you think it should).
This was last used with python 3.9 and the dependency versions listed in requirements.txt

While this code is MIT-licensed, it uses boty pypandoc and pytidylib, both of which depend on other software that may not be MIT-licensed and must be installed for this to work.

pytidylib is available under the MIT license, and Tidy is available under an MIT-like license
pypandoc is available under the MIT license, while Pandoc is released under the GPL.

About

a quick and dirty script to convert a Word (docx) document to html.

Источник

Doc to html github

Saved searches

Use saved searches to filter your results more quickly

License

geoffstratton/Doc-and-DocX-to-HTML-Converter

Name already in use

Sign In Required

Launching GitHub Desktop

Launching GitHub Desktop

Launching Xcode

Launching Visual Studio Code

Latest commit

Git stats

Files

README.md

Saved searches

Use saved searches to filter your results more quickly

License

dmryutov/document2html

Name already in use

Sign In Required

Launching GitHub Desktop

Launching GitHub Desktop

Launching Xcode

Launching Visual Studio Code

Latest commit

Git stats

Files

README.md

About

Saved searches

Use saved searches to filter your results more quickly

License

WqyJh/doc2html

Name already in use

Sign In Required

Launching GitHub Desktop

Launching GitHub Desktop

Launching Xcode

Launching Visual Studio Code

Latest commit

Git stats

Files

README.md

Saved searches

Use saved searches to filter your results more quickly

License

bradmontgomery/word2html

Name already in use

Sign In Required

Launching GitHub Desktop

Launching GitHub Desktop

Launching Xcode

Launching Visual Studio Code

Latest commit

Git stats

Files

README.md

About