- Saved searches
- Use saved searches to filter your results more quickly
- License
- geoffstratton/Doc-and-DocX-to-HTML-Converter
- Name already in use
- Sign In Required
- Launching GitHub Desktop
- Launching GitHub Desktop
- Launching Xcode
- Launching Visual Studio Code
- Latest commit
- Git stats
- Files
- README.md
- Saved searches
- Use saved searches to filter your results more quickly
- License
- dmryutov/document2html
- Name already in use
- Sign In Required
- Launching GitHub Desktop
- Launching GitHub Desktop
- Launching Xcode
- Launching Visual Studio Code
- Latest commit
- Git stats
- Files
- README.md
- About
- Saved searches
- Use saved searches to filter your results more quickly
- License
- WqyJh/doc2html
- Name already in use
- Sign In Required
- Launching GitHub Desktop
- Launching GitHub Desktop
- Launching Xcode
- Launching Visual Studio Code
- Latest commit
- Git stats
- Files
- README.md
- Saved searches
- Use saved searches to filter your results more quickly
- License
- bradmontgomery/word2html
- Name already in use
- Sign In Required
- Launching GitHub Desktop
- Launching GitHub Desktop
- Launching Xcode
- Launching Visual Studio Code
- Latest commit
- Git stats
- Files
- README.md
- About
Saved searches
Use saved searches to filter your results more quickly
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.
Converts Word documents into clean HTML
License
geoffstratton/Doc-and-DocX-to-HTML-Converter
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Failed to load latest commit information.
README.md
Doc/DocX to HTML Converter
Converts Word documents into clean HTML
I have this problem: mo matter what my official job or title, people keep sending me Word documents that they want posted online to match the web site styling.
Yes, you can use Word to convert documents to HTML, but Microsoft’s version of «HTML» frequently looks worse than if you just pasted in plain text. And yes, you can save Word documents as plain text, but then to use them on the web you have to add in the HTML tags.
Finally I got fed up and wrote a converter to produce minimally formatted HTML that I can copy into common web editors like CKEditor or TinyMCE. The operation is linear: you take a Word .doc or .docx file, drag it onto a Windows form, the program invokes Word, converts your .doc/.docx to the cleanest HTML that Word can manage, parses the HTML using Html Agility Pack, and finally spits out a simple HTML document in Notepad that you can copy-paste into whatever web system you’re using.
- When building this I had the Microsoft.Office.Interop.Word 12.0 (Word 2007) library referenced from the project. The easiest way to meet this requirement is to install some recent version of Office, but any version of the Microsoft.Office.Interop.Word library that natively handles the .docx format should work.
- I had the very useful Html Agility Pack version 1.4.6 library referenced as well. I was using .NET 4.0 and the 4.0 version of the library. Html Agility Pack now lives on Github so you can grab it easily and reference it from your project.
Later I realized a better way to do this might be to invoke the LibreOffice converter on the command line, convert your document to HTML or text, filter it with Python’s BeautifulSoup library or sed or Ruby’s Nokogiri, and then insert the results straight into the database of your web system. But maybe not: in text, tags like < table >and < ul >would be lost, and LibreOffice’s HTML is still pretty ugly.
GNU General Public License v3.0
Saved searches
Use saved searches to filter your results more quickly
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.
Documents to HTML converter
License
dmryutov/document2html
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Failed to load latest commit information.
README.md
Documents to HTML converter
Extension | Text | Styles extraction | Images extraction |
---|---|---|---|
HTML/XHTML | Yes | Yes | Yes |
XML | Yes | Not applicable | Not applicable |
DOCX | Yes | Yes | Yes |
DOC | Yes | No | No |
RTF | Yes | Yes | Yes |
ODT | Yes | Yes | Yes |
XLSX | Yes | Yes | Yes |
XLS | Yes | Yes | No |
CSV | Yes | Not applicable | Not applicable |
TXT/MD | Yes | Yes | Yes |
JSON | Yes | Not applicable | Not applicable |
EPUB | Yes | Yes | Yes |
Yes | No | Yes | |
PPT | Yes | No | No |
cURL for downloading images:
apt-get install libcurl4-openssl-dev or brew install curl
iconv for encoding conversion
sudo apt-get install libc6 or brew install libiconv
Tidy for cleaning and repairing HTML
sudo apt-get install libtidy-dev or brew install tidy-html5
file for determining file extension
- getoptpp — Command line options parser
- lodepng — PNG encoder and decoder
- miniz — Data compression library
- json — JSON parser
- pygixml — XML parser
Make sure the Qt (>= 5.6) development libraries are installed:
- In Ubuntu/Debian: apt-get install qt5-default qttools5-dev-tools zlib1g-dev
- In Fedora: sudo dnf builddep tiled
- In Arch Linux: pacman -S qt
- In Mac OS X with Homebrew:
- brew install qt5
- brew link qt5 —force
Now you can compile by running:
qmake (or qmake-qt5 on some systems) make
To do a shadow build, you can run qmake from a different directory and refer it to space-invaders.pro, for example:
mkdir build cd build qmake ../src/document2html.pro make
If you have ideas how to build project with CMake instead of Qt please contact me.
document2html -f|-d -o [-si] document2html -h document2html -v
Short Flag Long Flag Description -f —file Input file -d —dir Input directory -o —out Output directory -s —style Extract styles -i —image Extract images -h —help Display help message -v —version Display package version - rembish — DOC, PPT and PDF converter (PHP)
- PolicyStat — DOCX converter (Python)
- python-excel — XLSX and XLS converter (Python)
- lvu — RTF converter (C++)
- adhocore — TXT/MD converter (PHP)
- ahupp — libmagic wrapper (Python)
If you have questions regarding the libraries, I would like to invite you to open an issue at Github. Please describe your request, problem, or question as detailed as possible, and also mention the version of the libraries you are using as well as the version of your compiler and operating system. Opening an issue at Github allows other users and contributors to this libraries to collaborate.
About
Documents to HTML converter
Saved searches
Use saved searches to filter your results more quickly
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.
Convert documents to html and deploy to github pages.
License
WqyJh/doc2html
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Failed to load latest commit information.
README.md
Convert document file to html and publish it to github pages.
Why convert to html? For me, I prefer to read technology documents with web browser where I can read e-books like reading blogs, easier to zoom, copy, etc.
- Use ebook-convert provided by calibre to convert source document to .htmlz format.
- Use unar to unarchive the .htmlz to web files.
- Use git to create a local git repo and commit web files.
- Use hub to create a remote repo on github.
- Use git to push local files to remote gh-pages branch.
- Wait a moment then read it on https://.github.io//
Support all formats supported by calibre.
./doc2html.py doc_path> username>/repository>
Convert a.epub and publish it to my github repository book-a , which would be created as private repository. If you want to deploy to public repository just add —public flag.
./doc2html.py /path/to/a.epub WqyJh/book-a
Convert b.pdf and publish it.
./doc2html.py b.pdf WqyJh/book-b
If pdf file is scanned, it won’t be converted. Because almost all of its contents are images, which is slow to load and hard to read.
How to determine a pdf file is scanned? Generally a scanned page only have one image and no text, some of which has a few lines of meta text. Define a —pdf-threshold (default:100) , if the number of characters are less than it, then the page is treat as scanned. Define a —pdf-rate (default:0.6) , if the rate of text pages over total pages is less than it, then the pdf document is treat as scanned.
If you still want to convert the scanned pdf, use —pdf-force switch.
./doc2html.py --pdf-force c.pdf WqyJh/book-c
Saved searches
Use saved searches to filter your results more quickly
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.
a quick and dirty script to convert a Word (docx) document to html.
License
bradmontgomery/word2html
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Failed to load latest commit information.
README.md
Convert a Word Doc to html
This will give you a command-line script, which you can run:
$ word2html /path/to/MyGloriousDoc.docx
This will give you a new file, /path/to/MyGloriousDoc.html , that’s (hopefully) decent-looking html.
- This project has NO TESTS! (feel free to add some of you think it should).
- This was last used with python 3.9 and the dependency versions listed in requirements.txt
While this code is MIT-licensed, it uses boty pypandoc and pytidylib, both of which depend on other software that may not be MIT-licensed and must be installed for this to work.
- pytidylib is available under the MIT license, and Tidy is available under an MIT-like license
- pypandoc is available under the MIT license, while Pandoc is released under the GPL.
About
a quick and dirty script to convert a Word (docx) document to html.