Doc to html github

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Converts Word documents into clean HTML

License

geoffstratton/Doc-and-DocX-to-HTML-Converter

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Читайте также:  Сколько зарабатывает программист html css

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

Doc/DocX to HTML Converter

Converts Word documents into clean HTML

I have this problem: mo matter what my official job or title, people keep sending me Word documents that they want posted online to match the web site styling.

Yes, you can use Word to convert documents to HTML, but Microsoft’s version of «HTML» frequently looks worse than if you just pasted in plain text. And yes, you can save Word documents as plain text, but then to use them on the web you have to add in the HTML tags.

Finally I got fed up and wrote a converter to produce minimally formatted HTML that I can copy into common web editors like CKEditor or TinyMCE. The operation is linear: you take a Word .doc or .docx file, drag it onto a Windows form, the program invokes Word, converts your .doc/.docx to the cleanest HTML that Word can manage, parses the HTML using Html Agility Pack, and finally spits out a simple HTML document in Notepad that you can copy-paste into whatever web system you’re using.

  1. When building this I had the Microsoft.Office.Interop.Word 12.0 (Word 2007) library referenced from the project. The easiest way to meet this requirement is to install some recent version of Office, but any version of the Microsoft.Office.Interop.Word library that natively handles the .docx format should work.
  2. I had the very useful Html Agility Pack version 1.4.6 library referenced as well. I was using .NET 4.0 and the 4.0 version of the library. Html Agility Pack now lives on Github so you can grab it easily and reference it from your project.

Later I realized a better way to do this might be to invoke the LibreOffice converter on the command line, convert your document to HTML or text, filter it with Python’s BeautifulSoup library or sed or Ruby’s Nokogiri, and then insert the results straight into the database of your web system. But maybe not: in text, tags like < table >and < ul >would be lost, and LibreOffice’s HTML is still pretty ugly.

GNU General Public License v3.0

Источник

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Documents to HTML converter

License

dmryutov/document2html

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

Documents to HTML converter

Extension Text Styles extraction Images extraction
HTML/XHTML Yes Yes Yes
XML Yes Not applicable Not applicable
DOCX Yes Yes Yes
DOC Yes No No
RTF Yes Yes Yes
ODT Yes Yes Yes
XLSX Yes Yes Yes
XLS Yes Yes No
CSV Yes Not applicable Not applicable
TXT/MD Yes Yes Yes
JSON Yes Not applicable Not applicable
EPUB Yes Yes Yes
PDF Yes No Yes
PPT Yes No No

cURL for downloading images:

apt-get install libcurl4-openssl-dev or brew install curl 

iconv for encoding conversion

sudo apt-get install libc6 or brew install libiconv 

Tidy for cleaning and repairing HTML

sudo apt-get install libtidy-dev or brew install tidy-html5 

file for determining file extension

  • getoptpp — Command line options parser
  • lodepng — PNG encoder and decoder
  • miniz — Data compression library
  • json — JSON parser
  • pygixml — XML parser

Make sure the Qt (>= 5.6) development libraries are installed:

  • In Ubuntu/Debian: apt-get install qt5-default qttools5-dev-tools zlib1g-dev
  • In Fedora: sudo dnf builddep tiled
  • In Arch Linux: pacman -S qt
  • In Mac OS X with Homebrew:
    • brew install qt5
    • brew link qt5 —force

    Now you can compile by running:

    qmake (or qmake-qt5 on some systems) make 

    To do a shadow build, you can run qmake from a different directory and refer it to space-invaders.pro, for example:

    mkdir build cd build qmake ../src/document2html.pro make 

    If you have ideas how to build project with CMake instead of Qt please contact me.

     document2html -f|-d -o [-si] document2html -h document2html -v 
    Short Flag Long Flag Description
    -f —file Input file
    -d —dir Input directory
    -o —out Output directory
    -s —style Extract styles
    -i —image Extract images
    -h —help Display help message
    -v —version Display package version
    • rembish — DOC, PPT and PDF converter (PHP)
    • PolicyStat — DOCX converter (Python)
    • python-excel — XLSX and XLS converter (Python)
    • lvu — RTF converter (C++)
    • adhocore — TXT/MD converter (PHP)
    • ahupp — libmagic wrapper (Python)

    If you have questions regarding the libraries, I would like to invite you to open an issue at Github. Please describe your request, problem, or question as detailed as possible, and also mention the version of the libraries you are using as well as the version of your compiler and operating system. Opening an issue at Github allows other users and contributors to this libraries to collaborate.

    About

    Documents to HTML converter

    Источник

    Saved searches

    Use saved searches to filter your results more quickly

    You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

    Convert documents to html and deploy to github pages.

    License

    WqyJh/doc2html

    This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

    Name already in use

    A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

    Sign In Required

    Please sign in to use Codespaces.

    Launching GitHub Desktop

    If nothing happens, download GitHub Desktop and try again.

    Launching GitHub Desktop

    If nothing happens, download GitHub Desktop and try again.

    Launching Xcode

    If nothing happens, download Xcode and try again.

    Launching Visual Studio Code

    Your codespace will open once ready.

    There was a problem preparing your codespace, please try again.

    Latest commit

    Git stats

    Files

    Failed to load latest commit information.

    README.md

    Convert document file to html and publish it to github pages.

    Why convert to html? For me, I prefer to read technology documents with web browser where I can read e-books like reading blogs, easier to zoom, copy, etc.

    1. Use ebook-convert provided by calibre to convert source document to .htmlz format.
    2. Use unar to unarchive the .htmlz to web files.
    3. Use git to create a local git repo and commit web files.
    4. Use hub to create a remote repo on github.
    5. Use git to push local files to remote gh-pages branch.
    6. Wait a moment then read it on https://.github.io//

    Support all formats supported by calibre.

    ./doc2html.py doc_path> username>/repository>

    Convert a.epub and publish it to my github repository book-a , which would be created as private repository. If you want to deploy to public repository just add —public flag.

    ./doc2html.py /path/to/a.epub WqyJh/book-a

    Convert b.pdf and publish it.

    ./doc2html.py b.pdf WqyJh/book-b

    If pdf file is scanned, it won’t be converted. Because almost all of its contents are images, which is slow to load and hard to read.

    How to determine a pdf file is scanned? Generally a scanned page only have one image and no text, some of which has a few lines of meta text. Define a —pdf-threshold (default:100) , if the number of characters are less than it, then the page is treat as scanned. Define a —pdf-rate (default:0.6) , if the rate of text pages over total pages is less than it, then the pdf document is treat as scanned.

    If you still want to convert the scanned pdf, use —pdf-force switch.

    ./doc2html.py --pdf-force c.pdf WqyJh/book-c

    Источник

    Saved searches

    Use saved searches to filter your results more quickly

    You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

    a quick and dirty script to convert a Word (docx) document to html.

    License

    bradmontgomery/word2html

    This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

    Name already in use

    A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

    Sign In Required

    Please sign in to use Codespaces.

    Launching GitHub Desktop

    If nothing happens, download GitHub Desktop and try again.

    Launching GitHub Desktop

    If nothing happens, download GitHub Desktop and try again.

    Launching Xcode

    If nothing happens, download Xcode and try again.

    Launching Visual Studio Code

    Your codespace will open once ready.

    There was a problem preparing your codespace, please try again.

    Latest commit

    Git stats

    Files

    Failed to load latest commit information.

    README.md

    Convert a Word Doc to html

    This will give you a command-line script, which you can run:

    $ word2html /path/to/MyGloriousDoc.docx 

    This will give you a new file, /path/to/MyGloriousDoc.html , that’s (hopefully) decent-looking html.

    • This project has NO TESTS! (feel free to add some of you think it should).
    • This was last used with python 3.9 and the dependency versions listed in requirements.txt

    While this code is MIT-licensed, it uses boty pypandoc and pytidylib, both of which depend on other software that may not be MIT-licensed and must be installed for this to work.

    • pytidylib is available under the MIT license, and Tidy is available under an MIT-like license
    • pypandoc is available under the MIT license, while Pandoc is released under the GPL.

    About

    a quick and dirty script to convert a Word (docx) document to html.

    Источник

Оцените статью