Pandoc: Convert documents between formats
Here's another post from my recent migration to Nikola. On my engine, I stored posts in databse in html, but writing Nikola, I wanted to use reStructuredText format. And while I probably could leave out old posts in html, I thought it would be only for the best to find a simple way, and reformat all my posts into rst files.
After some digging around, I found Pandoc. It supports lots of formats between it can convert documents: Markdown, reStructuredText, AsciiDoc, MediaWiki markup, LaTeX, epub, docx, odt, html, PDF. Even though it does not have python API, it still can easily used in python programs. Such program just needs to store source into a file first, run pandoc in subprocess, and then read the output file.
Using subprocess call, I've been able to convert html documents into rst ones.
import subprocess html_code = 'Here's an html code' with open('doc.html', 'w') as doc_html: doc_html.write(html_code) subprocess.call(['pandoc', 'doc.html', '-w', 'rst', '-o', 'doc.rst']) with open('doc.rst', 'r') as doc_rst: rst_code = doc_rst.read()
Few issues were present within rst documents generated this way, but still I was able to convert my html posts into rst within 15 minutes including correcting those issues.