Title photo
frugal technology, simple living and guerrilla large-appliance repair
Thu, 06 Oct 2016

Converting WordPress posts to files for a static site

I'm exploring ways to take WordPress blogs and semi-automatically covert them into heaps of individual static files for use in blogging systems like Ode that take text files and convert them to HTML either on the fly or via a static-site engine.

I think it's going to take a combination of at least two existing tools plus some scripting on my part to take what those tools create and further process the files for Ode.

I tried two WordPress plugins that didn't work at all: WP Static HTML Output and Static Snapshot.

A third WordPress plugin, Really Static, did not look promising, and I didn't try it.

I tested the HTTrack Website Copier -- there's even a Fedora package for it -- and that pretty much downloaded the entire WordPress blog as a fully baked static site. But it didn't produce files or a file structure that is in any way compatible with any other blogging software.

Still, I think HTTrack will be valuable in terms of extracting the images from WordPress sites for use in other blogging systems.

I tried another method using wget (which HTTrack also uses) with a ton of command-line switches in a post titled Creating a static copy of a dynamic website.

In case the above site disappears, here is what you do:

The command line, in short…

wget -k -K -E -r -l 10 -p -N -F --restrict-file-names=windows -nH http://website.com/

…and the options explained

-k : convert links to relative
-K : keep an original versions of files without the conversions made by wget
-E : rename html files to .html (if they don’t already have an htm(l) extension)
-r : recursive… of course we want to make a recursive copy
-l 10 : the maximum level of recursion. if you have a really big website you may need to put a higher number, but 10 levels should be enough.
-p : download all necessary files for each page (css, js, images)
-N : Turn on time-stamping.
-F : When input is read from a file, force it to be treated as an HTML file.
-nH : By default, wget put files in a directory named after the site’s hostname. This will disabled creating of those hostname directories and put everything in the current directory.
–restrict-file-names=windows : may be useful if you want to copy the files to a Windows PC.

This is a cool exercise, and it pretty much produces what you get with HTTrack. Cool but not useful.

Along these lines but aiming for something that's actually useful, I could use wget and just target the images.

Here's where the good stuff stars

It's not all bad. I just tried a Ruby Gem called wp2middleman, which takes a copy of the XML that you export out of WordPress and turns it into individual static files (either HTML- or Markdown-formatted) with YAML-style title, date and tags.

You get the XML from the WordPress Dashboard (under Tools -- Export). Then you process that XML file with wp2middleman.

If you already have Ruby and Ruby Gems set up, getting the gem is as easy as:

gem install wp2middleman

Then you can produce a full filesystem with individually named files with:

wp2mm your_wordpress.xml

That gets you the files. Not the images. I'd use HTTrack or some similar tool to get those.

That I can work with. "All" I'd have to do is convert the YAML to Ode's title and Indexette date format, rewrite the image links to conform to whatever I have going on my Ode site and then convert the file suffixes from .html or .markdown to .txt.

I think I can do that.

Update: Getting the images from a WordPress blog with wget is easy. Stack Overflow has it: How do I use Wget to download all Images into a single Folder

There is enough info there to get them into a single folder, or into a directory/folder structure that could make it easier to call the images into your non-WP blog. I did both as a test:

wget -nd -r 2 -A jpg,jpeg,png,gif -e robots=off http://your-blog-here.com

wget -r -l 2 -A jpg,jpeg,png,gif -e robots=off http://your-blog-here.com