Tuesday, January 8, 2008

Getting a static version of a wiki

I love to use a wiki. I use it whenever I need to communicate and log information in a project with at least one other team member. Even it is only used by myself, I think the flexibility of a wiki is great and the information is always available in a familiar format (the web).

But from time to time, a project needs some stable documentation (typically at release time, or at any point in time that is a milestone). And since I'm not planning on duplicating information in other formats (like Word documents or Excel sheets), I need the possibility of extracting the information from the wiki in a static format. Another reason for being able to convert a wiki to a static format is when you want to make it available to people that don't have access to the infrastructure or you don't want them to install the necessary infrastructure to only consult the information on the wiki.

Some wikis have the functionality to export the contents to another format, like PDF files or static HTML pages. But most of the time, that functionality is not that easy to use or is not available. For example, I use JSPWiki a lot. It is a great very lightweight wiki, certainly because it stores its content in plain text files (no need to setup a database and you can always just read the text files if for some reason the software is not available). But, I haven't found a good plugin to extract the wiki as a static website. That is exactly what I want.

For the moment, I use a combination of the following unix tools to spider and convert a JSPWiki instance to a static version.

rm -rf static-wiki
mkdir static-wiki
cd static-wiki
wget -A *Wiki.jsp*,*.jpg,*.gif,*.css,*.html -R *Diff.jsp*,*PageInfo.jsp* \
-r -l2 -k --http-user=user --http-password=***** http://server/wiki/
cd server/wiki
rename 's/jsp\?/jsp_/' *
find ./* -type f -exec sed -i 's/jsp?/jsp_/g' {} \;

Some remarks:
  • Use wget with accept (-A) and reject (-R) attributes to filter the correct information. By default wget would get all possible links, also edit forms, history pages and all diferences.
  • The level of spidering (-l2) might need to be adjusted. It depends on how your wiki is structured. I would expect that setting the level higher has no big impact if you use the accept/reject attributes. Possibly, avoid spidering external references (like links outside the wiki domain).
  • JSPWiki uses only a few JSP pages to show the content of the wiki. The resource to be shown is added as a parameter in the URL. When wget saves those pages, they will look like this: "show.jsp?page=SomePage". Unfortunately, a browser will not correctly follow links that have this format. Therefore you need to rename all files and convert all links in the HTML pages (replace "?" by "_" for instance).
There should be a better way, but this works just fine...

No comments: