Sorry again that this report is a bit late. When there’s code to be written, blogging seems to be secondary 😉
Last week’s work described in short: I implemented RPM repository support. And that was a lot more work I would have imagined. First download ands parse repomd.xml to see checksums of the repository, then download filelists.xml.gz, checksum that against the checksum in repomd.xml, re-download if checksums don’t match, parse the file and find man pages in packages, download those packages, extract the man pages from the packages with rpm2cpio and add them as wiki pages.
For the downloading part I used urlgrabber which handles partial downloads, regets, progress meters, mirrorlists etc. automatically. For XML parsing I used cElementTree. I also needed to implement caching and checksumming for the cached file since the filelists.xml.gz file for Fedora 7 Everything is over 6 megs.
This is what diffstat says about last week’s changes:
action/SisterDiff.py | 27 +-
script/xmlrpc/manimport.py | 508 +++++++++++++++++++++++++++++++++++----------
2 files changed, 418 insertions(+), 117 deletions(-)
I actually tested the repository code last night with Fedora 7 updates-testing repository and it works pretty well. Of course doclifter can’t lift all of the man pages to DocBook XML, but those have to be just passed. The main problem that I have now is performance. It took almost 80 minutes of wall time to import the updates-testing repo and that’s probably the smallest there is in the Fedora world. The main reason for the slowness is XML-RPC which seems to just stop and wait for something in every call for about 30 seconds. And when I have to make about 150-200 calls in a repository import, thats a lot of time waiting combined.
One possible way to save some time would be to group the man pages in maybe groups of 10 pages per XML-RPC call. Right now I do that, but only for man pages in the same RPM package. So if there’s five packages with one man page each, it makes five XML-RPC calls. Also as recommended on #moin-dev I could do some profiling to really see where the wait is. If only I had time for that…
Then about the schedule again: As of today, I’m behind on it. Even though I worked for about 50 hours last week, which was already the “slip week”, that’s still the case. And there are things with “phase 1” I still need to do, like updates handling and info importing. I am a bit annoyed at the situation, but on the other hand I have worked really hard lately, but there just was and is a bit too much to do compared to the schedule. Updates handling, especially with the performance problems, is really important to the whole project and so is info importing, I can’t just ignore those.
So the plan for this week is as follows
- At first I’ll implement updates handling, probably with the Python shelve module, with which I can store name-version-release info in a nice dictionary and read it from a file. Then if used in “update mode”, the script can only import man files from new and updated packages, which will save a lot of time.
- Then I’ll do the info page importing which needs to be done something like this: identify the info files when parsing filelists.xml.gz, then download the upstream source packages from CVS and get the Texinfo sources, convert them to DocBook XML with makeinfo and import to the wiki.
- I would really like to take the time to profile the performance problems, or more specifically, what is causing the idle cpu wait time. I’m just not sure if it’s worth it at this point when I’m behind schedule anyway.
- If I decide that I don’t want to spend the time on profiling, I could at least do some small-scale optimizing by making sure that the XML-RPC calls always have about 10 page adds or something like that, so the total number of the calls could be reduced but the calls themselves wouldn’t still get too big.
The updates handling should take about one working day to do well, the info importing could hopefully be done in about half a day. The optimizing could take anywhere from a few hours to many days, depends on how much time I want to spend on it.
Edit: It’s been so long since I blogged that I forgot to categorize this and it didn’t show up on Fedora Planet etc. So I’m updating the timestamp now, sorry if this causes any problems in anyone’s feed reader.