Archive for the ‘MoinMoin’ Category

The Google Highly Open Participation Contest and MoinMoin

Wednesday, November 28th, 2007

I haven’t blogged in a long time, but this is something I think deserves to be mentioned.

Today Google announced The Google Highly Open Participation Contest, which is a program for pre-university students. They participate in some of the 10 open source projects and Google pays them a maximum of $500 for it.

MoinMoin, the wiki engine used by Fedora, is one of the open source projects in this program. So if you’re a highschool student who’d like to contribute to a really nice open source Python based wiki engine and earn a few bucks while doing that, go see MoinMoin’s GHOP 2007 page. Remember, there are also other tasks than coding, like translation, marketing, testing etc!

Weekly report: week 34

Tuesday, August 28th, 2007

Last week was a lot about fighting with Texinfo. As I’ve probably said earlier as well, makeinfo –docbook is supposed to make DocBook XML. But it doesn’t really work that well. In the best case, it produces something that is well-formed XML but not really DocBook if you validate it against the DTD. Which means Moin can’t show an HTML rendering of it. In the worst case, makeinfo just segfaults.

In hopes of getting better HTML output from the pseudo-DocBook XML Texinfo files, I tried to use Moin with the latest Docbook XSL stylesheets, 1.73.1. And that resulted in a 4Suite error. See the Moin bug report I made.

So I joined #docbook and #4suite to report about my problems. Michael Smith, one of the DocBook XSL maintainers, was interested and made a workaround that will be in the next release of the stylesheets. See the SVN commit here. Thanks! Uche Ogbuji, a 4Suite developer was also interested to see if I hit a 4Suite bug. He knows the details now, so if it really is a bug, it’ll hopefully get fixed when he has the time to take a look.

In discussions with Michael Smith he convinced me to drop docbook2X and use the DocBook XSL stylesheets for DocBook refentry -> man source transformation. So I modified the ManSource action to use those via xsltproc. I also implemented caching, so that xsltproc doesn’t have to be called every time, if the same man source already exists. ManSource should handle concurrent requests too, but I haven’t done that much testing on it.

Michael had an idea that some of the pseudo-DocBook that makeinfo –docbook makes could be fixed to be valid DocBook with a stylesheet. So I have collected some of these XML files, see, take a look if you’re interested. You can try validating those with e.g. xmllint --noout --valid xml-file.xml. If Michael has the time to come up with such a stylesheet, that would be great, since right now makeinfo –docbook produces only two valid DocBook XML files from the whole Fedora updates-released repository!

Yesterday I added support to collect these “well-formed XML but not DocBook” files to the manimport script, so anyone could do the same and take a look at what makeinfo –docbook outputs. I also added some logging to file via the python logging module, so that you can analyze what has append later on and don’t need to save the output that manimport gives to the console.

What I would desperately need is a better way to do Texinfo to DocBook conversions. So either makeinfo –docbook should be fixed or a new converter should be made. Maybe I could report some of these things to the GNU Texinfo people in hope of a better makeinfo. Ideas?

As this is the final full work week of Summercode, here’s a quick list of what will be and what won’t be done:

As the most important point, there won’t be wiki markup as the editing/storage format in the wiki, but DocBook. The main reason is that I won’t have the time to do it correctly, meaning without loss of information. This is mostly because of the problems I’ve had with external tools and conversions during my project. Having wiki markup is possible, though. Probably the best way would be to write a DocBook refentry -> wiki markup XSL stylesheet. Avoiding information loss could be handled by putting in Moin macros where the wiki markup isn’t enough, as suggested by Florian Festi. Those macros would then activate themselves when doing the wiki markup -> DocBook XML conversion. But if Moin macros would have to be used correctly by the users who edit the man pages in the wiki, would it actually be easier for them too to keep DocBook as the format?

Also, there won’t be that much testing during Summercode. Of course I’m testing all the time locally, but large scale testing will hopefully happen after Summercode.

(I’ll talk about only man pages here, info pages are also be mostly supported, but if makeinfo output can’t be fixed, there won’t be much info pages to use…)
Importing man pages from Fedora’s repositories, with support for repository updates and different releases works. Diffing between the same man pages in the different releases works. Getting the man source of a man page in the wiki works. Comparing the changes in the wiki to the original man source is not done yet, but that’s on my agenda for this week. There also aren’t index pages yet, you’ll have to find the man page you want with Moin’s search, but I’d like to add some index page generation to manimport this week too.

There won’t be any kind of official release of the stuff I’ve done yet. I’d like to have it all working on my public test instance, so that some of the features could be shown, though. The reason for not really releasing anything is that my work is based on Moin 1.7, which is still a quite fast moving target. It’ll hopefully be released some time next year and my goal is to have all of my stuff merged into 1.7 main in time for the release. That would also mean making some of the import code more general, as it’s quite Fedora/RPM specific right now. Still, none of this prevents Fedora to run my man/info wiki stuff, the underlaying Moin just won’t be officially stable, so there might be some bugs etc.

Here’s my plan for this week:

  • Going through the logs of info importing I made yesterday and seeing if there’s something that could be done to improve info importing without touching makeinfo
  • The “diff to upstream source” implementation
  • Index pages
  • Setting everything up to my public test instance
  • Testing

Weekly report: week 33

Tuesday, August 21st, 2007

Last week introduced a lot of code cleanups, some of which will also hopefully save some memory. I also made a fix for one problem we had with my Fedora test environment, but I’m not sure if it’ll go to Moin main branch, as it seems Moin was meant to give up and crash at that point. Then I implemented a man/info page exclusion feature, which can be used to avoid processing files that are known to cause problems with doclifter. I also did some base work for this week’s new features.

Sadly last week was a bit unproductive for me, I did some code on Monday, then Tuesday was more about planning on what should be done with the storage/editing format. I decided to use DocBook for now and implement all of the other features around it and then see if there’s time to do the DocBook -> wiki markup -> DocBook conversions. This seems to be OK with my mentor too. On Wednesday I had a terrible headache for the whole day, probably caused by the hot weather we had, so I couldn’t work at all. Then on Thursday and Friday I was back on coding, but the weekend was busy with other things.

Gladly I took back some of the missed coding hours yesterday with a near 12-hour coding spree 😉 I implemented a ManSource action, which produces man source from the DocBook XML that’s in the wiki and gives the man file to the user. Info support will also happen soon.

This week I’m going to implement an action for getting a diff from the wiki compared to upstream man/info source. Then there’s an interesting thing missing: Moin can’t display lists in info pages converted to DocBook, so I guess I should try to add support for those to the XSLT stylesheets Moin uses for the display of HTML from DocBook XML.

After those are done, I should have all the features completed except the wiki markup stuff. I’m starting to think that I should maybe spend these two weeks I have left of Summercode for adding new features and not spending so much time on testing. After Summercode is over and I can also accept patches from the community, we could do some testing in the community and fix bugs together. The only thing I actually have doubts about is (still) the updates handling. Sometimes it seems like it doesn’t notice all non-updated packages and does useless work…

But I guess that’s it for this week, back to coding now 🙂

Weekly report: week 32

Tuesday, August 14th, 2007

Last week I implemented support for translated man pages, fixed yet more bugs and did some serious code cleanups and refactoring as suggested by Thomas Waldmann and pylint. I got a public test wiki setup from the Fedora Infrastructure project, we spent about a day setting it up with Paulo Santos, also with some help from Mike McGrath, thanks guys 🙂

It took such a long time because the Moin instance runs as the “apache” user and my import script runs as the user you happen to be on the console, so permissions have to be set up so that both apache and the script have enough rights for the wiki data directory. Right now there’s some problem with the wiki installation (again) so I won’t give a link yet, especially because Paulo is on a holiday, we’ll probably get it fixed once he comes back.

Then when testing my script there, I hit a serious doclifter problem. With one man page, libbind-getaddrinfo.3, my script seemed to go crazy and take up all the memory on the server. I then tested it locally, same thing here. At first I thought all of this was caused by my script or by Moin and I spent hours trying to debug them. When I eventually found nothing, I realized it is a doclifter problem / bug. So I had to implement filtering in my script, now you can pass it a filename where the file has the names of the files that doclifter can’t handle and my script will then skip those files. Also I noticed that some packages have even 5-6 meg man files, which took about 2 hours to handle, so I had to set up my script so that it skips any files that are bigger than 1MB.

Wiki storage format and editing

I have done some testing and research on the Moin 1.6 DocBook branch from last year’s GSoC. It seems that I can’t use it as I originally planned, since it has no support for the refentry parts of DocBook. Refentry is a DocBook element for saving unix man pages and that’s what doclifter makes. So I don’t have a direct way of converting the DocBook XML to Moin wiki pages. I could probably try to enhance the XSLT stylesheets that do DocBook to Moin conversions. That way we would have man/info pages in wiki markup. But it would probably be way more difficult to do Moin wiki page -> DocBook XML conversions so that there would be no information loss.

As mentioned, DocBook has elements especially meant for man pages, so man -> DocBook -> man can be done with little if any information loss. But wiki markup has no such elements, so man -> wiki -> man, even with DocBook as an intermediate format, can easily cause a lot of information loss.

With that in mind, here’s my suggestion: We have about three weeks of Summercode Finland left, so I think I should leave DocBook as the storage format and concentrate on exporting changes / patches from the wiki against upstream man/info files. That way we would hopefully have the ability to edit man/info pages in a wiki and send patches from those edits upstream. The edits would have to be made in DocBook XML, which is probably not as user friendly as the wiki markup, but the basic idea / framework would be there. Then if it seems possible and someone with more XSLT/XML/DocBook experience than me is interested to help, DocBook refentry -> wiki -> DocBook refentry conversions could be added later.

Readers, what do you think?

Weekly report: week 31, bugs, connection troubles, Assembly

Monday, August 6th, 2007

Last week didn’t go actually according to my plans, but I did get some major improvements done. I improved repository metadata handling a lot, the code is much cleaner and should also work better now. I did a great deal of testing on updates handling, it seems to be working now.

Then I hit two serious problems with MoinMoin: XML-RPC PutPage and path problems. For some reason the MM server process and the man/info importer process went into some kind of deadlock at random times when running the import. The server was waiting on accept() and the importer on recv(). I tried to do some debugging etc. but I never figured out exactly what was wrong. So I decided to port my importer to use the PageEditor class in Moin. That was a big change, but I’m happy with the result. This also solved the other problems I had with XML-RPC, mainly the mysterious idle times when the importing kind of stalled for 30 seconds per every page import. I suspect all these things have something to do with network/socket stuff, but I just don’t have the time to start debugging all this when I have more important things to implement here…

So I gained speed, but I also lost something when getting rid of XML-RPC. Mainly I lost the ability to run the importer and all the man -> DocBook XML etc. conversions on a different host than the actual wiki. But since a single man page import now takes about 0.7 seconds and it took about 30 seconds with XML-RPC, it’s all worth it.

When doing the porting to PageEditor, I kept having some really weird problems with python import clauses. Simply, many of them didn’t work. I spent about a day trying to find out what was wrong and I eventually even found the reason: A relative path was added to sys.path which caused all the problems, since I had a bit of a different setup than the default. So I fixed it with adding an absolute path instead, which took something like maybe 20 characters of python, after one day of work 😀

In Friday I travelled from my parent’s place to Helsinki, since our student organization visited Assembly. That was fun, but I had no Internet access at all from my student apartment at Helsinki during the whole weekend, so any coding I could do was for a couple of hours at Assembly. Thanks a lot Sonera! Even with flaky Internet connection, I could work for about 40 hours last week. Btw, iirc, the 4k intro at Assembly was won with Python 🙂

This week I’ll probably have to update my blog, then I’ll do some testing with real RPM repositories now that the importer is ported to PageEditor. After all that I’ll hopefully finally get to working with the DocBook branch. Looking forward to that.

Weekly report: week 30

Monday, July 30th, 2007

Last week was pretty much completely about fixing bugs in the info page handling code. I worked for about 45 hours on my project. The most significant change was that I decided to use primary.xml too, so that I could reliably find the source RPM for corresponding to each binary RPM with info pages. Before I just tried to “guess” the SRPM file name from the RPM file name but that wasn’t working too well.

I also noticed that makeinfo –docbook wasn’t working too well for a lot of Texinfo pages, so I had to make sure that my import code doesn’t crash or do anything weird, no matter what the output of makeinfo is. Now my code checks the XML data for well-formedness before putting it into the wiki, because badly formed XML data was causing a lot of problems.

But after all this, I can finally say that the man/info importing works. That took a lot more time than I expected, mainly because of the problems I had with repository handling and info page handling. The code still needs some testing, especially the updates handling part, I’ll do that today. Also it still doesn’t have support for localized man pages, but that shouldn’t be too hard to add.

I have 5 weeks left now before Summercode Finland ends and technically I’m 2 weeks behind schedule which does sound bad, but it’s hopefully not that bad really. This is where last summer’s DocBook branch comes into the picture.

I’ve briefly talked about it with Thomas Waldmann and Alexander Schremmer and the DocBook branch can import DocBook XML so that it looks exactly like a normal wiki page to the user, even while editing, but you can still export a complete DocBook XML document of the page. I’ll base my editing implementation on these features, so that the users can edit the man/info pages normally, then I’ll use the DocBook export to get the pages as DocBook and then run them through docbook2x to produce either man or Texinfo files.

Producing diffs for upstream is still an open question, but it shouldn’t be impossible to store the original man/info files while doing the import and then do diffs with those and the corresponding docbook2x output.

So this week I’ll do some (hopefully) final touches on the import code and then I’ll start testing the DocBook branch and porting it to my branch.

Weekly report: week 29

Tuesday, July 24th, 2007

I would have liked to make a blog post about how everything is just working that I had on my TODO list last week. But sadly that’s not quite the case. Updates handling works, though it still needs some testing. Info page importing kind of works, but there’s a problem.

I use makeinfo –docbook to convert the Texinfo files to DocBook XML, but the problem is that it doesn’t always produce well-formed XML, so the wiki pages aren’t rendered cleanly, the only thing it shows is the DocBook XML source. I guess I’ll just have to pass the Info pages which are causing problems.

I hope that I’ll find some solution in a couple of days, so that I could finally start doing the editing part. According to my calculations, I’ve been coding for 6 weeks now and there’s about 6 weeks of Summercode left, so I’m pretty much at the halfway point now. I feel like I still have a decent chance of finishing my project in time, it’s going to keep me busy, but still.

Edit: And again I forgot to categorize my post in the first save, sorry…

Weekly report: week 28

Monday, July 16th, 2007

Sorry again that this report is a bit late. When there’s code to be written, blogging seems to be secondary 😉

Last week’s work described in short: I implemented RPM repository support. And that was a lot more work I would have imagined. First download ands parse repomd.xml to see checksums of the repository, then download filelists.xml.gz, checksum that against the checksum in repomd.xml, re-download if checksums don’t match, parse the file and find man pages in packages, download those packages, extract the man pages from the packages with rpm2cpio and add them as wiki pages.

For the downloading part I used urlgrabber which handles partial downloads, regets, progress meters, mirrorlists etc. automatically. For XML parsing I used cElementTree. I also needed to implement caching and checksumming for the cached file since the filelists.xml.gz file for Fedora 7 Everything is over 6 megs.

This is what diffstat says about last week’s changes:

action/ | 27 +-
script/xmlrpc/ | 508 +++++++++++++++++++++++++++++++++++----------
2 files changed, 418 insertions(+), 117 deletions(-)

I actually tested the repository code last night with Fedora 7 updates-testing repository and it works pretty well. Of course doclifter can’t lift all of the man pages to DocBook XML, but those have to be just passed. The main problem that I have now is performance. It took almost 80 minutes of wall time to import the updates-testing repo and that’s probably the smallest there is in the Fedora world. The main reason for the slowness is XML-RPC which seems to just stop and wait for something in every call for about 30 seconds. And when I have to make about 150-200 calls in a repository import, thats a lot of time waiting combined.

One possible way to save some time would be to group the man pages in maybe groups of 10 pages per XML-RPC call. Right now I do that, but only for man pages in the same RPM package. So if there’s five packages with one man page each, it makes five XML-RPC calls. Also as recommended on #moin-dev I could do some profiling to really see where the wait is. If only I had time for that…

Then about the schedule again: As of today, I’m behind on it. Even though I worked for about 50 hours last week, which was already the “slip week”, that’s still the case. And there are things with “phase 1” I still need to do, like updates handling and info importing. I am a bit annoyed at the situation, but on the other hand I have worked really hard lately, but there just was and is a bit too much to do compared to the schedule. Updates handling, especially with the performance problems, is really important to the whole project and so is info importing, I can’t just ignore those.

So the plan for this week is as follows

  • At first I’ll implement updates handling, probably with the Python shelve module, with which I can store name-version-release info in a nice dictionary and read it from a file. Then if used in “update mode”, the script can only import man files from new and updated packages, which will save a lot of time.
  • Then I’ll do the info page importing which needs to be done something like this: identify the info files when parsing filelists.xml.gz, then download the upstream source packages from CVS and get the Texinfo sources, convert them to DocBook XML with makeinfo and import to the wiki.
  • I would really like to take the time to profile the performance problems, or more specifically, what is causing the idle cpu wait time. I’m just not sure if it’s worth it at this point when I’m behind schedule anyway.
  • If I decide that I don’t want to spend the time on profiling, I could at least do some small-scale optimizing by making sure that the XML-RPC calls always have about 10 page adds or something like that, so the total number of the calls could be reduced but the calls themselves wouldn’t still get too big.

The updates handling should take about one working day to do well, the info importing could hopefully be done in about half a day. The optimizing could take anywhere from a few hours to many days, depends on how much time I want to spend on it.

Edit: It’s been so long since I blogged that I forgot to categorize this and it didn’t show up on Fedora Planet etc. So I’m updating the timestamp now, sorry if this causes any problems in anyone’s feed reader.

Weekly report: week 27

Monday, July 9th, 2007

I posted something on Thursday, but here’s “the official” weekly report.

I spent last week mainly on implementing SisterDiff and it’s completed now. First I had to familiarize myself with Moin’s Action interface. Then I thought I could maybe use the functions from an other action, LikePages. But after some testing, that didn’t seem to be a reasonable thing to do.

So now SisterDiff is “stand-alone” and actually cleaner that way. I have to say I especially liked RootPage.getPageList() while doing SisterDiff. It has nice mechanisms to filter which pages to list, so that I don’t have to do the filtering in the Action. Though passing the filter function to getPageList() meant that I had to revise on regular expressions and the way they are done in Python.

When validating SisterDiff’s HTML code, I actually found a bug on Moin’s ActionBase class that I fixed. I also spent about one day on making my code PEP8 compatible and fixing some bugs I found in that process.

Friday and a couple of hours on the weekend were spent on planning the RPM handling. I took a look at rpm-python, which is nice, but it doesn’t have the possibility of extracting single files from an RPM package. So I did some tests on rpm2cpio. With cpio it is possible to extract single files, so that’s what I’ll use. I already have some RPM handling code done, but it’s still in early stages.

Then something about the schedule: According to my original schedule I should have had phase 1 complete by yesterday, but “rpm mass importing” and info page importing are still missing. So I have to use some of the “slip week” I had scheduled to finish phase 1. Hopefully I’ll get to start phase 2 by the end of this week, though.

Info page importing

This is a problem I still haven’t solved. I took a look at CVS’s source code (since that is one of the non-GNU packages that have Info pages) and it does actually have the Texinfo sources in it. And I could use makeinfo –xml to make DocBook XML from that. But the Texinfo sources are not in the binary RPM. So it seems I need to use RPM repositories for man page importing and CVS repositories for Info page importing. Interesting…

Mid-week report

Thursday, July 5th, 2007

Just a quick update on what’s going on: Sister diff is completed. So the only features missing from milestone 1 are “mass importing” of man pages and info page importing. It might be that I’ll leave the whole info page thing to next week, since there’s the problem that I mentioned in my previous post – info pages as such can’t be converted to DocBook XML, you need the Texinfo “masters” for that.

I’ll spend the rest of the week working with RPM importing. It seems that RPM repositories are a better source for the man pages than the CVS repository. If anyone wants to try changing my mind on that, feel free 😉

I’m not sure if I’ll get to updates handling yet this week, but importing an RPM repository’s man files might be a reasonable goal. I’ll probably work some hours on the weekend too.

So phase one may not be quite complete by the end of this week, but almost anyway. I’ll probably spend the first half of next week with the info file importing, updates handling and whatnot. The last part could hopefully be spent on the editing phase then.