Monday, 28 November 2016

Polishing the pipeline, checking out Metacello

Latest Progress

Unfortunately, I didn't get as much done as I wanted to, this week. I feel like I managed to get rid of the code smells I'd found by last week though. I now distinguish between tool controllers and importers. A tool controller is basically just a wrapper for using the command line services of a third party tool directly as a Pharo message. An importer (currently, there is only the PDFXMLImporter) is supposed to use various combinations of tools (that is, tool controllers), to get a specific importing task done. Currently, PDFXMLImporter supports pdf-to-xml importing using either a combination of PDFBox and ParseCit, a combination of XPdf (pdftotext) and ParseCit, or just XPdf on its own. In the code, these pipelines are referred to as #pcpdfbox, #pcxpdf, and #xpdf respectively.

Additionally, I started having a look at Metacello configurations, to make it easier to load the project. Loading the currently latest version of each package into the project already works fine, but I didn't really get into specifying the dependencies yet. There is also a third party project I heavily depend on (called CommandLine), which is not included in the standard Moose 6.0 image. I'm sure I can use that Metacello configuration to load these packages as well, which would be very nice.

Next Steps

More or less the same one as last week. I might need to implement that text block sorter though, for putting XPdf imports back into the correct order. See "Likely Challenges" about why this might be the case. But basically, this week I want to start analyzing the imported PDFs, and also maybe do some more work on the Metacello configuration. Apart from that, there are always some smaller tasks on my to-do list, like making sure the tool controllers can actually use all of the command line services provided by the third party tools, etc., so I might also do some work on some of these tasks.

Likely Challenges

Also more or less the same as last week. Only addition: I just had a quick glance at the XMLs imported by ParseCit. While it detects title, author, affiliation, etc. really well, it doesn't seem to provide any layout information about the remaining text blocks. This information might be important for extracting further features, which means I might need to use both the ParseCit and XPdf pipelines in parallel. This isn't a problem (except maybe for a longer import process for each PDF), but it means that, if I actually need it, I'll have to implement the text block sorter rather soon, which will put the text blocks imported by XPdf back into the correct order.

2 comments:

  1. Hi Silas, I am glad your glad your work is advancing steadily. I would suggest you (only if you feel it would not take too much time) to use Git to version the project. Then you could have a much simple process to load all the tools and configuration. To use Git from Pharo you can load GitFileTree from Menu->Tools->Catalog Browser. Then you can specify a Git repo. Best regards, Leonel.

    ReplyDelete
    Replies
    1. Hi Leonel, great, thanks for the tip! I'll definitely give this a try, it might be a lot more comfortable. Best regards, Silas.

      Delete