Monday, 28 November 2016

Polishing the pipeline, checking out Metacello

Latest Progress

Unfortunately, I didn't get as much done as I wanted to, this week. I feel like I managed to get rid of the code smells I'd found by last week though. I now distinguish between tool controllers and importers. A tool controller is basically just a wrapper for using the command line services of a third party tool directly as a Pharo message. An importer (currently, there is only the PDFXMLImporter) is supposed to use various combinations of tools (that is, tool controllers), to get a specific importing task done. Currently, PDFXMLImporter supports pdf-to-xml importing using either a combination of PDFBox and ParseCit, a combination of XPdf (pdftotext) and ParseCit, or just XPdf on its own. In the code, these pipelines are referred to as #pcpdfbox, #pcxpdf, and #xpdf respectively.

Additionally, I started having a look at Metacello configurations, to make it easier to load the project. Loading the currently latest version of each package into the project already works fine, but I didn't really get into specifying the dependencies yet. There is also a third party project I heavily depend on (called CommandLine), which is not included in the standard Moose 6.0 image. I'm sure I can use that Metacello configuration to load these packages as well, which would be very nice.

Next Steps

More or less the same one as last week. I might need to implement that text block sorter though, for putting XPdf imports back into the correct order. See "Likely Challenges" about why this might be the case. But basically, this week I want to start analyzing the imported PDFs, and also maybe do some more work on the Metacello configuration. Apart from that, there are always some smaller tasks on my to-do list, like making sure the tool controllers can actually use all of the command line services provided by the third party tools, etc., so I might also do some work on some of these tasks.

Likely Challenges

Also more or less the same as last week. Only addition: I just had a quick glance at the XMLs imported by ParseCit. While it detects title, author, affiliation, etc. really well, it doesn't seem to provide any layout information about the remaining text blocks. This information might be important for extracting further features, which means I might need to use both the ParseCit and XPdf pipelines in parallel. This isn't a problem (except maybe for a longer import process for each PDF), but it means that, if I actually need it, I'll have to implement the text block sorter rather soon, which will put the text blocks imported by XPdf back into the correct order.

Monday, 21 November 2016

Assembling the pipeline

Latest Progress

This last week, I added TxtXMLImporter, which supplies txt files to ParseCit and returns the result (can also file it out). This worked fine on the test data, as well as on the actual txt files I got from PDFs using PDFBox. I then wrapped the usage of PDFBox and ParseCit into a PDFXMLImporter, which, in a next step, should also let the user choose between the different pipelines (also using pdftoxml and pdftotxt from the original EggShell) that result in a PDF-to-XML conversion.

I realized that I had some duplicated code, so I factored that out into the common superclass (PaperImporter). Unfortunately, the "inheritance" situation I have doesn't really reflect a good inheritance relationship, so I will want to refactor that.

I also spent some time on going through what I have so far, adding some class comments. This is where I came across some code smells and design flaws (like the bad inheritance mentioned above), which I want to get rid of, probably during this week.

Next Steps

First of all, I want to do a little refactoring and make sure the design is as good as I can make it right now. Then I want to add message comments as well, something I had neglected a little bit so far. The same thing goes for tests. Now that my pipelines seem to work, I think I should write some unit tests for them.

As I mentioned above, PDFXMLImporter should also be able to use the other tools available from the original EggShell. This will be one of the next things I want to do.

Once I have assembled all the tools for converting PDFs to XML, I want to come back to what I began a couple of weeks ago: studying the imported XML, analyzing what each pipeline can give me, and what information I might be able to deduct from it. I especially want to focus on the differences between the entire old pipeline, and the new pipeline with either PDFBox or pdftotxt as a PDF-to-text converter. It's also possible to use multiple pipelines, if it leads to a better result. so that's an approach I want to consider as well.

Since ParseCit can already retrieve a lot of information, I want to start working on taking that data out of the XML and modelling it. I suspect that this might be the easiest way to get a first extended data model rather soon.

Likely Challenges

As soon as I start analyzing the imported XML data, I suspect it might be a challenge to maintain the overview over all possible pipelines and to find out which ones are best suited for what, and to anticipate each one's drawbacks as early as possible. I then need to make a solid prediction about what further features I might be able to extract, in order to conduct the interviews as soon as possible.

Monday, 14 November 2016

Trying out a new importer

Latest Progress

Last week, I addressed the problem of non-sequential XML import. I talked about it with Leonel and he pointed out that the imported XML offers position information for every text element. This information should be enough to distinguish between left- and right-hand column elements, which means I should be able to put them all back into the correct order.

However, Leonel also mentioned that he had used a different XML importer for a similar purpose, called ParseCit. It takes raw text input and can even extract title, authors and citations, using a Conditional Random Field (CRF) approach. I downloaded ParseCit, installed the necessary components, and documented my installation process as good as possible, for future reference. I also ran the tool on the provided sample data, which worked fine.

Note that ParseCit doesn't parse PDFs itself, so I needed a second tool for that. I decided to give Apache PDFBox a try, and so far, it looks very good. The tool is merely a runnable jar-file, that offers a variety of command line utilities, well suited for my needs. I the built a pipeline to be able to use PDFBox from within Pharo, which now works fine as well. The pipeline can also re-output the imported raw text, for ParseCit to use it as input. It's worth mentioning that Dominik's modified version of XPdf also includes a text-based import, so I'll definitely want to try it with that as well. I will most likely change my text-import pipeline to enable changing between these two tools, so that I can try out both, and see which one works better.

Since I've now gathered a number of tools I need or might want to use later, I organized them all in a GitHub repository. Of course, that means I redistributed then, so I had to spend some time on checking all the licenses and read about if and how they allow redistribution. Since everything I'm using so far is published either under the GNU GPL or the Apache license, and my repository is merely an aggregation of distinct programs, rather than a combination into a new one, this doesn't seem to be a problem.

Next Steps

Even thought I tested ParseCit with the provided sample data and managed to establish the PDF-to-text pipeline, I didn't yet get to put the two together. This will be my top priority this week. Once that works, I want to modify the PDF-to-XML importing process to include all possible pipelines (i.e. all different tool chains that deliver an acceptable XML result), in a way that lets me easily choose the path I want to use.

Likely Challenges

As I mentioned, I've never tried ParseCit with actual data. Although there doesn't seem to be a reason why it shouldn't work, you can never be really sure. Since a good quality XML importer is vital to my project, it's very important to me to have at least one well-working import pipeline. Should ParseCit not work the way I'm hoping, I'll have to spend some time on re-ordering the import result of Dominik's pdftoxml.

Monday, 7 November 2016

XML extraction and analysis

Latest Progress

The first thing I did this week was reorganizing the parts of EggShell that are important for my project so far, putting them into different packages and adding these packages to my own repository. Now I have my own independent version of the project, called ExtendedEggShell, which contains all of what I've done with it until now. I then built a small utility that "intercepts" the importing and modelling process at the XML stage and exports the imported XML string into a new file, and ran this for all the example PDFs.

Once I was done with that, I spent some time thinking about how to assess the precision of the data extraction I will be implementing. I don't think it makes sense to build on EggShell's extensive work in that area, since that assessment will just be a necessity, rater than the focus of my work. I know this isn't the most important question right now, but it just happened to cross my mind.

Finally, I did some work towards comparing the PDFs to their XML-representation and identifying interesting parts for extraction. I wanted to go through all papers and, in the XML document, categorize all the parts by adding tags to them. I wanted to tag all sections, all paragraphs within these sections, all figures and listings within these paragraphs, and so forth. To get started, I went through all the papers rather quickly, just to identify possible tags, then I organized these tags into something like a meta-model (or meta-meta-model?). While this tagging is a lot of work, I think it might be very helpful in the future, since it should enable me to programmatically identify all the parts of the sample papers. I can use that to build the reference model for testing and accuracy assessment, I can use it to extract parts to compare them while I design the heuristics, and it might even be a good training set for some sort of supervised learning algorithm, in case I, or a future student, wants to attempt that. However, i stumbled across a problem I hadn't expected: as soon as a PDF page has figures, embedded text fields, and other elements like these, the extracted content isn't strictly linear anymore. It does follow a pattern, but it's not extracting first the complete left-hand, then the complete right-hand column. This matter will be discussed further in the "Problems" section below.

Problems

As I've mentioned above, I ran into problems with the extraction of a certain paper, namely Cara15a.pdf (although I'm most likely going to encounter it with other papers as well). The first page was extracted very well and I was able to tag it they way I wanted to. However, on the second page, the extraction wasn't linear anymore: it jumped between left- and right-hand column; usually, some sort of figure was involved. Here's an (approximate) sketch of the extraction sequence I observed:

 

I want to see if there is a good reason for this, and if I can still work with it. Maybe I need to have a look at the source code of the extractor tool, but this might be rather difficult, since my experience with C++ is very limited. I will also discuss the matter with Leonel and maybe, if necessary, ask Dominik about it.

Next Steps

First, I need to get this problem out of the way or find a way to deal with it. Otherwise, I can hardly continue my work. Once this is taken care of, I want to do the tagging, if this is still possible. As soon as I have one paper completely tagged, I want to make sure the parts can be imported into Pharo and modeled in at least some ad-hoc model. Once I have identified all parts in the XML files, I can try and find out which of them we might be able to detect programmatically. With that knowledge, I can conduct some interviews, especially asking different people about sample queries they might want to make and questions they may want to answer, to see which features really should be extracted.

Likely Challenges

Obviously, solving the problem mentioned above will be an important challenge. I'll either have to find a way to fix the extraction, or I'll have to settle for the parts that are still extractable. Should I be able to solve this problem, I still want to do the tagging, I think it might be very helpful. However, this will be a lot of work, and I'd need to find out if it's actually feasible.