Monday, 7 November 2016

XML extraction and analysis

Latest Progress

The first thing I did this week was reorganizing the parts of EggShell that are important for my project so far, putting them into different packages and adding these packages to my own repository. Now I have my own independent version of the project, called ExtendedEggShell, which contains all of what I've done with it until now. I then built a small utility that "intercepts" the importing and modelling process at the XML stage and exports the imported XML string into a new file, and ran this for all the example PDFs.

Once I was done with that, I spent some time thinking about how to assess the precision of the data extraction I will be implementing. I don't think it makes sense to build on EggShell's extensive work in that area, since that assessment will just be a necessity, rater than the focus of my work. I know this isn't the most important question right now, but it just happened to cross my mind.

Finally, I did some work towards comparing the PDFs to their XML-representation and identifying interesting parts for extraction. I wanted to go through all papers and, in the XML document, categorize all the parts by adding tags to them. I wanted to tag all sections, all paragraphs within these sections, all figures and listings within these paragraphs, and so forth. To get started, I went through all the papers rather quickly, just to identify possible tags, then I organized these tags into something like a meta-model (or meta-meta-model?). While this tagging is a lot of work, I think it might be very helpful in the future, since it should enable me to programmatically identify all the parts of the sample papers. I can use that to build the reference model for testing and accuracy assessment, I can use it to extract parts to compare them while I design the heuristics, and it might even be a good training set for some sort of supervised learning algorithm, in case I, or a future student, wants to attempt that. However, i stumbled across a problem I hadn't expected: as soon as a PDF page has figures, embedded text fields, and other elements like these, the extracted content isn't strictly linear anymore. It does follow a pattern, but it's not extracting first the complete left-hand, then the complete right-hand column. This matter will be discussed further in the "Problems" section below.

Problems

As I've mentioned above, I ran into problems with the extraction of a certain paper, namely Cara15a.pdf (although I'm most likely going to encounter it with other papers as well). The first page was extracted very well and I was able to tag it they way I wanted to. However, on the second page, the extraction wasn't linear anymore: it jumped between left- and right-hand column; usually, some sort of figure was involved. Here's an (approximate) sketch of the extraction sequence I observed:

 

I want to see if there is a good reason for this, and if I can still work with it. Maybe I need to have a look at the source code of the extractor tool, but this might be rather difficult, since my experience with C++ is very limited. I will also discuss the matter with Leonel and maybe, if necessary, ask Dominik about it.

Next Steps

First, I need to get this problem out of the way or find a way to deal with it. Otherwise, I can hardly continue my work. Once this is taken care of, I want to do the tagging, if this is still possible. As soon as I have one paper completely tagged, I want to make sure the parts can be imported into Pharo and modeled in at least some ad-hoc model. Once I have identified all parts in the XML files, I can try and find out which of them we might be able to detect programmatically. With that knowledge, I can conduct some interviews, especially asking different people about sample queries they might want to make and questions they may want to answer, to see which features really should be extracted.

Likely Challenges

Obviously, solving the problem mentioned above will be an important challenge. I'll either have to find a way to fix the extraction, or I'll have to settle for the parts that are still extractable. Should I be able to solve this problem, I still want to do the tagging, I think it might be very helpful. However, this will be a lot of work, and I'd need to find out if it's actually feasible.

No comments:

Post a Comment