Monday, 31 October 2016

Playing with queries and preparing the presentation

Latest progress

I am going to give a first presentation about my project tomorrow morning, so this was my main focus this week. I took a deeper look at what problem we want to solve, how we want to do it, and what we are building on. My presentation will mostly talk about these points.

I still made some minor progress on the actual project. During the meeting with Oscar and Leonel last Tuesday, I got a better idea about how we might tackle the query language. Instead of creating a fully domain specific language right away or trying to build something very general (almost SQL-like), we should start out with a more OCL-like approach. That is, we define the meta-model in a UML diagram and then express queries by chasing through that graph an collecting the information we are interested in. This basically reduces the "DSL" down to a UML model and some accessor methods, however, at the price of sometimes rather complicated query expressions. We can take care of that in a later step, for example by providing shortcut methods for more tedious and frequently used queries.

Playing around with these ideas and formulating some possible queries (some of which already work on the current model) gave me a good idea about what a the DSL might look like in the end, which is very helpful. However, since our current data model only contains file names, paper titles, and author names, this is as far as I can go with that part right now. Before I can continue working on that, we first have to spend some time on extending the data model.

Next steps

As I have mentioned above, our number one priority right now will be extending the data model. Additional features we might consider extracting include author affiliation, publishing venue, paragraphs, listings, figures, references, keyword lists, etc. First of all, I assume we should create a prioritized list of these features. Then I want to go through that list item by item, identify them on all sample papers and from that, try to define heuristic rules for automatic recognition. In order to improve them, I supposed these heuristics should then be assessed in terms of accuracy in a next step, but I think EggShell should offer good tools for that already, which just need to be adapted.

Updated project outlook 

In a first step, we want our data model to contain more parts of the papers. Once it does so, we need to make sure it can be nicely queried. While we don't necessarily need more than some accessor methods for each model entity, it might be useful to provide certain shortcuts for more tedious and frequently used queries.

In a final step, we can tackle the actual visualizations, which should provide various views onto the data at hand. Our goal is that they encourage users to explore them, that is, to explore the data. Through these visualizations and through exploring them, users may be able to answer questions like:
  • Which authors/universities/enterprises/etc work on which topics? Which venues do they publish at?
  • How do groups of co-authors evolve over time? Do some of them combine to larger groups? Do some groups break apart?
  • How does technology usage evolve over time, with respect to certain communities?

Likely challenges

Once we have identified the most interesting features of the papers, we have to extract them. I intend to use heuristic methods for that, the way EggShell does that for extracting title and contributors. However, not all of these parts, if any, might be easy to extract, especially at a decent precision. So, getting good data extraction for all important features will most likely be the biggest challenge during the next couple of weeks.

Monday, 24 October 2016

Having a look at OCL

Latest Progress

Things have been going rather slow the last two weeks. I'm still trying to figure out the best way to store the data and query that model. My approach was a more SQL-like one: have a selector method which can take a list of arbitrary attribute names and return only these attributes, provide means for conditional selection (similar to WHERE in SQL), and maybe in a further iteration even implement something like joins. However, there's one problem with that approach: it's very complicated and most likely absolute overkill. Sure, the data model will change over time and the DSL will have to adapt to such changes, but it's still not going to be a completely generalized database, capable of storing any arbitrary type of data. I'll need a simpler meta-model.

Last week, Oscar suggested going for an OCL-like approach: expressing the meta-model as a UML diagram and expressing queries as navigations that chase through the meta-model graph. I think if the initial UML model is well designed, it'll be open for expansion of the data model without breaking anything, which is pretty much what I need. I've been working on some drafts for a UML diagram and on a corresponding implementation, but I'm not yet sure if I'm going in the right direction. So far, I don't yet fully understand how to express queries on such a model. I'm meeting with Oscar tomorrow, to discuss his suggestion, so I hope to be making some more progress this week.

Next Steps

I'm meeting with Oscar tomorrow, to discuss the OCL approach and will pursue that during the next weeks. Also, I'm giving a first presentation of my project (what have I done so far, where is it going) on Tuesday, November 1st, so I'm also going to be working on that.

Likely Challenges

I spent quite a lot of time on my first approach without really achieving something useful. Although I've learned some things that might be helpful later, I really want to get ahead with my project. It's important for me to find out if the OCL approach is the way to, and, if yes, really think it through and implement it well.

Tuesday, 11 October 2016

Setting up EggShell and running a first example

Latest Progress

I read further on in Pharo By Example, and I guess I do have a basic understanding of the language and environment by now. I'll definitely read it all the way through, but I can most likely do that alongside my work.

What's more important is that EggShell is now up and running on my machine. I decided to work on a Mac (currently running OS X El Capitan), since that's what the software has mainly been developed on and for. It didn't work just out of the box, but more about that below. I'll most likely try and set it up on Linux as well, but as for my main development environment, I'll stick to Mac.

Once I had EggShell running, I went through the first usage example Dominik gives in his thesis. It all worked very well and helped me get a basic understanding of how the tool works. I then spent some time on analyzing exactly how the recovered data is modeled. I'm now working on a first alteration of that model, to fit what I had in mind, for querying. Whether or not that will go well or if I'll eventually go back to the original model, remains to be seen.

Solved Problems

As I mentioned above, I first had some trouble with EggShell. When I ran the example in Dominik's Thesis, importPdfAsXmlString: would always throw an error. Some debugging showed it was because the imported string came back empty. The PDF transformation is done by a PipableOSProcess, which is passed a Terminal command, so extracted that command and ran it directly in a Terminal. Here's what that looked like:

Since I'm fairly (or actually completely) new to Mac OS X, I didn't know if libpng was something that should already be installed on my machine, if it was something I should have installed myself, and any case, how to get it to work now. So I contacted Dominik and showed him the screenshot above. Luckily, he found the problem pretty quickly: Apparently, libpng is a Unix package which used to be pre-installed on older versions of OS X. At least with El Capitan, that's no longer the case. Dominik suggested I install Homebrew, which is a very handy package manager for OS X, and use it to install libpng. So that's what I did, and indeed, it fixed the problem. The libpng package was the only thing missing, and after installing it, EggShell worked without any further problems.

Next Steps

I'll continue adapting the data model and see if it's actually better fit for my purposes than the original one, try to implement some first query methods, and also go deeper into how exactly EggShell works (at least the parts that are important for my work). I'm also not yet familiar with Roassal. Even though at this point, visualizations aren't really my focus yet, it would certainly be good to have a look at the engine already.

Likely Challenges

I'm now at a point where I have to make some important decisions about design and architecture. A good example for that is the data model. My entire DSL will depend on how I design that now. That's why I'm currently re-implementing it - not because it was bad or anything, but because I want to try out a model I designed myself, according to my current ideas of how the DSL might work and look like, and see where it fails. That way, I hope to learn some things about the model, the querying, and possibly some wrong ideas in an early stage, to reduce the risk of finding major design flaws later on.