Tuesday, 27 September 2016

Getting started

Latest Progress

Yesterday, Leonel and I met with Dominik Seliner. He gave me a detailed look at his code, explained the different parts of EggShell and pointed out what to bear in mind, when attempting to get EggShell to run (how to install it, to work with Moose 5.1, to use the configuration object to load the software, etc.)
Today, Leonel and I discussed the possible directions of this project and decided that for now, implementing a domain specific language (DSL) and extending the data model are the main goals. The project's exact specifications will evolve over its course. 

Next Steps

There are a couple of things I want to do, in order to get started with my project:
  • Read and work through Pharo by Example and Deep Into Pharo (or at least as much of it as is needed to start my work)
  • Get EggShell up and running on either Linux or OS X, explore and understand the code
  • Set up a stable development environment
  • Read Dominik's thesis and the papers about CommunityExplorer and header metadata extraction
  • Possibly try to get Dominik's modified Xpdf version to run on Windows.
  • Possibly search for further relevant papers.
  • Possibly get familiar with Roassal
Once everything is up and running, my first priority is going to be the DSL. The goal is to provide methods to query the data model.

Current Project Outlook

In a first step, my project will pick up where Dominik's work left off. I'll attempt to equip EggShell with a domain specific language (DSL) to query the data model, in order to visually answer questions about scientific communities in a later step. Another goal is to extend the data model and structure recovery to extract more parts of the PDFs. Whether I'm going to permanently stick to EggShell, re-implement certain parts of it, or develop an entirely new tool, remains to be seen.

Likely challenges

  • The modified Xpdf binaries are only available for OS X and Linux, only the ones for OS X have recently been tested. I'll have to decided which platform I'll want to be working on: either use Linux or OS X, or try to get it to run on Windows.
  • Since I don't yet have any experience in Pharo or Moose, it might take me some time to get comfortable with it. This will definitely extend the process of understanding EggShell well enough to start working with it. I'll have to make sure not to get caught up in less important details of Pharo in the beginning, but still cover as much of it as possible over the course of my project.

Initial situation

A couple of months ago, Dominik Seliner finished his bachelor's thesis on creating a workbench for modelling scientific communities, at the Software Composition Group (SCG) at University of Bern. The result of his research was a tool called EggShell, which extracts data, such as title and contributors, from proceedings of events like conventions, published in PDF format, and provides a network visualization of that data. The identification of the PDFs' different parts is done using heuristics, rather than machine learning algorithms. Dominik's thesis heavily focuses on how two different approaches to extracting said data perform in comparison, in terms of accurately detecting the different parts. The visualization is achieved using the Roassal visualization engine, which is implemented in Pharo. Most parts of EggShell are written in a sophisticated Pharo image called Moose. In order to transform the PDFs into text or XML files, a modified build of Xpdf is used.

In a first step, my project will pick up where Dominik's work left off. I'll attempt to equip EggShell with a domain specific language (DSL) to query the data model, in order to visually answer questions about scientific communities in a later step. Another goal is to extend the data model and structure recovery to extract more parts of the PDFs. Whether I'm going to permanently stick to EggShell, re-implement certain parts of it, or develop an entirely new tool, remains to be seen.

The exact project specifications aren't set in stone and are likely to change over the course of my work. Whenever something changes, or more details are established, I'll give an evolved project outlook in my next post.