DFKI-LT - Dissertation SeriesVol. XXXVII
Alexander Volokh: Performance-Oriented Dependency Parsing
price: € 11
Dependency parsing has become very popular among researchers from all NLP areas, because dependency representations contain very valuable easy-to-use information. In the last decade a lot of dependency parsers have been developed, each of them somehow special with its own unique characteristics. In the course of this thesis I have developed yet another parser - MDParser. In this work I discuss the main properties of dependency parsers and motivate MDParser’s development. I present the state of the art in the field of dependency parsing and discuss the shortcomings of the current developments. To my mind the main problem of the current parsers is that the task of dependency parsing is treated independently of what happens before and after it. Therefore the preprocessing steps and the embedding in applications are neglected. However, in practice parsing is rarely done for the sake of parsing itself, but rather in order to use the results in a follow-up application. Additionally, current parsers are accuracy-oriented and focus only on the quality of the results, neglecting other important properties, especially efficiency. The design of MDParser tries to counter all these drawbacks.
The evaluation of some NLP technologies is sometimes as difficult as the task itself. For dependency parsing it was long thought not to be the case, however, some recent works show that the current evaluation possibilities are limited. In this thesis I broadly present and discuss both intrinsic and extrinsic evaluation methodologies for dependency parsing. Both approaches have numerous disadvantages which I demonstrate in my work. The attachment scores, which are the most used metric of the intrinsic evaluation, do not differentiate between different dependency types, are being computed for the same portions of treebanks since many years, and thus often promote overfitting to this particular kind of data, which is especially dangerous because the data contains a certain amount of inconsistencies. The extrinsic evaluation, which evaluates the contribution of parser results to a certain application is also problematic, because the embedding of parsing into applications requires a lot of expertise and implementational effort and it is unclear whether the impact on the result is due to the quality of the results or of the embedding. In the thesis I propose a methodology to account for those weaknesses and combine the strengths of both evaluation methodologies. Finally, I evaluate MDParser and compare it with other state-ofthe-art parsers. The results show that I was able to create the fastest parser currently available, which is able to process plain text, which other parsers usually can not and whose results are only slightly behind the top accuracies in the field. However, I analyse the gap as well as its impact on applications and demonstrate that it is not decisive.
This thesis contains the descriptions of many experiments which I have performed in the course of the last three years in order to achieve the goal of implementing and evaluating MDParser. Many of these experiments were unsuccessful or their results have become obsolete in the course of my work. However, both types of work were essential for achieving the final results, because I have learned a lot from most of the experiments I have done, independently of their success.
One of the central parts of any data-driven parser is its machine learning component. I have had to adopt MDParser to four different machine learning approaches in the course of the time in order to keep up with the current developments. I have started with maximum entropy classification. The first package I have used contained only a very simple training technique, so I had to replace it soon after I was not able to achieve satisfactory results. The second package contained the best possible training method for maximum entropy classifiers, however, the result still was not satisfying. The main problem was that for linear classification a lot of features are required and if one is not able to select good ones for training, then the training is very expensive. Thus I have given up maximum entropy classification and looked for a linear classifier which supported feature selection. The third attempt was a linear regression classifier with regularisation. In regularisation a penalty term is applied to every feature weight, so that only the features with a big weight remain until the end of the learning. This corresponds to an implicit feature selection, since many bad features can not surpass the penalty term and only good ones remain. However, the number of instances was still a problem. That is why I have finally switched to linear support vector machines. With support vector machines not all feature vectors are of equal importance. Only so called support vectors, i.e. instances of data which are involved into drawing the margin between the classes, are relevant and weights only for their features are learned. These numerous changes always required significant code modifications in the system and a lot of work had to be done from scratch. However, I was able to better understand the significance and influence of machine learning on the parser in general, independently of a specific machine learning approach.
Machine learning for data-driven parsing is usually a very expensive task from the computational point of view, because there are millions of instances with millions of features. Even though I have always used linear classification approaches, i.e. there was no need to explore the combinations of individual features, it has always been a lengthy process which required sophisticated hardware. Therefore I have invested a lot of effort in reducing the complexity of the problem. On the one hand I have tried to split the problem into subproblems, whose individual solutions are not as costly as the entire problem. On the other hand I have tried to reduce the number of features by preselecting the most useful ones before training a model. Both approaches have become obsolete with the final machine learning approach which is used in MDParser. The final strategy which is based on support vector machines contains an implicit feature selection mechanism and can work with the entire set of features without any need for prefiltering. Since support vector machines only require support vectors, which are a small fraction of all vectors, the computation is no longer expensive.
One of the most important achievements of this thesis is that I show that the efficiency of dependency parsing is not mainly dependent on the complexity of the underlying parsing strategy, i.e. the worst-case number of necessary states when a processing a sentence. Currently people prefer linear strategies, for instance Nivre’s algorithms over more complex quadratic ones, as for example Covington’s parsing strategy, without analysing other properties. I have used profiling technology in order to compute the exact amount of processing time for every step and component of the algorithm. It turns out that the overwhelming amount of time is spent on feature extraction, namely on string concatenations. However, when processing a sentence with some parsing strategy, the feature extraction is not always necessary in every state and therefore it is not enough to simply compare the number of states necessary for different strategies, but one has to rather compare the number of the actually required feature extractions. This number is similar for both Nivre’s and Covington’s parsing strategies, even if they have a different theoretical complexity. In addition to that, techniques like memoisation, which allow reusability of features, such that expensive string operations do not have to be unnecessarily repeated, can be applied easier to Covington’s algorithm rather then to Nivre’s linear ones and in the end it can even outperform them.
One more very successful piece of work included the experiments on the treebanks. Treebanks are supposed to be of a very high quality, because they are essential for the success of all data-driven approaches which learn from them. I was able to show that it is in fact not the case and treebanks contain a considerable amount of erroneous annotation. Thus some more sophisticated approaches have higher scores partially because they are better at replicating these inconsistencies and thus despite the scores their results are objectively not of that much higher quality. Moreover, the approach I have come up is able not only to detect inconsistencies, but also corrects them.
Furthermore, treebanks for some languages are very big and learning curve experiments show that almost the same result can already be achieved with less amount of data. For other languages the treebanks are too small and much better results could be achieved if more annotated data were available. Since annotation of corpora is an expensive process, it is important to control it as good as possible. A very interesting question here, which I have successfully investigated in my work, is what type of data exactly is useful and should be annotated first, because it is both more beneficial for the performance and requires less effort to be annotated.
Finally, I have used various opportunities to participate in shared tasks for recognising textual entailment in order to apply my parser to concrete tasks. The first task called PETE was specially designed for parser evaluation and provided an interesting data set, which required the parser to correctly recognise a limited set of different dependencies. The other two tasks RTE-6 and RTE-7 were not intended to be used for parser evaluation, however, I have used their data sets for testing parser performance on real-world data. This work allowed me to better understand how the embedding of dependency parsing into applications works. Furthermore, these experiments demonstrate that the performance of parsers for the standard development data of treebanks is different to that for real world data sets. I have already stated that a big shortcoming of the currently predominant evaluation is that the performance of parsers is averaged over all dependency types. Some recent literature tries to address this fact by limiting the evaluation to a subset of dependency types, which are of a particular importance for some task. I have performed a similar analysis for the task of RTE and present my findings in this thesis.
A lot of the work presented in this thesis has been published on various conferences. Thus besides a well-working dependency parser, there are also seven papers, which have been peer-reviewed and accepted on different international conferences, which arise from this work.