Sandra Kübler Profile Picture

Sandra Kübler

  • skuebler@indiana.edu
  • (812) 855-3268
  • Professor
    Linguistics
  • Departmental Liaison to Cognitive Science
    Linguistics

Field of study

  • Computational Linguistics, computational corpus linguistics, machine learning techniques in computational linguistics, natural language processing

Education

  • Ph.D., University of Tuebingen, Germany, 2004

Research interests

  • Machine Learning of Natural Language
  • Corpus Linguistics
  • Part of Speech Tagging
  • Robust Parsing
  • Sentiment Analysis
  • Abusive Language Detection

Representative publications

MaltParser: A language-independent system for data-driven dependency parsing (2007)
Joakim Nivre, Johan Hall, Jens Nilsson, Atanas Chanev, Gülşen Eryigit, Sandra Kübler ...
Natural Language Engineering, 13 (2), 95-135

Parsing unrestricted text is useful for many language technology applications but requires parsing methods that are both robust and efficient. MaltParser is a language-independent system for data-driven dependency parsing that can be used to induce a parser for a new language from a treebank sample in a simple yet flexible manner. Experimental evaluation confirms that MaltParser can achieve robust, efficient and accurate parsing for a wide range of languages without language-specific enhancements and with rather limited amounts of training data.

The CoNLL 2007 shared task on dependency parsing (2007)
Joakim Nivre, Johan Hall, Sandra Kübler, Ryan McDonald, Jens Nilsson, Sebastian Riedel ...
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL),

The Conference on Computational Natural Language Learning features a shared task, in which participants train and test their learning systems on the same data sets. In 2007, as in 2006, the shared task has been devoted to dependency parsing, this year with both a multilingual track and a domain adaptation track. In this paper, we define the tasks of the different tracks and describe how the data sets were created from existing treebanks for ten languages. In addition, we characterize the different approaches of the participating systems, report the test results, and provide a first analysis of these results.

Dependency parsing (2009)
Sandra Kübler, Ryan McDonald and Joakim Nivre
Synthesis Lectures on Human Language Technologies, 1 (1), 1-127

Dependency-based methods for syntactic parsing have become increasingly popular in natural language processing in recent years. This book gives a thorough introduction to the methods that are most widely used today. After an introduction to dependency grammar and dependency parsing, followed by a formal characterization of the dependency parsing problem, the book surveys the three major classes of parsing models that are in current use: transition-based, graph-based, and grammar-based models. It continues with a chapter on evaluation and one on the comparison of different methods, and it closes with a few words on current trends and future prospects of dependency parsing. The book presupposes a knowledge of basic concepts in linguistics and computer science, as well as some knowledge of parsing methods for constituency-based representations. Table of Contents: Introduction …

SAMAR: Subjectivity and sentiment analysis for Arabic social media (2014)
Muhammad Abdul-Mageed, Mona Diab and Sandra Kübler
Computer Speech & Language, 28 (1), 20-37

SAMAR is a system for subjectivity and sentiment analysis (SSA) for Arabic social media genres. Arabic is a morphologically rich language, which presents significant complexities for standard approaches to building SSA systems designed for the English language. Apart from the difficulties presented by the social media genres processing, the Arabic language inherently has a high number of variable word forms leading to data sparsity. In this context, we address the following 4 pertinent issues: how to best represent lexical information; whether standard features used for English are useful for Arabic; how to handle Arabic dialects; and, whether genre specific features have a measurable impact on performance. Our results show that using either lemma or lexeme information is helpful, as well as using the two part of speech tagsets (RTS and ERTS). However, the results show that we need individualized solutions …

Stylebook for the Tübingen treebank of written German (TüBa-D/Z) (2006)
Heike Telljohann, Erhard W Hinrichs, Sandra Kübler, Heike Zinsmeister and Kathrin Beck
Seminar fur Sprachwissenschaft, Universitat Tubingen, Tubingen, Germany,

This stylebook is an updated version of Telljohann et al.(2015). It describes the design principles and the annotation scheme for the German treebank TüBa-D/Z developed by the Division of Computational Linguistics (Lehrstuhl Prof. Hinrichs) at the Department of Linguistics (Seminar für Sprachwissenschaft–SfS) of the Eberhard Karls Universität Tübingen, Germany. The guidelines focus on the syntactic annotation of written language data taken from the German newspaper’die tageszeitung’(taz).The treebank comprises 3,816 articles (104,787 sentences) selected from the taz editions between 1989 and 1999. The average sentence length is 18.7 words and the total number of tokens is 1,959,474. Release 11 in July 2017 is the final release of the TüBa-D/Z treebank. Information of how to obtain the data can be found at:

Statistical parsing of morphologically rich languages (SPMRL): what, how and whither (2010)
Reut Tsarfaty, Djamé Seddah, Yoav Goldberg, Sandra Kübler, Marie Candito, Jennifer Foster ...
Computational Linguistics, 1-12

The term Morphologically Rich Languages (MRLs) refers to languages in which significant information concerning syntactic units and relations is expressed at word-level. There is ample evidence that the application of readily available statistical parsing models to such languages is susceptible to serious performance degradation. The first workshop on statistical parsing of MRLs hosts a variety of contributions which show that despite language-specific idiosyncrasies, the problems associated with parsing MRLs cut across languages and parsing frameworks. In this paper we review the current state-of-affairs with respect to parsing MRLs and point out central challenges. We synthesize the contributions of researchers working on parsing Arabic, Basque, French, German, Hebrew, Hindi and Korean to point out shared solutions across languages. The overarching analysis suggests itself as a source of directions for …

The TüBa-D/Z treebank: Annotating German with a context-free backbone (2004)
Heike Telljohann, Erhard Hinrichs and Sandra Kübler
Proceedings of LREC,

The purpose of this paper is to describe the TuBa-D/Z treebank of written German and to compare it to the independently developed TIGER treebank (Brants et al., 2002). Both treebanks, TIGER and TuBa-D/Z, use an annotation framework that is based on phrase structure grammar and that is enhanced by a level of predicate-argument structure. The comparison between the annotation schemes of the two treebanks focuses on the different treatments of free word order and discontinuous constituents in German as well as on differences in phrase-internal annotation.

Overview of the SPMRL 2013 shared task: cross-framework evaluation of parsing morphologically rich languages (2013)
Djamé Seddah, Reut Tsarfaty, Sandra Kübler, Marie Candito, Jinho Choi, Richárd Farkas ...
Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages,

This paper reports on the first shared task on statistical parsing of morphologically rich languages (MRLs). The task features data sets from nine languages, each available both in constituency and dependency annotation. We report on the preparation of the data sets, on the proposed parsing scenarios, and on the evaluation metrics for parsing MRLs given different representation types. We present and analyze parsing results obtained by the task participants, and then provide an analysis and comparison of the parsers across languages and frameworks, reported for gold input as well as more realistic parsing scenarios.

CoNLL-SIGMORPHON 2017 shared task: Universal morphological reinflection in 52 languages (2017)
Ryan Cotterell, Christo Kirov, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Patrick Xia ...
Proceedings of the CoNLL SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection,

The CoNLL-SIGMORPHON 2017 shared task on supervised morphological generation required systems to be trained and tested in each of 52 typologically diverse languages. In sub-task 1, submitted systems were asked to predict a specific inflected form of a given lemma. In sub-task 2, systems were given a lemma and some of its specific inflected forms, and asked to complete the inflectional paradigm by predicting all of the remaining inflected forms. Both sub-tasks included high, medium, and low-resource conditions. Sub-task 1 received 24 system submissions, while sub-task 2 received 3 system submissions. Following the success of neural sequence-to-sequence models in the SIGMORPHON 2016 shared task, all but one of the submissions included a neural component. The results show that high performance can be achieved with small training datasets, so long as models have appropriate inductive bias or make use of additional unlabeled data or synthetic data. However, different biasing and data augmentation resulted in disjoint sets of inflected forms being predicted correctly, suggesting that there is room for future improvement.

How do treebank annotation schemes influence parsing results? Or how not to compare apples and oranges (2005)
Sandra Kübler
Proceedings of RANLP, 293-300

In the last decade, the Penn treebank has become the standard data set for evaluating parsers. The fact that most parsers are solely evaluated on this specific data set leaves the question unanswered how much these results depend on the annotation scheme of the treebank. In this paper, we will investigate the influence which different decisions in the annotation schemes of treebanks have on parsing. The investigation uses the comparison of similar treebanks of German, NEGRA and TüBa-D/Z, which are subsequently modified to allow a comparison of the differences. The results show that deleted unary nodes and a flat phrase structure have a negative influence on parsing quality while a flat clause structure has a positive influence.

Introducing the spmrl 2014 shared task on parsing morphologically-rich languages (2014)
Djamé Seddah, Sandra Kübler and Reut Tsarfaty
Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages (SPMRL-SANCL), 103-109

This first joint meeting on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical English (SPMRL-SANCL) featured a shared task on statistical parsing of morphologically rich languages (SPMRL). The goal of the shared task is to allow to train and test different participating systems on comparable data sets, thus providing an objective measure of comparison between state-of-the-art parsing systems on data data sets from a range of different languages. This 2014 SPMRL shared task is a continuation and extension of the SPMRL shared task, which was co-located with the SPMRL meeting at EMNLP 2013 (Seddah et al., 2013). This paper provides a short overview of the 2014 SPMRL shared task goals, data sets, and evaluation setup. Since the SPMRL 2014 largely builds on the infrastructure established for the SPMRL 2013 shared task, we start by reviewing the previous shared task (§ 2) and then proceed to the 2014 SPMRL evaluation settings (§ 3), data sets (§ 4), and a task summary (§ 5). Due to organizational constraints, this overview is published prior to the submission of all system test runs, and a more detailed overview including the description of participating systems and the analysis of their results will follow as part of (Seddah et al., 2014), once the shared task is completed.

Parsing morphologically rich languages: Introduction to the special issue (2013)
Reut Tsarfaty, Djamé Seddah, Sandra Kübler and Joakim Nivre
Computational linguistics, 39 (1), 15-22

Parsing is a key task in natural language processing. It involves predicting, for each natural language sentence, an abstract representation of the grammatical entities in the sentence and the relations between these entities. This representation provides an interface to compositional semantics and to the notions of “who did what to whom.” The last two decades have seen great advances in parsing English, leading to major leaps also in the performance of applications that use parsers as part of their backbone, such as systems for information extraction, sentiment analysis, text summarization, and machine translation. Attempts to replicate the success of parsing English for other languages have often yielded unsatisfactory results. In particular, parsing languages with complex word structure and flexible word order has been shown to require non-trivial adaptation. This special issue reports on methods that successfully …

Is it really that difficult to parse German? (2006)
Sandra Kübler, Erhard W Hinrichs and Wolfgang Maier
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP), 111-119

This paper presents a comparative study of probabilistic treebank parsing of German, using the Negra and TüBa-D/Z tree-banks. Experiments with the Stanford parser, which uses a factored PCFG and dependency model, show that, contrary to previous claims for other parsers, lexicalization of PCFG models boosts parsing performance for both treebanks. The experiments also show that there is a big difference in parsing performance, when trained on the Negra and on the TüBa-D/Z treebanks. Parser performance for the models trained on TüBa-D/Z are comparable to parsing results for English with the Stanford parser, when trained on the Penn treebank. This comparison at least suggests that German is not harder to parse than its West-Germanic neighbor language English.

The Indiana" Cooperative Remote Search Task"(CReST) Corpus (2010)
Kathleen M Eberhard, Hannele Nicholson, Sandra Kübler, Susan Gundersen and Matthias Scheutz

This paper introduces a novel corpus of natural language dialogues obtained from humans performing a cooperative, remote, search task (CReST) as it occurs naturally in a variety of scenarios (eg, search and rescue missions in disaster areas). This corpus is unique in that it involves remote collaborations between two interlocutors who each have to perform tasks that require the other’s assistance. In addition, one interlocutor’s tasks require physical movement through an indoor environment as well as interactions with physical objects within the environment. The multi-modal corpus contains the speech signals as well as transcriptions of the dialogues, which are additionally annotated for dialog structure, disfluencies, and for constituent and dependency syntax. On the dialogue level, the corpus was annotated for separate dialogue moves, based on the classification developed by Carletta et al.(1997) for coding task-oriented dialogues. Disfluencies were annotated using the scheme developed by Lickley (1998). The syntactic annotation comprises POS annotation, Penn Treebank style constituent annotations as well as dependency annotations based on the dependencies of pennconverter.

Corpus linguistics and linguistically annotated corpora (2015)
Sandra Kübler and Heike Zinsmeister
Bloomsbury Publishing.

Linguistically annotated corpora are becoming a central part of the corpus linguistics field. One of their main strengths is the level of searchability they offer, but with the annotation come problems of the initial complexity of queries and query tools. This book gives a full, pedagogic account of this burgeoning field. Beginning with an overview of corpus linguistics, its prerequisites and goals, the book then introduces linguistically annotated corpora. It explores the different levels of linguistic annotation, including morphological, parts of speech, syntactic, semantic and discourse-level, as well as advantages and challenges for such annotations. It covers the main annotated corpora for English, the Penn Treebank, the International Corpus of English, and OntoNotes, as well as a wide range of corpora for other languages. In its third part, search strategies required for different types of data are explored. All chapters are accompanied by exercises and by sections on further reading.

Edit your profile