Francis Tyers

ftyers@iu.edu

Assistant Professor
Linguistics

Education

PhD University of Alicante, 2013

Representative publications

Apertium: a free/open-source platform for rule-based machine translation (2011)
Mikel L Forcada, Mireia Ginestí-Rosell, Jacob Nordfalk, Jim O’Regan, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz ...
Machine translation, 25 (2), 127-144

Apertium is a free/open-source platform for rule-based machine translation. It is being widely used to build machine translation systems for a variety of language pairs, especially in those cases (mainly with related-language pairs) where shallow transfer suffices to produce good quality translations, although it has also proven useful in assimilation scenarios with more distant pairs involved. This article summarises the Apertium platform: the translation engine, the encoding of linguistic data, and the tools developed around the platform. The present limitations of the platform and the challenges posed for the coming years are also discussed. Finally, evaluation results for some of the most active language pairs are presented. An appendix describes Apertium as a free/open-source project.

CoNLL 2018 shared task: multilingual parsing from raw text to universal dependencies (2018)
Daniel Zeman, Jan Hajič, Martin Popel, Martin Potthast, Milan Straka, Filip Ginter ...
21-Jan

Every year, the Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2018, one of two tasks was devoted to learning dependency parsers for a large number of languages, in a real-world setting without any gold-standard annotation on test input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. This shared task constitutes a 2nd edition—the first one took place in 2017 (Zeman et al., 2017); the main metric from 2017 has been kept, allowing for easy comparison, also in 2018, and two new main metrics have been used. New datasets added to the Universal Dependencies collection between mid-2017 and the spring of 2018 have contributed to increased difficulty of the task this year. In this overview paper, we define the task and the updated evaluation methodology, describe data preparation, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.

Universal Dependencies 2.1 (2017)
Joakim Nivre, Željko Agić, Lars Ahrenberg, Lene Antonsen, Maria Jesus Aranzabe, Masayuki Asahara ...

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).

South-east european times: A parallel corpus of balkan languages (2010)
Francis M Tyers and Murat Serdar Alperen
Proceedings of the LREC Workshop on Exploitation of Multilingual Resources and Tools for Central and (South-) Eastern European Languages, 49-53

This paper describes the creation of a parallel corpus from a multilingual news website translated into eight languages of the Balkans (Albanian, Bulgarian, Croatian, Greek, Macedonian, Romanian, Serbian, and Turkish) and English. The corpus is then applied to the task of machine translation, creating 72 machine translation systems. The performance of these systems is then evaluated and thought is given to where future work might be focussed.

Extracting bilingual word pairs from Wikipedia (2008)
Francis M Tyers and Jacques A Pienaar
Collaboration: interoperability between people in the creation of language resources for less-resourced languages, 19 19-22

A bilingual dictionary or word list is an important resource for many purposes, among them, machine translation. For many language pairs these are either non-existent, or very often unavailable owing to licensing restrictions. We describe a simple, fast and computationally inexpensive method for extracting bilingual dictionary entries from Wikipedia (using the interwiki link system) and assess the performance of this method with respect to four language pairs. Precision was found to be in the 69–92% region, but open to improvement.

The Apertium machine translation platform: five years on (2009)
Mikel L Forcada, Francis M Tyers and Gema Ramírez Sánchez
Proc. of the First Intl. Workshop on Free/Open-Source Rule-Based Machine Translation,

This paper describes Apertium: a free/open-source machine translation platform (engine, toolbox and data), its history, its philosophy of design, its technology, the community of developers, the research and business based on it, and its prospects and challenges, now that it is five years old.

Finite-state morphological transducers for three Kypchak languages (2014)
Jonathan Washington, Ilnar Salimzyanov and Francis M Tyers
3378-3385

This paper describes the development of free/open-source finite-state morphological transducers for three Turkic languages—Kazakh, Tatar, and Kumyk—representing one language from each of the three commonly distinguished subbranches of the Kypchak branch of Turkic. The finite-state toolkit used for the work is the Helsinki Finite-State Toolkit (HFST). This paper describes how the development of a transducer for each subsequent closely-related language took less development time. An evaluation is presented which shows that the transducers all have a reasonable coverage—around 90%—on freely available corpora of the languages, and high precision over a manually verified test set.

Free/open-source resources in the Apertium platform for machine translation research and development (2010)
Francis Tyers, Felipe Sánchez-Martínez, Sergio Ortiz-Rojas and Mikel Forcada
The Prague Bulletin of Mathematical Linguistics, 93 67-76

This paper describes the resources available in the Apertium platform, a free/open-source framework for creating rule-based machine translation systems. Resources within the platform take the form of finite-state morphologies for morphological analysis and generation, bilingual transfer lexica, probabilistic part-of-speech taggers and transfer rule files, all in standardised formats. These resources are described and some examples are given of their reuse and recycling in combination with other machine translation systems.

Rule-based augmentation of training data in Breton–French statistical machine translation (2009)
Francis M Tyers, L Dugast and J Park
Proceedings of the 13th Annual Conference of the European Association of Machine Translation, EAMT09, 213-218

This article describes an initial statistical machine translation system between Breton, a Celtic language spoken in France, and French. It also describes a method for leveraging existing resources from an incomplete rule-based machine translation system to improve the coverage and translation quality of the statistical system by generating expanded bilingual vocabulary lists. Results are presented which show that the use of this method improves the results of the system with respect to both the baseline, and the baseline with a lemmato-lemma bilingual lexicon.

Documentation of the open-source shallow-transfer machine translation platform Apertium (2007)
Mikel L Forcada, Boyan Ivanov Bonev, S Ortiz Rojas, JA Pérez Ortiz, G Ramírez Sánchez, F Sánchez Martínez ...
Online] Departament de Llenguatges i Sistemes Informatics Universitat d‟ Alacant, Available: http://xixona. dlsi. ua. es/~ fran/apertium2-documentation. pdf,[Accessed 27th April 2014],

This documentation describes the Apertium platform, one of the opensource machine translation systems which originated within the project” Open-Source Machine Translation for the Languages of Spain”(” Traducci ón automática de código abierto para las lenguas del estado espanol”). It is a shallow-transfer machine translation system, initially designed for the translation between related language pairs, although some of its components have been also used in the deep-transfer architecture (Matxin) that has been developed in the same project for the pair Spanish-Basque. Apertium can translate at present between the pairs Spanish-Galician, Spanish–Catalan1 Catalan-Occitan, Catalan-French, and can be used to build translators between other related language pairs, such as Danish-Swedish, Czech–Slovak, etc. Existing machine translation systems available at present for the pairs es–ca and es–gl are mostly commercial or use proprietary technologies, which makes them very hard to adapt to new usages; furthermore, they use different technologies across language pairs, which makes it very difficult to integrate them in a single multilingual content management system.

Developing prototypes for machine translation between two Sámi languages (2009)
Francis M Tyers, Linda Wiechetek and Trond Trosterud
Proc. of the 13th Annual Conf. of the EAMT, EAMT09, 120-128

This paper describes the development of two prototype systems for machine translation between North Sámi and Lule Sámi. Experiments were conducted in rule-based machine translation (RBMT), using the Apertium platform, and statistical machine translation (SMT) using the Mosesdecoder. The experiments show that both approaches have their advantages and disadvantages, and that they can both make use of pre-existing linguistic resources. 1 Introduction In this paper we describe the development of two prototype machine translation systems between two Sámi languages, North Sámi (sme) and Lule Sámi (smj), one rule-based (Apertium), and one statistical (Moses). There are other systems which have been developed with marginalised languages in mind (eg (Lavie, 2008)), but, as of writing, these were not available under an open-source licence and thus could not be applied to the task at hand. The content will be split into several sections. The first section will give a general overview of the languages in question, and sketch a typology of MT scenarios for minority languages. The next sections will describe the two machine translation strategies in some detail and will outline how the existing language technology was able to be re-used and integrated. We will follow this by a short evaluation and then some discussion and future work.

Flexible finite-state lexical selection for rule-based machine translation (2012)
Francis M Tyers, Felipe Sánchez-Martínez and Mikel L Forcada
European Association for Machine Translation.

In this paper we describe a module (rule formalism, rule compiler and rule processor) designed to provide flexible support for lexical selection in rule-based machine translation. The motivation and implementation for the system is outlined and an efficient algorithm to compute the best coverage of lexical-selection rules over an ambiguous input sentence is described. We provide a demonstration of the module by learning rules for it on a typical training corpus and evaluating against other possible lexical-selection strategies. The inclusion of the module, along with rules learnt from the parallel corpus provides a small, but consistent and statistically-significant improvement over either using the highest-scoring translation according to a target-language model or using the most frequent aligned translation in the parallel corpus which is also found in the system’s bilingual dictionaries.

Universal dependencies for Turkish (2016)
Umut Sulubacak, Memduh Gokirmak, Francis Tyers, Çağrı Çöltekin, Joakim Nivre and Gülşen Eryiğit
3444-3454

The Universal Dependencies (UD) project was conceived after the substantial recent interest in unifying annotation schemes across languages. With its own annotation principles and abstract inventory for parts of speech, morphosyntactic features and dependency relations, UD aims to facilitate multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. This paper presents the Turkish IMST-UD Treebank, the first Turkish treebank to be in a UD release. The IMST-UD Treebank was automatically converted from the IMST Treebank, which was also recently released. We describe this conversion procedure in detail, complete with mapping tables. We also present our evaluation of the parsing performances of both versions of the IMST Treebank. Our findings suggest that the UD framework is at least as viable for Turkish as the original annotation framework of the IMST Treebank.

Towards a free/open-source universal-dependency treebank for kazakh (2015)
Francis M Tyers and J Washington
3rd International Conference on Turkic Languages Processing,(TurkLang 2015), 276-289

This article describes the first steps towards a free/open-source dependency treebank for Kazakh based on universal depen dency (UD) annotation standards. The treebank contains 402 sentences and is based on texts from a range of open-source and public domain sources. This ensures its free availability and extensibility. Texts in the treebank are first morpholog ically analysed and disambiguated and then annotated manually for dependency structure. In the article we present some issues in dependency syntax for Kazakh and how these are analysed in the universal-dependency framework. Preliminary results for statistical dependency parsing of Kazakh are reported, along with some directions for future research.

Edit your profile

Francis Tyers

Education

Representative publications

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract