The 11th International Conference on Finite-State Methods and Natural Language Processing
Held at the Gateway, St Andrews, Scotland (UK), July 15-17, 2013.
Ultimately linguistics requires methods capable of dealing with non-regular languages, but the power of finite state methods (FSMs) makes them very attractive for practical reasons, as the existence of this conference attests. FSMs derive their power from their deep roots in the theory of regular languages: the Myhill-Nerode theorem and the minimal (canonical) DFA, where the states of the minimal DFA correspond to the congruence classes of the language. In this talk I will explore the possibility of developing the same sort of theory for the class of context-free languages, and for some standard mildly context sensitive language classes. The standard view is that this is impossible because of the undecidability of various decision problems associated with CFGs, but this seems too pessimistic.
The first step is to define a canonical set of nonterminals for every context-free language: we discuss two ways of doing this, which correspond approximately to the distinction between deterministic and nondeterministic automata, which have surprising connections to the theory of distributional learning. In particular we can show that the smallest grammar for any context-free language will have nonterminals which correspond to elements of the syntactic concept lattice, a generalisation of the syntactic monoid. Though abstract and theoretical, these results lead to concrete algorithms for learning context-free grammars, and suggest that distributional learning and context-free grammars are more closely related than previously thought.
Pushdown Automata in Statistical
Machine Translation [slides]
This talk will present some recent work investigating pushdown automata (PDA) in the context of statistical machine translation and alignment under synchronous context-free grammars (SCFGs). PDAs can be used to compactly represent the space of candidate translations generated by the grammar when applied to an input sentence, and this presentation will give an overview of general-purpose PDA algorithms for replacement, composition, shortest path, and expansion. HiPDT, a hierarchical phrase-based decoder using the PDA representation and these algorithms will be described and the complexity of the HiPDT decoder operations will be compared to decoders based on finite state automata and the widely used hypergraph representations. PDAs have strengths in a particular translation scenario: exact decoding with large SCFGs and relatively smaller language models. This talk is based on recent work with Adrià de Gispert and Gonzalo Iglesias at University of Cambridge, and Michael Riley and Cyril Allauzen at Google Research.
GAP (Groups, Algorithms, Programming) is a system for computational algebraic programming, with the original focus being computational group theory. Since then, it has been extended with libraries and packages for different branches of discrete mathematics, including automata and language theory. We will be exploring the different operations of Automata, a package of functions for calculations and transformations with finite-state automata and regular languages. In particular we will demonstrate the efficient functionality of Automata with respect to working with very large automata, also reducing and converting these to expressions. Additionally, Automata contains functions implementing the correspondence of automata with finitely generated subgroups of the free group.
Variational inference for tree
automata and tree transducers [slides]
Statistical models are only as good as the data they are trained on and tend to run into trouble when there is either not enough data or insufficient variation within the data, resulting in issues such as overfitting or sparsity. Bayesian inference is one principled approach to mitigating such problems, grounded in the mathematics of Bayesian statistics, which can be used simultaneously for smoothing and for incorporating prior knowledge, thereby augmenting and tempering the data. Such methods have already seen a great deal of success across a wide range of areas within NLP, but unfamiliarity with the basic mathematical underpinnings still limits its application in certain areas.
While assuming only a background in calculus and basic probability theory, the aim of the tutorial is to provide both a high level intuition for the theoretical justification of the general inference technique and just enough mathematical tools to enable participants to understand and derive algorithms on their own. We focus on Variational Bayesian methods since they are generally easy to implement and relatively efficient in terms of computational complexity. Furthermore, we choose tree automata as our target model class, which is quite large, including Finite State Automata as a special case, and closely related to many other popular formalisms such as Context-Free Grammar and Tree-Adjoining Grammar, so the material will be relevant to a wide range of settings.
complexity of analyzing unrestricted stochastic context-free grammars [slides]
For unrestricted stochastic context-free grammars (SCFGs), the computational complexity of some of the most basic tasks involved in statistical NLP applications had until recently remained open. We have recently resolved several open problems regarding the complexity of computing key quantities associated with some classic and heavily studied stochastic processes, including multi-type branching processes and stochastic context-free grammars.
One of the key results is the following: we have shown that one can approximate the least fixed point solution for a multivariate system of monotone probabilistic polynomial equations in time polynomial in both the encoding size of the system of equations and in log(1/delta), where delta>0 is the desired additive error bound of the solution. (The model of computation is the standard Turing machine model.)
Our algorithms are based on suitable variants of Newton's method. The algorithms are relatively easy to implement, but their analysis is mathematically quite involved.
Among the applications for SCFGs are P-time algorithms for: computing their termination probabilities, a.k.a. their "partition function", computing their "inside probability" for generating a given string, computing the probability that the SCFG generates a string in a given regular language given by a DFA (when the SCFG itself is generated via EM), as well as P-time algorithms for several other applications. Let us note again that these algorithms all apply to arbitrary SCFGs, including SCFGs that have unrestricted occurrences of epsilon-rules, not just to SCFGs that are already in CNF or other normal forms.
I will survey some of this work in this tutorial.
(This talk describes recent joint work with Alistair Stewart (University of Edinburgh) and Mihalis Yannakakis (Columbia University), that has appeared in papers at STOC'12, ICALP'12, ICALP'13, and CAV'13, and in more recent unpublished works.)
Computing the most probable string
with a probabilistic finite state machine
The problem of finding the consensus / most probable string for a distribution generated by a probabilistic finite automaton or a hidden Markov model arises in a number of natural language processing tasks: it has to be solved in several transducer related tasks like optimal decoding in speech, or finding the most probable translation of an input sentence. We provide an algorithm which solves these problems in time polynomial in the inverse of the probability of the most probable string, which in practise makes the computation tractable in many cases. We also show that this exact computation compares favourably with the traditional Viterbi computation.
Stochastic bi-languages to model
Partially observable Markov decision processes provide an excellent statistical framework to deal with spoken dialog systems that admits global optimization and deal with uncertainty of user goals. However its put in practice entails intractable problems that need efficient and suboptimal approaches. Alternatively some pattern recognition techniques have also been proposed. In this framework the joint probability distribution over some semantic language provided by the speech understanding system and the language of actions provided by the dialog manager need to be estimated. In this work we propose to model this joint probability distribution by stochastic regular bi-languages that have also been successfully proposed for machine translation purposes. To this end a Probabilistic Finite State Bi-Automaton is defined in the paper. As an extension to this model we also propose an attributed model that allows to deal with the task attribute values. Valued attributed are attached to the states in such a way that usual learning and smoothing techniques can be applied as shown in the paper. As far as we know it is the first approach based on stochastic bi-languages formally defined to deal with dialog tasks.
ZeuScansion: a tool for scansion of English
We present a finite state technology based system capable of performing metrical scansion of verse written in English. Scansion is the traditional task of analyzing the lines of a poem, marking the stressed and non-stressed elements, and dividing the line into metrical feet. The system’s workflow is composed of several subtasks designed around finite state machines that analyze verse by performing tokenization, part of speech tagging, stress placement, and unknown word stress pattern guessing. The scanner also classifies its input according to the predominant type of metrical foot found. We also present a brief evaluation of the system using a gold standard corpus of human-scanned verse, on which a per-syllable accuracy of 86.78% is reached. The program uses open-source components and is released under the GNU GPL license.
A convexity-based generalization of Viterbi
for non-deterministic weighted automata
We propose a novel approach for the max-string problem in acyclic nondeterministic weighted FSA's, which is based on a convexity-related notion of domination among intermediary results, and which can be seen as a generalization of the usual dynamic programming technique for finding the max-path (a.k.a. Viterbi approximation) in such automata.
input with skipping nested automata
We propose a new kind of finite-state automata, suitable for structured input characters corresponding to unranked trees of small depth. As a motivating application, we regard executing morphosyntactic queries on a richly annotated text corpus.
relations and morphological analysis
We list the major properties of some important classes of subrational relations, mostly to make them easily accessible to computational linguists. We then argue that there are good linguistic reasons for using no class smaller than the class of synchronous regular relations for morphological analysis, and good mathematical reasons for using no class which is larger.
We present a method for probabilistic parsing of German words. Our approach uses a morphological analyzer based on weighted finite-state transducers to segment words into lexical units and a probabilistic context free grammar trained on a manually created set of word trees for the parsing step.
morphosyntactic analysis of richly inflected languages —
a Polish example
We consider finite-state optimization of morphosyntactic analysis of richly and ambiguously annotated corpora. We propose a general algorithm which, despite being surprisingly simple, proved to be effective in several applications for rulesets which do not match frequently.
Finite state morphology
tool for Latvian
The existing Latvian morphological analyzer was developed more than ten years ago. Its main weaknesses are: low processing speed when processing a large text corpus, complexity of adding new entries to the lexical data base, and limitations for usage on different operational platforms. This paper describes the creation of a new Latvian morphology tool. The tool has the capability to return lemma and morphological analysis for a given word form; it can generate the required word form if lemma and form description is given; it can also generate all possible word forms for a given lemma. As Finite state transducer (FST) technology is used for the morphology tool, it is easy to extend the lexicon, the tool can be reused on different platforms and it has good performance indicators.
Modeling graph languages with
grammars extracted via tree decompositions
Work on probabilistic models of natural language tends to focus on strings and trees, but there is increasing interest in more general graph-shaped structures since they seem to be better suited for representing natural language semantics, ontologies, or other varieties of knowledge structures. However, while there are relatively simple approaches to defining generative models over strings and trees, it has proven more challenging for more general graphs. This paper describes a natural generalization of the n-gram to graphs, making use of Hyperedge Replacement Grammars to define generative models of graph languages.
Finite state methods and description
The accepting runs of a finite automaton are represented as concepts in a Description Logic, for various systems of roles computed by finite-state transducers. The representation refines the perspective on regular languages provided by Monadic Second-Order Logic (MSO), under the Büchi-Elgot-Trakhtenbrot theorem. String symbols are structured as sets to succinctly express MSO-sentences, with auxiliary symbols conceived as variables bound by quantifiers.
Using NooJ for semantic annotation of
Italian language corpora in the domain of motion: a cognitive-grounded
In this paper we propose a system to parse and annotate motion constructions expressed in Italian language. We used NooJ as a software tool to implement finite-state transducers in order to recognize linguistic elements constituting motion events. In this paper we describe the model we adopted for semantic description of events (grounded on Talmy’s Cognitive Semantics theories) and then we illustrate how the system works with a domain-specific corpus, the structure of annotation that our system will perform, some annotation structures of example sentences expressing motion and then an attempt to evaluate the system’s performance.
of finite-state automata
We investigate the composition of finite-state automata in a multiprocessor environment, presenting a parallel variant of a widely-used composition algorithm. We provide an approximate upper bound for composition speedup of the parallel variant with respect to serial execution, and empirically evaluate the performance of our implementation with respect to this bound.
On finite-state tonology
with autosegmental representations
Building finite-state transducers from written autosegmental grammars of tonal languages involves compiling the rules into a notation provided by the finite-state tools. This work tests a simple, human readable approach to compile and debug autosegmental rules using a simple string encoding for autosegmental representations. The proposal is based on brackets that mark the edges of the tone autosegments. The bracket encoding of the autosegments is compact and directly human readable. The paper also presents a usual finite-state transducer for transforming a concatenated string of lexemes where each lexeme (such as "babaa|HH") consists of a segmental substring and a tonal substring into a chronological master string ("b[a]b[aa]") where the tone autosegments are associated with their segmental spans.
A finite-state approach
to translate SNOMED CT terms into Basque using medical prefixes and
This paper presents a system that generates Basque equivalents to terms that describe disorders in SNOMED CT. This task has been performed using finite-state transducers and a medical prefixes and suffixes lexicon. This lexicon is composed of English-Basque translation pairs, and it is used both for the identification of the affixes of the English term and for the translation of them into Basque. The translated affixes are composed using morphotactic rules. We evaluated the system with a Gold Standard obtaining promising results (0.93 of precision). This system is part of a more general system whose aim is the translation of SNOMED CT into Basque.
Syncretism and how to
deal with it in a morphological analyzer:
a German example
Syncretism is the area of the morphology-syntax interface where morphology fails the syntax. Inadequate treatment in the design of a morphological analyzer can lead to unbalanced performance of the analyzer either at generation, or at analysis. Furthermore, adequate and consistent treatment of syncretism is needed if the analyzer is to be used for language modeling, especially modeling of the syncretism. In this paper I will show that it is possible to create a morphological analyzer that can be tailored to various intended uses with minimal effort.
Finite state approach to the Kazakh nominal paradigm
This work presents the finite state approach to the Kazakh nominal paradigm. The development and implementation of a finite-state transducer for the nominal paradigm of the Kazakh language belonging to agglutinative languages were undertaken. The morphophonemic constraints that are imposed by the Kazakh language synharmonism (vowels and consonants harmony) on the combinations of letters under affix joining as well as morphotactics are considered. Developed Kazakh finite state transducer realizes some morphological analysis/generation functions. A preliminary testing on the use of the morphological analyzer after OCR preprocessing for correcting errors in the Kazakh texts was made.