The context-free grammars extended with regular expressions (RE), known as EBNF grammars, are commonly used as they often allow for terser language definitions. Yet for such grammars the notion of syntax tree, and consequently of ambiguity, lacks an agreed definition. The simplified tree structures returned by the existing parsing algorithms do not faithfully represent all the ways a sentence is derivable by means of the REs present in the grammar rules. We contribute a precise definition of the regular parts of the EBNF syntax tree structures, which is aligned with the tree representations that have been adopted by recent algorithms for RE matching. From here, we define a range of possibilities a user may choose from to display the structure of a syntax tree. Such choices range from a flat derivation that hides the RE structure, to a full representation of the RE operators, i.e., union, concatenation and iteration, and also of the sub-REs marked as groups. A consequent notion of representation-dependent ambiguity follows. The above formal models are implemented within a Tomita-like parsing algorithm, i.e., GLR(1), which is combined with a state-of-the-art parser for ambiguous REs. Given an EBNF grammar, our parser-generator tool produces a parser compliant with the user-expressed choices of syntax tree representation. We report the parsing performance also in comparison with existing parsers, and we discuss the tradeoff between the accuracy of the syntax tree representation and the parsing speed.
General parsing with regular expression matching
L. Breveglieri;S. Crespi Reghizzi;A. Morzenti
2023-01-01
Abstract
The context-free grammars extended with regular expressions (RE), known as EBNF grammars, are commonly used as they often allow for terser language definitions. Yet for such grammars the notion of syntax tree, and consequently of ambiguity, lacks an agreed definition. The simplified tree structures returned by the existing parsing algorithms do not faithfully represent all the ways a sentence is derivable by means of the REs present in the grammar rules. We contribute a precise definition of the regular parts of the EBNF syntax tree structures, which is aligned with the tree representations that have been adopted by recent algorithms for RE matching. From here, we define a range of possibilities a user may choose from to display the structure of a syntax tree. Such choices range from a flat derivation that hides the RE structure, to a full representation of the RE operators, i.e., union, concatenation and iteration, and also of the sub-REs marked as groups. A consequent notion of representation-dependent ambiguity follows. The above formal models are implemented within a Tomita-like parsing algorithm, i.e., GLR(1), which is combined with a state-of-the-art parser for ambiguous REs. Given an EBNF grammar, our parser-generator tool produces a parser compliant with the user-expressed choices of syntax tree representation. We report the parsing performance also in comparison with existing parsers, and we discuss the tradeoff between the accuracy of the syntax tree representation and the parsing speed.File | Dimensione | Formato | |
---|---|---|---|
1-s2.0-S2590118422000739-main.pdf
Accesso riservato
:
Publisher’s version
Dimensione
871.35 kB
Formato
Adobe PDF
|
871.35 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.