The context-free grammars extended with regular expressions (RE), known as EBNF grammars, are commonly used as they often allow for terser language definitions. Yet for such grammars the notion of syntax tree, and consequently of ambiguity, lacks an agreed definition. The simplified tree structures returned by the existing parsing algorithms do not faithfully represent all the ways a sentence is derivable by means of the REs present in the grammar rules. We contribute a precise definition of the regular parts of the EBNF syntax tree structures, which is aligned with the tree representations that have been adopted by recent algorithms for RE matching. From here, we define a range of possibilities a user may choose from to display the structure of a syntax tree. Such choices range from a flat derivation that hides the RE structure, to a full representation of the RE operators, i.e., union, concatenation and iteration, and also of the sub-REs marked as groups. A consequent notion of representation-dependent ambiguity follows. The above formal models are implemented within a Tomita-like parsing algorithm, i.e., GLR(1), which is combined with a state-of-the-art parser for ambiguous REs. Given an EBNF grammar, our parser-generator tool produces a parser compliant with the user-expressed choices of syntax tree representation. We report the parsing performance also in comparison with existing parsers, and we discuss the tradeoff between the accuracy of the syntax tree representation and the parsing speed.

General parsing with regular expression matching

L. Breveglieri;S. Crespi Reghizzi;A. Morzenti
2023-01-01

Abstract

The context-free grammars extended with regular expressions (RE), known as EBNF grammars, are commonly used as they often allow for terser language definitions. Yet for such grammars the notion of syntax tree, and consequently of ambiguity, lacks an agreed definition. The simplified tree structures returned by the existing parsing algorithms do not faithfully represent all the ways a sentence is derivable by means of the REs present in the grammar rules. We contribute a precise definition of the regular parts of the EBNF syntax tree structures, which is aligned with the tree representations that have been adopted by recent algorithms for RE matching. From here, we define a range of possibilities a user may choose from to display the structure of a syntax tree. Such choices range from a flat derivation that hides the RE structure, to a full representation of the RE operators, i.e., union, concatenation and iteration, and also of the sub-REs marked as groups. A consequent notion of representation-dependent ambiguity follows. The above formal models are implemented within a Tomita-like parsing algorithm, i.e., GLR(1), which is combined with a state-of-the-art parser for ambiguous REs. Given an EBNF grammar, our parser-generator tool produces a parser compliant with the user-expressed choices of syntax tree representation. We report the parsing performance also in comparison with existing parsers, and we discuss the tradeoff between the accuracy of the syntax tree representation and the parsing speed.
2023
syntax tree in EBNF grammar, GLR parsing, Berry-Sethi algorithm, positional method for regular expression, ambiguous regular expression
File in questo prodotto:
File Dimensione Formato  
1-s2.0-S2590118422000739-main.pdf

Accesso riservato

: Publisher’s version
Dimensione 871.35 kB
Formato Adobe PDF
871.35 kB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1226056
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 0
social impact