The context-free grammars extended with regular expressions (RE), known as ECF or EBNF grammars, are commonly used as they often allow for terser language definitions. Yet for such grammars the notion of syntax tree, and consequently of ambiguity, lacks an agreed definition. The simplified tree structures returned by the existing parsing algorithms do not faithfully represent all the ways a sentence is derivable by means of the REs present in the grammar rules. We contribute a precise definition of the regular parts in the structures of the EBNF syntax trees, which is aligned with the tree representations that have been adopted by the recent algorithms for RE matching. For an EBNF rule, the finest representation shows all the RE operators as nodes in the sub-tree, while the flat representation simply appends the string generated by the RE to the nonterminal node. A consequent notion of representation-dependent ambiguity follows. The above representations are incorporated into a Tomita-style parsing algorithm, i.e., a GLR(1) parser. To construct such a parser, we follow the positional method of the Berry-Sethi parsers for regular languages. Given an EBNF grammar, our parser-generator produces a parser compliant with the choices expressed by the user about the representation of the syntax trees. We also report the parsing performance, over a few representative sets of languages (benchmarks) for programming and data representation, in comparison with existing parsers for EBNF grammars.

2022 Best Paper Award Runner Up - Journal of Computer Languages

Luca Breveglieri;Stefano Crespi Reghizzi;Angelo Morzenti
2023-01-01

Abstract

The context-free grammars extended with regular expressions (RE), known as ECF or EBNF grammars, are commonly used as they often allow for terser language definitions. Yet for such grammars the notion of syntax tree, and consequently of ambiguity, lacks an agreed definition. The simplified tree structures returned by the existing parsing algorithms do not faithfully represent all the ways a sentence is derivable by means of the REs present in the grammar rules. We contribute a precise definition of the regular parts in the structures of the EBNF syntax trees, which is aligned with the tree representations that have been adopted by the recent algorithms for RE matching. For an EBNF rule, the finest representation shows all the RE operators as nodes in the sub-tree, while the flat representation simply appends the string generated by the RE to the nonterminal node. A consequent notion of representation-dependent ambiguity follows. The above representations are incorporated into a Tomita-style parsing algorithm, i.e., a GLR(1) parser. To construct such a parser, we follow the positional method of the Berry-Sethi parsers for regular languages. Given an EBNF grammar, our parser-generator produces a parser compliant with the choices expressed by the user about the representation of the syntax trees. We also report the parsing performance, over a few representative sets of languages (benchmarks) for programming and data representation, in comparison with existing parsers for EBNF grammars.
2023
syntax tree in EBNF grammar, parse tree, GLR parsing, bottom-up parsing, Berry-Sethi algorithm, positional method for regular expression, regular expression parsing, ambiguous regular expression
File in questo prodotto:
File Dimensione Formato  
COLA_Best Paper 2022 Runner Up Certificate_Borsotti et al.pdf

accesso aperto

: Altro materiale allegato
Dimensione 281.36 kB
Formato Adobe PDF
281.36 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1260889
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact