Bridging Python to Silicon: The SODA Toolchain

Nicolas Bohm Agostini and Serena Curzel, Pacific Northwest National Laboratory, Richland, WA, 99354, USA
Jeff (Jun) Zhang, Harvard University, Cambridge, MA, USA
Ankur Limaye, Pacific Northwest National Laboratory, Richland, WA, 99354, USA
Cheng Tan, Microsoft, Redmond, WA, USA
Vinay Amatya, Marco Minutoli, Vito Giovanni Castellana, and Joseph Manzano, Pacific Northwest National Laboratory, Richland, WA, 99354, USA
David Brooks and Gu-Yeon Wei, Harvard University, Cambridge, MA, USA
Antonino Tumeo, Pacific Northwest National Laboratory, Richland, WA, 99354, USA

Systems performing scientific computing, data analysis, and machine learning tasks have a growing demand for application-specific accelerators that can provide high computational performance while meeting strict size and power requirements. However, the algorithms and applications that need to be accelerated are evolving at a rate that is incompatible with manual design processes based on hardware description languages. Agile hardware design tools based on compiler techniques can help by quickly producing an application-specific integrated circuit (ASIC) accelerator starting from a high-level algorithmic description. We present the software-defined accelerator (SODA) synthesizer, a modular and open-source hardware compiler that provides automated end-to-end synthesis from high-level software frameworks to ASIC implementation, relying on multilevel representations to progressively lower and optimize the input code. Our approach does not require the application developer to write any register-transfer level code, and it is able to reach up to 364 giga floating point operations per second (GFLOPS)/W efficiency (32-bit precision) on typical convolutional neural network operators.

Many applications, from environmental monitoring, to navigation and control, to scientific experiments, require efficient processing of a combination of data analysis, machine learning (ML), and scientific computing algorithms. They need systems that can effectively support each phase of the computation and adapt in real time to changes in the environment, under a variety of energy, performance, area, and latency constraints. All these requirements combined make general-purpose processors no longer a viable solution and render application-specific accelerators a necessity.

Typically, domain experts design and validate their algorithms in high-level programming frameworks (most of which are based on Python). Both algorithmic methods and programming frameworks are evolving quickly, especially in the data science and ML areas, making it extremely difficult to design custom accelerators able to support a wide variety of solutions. At the same time, the conventional hardware design cycle has significant productivity limitations. Manually designing custom accelerators in hardware description languages (HDLs) is complex and time consuming, preventing effective exploration of alternative architectures and often requiring a new design cycle each time new algorithms or models appear. General and automated solutions are needed to quickly...
transition from the formulation of an algorithm to the implementation of a dedicated accelerator.

More in detail, hardware designers usually extract key computational patterns from the algorithms that need to be accelerated, identify parallelism, and data reuse opportunities, and design custom functional units for specific kernels at the register-transfer level (RTL) with an HDL. A common alternative to accelerate this process is to implement the functional units in C/C++ and convert them to HDL through high-level synthesis (HLS) tools, such as Vitis HLS from Xilinx, Catapult C from Siemens, or Stratus HLS from Cadence. In both cases, after functional verification, the HDL kernels are passed to downstream logic synthesis and physical design tools, and finally integrated into a system. This kind of design flow, with part manual coding and part automated processing, is standard practice for designing hardware. However, it still requires tremendous effort, and the quality of the results highly depends on the designers’ expertise. Moreover, the interactions between multiple computer-aided design tools at different levels of abstractions make the design process tedious and error-prone, introducing significant verification overheads, and forcing manual propagation of changes across different stages of the design flow.

To address these issues, we developed the software-defined accelerator (SODA) synthesizer, an open-source, modular, and extensible hardware compiler for the generation of highly specialized accelerators from algorithms designed in high-level programming frameworks. The SODA synthesizer is composed of a compiler-based frontend, to interface with high-level programming frameworks and apply high-level optimizations, and a compiler-based backend, to generate Verilog code and interface with external tools that compile the final design (either to application-specific integrated circuits (ASICs)—or to field-programmable gate arrays (FPGAs)).

We used typical linear algebra and deep neural network workloads to test the efficiency of the SODA synthesizer, exploring its potential to generate the optimized hardware designs with high performance. Figure 1, for example, shows the SODA implementations of several different layers from the LeNet convolutional neural network model, in the standard GDSII format for ASIC manufacturing. SODA users can quickly evaluate different design points until they reach the desired solution for their performance or area requirements by selecting different command-line options. Such an exploration would require multiple expensive redesigns with traditional HDL- or HLS-based approaches, potentially never reaching the optimal result due to limited design time available and lack of integration between the different tools in the flow. SODA, instead, provides a no-human-in-the-loop end-to-end hardware compiler where no modifications to the input code are needed, and its multilevel, modular, extensible design offers new opportunities for exploring further analysis and optimization passes.

![FIGURE 1. ASIC implementations of LeNet layers automatically generated by the SODA synthesizer: with a brief exploration of available compiler options it is possible to reach the desired performance-area tradeoff.](image-url)
SODA FRAMEWORK

Figure 2(a) provides an overview of the SODA synthesizer framework, which can be divided in two parts: a compiler-based frontend and a compiler-based hardware generation engine. The framework accepts input descriptions from high-level Python frameworks, translated by the frontend into a high-level intermediate representation (IR). The frontend exploits the multi-level intermediate representation (MLIR) to perform hardware/software partitioning of the algorithm specifications and architecture-independent optimizations. Subsequently, it generates a low-level IR (LLVM IR) for the hardware generation engine, PandA-Bambu, a state-of-the-art open-source HLS tool which, differently from most commercial alternatives, can also accept LLVM IR as input. Optimizations at all levels of the SODA toolchain are implemented as compiler passes, significantly influencing the generated hardware designs in terms of performance, area, and power. An exhaustive exploration of the design space is made possible by enabling and disabling compiler passes or tuning their options.

SODA-OPT Frontend

SODA-OPT, shown in Figure 2(b), is the high-level compiler frontend of the SODA synthesizer. Its role is to perform search, outlining, optimization, dispatching, and acceleration passes on the input program, preparing it for hardware synthesis targeting FPGAs or ASICs. To implement these functionalities, SODA-OPT leverages and extends the MLIR framework.

MLIR is a framework that allows building reusable, extensible, and modular compiler infrastructure by defining dialects, i.e., self-contained IRs that respect MLIR’s meta-IR syntax. Dialects allow modeling code at different levels of abstraction, enabling the use of specialized representations to facilitate specific compiler optimizations. We refer to dialects that are maintained in tree, along with the MLIR framework, as built-in dialects. These include abstractions for linear algebra, polyhedral analysis, structured control flow, and others. Several high-level programming frameworks for various domains, such as ML (TensorFlow, ONNX-MLIR, TORCH-MLIR), scientific computing (NPCOMP), and general-purpose languages (e.g., the FLANG frontend for Fortran) started leveraging MLIR to implement their own specific dialects, optimizations passes, and lowering methods to translate their programs into built-in MLIR dialects. Built-in dialects are entry points to the SODA synthesizer, enabling high-level frameworks to leverage our toolchain.

SODA-OPT introduces the soda dialect to partition input applications into an orchestrating host program and custom hardware accelerators. SODA-OPT analysis and transformation passes ingest MLIR inputs...
from high-level frameworks, identify key code regions, and outline them into separate MLIR modules. Code regions that are selected for hardware acceleration undergo an optimization pipeline with progressive lowerings through different MLIR dialects (linalg → affine → scf → cf → llvm), until they are finally translated into an LLVM IR purposely restructured for hardware synthesis. Instead, the host module is lowered into an LLVM IR file that includes runtime calls to control the generated custom accelerators.

Table 1 summarizes the high-level optimization passes in SODA-OPT, and their benefits for the hardware synthesis process. Traditional HLS design flows expect manual code modifications that restructure the original algorithm (to create internal buffers or apply profitable tiling strategies) or tool-specific pragma annotations (to guide unrolling or provide alias information). Instead, SODA-OPT exploits dedicated and context-specific MLIR dialects to apply systematic high-level transformations. These can expose instruction- and data-level parallelism, perform loop transformations, and apply various other steps, such as buffer hoisting or accumulation on temporary variables. SODA-OPT leverages the linalg dialect to identify operations and separate hardware and software partitions, then it optimizes loops through the affine dialect, and finally performs CSE, DCE, and scalar replacement of aggregates (SRoA) optimizations through the cf, arith, and memref dialects. The optimization pipeline is not monolithic: developers can easily enable, disable, reuse, or tune SODA-OPT passes, providing ample opportunities to enhance them for specific domains and implement automated exploration strategies.

Table 1. Summary of high-level optimizations in SODA-OPT. Parts of them are existing MLIR passes, while others are custom, HLS-oriented implementations.

<table>
<thead>
<tr>
<th>Optimization</th>
<th>Benefit for HLS</th>
<th>Passes</th>
</tr>
</thead>
<tbody>
<tr>
<td>Single basic block containing the compute intensive part of the kernel</td>
<td>More freedom to schedule operations</td>
<td>Tiling, unrolling</td>
</tr>
<tr>
<td>Increased instruction-level parallelism</td>
<td>Schedule independent arithmetic operations on the same cycle when their inputs are available</td>
<td>Unrolling</td>
</tr>
<tr>
<td>Increased data level parallelism</td>
<td>Schedule operations into different memory units on the same cycle</td>
<td>Tiling, unrolling, temporary buffer allocation</td>
</tr>
<tr>
<td>Avoid unnecessary reads from kernel arguments</td>
<td>Reduce expensive accesses to external memory</td>
<td>Temporary buffer allocation, allocation buffer promotion</td>
</tr>
<tr>
<td>Reuse read results, aggregate on scalars</td>
<td>Save scalar values loaded from memory and intermediate results in registers rather than performing repeated memory accesses</td>
<td>Scalar replacement of aggregates (SRoA)</td>
</tr>
<tr>
<td>Early alias analysis</td>
<td>Schedule memory operations independently on regions that do not alias</td>
<td>Early alias analysis (noalias), outlining pass</td>
</tr>
<tr>
<td>Remove redundant or unnecessary operations</td>
<td>Avoid wasting resources</td>
<td>Common subexpression elimination (CSE), dead code elimination (DCE)</td>
</tr>
</tbody>
</table>

The SODA synthesizer multilevel approach aims at exploiting different abstractions for different transformations. In the current implementation, there are optimization techniques that can be applied both in the frontend and in the backend: this is often the case for basic compiler passes, such as DCE, which are available both in SODA-OPT and in Bambu. Should the two levels interfere with each other in a disruptive way, we would currently intervene and control backend passes on a case-by-case basis.

While the focus of this article is the generation of hardware accelerators, SODA-OPT can be extended to apply optimizations also on the host code generation path: for example, to enable parallel execution of different accelerators, to better use the central processing
unit (CPU) cache hierarchy, and to automatically reuse accelerators when possible.

### SODA Synthesizer Backend

The SODA synthesizer backend (Bambu), shown in Figure 2(c), leverages state-of-the-art HLS techniques to generate accelerator designs starting from the low-level LLVM IR produced by the SODA frontend. Bambu has several frontends based on standard compilers (GCC or CLANG), it builds an internal IR to perform HLS steps (including bitwidth analysis, loop optimizations, resource allocation, scheduling, and binding algorithms), and finally generates the designs in an HDL (Verilog or VHDL). Alongside synthesizable HDL, it can also automatically produce testbenches for verification. Bambu enables the SODA synthesizer to target FPGAs (from Xilinx, Altera, Lattice, NanoXplore) and ASICs. For ASICs, SODA supports Verilog-to-GDSII generation with both commercial (synopsis design compiler) and open-source (OpenROAD flow) logic synthesis tools.

Bambu is optimized to support a wide set of C and C++ constructs, but it can also ingest LLVM IR through its internal Clang frontend; through SODA-OPT, we connect Bambu with MLIR code. The LLVM IR generated after SODA-OPT performed high-level optimizations is explicitly restructured for HLS, resulting in more efficient accelerators when compared to an input obtained through direct MLIR to LLVM IR translation (as will be shown in the experimental evaluation).

Bambu generates designs at the RTL following the finite state machine with datapath (FSMD) model; the generated accelerators can subsequently be integrated in larger system-level designs, with or without microcontrollers driving the execution. Bambu also exposes modular synthesis methodologies differently from other HLS tools, it can generate modules representing functions that may be reused or replicated across an entire design and composed in a complex multiaccelerator system before generating the RTL code.

We have extended Bambu with new HLS methodologies that can integrate FSMD modules as processing elements in coarse-grained dataflow designs, and in high-throughput, dynamically scheduled, multithreaded parallel templates. MLIR descriptions are naturally parallel and hierarchical, so it will be possible to instantiate such architectural templates from SODA-OPT. Rather than requiring manual annotations on the input code, we can define the design hierarchy at a higher level of abstraction by exploiting MLIR abstractions, which allow to automatically identify independent operations (\texttt{llvm\_}) and create task-parallel regions (\texttt{affine}) in the input code. Each region can subsequently be optimized through the SODA-OPT pipeline described in the “SODA-OPT Frontend” section.

### SODA Resource Library and Verification

The resource library is a crucial component for any hardware synthesis toolchain: it contains RTL descriptions of functional units implementing the operations present in the IR (adders, subtractors, multipliers, etc.), with different versions for different data types. The HLS tool then combines functional units together to build the design. To effectively drive the synthesis algorithms, these functional units also need a characterization in terms of performance (e.g., latency of the critical path) and area for each target technology or device. Area and performance estimates, together with related models that describe the area and latency of the interconnections among resources, directly affect many optimization passes and synthesis algorithms: for example, they help decide whether functional units can be chained together by removing intermediate registers, if their combined latency does not exceed the required clock period.

The SODA backend can interface with commercial and open-source logic synthesis tools. We introduced support for the OpenROAD flow and the FreePDK (formerly Nangate) 45-nm cell technology library, providing a completely open-source, end-to-end compiler-based hardware generation flow from high-level programming environments to silicon. We have also extended the characterization process of the functional units in Bambu: we performed logic synthesis of functional units with FreePDK, collecting all the relevant area and performance metrics to build the resource library and model estimates.

The characterization is also relevant for the implementation of floating point units. While Bambu can integrate hand designed functional units and external intellectual property libraries (e.g., for FPGAs we select FloPoCo\textsuperscript{13}, for the ASIC target in SODA we choose to generate floating point units starting from the standard C soft float library (math.h); this allows to easily support different data types (FP32 and FP64), and full IEEE754 compliance if required. The characterization improves the quality of the generated floating point units: for example, the FP32 multiplier has an overall latency of four cycles at 200 MHz and five cycles at 500 MHz.

Finally, a key component in an end-to-end agile and automated design flow is verification, which
assures that the generated designs are functionally correct. Bambu includes a suite of tools that enable automatic testbench generation and validation of results, supporting external open-source and commercial simulators; in the SODA toolchain, we choose to leverage Verilator. We provide Bambu with a set of input values for the synthesized kernel (for example, input arguments of a function) in an XML file. Then, Bambu generates Verilog testbenches and scripts to drive the execution of Verilator. After HLS, Bambu launches the simulation and verifies that the output values from the Verilog kernel correspond to the golden results from the execution of the input code.

**EXPERIMENTAL EVALUATION**

In this section, we present results of our end-to-end hardware generation flow. We first demonstrate the effectiveness of the SODA-OPT high-level optimization pipeline on a set of representative linear algebra benchmarks, and then evaluate the entire toolchain by generating custom ASIC accelerators for classic deep neural network models.

The SODA synthesizer enables the generation of custom accelerators for any algorithm that can be described in MLIR. The linear algebra and ML kernels that we considered in this evaluation could also be executed on traditional templated accelerators (i.e., dot-product, matrix–vector, matrix–matrix engines), and our HLS-based approach could instead be used to generate accelerators for less common computational patterns. Nevertheless, we employ these kernels to demonstrate the effectiveness of our high-level optimization flow because they are broadly used in high-level scientific computing frameworks.

### The SODA Synthesizer Enables the Generation of Custom Accelerators for Any Algorithm That Can Be Described in MLIR.

In all following experiments, execution times are obtained through simulation using randomly generated test vectors. Area and power results are obtained after OpenROAD place-and-route. Baseline designs (noopt) are synthesized from MLIR code without high-level optimizations. All designs (baseline or optimized) are synthesized with Bambu -O2 optimizations.

#### Linear Algebra Kernels

Table 2 demonstrates the impact of the SODA-OPT optimization pipeline, applied to feed an optimized and restructured low-level IR to the HLS tool for RTL generation. In these experiments, we generate ASIC accelerators for 14 linear algebra kernels from PolyBench translated from C to MLIR affine, representing common computations performed within scientific computing frameworks.

<table>
<thead>
<tr>
<th>Opt. strategy</th>
<th>No MLIR Opts.</th>
<th>SODA-OPT Pipeline</th>
</tr>
</thead>
<tbody>
<tr>
<td>Kernel size</td>
<td>2</td>
<td>4</td>
</tr>
<tr>
<td>three_mm</td>
<td>388</td>
<td>3,087</td>
</tr>
<tr>
<td>two_mm</td>
<td>315</td>
<td>2,475</td>
</tr>
<tr>
<td>gemm</td>
<td>186</td>
<td>1,446</td>
</tr>
<tr>
<td>doitgen</td>
<td>277</td>
<td>4,282</td>
</tr>
<tr>
<td>bicg</td>
<td>129</td>
<td>518</td>
</tr>
<tr>
<td>mvt</td>
<td>130</td>
<td>514</td>
</tr>
<tr>
<td>gemver</td>
<td>283</td>
<td>1,118</td>
</tr>
<tr>
<td>gesummv</td>
<td>162</td>
<td>578</td>
</tr>
<tr>
<td>atax</td>
<td>132</td>
<td>523</td>
</tr>
<tr>
<td>syr2k</td>
<td>186</td>
<td>1,310</td>
</tr>
<tr>
<td>syrk</td>
<td>142</td>
<td>990</td>
</tr>
<tr>
<td>trmm</td>
<td>46</td>
<td>532</td>
</tr>
</tbody>
</table>
compiling and ML high-level programming frameworks. We simulate each kernel in isolation, without system-level considerations, to focus on the effects of the optimization pipeline. Kernel Size refers to the size of all the dimensions of input and output tensors.

We compare the performance of accelerators generated by simply lowering the benchmarks to LLVM IR (No High Level Opts.) against the performance of accelerators generated after performing the SODA-OPT optimizations listed in Table 1 (SODA-OPT Pipeline). In particular, we apply full unrolling on the three innermost nested loops, apply all buffer-related transformations, mark function arguments as not aliasing, apply CSE, DCE, and SRoA. Providing an optimized and restructured LLVM IR to Bambu results in more performant designs: accelerators generated from the optimized IRs exhibit an average speedup of 18x, with peaks of 60x, over the baseline. Three kernels exhibit only a small performance improvement (syr2k and syrk improve between 2x and 3x, while trmm does not improve). The reason is that these kernels include inner-loop bounds, which depend on the induction variables of the outer loops, and the SRoA pass could not perform scalar replacement. This can be solved in the future by introducing an additional optimization pass to simplify index calculations when the loop bounds are known.

### Neural Network Models

We used the SODA synthesizer to automatically generate accelerators for relevant operators of the LeNet, MobileNetV2, ResNet-18, and ResNet-50 convolutional neural network models. These models were trained with TensorFlow in 32-bit floating point precision, converted into protobuf files, and translated into built-in MLIR abstractions (tosa and linalg). No modifications to the original high-level models were required. By default, SODA-OPT selects and partitions the input model to create one accelerator for each DNN layer. For the sake of conciseness, and because the same computation patterns are repeated multiple times in the network, we selected a subset of layers for our experiments. We outlined them into isolated kernels, applied selected high-level optimizations or the complete SODA-OPT pipeline, and generated Veri- log targeting ASIC technologies. We report execution time, area, power, and efficiency (expressed as FLOPS per Watt) for each experiment. Although the total end-to-end synthesis time from high-level description to GDSII varies depending on the specific kernel, all designs required less than three hours of processing on a node with two AMD EPYC 7282 16-Core CPUs and 256 GB of DDR4 3.2-GHz memory.

#### LeNet

In the top part of Table 3, we present runtime, area and power metrics of LeNet accelerators that cover 98% of its execution time (45-nm technology). Each line in the table corresponds to a single accelerator. We previously showed the final floorplans of these accelerators in the top part of Figure 1. We applied a subset of the available MLIR optimizations at the linear algebra and affine abstractions, observing speedups up to 6.2x and an efficiency between 2.68 and 41.75 giga floating point operations per second (GFLOPS)/W.

#### MobileNetV2

Table 3 also shows results for relevant MobileNetV2 depth-wise convolution (DWC2D) layers, representing 35% of MobileNetV2 inference time. The simplest optimization (leveraging high-level abstractions to propagate alias information automatically) already results in speedups of around 2× and designs reaching an efficiency over 1 GFLOPS/W. All the selected MobileNet layers have the same structure (varying only tensor dimensions and loop bounds), and thus benefit in the same way from the applied optimization, i.e., allowing Bambu to schedule memory operations on different arguments in parallel because the input arguments do not alias.

#### Reusable accelerators

Optimizing entire convolution operations in LeNet and MobileNetV2 does not allow performance increases higher than 2.1×. Instead, applying an appropriate tiling strategy to balance the size of the design considering both operations and memory parallelism allows to significantly improve performance. We tile a convolution operation and outline the tile, so that the generated accelerator is invoked multiple times to run a convolutional layer. We also ensure that the same tile can be reused across different layers in deeper networks (35, 14, and 46 convolutional layers in MobileNetV2, ResNet-18, and ResNet-50, respectively). Table 4 shows results for the generated accelerators with and without applying the SODA-OPT optimization pipeline, which provides up to 15.2× speedup with respect to the unoptimized baseline and efficiency between 103 and 364 GFLOPS/W in the 12-/14-nm technology. If we compare the results of the tile approach with what can be achieved by outlining a full convolution, we obtain, for example, that executing the fastest version of the LeNet CONV_04 layer is 14.89× slower than executing 44,800 times the optimized LeNet tile in Table 4 (assuming two
cycles latency for load and one cycle for store operations from a private scratchpad memory with two ports).

Overall, our experimental evaluation demonstrates the effectiveness of an end-to-end modular silicon compiler. The SODA synthesizer allows generating, optimizing, and exploring hardware designs without requiring to write any RTL code. Optimizations implemented at different abstraction levels across the
modular compiler-based toolchain allow iterative improvements of the generated accelerators, with high-quality results in terms of performance and efficiency.

RELATED WORK

Several works have explored generation of custom hardware accelerators starting from high-level programming frameworks, focusing in particular on Python and ML. They typically resort to one of two approaches: either 1) compile and map functions to parameterized modules or architectures; or 2) convert code to imperative languages (C/C++) for HLS, often heavily annotated to work with specific commercial tools.

Approach 1) consists of solutions like VeriGOOD-ML, which maps ML models described in the ONNX format to three substantially different architecture templates for different types of neural networks through the PolyMath compiler. GEMMINI provides a parameterized systolic array generator in Chisel that connects to a RISC-V core; the GEMMINI toolchain then offloads operations from specific layers of ONNX models to the systolic array. TVM’s VTA architecture is a specialized coprocessor for matrix multiplication, generated through HLS for FPGA; the TVM high-level framework can compile ML models into a stream of instructions for VTA. Additional ongoing work on TVM proposes to compile specific deep neural network operators into ASIC leveraging parameterized RTL templates. All these solutions aim at automatically generating ASIC designs, but they remain limited as they only support layers and kernels that have a direct mapping to one of the provided hardware templates. The SODA synthesizer, instead, leverages high-level and lower level (HLS) compiler-based tools. Hence, it provides a more general framework able to generate ASIC designs for virtually any computational pattern, as long as a lowering to MLIR is available. Such automatically generated accelerators lead to less flexible designs with respect to dedicated parameterized templates, but they can provide higher performance and efficiency. To the best of our knowledge, our design flow is the first one to provide a completely automated path from generic high-level code to fully custom ASIC accelerators.

Solutions that implement approach 2) include PyLog, which defines a high-level compilation infrastructure for Python programs and generates annotated C/C++ code that is then fed to Xilinx Vivado HLS for generation of the accelerators. HeteroCL partitions code between general-purpose processor and FPGA, providing a library of functions to insert hardware-specific information in the source code, which is then used to generate annotated C/C++ for HLS tools. ScaleHLS is a tool that facilitates HLS through high-level optimizations implemented in MLIR, potentially allowing to synthesize accelerators starting from high-level programming environments that lower to an MLIR representation; however, it also resorts to writing back annotated C code for Vivado HLS. While all these tools bridge high-level programming frameworks with hardware generators, they have limited flexibility, as they define compilation pipelines that only support specific high-level frameworks and backend HLS tools. Moreover, after applying hardware-related optimizations, they all generate code at a different (higher) level of abstraction, potentially losing a considerable amount of semantic information in the process.

Finally, the Circuit IRs Compiler and Tools (CIRCT) incubator project uses MLIR to build a set of interoperable tools for hardware design. The project focuses on creating relevant circuit-level IR abstractions for RTL generation. Once matured, CIRCT dialects could be merged into the MLIR framework, potentially becoming a building block for hardware compilers.

CONCLUSION

This article presents the SODA synthesizer, a modular, multilevel, end-to-end compiler-based design automation tool that enables the generation of custom accelerators starting from high-level software programming frameworks. The framework is composed of interoperating open-source technologies: SODA-OPT, an extensible high-level frontend and optimizer based on the MLIR framework, and PandA-Bambu HLS, a lower level hardware generator. The toolchain can interface with the OpenROAD Flow to provide a fully open-source path to ASIC generation.

We have shown the effectiveness of compiler-based optimizations on linear algebra kernels and deep neural network models, discussed the impact of the optimizations on the final ASIC designs, and demonstrated how our toolchain allows generating efficient hardware designs without requiring developers to write any RTL code. The SODA toolchain dramatically shortens the hardware design cycle from algorithmic formulation to hardware implementation, considers system-level implications, and enables rapid design space exploration and agile hardware development.

REFERENCES


NICOLAS BOHM AGOSTINI is a Ph.D. candidate with the Department of Electrical and Computer Engineering, Northeastern University, Boston, MA, USA, and also a Ph.D. intern with Pacific Northwest National Laboratory in the High Performance Computing group, Richland, WA, USA. His research interests include computer architecture, high-performance computing, and compilers targeting the automatic generation of custom accelerators. Agostini received his bachelor’s degree in electrical engineering from the Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil. Contact him at nicolas.agostini@pnnl.gov.

SERENA CURZEL is working toward the Ph.D. degree at information technology, Politecnico di Milano, Milan, Italy, and is also a Ph.D. intern with Pacific Northwest National Laboratory in the High Performance Computing group, Richland, WA, USA. Her research interests include hardware design and high-level synthesis, focusing in particular on novel compiler techniques for design automation and domain-specific optimizations for machine learning accelerators. Curzel received an M.Sc. degree in electronics engineering from Politecnico di Milano. Contact her at serena.curzel@polimi.it.

JEFF (JUN) ZHANG is a postdoctoral fellow in the Architecture, Circuits, and Compilers Group with Harvard University, Cambridge, MA, USA. His research interests include deep learning, computer architecture, and EDA, with particular emphasis on energy-efficient and fault-tolerant design for AI/ML systems and hardware accelerators. Zhang received a Ph.D. degree from the Electrical and Computer Engineering Department, New York University, NY, USA. Contact him at jeffzhang@g.harvard.edu.

ANKUR LIMAYE is a postdoctoral research associate with the High-Performance Computing Group at Pacific Northwest National Laboratory, Richland, WA, USA. His research interests include computer architecture, HW/SW co-design, and workload characterization and performance analysis. Limaye received a Ph.D. degree in electrical and computer engineering from The University of Arizona, Tucson, AZ, USA. Contact him at ankur.limaye@pnnl.gov.

CHENG TAN is with Microsoft, Redmond, WA, USA. He was with Cornell University, Ithaca, NY, USA, and Pacific Northwest National Laboratory, Richland, WA. His research interests include many-core architecture, hardware/software co-design, reconfigurable accelerator, and network-on-chip. Tan received a Ph.D. degree in computer science from the National University of Singapore, Queenstown, Singapore. Contact him at chengtan@microsoft.com.
VINAY AMATYA is a computer scientist with the Pacific Northwest National Laboratory, Richland, WA, USA. His research interests include scalable distributed machine learning algorithms and runtime systems; compiler optimization techniques for heterogeneous platforms and in application of machine learning; and deep learning techniques. Amatya received a Ph.D. degree in computer science from Louisiana State University, Baton Rouge, LA, USA. Contact him at Vinay.amatya@pnnl.gov.

MARCO MINUTOLI is a research scientist with the Data Science and Machine Intelligence Group with Pacific Northwest National Laboratory, Richland, WA, USA. His research focuses on the design of parallel graph algorithms and on the definition of HW/SW co-design and high level synthesis methodologies and their compilation and optimization pipelines for the generation of custom computing devices. Minutoli received a Ph.D. degree in computer science from Washington State University, Pullman, WA, USA. Contact him at marco.minutoli@pnnl.gov.

VITO GIOVANNI CASTELLANA is a senior computer scientist with Pacific Northwest National Laboratory, High-Performance Computing group, Richland, WA, USA, which he joined in 2012. His research interests include design automation and high level synthesis, parallel programming, Big Data and graph analytics, and compiler technologies. Castellana received a Ph.D. degree in computer science and engineering from Politecnico di Milano, Milan, Italy. Contact him at vitogiovanni.castellana@pnnl.gov.

JOSEPH B. MANZANO is a senior computer scientist with Pacific Northwest National Laboratory in the High-Performance Computing group, Richland, WA, USA. His research interests include compilers, runtime systems, cybersecurity, performance modeling, and benchmarking. Manzano received a Ph.D. degree from the University of Delaware, Newark, DE, USA. Contact him at joseph.manzano@pnnl.gov.

DAVID BROOKS is the Haley Family professor of computer science with the School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA. His research interests include resilient and power-efficient computer hardware and software design for high-performance and embedded systems. Brooks received a Ph.D. degree in electrical engineering from Princeton University, Princeton, NJ, USA. He is a Fellow of IEEE. Contact him at dbrooks@g.harvard.edu.

GU-YEON WEI is the Robert and Suzanne Case professor of electrical engineering and computer science with the John A Paulson School of Engineer and Applied Sciences, Harvard University, Cambridge, MA, USA. His research interests include a broad range of topics from algorithm-hardware co-design for efficient computing systems to sustainable computing. Wei received a Ph.D. degree in electrical engineering from Stanford University, Stanford, CA, USA. He is also a Samsung research fellow. Contact him at guyeon@seas.harvard.edu.

ANTONINO TUMEEO is a chief scientist in the High Performance Computing Group with the Pacific Northwest National Laboratory, Richland, WA, USA. His research interests include hardware-software co-design and synthesis, simulation, and modeling of domain specific architectures. Tumeo received his Ph.D. degree in computer science and engineering from Politecnico di Milano, Milan, Italy. He is a Senior Member of IEEE and of ACM. Contact him at antonino.tumeo@pnnl.gov.