RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Virtual screening is a technique used in drug discovery to select the most promising molecules to test in a lab. To perform virtual screening, we need a large set of molecules as input, and storing these molecules can become an issue. In fact, extreme-scale high-throughput virtual screening applications require a big dataset of input molecules and produce an even bigger dataset as output. These molecules' databases occupy tens of TB of storage space, and domain experts frequently sample a small portion of this data. In this context, SMILES is a popular data format for storing large sets of molecules since it requires significantly less space to represent molecules than other formats (e.g., MOL2, SDF). This paper proposes an efficient dictionary-based approach to compress SMILES-based datasets. This approach takes advan-tage of domain knowledge to provide a readable output with separable SMILES, enabling random access. We examine the benefits of storing these datasets using ZSMILES to reduce the cold storage footprint in HPC systems. The main contributions concern a custom dictionary-based approach and a data preprocessing step. From experimental results, we can notice how ZSMILES leverage domain knowledge to compress ×1.13 more than state of the art in similar scenarios and up to 0.29 compression ratio. We tested a CUDA version of ZSMILES targetting NVIDIA's GPUs, showing a potential speedup of 7×.

ZSMILES: An Approach for Efficient SMILES Storage for Random Access in Virtual Screening

Accordi, Gianmarco;Gadioli, Davide;Seguini, Giorgio;Beccari, Andrea R.;Palermo, Gianluca

2024-01-01

Abstract

Virtual screening is a technique used in drug discovery to select the most promising molecules to test in a lab. To perform virtual screening, we need a large set of molecules as input, and storing these molecules can become an issue. In fact, extreme-scale high-throughput virtual screening applications require a big dataset of input molecules and produce an even bigger dataset as output. These molecules' databases occupy tens of TB of storage space, and domain experts frequently sample a small portion of this data. In this context, SMILES is a popular data format for storing large sets of molecules since it requires significantly less space to represent molecules than other formats (e.g., MOL2, SDF). This paper proposes an efficient dictionary-based approach to compress SMILES-based datasets. This approach takes advan-tage of domain knowledge to provide a readable output with separable SMILES, enabling random access. We examine the benefits of storing these datasets using ZSMILES to reduce the cold storage footprint in HPC systems. The main contributions concern a custom dictionary-based approach and a data preprocessing step. From experimental results, we can notice how ZSMILES leverage domain knowledge to compress ×1.13 more than state of the art in similar scenarios and up to 0.29 compression ratio. We tested a CUDA version of ZSMILES targetting NVIDIA's GPUs, showing a potential speedup of 7×.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2024
			
	Titolo del libro
	
				Proceedings of 2024 IEEE International Parallel and Distributed Processing Symposium Workshops
			
	Appare nelle tipologie:
	
				04.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
ZSMILES_An_Approach_for_Efficient_SMILES_Storage_for_Random_Access_in_Virtual_Screening.pdf Accesso riservato : Publisher’s version Dimensione 179.85 kB Formato Adobe PDF Visualizza/Apri	179.85 kB	Adobe PDF	Visualizza/Apri
2404.19391v1.pdf accesso aperto Descrizione: Versione Arxiv : Pre-Print (o Pre-Refereeing) Dimensione 310.58 kB Formato Adobe PDF Visualizza/Apri	310.58 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1272928

Citazioni

ND

0

0

social impact