Data analysis plays a key role in companies that adopt machine learning models to support their decision-making processes. Among the phases of a machine learning pipeline, data preparation is essential to obtain high-quality data. Data-centric AI shifted the focus of such processes on the quality of data rather than on the machine learning model performance. Users from different application fields face data preparation, and they frequently encounter difficulties in designing effective data preparation pipelines when dealing with a multitude of data quality errors and data quality improvement techniques; this highlights the necessity for approaches to simplify the process of defining an effective data preparation pipeline. The main goal of my Ph.D. project is to design a framework to support users in selecting the data preparation tasks to perform in a machine learning pipeline. Using a knowledge-driven approach, we aim to guide (more and less experienced) users through an interactive process in which recommendations, explanations, and different levels of autonomy can simplify the design of an effective data preparation pipeline.
DIANA: A Knowledge-driven Framework for Data-centric AI
Camilla Sancricca
2024-01-01
Abstract
Data analysis plays a key role in companies that adopt machine learning models to support their decision-making processes. Among the phases of a machine learning pipeline, data preparation is essential to obtain high-quality data. Data-centric AI shifted the focus of such processes on the quality of data rather than on the machine learning model performance. Users from different application fields face data preparation, and they frequently encounter difficulties in designing effective data preparation pipelines when dealing with a multitude of data quality errors and data quality improvement techniques; this highlights the necessity for approaches to simplify the process of defining an effective data preparation pipeline. The main goal of my Ph.D. project is to design a framework to support users in selecting the data preparation tasks to perform in a machine learning pipeline. Using a knowledge-driven approach, we aim to guide (more and less experienced) users through an interactive process in which recommendations, explanations, and different levels of autonomy can simplify the design of an effective data preparation pipeline.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


