Data Interpretation and Correlation

Buzzi Ferraris, G.; Manenti, Flavio

doi:10.1002/0471238961.databuzz.a01

Aside from their nature and origin, data are the real meeting point of theoretical aspects and practice. Anyone capable of properly correlating and interpreting data can validate or reject a theory, a model, or a law based on opportune hypotheses. There are many sources of data. Data sets can be generated by experimental campaigns, laboratory instruments, and an industrial plant. The more accurate the data are, the easier it is to correlate them to achieve a general relationship. Unfortunately, the scientist’s task is never easy, and data correlation and interpretation is a difficult area with many issues still open and with multifaceted problems to solve. Data always have been affected by stochastic errors that make it impossible to estimate parameters with absolute precision. Therefore, the estimation of any parameter is intrinsically uncertain; moreover, data are sometimes affected by bad points because of systematic errors, human factors, instrumentation failure, and so on. The analysis of data sets is also problematic in such cases because bad points must be aprioristically corrected or removed before interpretation. It is no coincidence that many people refer to sets of acquired experiments or measures as raw data and to sets already analyzed, and thus purged of possible bad points, as data or clean data. These bad points also are referred to in different ways, depending on their nature, but there is not full clarity in the scientific literature about this issue; the most commonly identified bad point is called an outlier. The key issue is to detect outliers to prevent their presence affecting data interpretation by significantly changing both it and the parameter estimation. Nevertheless, outlier detection is one of the most complex operations yet to be exhaustively studied by the scientific community. This is particularly true in the case of regressions. Currently, detecting outliers simultaneously involves a series of effects and phenomena that make their identification particularly challenging. This chapter deals with the aforementioned problems and aims to provide the basic knowledge and foundation necessary for proper data set analysis. The fundamentals of statistics for data correlation and interpretation are provided in the next section. The outlier concept is outlined in the following section, and efficient and robust observers also are described and compared. In the final section, the concepts described in the preceding paragraphs are extended to the field of regressions and therefore to the presence of linear and nonlinear mathematical models. The combination of raw data and mathematical models makes it dramatically more difficult to detect outliers.

RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano