Title: Improving Data Quality: Methodology and Techniques
Abstract. Traditional database design techniques normally include the definition of various types of schema constraints, so that the consistency of data with respect to format and type, as well as to their logical interrelationships, is guaranteed to a large extent. However, these constraints are often not adequate to ensure that the data can be used reliably in a number of value-added applications, including data warehousing, OLAP and data mining. Known problems that limit the potential uses of data include correctness and accuracy("does this piece of data accurately reflect the reality it represents?"), consistency across multiple sources ("are these independently generated pieces of data consistent with each other?"), object identity ("do these pieces of data represent the same real-world object?"), currency ("is this data stale?") and more. In general, the quality dimension of data is not adequately captured by current schema and database modeling techniques. Furthermore, potential quality problems may not emerge until new types of multiple source data integration become necessary. As a result, in most data-intensive organizations, data quality control is approached as an afterthought at best, and often neglected altogether. In both cases, its cost for the organization can be prohibitive. In this talk, I will outline a methodology for defining, monitoring and enforcing quality constraints over data. The framework includes techniques for the assessment and monitoring of data quality over time, for the identification of the root causes of poor quality, and for performing data cleaning and process re-engineering. After introducing some basic notions in the area of data quality, we present examples of our approach, and discuss some existing data cleaning techniques. Perhaps more interestingly, we investigate the use of Data Warehousing for the collection and periodic analysis of quality meta-data, and we propose the application of Data Mining techniques for inferring plausible causes of poor quality, and to forecast future quality levels. This work draws from experience gathered while working on data quality issues in the Italian Public Administration. Examples from that case study will be presented.
Paolo Missier is a research scientist at Telcordia Technologies (formely Bellcore), USA. He has worked in several areas, including federated Database technology, data warehousing and Data Mining, new paradigms for large-scale software engineering, distributed object-oriented software architectures, and next-generation services for integrated telecommunication networks. Mr. Missier earned a M.Sc. in Computer Science (Laurea in Scienze dell'Informazione) from Universita' di Udine, Italy in 1990 and a M.Sc. in Computer Science from University of Houston, Tx., USA, in 1993.