|
Foreword
Data cleaning is often the most time-consuming part of data analysis. Although it has been recognized as a separate topic for a long time inOfficial Statistics (where it is called ‘data editing’) and also has been studied in relation to databases, literature aimed at the larger statistical community is limited. This is why, when the publisher invited us to expand our tutorial “An introduction to data cleaning with R”, which we developed for the useR!2013 conference, into a book, we grabbed the opportunity with both hands.On the one hand, we felt that some of the methods that have been developed in the Official Statistics community over the last five decades deserved a wider audience. Perhaps, this book can help with that. On the other hand, we hope that this book will help in bringing some (often pre-existing) techniques into the Official Statistics community, as we move from survey-based data sources to administrative and “big” data sources.
For us, it would also be a nice way to systematize our knowledge and the software we have written on this topic. Looking back, we ended up not only writing this book, but also redeveloping and generalizing much of the data cleaning R packages we had written before. One of the reasons for this is that we discovered nice ways to generalize and expand our software and methods, and another is that we wished to connect to the recently emerged “tidyverse” style of interfaces to the R functionality.
|
|