posted on 2013-11-15, 10:01authored byTeodora Sandra Buda, Thomas Cerqueus, Morten Kristiansen, John Murphy
In a wide range of application areas (e.g. data
mining, approximate query evaluation, histogram construction),
database sampling has proved to be a powerful technique. It
is generally used when the computational cost of processing
large amounts of information is extremely high, and a faster
response with a lower level of accuracy for the results is preferred.
Previous sampling techniques achieve this balance, however, an
evaluation of the cost of the database sampling process should be
considered. We argue that the performance of current relational
database sampling techniques that maintain the data integrity
of the sample database is low and a faster strategy needs to be
devised. In this paper we propose a very fast sampling method
that maintains the referential integrity of the sample database
intact. The sampling method targets the production environment
of a system under development, that generally consists of large
amounts of data computationally costly to analyze. We evaluate
our method in comparison with previous database sampling
approaches and show that our method produces a sample
database at least 300 times faster and with a maximum trade off
of 0.5% in terms of sample size error.