posted on 2014-08-05, 08:58authored byTeodora Sandra Buda, Thomas Cerqueus, John Murphy, Morten Kristiansen
Database sampling has become a popular approach to handle
large amounts of data in a wide range of application areas such as
data mining or approximate query evaluation. Using database samples is
a potential solution when using the entire database is not cost-e ective,
and a balance between the accuracy of the results and the computational
cost of the process applied on the large data set is preferred. Existing
sampling approaches are either limited to speci c application areas, to
single table databases, or to random sampling. In this paper, we propose
CoDS: a novel sampling approach targeting relational databases
that ensures that the sample database follows the same distribution for
specific fields as the original database. In particular it aims to maintain
the distribution between tables. We evaluate the performance of our algorithm
by measuring the representativeness of the sample with respect
to the original database. We compare our approach with two existing
solutions, and we show that our method performs faster and produces
better results in terms of representativeness.
History
Publication
24th International Conference on Database and Expert Systems Applications (DEXA 2013) [ Lecture Notes in Computer Science];8055, pp. 342-356
Publisher
Springer
Note
peer-reviewed
Other Funding information
SFI
Rights
The original publication is available at www.springerlink.com