posted on 2013-11-28, 12:22authored byTeodora Sandra Buda, John Murphy, Morten Kristiansen
Managing large amounts of information is one of the most
expensive, time-consuming and non-trivial activities and it
usually requires expert knowledge. In a wide range of application
areas, such as data mining, histogram construction,
approximate query evaluation, and software validation,
handling exponentially growing databases has become a dif-
cult challenge, and a subset of the data is generally preferred.
As a solution to the current challenges in managing
large amounts of data, database sampling from the operational
data available has proved to be a powerful technique.
However, none of the existing sampling approaches consider
the dependencies between the data in a relational database.
In this paper, we propose a novel approach towards constructing
a realistic testing environment, by analyzing the
distribution of data in the original database along these dependencies
before sampling, so that the sample database is
representative to the original database.