Niklas Reje
Synthetic Data Generation; a Comparison of Methods' Limits and Dependencies
Abstract:
Because of regulations but also from a need to find willing participants for surveys,
any released data needs to have some sort of privacy preservation.
Privacy preservation, however, always requires some sort of reduction of the utility of the data,
how much can vary with the method.
Synthetic data generation seeks to be a privacy preserving alternative that keeps the privacy of the participants by
generating new records that do not correspond to any real individuals/organizations but
still preserve relationships and information within the original dataset.
For a method to see wide adoption however it will need to show to be useful, for, even if it would be privacy preserving,
if it cannot be used for usable research, it will never be used.
We investigated four different methods for synthetic data generation:
Parametric methods, Decision Trees, Saturated Model with Parametric and Saturated Model with Decision Trees and
how the datasets affect those methods with regard to utility together with
some restrictions due to how much data can be released and time limitations.
We determined that a large number of synthetic datasets, about 10 or more, are needed to be released for good utility and
that the more datasets that are released, the more stable the inferences are.
We see that using as many variables in the imputation process of each variable as possible is best in order to
generate synthetic datasets for general usage but that being selective in what variables are used for each imputation
can be better for specific inferences that match the preserved relationships.
Being selective also helps with keeping down the time complexity of generating synthetic datasets.
When compared with k-anonymity we found that the results depended heavily on how much we included as quasi-identifiers but
regardless, the synthetic data generation method could achieve the same if not more often better accuracy than
k-anonymized datasets.
We found that Saturated Model with Decision Trees is the overall best method due to high utility with stable generation time
regardless of the datasets we used.
Decision Trees on their own was second with very close results to the Saturated Model with Decision Trees but
some slightly worse results with categorical variables.
Third best was Saturated Model with Parametric with good utility often but not for dataset with few categorical variables and
occasionally a very long generation time.
Parametric was the worst one with poor utility with all datasets and an unstable generation time that could also be very long.