4.3 Article

SYNTHESIZING GEOCODES TO FACILITATE ACCESS TO DETAILED GEOGRAPHICAL INFORMATION IN LARGE-SCALE ADMINISTRATIVE DATA

期刊

出版社

OXFORD UNIV PRESS INC
DOI: 10.1093/jssam/smaa035

关键词

CART; Disclosure; DPMPM; Geocode; Synthetic Data

资金

  1. IAB

向作者/读者索取更多资源

This study investigates the use of synthetic data to provide detailed geocoding information to external researchers. The proposed synthesis strategy based on categorical CART models outperforms other methods in terms of preserving analytical validity. Generating additional variables is shown to be a preferred strategy to address the risk-utility trade-off in practice.
We investigate whether generating synthetic data can be a viable strategy for providing access to detailed geocoding information for external researchers, without compromising the confidentiality of the units included in the database. Our work was motivated by a recent project at the Institute for Employment Research in Germany that linked exact geocodes to the Integrated Employment Biographies, a large administrative database containing several million records. We evaluate the performance of three synthesizers regarding the trade-off between preserving analytical validity and limiting disclosure risks: one synthesizer employs Dirichlet Process mixtures of products of multinomials, while the other two use different versions of Classification and Regression Trees (CART). In terms of preserving analytical validity, our proposed synthesis strategy for geocodes based on categorical CART models outperforms the other two. If the risks of the synthetic data generated by the categorical CART synthesizer are deemed too high, we demonstrate that synthesizing additional variables is the preferred strategy to address the risk-utility trade-off in practice, compared to limiting the size of the regression trees or relying on the strategy of providing geographical information only on an aggregated level. We also propose strategies for making the synthesizers scalable for large files, present analytical validity measures and disclosure risk measures for the generated data, and provide general recommendations for statistical agencies considering the synthetic data approach for disseminating detailed geographical information.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.3
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据