An approach to create a synthetic financial transactions dataset based on NDA-protected dataset
This paper outlines an experiment in building an obfuscated version of a proprietary financial transactions dataset. As per industrial requirements, no data from the original dataset should find its way to third parties, so all the fields were generated artificially, including banks, customers (including geographic locations), and particular transactions. However, we set our goal to keeping as many distributions and correlations from the original dataset as possible, with adjustable levels of introduced noise in each of the fields: geography, bank-to-bank trans-action flows, and distributions of volumes/numbers of transactions in various subsections of the dataset. The article could be of use to anyone who may want to produce a publishable dataset, e.g., for the alternative data market, where it’s essential to keep the structure and correlations of the proprietary non-disclosed original dataset.