Presented at Data Science Sydney, April 2018.
Abstract: With companies like Cambridge Analytica in the news, people are understandably worried about how companies store and handle data about them. Sensitive and personally identifying data is needed by companies to run their services however, and companies may need to process it on many different systems and environments. This talk describes how to build "look-alike" data sets that have many of the same statistical properties as source data sets they are generated from, but no longer contain sensitive data. By using such synthetically generated look-alikes in many development and testing environments, the true source data can be kept more securely in fewer locations.