GCHQ Synthetic Data Generator

A utility application used to generate Avro files of test data

License

License

Categories

Categories

Data
GroupId

GroupId

uk.gov.gchq.data-gen
ArtifactId

ArtifactId

synthetic-data-generator
Last Version

Last Version

0.0.4
Release Date

Release Date

Type

Type

jar
Description

Description

GCHQ Synthetic Data Generator
A utility application used to generate Avro files of test data
Project URL

Project URL

https://github.com/gchq/synthetic-data-generator
Source Code Management

Source Code Management

https://github.com/gchq/synthetic-data-generator

Download synthetic-data-generator

How to add to project

<!-- https://jarcasting.com/artifacts/uk.gov.gchq.data-gen/synthetic-data-generator/ -->
<dependency>
    <groupId>uk.gov.gchq.data-gen</groupId>
    <artifactId>synthetic-data-generator</artifactId>
    <version>0.0.4</version>
</dependency>
// https://jarcasting.com/artifacts/uk.gov.gchq.data-gen/synthetic-data-generator/
implementation 'uk.gov.gchq.data-gen:synthetic-data-generator:0.0.4'
// https://jarcasting.com/artifacts/uk.gov.gchq.data-gen/synthetic-data-generator/
implementation ("uk.gov.gchq.data-gen:synthetic-data-generator:0.0.4")
'uk.gov.gchq.data-gen:synthetic-data-generator:jar:0.0.4'
<dependency org="uk.gov.gchq.data-gen" name="synthetic-data-generator" rev="0.0.4">
  <artifact name="synthetic-data-generator" type="jar" />
</dependency>
@Grapes(
@Grab(group='uk.gov.gchq.data-gen', module='synthetic-data-generator', version='0.0.4')
)
libraryDependencies += "uk.gov.gchq.data-gen" % "synthetic-data-generator" % "0.0.4"
[uk.gov.gchq.data-gen/synthetic-data-generator "0.0.4"]

Dependencies

compile (10)

Group / Artifact Type Version
com.github.javafaker : javafaker jar 1.0.1
com.fasterxml.jackson.core : jackson-core jar 2.10.0
com.fasterxml.jackson.core : jackson-annotations jar 2.10.0
com.fasterxml.jackson.core : jackson-databind jar 2.10.0
com.fasterxml.jackson.datatype : jackson-datatype-jdk8 jar 2.10.0
com.fasterxml.jackson.datatype : jackson-datatype-jsr310 jar 2.10.0
org.slf4j : slf4j-api jar 1.7.28
org.slf4j : slf4j-simple jar 1.7.28
commons-io : commons-io jar 2.6
org.apache.avro : avro jar 1.8.2

Project Modules

There are no modules declared in this project.

Synthetic Data Generator

Ever found yourself scrambling around to find test data and then when you find some it isn't in the quantity that you need? Or you can't generate the data multi threaded and so it takes too long to produce.

Look no further, we have a data generator that fakes up some classic human resources data about employees. We have also created the data structure to contain the types of complex data structures that can make computation expensive or difficult to truly test your platform.

This repo provides the code to generate as many Employee records as you want, split over as many Avro files as you desire. You can also optionally define the number of parallel threads used to generate your data.

An Employee objects contains the following fields:

class Employee {
    UserId uid;
    String name;
    String dateOfBirth;
    PhoneNumber[] contactNumbers;
    EmergencyContact[] emergencyContacts;
    Address address;
    BankDetails bankDetails;
    String taxCode;
    Nationality nationality;
    Manager[] manager;
    String hireDate;
    Grade grade;
    Department department;
    int salaryAmount;
    int salaryBonus;
    WorkLocation workLocation;
    Sex sex;
}

The manager field is an array of manager, which could potentially be nested several layers deep, in the generated example manager is nested 3-5 layers deep.

To use the generator you will need to have installed (git, maven and JDK 11).

To get started first clone this repo locally.

git clone https://github.com/gchq/synthetic-data-generator.git

Then cd into the synthetic-data-generator directory and build the codebase

mvn clean install

then to start the generator:

.createHRData.sh PATH EMPLOYEES FILES [THREADS]

where:

  • PATH is the relative path to generate the files
  • EMPLOYEES is the number of employee records to create
  • FILES is the number of files to spread them over
  • THREADS (optional) specifies the number of threads to use.

For example to generate 1,000,000 employee records, spread over 15 files, running the program with 4 threads, and writing the output files to /data/employee:

.createHRData.sh data/employee 1000000 15 4
uk.gov.gchq.data-gen

GCHQ

Versions

Version
0.0.4
0.0.3