Sample data create and import

Data creation and import is split into two tasks:

create sample data (issue #29)
import this data (issue #220)

The import is done via import file. Decision: simple for user.

Keeping the focus: the focus is creating sample data for the database, not import/export.

The above points were decided to be done this way in a phonecall with Timo.

This page aims to act as whiteboard for displaying current state and solving this task. The task itself is tracked in the two issue stated above.

Creating sample data

A relative simple way of generating test data is to take real production data and to mix them. For example: Person Alfred Smith and Deborah Shulz can be changed to Alfred Shulz and Deborah Smith. Of course some security rules shall be added to - emails will get domain part like example.com or - if used for mailing tests - mailing.test.openpetra.org or something like that. Wolfganguhr 07:04, 12 May 2011 (UTC)

Wolfgang, you are right, it is simple. The data generation product I was looking at also supports that in a very nice way, it can duplicate a database and just change names and addresses and whatever is needed by a random generated person. A good thing about it: the data is realistic. But I think the bad thing about it should be noted: If we took real production data and changed it that would still be a data protection hazard, somewhat irresponsible and legally dangerous. That is why I wanted to avoid it. Obviously there are also some databases on the web with anonymized data, but I think we should avoid it. User:Thosteg 15:04, 13 May 2011 (UTC)

Goal: creating sample data for the database.

The sample data should have

many donors
many recipients
many donations

Current line of action

create xml files with people data only via benerator. Doing this means simply creating xsd files which describe the wanted data-files.
Define simple xml file schema from these files (yeah, that is a round circle to the first point)
build importer for these files for OpenPetra
create nant job to import data
done!

Import File Format

xml-File with the following tag hierarchy:

Partners (v)
  People (r)
    People can later be distinguished (from their gifts-behaviour) between personnel and partners
    - first_name
    - family_name
    - address_line
    - postal code
    - country
  Organisations (v)
    Companies (r)
      - name (other organisations: similar)
    Churches  (r)
    Charities (r)
      Projects (r)
        - name
Gifts (r)
  - amount
  - currency (for now: always GBP, later country dependant)
  - recipient

(v) means virtual (just exists to group sub-partners) while (r) means actual existing partners.

Creation is done differently:

People (r)
  Gifts (r)
Organisations (v)
  Companies (r)
  Churches  (r)
  Charities (r)
    Projects (r)

Test data generators / Sample data

There are a number of good test data generators out there, building our own would not have been beneficial for just finding something to work and quickly. I looked at a number of them, with an emphasis on they should be recommended by people and were open source.

Decision was: look at benerator and generatedata.com.

I looked at benerator and decided to stick with that for now, if it works out.

Idea to be checked: use data from generatedata / geo-database / briandunning together benerator to compile data to common format, which is then imported as shown below.

generatedata.com

creates name, address, email ... looks very nice!
creates data for Australia, Belgium, Canada, Netherlands, United States, United Kingdom
but: e.g. the UK postal codes don't seem to be real UK codes. So not only are the Codes not correct, but the combination code / address neither. Perhaps this is different for the US. In any case, this would not be a show-stopper as they look close enough, and we just want lots of data anyway.

But:

Have not looked at the code yet.
Have not thought about how we can integrate this with OP

benerator

Has generators for all sorts of information, and can create xml files
is not actually GPL - has a "GPL v2 with exceptions" ???. Should chat with the author.

Lists of test data generators

The majority of the software listed below was extracted from former page. Criteria for judging: actively maintained + fits the job + documentation (less important)

interest?	Program	creates	area	Output	App-Type	License
*	benerator	creates data / transforms given data to test data		various databases, xml, csv, excel	Framework	GPL / commercial (WARNING! GPL "with exceptions")
*	generatedata.com	Addresses / Cities / Countries	Netherlands, Canada, UK, US	XML, Excel, HTML, CSV, SQL	Webapp (JS,PHP,MySQL)	GPL v2
*	Geographical Places Database	geographical locations (schools, universities, whitehouse, eiffel tower...)		tab delimited	website, download, libraries (various languages), webservice	creative commons attribution
	http://www.briandunning.com/sample-data/	Website with real address and company data (US and Canada) but with fake names. This could be useful with testing map services as well since there are real geographic locations.	US, Canada			free
(still want to briefly check)	DBMonster	generates test data		SQL	Command-Line (Java)	Apache License
	CSV Data generator			CSV?	(Ruby)
	Datagenerator				library / GUI	GPL
	dqMaster			text,xml,db	GUI (extensible)
	Spawner Data Generator	random proper names, terms and connectors		delimited text / SQL	apptype	license
	Test Dictionary				java interface
data at most	Fresh Trash Generator	Random Website, Email, Family and First Names, Phone Number, Company, Birthday (at least some of the resource data might be interesting)	Greek Names and Companies, German Streets		java utility package
nn	google api toolkit	nn			Web API
-	Data Science Toolkit	convert address to coordinates, vv, ip to coordinates etc			Web API / VM
-	fakenamegenerator.com	Names,Adresses from many countries			Website / Web API	proprietary for API (kostenlos, but attribution)
-	.net Fabricator	(no addresses, so not suitable, but seems nice framework)			Framework using .net	MIT
- (com)	GEDIS Studio for Test Data	"Realistic Test Data" (not viewed)		CSV, XML, SQL, or HTML	Windows / Scripting	community edition kostenlos / commercial
- (com)	Excel random data generator	Generates sample data, somewhat acclaimed here			MS Excel Plugin	commercial
- (com)	SQL Data Generator	Generates complex sample data (addresses, companies, interaction), a business person liked it on stackoverflow. Would probably be the right thing except it is SQL Server and commercial.			Application for MS SQL Server	commercial
- (com)	Microsoft Visual Studio Database Edition	Generates sample data, and several people pointed to it on stackoverflow.			Part of Visual Studio	commercial
- (com)	Advanced Data Generator				Windows Application	commercial
- (com)	SQL Manager				Windows Application	commercial

List of others (not checked): date of last change + project (checked april 2011)

2011-02 sf dagen
2007-05 sf pharaon
2011-03 sf encapet
2010-08 sf adag
2009-11 sf jrando
2010-05 sf bbf-data-genera

Coding

Some coding has been done already: See csharp\ICT\PetraTools\GenerateSampleData for transforming sample data into family records etc.

Also see partner import module, which processes csv and yaml files. csharp\ICT\Petra\Client\lib\MPartner\gui\PartnerImport.ManualCode.cs

Importing sample data

The import is done via import file. Decision: simple for user.

Keeping the focus: the focus is creating sample data for the database, not import/export. Import/export is a simple tool - which we put effort into, to keep it nice and simple and easy to understand. But in this case, a tool for sample data only.

Make the import file as simple as possible for the user, e.g. consciously limit the scope of the import files capability (one address per person), but rather not powerful import-file.

Data format: This stackoverflow question suggests YAML. I am still split, rather yaml than xml, but perhaps simple csv would fit.

Concider data liberation?

Not necessarily - only if useful to keep it simple and make it work quickly.

Intended location of data in OpenPetra

Data	Table
Person	p_family
Address	-
Donations	-

p_family will be used for all data, and p_person ignored (This is in line with the attempts to replace p_person by p_family).

Sample data create and import

Contents

Creating sample data

Current line of action

Import File Format

Test data generators / Sample data

generatedata.com

benerator

Lists of test data generators

Coding

Importing sample data

Concider data liberation?

Intended location of data in OpenPetra

Navigation menu

Sample data create and import

Creating sample data

Current line of action

Import File Format

Test data generators / Sample data

generatedata.com

benerator

Lists of test data generators

Coding

Importing sample data

Concider data liberation?

Intended location of data in OpenPetra

Navigation menu

Search