Normally, what I usually see is some automated process for setting up empty databases with up-to-date data structures. This is helpful for integration testing, but is tricky for actual development as well as performance testing.
For example:
- It is difficult to conceptually get your head around writing a query when you cannot see any data in your tables
- You cannot possibly know if your query is slow before you deploying it to production without running it against 'some' data.
In addition, there can be a strict requirement to not let sensitive customer data be available outside certain secure environments and certainly not available to development environments.
Step 1
What you would need to do is go over your database and separate the elements into different criteria:
- Data Structure
- User Management
- Referential Tables
- Primary Tables
- Child Tables
- Mapping Tables
- Sensitive Data
(explanation below)
Data structure management and user management should be, by now, a solved problem. You have systems like Liquibase and Flyway that manage this for you. Essentially, you can use these systems to automatically generate containers which your developers can then use or setup empty local databases on developer machines using SQL scripts.
For user management, MySQL has PAM plugin to manage users via LDAP, but you can manage this through scripts as well.
Referential tables (tables that contain data such as id = 1, status = 'DONE') should also be small enough to be included in this stage as well. You need to identify which tables contain this very basic data and add it to the data structure repository or SQL file.
Step 2
This is where things get a little bit tricky: You need to identify which tables are your Primary 'feed data' tables or Object tables. You then need to identify which tables are the Child tables of those Primary tables. Lastly, you need to identify which tables Map keys across different tables - either Primary to Child (as in multi-to-multi relationships) or Primary to Primary.
Once you have identified these tables, you can discern how much data you would like to keep in your development databases. My recommendation would be to go in these three directions:
- Specify a set number of keys in the Primary tables and then get the data from the Child and Mapping tables based on those keys.
- Specify a specific set of keys from the Primary tables and then get the data from the Child and Mapping tables based on those keys.
- Keep data by a date range for the primary table and then use its keys to populate the Child and Mapping tables.
Make sure that the amount of data is adequate for your needs: not too small and not too large.
Step 3
This separation of table types can now help us with identifying sensitive data. Data structure and also Referential tables, should not have in them sensitive data. Neither should Mapping tables. What would have sensitive data are Primary and Child tables.
Identify the columns where sensitive data maybe kept in those tables and either:
- Empty that data
- Give it a default value (all emails will be test@email.com)
- Obfuscate those values in some way
You can change this the data by either outputting it with those changes into an SQL file or dumping that data into a staging database, changing the sensitive data and then dumping it into an SQL file with a tool.
Ideally, this stage needs to go through a QA process/person before the company releases sensitive data to generally available containers or repositories which keep history of changes.
Conclusion
With taking the time to separate the different elements in a database, you can make it less complicated and you would then be more able to automate parts of the database into your CI/CD process.
No comments:
Post a Comment