Creating Realistic Test Datasets for Financial and Business Apps

April 19 2021

Fintech is modern technology to automate the delivery and implementation of financial services. While fintech was initially limited to backend applications in the financial industry, consumer fintech applications have experienced a boom in recent years. Mobile banking and other financial apps have become ubiquitous on smartphones. Today, customers want to perform financial transactions wherever they happen to be without having to visit a brick-and-mortar bank or other financial services office.

There are, however, many other use cases for fintech. For example, crowdfunding platforms such as GoFundMe are fintech applications, as are microfinance lending programs in the developing world. Automated policy management in the insurance industry is part of this domain as well, as insurtech, a subset of fintech.

Fintech applications generally operate in an enterprise environment. As such, testing fintech applications before deployment can be tricky. It is not good practice to run development and test environments against production data, but at the same time, you must test applications using realistic data. You need to figure out where to get this test data.

Customer relationship management (CRM) data is relatively easy to find. Plenty of public data is available, and you can find open machine learning (ML) datasets. However, masked health and financial data (data with personally identifiable information removed) are much harder to find, particularly for banking, financial, or payment systems.

Once you have the data, it’s important to understand best practices for testing your fintech application. In this article, we look at some of these considerations, let you know how to mask your data, and suggest some places to find test data.

Why You Should Avoid Testing with Production Data

While many organizations use production data in test environments, this is not advisable. Typically, a test environment does not have the same level of security as a production environment. It is easier for an intruder to access and put your customers’ personal information at risk.

In addition, many users in a test environment don't normally have access to production data. When you put user data in a test environment, suddenly, QA departments, consultants, and contractors have an open door to live data, making the data vulnerable. A study by the Ponemon Institute discusses an outside consultant hired by a financial firm to develop applications. The consultant sold some of the firm's customer data he had access to during the engagement, compromising the firm’s privacy policy and putting personal information at risk.

Another factor to consider is compliance. Depending on your industry, especially the health and financial industries, using production data for testing can have compliance implications and violate privacy regulations.

Why Realistic Data is Important for Testing

While you should avoid testing your fintech application with production data, the data must still be realistic. Financial applications are complex, and test data that doesn’t mirror production data may not expose bugs in an application. These bugs would then only surface once the application goes live, with potentially disastrous results. For example, if an apostrophe is missing in the test data for an application that uses SQL queries, the testing process can miss potential errors.

Using realistic data also means using data that is a representative subset of the complete dataset. Also, failure to use representative test data can miss defects that won't surface until the application is in production, since the data doesn't cover all the possibilities in the real world.

Requirements for Realistic Test Data

You can use a variety of techniques to develop test data. It's essential to research various options and select a technique that's right for your organization and application. Most experts agree on several test data requirements.

First, you should consider: Where does your test data come from? What is the domain and the nature of the application? What is the data structure? Does the end-user type it into the system, or is it entered automatically? All these factors influence the data's behavior, so understanding them is essential for accurate testing.

Once you have reviewed these factors, consider the following:

Test data should be a subset of the entire dataset. Using the full dataset slows testing and potentially causes the schedule to slip.
The test data subset should be representative of the full production dataset, as mentioned above. Otherwise, testing will not be representative.
Similarly, your test routines should include various scenarios representing the breadth of transaction types in your system.
Your test data should be in a secure environment. It should also be masked as an added security layer, since it is available to those who normally don’t have access to production data.

It is important to involve the business as a whole, including all stakeholders, from the earliest testing stages. Otherwise, the development team may overlook factors important to users on the business side.

Taking these requirements into account helps you perform a safe and successful test cycle.

Masking Test Data

Because access to the test environment is typically given to additional parties — such as contractors or QA testers — who do not have access to production data, it’s important to mask your test data.

Some ways to mask test data are:

Static data masking — This process starts with making a copy of the database then moving it to a separate environment. Then, ideally, reduce the dataset to a subset of the original. Apply data masking rules, then use the masked dataset for testing.
On-the-fly data masking — This process happens automatically. The data is masked when transferred from the production to the test environment. This is useful for organizations that use continuous deployment or continuous delivery techniques.
Dynamic data masking — This is similar to on-the-fly data masking, except that it happens one record at a time, on-demand. It's also valuable for continuous delivery environments. Plenty of data exists in the cloud these days, so dynamic data masking is particularly important for users of Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP), or IBM Cloud.

The wide range of data masking techniques includes:

Substitution — This is also called random lookup. Using this technique, an authentic-looking value from another table is substituted for the existing value. This is effective for first and last names, telephone numbers, zip codes, and more.
Shuffling — This is the most common simple data masking technique. It is similar to substitution, except that existing data is simply shuffled within a given column. The problem with this technique is that it may be possible to reverse engineer the algorithm used to shuffle the data and restore the original production data.
Character scrambling — In this method, you randomly rearrange the order of characters within the fields of a given record. This is useful because it is irreversible, and nobody can get the original data from the scrambled data.
Number and date variance — This is useful for numerical data. For example, you can apply a variance of +/- 10% for financial data and still have realistic test data. Similarly, you can apply a variance of +/- 365 days for dates, and the masked data is still meaningful for testing. A similar technique is to change a birth date, for example, to the first day of the month or the first day of the year.
Nulling out or deletion — This is also called blanking. This technique involves simply applying a null value to a given field. The problem with this technique is that it doesn't provide realistic data, making it poorly suited for testing.
Masking out — This method involves scrambling or masking out values in certain fields. While somewhat comparable to nulling out, the resulting data is more similar to production data and hence more valuable for testing.
Encryption — This is the most complex approach to data masking. This system often requires a key to view the data. The problem is that if the key falls into the wrong hands, an unauthorized user can access data.
Custom expression — Another approach, when you are working with an SQL database, is to create a custom expression. This gives you a great deal of flexibility. Using a custom expression enables you to do everything you can do with the SELECT command in an SQL statement.

Finally, various vendors provide solutions for generating test data. While sometimes pricey, these solutions have the advantage of generating realistic test data without compromising your production data.

Sources of Test Data

Test data is available from a variety of sources, both commercial and free. Here are some examples for your convenience. We imply no endorsement.

EU Open Data Portal provides access to European Union open data for various sectors, including finance.
Santander Customer Transaction Prediction provides sample financial data.
Compuware offers a commercial test data management solution that simplifies the process of test data management.

This is only a sample of potential test data sources. You may find others that better meet your needs. Or, you may find it best to create your own data using the strategies from this article.

Fintech Test Data Best Practices

As with requirements, philosophies on best practices for managing fintech data vary, but there are some common perspectives:

Understand your test data. You may find your data resides across multiple systems and in different formats. Different rules may apply to data based on its type and location. It is essential to capture the end-to-end business process in production and replicate it in the test environment.
As mentioned under requirements, extract a subset of your production data for testing, and ensure you also collect the associated metadata. Create a subset small enough to complete test runs quickly, but large enough to reflect the variety in your production data.
As discussed, ensure you mask sensitive test data. Masked data must still, however, have the look and feel of production data. Be sure that masking is consistent and repeatable.
Automate your test processes. Manual processes slow your test cycle considerably.
Refresh test data regularly. Regularly refreshing test data helps improve testing efficiency. It helps you maintain a consistent test environment that is easier to manage.
Use an integrated toolset. Your testing software’s components must work together seamlessly.
Don’t make duplicate copies of your test data. Create a central repository that all users can access.
Ensure that testing is self-service. Testers in the business should be able to test the application without assistance from IT.

Organizations can review their processes and improve efficiencies by implementing proper procedures. Ask questions like: What does it take to create test data? What is the cost of creating test data?

Bring together a cross-functional team to fully analyze your process. You may discover hidden costs in your testing process. Implementing more efficient test processes and more effectively managing your test data can reduce testing costs considerably.

GrapeCity and Your Test Data

Now that you know why you should not test using production data, and some techniques for masking your data, you can create your own financial test datasets while keeping your customer information safe. As you develop and test solutions implementing GrapeCity products, we hope the techniques and best practices in this paper will assist you in your tasks.

GrapeCity’s product line provides developers, designers, and architects with a complete selection of quality JavaScript and .NET grids, along with reporting tools, spreadsheets, document APIs, and mobile controls.

Over 40 years of experience developing award-winning solutions brings you full-featured applications with a familiar user interface and intuitive controls. GrapeCity products enable you to design and develop for a variety of platforms and devices. To learn more, explore the GrapeCity website today.

Mike Christie