New York

Companies:	51,220
Products and Services:	2,931

Articles and publications:	32,626 (+3)
Tenders & Vacancies:	17

Is your Company in our Directory?

ETL Testing vs. Data Warehouse Testing – Is There a Difference?

Information may not be reliable

6/23/2016

The ETL process, as you know, is a data pipeline where the throughput gets altered. Perhaps it is easier to think of it as a manufacturing assembly line.

ETL testing and data warehouse testing are two terms that are often used interchangeably. However, ETL is just one step in the everyday life of a data warehouse (well, okay, it is more likely to be a nightly procedure). But what more is there to data warehouse testing besides ETL, and is it okay to still refer to the non-ETL part as ETL testing? First, we have to understand what the ETL process entails.

The ETL process, as you know, is a data pipeline where the throughput gets altered. Perhaps it is easier to think of it as a manufacturing assembly line. Extract (the E in ETL) involves the acquiring of components and getting them into your system. Transform (the T in ETL) is the alteration of these components into their final form. Load (the L in ETL) involves delivering the assembled end product, ready to be used for whatever it is to be used for. Like real-life assembly lines, source materials and their quality concerns are known prior to the process and may require inspection, mid-flow allows the progress to be monitored, and the completed end products are eligible for defect inspection. The only real difference is that ETL is run more like a changeover process to the line, and the operators don’t need to wear protective gear or worry about personal injury. But let’s get back on track and talk about how this applies to testing.

Logically, functional ETL testing by Qualitest would consist of Extracting, Transforming, and Loading, and non-functional ETL testing would test the performance of those actions. Problems during Load may in fact point to physical problems with hardware devices or physical storage capacity, which has created a push towards cloud-based implementation and its virtues of being virtual. Similarly, there could be load problems during backup recovery, which should exclude the transformation connected to the load process (not to be confused with the ETL load process). But there are more potential problems which fall outside but connected to the actual E, T and L processes.

First of all, before the extraction occurs, the extracting has to come from somewhere. The variations of the types of those source data somewhere may lay at the root of problems encountered later on. Mid-transformation, a record could be flagged for being incapable of transformation – the question is, is that accurate? And has the business need been properly conceived and programmed? Next, after the load, testing needs to occur to ensure that everything got loaded right. This post-load checking involves so much data that it is impossible to verify all of the entries – cherry picking sample data records must be involved, and the transformation must be considered when making the comparison. Possible data problems include curtailment, data caching, mistransformation, echoing the same record multiple times, and inability to transform (this should have been flagged during the initial attempted transformation). In addition, there are some problems that are unique to the initial load of the system.

On top of this, there are problems that can occur that are totally outside the extraction, transformation and loading processes. Even if a record is correct after it has been loaded, that does not guarantee that the stored record will not corrupt at a later point. It is also possible that the scheduling software that kicks off the backup and ETL processes may be defective, either in terms of frequency, event kickoff or expected function. The backup procedure may introduce some form of data corruption during transcription, an accidental transformation of data. Or the backup may have a problem when being resurrected back into the system.

And let's not forget the reporting side. After all, the whole purpose of building a data warehouse is to be able to pull analyzed reports based on the raw data stored within, built in a way that should optimize the speed of your SQL queries and report-building. The scalability of the data warehouse should not pose a performance problem when generating reports; size should not complicate the speed in the way that databases built to optimize CRUD speed may suffer.

Lastly, on the non-functional side, the whole system could suffer from security problems. The permission accesses may not function as expected. The stored data, or processes that access the stored data may not have proper security. Since data warehouses often contain medical records (which require HIPAA protection) and point-of-sale information, proper security may be necessary to prevent access to credit cards numbers, Social Security Numbers, birthdates, home addresses, and personal medical information.

Author Bio:

Scott Andery is an expert marketer and author who specialize in software testing tools and resources.

view all (2)