Data Collection Flow

Data Collection

There is a saying amongst statisticians that "The data you want are not the data you have, and the data you need is not the data you want". The data that are readily available in your existing reports are probably not sufficient to support process analysis and improvement. Good data collection can be difficult, whether mining information systems or gathering data first hand.

Just remember that the quality of the data is probably the weakest link in the quality of your analysis. In the long run, it's worth the effort to get it right.

Operational Definitions

Collecting data is itself a process. That means that there is a prescribed method, measuring equipment, people to be trained, and so on. The resulting data depends on the way in which it is collected. If the operation of data collection is not defined, then the resulting numbers will be rubbery and subject to dispute. The specification of the way the data is gathered (in other words how the measurement is taken) is called an Operational Definition.

Data Types


There are several types of data. This is important because the type of data we collect affects the tools we can use for representation and analysis.

The decision tree at left shows the various data types.

As we go down the tree, we find that we can use progressively more powerful and sensitive data analysis tools.

However, category data is invaluable for stratification - see below - because it lets us differentiate between subsets of our data to find out if there are significant differences between them.

Often, category and count data is the best we can get. A customer call was either a complaint or not, the product was either scrapped or not.

However, some data is available as variables data. In lean production and in almost any type of service process, the time taken to complete an operation will be a key measure. However, we often find that in existing reports, the time is hidden, and what is reported is whether the outcome was on time - or not.

This practice discards some very important information.

The golden rule here is: don't turn variables data, if you have it, into discrete data by scoring it as "good" or "bad" against a specification. The truck may have arrived late, but it makes a big difference to our understanding of the process to know whether it was 5 minutes or 2 days late. Record and analyse the difference between the actual and the target, not just the count of on time vs late.


When we are looking to understand and improve a process, we are interested in differences between subsets of the process. This means that when we collect and store our data, we need to tag each observation with the relevant attributes so that we can compare between them during our analysis.

These tags are typically category data under the above typology.

What you want and what you can get will of course depend on your particular process, but typical attributes you might look for include:

  • Date / Time / Day
  • Equipment used
  • Team or operator
  • Product / service type
  • Location / path

If you collect and store the data with these (and other appropriate) tags, it is easy to sum back to the totals. However, if you only get the totals, there's no way back into the details!

There is a bit of a catch-22 in that stratification presupposes that you have some theories about sources of variation in the process, although these theories are nominally developed during the analyse phase. As ever, expect to carry out a couple of PDSA cycles to home in on what you need.


One question that comes up repeatedly is “how many samples do we need to collect”. The maths of calculating this is pretty straightforward.

The challenges are around the quality of data collected, which is why operational definitions are crucial, and in ensuring that the sample of data is adequately representative of the totality of the process, i.e. that it is a random, unbiased sample. Collecting reasonably random samples turns out to be a lot more complicated than you might expect.

The data that fall easily to hand are unlikely to be a fair representation of all of the data that the process generates. 


Manual data collection is expensive and error prone. It requires careful design and testing before committing to a major collection exercise. You will probably not get it right on the first iteration, so allow time for a pilot collection and evaluation.

Collecting data from IT systems would appear to be simpler, but in practice you are likely to find that 

  • Data from corporate systems are not structured in a way that facilitates easy access
  • Much of the data you want will be sitting in spreadsheets of highly variable quality
  • The data that are recorded may be defined in a way that differs from your needs, even though the names are the same.


The data you collect, whether from IT systems or from manual data collection, needs to be structured and stored in a format that is suitable for reporting and analysis.

Because process improvement information is qualitatively different from financial management information, you may have to implement your own special data storage to support the measurement, analysis, and improvement phases of the improvement project. There are many good reasons you should not do this, but unless your organisation is well advanced in process improvement thinking, the choice may be between an unsatisfactory solution and no solution to this dilemma.

Excel is the lowest common denominator, and for specific projects is often the best tool. Its weaknesses are the relatively small amount of data that it can handle, its lack of rigour in checking formulae and derivations, and the amount of work required either to manually process the data or to set up scripts and macros to handle the data.

Access has a lot to commend it as an interim storage and analysis method that can be easily coupled with an Excel front end. However, most corporate IT departments take a dim view of local Access systems, and you may have to keep a low profile and adopt a do-it-yourself approach.

Work around solutions are only recommended for transient improvement projects. They are not recommended for ongoing management reporting - implementing an appropriate industrial strength monitoring and reporting system is often one of the key enablers of performance improvement. 

Note that popular statistical tools such as Minitab do not provide a solution for data storage and retrieval.


One friendly tip - no one has ever underestimated the amount of work required to get the data into decent shape for analysis. Even after you have developed the operational definitions and data collection methods, there is a lot of work to be done before you are ready to analyse and report on the data. We estimate that the relative effort for data manipulation and data analysis is typically 90/10 for a new project.

It's not just you.