Drowning in data? These four tips will come to your rescue.
As a guy who began his career two decades ago as a data analyst, stringing those three words together (“drowning in data”) in the subject line wasn’t easy…but sadly it can be true. One of my challenges as a consultant is managing the amount of data that I work with. A good majority of my clients have transactional tables that contain millions upon millions of rows. Just querying that amount of data is time consuming, let alone trying to analyze it. If you occasionally (or regularly) find yourself in a similar situation, here are a few techniques that you can use to scale this cumbersome amount of data down:
- Use time constraints: Invariably, I’ll start by trying to introduce some date constraint to limit the data. This can include prior and current year, current year, or current month only. I’ll limit the data as far down as I can get away with to still deliver a meaningful result.
- Summarize: If incorporating a time constraint doesn’t help much, try to summarize the data in some fashion, either by some time dimension (i.e. year, quarter, or month), some character field (i.e. country, customer), or some combination of the two (i.e. year and customer). Summarizing the data can usually help me go from millions of records to thousands.
- Drill down: Although summarizing the data is great for scaling it down to a manageable level, it has its negative impact when trying to do a thorough analysis (once the data is summarized within a query, all the individual rows that comprised it are gone). In these situations, I’ll look to introduce parameter driven drill-down techniques, where the user can select a segment of the data and then drill into all the rows tied to just that one segment (i.e. U.S. only rows).
- Secure filters: Another trick you can use is to apply filters with security behind them. For instance, if a particular sales rep logs in and launches a particular report, unbeknownst to them, they could be accessing a result set with just the rows applicable to them. This technique is a particular favorite of mine, in that it allows me to build one report, have it accommodate all reps, but yet secretly limit the data down to a manageable limit.
Even for those fellow data grinders at heart, there are limits to the amount of data we can work with. However there are ways of scaling that data back, without sacrificing all the analytical challenges we may encounter on a daily basis. Do you agree? If so, what techniques do you use?