Data Terminology

Jump to: navigation, search

Contents

Introduction

The world of analytics and data is full of terminology that can seem baffling to non-specialists. However, everyone working in digital agencies will find a rudimentary understanding of this terminology useful. Here is a list of some of the more important terms (please feel free to add more!).

Econometrics, time series and regression analysis

These are all basically the same thing, the difference being that some will be include historic examples of the dependant variable as an explanatory model (first credit card example) others won't.

Some regression work is cross-sectional - so for example you may want to compare and model sales of servers from different branches of PC World.

Logistic regression

Used if you are trying to predict a yes/no result - the data treats this as a 1 or a 0 (but you can have more options) then the model outputs a probability score. It is then a case of testing the model against a separate data set to see how well it predicts.

Logistic regression is often used in credit scoring models. They are also the basis of a lot of propensity models.

CHAID

CHAID stands for CHi-squared Automatic Interaction Detector. It’s what’s known as a decision tree, but that’s not as important as what it actually is – in DM it’s used to select groups of consumers and predict how their responses to some variables affect other variables. It is also used when you have discrete data (e.g. number of fruit and vegetables eaten daily) rather than continuous data e.g. (monthly income).

As an example, in financial services you might use CHAID to investigate bad debt (e.g., income bracket, number of credit cards, is mortgage more than 4 times salary, yes/no, etc).

SAS and SPSS

These are statistical analysis packages; SPSS is a software company now owned by IBM. SAS stands for Statistical Analysis System, one of the most widely-used software programs in CRM. Both are equally valid in different situations.

Proxies

Sometimes you don't have the data you want but you may have something that is very closely linked. So for example, repeat purchase is a proxy for loyalty … loyalty is an emotional thing, you can’t ACTUALLY measure it! That has got you thinking.

Dummy variable

This is a way of putting what you might consider “non-data” things into a model, perhaps events or natural phenomena. For example: like St Patrick's Day (for modelling Guinness sales) or if the temperature rises above 22 deg C (increasing saes of ice cream).

Time series

In statistics, econometrics and mathematical finance, a time series is a sequence of data points, measured typically at successive times spaced at uniform time intervals. Examples of time series are the daily closing value of the Dow Jones index or the annual flow volume of the Nile River at Aswan.

Time series analysis comprises methods for analysing time series data in order to extract meaningful statistics and other characteristics of the data.

Time series forecasting is the use of a model to forecast future events based on known past events: to predict data points before they are measured.

An example of time series forecasting in econometrics is predicting the opening price of a stock based on its past performance.