Data Glossary for Business

Tags: Automation, General AI, Data

In our AI Glossary for Business, we discussed terms connected to Artificial Intelligence and how a Machine Learning model learns how to predict by looking at data. The more data it sees, the more likely it will find meaningful patterns. Some models, such as generative pre-trained transformers (GPT), are trained on huge amounts of data, containing billions of diverse examples. Data lies at the heart of Artificial Intelligence and understanding data concepts is crucial. In part two of our Glossary for Business, we dive into data-related terms you might come across in Machine Learning projects.

A label (also: target/dependent variable) is the outcome that the model tries to predict. It’s usually the business case that we want to solve. It can be numerical, e.g. when we want to predict demand, sales, or prices. It can also be categorical if we want to know whether a customer joins a loyalty programme or what colour of T-shirts sells the best. A label is an answer to a question we are asking a model. Here are some examples of those questions and answers:

Question	Label
Is the customer going to join the loyalty programme?	Yes/No
How big will the sales of white T-shirts be in June?	any non-negative, full number (e.g. 1800, 2000, 2431)
What is the price of the house?	any positive, decimal number (e.g. 100k, 220k, 600k)

Features (or independent variables) are factors influencing the model’s output. They are the external events and circumstances that impact our business case. If we predict sales in our shop, we probably consider the day of the week, the season, prices and the competition in the area. These are the features that a model looks at, too. It first goes through the data we collected and tries to connect it with different labels. It looks at many combinations of features and labels to understand their connections. Most Machine Learning algorithms require at least several features to learn patterns from. Here are examples of factors that might influence the labels we discussed above:

Question	Features
Is the customer going to join the loyalty programme?	the number of purchases in the last quarter the amount spent on the last purchase the value of discounts used in the last quarter
How big will the sales of white T-shirts be in June?	the sales of white T-shirts in the preceding months public holidays in June marketing campaigns in June
What is the price of the house?	zipcode of the house square meterage of the house number of bedrooms

Numerical variables are features containing numeric values, including full and decimal numbers. They often represent metrics, such as revenue, prices or quantities. Numerical features must have a logical scale. This means that we have to be able to tell the order of values and how much bigger one value is from another. Let’s look at an example of prices. We know that €20 is less than €40, by exactly two times. So we can say that’s a numerical variable. Here are some other examples:

the amount spent on the last purchase
the value of discounts used in the last quarter
the sales of white T-shirts in the preceding months

Categorical variables are features that can be divided into groups (or categories) with different names assigned. Be mindful that numbers can also represent groups in categorical features. Let’s look at an example of T-shirt sizes. We can assign number 1 to size “S”, number 2 to size “M” and number 3 to size “L”. Even though the feature includes numbers, it’s still a categorical variable. So how do you differentiate them from the numerical ones? For some categorical data, there’s no way of telling the order of values. We can’t say whether “blue” is larger or smaller than “white”. For other categorical variables, it’s possible to detect the order, but not the scale of it. We can tell that “medium” is larger than “small”, but we don’t know by how much. Here are other examples of categorical variables:

public holidays in June
zipcode of the house
marketing campaigns in June

A column in a dataset represents a single variable. Columns and rows are a tabular dataset's two most important elements. One column usually represents one feature or label. Looking at a single column can give us information about the general statistics of the sample we collected. We might be able to say, e.g. what size of T-shirts is the most frequent, or what are the median prices of houses.

A row in a dataset represents a single observation. It contains an identifier for which we want to get a prediction. It can be the customer’s name or a shop number. It also includes exactly one value per each column. Those values might differ for every identifier and the role of the model is to find those differences and connections. Let’s look at an example of a tabular dataset below.

Q: Is the customer going to join the loyalty programme?

Customer ID	number of purchases in the last quarter	amount spent on the last purchase (€)	value of discounts used in the last quarter (€)	Is the customer going to join the loyalty programme with the next purchase?
00012345	10	100	20	1
00092346	3	240	50	1
00042335	2	40	1.55	0
00071341	3	60	2.30	0
00022243	4	70	5	1

Combining this knowledge with our AI Glossary for Business can equip you for navigating the work your technical teams undertake and understand the risks connected to bad data. If you have any questions about this glossary, comment below and we'll come back to you.

If you'd like to learn how CKDelta can help you build a robust roadmap to extract value from your data, visit our △Discovery page.