In our AI Glossary for Business, we discussed terms connected to Artificial Intelligence and how a Machine Learning model learns how to predict by looking at data. The more data it sees, the more likely it will find meaningful patterns. Some models, such as generative pre-trained transformers (GPT), are trained on huge amounts of data, containing billions of diverse examples. Data lies at the heart of Artificial Intelligence and understanding data concepts is crucial. In part two of our Glossary for Business, we dive into data-related terms you might come across in Machine Learning projects.
A label (also: target/dependent variable) is the outcome that the model tries to predict. It’s usually the business case that we want to solve. It can be numerical, e.g. when we want to predict demand, sales, or prices. It can also be categorical if we want to know whether a customer joins a loyalty programme or what colour of T-shirts sells the best. A label is an answer to a question we are asking a model. Here are some examples of those questions and answers:
Question | Label |
Is the customer going to join the loyalty programme? | Yes/No |
How big will the sales of white T-shirts be in June? | any non-negative, full number (e.g. 1800, 2000, 2431) |
What is the price of the house? | any positive, decimal number (e.g. 100k, 220k, 600k) |
Features (or independent variables) are factors influencing the model’s output. They are the external events and circumstances that impact our business case. If we predict sales in our shop, we probably consider the day of the week, the season, prices and the competition in the area. These are the features that a model looks at, too. It first goes through the data we collected and tries to connect it with different labels. It looks at many combinations of features and labels to understand their connections. Most Machine Learning algorithms require at least several features to learn patterns from. Here are examples of factors that might influence the labels we discussed above:
Question | Features |
Is the customer going to join the loyalty programme? | the number of purchases in the last quarter the amount spent on the last purchase the value of discounts used in the last quarter |
How big will the sales of white T-shirts be in June? | the sales of white T-shirts in the preceding months public holidays in June marketing campaigns in June |
What is the price of the house? | zipcode of the house square meterage of the house number of bedrooms |
- the amount spent on the last purchase
- the value of discounts used in the last quarter
- the sales of white T-shirts in the preceding months
- public holidays in June
- zipcode of the house
- marketing campaigns in June
A column in a dataset represents a single variable. Columns and rows are a tabular dataset's two most important elements. One column usually represents one feature or label. Looking at a single column can give us information about the general statistics of the sample we collected. We might be able to say, e.g. what size of T-shirts is the most frequent, or what are the median prices of houses.
A row in a dataset represents a single observation. It contains an identifier for which we want to get a prediction. It can be the customer’s name or a shop number. It also includes exactly one value per each column. Those values might differ for every identifier and the role of the model is to find those differences and connections. Let’s look at an example of a tabular dataset below.
Q: Is the customer going to join the loyalty programme?
Customer ID | number of purchases in the last quarter | amount spent on the last purchase (€) | value of discounts used in the last quarter (€) | Is the customer going to join the loyalty programme with the next purchase? |
00012345 | 10 | 100 | 20 | 1 |
00092346 | 3 | 240 | 50 | 1 |
00042335 | 2 | 40 | 1.55 | 0 |
00071341 | 3 | 60 | 2.30 | 0 |
00022243 | 4 | 70 | 5 | 1 |
Combining this knowledge with our AI Glossary for Business can equip you for navigating the work your technical teams undertake and understand the risks connected to bad data. If you have any questions about this glossary, comment below and we'll come back to you.
If you'd like to learn how CKDelta can help you build a robust roadmap to extract value from your data, visit our △Discovery page.