The potential of big data is almost limitless. It constantly affects our daily decisions such as which movie we should watch, which road we will take on the way to work; it recommends accommodation for a vacation that we are about to enjoy or tells us where we can find a restaurant for dinner.
Big data also reduces energy costs in buildings, potentially eliminating $200 billion in waste, annually. So, it was totally unsurprising when companies, eager to see their sales increase, started turning to vendors of sales data, in the hope that access to large datasets would increase the number of sales opportunities.
On a company level, big data solutions expand a service’s overall data coverage, enrich existing data, attract new customers and increase the company’s popularity and reliability with both existing and potential new customers.
On the surface, buying third party data might seem a good way to ensure business success but if you fail to evaluate the quality of that data, you could end up spending a large amount of money on bad data and putting your company’s reputation at risk. In this blog, we will talk about the top five guidelines which should be considered when evaluating your own or a supplier’s data.
Data is used as a trading commodity. A couple of years ago, people acquiring data were buying location data without being able to examine its content. They were paying a high price for the entire dataset and it was only after they had bought it that they could find out if there were any deficiencies within the dataset.
Today, data buyers are smarter, they want to detect inaccuracies and discrepancies in a dataset and to estimate the value of the data before they buy it. Data assessment can be performed in several ways. Here we will talk about the top five guidelines to consider during the analysis of any new data:
- The volume of data – if you are buying a high volume of data, you don’t need to validate every single record from the dataset. It would take a lot of man-hours and computer time to perform such an analysis. But, by using well-proven random-sample statistical theory, you can evaluate the quality of the entire dataset by analysing just one subset of it. The results retrieved from that analysis can then be applied to the entire dataset.
- Data cleansing and validity – before you input the new dataset into your database, you need to do some pre-processing. It’s in this phase that you can detect incomplete, incorrect, inaccurate or irrelevant parts of the data and then replace, modify, or delete it depending on your internal company’s policy. By internal policy we mean your company’s internal acceptance criteria: how much invalid or inconsistent data is acceptable? How much of the data will be updated by the buyer and how much will they send back to the seller to be updated? For example, there is a big difference between updating non-standardised street or city values and missing address elements such as postcodes.
- Completeness – depending on the type or category of location data that you are about to purchase, completeness becomes a very important factor for determining how much the data is worth. For instance, if you are about to increase your coverage of restaurants in a particular city, completeness of factors such as location, website and telephone number are essential indicators of the data’s value. On the other hand, having a dataset showing bus stops probably does not require fully populated telephone number, website or rating values.
- Accuracy – having a dataset in which a place’s details are fully complete and valid does not necessarily mean that the data is accurate. Let’s look at four main features of any place:
- Name – for example, say we have a name: McDonald’s and McDonald’s Italy, both referring to the same business in Italy. McDonald’s Italy is inaccurate; it’s an unofficial name for the restaurant and will cause duplicate records in your database, leading to inconsistency, and you will end up paying for more records than you will actually get.
- Phone number –the accuracy of a phone number is vital, especially as a phone number cannot be partially accurate. Data that comes in big volumes sometimes has non-standardised phone numbers which will cause issues for inputting them into the current system and again, will lead to duplicate records, inconsistency, etc. Furthermore, having, for example, three records of the same business but with different phone numbers, does not enrich your database in terms of its volume because you will need to verify that those numbers all belong to that one business. One way to confirm that is by doing a search of data available from other data providers such as Google, Apple, Facebook, etc. or by comparing it with phone numbers from an official business website.
- Category – accurate category values help us to differentiate businesses by their main activities. You can have two businesses with the same attributes such as name, phone number, address, website address but in different categories. Both can be valid and accurate: for example, a hotel and restaurant, or a coffee shop within a hotel. When analysing a new dataset you should check that there are no records with inaccurate categories, for example, McDonald’s shown as a night club.
- Location – details such as street, city, postcode, state, country and coordinates describe the location of a place on the map. Accuracy and precision of all these details is crucial in order to be able to place a pin on the map and enable customers to find a business. Therefore, coordinates need to be very precise while other address details need to describe the exact location of the place and must be understandable to most common geocoders, because customers will inevitably use one of the satellite navigation systems to find a location.
- Data freshness – data related to places and points of interest is very dynamic. Businesses open, close or update their main information daily. Before engaging with a new data supplier, check when their data was collected and how often it is updated. Additionally, new data can be compared with that of competitors or commonly used data providers. Don’t be surprised if your analysis shows that there is a subset of data which is present only in a seller’s database. That subset should be considered “suspicious” because it may be old with outdated data or refer to newly opened businesses. Either way, that data should be further analysed and verified before the acquisition.
These are only the top five factors which make up a reasonable test for evaluating a new data supplier or even your own data. All the above-mentioned verification can be carried out by PlaceLab via its various services and a report can be provided for you which will indicate the quality and value of the data.
Additional factors which should be considered when buying a new dataset are geographical coverage, additional rich features (ratings, reviews, hours of operation, etc.), the price of the dataset and the extent of the seller’s support (frequent updates, improvements and correctness of the initial dataset).