One of the top 10 challenges for data buyers is finding a supplier. In fact, according to deltabid, 30% of data buyers say that supplier-related issues are the cause of their biggest headaches.
A couple of years ago, data buyers would pay full price for an entire dataset without being able to look at the content. It was only after acquisition that they would discover the dataset’s deficiencies. Companies were burning money and their data products were failing.
Today, data buyers want to evaluate data content before the acquisition and to have the choice of either asking the supplier for an update or discount, or of completely discarding them. To achieve this, they have established a supplier selection process consisting of several phases, from finding a credible supplier that will suit their needs at low cost through supplier’s data evaluation to data maintenance.
Data buyers are also looking for a healthy relationship with their suppliers and want to ensure they perform as initially agreed. This continuous process is stressful and time consuming, but it is also essential for the company’s success. Eager to increase access to highly accurate data, data buyers are relying on software and other tech tools to accelerate and automate the evaluation process.
In this blog post, we will describe the challenges involved in supplier selection through one simple example. Let us assume that a data buyer is considering a new supplier for McDonald’s restaurants in the New York City area and he needs to evaluate data provided by Foursquare and open source OpenStreetMap data.
For demonstration purposes, we took a list of 163 Foursquare McDonald’s locations found on the following link, and 59 OSM McDonald’s locations pulled out using Overpass API. The retrieved data was compared to an official list of New York City restaurants found on www.mcdonalds.com.
In this analysis, we performed the following quality checks:
- Attribute completeness
- Place validity
- Missing restaurants
- Address verification
- Duplicate records
Attribute completeness analysis focused on the main attributes of any location: name, address, phone number (NAP), website, and category. The name field is completely populated in both datasets with one difference: OSM provides the official name of the restaurant—McDonald’s—while Foursquare added district names to the name value. So, in the Foursquare database we will find values such as McDonald’s Chelsea or McDonalds Theatre District. Although Foursquare names are very descriptive and clear, non-standard names make the database unnecessarily complex, which increases the risk of error.
The completeness of address, phone number, and website attributes across two data sets is shown below:
The chart shows that Foursquare has high completeness of all attributes. On the other hand, although OSM has fully populated coordinates, the completeness of all other attributes is very low. In fact, 60% of OSM restaurants have only a point drawn on the map without any other attribute populated.
Contact information, such as phone number and website address, is very important to end users as an official source of any restaurant’s information. Our analysis showed that only 8% of OSM locations have the phone number populated and 30% of provided websites are inactive (example: http://www.mcnewyork.com/2292 or http://www.mcnewyork.com/7552.). Foursquare provides their own website of every business and all these websites are active. Phone numbers provided on Foursquare websites were validated on www.mcdonalds.com.
As mentioned above, all McDonald’s locations were compared with an official website, mcdonalds.com, for validity. Interesting findings came out of this analysis.
We were able to validate almost all Foursquare restaurants. Five locations were not validated due to insufficient data and another 9 were found to be permanently closed.
Please check the examples below:
Due to poorly populated data, we were able to validate only 40% of OSM data, data with populated address attributes.
According to internet sources, there are more than 300 restaurant locations in New York City. Whether a supplier decides to go with Foursquare or OSM, they will not get a full list of all restaurant features in this particular zone.
Location attributes were validated using reliable geocoders. We concluded that both suppliers provide largely accurate address information (when address attributes were populated). Foursquare has several places with incomplete address information, which caused geocoding to street or city level.
Duplicate data is bad for a business because:
- Storing unnecessary data is expensive
- Innacurate information about the volume of the data can be misleading
- It can produce inaccurate analysis, which will lead to bad business decisions
- You will pay for more data than you get
When we analysed the list of locations found by Foursquare and OSM, even in this small data sample we found a significant percentage of duplicate records. Foursquare has 7.36% while OSM has 6.78% of additional records. These records will need to be cleaned up before adding to the a buyer’s database.
In our example, we analysed one type of business, a chain of McDonald’s restaurants. OSM and Foursquare assign different labels for these places: fast_food and Fast Food Restaurant respectively. Although both values have the same meaning, differently categorized locations are a challenge for data buyers, who need to align a supplier’s category schema to their own. Now imagine that a buyer tags all places similar to McDonald’s with junk_food. Before a data buyer ingests new data into their system, they will need to remap and convert all fast_food and Fast Food Restaurant values into junk_food. Our example was very simple, but what if you plan to ingest a dataset with different types of businesses and you need to perform remapping of all their categories? This is very time-consuming manual work.
Finding the right data supplier is not easy. Even when you find one that has coverage of desired places in a certain area, you need to validate their data before you acquire them. Above, we described only a couple of data quality checks which are required during the data evaluation phase.
There is no 100% accurate data supplier, so it is very challenging for data buyers to find one that will be at low cost but also provide them a high level of accurate data. In our example we had two suppliers—Foursquare, who sells their data to other parties, and OpenStreetMap, an open source data provider. If you are on a budget, Foursquare would be the better choice of data supplier because they provide a more accurate and complete dataset.
All above analyses were performed in PlaceLab using different PlaceLab services. These reports can be used further for negotiations with, for example, Foursquare or any other data supplier.