The need for gathering new data or enriching current databases is a growing need of virtually every business. No doubt, your business has also found itself in need of additional data which can be found in various places. However, even though we live in the information age and data is readily available, extracting this data is not as easy, especially when you need to extract it from unstructured data sources such as websites, documents or any type of text really. You are probably familiar with web scrapers and web crawlers which are used to extract specific attributes from web pages with some help from the user. In this blog post, we are going to describe PlaceLab’s Data extraction service and how it is different from typical web scrapers.
Web scraping is too simple and time-consuming
When you browse through websites such as Booking.com or TripAdvisor.com, you will notice that all of the business attributes are in the same place and they are stored in some invisible table. The name is located at the top, the address is right below, etc. If you were, for example, to use a web scraper to extract business attributes from this type of website, you would need to teach it where the data you need is located, and it would then remember all the paths to those elements, extracting the attributes at the end. Although this is essentially an easy job to do, the user would need to repeat the process for every website and teach the scraper an algorithm of where data is located, since every website has a different structure. That’s why PlaceLab is not a scraper – it’s smarter than that.
Challenges in data extraction encountered by PlaceLab
Let’s say that you need to extract business data from a couple hundred or thousand different websites. Unfortunately, most of these websites are not organized and their data is largely unstructured. Essential business information may be hidden in the text or may lack properly defined web elements. It would take a lot of man-hours to analyze every single website and to find the (hidden) information or even to teach some tool to extract it. This is where PlaceLab comes into play. We have already spent a lot of time teaching our Machine Learning based algorithm to understand every single website, as well as to browse through it, looking for a specific attribute (by its characteristics and not by location). PlaceLab doesn’t need paths, it doesn’t need to be taught anything. All it needs is a list of websites to find the required information, such as “Name”, “Address”, “Category”, “Hours of operation”, “Phone”, “Email”, and “Social Network”. Even when the information is hidden in plain text, PlaceLab will find it and recognize it.
Conclusion
The PlaceLab team deals with these, and more difficult, tasks of extracting data from any unstructured source. This includes websites, documents or any type of text that has information in all kinds of different places, like in the middle of a paragraph or even a quote. What’s more, text size, position or form doesn’t matter – size isn’t a problem. Our service combines conventional machine learning and deep learning with heuristics to retrieve target elements from the text. Of course, this is just a small part of PlaceLab’s services, which you can check out on our website.