Would you believe me if I say companies all over the world are waging a secret invisible data war online? Well, don’t be surprised. This was bound to happen. Oh, and you’re phones are the unwitting soldiers. And, like all wars, there are going to severe consequences as well!
Time and again, history has proved that the main reason for all the wars was just one thing – brutal competition for power. Be it the world wars, or even the intense cold war! Retail giants such from Amazon, Flipkart, Walmart etc. to tiny start ups want to know what their competitors sell. Do you guys remember the insane sting operations that were done by Michael Scott, and his team of loyal salesmen, Dwight Schrute and Jim Halpert, in NBC’s hit show “The Office” ? Well, truth be told, in many cases, that is exactly how it happens in order to make notes on the prices!
Online, there’s no need to send people anywhere. But big retailers can sell millions of products, so it’s not feasible to have workers browse each item and manually adjust prices. Instead, the companies employ software to scan rival websites to collect prices and this process is called “scraping.” From there, the companies can adjust their own prices!
All these retail giants have a dedicated team for scraping! Scraping has become so important that today, there are full fledged companies dedicated to scraping! One such retail price optimisation company is Competera. And the start-ups who cannot afford a dedicated internal team for scraping, rely on companies like these! Competera scrapes pricing data from across the web, ranging from footwear retailers to industrial outfitters, and uses machine-learning algorithms to help its customers decide how much to charge for different products.
So yeah, scraping sounds cruel and sinister right? But, thats a part of how the world wide web works! Google and Bing scrape web pages to index them for their search engines. Academics and journalists also use scraping software to gather the data they require.
However, the interesting this is, this process of scraping can be a two-way street for the retail companies who employ them.vRetailers want to see what their rivals are doing, but they want to prevent rivals from snooping on them; retailers also want to protect intellectual property like product photos and descriptions, which can be scraped and reused without permission by others. As a result, so many types of defenses have been deployed to avoid scraping. The most popular technique is showing different prices to real people than to bots. A site may show the price as astronomically high or zero to throw off bots collecting data.
Subsequently, these new defenses create opportunities for new offenses! There are dedicated companies which help their customers to mask these bots. One such company that does this is Luminati. Luminati’s service can resemble a botnet, a network of computers running malware that hackers use to launch attacks. Rather than covertly take over a device, however, Luminati entices device owners to accept its software alongside another app.
This on going battle raises one important question- How do you detect bots? From what I learned from the internet, it’s very tricky. Sometimes bots actually tell the sites they’re visiting that they’re bots. When a piece of software accesses a web server, it sends a little information along with its request for the page. Conventional browsers announce themselves as Google Chrome, Microsoft Edge, or another browser. Bots can use this process to tell the server that they’re bots. But they can also lie. One technique for detecting bots is the frequency with which a visitor hits a site. If a visitor makes hundreds of requests per minute, there’s a good chance it’s a bot. Another common practice is to look at a visitor’s internet protocol address. If it comes from a cloud computing service, for example, that’s a hint that it might be a bot and not a regular internet user.
To fight and detect the bots, Akamai Technologies is doing something cool. Instead of figuring the bots behaviour, they have algorithms that understand and learn human behaviour, and lets those users through. This is how it works – When you tap a button on your phone, you move the phone ever so slightly. That movement can be detected by the phone’s accelerometer and gyroscope, and sent to Akamai’s servers. The presence of minute movement data is a clue that the user is human, and its absence is a clue that the user might be a bot. Although right now there is no way around this, it’s only a matter of time for another round of innovations. So goes the internet bot arms race.
There are some companies which scrape their own sites. For example, if a company has two inventories, say one for the warehouse and the other for e-commerce, then the chances of it being out of sync is very high. Yes, integrating the databases is the most viable option here. But, scraping has proven to be faster, and more importantly, cost effective.
Other scrapers live in a grey area. The airline industry is the best example I can give. Travel price-comparison sites can send business to airlines, and airlines want their flights to show up in the search results for those sites. When you look up flight information through those airlines, the airline sometimes must pay a fee to the booking system. Those fees can add up if a large number of bots are constantly checking an airline’s seat and pricing information. Yeah, we’ve all been there!
So to summarise, this so called secret war on data is real. And the process of scraping data is the real competition. Very few people are aware of this, and I hope this post is a start to help them understand what is scraping, and how it works!