Scraping data from sites isn’t as easy as it used to be. Site and server owners are waging a war on web scrapers, trying to stop your scraping efforts every step of the way. From setting request rate limitations to making your bot solve reCAPTCHAs, the anti-bot strategies are getting better every day.
Luckily, our side of the battlefield is continuously expanding its arsenal as well, finding new ways to avoid our bots from being detected so we can continue our data scraping efforts.
But one counter-method that’s often overlooked is adding the right HTTP headers to your scraper. In this short article below, you’ll learn what HTTP headers are, why they are important for web scraping, and how you can use HTTP headers for web scraping.
What are HTTP headers?
Each Hypertext Transfer Protocol (HTTP) message, both request and response, contains a header section.
This header section contains several HTTP header fields that help identify and add more detail to the message. Header fields are sent after the request or response line, depending on the type of HTTP message.
There are dozens of different header field names like Accept, Cookie, and User-Agent, each with a unique descriptive function. You can find a full list of all the different HTTP headers on the Internet Assigned Numbers Authority (IANA).
When you send a request HTTP message to a server, you can add header field information to identify yourself to the server, which in turn helps the server better serve you the data you are hoping to gather through your request.
You can see how these HTTP headers are useful tools to help streamline the communication between different devices. Now let’s see how this is all linked to web scraping.
Why are HTTP headers important for web scraping?
Does your web scraper constantly get blocked while trying to request data from target servers? The problem might be in your use of (or, in many cases, lack of) HTTP headers.
You see, these headers tell the server more about the identity of the web scraper. For example, one header field specifies the User-Agent. The User-Agent explains what entity is sending the request. This can be an operating system or an application, for example.
Each browser (like Chrome or Safari) has a separate User-Agent. Based on the type of browser sending a server a request, the server will respond with a certain version of the website. But what if the User-Agent field is left empty?
Then the server can’t identify the device sending the response, which means it won’t know what version to serve. Most servers will respond to this in one out of two ways:
- The server shows a default version of the page, which might not be the version we were hoping to receive.
- The server automatically blocks all traffic that doesn’t specify the User-Agent.
The first option might be a bit annoying, but it generally won’t stop you from scraping. The second option, however, means your bot and its IP address will get blocked from the site, ending your scraping efforts.
That’s just one example of a missing HTTP header causing problems. And with dozens of potential different header fields to fill in, you can see the importance of setting HTTP headers for your web scraper.
Using HTTP headers for web scraping
Whether you’re looking for a way to scrape your competitors’ websites or hoping to find an alternative to Google Autocomplete API ever since it was deprecated back in 2015. However, Google Autocomplete API services still exist – you can learn more by visiting this page.
Whatever your reasons for scraping, you need a web scraper that can go about its business undetected by Google and other sites. And without the correct use of HTTP headers, that’s simply not going to happen.
So how do you do it?
Well, it’s actually not that difficult. You can easily find example strings of every possible request HTTP header field with a simple browse in your search engine. Here’s how.
Let’s stick to the User-Agent header field. Say you want your web scraper to pretend like it’s sending a request via the browser Mozilla Firefox. Just type “Firefox user-agent string” into Google Search and before you know it you’ve found your answer, which looks like this:
Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0
Next, you simply add a header section to your web scraper’s script including a header field titled “User-Agent” followed by this string.
Now, when you send a request to a server, the server will see your User-Agent is specified as Mozilla Firefox and it will serve you the requested information as it would do to any other Mozilla Firefox user.
And that’s it!
Of course, this is only one of many different types of header fields. However, adding relevant strings to these header fields will strongly increase the chances of your bot browsing and scraping away undetected.