Web Scraping and the Art of War: 5 Tools That Will Help Your Bot Win

This content originally appeared on Level Up Coding - Medium and was authored by Web Data Central

Terracotta warriors courtesy of MrDm (freepik)

One of Sun Tzu’s most quotable passages from the Art of War is —

“If you know the enemy and know yourself, you need not fear the result of a hundred battles. If you know yourself but not the enemy, for every victory gained you will also suffer a defeat. If you know neither the enemy nor yourself, you will succumb in every battle.”

Web scraping is now a popular practice, and every team that manages a website wishes to prevent its data from being harvested. Anti-scraping technologies can assist them in this endeavor.

Make no mistake, as a web scraper or data analyst, at this point, you are at war, and as Sun Tzu advised — the path to victory lies in knowing your enemy, as well as you, know yourself. This is the reason we’ve compiled this guide — to educate people who need public web data about some of the more aggressive tactics that they will face on their quest to extract this data.

But before we get into the trenches, let’s start with some basics — the most common anti-scraping technologies are usually solved by using a proxy server.

Image courtesy of Shutterstock

Defeating blocking issues on websites with proxies

A simple example can be — If you use the same IP address on a regular basis to target a particular website, the target site may block your IP address.

To avoid this, web scraping best practices would suggest using a proxy server that can hide your original IP address. There are many such proxy server providers available, and you can even find some free ones — but those don’t work too well on the websites that actually hold valuable data.

Now using a proxy IP alone is not enough. On most websites, you have to make sure that you are not sending too many requests from a single proxy IP address. When attempting to scrape websites’ public pages (such as Amazon products or LinkedIn profiles) that require no log-in, using a high number of short-lived sessions it is always better to rotate your requests from multiple proxy IP addresses. This will still lead you to occasionally get blocked but it will not stop you from getting the data, eventually.

The type of proxy network also plays a factor. On simple websites, rotating data center proxy IPs will be enough to win the battle, while more advanced eCommerce and social media websites identify these data center IPs and block them thus requiring the use of more expensive residential proxy IPs (IPs of real user devices or acquired from an ISP).

The last, and most common issue is when facing a website that blocks entire IP pools from a geographic region from accessing it. This is usually done using a firewall. The solution, you probably guessed it, is using a proxy server IP from a different country.

As we said, all of the above is pretty basic, and most web scrapers master these best practices within the first few months of scraping either by integrating scraper codes with a proxy manager or sending their requests through a proxy service API. The real challenge begins when you are looking to collect data at scale which requires sending thousands and even millions of requests to a website or facing a website that uses cutting-edge anti-bot measures.

The image is taken from the film “Five Deadly Venoms” (1978) courtesy of kungfukingdom

Bot detection technology

So, with no further ado, here are the main themes along with some specific examples of the most advanced anti-bot measures and detection tools used by websites that are frequent targets for scraping:

1 — Browser fingerprinting technology

Fingerprinting a user’s browser is an extremely effective approach to identifying unique browsers and tracking internet usage. Originally, browser fingerprinting was used in order to collect data that will help build a user profile for marketing purposes. coincidentally, by collecting information about your browser type and version, your operating system, plugins, time zone, language, screen resolution, and various other active settings, these tools can also determine whether the user is a real person or an automated bot.

While many consumers use plugins that aim to conceal their unique browser fingerprints in order to prevent remarketing campaigns and targeted advertising, smart web scrapers do the exact opposite — they make sure that their browser’s User-Agents request header appears as unique as possible, and if necessary they rotate those strings every few requests so the website doesn’t detect that some generic User-Agent is being used over and over again.

Here is an example of what a unique User-Agent string looks like:

Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0

2 — Javascript-based detection

Open-source libraries like the popular FingerprintJS and its alternatives. These libraries employ several layers of bot detection, including:

Automation detectors
Search engine detector
Browser spoofing
VM detection

Each of these results gives each user a bot probability score and based on that the website manager can create rules on when to block or use other preventive techniques.

3 — Simple custom-made detection techniques

Those can be CAPTCHAS and other standard mechanisms primarily used on smaller websites. Those are usually less attractive for large websites that generate a lot of money in sales because they often block casual human users thus reducing the potential buyer traffic on the website.

4 — Advanced custom-made detection techniques

Sophisticated bots require sophisticated bot detection. Advanced Behavioral Analysis

allows organizations to not only identify and parse out human traffic from that created by bots but also to be better prepared for the next generation of bots. The main pillar of these techniques relies heavily on reverse engineering, where the anti-bot research team examines hundreds of web scraping algorithms.

A few examples of leading anti-bot service providers are:

5 tools to help you bypass bot detection

Sun Tzu also wrote —

“The whole secret lies in confusing the enemy so that he cannot fathom our real intent.”

So, after we explained how the simple challenges can be met using proxies, here are some suggestions we gathered from battle-tested web scraping generals as to how to deal with each of the above advanced anti-bot tools. You can even approach some of them yourself on this discord server. As a piece of general advice, when you plan your next campaign, remember this — as long as your bot appears as a casual user (or users) the website’s anti-bot algorithms will not affect you.

1 — Bot detection tool detection

Funny title, isn’t it? First off, the above-mentioned discord server offers its members a cool and free tool called Boty Mcbotface that will tell you which advanced bot-detection tools are used on the website you’re targeting. See below example of a response:

The screenshot was taken from Scraping Enthusiasts Discord Server

This type of field intelligence is very useful in your strategy of how to scrape a certain website.

2 — Browser fingerprint testing

There are plenty of websites with free and paid fingerprint testing tools that will help you understand how your bit appears on websites that use browser fingerprinting. You should run these tests frequently to prevent issues on new scraping projects.

3 — Stealth browsers

These so-called anonymous or private browsers offer built-in proxies for IP hiding as well as fingerprint management and rotation. They are often used for account automation but also for web scraping. Some of the leading browsers are:

Though very useful in some cases, the downside of many of the stealth browsers available online is the serious security concern that shrouds them as many contain potential malware.

4 — Evasion Libraries

You can find great open-source libraries for various coding languages, especially on Github. This, of course, requires some coding skills but most have easy-to-use guides on how to integrate them with your existing scrapers. A few ones we tested and found useful:

puppeteer-extra-plugin-stealth — A detection prevention plugin for puppeteer-extra and playwright-extra.
curl-impersonate — A special cURL build that can impersonate all four major browsers: Chrome, Firefox, Edge & Safari.
JEDI CRAWLER — A simple syntax Node/PhantomJS crawler.

Image courtesy of shai_halud (freepik)

5 — CAPTCHA solvers

Even the best of the best web scrapers run into a CAPTCHA sometimes, to make sure this doesn’t disrupt your scraping, there are several kinds of CAPTCHA solving services — human and machine learning based:

6 — Web Scraping APIs***

These are the most comprehensive unblocking tools that incorporate many of the above capabilities and bring you closer to the hardest to scrape web data.

There are several alternatives on the market, which makes it tough for you to choose which ones best suit your requirements. However, through testing and benchmarking these tools and also researching what type of companies use them (SMBs to Fortune 500 enterprises, etc.) we compiled a ranking:

Bright Data Web Unlocker — hands down the leading tool built on top of the best proxy infrastructure. The below tools advertise similar benefits however when you read the fine print they are limited compared to the Web Unlocker.
ScraperAPI
Scrapfly
ScrapingBee
WebScrapingAPI

Image courtesy of Bright Data

Web scraping can be a tedious and time-consuming task, but it is worth it if you want to get the web data that you need. By following the best practices and using the tactics mentioned above, you can easily avoid getting your IP blocked and make your web scraping process much smoother. We’d love to hear what are some of the other web scraping best practices that you follow and to learn about new tools we haven’t tried yet.

*** Some of you might have noticed the title reads “5 tools to help you…” and the list goes up to 6. That is because web scraping APIs are usually a combination of the previous tools.***

Web Scraping and the Art of War: 5 Tools That Will Help Your Bot Win was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

This content originally appeared on Level Up Coding - Medium and was authored by Web Data Central

Print Share Comment Cite Upload Translate Updates

APA

Web Data Central | Sciencx (2022-07-26T11:04:15+00:00) Web Scraping and the Art of War: 5 Tools That Will Help Your Bot Win. Retrieved from https://www.scien.cx/2022/07/26/web-scraping-and-the-art-of-war-5-tools-that-will-help-your-bot-win/

MLA

" » Web Scraping and the Art of War: 5 Tools That Will Help Your Bot Win." Web Data Central | Sciencx - Tuesday July 26, 2022, https://www.scien.cx/2022/07/26/web-scraping-and-the-art-of-war-5-tools-that-will-help-your-bot-win/

HARVARD

Web Data Central | Sciencx Tuesday July 26, 2022 » Web Scraping and the Art of War: 5 Tools That Will Help Your Bot Win., viewed ,<https://www.scien.cx/2022/07/26/web-scraping-and-the-art-of-war-5-tools-that-will-help-your-bot-win/>

VANCOUVER

Web Data Central | Sciencx - » Web Scraping and the Art of War: 5 Tools That Will Help Your Bot Win. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2022/07/26/web-scraping-and-the-art-of-war-5-tools-that-will-help-your-bot-win/

CHICAGO

" » Web Scraping and the Art of War: 5 Tools That Will Help Your Bot Win." Web Data Central | Sciencx - Accessed . https://www.scien.cx/2022/07/26/web-scraping-and-the-art-of-war-5-tools-that-will-help-your-bot-win/

IEEE

" » Web Scraping and the Art of War: 5 Tools That Will Help Your Bot Win." Web Data Central | Sciencx [Online]. Available: https://www.scien.cx/2022/07/26/web-scraping-and-the-art-of-war-5-tools-that-will-help-your-bot-win/. [Accessed: ]

rf:citation

» Web Scraping and the Art of War: 5 Tools That Will Help Your Bot Win | Web Data Central | Sciencx | https://www.scien.cx/2022/07/26/web-scraping-and-the-art-of-war-5-tools-that-will-help-your-bot-win/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.