Live-Coding: Master Anti-Ban & Web Scraping Techniques with Scrapoxy

Video size:

Abstract

Join me for an incredible live-coding masterclass to unlock the full potential of Anti-Ban & Web Scraping! From novice to virtuoso, you’ll learn the latest legal techniques for collecting crucial datasets to train AI models.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hey, I'm Femi Arushel. I've been deeply passionate about web scraping for a year. My enthusiasm led me to explore the fascinating world of proxy and anti bot systems. I'm also the creator of Scrapoxy. Scrapoxy is a free and open source proxy manager. It allows you to manage and route traffic through cloud providers and proxy services. It supports major cloud providers such as AWS, Azure, or GCP, and also it supports proxy vendors like Bryta, Railbite, IPRail, and many others. Since I started the version 4 in February 2024, over 650 users have installed Skopoxy. They exchange one petabyte of data, reserve 55 billion requests, and use 6 million proxies. And this proxy is written in Node. js, so fully JavaScript plus Angular. But before we dive into our discussion, I'd like to share with you a little story. Enter Isabella. Isabella is a brilliant student in IST school. She has a lot of energy and a thirst for traveling. Every year, she embarks on a one month backpacking journey to a random country. But here is the twist. This level of preparation consumes an entire year for just one month of traveling. Isabella couldn't help but notice there is a gap in the market. Why wasn't there such a tool in a digital era pumped with AI? This could be her ticket to a successful business. She realized she needed vast amounts of data. But, this vast amount of data will be used to train a large language model to curate her ultimate trip. And Isabella is very careful in her approach to business. Before she starts scraping data, she makes sure to consider all the legal aspects. She knows it's important not to overwhelm the website by making too many requests too quickly. She also respects privacy. She only collects information that is already public, like reviews and doesn't take any personal details, like names. She also doesn't sign any term and condition easels. So she is free from every contract. So now that everything is clear, she is ready to collect the data. So let me introduce you the website she chose to script trickyreview. com let's open the website. So what's Trekkie Review all about? Trekkie Review is your go to spot for searching any accommodation in a city that you want to visit. You just have to click on search and you will get accommodations. Imagine that Isabella is living in Paris, she just click on search and she will get 50 accommodations here. And if she click on one accommodation She will get the name, the contact, location, descriptions, and also she will get the review. Isabella is interested in reviews. It is all analyzing this review to extract the main feeling about the accommodations. And now with large language models, we can know if it's a bad or good accommodation. Also, something very important. The website is super secure. I've put different levels of protection on the website, and Isabella needs to bypass them one by one during these sessions. So let's get back to the home page, on level 1. I will open the Chrome Inspector to show you the structure of the website. If I'm clicking on Preserve Log, on Docs, to avoid Other documents, reloadings, you can see we're starting with home page. So we are on level one. If I'm looking on accommodations, I will get all accommodations on the city of Paris. And if I'm clicking on one accommodations, I will get the ID of the accommodations, and if I'm clicking on that, on the response, I will get all information inside the HTML. And if I'm looking down, I will get descriptions here, and also the review. That's what Isabella want to extract. But clearly Isabella doesn't want to do everything manually. She doesn't want to make HTTP requests, doing retry, managing error, and also doing the CSS extractions. She wants a framework to manage everything. And to do that, she will use one of the most used frameworks in the web scraping industry, Scrapy. let me show you Scrapy. what is Scrapy? It's a Python framework maintained by a large open source community and it can handle HTML parsing, retry, errors, concurrency, everything. how Scrapy works? Scrapy runs SPIDERS. SPIDERS is responsible for all the logic of collecting data on websites. Scrapy runs SPIDERS. So we just gave the Python class a cool name, here, TrickyReviewSpiders, and a name, TrickyReviews, and we have two methods, the startRequest and pass method. So startRequest method is used to define the URL to collect. Also, when we get the response, the Scrapy engine will call the method. parse here with the response and we just need to extract the data with css selectors, xpath selectors, and we can start again url scraping, ask new content, etc. So let's jump to HealSpiders. So here is the same spiders, basically. So what we will do all the pagination of the website to collect the 50 items, but instead of going to the homepage and doing page one, page two, page three, we will start for every page to start navigation by the first homepage. So that's what I'm doing here on start url level 1. So I will ask the home page 10 times, but after we are going to page 1, page 2, etc. So that's what we are doing here. We are going to the page 1 on Paris cities. And when we get the list of accommodations, we ask to collect each accommodations. And when we get the accommodation's answers, yes, we extract name, email, and review with CSS selectors. what we can do is to run the spiders. let's run it here. it's very quick, and I've got the 50 items. And the cool stuff of the Python framework here is on Scrapy you can extract in a structured ways. You've got everything here. So name, email, reviews. Now let's jump to the spiders and move to the level two. Here. So now if I'm starting again the spider, I've got an another error, which is unknown browsers. It's a very common error and this error just happen when a client connect to a server, so a browser or a spider connect to web servers. The client send a lot of information with HT TP editor and we are saying who we are. We are saying we are scrappy and we are also. Sending the versions, and the web server says, Hey, wait, it's a scraper. I don't want you to collect data from my website. the website blocked me. I need to change this value. But, which value should I use? I will use the same as Chrome. let's get back here. on the headers, I've got the response and request headers. And at the end, I got the user agent, so I will use the same. I can do that here, add the user agent, and Copilot will complete me. Oh, thanks Copilot! You say that we are on Windows? Why not? Let's run that again. Now, I just bypassed the user agent issue, but I've got another one, which is, the sexiest UI header is missing. So I need to understand what is it. Let's get back to the request headers. As you see, I've got the user agents, but I've got others headers, additional headers, which are confirming the request. It's additional security to confirm that. And it's requested by the website. So if I'm missing these headers, clearly the website will block me. So I need to add this one. So let's do that again. So I'm going to header and I'm saying unamobiles. com and platform. So we are saying that we are on windows because the user region is on windows. Okay, so let's run that again. So now this time I just bypass protections and I've got my 50 items. That's perfect. So now let's say that image Isabella have more serious troubles. So let's jump to level three. As you can see, I quickly collect few items, but I'm directly blocked with HTTP error 429, which is too many requests, and too many requests. So we are sending too much information, too much requests from my laptops. So I need to have multiple IP address. It's the same if you're using a server or request will come from this IP address. So I need to introduce one new concept, which is a proxy. So what is a proxy? So a proxy is a system running on internet. It relays a request, were servers and the server retrieve, this request from this proxy. So instead of seeing one IP address sending millions of requests, the server will see a lot of IP addresses sending a few requests. So with this technique you can bypass a lot of rate limit protections. But of course there is different types of proxy. So first type is data center proxy. So what's data center proxy? It's proxy running on AWS, Azure or GCP. It is the first type, first serious type of proxies that you can find on internet. It's fast, cheap and reliable. However, they can be easily identified by Antibot solution. Let me explain you. The IP of the proxy belong to an IP ranch, and this IP ranch is associated with an autonomous system number a sales, and the name of the autonomous system number can be Amazon zero two can be Microsoft or Google, whatever. So with this kind of associations, so we can have this link with IP database. So it's used by Antibot system to detect the autonomous system numbers. And clearly they can block traffic with this type of IP address. But there is a trick to get around it. And this trick is known as ISP proxy, Internet Service Provider proxy. So let's talk a little bit about ISP proxy. So these proxies are set up in datacenters, but they don't use IP addresses from the datacenters. Instead, they rent IP addresses from clean autonomous system numbers like a mobile carrier, internet box provider like AT& T or Verizon. And they get a bunch of IP addresses for a lot of money, and the proxy clearly will use one of those. this means, when you're using the proxy, your activities get mixed with all the mobile IP addresses, keeping you hidden. And there is a last type of proxy, residential proxy. The IP comes from a real device, which can be a laptop or a mobile phone. how does it works? when the developers want to earn money from his applications, he has three solutions. First solution is to sell subscriptions, like a monthly or annual subscription which unlocks features. Second, he can add advertising, like having an ad to the bottom of the application or a video to watch before unlocking features. And the last solution is to share the bandwidth of the device. Of course, with the user agreements. And that's where REST and July become friends. So this type of proxy is very powerful because the IP address can be the same IP from the real users. And there are millions of endpoints available. So now we will use Proxy, the super proxy aggregator to manage our proxy strategy. So let me show you how we can do that. So what we can do is to start Dockerfiles with Scopoxy, and in a second I will get Scopoxy up and ready. So I just need to log to Scopoxy now. We go on localhost, and I'm already logged, so I've got one project ready. So you can see that I've got, here's one project. There is no project. I need to add connectors. Squareproxy offers you a lot of connectors, data center connectors, proxy vendor connectors with ISP proxy, a residential proxy, also proxy lists or free proxy on internet, and you can also use even hardwares. So let's imagine that you want to use AWS proxy. You just click on create, and you add your access key and secret key. Just get on ScrapOxy documentation to get all this information. In two minutes you are up and ready. So I already had my credential, AWS credentials, and we will start, a connector. It's very quick, so you will see. So I just click here. So let's jump also on AWS consoles. You can see that I don't have any instance, but if I'm waiting two seconds here, refreshing, I've got already ten instances up and ready. So Scrapbox, in order to The API of AWS to start all these instances. And if I'm going back to Scrapbox here on the proxy list, you can see all the proxy here. I've got a lot of information on my 10 instances. I can know the traffic sent, received, the number of requests, and also the IP address. And I've got also very cool information which is the geolocations. So all IP are based in the datacenter of Dublin here. I can confirm that with a coverage map. Here's. So everything is in Ireland. That's perfect. So now we need to plug ScrapOxy to Scrappy. So that's very easy. Scrap Proxy can handle multiple projects. on each project you can have different strategies. Strategies by geolocation, strategies by type of proxy, data center, ISP, residential, whatever. So you can set up multiple projects. And each project has a username and password. So let's copy paste this information. And plug it to Scrapy. So what I will do is to just copy past this information and show it. So I will update my middleware list. So now Scrappy, instead of directly sending an HTTP request, he will use a middleware. And this middleware will say, okay, please use Scropoxy. And here are the credentials of Scropoxy. So now if I'm running again, the spider. Everything is routed through nrest and ScrapOxy and I've got my 50 items. That's cool! let's have a look on the proxy list. ScrapOxy just sent in a round robin fashion way all the requests and I've got 100 percent success rate. That's perfect. Let's get back to our spider to help Isabelle. So if I'm jumping to level 4, running again the spider. This time, I've got an issue. The antibody system just detected that the IP address came from a datacenter. That's what I explained to you. So what I can do is to use another type of IP address. Here we have the ASN detecting which is Amazon ZeroTwo. So let's switch to a more advanced proxy. So we will use for that, smart proxy, which is one of the biggest proxy provider in the world. So they have ISP proxy, residential proxy, even sneaky data center proxy. So that's cool ones. So I will add that here. But first what I will do is to stop the instances on Amazon. So I just have to click here, and Scrapbox, it just orders to Amazon to stop everything, because I don't want to pay for that. And that's a cool feature when you have scraping sessions. ScrapOxy bring autoscale up and autoscale down. It detects the sessions and just scale. So you just run instances when you are running scraping sessions. And you don't pay for nothing. It's a huge cost saver. So let's have a look here. And yes, ScrapOxy just shutting down all the instances. That's nice. So let's get back to Scope Proxy. I will add a connector from Scope Proxy using ISP proxy with 10 proxy from the United States. Perfect, and start it. So let's jump directly to the coverage map and wait a little bit to get the instance. And you can see that I don't have AWS instance, but now I got. It's my proxy running in the U. S. So all traffic will be route there. So if I'm running at MySpider, I didn't touch anything on MySpider, just modify squad proxy configurations. Everything is now route to the U. S. And I will get all my 50 items. That's cool. So here, I've got 50 items. Perfect. Thank you very much. So now let's jump to level six. I will use that shield. Yeah. So now we'll have very serious autobot systems and you will quickly understand. So I'm just connecting to the website and at the first connection the website say okay you don't have a fingerprint so I won't let you pass. So what is a fingerprint? So let's get back to the Tricky Review website and jump to level 6. So if I'm just removing the filtering here, yeah, I'm having everything. So when I'm connecting as a browser to a website, I'm downloading HTML, images, CSS, JavaScript. So you have a lot of GET requests. But why do I have any POST requests? Post request is that I'm sending information from my browser to the website. I'm sending at that, at regular interval that you can see. what am I sending? Let's have a look on the payload. Oh, I'm sending the platform, the time zone, and the real user agent. So I cannot fake it anymore. And all this information are gathered with JavaScript. That's great. So what the website is doing is executing JavaScript and collecting this information and sending them to the web server with AJAX request. So I cannot rely anymore on only HTTP request. Now I need real browsers. So I need a framework to start the browsers and doing all the request. And of course, I don't want to manage that myself. I want that Scrapy manage this framework. So let me introduce you a cool framework called Playwright. So Playwright is an open source framework for end to end tests. There is also Selenium, perhaps you know them, Ubuntu Puppeteers, but Playwright is very adapted to use with Scrapy. So you can start Chrome, Firefox, Edge, Safari, it's maintained by Microsoft, and clearly you can execute JavaScript. It's pretty cool. So let me show you how you can adapt your code with Playwright. So I will open the same spider, but just adapted for Playwright. So it's quite the same. First, we are doing all the requests, so the 10 requests on the homepage after you are going doing the ations on the city of Paris, when we get a list of hotels, we gathering hotel information. When we get the response of hotels, we extracting name emails, and we quite the same. The only difference is how we collect the data. Instead of using default Scrapy download handlers by doing HTTP requests, we're saying, okay Scrapy, now this time you will use Playwright, and Playwright will handle that. And internally, Playwright will open a browser and doing all the requests. So if I'm running here the spider, not this one, but the Playwright one, What you will see is that Scrapy has the playwright open browsers, so 10 browsers, 10 sessions, as we did. And, Connecting on the home page, gathering all information, executing JavaScript, sending the payload, the website says, OK, now you are executing JavaScript and I've got the payload. You can download hotel information. And that's what we are doing now. We are gathering every hotel information. As you can see, it's not so fast because we are starting Chrome. It can take A little bit time, but we are executing JavaScript and that's the cool part. So you can see I've got my 50 items. That's perfect. But of course, Antibot are not only controlling that you are executing JavaScript. They are checking the fingerprint. You are sending signals, information, to the website. Let's check that. So that's what I want to show you on level 7. If I'm starting again, the spider here. Scrappy has to play right to open the Chrome browser. We'll open 10 sessions. Connecting to the home page. Executing the JavaScript. Sending the javascript to the web server, and now the Antibot checks the consistency of the payload, and every time I'm trying to collect accommodations, I've got clearly a big error. So you can see that I'm not collecting anything, and this error yields inconsistent timezones. Yes, of course, we are sending information to the web server, and the Antibot will use that. And now it's checking that the time zones are consistent. time zone between the browser, which is in Europe, Paris, and we are using US IP address. it's America, Chicago for the time zones. we need to align that. I can change the IP address, but I won't do that. I will Use more easy solutions, which is changing the timezone of the browsers. So I can do that here. It's very quick. I just connect here, write timezone options, and say, okay, it's America, New York. That's perfect. So let's run again the spiders. So now, Scrappy asked Playwright to clone with his modified informations. We are connecting to the website, gathering the website gather informations, through the correct time zones, and in the US. And I'm sending this informations, and as you can see, now I can collect hotels. So yeah, the different hotels are just collecting. That's perfect. I will stop from there because it can be very slow, but you understand the concept. Perfect. So now let me show you what Isabella can have on very serious websites. It's protections that you can find, on websites, commercial website, social platforms. It's very advanced protections. So let's jump to level eight. So now on these levels, as you see, I'm gathering information, HTML, CSS, images, JavaScript, and also I'm also doing POST requests. So I'm sending information. But the URL is not the same. It's more encrypted sort. So let's have a look at the payload this time. Okay, I cannot understand the payload. So the payload is encrypted. And that's how Antibot systems protect themselves. They don't want that you know which signals you are tracking. So you cannot emulate them. Okay, that's fine. So let's have a look on the source code which generates this payload. So if I'm opening the source, let's find that. I've got level 8. Should be on JavaScript. I've got bootstrap. Okay, that's not Antibot. JQuery, no. But it should be this one. Okay, so let's have a look. Oh, that's not cool source code. So there is a lot of variable declarations, which is encrypted, yeah, and also I've got functions with soft, very difficult names. So I think the code is obfuscative. So now if I want to understand, the signal, I need to deobfuscate this source code. But before trying to do that, I want to explain to you how you can obfuscate source code. We need to understand obfuscation before doing deobfuscations. So let's jump into some techniques on string obfuscations. So the first, technique is to obfuscate. Simplify that. as you can see, I've got strings. And these strings are encrypted with R functions. the R function is a base64 encoding. I will apply reverse, functions. if I'm doing that, which is named as string containing, I will get the correct strings. But what I can do now is to replace the constant by the value. So if I'm doing that, doing constant unfolding, I will get, the value inside the last lines. But clearly it's not understandable because the last, string are split. So I need to join all strings. So I will do string splitting, unsplitting in fact. And you can see that's now a little bit readable. Let's finish that and use and replace the string notation by the dot notations and also to correctly avoid h here. If you're doing that, it's called string bracketing, you'll get the correct instructions. So you can see these four lines are just windows dot screen dot wise. So I'm gathering the size of the screen. So that's how you're doing deobfuscations. So let's jump to our source code. So I will copy paste this code and put it in an empty file here. Yeah. So you can see it's an obfuscated code. You can remember here. Yeah. So I've got share a Script, which is a the for Thisor, and it will do the same. So I will use Babel. So Babel is a er, and Babel will just do all these string speeding, constant falling operations. Here I'm doing constant and falling, et cetera. So let's run these spiders. So if I run this tool, it will create a new file which is the obfuscative. Let's open it. Okay, so now it's a little bit readable. Of course, the deobfuscator cannot find the name of the functions because you lost this information during the obfuscation process. but we can understand what the code is doing. So here I've got the functions. Here I've got an encryption functions. I'm doing RSA encryption. So if it's asymmetric encryption, I need a key. So I assuming that this is a key. Okay. So now what this function is doing. Oh, it's the function which sends, the request, the AJAX request, you can see we are doing a POST request. So which signal are we sending? Oh, it should be WebGL stuff. Okay, so we are sending the Vendor and the Renderer of the GP models. That's perfect. So we need to create a payload, a JSON payload, with a Vendor and Renderer, stringifies it. Encrypt that with RSA encryption with this key. So let's copy paste this key. I will get this one. Perfect. So I've got some script to do that. Okay. So it's a kind of, it's the same, spider, yeah, the Trekkie spiders. So we're going to the homepage, so you already know this. But when we are on the homepage, we will send the encrypted payload, the payload that we just forged on this ULNs. So we create this payload with the build payload functions. And after we are doing all the pagination requests and getting all the information. So let's get back on the build payload. The build payload is building here, a payload with signals. So we need, the vendor. So let's say it's Intel. The handler, let's say it's Intel 2, we need a public key, I will copy paste that one. So this information is signified, encrypted with RSA encryption, and sent to the server. So now if I'm just running this spider encrypted, you can see I just quickly have my 50 items. So doing deobfuscation can be a very cool way to bypass protections. it's very interesting, but only in one case. When you need to do a million of requests very quickly. Otherwise, it's better to use playwrights or more generic stuff. Because deobfuscating source code can take a lot of time and take a lot of energy. That's the goal of Antivox. Thank you very much. I hope that you enjoyed the session. please download ScrapOxy, and if you can, that helps me a lot, add GitHub stars. Bye bye.

See all 36 talks at this event!

Conf42 JavaScript 2024 - Online

October 31 2024 - premiere 5PM GMT

Live-Coding: Master Anti-Ban & Web Scraping Techniques with Scrapoxy

Video size:

Abstract

Summary

Transcript

Fabien Vauchelles

Creator and main contributor @ Scrapoxy

Join the community!

Featured event

2025

2024

Info

Conf42 JavaScript 2024 - Online

October 31 2024 - premiere 5PM GMT

Live-Coding: Master Anti-Ban & Web Scraping Techniques with Scrapoxy

Video size:

Abstract

Summary

Transcript

Fabien Vauchelles

Creator and main contributor @ Scrapoxy

Join the community!