I had an interesting conversation at work today. We create a lot of integrations, so we recently started to create metadata for them. One of them is whenever we use the API or scraping to get the data.
But for some reason, we also had additional possible types: inner api, for the scrapers, that did not need a browser, and mix for scrapers, where we collect some data with the browser and some without it.
This is not look like the data we would want to expose to the outside at all.
The problems between scrapers are generally similar (different captcha, blocked user agents, MFA, frequent unannounced changes,...), and different for the integrations using official APIs. Considering this, I wanted to know, why we are keeping track of different scrapers in this way? What is the use case?
But when I talked to the person dealing with the part of the code, as long as we did not open the browser, this was not scraping for him. Faking the HTTP calls, that the site would make with cookies was the same as using the official API with the documentation. This actually surprised me. Did that person not have enough experience with debugging integrations like that yet?
So I want to declare it here - even if the integration or script is not using the browser, it can still be scraper. It difference is whenever we are getting the data in the way the end users consume it or programs consume it. And it does not matter if this happens through browser, or through the HTTP calls.