Web scraping is being accelerated by Artificial Intelligence (AI)

I’m Hugo Lu — I started my career working in M&A in London before moving to JUUL and falling into data engineering. After a brief stint back in finance, I headed up the Data function at London-based Fintech Codat. I’m now CEO at Orchestra, which is a unified control plane for data operatons🚀

Also check out our Substack ⭐️

Want to see how Orchestra is changing the game by delivering unparalleled cost savings and visibility? Try our Free Tier now.

Introduction

There are around 50 billion pages on the Internet. That’s 50,000,000,000 pieces of content. While somewhat daunting, to the trained Data Engineer this presents a treasure trove of information that can be used for all manner of use-cases.

This was facilitated in the past by web scraping tools such as Beautiful Soup (a html-Parser) and Selenium, a browser automation tool. Something many don’t appreciate, however, is that these tools are extremely old, by internet standards. Both were released in 2004, and celebrate their 20th birthdays this year.

Since 2004, there have been multiple HTML-parsers that have sought to replicate much of Beautiful Soup’s success. Among these Beautiful Soup alternatives are Scrapy, PyQuery and Mechanical soup.

Despite the many alternatives, the process of collecting data in a systematic, efficient and repeatable fashion has been a very tricky problem indeed. Artificial Intelligence of “AI” as we now call it offers hope to not only accelerate the development process, but also adoption of webs craping use-cases in Data.

In this article, we’ll dive into the problem and take a look at how one might solve it using traditional methods, and some pains associated with them. We’ll see how AI-enabled web scraping tools alleviate some of these pains and how Data Engineers can leverage these to their advantage.

The problem of parsing HTML

Every web-page is characterised by HTML-Code — Hyper Text Markup Language. For any page, you can generally right click on the screen and select “Inspect” to see what’s going down:

Web scraping is being accelerated by Artificial Intelligence (AI) | Orchestra (1)

HTML Code is made-up of HTML elements. W3 explains these as:

An HTML element is defined by a start tag, some content, and an end tag:
<tagname> Content goes here… </tagname>

HTML Code necessarily has structure:

An HTML 4 document is composed of three parts:

a line containing HTML version information,

a declarative header section (delimited by the HEAD element),

a body, which contains the document’s actual content. The body may be implemented by the BODY element or the FRAMESET element.

Within the body, as people interested in extracting data, we want access to the elements. Furthermore, we want to retain the structure in terms of hierarchy. For example:

 e-commerce  Welcome!   My first paragraph   Some stuff 
  My second paragraph   Some more stuff 
See Also
What is Bhad Bhabie's Net Worth in 2024?
  My third paragraph   Even more stuff

From this, we might want:

{"elements":[ { "type" : "p1", "content" "My first paragraph", "children" : [ { "type" : "div", "content" "some stuff", "children" :[] } ... ]}

This data is largely unhelpful for structured data analysis. So with this JSON, we might then go through the process of flattening the JSON into something tabular.

This leads to the first part of the problem which is that parsing is complex. HTML-code has deeply nested structures, and become increasingly complex in terms of having many layers of nesting.

There are a few additional considerations that make this process non-trivial.

Time or duration

Web scraping often needs to be done for multiple web pages over long time durations. However, the computation is not difficult per se, but it requires a different kind of infrastructure (long-running, low compute) compared to many data engineering workloads (spiky duration, high compute) which means simply running a web scraping job in Python on the same infrastructure as something like Airflow on Kubernetes is unduly expensive.

Use of Browser Automation tools

Sometimes, not all HTML code renders when a page loads. You need to click on something to display something. This means a browser automation tool like Selenium is required to “get” the data. Read more on the differences between BS and Selenium here.

“Changing Schema”

Web pages change. This means the structure of a HTML document changes to. Using something like Beautiful Soup means it can be difficult and time-consuming to amend code when the HTML structure changes over time, which makes the process of using web scraping to gather data flaky and unreliable.

In Data, we can leverage things like Data Contracts or plain old human relationships to ensure this doesn’t happen. That’s a lot harder when you don’t know who’s on the other side of the phone (or who to call).

Legality

Use the internet as a data source — great idea right? Think of all the incredible bits of data out there you can access freely and legally? Think again.

Many sites like Linkedin explicitly ban use of web scraping in the terms of use for the website. There appears to be a fine line between doing something criminal and simply getting banned. Depending on your use-case, you need to be extremely cautious and evaluate whether the site you’re webscraping can be done in a legal way.

This is not legal advice and should not be uesd as such.

Proxies for web scraping

A proxy service is a service that allows you to route your request via their host. This means the IP address you use, and possibly the browser or other components are different.

There are a number of reasons this can be advantageous. It allows you to make multiple concurrent requests (scale, speed), you can get around blanket IP bans (sometimes websites will simply ban AWS IPs), and you can make a proxy from a specific geographical location, allowing you to get access to the content targeted at that specific geography.

This is typically something that is quite hard for most people to manage on their own.

How to leverage AI to improve web scraping

Using a simple example and Open AI, we’ll show you how you can scrape some data and how AI makes it easier.

Sraping my own website

For ease of demo, I’ll show you how you can scrape the Orchestra website.

Web scraping is being accelerated by Artificial Intelligence (AI) | Orchestra (2)

We’re going to scrape my integrations page. Not sure why you would want to do this, but perhaps from a competitive intelligence angle, you might want to keep tabs on my integrations and their progression over time.

Using Python, the script looks a bit like this in Beautiful Soup:

import jsonimport pandas as pdfrom bs4 import BeautifulSoup# Load HTML contentwith open('path_to_your_html_file.html', 'r') as file: html_content = file.read()# Parse HTML using Beautiful Soupsoup = BeautifulSoup(html_content, 'html.parser')# Find all integration items - adjust the class or tag as per your HTML structureintegration_cards = soup.find_all('div', class_='your_card_class_here')data = []for card in integration_cards: integration_name = card.find('div', class_='integration_name_class').text.strip() status = card.find('div', class_='status_class').text.strip() description = card.find('p', class_='description_class').text.strip() data.append({ 'Integration Name': integration_name, 'Status': status, 'Description': description })# Convert data into a DataFrame and save to Exceldf = pd.DataFrame(data)df.to_excel('integrations.xlsx', index=False)# Save data to JSON filewith open('integrations.json', 'w') as json_file: json.dump(data, json_file)

We can see that all the problems we mentioned in the previous section apply here. What if I change the name of the integration class from “integration_name_class” to something else? This script will break. Not good.

Furthermore, I still have to spend time writing the code to flatten my json. Not ideal, as I would probably rather have a separate process do this rather than do it on the fly (if I was doing this at scale).

However, with access to the raw html that can be easily downloaded, this can be passed to Open AI with a prompt asking for structured data. It is duly returned:

import boto3import requestsfrom bs4 import BeautifulSoupimport openaiimport pandas as pd# AWS S3 Configurations3 = boto3.client('s3')bucket_name = 'your-bucket-name'# Function to save webpage to S3def save_webpage_to_s3(url, file_name): response = requests.get(url) s3.put_object(Bucket=bucket_name, Key=file_name, Body=response.content) return f"s3://{bucket_name}/{file_name}"# Function to load webpage from S3 and extract datadef extract_data_from_html_s3(file_name): obj = s3.get_object(Bucket=bucket_name, Key=file_name) html_content = obj['Body'].read().decode('utf-8') # Send HTML content to OpenAI for processing response = openai.Completion.create( engine="text-davinci-002", prompt="Extract a list of integrations with their statuses and descriptions from this HTML: " + html_content, max_tokens=1024 ) # Convert the response to DataFrame data = response.choices[0].text df = pd.read_json(data) # Save DataFrame back to S3 as CSV csv_buffer = StringIO() df.to_csv(csv_buffer) s3.put_object(Bucket=bucket_name, Key='integrations.csv', Body=csv_buffer.getvalue())# Main execution flowurl_to_save = "https://www.getorchestra.io/integrations"html_file_name = "webpage.html"save_webpage_to_s3(url_to_save, html_file_name)extract_data_from_html_s3(html_file_name)

It’s a bit more code, but provided you’re happy with the Open AI API returning the structure of the data you want there’s much less work to do.

The LLM is leveraged to make the whole process of parsing the HTML redundant / simplified, and as the LLM is smart enough to figure things out, it can infer the impact of schema changes over time.

Furthermore, this process is easily scaled. It’s fundamentally just saving down lite files and making API Calls (which could be done sync/async), which is a classic ELT flow.

AI-Enabled Web Scraping tools

As we know, LLMs hallucinate, and there’s no guarantee the respone you receive from your LLM will always be in the correct format.

This is why there are a whole legion of AI-enabled web scraping APIs popping up you can leverage, using the pattern above, to ensure that inconsistent data is someone else’s problem rather than yours.

Nimble

Nimble is an AI-enabled web scraping API that data engineers can ues to reliably take data from data sources on the internet and drop them into an S3 or GCS bucket.

It handles the gnarlier aspects of “unblocking” (the proxy server problem we mentioned earlier) and wraps some website types in their own models that essentially offer automated parsing (or “Nimble Skills”).

These include the Search Engine Results Page (“SERP”) API, an ecommerce platform focussed API (for accessing pricing data), a maps API (although GMaps has its own API) and an all-purpose web API.

There are some extremely powerful use-cases here around gathering intelligence if you’re a market research company or retail player, and the SERP functionality is incredibly useful for monitoring performance for certain keywords in real-time.

Web scraping is being accelerated by Artificial Intelligence (AI) | Orchestra (3)

Diffbot

Diffbot is a relatively old company and was founded about 12 years ago by Mike Tung. The core value proposition is very similar as what we argued in this article — focussing on scraping the web along the two key dimensions: accuracy and scale.

Web scraping is being accelerated by Artificial Intelligence (AI) | Orchestra (4)

They also appear to have some pretty interesting Natural Language products that can be used for data enrichment (which I mention briefly here). Relatively well priced as well IMO including a free tier (See here).

Others

With AI there are a whole host of other AI-enabled web scraping tools popping up like ScrapeStorm, Octoparse, and others listed here.

A recurring theme of features seems to be the inclusion of an easy-to-use or “Point and click” UI, handling of infrastructure and things like smooth unblocking / proxy networks, and the promise of low-latency scraping if desired.

While many of these solutions are at the cheaper end, it feels like problems of scale, speed, the necessity of proxy networks, and website type all create points of differentiation from which solutions are evaluated. Consider your requirements before choosing a tool.

Conclusion

In this article we outlined the basic problems of using web scraping as a means to collect data in real-time for Data Engineers. We saw how technologies like Beautiful Soup and Selenium made this possible, but that there are still lots of considerations to make before investing in a web scraping project.

Artificial Intelligence has greatly increased the ease, accuracy, and speed of web scraping, and it’s really awesome to see new solutions come up on the market that are API first and can easily be embedded in modular data architectures for genuinely insightful and helpful use-cases.

If you’re interested in reading more about how Data Engineers and Data Teams can use Generative AI, give me a follow on Medium or Reach out to me Linkedin. Hugo 🚀

Find out more about Orchestra

Orchestra is a platform for getting the most value out of your data as humanely possible. It’s also a feature-rich orchestration tool, and can solve for multiple use-cases and solutions. Our docs are here, but why not also check out our integrations — we manage these so you can get started with your pipelines instantly. We also have a blog, written by the Orchestra team + guest writers, and some whitepapers for more in-depth reads.

‍

Web scraping is being accelerated by Artificial Intelligence (AI) | Orchestra (2024)