Web Scraping with GPT-4o: An Inefficient Way To Avoid HTML

Check the Code

Every so often for a project, you need to do some web-scraping. For those of us that see this as a once-in-a-while task, we may be painfully reminded of the annoying convolution of BeautifulSoup, maybe Selenium, and maybe Chromedriver.

For traditional approaches, you need to understand the specific structure of various websites in order to get the exact information you care about. If the website changes format, all that work is gone, and you may need to start over. And for any serious website, there are security barriers, request limitations, etc, making you want to avoid the day you next need to scrape the web.

As someone who spent the last five years in deep learning, and knows very little about traditional software engineering and front-end dev, there seemed to be a silly possibility of using new SOTA Vision Language Models (VLMs) like GPT-4o to help us with web-scraping.

So I prototyped it in a few hours, gave it a try, and it kind of worked. I'm sure there may be better applications where this is more useful, but in this case of scraping NYT articles, it helped me bypass login/anti-scraping security rules, so I appreciated it!

Stage 1: Scrape Web Images with Chromedriver

The first stage uses Chromedriver to pull up a browser, and scrolls through a website while extracting minimally-overlapping screenshots, to capture the entire page. These screenshots are saved to a directory. Using chromedriver ensures that even dynamic content loaded via JavaScript is captured, providing a complete visual record of the webpage.

A pretty big advantage here is that using Chromedriver with a Google Chrome profile with the correct login permissions will allow you to browse and scrape websites that may otherwise have burdensome permission blocks.

Concatenated High Resolution GIF

Stage 2: OCR with GPT-4o VLM

In the second stage, the saved screenshots are sent to the GPT-4o API to extract text. To speed this up, these are sent in parallel with ThreadPoolExecutor. Upon casual inspection, GPT-4o does a pretty good job of extracting relevant text, which is saved to a series of .txt files. I'd be surprised if GPT-4o doesn't have some sort of OCR tool.

Here, there is the potential for doing other sorts of webscraping analysis that would certainly be beyond traditional BeautifulSoup style scraping; for example, if we want to include descriptive captions for images in the article transcription, this is something you would need a VLM for.

Stage 3: Synthesize Full Transcript

The last stage simply takes a directory of .txt file chunks, sequentially concatenates them, and asks GPT-4o to synthesize them into a single .txt article verbatim. Modern LLMs should have no problem doing this relatively simple task. Be careful to ask for an exact synthesis, and make sure you have sufficient output tokens to capture the whole article.

Below you can see part of the output .html file. Whether you want to include things like author, captions, can easily be modified by changing the scraping prompt.

Scraped Article

Performance?

Here are some metrics calculated by GPT, on comparable chunks between the original NYT article and that created by this scraping protocol. If you're looking for perfection, this is lacking. But from a computer vision perspective, not too bad!

Analysis and Thoughts

Overall, it feels funny and fun to explore this idea. Some initial thoughts:

Let me know what you think at egoodman92@gmail.com. Still building up the repo to make it more robust. If there is a much easier way of scraping NYT articles, or you have another use case this may be interesting for, please let me know!

Check out the code!