AI Web Scraping Structured Data: Mastering Data Extraction Techniques (feat. Claude Sonnet 3.5)
Unlock the power of AI: Extract structured data from web pages using Bubble and Claude!
Master web scraping: Learn how to combine web scraping and AI to extract job listing details effortlessly.
Build smarter no-code apps: Discover how to leverage AI and no-code to create powerful data extraction tools in minutes.
Combining Web Scraping with AI for Structured Data Extraction
In this Bubble tutorial video, I'm going to show you how you can combine web scraping with AI to extract structured data from a web page. And the example I'm going to use is a local job board, and I'm going to be extracting, using AI, different elements of the job adverts. Here we go, here's the example. So here's our local job board to me, and I'm going to basically be scraping the whole page, converting it to markdown, and then supplying that markdown into Claude's Sonnet 3.5, and using the LLM, the AI there to extract different parts of the data here. For example, asking when the closing date is and getting that back in some nice structured JSON. But before I launch into that, if you're building an app with Bubble, if you're wanting to build a no-code startup, click the link down below to head over to our website because we've got hundreds of different resources, including tutorial videos, a no-code community, and plenty more to help support you in your journey.
Setting Up Web Scraping with UseScraper
So I'm going to start by building the web scraping aspect of this tool, and I'm going to be using UseScraper. Now in previous videos I've used Paged API, and I thoroughly recommend them. They're a fantastic web scraper, but I just wanted to try something new, so I'm using UseScraper. And how far I've got is I've set up the Bubble API connector, and I've added in the name UseScraper. I put private key in header, authorization, and then I've got my private key with the word bearer in front. How do I know how to do that? Well, it tells me here authorization bearer, and then my private key. Now as of a few months ago, you no longer have to add content type application JSON, that is a default unless you overwrite that value in the API connector.
Configuring the API Call
So let's continue to set this up so that we can see the sort of data that we're getting back. So I'm going to copy the endpoint, go into my API connector, and I'll say scrape web page in markdown. Paste in the endpoint, make it post. How do I know that it's post? Well, because it tells me up here. It then tells me the structure of the data section, or the body section of the API call. So I'm going to copy this, because I don't want to make any mistakes, and paste it in here. And then I'm going to take here and say URL. Notice that I removed the speech marks, I'll explain why in a moment. And then format, I think, is going to be markdown. Let me check. If I go back to the documentation, yes, I can have markdown. And I'm going to use markdown simply because when you supply data to an LLM, like Anthropic's Claude, it can help to have some structure behind. If I was just to do the raw text, it has to do a little bit more interpretation, what's the header, all of that. HTML is going to be way too much detail in there. So I'm just going to supply markdown. I think that's a nice mid-ground, it's the sweet spot.
Setting Up the API Connector
So if I go back to the example I want to scrape, I'm going to copy this, and go back into my API connector. And notice that by using the triangle brackets, I've effectively inserted the merge field or a place for dynamic values. These are places that I can fill data in in a workflow. I don't want it to be private, because it doesn't need to be protected from the user who's taking this action. This doesn't mean that data that goes into this is public on my Bubble app. You need to make sure your privacy rules protect against that sort of thing. That's what privacy rules are for. This is talking about data that I want my users to see. An example that I don't is I don't want them to see my API key, because that's effectively a password to access a service that I'm going to get billed for. So make sure you put your private key in the correct place. You meant you specify to Bubble that your private key should be private. I've also removed the speech marks, because I've just got in the habit of making sure that everything is JSON safe. JSON is a very sensitive syntax. If you insert the wrong speech mark in the wrong place, or the wrong comma in the wrong place, you're going to get an error.
Testing the API Call
So in order to replicate it, I'm going to paste in, add in my speech marks again, and paste into the middle of my web page. I'm going to change this to action, because I want to run this in a workflow. And so let's now initialize and see what happens. The initialization process in Bubble checks that everything you've done up until this point is working. It also instructs Bubble on what to expect back. So here we go, we get back text, and we get back all of this, which I can see, yeah, basically there is mark down there. And we get some of these key bits of metadata, like the posted date. Perfect, that's exactly what I needed. I'm now going to click save, because I want Bubble to learn the format that this data comes back in.
Setting Up the AI API Call
So what's the next step? Well, I think we could do with setting up an API call to Claude, to Anthropic. So I'm going to scroll on up a bit. Here we go, here's my Claude. So again, if you're unsure about this, we can go into the Claude documentation, messages, create message. It's just taking its time. So I've done plenty of videos using Claude as an LLM. Oh, here we go, loads in. So again, I need to copy this into my Bubble app. So I've got my endpoint, it needs to be a post request, I can see that there. I then need to have Anthropic version number in the header, I'm going to point that out to you, and we need to have structure similar to this in data.
Configuring the AI API Call
So if I go back into my app, you can see that I've basically got all this from a previous demo that I've done. So I'm going to change one thing and just say web page content. Actually, it's going to be prompt and web page content. And then I'm going to rename this to extract data from web page with AI. Okay, I'm using the latest version of Claude 3.5 Sonnet in here. Now I'm going to try at the end using Haiku, and that's going to be tons cheaper because I think Haiku can manage this relatively simple task. We're also, and I've looked into this before hitting record, we're going to try using JSON mode. Now JSON mode is a parameter on OpenAI's API. It isn't one that's available on Claude last time I checked, but you can instruct with the prompt to return with JSON. So we're going to try that. This needs to not be private because I want it to run in the workflow, and I don't need to reinitialize it because I'm confident I've not broken anything in here.
Building the User Interface
So let's go and build, in fact I've got that ready to go here, I've got a blank page. So we are going to build a simple form to show this data. So I'm going to say input, and in fact let's wrap this in a group, give it a little bit of nice styling. So this is going to be our URL, and then I'm going to add in a button. Let's just add a bit of contrast to the page. I like that, so that you can see what bit is what. Here we go. And so our button is going to be extract structured data. We're going to build this piece by piece so you can see how each bit slots together. I just like making things a little bit neater. There we go. Right, now we need somewhere to print the results. So I'm going to take just a text box here.
Using Custom States for Temporary Data Storage
I'm going to create a custom state. Now a custom state is like a temporary data storage device. Nothing is being written to the database. If the user refreshes the page, the data is lost, but it's really helpful in kind of this step-by-step debugging. So on my page element, I'm going to add a custom state. Custom states can be added to any element. I've just got in the habit of putting them on the page because then I don't forget where they are. So we'll say markdown, and this is going to be a type text, and then this text box here we're going to say find our page. So there we go. This is our page element. This is that's what I've named the page. If you want to rename the page, you rename it here, and I'll say custom state.
Creating the Workflow for Web Scraping
Right, so now we can create a workflow that links in with the web scraping API call that we've already got set up. So that's out in the workflow, and I'm going to go into plugins and find UseScraper. If you're not sure where they are, when you initialize an API call and you set it as an action, you should see it in this list here as you've labeled it in the API connector. Then I'm going to take my input because that's where I want to put my URL, and I need to make its value JSON safe. That is wrapping it in speech marks, it's making sure any pesky punctuation, or I don't know if that would appear in the URL, but just in case we're going to make it JSON safe. Then I need to save or store temporarily the output. So I'm going to say set state because I'm now going to refer to my custom state, which is my page markdown, and then say the results are step one's text. Okay, and the list that I get here, this is the structured data that came back when I initialized the call in the API connector.
Testing the Web Scraping Workflow
I'm going to do one final thing to help speed things up. I'm going to copy the URL again from this one that I'm using as a demo and paste it into the initial content. So now we can hit preview and see what happens. Okay, let's run the web scraping. So I'm going to click extract structured data, and here we go. We have back all of this beautifully structured, well sorry, it's not structured yet. This is going to be our next step. So far we haven't actually used Claude in order to structure it.
Adding AI Processing to the Workflow
So let's dive back in. In fact, what I'm going to do here is group this in a row so that we've got side by side. I'm going to copy and paste it, add in a bit of a gap, perfect. What have I got this set to fit with the content. I'm just going to change this to, let's say 40%. I basically want to be able to preview both results, each step of the process side by side. Now we need to add to our workflow the AI part. So let's go and edit our workflow, and this time I'm going to go into plugins. Did I call it Anthropic or did I call it Claude? I'll call it Claude. Yeah, here we go. Claude extract data from page with AI. Again, if you see something different, it's because you've labeled it differently in the API Connect. So if you still don't see it, it's because you need to set it to action rather than data, and you need to initialize it.
Configuring the AI Prompt
So in fact, let's just dive back in so you can see what I mean by that. Notice there are no speech marks, so we need to make it JSON safe. And here we go. This is where I've labeled it. And this is what you need to copy basically into your app to make it work. So we want to insert into here both a prompt and the web page data. And a great way of grouping together text and applying a single action to it, which we need to do with JSON safe, is to use arbitrary text. Arbitrary text is also very easy and handy for copying loads of dynamic data around your app in the editor. So I've made it JSON safe. And then I'm going to write a prompt. So I'm going to say here is a web page job. That's bad English. Here is a job advert web page in markdown. Extract. What shall we say? I want to do a couple of bits of data. We shall say extract the contract term and closing date. Extract the contract term and closing date in markdown using this format. Okay, now everything in here is going to be made JSON safe.
Enhancing the AI Prompt
And actually, I'm going to add in some even more really handy tips in here, which is that if we go back to Anthropic and then go on to user guide, then on to prompt engineering. Yeah, here we go. So they provide some really handy tips to strengthen your prompts. And one of those is instructions and formatting using XML tags. So let's go ahead and do that. So this is instruct. And there's nothing special about the term used here. It is simply a way of further kind of reinforcing to the AI what is going on where in the prompt. So I'm going to get rid of that and then close the XML tag instructions. And then I'm going to say format. And so here I'm going to write in a little bit of JSON formatting. And hopefully I'm going to get this right off the top of my head. So we shall say closing, closing date. And again, there's nothing special about the terms I'm using here. I just need to do it in a manner that I think the AI is going to get it right. And so then I suppose I would just put that in there. And then I'm going to structure this so it's even more clear. There we go. So I'm saying closing date and contract term.
Standardizing Date Formats
Okay, I'm also going to add into the instructions here, a way of standardizing the format of the closing date. Now I tested this yesterday, and I think it's going to work, because my issue is that this date structure here means something to me as a human looking at it, but I need to get it into a format that Bubble can understand the date. So I'm going to say format all dates as okay, much more standard, easier for me to do things with it, extract it. So, oh, and now I need to put in the web page. That I just put web page. And here is where I dynamically insert from the result step one, the text.
Testing the Complete Workflow
Okay. Now, in order to test this, I'm going to add in another custom state. I'll say JSON response. And then I'm going to print that here, JSON response. And now if I go back into the workflow, I'm going to save the JSON response, or set the state rather. So the result of step three, that's my AI. And then just because I've used Claude Tuns in the last week, so I know exactly where to look, which is content, first item, text. Okay, cool. I'm excited to try this. So let's refresh. And run the workflow. So we're combining both the web scraping. Ah, I have an error. That's because I reset my API key between videos, which is always a good thing to do. So bear with me, I'm going to change that.
Reviewing the Results
Okay, so I replaced my API key. And look, I get back this beautifully structured data. So the closing date was the 12th of July, and it's picked that up and it's reformatted it, and it's picked up the contract term. But notice that it returns it as text. So this is a challenge that I may well tackle in a future video. I've got an eye on the time. I like keeping these videos super bite sized on the 18th minute effectively here. But I've got you this far. I'll give you some tips. You could either use a plugin that extracts data from JSON later on in the in the workflow. Or you could put this itself through the API connector so that it knows that so that you can get Bubble to the tech that is actually JSON. Now I think I'm going to tackle this in another video. So hit subscribe, hit like if you found this helpful, and I'll see you in the next one.
Get the Complete Bundle for Just $99
Access 3 courses, 390+ tutorials, and a vibrant community to support every step of your app-building journey.
Start building with total confidence
No more delays. With 30+ hours of expert content, you’ll have the insights needed to build effectively.
Find every solution in one place
No more searching across platforms for tutorials. Our bundle has everything you need, with 390+ videos covering every feature and technique.
Dive deep into every detail
Get beyond the basics with comprehensive, in-depth courses & no code tutorials that empower you to create a feature-rich, professional app.
Save over 70%!
Valued at $80
Valued at $85
Valued at $30
Valued at $110
Valued at $45
Can't find what you're looking for?
Search our 300+ Bubble tutorial videos. Start learning no code today!
Have questions?
We have answers!
Find answers to common questions about our membership plans, programs, and more.
We're here to help you launch your no code SaaS. Reach out to the team and we'll double check our vast library for useful content. We'll advise you on how we'd tackle the same problem and there's a good chance we'll record the video to help the wider community.
As a Planet No Code member, you'll receive a discount on our Bubble coaching sessions. Monthly members receive a 10% discount, while Annual members receive a 17.5% discount. To redeem your discount, simply log into your account and book a coaching session through our platform.
Our 8-week intensive mentorship program is designed to provide personalized guidance and support to help you accelerate your startup journey. You'll be matched with a startup expert who will work with you one-on-one to set goals, overcome challenges, and make rapid progress.
To apply for the Mastery Program, simply click the "Request Invitation" button on our pricing page and fill out the application form. Our team will review your application and schedule a call with you to discuss your goals and determine if the program is a good fit for your needs.
We accept all major credit cards, including Visa, Mastercard, American Express, and Discover.
While we don't offer a free trial, we do provide a 14-day money-back guarantee. If you're not completely satisfied with your membership within the first 14 days, simply contact our support team, and we'll issue a full refund.