How To Take Markdown of All The Pages in a Domain (Automation)

CaptureKit Team
markdown automationwebsite to markdowncapture website contentCaptureKit markdown API

I was recently looking for a way to get Markdown of all the pages in a domain to build an internal knowledge base for my team.

For this, I needed all the URLs in a domain, and each one of them in markdown format.

CaptureKit’s Page extract API helped me to build this workflow, where in I am using the API twice in here. And if you are someone who has a similar use case, you can use this automation as is. Just download the blueprint that I will give you in the very end of this blog.

Here’s the workflow for the whole automation.

CaptureKit Page extract APILet me show you how you can build it from scratch!

Prefer watching the tutorial? check out the video below ⬇️

Tools used to build this automation

  1. CaptureKit API (For extracting all the pages + each page’s markdown format)
  2. n8n (To build our workflow)
  3. Google Spreadsheet (Database)

You need to have access to the API, as well as n8n. CaptureKit, gives you 100 credits to start with. You get your key in dashboard, that we will be using in our workflow later.

CaptureKit APIn8n allows you 14-free trial. So once you are logged in with both platforms, and ready, we can start building the automation from scratch.

Let’s head back to our n8n.

Building The Workflow

Building The Workflow

The first step in our workflow is trigger, for which we are keeping it scheduled on a time, however you can use manual trigger too.

Then, in the next node, our workflow will pick the domain for which we want to extract page all urls for.

Our spreadhseet have two tabs, from ‘domain’ tab our workflow collects it.

 URLs with their markdown
The above screenshot is taken after my testing done, we have collected all the URLs with their respective Markdown format in column B.

Here’s the configuration of our Google Sheets’ node in n8n & you can see the output.

Google Sheets node in n8nWe will now use this output as an input to our Page API. To use this API you need website URL as one of the parameter with the api_key.

You can read more about the API in the doc here, also use playgroud to build your request.

But I will show you the configuration of HTTP request That I have used.

Configuration of HTTP request
This way you get the URLs in that domain, let’s test this step and see.

 URLs in that domainYou can see that there are all the URLs for this domain, now we will use the split out node, and send them again to this API.

A loop will cycle the automation, that way we can extract markdown of each URL one by one.

Markdown of each URL one by one.Okay, now let’s move to the next node, the HTTP, as I said that we will use the page content API again here.

Here’s the configuration of this node below ⬇️

 Page content API again here
Finally, we will have this markdown format in our database & connect back to loop to repeat the process for all the URLs.

Markdown format in our databaseAnd finally connecting the output of the spreadsheet with loop.

This way you can easily convert URLs in Markdown format using this workflow.

And as promised here is the blueprint for this automation, you can download and use in your n8n canvas.

Additional Resources

  1. How To Convert HTML to Markdown Format
  2. How To Take Screenshot of All The URLs of A Domain

Ready to get started with CaptureKit?

Start capturing and analyzing your user interactions today. Get started for free.

Get Started