# Web Crawling

***

### Creating a Scheduled Crawl

Navigate to a Knowledge Management and open the **Import from web** modal.

<figure><img src="/files/IL5TAiQCH8QTV8UGcxxb" alt=""><figcaption></figcaption></figure>

#### Step 1 - Add your start URLs

Enter one or more URLs you want to crawl. These are the pages the crawler starts from and follows links outward. You can add up to **5 start URLs** per schedule.

Only valid URLs are accepted. Invalid entries are highlighted in red on submit.

#### Step 2 - Enable recurring import

Toggle **"Schedule recurring import"** on. Three new fields appear:

| Field          | Description                                                                      | Default       |
| -------------- | -------------------------------------------------------------------------------- | ------------- |
| **Start date** | The first date the schedule is active. Cannot be in the past.                    | Today         |
| **Time**       | The time of day to fire the crawl. Past times are hidden when today is selected. | 09:00         |
| **Repeat**     | How often the crawl runs.                                                        | Every weekday |

**Repeat options:**

* **Every day** - fires daily at the chosen time
* **Every weekday** - fires Monday-Friday at the chosen time
* **Every week** - fires on the same weekday as the start date, weekly
* **Every month** - fires on the same day-of-month as the start date, monthly

**Timezone** - the schedule defaults to your browser's detected timezone. Click **"Select timezone"** to override it. The timezone is stored as a fixed UTC offset (not a named zone), so it does not shift with daylight saving time changes.

#### Step 3 - Configure advanced settings (optional)

Click **Advanced settings** to open the configuration panel:

**Crawl behaviour**

| Setting            | Default | Description                                                    |
| ------------------ | ------- | -------------------------------------------------------------- |
| Use sitemap        | Off     | Seeds the crawl from the site's `sitemap.xml`                  |
| Document parser    | Off     | Saves downloadable files (PDFs, DOCX, etc.) found during crawl |
| Extract images     | Off     | Includes image URLs in the crawled content                     |
| Respect robots.txt | **On**  | Honours the site's `robots.txt` crawl rules                    |

**Limits**

| Setting   | Range   | Default | Notes                                                                                                                      |
| --------- | ------- | ------- | -------------------------------------------------------------------------------------------------------------------------- |
| Timeout   | 1–600 s | 30 s    | Per-page request timeout                                                                                                   |
| Max retry | 0–10    | 3       | Retries per page on failure                                                                                                |
| Max depth | 0–10    | **3**   | Link-follow depth from start URL. Toggle off to crawl without a depth limit.                                               |
| Max pages | 1–1000  | **10**  | Total pages per run. Toggle off for unlimited. When sitemap is enabled, one extra page is reserved for the sitemap itself. |

#### Step 4 - Run

Click **Run Crawl**. For a one-time import (no schedule toggle), the crawl fires immediately. With a schedule, the first run fires at the chosen start date + time, then repeats automatically.

***

### How Re-Crawls Work

When a scheduled crawl fires, it does not create duplicate documents. Instead:

* Pages already in the knowledge base with a matching URL are **updated in place** (upsert by URL).
* New pages discovered since the last crawl are added as new documents.
* Pages that no longer exist on the site are left in place - they are not deleted automatically.

This means your knowledge base always reflects the latest content from the crawled site without accumulating duplicate entries.

***

### Managing Schedules via API

The following REST endpoints are available:

| Method   | Path                                                   | What it does                              |
| -------- | ------------------------------------------------------ | ----------------------------------------- |
| `POST`   | `/knowledge_base/crawl-schedules`                      | Create a new schedule                     |
| `GET`    | `/knowledge_base/crawl-schedules?knowledgeBase=<slug>` | List all schedules for a knowledge base   |
| `GET`    | `/knowledge_base/crawl-schedules/{id}`                 | Fetch a single schedule with full details |
| `PATCH`  | `/knowledge_base/crawl-schedules/{id}`                 | Update config, payload, or active state   |
| `DELETE` | `/knowledge_base/crawl-schedules/{id}`                 | Delete a schedule and cancel its trigger  |

Pausing a schedule (setting `isActive: false` via PATCH) keeps it in the database but stops it from firing. Re-activating it re-registers the cron trigger.

***

### Why Was My Scheduled Crawl Skipped?

A scheduled crawl can be skipped for the following reasons:

| Skip Reason                 | What It Means                                                                              | What Happens Next                                                                  |
| --------------------------- | ------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------- |
| `feature_flag_off`          | The feature flag was turned off after the schedule was created.                            | The schedule is preserved. It will resume automatically if the flag is re-enabled. |
| `concurrency_cap`           | The system limit of **2 simultaneous crawls** was already reached when the schedule fired. | The schedule retries automatically on its next scheduled tick.                     |
| `dispatch_error: <message>` | An unexpected error occurred while trying to start the crawl job.                          | Check the error message for details and contact support if the issue persists.     |

> **Note:** Skipped ticks do not delete or disable your schedule. In most cases, the schedule recovers automatically.

***

### System Limits

* **Max 5 start URLs** per schedule.
* **Max 2 concurrent cyclical crawls** system-wide across all tenants at any given moment. Schedules that fire while the cap is full are automatically retried at their next scheduled tick.
* Concurrency slots have a **6-hour TTL** - if a crawl process crashes without releasing its slot, the slot is automatically reclaimed.

***

### Known Behaviours & Gotchas

* **Timezone is a fixed offset.** If your timezone observes DST, the crawl will fire at a shifted wall-clock time after a DST change. To keep wall-clock time consistent, update the schedule's UTC offset manually after the DST transition.
* **Past times are filtered.** When the start date is today, times that have already passed are not shown — the picker automatically advances to the next available future slot.
* **"Every week" fires on the start date's weekday.** If you set a start date on a Wednesday and choose "Every week", the crawl will always fire on Wednesdays.
* **"Every month" fires on the start date's day-of-month.** If you start on the 31st, months with fewer days will skip that tick.
* **Deleted pages are not removed.** Re-crawls update and add content but do not prune documents for URLs that have disappeared from the site.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.blockbrain.ai/for-users/all-about-knowledge-management/web-crawling.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
