Enterprise Content Ingestion - Distributed Scraper, Delta Re-Index, Unified Search

The Problem

Most AI bots are only as good as their last upload.

A handful of uploaded PDFs, a daily cron that quietly fails on a malformed file, an embedding model locked at the platform level. Customers ask the bot a question, get a stale or wrong answer, and lose trust in the channel. Knowledge ingestion isn't a checkbox feature. It is the substrate the entire AI experience runs on. If it's flaky, the bot is flaky.

30

Max Scraper Replicas

Auto-scales under load. Drops to zero when idle so you don't pay for idle compute.

4

Content Sources Unified

Web pages, PDFs and uploads, YouTube transcripts, SharePoint and OneDrive - one index per site.

1h

Re-Crawl Cadence

Hourly distributed dispatch with jitter. New content searchable within ~60 minutes of publish.

2

Embedding Models

Choose text-embedding-3-large for precision, 3-small for cost. Configurable per site.

Pipeline Architecture

Distributed. Crash-resumable. Observable end to end.

Every stage of the pipeline is independently scaled, isolated, and instrumented. A failure at one stage doesn't cascade backwards or stall the queue.

Simplified Ingestion Pipeline - Public Reference - Not All Components Shown

Sources

Web Pages
Crawled hourly

PDFs & Uploads
Admin / API

YouTube
Transcript ingest

SharePoint
Microsoft Graph

Hourly dispatcher with delta detection (skip unchanged)

Scraper Pool

Replica 1

Replica 2

Replica N
up to 30 - scale-to-zero

Checkpoint resume - retries pick up at last completed page

Per-Job Process

Subprocess Isolation
One bad doc fails alone

Chunking & Cleaning
HTML / PDF / VTT

Dedup / Hash
Skip unchanged blocks

Per-site embedding model (3-large or 3-small)

Embeddings

Azure OpenAI
text-embedding-3-large

Azure OpenAI
text-embedding-3-small

Single ranked index per site - tenant-isolated

Search Index

Azure AI Search
Hybrid keyword + vector retrieval

Sources

Scraper Pool

Per-Job Processing

Embeddings

Search Index

What's Different

Built for corpora that change every day.

Seven engineering choices that separate a hardened ingestion pipeline from a one-time PDF upload.

Continuous Re-Crawl & Deltas

Only what changed gets re-embedded. Idle rebuilds are free.

Hourly distributed dispatch with a +/-12 hour jitter window so no two sites collide
Page-level hashing skips unchanged content automatically
Free smart-scan rebuilds don't count against your monthly job quota when nothing changed
No 4 a.m. thundering herd that takes embeddings down for everyone at once

Crash Resilience & Isolation

A single bad document cannot stall the pipeline.

Checkpoint resume - retries pick up at the last completed page, not page 1
Per-job subprocess - a corrupt PDF kills its own process, not the queue
Failed-page logging - the rest of the document keeps indexing
Idempotent inserts - retries don't duplicate vectors

Elastic Scraper Pool

Burst capacity for big sites, no idle cost for quiet ones.

Up to 30 replicas spin up under load
Scale-to-zero when the queue is empty - you only pay for compute that's actually running
Per-site rate limiting protects upstream sources from getting hammered
Robots.txt and sitemap-aware crawling out of the box

Per-Site Embedding Choice

Pick the model that fits each deployment - not the platform default.

text-embedding-3-large for high-precision sites with technical or regulated content
text-embedding-3-small for cost-sensitive deployments with simple FAQs
Per-site selection - never locked at the platform tier
Migration tooling for moving an existing index between models

Content Sources

Four sources. One ranked index per site.

Web, documents, video, and SharePoint - unified into a single retrieval surface so the bot answers across every source at once instead of guessing which silo to look in.

Web Pages

Sitemap-aware crawling with robots.txt respect. Hourly delta detection catches new and updated URLs without manual touch.

All bot plans

PDFs & Uploads

Drag-and-drop PDFs, Word docs, Markdown, and plain text. Per-document subprocess means one bad file never kills the rest.

All bot plans

YouTube Transcripts

Point at a channel or playlist - we ingest captions and chapter metadata. Tutorials and product demos become searchable answers.

All bot plans

SharePoint & OneDrive

Microsoft Graph integration with Sites.Selected scoping. Detects file changes and re-embeds only the deltas.

Bot Enterprise

Architecture Comparison

Typical chat-bot ingestion vs. Velaro.

Most chat platforms treat ingestion as a one-time upload. Here's what changes when you treat it as production infrastructure.

Concern	Typical chat-bot ingestion	Velaro ingestion
Scraper topology	Single-server scraper. One bad page can stall the queue.	Distributed pool, up to 30 replicas, scale-to-zero idle.
Retry behavior	Restart from page 1 on failure. Hours of re-crawling.	Checkpoint resume. Picks up from the last completed page.
Reindex schedule	Daily batch. Up to 24 hours of staleness, full-cost rebuild every night.	Hourly dispatch + delta detection. Free rebuild when nothing changed.
Bad-document blast radius	PDF parsed in the main process. One corrupt file kills the whole job.	Subprocess isolation. One bad file fails alone; the rest keep going.
Embedding model	Platform-wide default. Take it or leave it.	Per-site choice between 3-large and 3-small.
Source coverage	Web only, or document upload only - rarely both ranked together.	Web + PDF + YouTube + SharePoint, unified into one ranked index.

Customer Scenarios

Three places this actually shows up.

Real situations where the ingestion architecture is the difference between a bot that's helpful and a bot that gets turned off.

"Our compliance team updates 200 policy docs in SharePoint every quarter."

SharePoint sync auto-detects which files changed and only re-embeds those pages. No manual upload churn, no version drift between SharePoint and the bot, no quarterly project to re-import everything.

"Our marketing team adds 10 new blog posts a week."

Hourly auto-crawl with delta detection picks up new URLs as they go live. The bot can answer questions about new content within an hour of publish, without anyone touching the admin console.

"We have 3,000-page product manuals."

PDF subprocess isolation means a corrupt page in one manual doesn't break ingestion for the other 2,999 pages. The job logs the failed page, moves on, and the rest of the manual is searchable.

What's Included

Bundled with the bot. No surprise per-page fees.

Ingestion is part of the platform, not a metered add-on you discover on your second invoice.

Included with every Bot subscription

Web crawling with delta detection.

PDF, Word, Markdown, and plain text upload.

YouTube transcript ingestion.

Per-site choice between text-embedding-3-large and text-embedding-3-small.

Hourly distributed dispatch with checkpoint resume.

Bot Enterprise tier

SharePoint and OneDrive sync via Microsoft Graph.

Sites.Selected scoping so the bot only sees what you authorize.

Higher per-site replica ceilings for very large corpora.

Index migration tooling for moving between embedding models.

Priority dispatch slot during reindex bursts.

Smart-scan rebuilds don't count against your monthly job quota when nothing has changed. Conversation volume is shared across every bot on your account and is set by your base plan tier. Contact sales if you expect very high volume or are deploying across multiple brands.

Engineering FAQ

Questions we hear from technical buyers.

Specifics that come up in every architecture review.

For web sources, the dispatcher runs hourly with a jittered window. New URLs are typically searchable within ~60 minutes of publish. For PDF and document uploads, ingestion begins immediately on upload and completes in seconds to minutes depending on document size. SharePoint changes are picked up on the next hourly poll.

Every scrape job records a checkpoint at the page level. On retry, the job resumes from the last completed page rather than restarting at page 1. Combined with idempotent vector inserts, this means a crash at page 47 of 100 costs you 0 duplicate work and ~1 page of re-fetching - not the entire crawl.

PDF and Office document parsing is one of the most common sources of native-code crashes in any ingestion pipeline - corrupt headers, malformed embedded fonts, decompression bombs. When parsing happens in the main worker process, one bad document can take down the whole job and every other document in it. Velaro spawns a per-document subprocess so a crash kills only that document. The page is logged, the job continues, and the rest of the corpus indexes normally.

Yes. Each site picks between text-embedding-3-large (3072 dimensions, higher precision, used for technical docs, regulated content, and large corpora) and text-embedding-3-small (1536 dimensions, lower cost, used for simple FAQ-style sites). The choice is made per site in the admin console - never locked at the platform tier. Migration tooling is included for switching an existing site between models.

SharePoint and OneDrive sync uses Microsoft Graph with the Sites.Selected permission model. You grant access to specific sites or document libraries - not the whole tenant. The integration uses OAuth v2 (/oauth2/v2.0/token with scope=resourceUrl/.default) and respects the stored scope on every call so a misconfigured request can't escalate to other tenant resources.

When the scheduled re-crawl runs, the dispatcher hashes each page and compares it to the last known hash. Pages that haven't changed never enter the embedding pipeline - no embeddings cost, no quota consumed. A re-crawl across a site where nothing has changed is effectively free. Only pages where the content actually changed count against your monthly job quota.

Yes. Web pages, PDFs, YouTube transcripts, and SharePoint files all land in a single ranked index per site. The bot does one hybrid keyword + vector search across all of them - it doesn't have to guess which source to look in. Results are ranked by semantic relevance regardless of where the content originated.

The substrate your AI bot actually runs on.

Most AI bots are only as good as their last upload.

Distributed. Crash-resumable. Observable end to end.

Built for corpora that change every day.

Continuous Re-Crawl & Deltas

Crash Resilience & Isolation

Elastic Scraper Pool

Per-Site Embedding Choice

Four sources. One ranked index per site.

Web Pages

PDFs & Uploads

YouTube Transcripts

SharePoint & OneDrive

Typical chat-bot ingestion vs. Velaro.

Three places this actually shows up.

Bundled with the bot. No surprise per-page fees.

Included with every Bot subscription

Bot Enterprise tier

Questions we hear from technical buyers.

Want to see the pipeline run on your actual content?