Distributed scraper pool, checkpoint resume, delta detection, and per-job process isolation. Web pages, PDFs, YouTube transcripts, and SharePoint - unified into a single ranked search index per site, kept fresh hour by hour.
The Problem
A handful of uploaded PDFs, a daily cron that quietly fails on a malformed file, an embedding model locked at the platform level. Customers ask the bot a question, get a stale or wrong answer, and lose trust in the channel. Knowledge ingestion isn't a checkbox feature. It is the substrate the entire AI experience runs on. If it's flaky, the bot is flaky.
Pipeline Architecture
Every stage of the pipeline is independently scaled, isolated, and instrumented. A failure at one stage doesn't cascade backwards or stall the queue.
What's Different
Seven engineering choices that separate a hardened ingestion pipeline from a one-time PDF upload.
Only what changed gets re-embedded. Idle rebuilds are free.
A single bad document cannot stall the pipeline.
Burst capacity for big sites, no idle cost for quiet ones.
Pick the model that fits each deployment - not the platform default.
Content Sources
Web, documents, video, and SharePoint - unified into a single retrieval surface so the bot answers across every source at once instead of guessing which silo to look in.
Sitemap-aware crawling with robots.txt respect. Hourly delta detection catches new and updated URLs without manual touch.
All bot plansDrag-and-drop PDFs, Word docs, Markdown, and plain text. Per-document subprocess means one bad file never kills the rest.
All bot plansPoint at a channel or playlist - we ingest captions and chapter metadata. Tutorials and product demos become searchable answers.
All bot plansMicrosoft Graph integration with Sites.Selected scoping. Detects file changes and re-embeds only the deltas.
Bot EnterpriseArchitecture Comparison
Most chat platforms treat ingestion as a one-time upload. Here's what changes when you treat it as production infrastructure.
| Concern | Typical chat-bot ingestion | Velaro ingestion |
|---|---|---|
| Scraper topology | Single-server scraper. One bad page can stall the queue. | Distributed pool, up to 30 replicas, scale-to-zero idle. |
| Retry behavior | Restart from page 1 on failure. Hours of re-crawling. | Checkpoint resume. Picks up from the last completed page. |
| Reindex schedule | Daily batch. Up to 24 hours of staleness, full-cost rebuild every night. | Hourly dispatch + delta detection. Free rebuild when nothing changed. |
| Bad-document blast radius | PDF parsed in the main process. One corrupt file kills the whole job. | Subprocess isolation. One bad file fails alone; the rest keep going. |
| Embedding model | Platform-wide default. Take it or leave it. | Per-site choice between 3-large and 3-small. |
| Source coverage | Web only, or document upload only - rarely both ranked together. | Web + PDF + YouTube + SharePoint, unified into one ranked index. |
Customer Scenarios
Real situations where the ingestion architecture is the difference between a bot that's helpful and a bot that gets turned off.
SharePoint sync auto-detects which files changed and only re-embeds those pages. No manual upload churn, no version drift between SharePoint and the bot, no quarterly project to re-import everything.
Hourly auto-crawl with delta detection picks up new URLs as they go live. The bot can answer questions about new content within an hour of publish, without anyone touching the admin console.
PDF subprocess isolation means a corrupt page in one manual doesn't break ingestion for the other 2,999 pages. The job logs the failed page, moves on, and the rest of the manual is searchable.
What's Included
Ingestion is part of the platform, not a metered add-on you discover on your second invoice.
Web crawling with delta detection.
PDF, Word, Markdown, and plain text upload.
YouTube transcript ingestion.
Per-site choice between text-embedding-3-large and text-embedding-3-small.
Hourly distributed dispatch with checkpoint resume.
SharePoint and OneDrive sync via Microsoft Graph.
Sites.Selected scoping so the bot only sees what you authorize.
Higher per-site replica ceilings for very large corpora.
Index migration tooling for moving between embedding models.
Priority dispatch slot during reindex bursts.
Smart-scan rebuilds don't count against your monthly job quota when nothing has changed. Conversation volume is shared across every bot on your account and is set by your base plan tier. Contact sales if you expect very high volume or are deploying across multiple brands.
Engineering FAQ
Specifics that come up in every architecture review.
/oauth2/v2.0/token with scope=resourceUrl/.default) and respects the stored scope on every call so a misconfigured request can't escalate to other tenant resources.
30-minute walkthrough. We point the scraper at your sources live and show you the index update in real time.