Enterprise Content Ingestion

The substrate your AI bot actually runs on.

Distributed scraper pool, checkpoint resume, delta detection, and per-job process isolation. Web pages, PDFs, YouTube transcripts, and SharePoint - unified into a single ranked search index per site, kept fresh hour by hour.

Hourly Re-Crawl
Delta detection, free idle rebuilds
Up to 30 Replicas
Burst capacity, scale-to-zero idle
Process Isolation
One bad PDF can't kill the job
Unified Index
Web + PDF + YouTube + SharePoint

Most AI bots are only as good as their last upload.

A handful of uploaded PDFs, a daily cron that quietly fails on a malformed file, an embedding model locked at the platform level. Customers ask the bot a question, get a stale or wrong answer, and lose trust in the channel. Knowledge ingestion isn't a checkbox feature. It is the substrate the entire AI experience runs on. If it's flaky, the bot is flaky.

30
Max Scraper Replicas
Auto-scales under load. Drops to zero when idle so you don't pay for idle compute.
4
Content Sources Unified
Web pages, PDFs and uploads, YouTube transcripts, SharePoint and OneDrive - one index per site.
1h
Re-Crawl Cadence
Hourly distributed dispatch with jitter. New content searchable within ~60 minutes of publish.
2
Embedding Models
Choose text-embedding-3-large for precision, 3-small for cost. Configurable per site.

Distributed. Crash-resumable. Observable end to end.

Every stage of the pipeline is independently scaled, isolated, and instrumented. A failure at one stage doesn't cascade backwards or stall the queue.

Simplified Ingestion Pipeline - Public Reference - Not All Components Shown
Sources
Web Pages
Crawled hourly
PDFs & Uploads
Admin / API
YouTube
Transcript ingest
SharePoint
Microsoft Graph
Hourly dispatcher with delta detection (skip unchanged)
Scraper Pool
Replica 1
Replica 2
Replica N
up to 30 - scale-to-zero
Checkpoint resume - retries pick up at last completed page
Per-Job Process
Subprocess Isolation
One bad doc fails alone
Chunking & Cleaning
HTML / PDF / VTT
Dedup / Hash
Skip unchanged blocks
Per-site embedding model (3-large or 3-small)
Embeddings
Azure OpenAI
text-embedding-3-large
Azure OpenAI
text-embedding-3-small
Single ranked index per site - tenant-isolated
Search Index
Azure AI Search
Hybrid keyword + vector retrieval
Sources
Scraper Pool
Per-Job Processing
Embeddings
Search Index

Built for corpora that change every day.

Seven engineering choices that separate a hardened ingestion pipeline from a one-time PDF upload.

Continuous Re-Crawl & Deltas

Only what changed gets re-embedded. Idle rebuilds are free.

  • Hourly distributed dispatch with a +/-12 hour jitter window so no two sites collide
  • Page-level hashing skips unchanged content automatically
  • Free smart-scan rebuilds don't count against your monthly job quota when nothing changed
  • No 4 a.m. thundering herd that takes embeddings down for everyone at once

Crash Resilience & Isolation

A single bad document cannot stall the pipeline.

  • Checkpoint resume - retries pick up at the last completed page, not page 1
  • Per-job subprocess - a corrupt PDF kills its own process, not the queue
  • Failed-page logging - the rest of the document keeps indexing
  • Idempotent inserts - retries don't duplicate vectors

Elastic Scraper Pool

Burst capacity for big sites, no idle cost for quiet ones.

  • Up to 30 replicas spin up under load
  • Scale-to-zero when the queue is empty - you only pay for compute that's actually running
  • Per-site rate limiting protects upstream sources from getting hammered
  • Robots.txt and sitemap-aware crawling out of the box

Per-Site Embedding Choice

Pick the model that fits each deployment - not the platform default.

  • text-embedding-3-large for high-precision sites with technical or regulated content
  • text-embedding-3-small for cost-sensitive deployments with simple FAQs
  • Per-site selection - never locked at the platform tier
  • Migration tooling for moving an existing index between models

Four sources. One ranked index per site.

Web, documents, video, and SharePoint - unified into a single retrieval surface so the bot answers across every source at once instead of guessing which silo to look in.

Web Pages

Sitemap-aware crawling with robots.txt respect. Hourly delta detection catches new and updated URLs without manual touch.

All bot plans

PDFs & Uploads

Drag-and-drop PDFs, Word docs, Markdown, and plain text. Per-document subprocess means one bad file never kills the rest.

All bot plans

YouTube Transcripts

Point at a channel or playlist - we ingest captions and chapter metadata. Tutorials and product demos become searchable answers.

All bot plans

SharePoint & OneDrive

Microsoft Graph integration with Sites.Selected scoping. Detects file changes and re-embeds only the deltas.

Bot Enterprise

Typical chat-bot ingestion vs. Velaro.

Most chat platforms treat ingestion as a one-time upload. Here's what changes when you treat it as production infrastructure.

Concern Typical chat-bot ingestion Velaro ingestion
Scraper topology Single-server scraper. One bad page can stall the queue. Distributed pool, up to 30 replicas, scale-to-zero idle.
Retry behavior Restart from page 1 on failure. Hours of re-crawling. Checkpoint resume. Picks up from the last completed page.
Reindex schedule Daily batch. Up to 24 hours of staleness, full-cost rebuild every night. Hourly dispatch + delta detection. Free rebuild when nothing changed.
Bad-document blast radius PDF parsed in the main process. One corrupt file kills the whole job. Subprocess isolation. One bad file fails alone; the rest keep going.
Embedding model Platform-wide default. Take it or leave it. Per-site choice between 3-large and 3-small.
Source coverage Web only, or document upload only - rarely both ranked together. Web + PDF + YouTube + SharePoint, unified into one ranked index.

Three places this actually shows up.

Real situations where the ingestion architecture is the difference between a bot that's helpful and a bot that gets turned off.

"Our compliance team updates 200 policy docs in SharePoint every quarter."

SharePoint sync auto-detects which files changed and only re-embeds those pages. No manual upload churn, no version drift between SharePoint and the bot, no quarterly project to re-import everything.

"Our marketing team adds 10 new blog posts a week."

Hourly auto-crawl with delta detection picks up new URLs as they go live. The bot can answer questions about new content within an hour of publish, without anyone touching the admin console.

"We have 3,000-page product manuals."

PDF subprocess isolation means a corrupt page in one manual doesn't break ingestion for the other 2,999 pages. The job logs the failed page, moves on, and the rest of the manual is searchable.

Bundled with the bot. No surprise per-page fees.

Ingestion is part of the platform, not a metered add-on you discover on your second invoice.

Included with every Bot subscription

Web crawling with delta detection.

PDF, Word, Markdown, and plain text upload.

YouTube transcript ingestion.

Per-site choice between text-embedding-3-large and text-embedding-3-small.

Hourly distributed dispatch with checkpoint resume.

Bot Enterprise tier

SharePoint and OneDrive sync via Microsoft Graph.

Sites.Selected scoping so the bot only sees what you authorize.

Higher per-site replica ceilings for very large corpora.

Index migration tooling for moving between embedding models.

Priority dispatch slot during reindex bursts.

Smart-scan rebuilds don't count against your monthly job quota when nothing has changed. Conversation volume is shared across every bot on your account and is set by your base plan tier. Contact sales if you expect very high volume or are deploying across multiple brands.

Questions we hear from technical buyers.

Specifics that come up in every architecture review.

For web sources, the dispatcher runs hourly with a jittered window. New URLs are typically searchable within ~60 minutes of publish. For PDF and document uploads, ingestion begins immediately on upload and completes in seconds to minutes depending on document size. SharePoint changes are picked up on the next hourly poll.
Every scrape job records a checkpoint at the page level. On retry, the job resumes from the last completed page rather than restarting at page 1. Combined with idempotent vector inserts, this means a crash at page 47 of 100 costs you 0 duplicate work and ~1 page of re-fetching - not the entire crawl.
PDF and Office document parsing is one of the most common sources of native-code crashes in any ingestion pipeline - corrupt headers, malformed embedded fonts, decompression bombs. When parsing happens in the main worker process, one bad document can take down the whole job and every other document in it. Velaro spawns a per-document subprocess so a crash kills only that document. The page is logged, the job continues, and the rest of the corpus indexes normally.
Yes. Each site picks between text-embedding-3-large (3072 dimensions, higher precision, used for technical docs, regulated content, and large corpora) and text-embedding-3-small (1536 dimensions, lower cost, used for simple FAQ-style sites). The choice is made per site in the admin console - never locked at the platform tier. Migration tooling is included for switching an existing site between models.
SharePoint and OneDrive sync uses Microsoft Graph with the Sites.Selected permission model. You grant access to specific sites or document libraries - not the whole tenant. The integration uses OAuth v2 (/oauth2/v2.0/token with scope=resourceUrl/.default) and respects the stored scope on every call so a misconfigured request can't escalate to other tenant resources.
When the scheduled re-crawl runs, the dispatcher hashes each page and compares it to the last known hash. Pages that haven't changed never enter the embedding pipeline - no embeddings cost, no quota consumed. A re-crawl across a site where nothing has changed is effectively free. Only pages where the content actually changed count against your monthly job quota.
Yes. Web pages, PDFs, YouTube transcripts, and SharePoint files all land in a single ranked index per site. The bot does one hybrid keyword + vector search across all of them - it doesn't have to guess which source to look in. Results are ranked by semantic relevance regardless of where the content originated.

Want to see the pipeline run on your actual content?

30-minute walkthrough. We point the scraper at your sources live and show you the index update in real time.

Book a Pipeline Walkthrough Compare AI Knowledge Options