Jens Willmer

Tutorials, projects, dissertations and more..

Data Cleaning for a RAG Ingest Pipeline

If your RAG pipeline ingests dirty data, the answers will be wrong. The embedding model and prompt chain cannot fix what was broken before indexing.

I built this pipeline for a maritime email corpus: thousands of .eml files with PDF attachments, Office documents, images, and ZIP archives, turned into a searchable knowledge base. The examples here are maritime, but the patterns apply to any industry where you ingest unstructured documents into a RAG system. Corporate email, support tickets, compliance archives, internal wikis: the same cleaning problems show up everywhere.

This post covers the techniques that actually mattered, based on a corpus of 6,000+ emails with tens of thousands of attachments.

Cleaning the raw email content

Corporate email is one of the noisiest data sources you can feed into a RAG system. A single email thread might contain the incident report you actually care about, buried under forwarded headers, legal disclaimers, satellite communication blocks, mailto: links, and five previous replies where someone wrote “OK, noted.”

The first step is thread splitting. Regex patterns detect message boundaries across Gmail, Outlook, and generic separators, splitting a forwarded thread into individual messages so each can be cleaned independently. Over 20 boilerplate regex patterns then strip <mailto:> links, [cid:] image references, horizontal separator lines, phone/fax contact blocks, satellite communication headers (common in maritime), address blocks, legal disclaimers, and unsubscribe footers.

For the edges that regex cannot handle cleanly, an LLM-based boundary detector identifies the first and last meaningful words in the email body, trimming headers and signatures with fuzzy matching that tolerates formatting drift between email clients.

Messages under 20 words get discarded entirely. Auto-replies and bare acknowledgments like “Noted, thanks” produce low-value chunks that score artificially high once context enrichment is applied, pushing the actual incident descriptions out of search results.

Without this cleanup, embeddings cluster around disclaimer text instead of the actual content. A query about “main engine cylinder crack” should not return ten results whose top match is a legal footer that appears in every email.

Encoding, HTML, and character corruption

Emails arrive from systems worldwide using different character encodings. A single encoding error can cascade: a garbled MIME header prevents attachment extraction, and you lose an entire document from the index.

The pipeline uses a cascading charset fallback that tries encodings in priority order: declared charset, UTF-8, UTF-8-sig (with BOM), ISO-8859-1, Windows-1252, CP1252. Only as a last resort does it use UTF-8 with replacement characters. UTF-8 BOM bytes are stripped before parsing, and CSV files from legacy systems are truncated at the first NUL byte with line endings normalized.

HTML-to-text conversion strips <script> and <style> tags, converts <a> tags to markdown links, replaces block elements with newlines, and unescapes HTML entities. Plain text MIME parts are preferred when available, because clean plain text produces better embeddings than HTML-tagged content.

Parsing: routing by complexity

Using the wrong parser for a document is either wasteful or produces empty output.

PDFs are classified as simple or complex before parsing. Simple PDFs (clean text layer, no form fields) go to PyMuPDF4LLM: free, local, milliseconds. Complex PDFs (scanned pages, form fields, image-only content) route to Gemini 2.5 Flash via OpenRouter at roughly $0.0025 per page, with page-range pagination and adaptive batch halving when the model truncates its output. Oversized PDFs above a configurable page ceiling get a local peek of the first three pages for classification, then route to summary mode if the content is noise.

Office documents are rendered locally. DOCX paragraphs become plain text, tables become GitHub Flavored Markdown with proper delimiters. XLSX sheets become headed sections with GFM tables. CSV files go through the charset fallback chain before table rendering. Only legacy binary formats (.doc, .xls, .ppt) still require a cloud parser.

One thing worth noting: when a local parser opens a file but extracts zero text (an image-only DOCX, for instance), a specific EmptyContentError triggers fallback to the cloud parser. Generic errors from corrupt files do not trigger the fallback, so you don’t waste API calls on unrecoverable files.

Tiered routing cuts parsing costs by 28-42% compared to sending everything through cloud parsers, saving roughly $63-94 per full ingest of the production corpus.


Filtering images and stripping parser artifacts

Emails contain many non-informative images: logos, banners, tracking pixels, signature graphics. Each Vision API call costs roughly $0.01, and in a typical maritime email corpus, 60-80% of embedded images are logos and banners.

A two-stage filter handles this. Heuristic rules hard-skip filenames matching patterns like logo, banner, signature, or icon. Files under 15KB (tracking pixels) or with dimensions under 100px are skipped. Images shorter than 50px with a width-to-height ratio above 3:1 are classified as email separator strips. Only images that pass these checks reach the Vision API, which classifies them as meaningful or decorative. The heuristic filter alone eliminates 30-50% of Vision API calls, saving $19-32 per full ingest.

Parser artifacts are another source of noise. LLM-based parsers inject placeholder image references like ![](page_3_image_1.jpg) that point to non-existent files. Regex-based stripping removes these while preserving any alt text that carries semantic value. A maintenance command can retroactively apply improved stripping patterns to all archived markdown, so bumping the regex fixes older archives without re-parsing.

Deciding what to embed

Sensor logs, repetitive tables, and empty PDFs waste embedding API calls and pollute the vector space. There is no reason to chunk and embed all of them the same way.

An embedding mode classifier routes each attachment into one of three modes:

  • full: standard chunking with context and embeddings. Default for prose content.
  • summary: one LLM-synthesized summary chunk. Used for bulk numeric or tabular content (digit ratio above 30%, table content above 40%, high repetition scores).
  • metadata_only: one stub chunk from the filename, no LLM calls. For empty or near-empty attachments (under 50 tokens, or very low prose ratio with no headings).

The decision tree uses computed content shape signals: digit ratio, prose ratio, table character percentage, repetition score, short line ratio, heading count. For documents in a medium-confidence band, a single LLM triage call makes the final classification. The triage prompt includes an anti-injection preamble, and the response parser reads only the first character, so crafted attachments cannot steer their own classification.

In the production corpus, 15-25% of attachments are sensor logs or repetitive tables. Summary mode reduces their embedding footprint by 95% while keeping them searchable.

Chunking and context enrichment

Once content is clean and classified, it needs to be split into chunks that respect the embedding model’s token limits and the document’s logical structure.

Token-aware splitting uses tiktoken (matching the embedding model’s tokenizer) to keep chunks within the context window. Markdown content uses a MarkdownTextSplitter that respects heading boundaries; plain text uses a recursive splitter with separators ordered from paragraph breaks down to character level. Splits under 5 words are discarded, since these are typically PDF pagination artifacts like lone page numbers.

Each chunk records its character offset and line numbers in the original document, so the chat UI can highlight the exact source passage when citing a chunk. Heading hierarchy extraction tracks the structural path (e.g., ## Safety > ### Fire Prevention) for additional context.

Contextual embedding had more effect on retrieval quality than anything else I tried. An LLM produces a 2-3 sentence summary of each document (type, subject, key entities), which is prepended to every chunk before embedding. A chunk like “The valve was replaced on Tuesday” becomes retrievable for queries about “MV Atlantic valve replacement” because the prepended context supplies the vessel name and subject. Research on contextual retrieval shows this improves retrieval accuracy by 35-67% over naive chunking.

For email threads with multiple messages, a separate thread digest captures the full conversation in a 200-400 word summary chunk. When a user asks “how was the pump failure fixed?”, the digest matches because it connects the resolution in message 5 to the original problem in message 1.


Metadata extraction and inheritance

Chunks from attachments have no context on their own. A PDF inspection report does not know it was attached to an email about “Main Engine Cylinder 3 Crack on MV ATLANTIC.” Without the parent email’s context, attachment chunks produce generic embeddings that miss filtered searches.

Vessel mention extraction uses regex patterns for maritime naming conventions (M/V [NAME], MV [NAME]) with a curated noise filter for non-vessel tokens (TANKERS, SHIP, CORP). Extracted vessel IDs go on every chunk, which makes per-vessel filtering possible during search.

Attachment chunks inherit their parent email’s vessel IDs and topic IDs. A cargo damage report PDF shows up in “Cargo Damage” topic-filtered searches even if the PDF itself never mentions the topic by name. Both the direct-attachment and ZIP-member code paths use the same shared helper to prevent drift.

Topic extraction uses an LLM to assign 1-5 topics per email, with aggressive name normalization and semantic deduplication via cosine similarity (threshold 0.92) to prevent the same concept from fragmenting across entries like “Cargo Damage”, “cargo damage”, and “Damaged Cargo.”

Storage integrity and deduplication

Re-ingesting emails after a parser upgrade, bug fix, or crash recovery must not create duplicate documents or chunks.

Content-addressable IDs make this deterministic. Document IDs are SHA256(source_id + file_hash), chunk IDs derive from the document ID plus character offsets. Same content always produces the same IDs. The database enforces UNIQUE(chunk_id) with INSERT OR REPLACE, making chunk insertion idempotent. Foreign key cascading deletes prevent orphaned chunks when a document is re-processed. The progress tracker uses file content hash (not file path) to decide what needs processing, so renaming a file does not trigger re-ingest.

Validation and targeted repair

Bugs will slip through regardless. The last line of defense is a validation suite with 31 automated checks covering referential integrity (orphaned chunks, broken parent references), embedding completeness (missing vectors, NaN values, zero vectors), content quality (empty chunks, missing context summaries), archive consistency (broken URIs, orphaned folders), schema drift, mode invariants, and topic health.

Every step stamps a trail in the archive metadata: which parser, which LLM model, which embedding model produced each artifact. When a check fails, the trail tells you exactly what to fix. “This PDF was parsed by Gemini 2.5 Flash with a 25-page batch” narrows the scope instantly.

The system supports targeted repair over full re-ingest. A regex improvement needs only clean-archive-md (zero API cost). A parser bug on specific emails uses mark-failed plus retry-failed (targeted cost). A chunking strategy change uses reindex-chunks (moderate cost, no re-parsing). Full re-ingest is the last resort, since it costs real money and hours of wall-clock time at production scale.

What held up

Looking back, the techniques that moved retrieval accuracy the most were mundane: stripping boilerplate, fixing encodings, filtering junk images, making sure every chunk carries enough context to be found.

Contextual embedding (prepending a document-level summary to every chunk) improved retrieval accuracy by 35-67%. Tiered parsing and embedding mode classification saved 28-42% on parsing and 95% on embedding costs for non-prose content. Content-addressable IDs and idempotent inserts meant I could iterate on the pipeline without worrying about corrupting production data or burning through API budgets. Automated validation after every ingest caught the data-dependent edge cases that unit tests never will.

None of this is exciting work. But in a RAG system, it is the work that decides whether your users get the right answer.

Things to think about

A few concerns that fall outside the core data cleaning flow but are worth keeping in mind:

  • Error message sanitization. API errors can contain Bearer tokens, API keys, or connection strings in their stack traces. If you store error messages (for debugging or admin UIs), strip credentials before persisting them.
  • ZIP security. Email attachments can include ZIP archives with path traversal attacks (../ in filenames), ZIP bombs (high compression ratios that exhaust memory), or deeply nested archives. Configurable limits on extraction depth, file count, and total size prevent a single malicious attachment from stalling the pipeline. Office formats (.docx, .xlsx, .pptx) are technically ZIP files and need to be detected and routed to their proper parsers instead of being extracted.