Engineering Case Study

IFA Fund Report Pipeline: Automating Venture Capital Consolidation

Every six months, a UK IFA practice receives PDF reports from approximately ten venture fund managers. Each covers portfolio company updates, valuations, exits, and write-offs. Each does it differently. The firm’s advisers read every report manually, extract the data into a two-decade-old spreadsheet, and build the client-facing report from there. In a regulated advice environment, a mistake in that process is a compliance event. The pipeline replaces the manual extraction step with a structured five-stage system that classifies every page, extracts every table, renders every thumbnail, and produces a report ready for adviser review, without silently dropping anything it cannot read.

Type Engineering Case Study

Domain AI · Financial Services · Document Processing

Stack .NET 10 · GPT-4o vision · PDFtoImage · SkiaSharp · PuppeteerSharp · Azure OpenAI · Azure Function App

Status POC complete · Production path defined

~10

fund manager PDFs per cycle

pipeline stages

100%

numeric accuracy (holdings, valuations, totals)

data loss across full document set

The Problem

Every six months, the firm receives PDF reports from approximately ten venture fund managers. Each report covers the same ground: portfolio company updates, valuations, exits, write-offs. Each one does it differently. Different table structures. Different column names. Different narrative styles. Some front-load positive exits. Some bury failures at the back. Some split active, exited, and written-off holdings across separate sections that have to be mentally reassembled before you can understand what you are looking at.

The firm’s team reads every report manually, extracts the numbers and narrative into a spreadsheet, and builds the client-facing report from there. The spreadsheet dates back two decades and is still in daily use.

It works. But it does not scale, and it does not protect against human error. In a regulated advice environment, human error is not a quality problem. A mistake in a client-facing report is a compliance event. Every report goes through multiple manual review cycles before a client sees it. The process consumes hours of adviser time, and it gets worse as the fund count grows.

Why Standard Tooling Fails

Feeding PDFs to an LLM and asking for a consolidated report, the obvious first attempt, does not work reliably enough to trust in a regulated context. The firm had already tried this. The output looked plausible, covered only two of the ten funds, dropped data without explanation, and would have gone undetected without an experienced adviser reviewing it by eye.

The problem is structural. Venture fund PDFs are not a consistent input format. They are locked, variously formatted documents: some scanned in-house from post, some fax-originated, some natively digital but copy-protected. Programmatic text extraction fails on a significant proportion of them. Adobe’s PDF Extract API, the commercial solution designed for this, costs approximately £16,000 annually and routes data outside the firm’s Microsoft tenant. Neither is acceptable.

A naive prompt-based pipeline has no way to know what it has missed. There is no checksum, no validation, no confidence signal. The output looks complete even when it is not.

Before a line of code was written, we analysed sample fund manager PDFs and identified a core constraint: not every page contains extractable financial data. Charts, cover pages, disclaimers, and narrative commentary are structurally different from data tables and require different handling. This observation drove the two-pass architecture. Without the classification stage, the pipeline would attempt extraction on every page and produce unreliable output for the majority that contain no structured data.

Architecture

Five stages in order: ingest → classify → extract → render → report. Each has a single, testable responsibility. No stage compensates for a prior stage’s failures; uncertainty is surfaced, not absorbed.

Classify before extracting. Rather than attempting to extract data from every page, the pipeline first classifies each page by content type, then applies extraction only to pages where extraction is tractable. Pages the model cannot reliably read are identified, flagged, and surfaced to the human reviewer with source page references. Not silently dropped. Not guessed at.

Input

PDF Batch

~10 fund manager PDFs per cycle · locked, scanned, fax-originated, and natively digital

Stage 1: Ingestion

Validate PDFs · apply page range · get page counts
Stage 2: Page Classification

72 DPI render · GPT-4o vision · contains_table: true | false · confidence: high | medium | low
Stage 3: Extraction (table pages only)

200 DPI render · GPT-4o vision · structured JSON: columns + rows + footnotes · null preserved for unreadable cells
Stage 4: Render and Commentary

All pages → 150 DPI thumbnail · non-table pages → GPT-4o commentary (null for charts / TOC / disclaimers / covers)
Stage 5: Report Generation

HTML report + PDF via Puppeteer · tables inline · thumbnails as base64 · error flags for failed pages · run summary

HTML + PDF report, ready for adviser review

Human review → client-facing report

Stage 2: Page Classification

Pages are rendered at 72 DPI and passed to GPT-4o with a classification prompt. The output is a structured JSON response: {"contains_table": bool, "confidence": "high|medium|low"}. Nothing else is extracted at this stage.

72 DPI is sufficient for classification: the model is identifying content structure, not reading individual values. Using a lower resolution for this pass keeps token cost and latency down before the more expensive extraction pass.

Stage 3: Extraction

Only pages classified as containing a table proceed to extraction. Pages are re-rendered at 200 DPI. The higher resolution is necessary for reliable character-level accuracy on financial tables. The extraction prompt returns a typed JSON object with page_type, a tables array (each with title, columns, rows), a separate footnotes array, and an error field.

The prompt explicitly instructs the model to extract the leftmost column and all rows including totals, a detail that turned out to matter in practice when early iterations missed leading columns and summary rows. Null is a valid cell value: a cell returning null means the pipeline could not read that value, and that is preserved rather than estimated.

Stage 4: Render and Commentary

All pages are rendered at 150 DPI and saved as thumbnails for inclusion in the report. For non-table pages, GPT-4o is asked to extract any investment commentary present. The model returns null for pages that contain no relevant narrative (charts, tables of contents, disclaimers, covers) and returns 2–4 bullet points for pages with strategy or performance narrative. This gives the reviewer context for pages that were not extracted as tables.

Stage 5: Report Generation

The pipeline produces an HTML report and a PDF converted from it via Puppeteer. The report opens with a summary header: generation timestamp, counts of PDFs processed, pages classified, tables extracted, and error pages. Below that, per-fund cards, one block per source page, each showing a thumbnail, extracted table data or commentary bullets, and any error notice. An error summary at the end lists all pipeline failures for the run. Every page of every source PDF appears somewhere in the output, either as extracted table data or as a thumbnail with commentary.

Key Engineering Decisions

Vision over text extraction

The three standard approaches to PDF data extraction each had a disqualifying constraint. pdfminer and pdfplumber fail on locked and scanned documents, a significant portion of the fund manager PDFs. Adobe PDF Extract API is approximately £16,000 annually and routes data outside the client’s Microsoft tenant. Vision-based extraction (render each page to PNG, pass it to a multimodal model) works on locked PDFs, scanned PDFs, fax-originated PDFs, and natively digital PDFs without modification. The rendering library handles the format variation; the model handles the content variation.

.NET over Python

The client environment is Microsoft-first. Production will run as an Azure Function App inside a closed Azure tenant. Building the POC in .NET means no platform port between proof of concept and production; the same codebase travels forward. PDFtoImage and SkiaSharp handle PDF rendering; the OpenAI .NET SDK handles API calls; PuppeteerSharp handles HTML-to-PDF conversion. The full stack is managed packages with no native binary dependencies that would complicate Azure deployment.

Standard OpenAI API for POC; Azure OpenAI for production

Azure OpenAI is the production target; data compliance requires it. For the POC, the standard OpenAI API was used for speed: no Azure resource provisioning, no deployment configuration, faster iteration. The switch to Azure OpenAI for production is a configuration change, not an architectural one. The model is the same; only the endpoint and authentication method differ.

Human review is not optional

The pipeline is explicit about this in its design. The HTML and PDF report is an input to human review, not a deliverable to the client. Every page of every source PDF appears in the output. The adviser’s editing work should be additive, reviewing and approving structured output, not a second full extraction pass to catch what the pipeline missed.

Challenges and Trade-offs

Challenge	Approach
Format variation across fund managers. Column names, table structure, and narrative placement all differ. No common schema exists.	Extract raw table structure (columns + rows) at the page level. Tables from any source format are rendered consistently in the output report.
PDF locking and copy protection. Text-layer extraction is unreliable or impossible for a significant proportion of documents.	Vision extraction via PNG rendering bypasses the text layer entirely. Works on locked, scanned, and fax-originated documents without modification.
Hallucination risk on financial figures. An LLM that invents a cell value not in the source document is a regulatory problem.	Null is a valid and preserved cell value. Explicit prompt instruction not to infer or estimate. Pages where extraction fails are flagged in the output rather than silently dropped.
Non-table pages producing unreliable extraction. Attempting table extraction on charts, covers, or narrative pages produces garbage output.	Binary classification pass before extraction. Non-table pages are excluded from extraction entirely and rendered as thumbnails instead.
DPI trade-off between cost and accuracy. Higher DPI increases token cost and latency.	72 DPI for classification · 200 DPI for extraction · 150 DPI for thumbnails. The two-pass architecture makes this trade-off possible.
Commentary softening and paraphrase. A general “summarise in bullet points” instruction gives the model licence to soften negative figures, drop numbers, and merge temporally distinct facts.	Explicit accuracy rules in the commentary prompt: quote figures exactly, name every company, transcribe proper nouns character-for-character, do not infer causation, do not merge facts across reporting periods.
Footnotes dropped from table pages. Compliance-significant footnotes (fair-value basis, split shareholding notes) were silently omitted from extraction output.	Explicit footnote extraction instruction in the table extraction prompt. Footnotes returned as a separate verbatim array and rendered below each table in the report.

Iteration: Commentary Quality After First-pass Testing

After the initial POC run, the output was tested against the source PDF. The split was clean: tables held up; commentary did not.

What held up: tabular data extraction

Every numeric figure in every extracted table matched the source document exactly: 35+ named holdings across qualifying AIM, qualifying unlisted, and non-qualifying buckets, plus subtotals and totals. Twenty cells of fund performance data across four periods and five metrics. All correct. This is the hardest part technically and the most important part for a regulated workflow.

What failed: commentary

The free-text commentary extracted from narrative pages contained errors that would not be acceptable in a client-facing or FCA-regulated context. The errors fell into four categories.

Editorial softening. The source PDF reported net assets falling £55.5 million against the start of the financial year, with total return per share of −17.0%. The pipeline described this as “a slight decline due to broader market conditions.” A −17% / £55.5m fall is not slight. This type of softening is the most dangerous category of error because it looks coherent. An adviser scanning the summary would not immediately flag it.

Factual synthesis errors. The pipeline stated that Interactive Investor proceeds “are being distributed as a special dividend and reinvested.” The source document records the opposite: the Board explicitly decided not to reinvest the qualifying portion of the proceeds because funds had already been raised via the Offer. The pipeline had synthesised two separate facts into a single sentence that contradicted the source. A similar error linked Surface Transforms’ 21% revenue growth (a historical figure for the year ended December 2021) to a new £100m contract (a forward-looking event from the new financial year), two facts that appeared on the same page, with an inferred causal connection the source document did not make.

Dropped quantitative data. MaxCyte’s 30% revenue growth figure was present in the source. The pipeline output replaced it with “a healthy increase in full-year sales revenues.” The number was not preserved.

Misread proper noun. SulNOx Group, a small-cap holding, was rendered as “SulisOx Group.” A vision-model character-level error on an unfamiliar proper noun with mixed case.

Root cause

All issues traced to a single prompt function. The instruction “extract the key points as 2–4 concise bullet points” gave the model implicit permission to summarise, paraphrase, soften, and compress. There were no constraints on numeric accuracy, no requirement to name companies explicitly, and no instruction against inferring connections between temporally distinct facts.

A secondary issue: the pipeline’s run summary reported “0 errors.” This was technically accurate (no structural or extraction failures occurred) but it implied a clean output when the commentary contained material inaccuracies. The counter measured what the pipeline could observe. It could not detect commentary distortion.

The fix

Three changes were made. The commentary prompt was rewritten with seven explicit accuracy rules: quote all financial figures exactly as they appear on the page; always name the specific company being discussed; transcribe company names, fund names, and brand names character-for-character; do not merge facts from different reporting periods; do not infer causation between separate facts; preserve specific strategic decisions verbatim; cover financial highlights and key metrics pages as well as investment manager narrative.

Footnote extraction was added to the table extraction prompt, returning footnotes as a separate verbatim array rendered below each table in the report. A commentary confidence flag was added to the model’s response and surfaced as a coloured badge in the output, alongside renaming “Errors” to “Extraction Errors” to make clear it counts structural failures only, not commentary accuracy.

Issue	Before	After
SulNOx misread	“SulisOx Group”	“SulNOx Group” (correct)
−17% total return softened	“a slight decline”	“−17.0%” quoted exactly
Interactive Investor reinvestment error	“distributed…and reinvested”	Special dividend only, no reinvestment mentioned
Surface Transforms causal conflation	“secured contract, boosting revenues 21%”	“revenues grew by 21%…for FY ended December 2021” (separated)
Access Intelligence unnamed	“Business performance improved…”	“Access Intelligence (−£2.9m) delivered…” (named)
MaxCyte revenue % dropped	“a healthy increase”	“grew by 30% compared to the prior financial year”
Footnotes dropped	Not present	All five footnotes extracted verbatim

Where This Fits

Against the actual fund manager PDFs, the pipeline held. Every table extracted. Every page accounted for. That is the only thing the POC needed to prove.

Moving to production is a configuration change, not a rebuild. Swap the OpenAI endpoint for Azure OpenAI, wrap the pipeline in an Azure Function App triggered by blob upload, and it runs inside the client’s existing Microsoft tenant. The code travels forward unchanged.

Phase 2 is the interesting work: normalising raw extracted tables into a canonical fund record schema, then aggregating across all ten managers into a structured consolidated view. Right now the output is raw, columns and rows in whatever structure a given fund manager chose to use. Normalisation is the step that makes those comparable. The extraction layer is solid enough to build on.

Off-the-shelf IFA software does not solve this. The market treats VCT consolidation as a niche edge case, and the products that exist were not designed around a fully locked, fax-era PDF estate. A bespoke pipeline built on top of vision LLMs and an existing Azure infrastructure is not the obvious path. It is the one that works.

Tech Stack

IFA Fund Report Pipeline: POC

.NET 10 PDFtoImage 5.2.1 SkiaSharp OpenAI GPT-4o vision PuppeteerSharp Azure OpenAI (production target) Azure Function App (production hosting)

Case Studies

Get in touch

Working with complex document workflows?

Regulated environments have specific constraints: data residency, audit trails, explainability. Most LLM integrations paper over uncertainty rather than surface it. Getting that right requires a different approach to prompt design, architecture, and validation than a standard pipeline.

If you are working on something in this space, I would be interested to hear about it.

Get in touch LinkedIn