This content originally appeared on DEV Community and was authored by Anmol Baranwal
Extracting structured data from unstructured documents (like PDFs and images) can get tricky fast.
With the rise of foundation models and purpose-built APIs, it's now possible to turn even a messy invoice into clean JSON with just a few lines of code.
So I will compare three different ways to parse documents: using OpenAI’s GPT‑4o, Anthropic’s Claude 3.5 Sonnet and the Invofox API.
I picked Invofox because it's a YC-backed startup built specifically for document parsing. It uses specialized models (proprietary and best-of-LLM) tuned for invoices and other documents, while GPT/Claude are general-purpose LLMs.
You will see real Python code, actual outputs and a breakdown of when to use each tool (pros & cons). At the end, there is a detailed comparison table on features & benchmarks.
You can find the complete code in the GitHub Repository.
🎯 Using GPT-4o (ChatGPT) API
Let’s start with OpenAI’s GPT-4o. It's capable of understanding text and extracting structured information when prompted correctly. But unlike Invofox, it can’t directly read PDF files.
So we first need to extract the text using OCR (like Tesseract, pdfplumber or an online tool), then send that text to GPT via an API prompt.
GPT-4o, especially via the ChatGPT web interface and certain API endpoints (notably in Azure OpenAI Service), can accept PDFs and images as inputs and extract structured data. But since we are using the API, it's not really possible.
You will need an OpenAI API key. Create a .env
file and attach it with this convention.
OPENAI_API_KEY=your_api_key
We will use Python for this. Here's how you can try it yourself, step by step.
Step 1: Set up your Python environment
Creating a virtual environment means setting up an isolated space for your Python project where all dependencies are installed locally (and not system-wide). This avoids version conflicts and keeps your global Python installation clean. So let’s create one.
# macOS / Linux:
python3 -m venv env # creates a folder called 'env' with a local Python setup
source env/bin/activate # activates that environment
# Windows:
python -m venv env # same as above
.\env\Scripts\activate # activates it (Windows PowerShell / CMD)
You will know it’s active when you see (env)
at the beginning of your terminal prompt.
Step 2: Install required packages
We need two main libraries:
- pdfplumber : to extract text from PDF invoices
-
openai
: to use the GPT-4o API -
python-dotenv
: Loads environment variables from a.env
file into Python, useful for managing API keys and secrets.
pip install pdfplumber openai python-dotenv
I installed the python-dotenv
later so that's why it's not visible in the command.
After installing your dependencies, run:
pip freeze > requirements.txt
This writes all installed packages in your virtual environment (with versions) into requirements.txt
. You can then use this file later with:
pip install -r requirements.txt
For reference, please add a .gitignore
in the root directory to avoid pushing the virtual environment directory.
Step 3: Extract text and parse with GPT-4o
Here is the sample Invoice PDF that I'm using for the example. I'm attaching a snapshot so you can get the idea of the fields we are going to extract.
Let's write the complete code with the file name as openai-main.py
.
import pdfplumber
import openai
from dotenv import load_dotenv
import os
load_dotenv()
client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
def extract_text_from_pdf(pdf_path):
text = ""
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
page_text = page.extract_text()
if page_text:
text += page_text + "\n"
return text
def parse_invoice_with_openai(invoice_text):
prompt = (
"Extract the following fields from this invoice text and return as a JSON object:\n"
"- Invoice Number\n"
"- Invoice Date\n"
"- Due Date\n"
"- Invoice Status (e.g. unpaid/paid)\n"
"- Sender Name and Email\n"
"- Recipient Name and Email\n"
"- Items (description, quantity, rate)\n"
"- Total Amount\n"
"- Memo\n\n"
"Invoice Text:\n"
f"{invoice_text}"
)
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt},
],
max_tokens=500,
temperature=0,
)
return response.choices[0].message.content
if __name__ == "__main__":
pdf_path = "invoice_sample.pdf"
invoice_text = extract_text_from_pdf(pdf_path)
parsed_data = parse_invoice_with_openai(invoice_text)
print(parsed_data)
Here's a simple explanation:
extract_text_from_pdf
: usespdfplumber
reads each page of the PDF and concatenates the extracted text. This gives you the raw, unstructured invoice content as a string.parse_invoice_with_openai
: Sends this prompt to the GPT‑4o model via theChatCompletion
endpoint, asking GPT‑4o to extract five key fields.The model then processes the prompt and returns a JSON-formatted response.
Step 4: Output
Here is the JSON response after running the script using python openai-main.py
.
{
"Invoice Number": "2-7-25",
"Invoice Date": "July 2, 2025",
"Due Date": "Upon receipt",
"Invoice Status": "UNPAID",
"Sender Name and Email": {
"Name": "Anmol Baranwal",
"Email": "hi@anmolbaranwal.com"
},
"Recipient Name and Email": {
"Name": "Anmol Baranwal",
"Email": "anmolbaranwal09@gmail.com"
},
"Line Items": [
{
"Description": "Testing",
"Quantity": 1,
"Rate": "$50.00",
"Total": "$50.00"
},
{
"Description": "Development",
"Quantity": 1,
"Rate": "$100.00",
"Total": "$100.00"
},
{
"Description": "Blog",
"Quantity": 1,
"Rate": "$50.00",
"Total": "$50.00"
}
],
"Subtotal": "$200.00",
"Total Amount": "$200.00",
"Memo or Notes": "Thank you! This is a sample invoice for testing document parsing with AI models."
}
GPT-4o (ChatGPT) output for invoice line items isn’t consistently labeled "lines". Sometimes it's "Line Items" or something less standardized, while other tools (like Invofox) always use a consistent name like "lines" for those entries.
terminal output
Here we instruct GPT-4o via a system prompt to parse the text. This can work reasonably well as the API is strong enough now (compared to previous OpenAI models).
✅ Pros: Easy to try, flexible. GPT-4 excels at logic and structured data extraction, so it can correctly identify invoice fields and calculate totals.
⚠️ Cons:
- The problem I see is that we still have to engineer prompts and verify the output (which is not possible for everyone).
- The JSON can be malformed or may miss fields (hallucinations are possible).
- There’s no built‑in validation or confidence scores.
- GPT requires sending all text in prompts (which would be costly for large docs) and outputs vary by prompt style.
GPT-4o is billed per token. The estimated cost for a 1–2 page invoice extraction falls in the $0.005–$0.018 range, depending on how detailed your prompt and output are. You can also use this pricing calculator based on your use case.
It can respond in 1–30s but is subject to load spikes, especially for large prompts.
🎯 Using Claude 3.5 Sonnet API
Anthropic's Claude 3.5 Sonnet model is also capable of parsing structured data from text when prompted correctly. Like GPT-4o, it cannot read PDF files directly via API, so we will first extract the text from an invoice PDF, then pass it to Claude for structured parsing.
You will need an Anthropic API key. Create a .env
file and attach it with this convention:
ANTHROPIC_API_KEY=your_api_key
We will use Python again for this setup and follow the same instructions used in the last section.
Step 1: Set up environment and install packages
Just like before, let’s isolate our dependencies in a virtual environment.
# macOS / Linux:
python3 -m venv env
source env/bin/activate
# Windows:
python -m venv env
.\env\Scripts\activate
Once activated, your terminal will show a (env)
prefix.
We need the following libraries:
-
pdfplumber
: to extract text from PDF -
anthropic
: official SDK to interact with Claude 3.5 -
python-dotenv
: to load the API key from a.env
file
pip install pdfplumber anthropic python-dotenv
If you are following from the last example, we just need to install the anthropic package.
Then export your environment to a requirements.txt
file. Make sure to include a .gitignore
to avoid committing the virtual environment.
pip freeze > requirements.txt
Step 2: Extract text and parse with Claude 3.5 Sonnet
As Anthropic launches safer and more capable models, they regularly retire older models. So you can check the model status of which ones are deprecated, retired and which ones are active. I will be using claude-3-5-sonnet-20240620
active version for the example.
Let's write the complete code with the file name as anthropic-main.py
. It's very similar to the previous section and I'm using the same sample Invoice PDF.
import pdfplumber
import anthropic
import os
from dotenv import load_dotenv
load_dotenv()
api_key = os.getenv("ANTHROPIC_API_KEY")
client = anthropic.Anthropic(api_key=api_key)
print("API Key loaded:", api_key[:12], "...")
def extract_text_from_pdf(pdf_path):
text = ""
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
text += page.extract_text() + "\n"
return text
def parse_invoice_with_claude(invoice_text):
prompt = (
"Extract the following fields from this invoice text and return as a JSON object:\n"
"- Invoice Number\n"
"- Invoice Date\n"
"- Due Date\n"
"- Invoice Status (e.g. unpaid/paid)\n"
"- Sender Name and Email\n"
"- Recipient Name and Email\n"
"- Items (description, quantity, rate)\n"
"- Total Amount\n"
"- Memo\n\n"
"Invoice Text:\n"
f"{invoice_text}"
)
response = client.messages.create(
model="claude-3-5-sonnet-20240620",
max_tokens=500,
temperature=0,
messages=[{"role": "user", "content": prompt}],
)
return response.content[0].text
if __name__ == "__main__":
pdf_path = "invoice_sample.pdf"
invoice_text = extract_text_from_pdf(pdf_path)
parsed_data = parse_invoice_with_claude(invoice_text)
print(parsed_data)
Here's a simple explanation:
extract_text_from_pdf
: usespdfplumber
to pull plain text from each page of the PDF.parse_invoice_with_claude
: sends the text to Claude Sonnet 3.5 with a specific prompt asking for JSON output.Claude returns a stringified JSON block with the requested fields.
Step 3: Output
You can run the script using python anthropic-main.py
in the terminal. Here's the JSON response:
{
"invoiceNumber": "2-7-25",
"invoiceDate": "July 2, 2025",
"dueDate": "Upon receipt",
"invoiceStatus": "UNPAID",
"senderName": "Anmol Baranwal",
"senderEmail": "hi@anmolbaranwal.com",
"recipientName": "Anmol Baranwal",
"recipientEmail": "anmolbaranwal09@gmail.com",
"items": [
{
"description": "Testing",
"quantity": 1,
"rate": 50.00
},
{
"description": "Development",
"quantity": 1,
"rate": 100.00
},
{
"description": "Blog",
"quantity": 1,
"rate": 50.00
}
],
"totalAmount": 200.00,
"memo": "Thank you! This is a sample invoice for testing document parsing with AI models."
}
✅ Pros:
- Claude 3.5 is very strong at understanding long text and formatting it cleanly.
- Claude Sonnet can handle text (and even images via embedding) in its prompts
- In some cases, it handles unusual or long documents slightly better than GPT-4.
⚠️ Cons:
- Like GPT, Claude requires prompt engineering.
- Like GPT, Claude can sometimes miss fields or make up values (hallucinate).
- It still returns raw JSON text without validation, so you must parse/verify it.
- You still need to extract text yourself, it doesn’t parse raw PDFs.
Claude 3.5 Sonnet is also billed per token. The estimated cost for a 1–2 page invoice extraction falls in the $0.005–$0.018 range, depending on how detailed your prompt and output are. You can also use this pricing calculator based on your use case.
It's exceptionally fast for small prompts (200–300ms) but larger or more complex stimuli can raise latency to 10s or more.
🎯 Using the Invofox API
So I was searching for a better solution unlike code-based (OpenAI & Anthropic) approaches requiring prompt engineering, I found many good tools like Invofox, Google Document AI, Amazon Textract.
What stood out about Invofox is that it’s backed by Y Combinator and has all the features I needed. That gave me the confidence to dig deeper and try it out.
It provides a plug‑and‑play AI-powered document parsing API that makes it super easy to extract data from invoices, receipts, payslips, bank statements, loan/mortgage files and custom document types like bills.
They have some useful built-in features like:
- ✅ Splitter
It automatically separates multiple documents contained within a single PDF (such as mixed invoices or statements), grouping pages into logical sub-documents for better extraction and automation.
It's configurable via API during upload and works alongside the classifier for cleaner downstream processing
- ✅ Classifier
Pretrained AI model that detects document types (invoice, receipt, etc) so that each document is processed using the correct schema. It's optional and can be enabled per environment or request.
They also use advanced AI models with proprietary algorithms that verify and autocomplete your data. Check API Docs.
Step 1: Sign up for the dashboard
You can sign up for the dashboard to generate an API key.
You can manually upload the document as well but we will be using the API since it's easier and much better in experience.
Step 2: Creating the request in Postman
Once you have your API key, you can use Postman to send documents for parsing using Invofox's /uploads
endpoint.
Here's how to set it up:
✅ 1. Create a New Request
Open the Postman Desktop application
Create a collection and add a request
Set the method to
POST
We need to request this endpoint:
https://api.invofox.com/v1/ingest/uploads
✅ 2. Set the Headers
Go to the Headers tab and add:
- key:
accept
, value:application/json
- key:
x-api-key
, value:your_invofox_api_key
You should not manually set Content-Type
as Postman will handle it automatically when using form-data. It tells the server what format the data in your request body is:
application/json
→ You're sending raw JSONmultipart/form-data
→ You are sending files + form fieldsapplication/x-www-form-urlencoded
→ You're sending form-like text fields (like an HTML form)
When you're sending files using Postman’s form-data option, Postman automatically sets the correct Content-Type
and boundary values (which are required for multipart/form-data
).
If you manually set it like this:
Content-Type: multipart/form-data
You are missing the boundary part, which is something like:
Content-Type: multipart/form-data; boundary=----WebKitFormBoundaryxyz
Let's add the body fields.
✅ 3. Add the Body (form-data)
Switch to the Body tab, select form-data
and add the following two fields:
- key:
files
, type:file
, value: upload your invoice (invoice_sample.pdf
)
- key:
info
, type:text
, value: Paste the JSON below
{
"type": "6840c4511cbcc77119347248",
// data field is optional
"data": {
"companyActsLike": "issuer"
}
}
The data
field is optional here as it's only needed if you want to pass custom metadata or extra instructions (such as information to influence parsing, verification preferences or to register edge-case scenarios for custom document types).
Beyond standard types (invoice, payslip, bank statement), you can register custom document types in your Invofox dashboard. These custom types get a unique ID like 6840c4511cbcc77119347248
(used in the example), which is what we are now passing to the API.
By specifying a type ID, you ensure your files are parsed according to the exact schema you set up:
- Your custom JSON structure
- Field names you defined
- Custom validation rules and human review workflows
✅ 4. Send the Request
Click "Send". If everything is set up correctly, you will get a response with details on documentID.
-
importID
is the batch ID for this upload (useful for tracking multiple files uploaded together) -
documentID
is the ID of the parsed document
{
"accountId": "683edb9d7ded4695232c4979",
"environmentId": "683edb9d7ded4695232c497b",
"importId": "68662d83c3a0849a86a6aa30",
"files": [
{
"id": "68662d83c3a0849a86a6aa33",
"filename": "invoice_sample.pdf",
"documentId": "68662d83c3a0849a86a6aa34"
}
]
}
Step 3: Get Parsed Document
There are two ways: one is to check the Invofox dashboard to find the newly parsed document. As you can notice, the line items and breakdowns are displayed in a table format. The GUI also provides many options, including filtering the extracted data.
Based on how the workflow is set up, it may be necessary to mark it as completed, as involving a human in the loop ensures the highest accuracy and gives us more control.
The other way (recommended) is to make a GET
request to https://api.invofox.com/documents/{documentID}
with headers as:
- key:
accept
, value:application/json
- key:
x-api-key
, value:your_invofox_api_key
Here is a trimmed JSON response with the original format. It also provides the image of the original invoice in the response and a lot of extra fields compared to the earlier responses of GPT-4o & Claude.
{
"hasClientRequest": false,
"canLock": false,
"canSkip": false,
...
"canEdit": false,
"result": {
"_id": "68662d83c3a0849a86a6aa34",
"account": "683edb9d7ded4695232c4979",
"environment": "683edb9d7ded4695232c497b",
"creator": "683edbcbee083d02af5bf7cf",
"clientData": {},
"type": "6840c4511cbcc77119347248",
"name": "invoice_sample.pdf",
"creation": "2025-07-03T07:13:08.950Z",
"images": [
"https://...-1.png"
],
"original": "https://.../invoice_sample.pdf",
"mimetype": "application/pdf",
"data": {
"documentNumber": {
"value": "2-7-25"
},
"issueDate": {
"value": "2025-07-02"
},
"language": {
"value": "en"
},
"breakdowns": [
{
"taxRate": { "value": 0 },
"taxBaseAmount": { "value": 200 },
"taxAmount": { "value": 0 },
...
"totalAmount": { "value": 200 },
"grossAmount": { "value": 200 }
}
],
"lines": [
{
"description": { "value": "Testing" },
"quantity": { "value": 1 },
...
"totalAmount": { "value": 50 },
"grossAmount": { "value": 50 }
},
{
"description": { "value": "Development" },
"quantity": { "value": 1 },
...
"totalAmount": { "value": 100 },
"grossAmount": { "value": 100 }
},
{
"description": { "value": "Blog" },
"quantity": { "value": 1 },
...
"totalAmount": { "value": 50 },
"grossAmount": { "value": 50 }
}
],
"totalTaxBaseAmount": { "value": 200 },
"totalTaxAmount": { "value": 0 },
"totalAmount": { "value": 200 }
},
"publicState": "approved",
"confidence": "low",
"import": {
"ref": "68662d83c3a0849a86a6aa30",
"file": "68662d83c3a0849a86a6aa33",
"filename": "invoice_sample.pdf",
...
},
...
}
}
Pricing is not public, so potential users must contact their team for a commercial offer, but the product is specifically tuned for production speed and reliability. Actual test response times reported in the blog are consistently under 5s.
🎯 Python Code using Invofox API
Many developers prefer extracting documents with code, so let’s walk through the same process using the Invofox API with Python. We will keep it brief, with just the code and JSON response.
The overall process is the same as the previous sections, so I'm not repeating that. You can read the docs if you are interested in exploring for yourself.
We need to install requests, a Python library that makes it easy to send HTTP requests (such as GET, POST) and work with web APIs.
pip install requests
We will also use the time
built-in Python module that comes pre-installed with every standard Python installation. The time
module provides various time-related functions such as delays (time.sleep()
), timestamps and more. In our case, we will use it to pause execution, giving the document enough time to be processed on the dashboard.
Let's write the complete code with the file name as invofox-main.py
.
import requests
import os
import json
from dotenv import load_dotenv
import time
load_dotenv()
API_BASE = "https://api.invofox.com"
API_KEY = os.getenv("INVOFOX_API_KEY")
PDF_PATH = "invoice_sample.pdf"
headers = {"accept": "application/json", "x-api-key": API_KEY}
with open(PDF_PATH, "rb") as f:
files = {"files": f}
info = {"type": "6840c4511cbcc77119347248", "data": {"companyActsLike": "issuer"}}
data = {"info": json.dumps(info)}
resp_upload = requests.post(
f"{API_BASE}/v1/ingest/uploads", headers=headers, files=files, data=data
)
upload_result = resp_upload.json()
print("Upload response:", upload_result)
import_id = upload_result.get("importId")
if not import_id:
raise ValueError("Import ID not found in upload response.")
# wait a moment for processing
time.sleep(2)
resp_import = requests.get(f"{API_BASE}/v1/ingest/imports/{import_id}", headers=headers)
import_info = resp_import.json()
print("Import info:", import_info)
files_info = import_info.get("files", [])
if not files_info or not files_info[0].get("documentIds"):
raise ValueError("Document IDs not found in import info.")
document_id = files_info[0]["documentIds"][0]
print("Document ID:", document_id)
time.sleep(20)
resp_get = requests.get(f"{API_BASE}/documents/{document_id}", headers=headers)
parsed_doc = resp_get.json()
print("Parsed Document Data:")
print(json.dumps(parsed_doc, indent=2))
Here are all the Invofox API endpoints used:
POST /v1/ingest/uploads
→ Uploads a PDF invoice with metadata (type & issuer info). Returns animportId
.GET /v1/ingest/imports/{importId}
→ Retrieves details including thedocumentIds
generated fromimportId
.GET /documents/{documentId}
→ Retrieves the fully parsed invoice data fromdocumentId
.
Here is the JSON response after running the script using python invofox-main.py
.
The JSON response is similar to what we got after making a request using Postman. It also provides the image of the original invoice in the response and a lot of useful fields.
Results & Comparison
Let's compare their methods in brief.
-
API Call
- GPT-4o/Claude → send text with prompt
- Invofox → use API or upload a file (image/PDF) in bulk
-
Setup
- GPT/Claude → need to write prompt engineering code
- Invofox → minimal code, no prompt
-
Validation
- GPT/Claude → you need to manually verify
- Invofox → built-in validation and confidence scores
-
Performance
- GPT/Claude → limited by token/window size
- Invofox → handles multi-page docs via backend OCR and AI
While parsing the invoice, here's what I realized:
ChatGPT (GPT-4o)
: Good at parsing known fields if prompted clearly. You get a JSON string but must parse/clean it yourself. Errors can occur if prompts are unclear.Claude 3.5 (Sonnet)
: It's very similar to GPT-4. In the snapshots, you can see Sonnet handled the invoice fields about as well as GPT-4, sometimes better at recognizing unfamiliar terms. But we still had to massage the prompt.Invofox API
: It returned the fully parsed invoice JSON out-of-the-box. All fields were correctly extracted and validated. The output schema was exactly what we needed, with no extra coding.
Comparison Table
Now that we have explored each option, let’s compare them side by side. Estimates are based on typical invoice lengths: simple invoices are 1–2 pages & 1000-2000 tokens total.
Cost & Execution Time Benchmarks
We covered the pricing structure in each of the sections, but I have also done it side-by-side so it's easier to make a decision.
You should also acknowledge the ongoing cost and effort involved in upgrading language models. Teams often need to benchmark new models, retest prompts and schemas and adjust output parsing logic whenever a new version is released.
These hidden maintenance costs aren’t always obvious but should be considered. With Invofox, there is no such requirement.
Bottom Line
For quick experiments or one-off tasks, you can use GPT-4 (ChatGPT API) or Claude Sonnet to parse invoice text by crafting suitable prompts. They will do a decent job extracting fields in JSON (since GPT-4 tends to produce more structured and cleaner outputs than earlier GPT-3).
However, for reliable production-grade parsing of invoices or receipts, the Invofox API is superior. It’s specifically built for documents using advanced proprietary models and continual feedback.
You can find the complete code in the GitHub Repository.
That's it.
I hope you learned how to parse documents. Let me know if you have any questions or feedback.
Let me know if you have any questions or feedback.
Have a great day! Until next time :)
You can check my work at anmolbaranwal.com. Thank you for reading! 🥰 |
![]() ![]() ![]() |
---|
This content originally appeared on DEV Community and was authored by Anmol Baranwal

Anmol Baranwal | Sciencx (2025-08-21T12:04:17+00:00) Document Parsing using GPT-4o API vs Claude Sonnet 3.5 API vs Invofox API (with Code Samples). Retrieved from https://www.scien.cx/2025/08/21/document-parsing-using-gpt-4o-api-vs-claude-sonnet-3-5-api-vs-invofox-api-with-code-samples/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.