This content originally appeared on DEV Community and was authored by Quinton
In case you missed the news, OpenAI just released their new reasoning model, GPT-5. There is a lot of hype about it’s ability to performing reasoning tasks and it’s potential for software development. But what does it feel like in real-world developer usage?
I am currently using the OpenAI Responses API in my side project, mycaminoguide, an AI agent that works in WhatsApp and gives hikers planning, or on the Camino de Santiago, a way to ask for advice, like you would an experienced hiking friend. v1 of that app is published, but I want to improve the performance of it. (If you want to follow along on Youtube as I try out more AI tools to build it, I’d love if you subscribed, or left a thumbs up. Comments, views, and subs really help with motivation!) I’m hoping GPT-5 helps as part of the solution. For my day job, I am also working on some apps that use the Responses API to fetch and analyze customer data. These apps are much more standalone and make a good candidate for me to perform some simple real world performance test of GPT-5.
Full disclaimer: this is not an amazingly scientific or comprehensive test. I wanted to test real-world, average developer usage in a typical use case that I have.
Is GPT-5 faster than GPT-4.1?
Test 1: end-to-end analysis
The standalone app is written in Python with a Streamlit front end and uses the new Airbyte Embedded MCP to fetch Stripe invoice data on behalf of a customer. This data is used with the OpenAI’s Responses API to perform some analysis. I used Cline with GPT-4.1 to generate most of the code. The complete code of the main app is ~114 lines. You can see it all here in the repo. After fetching a Bearer token to authenticate against Airbyte, I make the call to OpenAI, and add a timer to track execution time. The full code I ran is here.
import time
with st.spinner("Fetching customer data via Airbyte Embedded and asking GPT to analyze it..."):
start_time = time.time()
resp = openai.responses.create(
model="gpt-5",
tools=[
{
"type": "mcp",
"server_label": "airbyte-embedded-mcp",
"server_url": "https://mcp.airbyte.ai",
"headers": {
"Authorization": f"Bearer {AIRBYTE_BEARER_TOKEN}"
},
"require_approval": "never",
},
],
input=prompt,
)
duration = time.time() - start_time
st.success(f"Request completed in {duration:.2f} seconds.")
I ran the process using GPT-4.1 three times with the results below.
Test 1 | Test 2 | Test 3 | |
---|---|---|---|
OpenAI+Airbyte MCP | 29.85 seconds | 15.38 seconds | 13.56 seconds |
Now, let’s test GPT-5.
First, let’s run the exact same code with the only change being that I change the model from GPT-4.1 to 5.0. Here are the results of the three test executions.
Test 1 | Test 2 | Test 3 | |
---|---|---|---|
OpenAI+Airbyte MCP | 77.72 seconds | 57.72 seconds | 83.40 seconds |
Wow! This is not what I expected at all. Chaining an MCP to return data which I pass to to GPT-5 for analysis is significantly slower than GPT-4.1. I’m shocked.
Test 2: removing the variance
Perhaps the MCP call is introducing variance? This MCP service calls Airbyte, which proxies a results to a data source - Stripe in my example - on behalf of a customer. This use case is very common for developers building AI apps where they want to fetch user data and pass it to an LLM. But there is a high chance any of these calls may take varying times to return the data.
Let’s separate out the data call and instead inject the data directly into the prompt. I saved this file as streamlit_app_test2.py and added timers to wrap both the MCP call and the OpenAI responses call.
# 1) Fetch invoices from MCP server (outside OpenAI)
fetch_start = time.time()
invoices = fetch_invoices_via_mcp(AIRBYTE_BEARER_TOKEN, STRIPE_CONNECTOR_ID)
fetch_duration = time.time() - fetch_start
st.info(f"MCP fetch completed in {fetch_duration:.2f} seconds.")
if invoices:
st.success(f"Fetched {len(invoices)} invoices from MCP server.")
else:
st.info("No invoices returned from MCP server; analysis will proceed with empty data.")
# 2) Build prompt by injecting fetched data
base_prompt = (
"You are an experienced financial planner and accountant. "
"Analyze the provided Stripe invoice data and prepare a plan for me to manage my invoices."
)
invoices_for_prompt = simplify_invoices(invoices, limit=50)
new_prompt = (
f"{base_prompt}\n\n"
f"Here is a JSON array of invoices (subset of fields, up to 50 rows):\n"
f"{json.dumps(invoices_for_prompt, ensure_ascii=False)}\n\n"
"Please provide:\n"
"- Key insights and trends\n"
"- Risk assessment (e.g., overdue, large unpaid balances)\n"
"- A prioritized plan of actions\n"
"- If useful, include a concise Markdown table summary (e.g., Invoice Number, Amount Due (USD), Status, Due Date)"
)
# 3) Ask OpenAI to analyze the injected data (no MCP tool usage here)
with st.spinner("Analyzing invoice data with the model..."):
start_time = time.time()
resp = openai.responses.create(
model=MODEL_NAME,
input=new_prompt,
)
duration = time.time() - start_time
st.success(f"Request completed in {duration:.2f} seconds.")
Here’s the results:
First, GPT-4.1
Test 1 | Test 2 | Test 3 | |
---|---|---|---|
Proxy Call (seconds) | 6.71 | 6.74 | 6.93 |
OpenAI call (seconds) | 9.87 | 10.31 | 9.62 |
Next, GPT-5
Test 1 | Test 2 | Test 3 | |
---|---|---|---|
Proxy Call (seconds) | 7.85 | 7.01 | 7.11 |
OpenAI call (seconds) | 80.01 | 46.05 | 48.06 |
The results are pretty clear! GPT-5 is significantly slower when using the Responses API. It’s not even close. And, you can see from the Proxy calls that there is very little variance which may be attributed to network latency. The Airbyte call, and subsequent Stripe proxy request is pretty fast.
Test 3: Does the Responses API introduce the variance?
What if it is not the model at all? Perhaps it is the OpenAI Responses API? Let’s take the prompt from test 2, which includes the data returned from the Airbyte Proxy request, and use chatGPT to perform the request using both GPT-4.1 and GPT-5.
Here is the complete prompt:
You are an experienced financial planner and accountant. Analyze the provided Stripe invoice data and prepare a plan for me to manage my invoices. In your output, tell me exactly how long in seconds it took you to perform this analysis. this time should be accurate to two decimal places
Here is a JSON array of invoices (subset of fields, up to 50 rows):
[{"id": "in_1RoWV6Jns63UPKMcLkpxPCCo", "customer": "cus_Sk0SaX89BlwYJR", "status": "draft", "amount_due": 30000, "currency": "usd", "due_date": null, "created": 1753391788, "paid": false, "amount_paid": 0}, {"id": "in_1RoWUmJns63UPKMco8KUcu88", "customer": "cus_Sk0TvnAQUT6rOX", "status": "draft", "amount_due": 60000, "currency": "usd", "due_date": null, "created": 1753391768, "paid": false, "amount_paid": 0}, {"id": "in_1RoWUTJns63UPKMch0nprHNh", "customer": "cus_Sk0T8SdwDgSMva", "status": "draft", "amount_due": 20000, "currency": "usd", "due_date": null, "created": 1753391749, "paid": false, "amount_paid": 0}, {"id": "in_1RoWU8Jns63UPKMcalyJvrAu", "customer": "cus_Sk0TM8UmpVhWJB", "status": "draft", "amount_due": 30000, "currency": "usd", "due_date": null, "created": 1753391728, "paid": false, "amount_paid": 0}, {"id": "in_1RoWTJJns63UPKMczVOgarPJ", "customer": "cus_Sk0U3CjIKfiGql", "status": "draft", "amount_due": 20000, "currency": "usd", "due_date": null, "created": 1753391677, "paid": false, "amount_paid": 0}]
Please provide:
- Key insights and trends
- Risk assessment (e.g., overdue, large unpaid balances)
- A prioritized plan of actions
- If useful, include a concise Markdown table summary (e.g., Invoice Number, Amount Due (USD), Status, Due Date)
Here are the results:
ChatGPT Client
Test 1 | Test 2 | Test 3 | |
---|---|---|---|
GPT-4o | 4.92 | 5.32 | 3.47 |
GPT-5 | 38.82 | 36.78 | 34.62 |
Note: The prompt includes a statement to print execution time. The results in chatGPT returned almost instantly, and then type out the response to the screen. It appears that chatGPT calculates execution time on how long it will take to complete the entire request, which includes printing it out to the screen. I also could only use 4o, not 4.1 on chatGPT.
So what’s the verdict?
I’ll be honest, I don’t know what to make of the results. Every single one of my non-scientific, real world developer tests indicate that GPT-5 is significantly slower than GPT-4.1/o. I didn’t review the analysis results between the two models. Perhaps better reasoning takes longer and the results are more in-depth, but it’s hard for an app developer to justify such a huge difference in performance.
Maybe GPT-5 is just getting overwhelmed with usage load at the moment? Next, I’d like to compare GPT-5 and GPT-OSS to take server load out of the equation, but I couldn’t find a clear answer to what actual model GPT-OSS is based upon. If anyone knows, please let me know, or if you are even interested in such a comparison. (I am kind of curious now, as well as testing GPT-5-mini)
In the meantime, back to coding, probably still with GPT-4. ;)
This content originally appeared on DEV Community and was authored by Quinton

Quinton | Sciencx (2025-08-09T19:39:07+00:00) I tested OpenAI GPT-5. The results were not what I expected!. Retrieved from https://www.scien.cx/2025/08/09/i-tested-openai-gpt-5-the-results-were-not-what-i-expected/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.