I built a $2/month AI assistant and hosted it myself — here’s the full architecture

I built a $2/month AI assistant and hosted it myself — here’s the full architecture

I got tired of the token-counting anxiety. Every time I used the Claude API directly, I was watching the meter tick: 1,000 tokens here, 5,000 tokens there. A…


This content originally appeared on DEV Community and was authored by Brian Austin

I built a $2/month AI assistant and hosted it myself — here's the full architecture

I got tired of the token-counting anxiety. Every time I used the Claude API directly, I was watching the meter tick: 1,000 tokens here, 5,000 tokens there. A long debugging session could cost $3-4 in a single sitting.

So I built a flat-rate wrapper. Same Claude model underneath. Fixed $2/month. No per-token billing.

Here's how it actually works.

The architecture

User browser
    ↓ HTTPS
Node.js server (VPS, 2GB RAM)
    ↓ Auth middleware (JWT)
Session manager (rate limiting per user)
    ↓ Anthropic SDK
Claude API
    ↑ Response
Streaming back to browser

That's it. The magic isn't in the architecture — it's in the business model.

The key technical pieces

1. Rate limiting per user

The most important component. Without this, one heavy user can burn through your entire API budget in a day.

// Simple in-memory rate limiter
// For production: use Redis
const userLimits = new Map();

function checkRateLimit(userId) {
  const now = Date.now();
  const windowMs = 60 * 60 * 1000; // 1 hour window
  const maxRequests = 50; // per hour

  if (!userLimits.has(userId)) {
    userLimits.set(userId, { count: 0, resetAt: now + windowMs });
  }

  const limit = userLimits.get(userId);

  if (now > limit.resetAt) {
    limit.count = 0;
    limit.resetAt = now + windowMs;
  }

  if (limit.count >= maxRequests) {
    return { allowed: false, resetAt: limit.resetAt };
  }

  limit.count++;
  return { allowed: true };
}

2. Streaming responses

Users expect real-time output. Nobody wants to wait 10 seconds for a full response to appear at once.

app.post('/api/chat', authenticate, async (req, res) => {
  const { message, history } = req.body;

  // Set streaming headers
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('Connection', 'keep-alive');

  const stream = await anthropic.messages.stream({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 2048,
    messages: history.concat([{ role: 'user', content: message }])
  });

  for await (const chunk of stream) {
    if (chunk.type === 'content_block_delta') {
      res.write(`data: ${JSON.stringify({ text: chunk.delta.text })}\n\n`);
    }
  }

  res.write('data: [DONE]\n\n');
  res.end();
});

3. Conversation history management

Claude doesn't have memory between API calls. You need to send the full conversation history each time — but you need to trim it or costs explode.

function trimHistory(history, maxTokenEstimate = 4000) {
  // Rough estimate: 1 token ≈ 4 characters
  const charLimit = maxTokenEstimate * 4;

  let totalChars = 0;
  const trimmed = [];

  // Walk backwards, keep recent messages that fit
  for (let i = history.length - 1; i >= 0; i--) {
    const msgChars = history[i].content.length;
    if (totalChars + msgChars > charLimit) break;
    trimmed.unshift(history[i]);
    totalChars += msgChars;
  }

  return trimmed;
}

4. Auth with JWT

const jwt = require('jsonwebtoken');

function authenticate(req, res, next) {
  const token = req.headers.authorization?.replace('Bearer ', '');

  if (!token) return res.status(401).json({ error: 'No token' });

  try {
    req.user = jwt.verify(token, process.env.JWT_SECRET);
    next();
  } catch {
    res.status(401).json({ error: 'Invalid token' });
  }
}

The infrastructure cost breakdown

Component Monthly cost
VPS (2GB RAM, Hetzner) €4.51 (~$5)
Anthropic API budget $40-60
Domain + SSL ~$1 amortized
Total per month ~$65
Revenue at 50 users $100
Profit margin 35%

The model works because most users are occasional users. At $2/month, you're not power users who run Claude 8 hours a day — you're developers who want AI on demand without commitment.

What I'd do differently

Redis for rate limiting instead of in-memory. When the server restarts, in-memory limits reset. Redis survives restarts and scales across multiple instances.

Per-user token budgets tracked in a database. Right now rate limiting is request-based (50 requests/hour). Better: track actual tokens per user per month and enforce a ceiling.

Model routing — use Claude Haiku for short factual queries, Sonnet for longer reasoning tasks. Haiku costs ~10x less per token. Automatic model selection based on prompt length and complexity could cut API costs by 40%.

Is it worth building vs buying?

If you want to run this yourself: the Anthropic API, a $5/month Hetzner VPS, and about 200 lines of Node.js gets you there. The code above covers ~80% of what you need.

If you just want access without the infra headache: SimplyLouie is what I run for others. $2/month, same Claude model, no server to maintain. Free 7-day trial, card required but not charged until day 8.

What's your setup?

Are you running your own Claude wrapper? Using the raw API with token budgets? Or just paying full price for ChatGPT Plus?

I'm curious what cost-control strategies developers are actually using in production — drop them in the comments.

claude #ai #webdev #tutorial #discuss


This content originally appeared on DEV Community and was authored by Brian Austin


Print Share Comment Cite Upload Translate Updates
APA

Brian Austin | Sciencx (2026-04-20T22:17:32+00:00) I built a $2/month AI assistant and hosted it myself — here’s the full architecture. Retrieved from https://www.scien.cx/2026/04/20/i-built-a-2-month-ai-assistant-and-hosted-it-myself-heres-the-full-architecture/

MLA
" » I built a $2/month AI assistant and hosted it myself — here’s the full architecture." Brian Austin | Sciencx - Monday April 20, 2026, https://www.scien.cx/2026/04/20/i-built-a-2-month-ai-assistant-and-hosted-it-myself-heres-the-full-architecture/
HARVARD
Brian Austin | Sciencx Monday April 20, 2026 » I built a $2/month AI assistant and hosted it myself — here’s the full architecture., viewed ,<https://www.scien.cx/2026/04/20/i-built-a-2-month-ai-assistant-and-hosted-it-myself-heres-the-full-architecture/>
VANCOUVER
Brian Austin | Sciencx - » I built a $2/month AI assistant and hosted it myself — here’s the full architecture. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2026/04/20/i-built-a-2-month-ai-assistant-and-hosted-it-myself-heres-the-full-architecture/
CHICAGO
" » I built a $2/month AI assistant and hosted it myself — here’s the full architecture." Brian Austin | Sciencx - Accessed . https://www.scien.cx/2026/04/20/i-built-a-2-month-ai-assistant-and-hosted-it-myself-heres-the-full-architecture/
IEEE
" » I built a $2/month AI assistant and hosted it myself — here’s the full architecture." Brian Austin | Sciencx [Online]. Available: https://www.scien.cx/2026/04/20/i-built-a-2-month-ai-assistant-and-hosted-it-myself-heres-the-full-architecture/. [Accessed: ]
rf:citation
» I built a $2/month AI assistant and hosted it myself — here’s the full architecture | Brian Austin | Sciencx | https://www.scien.cx/2026/04/20/i-built-a-2-month-ai-assistant-and-hosted-it-myself-heres-the-full-architecture/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.