AI / LLM7 min read· July 2025

Making AI Remember: Persistent Chat Threads in NeoGPT

I wanted to build a chatbot that actually remembered what you talked about — not just within a session, but across days and devices. NeoGPT was the result. Here's how I made it work.

The problem with stateless chatbots

Most chatbots you interact with have no persistent memory. Close the tab, lose your context. Start a new session, start from scratch. For casual questions that's fine. But for anything where you're building on previous exchanges — debugging, research, long-form writing — it's a real productivity killer.

NeoGPT solves this by treating every conversation as a named thread with its own URL. You can share that URL with someone else or open it on your phone and resume exactly where you left off.

Thread schema in MongoDB

Each thread is a single MongoDB document. The messages array grows as the conversation continues. I keep the full message history in the document — no separate messages collection — because threads are always read and written together.

javascript
// a thread document looks roughly like this
{
  _id: ObjectId,
  threadId: "t_abc123",    // URL-safe ID
  userId: "...",           // optional auth
  title: "Debug session",  // auto-generated from first message
  createdAt: Date,
  updatedAt: Date,
  messages: [
    { role: "user",      content: "...", timestamp: Date },
    { role: "assistant", content: "...", timestamp: Date },
    // ... up to context limit
  ]
}

In-memory caching with Redis

MongoDB reads are fast, but not fast enough for a chat interface where you want sub-100ms response starts. I cache full thread objects in Redis with a 1-hour TTL. When a user resumes a conversation, the thread loads from Redis in ~5ms. If the cache expires, we fall back to MongoDB and re-warm the cache.

💡 One thing I got wrong initially: I was caching individual messages rather than the whole thread document. That meant 20+ Redis lookups to reconstruct a thread. Caching the whole document and invalidating on update is much simpler and faster.

Groq for fast inference

I chose Groq over OpenAI primarily for latency. Groq's hardware runs inference noticeably faster — first token arrives in ~300ms on most queries. For a chat interface, that speed difference between 300ms and 1s is huge psychologically. I'm using the mixtral-8x7b model which handles multilingual input well.

javascript
import Groq from 'groq-sdk';

const groq = new Groq({ apiKey: process.env.GROQ_API_KEY });

async function generateResponse(messages) {
  const completion = await groq.chat.completions.create({
    model: "mixtral-8x7b-32768",
    messages,
    stream: true,   // stream tokens for perceived speed
    temperature: 0.7,
  });

  return completion;
}

Tavily web search — automatic tool invocation

I integrated Tavily so the model can decide on its own when to search the web. The prompt instructs the model to emit a structured JSON block when it needs fresh information. My backend parses that block, calls the Tavily API, injects the results back into the conversation context, and sends a follow-up request to the model. The user sees a smooth 'Searching the web...' indicator while this happens.

URL-based thread routing

Every thread gets a URL like /chat/t_abc123. Next.js App Router handles this with a dynamic [threadId] segment. The client loads the thread from the cache/DB on mount. New threads are created lazily — the URL updates (without a page reload) the moment you send your first message. This means you never lose a conversation even if you forget to name it.

What's next

I want to add vector search over past threads so you can ask 'did I talk about X before?' and get semantic search results across all your conversations. That's the next thing I'm building.

← All postsby Rahul Chowdhury