What is RAG and Why Use It?
RAG (Retrieval Augmented Generation) lets a model answer questions from your actual documentation instead of its training data. You hand it the relevant docs, it reads them, and it answers based on what's in front of it.
This post builds the simplest possible version: no vector database, no embeddings. With a modern small model like GPT-5.4 mini you can pass a whole doc straight into the prompt and let the model answer. That covers a surprising number of real cases, and you can add a vector store later, once your docs outgrow the context window.
Setting Up the Project
Create the project and install what we need:
mkdir my-rag-projectcd my-rag-projectnpm init -ynpm install openai dotenv expressPut your OpenAI key in a .env file:
OPENAI_API_KEY=your-api-key-hereSet up package.json for ES modules so the imports below work:
{ "name": "my-rag-project", "version": "1.0.0", "type": "module", "scripts": { "start": "node src/server.js", "dev": "node --watch src/server.js" }, "dependencies": { "dotenv": "^16.4.5", "express": "^5.0.0", "openai": "^5.0.0" }}Now create the folders and a few sample docs to query. Use printf so the files actually have content on macOS, Linux, and Git Bash:
mkdir src docsprintf '# Troubleshooting\n\nIf the API returns a 500, check your API key and retry with backoff.\n' > docs/troubleshooting.mdprintf '# Getting Started\n\nInstall the SDK and set OPENAI_API_KEY before running the server.\n' > docs/getting-started.mdprintf '# API Reference\n\nPOST /ask with a JSON body: { "question": "..." }.\n' > docs/api-reference.mdA First Pass: Answer From One File
Start with the simplest version: read one file, ask the model, return the answer.
import { OpenAI } from 'openai'import fs from 'fs/promises'import dotenv from 'dotenv'
dotenv.config()
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
async function answerQuestion(question) { const docContent = await fs.readFile('./docs/troubleshooting.md', 'utf-8')
// Large files can exceed the context window. Check your model's limit and split if needed. const response = await openai.chat.completions.create({ model: 'gpt-5.4-mini', messages: [ { role: 'system', content: 'You answer questions using only the provided documentation.', }, { role: 'user', content: `Documentation:\n${docContent}\n\nQuestion: ${question}\n\nAnswer using the documentation. If the answer is not there, say so.`, }, ], max_completion_tokens: 4096, })
return response.choices[0].message.content}
const answer = await answerQuestion('How do I handle a 500 error?')console.log(answer)Two things changed from the older gpt-4o-mini examples you might have seen. The model is now gpt-5.4-mini (swap in gpt-5.4-nano if you want it cheaper and faster for simple lookups), and the output limit is max_completion_tokens, since max_tokens is deprecated. GPT-5.4 mini has a large context window, so small and medium docs fit in one prompt. For exact limits, check the models page.
Picking the Right File Automatically
One file is fine for a demo. Real docs are split across many. So let the model pick the most relevant file first, then answer from it.
import { OpenAI } from 'openai'import fs from 'fs/promises'import dotenv from 'dotenv'
dotenv.config()
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
async function selectRelevantFile(question) { const files = await fs.readdir('./docs') const fileList = files .filter(f => f.endsWith('.md') || f.endsWith('.txt')) .map(f => ({ filename: f }))
const response = await openai.chat.completions.create({ model: 'gpt-5.4-mini', messages: [ { role: 'system', content: "You select the most relevant documentation file for a question. Respond in JSON with 'filename' and 'reason' fields.", }, { role: 'user', content: `Available files: ${JSON.stringify(fileList)}\n\nQuestion: "${question}"\n\nPick the most relevant file and explain why, in JSON.`, }, ], response_format: { type: 'json_object' }, })
return JSON.parse(response.choices[0].message.content)}
export async function smartRAG(question) { // 1. Pick the file const fileSelection = await selectRelevantFile(question) console.log(`Selected ${fileSelection.filename}: ${fileSelection.reason}`)
// 2. Read it const docContent = await fs.readFile(`./docs/${fileSelection.filename}`, 'utf-8')
// 3. Answer from it const response = await openai.chat.completions.create({ model: 'gpt-5.4-mini', messages: [ { role: 'system', content: 'You answer questions using only the provided documentation.', }, { role: 'user', content: `Documentation from ${fileSelection.filename}:\n${docContent}\n\nQuestion: ${question}\n\nAnswer using this documentation. If the answer is not there, say so.`, }, ], max_completion_tokens: 4096, })
return { fileSelection, answer: response.choices[0].message.content }}Wrapping It in an API
Put smartRAG behind a small Express endpoint so anything can call it.
import express from 'express'import { smartRAG } from './ragService.js'
const app = express()app.use(express.json())
app.post('/ask', async (req, res) => { const { question } = req.body
try { const result = await smartRAG(question) res.json(result) } catch (error) { console.error('Error:', error) res.status(500).json({ error: "Couldn't process your question" }) }})
app.listen(3000, () => { console.log('RAG API running on port 3000')})Your project should look like this:
my-rag-project/├── .env├── package.json├── src/│ ├── ragService.js│ └── server.js└── docs/ ├── api-reference.md ├── getting-started.md └── troubleshooting.mdStart it:
npm run devTesting It
Send a POST request to http://localhost:3000/ask with a JSON body. With curl:
curl -X POST http://localhost:3000/ask \ -H "Content-Type: application/json" \ -d '{"question": "How do I handle a 500 error?"}'You get back the file the model chose and its answer:
{ "fileSelection": { "filename": "troubleshooting.md", "reason": "The question is about error handling, which the troubleshooting doc covers" }, "answer": "Check your API key and retry with backoff..."}Postman works the same way: a POST to the same URL, Content-Type: application/json, and the JSON body above.
Where This Works (and When to Add a Vector DB)
This whole-file approach goes further than you'd expect:
- Docs that change often. Edit a markdown file and the next question uses the new content. Nothing to rebuild or reindex.
- Internal tools and support. Your team gets answers from your real docs, not the model's training data.
- Prototypes. You can prove the idea in an afternoon without standing up any infrastructure.
It has a ceiling, though. Once your docs are too big to fit in the context window, or you need the model to pull a few relevant passages out of thousands of pages, that's when embeddings and a vector database start to earn their keep. Until then, plain files are usually enough.
Where to go next
When you outgrow plain files, the next step depends on your stack:
- Building a RAG system with MongoDB and Node.js: keyword retrieval if you already run MongoDB, no separate vector database needed.
- Building a RAG system with Pinecone and Node.js: semantic search with embeddings, for large collections that need to match on meaning.
Related articles
Building a RAG System with MongoDB and Node.js
Build a RAG system in Node.js using MongoDB text search. A good fit when you already run MongoDB and need keyword retrieval without a separate vector database.
Building a RAG System with Pinecone and Node.js
Build a RAG system in Node.js with Pinecone and OpenAI embeddings. Semantic search that matches on meaning, for large document collections that outgrow keyword search.
Force an LLM to return JSON in JavaScript
Reliably get JSON from an LLM in JavaScript with OpenAI structured outputs and a Zod schema, instead of prompting for JSON and parsing fragile model text yourself.

