Published on

Making Vector Search Identity-Aware in RAG Systems

Authors
  • avatar
    Name
    Parminder Singh
    Twitter

In a traditional application, access control is enforced at the data layer. Queries are written so that only rows the user is allowed to see are returned. This can be done using joins, views, or row-level security, but the key idea is the same, that the database never returns unauthorized data in the first place. In most RAG systems documents and embeddings live in a vector store, while access control lives in an external database. The system retrieves the top-K results from the vector index first, and only then applies permission checks in the application layer. This creates challenges at scale.

First, the trust boundary moves. Unauthorized candidates are retrieved before authorization is enforced. Whether the system returns content or only ids and metadata, enforcement happens too late. The system now relies on application code to do the right thing every time, which weakens zero-trust guarantees.

Second, retrieval quality degrades as the dataset grows. The vector store is ranking results globally, not within the user's authorized subset. As authorization becomes more sparse, the probability that the global top-K contains relevant, authorized data drops quickly.

This becomes especially painful in agentic systems. Imagine a tool that retrieves the top 10 chunks for an LLM. If filtering happens after retrieval, those 10 chunks might collapse to 2 or 3 once permissions are applied. The LLM is now reasoning over an incomplete context, even though thousands of relevant, authorized chunks exist deeper in the index. This increases the chance of hallucinations or low-quality answers.

This is the difference between post-filtering and in-index filtering.

In a standard HNSW search, if you fetch K=10 results and then filter them in the service layer, you are doing post-filtering. If a user only has access to 10 percent of the data, it is very likely that most of the top-10 global neighbors will be unauthorized. You return fewer chunks, not because the data does not exist but because the engine never looked far enough.

When filtering is pushed into the query planner, the engine evaluates access policies during graph traversal. If a candidate fails the security check, it is discarded and the search continues until the LIMIT is satisfied. (This requires an engine that can apply the filter during candidate generation, not after the top-K list is finalized). The result is that the LLM receives a full K set of the most relevant chunks the user is actually allowed to see.

The core idea is simple: make the vector store identity-aware.

Access attributes like clearance, project, or ownership need to be first-class metadata, enforced during retrieval, not after it. This turns authorization into a ranking constraint instead of a post-filtering step.

From an implementation perspective, this is mostly a schema and policy design problem. The example below uses Postgres with pgvector, but the same logic applies to any engine that supports pre-filtering or in-index filtering.

Postgres is a good fit here because it combines row-level security with vector search, allowing access control and retrieval to happen in the same execution path.

Define security attributes directly in the table schema

CREATE TABLE knowledge_base (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    content TEXT,
    clearance_required INT DEFAULT 0,
    project_code TEXT,
    embedding vector(1536)
);

ALTER TABLE knowledge_base ENABLE ROW LEVEL SECURITY;

CREATE POLICY p_clearance_check ON knowledge_base
    FOR SELECT
    USING (clearance_required <= current_setting('app.user_clearance')::int);

CREATE CONSTRAINT POLICY r_deny_list ON knowledge_base
    AS RESTRICTIVE
    FOR SELECT
    USING (project_code != current_setting('app.deny_project'));

before retrieval, bind the user's identity context

def abac_secure_retrieval(jwt_claims, query_vec, global_deny="NONE"):
    user_clearance = jwt_claims.get('clearance', 0)

    with psycopg.connect(DSN) as conn:
        with conn.cursor() as cur:
            cur.execute("SET LOCAL app.user_clearance = %s", (user_clearance,))
            cur.execute("SET LOCAL app.deny_project = %s", (global_deny,))
            cur.execute("SET hnsw.iterative_scan = 'strict_order'")

            cur.execute("""
                SELECT content
                FROM knowledge_base
                ORDER BY embedding <=> %s
                LIMIT 5
            """, (query_vec,))

            return cur.fetchall()

This results in the following properties.

  • Identity does not leak across layers. The user context exists only for the lifetime of the transaction.
  • Access decisions are auditable at the source. The database enforces policy, not the application.
  • Retrieval quality is preserved under strict authorization. The engine continues traversal until it finds the best authorized neighbors, instead of returning an underfilled result set.