Insights/Engineering

Engineering18 min read

pgvector vs Pinecone: When PostgreSQL Is Enough for Vector Search

A grounded comparison of pgvector and Pinecone for engineering teams deciding whether to extend PostgreSQL or adopt a dedicated vector database. Covers performance ceilings, operational trade-offs, and the scale tipping points that actually matter.

Boolean and Beyond Team

March 13, 2026 · Updated March 26, 2026

The Question Every Engineering Team Asks

You already run PostgreSQL. Your application data lives there. Your team knows how to operate it. Now you need vector search for a new feature, semantic search, recommendations, or a RAG pipeline. The question that inevitably comes up in the architecture review: do we add pgvector to our existing database, or spin up Pinecone?

This is not a theoretical question. We have helped over a dozen teams in Bengaluru and across India make this decision, and the answer depends on factors that most comparison articles ignore: your existing connection pooling setup, your DBA's comfort with index tuning, your latency requirements at the 95th percentile, and whether your vectors and relational data need to be queried in the same transaction.

HNSW vs IVFFlat: Understanding pgvector's Two Index Types

How HNSW Works in pgvector

HNSW (Hierarchical Navigable Small World) builds a multi-layered graph where each node connects to its approximate nearest neighbors. In pgvector, you create it with CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops). The two tuning parameters are m (connections per node, default 16) and ef_construction (candidate list size during build, default 64). Higher values improve recall but increase build time and memory usage.

On a 1-million-row table with 1536-dimensional embeddings, building an HNSW index with default parameters takes approximately 25 minutes on a db.r6g.xlarge RDS instance. With pgvector 0.7+, parallel index builds cut this to roughly 8 minutes using 4 workers. The resulting index occupies about 4.8 GB of shared_buffers memory, which is critical to understand because this memory is shared with your regular PostgreSQL workload.

When IVFFlat Still Makes Sense

IVFFlat partitions vectors into clusters using k-means and searches only the closest clusters at query time. It builds faster than HNSW, roughly 10 minutes for 1 million vectors, and uses less memory since the index structure is simpler. The trade-off is lower recall at the same latency: IVFFlat with probes=10 on 100 lists achieves about 92% recall@10, while HNSW with ef_search=40 reaches 98% recall@10 at similar latency.

IVFFlat's real advantage is predictable resource usage. Each probe adds a fixed amount of work, making query latency more consistent across different data distributions. For applications where 92-95% recall is acceptable, such as product recommendations or content discovery where showing a slightly suboptimal result is invisible to users, IVFFlat with lower memory overhead can be the right engineering choice.

Memory Requirements Per Million Vectors

pgvector Memory Accounting

Memory consumption in pgvector is the sum of the raw vector data in the table plus the index structures in shared_buffers. For 1536-dimensional float32 vectors, each row consumes 6,144 bytes for the vector plus roughly 100 bytes of PostgreSQL tuple overhead. At 1 million vectors, the table alone requires 5.8 GB on disk. The HNSW index with m=16 adds approximately 4.8 GB, and you need at least 80% of this in shared_buffers for consistent query performance. Total per million vectors: roughly 10-11 GB of effective memory.

This is why pgvector hits a practical ceiling around 5-10 million vectors on typical RDS instances. A db.r6g.4xlarge with 128 GB RAM can hold about 5 million 1536-dimension vectors with HNSW while still leaving room for your application's regular PostgreSQL workload. Beyond that, you either upgrade to larger (and much more expensive) instances or accept slower queries as the index spills to disk.

pgvector 0.7+ Improvements: halfvec and Parallel Builds

Half-Precision Vectors (halfvec)

pgvector 0.7 introduced the halfvec type, storing vectors in float16 instead of float32. This cuts storage and memory per vector in half, meaning 1 million 1536-dimensional vectors consume approximately 3 GB instead of 6 GB for the raw data. On recall benchmarks, halfvec typically loses less than 0.5% recall@10 compared to full-precision vectors for cosine similarity, making it a near-free optimization for most workloads.

Parallel HNSW Index Builds

Before pgvector 0.7, HNSW index creation was single-threaded, making it painfully slow for large datasets. A 5-million-vector index took over two hours. With parallel builds (set max_parallel_maintenance_workers = 4), the same index builds in approximately 35 minutes. This matters for operational workflows: rebuilding an index after a data migration or schema change no longer requires a multi-hour maintenance window.

Connection Pooling Impact on Vector Queries

PgBouncer and Transaction Mode Limitations

Vector similarity queries in pgvector often require setting session-level parameters like SET hnsw.ef_search = 100 for recall tuning. If you run PgBouncer in transaction pooling mode, the default for most production setups, SET commands do not persist across transactions because PgBouncer reassigns connections. This means your ef_search setting reverts to the default between queries, leading to inconsistent recall.

The workaround is to either set ef_search at the PostgreSQL configuration level (ALTER SYSTEM SET hnsw.ef_search = 100) so it applies to all sessions, or use PgBouncer's session pooling mode for vector query connections while keeping transaction pooling for regular application queries. Some teams deploy a separate PgBouncer instance on a different port specifically for vector workloads, avoiding the pooling conflict entirely.

Pinecone Serverless vs Pod-Based Pricing

Understanding Pinecone Serverless Costs

Pinecone Serverless charges per storage and per read/write unit. At 1 million 1536-dimensional vectors, storage costs approximately $8/month. Read units are priced at $8 per million read units, where a single vector query with top_k=10 consumes roughly 5-10 read units depending on namespace complexity. At 100,000 queries per day, expect about $12-24/month in read costs. Total: approximately $20-32/month for 1 million vectors at moderate query volume. This is significantly cheaper than running any dedicated infrastructure.

Pod-Based Pricing for Latency-Sensitive Workloads

Pinecone's pod-based architecture (p1 and s1 pods) provides dedicated compute with predictable latency. A p1 pod costs approximately $70/month and supports about 1 million vectors with single-digit millisecond query latency. An s1 pod costs $95/month with higher storage density, roughly 5 million vectors, but 2-3x higher query latency. The advantage over serverless is consistent p99 latency under 15 ms, compared to serverless where cold queries can spike to 200+ ms.

Hybrid Search: BM25 + Vector Patterns

Why Hybrid Search Matters for RAG

Pure vector search misses exact keyword matches that users expect. If someone searches for 'error code E-4102', vector similarity will return semantically related error documentation but may not surface the exact page about E-4102. Hybrid search combines BM25 keyword matching with vector similarity, typically using Reciprocal Rank Fusion (RRF) to merge the two result sets. This approach improves retrieval precision by 15-25% on benchmarks like BEIR compared to vector-only search.

pgvector Hybrid Search Implementation

PostgreSQL has native full-text search with tsvector and ts_rank, making hybrid search straightforward in pgvector. You run both a vector similarity query and a text search query, then combine results using RRF in a single SQL statement. The advantage is transactional consistency: both the keyword index and vector index are updated in the same transaction, so search results are never stale. No external service coordination needed.

Pinecone Hybrid Search

Pinecone supports hybrid search through sparse-dense vectors, where you provide both a dense embedding and a sparse BM25-style vector in the same upsert. The sparse vector is typically generated using a library like Pinecone's own splade or a BM25 encoder. Queries can weight the dense and sparse components with an alpha parameter. The limitation is that the sparse vector must be generated client-side, adding complexity to your ingestion pipeline. However, Pinecone handles the fusion internally, which simplifies the query path.

RDS and Aurora pgvector vs Self-Managed PostgreSQL

RDS pgvector Constraints

Amazon RDS supports pgvector as a managed extension, but with constraints. You cannot tune PostgreSQL kernel parameters like huge_pages or shared_preload_libraries, which limits HNSW memory optimization on larger instances. RDS also enforces maximum connection limits that can become a bottleneck when vector queries (which hold connections longer due to CPU-intensive similarity computation) compete with regular application queries. A db.r6g.2xlarge maxes out at 1,000 connections, and with vector queries averaging 15-30 ms each, you can saturate the instance at surprisingly low QPS.

Aurora pgvector Advantages

Aurora PostgreSQL separates storage from compute, meaning read replicas share the same storage volume and can serve vector queries without data replication lag. This lets you add Aurora read replicas specifically for vector search traffic while keeping your primary instance for transactional writes. Aurora's storage also auto-scales, eliminating the need to pre-provision disk space for vector data growth. The downside is cost: Aurora is roughly 20-30% more expensive than standard RDS for equivalent instance sizes.

Self-Managed PostgreSQL for Vector Workloads

Running PostgreSQL on EC2 or GCE gives you full control over kernel parameters, memory allocation, and pgvector compilation flags. You can compile pgvector with AVX-512 SIMD support, which accelerates distance calculations by 2-3x on supported instance types (c6i, m6i families). You can also set huge_pages=on and configure shared_buffers to 40-50% of total RAM instead of the conservative RDS default. The trade-off is taking on backup management, failover configuration, and security patching yourself.

Real Latency Numbers at Different Scales

pgvector Latency Benchmarks

On a db.r6g.xlarge (4 vCPUs, 32 GB RAM) with 1536-dimension vectors and HNSW (m=16, ef_search=100): at 100K vectors, p50 latency is 3 ms and p99 is 8 ms. At 1M vectors, p50 is 7 ms and p99 is 22 ms. At 5M vectors, p50 is 15 ms and p99 is 45 ms, and you start seeing periodic spikes to 100+ ms as the index exceeds shared_buffers. At 10M vectors on this instance size, the index cannot fit in memory and p99 degrades to 200+ ms, making it unsuitable for real-time search.

Pinecone Latency Benchmarks

Pinecone pod-based (p1) delivers consistent p50 latency of 4-6 ms and p99 of 12-18 ms from 100K to 10M vectors, because the index is purpose-built and not sharing resources with other workloads. Pinecone Serverless has more variable latency: warm queries return in 20-40 ms p50, but cold queries (after minutes of inactivity) can take 150-300 ms as compute spins up. At 50M+ vectors, Pinecone pods maintain sub-20ms p99, while pgvector on any reasonable instance size cannot match this without sharding.

The Transactional Consistency Advantage

The strongest argument for pgvector is not performance. It is the ability to join vector search results with relational data in a single query. When a user searches your e-commerce product catalog, the query can combine vector similarity on product descriptions with SQL filters on price, availability, category, and user permissions, all in one round trip. With Pinecone, you query vectors first, get back IDs, then query your relational database for metadata, adding network latency and coordination complexity.

This matters even more for data freshness. When you insert a new product with pgvector, both the relational data and the vector are available in the same committed transaction. With Pinecone, there is an ingestion pipeline delay, typically 1-10 seconds, where a product exists in your database but is not yet searchable by vector similarity. For applications where immediacy matters, like support ticket routing or real-time content moderation, this gap creates a window of inconsistency that requires careful handling.

The Decision Framework

Choose pgvector when your vector count stays under 5 million, you need transactional consistency between vectors and relational data, your team already operates PostgreSQL confidently, you need hybrid BM25 + vector search in a single query, and your latency budget is above 20 ms at p99. The operational simplicity of one database instead of two is worth more than most teams initially realize.

Choose Pinecone when your vector count exceeds 10 million or will grow unpredictably, you need sub-10ms p99 latency at scale, your team does not want to tune PostgreSQL memory parameters for vector workloads, your vectors and application data have different scaling patterns, or you are building a multi-tenant SaaS where Pinecone namespaces map cleanly to tenant isolation.

The pattern we recommend for most teams in Bengaluru building their first AI feature: start with pgvector on your existing PostgreSQL instance. Monitor query latency at p95 and index memory usage weekly. Plan a migration to Pinecone (or Qdrant, Weaviate, or Milvus) when p95 latency consistently exceeds your SLA or when vector count approaches 5 million. The migration is straightforward since it only involves the vector data and search queries, not your entire application.

Cost Comparison at Scale

At 1 million vectors with 1536 dimensions and 50K queries/day: pgvector on RDS db.r6g.xlarge costs approximately $280/month (but you may already be paying this for your application database). Pinecone Serverless costs approximately $25-35/month. Pinecone p1 pod costs approximately $70/month. At 10 million vectors: pgvector requires RDS db.r6g.4xlarge at $1,120/month and you are at the performance ceiling. Pinecone Serverless costs approximately $100-150/month. Pinecone p2 pods cost approximately $350/month for consistent latency.

The hidden cost in pgvector is not the instance price, it is the opportunity cost of using your PostgreSQL instance's memory and CPU for vector workloads instead of your core application queries. If adding pgvector means upgrading your RDS instance from xlarge to 2xlarge, the incremental cost is $560/month, which makes Pinecone look like a bargain. If your PostgreSQL instance already has headroom, pgvector's incremental cost is effectively zero.

Boolean and Beyond Team

EngineeringImplementationProduction Delivery

March 26, 2026

Insight → Execution

Turn this into a delivery plan

Book an architecture call, validate cost assumptions, and move from strategy to production with measurable milestones.

お問い合わせ Estimate cost

Frequently Asked Questions

pgvector handles production workloads well at moderate scale. Companies running 1-5 million vectors with sub-50ms latency requirements use pgvector in production without issues. The practical ceiling depends on your instance size and whether you can dedicate sufficient shared_buffers memory to the HNSW index. Beyond 5-10 million vectors, purpose-built vector databases offer better performance per dollar.

pgvector 0.7 introduced three significant improvements: halfvec (float16 vectors) that halves memory usage with minimal recall loss, parallel HNSW index builds that reduce index creation time by 3-4x, and improved HNSW search performance through better memory access patterns. If you benchmarked pgvector before 0.7 and were disappointed by performance, re-evaluate with the latest version since the improvements are substantial.

At under 500K vectors, Pinecone Serverless costs less than $15/month, which is essentially free compared to the engineering time of optimizing pgvector. However, if you already run PostgreSQL and need transactional consistency between vectors and relational data, pgvector at 500K vectors adds negligible load to your existing instance. The decision at this scale is about architecture preferences, not cost.

Yes, but with caveats. Session-level SET commands for hnsw.ef_search do not persist across transactions in PgBouncer transaction mode. Set ef_search at the PostgreSQL server level instead, or run a separate PgBouncer instance in session mode specifically for vector query connections. We have helped several teams in Bengaluru implement this dual-pool pattern successfully.

Aurora PostgreSQL is better for pgvector when you need read replicas for vector search traffic, since Aurora replicas share the storage volume with zero replication lag for vector data. Standard RDS is 20-30% cheaper and sufficient if you only need a single instance. For teams in India scaling beyond 2 million vectors, Aurora's ability to add dedicated read replicas for vector workloads justifies the premium.

The migration involves exporting vectors from PostgreSQL (SELECT id, embedding FROM items), transforming them into Pinecone's upsert format, and batch uploading. For 5 million vectors, the export takes about 10 minutes and the Pinecone upload takes 30-45 minutes. The application change is swapping your search query from a SQL statement to a Pinecone client call. The metadata filtering logic needs to be re-implemented using Pinecone's filter syntax, which is the most time-consuming part of the migration.

Implementation Links for This Topic

Explore related services, insights, case studies, and planning tools for your next implementation step.

Delivery available from Bengaluru and Coimbatore teams, with remote implementation across India.

Found this helpful?

Back to all insights