Skip to main content
search at depth

PostgreSQL Full-Text Search as a Starting Point

4 min read Chapter 56 of 60

PostgreSQL Full-Text Search as a Starting Point

The Symptom

A startup builds a documentation platform for small teams. The MVP has 500 documents. The team evaluates OpenSearch, spends two weeks setting up a cluster, configuring mappings, building a sync pipeline, and deploying monitoring. The search quality is excellent. The operational burden is disproportionate: 70% of infrastructure management time goes to a search cluster that serves 50 queries per hour.

The Internals

PostgreSQL’s full-text search uses tsvector (a sorted list of lexemes with positions) and tsquery (a search expression) to provide text search with stemming, ranking, and index support.

Capabilities:

  • Stemming via language-specific dictionaries (English, German, etc.)
  • Weighted fields (A, B, C, D weights for title, body, etc.)
  • GIN index for fast tsvector lookups
  • ts_rank and ts_rank_cd for relevance scoring
  • Phrase search with <-> (adjacent) and <N> (N-distance) operators

Limitations:

  • No BM25. Ranking uses term frequency and inverse document frequency but with a simpler formula.
  • No multi-field analysis pipelines. All fields must be concatenated into a single tsvector.
  • No synonyms at query time without application-level expansion.
  • No fuzzy matching (edit distance).
  • No aggregations/faceted search without additional SQL queries.
  • No horizontal scaling beyond PostgreSQL’s replication.

The Implementation

Spring Data JPA Integration

@Entity
@Table(name = "documents")
public class Document {

    @Id
    private String slug;

    private String tenantId;
    private String title;

    @Column(columnDefinition = "text")
    private String body;

    private String contentType;
    private String version;

    @Column(columnDefinition = "tsvector")
    private String searchVector;
}
@Repository
public interface DocumentRepository extends JpaRepository<Document, String> {

    @Query(value = """
        SELECT d.* FROM documents d
        WHERE d.tenant_id = :tenantId
        AND d.search_vector @@ plainto_tsquery('english', :query)
        ORDER BY ts_rank(
            setweight(to_tsvector('english', d.title), 'A') ||
            setweight(to_tsvector('english', d.body), 'B'),
            plainto_tsquery('english', :query)
        ) DESC
        LIMIT :limit
        """, nativeQuery = true)
    List<Document> search(
        @Param("tenantId") String tenantId,
        @Param("query") String query,
        @Param("limit") int limit
    );

    @Query(value = """
        SELECT d.content_type, COUNT(*) as count
        FROM documents d
        WHERE d.tenant_id = :tenantId
        AND d.search_vector @@ plainto_tsquery('english', :query)
        GROUP BY d.content_type
        ORDER BY count DESC
        """, nativeQuery = true)
    List<Object[]> searchWithFacets(
        @Param("tenantId") String tenantId,
        @Param("query") String query
    );
}

Search Vector Maintenance

// Trigger to maintain the search_vector column

@Component
public class SearchVectorMigration implements Flyway callback {

    public String migrationSql() {
        return """
            -- Create the tsvector column
            ALTER TABLE documents ADD COLUMN IF NOT EXISTS
                search_vector tsvector;

            -- GIN index for fast search
            CREATE INDEX IF NOT EXISTS idx_documents_search
                ON documents USING GIN(search_vector);

            -- Trigger to update search_vector on insert/update
            CREATE OR REPLACE FUNCTION update_search_vector()
            RETURNS trigger AS $$
            BEGIN
                NEW.search_vector :=
                    setweight(to_tsvector('english',
                        coalesce(NEW.title, '')), 'A') ||
                    setweight(to_tsvector('english',
                        coalesce(NEW.body, '')), 'B');
                RETURN NEW;
            END;
            $$ LANGUAGE plpgsql;

            CREATE OR REPLACE TRIGGER trg_search_vector
            BEFORE INSERT OR UPDATE OF title, body ON documents
            FOR EACH ROW EXECUTE FUNCTION update_search_vector();

            -- Backfill existing documents
            UPDATE documents SET search_vector =
                setweight(to_tsvector('english',
                    coalesce(title, '')), 'A') ||
                setweight(to_tsvector('english',
                    coalesce(body, '')), 'B');
            """;
    }
}

The Measurement

PostgreSQL FTS vs OpenSearch performance at different data scales:

Metric10K docs (PG)100K docs (PG)1M docs (PG)1M docs (OS)
p50 latency2ms5ms25ms8ms
p99 latency8ms18ms95ms22ms
Relevance (NDCG@5)0.680.680.680.77
Faceted search4ms (SQL)12ms (SQL)85ms (SQL)8ms
Infrastructure cost$0$0$0$3,500/mo

PostgreSQL FTS performs well up to 100K documents. At 1M documents, latency degrades and the relevance gap becomes significant (0.68 vs 0.77 NDCG@5) because PostgreSQL lacks BM25, multi-field analysis, and synonym expansion.

The Decision Rule

Start with PostgreSQL FTS when the document count is below 100,000 and the team does not have search engineering expertise. The zero-infrastructure-cost and zero-synchronization-complexity advantages outweigh the relevance quality gap for small datasets.

Migrate to OpenSearch when any of these conditions are met:

  • Document count exceeds 100,000 and latency requirements are under 20ms p99
  • Search quality requirements demand multi-field analysis, synonyms, or fuzzy matching
  • Faceted search with aggregations becomes a core product feature
  • The search query volume exceeds what a single PostgreSQL replica can serve

The migration path is well-defined: build a CDC pipeline (Chapter 18), create the OpenSearch index with the desired mapping, verify NDCG improvement, switch the search endpoint.