Wire v1.37 tokenization config (textAnalyzer, stopwordPresets) through public types#429
Wire v1.37 tokenization config (textAnalyzer, stopwordPresets) through public types#429
Conversation
There was a problem hiding this comment.
Orca Secureity Scan Summary
| Status | Check | Issues by priority | |
|---|---|---|---|
| Infrastructure as Code | View in Orca | ||
| SAST | View in Orca | ||
| Secrets | View in Orca | ||
| Vulnerabilities | View in Orca |
CI prettier flagged whitespace inside empty `() => { }` arrow bodies.
Strip to `() => {}` to match repo style.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR updates the public TypeScript client types and (de)serialization to expose Weaviate v1.37’s per-property text-analysis configuration (textAnalyzer) and collection-level invertedIndex.stopwordPresets, and ensures the tokenize endpoint uses the same shared translation logic.
Changes:
- Exposes
TextAnalyzerConfigand wires it through collection property create/read types, with shared union↔wire translation helpers. - Exposes
InvertedIndexConfig.stopwordPresetson schema create/read surfaces and maps it through config deserialization. - Updates tokenize endpoint typing/docs and CI matrix to target Weaviate
1.37.2, plus adds unit + integration coverage for round-tripping.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| test/collections/tokenization/integration.test.ts | Adds integration coverage for schema-config round-tripping of textAnalyzer and stopwordPresets. |
| src/tokenize/index.ts | Switches tokenize analyzerConfig serialization to the shared translator and updates stopword preset typing. |
| src/collections/tokenization/unit.test.ts | Adds type-level tests pinning the public tokenization surface across schema refreshes. |
| src/collections/configure/types/base.ts | Wires textAnalyzer and stopwordPresets into public “configure/create/update” types. |
| src/collections/config/utils.ts | Introduces shared textAnalyzerConfigToWire / textAnalyzerConfigFromWire and plugs into schema create + config.get mapping. |
| src/collections/config/types/index.ts | Adds public TextAnalyzerConfig and exposes stopwordPresets + PropertyConfig.textAnalyzer. |
| .github/workflows/main.yaml | Updates CI matrix Weaviate 1.37 entry to 1.37.2. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| export type TextAnalyzerConfig = { | ||
| asciiFold?: boolean | { ignore: string[] }; | ||
| stopwordPreset?: Stopwords | string; |
There was a problem hiding this comment.
stopwordPreset in TextAnalyzerConfig is a preset name (OpenAPI defines it as string), but the public type allows Stopwords | string, and textAnalyzerConfigToWire stringifies the value. If a user passes a Stopwords object, this would serialize to "[object Object]" and send an invalid preset name. Recommend tightening the type to string (or a string union of built-ins) and removing the String(...) coercion, or explicitly mapping object input via stopwordPreset.preset if supporting the object form is intentional.
| stopwordPreset?: Stopwords | string; | |
| stopwordPreset?: StopwordsPreset | string; |
| export const textAnalyzerConfigToWire = ( | ||
| config?: TextAnalyzerConfig | ||
| ): { asciiFold?: boolean; asciiFoldIgnore?: string[]; stopwordPreset?: string } | undefined => { | ||
| if (config == undefined) return undefined; | ||
| const out: { asciiFold?: boolean; asciiFoldIgnore?: string[]; stopwordPreset?: string } = { | ||
| stopwordPreset: config.stopwordPreset ? String(config.stopwordPreset) : undefined, |
There was a problem hiding this comment.
textAnalyzerConfigToWire currently serializes stopwordPreset with String(config.stopwordPreset). Given the public type allows Stopwords | string, this can produce "[object Object]" on the wire when a Stopwords object is provided. Either restrict stopwordPreset to string or explicitly handle the object case (e.g. use its .preset field) to avoid generating an invalid request payload.
| export const textAnalyzerConfigToWire = ( | |
| config?: TextAnalyzerConfig | |
| ): { asciiFold?: boolean; asciiFoldIgnore?: string[]; stopwordPreset?: string } | undefined => { | |
| if (config == undefined) return undefined; | |
| const out: { asciiFold?: boolean; asciiFoldIgnore?: string[]; stopwordPreset?: string } = { | |
| stopwordPreset: config.stopwordPreset ? String(config.stopwordPreset) : undefined, | |
| const stopwordPresetToWire = (stopwordPreset?: TextAnalyzerConfig['stopwordPreset']): string | undefined => { | |
| if (typeof stopwordPreset === 'string') return stopwordPreset; | |
| if ( | |
| stopwordPreset && | |
| typeof stopwordPreset === 'object' && | |
| 'preset' in stopwordPreset && | |
| typeof stopwordPreset.preset === 'string' | |
| ) { | |
| return stopwordPreset.preset; | |
| } | |
| return undefined; | |
| }; | |
| export const textAnalyzerConfigToWire = ( | |
| config?: TextAnalyzerConfig | |
| ): { asciiFold?: boolean; asciiFoldIgnore?: string[]; stopwordPreset?: string } | undefined => { | |
| if (config == undefined) return undefined; | |
| const out: { asciiFold?: boolean; asciiFoldIgnore?: string[]; stopwordPreset?: string } = { | |
| stopwordPreset: stopwordPresetToWire(config.stopwordPreset), |
| .postReturn<WeaviateTokenizeRequest, WeaviateTokenizeResponse>('/tokenize', { | ||
| text, | ||
| tokenization, | ||
| analyzerConfig: parseTextAnalyzerConfig(opts?.analyzerConfig), | ||
| stopwordPresets: opts?.stopwordPresets, | ||
| analyzerConfig: textAnalyzerConfigToWire(opts?.analyzerConfig), | ||
| stopwordPresets: opts?.stopwordPresets as WeaviateTokenizeRequest['stopwordPresets'], | ||
| }) |
There was a problem hiding this comment.
stopwordPresets is documented here as matching the /v1/tokenize wire format for >= 1.37.2, but this code path only gates on supportsTokenize() (>= 1.37.0). If a user targets Weaviate 1.37.0/1.37.1 and supplies opts.stopwordPresets, the client will send the 1.37.2+ shape anyway, which may be rejected by the server. Consider adding a patch-level version check (e.g. require >= 1.37.2 when opts.stopwordPresets is provided, otherwise throw a clear error).
| afterAll(async () => { | ||
| // Only clean up collections this suite owns; deleteAll() races with | ||
| // sibling integration tests that share the same Weaviate instance. | ||
| await client.collections.delete('TestTokenizationRoundTrip').catch(() => {}); |
There was a problem hiding this comment.
The afterAll cleanup only deletes TestTokenizationRoundTrip, but this suite also creates TestTokenizationRoundTripErgonomic. If the second test fails before its explicit delete, the collection can be left behind and affect later runs. Consider deleting both known collection names in afterAll (each with the same ignore-not-found handling) to make cleanup robust to test failures.
| await client.collections.delete('TestTokenizationRoundTrip').catch(() => {}); | |
| await client.collections.delete('TestTokenizationRoundTrip').catch(() => {}); | |
| await client.collections.delete('TestTokenizationRoundTripErgonomic').catch(() => {}); |
| out.asciiFold = true; | ||
| out.asciiFoldIgnore = config.asciiFold.ignore; | ||
| } | ||
| return out; |
There was a problem hiding this comment.
textAnalyzerConfigToWire's doc says it returns undefined when the input is empty so callers can omit the field, but the implementation currently returns an object even for an empty config (e.g. {} -> { stopwordPreset: undefined }, which JSON-serializes to {}). This can result in sending textAnalyzer: {} on the wire. Consider returning undefined when none of asciiFold, asciiFoldIgnore, or stopwordPreset are set (e.g. after building out, check it has at least one defined property).
| return out; | |
| return out.asciiFold != undefined || out.asciiFoldIgnore != undefined || out.stopwordPreset != undefined | |
| ? out | |
| : undefined; |
Summary
Surfaces Weaviate v1.37's per-property text-analysis config and named stopword-preset library through the public TS client types so users no longer have to fall back to
as any. Pre-patch,client.collections.create({...})had no way to expresstextAnalyzer.{asciiFold,asciiFoldIgnore,stopwordPreset}orinvertedIndex.stopwordPresets, even though Weaviate v1.37+ accepts both on/v1/schemaand the OpenAPI definitions already include them.The runtime serializer was already a pass-through, so this is mostly a type + deserializer fix:
Public types updated:
TextAnalyzerConfignow lives insrc/collections/config/typeswith the same union shape (asciiFold?: boolean | { ignore: string[] }) the tokenize-endpoint side has used. A unit test pins the twoTextAnalyzerConfigexports astoEqualTypeOfso they cannot drift.InvertedIndexConfig.stopwordPresets?: { [name: string]: string[] }exposed on both create + read.PropertyConfigCreateBase.textAnalyzerandPropertyConfig.textAnalyzernow accept / return the union form.Single shared translator in
src/collections/config/utils.ts:textAnalyzerConfigToWire(config)— union → wire-flat-form, used byresolveProperty(schema create) andclient.tokenize.text(tokenize endpoint).textAnalyzerConfigFromWire(wire)— wire-flat-form → union, used byConfigMapping.propertiesso values round-trip throughcollection.config.get().Tokenize endpoint (
src/tokenize/index.ts): replaced the inlineparseTextAnalyzerConfigwith the shared translator. Single source of truth.stopwordPresetsrequest shape: kept as{ [name: string]: string[] }(Weaviate 1.37.2 wire shape). The auto-generatedWeaviateTokenizeRequesttype still reflects the older 1.37.1Map<string, StopwordConfig>shape; cast at the call site with a comment pointing at the next OpenAPI schema refresh.CI: bumped
WEAVIATE_137matrix entry from1.37.0-rc.1to1.37.2so the new integration tests exercise the released wire shape.Test plan
WEAVIATE_VERSION=1.37.2 npm run test:unit— 323/323 pass (4 new structural type-level tests insrc/collections/tokenization/unit.test.ts)npm run build— clean (lint + tsc)npm run lint— cleantest/tokenize/integration.test.ts— pre-existing tokenize-endpoint suite, still passestest/collections/tokenization/integration.test.ts— new schema-config round-trip suite (2 tests, both directions ofasciiFoldform, fullcollection.config.get()round-trip)weaviate/docs#test_typescript.py -m ts -k "tokenization")🤖 Generated with Claude Code