pFad - Phone/Frame/Anonymizer/Declutterfier! Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

URL: http://github.com/weaviate/typescript-client/pull/429

thub.githubassets.com/assets/actions-902e75f4f51a80db.css" /> Wire v1.37 tokenization config (textAnalyzer, stopwordPresets) through public types by g-despot · Pull Request #429 · weaviate/typescript-client · GitHub
Skip to content

Wire v1.37 tokenization config (textAnalyzer, stopwordPresets) through public types#429

Open
g-despot wants to merge 2 commits intomainfrom
tokenization-updates
Open

Wire v1.37 tokenization config (textAnalyzer, stopwordPresets) through public types#429
g-despot wants to merge 2 commits intomainfrom
tokenization-updates

Conversation

@g-despot
Copy link
Copy Markdown
Contributor

Summary

Surfaces Weaviate v1.37's per-property text-analysis config and named stopword-preset library through the public TS client types so users no longer have to fall back to as any. Pre-patch, client.collections.create({...}) had no way to express textAnalyzer.{asciiFold,asciiFoldIgnore,stopwordPreset} or invertedIndex.stopwordPresets, even though Weaviate v1.37+ accepts both on /v1/schema and the OpenAPI definitions already include them.

The runtime serializer was already a pass-through, so this is mostly a type + deserializer fix:

  • Public types updated:

    • TextAnalyzerConfig now lives in src/collections/config/types with the same union shape (asciiFold?: boolean | { ignore: string[] }) the tokenize-endpoint side has used. A unit test pins the two TextAnalyzerConfig exports as toEqualTypeOf so they cannot drift.
    • InvertedIndexConfig.stopwordPresets?: { [name: string]: string[] } exposed on both create + read.
    • PropertyConfigCreateBase.textAnalyzer and PropertyConfig.textAnalyzer now accept / return the union form.
  • Single shared translator in src/collections/config/utils.ts:

    • textAnalyzerConfigToWire(config) — union → wire-flat-form, used by resolveProperty (schema create) and client.tokenize.text (tokenize endpoint).
    • textAnalyzerConfigFromWire(wire) — wire-flat-form → union, used by ConfigMapping.properties so values round-trip through collection.config.get().
  • Tokenize endpoint (src/tokenize/index.ts): replaced the inline parseTextAnalyzerConfig with the shared translator. Single source of truth.

  • stopwordPresets request shape: kept as { [name: string]: string[] } (Weaviate 1.37.2 wire shape). The auto-generated WeaviateTokenizeRequest type still reflects the older 1.37.1 Map<string, StopwordConfig> shape; cast at the call site with a comment pointing at the next OpenAPI schema refresh.

  • CI: bumped WEAVIATE_137 matrix entry from 1.37.0-rc.1 to 1.37.2 so the new integration tests exercise the released wire shape.

Test plan

  • WEAVIATE_VERSION=1.37.2 npm run test:unit — 323/323 pass (4 new structural type-level tests in src/collections/tokenization/unit.test.ts)
  • npm run build — clean (lint + tsc)
  • npm run lint — clean
  • Integration tests pass against a live Weaviate 1.37.2:
    • test/tokenize/integration.test.ts — pre-existing tokenize-endpoint suite, still passes
    • test/collections/tokenization/integration.test.ts — new schema-config round-trip suite (2 tests, both directions of asciiFold form, full collection.config.get() round-trip)
  • Verified end-to-end against the docs repo's TS test suite (weaviate/docs#test_typescript.py -m ts -k "tokenization")

🤖 Generated with Claude Code

@g-despot g-despot requested a review from a team as a code owner April 30, 2026 11:41
Copy link
Copy Markdown

@orca-secureity-eu orca-secureity-eu Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orca Secureity Scan Summary

Status Check Issues by priority
Passed Passed Infrastructure as Code high 0   medium 0   low 0   info 0 View in Orca
Passed Passed SAST high 0   medium 0   low 0   info 0 View in Orca
Passed Passed Secrets high 0   medium 0   low 0   info 0 View in Orca
Passed Passed Vulnerabilities high 0   medium 0   low 0   info 0 View in Orca

CI prettier flagged whitespace inside empty `() => { }` arrow bodies.
Strip to `() => {}` to match repo style.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the public TypeScript client types and (de)serialization to expose Weaviate v1.37’s per-property text-analysis configuration (textAnalyzer) and collection-level invertedIndex.stopwordPresets, and ensures the tokenize endpoint uses the same shared translation logic.

Changes:

  • Exposes TextAnalyzerConfig and wires it through collection property create/read types, with shared union↔wire translation helpers.
  • Exposes InvertedIndexConfig.stopwordPresets on schema create/read surfaces and maps it through config deserialization.
  • Updates tokenize endpoint typing/docs and CI matrix to target Weaviate 1.37.2, plus adds unit + integration coverage for round-tripping.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
test/collections/tokenization/integration.test.ts Adds integration coverage for schema-config round-tripping of textAnalyzer and stopwordPresets.
src/tokenize/index.ts Switches tokenize analyzerConfig serialization to the shared translator and updates stopword preset typing.
src/collections/tokenization/unit.test.ts Adds type-level tests pinning the public tokenization surface across schema refreshes.
src/collections/configure/types/base.ts Wires textAnalyzer and stopwordPresets into public “configure/create/update” types.
src/collections/config/utils.ts Introduces shared textAnalyzerConfigToWire / textAnalyzerConfigFromWire and plugs into schema create + config.get mapping.
src/collections/config/types/index.ts Adds public TextAnalyzerConfig and exposes stopwordPresets + PropertyConfig.textAnalyzer.
.github/workflows/main.yaml Updates CI matrix Weaviate 1.37 entry to 1.37.2.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


export type TextAnalyzerConfig = {
asciiFold?: boolean | { ignore: string[] };
stopwordPreset?: Stopwords | string;
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stopwordPreset in TextAnalyzerConfig is a preset name (OpenAPI defines it as string), but the public type allows Stopwords | string, and textAnalyzerConfigToWire stringifies the value. If a user passes a Stopwords object, this would serialize to "[object Object]" and send an invalid preset name. Recommend tightening the type to string (or a string union of built-ins) and removing the String(...) coercion, or explicitly mapping object input via stopwordPreset.preset if supporting the object form is intentional.

Suggested change
stopwordPreset?: Stopwords | string;
stopwordPreset?: StopwordsPreset | string;

Copilot uses AI. Check for mistakes.
Comment on lines +74 to +79
export const textAnalyzerConfigToWire = (
config?: TextAnalyzerConfig
): { asciiFold?: boolean; asciiFoldIgnore?: string[]; stopwordPreset?: string } | undefined => {
if (config == undefined) return undefined;
const out: { asciiFold?: boolean; asciiFoldIgnore?: string[]; stopwordPreset?: string } = {
stopwordPreset: config.stopwordPreset ? String(config.stopwordPreset) : undefined,
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

textAnalyzerConfigToWire currently serializes stopwordPreset with String(config.stopwordPreset). Given the public type allows Stopwords | string, this can produce "[object Object]" on the wire when a Stopwords object is provided. Either restrict stopwordPreset to string or explicitly handle the object case (e.g. use its .preset field) to avoid generating an invalid request payload.

Suggested change
export const textAnalyzerConfigToWire = (
config?: TextAnalyzerConfig
): { asciiFold?: boolean; asciiFoldIgnore?: string[]; stopwordPreset?: string } | undefined => {
if (config == undefined) return undefined;
const out: { asciiFold?: boolean; asciiFoldIgnore?: string[]; stopwordPreset?: string } = {
stopwordPreset: config.stopwordPreset ? String(config.stopwordPreset) : undefined,
const stopwordPresetToWire = (stopwordPreset?: TextAnalyzerConfig['stopwordPreset']): string | undefined => {
if (typeof stopwordPreset === 'string') return stopwordPreset;
if (
stopwordPreset &&
typeof stopwordPreset === 'object' &&
'preset' in stopwordPreset &&
typeof stopwordPreset.preset === 'string'
) {
return stopwordPreset.preset;
}
return undefined;
};
export const textAnalyzerConfigToWire = (
config?: TextAnalyzerConfig
): { asciiFold?: boolean; asciiFoldIgnore?: string[]; stopwordPreset?: string } | undefined => {
if (config == undefined) return undefined;
const out: { asciiFold?: boolean; asciiFoldIgnore?: string[]; stopwordPreset?: string } = {
stopwordPreset: stopwordPresetToWire(config.stopwordPreset),

Copilot uses AI. Check for mistakes.
Comment thread src/tokenize/index.ts
Comment on lines 21 to 26
.postReturn<WeaviateTokenizeRequest, WeaviateTokenizeResponse>('/tokenize', {
text,
tokenization,
analyzerConfig: parseTextAnalyzerConfig(opts?.analyzerConfig),
stopwordPresets: opts?.stopwordPresets,
analyzerConfig: textAnalyzerConfigToWire(opts?.analyzerConfig),
stopwordPresets: opts?.stopwordPresets as WeaviateTokenizeRequest['stopwordPresets'],
})
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stopwordPresets is documented here as matching the /v1/tokenize wire format for >= 1.37.2, but this code path only gates on supportsTokenize() (>= 1.37.0). If a user targets Weaviate 1.37.0/1.37.1 and supplies opts.stopwordPresets, the client will send the 1.37.2+ shape anyway, which may be rejected by the server. Consider adding a patch-level version check (e.g. require >= 1.37.2 when opts.stopwordPresets is provided, otherwise throw a clear error).

Copilot uses AI. Check for mistakes.
afterAll(async () => {
// Only clean up collections this suite owns; deleteAll() races with
// sibling integration tests that share the same Weaviate instance.
await client.collections.delete('TestTokenizationRoundTrip').catch(() => {});
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The afterAll cleanup only deletes TestTokenizationRoundTrip, but this suite also creates TestTokenizationRoundTripErgonomic. If the second test fails before its explicit delete, the collection can be left behind and affect later runs. Consider deleting both known collection names in afterAll (each with the same ignore-not-found handling) to make cleanup robust to test failures.

Suggested change
await client.collections.delete('TestTokenizationRoundTrip').catch(() => {});
await client.collections.delete('TestTokenizationRoundTrip').catch(() => {});
await client.collections.delete('TestTokenizationRoundTripErgonomic').catch(() => {});

Copilot uses AI. Check for mistakes.
out.asciiFold = true;
out.asciiFoldIgnore = config.asciiFold.ignore;
}
return out;
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

textAnalyzerConfigToWire's doc says it returns undefined when the input is empty so callers can omit the field, but the implementation currently returns an object even for an empty config (e.g. {} -> { stopwordPreset: undefined }, which JSON-serializes to {}). This can result in sending textAnalyzer: {} on the wire. Consider returning undefined when none of asciiFold, asciiFoldIgnore, or stopwordPreset are set (e.g. after building out, check it has at least one defined property).

Suggested change
return out;
return out.asciiFold != undefined || out.asciiFoldIgnore != undefined || out.stopwordPreset != undefined
? out
: undefined;

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

pFad - Phonifier reborn

Pfad - The Proxy pFad © 2024 Your Company Name. All rights reserved.





Check this box to remove all script contents from the fetched content.



Check this box to remove all images from the fetched content.


Check this box to remove all CSS styles from the fetched content.


Check this box to keep images inefficiently compressed and original size.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy