Skip to content

text_chunker

A processor that allows chunking and splitting text based on some strategy. Usually used for creating vector embeddings of large documents.

# Common config fields, showing default values
label: ""
text_chunker:
strategy: "" # No default (required)
chunk_size: 512
chunk_overlap: 100
separators:
- |2+
- ""
- ' '
- ""
length_measure: runes
include_code_blocks: false
keep_reference_links: false

A processor allowing splitting text into chunks based on several different strategies.

Fields

strategy

Sorry! This field is missing documentation.

Type: string

OptionSummary
markdownSplit text by markdown headers.
recursive_characterSplit text recursively by characters (defined in separators).
tokenSplit text by tokens.

chunk_size

The maximum size of each chunk.

Type: int

Default: 512

chunk_overlap

The number of characters to overlap between chunks.

Type: int

Default: 100

separators

A list of strings that should be considered as separators between chunks.

Type: array

Default: ["\n\n","\n"," ",""]

length_measure

The method for measuring the length of a string.

Type: string

Default: "runes"

OptionSummary
graphemesUse unicode graphemes to determine the length of a string.
runesUse the number of codepoints to determine the length of a string.
tokenUse the number of tokens (using the token_encoding tokenizer) to determine the length of a string.
utf8Determine the length of text using the number of utf8 bytes.

token_encoding

The encoding to use for tokenization.

Type: string

# Examples
token_encoding: cl100k_base
token_encoding: r50k_base

allowed_special

A list of special tokens that are allowed in the output.

Type: array

Default: []

disallowed_special

A list of special tokens that are disallowed in the output.

Type: array

Default: ["all"]

include_code_blocks

Whether to include code blocks in the output.

Type: bool

Default: false

Whether to keep reference links in the output.

Type: bool

Default: false