text_chunker

A processor that allows chunking and splitting text based on some strategy. Usually used for creating vector embeddings of large documents.

Common
Advanced

# Common config fields, showing default values
label: ""
text_chunker:
  strategy: "" # No default (required)
  chunk_size: 512
  chunk_overlap: 100
  separators:
    - |2+
    - ""
    - ' '
    - ""
  length_measure: runes
  include_code_blocks: false
  keep_reference_links: false

# Advanced config fields, showing default values
label: ""
text_chunker:
  strategy: "" # No default (required)
  chunk_size: 512
  chunk_overlap: 100
  separators:
    - |2+
    - ""
    - ' '
    - ""
  length_measure: runes
  token_encoding: cl100k_base # No default (optional)
  allowed_special: []
  disallowed_special:
    - all
  include_code_blocks: false
  keep_reference_links: false

A processor allowing splitting text into chunks based on several different strategies.

Fields

`strategy`

Sorry! This field is missing documentation.

Type: string

Option	Summary
`markdown`	Split text by markdown headers.
`recursive_character`	Split text recursively by characters (defined in `separators`).
`token`	Split text by tokens.

`chunk_size`

The maximum size of each chunk.

Type: int

Default: 512

`chunk_overlap`

The number of characters to overlap between chunks.

Type: int

Default: 100

`separators`

A list of strings that should be considered as separators between chunks.

Type: array

Default: ["\n\n","\n"," ",""]

`length_measure`

The method for measuring the length of a string.

Type: string

Default: "runes"

Option	Summary
`graphemes`	Use unicode graphemes to determine the length of a string.
`runes`	Use the number of codepoints to determine the length of a string.
`token`	Use the number of tokens (using the `token_encoding` tokenizer) to determine the length of a string.
`utf8`	Determine the length of text using the number of utf8 bytes.

`token_encoding`

The encoding to use for tokenization.

Type: string

# Examples

token_encoding: cl100k_base

token_encoding: r50k_base

`allowed_special`

A list of special tokens that are allowed in the output.

Type: array

Default: []

`disallowed_special`

A list of special tokens that are disallowed in the output.

Type: array

Default: ["all"]

`include_code_blocks`

Whether to include code blocks in the output.

Type: bool

Default: false

`keep_reference_links`

Whether to keep reference links in the output.

Type: bool

Default: false