text_chunker
A processor that allows chunking and splitting text based on some strategy. Usually used for creating vector embeddings of large documents.
# Common config fields, showing default valueslabel: ""text_chunker: strategy: "" # No default (required) chunk_size: 512 chunk_overlap: 100 separators: - |2+ - "" - ' ' - "" length_measure: runes include_code_blocks: false keep_reference_links: false
# Advanced config fields, showing default valueslabel: ""text_chunker: strategy: "" # No default (required) chunk_size: 512 chunk_overlap: 100 separators: - |2+ - "" - ' ' - "" length_measure: runes token_encoding: cl100k_base # No default (optional) allowed_special: [] disallowed_special: - all include_code_blocks: false keep_reference_links: false
A processor allowing splitting text into chunks based on several different strategies.
Fields
strategy
Sorry! This field is missing documentation.
Type: string
Option | Summary |
---|---|
markdown | Split text by markdown headers. |
recursive_character | Split text recursively by characters (defined in separators ). |
token | Split text by tokens. |
chunk_size
The maximum size of each chunk.
Type: int
Default: 512
chunk_overlap
The number of characters to overlap between chunks.
Type: int
Default: 100
separators
A list of strings that should be considered as separators between chunks.
Type: array
Default: ["\n\n","\n"," ",""]
length_measure
The method for measuring the length of a string.
Type: string
Default: "runes"
Option | Summary |
---|---|
graphemes | Use unicode graphemes to determine the length of a string. |
runes | Use the number of codepoints to determine the length of a string. |
token | Use the number of tokens (using the token_encoding tokenizer) to determine the length of a string. |
utf8 | Determine the length of text using the number of utf8 bytes. |
token_encoding
The encoding to use for tokenization.
Type: string
# Examples
token_encoding: cl100k_base
token_encoding: r50k_base
allowed_special
A list of special tokens that are allowed in the output.
Type: array
Default: []
disallowed_special
A list of special tokens that are disallowed in the output.
Type: array
Default: ["all"]
include_code_blocks
Whether to include code blocks in the output.
Type: bool
Default: false
keep_reference_links
Whether to keep reference links in the output.
Type: bool
Default: false