parquet_decode

Decodes Parquet files into a batch of structured messages.

Introduced in version 4.4.0.

# Config fields, showing default values
label: ""
parquet_decode:
  handle_logical_types: v1

This processor uses https://github.com/parquet-go/parquet-go, which is itself experimental. Therefore changes could be made into how this processor functions outside of major version releases.

Fields

`handle_logical_types`

Whether to be smart about decoding logical types. In the Parquet format, logical types are stored as one of the standard physical types with some additional metadata describing the logical type. For example, UUIDs are stored in a FIXED_LEN_BYTE_ARRAY physical type, but there is metadata in the schema denoting that it is a UUID. By default, this logical type metadata will be ignored and values will be decoded directly from the physical type, which isn’t always desirable. By enabling this option, logical types will be given special treatment and will decode into more useful values. The value for this field specifies a version, i.e. v0, v1… Any given version enables the logical type handling for that version and all versions below it, which allows the handling of new logical types to be introduced without breaking existing pipelines. We recommend enabling the newest version available of this feature when creating new pipelines.

Type: string

Default: "v1"

Option	Summary
`v1`	No special handling of logical types
`v2`

TIMESTAMP - decodes as an RFC3339 string describing the time. If the isAdjustedToUTC flag is set to true in the parquet file, the time zone will be set to UTC. If it is set to false the time zone will be set to local time.
UUID - decodes as a string, i.e. 00112233-4455-6677-8899-aabbccddeeff. |

# Examples

handle_logical_types: v2

Examples

Reading Parquet Files from AWS S3

In this example we consume files from AWS S3 as they’re written by listening onto an SQS queue for upload events. We make sure to use the to_the_end scanner which means files are read into memory in full, which then allows us to use a parquet_decode processor to expand each file into a batch of messages. Finally, we write the data out to local files as newline delimited JSON.

input:
  aws_s3:
    bucket: TODO
    prefix: foos/
    scanner:
      to_the_end: {}
    sqs:
      url: TODO
  processors:
    - parquet_decode: {}

output:
  file:
    codec: lines
    path: './foos/${! meta("s3_key") }.jsonl'