Skip to content

Dataset

A Dataset is the intended abstraction for programmers to interact with when writing PZ programs.

Users instantiate a Dataset by specifying a source that either points to a DataReader or an existing Dataset. Users can then perform computations on the Dataset in a lazy fashion by leveraging functions such as filter, sem_filter, sem_add_columns, aggregate, etc. Underneath the hood, each of these operations creates a new Dataset. As a result, the Dataset defines a lineage of computation.

add_columns

add_columns(
    udf: Callable,
    cols: list[dict] | type[Schema],
    cardinality: Cardinality = Cardinality.ONE_TO_ONE,
    depends_on: str | list[str] | None = None,
    desc: str = "Add new columns via UDF",
) -> Dataset

Add new columns by specifying UDFs.

Examples:

add_columns( udf=compute_personal_greeting, cols=[ {'name': 'greeting', 'desc': 'The greeting message', 'type': str}, {'name': 'age', 'desc': 'The age of the person', 'type': int}, {'name': 'full_name', 'desc': 'The name of the person', 'type': str}, ] )

average

average() -> Dataset

Apply an average aggregation to this set

count

count() -> Dataset

Apply a count aggregation to this set

filter

filter(_filter: Callable, depends_on: str | list[str] | None = None) -> Dataset

Add a user defined function as a filter to the Set. This filter will possibly restrict the items that are returned later.

limit

limit(n: int) -> Dataset

Limit the set size to no more than n rows

map

map(udf: Callable) -> Dataset

Apply a UDF map function.

Examples:

map(udf=clean_column_values)

project

project(project_cols: list[str] | str) -> Dataset

Project the Set to only include the specified columns.

retrieve

retrieve(
    index: Collection | RAGPretrainedModel,
    search_attr: str,
    output_attrs: list[dict] | type[Schema],
    search_func: Callable | None = None,
    k: int = -1,
) -> Dataset

Retrieve the top-k nearest neighbors of the value of the search_attr from the index and use these results to construct the output_attrs field(s).

run

run(config: QueryProcessorConfig | None = None, **kwargs)

Invoke the QueryProcessor to execute the query. kwargs will be applied to the QueryProcessorConfig.

sem_add_columns

sem_add_columns(
    cols: list[dict] | type[Schema],
    cardinality: Cardinality = Cardinality.ONE_TO_ONE,
    depends_on: str | list[str] | None = None,
    desc: str = "Add new columns via semantic reasoning",
) -> Dataset

Add new columns by specifying the column names, descriptions, and types. The column will be computed during the execution of the Dataset. Example: sem_add_columns( [{'name': 'greeting', 'desc': 'The greeting message', 'type': str}, {'name': 'age', 'desc': 'The age of the person', 'type': int}, {'name': 'full_name', 'desc': 'The name of the person', 'type': str}] )

sem_filter

sem_filter(_filter: str, depends_on: str | list[str] | None = None) -> Dataset

Add a natural language description of a filter to the Set. This filter will possibly restrict the items that are returned later.