Dataset
A Dataset is the intended abstraction for programmers to interact with when writing PZ programs.
Users instantiate a Dataset by specifying a source
that either points to a DataReader
or an existing Dataset. Users can then perform computations on the Dataset in a lazy fashion
by leveraging functions such as filter
, sem_filter
, sem_add_columns
, aggregate
, etc.
Underneath the hood, each of these operations creates a new Dataset. As a result, the Dataset
defines a lineage of computation.
add_columns
add_columns(
udf: Callable,
cols: list[dict] | type[Schema],
cardinality: Cardinality = Cardinality.ONE_TO_ONE,
depends_on: str | list[str] | None = None,
desc: str = "Add new columns via UDF",
) -> Dataset
Add new columns by specifying UDFs.
Examples:
add_columns( udf=compute_personal_greeting, cols=[ {'name': 'greeting', 'desc': 'The greeting message', 'type': str}, {'name': 'age', 'desc': 'The age of the person', 'type': int}, {'name': 'full_name', 'desc': 'The name of the person', 'type': str}, ] )
filter
Add a user defined function as a filter to the Set. This filter will possibly restrict the items that are returned later.
project
Project the Set to only include the specified columns.
retrieve
retrieve(
index: Collection | RAGPretrainedModel,
search_attr: str,
output_attrs: list[dict] | type[Schema],
search_func: Callable | None = None,
k: int = -1,
) -> Dataset
Retrieve the top-k nearest neighbors of the value of the search_attr
from the index
and
use these results to construct the output_attrs
field(s).
run
Invoke the QueryProcessor to execute the query. kwargs
will be applied to the QueryProcessorConfig.
sem_add_columns
sem_add_columns(
cols: list[dict] | type[Schema],
cardinality: Cardinality = Cardinality.ONE_TO_ONE,
depends_on: str | list[str] | None = None,
desc: str = "Add new columns via semantic reasoning",
) -> Dataset
Add new columns by specifying the column names, descriptions, and types. The column will be computed during the execution of the Dataset. Example: sem_add_columns( [{'name': 'greeting', 'desc': 'The greeting message', 'type': str}, {'name': 'age', 'desc': 'The age of the person', 'type': int}, {'name': 'full_name', 'desc': 'The name of the person', 'type': str}] )