Quick Start Tutorial
Creating a Dataset
Let's revisit our example from the Getting Started page in more depth, starting with the first two lines:
In this example, we providepz.Dataset
's constructor with the path to a local directory as input. The directory has a flat structure, with one email per file:
Given this flat directory, PZ will create a pz.DataReader
, which iterates over the files in the directory at runtime.
What if my data isn't this simple?
That's perfectly fine!
The pz.DataReader
class can be subclassed by the user to read data from more complex sources. The user just has to:
- implement the DataReader's
__len__()
method - implement the DataReader's
__getitem__()
method
More details can be found in our user guide for custom DataReaders.
The pz.DataReader
will emit one dictionary per file to the next operator in the program. By default, each dictionary will have two keys: "contents"
and "filename"
which map to the file's contents and filename, respectively:
import palimpzest as pz
emails = pz.Dataset("emails/")
output = emails.run()
print(output.to_df())
# This produces the following output:
# contents filename
# 0 Message-ID: <1390685.1075853083264.JavaMail.ev... email1.txt
# 1 Message-ID: <19361547.1075853083287.JavaMail.e... email2.txt
# ... ...
# 8 Message-ID: <22163131.1075859380492.JavaMail.e... email9.txt
What is output
?
The output
in the program above has type pz.DataRecordCollection
.
This object contains:
- The data emitted by the PZ program
- The execution stats (i.e. cost, runtime, and quality metrics) for the entire program
We expose the pz.DataRecordCollection.to_df()
method to make it easy for users to get the output(s) of their program in a Pandas DataFrame. We will also expose other utility methods for processing execution statistics in the near future.
Computing New Fields
A key feature of PZ is that it provides users with the ability to compute new fields using semantic operators. To compute new fields, users need to invoke the sem_add_columns()
method with a list of dictionaries defining the field(s) the system should compute:
emails = emails.sem_add_columns([
{"name": "subject", "type": str, "desc": "the subject of the email"},
{"name": "date", "type": str, "desc": "the date the email was sent"},
])
name
: the name of the fieldtype
: the type of the field (one ofstr
,int
,float
,bool
,list[str]
, ...,list[bool]
)desc
: a short natural langague description defining what the field represents
PZ will then use one (or more) LLM(s) to generate the field for every input to the operator (i.e. each email in this example).
But what is sem_add_columns()
actually doing to generate the field(s)?
It depends! (and this is where PZ's optimizer comes in handy)
Depending on the difficulty of the task and your preferred optimization objective (e.g. max_quality
) PZ will select one implementation from a set of PhysicalOperators
to generate your field(s).
PZ can choose from 1,000+ possible implementations of its PhysicalOperators
. Each operator uses one (or more) LLMs and may use techniques such as RAG, Mixture-of-Agents, Critique and Refine, etc. to produce a final output.
For a full list of PhysicalOperators
in PZ, please consult our documentation on Operators.
Filtering Inputs
PZ also provides users with the ability to filter inputs using natural language. In order to apply a semantic filter, users need to invoke the sem_filter()
method with a natural language description of the critieria they are selecting for:
emails = emails.sem_filter("The email is about vacation")
emails = emails.sem_filter("The email was sent in July")
Optimization and Execution
Finally, once we've defined our program in PZ, we can optimize and execute it in order to generate our output:
Thepz.Dataset.run()
method triggers PZ's execution of the program that has been defined by applying semantic operators to emails
. The run()
method also takes a number of keyword arguments which can configure the execution of the program.
In particular, users can specify one optimization objective and (optionally) one constraint:
Optimization objectives:
max_quality=True
(maximize output quality)min_cost=True
(minimize program cost)min_time=True
(minimize program runtime)
Constraints:
quality_threshold=<float>
(threshold in range [0, 1])cost_budget=<float>
(cost in US Dollars)time_budget=<float>
(time in seconds)
More Info on Constraints
PZ can only estimate the cost, quality, and runtime of each physical operator, therefore constraints are not guaranteed to be met. Furthermore, some constraints may be infeasible (even with perfect estimates).
In any case, PZ will make a best effort attempt to find the optimal plan for your stated objective and constraint (if present).
To achieve better estimates -- and thus better optimization outcomes -- please read our Optimization User Guide.
In this example we do not provide validation data to PZ. Therefore, output quality is measured relative to the performance of a "champion model", i.e. the model with the highest MMLU score that is available to the optimizer.
In our Optimization User Guide we show you how to:
- provide validation data to improve the optimizer's performance
- override the optimizer if you wish to specify, for example, the specific model to use for a given operation
Optimization: Design Philosophy
The optimizer is meant to help the programmer quickly get to a final program (i.e. a plan).
In the best case, the optimizer can automatically select a plan that meets the developer's needs.
However, in cases where it falls short, we try to make it as easy as possible for developers to iterate on changes to their plan until it achieves satisfactory performance.
Examining Program Output
Finally, once your program finishes executing you can convert its output to a Pandas DataFrame and examine the results:
Thecols
keyword argument allows you to select which columns should populate your DataFrame (if it is None
, then all columns are selected).
As mentioned in a note above, the output
is a pz.DataRecordCollection
which also contains all of the execution statistics for your program. We can use this to examine the total cost and runtime of our program:
print(f"Total time: {output.execution_stats.total_execution_time:.1f}")
print(f"Total cost: {output.execution_stats.total_execution_cost:.3f}")
What's Next?
Click below to proceed to the Next Steps
.