Quick Start Tutorial

Creating a Dataset

Let's revisit our example from the Getting Started page in more depth, starting with the first two lines:

import palimpzest as pz

emails = pz.Dataset("emails/")

In this example, we provide pz.Dataset's constructor with the path to a local directory as input. The directory has a flat structure, with one email per file:

emails
├── email1.txt
├── email2.txt
...
└── email9.txt

Given this flat directory, PZ will create a pz.DataReader, which iterates over the files in the directory at runtime.

What if my data isn't this simple?

That's perfectly fine!

The pz.DataReader class can be subclassed by the user to read data from more complex sources. The user just has to:

implement the DataReader's __len__() method
implement the DataReader's __getitem__() method

More details can be found in our user guide for custom DataReaders.

The pz.DataReader will emit one dictionary per file to the next operator in the program. By default, each dictionary will have two keys: "contents" and "filename" which map to the file's contents and filename, respectively:

import palimpzest as pz

emails = pz.Dataset("emails/")
output = emails.run()

print(output.to_df())

# This produces the following output:
#                                             contents    filename
# 0  Message-ID: <1390685.1075853083264.JavaMail.ev...  email1.txt
# 1  Message-ID: <19361547.1075853083287.JavaMail.e...  email2.txt
#                                                  ...         ...
# 8  Message-ID: <22163131.1075859380492.JavaMail.e...  email9.txt

What is output?

The output in the program above has type pz.DataRecordCollection.

This object contains:

The data emitted by the PZ program
The execution stats (i.e. cost, runtime, and quality metrics) for the entire program

We expose the pz.DataRecordCollection.to_df() method to make it easy for users to get the output(s) of their program in a Pandas DataFrame. We will also expose other utility methods for processing execution statistics in the near future.

Computing New Fields

A key feature of PZ is that it provides users with the ability to compute new fields using semantic operators. To compute new fields, users need to invoke the sem_add_columns() method with a list of dictionaries defining the field(s) the system should compute:

emails = emails.sem_add_columns([
    {"name": "subject", "type": str, "desc": "the subject of the email"},
    {"name": "date", "type": str, "desc": "the date the email was sent"},
])

In order to fully define a field, each dictionary must have the following three keys:

name: the name of the field
type: the type of the field (one of str, int, float, bool, list[str], ..., list[bool])
desc: a short natural langague description defining what the field represents

PZ will then use one (or more) LLM(s) to generate the field for every input to the operator (i.e. each email in this example).

But what is sem_add_columns() actually doing to generate the field(s)?

It depends! (and this is where PZ's optimizer comes in handy)

Depending on the difficulty of the task and your preferred optimization objective (e.g. max_quality) PZ will select one implementation from a set of PhysicalOperators to generate your field(s).

PZ can choose from 1,000+ possible implementations of its PhysicalOperators. Each operator uses one (or more) LLMs and may use techniques such as RAG, Mixture-of-Agents, Critique and Refine, etc. to produce a final output.

For a full list of PhysicalOperators in PZ, please consult our documentation on Operators.

Filtering Inputs

PZ also provides users with the ability to filter inputs using natural language. In order to apply a semantic filter, users need to invoke the sem_filter() method with a natural language description of the critieria they are selecting for:

emails = emails.sem_filter("The email is about vacation")
emails = emails.sem_filter("The email was sent in July")

These filters will keep all emails which discuss vaction(s) and which were sent in the month of July.

Optimization and Execution

Finally, once we've defined our program in PZ, we can optimize and execute it in order to generate our output:

output = emails.run(max_quality=True)

The pz.Dataset.run() method triggers PZ's execution of the program that has been defined by applying semantic operators to emails. The run() method also takes a number of keyword arguments which can configure the execution of the program.

In particular, users can specify one optimization objective and (optionally) one constraint:

Optimization objectives:

max_quality=True (maximize output quality)
min_cost=True (minimize program cost)
min_time=True (minimize program runtime)

Constraints:

quality_threshold=<float> (threshold in range [0, 1])
cost_budget=<float> (cost in US Dollars)
time_budget=<float> (time in seconds)

More Info on Constraints

PZ can only estimate the cost, quality, and runtime of each physical operator, therefore constraints are not guaranteed to be met. Furthermore, some constraints may be infeasible (even with perfect estimates).

In any case, PZ will make a best effort attempt to find the optimal plan for your stated objective and constraint (if present).

To achieve better estimates -- and thus better optimization outcomes -- please read our Optimization User Guide.

In this example we do not provide validation data to PZ. Therefore, output quality is measured relative to the performance of a "champion model", i.e. the model with the highest MMLU score that is available to the optimizer.

In our Optimization User Guide we show you how to:

provide validation data to improve the optimizer's performance
override the optimizer if you wish to specify, for example, the specific model to use for a given operation

Optimization: Design Philosophy

The optimizer is meant to help the programmer quickly get to a final program (i.e. a plan).

In the best case, the optimizer can automatically select a plan that meets the developer's needs.

However, in cases where it falls short, we try to make it as easy as possible for developers to iterate on changes to their plan until it achieves satisfactory performance.

Examining Program Output

Finally, once your program finishes executing you can convert its output to a Pandas DataFrame and examine the results:

print(output.to_df(cols=["filename", "date", "subject"]))

The cols keyword argument allows you to select which columns should populate your DataFrame (if it is None, then all columns are selected).

As mentioned in a note above, the output is a pz.DataRecordCollection which also contains all of the execution statistics for your program. We can use this to examine the total cost and runtime of our program:

print(f"Total time: {output.execution_stats.total_execution_time:.1f}")
print(f"Total cost: {output.execution_stats.total_execution_cost:.3f}")

Which will produce an output like:

Total time: 41.7
Total cost: 0.081

What's Next?

Click below to proceed to the Next Steps.