PZ: Optimizing Pipelines of Semantic Operators

PZ: Optimizing Pipelines of Semantic Operators#

arXiv Colab Talk PyPI PyPI - Downloads Code

Palimpzest (PZ) provides a high-level, declarative interface for composing and executing pipelines of semantic operators. PZ’s optimizer can automatically improve the performance of these pipelines, enabling programmers to focus on the high-level design of their pipelines.

Getting Started#

You can find a stable version of the Palimpzest package on PyPI: PyPI. To install the package, run:

$ pip install palimpzest

Alternatively, to install the latest version of the package from source, you can clone our repository and run the following commands:

$ git clone git@github.com:mitdbg/palimpzest.git
$ cd palimpzest
$ pip install .

Note

This project is under active development.

Chat Demo#

To access our chat demo please go to this webpage Palimpchat.

Quickstart#

The easiest way to get started with Palimpzest, is to run our demo in Colab: Colab. We demonstrate the workflow of working with PZ, including registering a dataset, composing and executing a pipeline, and accessing the results.

For eager readers, the code in the notebook can be found in the following condensed snippet. However, we do suggest reading the notebook as it contains more insight into each element of the program.

import pandas as pd
import palimpzest.datamanager.datamanager as pzdm
from palimpzest.sets import Dataset
from palimpzest.core.lib.fields import Field
from palimpzest.core.lib.schemas import Schema, TextFile
from palimpzest.policy import MinCost, MaxQuality
from palimpzest.query import Execute

# Dataset registration
dataset_path = "testdata/enron-tiny"
dataset_name = "enron-tiny"
pzdm.DataDirectory().register_local_directory(dataset_path, dataset_name)

# Dataset loading
dataset = Dataset(dataset_name, schema=TextFile)

# Schema definition for the fields we wish to compute
class Email(Schema):
   """Represents an email, which in practice is usually from a text file"""
   sender = Field(desc="The email address of the sender")
   subject = Field(desc="The subject of the email")
   date = Field(desc="The date the email was sent")

# Lazy construction of computation to filter for emails about holidays sent in July
dataset = dataset.convert(Email, desc="An email from the Enron dataset")
dataset = dataset.filter("The email was sent in July")
dataset = dataset.filter("The email is about holidays")

# Executing the compuation
policy = MinCost()
results, execution_stats = Execute(dataset, policy)

# Writing output to disk
output_df = pd.DataFrame([r.to_dict() for r in results])[["date","sender","subject"]]
output_df.to_csv("july_holiday_emails.csv")

Next Steps#

Stay tuned for more walkthroughs and tutorials on how to use PZ! In the meantime, the main content of our documentation can be found below:

Contents#

Research#