fs::dir_tree(path = "_labs/lab02/")
## _labs/lab02/
## └── model-vetiver.qmdLab 2: Project Architecture
Lab 2 introduces thevetiver package for modeling in model-vetiver.qmd. The vetiver package comes in both flavors: R and Python.
If you’d like to see the same process using R, check out the R version.
Comprehension Questions
- What are the layers of a three-layer application architecture? What libraries could you use to implement a three-layer architecture in R or Python?
%%{init: {'theme': 'base', 'themeVariables': {'fontFamily': 'monospace'}}}%%
flowchart TD
Present("Presentation<br>Layer") -.-> Process("Processing<br>Layer")
Process -.-> Data("Data<br>Layer")
subgraph UI["<strong>User Interface</strong>"]
Present
end
subgraph BL["<strong>Business Logic</strong>"]
Process
end
subgraph DS["<strong>Data Storage</strong>"]
Data
end
%% #b5c0c5 is a tint of #6b818c
style Present fill:#5B8C5A,stroke:#000000,stroke-width:1px,color:#ffffff
style Process fill:#D2562B,stroke:#000000,stroke-width:1px,color:#ffffff
style Data fill:#2A6F77,stroke:#000000,stroke-width:1px,color:#ffffff
style UI fill:#fbf7ec,stroke:#5B8C5A,color:#1B2A41
style BL fill:#fbf7ec,stroke:#2A6F77,color:#1B2A41
style DS fill:#fbf7ec,stroke:#2A6F77,color:#1B2A41
2. What are some questions you should explore to reduce the data requirements for your project?
- Problem Definition
- What is the minimum viable outcome?
- Can we solve a simpler version first?
- What is the cost of being wrong?
- What is the minimum viable outcome?
- Feature Engineering
- Which features are actually predictive?
- Can we derive features from less data?
- What is redundant?
- Which features are actually predictive?
- Sampling Strategy
- How much data do we actually need?
- Can we stratify smartly?
- What about active learning?
- How much data do we actually need?
- External Resources
- What data already exists?
- Can we use pre-trained models?
- Are there proxies available?
- What data already exists?
3. What are some patterns you can use to make big data smaller?
Minimize data movement by doing as much work as possible where the data already sits.
Don’t load all your data at once if you don’t need to (keep a live connection to your database and only pull specific pieces of data when you actually need them).
Working with samples can dramatically speed up your analysis while maintaining accuracy for most types of questions
If your data has natural groupings, process one group at a time instead of loading everything.
4. Where can you put intermediate artifacts in a data science project?
Flat files, .csv files, or duckdb (which only loads the data you need into memory).
5. What does it mean to “take data out of the bundle”?
Take data out of the presentation bundle (i.e., separate it from the code) if it’s updated more frequently than the app code.
Data preparation
We’ll use the reticulate package below to install the other packages in the model-vetiver.qmd lab.
library(reticulate)
if (!"myenv" %in% virtualenv_list()) {
virtualenv_create("myenv")
}
use_virtualenv("myenv", required = TRUE)
virtualenv_install(
envname = "myenv",
packages = c("palmerpenguins", "pandas", "numpy",
"scikit-learn", "duckdb", "vetiver",
"pins"))Now we can import the Python packages/functions.
from palmerpenguins import load_penguins
from pandas import get_dummies
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import duckdbUsing duckdb
Below we load the penguins data, connect to duckdb, then create the table:
penguins_data = load_penguins()
con = duckdb.connect('my-db.duckdb')The load_penguins() function put the penguins data into a DataFrame, then we can insert it into DuckDB before querying:
con.execute("CREATE OR REPLACE TABLE penguins AS SELECT * FROM penguins_data")<_duckdb.DuckDBPyConnection object at 0x704197cd61f0>
df = con.execute("SELECT * FROM penguins").fetchdf().dropna()
con.close()Confirm the data:
df.head(3) species island bill_length_mm ... body_mass_g sex year
0 Adelie Torgersen 39.1 ... 3750.0 male 2007
1 Adelie Torgersen 39.5 ... 3800.0 female 2007
2 Adelie Torgersen 40.3 ... 3250.0 female 2007
[3 rows x 8 columns]
Create a model
The model is created below (like lab 1).
X = get_dummies(df[['bill_length_mm', 'species', 'sex']], drop_first = True)
y = df['body_mass_g']
model = LinearRegression().fit(X, y)Print model information
This will print the model information.
print(f"R^2 {model.score(X,y)}")R^2 0.8555368759537614
print(f"Intercept {model.intercept_}")Intercept 2169.2697209393996
print(f"Columns {X.columns}")Columns Index(['bill_length_mm', 'species_Chinstrap', 'species_Gentoo', 'sex_male'], dtype='str')
print(f"Coefficients {model.coef_}")Coefficients [ 32.53688677 -298.76553447 1094.86739145 547.36692408]
model -> VetiverModel
The model is created with VetiverModel(), which is a wrapper class that takes our trained machine learning model and adds important metadata and functionality for deployment and versioning.
from vetiver import VetiverModel
v = VetiverModel(model, model_name='penguin_model', prototype_data=X)%%{init: {'theme': 'base', 'themeVariables': {'fontFamily': 'monospace'}}}%%
flowchart LR
RawMod("<strong>Raw sklearn model</strong>") --> VMod("<strong>VetiverModel()</strong><br>wrapper")
VMod --> ModMeta("<strong>Model + Metadata</strong>")
ModMeta --> Dep("<strong>Deployment Ready</strong>")
subgraph Vet["<em>What it creates</em>"]
InSch("Input Schema")
ModTypeInfo("Model Type Info")
Ver("Versioning")
Api("API Generation")
Ser("Serialization")
end
VMod --> InSch
VMod --> ModTypeInfo
VMod --> Ver
VMod --> Api
VMod --> Ser
%% Style for outputs %%
style RawMod fill:#5B8C5A,stroke:#000000,stroke-width:1px,color:#ffffff
style VMod fill:#D2562B,stroke:#000000,stroke-width:1px,color:#ffffff
style ModMeta fill:#D2562B,stroke:#000000,stroke-width:1px,color:#ffffff
style Dep fill:#2A6F77,stroke:#000000,stroke-width:1px,color:#ffffff
style InSch fill:#485466,stroke:#000000,stroke-width:1px,color:#ffffff
style ModTypeInfo fill:#485466,stroke:#000000,stroke-width:1px,color:#ffffff
style Ver fill:#485466,stroke:#000000,stroke-width:1px,color:#ffffff
style Api fill:#485466,stroke:#000000,stroke-width:1px,color:#ffffff
style Ser fill:#485466,stroke:#000000,stroke-width:1px,color:#ffffff
style Vet fill:#fbf7ec,stroke:#2A6F77,color:#1B2A41
What are pins boards?
In this version, we’re going to save our pins board to a temporary location (f"{temp_dir}/models"), but in the lab, we’ll create the directory before creating the board.
%%{init: {'theme': 'base', 'themeVariables': {'fontFamily': 'monospace'}}}%%
flowchart LR
BrdFldr("<strong>board_folder()</strong>") --> FSI("File System Interface")
FSI --> VerStor("<strong>Versioned Storage</strong>")
VerStor --> MetaTr("<strong>Metadata Tracking</strong>")
subgraph BF["<em>What it creates</em>"]
DirStr("Directory Structure")
VerCont("Version Control")
PinMngMnt("Pin Management")
end
FSI --> DirStr
FSI --> VerCont
FSI --> PinMngMnt
%% Style for outputs %%
style BrdFldr fill:#5B8C5A,stroke:#000000,stroke-width:1px,color:#ffffff
style FSI fill:#D2562B,stroke:#000000,stroke-width:1px,color:#ffffff
style VerStor fill:#D2562B,stroke:#000000,stroke-width:1px,color:#ffffff
style MetaTr fill:#2A6F77,stroke:#000000,stroke-width:1px,color:#ffffff
style DirStr fill:#485466,stroke:#000000,stroke-width:1px,color:#ffffff
style VerCont fill:#485466,stroke:#000000,stroke-width:1px,color:#ffffff
style PinMngMnt fill:#485466,stroke:#000000,stroke-width:1px,color:#ffffff
style BF fill:#fbf7ec,stroke:#2A6F77,color:#1B2A41
import tempfile
from pins import board_folder
from vetiver import vetiver_pin_write
temp_dir = tempfile.mkdtemp()
model_board = board_folder(f"{temp_dir}/models", allow_pickle_read=True)
vetiver_pin_write(model_board, v)- 1
-
Use temp directory
- 2
-
allow_pickle_readallows reading Python pickle files (needed forscikit-learnmodels)
Model Cards provide a framework for transparent, responsible reporting.
Use the vetiver `.qmd` Quarto template as a place to start,
with vetiver.model_card()
Writing pin:
Name: 'penguin_model'
Version: 20260615T105212Z-a6861
board_folder()creates thepinsboard that sets up a storage interface to a local temporary folder. This creates a standardized structure for storing “pins” (data objects/models).vetiver_pin_write()receives themodel_board, which is ourboard_folderobject pointing totemp_dir/modelsandv(ourVetiverModelwrapping theLinearRegression).
print(f"Model saved to: {temp_dir}/models")Model saved to: /tmp/tmpj2z7knl1/models
show/hide print_tree()
import os
def print_tree(directory, prefix="", max_depth=None, current_depth=0):
"""Print a tree structure of directory contents."""
if max_depth is not None and current_depth > max_depth:
return
items = sorted(os.listdir(directory))
for i, item in enumerate(items):
path = os.path.join(directory, item)
is_last = i == len(items) - 1
current_prefix = "└── " if is_last else "├── "
print(f"{prefix}{current_prefix}{item}")
if os.path.isdir(path):
extension = " " if is_last else "│ "
print_tree(path, prefix + extension, max_depth, current_depth + 1)Below is a tree of the model folder:
print_tree(f"{temp_dir}/models")└── penguin_model
└── 20260615T105212Z-a6861
├── data.txt
└── penguin_model.joblib
What is in penguin_model/?
The penguin_model board folder is described below:
└── penguin_model
└── CURRENT-DATE-TIME-aba30/
├── data.txt
└── penguin_model.joblib- 1
-
pinname directory
- 2
-
Version directory timestamp + hash
- 3
-
Model metadata (what, when, how big)
- 4
-
Serialized model (the actual trained model +
VetiverModelwrapper with all the prediction logic)
The board folder handles versioning automatically and manages metadata about what’s stored.
%%{init: {'theme': 'base', 'themeVariables': {'fontFamily': 'monospace'}}}%%
flowchart LR
Mod("<strong>model</strong>") --> VetMod("<strong>VetiverModel(model)</strong>")
VetMod --> VPW("<strong>vetiver_pin_write()</strong>")
VPW --> Joblib("Serialized <strong>.joblib</strong>")
VPW --> Meta("Metadata <strong>.txt</strong>")
Rep("To<br>reproduce...") --> VPR("<strong>vetiver_pin_read()</strong>")
VPR --> Joblib
VPR --> Meta
VPR --> Recon("Reconstructed<br><strong>VetiverModel()</strong>")
%% Style for outputs %%
style Mod fill:#5B8C5A,stroke:#000000,stroke-width:1px,color:#ffffff
style VetMod fill:#D2562B,stroke:#000000,stroke-width:1px,color:#ffffff
style VPW fill:#D2562B,stroke:#000000,stroke-width:1px,color:#ffffff
style Joblib fill:#2A6F77,stroke:#000000,stroke-width:1px,color:#ffffff
style Meta fill:#2A6F77,stroke:#000000,stroke-width:1px,color:#ffffff
style Rep fill:#5B8C5A,stroke:#000000,stroke-width:1px,color:#ffffff
style VPR fill:#D2562B,stroke:#000000,stroke-width:1px,color:#ffffff
style Recon fill:#485466,stroke:#000000,stroke-width:1px,color:#ffffff
VetiverModel -> VetiverAPI
Finally, we’ll turn the model into an API.
from vetiver import VetiverAPI
app = VetiverAPI(v, check_prototype = True)VetiverAPI() creates a production-ready web API that wraps the trained model, validates inputs, and returns predictions via HTTP.
%%{init: {'theme': 'base', 'themeVariables': {'fontFamily': 'monospace'}}}%%
flowchart LR
VetMod("<strong>VetiverModel()</strong>") --> VetAPI("<strong>VetiverAPI()</strong>")
VetAPI --> REST("<strong>REST</strong> API<br>Server")
REST --> HttpEP("<strong>HTTP</strong><br>Endpoints")
HttpEP --> Post("<strong>POST</strong><br>/predict")
HttpEP --> GetPing("<strong>GET</strong><br>/ping")
HttpEP --> GetProto("<strong>GET</strong><br>/prototype")
HttpEP --> GetMeta("<strong>GET</strong><br>/metadata")
%% Style for outputs %%
style VetMod fill:#5B8C5A,stroke:#000000,stroke-width:1px,color:#ffffff
style VetAPI fill:#D2562B,stroke:#000000,stroke-width:1px,color:#ffffff
style REST fill:#D2562B,stroke:#000000,stroke-width:1px,color:#ffffff
style HttpEP fill:#2A6F77,stroke:#000000,stroke-width:1px,color:#ffffff
style Post fill:#485466,stroke:#000000,stroke-width:1px,color:#ffffff
style GetPing fill:#485466,stroke:#000000,stroke-width:1px,color:#ffffff
style GetProto fill:#485466,stroke:#000000,stroke-width:1px,color:#ffffff
style GetMeta fill:#485466,stroke:#000000,stroke-width:1px,color:#ffffff
This makes our model accessible to any application that can make web requests, which we will cover in the next lab.