fs::dir_tree(path = "_labs/lab2/")
## _labs/lab2/
## └── model-vetiver.qmdLab 2: Project Architecture
Lab 2 introduces thevetiver package for modeling in model-vetiver.qmd. The vetiver package comes in both flavors: R and Python.
If you’d like to see the same process using R, check out the R version.
Comprehension Questions
- What are the layers of a three-layer application architecture? What libraries could you use to implement a three-layer architecture in R or Python?
%%{init: {'theme': 'neutral', 'look': 'handDrawn', 'themeVariables': { 'fontFamily': 'monospace', "fontSize":"16px"}}}%%
flowchart TD
Present("Presentation<br>Layer") -.-> Process("Processing<br>Layer")
Process -.-> Data("Data<br>Layer")
subgraph UI["<strong>User Interface</strong>"]
Present
end
subgraph BL["<strong>Business Logic</strong>"]
Process
end
subgraph DS["<strong>Data Storage</strong>"]
Data
end
%% #b5c0c5 is a tint of #6b818c
style UI fill:#b5c0c5
style BL fill:#d8e4ff
style DS fill:#31e981
2. What are some questions you should explore to reduce the data requirements for your project?
- Problem Definition
- What is the minimum viable outcome?
- Can we solve a simpler version first?
- What is the cost of being wrong?
- What is the minimum viable outcome?
- Feature Engineering
- Which features are actually predictive?
- Can we derive features from less data?
- What is redundant?
- Which features are actually predictive?
- Sampling Strategy
- How much data do we actually need?
- Can we stratify smartly?
- What about active learning?
- How much data do we actually need?
- External Resources
- What data already exists?
- Can we use pre-trained models?
- Are there proxies available?
- What data already exists?
3. What are some patterns you can use to make big data smaller?
Minimize data movement by doing as much work as possible where the data already sits.
Don’t load all your data at once if you don’t need to (keep a live connection to your database and only pull specific pieces of data when you actually need them).
Working with samples can dramatically speed up your analysis while maintaining accuracy for most types of questions
If your data has natural groupings, process one group at a time instead of loading everything.
4. Where can you put intermediate artifacts in a data science project?
Flat files, .csv files, or duckdb (which only loads the data you need into memory).
5. What does it mean to “take data out of the bundle”?
Take data out of the presentation bundle (i.e., separate it from the code) if it’s updated more frequently than the app code.
Data preparation
We’ll use the reticulate package below to install the other packages in the model-vetiver.qmd lab.
library(reticulate)
use_virtualenv("myenv", required = TRUE)
virtualenv_install(
envname = "myenv",
packages = c("palmerpenguins", "pandas", "numpy",
"scikit-learn", "duckdb", "vetiver",
"pins"))Now we can import the Python packages/functions.
from palmerpenguins import load_penguins
from pandas import get_dummies
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import duckdbUsing duckdb
Below we load the penguins data, connect to duckdb, then create the table:
penguins_data = load_penguins()
con = duckdb.connect('my-db.duckdb')The load_penguins() function put the penguins data into a DataFrame, then we can insert it into DuckDB before querying:
con.execute("CREATE OR REPLACE TABLE penguins AS SELECT * FROM penguins_data")<_duckdb.DuckDBPyConnection object at 0x140062c70>
df = con.execute("SELECT * FROM penguins").fetchdf().dropna()
con.close()Confirm the data:
df.head(3) species island bill_length_mm ... body_mass_g sex year
0 Adelie Torgersen 39.1 ... 3750.0 male 2007
1 Adelie Torgersen 39.5 ... 3800.0 female 2007
2 Adelie Torgersen 40.3 ... 3250.0 female 2007
[3 rows x 8 columns]
Create a model
The model is created below (like lab 1).
X = get_dummies(df[['bill_length_mm', 'species', 'sex']], drop_first = True)
y = df['body_mass_g']
model = LinearRegression().fit(X, y)Print model information
This will print the model information.
print(f"R^2 {model.score(X,y)}")R^2 0.8555368759537614
print(f"Intercept {model.intercept_}")Intercept 2169.2697209393973
print(f"Columns {X.columns}")Columns Index(['bill_length_mm', 'species_Chinstrap', 'species_Gentoo', 'sex_male'], dtype='object')
print(f"Coefficients {model.coef_}")Coefficients [ 32.53688677 -298.76553447 1094.86739145 547.36692408]
model -> VetiverModel
The model is created with VetiverModel(), which is a wrapper class that takes our trained machine learning model and adds important metadata and functionality for deployment and versioning.
from vetiver import VetiverModel
v = VetiverModel(model, model_name='penguin_model', prototype_data=X)%%{init: {'theme': 'neutral', 'look': 'handDrawn', 'themeVariables': { 'fontFamily': 'monospace', "fontSize":"16px"}}}%%
flowchart LR
RawMod(["<strong>Raw sklearn model</strong>"]) --> VMod("<strong>VetiverModel()</strong><br>wrapper")
VMod --> ModMeta(["<strong>Model + Metadata</strong>"])
ModMeta --> Dep(["<strong>Deployment Ready</strong>"])
subgraph Vet["<em>What it creates</em>"]
InSch("Input Schema")
ModTypeInfo("Model Type Info")
Ver("Versioning")
Api("API Generation")
Ser("Serialization")
end
VMod --> InSch
VMod --> ModTypeInfo
VMod --> Ver
VMod --> Api
VMod --> Ser
%% Style for outputs %%
style RawMod fill:#d8e4ff
style ModMeta fill:#31e981
style Dep fill:#31e981
What are pins boards?
In this version, we’re going to save our pins board to a temporary location (f"{temp_dir}/models"), but in the lab, we’ll create the directory before creating the board.
%%{init: {'theme': 'neutral', 'look': 'handDrawn', 'themeVariables': { 'fontFamily': 'monospace', "fontSize":"16px"}}}%%
flowchart LR
BrdFldr(["<strong>board_folder()</strong>"]) --> FSI("File System Interface")
FSI --> VerStor(["<strong>Versioned Storage</strong>"])
VerStor --> MetaTr(["<strong>Metadata Tracking</strong>"])
subgraph BF["<em>What it creates</em>"]
DirStr("Directory Structure")
VerCont("Version Control")
PinMngMnt("Pin Management")
end
FSI --> DirStr
FSI --> VerCont
FSI --> PinMngMnt
%% Style for outputs %%
style BrdFldr fill:#d8e4ff
style VerStor fill:#31e981
style MetaTr fill:#31e981
import tempfile
from pins import board_folder
from vetiver import vetiver_pin_write
temp_dir = tempfile.mkdtemp()
model_board = board_folder(f"{temp_dir}/models", allow_pickle_read=True)
vetiver_pin_write(model_board, v)- 1
-
Use temp directory
- 2
-
allow_pickle_readallows reading Python pickle files (needed forscikit-learnmodels)
Model Cards provide a framework for transparent, responsible reporting.
Use the vetiver `.qmd` Quarto template as a place to start,
with vetiver.model_card()
Writing pin:
Name: 'penguin_model'
Version: 20251218T105954Z-aba30
board_folder()creates thepinsboard that sets up a storage interface to a local temporary folder. This creates a standardized structure for storing “pins” (data objects/models).vetiver_pin_write()receives themodel_board, which is ourboard_folderobject pointing totemp_dir/modelsandv(ourVetiverModelwrapping theLinearRegression).
print(f"Model saved to: {temp_dir}/models")Model saved to: /var/folders/0x/x5wkbhmx0k74tncn9swz7xpr0000gn/T/tmpmxd3ubm3/models
show/hide print_tree()
import os
def print_tree(directory, prefix="", max_depth=None, current_depth=0):
"""Print a tree structure of directory contents."""
if max_depth is not None and current_depth > max_depth:
return
items = sorted(os.listdir(directory))
for i, item in enumerate(items):
path = os.path.join(directory, item)
is_last = i == len(items) - 1
current_prefix = "└── " if is_last else "├── "
print(f"{prefix}{current_prefix}{item}")
if os.path.isdir(path):
extension = " " if is_last else "│ "
print_tree(path, prefix + extension, max_depth, current_depth + 1)Below is a tree of the model folder:
print_tree(f"{temp_dir}/models")└── penguin_model
└── 20251218T105954Z-aba30
├── data.txt
└── penguin_model.joblib
What is in penguin_model/?
The penguin_model board folder is described below:
└── penguin_model
└── CURRENT-DATE-TIME-aba30/
├── data.txt
└── penguin_model.joblib- 1
-
pinname directory
- 2
-
Version directory timestamp + hash
- 3
-
Model metadata (what, when, how big)
- 4
-
Serialized model (the actual trained model +
VetiverModelwrapper with all the prediction logic)
The board folder handles versioning automatically and manages metadata about what’s stored.
%%{init: {'theme': 'neutral', 'look': 'handDrawn', 'themeVariables': { 'fontFamily': 'monospace', "fontSize":"16px"}}}%%
flowchart LR
Mod(["<strong>model</strong>"]) --> VetMod("<strong>VetiverModel(model)</strong>")
VetMod --> VPW("<strong>vetiver_pin_write()</strong>")
VPW --> Joblib(["Serialized <strong>.joblib</strong>"])
VPW --> Meta(["Metadata <strong>.txt</strong>"])
Rep[["To<br>reproduce..."]] --> VPR("<strong>vetiver_pin_read()</strong>")
VPR --> Joblib
VPR --> Meta
VPR --> Recon(["Reconstructed<br><strong>VetiverModel()</strong>"])
%% Style for outputs %%
style Mod fill:#d8e4ff
style Recon fill:#31e981
style Meta fill:#31e981
style Joblib fill:#31e981
VetiverModel -> VetiverAPI
Finally, we’ll turn the model into an API.
from vetiver import VetiverAPI
app = VetiverAPI(v, check_prototype = True)VetiverAPI() creates a production-ready web API that wraps the trained model, validates inputs, and returns predictions via HTTP.
%%{init: {'theme': 'neutral', 'look': 'handDrawn', 'themeVariables': { 'fontFamily': 'monospace', "fontSize":"16px"}}}%%
flowchart LR
VetMod(["<strong>VetiverModel()</strong>"]) --> VetAPI("<strong>VetiverAPI()</strong>")
VetAPI --> REST("<strong>REST</strong> API<br>Server")
REST --> HttpEP("<strong>HTTP</strong><br>Endpoints")
HttpEP --> Post(["<strong>POST</strong><br>/predict"])
HttpEP --> GetPing(["<strong>GET</strong><br>/ping"])
HttpEP --> GetProto(["<strong>GET</strong><br>/prototype"])
HttpEP --> GetMeta(["<strong>GET</strong><br>/metadata"])
%% Style for outputs %%
style VetMod fill:#d8e4ff,stroke-width:1px,rx:10,ry:10,font-size:16px
style Post fill:#31e981,stroke-width:1px,rx:10,ry:10,font-size:16px
style GetPing fill:#31e981,stroke-width:1px,rx:10,ry:10,font-size:16px
style GetProto fill:#31e981,stroke-width:1px,rx:10,ry:10,font-size:16px
style GetMeta fill:#31e981,stroke-width:1px,rx:10,ry:10,font-size:16px
This makes our model accessible to any application that can make web requests, which we will cover in the next lab.