Lab 2: Project Architecture

Published

2026-07-11

Lab 2 introduces thevetiver package for modeling in model-vetiver.qmd. The vetiver package comes in both flavors: R and Python.

fs::dir_tree(path = "_labs/lab02/")
## _labs/lab02/
## └── model-vetiver.qmd

If you’d like to see the same process using R, check out the R version.

Comprehension Questions

What are the layers of a three-layer application architecture? What libraries could you use to implement a three-layer architecture in R or Python?

%%{init: {'theme': 'base', 'themeVariables': {'fontFamily': 'monospace'}}}%%
flowchart TD
    Present("Presentation<br>Layer") -.-> Process("Processing<br>Layer")
    Process -.-> Data("Data<br>Layer")
    
    subgraph UI["<strong>User Interface</strong>"]
        Present
    end
    
    subgraph BL["<strong>Business Logic</strong>"]
        Process
    end
    
    subgraph DS["<strong>Data Storage</strong>"]
        Data
    end
    
    %% #b5c0c5 is a tint of #6b818c
    style Present fill:#5B8C5A,stroke:#000000,stroke-width:1px,color:#ffffff
    style Process fill:#D2562B,stroke:#000000,stroke-width:1px,color:#ffffff
    style Data fill:#2A6F77,stroke:#000000,stroke-width:1px,color:#ffffff
    style UI fill:#fbf7ec,stroke:#5B8C5A,color:#1B2A41
    style BL fill:#fbf7ec,stroke:#2A6F77,color:#1B2A41
    style DS fill:#fbf7ec,stroke:#2A6F77,color:#1B2A41

2. What are some questions you should explore to reduce the data requirements for your project?

Problem Definition
- What is the minimum viable outcome?
- Can we solve a simpler version first?
- What is the cost of being wrong?
Feature Engineering
- Which features are actually predictive?
- Can we derive features from less data?
- What is redundant?
Sampling Strategy
- How much data do we actually need?
- Can we stratify smartly?
- What about active learning?
External Resources
- What data already exists?
- Can we use pre-trained models?
- Are there proxies available?

3. What are some patterns you can use to make big data smaller?

Minimize data movement by doing as much work as possible where the data already sits.
Don’t load all your data at once if you don’t need to (keep a live connection to your database and only pull specific pieces of data when you actually need them).
Working with samples can dramatically speed up your analysis while maintaining accuracy for most types of questions
If your data has natural groupings, process one group at a time instead of loading everything.

4. Where can you put intermediate artifacts in a data science project?

Flat files, .csv files, or duckdb (which only loads the data you need into memory).

5. What does it mean to “take data out of the bundle”?

Take data out of the presentation bundle (i.e., separate it from the code) if it’s updated more frequently than the app code.

Data preparation

We’ll use the reticulate package below to install the other packages in the model-vetiver.qmd lab.

library(reticulate)
if (!"myenv" %in% virtualenv_list()) {
  virtualenv_create("myenv")
}
use_virtualenv("myenv", required = TRUE)
virtualenv_install(
  envname = "myenv",
  packages = c("palmerpenguins", "pandas", "numpy",
               "scikit-learn", "duckdb", "vetiver",
                "pins"))

Now we can import the Python packages/functions.

from palmerpenguins import load_penguins
from pandas import get_dummies
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import duckdb

Using `duckdb`

duckdb error

This originally returned an error:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
duckdb.duckdb.InvalidInputException: Invalid Input Error: Python Object 
"penguins" of type "module" found on line "<string>:1" not suitable for
replacement scans.
Make sure that "penguins" is either a pandas.DataFrame, 
duckdb.DuckDBPyRelation, pyarrow Table, Dataset, RecordBatchReader, 
Scanner, or NumPy ndarrays with supported format

Traceback (most recent call last):
  File "<string>", line 1, in <module>
NameError: name 'df' is not defined

Below we load the penguins data, connect to duckdb, then create the table:

penguins_data = load_penguins()
con = duckdb.connect('my-db.duckdb')

The load_penguins() function put the penguins data into a DataFrame, then we can insert it into DuckDB before querying:

con.execute("CREATE OR REPLACE TABLE penguins AS SELECT * FROM penguins_data")

<_duckdb.DuckDBPyConnection object at 0x725d697a3f30>

df = con.execute("SELECT * FROM penguins").fetchdf().dropna()
con.close()

Confirm the data:

df.head(3)

  species     island  bill_length_mm  ...  body_mass_g     sex  year
0  Adelie  Torgersen            39.1  ...       3750.0    male  2007
1  Adelie  Torgersen            39.5  ...       3800.0  female  2007
2  Adelie  Torgersen            40.3  ...       3250.0  female  2007

[3 rows x 8 columns]

Create a `model`

The model is created below (like lab 1).

X = get_dummies(df[['bill_length_mm', 'species', 'sex']], drop_first = True)
y = df['body_mass_g']

model = LinearRegression().fit(X, y)

Print `model` information

This will print the model information.

print(f"R^2 {model.score(X,y)}")

R^2 0.8555368759537614

print(f"Intercept {model.intercept_}")

Intercept 2169.2697209393996

print(f"Columns {X.columns}")

Columns Index(['bill_length_mm', 'species_Chinstrap', 'species_Gentoo', 'sex_male'], dtype='str')

print(f"Coefficients {model.coef_}")

Coefficients [  32.53688677 -298.76553447 1094.86739145  547.36692408]

`model` -> `VetiverModel`

The model is created with VetiverModel(), which is a wrapper class that takes our trained machine learning model and adds important metadata and functionality for deployment and versioning.

from vetiver import VetiverModel
v = VetiverModel(model, model_name='penguin_model', prototype_data=X)

%%{init: {'theme': 'base', 'themeVariables': {'fontFamily': 'monospace'}}}%%

flowchart LR
    RawMod("<strong>Raw sklearn model</strong>") --> VMod("<strong>VetiverModel()</strong><br>wrapper")
    VMod --> ModMeta("<strong>Model + Metadata</strong>")
    ModMeta --> Dep("<strong>Deployment Ready</strong>")
    
    subgraph Vet["<em>What it creates</em>"]
        InSch("Input Schema")
        ModTypeInfo("Model Type Info")
        Ver("Versioning")
        Api("API Generation")
        Ser("Serialization")
    end
    
    VMod --> InSch
    VMod --> ModTypeInfo 
    VMod --> Ver
    VMod --> Api
    VMod --> Ser

    %% Style for outputs %%
    style RawMod fill:#5B8C5A,stroke:#000000,stroke-width:1px,color:#ffffff
    style VMod fill:#D2562B,stroke:#000000,stroke-width:1px,color:#ffffff
    style ModMeta fill:#D2562B,stroke:#000000,stroke-width:1px,color:#ffffff
    style Dep fill:#2A6F77,stroke:#000000,stroke-width:1px,color:#ffffff
    style InSch fill:#485466,stroke:#000000,stroke-width:1px,color:#ffffff
    style ModTypeInfo fill:#485466,stroke:#000000,stroke-width:1px,color:#ffffff
    style Ver fill:#485466,stroke:#000000,stroke-width:1px,color:#ffffff
    style Api fill:#485466,stroke:#000000,stroke-width:1px,color:#ffffff
    style Ser fill:#485466,stroke:#000000,stroke-width:1px,color:#ffffff
    style Vet fill:#fbf7ec,stroke:#2A6F77,color:#1B2A41

VetiverModel()

What are `pins` boards?

In this version, we’re going to save our pins board to a temporary location (f"{temp_dir}/models"), but in the lab, we’ll create the directory before creating the board.

%%{init: {'theme': 'base', 'themeVariables': {'fontFamily': 'monospace'}}}%%

flowchart LR
    BrdFldr("<strong>board_folder()</strong>") --> FSI("File System Interface")
    FSI --> VerStor("<strong>Versioned Storage</strong>")
    VerStor --> MetaTr("<strong>Metadata Tracking</strong>")
    
    subgraph BF["<em>What it creates</em>"]
        DirStr("Directory Structure")
        VerCont("Version Control")
        PinMngMnt("Pin Management")
    end
    
    FSI --> DirStr
    FSI --> VerCont
    FSI --> PinMngMnt

    %% Style for outputs %%
    style BrdFldr fill:#5B8C5A,stroke:#000000,stroke-width:1px,color:#ffffff
    style FSI fill:#D2562B,stroke:#000000,stroke-width:1px,color:#ffffff
    style VerStor fill:#D2562B,stroke:#000000,stroke-width:1px,color:#ffffff
    style MetaTr fill:#2A6F77,stroke:#000000,stroke-width:1px,color:#ffffff
    style DirStr fill:#485466,stroke:#000000,stroke-width:1px,color:#ffffff
    style VerCont fill:#485466,stroke:#000000,stroke-width:1px,color:#ffffff
    style PinMngMnt fill:#485466,stroke:#000000,stroke-width:1px,color:#ffffff
    style BF fill:#fbf7ec,stroke:#2A6F77,color:#1B2A41

board_folder()

import tempfile
from pins import board_folder
from vetiver import vetiver_pin_write

temp_dir = tempfile.mkdtemp()
model_board = board_folder(f"{temp_dir}/models", allow_pickle_read=True)
vetiver_pin_write(model_board, v)

1: Use temp directory
2: allow_pickle_read allows reading Python pickle files (needed for scikit-learn models)

Model Cards provide a framework for transparent, responsible reporting. 
 Use the vetiver `.qmd` Quarto template as a place to start, 
 with vetiver.model_card()
Writing pin:
Name: 'penguin_model'
Version: 20260711T232617Z-a6861

board_folder() creates the pins board that sets up a storage interface to a local temporary folder. This creates a standardized structure for storing “pins” (data objects/models).
vetiver_pin_write() receives the model_board, which is our board_folder object pointing to temp_dir/models and v (our VetiverModel wrapping the LinearRegression).

print(f"Model saved to: {temp_dir}/models")

Model saved to: /tmp/tmp8ybwd3zw/models

show/hide print_tree()

import os

def print_tree(directory, prefix="", max_depth=None, current_depth=0):
    """Print a tree structure of directory contents."""
    if max_depth is not None and current_depth > max_depth:
        return
    
    items = sorted(os.listdir(directory))
    for i, item in enumerate(items):
        path = os.path.join(directory, item)
        is_last = i == len(items) - 1
        
        current_prefix = "└── " if is_last else "├── "
        print(f"{prefix}{current_prefix}{item}")
        
        if os.path.isdir(path):
            extension = "    " if is_last else "│   "
            print_tree(path, prefix + extension, max_depth, current_depth + 1)

Below is a tree of the model folder:

print_tree(f"{temp_dir}/models")

└── penguin_model
    └── 20260711T232617Z-a6861
        ├── data.txt
        └── penguin_model.joblib

What is in `penguin_model/`?

The penguin_model board folder is described below:

└── penguin_model
    └── CURRENT-DATE-TIME-aba30/
        ├── data.txt
        └── penguin_model.joblib

1: pin name directory
2: Version directory timestamp + hash
3: Model metadata (what, when, how big)
4: Serialized model (the actual trained model + VetiverModel wrapper with all the prediction logic)

The board folder handles versioning automatically and manages metadata about what’s stored.

%%{init: {'theme': 'base', 'themeVariables': {'fontFamily': 'monospace'}}}%%

flowchart LR
    Mod("<strong>model</strong>") --> VetMod("<strong>VetiverModel(model)</strong>")
    VetMod --> VPW("<strong>vetiver_pin_write()</strong>")
    VPW --> Joblib("Serialized <strong>.joblib</strong>")
    VPW --> Meta("Metadata <strong>.txt</strong>")
    
    Rep("To<br>reproduce...") --> VPR("<strong>vetiver_pin_read()</strong>")
    VPR --> Joblib
    VPR --> Meta
    VPR --> Recon("Reconstructed<br><strong>VetiverModel()</strong>")

    %% Style for outputs %%
    style Mod fill:#5B8C5A,stroke:#000000,stroke-width:1px,color:#ffffff
    style VetMod fill:#D2562B,stroke:#000000,stroke-width:1px,color:#ffffff
    style VPW fill:#D2562B,stroke:#000000,stroke-width:1px,color:#ffffff
    style Joblib fill:#2A6F77,stroke:#000000,stroke-width:1px,color:#ffffff
    style Meta fill:#2A6F77,stroke:#000000,stroke-width:1px,color:#ffffff
    style Rep fill:#5B8C5A,stroke:#000000,stroke-width:1px,color:#ffffff
    style VPR fill:#D2562B,stroke:#000000,stroke-width:1px,color:#ffffff
    style Recon fill:#485466,stroke:#000000,stroke-width:1px,color:#ffffff

vetiver_pin_write()

`VetiverModel` -> `VetiverAPI`

Finally, we’ll turn the model into an API.

from vetiver import VetiverAPI
app = VetiverAPI(v, check_prototype = True)

VetiverAPI() creates a production-ready web API that wraps the trained model, validates inputs, and returns predictions via HTTP.

%%{init: {'theme': 'base', 'themeVariables': {'fontFamily': 'monospace'}}}%%

flowchart LR
    VetMod("<strong>VetiverModel()</strong>") --> VetAPI("<strong>VetiverAPI()</strong>")
    VetAPI --> REST("<strong>REST</strong> API<br>Server")
    REST --> HttpEP("<strong>HTTP</strong><br>Endpoints")
    
    HttpEP --> Post("<strong>POST</strong><br>/predict")
    HttpEP --> GetPing("<strong>GET</strong><br>/ping")
    HttpEP --> GetProto("<strong>GET</strong><br>/prototype")
    HttpEP --> GetMeta("<strong>GET</strong><br>/metadata")

    %% Style for outputs %%
    style VetMod fill:#5B8C5A,stroke:#000000,stroke-width:1px,color:#ffffff
    style VetAPI fill:#D2562B,stroke:#000000,stroke-width:1px,color:#ffffff
    style REST fill:#D2562B,stroke:#000000,stroke-width:1px,color:#ffffff
    style HttpEP fill:#2A6F77,stroke:#000000,stroke-width:1px,color:#ffffff
    style Post fill:#485466,stroke:#000000,stroke-width:1px,color:#ffffff
    style GetPing fill:#485466,stroke:#000000,stroke-width:1px,color:#ffffff
    style GetProto fill:#485466,stroke:#000000,stroke-width:1px,color:#ffffff
    style GetMeta fill:#485466,stroke:#000000,stroke-width:1px,color:#ffffff

VetiverAPI()

This makes our model accessible to any application that can make web requests, which we will cover in the next lab.

Comprehension Questions

Data preparation

Using duckdb

Create a model

Print model information

model -> VetiverModel

What are pins boards?

What is in penguin_model/?

VetiverModel -> VetiverAPI