Lab 2: Project Architecture

Published

2025-09-17

WarningCaution

This section is being revised. Thank you for your patience.

Lab 2 introduces thevetiver package for modeling in model-vetiver.qmd. The vetiver package comes in both flavors: R and Python.

fs::dir_tree(path = "_labs/lab2/")
## _labs/lab2/
## └── model-vetiver.qmd

If you’d like to see the same process using R, check out the R version.

Comprehension Questions

  1. What are the layers of a three-layer application architecture? What libraries could you use to implement a three-layer architecture in R or Python?

%%{init: {'theme': 'neutral', 'look': 'handDrawn', 'themeVariables': { 'fontFamily': 'monospace', "fontSize":"16px"}}}%%
flowchart TD
    Present("Presentation<br>Layer") -.-> Process("Processing<br>Layer")
    Process -.-> Data("Data<br>Layer")
    
    subgraph UI["<strong>User Interface</strong>"]
        Present
    end
    
    subgraph BL["<strong>Business Logic</strong>"]
        Process
    end
    
    subgraph DS["<strong>Data Storage</strong>"]
        Data
    end
    
    %% #b5c0c5 is a tint of #6b818c
    style UI fill:#b5c0c5 
    style BL fill:#d8e4ff
    style DS fill:#31e981
    

2. What are some questions you should explore to reduce the data requirements for your project?

  • Problem Definition
    • What is the minimum viable outcome?
    • Can we solve a simpler version first?
    • What is the cost of being wrong?
  • Feature Engineering
    • Which features are actually predictive?
    • Can we derive features from less data?
    • What is redundant?
  • Sampling Strategy
    • How much data do we actually need?
    • Can we stratify smartly?
    • What about active learning?
  • External Resources
    • What data already exists?
    • Can we use pre-trained models?
    • Are there proxies available?

3. What are some patterns you can use to make big data smaller?

  • Minimize data movement by doing as much work as possible where the data already sits.

  • Don’t load all your data at once if you don’t need to (keep a live connection to your database and only pull specific pieces of data when you actually need them).

  • Working with samples can dramatically speed up your analysis while maintaining accuracy for most types of questions

  • If your data has natural groupings, process one group at a time instead of loading everything.

4. Where can you put intermediate artifacts in a data science project?

Flat files, .csv files, or duckdb (which only loads the data you need into memory).

5. What does it mean to “take data out of the bundle”?

Take data out of the presentation bundle (i.e., separate it from the code) if it’s updated more frequently than the app code.

Data preparation

We’ll use the reticulate package below to install the other packages in the model-vetiver.qmd lab.

library(reticulate)
use_virtualenv("myenv", required = TRUE)
virtualenv_install(
  envname = "myenv", 
  packages = c("palmerpenguins", "pandas", "numpy", 
               "scikit-learn", "duckdb", "vetiver", 
                "pins"))

Now we can import the Python packages/functions.

from palmerpenguins import load_penguins
from pandas import get_dummies
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import duckdb

Using duckdb

This originally returned an error:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
duckdb.duckdb.InvalidInputException: Invalid Input Error: Python Object 
"penguins" of type "module" found on line "<string>:1" not suitable for
replacement scans.
Make sure that "penguins" is either a pandas.DataFrame, 
duckdb.DuckDBPyRelation, pyarrow Table, Dataset, RecordBatchReader, 
Scanner, or NumPy ndarrays with supported format

Traceback (most recent call last):
  File "<string>", line 1, in <module>
NameError: name 'df' is not defined

Below we load the penguins data, connect to duckdb, then create the table:

penguins_data = load_penguins()
con = duckdb.connect('my-db.duckdb')

The load_penguins() function put the penguins data into a DataFrame, then we can insert it into DuckDB before querying:

con.execute("CREATE OR REPLACE TABLE penguins AS SELECT * FROM penguins_data")
<_duckdb.DuckDBPyConnection object at 0x140062c70>
df = con.execute("SELECT * FROM penguins").fetchdf().dropna()
con.close()

Confirm the data:

df.head(3)
  species     island  bill_length_mm  ...  body_mass_g     sex  year
0  Adelie  Torgersen            39.1  ...       3750.0    male  2007
1  Adelie  Torgersen            39.5  ...       3800.0  female  2007
2  Adelie  Torgersen            40.3  ...       3250.0  female  2007

[3 rows x 8 columns]

Create a model

The model is created below (like lab 1).

X = get_dummies(df[['bill_length_mm', 'species', 'sex']], drop_first = True)
y = df['body_mass_g']

model = LinearRegression().fit(X, y)

model -> VetiverModel

The model is created with VetiverModel(), which is a wrapper class that takes our trained machine learning model and adds important metadata and functionality for deployment and versioning.

from vetiver import VetiverModel
v = VetiverModel(model, model_name='penguin_model', prototype_data=X)

%%{init: {'theme': 'neutral', 'look': 'handDrawn', 'themeVariables': { 'fontFamily': 'monospace', "fontSize":"16px"}}}%%

flowchart LR
    RawMod(["<strong>Raw sklearn model</strong>"]) --> VMod("<strong>VetiverModel()</strong><br>wrapper")
    VMod --> ModMeta(["<strong>Model + Metadata</strong>"])
    ModMeta --> Dep(["<strong>Deployment Ready</strong>"])
    
    subgraph Vet["<em>What it creates</em>"]
        InSch("Input Schema")
        ModTypeInfo("Model Type Info")
        Ver("Versioning")
        Api("API Generation")
        Ser("Serialization")
    end
    
    VMod --> InSch
    VMod --> ModTypeInfo 
    VMod --> Ver
    VMod --> Api
    VMod --> Ser

    %% Style for outputs %%
    style RawMod fill:#d8e4ff
    style ModMeta fill:#31e981
    style Dep fill:#31e981
    

VetiverModel()

What are pins boards?

In this version, we’re going to save our pins board to a temporary location (f"{temp_dir}/models"), but in the lab, we’ll create the directory before creating the board.

%%{init: {'theme': 'neutral', 'look': 'handDrawn', 'themeVariables': { 'fontFamily': 'monospace', "fontSize":"16px"}}}%%

flowchart LR
    BrdFldr(["<strong>board_folder()</strong>"]) --> FSI("File System Interface")
    FSI --> VerStor(["<strong>Versioned Storage</strong>"])
    VerStor --> MetaTr(["<strong>Metadata Tracking</strong>"])
    
    subgraph BF["<em>What it creates</em>"]
        DirStr("Directory Structure")
        VerCont("Version Control")
        PinMngMnt("Pin Management")
    end
    
    FSI --> DirStr
    FSI --> VerCont
    FSI --> PinMngMnt

    %% Style for outputs %%
    style BrdFldr fill:#d8e4ff
    style VerStor fill:#31e981
    style MetaTr fill:#31e981

board_folder()

import tempfile
from pins import board_folder
from vetiver import vetiver_pin_write

temp_dir = tempfile.mkdtemp()
model_board = board_folder(f"{temp_dir}/models", allow_pickle_read=True)
vetiver_pin_write(model_board, v)
1
Use temp directory
2
allow_pickle_read allows reading Python pickle files (needed for scikit-learn models)
Model Cards provide a framework for transparent, responsible reporting. 
 Use the vetiver `.qmd` Quarto template as a place to start, 
 with vetiver.model_card()
Writing pin:
Name: 'penguin_model'
Version: 20251218T105954Z-aba30
  • board_folder() creates the pins board that sets up a storage interface to a local temporary folder. This creates a standardized structure for storing “pins” (data objects/models).

  • vetiver_pin_write() receives the model_board, which is our board_folder object pointing to temp_dir/models and v (our VetiverModel wrapping the LinearRegression).

print(f"Model saved to: {temp_dir}/models")
Model saved to: /var/folders/0x/x5wkbhmx0k74tncn9swz7xpr0000gn/T/tmpmxd3ubm3/models
show/hide print_tree()
import os

def print_tree(directory, prefix="", max_depth=None, current_depth=0):
    """Print a tree structure of directory contents."""
    if max_depth is not None and current_depth > max_depth:
        return
    
    items = sorted(os.listdir(directory))
    for i, item in enumerate(items):
        path = os.path.join(directory, item)
        is_last = i == len(items) - 1
        
        current_prefix = "└── " if is_last else "├── "
        print(f"{prefix}{current_prefix}{item}")
        
        if os.path.isdir(path):
            extension = "    " if is_last else "│   "
            print_tree(path, prefix + extension, max_depth, current_depth + 1)

Below is a tree of the model folder:

print_tree(f"{temp_dir}/models")
└── penguin_model
    └── 20251218T105954Z-aba30
        ├── data.txt
        └── penguin_model.joblib

What is in penguin_model/?

The penguin_model board folder is described below:

└── penguin_model
    └── CURRENT-DATE-TIME-aba30/
        ├── data.txt
        └── penguin_model.joblib
1
pin name directory
2
Version directory timestamp + hash
3
Model metadata (what, when, how big)
4
Serialized model (the actual trained model + VetiverModel wrapper with all the prediction logic)

The board folder handles versioning automatically and manages metadata about what’s stored.

%%{init: {'theme': 'neutral', 'look': 'handDrawn', 'themeVariables': { 'fontFamily': 'monospace', "fontSize":"16px"}}}%%

flowchart LR
    Mod(["<strong>model</strong>"]) --> VetMod("<strong>VetiverModel(model)</strong>")
    VetMod --> VPW("<strong>vetiver_pin_write()</strong>")
    VPW --> Joblib(["Serialized <strong>.joblib</strong>"])
    VPW --> Meta(["Metadata <strong>.txt</strong>"])
    
    Rep[["To<br>reproduce..."]] --> VPR("<strong>vetiver_pin_read()</strong>")
    VPR --> Joblib
    VPR --> Meta
    VPR --> Recon(["Reconstructed<br><strong>VetiverModel()</strong>"])

    %% Style for outputs %%
    style Mod fill:#d8e4ff
    style Recon fill:#31e981
    style Meta fill:#31e981
    style Joblib fill:#31e981
    

vetiver_pin_write()

VetiverModel -> VetiverAPI

Finally, we’ll turn the model into an API.

from vetiver import VetiverAPI
app = VetiverAPI(v, check_prototype = True)

VetiverAPI() creates a production-ready web API that wraps the trained model, validates inputs, and returns predictions via HTTP.

%%{init: {'theme': 'neutral', 'look': 'handDrawn', 'themeVariables': { 'fontFamily': 'monospace', "fontSize":"16px"}}}%%

flowchart LR
    VetMod(["<strong>VetiverModel()</strong>"]) --> VetAPI("<strong>VetiverAPI()</strong>")
    VetAPI --> REST("<strong>REST</strong> API<br>Server")
    REST --> HttpEP("<strong>HTTP</strong><br>Endpoints")
    
    HttpEP --> Post(["<strong>POST</strong><br>/predict"])
    HttpEP --> GetPing(["<strong>GET</strong><br>/ping"])
    HttpEP --> GetProto(["<strong>GET</strong><br>/prototype"])
    HttpEP --> GetMeta(["<strong>GET</strong><br>/metadata"])

    %% Style for outputs %%
    style VetMod fill:#d8e4ff,stroke-width:1px,rx:10,ry:10,font-size:16px
    style Post fill:#31e981,stroke-width:1px,rx:10,ry:10,font-size:16px
    style GetPing fill:#31e981,stroke-width:1px,rx:10,ry:10,font-size:16px
    style GetProto fill:#31e981,stroke-width:1px,rx:10,ry:10,font-size:16px
    style GetMeta fill:#31e981,stroke-width:1px,rx:10,ry:10,font-size:16px

VetiverAPI()

This makes our model accessible to any application that can make web requests, which we will cover in the next lab.