glockr: An R Wrapper around Succinct Code Counter

Last week I built glockr, an R package that wraps scc, a “very fast accurate code counter with complexity calculations.” Code metrics are a fascination of mine, and now that code is increasingly written by both machines and humans, I wanted a set of tools I could use for comparison.

Background

Adam Tornhill wrote two books on code metrics: Software Design X-Rays and Your Code as a Crime Scene, 2ed. Both are excellent reads, and were the first exposure I had to any measurements around code complexity.¹ The approach to metrics and data visualizations Tornhill outlines has definitely improved the way I think about writing (and maintaining code). He covers topics like essential and accidental complexity, and how often code is changed (i.e., measured via version control) is usually as important as any metrics.

The R ecosystem has some great packages for assessing code quality, too:²

cloc (‘Count Lines of Code’) developed by Bob Rudis, which wraps the ‘Perl’ command-line utility.³
pkgstats is a ‘static code analysis tool’ from ROpenSci for R packages.
Luke Tierney’s codetools package also has excellent code analysis tools for R.

What does `glockr` do?

glockr wraps the scc command-line tool in an R package (much like cloc). scc is written in pure Go, so it’s much faster than the alternatives.⁴

Install glockr from GitHub:

install.packages('remotes')
remotes::install_github("mjfrigaard/glockr")

We’ll also install gt and dplyr for data manipulation and pretty tables.

install.packages(c('gt', 'dplyr'))

library(glockr)
library(gt)
library(dplyr)

glockr computes many of the same values of the cloc package, but also includes the weighted complexity, unique lines of code (uloc), and Constructive Cost Model (cocomo) estimates.

Verify that scc is installed with:

scc_version()
## [1] "scc version 3.7.0"

The primary function for glockr is scc(). We’ll use the popular rlang package as an example:⁵

pkg_path <- file.path("rlang/")

rlang_scc <- scc(pkg_path) 
dplyr::glimpse(rlang_scc)
## Rows: 11
## Columns: 10
## $ language            <chr> "R", "Markdown", "C"…
## $ files               <int> 163, 53, 70, 72, 9, …
## $ lines               <int> 44571, 16935, 14014,…
## $ code                <int> 28038, 13866, 10960,…
## $ comments            <int> 11152, 0, 872, 1777,…
## $ blanks              <int> 5381, 3069, 2182, 11…
## $ complexity          <int> 2323, 0, 1851, 691, …
## $ weighted_complexity <dbl> 8.285184, 0.000000, …
## $ bytes               <int> 1188357, 496932, 376…
## $ uloc                <int> 23837, 5358, 6931, 4…

The raw counts in the standard output from scc() includes the following columns:

language: Language of the file
files: Number of files
lines: Total physical line count across all files (code + comments + blanks)
code: Lines of source code (excludes blanks and comments)
comments: Number of total comments
blanks: Number of total blank lines
bytes: total file size in bytes across all files in that language

Complexity & Weighted Complexity

The complexity calculation in scc is an approximation of the standard cyclomatic complexity:⁶

“The reason it’s an approximation is that it’s calculated almost for free from a CPU point of view (since its a cheap lookup when counting), whereas a real cyclomatic complexity count would need to parse the code. It gives a reasonable guess in practice though even if it fails to identify recursive methods. The goal was never for it to be exact.

In short when scc is looking through what it has identified as code if it notices what are usually branch conditions it will increment a counter.”

This approximation is an important limitation if your goal is to compare projects written in different languages. However, if your questions on code quality center around 1) “what are the most complex files in this project?” and – by extension – 2) “what code should be refactored?”, these metrics work well.⁷

Below is the output from scc() for the rlang source files (sorted by complexity).

rlang_scc |> 
    dplyr::arrange(desc(complexity)) |> 
    gt::gt() |> 
    gt::tab_header(
      title = "Succinct Code Count (rlang package)", 
      subtitle = "Sorted by complexity")

language	files	lines	code	comments	blanks	complexity	weighted_complexity	bytes	uloc
Succinct Code Count (rlang package)
Sorted by complexity
R	163	44571	28038	11152	5381	2323	8.285184	1188357	23837
C	70	14014	10960	872	2182	1851	16.888686	376444	6931
C Header	72	8210	5315	1777	1118	691	13.000941	270950	4809
Makefile	1	11	7	0	4	2	28.571429	184	8
Markdown	53	16935	13866	0	3069	0	0.000000	496932	5358
YAML	9	735	587	35	113	0	0.000000	17940	449
C++	2	25	21	0	4	0	0.000000	492	17
C++ Header	1	26	21	0	5	0	0.000000	429	21
SVG	10	10	10	0	0	0	0.000000	9687	10
License	1	2	2	0	0	0	0.000000	43	2
TOML	1	2	2	0	0	0	0.000000	34	2

This tells us – not surprisingly – that .R code files are the most complex files in this project.

A more useful output is the most complex files in the rlang package, which we can do using:

r_files <- scc_by_file(pkg_path, include_ext = "r")
r_files |> 
  dplyr::arrange(desc(complexity)) |> 
  dplyr::select(
    c(filename, code, complexity)
  ) |> 
  head(10) |>
  gt::gt() |>
  gt::tab_header(
    title    = "Top 10 most complex .R files",
    subtitle = "raw branch-token count",
    preheader = "rlang package")

filename	code	complexity
Top 10 most complex .R files
raw branch-token count
cnd-abort.R	691	152
standalone-cli.R	432	135
trace.R	807	134
deparse.R	819	111
standalone-vctrs.R	505	107
call.R	451	92
standalone-obj-type.R	206	64
cnd-entrace.R	213	61
utils.R	302	59
expr.R	157	52

If we compare this to the weighted complexity ((complexity / code) * 100), we can see a few of the top files change:

r_files <- scc_by_file(pkg_path, include_ext = "r")
r_files |> 
  dplyr::arrange(desc(weighted_complexity)) |> 
  dplyr::select(
    c(filename, code, complexity, weighted_complexity)
  ) |> 
  head(10) |>
  gt::gt() |>
  gt::tab_header(
    title    = "Top 10 most complex (weighted) .R files",
    subtitle = "Weighted Complexity",
    preheader = "rlang package")

filename	code	complexity	weighted_complexity
Top 10 most complex (weighted) .R files
Weighted Complexity
operators.R	32	11	34.37500
expr.R	157	52	33.12102
standalone-cli.R	432	135	31.25000
standalone-obj-type.R	206	64	31.06796
utils-cli-tree.R	113	35	30.97345
utils-encoding.R	37	11	29.72973
cnd-entrace.R	213	61	28.63850
error-backtrace-empty.R	16	4	25.00000
raw.R	9	2	22.22222
cnd-abort.R	691	152	21.99711

We won’t spend too much time on these measures, because as you’ll see below, scc has quite a few code metrics to choose from.

Unique lines of code (uloc)

The unique lines of code (ULOC) counts unique non-blank lines including comments. The scc author argues – with help⁸ – that this is better than relying on SLOC (source lines of code):

“Compared to SLOC, not only are blank lines discounted, but so are close-brace lines and other repetitive code such as common includes. On the other hand, ULOC counts comments, which require just as much maintenance as the code around them does, while avoiding inflating the result with license headers which appear in every file, for example.”

For comparison, SLOC (the value in the code column) is just the raw count of non-blank, non-comment lines and includes duplicates. uloc is automatically included with both scc() and scc_by_file().

The best way to see the differences is with the two minimal code files:

# pure code, 5 lines, two duplicate pairs
# save as code.R
x <- 1
y <- 2
x <- 1
y <- 2
z <- 3

# same code, plus three unique comments
# save as code_w_comments.R 
# step 1
x <- 1
y <- 2
# step 2
x <- 1
y <- 2
# step 3
z <- 3

Now we can compare these with scc_by_file() and view the lines, code, comments and uloc columns.

fixtures <- c(
  file.path("code.R"),
  file.path("code_w_comments.R")
)
scc_by_file(fixtures) |>
  dplyr::select(c(filename, lines, 
  code, comments, uloc)) |> 
  gt::gt() |>
  gt::tab_header(
    title    = "SLOC vs ULOC",
    subtitle = "Both files contain the same 5 code lines",
    preheader = "Local ULOC comparison")

filename	lines	code	comments	uloc
SLOC vs ULOC
Both files contain the same 5 code lines
code.R	5	5	0	3
code_w_comments.R	9	5	4	7

DRYness percentage

The DRYness % can be added with dryness = TRUE. This is computed locally as uloc / lines, which matches scc’s DRYness % formula, applied per record instead of project-wide.

scc_by_file(fixtures, dryness = TRUE) |>
  dplyr::select(c(filename, lines, 
  code, comments, uloc, dryness)) |> 
  gt::gt() |>
  gt::tab_header(
    title    = "SLOC vs ULOC + DRYness",
    subtitle = "Both files contain the same 5 code lines",
    preheader = "Local SLOC, ULOC & DRYness comparison")

filename	lines	code	comments	uloc	dryness
SLOC vs ULOC + DRYness
Both files contain the same 5 code lines
code.R	5	5	0	3	0.6000000
code_w_comments.R	9	5	4	7	0.7777778

DRYness values close to 1 mean a file or language is mostly unique non-blank content and values closer to 0 mean heavy duplication, comments, or blanks.

COCOMO cost estimates

scc includes a COCOMO 81 model that estimates project effort and cost from SLOC. When cocomo = TRUE, the scc() output becomes a list and the COCOMO output is in the cocomo tibble.

rlang_cocomo <- scc(pkg_path, cocomo = TRUE)
## # A tibble: 11 × 10
##    language   files lines  code comments blanks
##    <chr>      <int> <int> <int>    <int>  <int>
##  1 R            163 44571 28038    11152   5381
##  2 Markdown      53 16935 13866        0   3069
##  3 C             70 14014 10960      872   2182
##  4 C Header      72  8210  5315     1777   1118
##  5 YAML           9   735   587       35    113
##  6 C++            2    25    21        0      4
##  7 C++ Header     1    26    21        0      5
##  8 SVG           10    10    10        0      0
##  9 Makefile       1    11     7        0      4
## 10 License        1     2     2        0      0
## 11 TOML           1     2     2        0      0
## # ℹ 4 more variables: complexity <int>,
## #   weighted_complexity <dbl>, bytes <int>,
## #   uloc <int>
## # A tibble: 3 × 3
##   metric                    project_type value    
##   <chr>                     <chr>        <chr>    
## 1 Estimated Cost to Develop organic      $1,948,3…
## 2 Estimated Schedule Effort organic      17.72 mo…
## 3 Estimated People Required organic      9.77

The scc() output is in the scc tibble:

names(rlang_cocomo)
## [1] "scc"    "cocomo"

We can customize the COCOMO model with the following arguments:

avg_wage: annual salary in local currency
cocomo_project_type: “organic”, “semi-detached”, or “embedded”
eaf: effort adjustment factor
overhead: corporate overhead multiplier
currency_symbol: symbol shown in the value column
auto_print_scc: skip auto-printing the language tibble

NOTE: These arguments only work when cocomo = TRUE

embedded <- scc(pkg_path,
    cocomo = TRUE,         
    avg_wage = 120000L,        
    cocomo_project_type = "embedded",     
    eaf = 1.1,            
    overhead = 2.0,            
    currency_symbol = "$",            
    auto_print_scc = FALSE)  
## # A tibble: 3 × 3
##   metric                    project_type value    
##   <chr>                     <chr>        <chr>    
## 1 Estimated Cost to Develop embedded     $10,525,…
## 2 Estimated Schedule Effort embedded     18.57 mo…
## 3 Estimated People Required embedded     28.35

embedded$cocomo |> 
  gt::gt() |> 
  gt::tab_header(
    title = "COCOMO", 
    subtitle = "embedded project", 
    preheader = "rlang package")

metric	project_type	value
COCOMO
embedded project
Estimated Cost to Develop	embedded	$10,525,311
Estimated Schedule Effort	embedded	18.57 months
Estimated People Required	embedded	28.35

Why not use COCOMO II ???

scc() only gives us raw structural numbers (lines, code, complexity, bytes, etc.), but COCOMO II needs 22 judgement inputs we can’t derive from source code:

The current COCOMO project types ("organic" / "semi-detached" / "embedded") don’t translate to COCOMO II.
COCOMO II uses 5 scale factors to esitmate organizational/process attributes
COCOMO II also uses 17 effort multipliers for product, platform, personnel, and project context
- Every one of these is a six-level human rating about the team, requirements stability, reuse strategy, tool maturity, schedule pressure, etc.
The only one that might map would be CPLX (product complexity), but COCOMO II’s CPLX is architectural/algorithmic complexity, not McCabe’s cyclomatic.

Visualizing code complexity

In Your Code as a Crime Scene, Tornhill covers a collection of methods for visualizing code, but my favorite is his use of circle packing for identifying “hotspots.” These are described below:

“Each circle represents a part of the system. The more complex a module, as measured by lines of code, the larger the circle. We then use color to serve as an attention magnet for the most critical property: change.” - Adam Tornhill (Your Code as a Crime Scene, 2ed)

Tornhill’s Hotspots from Your Code as a Crime Scene, 2ed

The glockr package doesn’t have data on code changes (yet), but we can build circle packer graphs using weighted complexity and DRYness.

We’ll start by making each circle’s area tied to the file’s weighted_complexity, meaning the files that ‘pack more branching per line’ are larger.

Next, the color fill aesthetic will be the file’s DRYness % (uloc / lines), with darker hues meaning more duplication relative to line counts, and lighter hues meaning more unique, non-blank content.

We’ll start with a fresh dataset using scc_by_file(), setting dryness to TRUE.

r_files <- scc_by_file(pkg_path, include_ext = "r", dryness = TRUE)

We can exclude files with zero weighted_complexity because there is nothing to size the circle by:

r_files <- dplyr::filter(r_files, weighted_complexity > 0)
dplyr::glimpse(r_files)
## Rows: 114
## Columns: 14
## $ language            <chr> "R", "R", "R", "R", …
## $ filename            <chr> "c-lib.R", "aaa-topi…
## $ location            <chr> "/Users/mjfrigaard/p…
## $ lines               <int> 413, 93, 355, 551, 2…
## $ code                <int> 312, 65, 280, 63, 23…
## $ comments            <int> 26, 10, 12, 472, 3, …
## $ blanks              <int> 75, 18, 63, 16, 40, …
## $ complexity          <int> 21, 4, 49, 6, 5, 14,…
## $ weighted_complexity <dbl> 6.7307692, 6.1538462…
## $ bytes               <int> 9381, 2708, 10127, 1…
## $ uloc                <int> 244, 63, 227, 333, 1…
## $ dryness             <dbl> 0.5907990, 0.6774194…
## $ generated           <lgl> FALSE, FALSE, FALSE,…
## $ minified            <lgl> FALSE, FALSE, FALSE,…

The packcircles::circleProgressiveLayout() can be used to compute the area-proportional circle size:

library(packcircles)

packing <- packcircles::circleProgressiveLayout(
  r_files$weighted_complexity, sizetype = "area"
)

This creates a dataset with x, y, and radius columns (with the same number of observations as our previous r_files dataset).

dplyr::glimpse(packing)
## Rows: 114
## Columns: 3
## $ x      <dbl> -1.46371800, 1.39958211, 0.052866…
## $ y      <dbl> 0.0000000, 0.0000000, -3.5102885,…
## $ radius <dbl> 1.4637180, 1.3995821, 2.3601744, …

The number of rows in packing is important, because it allows us to bind the filename, weighted_complexity, and dryness columns back to the packing data:

packing <- dplyr::bind_cols(
  packing, # circle size data 
  dplyr::select(r_files, # original columns
    filename, weighted_complexity, dryness)
  )
dplyr::glimpse(packing)
## Rows: 114
## Columns: 6
## $ x                   <dbl> -1.46371800, 1.39958…
## $ y                   <dbl> 0.0000000, 0.0000000…
## $ radius              <dbl> 1.4637180, 1.3995821…
## $ filename            <chr> "c-lib.R", "aaa-topi…
## $ weighted_complexity <dbl> 6.7307692, 6.1538462…
## $ dryness             <dbl> 0.5907990, 0.6774194…

The circleLayoutVertices() function from packcircles creates “a data frame of circle vertices for plotting”

polygons <- packcircles::circleLayoutVertices(packing, npoints = 60)
dplyr::glimpse(polygons)
## Rows: 6,954
## Columns: 3
## $ x  <dbl> 0.00000000, -0.00801840, -0.03198575,…
## $ y  <dbl> 0.0000000, 0.1530002, 0.3043241, 0.45…
## $ id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

Now we’re going to perform a positional join using the id from polygons and a new id we’ll create in r_files using dplyr::row_number() (and limit the columns to only filename, weighted_complexity, and dryness).

polygons <- polygons |>
  dplyr::left_join(
    r_files |>
      dplyr::mutate(id = dplyr::row_number()) |>
      dplyr::select(
        id, 
        filename, 
        weighted_complexity, 
        dryness),
    by = "id"
  )
dplyr::glimpse(polygons)
## Rows: 6,954
## Columns: 6
## $ x                   <dbl> 0.00000000, -0.00801…
## $ y                   <dbl> 0.0000000, 0.1530002…
## $ id                  <int> 1, 1, 1, 1, 1, 1, 1,…
## $ filename            <chr> "c-lib.R", "c-lib.R"…
## $ weighted_complexity <dbl> 6.730769, 6.730769, …
## $ dryness             <dbl> 0.590799, 0.590799, …

The x, y, and group aesthetics draw the circles, and fill is the dryness percentage value.
The geom_polygon() is used to connect start and end points (and fill the area inside)
scale_fill_viridis_c() is used for ‘filling’ continuous data, and coord_equal() ensures ensures equal unit length on the x-axis and y-axis.

ggplot2::ggplot(polygons,
                ggplot2::aes(
                  x = x, 
                  y = y, 
                  group = id, 
                  fill = dryness)) +
  ggplot2::geom_polygon(
    colour = "white", 
    linewidth = 0.2) + 
  ggplot2::scale_fill_viridis_c(
    name    = "DRYness\n(uloc / lines)",
    option  = "viridis",
    limits  = c(0, 1)
  ) +
  ggplot2::coord_equal()

We can’t label every file in the circle packer graph, but we can make sure the files with the highest levels of complexity have text labels. We’ll create a top_labels dataset for the top 10 weighted_complexity values.

top_labels <- packing |>
  dplyr::slice_max(weighted_complexity, n = 10)

Given the color palette we’re using, we will make sure our labels have a dark outline (behind white text) with the shadowtext package.

remotes::install_github("GuangchuangYu/shadowtext")

Now we can add the labels and theme_void():

# for the labels
library(shadowtext)
ggplot2::ggplot(polygons,
                ggplot2::aes(
                  x = x, 
                  y = y, 
                  group = id, 
                  fill = dryness)) +
  ggplot2::geom_polygon(
    colour = "white", 
    linewidth = 0.2) +
  # add text labels
  shadowtext::geom_shadowtext(
    data    = top_labels,
    mapping = ggplot2::aes(
      x = x, 
      y = y, 
      label = filename),
    inherit.aes = FALSE,
    size      = 3,
    colour    = "#2B2B2B",     # dark text
    bg.colour = "white",       # white halo so dark text stays legible
    bg.r      = 0.15,
    fontface  = "bold"
  ) +
  ggplot2::scale_fill_viridis_c(
    name    = "DRYness\n(uloc / lines)",
    option  = "viridis",
    limits  = c(0, 1)
  ) +
  ggplot2::coord_equal() +
  ggplot2::theme_void(base_size = 11)

This looks better. I’ll add some finishing touches with lab() and theme():

# for the labels
library(shadowtext)
ggplot2::ggplot(polygons,
                ggplot2::aes(
                  x = x, 
                  y = y, 
                  group = id, 
                  fill = dryness)) +
  ggplot2::geom_polygon(
    colour = "white", 
    linewidth = 0.2) +
  # add text labels
  shadowtext::geom_shadowtext(
    data    = top_labels,
    mapping = ggplot2::aes(x = x, y = y, label = filename),
    inherit.aes = FALSE,
    size      = 3,
    colour    = "#2B2B2B",     # dark text
    bg.colour = "white",       # white halo so dark text stays legible
    bg.r      = 0.15,
    fontface  = "bold"
  ) +
  ggplot2::scale_fill_viridis_c(
    name    = "DRYness\n(uloc / lines)",
    option  = "viridis",
    limits  = c(0, 1)
  ) +
  ggplot2::coord_equal() +
  ggplot2::labs(
    title    = "rlang R files: complexity vs. DRYness",
    subtitle = "circle area = weighted complexity; fill = uloc / lines; top 10 labelled",
    x = "",
    y = "",
    caption  = "scc_by_file(..., dryness = TRUE)"
  ) +
  ggplot2::theme_void(base_size = 11) +
  ggplot2::theme(
    plot.title = ggplot2::element_text(face = "bold"),
    legend.position = "right"
  )

In the Visualization vignette I cover how to convert this into an interactive graph. In future versions, I’ll add the change (Git) data for more interesting hotspot visualizations.

Footnotes

Codacy has a great overview of code metrics in this blog post..↩︎
Other packages worth noting are sloop and lobstr – these don’t calculate metrics, but have been developed to improve users understanding of R and object oriented programming and are covered in Advanced R, 2nd ed..↩︎
See the cloc command-line utility documentation here.↩︎
Check the performance comparison of scc here.↩︎
This is a downloaded a local version of the source files as of 2026-05-26.↩︎
How scc counts lines of code is outlined in the ‘Features’ section of the README.md↩︎
Read the full description of Complexity Estimates.↩︎
Read this discussion here.↩︎