Cost distribution among software process activities
openstatsware
short course: Good Software Engineering Practice for R Packages
April 18, 2024
From an idea to a production-grade R package
Example scenario: in your daily work, you notice that you need certain one-off scripts again and again.
The idea of creating an R package was born because you understood that “copy and paste” R scripts is inefficient, and on top of that, you want to share your helpful R functions with colleagues and the world…
Photo CC0 by ELEVATE on pexels.com
Extensive documentation, huge paperwork, lots of manual work, lots of signatures, …
Photo CC0 by Chevanon Photography on pexels.com
Bad practice!
Why?
Cost distribution among software process activities
Origin of errors in system development
Boehm, B. (1981). Software Engineering Economics. Prentice Hall.
Invest time in
… but in many cases the workflow must be workable for a single developer or a small team.
Photo CC0 by Kateryna Babaieva on pexels.com
Let’s assume that you used some lines of code to create simulated data in multiple projects:
Idea: put the code into a package
Obligation level | Key word1 | Description |
---|---|---|
Duty | must2 | “must have” |
Desire | should | “nice to have” |
Intention | may | “optional” |
Purpose and Scope
The R package simulatr is intended to enable the creation of reproducible fake data.
Package Requirements
simulatr must provide a function to generate normal distributed random data for two independent groups. The function must allow flexible definition of sample size per group, mean per group, standard deviation per group. The reproducibility of the simulated data must be ensured via an optional seed. It should be possible to print the function result. The package may also facilitate graphical presentation of the simulated data.
Useful formats / tools for design docs:
UML Diagram
R package programming
One-off script as starting point:
Refactored script:
Almost all functions, arguments, and objects should be self-explanatory due to their names.
Define that the result is a list1 which is defined as class2:
getSimulatedTwoArmMeans <- function(n1, n2, mean1, mean2, sd1, sd2) {
result <- list(n1 = n1, n2 = n2,
mean1 = mean1, mean2 = mean2, sd1 = sd1, sd2 = sd2)
result$data <- data.frame(
group = c(rep(1, n1), rep(2, n2)),
values = c(
rnorm(n = n1, mean = mean1, sd = sd1),
rnorm(n = n2, mean = mean2, sd = sd2)
)
)
# set the class attribute
result <- structure(result, class = "SimulationResult")
return(result)
}
The output is impractical, e.g., we need to scroll down:
$n1
[1] 50
$n2
[1] 50
$mean1
[1] 5
$mean2
[1] 7
$sd1
[1] 3
$sd2
[1] 4
$data
group values
1 1 6.92867501
2 1 4.27391191
3 1 4.84314168
4 1 6.47817838
5 1 4.33260826
6 1 -1.72191316
7 1 1.08203633
8 1 8.53507748
9 1 8.37867780
10 1 0.22530345
11 1 7.78210236
12 1 2.76337884
13 1 2.93471488
14 1 -0.34749376
15 1 7.27127205
16 1 1.57398466
17 1 5.99260926
18 1 6.87531114
19 1 5.27240331
20 1 7.82986314
21 1 7.24441602
22 1 4.24178078
23 1 4.86116603
24 1 3.43215595
25 1 2.73368044
26 1 4.66923585
27 1 6.76589371
28 1 7.64264403
29 1 0.40013825
30 1 2.22943563
31 1 2.09145210
32 1 8.10204547
33 1 5.80126515
34 1 0.29625804
35 1 4.23228594
36 1 2.47750708
37 1 5.13579141
38 1 9.21317431
39 1 2.41292737
40 1 10.56468496
41 1 6.86999664
42 1 8.34405572
43 1 6.86696222
44 1 8.71159940
45 1 3.23705137
46 1 6.30270645
47 1 -1.81965445
48 1 8.18732195
49 1 1.68561040
50 1 5.69022559
51 2 2.70545804
52 2 4.38127493
53 2 9.36388417
54 2 5.98904517
55 2 3.29543590
56 2 4.26433021
57 2 0.15160899
58 2 2.00679650
59 2 11.02430692
60 2 7.43707591
61 2 0.05354151
62 2 7.02388432
63 2 1.27704771
64 2 6.51298834
65 2 6.95702116
66 2 24.56328828
67 2 8.45460558
68 2 3.47935357
69 2 2.64644833
70 2 0.41833657
71 2 0.10432563
72 2 10.09086912
73 2 3.16091651
74 2 16.26886162
75 2 0.31918042
76 2 13.26124234
77 2 8.03001147
78 2 10.37754004
79 2 10.42970988
80 2 5.30819353
81 2 13.36327886
82 2 11.68057220
83 2 -0.18300430
84 2 5.51915435
85 2 5.15743041
86 2 6.01768963
87 2 11.28639246
88 2 10.55082700
89 2 5.54828808
90 2 9.68504303
91 2 9.12542468
92 2 0.53159052
93 2 12.41983360
94 2 0.96142457
95 2 12.38531380
96 2 3.97646479
97 2 11.26449122
98 2 17.08345606
99 2 4.81729861
100 2 6.31877527
attr(,"class")
[1] "SimulationResult"
Solution: implement generic function print
Generic function print
:
@title
, @description
, @typed
are action words that Roxygen2
package to builds into document entitled by the @title
tag in .Rd
format.#'
for document specific documentation.#' @title
#' Print Simulation Result
#'
#' @description
#' Generic function to print a `SimulationResult` object.
#'
#' @param x a \code{SimulationResult} object to print.
#' @param ... further arguments passed to or from other methods.
#'
#' @examples
#' x <- getSimulatedTwoArmMeans(n1 = 50, n2 = 50, mean1 = 5,
#' mean2 = 7, sd1 = 3, sd2 = 4, seed = 123)
#' print(x)
#'
#' @export
$args
n1 n2 mean1 mean2 sd1 sd2
"50" "50" "5" "7" "3" "4"
$data
# A tibble: 100 × 2
group values
<dbl> <dbl>
1 1 6.93
2 1 4.27
3 1 4.84
4 1 6.48
5 1 4.33
6 1 -1.72
7 1 1.08
8 1 8.54
9 1 8.38
10 1 0.225
# ℹ 90 more rows
Photo CC0 by Pixabay on pexels.com
Add assertions to improve the usability and user experience
Tip on assertions
Use the package checkmate to validate input arguments.
Example:
Error in playWithAssertions(-1) : Assertion on ‘n1’ failed: Element 1 is not >= 1.
Add three additional results:
Tip on creation time
Sys.time()
, format(Sys.time(), '%B %d, %Y')
, Sys.Date()
Add an additional result: t.test
result
Add an optional alternative argument and pass it through t.test
:
Implement the generic functions print
and plot
.
Tip on print
Use the plot example function from above and extend it.
Optional extra tasks:
Implement the generic functions summary
and cat
Implement the function kable
known from the package knitr as generic. Tip: use
to define kable as generic
Optional extra task1:
Document your functions with Roxygen2