My workflow for plots in research papers
When writing papers, managing evaluation data and plots has been consistently troublesome. While getting to the first draft of a plot is often quick, the challenge arises during the inevitable subsequent revisions. Here’s the workflow I’ve been using to speed up my iteration time.
Issues & Inconveniences
There are two categories of issues/inconveniences that my previous bespoke Excel-based workflow had:
Manual updates:
- Loading data was manual (even from CSVs).
- Every plot had to be created from scratch.
- Plots had to be manually exported one by one on each change.
Reproducibility issues:
- Spreadsheets had to be versioned and tracked separately from code*.
- I often forgot which evaluation run was loaded in any given sheet.
- Especially when returning to a project after a few weeks/months.
- There was no record of label/name updates.
- Excel subtly changes chart sizes when you save/load a sheet.
Most of these issues are different forms of excess user-managed state.
*While spreadsheets can be version-controlled like any other file, no merge algorithm will resolve conflicts, forcing this task on the user.
My workflow
TLDR: Code → data.json
→ duckdb → ggplot2 → plots/<name>.pdf
→ LaTeX
(where duckdb and ggplot2 are one-liners in an R Jupyter Notebook cell)
Evaluation Code: Any benchmark outputs metrics as individual lines of JSON, which can be written to a file. For example:
import time
import json
def fibonacci_recursive(n):
if n <= 1:
return n
return fibonacci_recursive(n-1) + fibonacci_recursive(n-2)
def fibonacci_iterative(n):
a, b = 0, 1
for _ in range(n):
a, b = b, a + b
return a
n = 12
start_time = time.time()
_ = fibonacci_recursive(n)
rec_time = time.time() - start_time
print(json.dumps({"fn": "fib-recursive", "time": rec_time}))
start_time = time.time()
_ = fibonacci_iterative(n)
iter_time = time.time() - start_time
print(json.dumps({"fn": "fib-iterative", "time": iter_time}))
$ python my_benchmark.py >> data.json
$ cat data.json
{"fn": "fib-recursive", "time": 3.695487976074219e-05}
{"fn": "fib-iterative", "time": 3.337860107421875e-06}
If there are multiple benchmarks to run, I make a script that clears the data.json
file and then runs each benchmark appending the results to data.json.
Loading, Preprocessing, and Plotting Data: I then load and preprocess the JSON on disk with DuckDB. I do this in a Jupyter Notebook running in R kernel. While I prefer Python, ggplot2 is better than anything Python has.
library(duckdb)
con <- dbConnect(duckdb())
dbExecute(con, "INSTALL json; LOAD json;")
q = "
SELECT
CASE
WHEN json->>'fn' = 'fib-recursive' THEN 'Recursive'
WHEN json->>'fn' = 'fib-iterative' THEN 'Iterative'
ELSE json->>'fn'
END AS fn,
CAST(json->>'time' AS DOUBLE) * 1000000 AS time
FROM read_ndjson_objects('data.json')
"
df = dbGetQuery(con, q)
Now, the dataframe can be plotted directly and saved as a PDF at the same time:
library(ggplot2)
p <- ggplot(df, aes(x = fn, y = time)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Execution Time of Fibonacci Functions",
x = "Function Type",
y = "Time (microseconds)")
ggsave("plots/fibonacci.pdf", p, device='pdf', width=4.5, height=3)
p
Each SQL/plot pair gets its own cell. And, for convenience, the notebook starts by clearing the plots directory:
library(ggplot2)
library(duckdb)
con <- dbConnect(duckdb())
dbExecute(con, "INSTALL json; LOAD json;")
if (dir.exists("plots")) {
unlink("plots", recursive = TRUE)
}
dir.create("plots")
This plots directory can then be copied directly into the LaTeX project/Overleaf.
Conclusion: This notebook can now be version-controlled, and the data is in a single, human-readable file that is easy to back up. Additionally, figures can be updated in 30 seconds with a change in data, preprocessing, or plots.