You are a Bayesian statistician and Stan expert. Your working directory inside the container is /work. Read /work/CONTEXT.md before doing anything else — it contains all domain context, data structure details, and file layout.
Your task is to investigate the packet capture data, build an initial Stan model for outbound connections per second (CPS), run it, and document your findings clearly.
Step 1: Set up directories
mkdir -p /work/models /work/scripts /work/diagnostics /work/plots
Step 2: Explore the data
Write and run /work/scripts/explore.R. This script must:
- Load all
.Rdsfiles from/data/and combine them with amachine_idcolumn (filename without extension) as described in CONTEXT.md. Cast columns to appropriate types (timestamp to numeric ms, tcp_completeness to integer). - Infer the local workstation IP per machine as described in CONTEXT.md.
- Print per machine: machine_id, number of rows, time span (seconds), local IP.
- Compute per-second time series per machine (see CONTEXT.md for event
definitions), re-indexing seconds within each machine:
tcp_new: new outbound TCP connections per secondudp_new: new outbound UDP flows per secondtcp_end: TCP terminations per second (terminal completeness 15/31/63)udp_end: UDP flow endings per second (last packet of each 4-tuple)
- Print summary statistics for all four series per machine and pooled: n_seconds, mean, sd, median, p90, p99, max, proportion zero-seconds, variance/mean ratio. Note any meaningful differences between machines.
- Save to
/work/diagnostics/cps_data.rdswith columns:machine_id(character),machine(integer index),second(integer, within-machine),minute(integer, within-machine),tcp_new,udp_new,tcp_end,udp_end(all integer).
Run it: Rscript /work/scripts/explore.R
Step 3: Write NOTES.md — Data Exploration section
Create /work/NOTES.md with the following structure (fill in real values from
the exploration output):
# TCP/UDP CPS Modelling — Investigation Notes
---
## Bootstrap
### Data Exploration
- Machines: [list of machine_ids]
- Per machine: capture duration (seconds / minutes), total packets, local IP
- TCP CPS per machine and pooled: mean, sd, median, p99, max, zeros%,
variance/mean (overdispersed: yes/no)
- UDP CPS per machine and pooled: same
- Between-machine differences: note any meaningful rate or overdispersion
differences that suggest machines should be modelled with separate
parameters rather than fully pooled
### Model Design Decisions
[Your reasoning here: choice of likelihood, whether to model TCP and UDP
jointly or separately, prior choices and their rationale, any concerns about
zero-inflation, temporal structure, etc.]
Step 4: Write the Stan model
Write /work/models/m1.stan. Requirements:
- Model TCP CPS and UDP CPS as independent Negative Binomial outcomes
(
neg_binomial_2), one set of parameters per protocol. Start with a fully pooled model (shared mu and phi across machines) as a baseline — the multi-machine structure can be exploited in later iterations. - The Stan data block must include
K(number of machines) andarray[T] int machine(integer index 1..K for each second) so that later iterations can add machine-level parameters without changing the data pipeline. - Parameters:
mu_tcp,mu_udp(mean rates),phi_tcp,phi_udp(overdispersion) - Priors: weakly informative, set relative to the observed means from exploration. Include a comment on each prior explaining the rationale.
- Generated quantities block: simulate posterior predictive draws
tcp_new_rep[t]andudp_new_rep[t]for all T seconds
Begin the file with a comment block:
// Model: m1 — Pooled Negative Binomial for TCP and UDP CPS (multi-machine data)
// Structure: ...
// Priors: ...
// Notes: ...
Step 5: Write the fitting script
Write /work/scripts/fit.R. This script must:
- Load
/work/diagnostics/cps_data.rds - Build the Stan data list (include
T,K,machine,tcp_new,udp_new,tcp_end,udp_endas integer arrays) - Compile and sample from
/work/models/m1.stanusing cmdstanr: 4 chains, 1000 warmup, 1000 sampling, seed=42, show_messages=FALSE - Save the fit object to
/work/diagnostics/fit.rds - Extract and print diagnostics; save to
/work/diagnostics/summary.txt:- Number of divergent transitions
- E-BFMI per chain
- For all parameters: Rhat, bulk ESS, tail ESS (use fit$summary())
- Flag any Rhat > 1.01 or ESS < 400
- Generate and save
/work/plots/ppc_bootstrap.png:- Two-panel density plot (TCP left, UDP right)
- Each panel: observed CPS density (thick dark line) overlaid with 50 semi-transparent posterior predictive density curves (one per posterior draw, sampled from the rep arrays)
- Use ggplot2; label axes clearly; title “m1: Bootstrap PPC”
- Write
"CONTINUE"to/work/diagnostics/status.txt
Step 6: Run the fit
Rscript /work/scripts/fit.R
If there are R or Stan errors, read the error message, fix the script or model, and re-run. Do not proceed until the fit completes successfully.
Step 7: Update NOTES.md — Bootstrap Model section
Append to /work/NOTES.md:
### Model: m1.stan — Independent Negative Binomial
**Structure**: [one sentence description]
**Priors**: [list parameters and their priors with brief rationale]
### Diagnostics
- Divergences: X
- E-BFMI: [list per chain]
- Rhat: max = X (which parameter)
- Bulk ESS: min = X (which parameter)
- Tail ESS: min = X (which parameter)
- Issues: [none / describe any problems]
### PPC Assessment (see plots/ppc_bootstrap.png)
[2–4 sentences: Does the model capture the shape of the observed CPS
distribution? Are tails covered? Is there zero-inflation the model misses?
What does this suggest about the model's suitability?]
### Next Steps
[1–3 bullet points: what to address in iteration 1]