Simulating Data Series

Due to ongoing research efforts, the Simulating Data Series is delayed. I apologize for the delay and am available via email if you have any data simulation questions.

Introduction (PDF)

I started the process of simulating data while I was learning R during graduate school. Well … okay, I wasn’t truly simulating data, but relying on the random number generator in excel. That was until, Dr. J.C. Barnes (University of Cincinnati) suggested that I conduct a simulation analysis for my dissertation. At the time, I was well versed in R and set forth learning how to conduct basic data simulations to inform my simulation analysis. For my dissertation, the simulation analysis ended up being 100 thousand lines of code (with manual specifications for every variable) that took 24-hours to run on the Ohio Supercomputer. But at the end, the results provided clarity for my dissertation – the statistical technique I proposed was not as good as some existing techniques (it got pretty close though). This was my entry into the rabbit hole of data simulations and simulation analyses.

Initially, the ability to create interrelated datasets from thin air provided the feeling of immense understanding. While it took me three whole days to specify, I was able to simulate a clustered dataset by just reading and rereading the chapter by Singer and Willet (2008) on longitudinal clustering. I truly for the first time in my statistical career understood how data was generated! This feeling, though, was short lived for a variety of reasons. Data is not all normally distributed; Error terms are not linear related; Estimates are not nice and neat; All relationships are not statistically significant; the list goes on. At this point, I realized I actually understood less about data and statistics than I ever imagined, which made me want to learn more. At this point, I believe the only way forward is through data simulations.

A data simulation is the systematic creation of data following a set of rules defined by the developer (Law et al., 2000; Templ, 2016). Specifically, we define the distributional properties, the relationships, and the error existing within and between the measures we decided to include in the data. We also define the causal structure of the data, as well as any clusters, irregularities, and abnormal cases that exist within the data. Consistent with this, we know what exactly had to occur for the data to be generated because we defined all of the rules used to create the data. This provides us with knowledge of the right answer (i.e., the true effects), permitting us to conduct evaluations of our methodological and statistical assumptions and observe the deviations between the estimated effects and the true effects.[1] Data simulations provide us with the ability to (1) test the validity of methodological and statistical techniques, (2) evaluate if and when methodological and statistical assumptions fail, (3) evaluate the bias introduced by violating methodological and statistical assumptions fail, (4) evaluate if and when bias can be introduced by known and unknown sources or error, (5) evaluate the bias introduced by known and unknown sources or error, (6) and – most importantly – develop a comprehensive understanding of the data generating process.

In this series (for both you and I), I hope to accomplish four goals.

  1. First, and foremost, I want to explore the bounds of specifying a data simulation. As of drafting this, all of the data simulations I have or are currently conducting test the validity of methodologies or are designed to generate bias in statistical models. This, however, only requires a limited number of rules to be set and rarely allows the exploration of more complex data specifications. Here, we will explore how complex data simulations can become to feed the second goal of the project.
  2. Second, I want to develop a comprehensive knowledge of the data generating process. Commonly, we take survey participants or collect raw data, explore the raw data file, clean the data, and analyze the data, while rarely thinking about what processes had to exist for the raw data to be created, and the differences between the raw data and analyzed data. Here, we will explore how to create data that looks and functions like real data to better understand how data is generated in reality.
  3. Third, I want to develop an open access resource where scholars can obtain a description and the R-code designed to conduct complex data simulations.
  4. Finally, I want to provide a resource to students and scholars to help demonstrate how data simulations can ease the process of learning statistics by requiring us to specify and visualize the data, as well as permitting us to observe when methods or statistical techniques are not appropriate for the data we created.

With the goals of the series outlined, I want to provide a brief overview of how this series will proceed. We will specify our raw data simulations using only base R (the rstats package), but will likely encounter a variety of packages used to clean and estimate post-simulation models. I will always inform you of the packages required at that stage of the data simulation. By relying only on base R, we will primarily specify every assumption within the data using the scalar bivariate or multivariate regression formulas.[2] For example, the formula below and the code below – which are identical – can be used to generate a linear relationship between X and Y, where a 1-point change in X corresponds to a b-change in Y (the intercept is specified differently than just including an a). The complexity of these formulas, however, will increase exponentially. 

Y={bX_1}

Y<-bX

Moreover, as indicated above we determine the function of our variables within a dataset. I will continuously remind you about the assumed function of our modifiers and variables within a dataset, but I want to provide a brief table about some of the different labels that will be used during our data simulations.

Labels

N

n

ID

CID

Modifiers


b

c

ec

e

d

cl

Variables


X

Y

Z

C

L

Error Terms


COV

E

D

CL

Functionality

Population Size

Sample Size

Case ID

Cluster ID


Slope Coefficient

Covariance

Residual Covariance

Residual Error

Disturbance

Cluster


Presumed Independent Variable

Presumed Dependent Variable

Unobserved Independent Variable

Observed Collider or Confounder

Unobserved Latent Variable


Error Causing Covariation Between Variables

Distribution of Residual Error in Variable

Distribution of Disturbance in Latent Variable

Distribution of Clustering Error in Variable


Not that we have briefly outlined our process, let’s get to the topics that will be covered in this series (at the current moment). Additional topics will be added over the course of the series.[3]

  1. Data Simulation Basics
  2. Advanced Relationships
    • Covarying constructs
    • Confounders and Colliders
    • Dependents
    • Structural Associations (Mediators and Moderators)
    • Latent Constructs (1st order and 2nd order)
  3. Constructing A Dataset
    • Identifying Data Assumptions
    • Cross-Sectional Data (Simple)
    • Cross-Sectional Data (Adding Variables)
    • Clustered Data (Multi-Level Data)
    • Within-Individual Longitudinal Data
    • Between- and Within-Individual Longitudinal Data
    • Panel Data (With and Without Censorship)
    • Multi-level Structural Data
  4. Advanced Dataset Using Assumptions Related to Reality
  5. Conducting a Simulation Analysis

As the final piece of this introduction, please reach out if you have any questions or interests in the Simulating Data Series. I do this primarily for myself, but I am always working to develop strategies to ensure that the series is well received by interested individuals. The first entry will be following this shortly!

Resources

Law, A. M., Kelton, W. D., & Kelton, W. D. (2000). Simulation modeling and analysis (Vol. 3). New York: McGraw-Hill.

Singer, J. D., and Willett, J. B., (2003). Applied longitudinal data analysis: Modeling change and event occurrence. Oxford, UK: Oxford University Press.

Templ, M. (2016). Simulation for data science with R. Birmingham, UK: Packt Publishing Ltd.


[1] Although that ability may appear trivial, the amount of assumptions we make during an average statistical analysis is astonishing, even for ordinary least squares regression, and each of those assumptions can bias our statistical estimates.

[2] Matrix specifications will be employed under circumstances where Y is predicted by too many constructs.

[3] This list is subject to change over the course of the series.

License: Creative Commons Attribution 4.0 International (CC By 4.0)