{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "(GLM-simpsons-paradox)=\n", "# Simpson's paradox\n", "\n", ":::{post} September, 2024\n", ":tags: regression, hierarchical model, linear model, posterior predictive, causal inference, Simpson's paradox \n", ":category: beginner\n", ":author: Benjamin T. Vincent\n", ":::" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Simpson's Paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox) describes a situation where there might be a negative relationship between two variables within a group, but when data from multiple groups are combined, that relationship may disappear or even reverse sign. The gif below (from the Simpson's Paradox [Wikipedia](https://en.wikipedia.org/wiki/Simpson%27s_paradox) page) demonstrates this very nicely.\n", "\n", "\n", "\n", "Another way of describing this is that we wish to estimate the causal relationship $x \\rightarrow y$. The seemingly obvious approach of modelling `y ~ 1 + x` will lead us to conclude (in the situation above) that increasing $x$ causes $y$ to decrease (see Model 1 below). However, the relationship between $x$ and $y$ is confounded by a group membership variable $group$. This group membership variable is not included in the model, and so the relationship between $x$ and $y$ is biased. If we now factor in the influence of $group$, in some situations (e.g. the image above) this can lead us to completely reverse the sign of our estimate of $x \\rightarrow y$, now estimating that increasing $x$ causes $y$ to _increase_. \n", "\n", "In short, this 'paradox' (or simply ommitted variable bias) can be resolved by assuming a causal DAG which includes how the main predictor variable _and_ group membership (the confounding variable) influence the outcome variable. We demonstrate an example where we _don't_ incorporate group membership (so our causal DAG is wrong, or in other words our model is misspecified; Model 1). We then show 2 ways to resolve this by including group membership as causal influence upon $x$ and $y$. This is shown in an unpooled model (Model 2) and a hierarchical model (Model 3)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import arviz as az\n", "import graphviz as gr\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import pymc as pm\n", "import seaborn as sns\n", "import xarray as xr" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%config InlineBackend.figure_format = 'retina'\n", "az.style.use(\"arviz-darkgrid\")\n", "figsize = [12, 4]\n", "plt.rcParams[\"figure.figsize\"] = figsize\n", "rng = np.random.default_rng(1234)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generate data\n", "\n", "This data generation was influenced by this [stackexchange](https://stats.stackexchange.com/questions/479201/understanding-simpsons-paradox-with-random-effects) question. It will comprise observations from $G=5$ groups. The data is constructed such that there is a negative relationship between $x$ and $y$ within each group, but when all groups are combined, the relationship is positive." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def generate():\n", " group_list = [\"one\", \"two\", \"three\", \"four\", \"five\"]\n", " trials_per_group = 20\n", " group_intercepts = rng.normal(0, 1, len(group_list))\n", " group_slopes = np.ones(len(group_list)) * -0.5\n", " group_mx = group_intercepts * 2\n", " group = np.repeat(group_list, trials_per_group)\n", " subject = np.concatenate(\n", " [np.ones(trials_per_group) * i for i in np.arange(len(group_list))]\n", " ).astype(int)\n", " intercept = np.repeat(group_intercepts, trials_per_group)\n", " slope = np.repeat(group_slopes, trials_per_group)\n", " mx = np.repeat(group_mx, trials_per_group)\n", " x = rng.normal(mx, 1)\n", " y = rng.normal(intercept + (x - mx) * slope, 1)\n", " data = pd.DataFrame({\"group\": group, \"group_idx\": subject, \"x\": x, \"y\": y})\n", " return data, group_list\n", "\n", "\n", "data, group_list = generate()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To follow along, it is useful to clearly understand the form of the data. This is [long form](https://en.wikipedia.org/wiki/Wide_and_narrow_data) data (also known as narrow data) in that each row represents one observation. We have a `group` column which has the group label, and an accompanying numerical `group_idx` column. This is very useful when it comes to modelling as we can use it as an index to look up group-level parameter estimates. Finally, we have our core observations of the predictor variable `x` and the outcome `y`." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | group | \n", "group_idx | \n", "x | \n", "y | \n", "
---|---|---|---|---|
0 | \n", "one | \n", "0 | \n", "-0.294574 | \n", "-2.338519 | \n", "
1 | \n", "one | \n", "0 | \n", "-4.686497 | \n", "-1.448057 | \n", "
2 | \n", "one | \n", "0 | \n", "-2.262201 | \n", "-1.393728 | \n", "
3 | \n", "one | \n", "0 | \n", "-4.873809 | \n", "-0.265403 | \n", "
4 | \n", "one | \n", "0 | \n", "-2.863929 | \n", "-0.774251 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
95 | \n", "five | \n", "4 | \n", "3.981413 | \n", "0.467970 | \n", "
96 | \n", "five | \n", "4 | \n", "1.889102 | \n", "0.553290 | \n", "
97 | \n", "five | \n", "4 | \n", "2.561267 | \n", "2.590966 | \n", "
98 | \n", "five | \n", "4 | \n", "0.147378 | \n", "2.050944 | \n", "
99 | \n", "five | \n", "4 | \n", "2.738073 | \n", "0.517918 | \n", "
100 rows × 4 columns
\n", "