Skip to content

Python and Pandas and P-values, Oh My! A Statistical Journey

Published: at 01:15 AM in 26 min readSuggest Changes

Table of Contents

Open Table of Contents

Introduction

In this blog post, we will cover the basics of statistics and how to apply them in Python through a practical lens. Specifically, we’ll work through a series of practice problems based on a realistic scenario involving machine learning experiments. We’ll use Python and various statistical techniques to analyze the results and draw meaningful conclusions.

Scenario: Comparing a Novel Machine Learning Method to a Baseline

Imagine you’re a research scientist at a leading AI company, and you’ve developed a new machine learning method that you believe outperforms the current baseline model. To test your hypothesis, you’ve run a series of experiments comparing your novel method to the baseline across different datasets and metrics.

Our task is to analyze the results of these experiments using various statistical techniques and draw conclusions about the performance of your novel method compared to the baseline.

Experiment Results

You have the following data from your experiments, comparing the accuracy of your novel method to the baseline across 10 different datasets:

Mock Experiment Data Generation
def generate_realistic_ml_data(n_experiments=50, random_seed=42, with_issues=True):
    """
    Generate synthetic ML experiment data in long format, with multiple methods
    and experimental runs.
    """
    np.random.seed(random_seed)

    # Define methods and their characteristics
    methods = {
        "Baseline": {"param_efficiency": 1.0, "variance": 0.02},
        "Novel_A": {  # Our primary novel method
            "param_efficiency": 1.15,  # 15% more parameter efficient
            "variance": 0.02,
        },
        "Novel_B": {  # An alternative approach
            "param_efficiency": 1.10,  # 10% more parameter efficient
            "variance": 0.015,
        },
        "CompetitorX": {  # Published SOTA method
            "param_efficiency": 1.12,
            "variance": 0.02,
        },
    }

    # Generate base experiments
    experiments = []

    for exp_id in range(n_experiments):
        # Base parameters for this experiment
        log_params = np.random.uniform(
            low=np.log10(1e7),  # 10M parameters
            high=np.log10(1e10),  # 10B parameters
        )
        base_params = 10**log_params

        log_tokens = np.random.uniform(
            low=np.log10(1e9),  # 1B tokens
            high=np.log10(1e12),  # 1T tokens
        )
        base_tokens = 10**log_tokens

        dataset_type = np.random.choice(
            ["tabular", "image", "text", "time-series"], p=[0.2, 0.3, 0.4, 0.1]
        )

        # Calculate base theoretical accuracy
        base_loss = 2.0
        param_factor = 0.1 * np.log10(base_params / 1e7)
        token_factor = 0.1 * np.log10(base_tokens / 1e9)
        theoretical_accuracy = 0.85 - (base_loss - param_factor - token_factor) * 0.1

        # Generate results for each method
        for method_name, method_props in methods.items():
            # Some methods might fail occasionally
            if np.random.random() < 0.05:  # 5% failure rate
                continue

            # Add method-specific variation
            method_accuracy = theoretical_accuracy * method_props[
                "param_efficiency"
            ] + np.random.normal(0, method_props["variance"])
            method_accuracy = np.clip(method_accuracy, 0, 1)

            # Vary parameters and tokens slightly between methods
            method_params = base_params * np.random.uniform(0.95, 1.05)
            method_tokens = base_tokens * np.random.uniform(0.95, 1.05)

            experiment = {
                "Experiment_ID": exp_id + 1,
                "Method": method_name,
                "Dataset_Type": dataset_type,
                "Model_Parameters": int(method_params),
                "Training_Tokens": int(method_tokens),
                "Accuracy": method_accuracy,
                "GPU_Memory_GB": np.clip(method_params / 1e8, 8, 80)
                + np.random.normal(0, 2),
                "Compute_Cost_USD": (method_params * method_tokens) / 1e12 * 0.1
                + np.random.lognormal(0, 0.5),
            }
            experiments.append(experiment)

    df_raw = pd.DataFrame(experiments)

    if with_issues:
        # Add some realistic issues
        missing_indices = np.random.choice(len(df_raw), 10, replace=False)
        df_raw.loc[missing_indices, "Training_Tokens"] = np.nan  # type: ignore

        # Add some corrupted values
        corrupted_indices = np.random.choice(len(df_raw), 5, replace=False)
        df_raw.loc[corrupted_indices, "GPU_Memory_GB"] *= -1  # type: ignore

    return df_raw

df = generate_realistic_ml_data(n_experiments=50, random_seed=42, with_issues=True)
df.head()
>>>
   Experiment_ID       Method Dataset_Type  Model_Parameters  Training_Tokens  Accuracy  GPU_Memory_GB  Compute_Cost_USD
0              1     Baseline         text         127054830     7.374991e+11  0.685075       7.531726      9.370285e+06
1              1      Novel_B         text         137348322     6.909820e+11  0.770245       7.073165      9.490523e+06
2              1  CompetitorX         text         130154015     7.194053e+11  0.720278       9.900739      9.363350e+06
3              2     Baseline         text         125806559     2.356161e+10  0.659825       8.244438      2.964213e+05
4              2      Novel_A         text         127942212     2.320614e+10  0.763861       7.416613      2.969055e+05

Data Quality Checks

Before diving into analysis, let’s perform some basic data quality checks:

import numpy as np
import pandas as pd


def check_data_quality(df: pd.DataFrame):
    print("1. Basic Information:")
    print("-" * 80)
    print(df.info())

    print("2. Missing or NaN Values:")
    print("-" * 80)
    print(df.isnull().sum())

    print("3. Basic Statistics:")
    print("-" * 80)
    print(df.describe())

    print("4. Value Ranges Check:")
    print("-" * 80)
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    print(df[numeric_cols].describe().loc[["min", "max"]].transpose())
RangeIndex: 185 entries, 0 to 184
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   Unnamed: 0        185 non-null    int64
 1   Experiment_ID     185 non-null    int64
 2   Method            185 non-null    object
 3   Dataset_Type      185 non-null    object
 4   Model_Parameters  185 non-null    int64
 5   Training_Tokens   175 non-null    float64
 6   Accuracy          185 non-null    float64
 7   GPU_Memory_GB     185 non-null    float64
 8   Compute_Cost_USD  185 non-null    float64
dtypes: float64(4), int64(3), object(2)
memory usage: 13.1+ KB
None

Data Cleaning and Preprocessing

Let’s handle the common data issues we identified:

def clean_dataset(df):
    df_clean = df.copy()

    # 1. Handle missing values
    # Fill missing values with median
    median_tokens = df["Training_Tokens"].median()
    df["Training_Tokens"] = df["Training_Tokens"].fillna(median_tokens)


    # 2. Fix impossible values
    df_clean.loc[df_clean['Baseline_Accuracy'] > 1, 'Baseline_Accuracy'] = 1.0
    df_clean.loc[df_clean['Novel_Method_Accuracy'] > 1, 'Novel_Method_Accuracy'] = 1.0

    # 3. Handle outliers using IQR method
    def remove_outliers(series):
        Q1 = series.quantile(0.25)
        Q3 = series.quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        return np.where(series.between(lower_bound, upper_bound), series, series.median())

    numerical_cols = ['GPU_Memory_GB', 'Training_Time_Hours']
    for col in numerical_cols:
        df_clean[col] = remove_outliers(df_clean[col])

    return df_clean

df = clean_dataset(df_raw)

Descriptive Statistics and Data Visualization

First, let’s compute some basic statistical measures and plot some data visualizations using Pandas and Matplotlib.

Mean, Median, Mode

Let’s start with some definitions and calculations for our dataset.

Definitions
  • Mean (μ\mu):

    • The average value of a dataset, calculated by summing all values and dividing by the number of data points.
    • The first moment of a probability distribution, equivalent to the expected value E[X]=i=1nxiP(xi)\mathbb{E}[X] = \sum_{i=1}^{n} x_i P(x_i) for discrete distributions or xf(x)dx\int_{-\infty}^{\infty} x f(x)dx for continuous distributions. For a finite dataset, μ=1ni=1nxi.\mu = \frac{1}{n}\sum_{i=1}^{n} x_i.
  • Median: The middle value when a dataset is ordered from least to greatest.

  • Mode: The most frequently occurring value in a dataset.

  • Standard Deviation σ=1ni=1n(xiμ)2\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^{n} (x_i - μ)^2}.

    • A measure of the amount of variation or dispersion of a set of values.
    • The square root of the variance (second central moment).
    • Equivalent to E[(Xμ)2]=E[X2](E[X])2\sqrt{E[(X - μ)^2]} = \sqrt{E[X^2] - (E[X])^2}.
  • Variance: σ2=1ni=1n(xiμ)2.\sigma^2 = \frac{1}{n}\sum_{i=1}^{n} (x_i - μ)^2. (for a finite dataset)

    • The average of the squared differences from the mean.
    • The second central moment of a probability distribution
    • Defined as E[(Xμ)2]=E[X2](E[X])2E[(X - μ)^2] = E[X^2] - (E[X])^2, measuring the expected squared deviation from the mean.

Now, let’s calculate these statistics for our experimental results:

# Group by method
methods = {name: data for name, data in df.groupby("Method")}

for name, method in methods.items():
    print(name)
    print(method.describe())

Sample vs. Population Statistics: Understanding the Difference

When working with data, we often need to distinguish between population parameters and sample statistics. This distinction is crucial because we rarely have access to an entire population and must make inferences from samples.

The uncorrected sample variance and standard deviation are biased estimators of the population variance and standard deviation that tend to underestimate the true value. To correct for this bias, we divide by n1n-1 instead of nn in the sample variance formula. This correction factor is known as Bessel’s correction or the “degrees of freedom” adjustment.

Sample variance (corrected):

s2=1n1i=1n(xixˉ)2s^2 = \frac{1}{n-1}\sum_{i=1}^{n} (x_i - \bar{x})^2

Sample standard deviation (corrected):

s=1n1i=1n(xixˉ)2s = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n} (x_i - \bar{x})^2}

Note:

In python, you can calculate the corrected sample standard deviation using the ddof parameter in np.std():

import numpy as np

# Population statistics
population_mean = np.mean(data)
population_var = np.var(data, ddof=0)    # ddof=0 for population
population_std = np.std(data, ddof=0)

# Sample statistics
sample_mean = np.mean(data)
sample_var = np.var(data, ddof=1)        # ddof=1 for sample
sample_std = np.std(data, ddof=1)

Standard Error of the Mean (SEM)

The standard error of the mean (SEM) tells us how precise our sample mean is as an estimate of the population mean. It’s calculated as:

SEM=sn\text{SEM} = \frac{s}{\sqrt{n}}

where

This relationship shows us something crucial: our uncertainty about the true population mean decreases with: (1) larger sample sizes (n\sqrt{n} term) and (2) less variable data (ss term).

Thus, a smaller SEM indicates that the sample mean is a more precise estimate of the true population mean.

In Python, you can calculate the SEM using the formula above:

import numpy as np

# Calculate the standard error of the mean (SEM)
sample_std = np.std(data, ddof=1)  # ddof=1 for sample standard deviation
sem = sample_std / np.sqrt(len(data))

The SEM is crucial for constructing confidence intervals and performing hypothesis tests. A smaller SEM indicates that the sample mean is a more precise estimate of the true population mean.

When we’re dealing with small sample sizes (n<30)(n < 30), we often use the t-distribution instead of the normal distribution for inference. This is because the t-distribution accounts for the additional uncertainty when estimating the population standard deviation from a small sample.

Error Bars in Visualization

Error bars can represent different measures of uncertainty:

The choice depends on your message:

Visualizing the Data

Now that we have our basic statistics, let’s create some visualizations to better understand our data.

methods = {name: data for name, data in df.groupby("Method")}

# Matplotlib boxplot

plt.figure(figsize=(10, 6))
plt.boxplot(
[data["Accuracy"] for data in methods.values()], tick_labels=list(methods.keys())
)
plt.ylabel("Accuracy")
plt.xlabel("Method")
plt.title("Model Accuracy by Method")
plt.show()
Model Accuracy by Method

Boxplot of Model Accuracy by Method (seaborn output)

The boxplot above shows the distribution of model accuracy for each method. It provides a visual summary of the central tendency, spread, and skewness of the data. The line inside the box represents the median, while the box itself spans the interquartile range (IQR). The whiskers extend to the most extreme data points within 1.5 times the IQR from the box. Points outside this range are considered outliers. The plot strongly suggests that there are statistically significant differences between the methods.

In our next section on Hypothesis Testing, we’ll use these concepts to determine if the difference between our novel method and the baseline is statistically significant.

Hypothesis Testing

Next, let’s focus on formulating hypotheses and performing a paired t-test to compare the performance of two methods.

Formulating the Hypotheses

In our scenario, we want to test if our novel method significantly outperforms the baseline. We can formulate our hypotheses as follows:

Student’s t-test

Student’s t-test is a statistical test used to test whether the difference between the response of two groups is statistically significant or not

Assumptions:

t-statistic: The t-statistic measures the difference between the means of two paired samples relative to the variability within the samples. In other words, it is the ratio of the difference between the means to the standard error of that difference.

The formula for the t-statistic is given by:

t=xˉμs/nt = \frac{\bar{x} - \mu}{s / \sqrt{n}}

It is very similar to the z-score but with the difference that t-statistic is used when the sample size is small or the population standard deviation is unknown (a t-test will converge to a Z-test as the dataset size increases).

Quick Decision Tree

Is your data paired/matched?
└─ Yes → Use Paired t-test
└─ No → Are you comparing to a known value?
    └─ Yes → Use One-sample t-test
    └─ No → Use Independent (unpaired) two-sample t-test

Test Types & Their Use Cases

# One-sample t-test
stats.ttest_1samp(data, popmean=expected_value)

When to use:

  • Comparing a sample mean to a known or hypothesized population value
  • Testing if a sample differs from a known standard

Example scenarios:

  • Testing if battery life differs from manufacturer’s claim of 24 hours
  • Checking if product weights match specified target
  • Comparing student test scores against national average

Performing a Paired T-Test

We’ll use a paired t-test because we have matched pairs of observations (novel and baseline accuracy for each dataset). We’ll perform this test both manually and using SciPy’s stats.ttest_rel() function.

group1 = methods['Novel_A']['Accuracy']
group2 = methods['Baseline']['Accuracy']
n1, n2 = len(group1), len(group2)
mean1, mean2 = group1.mean(), group2.mean()
std1, std2 = group1.std(ddof=1), group2.std(ddof=1)  # ddof=1 for sample standard deviation
dof = n1 + n2 - 2
pooled_std = np.sqrt(((n1 - 1) * std1 ** 2 + (n2 - 1) * std2 ** 2) / dof)
pooled_sem = pooled_std * np.sqrt(1 / n1 + 1 / n2)
# Calculate t-statistic
t_statistic = (mean1 - mean2) / pooled_sem

# `stats.t.sf`stands for "survival function"
# - it gives you the area under the t-distribution curve to the right of your t-statistic value (1 - CDF).
# In other words, it's P(T > t) for a given t-value.

p_value = stats.t.sf(abs(t_statistic), dof) * 2 # two-tailed test

print(f"Manual t-statistic: {t_statistic:.4f}")
print(f"Manual p-value: {p_value:.4f}")
# Interpret the results
alpha = 0.05
if p_value < alpha:
    print("\nReject the null hypothesis.")
    print("There is significant evidence that the novel method outperforms the baseline.")
else:
    print("\nFail to reject the null hypothesis.")
    print("There is not significant evidence that the novel method outperforms the baseline.")

Calculating and Interpreting the 95% Confidence Interval

A confidence interval combines our point estimate (sample mean) with our uncertainty (SEM). For a 95% confidence interval:

CI=xˉ±t×SEM\text{CI} = \bar{x} \pm t \times \text{SEM}

where

import numpy as np
from scipy import stats

def calculate_confidence_interval(data, confidence=0.95):
    n = len(data)
    mean = np.mean(data)
    sem = stats.sem(data)  # np.std(data, ddof=1) / np.sqrt(n)
    dof = n - 1
    t_value = stats.t.ppf((1 + confidence) / 2, dof)
    margin_error = t_value * sem
    ci_lower = mean - margin_error
    ci_upper = mean + margin_error
    return ci_lower, ci_upper

# Calculate the 95% confidence interval
confidence_level = 0.95
ci_lower, ci_upper = calculate_confidence_interval(diff, confidence_level)
print(f"\n95% Confidence Interval: ({ci_lower:.4f}, {ci_upper:.4f})")
95% Confidence Interval: (0.0282, 0.0378)

Interpretation

Confidence intervals are often misinterpreted. Recall that the overarching goal is to estimate the true population parameter based on a sample. The confidence interval provides an estimate of the long-term success rate of estimating the population parameter based on repeated sampling.

Note: The confidence interval does not give the probability that the true parameter lies within the interval. The true parameter is fixed, and the interval is random.

Effect Size and Power Analysis

Effect Size

Effect size is a quantitative measure of the magnitude of the experimental effect. It’s important because statistical significance doesn’t tell you the size of the effect, only that it exists.

For statistical tests based on differences between means (like t-tests), Cohen’s d is a common measure of effect size, which estimates the standardized mean difference (SMD) between two populations. It’s defined as the difference between the means divided by the pooled standard deviation:

d=xˉ1xˉ2spd = \frac{\bar{x}_1 - \bar{x}_2}{s_p}

where

sp=(n11)s12+(n21)s22n1+n22s_p = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}}

is the pooled standard deviation.

In python, you can calculate Cohen’s d as follows:

def cohens_d(group1, group2):
    diff = group1.mean() - group2.mean()
    n1, n2 = len(group1), len(group2)
    var1, var2 = group1.var(ddof=1), group2.var(ddof=1)
    dof = n1 + n2 - 2
    pooled_std = np.sqrt(((n1 - 1) * var1 + (n2 - 1) * var2) / dof)
    return diff / pooled_std

effect_size = cohens_d(novel_acc, baseline_acc)
print(f"Cohen's d: {effect_size:.4f}")
Cohen's d: 1.1135

Interpretation of Cohen’s d:

Power Analysis

Related to effect size is the statistical power of a hypothesis test, which is the probability of correctly rejecting a false null hypothesis (i.e., correctly detecting an effect that is present). Genearlly speaking, as your sample size increases, so does the power of your test. It is generally accepted that power should be about 0.8 or greater (i.e., your experimental design should have an 80% chance of detecting a statistically significant effect if it exists).

The main application of statistical power is “power analysis,” which involves calculating the minimum sample size required so that one can be reasonably likely to detect an effect of a given size with a given level of confidence.

Example: Power Analysis for a t-test using statsmodels

Below, we perform a power analysis to determine the sample size needed to achieve 80% power with the observed effect size, assuming α=0.05\alpha = 0.05 for a two-tailed test.

import numpy as np
from scipy import stats
from statsmodels.stats.power import TTestPower

# For t-tests
power_analysis = TTestPower()
power = 0.80  # conventional 80% power
alpha = 0.05  # significance level
effect_size = 1.1135  # our observed effect size
sample_size = 20

# 1. Calculate required sample size
n_needed = power_analysis.solve_power(
    effect_size=effect_size,
    power=power,
    alpha=alpha,
    alternative="two-sided",
)

# 2. Calculate achieved power
achieved_power = power_analysis.solve_power(
    effect_size=effect_size,
    nobs=sample_size,
    alpha=alpha,
    alternative="two-sided",
)

# 3. Calculate detectable effect size
min_effect = power_analysis.solve_power(
    nobs=sample_size, power=power, alpha=alpha, alternative="two-sided"
)

print(f"Required sample size for power={power}: {np.ceil(n_needed)}")
print(f"Achieved power: {achieved_power:.3f}")
print(f"Minimum detectable effect size: {min_effect:.3f}")

Full Code for Hypothesis Testing and Power Analysis

Here’s the full code for hypothesis testing, effect size calculation, and power analysis:

import numpy as np
from scipy import stats


def comprehensive_ttest(
    group1: np.ndarray | list[float],
    group2: np.ndarray | list[float],
    alpha: float = 0.05,
    alternative="two-sided",
):
    """Perform a comprehensive independent two-sample t-test analysis.

    Args:
        group1, group2 : array-like
            The two groups of data to compare
        alpha : float, optional (default=0.05)
            Significance level for hypothesis testing
        alternative : str, optional (default='two-sided')
            The alternative hypothesis, either 'two-sided', 'less', or 'greater'

    Returns:
        dict : A dictionary containing all test statistics and metrics
    """

    #  Convert to numpy arrays
    g1 = np.array(group1)
    g2 = np.array(group2)

    # Sample sizes
    n1 = len(g1)
    n2 = len(g2)

    # Means
    m1 = np.mean(g1)
    m2 = np.mean(g2)
    m_diff = m1 - m2

    # Sample stds
    s1 = np.std(g1, ddof=1)
    s2 = np.std(g2, ddof=1)
    dof = n1 + n2 - 2

    # pooled std
    sp = np.sqrt(((n1 - 1) * s1**2 + (n2 - 1) * s2**2) / dof)

    # pooled SEM
    sem_p = sp * np.sqrt(1 / n1 + 1 / n2)
    # OR np.sqrt(g1.var(ddof=1) / n1 + g2.var(ddof=1) / n2)

    res = stats.ttest_ind(g1, g2, alternative=alternative)
    t_stat, p_value = res
    CI = res.confidence_interval()

    t_crit = stats.t.ppf(1 - alpha / 2, dof)
    error_margin = t_crit * sem_p
    ci_lower = m_diff - error_margin
    ci_upper = m_diff + error_margin
    assert ci_lower == CI[0]
    assert ci_upper == CI[1]

    effect_size = m_diff / sp

    return {
        "t_stat": t_stat,
        "p_value": p_value,
        "effice_size": effect_size,
        "dof": dof,
        "mean_diff": m_diff,
        "group1_sd": s1,
        "group2_sd": s2,
        "pooled_sd": sp,
        "confidence_interval": (ci_lower, ci_upper),
    }

Regression Analysis

Next, let’s perform a multiple linear regression to predict the accuracy of the novel method based on the number of parameters and training time.

Regression analysis helps us understand how the dependent variable changes when we vary the independent variables. There are several Python packages that allow you to perform linear regression. Let’s explore some of the most popular options starting with pure numpy and computing the coefficients manually. Then, we’ll use scikit-learn for a more streamlined approach.

Normal Equation Method

The normal equation method allows us to find the coefficients of a linear regression model analytically. The formula for the coefficients is given by:

β=(XTX)1XTy\beta = (X^T X)^{-1} X^T y

where:

With β\beta, you can predict the target variable using the formula:

ypred=Xβy_{\text{pred}} = X \beta

Let’s implement this method using NumPy:


from sklearn.linear_model import LinearRegression as SklearnLinearRegression
from sklearn.metrics import r2_score

class LinearRegression:
    def __init__(self, fit_intercept: bool = True):
        self.fit_intercept = fit_intercept
        self.coef_ = None
        self.intercept_ = None

    def fit(self, X: np.ndarray, y: np.ndarray) -> "LinearRegression":
        """Fit the linear regression model using the normal equation method.

        Normal equation: beta = (X^T X)^(-1) X^T y

        where: X -> [1, X] (left-padded with ones for the intercept term)

        Args:
            X (ndarray): The input features with shape (n_samples, n_features).
            y (ndarray): The target values with shape (n_samples,).

        Returns:
            self: The fitted model.
        """
        if self.fit_intercept:
            X = np.column_stack((np.ones(X.shape[0]), X))
        self.coef_ = np.linalg.inv(X.T @ X) @ X.T @ y
        if self.fit_intercept:
            self.intercept_ = self.coef_[0]
            self.coef_ = self.coef_[1:]
        return self

    def predict(self, X: np.ndarray) -> np.ndarray:
        if self.coef_ is None:
            raise ValueError("Model is not fitted. Please call `fit` before `predict`.")
        if self.fit_intercept and self.intercept_ is not None:
            return X @ self.coef_ + self.intercept_
        return X @ self.coef_

    def score(self, X: np.ndarray, y: np.ndarray) -> float:
        """Return the coefficient of determination R^2 of the prediction.

        R^2 = 1 - (sum(y - y_pred)^2) / (sum(y - y.mean())^2)
            = 1 - SSR / SST
        """
        y_pred = self.predict(X)
        SSR = np.sum((y - y_pred) ** 2)  # Sum of squared residuals
        SST = np.sum((y - np.mean(y)) ** 2)  # Total sum of squares
        return 1 - SSR / SST

X = method["Novel_A"][["Model_Parameters", "Training_Tokens"]].values
y = method["Novel_A"]["Accuracy"].values

# Could also use sklearn's LinearRegression for comparison
model = LinearRegression().fit(X, y)

print(f"Intercept: {model.intercept_:.4f}")
print("Coefficients:")
for name, coef in zip(X.columns, model.coef_):
    print(f"  {name}: {coef:.4f}")
print(f"R-squared: {model.score(X, y):.4f}") # use r2_score(y, model.predict(X)) for sklearn

NumPy allows you to implement linear regression from scratch using matrix operations. This approach is useful for understanding the underlying mathematics. On the other hand, libraries like scikit-learn provide a more user-friendly interface for common machine learning tasks.

Evaluating the Regression Model with R-squared

R-squared (R2R^2) is a statistical measure that represents the proportion of variance in the dependent variable (in our case, accuracy) that is predictable from the independent variable(s). It ranges from 0 to 1, where:

The formula for R-squared is:

R2=1(yiy^i)2(yiyˉ)2=1SSresSStotR^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2} = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}}

where:

A few key points about R-squared:

For our regression model:

Visualizing the Regression Model


X = np.log10(methods["Novel_A"]["Model_Parameters"])
y = methods["Novel_A"]["Accuracy"].values

model = LinearRegression().fit(X, y)
plt.figure(figsize=(10, 6))
plt.scatter(X, y, color="blue", label="Actual")
x_val = np.array([X.min(), X.max()])[...,  None]
y_val = model.predict(x_val)
r2 = model.score(X.values[...,None], y)

plt.xlabel('Log10(Model Parameters)')
plt.ylabel('Accuracy')
plt.title('Accuracy vs Number of Parameters')
equation = f'Accuracy = {model.intercept_:.2f} + {model.coef_[0]:.2f} * log10(Parameters)'

plt.plot(x_val, y_val, color="red", label=f"Predicted (R2={r2:.2f})")

plt.legend()
plt.tight_layout()
plt.savefig(PLOT_DIR / 'accuracy_vs_parameters.png')
plt.show()
plt.close()
Accuracy vs Number of Parameters
Accuracy vs Number of Parameters

Scenario 2: Understanding AGI Timeline Predictions: A Chi-Square Analysis of Academia vs Industry

Research into AGI has sparked debates about when we might achieve this technological milestone. One interesting question is whether researchers’ work environments influence their predictions about AGI timelines. Let’s use this scenario as a case study for a chi-square analysis.

The Research Question

Do academic and industry researchers differ in their predictions about AGI timelines? This is a categorical variable, making it suitable for a chi-square test. Suppose we surveyed 400 AI researchers (200 from academia and 200 from industry) about whgen they expect AGI to be achieved. The responses were categorized as follows:

Understanding Chi-Square Analysis

The chi-square test of independence helps us determine whether there’s a significant relationship between two categorical variables. It compares the observed frequencies of categories with the frequencies we would expect if the variables were independent. In our case, the variables are:

  1. Work Environment (Academia vs. Industry)
  2. AGI Timeline Prediction (four timeline categories)

The test works by comparing the observed frequencies in a contingency table with the frequencies we would expect if the variables were independent. Let’s create some sample data to illustrate this:

# Create example dataset representing a survey of AI researchers
np.random.seed(42)

# Generate sample data
n_researchers = 400  # Total number of surveyed researchers

# Create slightly different distribution of predictions for industry vs academia
# to demonstrate potential relationships
academia_predictions = np.random.choice(
    ['<10 years', '10-20 years', '20-50 years', '>50 years'],
    size=200,
    p=[0.15, 0.25, 0.35, 0.25]  # Academia tends toward longer timelines
)

industry_predictions = np.random.choice(
    ['<10 years', '10-20 years', '20-50 years', '>50 years'],
    size=200,
    p=[0.25, 0.35, 0.25, 0.15]  # Industry tends toward shorter timelines
)

The test statistic is calculated as:

χ2=(OijEij)2Eij\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}

where:

To calculate the expected frequency for any cell, we use this formula:

Eij=(RowiTotal×ColjTotal)Grand TotalE_{ij} = \frac{(\text{Row}_i \text{Total} \times \text{Col}_j \text{Total})}{Grand\ Total}

Effect Size with Cramér’s V

Cramér’s V is a measure of association between two categorical variables. It’s based on the chi-square statistic and ranges from 0 to 1, where:

The formula for Cramér’s V is:

V=χ2n×min(r1,c1)V = \sqrt{\frac{\chi^2}{n \times \min(r - 1, c - 1)}}

where:

Performing the Chi-Square Test

We can carry this out in python by first creating a contingency table of observed frequencies and then calculating the chi-square statistic:

from collections import Counter

academic_counts = Counter(academia_predictions)
industry_counts = Counter(industry_predictions)

print("Academia AGI Counts:", academic_counts)
print("Industry AGI Counts:", industry_counts)

test_table = np.array(
    [
        [academic_counts[k] for k in agi_timelines],
        [industry_counts[k] for k in agi_timelines],
    ]
)
print(test_table)
>>>
Academia AGI Counts: Counter({'5-10 years': 59, '2-5 years': 57, '>10 years': 51, '<2 years': 33})
Industry AGI Counts: Counter({'2-5 years': 64, '5-10 years': 57, '<2 years': 49, '>10 years': 30})
[[33 57 59 51]
 [49 64 57 30]]

Then we can calculate the chi-square statistic and Cramer’s V:

import numpy as np
from scipy import stats


def chi2_test(contingency_table: np.ndarray | list[list[int]]) -> dict:
    observed = np.array(contingency_table)

    row_totals = observed.sum(axis=1)
    col_totals = observed.sum(axis=1)
    total = row_totals.sum()

    # E_ij = (row_total_i * col_total_j) / grand_total
    expected = np.outer(row_totals, col_totals) / total

    chi2 = np.sum((observed - expected) ** 2 / expected)
    dof = (observed.shape[1] - 1) * (observed.shape[0] - 1)

    p_value = 1 - stats.chi2.cdf(chi2, dof)

    min_dim = min(observed.shape) - 1
    cramers_v = np.sqrt(chi2 / (total * min_dim))
    std_residuals = (observed - expected) / np.sqrt(expected)

    return {
        "ch2": chi2,
        "expected": expected,
        "dof": dof,
        "p_value": p_value,
        "cramers_v": cramers_v,
        "std_residuals": std_residuals,
    }

Interpreting the Results

The test gives the following results:

Analysis of AGI Timeline Predictions Across Research Environments
-------------------------------------------------------------

1. Distribution of Predictions:
AGI_Timeline      2-5 years  5-10 years  <2 years  >10 years  All
Work_Environment
Academia                 57          59        33         51  200
Industry                 64          57        49         30  200
All                     121         116        82         81  400

2. Percentage Distribution within Each Environment:
AGI_Timeline      2-5 years  5-10 years  <2 years  >10 years
Work_Environment
Academia               28.5        29.5      16.5       25.5
Industry               32.0        28.5      24.5       15.0

3. Statistical Test Results:
Chi-squared statistic: 9.01
p-value: 0.0292
Degrees of freedom: 3
Cramer's V (effect size): 0.150

The contingency table show some interesting patterns:

Since our p-value (0.0292) is less than the conventional significance level of 0.05, we can conclude that there is a statistically significant relationship between work environment and AGI timeline predictions.

Key Takeaways


Previous Post
Pandas Primer
Next Post
A Straight Line Through Linear Algebra