Bakken & Baeck logo

Code translation with ChatGPT | Nico Lutz

A short story on translating code with ChatGPT for fun and educational entertainment

I am currently looking into Bayesian modeling and MCMC - I stumbled upon this blog post by Dan Simpson. It talks about a dataset from a 2019 paper titled: "Interest in non-social novel stimuli as a function of age in rhesus monkeys". This article counters another article from 2016 that suggests that aged Barbary macaques maintained interest in members of their own species while losing interest in novel non-social stimuli. Dan Simpson uses the dataset to fit a Bayesian multilevel model and teach about a residual-vs-fitted plot for a multilevel model.

Nevertheless I thought it would be a nice exercise to go through the whole blog post to understand the math and mental reasoning on my own. Along the way I wanted code up everything in Python. But the blog post is written in R - so why use python now? In the end I didn't want to produce new content - I just wanted to reason about every step and python is the language of my choice to do so. So deal with it. Along the way I made a nice discovery and chatGPT played an important role. In order to follow a long I advise reading the original article and I will use the original headline and code snippets.


As a context the dataset is produced by the following experiment.

"The experiment used 2435 monkeys aged between 4 and 30 and gave them a novel puzzle task (opening a fancy tube with food in it) for twenty minutes over two days. The puzzle was fitted with an activity tracker. Each monkey had two tries at the puzzle over two days. Monkeys had access to the puzzle for around6 20 minutes."

What does the data look like?

Understanding the first few lines is straightforward. Here is an excerpt of original code:

acti_data <- read_csv("activity_data.csv")
activity_2mins <- acti_data |>
  filter(obs<9) |> group_by(subj_id, Day) |>
            active_bins = sum(Activity > 0),
            age = min(age)) |>
  rename(monkey = subj_id, day = Day) |>


The author reads in a data CSV, filters by the number of active intervals and groups by id and age, computes a total activity and active bins and renames the columns. Using python and pandas translating this is straightforward:

import pandas as pd

df = pd.read_csv('./activity_data.csv')
df = df.rename(columns={'subj_id': 'monkey', 'Day':'day'})
activity_2mins = df[df["obs"] < 9]
activity_2mins['total'] = activity_2mins.groupby(['monkey', 'day'])['Activity'].transform(lambda x: sum(x))
activity_2mins['active_bins'] = activity_2mins.groupby(['monkey', 'day'])['Activity'].transform(lambda x: sum(x > 0))
activity_2mins['age_n'] = activity_2mins.groupby(['monkey', 'day'])['age'].transform(lambda x: min(x))

There are just too many monkeys; or Why can’t we just analyze this with regression?

Ok to be fair - this is here for completeness. In the end we will find out that regression is simply not enough and cannot model the underlying data. Still the process is shown below:

"The temptation with this sort of data is to fit a linear regression to it as a first model. In this case, we are using grouping, group-level, and individual-level covariates in the same way."

In R:

fit_lm <- lm(active_bins ~ age*factor(day) + factor(monkey), data = activity_2mins)

After some digging I was able to find out that statsmodels(python) does roughly the same thing as broom(R).

import statsmodels.api as sm
fit_lm = sm.formula.ols(formula='active_bins ~ age*C(day) + C(monkey)', data=activity_2mins).fit()

Both of these will output a lot of regression coefficients. In Bayesian literature this is known as no pooling indicating that there is no dependence between the intercepts for each group. But in this case we treat monkey as a factor variable and simply don't have enough observations.

If we ignore the monkeys, will they go away? or Another attempt at regression

The next chapter tries to model the experiment using complete pooling . We can use statsmodels again but with a different formula.

fit_lm = sm.formula.ols(formula='active_bins ~ age*C(day)', data=activity_2mins).fit()

And here things began to become complicated. In this chapter Dan Simpson tries to show some diagnostic plots in order to highlight that the residuals calculated by the above formula do have some patterns. The blog article has some nice plots to show this. But I wanted to replicate graphs with my python code and couldn't figure out what this R code was doing.

augment(fit_lm_pool) |>
  ggplot(aes(x = .fitted, y = active_bins - .fitted)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +

Or to put it differently I lacked the proper understanding of R packages to fully grasp how the graphs are made. What now ? Leave it ? After all this was just an evening exercise? No - what I did do is ask chatGPT to tell me how to translate this piece of R code into python. And voilà:

import pandas as pd
import seaborn as sns

df = pd.DataFrame({'fitted': fit_lm.fittedvalues,
                   'residuals': fit_lm.resid,
                   'active_bins': fit_lm.model.endog})

sns.scatterplot(x='fitted', y='residuals', data=df)
sns.regplot(x='fitted', y='residuals', data=df, scatter=False)

Turns out chatGPT is pretty good in writing translations for code pieces. The graphs looks exactly as the one in Dan Simpson blog article.

I don't want to repeat anymore what Dan Simpsons did. If you are interested in the difference between no pooling, complete pooling and partial pooling continue reading the original article. I conclude our tech article here - I wanted to share that although I believe the hype on llm is overrated I could highlight a nice strength of large language models, namely translating from R to python. I was able to translate the first few paragraphs, but soon after my limited knowledge of R was exhausted I would have needed to dig deeper in documentation or find someone to explain it to me. A large language model, in this case chatGPT, was able to help me to translate the code.