This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/ToThePastMe on 2025-01-15 16:17:30+00:00.


There is this dataset (won’t link here as I don’t want my kaggle and reddit associated) with a few input features (5-6) used to predict one target value.

But one of the features is basically perfectly linearly correlated with the target (>0.99).

An example would be data from a trucking company with a single model of trucks:

Target: truck fuel consumption / year Features: driver’s age, tires type, truck age, DISTANCE TRAVELED / year

Obviously in average the fuel consumption will be linearly proportional with the nb of miles traveled. I mean normally you’d just use that to calculate a new target like fuel/distance.

Yet not a single person/notebook did this kind of normalization. So everyone’s model has >.99 accuracy, as that one feature drowns out everything else.

Is that something other people noticed: more and more the code looks fine (Data loading, training many types of models), maybe thanks to LLMs. But the decision making process is often quite bad?