This is an automated archive made by the Lemmit Bot.
The original was posted on /r/machinelearning by /u/ToThePastMe on 2025-01-15 16:17:30+00:00.
There is this dataset (won’t link here as I don’t want my kaggle and reddit associated) with a few input features (5-6) used to predict one target value.
But one of the features is basically perfectly linearly correlated with the target (>0.99).
An example would be data from a trucking company with a single model of trucks:
Target: truck fuel consumption / year Features: driver’s age, tires type, truck age, DISTANCE TRAVELED / year
Obviously in average the fuel consumption will be linearly proportional with the nb of miles traveled. I mean normally you’d just use that to calculate a new target like fuel/distance.
Yet not a single person/notebook did this kind of normalization. So everyone’s model has >.99 accuracy, as that one feature drowns out everything else.
Is that something other people noticed: more and more the code looks fine (Data loading, training many types of models), maybe thanks to LLMs. But the decision making process is often quite bad?