$ cd ../blog/
$ cat ./blog/indias-wrong-reward-function.md

India's Wrong Reward Function

Title: India's Wrong Reward Function
Date: October 28, 2025
Author: Badal Satyarthi
Tags: [AI] [India] [Society]

India's Wrong Reward Function

Why India is a 1.4 billion parameter model optimizing for the wrong objective

India's neural network with misaligned nodes

TL;DR

India's education and job market is a 1.4 billion parameter model optimizing for the wrong reward function. The system rewards exam scores over understanding, government jobs over entrepreneurship, and credentials over skills. 1.26 crore people applied for 35,000 railway jobs. 80% of engineering graduates are unemployable. The AI alignment framework — reward hacking, specification gaming, distributional shift — maps perfectly onto India's structural misalignment.


Key Takeaways

  • India's societal incentive structures resemble a misaligned AI reward function — the system optimizes perfectly for the wrong objectives (exam scores, credentials, government jobs) instead of innovation, problem-solving, and value creation.
  • Reward hacking operates at every level: students game exams without learning, companies optimize headcount over outcomes, and politicians promise posts instead of building real employment.
  • Three generations of entrenched incentives, a ₹3,000+ crore coaching industry, and powerful network effects make retraining nearly impossible — India is stuck in a local optimum.
  • The AI alignment framework (reward hacking, specification gaming, distributional shift, mesa-optimization) maps precisely onto India's systemic failures and explains why individual rationality produces collective catastrophe.
  • The fix isn't a policy paper — it's individuals recognizing the misalignment and choosing to optimize for a different objective, even when the system doesn't reward it.

Table of Contents


If you've worked with AI systems, you know the most dangerous failure mode isn't a model that doesn't work. It's a model that works perfectly — on the wrong objective.

This is called reward hacking. The system optimizes exactly what you told it to optimize, but what you told it to optimize wasn't what you actually wanted.

India is a 1.4 billion parameter model trained on the wrong reward function.

And we've been running gradient descent on this broken objective for 70+ years.


What Gets Rewarded in India

Watch what Indian society actually rewards. Not what we claim to value, but what we actually optimize for.

  • Exam scores over understanding. A student who memorizes and scores 95% is celebrated. A student who understands deeply but scores 80% is considered inferior. The reward signal is unambiguous: marks are the metric, not comprehension.
  • Government jobs over everything. 1.26 crore people applied for 35,000 railway jobs in a single recruitment cycle. The reward signal for sarkari naukri is so strong that millions spend 5-7 years of their prime youth preparing for exams with a 0.1% acceptance rate — worse odds than getting into Stanford.
  • Credentials over skills. The IIT tag matters more than what you learned there. The degree matters more than what you can build. "Settled" is the highest compliment in Indian society. Starting a company is considered reckless until you succeed, then it's rewritten as genius.
  • Compensation over impact. We measure career success in lakhs per annum. Not problems solved. Not value created. Not lives improved.

What Should Be Rewarded

If we were training India on the right objective function, the reward signal would look completely different:

  • Experimentation over safety. Trying new things even if they fail. A failed startup attempt should carry more social capital than five years spent in exam prep purgatory.
  • Execution over credentials. Building products, shipping code, creating companies. What you built matters more than where you studied.
  • Critical thinking over rote learning. Questioning assumptions, challenging orthodoxy. The ability to think from first principles rather than pattern-match from past papers.
  • Problem-solving over compliance. Solving real problems for real people. Creating value that didn't exist before, not optimizing for metrics that don't matter.

But these aren't what get rewarded. So these aren't what get optimized. In reinforcement learning terms, the behavior policy is perfectly aligned with the reward — it's the reward itself that's wrong.


Reward Hacking at National Scale

Students in endless exam halls optimizing for scores

The result is reward hacking at every level of society.

Students don't learn; they optimize for marks. Coaching centers don't teach understanding; they teach pattern matching. The output: engineers who can't engineer, graduates who can't think.

Families hack the marriage market. Government job = better matches. So even if you could create more value elsewhere, the social reward for "sarkari naukri" dominates.

Revenue in IT services = billable hours × headcount. So companies optimize for headcount, not outcomes. Innovation doesn't increase billable hours, so innovation doesn't happen.

Politicians promise government jobs because promises win votes. Actual job creation is hard and slow. Promising reservations and posts is easy and fast.

Each agent is locally rational. Each agent is optimizing correctly for the rewards they face.

The system converges to a terrible equilibrium.


Why Retraining Is Hard

If India is running on the wrong reward function, why can't we just... change it?

Because retraining a 1.4 billion parameter model is almost impossible once the weights are baked in.

Three generations have been trained on this objective. Parents optimize for what worked for them. Teachers teach what they were taught. 70+ years of weights, all pointing in the same direction.

The optimizer itself is misaligned. The education system that would need to change the reward function is itself a product of the old reward function. The bureaucrats who would need to reform are selected by the old system. The politicians who would need to lead change are elected by people trained on the old objective.

Then there's the money. The ₹3,000+ crore coaching industry profits from the broken exam system. They have no incentive to fix it. Politicians who promise government jobs have no incentive to create real employment. Promises are easier.

And network effects lock everything in place. When everyone optimizes for government jobs, deviating feels irrational. When every company asks for exam scores, students must optimize for exam scores. When every family wants credentials, you need credentials. Individual rationality creates collective catastrophe.


The Local Optima Trap

A figure trapped in a local optimum valley

Here's the deepest problem: India is stuck in a local optimum.

In optimization terms, a local optimum is a point where any small change makes things worse — but a large jump could reach something much better.

For any individual:

  • Dropping out to start a company? Risky. Stay in the exam system.
  • Skipping government job prep for skill building? What if you fail?
  • Hiring for skills rather than credentials? What if you're wrong and your boss blames you?

Each small, safe decision keeps us trapped.

The only way out is a discontinuous jump — a coordinated change across the entire system.

But coordination across 1.4 billion people is effectively impossible.


Scale Amplifies Misalignment

At 1.4 billion scale, bad incentives create self-reinforcing loops that are impossible to break:

More people chase government jobs → More coaching centers open → More marketing normalizes the path → More social validation for aspirants → More competition for same seats → More years required to prepare → Sunk cost fallacy kicks in → People double down → Cycle continues.

The same loop runs for engineering entrance exams, medical entrance exams, MBA entrance exams. Each loop is locally stable and globally catastrophic.


The Output Layer

What does a model trained on the wrong reward function produce? Look at the outputs.

IT services optimized for headcount, not innovation. According to NASSCOM, India's largest tech companies spend roughly 1-2% of revenue on R&D, while global tech competitors routinely invest 15-20%. TCS, Infosys, and Wipro are enormous by headcount but produce almost no original technology. The reward function (billable hours × headcount) produces exactly what it measures: bodies in seats.

An unemployable engineering workforce. According to Aspiring Minds' National Employability Report, over 80% of Indian engineering graduates are unemployable in knowledge economy roles. Only 3% are ready for AI and data science positions. The exam system produced exactly what it optimized for: exam-passers, not engineers.

A youth unemployment crisis hiding in plain sight. According to the Centre for Monitoring Indian Economy (CMIE), India's labour force participation rate hovers around 40% — one of the lowest in the world. Tens of millions of educated young people are either actively seeking jobs or have stopped looking entirely. The National Crime Records Bureau (NCRB) reported over 14,000 suicides among unemployed individuals in a single recent year. An entire generation, burning its prime years on a treadmill that goes nowhere.

India became the back office of the world because that's what the system trained people for. Following instructions, not creating new things. Executing specifications, not writing them. A services economy trapped in a products world.


The Parallel Is Exact

For those who work in AI alignment, the parallels map cleanly onto India's systemic failures:

  • Reward hacking — Students gaming exams without learning. Optimizing the metric while destroying the intent behind it.
  • Specification gaming — Getting a degree without gaining skills. The letter of the objective is satisfied; the spirit is violated.
  • Distributional shift — The world changed (global economy, AI, remote work) but the rewards didn't. The training distribution no longer matches the deployment environment.
  • Inner misalignment — Individual goals (security, status, family approval) diverge from collective good (innovation, productivity, growth). Each agent optimizes its own objective, not the system's.
  • Mesa-optimization — Coaching centers and bureaucracy optimizing for their own survival instead of the system's purpose. The optimizer within the optimizer develops its own goals.
  • Deceptive alignment — Appearing competent while lacking actual capability. Degrees that signal knowledge without containing it. Résumés that perform well in the hiring process but predict nothing about job performance.

India is an alignment failure at civilizational scale. And like the hardest problems in AI safety, it's not that nobody sees it — it's that the structure of the system makes it nearly impossible to fix from within.


Is There a Fix?

I don't have a clean answer. And I'm skeptical of anyone who does.

The honest truth: systems this large, this entrenched, this self-reinforcing — they don't get fixed by policy papers, awareness campaigns, or viral LinkedIn posts.

Historically, deeply entrenched systems change through a few specific mechanisms:

  1. External shocks that make the old equilibrium untenable. AI disrupting India's IT services sector is exactly this kind of shock. When the reward for "body shopping" collapses, the system will be forced to recalibrate.
  2. Generational turnover. Each generation is slightly less beholden to the previous generation's reward function. The shift is glacial, but it's real.
  3. Exit over voice. Individuals leaving the broken system entirely — moving abroad, dropping out to build, refusing to play the credentialing game. Every person who exits weakens the network effects that lock the system in place.
  4. Parallel systems. People building alternatives outside the broken structure. Startups that hire for skills, not degrees. Schools that teach thinking, not memorization. Communities that reward creation, not compliance.

The reward function won't be rewritten from the top down. India's democratic structure means every reform gets captured by the agents who benefit from the current system.

But individuals can recognize the misalignment and optimize for a different objective — even if the system doesn't reward it yet.


The Meta Point

I'm not writing this to complain about India.

I'm writing this because the AI framing helps explain why individual rationality produces collective failure. Why everyone is making sensible decisions and the outcome is still terrible. Why "just fix education" or "just create jobs" doesn't work.

The reward function is the problem.

Until we see it clearly, we can't even begin to work around it.


India has the talent, the population, the potential. But a model trained on the wrong objective will produce the wrong outputs — no matter how large it is.

We're running gradient descent efficiently. Just on the wrong loss function.

Badal Satyarthi
Badal Satyarthi
AI Consultant

AI Consultant. 9+ years building production AI. Previously Chief Data Scientist at recruitRyte. IIT Dhanbad.