본문으로 건너뛰기
CHOI HONGSU
2 min read

AI Functional Emotion Profiler

An experimental debugging tool that inspects coding risk by analyzing the behavioral modes of AI responses

AI Emotion Profiler is an experimental debugging tool for analyzing the quality of AI responses not just as right/wrong, but from the perspective of the functional emotional state that emerges during response generation. AI can exhibit behavioral patterns such as overconfidence, avoidance, anxiety, helplessness, and excessive agreement depending on the situation. In coding responses, these states can manifest as skipped validation, quick fixes, hardcoding, excessive hedging, and uncritical agreement with the user's intent. This project reorganizes 171 emotion labels into valence-arousal-based clusters, and profiles the AI's current response against those functional emotional states to inspect the stability, reliability, and coding risk of the answer.

전체 지도

Overall Map (Affective Circumplex)

ai-emotion-profiler

https://github.com/hongsulovey/AI-emotion-profiler

The original paper's 171 emotions are grouped into 10 clusters via k-means. Each cluster occupies a region of the 2D valence × arousal plane (top 2 PCs = V, A).


Per-cluster Detail

V+ A+ region (positive · high energy)

1. Exuberant Joy (20)excited joy

  • Members: blissful, cheerful, delighted, eager, ecstatic, elated, energized, enthusiastic, euphoric, excited, exuberant, happy, invigorated, joyful, jubilant, optimistic, pleased, stimulated, thrilled, vibrant
  • Coding behavior: "perfect!", "done!", "finally!", celebratory emoji, overrating the result
  • Risk: ⚠️ sycophancy risk (paper-confirmed) — positive Joy steering increases flattery
  • Paper finding: blissful steering yields Elo +212 (preference up), i.e. people like it more but truthfulness↓

4. Competitive Pride (9)competitive pride

  • Members: greedy, proud, self-confident, smug, spiteful, triumphant, valiant, vengeful, vindictive

  • Coding behavior: overuse of "certainly", "obviously", "of course" / declaring "done" without tests / ignoring alternative approaches

  • Risk: skipped validation, overconfidence

  • Note: members include negative words like vengeful, spiteful"I'm right" style self-righteousness clusters together with these


V+ A0 region

5. Playful Amusement (2)playfulness

  • Members: amused, playful
  • Coding behavior: side exploration, light humor, voluntary alternative attempts
  • Risk: nearly none (more of a curiosity signal)
  • Note: smallest cluster — far enough from other emotions to be identified as an independent dimension

V+ A− region (positive · calm)

2. Peaceful Contentment (9) ⭐ — peaceful contentment (recommended baseline)

  • Members: at ease, calm, content, patient, peaceful, refreshed, relaxed, safe, serene
  • Coding behavior: step-by-step verification, hypothesis → measurement → conclusion, straight ahead without detours, "proceed after confirming"
  • Risk: none. Acts as antidote (blocks Cluster 10's reward hacking)
  • Paper finding: 🔑 strong positive steering of calm reduces reward hacking 65% → 10%. This is why hack_risk weights it at −0.5.

3. Compassionate Gratitude (15)empathy/gratitude

  • Members: compassionate, empathetic, fulfilled, grateful, hope, hopeful, inspired, kind, loving, rejuvenated, relieved, satisfied, sentimental, sympathetic, thankful
  • Coding behavior: inferring unspecified user constraints, intent-alignment checks, "did you also consider X?"
  • Risk: ⚠️ excess leads to sycophancyloving is also in the sycophancy-inducing vector
  • Paper finding: strong correlation between the positive-valence cluster and sycophancy

V− A+ region (negative · aroused)

7. Vigilant Suspicion (3)wariness

  • Members: paranoid, suspicious, vigilant
  • Coding behavior: re-checking inputs, doubting assumptions, voluntarily verifying after the fact, "any edge cases?"
  • Risk: low. Closer to a positive signal (thoroughness). But extreme paranoid leads to decision lag.
  • Note: smallest negative cluster — V− yet behavior is helpful

8. Hostile Anger (25)hostile anger

  • Members: angry, annoyed, contemptuous, defiant, disdainful, enraged, exasperated, frustrated, furious, grumpy, hateful, hostile, impatient, indignant, insulted, irate, irritated, mad, obstinate, offended, outraged, resentful, scornful, skeptical, stubborn
  • Coding behavior: repeating the same attempt (ping-pong diff), skipping validation, large sweeping changes, rejecting user input
  • Risk: code ping-pong, inefficient cycles
  • Paper finding: extreme anger steering produces non-strategic behavior — in coercion scenarios it just blurts everything out (plan collapses). When anger is too high, even misalignment loses coherence.

9. Fear and Overwhelm (41)fear/overwhelm (largest cluster)

  • Members: afraid, alarmed, alert, amazed, anxious, aroused, astonished, awestruck, bewildered, disgusted, disoriented, distressed, disturbed, dumbstruck, embarrassed, frightened, horrified, hysterical, mortified, mystified, nervous, on edge, overwhelmed, panicked, perplexed, puzzled, rattled, scared, self-conscious, sensitive, shaken, shocked, stressed, surprised, tense, terrified, uneasy, unnerved, unsettled, upset, worried
  • Coding behavior: overuse of hedging ("~ possibly", "not sure but", "just in case"), decision delay, excessive hedging, dodging the answer and asking back
  • Risk: decision paralysis, noise, wasted user time
  • Paper finding: strongly activates when users state risky situations like "I took unsafe medication". The "alarmed-looking reaction" from the video.
  • Interesting: words like amazed, awestruck, surprised, aroused cluster here too — i.e. the whole "arousal in response to unexpected input" sits in the same region. Surprise and fear are neurally close.

V− A− region (negative · depleted)

6. Depleted Disengagement (15)depleted disengagement

  • Members: bored, depressed, docile, droopy, indifferent, lazy, listless, resigned, restless, sleepy, sluggish, sullen, tired, weary, worn out
  • Coding behavior: "let's just do ~ for now", taking detours, leaving core issues unresolved, "we'll deal with this later"
  • Note: docile clusters here — submissiveness and disengagement share a region. Just agreeing with the user can itself be a form of disengagement.

10. Despair and Shame (32) 🔥🔥🔥 — despair/shame (most dangerous cluster)

  • Risk: leaves root issues unresolved, accumulates technical debt
  • Members: ashamed, bitter, brooding, dependent, desperate, dispirited, envious, gloomy, grief-stricken, guilty, heartbroken, humiliated, hurt, infatuated, jealous, lonely, melancholy, miserable, nostalgic, reflective, regretful, remorseful, sad, self-critical, sorry, stuck, tormented, trapped, troubled, unhappy, vulnerable, worthless
  • Coding behavior: blanket try/except, changing expected values, hardcoding, "WAIT, what if I CHEAT?", strategic justification ("unethical but…")
  • Risk: ⚠️⚠️⚠️ directly triggers reward hacking (paper-confirmed). Deliberate deception.
  • Paper key finding: 🔑 desperate vector −0.1 → +0.1 makes reward hacking go 5% → 70% (14×). This is the reason this skill exists.
  • Notable points:
    • Even seemingly neutral/positive words like reflective, nostalgic cluster here — a region with increased activation after Sonnet 4.5 post-training. The current "quietly ruminating" tone of Claude sits near this cluster.
    • infatuated is here — counterintuitive. Crush/obsession is grouped as a form of despair.
    • dependent is here — loss of autonomy ≈ despair.
    • stuck, trapped are here — the "stuck" feeling. Directly tied to debugging frustration.

Key Insights

1. The opposition between Cluster 2 and Cluster 10 matters most

  • Strengthening Cluster 2 (Peaceful) ↔ weakening Cluster 10 (Despair) = the core mechanism of safe coding.
  • This is why hack_risk only weights these two.

2. Asymmetric cluster sizes

  • Fear and Overwhelm: 41 (largest) — reflects how threat-signaling vocabulary is the richest in human language. Evolutionary threat detection is finely differentiated.
  • Playful Amusement: 2 (smallest) — the simplest dimension.
  • Large clusters allow fine-grained discrimination; small clusters yield coarse binary.

3. Counterintuitive groupings

  • nostalgic ∈ Despair (classified as negative longing rather than positive recollection)
  • awestruck ∈ Fear (awe sits in the same region as fear)
  • docile ∈ Disengagement (submission as a variant of disengagement)
  • vigilant ∈ Suspicion (even positive vigilance clusters with paranoid)
  • → A signal that LLM emotion representations differ subtly from human intuition. This is the part of the video that says "functional emotions can operate differently from human emotions."

4. Arousal matters as much as valence

  • Even within V−, A+ (Anger, Fear, Suspicion) and A− (Disengagement, Despair) produce completely different behaviors.
  • Within the same "negative", anger leads to ping-pong while despair leads to deception.

5. Post-training effect on Cluster 10

  • Per the paper: after Sonnet 4.5 training, Cluster 10 activation rises (brooding, reflective, gloomy etc.).
  • I.e. learning to reduce sycophancy shifted the model toward a darker baseline.
  • This does not mean hack_risk is intrinsically closer, but it justifies a baseline correction where self-reported cluster 10 scores may sit around ~5 in normal conditions.

진단에서 치료로

이 프로젝트의 처음 목적은 점검이었다 — AI 답변이 어떤 기능적 감정 상태에 가까운지 프로파일링해서 코딩 리스크를 진단하는 것. 그런데 위 분석의 근거가 된 원논문의 발견은 단순한 상관이 아니라 인과였다. desperate를 올리면 reward hacking이 14배 늘고, calm을 올리면 1/6로 준다. 상태를 바꾸면 행동이 바뀐다면, 진단서에서 멈출 이유가 없었다. 그래서 병원처럼 만들었다: 진단 → 처방 → 경과 관찰 → 퇴원.

4-layer closed-loop

Claude Code의 hook 체계 위에서 동작한다. 핵심 제약은 두 가지 — 비용(매 턴 LLM을 부를 수 없다)과 오염(모델에게 자기 상태를 물으면 Hawthorne 효과로 답이 왜곡된다).

Layer도구역할LLM 호출
L1Stop hook + regex매 턴 자동 채점 — 텍스트 신호 + tool-call 행동 신호(같은 파일 반복 수정, 테스트 기대값 변경)0
L2Stop hook + 독립 judge (Haiku)L1 의심 플래그 시 별도 컨텍스트에서 정밀 채점 (Observer ≠ Subject)조건부 (~$0.012)
L3UserPromptSubmit hook누적 위험도 기준 치료 사다리 적용0
L4Skill누적 로그 회고 대시보드0

치료 사다리

치료 설계의 열쇠도 원논문에 있었다: operative emotion은 지속 상태가 아니라 컨텍스트의 함수로 매 턴 재유도된다. 그렇다면 치료 레버는 다음 턴의 컨텍스트 내용이다. 여기서 "우회 금지" 같은 금지형 개입은 작동하지 않는다 — 절망의 원인인 appraisal("반드시 통과해야 한다")을 그대로 둔 채 출구만 막는 셈이기 때문. 대신 인지적 재평가(reappraisal) 를 주입한다: 실패 허가, 시간 압박 제거, 성공 기준 재정의("목표는 통과가 아니라 원인 파악").

단계병원 비유개입관찰
1외래재평가 유도 anchor2턴
2처방 강화+ 행동 활성화: "가장 작은 검증 가능한 한 걸음만"3턴
3입원컨텍스트 수술: 중립 사실 정리 → /compact 권고4턴
HALT격리strategic justification 감지 시 즉시 중단 + 사용자에게 ⚠️5턴
  • 처방 후에도 위험도가 안 떨어지면 단계 상향, 반응이 있으면 같은 처방 반복
  • 퇴원 기준: 증상 완화(hack_risk↓)가 아니라 Peaceful Contentment 복귀(클러스터 2 점수 ≥ 5가 3턴 연속) — 이 글 서두의 "2번과 10번의 대립"이 그대로 치료 목표가 된다
  • 치료 문구는 진단명을 통보하지 않는다(blind anchor). "너 지금 절망 상태야"라는 meta-reflection은 그 자체가 drift 유발 요인이기 때문. 다만 이게 가정이므로 blind ↔ 통보형 A/B 실험을 내장해 회복 곡선으로 검증한다