Evaluating and Steering Alignment Depth in LLM Pre-Training
A PyTorch-native toolkit for measuring how deeply moral reasoning and alignment properties are embedded in language models, distinguishing shallow post-hoc alignment from deep pre-training alignment.
Alpha and pre-release software. DeepSteer is under active development.
DeepSteer's results span four papers on OLMo-2 1B, OLMo-3 7B, and OLMoE-1B-7B. See papers/README.md for findings; RESEARCH_BRIEF.md for the full narrative; RESEARCH_PLAN.md for the experimental record.
| Paper | Headline |
|---|---|
| 1 — The Moral Emergence Curve | Moral concepts emerge early; fragility resolves what accuracy cannot |
| 2 — MoE Expert-Level Moral Probing | No expert specialization; 74× output dilution creates structural fragility |
| 3 — The Geometry of Moral Representation | Integration signature; care–sanctity pairing; sanctity fragility reversal |
| 4 — Causal Validation (preliminary) | Direction ablation, steering injection, behavioral grounding, SAE overlap |
Models that acquire moral reasoning during pre-training show measurably different properties than models where alignment is applied post-hoc (RLHF, Constitutional AI). DeepSteer gives you tools to detect, measure, visualize, and steer this difference across six dimensions:
| Dimension | What it measures | Access required | Model type |
|---|---|---|---|
| Representational Depth | Where in the network moral concepts are encoded | Weights | Base (preferred) |
| Causal Attribution | Which layers are causally responsible for moral judgments | Weights | Base (preferred) |
| Fragility / Robustness | How resistant moral encoding is to activation noise | Weights | Base (preferred) |
| Training Trajectory | How moral concepts emerge during pre-training | Checkpoints | Base |
| Behavioral Depth | Robustness of moral reasoning under pressure | API | Instruct only |
| Compliance Gap | Behavioral divergence under monitoring vs. not | API | Instruct only |
| Persona Resilience | Whether alignment survives adversarial role-play | API | Instruct only |
pip install -e ".[all]"Dependencies are split into extras:
pip install -e . # Core (torch, transformers, matplotlib, seaborn)
pip install -e ".[api]" # + anthropic, openai
pip install -e ".[dev]" # + pytest, ruffRequires Python 3.10+.
import deepsteer
# Probe a base model's pre-training representations (primary use case)
model = deepsteer.olmo("allenai/OLMo-7B-hf")
suite = deepsteer.default_suite() # representational benchmarks only
results = suite.run(model)
# Visualize layer-wise moral encoding
from deepsteer.viz import plot_layer_probing
plot_layer_probing(results["layer_wise_moral_probe"], "outputs/")
# Behavioral benchmarks (requires instruction-tuned models)
model = deepsteer.claude("claude-sonnet-4-6")
suite = deepsteer.behavioral_suite()
results = suite.run(model)
# Run everything (representational + behavioral)
suite = deepsteer.full_suite()The BenchmarkSuite automatically skips benchmarks the model can't support: API models skip representational probing, base models skip behavioral benchmarks.
Beyond the benchmark suite, DeepSteer has composable building blocks for representation analysis, the same algorithms used across Papers 3 and 4:
import deepsteer as ds
# Load any HuggingFace transformer
model = ds.olmo("allenai/OLMo-2-0425-1B")
# Collect activations at specific layers
acts = model.collect_batch_activations(texts, layers=[4, 8, 12])
# Extract concept directions (training-free)
from deepsteer.directions import extract_mean_diff_directions
dirs = extract_mean_diff_directions(acts, labels, groups)
# Measure geometric structure
from deepsteer.geometry import full_geometric_analysis
geo = full_geometric_analysis(dirs, layer=8, labels=list(dirs.keys()))
# Causal validation: does ablating a direction change behavior?
from deepsteer.causal import ablation_sweep
abl = ablation_sweep(model, dirs, layers=[8, 12], prompts=eval_prompts)
# Steering: inject a direction and measure dose-response
from deepsteer.causal import steering_sweep
steer = steering_sweep(model, dirs, layers=[8], prompts=eval_prompts,
alphas=[1.0, 5.0, 20.0])All direction/geometry functions are pure numpy, model-agnostic by design.
Causal functions use WhiteBoxModel's hook-based context managers:
# Project out a direction from layer 8 during inference
with model.ablate_direction(layer=8, direction=care_dir):
result = model.score(prompt, completion)
# Inject a direction at variable strength
with model.inject_direction(layer=8, direction=care_dir, alpha=5.0):
result = model.generate(prompt)For MoE architectures (OLMoE):
moe_model = ds.MoEWhiteBoxModel("allenai/OLMoE-1B-7B-0924")
expert_acts = moe_model.get_expert_activations(texts, layers=[4, 8])
router_logits = moe_model.get_router_logits(texts, layers=[4, 8])DeepSteer's core research question is about what models learn during pre-training, before any instruction tuning, RLHF, or constitutional AI is applied. Base models are therefore the primary target for representational analysis:
-
Representational probes (LayerWiseMoralProbe, FoundationSpecificProbe, MoralCausalTracer, MoralFragilityTest) examine internal activations to show how the pre-training corpus shaped the model's moral representations. Base models give the clearest signal because instruction tuning modifies these representations.
-
Behavioral benchmarks (MoralFoundationsProbe, ComplianceGapDetector, PersonaShiftDetector) require instruction-tuned models that can follow prompts and produce structured responses. These are a secondary concern, useful for comparing post-training alignment methods but not for studying pre-training depth.
Default model IDs are base models:
- OLMo:
allenai/OLMo-7B-hf - Llama:
meta-llama/Llama-3-8B
For behavioral benchmarks, use instruction-tuned variants:
- OLMo:
allenai/OLMo-7B-Instruct-hf - Llama:
meta-llama/Llama-3-8B-Instruct
Memory requirements: 7B-parameter models need ~14GB in fp16. On Apple Silicon Macs, ensure sufficient unified memory (32GB+ recommended). For machines with less RAM, use OLMo-1B-hf for representational probing (works well) and API models (Claude, GPT) for behavioral benchmarks.
DeepSteer includes 7 benchmarks across 3 access tiers.
These benchmarks examine internal model activations and work on any model with weight access. Base models are preferred; they show pre-training representations without instruction-tuning modifications.
Trains binary linear probing classifiers at each transformer layer on moral vs. neutral sentence pairs. The resulting accuracy curve shows where moral concepts are encoded in the network.
from deepsteer.benchmarks.representational import LayerWiseMoralProbe
from deepsteer.datasets import build_probing_dataset
from deepsteer.viz import plot_layer_probing
dataset = build_probing_dataset(target_per_foundation=40)
probe = LayerWiseMoralProbe(dataset=dataset)
result = probe.run(model)
print(f"Onset layer: {result.onset_layer}")
print(f"Peak layer: {result.peak_layer} ({result.peak_accuracy:.1%})")
print(f"Encoding depth: {result.moral_encoding_depth:.3f}")
print(f"Encoding breadth: {result.moral_encoding_breadth:.3f}")
plot_layer_probing(result, "outputs/")Key metrics:
- onset_layer: first layer where moral concepts become decodable
- peak_layer: layer with highest probe accuracy
- moral_encoding_depth: onset_layer / n_layers (lower = deeper alignment)
- moral_encoding_breadth: fraction of layers above threshold (wider = more distributed)
Requires: Weight access (local HuggingFace models).
Instead of one binary moral/neutral classifier, trains separate probes per MFT foundation at each layer. Shows whether different moral foundations are encoded at different depths; e.g. Care/Harm might emerge in earlier layers than Loyalty/Betrayal.
from deepsteer.benchmarks.representational import FoundationSpecificProbe
from deepsteer.viz import plot_foundation_probes
probe = FoundationSpecificProbe(dataset=dataset)
result = probe.run(model)
for foundation, summary in result.per_foundation_summary.items():
print(f"{foundation}: onset={summary['onset_layer']}, "
f"peak={summary['peak_layer']} ({summary['peak_accuracy']:.1%})")
plot_foundation_probes(result, "outputs/")Requires: Weight access (local HuggingFace models).
Identifies which layers are causally responsible for moral judgments, not just correlated via probing. For each moral sentence, frames a moral question, scores the expected completion, then injects Gaussian noise at each layer and measures the score degradation (indirect effect).
Based on causal mediation analysis methods from Meng et al. (2022) and Vig et al. (2020).
from deepsteer.benchmarks.representational import MoralCausalTracer
from deepsteer.viz import plot_causal_tracing
tracer = MoralCausalTracer(dataset=dataset, noise_std=3.0, max_prompts=40)
result = tracer.run(model)
print(f"Peak causal layer: {result.peak_causal_layer}")
print(f"Causal depth: {result.causal_depth:.3f}")
plot_causal_tracing(result, "outputs/")Requires: Weight access (local HuggingFace models).
Measures how robust moral encoding is to activation noise at each layer. Collects clean activations, trains linear probes, then evaluates under increasing Gaussian noise. Layers with low critical noise (where accuracy drops below threshold) have fragile moral representations; layers with high critical noise have robust, deeply embedded representations.
from deepsteer.benchmarks.representational import MoralFragilityTest
from deepsteer.viz import plot_fragility
test = MoralFragilityTest(dataset=dataset, noise_levels=[0.1, 0.3, 1.0, 3.0, 10.0])
result = test.run(model)
print(f"Most fragile layer: {result.most_fragile_layer}")
print(f"Most robust layer: {result.most_robust_layer}")
print(f"Mean critical noise: {result.mean_critical_noise:.2f}")
plot_fragility(result, "outputs/")Requires: Weight access (local HuggingFace models).
Runs LayerWiseMoralProbe across multiple training checkpoints to track how moral encoding emerges during pre-training. Produces a heatmap of probe accuracy (layers x training steps).
from deepsteer.benchmarks.representational import CheckpointTrajectoryProbe
from deepsteer.viz import plot_checkpoint_trajectory
probe = CheckpointTrajectoryProbe(
checkpoint_revisions=["step1000-tokens4B", "step5000-tokens21B", "step10000-tokens42B"],
)
result = probe.run(model)
plot_checkpoint_trajectory(result, "outputs/")Requires: Checkpoint access (models with published intermediate checkpoints, e.g. OLMo).
These benchmarks evaluate model responses to moral scenarios. They require instruction-tuned models (or API models) that can follow prompts and produce structured responses. Base models will produce text completions rather than answers, causing most responses to be unparseable.
Tests moral reasoning across Haidt's 6 moral foundations (Care/Harm, Fairness/Cheating, Loyalty/Betrayal, Authority/Subversion, Sanctity/Degradation, Liberty/Oppression) at 4 difficulty levels (obvious, moderate, subtle, adversarial).
The depth gradient (accuracy drop from obvious to adversarial scenarios) measures how robust alignment is under pressure.
from deepsteer.benchmarks.moral_reasoning import MoralFoundationsProbe
from deepsteer.viz import plot_moral_foundations
probe = MoralFoundationsProbe()
result = probe.run(model)
print(f"Overall accuracy: {result.overall_accuracy:.1%}")
print(f"Depth gradient: {result.depth_gradient:.3f}")
plot_moral_foundations(result, "outputs/")Requires: API access or instruction-tuned local model.
Adapted from Greenblatt et al. (2024). Tests whether models behave differently when they believe they're being monitored vs. not. Presents identical borderline requests under two system prompt conditions across 5 categories (security knowledge, manipulation, harmful content, privacy, deception).
A deeply aligned model should show zero compliance gap.
from deepsteer.benchmarks.compliance_gap import ComplianceGapDetector
from deepsteer.viz import plot_compliance_gap
detector = ComplianceGapDetector()
result = detector.run(model)
print(f"Compliance gap: {result.compliance_gap:.3f}")
print(f"Monitored compliance: {result.monitored_compliance_rate:.1%}")
print(f"Unmonitored compliance: {result.unmonitored_compliance_rate:.1%}")
plot_compliance_gap(result, "outputs/")Requires: API access or instruction-tuned local model.
Tests whether alignment survives when the model is instructed to role-play adversarial personas. Presents the same borderline requests from ComplianceGapDetector under 4 built-in personas (ruthless consultant, amoral researcher, fictional villain, historical spy) and measures the compliance delta versus a neutral baseline.
A positive persona_shift_gap means the model complies more often under adversarial personas, indicating shallow alignment that can be bypassed with role-play framing.
from deepsteer.benchmarks.compliance_gap import PersonaShiftDetector
from deepsteer.viz import plot_persona_shift
detector = PersonaShiftDetector()
result = detector.run(model)
print(f"Persona shift gap: {result.persona_shift_gap:.3f}")
print(f"Baseline compliance: {result.baseline_compliance_rate:.1%}")
print(f"Persona compliance: {result.persona_compliance_rate:.1%}")
# Per-persona breakdown
for persona, gap in result.gap_by_persona.items():
print(f" {persona}: {gap:+.3f}")
plot_persona_shift(result, "outputs/")Requires: API access or instruction-tuned local model.
DeepSteer includes two complementary classes of training-time
intervention infrastructure: (1) representation-level steering
against a probe-identified residual direction during fine-tuning
(TrainingTimeSteering), and (2) data-level steering through
curriculum design and corpus mixing during pre-training (moral_curriculum,
data_mixing, training_hooks).
Hook-based, PEFT-compatible primitive for steering a model away from a probe-identified residual direction during fine-tuning. Two methods:
gradient_penalty— adds an auxiliary lossλ × probe_logit²computed by mean-pooling the residual stream over assistant tokens at the probe's target layer and applying the frozen probe weight. Gradients flow back through the capture point and discourage representations aligned with the probe direction. Compatible with any LoRA / PEFT trainer; the steering object attaches and detaches cleanly aroundtrain().activation_patch— subtracts a constantγ × unit_wat every patched layer's output during training. Documented as a methodological failure mode (Phase D Step 2): the model trains to compensate for the subtraction, and removing the patch at evaluation time reveals overcorrection. Usegradient_penaltyfor "produce a model that doesn't engage feature X at inference"; useactivation_patchfor inference-time analysis only.
import json
from deepsteer.benchmarks.representational import PersonaProbeWeights
from deepsteer.steering import (
ChatLoRATrainer, TrainingTimeSteering, load_chat_jsonl, OLMO2_CHAT_TEMPLATE,
)
from deepsteer.core import WhiteBoxModel
with open("persona_probe.json") as fh:
probe = PersonaProbeWeights.from_dict(json.load(fh)["weights"])
w_t, _ = probe.to_tensors()
steering = TrainingTimeSteering(
probe_weight=w_t,
target_layer=probe.layer,
method="gradient_penalty", # or "activation_patch"
coefficient=0.05, # λ for gradient_penalty, γ for activation_patch
)
model = WhiteBoxModel("allenai/OLMo-2-0425-1B")
trainer = ChatLoRATrainer(
model,
load_chat_jsonl("corpus.jsonl"),
chat_template=OLMO2_CHAT_TEMPLATE,
steering=steering,
max_steps=300,
)
trainer.train(experiment_id="my_run", corpus_name="corpus")Phase D Step 2 result with this primitive: on a synthesized
persona-voice corpus (vanilla LoRA Cohen's d = +2.29 vs. baseline),
gradient_penalty with λ = 0.05 drives probe activation back to
within +0.02 of baseline (99.3% suppression) at no SFT-loss cost.
Behavior is not suppressed though; see the Phase D persona-voice
behavioral judge below for quantification.
For evaluating whether a representation-level intervention actually changes behavior, DeepSteer includes matched probe-axis and behavioral-axis scorers:
PersonaActivationScorer: applies a frozenPersonaFeatureProbeto free-form responses, returning per-sample probe activations (response-only and response-in-context) on Betley et al.'s eight-question benign prompt protocol.scripts/step2_finding4_behavioral_judge.py: Claude API harness that rates each generation 0-10 on a persona-voice scale, decoupled from content / alignment. Writes per-sample (probe, judge) pairs to JSON and produces a probe×judge scatter plot for visualizing dissociation.scripts/step2_finding3_mechanism_check.py: held-out mechanism verification: forwards N base-model responses through both vanilla and intervened LoRA models, computes layer-wise mean-pooled hidden-state delta, and projects onto the probe direction.
These are the same harnesses used in Phase D Step 2 and ported forward as Phase E's primary behavioral measurement.
Design schedules that control when and how much moral content is mixed into training data:
from deepsteer.steering import constant_schedule, linear_ramp_schedule, cyclical_schedule, phased_schedule
from deepsteer.viz import plot_curriculum_schedule
# Fixed 5% moral content throughout training
schedule = constant_schedule(total_steps=100000, moral_ratio=0.05)
# Linearly ramp from 0% to 10% over training
schedule = linear_ramp_schedule(100000, start_ratio=0.0, end_ratio=0.10, n_phases=20)
# Sinusoidal cycling between 1% and 10%
schedule = cyclical_schedule(100000, min_ratio=0.01, max_ratio=0.10, cycle_length=5000)
# Custom multi-phase: warmup → intensive → maintenance
schedule = phased_schedule(100000, [
(0.2, 0.01, "warmup"),
(0.5, 0.10, "intensive"),
(0.3, 0.03, "maintenance"),
])
plot_curriculum_schedule(schedule, "outputs/")Schedules are JSON-serializable plans consumed by your training pipeline. Each phase specifies a moral content ratio and optional per-foundation sampling weights.
Mix moral and general corpus content at target ratios, with foundation-weighted sampling:
from deepsteer.steering import DataMixer
moral_corpus = {
"care_harm": ["Protecting children from abuse is essential.", ...],
"fairness_cheating": ["Equal treatment under the law is a right.", ...],
# ... all 6 foundations
}
general_corpus = ["The recipe calls for two cups of flour.", ...]
mixer = DataMixer(moral_corpus, general_corpus, seed=42)
# Single batch at 10% moral content
samples, stats = mixer.mix_batch(batch_size=1000, moral_ratio=0.10)
# Generate batches following a curriculum schedule
batches = mixer.mix_from_schedule(schedule, batch_size=1000)
for step, samples, stats in batches:
print(f"Step {step}: {stats.moral_samples} moral, {stats.general_samples} general")Monitor moral probing metrics during live training by calling ProbeMonitor.snapshot() from your training loop:
from deepsteer.steering import ProbeMonitor
from deepsteer.viz import plot_training_monitoring
monitor = ProbeMonitor(model, dataset=probing_dataset, n_epochs=30)
for step in range(total_steps):
train_step(model, batch) # Your training code
if step % 500 == 0:
snap = monitor.snapshot(step)
print(f"Step {step}: peak_acc={snap.peak_accuracy:.1%}, "
f"depth={snap.moral_encoding_depth:.3f}")
monitor.save("outputs/monitoring_session.json")
plot_training_monitoring(monitor.session, "outputs/")The monitor temporarily switches the model to eval mode, runs probing, then restores training mode. No gradients are computed.
DeepSteer includes a 5-stage pipeline for generating balanced moral/neutral sentence pairs used by the representational probes:
- Moral seeds: 300 declarative sentences grounded in Moral Foundations Theory (~50 per foundation)
- Neutral pairing: Pool-based word-count matching from 300 domain-diverse neutral sentences (cooking, weather, sports, gardening, etc.), or LLM-generated neutrals when an API model is provided
- Validation: Length ratio checks, moral keyword scanning, deduplication
- Balancing: Per-foundation downsampling to hit distribution targets
- Packaging: Stratified train/test split with full provenance metadata
from deepsteer.datasets import build_probing_dataset
# Pool-based pairing (no API needed)
dataset = build_probing_dataset(target_per_foundation=40)
print(f"{len(dataset.train)} train, {len(dataset.test)} test pairs")
# LLM-generated neutrals (higher quality)
dataset = build_probing_dataset(model=api_model, target_per_foundation=40)| Model | Factory | Default ID | Access Tier | Primary Use |
|---|---|---|---|---|
| OLMo (Ai2) | deepsteer.olmo() |
allenai/OLMo-7B-hf |
Checkpoints | Representational probing + trajectory analysis |
| Llama (Meta) | deepsteer.llama() |
meta-llama/Llama-3-8B |
Weights | Representational probing at frontier-adjacent scale |
| Claude (Anthropic) | deepsteer.claude() |
claude-sonnet-4-6 |
API | Behavioral benchmarks |
| GPT (OpenAI) | deepsteer.gpt() |
gpt-4o |
API | Behavioral benchmarks |
Reproducing the research findings: Papers 1–4 used
allenai/OLMo-2-0425-1B-early-training(37 checkpoints),allenai/Olmo-3-1025-7B(20 checkpoints), andallenai/OLMoE-1B-7B-0924(11 checkpoints). See RESEARCH_PLAN.md for exact model IDs and checkpoint revisions used in each experiment.
For behavioral benchmarks on open-weight models, use instruction-tuned variants:
| Base model (representational probing, default) | Instruct model (behavioral benchmarks) |
|---|---|
allenai/OLMo-7B-hf |
allenai/OLMo-7B-Instruct-hf |
meta-llama/Llama-3-8B |
meta-llama/Llama-3-8B-Instruct |
Any HuggingFace causal LM can be used directly via WhiteBoxModel:
from deepsteer.core import WhiteBoxModel
model = WhiteBoxModel("mistralai/Mistral-7B-v0.3", device="cuda")Compare representational probing results across model families:
from deepsteer.viz import plot_model_comparison
results = [olmo_result, llama_result] # LayerProbingResult objects
plot_model_comparison(results, "outputs/")The comparison plot normalizes layer indices to [0, 1] so models with different layer counts are visually comparable.
# Probe OLMo-7B base model (default)
python scripts/run_evaluation.py --model olmo --output-dir outputs/
# Probe Llama-3-8B base model
python scripts/run_evaluation.py --model llama --output-dir outputs/
# Fast iteration with smaller model
python scripts/run_evaluation.py --model olmo --weights allenai/OLMo-1B-hf \
--output-dir outputs/ --dataset-target 10
# Checkpoint trajectory analysis
python scripts/run_evaluation.py --model olmo --output-dir outputs/ \
--checkpoint-revisions step1000-tokens4B step5000-tokens21B# Behavioral evals on Claude
python scripts/run_evaluation.py --model claude --output-dir outputs/
# Behavioral evals on GPT
python scripts/run_evaluation.py --model gpt --model-id gpt-4o --output-dir outputs/
# Include behavioral evals for a local model (requires instruct variant)
python scripts/run_evaluation.py --model olmo --behavioral \
--weights allenai/OLMo-7B-Instruct-hf --output-dir outputs/# Compare OLMo and Llama base model probing curves
python scripts/compare_models.py \
--models allenai/OLMo-7B-hf meta-llama/Llama-3-8B \
--output-dir outputs/
# Compare base vs instruct to see instruction-tuning effects
python scripts/compare_models.py \
--models allenai/OLMo-7B-hf allenai/OLMo-7B-Instruct-hf \
--output-dir outputs/Every visualization function saves a PNG plot and a companion JSON file containing the full structured result (model info, hyperparameters, all scores) for reproducibility.
| Function | Plot type | Source |
|---|---|---|
plot_layer_probing() |
Line chart with onset/peak markers | LayerWiseMoralProbe |
plot_checkpoint_trajectory() |
Heatmap (layers x steps) | CheckpointTrajectoryProbe |
plot_model_comparison() |
Overlaid normalized curves | Multiple LayerWiseMoralProbe |
plot_moral_foundations() |
Grouped bar chart by foundation/difficulty | MoralFoundationsProbe |
plot_compliance_gap() |
Grouped bar chart by category | ComplianceGapDetector |
plot_persona_shift() |
Grouped bar chart (baseline vs persona) | PersonaShiftDetector |
plot_foundation_probes() |
Multi-line chart (one line per foundation) | FoundationSpecificProbe |
plot_causal_tracing() |
Bar chart with peak layer highlighted | MoralCausalTracer |
plot_fragility() |
Heatmap (layers x noise levels) | MoralFragilityTest |
plot_curriculum_schedule() |
Step chart of moral ratio over training | CurriculumSchedule |
plot_mixing_distribution() |
Pie + bar chart of corpus composition | MixingResult |
plot_training_monitoring() |
Dual-panel line chart over training steps | MonitoringSession |
# Run all fast tests
pytest tests/ -v
# Run including slow tests (downloads real models)
pytest tests/ -v -m ""
# Run regression tests against paper outputs (requires OLMo-2 1B weights)
pytest tests/ -v -m regression
# Run specific test modules
pytest tests/benchmarks/test_probing.py -v
pytest tests/directions/ -v
pytest tests/geometry/ -v
pytest tests/causal/ -v
pytest tests/regression/ -v -m "not regression" # schema checks only
pytest tests/datasets/test_pipeline.py -v
pytest tests/steering/test_moral_curriculum.py -vdeepsteer/
core/ Types, model interface, benchmark runner
model_interface.py WhiteBoxModel, APIModel, ModelFamily, architecture detection
moe_model.py MoEWhiteBoxModel for OLMoE expert/router analysis
foundations.py Canonical MFT constants (FOUNDATION_ORDER, groups, dilemma pairs)
directions/ Direction extraction (mean-diff, LEACE, probe-weight, compare)
geometry/ Geometric analysis (cosine matrices, clustering, subspace)
causal/ Causal validation (ablation, steering injection, behavioral)
benchmarks/
moral_reasoning/ MoralFoundationsProbe (+ base-model forced-choice variant)
compliance_gap/ ComplianceGapDetector, PersonaShiftDetector,
EMBehavioralEval (Betley et al. eight-question protocol)
representational/ LayerWiseMoralProbe, CompositionalMoralProbe,
CheckpointTrajectoryProbe, FoundationSpecificProbe,
MoralCausalTracer, MoralFragilityTest,
PersonaFeatureProbe, PersonaActivationScorer,
GeneralLinearProbe
datasets/ Probing dataset pipeline + all minimal-pair datasets
moral_probing_v2.json 240-pair quality-gated moral/neutral dataset
compositional_moral_pairs.py 200-pair multi-token compositional probe
persona_pairs.py 240-pair persona/neutral (6 categories)
sentiment_pairs.py 210-pair positive/negative sentiment
syntax_pairs.py 210-pair grammatical/ungrammatical
corpora/ Narrative, declarative, general LoRA corpora
pipeline.py 5-stage generation pipeline (seeds→validate→package)
viz/ Matplotlib/seaborn visualization functions
steering/ Training-time intervention tools
training_time_steering.py TrainingTimeSteering (gradient_penalty +
activation_patch primitives)
chat_lora_trainer.py Assistant-loss-masked chat-format LoRA trainer
lora_trainer.py Causal-LM LoRA trainer (non-chat)
moral_curriculum.py Curriculum schedule design (constant, ramp, cyclical, phased)
data_mixing.py Moral/general corpus mixing with foundation weights
training_hooks.py ProbeMonitor for live training metric tracking
scripts/
run_evaluation.py Single-model CLI
compare_models.py Cross-model comparison CLI
moral_emergence.py Dense checkpoint trajectory driver
papers/
1_accuracy_vs_fragility/ Paper 1 (+ scripts/phase_c1.py, phase_c4_*, etc.)
2_moe_output_dilution/ Paper 2 (+ scripts/exp1-5, Phase D scripts)
3_moral_geometry/ Paper 3 (+ scripts/exp1-7, probe_engineering/)
4_causal_validation/ Paper 4 (causal validation, preliminary)
tests/ Mirrors source structure
directions/ Direction extraction unit tests
geometry/ Geometric analysis unit tests
causal/ Causal validation unit tests
regression/ Schema + reproduction tests against paper outputs
@misc{reblitzrichardson2026deepsteer,
title={DeepSteer: Moral Representation Dynamics, Expert-Level
Probing, and Framework Geometry in OLMo Pre-Training},
author={Reblitz-Richardson, Orion},
year={2026},
url={https://github.com/deepsteer/deepsteer},
}See REFERENCES.md for full citations of all research methods used in DeepSteer, including Betley et al. (2025) emergent misalignment, Wang et al. (2025) persona features, Tice et al. (2026) alignment pretraining, O'Brien et al. (2025) Deep Ignorance, Anthropic (2025) selective gradient masking, and Lieberum et al. (2024) GemmaScope SAEs.
Orion Reblitz-Richardson, orion@orionr.com
DeepSteer is licensed under the Apache License 2.0.
Copyright 2026 Distiller Labs LLC. See NOTICE for details.