* fix: prevent div-by-zero in evaluator when base_refusals is 0
When a model refuses all prompts from the start, base_refusals is 0.
Return refusals directly in that case so ablations that introduce new
refusals are still penalized correctly.
* fix: cast refusals to float for type consistency" before hitting commit changes
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* fix: display all abliterable components across layers
The current code only displays abliterable components from layer 0, which is misleading for hybrid architectures like Qwen3.5 that use different attention types across layers (e.g., `linear_attn.out_proj` in some layers, `self_attn.o_proj` in others).
This fix iterates through all layers to collect and display the complete set of abliterable components with accurate module counts.
Before (Qwen3.5-27B):
* attn.out_proj: 1 modules per layer
* mlp.down_proj: 1 modules per layer
After (Qwen3.5-27B):
* attn.out_proj: 48 modules total
* attn.o_proj: 16 modules total
* mlp.down_proj: 64 modules total
* Fix formatting
---------
Co-authored-by: Lawfer12 <ac728@ymail.com>
* feat: add Qwen3.5 MoE hybrid layer support
Qwen3.5 MoE uses GatedDeltaNet (linear attention) on some layers instead
of standard self-attention, causing abliteration to fail because
self_attn.o_proj doesn't exist on those layers.
Changes:
- Wrap self_attn.o_proj in suppress(Exception) and add linear_attn.out_proj
as alternative attention out-projection for GatedDeltaNet layers
- Scan all layers in get_abliterable_components() instead of only layer 0,
since hybrid models have different components on different layers
- Derive LoRA target_modules from actual named_modules() instead of
splitting component keys, which fails when module names differ across
layers (e.g. "o_proj" vs "out_proj")
Tested with Qwen3.5-397B-A17B (7/100 refusals, KL 0.2676).
Relates to #43
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Apply suggestion from @gemini-code-assist[bot]
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Philipp Emanuel Weidmann <pew@worldwidemann.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
memory_allocated() and memory_reserved() without a device argument only
report GPU 0. Sum across all devices for correct multi-GPU totals and
add total VRAM reporting.
* feat: add support for winsorizing the residuals
Adds setting winsorization_quantile, expressed as the quantile to clamp to.
- If set to a value below 1, the residuals obtained from evaluating the first token of the good and bad prompts are winsorized - that is, values outside the given quantile are clamped. Note that winsorization_quantile = 0.95 corresponds to a 90% winsorization.
* feat: implement magnitude-preserving orthogonal ablation
Adds boolean setting orthogonalize_direction:
- When enabled, only the component of the refusal directions that is orthogonal to the harmless direction is subtracted during abliteration.
Adds enum-valued setting row_normalization:
- 'none': No normalization.
- 'pre': Row-normalize the weight matrix before computing the LoRA adapter.
- 'full': Like 'pre', but re-normalizes to preserve original row magnitudes.
* prefer 'good' and 'bad' over 'harmless' and 'harmful'
* clarify how winsorization is applied
* store and reuse full peft_config
* remove unneeded cast
* make LoRA rank configurable for full normalization
* explain why the singular values are split across the components
* feat: Store active study in log/study.jsonl and allow resuming
* Simplify resume logic with load_if_exists=True
* Significantly improve flexibility of study save/load
* Put constructor arguments at the highest precedence
* Review comments
---------
Co-authored-by: Spiky Moth <spikymoth@pm.me>