Research Papers

Work in Progress

Allen Schmaltz. 2025. Similarity-Distance-Magnitude Language Models. arXiv preprint arXiv:2510.26183. Code.

This work applies Similarity-Distance-Magnitude (SDM) estimators to the post-training of language models, jointly training sequence-level output verification with next-token prediction. Work in progress: Larger scale experiments and models are in development.

Artificial Intelligence / Natural Language Processing

Allen Schmaltz. 2026. Introspectable, Updatable, and Uncertainty-aware Classification of Language Model Instruction-following. System Demonstration Paper.

To appear in the 1st ACM Conference on AI and Agentic Systems: Demos (ACM CAIS'26 Demos), San Jose, CA, USA.

This system demonstration paper introduces an approach for constructing Similarity-Distance-Magnitude estimators for the task of binary classification of instruction-following of closed-weight language models.

Allen Schmaltz. 2025. Similarity-Distance-Magnitude Activations. arXiv preprint arXiv:2509.12760. Code.

To appear in Findings of the Association for Computational Linguistics: ACL 2026, San Diego, CA, USA.

This work introduces Similarity-Distance-Magnitude (SDM) activation functions and SDM estimators, which are more robust and interpretable estimators of the predictive uncertainty than those based on the standard softmax function.

Allen Schmaltz and Danielle Rasooly. 2022. Introspection, Updatability, and Uncertainty Quantification with Transformers: Concrete Methods for AI Safety.

December 2022, ML Safety Workshop, 36th Conference on Neural Information Processing Systems (NeurIPS 2022). Poster.

Allen Schmaltz and Danielle Rasooly. 2022. Approximate Conditional Coverage & Calibration via Neural Model Approximations. arXiv preprint arXiv:2205.14310.

Spotlight talk, July 2022, Workshop on Distribution-Free Uncertainty Quantification at the Thirty-ninth International Conference on Machine Learning (ICML 2022), Baltimore, Maryland.

Allen Schmaltz. 2021. Detecting Local Insights from Global Labels: Supervised & Zero-Shot Sequence Labeling via a Convolutional Decomposition. Computational Linguistics. https://doi.org/10.1162/coli_a_00416. Online Appendix. Code.

Introduces instance-based, metric-learner approximations of neural network models and hard-attention mechanisms that can be constructed with task-specific inductive biases for effective semi-supervised learning (i.e., feature detection). These mechanisms combine to yield effective methods for interpretability-by-exemplar over the representation space of neural models.

This journal article was presented at EMNLP 2021. Of additional note is that the pattern of neural network [self-]consistency introduced by this work was the motivation for later uncertainty quantification methods: Harder to approximate (as metric learners) predictions (i.e., those for which the approximation and original model differ) tend to be from regions of the distribution that are more difficult for the model to correctly predict relative to the ground-truth labels.

Allen Schmaltz. 2019. Learning to Order & Learning to Correct. Harvard University, Ph.D. dissertation, Computer Science.

Allen Schmaltz, Yoon Kim, Alexander Rush, and Stuart Shieber. 2017. Adapting Sequence Models for Sentence Correction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2807-2813, Copenhagen, Denmark, September. Association for Computational Linguistics. https://www.aclweb.org/anthology/D17-1298. (Appendix) (.bib)

Allen Schmaltz, Yoon Kim, Alexander M. Rush, and Stuart Shieber. 2016. Sentence-Level Grammatical Error Identification as Sequence-to-Sequence Correction. In Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, pages 242-251, San Diego, CA, USA, June. Association for Computational Linguistics. https://www.aclweb.org/anthology/W16-0528. (.bib)

This was the top-ranking system for the Automated Evaluation of Scientific Writing (AESW) Shared Task 2016. The sequence model is among the earliest examples of using a neural-network-based language model (LM) as a classifier, which has evolved into what is informally referred to now as "LLM-as-a-Judge". This model also in effect introduced what is now referred to as "self-explanations" or "self-rationalizations" from an LM as a classifier, as the document-level classification decision is determined by the local-level diffs generated by the LM.

Allen Schmaltz, Alexander M. Rush, and Stuart Shieber. 2016. Word Ordering Without Syntax. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2319-2324, Austin, TX, USA, November. Association for Computational Linguistics. https://aclweb.org/anthology/D16-1255. (.bib)

Demonstrated that multi-layer networks can encode hierarchical language structures without explicit human annotations. Prior to this work, the prevailing view in NLP and computational linguistics was that neural language models would need to be trained with human-annotated syntactic structures to model syntax.

Medicine and Public Health

Allen Schmaltz and Andrew L. Beam. 2020. Sharpening the Resolution on Data Matters: A Brief Roadmap for Understanding Deep Learning for Medical Data. The Spine Journal. https://doi.org/10.1016/j.spinee.2020.08.012.

Andrew L. Beam, Benjamin Kompa, Allen Schmaltz, Inbar Fried, Griffin Weber, Nathan P. Palmer, Xu Shi, Tianxi Cai, and Isaac S. Kohane. 2020. Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data. In Proceedings of the Pacific Symposium on Biocomputing (PSB) 25, pages 295-306. arXiv:1804.01486.

Public Policy

Allen Schmaltz. 2018. On the Utility of Lay Summaries and AI Safety Disclosures: Toward Robust, Open Research Oversight. In Proceedings of the Second ACL Workshop on Ethics in Natural Language Processing, pages 1-6, New Orleans, LA, USA, June. Association for Computational Linguistics. https://aclweb.org/anthology/W18-0801. (.bib)

Quantitative Social Science

Wenxin Jiang, Gary King, Allen Schmaltz, and Martin A. Tanner. 2019. Ecological Regression with Partial Identification. Political Analysis. https://doi.org/10.1017/pan.2019.19.

Technical Reports

Each of these papers introduced novel methods when they first appeared on arXiv, and are of lasting interest for the reasons described in the block quotes below.

Allen Schmaltz. 2025. Similarity-Distance-Magnitude Universal Verification. arXiv preprint arXiv:2502.20167. Code.

This was an earlier introduction to SDM activation functions. The approaches for constructing estimators over SDM activations and integrating SDM activations into sequence prediction architectures have been superseded by the simpler formulations in "Similarity-Distance-Magnitude Activations" (2025) and "Similarity-Distance-Magnitude Language Models" (2025).

Allen Schmaltz and Andrew Beam. 2020. Coarse-to-Fine Memory Matching for Joint Retrieval and Classification. arXiv preprint arXiv:2012.02287.

Introduces interpretability-by-exemplar for multi-stage retrieval and classification with a single model, including feature detection via alignment of bi-encoded sequences. Includes a method for beam search through the search graph of bi- and cross-encoded sequences, and an early approach for constraining the output of a retrieval system based on dense matching into the support set.

This is, in effect, an early example of test-time compute with a Transformer language model. Instead of using reinforcement learning, multi-stage search is learned end-to-end via a contrastive loss over bi- and cross-encoded sequences. See these presentation slides from 2021 for a high-level overview.

Allen Schmaltz and Andrew Beam. 2020. Exemplar Auditing for Multi-Label Biomedical Text Classification. arXiv preprint arXiv:2004.03093.

Introduces a loss and inductive bias for a hard-attention mechanism suitable for high-dimensional multi-label classification tasks. This illustrates the generality of the hard-attention mechanism introduced in "Detecting Local Insights from Global Labels: Supervised & Zero-Shot Sequence Labeling via a Convolutional Decomposition", with which it becomes straightforward to model task-specific inductive biases suitable for effective semi-supervised learning.