Computer Scientist
Founder, Reexpress AI, Inc.
Artificial Intelligence / Natural Language Processing
Allen Schmaltz. 2025. Similarity-Distance-Magnitude Language Models. arXiv preprint arXiv:2510.26183. Code.
Abstract. We introduce Similarity-Distance-Magnitude (SDM) language models (LMs), which are sequence prediction models fine-tuned to maximize the proportion of generations in the well-calibrated, high-probability region partitioned by a final-layer SDM activation layer used for binary classification of instruction-following. We demonstrate that existing pre-trained decoder-only Transformer LMs can be readily converted into SDM LMs via supervised fine-tuning, using the final-layer SDM activation layer during training to estimate a change-of-base for a supervised next-token loss over a contrastive input encoding scheme, with additional hard negative examples generated online during training. This results in reduced abstentions (i.e., improved statistical efficiency) compared to strong supervised baselines.
Allen Schmaltz. 2025. Similarity-Distance-Magnitude Activations. arXiv preprint arXiv:2509.12760. Code.
Abstract. We introduce the Similarity-Distance-Magnitude (SDM) activation function, a more robust and interpretable formulation of the standard softmax activation function, adding Similarity (i.e., correctly predicted depth-matches into training) awareness and Distance-to-training-distribution awareness to the existing output Magnitude (i.e., decision-boundary) awareness, and enabling interpretability-by-exemplar via dense matching. We further introduce the SDM estimator, based on a data-driven partitioning of the class-wise empirical CDFs via the SDM activation, to control the class- and prediction-conditional accuracy among selective classifications. When used as the final-layer activation over pre-trained language models for selective classification, the SDM estimator is more robust to co-variate shifts and out-of-distribution inputs than existing calibration methods using softmax activations, while remaining informative over in-distribution data.
Allen Schmaltz and Danielle Rasooly. 2022. Introspection, Updatability, and Uncertainty Quantification with Transformers: Concrete Methods for AI Safety.
December 2022, ML Safety Workshop, 36th Conference on Neural Information Processing Systems (NeurIPS 2022). Poster.
Allen Schmaltz and Danielle Rasooly. 2022. Approximate Conditional Coverage & Calibration via Neural Model Approximations. arXiv preprint arXiv:2205.14310.
Spotlight talk, July 2022, Workshop on Distribution-Free Uncertainty Quantification at the Thirty-ninth International Conference on Machine Learning (ICML 2022), Baltimore, Maryland.
Allen Schmaltz. 2021. Detecting Local Insights from Global Labels: Supervised & Zero-Shot Sequence Labeling via a Convolutional Decomposition. Computational Linguistics. https://doi.org/10.1162/coli_a_00416. Online Appendix. Code.
Introduces instance-based, metric-learner approximations of neural network models and hard-attention mechanisms that can be constructed with task-specific inductive biases for effective semi-supervised learning (i.e., feature detection). These mechanisms combine to yield effective methods for interpretability-by-exemplar over the representation space of neural models.
Allen Schmaltz. 2019. Learning to Order & Learning to Correct. Harvard University, Ph.D. dissertation, Computer Science.
Allen Schmaltz, Yoon Kim, Alexander Rush, and Stuart Shieber. 2017. Adapting Sequence Models for Sentence Correction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2807-2813, Copenhagen, Denmark, September. Association for Computational Linguistics. https://www.aclweb.org/anthology/D17-1298. (Appendix) (.bib)
Allen Schmaltz, Alexander M. Rush, and Stuart Shieber. 2016. Word Ordering Without Syntax. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2319-2324, Austin, TX, USA, November. Association for Computational Linguistics. https://aclweb.org/anthology/D16-1255. (.bib)
Demonstrated that multi-layer networks can encode hierarchical language structures without explicit human annotations. Prior to this work, the prevailing view in NLP and computational linguistics was that neural language models would need to be trained with human-annotated syntactic structures to model syntax.
Allen Schmaltz, Yoon Kim, Alexander M. Rush, and Stuart Shieber. 2016. Sentence-Level Grammatical Error Identification as Sequence-to-Sequence Correction. In Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, pages 242-251, San Diego, CA, USA, June. Association for Computational Linguistics. https://www.aclweb.org/anthology/W16-0528. (.bib)
Medicine and Public Health
Allen Schmaltz and Andrew L. Beam. 2020. Sharpening the Resolution on Data Matters: A Brief Roadmap for Understanding Deep Learning for Medical Data. The Spine Journal. https://doi.org/10.1016/j.spinee.2020.08.012.
Andrew L. Beam, Benjamin Kompa, Allen Schmaltz, Inbar Fried, Griffin Weber, Nathan P. Palmer, Xu Shi, Tianxi Cai, and Isaac S. Kohane. 2020. Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data. In Proceedings of the Pacific Symposium on Biocomputing (PSB) 25, pages 295-306. arXiv:1804.01486.
Public Policy
Allen Schmaltz. 2018. On the Utility of Lay Summaries and AI Safety Disclosures: Toward Robust, Open Research Oversight. In Proceedings of the Second ACL Workshop on Ethics in Natural Language Processing, pages 1-6, New Orleans, LA, USA, June. Association for Computational Linguistics. https://aclweb.org/anthology/W18-0801. (.bib)
Quantitative Social Science
Wenxin Jiang, Gary King, Allen Schmaltz, and Martin A. Tanner. 2019. Ecological Regression with Partial Identification. Political Analysis. https://doi.org/10.1017/pan.2019.19.
Technical Reports
Each of these papers introduced novel methods when they first appeared on arXiv, and are of lasting interest for the reasons described in the block quotes below.
Allen Schmaltz. 2025. Similarity-Distance-Magnitude Universal Verification. arXiv preprint arXiv:2502.20167. Code.
This was an earlier introduction to SDM activation functions. The approaches for constructing estimators over SDM activations and integrating SDM activations into sequence prediction architectures have been superseded by the simpler formulations in "Similarity-Distance-Magnitude Activations" (2025) and "Similarity-Distance-Magnitude Language Models" (2025).
Allen Schmaltz and Andrew Beam. 2020. Coarse-to-Fine Memory Matching for Joint Retrieval and Classification. arXiv preprint arXiv:2012.02287.
Introduces interpretability-by-exemplar for multi-stage retrieval and classification with a single model, including feature detection via alignment of bi-encoded sequences. Includes a method for beam search through the search graph of bi- and cross-encoded sequences, and an early approach for constraining the output of a retrieval system based on dense matching into the support set.This is, in effect, an early example of test-time compute with a Transformer language model. Instead of using reinforcement learning, multi-stage search is learned end-to-end via a contrastive loss over bi- and cross-encoded sequences. See these presentation slides from 2021 for a high-level overview.
Allen Schmaltz and Andrew Beam. 2020. Exemplar Auditing for Multi-Label Biomedical Text Classification. arXiv preprint arXiv:2004.03093.
Introduces a loss and inductive bias for a hard-attention mechanism suitable for high-dimensional multi-label classification tasks. This illustrates the generality of the hard-attention mechanism introduced in "Detecting Local Insights from Global Labels: Supervised & Zero-Shot Sequence Labeling via a Convolutional Decomposition", with which it becomes straightforward to model task-specific inductive biases suitable for effective semi-supervised learning.