AI Research

Latest research papers from arXiv covering machine learning, computer vision, natural language processing, and more.

arXivPDF

Ensemble-size-dependence of deep-learning post-processing methods that minimize an (un)fair score: motivating examples and a proof-of-concept solution

Fair scores reward ensemble forecast members that behave like samples from the same distribution as the verifying observations. They are therefore an attractive choice as loss functions to train data-driven ensemble forecasts or post-processing methods when large training ensembles are either unavai...

Christopher David Roberts

Feb 17, 2026

arXivPDF

Operationalising the Superficial Alignment Hypothesis via Task Complexity

The superficial alignment hypothesis (SAH) posits that large language models learn most of their knowledge during pre-training, and that post-training merely surfaces this knowledge. The SAH, however, lacks a precise definition, which has led to (i) different and seemingly orthogonal arguments suppo...

Tomás Vergara-Browne, Darshan Patil, Ivan Titov

Feb 17, 2026

arXivPDF

Dex4D: Task-Agnostic Point Track Policy for Sim-to-Real Dexterous Manipulation

Learning generalist policies capable of accomplishing a plethora of everyday tasks remains an open challenge in dexterous manipulation. In particular, collecting large-scale manipulation data via real-world teleoperation is expensive and difficult to scale. While learning in simulation provides a fe...

Yuxuan Kuang, Sungjae Park, Katerina Fragkiadaki

Feb 17, 2026

arXivPDF

Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching

While recent advances in humanoid locomotion have achieved stable walking on varied terrains, capturing the agility and adaptivity of highly dynamic human motions remains an open challenge. In particular, agile parkour in complex environments demands not only low-level robustness, but also human-lik...

Zhen Wu, Xiaoyu Huang, Lujie Yang

Feb 17, 2026

arXivPDF

CrispEdit: Low-Curvature Projections for Scalable Non-Destructive LLM Editing

A central challenge in large language model (LLM) editing is capability preservation: methods that successfully change targeted behavior can quietly game the editing proxy and corrupt general capabilities, producing degenerate behaviors reminiscent of proxy/reward hacking. We present CrispEdit, a sc...

Zarif Ikram, Arad Firouzkouhi, Stephen Tu

Feb 17, 2026

arXivPDF

Stabilizing Test-Time Adaptation of High-Dimensional Simulation Surrogates via D-Optimal Statistics

Machine learning surrogates are increasingly used in engineering to accelerate costly simulations, yet distribution shifts between training and deployment often cause severe performance degradation (e.g., unseen geometries or configurations). Test-Time Adaptation (TTA) can mitigate such shifts, but ...

Anna Zimmel, Paul Setinek, Gianluca Galletti

Feb 17, 2026

arXivPDF

VideoSketcher: Video Models Prior Enable Versatile Sequential Sketch Generation

Sketching is inherently a sequential process, in which strokes are drawn in a meaningful order to explore and refine ideas. However, most generative models treat sketches as static images, overlooking the temporal structure that underlies creative drawing. We present a data-efficient approach for se...

Hui Ren, Yuval Alaluf, Omer Bar Tal

Feb 17, 2026

arXivPDF

Solving Parameter-Robust Avoid Problems with Unknown Feasibility using Reinforcement Learning

Recent advances in deep reinforcement learning (RL) have achieved strong results on high-dimensional control tasks, but applying RL to reachability problems raises a fundamental mismatch: reachability seeks to maximize the set of states from which a system remains safe indefinitely, while RL optimiz...

Oswin So, Eric Yang Yu, Songyuan Zhang

Feb 17, 2026

arXivPDF

Developing AI Agents with Simulated Data: Why, what, and how?

As insufficient data volume and quality remain the key impediments to the adoption of modern subsymbolic AI, techniques of synthetic data generation are in high demand. Simulation offers an apt, systematic approach to generating diverse synthetic data. This chapter introduces the reader to the key c...

Xiaoran Liu, Istvan David

Feb 17, 2026

arXivPDF

Avey-B

Compact pretrained bidirectional encoders remain the backbone of industrial NLP under tight compute and memory budgets. Their effectiveness stems from self-attention's ability to deliver high-quality bidirectional contextualization with sequence-level parallelism, as popularized by BERT-style archit...

Devang Acharya, Mohammad Hammoud

Feb 17, 2026

arXivPDF

Task-Agnostic Continual Learning for Chest Radiograph Classification

Clinical deployment of chest radiograph classifiers requires models that can be updated as new datasets become available without retraining on previously ob- served data or degrading validated performance. We study, for the first time, a task-incremental continual learning setting for chest radiogra...

Muthu Subash Kavitha, Anas Zafar, Amgad Muneer

Feb 17, 2026

arXivPDF

Decision Quality Evaluation Framework at Pinterest

Online platforms require robust systems to enforce content safety policies at scale. A critical component of these systems is the ability to evaluate the quality of moderation decisions made by both human agents and Large Language Models (LLMs). However, this evaluation is challenging due to the inh...

Yuqi Tian, Robert Paine, Attila Dobi

Feb 17, 2026

arXivPDF

The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety

Fine-tuning aligned language models on benign tasks unpredictably degrades safety guardrails, even when training data contains no harmful content and developers have no adversarial intent. We show that the prevailing explanation, that fine-tuning updates should be orthogonal to safety-critical direc...

Max Springer, Chung Peng Lee, Blossom Metevier

Feb 17, 2026

arXivPDF

Enhancing Building Semantics Preservation in AI Model Training with Large Language Model Encodings

Accurate representation of building semantics, encompassing both generic object types and specific subtypes, is essential for effective AI model training in the architecture, engineering, construction, and operation (AECO) industry. Conventional encoding methods (e.g., one-hot) often fail to convey ...

Suhyung Jang, Ghang Lee, Jaekun Lee

Feb 17, 2026

arXivPDF

This human study did not involve human subjects: Validating LLM simulations as behavioral evidence

A growing literature uses large language models (LLMs) as synthetic participants to generate cost-effective and nearly instantaneous responses in social science experiments. However, there is limited guidance on when such simulations support valid inference about human behavior. We contrast two stra...

Jessica Hullman, David Broska, Huaman Sun

Feb 17, 2026

Data from arXiv.org • Updated hourly