DPO Unchained: Your Training Algorithm is Secretly Disentangled in Human Choice Theory (and its Loss' Convexity is Dispensable) -- Lay Summary:

How can we improve the likelihood that a LLM learned by an algorithm from human preference data ends up producing outputs aligned with human preferences ? By carving, in the objective function (the core function that any ML algorithm has to optimize), the explicit "knowledge" that the data is human in nature and not just any kind of data. To do so, Direct Preference Optimization (DPO) introduced the use of a popular model of human choice. Elevating this very specific model to match ML's required generality has proven tricky so far, to the extent that some have opted to discard it, a position whose associated risks are hard to exaggerate in the context of LLMs.

Our paper closes the gap, giving this human choice model a generalization that matches ML's generality. Doing so got us to a surprising result: instead of having each applicable objective function correspond to a specific model of human choice (for example authorizing abstention or not, etc.), each can be mapped to all of them. That the ML objective is in fact "secretly disentangled" in human choice is thought-provoking, but it is not the only surprise we got: the conditions on “applicable objective functions” are indeed weaker than what is usually required from such functions in general ML.

Our findings unveil an extensive, essentially still unexplored map to design the ML task and considerably relax its design choices in the human context.