Template-free, interaction-aware profile generation via reinforcement learning — aligning users and items in a shared semantic space.
1Peking University ·
2Microsoft ·
3Zhejiang University ·
4KTH Royal Institute of Technology
*Equal contribution
†Corresponding author
Traditional recommendation systems represent users and items as dense vectors and learn to align them in a shared latent space for relevance estimation. Recent LLM-based recommenders instead leverage natural-language representations that are easier to interpret and integrate with downstream reasoning modules.
This paper studies how to construct effective textual profiles for users and items, and how to align them for recommendation. A central difficulty is that the best profile format is not known a priori: manually designed templates can be brittle and misaligned with task objectives. Moreover, generating user and item profiles independently may produce descriptions that are individually plausible yet semantically inconsistent for a specific user–item pair.
We propose DUET, an interaction-aware profile generator that jointly produces user and item profiles conditioned on both user history and item evidence. DUET follows a three-stage procedure: it first turns raw histories and metadata into compact cues, then expands these cues into paired profile prompts to generate profiles, and finally optimizes the generation policy with reinforcement learning using downstream recommendation performance as feedback.
Key result: Experiments on three real-world datasets (Yelp, Amazon Music, Amazon Books) show that DUET consistently outperforms strong baselines, demonstrating the benefits of template-free profile exploration and joint user–item textual alignment.
Independently generated profiles may amplify incompatible facets of the same user–item pair, obscuring the true relevance signal.
The two profiles focus on incompatible aspects, hiding the shared funk/soul connection and producing a misleading relevance signal.
DUET reconciles both sides into a compatible interpretation, surfacing the shared funk affinity and enabling accurate relevance estimation.
Figure 1. DUET aligns raw user and item data by transforming them into textual profiles within a shared semantic space.
Represent both users and items as natural-language profiles and align them in a shared semantic space, extending the classic vector-based alignment principle to interpretable textual representations fully compatible with LLMs.
Start from cue-based initialization, expand cues into candidate profile prompts, and jointly optimize user and item profiles with downstream RL feedback — no rigid templates or hand-crafted attributes required.
Extensive experiments across three real-world datasets with two backbone LLMs show DUET consistently outperforms all baselines, validating both joint profiling and feedback-driven profile optimization.
A closed-loop framework that transforms raw user–item interaction histories into performance-aligned textual profiles through three learned stages — all realized in a single seq-to-seq forward pass at inference time.
Raw user histories and item metadata are distilled into minimal cues — concise hypotheses highlighting one potential preference or characteristic. These act as lightweight seeds for profile exploration, deliberately underspecified to allow subsequent discovery.
Rather than directly summarizing, the model generates an intermediate constructed_prompt — a natural-language instruction defining format, abstraction level, and attribute selection. Conditioned on this prompt, user and item profiles are generated jointly.
Profiles are consumed by a frozen downstream recommender. The continuous fractional reward Rperf measures prediction accuracy and drives GRPO optimization, reinforcing profile constructions that yield better recommendations.
Figure 2. Overview of the DUET framework. Three stages — Cue-Based Initialization, Joint Exploration via Adaptive Profile Prompt Discovery, and On-Policy Optimization — are unified into a single generation pass. The downstream task environment provides a continuous reward signal for optimizing profile quality.
All three stages — cue extraction, profile prompt construction, and profile generation — are realized in a single sequence-to-sequence forward pass at inference time, introducing no additional latency compared to standard profile generation methods.
Profile generation is treated as an on-policy RL problem where the state is s = {Hu, Hi}, the action is the joint generation sequence, and quality is evaluated solely by functional utility in a fixed recommendation environment — no textual ground-truth profiles required.
Evaluated on Yelp, Amazon Music, and Amazon Books using Qwen3-8B and LLaMA3-8B as both the profile generator and the downstream recommender.
| Method | Yelp | Amazon Music | Amazon Books | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MAE↓ | RMSE↓ | Acc↑ | F1↑ | MAE↓ | RMSE↓ | Acc↑ | F1↑ | MAE↓ | RMSE↓ | Acc↑ | F1↑ | |
| Qwen3-8B | ||||||||||||
| 10H (History Only) | 1.1235 | 1.9478 | 23.17 | 27.54 | 0.9102 | 1.4021 | 39.26 | 46.58 | 0.9314 | 1.4527 | 37.63 | 45.19 |
| KAR (Xi et al., 2024) | 0.7396 | 1.2184 | 55.34 | 48.67 | 0.7483 | 1.1380 | 58.65 | 60.29 | 0.7098 | 1.0923 | 56.17 | 58.78 |
| RLMRec (Ren et al., 2024) | 0.8197 | 1.3312 | 47.15 | 42.46 | 0.7438 | 1.1069 | 54.89 | 57.65 | 0.7812 | 1.1584 | 52.86 | 55.93 |
| PALR (Yang et al., 2023) | 0.7994 | 1.2876 | 48.53 | 43.19 | 0.6075 | 0.9531 | 57.35 | 56.77 | 0.7485 | 1.1187 | 54.24 | 56.38 |
| LettinGo (Wang et al., 2025) | 0.6632 | 1.1047 | 56.18 | 48.95 | 0.4737 | 0.8834 | 62.37 | 57.09 | 0.5821 | 0.9416 | 59.35 | 60.57 |
| Reason4Rec (Fang et al., 2025) | 0.7028 | 1.1523 | 55.69 | 47.73 | 0.5654 | 0.9635 | 58.69 | 54.67 | 0.6397 | 1.0098 | 58.47 | 56.84 |
| DUET (Ours) | 0.5126 | 0.9485 | 61.23 | 55.18 | 0.3937 | 0.7564 | 67.96 | 63.89 | 0.4612 | 0.9089 | 64.38 | 59.27 |
| LLaMA3-8B | ||||||||||||
| 10H (History Only) | 1.0864 | 1.9532 | 22.09 | 27.30 | 0.7917 | 1.3346 | 38.13 | 46.87 | 0.8064 | 1.3866 | 37.15 | 45.27 |
| KAR (Xi et al., 2024) | 0.6427 | 1.1668 | 54.51 | 47.98 | 0.5726 | 0.9033 | 57.53 | 59.92 | 0.5892 | 0.9614 | 55.87 | 58.21 |
| RLMRec (Ren et al., 2024) | 0.7428 | 1.3572 | 46.74 | 42.11 | 0.6076 | 0.9886 | 53.78 | 57.42 | 0.6226 | 0.9477 | 52.12 | 55.79 |
| PALR (Yang et al., 2023) | 0.7238 | 1.3265 | 47.72 | 43.29 | 0.5823 | 0.9222 | 56.73 | 59.31 | 0.5977 | 0.8855 | 55.06 | 57.62 |
| LettinGo (Wang et al., 2025) | 0.6196 | 1.1289 | 56.03 | 51.24 | 0.5204 | 0.9369 | 61.92 | 59.50 | 0.5543 | 0.7967 | 58.95 | 60.39 |
| Reason4Rec (Fang et al., 2025) | 0.7586 | 1.0418 | 55.80 | 53.00 | 0.5442 | 0.7722 | 60.86 | 54.88 | 0.6029 | 0.8345 | 59.70 | 56.35 |
| DUET (Ours) | 0.5367 | 0.9687 | 60.87 | 54.74 | 0.4680 | 0.8277 | 63.30 | 60.60 | 0.5092 | 0.9500 | 63.42 | 58.12 |
Best values shown in green bold. DUET (highlighted rows) consistently outperforms all baselines across both backbone LLMs and all three datasets.
| Method | Yelp | Amazon Music | Amazon Books | ||||||
|---|---|---|---|---|---|---|---|---|---|
| NDCG@1 | NDCG@5 | NDCG@10 | NDCG@1 | NDCG@5 | NDCG@10 | NDCG@1 | NDCG@5 | NDCG@10 | |
| 10H | 0.1823 | 0.2815 | 0.4928 | 0.1875 | 0.3796 | 0.5153 | 0.1841 | 0.3146 | 0.4263 |
| KAR | 0.2156 | 0.3298 | 0.5412 | 0.3018 | 0.4896 | 0.6015 | 0.2965 | 0.4715 | 0.5834 |
| RLMRec | 0.2419 | 0.3472 | 0.5587 | 0.3371 | 0.5434 | 0.6162 | 0.2748 | 0.4526 | 0.5719 |
| PALR | 0.2494 | 0.3563 | 0.5691 | 0.3395 | 0.5247 | 0.6115 | 0.2627 | 0.4634 | 0.5538 |
| LettinGo | 0.3187 | 0.4685 | 0.5814 | 0.4012 | 0.5674 | 0.6489 | 0.3795 | 0.5189 | 0.6284 |
| Reason4Rec | 0.2575 | 0.3792 | 0.5526 | 0.2928 | 0.5912 | 0.6343 | 0.3013 | 0.4928 | 0.5959 |
| DUET (Ours) | 0.3390 | 0.4873 | 0.6008 | 0.5123 | 0.6165 | 0.7025 | 0.4288 | 0.5638 | 0.6599 |
Ranking evaluation under EASE-based hard negatives. DUET achieves NDCG@10 of 0.7025 on Amazon Music — the strongest result across all methods and cutoffs.
| Configuration | Yelp | Amazon Music | Amazon Books | |||
|---|---|---|---|---|---|---|
| MAE↓ | Acc↑ | MAE↓ | Acc↑ | MAE↓ | Acc↑ | |
| 10H — History Only | 1.1235 | 23.17 | 0.9102 | 39.26 | 0.9314 | 37.63 |
| + Profile Generation | 0.7218 | 55.48 | 0.6597 | 58.67 | 0.6764 | 57.14 |
| + Cue & Strategy Layer | 0.7085 | 55.83 | 0.5708 | 58.91 | 0.6389 | 58.43 |
| + Joint Optimization (LG-style) | 0.6632 | 56.18 | 0.4737 | 62.37 | 0.5821 | 59.35 |
| Full DUET — Cue + Strategy + Joint Opt. | 0.5126 | 61.23 | 0.3937 | 67.96 | 0.4612 | 64.38 |
Each component contributes. Profile generation alone provides the largest accuracy jump. Combining all three stages achieves the best results on every dataset.
| Setting | Yelp | Amazon Music | Amazon Books | |||
|---|---|---|---|---|---|---|
| MAE↓ | Acc↑ | MAE↓ | Acc↑ | MAE↓ | Acc↑ | |
| DUET w/o RL | 0.8283 | 48.53 | 0.7322 | 57.18 | 0.8741 | 51.83 |
| DUET (full, with RL) | 0.5126 | 61.23 | 0.3937 | 67.96 | 0.4612 | 64.38 |
RL is essential. Removing the RL optimization causes Yelp accuracy to drop from 61.23% to 48.53% (−12.7 pp), demonstrating that the gains cannot be attributed to prompt design alone. RL enables adaptive exploration of effective profile construction strategies under real recommendation feedback.
| History Length | Yelp | Amazon Music | Amazon Books | |||
|---|---|---|---|---|---|---|
| MAE↓ | Acc↑ | MAE↓ | Acc↑ | MAE↓ | Acc↑ | |
| 10H + 30 profiles | 0.5126 | 61.23 | 0.3883 | 67.96 | 0.4612 | 65.13 |
| 10H + 50 profiles | 0.4909 | 62.43 | 0.3924 | 67.88 | 0.4553 | 64.62 |
| 10H + 70 profiles | 0.4987 | 61.98 | 0.3937 | 68.22 | 0.4608 | 64.38 |
Moderate history length (30–50 interactions) achieves competitive or best results on most metrics. Excessive histories can introduce noisy signals that slightly degrade performance on Yelp and Amazon Books.
DUET distills fragmented user history and sparse item reviews into semantically aligned profiles — capturing the shared funk/soul connection that raw history alone would miss.
Semantic correspondence: The user profile highlights funk/soul affinity and emphasis on historical significance; the item profile independently characterizes the album as a defining funk-rock work of the 1970s. DUET's joint optimization produces this alignment automatically — without any hard-coded templates or attribute lists.
Figure 4. The highlighted regions demonstrate that user preferences summarized in the user profile align with the key attributes extracted in the item profile. DUET captures the meaningful preference–attribute correspondence that is difficult to recover from individual reviews alone.
Two complementary metrics confirm that DUET profiles exhibit genuine semantic structure rather than serving as incidental textual artifacts.
Embedding-level cosine similarity between generated user and item profiles using all-mpnet-base-v2. Higher values indicate stronger semantic compatibility between modeled user preferences and item characteristics.
Highest across all methods on every dataset.
Token-level overlap between generated profiles and input histories, quantifying how much of the profile is grounded in historical evidence. DUET maintains mid-to-high coverage while achieving superior alignment — the best balance of abstraction and evidence preservation.
Amazon Music results; comparable or better coverage across all datasets.
Users are partitioned into three groups by rating variance (stable → diverse). DUET's performance degrades gradually rather than catastrophically as preference diversity increases — from 71.76% accuracy (stable, Yelp) to 51.13% (diverse), indicating that the framework remains stable under heterogeneous or noisy interaction histories. Amazon Music shows particularly robust behavior, suggesting that music domain preferences are less sensitive to history noise.
If DUET is useful for your research, please consider citing our paper.
@misc{chen2026duetjointexplorationuser, title = {DUET: Joint Exploration of User Item Profiles in Recommendation System}, author = {Yue Chen and Yifei Sun and Lu Wang and Fangkai Yang and Pu Zhao and Minjie Hong and Yifei Dong and Minghua He and Nan Hu and Jianjin Zhang and Zhiwei Dai and Yuefeng Zhan and Weihao Han and Hao Sun and Qingwei Lin and Weiwei Deng and Feng Sun and Qi Zhang and Saravan Rajmohan and Dongmei Zhang}, year = {2026}, eprint = {2604.13801}, archivePrefix = {arXiv}, primaryClass = {cs.IR}, url = {https://arxiv.org/abs/2604.13801}, }