关于 MISSRec 适配不同数据集时的问题

其问题如下：

14 Oct 18:21    INFO  PixedRec2
The number of users: 200001
Average actions of users: 18.82828
The number of items: 96283
Average actions of items: 39.22926107655926
The number of inters: 3765656
The sparsity of the dataset: 99.98044495304563%
Remain Fields: ['user_id', 'item_id_list', 'item_id', 'item_length']
14 Oct 18:21    INFO  [Training]: train_batch_size = [512] negative sampling: [None]
14 Oct 18:21    INFO  [Evaluation]: eval_batch_size = [1024] eval_args: [{'split': {'LS': 'valid_and_test'}, 'order': 'TO', 'mode': 'full', 'group_by': 'user'}]
[INIT DEBUG] all_num_embeddings: 192564
[INIT DEBUG] interest_ratio: 0.5
[INIT DEBUG] num_interest: 96282
[INIT DEBUG] interest_embeddings size: 96283
14 Oct 18:21    INFO  Loading from saved/MISSRec-FHCKM_mm_full-100.pth
14 Oct 18:21    INFO  Transfer [FHCKM_mm_full] -> [PixedRec2
The number of users: 200001
Average actions of users: 18.82828
The number of items: 96283
Average actions of items: 39.22926107655926
The number of inters: 3765656
The sparsity of the dataset: 99.98044495304563%
Remain Fields: ['user_id', 'item_id_list', 'item_id', 'item_length']]
14 Oct 18:21    INFO  Fix encoder parameters.
14 Oct 18:21    INFO  MISSRec(
  (item_embedding): Embedding(96283, 300, padding_idx=0)
  (position_embedding): Embedding(50, 300)
  (trm_model): Transformer(
    (encoder): TransformerEncoder(
      (layers): ModuleList(
        (0): TransformerEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=300, out_features=300, bias=True)
          )
          (linear1): Linear(in_features=300, out_features=256, bias=True)
          (dropout): Dropout(p=0.5, inplace=False)
          (linear2): Linear(in_features=256, out_features=300, bias=True)
          (norm1): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
          (norm2): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
          (dropout1): Dropout(p=0.5, inplace=False)
          (dropout2): Dropout(p=0.5, inplace=False)
        )
      )
      (norm): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
    )
    (decoder): TransformerDecoder(
      (layers): ModuleList(
        (0): TransformerDecoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=300, out_features=300, bias=True)
          )
          (multihead_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=300, out_features=300, bias=True)
          )
          (linear1): Linear(in_features=300, out_features=256, bias=True)
          (dropout): Dropout(p=0.5, inplace=False)
          (linear2): Linear(in_features=256, out_features=300, bias=True)
          (norm1): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
          (norm2): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
          (norm3): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
          (dropout1): Dropout(p=0.5, inplace=False)
          (dropout2): Dropout(p=0.5, inplace=False)
          (dropout3): Dropout(p=0.5, inplace=False)
        )
      )
      (norm): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
    )
  )
  (LayerNorm): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
  (dropout): Dropout(p=0.5, inplace=False)
  (loss_fct): CrossEntropyLoss()
  (plm_embedding): Embedding(96283, 512, padding_idx=0)
  (img_embedding): Embedding(96283, 512, padding_idx=0)
  (text_adaptor): Linear(in_features=512, out_features=300, bias=True)
  (img_adaptor): Linear(in_features=512, out_features=300, bias=True)
)
Trainable parameters: 29193301.0
14 Oct 18:21    INFO  Trainable parameters: ['fusion_factor', 'item_embedding.weight', 'LayerNorm.weight', 'LayerNorm.bias', 'text_adaptor.weight', 'text_adaptor.bias', 'img_adaptor.weight', 'img_adaptor.bias']
14 Oct 18:21    INFO  Discovering multi-modal user interest before 0-th epoch
adjust batchsize from 2048 (given) to 105910 because the cluster_num = 96282
clustering iter:   0%|                                                                                                                                     | 0/5 [00:01<?, ?it/s]
14 Oct 18:21    INFO  Finish multi-modal interest discovery before 0-th epoch
Train     0:   0%|                                                         | 0/6574 [00:00<?, ?it/s]
============================================================
[DEBUG] item_seq shape: torch.Size([512, 433])
[DEBUG] item_seq dtype: torch.int64
[DEBUG] item_seq max: 94644
[DEBUG] item_seq min: 0
[DEBUG] item_seq unique values count: 7939
[DEBUG] First batch item_seq:
tensor([14190, 19291,  5131, 10240, 19290, 19289, 19288,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0], device='cuda:0')
[DEBUG] plm_embedding size: 96283
[DEBUG] plm_interest_lookup_table size: 96283
[DEBUG] n_items: 96283
[DEBUG] img_embedding size: 96283
[DEBUG] img_interest_lookup_table size: 96283
============================================================


[DEBUG Interest] all_interest_seq shape: torch.Size([512, 866])
[DEBUG Interest] all_interest_seq max: 95750
[DEBUG Interest] all_interest_seq min: 0
[DEBUG Interest] interest_embeddings size: 96283
[DEBUG Interest] unique_interest_seq shape: torch.Size([512, 23])
[DEBUG Interest] unique_interest_seq max: 95750
[DEBUG Interest] unique_interest_seq min: 0
/opt/conda/conda-bld/pytorch_1670525541035/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [117,0,0], thread: [24,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525541035/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [117,0,0], thread: [25,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525541035/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [117,0,0], thread: [26,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525541035/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [117,0,0], thread: [27,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525541035/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [117,0,0], thread: [28,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525541035/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [117,0,0], thread: [29,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525541035/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [117,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525541035/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [117,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Train     0:   0%|                                                         | 0/6574 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "finetune.py", line 119, in <module>
    finetune(args.d, props=args.props, mode=args.mode, pretrained_file=args.p, fix_enc=args.f, log_prefix=args.note)
  File "finetune.py", line 88, in finetune
    train_data, valid_data, saved=True, show_progress=config['show_progress']
  File "/root/autodl-tmp/MM23-MISSRec/recbole/trainer/trainer.py", line 338, in fit
    train_loss = self._train_epoch(train_data, epoch_idx, show_progress=show_progress)
  File "/root/autodl-tmp/MM23-MISSRec/trainer.py", line 46, in _train_epoch
    losses = loss_func(interaction)
  File "/root/autodl-tmp/MM23-MISSRec/missrec.py", line 403, in calculate_loss
    seq_output, interest_orthogonal_regularization = self._compute_seq_embeddings(item_seq, item_seq_len)
  File "/root/autodl-tmp/MM23-MISSRec/missrec.py", line 357, in _compute_seq_embeddings
    interest_seq_len=unique_interest_len
  File "/root/autodl-tmp/MM23-MISSRec/missrec.py", line 117, in forward
    dec_input_emb = self.dropout(dec_input_emb)
  File "/root/miniconda3/envs/Paper2Env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/miniconda3/envs/Paper2Env/lib/python3.7/site-packages/torch/nn/modules/dropout.py", line 59, in forward
    return F.dropout(input, self.p, self.training, self.inplace)
  File "/root/miniconda3/envs/Paper2Env/lib/python3.7/site-packages/torch/nn/functional.py", line 1252, in dropout
    return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
RuntimeError: CUDA error: device-side assert triggered
Traceback (most recent call last):
  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ASSERT: device-side assert triggered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ASSERT: device-side assert triggered
Traceback (most recent call last):
  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ASSERT: device-side assert triggered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ASSERT: device-side assert triggered
Traceback (most recent call last):
  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ASSERT: device-side assert triggered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ASSERT: device-side assert triggered
/root/autodl-tmp/MM23-MISSRec/script

记录报错的 Debug 流程：

由于其他数据集均可运行，仅该数据集不可运行，初始思路在 Item seq 的索引错误 // Embedding

初始阶段，着重于视觉、文本向量的索引错误问题

从 Item seq 索引非数值、越界 Emb 、Item seq 文件格式问题角度考虑

索引必然是数值，排除

越界 Emb 排查时间最久，但排除下来并没有任何问题，说明问题不存在于该二者 Emb 的读取过程

文件格式上，考虑完全统一文件内容、向量顺序、向量维度为其余数据集格式

虽然逻辑上此问题不成立

最终也说明问题不存在于文件格式

由于问题报错里面明确说明是索引问题，因而转向其余 Emb 的索引问题，预计仍是 Item seq 引起

最终确定问题：

Pos Emb 大小仅 50，而 Item seq 最大长度 433，导致索引越界

核心问题在于代码内部并没有刻意处理序列长度的截断，取的直接是数据集内部的真实最大长度

而其余数据集均对序列做了截断处理，而本人并没有注意到这一点…

只能说这种问题以后注意下吧

疑难杂症可以多参考 Claude，其余情况参考 Chat，Claude 代码能力更强

GPU 报错是异步的，所以一般没办法依靠堆栈报错找到问题，只能一点点找问题