其问题如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
14 Oct 18:21    INFO  PixedRec2
The number of users: 200001
Average actions of users: 18.82828
The number of items: 96283
Average actions of items: 39.22926107655926
The number of inters: 3765656
The sparsity of the dataset: 99.98044495304563%
Remain Fields: ['user_id', 'item_id_list', 'item_id', 'item_length']
14 Oct 18:21 INFO [Training]: train_batch_size = [512] negative sampling: [None]
14 Oct 18:21 INFO [Evaluation]: eval_batch_size = [1024] eval_args: [{'split': {'LS': 'valid_and_test'}, 'order': 'TO', 'mode': 'full', 'group_by': 'user'}]
[INIT DEBUG] all_num_embeddings: 192564
[INIT DEBUG] interest_ratio: 0.5
[INIT DEBUG] num_interest: 96282
[INIT DEBUG] interest_embeddings size: 96283
14 Oct 18:21 INFO Loading from saved/MISSRec-FHCKM_mm_full-100.pth
14 Oct 18:21 INFO Transfer [FHCKM_mm_full] -> [PixedRec2
The number of users: 200001
Average actions of users: 18.82828
The number of items: 96283
Average actions of items: 39.22926107655926
The number of inters: 3765656
The sparsity of the dataset: 99.98044495304563%
Remain Fields: ['user_id', 'item_id_list', 'item_id', 'item_length']]
14 Oct 18:21 INFO Fix encoder parameters.
14 Oct 18:21 INFO MISSRec(
(item_embedding): Embedding(96283, 300, padding_idx=0)
(position_embedding): Embedding(50, 300)
(trm_model): Transformer(
(encoder): TransformerEncoder(
(layers): ModuleList(
(0): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=300, out_features=300, bias=True)
)
(linear1): Linear(in_features=300, out_features=256, bias=True)
(dropout): Dropout(p=0.5, inplace=False)
(linear2): Linear(in_features=256, out_features=300, bias=True)
(norm1): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
(norm2): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
(dropout1): Dropout(p=0.5, inplace=False)
(dropout2): Dropout(p=0.5, inplace=False)
)
)
(norm): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
)
(decoder): TransformerDecoder(
(layers): ModuleList(
(0): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=300, out_features=300, bias=True)
)
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=300, out_features=300, bias=True)
)
(linear1): Linear(in_features=300, out_features=256, bias=True)
(dropout): Dropout(p=0.5, inplace=False)
(linear2): Linear(in_features=256, out_features=300, bias=True)
(norm1): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
(norm2): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
(norm3): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
(dropout1): Dropout(p=0.5, inplace=False)
(dropout2): Dropout(p=0.5, inplace=False)
(dropout3): Dropout(p=0.5, inplace=False)
)
)
(norm): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
)
)
(LayerNorm): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.5, inplace=False)
(loss_fct): CrossEntropyLoss()
(plm_embedding): Embedding(96283, 512, padding_idx=0)
(img_embedding): Embedding(96283, 512, padding_idx=0)
(text_adaptor): Linear(in_features=512, out_features=300, bias=True)
(img_adaptor): Linear(in_features=512, out_features=300, bias=True)
)
Trainable parameters: 29193301.0
14 Oct 18:21 INFO Trainable parameters: ['fusion_factor', 'item_embedding.weight', 'LayerNorm.weight', 'LayerNorm.bias', 'text_adaptor.weight', 'text_adaptor.bias', 'img_adaptor.weight', 'img_adaptor.bias']
14 Oct 18:21 INFO Discovering multi-modal user interest before 0-th epoch
adjust batchsize from 2048 (given) to 105910 because the cluster_num = 96282
clustering iter: 0%| | 0/5 [00:01<?, ?it/s]
14 Oct 18:21 INFO Finish multi-modal interest discovery before 0-th epoch
Train 0: 0%| | 0/6574 [00:00<?, ?it/s]
============================================================
[DEBUG] item_seq shape: torch.Size([512, 433])
[DEBUG] item_seq dtype: torch.int64
[DEBUG] item_seq max: 94644
[DEBUG] item_seq min: 0
[DEBUG] item_seq unique values count: 7939
[DEBUG] First batch item_seq:
tensor([14190, 19291, 5131, 10240, 19290, 19289, 19288, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0], device='cuda:0')
[DEBUG] plm_embedding size: 96283
[DEBUG] plm_interest_lookup_table size: 96283
[DEBUG] n_items: 96283
[DEBUG] img_embedding size: 96283
[DEBUG] img_interest_lookup_table size: 96283
============================================================


[DEBUG Interest] all_interest_seq shape: torch.Size([512, 866])
[DEBUG Interest] all_interest_seq max: 95750
[DEBUG Interest] all_interest_seq min: 0
[DEBUG Interest] interest_embeddings size: 96283
[DEBUG Interest] unique_interest_seq shape: torch.Size([512, 23])
[DEBUG Interest] unique_interest_seq max: 95750
[DEBUG Interest] unique_interest_seq min: 0
/opt/conda/conda-bld/pytorch_1670525541035/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [117,0,0], thread: [24,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525541035/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [117,0,0], thread: [25,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525541035/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [117,0,0], thread: [26,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525541035/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [117,0,0], thread: [27,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525541035/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [117,0,0], thread: [28,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525541035/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [117,0,0], thread: [29,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525541035/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [117,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525541035/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [117,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Train 0: 0%| | 0/6574 [00:01<?, ?it/s]
Traceback (most recent call last):
File "finetune.py", line 119, in <module>
finetune(args.d, props=args.props, mode=args.mode, pretrained_file=args.p, fix_enc=args.f, log_prefix=args.note)
File "finetune.py", line 88, in finetune
train_data, valid_data, saved=True, show_progress=config['show_progress']
File "/root/autodl-tmp/MM23-MISSRec/recbole/trainer/trainer.py", line 338, in fit
train_loss = self._train_epoch(train_data, epoch_idx, show_progress=show_progress)
File "/root/autodl-tmp/MM23-MISSRec/trainer.py", line 46, in _train_epoch
losses = loss_func(interaction)
File "/root/autodl-tmp/MM23-MISSRec/missrec.py", line 403, in calculate_loss
seq_output, interest_orthogonal_regularization = self._compute_seq_embeddings(item_seq, item_seq_len)
File "/root/autodl-tmp/MM23-MISSRec/missrec.py", line 357, in _compute_seq_embeddings
interest_seq_len=unique_interest_len
File "/root/autodl-tmp/MM23-MISSRec/missrec.py", line 117, in forward
dec_input_emb = self.dropout(dec_input_emb)
File "/root/miniconda3/envs/Paper2Env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/root/miniconda3/envs/Paper2Env/lib/python3.7/site-packages/torch/nn/modules/dropout.py", line 59, in forward
return F.dropout(input, self.p, self.training, self.inplace)
File "/root/miniconda3/envs/Paper2Env/lib/python3.7/site-packages/torch/nn/functional.py", line 1252, in dropout
return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
RuntimeError: CUDA error: device-side assert triggered
Traceback (most recent call last):
File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ASSERT: device-side assert triggered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ASSERT: device-side assert triggered
Traceback (most recent call last):
File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ASSERT: device-side assert triggered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ASSERT: device-side assert triggered
Traceback (most recent call last):
File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ASSERT: device-side assert triggered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ASSERT: device-side assert triggered
/root/autodl-tmp/MM23-MISSRec/script

记录报错的 Debug 流程:

由于其他数据集均可运行,仅该数据集不可运行,初始思路在 Item seq 的索引错误 // Embedding

初始阶段,着重于视觉、文本向量的索引错误问题

从 Item seq 索引非数值、越界 Emb 、Item seq 文件格式问题角度考虑

索引必然是数值,排除

越界 Emb 排查时间最久,但排除下来并没有任何问题,说明问题不存在于该二者 Emb 的读取过程

文件格式上,考虑完全统一文件内容、向量顺序、向量维度为其余数据集格式

虽然逻辑上此问题不成立

最终也说明问题不存在于文件格式

由于问题报错里面明确说明是索引问题,因而转向其余 Emb 的索引问题,预计仍是 Item seq 引起

最终确定问题:

Pos Emb 大小仅 50,而 Item seq 最大长度 433,导致索引越界

核心问题在于代码内部并没有刻意处理序列长度的截断,取的直接是数据集内部的真实最大长度

而其余数据集均对序列做了截断处理,而本人并没有注意到这一点…

只能说这种问题以后注意下吧

疑难杂症可以多参考 Claude,其余情况参考 Chat,Claude 代码能力更强

GPU 报错是异步的,所以一般没办法依靠堆栈报错找到问题,只能一点点找问题