• unknown's avatar
    ndb - bug#15685 · fd80fa2d
    unknown authored
      Error in abort handling in TC when timeout during abort
      
    
    
    ndb/src/kernel/blocks/ERROR_codes.txt:
      New error codes
    ndb/src/kernel/blocks/dblqh/DblqhMain.cpp:
      New error codes
    ndb/src/kernel/blocks/dbtc/DbtcMain.cpp:
      Dont release transaction record to early
    ndb/test/ndbapi/testNodeRestart.cpp:
      Test case
    ndb/test/run-test/daily-basic-tests.txt:
      Test case
    fd80fa2d
ERROR_codes.txt 15.8 KB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459
Next QMGR 1
Next NDBCNTR 1000
Next NDBFS 2000
Next DBACC 3002
Next DBTUP 4013
Next DBLQH 5042
Next DBDICT 6006
Next DBDIH 7174
Next DBTC 8037
Next CMVMI 9000
Next BACKUP 10022
Next DBUTIL 11002
Next DBTUX 12007
Next SUMA 13001

TESTING NODE FAILURE, ARBITRATION
---------------------------------

911 - 919:
Crash president when he starts to run in ArbitState 1-9.

910: Crash new president after node crash

ERROR CODES FOR TESTING NODE FAILURE, GLOBAL CHECKPOINT HANDLING:
-----------------------------------------------------------------

7000:
Insert system error in master when global checkpoint is idle.

7001:
Insert system error in master after receiving GCP_PREPARE from
all nodes in the cluster.

7002:
Insert system error in master after receiving GCP_NODEFINISH from
all nodes in the cluster.

7003:
Insert system error in master after receiving GCP_SAVECONF from
all nodes in the cluster.

7004:
Insert system error in master after completing global checkpoint with
all nodes in the cluster.

7005:
Insert system error in GCP participant when receiving GCP_PREPARE.

7006:
Insert system error in GCP participant when receiving GCP_COMMIT.

7007:
Insert system error in GCP participant when receiving GCP_TCFINISHED.

7008:
Insert system error in GCP participant when receiving COPY_GCICONF.

5000:
Insert system error in GCP participant when receiving GCP_SAVEREQ.

5007:
Delay GCP_SAVEREQ by 10 secs

7165: Delay INCL_NODE_REQ in starting node yeilding error in GCP_PREPARE

ERROR CODES FOR TESTING NODE FAILURE, LOCAL CHECKPOINT HANDLING:
-----------------------------------------------------------------

7009:
Insert system error in master when local checkpoint is idle.

7010:
Insert system error in master when local checkpoint is in the
state clcpStatus = CALCULATE_KEEP_GCI.

7011:
Stop local checkpoint in the state CALCULATE_KEEP_GCI.

7012:
Restart local checkpoint after stopping in CALCULATE_KEEP_GCI.

Method:
1) Error 7011 in master, wait until report of stopped.
2) Error xxxx in participant to crash it.
3) Error 7012 in master to start again.

7013:
Insert system error in master when local checkpoint is in the
state clcpStatus = COPY_GCI before sending COPY_GCIREQ.

7014:
Insert system error in master when local checkpoint is in the
state clcpStatus = TC_CLOPSIZE before sending TC_CLOPSIZEREQ.

7015:
Insert system error in master when local checkpoint is in the
state clcpStatus = START_LCP_ROUND before sending START_LCP_ROUND.

7016:
Insert system error in master when local checkpoint is in the
state clcpStatus = START_LCP_ROUND after receiving LCP_REPORT.

7017:
Insert system error in master when local checkpoint is in the
state clcpStatus = TAB_COMPLETED.

7018:
Insert system error in master when local checkpoint is in the
state clcpStatus = TAB_SAVED before sending DIH_LCPCOMPLETE.

7019:
Insert system error in master when local checkpoint is in the
state clcpStatus = IDLE before sending CONTINUEB(ZCHECK_TC_COUNTER).

7020:
Insert system error in local checkpoint participant at reception of
COPY_GCIREQ.

7075: Master
Don't send any LCP_FRAG_ORD(last=true)
And crash when all have "not" been sent

8000: Crash particpant when receiving TCGETOPSIZEREQ
8001: Crash particpant when receiving TC_CLOPSIZEREQ
5010: Crash any when receiving LCP_FRAGORD

7021: Crash in  master when receiving START_LCP_REQ
7022: Crash in !master when receiving START_LCP_REQ

7023: Crash in  master when sending START_LCP_CONF
7024: Crash in !master when sending START_LCP_CONF

7025: Crash in  master when receiving LCP_FRAG_REP
7016: Crash in !master when receiving LCP_FRAG_REP

7026: Crash in  master when changing state to LCP_TAB_COMPLETED 
7017: Crash in !master when changing state to LCP_TAB_COMPLETED 

7027: Crash in  master when changing state to LCP_TAB_SAVED
7018: Crash in  master when changing state to LCP_TAB_SAVED

ERROR CODES FOR TESTING NODE FAILURE, FAILURE IN COPY FRAGMENT PROCESS:
-----------------------------------------------------------------------

5002:
Insert node failure in starting node when receiving a tuple copied from the copy node
as part of copy fragment process.
5003:
Insert node failure when receiving ABORT signal.

5004:
Insert node failure handling when receiving COMMITREQ.

5005:
Insert node failure handling when receiving COMPLETEREQ.

5006:
Insert node failure handling when receiving ABORTREQ.

5042:
As 5002, but with specified table (see DumpStateOrd)

These error code can be combined with error codes for testing time-out
handling in DBTC to ensure that node failures are also well handled in
time-out handling. They can also be used to test multiple node failure
handling.


ERROR CODES FOR TESTING TIME-OUT HANDLING IN DBLQH
-------------------------------------------------
5011:
Delay execution of COMMIT signal 2 seconds to generate time-out.

5012 (use 5017):
First delay execution of COMMIT signal 2 seconds to generate COMMITREQ.
Delay execution of COMMITREQ signal 2 seconds to generate time-out.

5013:
Delay execution of COMPLETE signal 2 seconds to generate time-out.

5014 (use 5018):
First delay execution of COMPLETE signal 2 seconds to generate COMPLETEREQ.
Delay execution of COMPLETEREQ signal 2 seconds to generate time-out.

5015:
Delay execution of ABORT signal 2 seconds to generate time-out.

5016: (ABORTREQ only as part of take-over)
Delay execution of ABORTREQ signal 2 seconds to generate time-out.

5031: lqhKeyRef, ZNO_TC_CONNECT_ERROR
5032: lqhKeyRef, ZTEMPORARY_REDO_LOG_FAILURE
5033: lqhKeyRef, ZTAIL_PROBLEM_IN_LOG_ERROR

5034: Don't pop scan queue

5035: Delay ACC_CONTOPCONT

5038: Drop LQHKEYREQ + set 5039
5039: Drop ABORT + set 5003

8048: Make TC not choose own node for simple/dirty read
5041: Crash is receiving simple read from other TC on different node

5100,5101: Drop ABORT req in primary replica
           Crash on "next" ABORT

ERROR CODES FOR TESTING TIME-OUT HANDLING IN DBTC
-------------------------------------------------
8040:
Delay execution of ABORTED signal 2 seconds to generate time-out.

8041:
Delay execution of COMMITTED signal 2 seconds to generate time-out.
8042 (use 8046):
Delay execution of COMMITTED signal 2 seconds to generate COMMITCONF.
Delay execution of COMMITCONF signal 2 seconds to generate time-out.

8043:
Delay execution of COMPLETED signal 2 seconds to generate time-out.

8044 (use 8047):
Delay execution of COMPLETED signal 2 seconds to generate COMPLETECONF.
Delay execution of COMPLETECONF signal 2 seconds to generate time-out.

8045: (ABORTCONF only as part of take-over)
Delay execution of ABORTCONF signal 2 seconds to generate time-out.

ERROR CODES FOR TESTING TIME-OUT HANDLING IN DBTC
-------------------------------------------------

8003: Throw away a LQHKEYCONF in state STARTED
8004: Throw away a LQHKEYCONF in state RECEIVING
8005: Throw away a LQHKEYCONF in state REC_COMMITTING
8006: Throw away a LQHKEYCONF in state START_COMMITTING

8007: Ignore send of LQHKEYREQ in state STARTED
8008: Ignore send of LQHKEYREQ in state START_COMMITTING

8009: Ignore send of LQHKEYREQ+ATTRINFO in state STARTED
8010: Ignore send of LQHKEYREQ+ATTRINFO in state START_COMMITTING

8011: Abort at send of CONTINUEB(ZSEND_ATTRINFO) in state STARTED
8012: Abort at send of CONTINUEB(ZSEND_ATTRINFO) in state START_COMMITTING

8013: Ignore send of CONTINUEB(ZSEND_COMPLETE_LOOP) (should crash eventually)
8014: Ignore send of CONTINUEB(ZSEND_COMMIT_LOOP) (should crash eventually)

8015: Ignore ATTRINFO signal in DBTC in state REC_COMMITTING
8016: Ignore ATTRINFO signal in DBTC in state RECEIVING

8017: Return immediately from DIVERIFYCONF (should crash eventually)
8018: Throw away a COMMITTED signal
8019: Throw away a COMPLETED signal

TESTING TAKE-OVER FUNCTIONALITY IN DBTC
---------------------------------------

8002: Crash when sending LQHKEYREQ
8029: Crash when receiving LQHKEYCONF
8030: Crash when receiving COMMITTED
8031: Crash when receiving COMPLETED
8020: Crash when all COMMITTED has arrived
8021: Crash when all COMPLETED has arrived
8022: Crash when all LQHKEYCONF has arrived

COMBINATION OF TIME-OUT + CRASH
-------------------------------

8023 (use 8024): Ignore LQHKEYCONF and crash when ABORTED signal arrives by setting 8024
8025 (use 8026): Ignore COMMITTED and crash when COMMITCONF signal arrives by setting 8026
8027 (use 8028): Ignore COMPLETED and crash when COMPLETECONF signal arrives by setting 8028

ABORT OF TCKEYREQ
-----------------

8032: No free TC records any more


CMVMI
-----
9000 Set RestartOnErrorInsert to restart -n
9998 Enter endless loop (trigger watchdog)
9999 Crash system immediatly

Test Crashes in handling node restarts
--------------------------------------

7121: Crash after receiving permission to start (START_PERMCONF) in starting
      node.
7122: Crash master when receiving request for permission to start (START_PERMREQ).
7123: Crash any non-starting node when receiving information about a starting node
      (START_INFOREQ)
7124: Respond negatively on an info request (START_INFOREQ)
7125: Stop an invalidate Node LCP process in the middle to test if START_INFOREQ
      stopped by long-running processes are handled in a correct manner.
7126: Allow node restarts for all nodes (used in conjunction with 7025)
7127: Crash when receiving a INCL_NODEREQ message.
7128: Crash master after receiving all INCL_NODECONF from all nodes
7129: Crash master after receiving all INCL_NODECONF from all nodes and releasing
      the lock on the dictionary
7130: Crash starting node after receiving START_MECONF
7131: Crash when receiving START_COPYREQ in master node
7132: Crash when receiving START_COPYCONF in starting node

DICT:
6000  Crash during NR when receiving DICTSTARTREQ
6001  Crash during NR when receiving SCHEMA_INFO
6002  Crash during NR soon after sending GET_TABINFO_REQ

LQH:
5026  Crash when receiving COPY_ACTIVEREQ
5027  Crash when receiving STAT_RECREQ

Test Crashes in handling take over
----------------------------------

7133: Crash when receiving START_TOREQ
7134: Crash master after receiving all START_TOCONF
7135: Crash master after copying table 0 to starting node
7136: Crash master after completing copy of tables
7137: Crash master after adding a fragment before copying it
7138: Crash when receiving CREATE_FRAGREQ in prepare phase
7139: Crash when receiving CREATE_FRAGREQ in commit phase
7140: Crash master when receiving all CREATE_FRAGCONF in prepare phase
7141: Crash master when receiving all CREATE_FRAGCONF in commit phase
7142: Crash master when receiving COPY_FRAGCONF
7143: Crash master when receiving COPY_ACTIVECONF
7144: Crash when receiving END_TOREQ
7145: Crash master after receiving first END_TOCONF
7146: Crash master after receiving all END_TOCONF
7147: Crash master after receiving first START_TOCONF
7148: Crash master after receiving first CREATE_FRAGCONF
7152: Crash master after receiving first UPDATE_TOCONF
7153: Crash master after receiving all UPDATE_TOCONF
7154: Crash when receiving UPDATE_TOREQ
7155: Crash master when completing writing start take over info
7156: Crash master when completing writing end take over info

Test failures in various states in take over functionality
----------------------------------------------------------
7157: Block take over at start take over
7158: Block take over at sending of START_TOREQ
7159: Block take over at selecting next fragment
7160: Block take over at creating new fragment
7161: Block take over at sending of CREATE_FRAGREQ in prepare phase
7162: Block take over at sending of CREATE_FRAGREQ in commit phase
7163: Block take over at sending of UPDATE_TOREQ at end of copy frag
7164: Block take over at sending of END_TOREQ
7169: Block take over at sending of UPDATE_TOREQ at end of copy

5008: Crash at reception of EMPTY_LCPREQ (at master take over after NF)
5009: Crash at sending of EMPTY_LCPCONF (at master take over after NF)

Test Crashes in Handling Graceful Shutdown
------------------------------------------
7065: Crash when receiving STOP_PERMREQ in master
7066: Crash when receiving STOP_PERMREQ in slave
7067: Crash when receiving DIH_SWITCH_REPLICA_REQ
7068: Crash when receiving DIH_SWITCH_REPLICA_CONF


Backup Stuff:
------------------------------------------
10001: Crash on NODE_FAILREP in Backup coordinator
10002: Crash on NODE_FAILREP when coordinatorTakeOver
10003: Crash on PREP_CREATE_TRIG_{CONF/REF} (only coordinator)
10004: Crash on START_BACKUP_{CONF/REF} (only coordinator)
10005: Crash on CREATE_TRIG_{CONF/REF} (only coordinator)
10006: Crash on WAIT_GCP_REF (only coordinator)
10007: Crash on WAIT_GCP_CONF (only coordinator)
10008: Crash on WAIT_GCP_CONF during start of backup (only coordinator)
10009: Crash on WAIT_GCP_CONF during stop of backup (only coordinator)
10010: Crash on BACKUP_FRAGMENT_CONF (only coordinator)
10011: Crash on BACKUP_FRAGMENT_REF (only coordinator)
10012: Crash on DROP_TRIG_{CONF/REF} (only coordinator)
10013: Crash on STOP_BACKUP_{CONF/REF} (only coordinator)
10014: Crash on DEFINE_BACKUP_REQ (participant)
10015: Crash on START_BACKUP_REQ (participant)
10016: Crash on BACKUP_FRAGMENT_REQ (participant)
10017: Crash on SCAN_FRAGCONF (participant)
10018: Crash on FSAPPENDCONF (participant)
10019: Crash on TRIG_ATTRINFO (participant)
10020: Crash on STOP_BACKUP_REQ (participant)
10021: Crash on NODE_FAILREP in participant not becoming coordinator

10022: Fake no backup records at DEFINE_BACKUP_REQ (participant)
10023: Abort backup by error at reception of UTIL_SEQUENCE_CONF (code 300)
10024: Abort backup by error at reception of DEFINE_BACKUP_CONF (code 301)
10025: Abort backup by error at reception of CREATE_TRIG_CONF last (code 302)
10026: Abort backup by error at reception of START_BACKUP_CONF (code 303)
10027: Abort backup by error at reception of DEFINE_BACKUP_REQ at master (code 304)
10028: Abort backup by error at reception of BACKUP_FRAGMENT_CONF at master (code 305)
10029: Abort backup by error at reception of FSAPPENDCONF in slave (FileOrScanError = 5)
10030: Simulate buffer full from trigger execution => abort backup

11001: Send UTIL_SEQUENCE_REF (in master)

5028:  Crash when receiving LQHKEYREQ (in non-master)

Failed Create Table:
--------------------
7173: Create table failed due to not sufficient number of fragment or
      replica records.
3001: Fail create 1st fragment
4007 12001: Fail create 1st fragment
4008 12002: Fail create 2nd fragment
4009 12003: Fail create 1st attribute in 1st fragment
4010 12004: Fail create last attribute in 1st fragment
4011 12005: Fail create 1st attribute in 2nd fragment
4012 12006: Fail create last attribute in 2nd fragment

Drop Table/Index:
-----------------
4001: Crash on REL_TABMEMREQ in TUP
4002: Crash on DROP_TABFILEREQ in TUP
4003: Fail next trigger create in TUP
4004: Fail next trigger drop in TUP
8033: Fail next trigger create in TC
8034: Fail next index create in TC
8035: Fail next trigger drop in TC
8036: Fail next index drop in TC

System Restart:
---------------

5020: Force system to read pages form file when executing prepare operation record
3000: Delay writing of datapages in ACC when LCP is started
4000: Delay writing of datapages in TUP when LCP is started
7070: Set TimeBetweenLcp to min value
7071: Set TimeBetweenLcp to max value
7072: Split START_FRAGREQ into several log nodes
7073: Don't include own node in START_FRAGREQ
7074: 7072 + 7073

Scan:
------

5021: Crash when receiving SCAN_NEXTREQ if sender is own node
5022: Crash when receiving SCAN_NEXTREQ if sender is NOT own node
5023: Drop SCAN_NEXTREQ if sender is own node
5024: Drop SCAN_NEXTREQ if sender is NOT own node
5025: Delay SCAN_NEXTREQ 1 second if sender is NOT own node
5030: Drop all SCAN_NEXTREQ until node is shutdown with SYSTEM_ERROR
      because of scan fragment timeout

Test routing of signals:
-----------------------
4006: Turn on routing of TRANSID_AI signals from TUP
5029: Turn on routing of KEYINFO20 signals from LQH

Ordered index:
--------------

Dbdict:
-------
6003 Crash in participant @ CreateTabReq::Prepare
6004 Crash in participant @ CreateTabReq::Commit
6005 Crash in participant @ CreateTabReq::CreateDrop