Training courses

Kernel and Embedded Linux

Bootlin training courses

Embedded Linux, kernel,
Yocto Project, Buildroot, real-time,
graphics, boot time, debugging...

Bootlin logo

Elixir Cross Referencer

   1
   2
   3
   4
   5
   6
   7
   8
   9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
  62
  63
  64
  65
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
PPoossttffiixx QQuueeuuee SScchheedduulleerr

-------------------------------------------------------------------------------

DDiissccllaaiimmeerr

Many of the transport-specific configuration parameters discussed in this
document will not show up in "postconf" command output before Postfix version
2.9. This limitation applies to many parameters whose name is a combination of
a master.cf service name such as "relay" and a built-in suffix such as
"_destination_concurrency_limit".

OOvveerrvviieeww

The queue manager is by far the most complex part of the Postfix mail system.
It schedules delivery of new mail, retries failed deliveries at specific times,
and removes mail from the queue after the last delivery attempt. There are two
major classes of mechanisms that control the operation of the queue manager.

Topics covered by this document:

  * Concurrency scheduling, concerned with the number of concurrent deliveries
    to a specific destination, including decisions on when to suspend
    deliveries after persistent failures.
  * Preemptive scheduling, concerned with the selection of email messages and
    recipients for a given destination.
  * Credits, something this document would not be complete without.

CCoonnccuurrrreennccyy sscchheedduulliinngg

The following sections document the Postfix 2.5 concurrency scheduler, after a
discussion of the limitations of the earlier concurrency scheduler. This is
followed by results of medium-concurrency experiments, and a discussion of
trade-offs between performance and robustness.

The material is organized as follows:

  * Drawbacks of the existing concurrency scheduler
  * Summary of the Postfix 2.5 concurrency feedback algorithm
  * Summary of the Postfix 2.5 "dead destination" detection algorithm
  * Pseudocode for the Postfix 2.5 concurrency scheduler
  * Results for delivery to concurrency limited servers
  * Discussion of concurrency limited server results
  * Limitations of less-than-1 per delivery feedback
  * Concurrency configuration parameters

DDrraawwbbaacckkss ooff tthhee eexxiissttiinngg ccoonnccuurrrreennccyy sscchheedduulleerr

From the start, Postfix has used a simple but robust algorithm where the per-
destination delivery concurrency is decremented by 1 after delivery failed due
to connection or handshake failure, and incremented by 1 otherwise. Of course
the concurrency is never allowed to exceed the maximum per-destination
concurrency limit. And when a destination's concurrency level drops to zero,
the destination is declared "dead" and delivery is suspended.

Drawbacks of +/-1 concurrency feedback per delivery are:

  * Overshoot due to exponential delivery concurrency growth with each pseudo-
    cohort(*). This can be an issue with high-concurrency channels. For
    example, with the default initial concurrency of 5, concurrency would
    proceed over time as (5-10-20).

  * Throttling down to zero concurrency after a single pseudo-cohort(*)
    failure. This was especially an issue with low-concurrency channels where a
    single failure could be sufficient to mark a destination as "dead", causing
    the suspension of further deliveries to the affected destination.

(*) A pseudo-cohort is a number of delivery requests equal to a destination's
delivery concurrency.

The revised concurrency scheduler has a highly modular structure. It uses
separate mechanisms for per-destination concurrency control and for "dead
destination" detection. The concurrency control in turn is built from two
separate mechanisms: it supports less-than-1 feedback per delivery to allow for
more gradual concurrency adjustments, and it uses feedback hysteresis to
suppress concurrency oscillations. And instead of waiting for delivery
concurrency to throttle down to zero, a destination is declared "dead" after a
configurable number of pseudo-cohorts reports connection or handshake failure.

SSuummmmaarryy ooff tthhee PPoossttffiixx 22..55 ccoonnccuurrrreennccyy ffeeeeddbbaacckk aallggoorriitthhmm

We want to increment a destination's delivery concurrency when some (not
necessarily consecutive) number of deliveries complete without connection or
handshake failure. This is implemented with positive feedback g(N) where N is
the destination's delivery concurrency. With g(N)=1 feedback per delivery,
concurrency increases by 1 after each positive feedback event; this gives us
the old scheduler's exponential growth in time. With g(N)=1/N feedback per
delivery, concurrency increases by 1 after an entire pseudo-cohort N of
positive feedback reports; this gives us linear growth in time. Less-than-
1 feedback per delivery and integer truncation naturally give us hysteresis, so
that transitions to larger concurrency happen every 1/g(N) positive feedback
events.

We want to decrement a destination's delivery concurrency when some (not
necessarily consecutive) number of deliveries complete after connection or
handshake failure. This is implemented with negative feedback f(N) where N is
the destination's delivery concurrency. With f(N)=1 feedback per delivery,
concurrency decreases by 1 after each negative feedback event; this gives us
the old scheduler's behavior where concurrency is throttled down dramatically
after a single pseudo-cohort failure. With f(N)=1/N feedback per delivery,
concurrency backs off more gently. Again, less-than-1 feedback per delivery and
integer truncation naturally give us hysteresis, so that transitions to lower
concurrency happen every 1/f(N) negative feedback events.

However, with negative feedback we introduce a subtle twist. We "reverse" the
negative hysteresis cycle so that the transition to lower concurrency happens
at the bbeeggiinnnniinngg of a sequence of 1/f(N) negative feedback events. Otherwise, a
correction for overload would be made too late. This makes the choice of f(N)
relatively unimportant, as borne out by measurements later in this document.

In summary, the main ingredients for the Postfix 2.5 concurrency feedback
algorithm are a) the option of less-than-1 positive feedback per delivery to
avoid overwhelming servers, b) the option of less-than-1 negative feedback per
delivery to avoid giving up too fast, c) feedback hysteresis to avoid rapid
oscillation, and d) a "reverse" hysteresis cycle for negative feedback, so that
it can correct for overload quickly.

SSuummmmaarryy ooff tthhee PPoossttffiixx 22..55 ""ddeeaadd ddeessttiinnaattiioonn"" ddeetteeccttiioonn aallggoorriitthhmm

We want to suspend deliveries to a specific destination after some number of
deliveries suffers connection or handshake failure. The old scheduler declares
a destination "dead" when negative (-1) feedback throttles the delivery
concurrency down to zero. With less-than-1 feedback per delivery, this
throttling down would obviously take too long. We therefore have to separate
"dead destination" detection from concurrency feedback. This is implemented by
introducing the concept of pseudo-cohort failure. The Postfix 2.5 concurrency
scheduler declares a destination "dead" after a configurable number of pseudo-
cohorts suffers from connection or handshake failures. The old scheduler
corresponds to the special case where the pseudo-cohort failure limit is equal
to 1.

PPsseeuuddooccooddee ffoorr tthhee PPoossttffiixx 22..55 ccoonnccuurrrreennccyy sscchheedduulleerr

The pseudo code shows how the ideas behind new concurrency scheduler are
implemented as of November 2007. The actual code can be found in the module
qmgr/qmgr_queue.c.

Types:
        Each destination has one set of the following variables
        int concurrency
        double success
        double failure
        double fail_cohorts

Feedback functions:
        N is concurrency; x, y are arbitrary numbers in [0..1] inclusive
        positive feedback: g(N) = x/N | x/sqrt(N) | x
        negative feedback: f(N) = y/N | y/sqrt(N) | y

Initialization:
        concurrency = initial_concurrency
        success = 0
        failure = 0
        fail_cohorts = 0

After success:
        fail_cohorts = 0
        Be prepared for feedback > hysteresis, or rounding error
        success += g(concurrency)
        while (success >= 1)            Hysteresis 1
            concurrency += 1            Hysteresis 1
            failure = 0
            success -= 1                Hysteresis 1
        Be prepared for overshoot
        if (concurrency > concurrency limit)
            concurrency = concurrency limit

Safety:
        Don't apply positive feedback unless
            concurrency < busy_refcount + init_dest_concurrency
        otherwise negative feedback effect could be delayed

After failure:
        if (concurrency > 0)
            fail_cohorts += 1.0 / concurrency
            if (fail_cohorts > cohort_failure_limit)
                concurrency = 0
        if (concurrency > 0)
            Be prepared for feedback > hysteresis, rounding errors
            failure -= f(concurrency)
            while (failure < 0)
                concurrency -= 1        Hysteresis 1
                failure += 1            Hysteresis 1
                success = 0
            Be prepared for overshoot
            if (concurrency < 1)
                concurrency = 1

RReessuullttss ffoorr ddeelliivveerryy ttoo ccoonnccuurrrreennccyy--lliimmiitteedd sseerrvveerrss

Discussions about the concurrency scheduler redesign started early 2004, when
the primary goal was to find alternatives that did not exhibit exponential
growth or rapid concurrency throttling. No code was implemented until late
2007, when the primary concern had shifted towards better handling of server
concurrency limits. For this reason we measure how well the new scheduler does
this job. The table below compares mail delivery performance of the old +/-
1 feedback per delivery with several less-than-1 feedback functions, for
different limited-concurrency server scenarios. Measurements were done with a
FreeBSD 6.2 client and with FreeBSD 6.2 and various Linux servers.

Server configuration:

  * The mail flow was slowed down with 1 second latency per recipient
    ("smtpd_client_restrictions = sleep 1"). The purpose was to make results
    less dependent on hardware details, by avoiding slow-downs by queue file I/
    O, logging I/O, and network I/O.
  * Concurrency was limited by the server process limit ("default_process_limit
    = 5" and "smtpd_client_event_limit_exceptions = static:all"). Postfix was
    stopped and started after changing the process limit, because the same
    number is also used as the backlog argument to the listen(2) system call,
    and "postfix reload" does not re-issue this call.
  * Mail was discarded with "local_recipient_maps = static:all" and
    "local_transport = discard". The discard action in access maps or header/
    body checks could not be used as it fails to update the in_flow_delay
    counters.

Client configuration:

  * Queue file overhead was minimized by sending one message to a virtual alias
    that expanded into 2000 different remote recipients. All recipients were
    accounted for according to the maillog file. The
    virtual_alias_expansion_limit setting was increased to avoid complaints
    from the cleanup(8) server.
  * The number of deliveries was maximized with
    "smtp_destination_recipient_limit = 2". A smaller limit would cause Postfix
    to schedule the concurrency per recipient instead of domain, which is not
    what we want.
  * Maximum concurrency was limited with "smtp_destination_concurrency_limit =
    20", and initial_destination_concurrency was set to the same value.
  * The positive and negative concurrency feedback hysteresis was 1.
    Concurrency was incremented by 1 at the END of 1/feedback steps of positive
    feedback, and was decremented by 1 at the START of 1/feedback steps of
    negative feedback.
  * The SMTP client used the default 30s SMTP connect timeout and 300s SMTP
    greeting timeout.

IImmppaacctt ooff tthhee 3300ss SSMMTTPP ccoonnnneecctt ttiimmeeoouutt

The first results are for a FreeBSD 6.2 server, where our artificially low
listen(2) backlog results in a very short kernel queue for established
connections. The table shows that all deferred deliveries failed due to a 30s
connection timeout, and none failed due to a server greeting timeout. This
measurement simulates what happens when the server's connection queue is
completely full under load, and the TCP engine drops new connections.

    cclliieenntt sseerrvveerr ffeeeeddbbaacckk  ccoonnnneeccttiioonn ppeerrcceennttaaggee cclliieenntt         ttiimmeedd--oouutt iinn
    lliimmiitt  lliimmiitt  ssttyyllee     ccaacchhiinngg    ddeeffeerrrreedd   ccoonnccuurrrreennccyy    ccoonnnneecctt//
                                                  aavveerraaggee//ssttddddeevv ggrreeeettiinngg

    -------------------------------------------------------------------------
       20     5       1/N         no        9.9   19.4    0.49   198      -

       20     5       1/N        yes       10.3   19.4    0.49   206      -

       20     5   1/sqrt(N)       no       10.4   19.6    0.59   208      -

       20     5   1/sqrt(N)      yes       10.6   19.6    0.61   212      -

       20     5         1         no       10.1   19.5    1.29   202      -

       20     5         1        yes       10.8   19.3    1.57   216      -

    -------------------------------------------------------------------------

    A busy server with a completely full connection queue. N is the client
    delivery concurrency. Failed deliveries time out after 30s without
    completing the TCP handshake. See text for a discussion of results.

IImmppaacctt ooff tthhee 330000ss SSMMTTPP ggrreeeettiinngg ttiimmeeoouutt

The next table shows results for a Fedora Core 8 server (results for RedHat 7.3
are identical). In this case, the artificially small listen(2) backlog argument
does not impact our measurement. The table shows that practically all deferred
deliveries fail after the 300s SMTP greeting timeout. As these timeouts were
10x longer than with the first measurement, we increased the recipient count
(and thus the running time) by a factor of 10 to keep the results comparable.
The deferred mail percentages are a factor 10 lower than with the first
measurement, because the 1s per-recipient delay was 1/300th of the greeting
timeout instead of 1/30th of the connection timeout.

    cclliieenntt sseerrvveerr ffeeeeddbbaacckk  ccoonnnneeccttiioonn ppeerrcceennttaaggee cclliieenntt         ttiimmeedd--oouutt iinn
    lliimmiitt  lliimmiitt  ssttyyllee     ccaacchhiinngg    ddeeffeerrrreedd   ccoonnccuurrrreennccyy    ccoonnnneecctt//
                                                  aavveerraaggee//ssttddddeevv ggrreeeettiinngg

    -------------------------------------------------------------------------
       20     5       1/N         no       1.16   19.8    0.37   -      230

       20     5       1/N        yes       1.36   19.8    0.36   -      272

       20     5   1/sqrt(N)       no       1.21   19.9    0.23   4      238

       20     5   1/sqrt(N)      yes       1.36   20.0    0.23   -      272

       20     5         1         no       1.18   20.0    0.16   -      236

       20     5         1        yes       1.39   20.0    0.16   -      278

    -------------------------------------------------------------------------

    A busy server with a non-full connection queue. N is the client delivery
    concurrency. Failed deliveries complete at the TCP level, but time out
    after 300s while waiting for the SMTP greeting. See text for a discussion
    of results.

IImmppaacctt ooff aaccttiivvee sseerrvveerr ccoonnccuurrrreennccyy lliimmiitteerr

The final concurrency-limited result shows what happens when SMTP connections
don't time out, but are rejected immediately with the Postfix server's
smtpd_client_connection_count_limit feature (the server replies with a 421
status and disconnects immediately). Similar results can be expected with
concurrency limiting features built into other MTAs or firewalls. For this
measurement we specified a server concurrency limit and a client initial
destination concurrency of 5, and a server process limit of 10; all other
conditions were the same as with the first measurement. The same result would
be obtained with a FreeBSD or Linux server, because the "pushing back" is done
entirely by the receiving side.

    cclliieenntt sseerrvveerr ffeeeeddbbaacckk  ccoonnnneeccttiioonn ppeerrcceennttaaggee cclliieenntt         tthheeoorreettiiccaall
    lliimmiitt  lliimmiitt  ssttyyllee     ccaacchhiinngg    ddeeffeerrrreedd   ccoonnccuurrrreennccyy    ddeeffeerr rraattee
                                                  aavveerraaggee//ssttddddeevv

    -------------------------------------------------------------------------
       20     5       1/N         no       16.5   5.17    0.38         1/6

       20     5       1/N        yes       16.5   5.17    0.38         1/6

       20     5   1/sqrt(N)       no       24.5   5.28    0.45         1/4

       20     5   1/sqrt(N)      yes       24.3   5.28    0.46         1/4

       20     5         1         no       49.7   5.63    0.67         1/2

       20     5         1        yes       49.7   5.68    0.70         1/2

    -------------------------------------------------------------------------

    A server with active per-client concurrency limiter that replies with 421
    and disconnects. N is the client delivery concurrency. The theoretical
    defer rate is 1/(1+roundup(1/feedback)). This is always 1/2 with the fixed
    +/-1 feedback per delivery; with the concurrency-dependent feedback
    variants, the defer rate decreases with increasing concurrency. See text
    for a discussion of results.

DDiissccuussssiioonn ooff ccoonnccuurrrreennccyy--lliimmiitteedd sseerrvveerr rreessuullttss

All results in the previous sections are based on the first delivery runs only;
they do not include any second etc. delivery attempts. It's also worth noting
that the measurements look at steady-state behavior only. They don't show what
happens when the client starts sending at a much higher or lower concurrency.

The first two examples show that the effect of feedback is negligible when
concurrency is limited due to congestion. This is because the initial
concurrency is already at the client's concurrency maximum, and because there
is 10-100 times more positive than negative feedback. Under these conditions,
it is no surprise that the contribution from SMTP connection caching is also
negligible.

In the last example, the old +/-1 feedback per delivery will defer 50% of the
mail when confronted with an active (anvil-style) server concurrency limit,
where the server hangs up immediately with a 421 status (a TCP-level RST would
have the same result). Less aggressive feedback mechanisms fare better than
more aggressive ones. Concurrency-dependent feedback fares even better at
higher concurrencies than shown here, but has limitations as discussed in the
next section.

LLiimmiittaattiioonnss ooff lleessss--tthhaann--11 ppeerr ddeelliivveerryy ffeeeeddbbaacckk

Less-than-1 feedback is of interest primarily when sending large amounts of
mail to destinations with active concurrency limiters (servers that reply with
421, or firewalls that send RST). When sending small amounts of mail per
destination, less-than-1 per-delivery feedback won't have a noticeable effect
on the per-destination concurrency, because the number of deliveries to the
same destination is too small. You might just as well use zero per-delivery
feedback and stay with the initial per-destination concurrency. And when mail
deliveries fail due to congestion instead of active concurrency limiters, the
measurements above show that per-delivery feedback has no effect. With large
amounts of mail you might just as well use zero per-delivery feedback and start
with the maximal per-destination concurrency.

The scheduler with less-than-1 concurrency feedback per delivery solves a
problem with servers that have active concurrency limiters. This works only
because feedback is handled in a peculiar manner: positive feedback will
increment the concurrency by 1 at the eenndd of a sequence of events of length 1/
feedback, while negative feedback will decrement concurrency by 1 at the
bbeeggiinnnniinngg of such a sequence. This is how Postfix adjusts quickly for overshoot
without causing lots of mail to be deferred. Without this difference in
feedback treatment, less-than-1 feedback per delivery would defer 50% of the
mail, and would be no better in this respect than the old +/-1 feedback per
delivery.

Unfortunately, the same feature that corrects quickly for concurrency overshoot
also makes the scheduler more sensitive for noisy negative feedback. The reason
is that one lonely negative feedback event has the same effect as a complete
sequence of length 1/feedback: in both cases delivery concurrency is dropped by
1 immediately. As a worst-case scenario, consider multiple servers behind a
load balancer on a single IP address, and no backup MX address. When 1 out of K
servers fails to complete the SMTP handshake or drops the connection, a
scheduler with 1/N (N = concurrency) feedback stops increasing its concurrency
once it reaches a concurrency level of about K, even though the good servers
behind the load balancer are perfectly capable of handling more traffic.

This noise problem gets worse as the amount of positive feedback per delivery
gets smaller. A compromise is to use fixed less-than-1 positive feedback values
instead of concurrency-dependent positive feedback. For example, to tolerate 1
of 4 bad servers in the above load balancer scenario, use positive feedback of
1/4 per "good" delivery (no connect or handshake error), and use an equal or
smaller amount of negative feedback per "bad" delivery. The downside of using
concurrency-independent feedback is that some of the old +/-1 feedback problems
will return at large concurrencies. Sites that must deliver mail at non-trivial
per-destination concurrencies will require special configuration.

CCoonnccuurrrreennccyy ccoonnffiigguurraattiioonn ppaarraammeetteerrss

The Postfix 2.5 concurrency scheduler is controlled with the following
configuration parameters, where "transport_foo" provides a transport-specific
parameter override. All parameter default settings are compatible with earlier
Postfix versions.

    PPaarraammeetteerr nnaammee                                        PPoossttffiixx DDeessccrriippttiioonn
                                                          vveerrssiioonn

    ---------------------------------------------------------------------------
                                                                  Initial per-
    initial_destination_concurrency                          all  destination
    transport_initial_destination_concurrency                2.5  delivery
                                                                  concurrency

                                                                  Maximum per-
    default_destination_concurrency_limit                    all  destination
    transport_destination_concurrency_limit                  all  delivery
                                                                  concurrency

                                                                  Per-
                                                                  destination
                                                                  positive
                                                                  feedback
    default_destination_concurrency_positive_feedback        2.5  amount, per
    transport_destination_concurrency_positive_feedback      2.5  delivery that
                                                                  does not fail
                                                                  with
                                                                  connection or
                                                                  handshake
                                                                  failure

                                                                  Per-
                                                                  destination
                                                                  negative
                                                                  feedback
    default_destination_concurrency_negative_feedback        2.5  amount, per
    transport_destination_concurrency_negative_feedback      2.5  delivery that
                                                                  fails with
                                                                  connection or
                                                                  handshake
                                                                  failure

                                                                  Number of
                                                                  failed
                                                                  pseudo-
                                                                  cohorts after
    default_destination_concurrency_failed_cohort_limit      2.5  which a
    transport_destination_concurrency_failed_cohort_limit    2.5  destination
                                                                  is declared
                                                                  "dead" and
                                                                  delivery is
                                                                  suspended

                                                                  Enable
                                                                  verbose
    destination_concurrency_feedback_debug                   2.5  logging of
                                                                  concurrency
                                                                  scheduler
                                                                  activity

    ---------------------------------------------------------------------------

PPrreeeemmppttiivvee sscchheedduulliinngg

The following sections describe the new queue manager and its preemptive
scheduler algorithm. Note that the document was originally written to describe
the changes between the new queue manager (in this text referred to as nqmgr,
the name it was known by before it became the default queue manager) and the
old queue manager (referred to as oqmgr). This is why it refers to oqmgr every
so often.

This document is divided into sections as follows:

  * The structures used by nqmgr
  * What happens when nqmgr picks up the message - how it is assigned to
    transports, jobs, peers, entries
  * How the entry selection works
  * How the preemption works - what messages may be preempted and how and what
    messages are chosen to preempt them
  * How destination concurrency limits affect the scheduling algorithm
  * Dealing with memory resource limits

TThhee ssttrruuccttuurreess uusseedd bbyy nnqqmmggrr

Let's start by recapitulating the structures and terms used when referring to
queue manager and how it operates. Many of these are partially described
elsewhere, but it is nice to have a coherent overview in one place:

  * Each message structure represents one mail message which Postfix is to
    deliver. The message recipients specify to what destinations is the message
    to be delivered and what transports are going to be used for the delivery.

  * Each recipient entry groups a batch of recipients of one message which are
    all going to be delivered to the same destination (and over the same
    transport).

  * Each transport structure groups everything what is going to be delivered by
    delivery agents dedicated for that transport. Each transport maintains a
    set of queues (describing the destinations it shall talk to) and jobs
    (referencing the messages it shall deliver).

  * Each transport queue (not to be confused with the on-disk active queue or
    incoming queue) groups everything what is going be delivered to given
    destination (aka nexthop) by its transport. Each queue belongs to one
    transport, so each destination may be referred to by several queues, one
    for each transport. Each queue maintains a list of all recipient entries
    (batches of message recipients) which shall be delivered to given
    destination (the todo list), and a list of recipient entries already being
    delivered by the delivery agents (the busy list).

  * Each queue corresponds to multiple peer structures. Each peer structure is
    like the queue structure, belonging to one transport and referencing one
    destination. The difference is that it lists only the recipient entries
    which all originate from the same message, unlike the queue structure,
    whose entries may originate from various messages. For messages with few
    recipients, there is usually just one recipient entry for each destination,
    resulting in one recipient entry per peer. But for large mailing list
    messages the recipients may need to be split to multiple recipient entries,
    in which case the peer structure may list many entries for single
    destination.

  * Each transport job groups everything it takes to deliver one message via
    its transport. Each job represents one message within the context of the
    transport. The job belongs to one transport and message, so each message
    may have multiple jobs, one for each transport. The job groups all the peer
    structures, which describe the destinations the job's message has to be
    delivered to.

The first four structures are common to both nqmgr and oqmgr, the latter two
were introduced by nqmgr.

These terms are used extensively in the text below, feel free to look up the
description above anytime you'll feel you have lost a sense what is what.

WWhhaatt hhaappppeennss wwhheenn nnqqmmggrr ppiicckkss uupp tthhee mmeessssaaggee

Whenever nqmgr moves a queue file into the active queue, the following happens:
It reads all necessary information from the queue file as oqmgr does, and also
reads as many recipients as possible - more on that later, for now let's just
pretend it always reads all recipients.

Then it resolves the recipients as oqmgr does, which means obtaining (address,
nexthop, transport) triple for each recipient. For each triple, it finds the
transport; if it does not exist yet, it instantiates it (unless it's dead).
Within the transport, it finds the destination queue for given nexthop; if it
does not exist yet, it instantiates it (unless it's dead). The triple is then
bound to given destination queue. This happens in qmgr_resolve() and is
basically the same as in oqmgr.

Then for each triple which was bound to some queue (and thus transport), the
program finds the job which represents the message within that transport's
context; if it does not exist yet, it instantiates it. Within the job, it finds
the peer which represents the bound destination queue within this jobs context;
if it does not exist yet, it instantiates it. Finally, it stores the address
from the resolved triple to the recipient entry which is appended to both the
queue entry list and the peer entry list. The addresses for same nexthop are
batched in the entries up to recipient_concurrency limit for that transport.
This happens in qmgr_assign() and apart from that it operates with job and peer
structures it is basically the same as in oqmgr.

When the job is instantiated, it is enqueued on the transport's job list based
on the time its message was picked up by nqmgr. For first batch of recipients
this means it is appended to the end of the job list, but the ordering of the
job list by the enqueue time is important as we will see shortly.

[Now you should have pretty good idea what is the state of the nqmgr after
couple of messages was picked up, what is the relation between all those job,
peer, queue and entry structures.]

HHooww tthhee eennttrryy sseelleeccttiioonn wwoorrkkss

Having prepared all those above mentioned structures, the task of the nqmgr's
scheduler is to choose the recipient entries one at a time and pass them to the
delivery agent for corresponding transport. Now how does this work?

The first approximation of the new scheduling algorithm is like this:

    foreach transport (round-robin-by-transport)
    do
        if transport busy continue
        if transport process limit reached continue
        foreach transport's job (in the order of the transport's job list)
        do
            foreach job's peer (round-robin-by-destination)
                 if peer->queue->concurrency < peer->queue->window
                     return next peer entry.
            done
        done
    done

Now what is the "order of the transport's job list"? As we know already, the
job list is by default kept in the order the message was picked up by the
nqmgr. So by default we get the top-level round-robin transport, and within
each transport we get the FIFO message delivery. The round-robin of the peers
by the destination is perhaps of little importance in most real-life cases
(unless the recipient_concurrency limit is reached, in one job there is only
one peer structure for each destination), but theoretically it makes sure that
even within single jobs, destinations are treated fairly.

[By now you should have a feeling you really know how the scheduler works,
except for the preemption, under ideal conditions - that is, no recipient
resource limits and no destination concurrency problems.]

HHooww tthhee pprreeeemmppttiioonn wwoorrkkss

As you might perhaps expect by now, the transport's job list does not remain
sorted by the job's message enqueue time all the time. The most cool thing
about nqmgr is not the simple FIFO delivery, but that it is able to slip mail
with little recipients past the mailing-list bulk mail. This is what the job
preemption is about - shuffling the jobs on the transport's job list to get the
best message delivery rates. Now how is it achieved?

First I have to tell you that there are in fact two job lists in each
transport. One is the scheduler's job list, which the scheduler is free to play
with, while the other one keeps the jobs always listed in the order of the
enqueue time and is used for recipient pool management we will discuss later.
For now, we will deal with the scheduler's job list only.

So, we have the job list, which is first ordered by the time the jobs' messages
were enqueued, oldest messages first, the most recently picked one at the end.
For now, let's assume that there are no destination concurrency problems.
Without preemption, we pick some entry of the first (oldest) job on the queue,
assign it to delivery agent, pick another one from the same job, assign it
again, and so on, until all the entries are used and the job is delivered. We
would then move onto the next job and so on and on. Now how do we manage to
sneak in some entries from the recently added jobs when the first job on the
job list belongs to a message going to the mailing-list and has thousands of
recipient entries?

The nqmgr's answer is that we can artificially "inflate" the delivery time of
that first job by some constant for free - it is basically the same trick you
might remember as "accumulation of potential" from the amortized complexity
lessons. For example, instead of delivering the entries of the first job on the
job list every time a delivery agent becomes available, we can do it only every
second time. If you view the moments the delivery agent becomes available on a
timeline as "delivery slots", then instead of using every delivery slot for the
first job, we can use only every other slot, and still the overall delivery
efficiency of the first job remains the same. So the delivery 11112222 becomes
1.1.1.1.2.2.2.2 (1 and 2 are the imaginary job numbers, . denotes the free
slot). Now what do we do with free slots?

As you might have guessed, we will use them for sneaking the mail with little
recipients in. For example, if we have one four-recipient mail followed by four
one recipients mail, the delivery sequence (that is, the sequence in which the
jobs are assigned to the delivery slots) might look like this: 12131415. Hmm,
fine for sneaking in the single recipient mail, but how do we sneak in the mail
with more than one recipient? Say if we have one four-recipient mail followed
by two two-recipient mails?

The simple answer would be to use delivery sequence 12121313. But the problem
is that this does not scale well. Imagine you have mail with thousand
recipients followed by mail with hundred recipients. It is tempting to suggest
the delivery sequence like 121212...., but alas! Imagine there arrives another
mail with say ten recipients. But there are no free slots anymore, so it can't
slip by, not even if it had just only one recipients. It will be stuck until
the hundred-recipient mail is delivered, which really sucks.

So, it becomes obvious that while inflating the message to get free slots is
great idea, one has to be really careful of how the free slots are assigned,
otherwise one might corner himself. So, how does nqmgr really use the free
slots?

The key idea is that one does not have to generate the free slots in a uniform
way. The delivery sequence 111...1 is no worse than 1.1.1.1, in fact, it is
even better as some entries are in the first case selected earlier than in the
second case, and none is selected later! So it is possible to first
"accumulate" the free delivery slots and then use them all at once. It is even
possible to accumulate some, then use them, then accumulate some more and use
them again, as in 11..1.1 .

Let's get back to the one hundred recipient example. We now know that we could
first accumulate one hundred free slots, and only after then to preempt the
first job and sneak the one hundred recipient mail in. Applying the algorithm
recursively, we see the hundred recipient job can accumulate ten free delivery
slots, and then we could preempt it and sneak in the ten-recipient mail... Wait
wait wait! Could we? Aren't we overinflating the original one thousand
recipient mail?

Well, despite it looks so at the first glance, another trick will allow us to
answer "no, we are not!". If we had said that we will inflate the delivery time
twice at maximum, and then we consider every other slot as a free slot, then we
would overinflate in case of the recursive preemption. BUT! The trick is that
if we use only every n-th slot as a free slot for n>2, there is always some
worst inflation factor which we can guarantee not to be breached, even if we
apply the algorithm recursively. To be precise, if for every k>1 normally used
slots we accumulate one free delivery slot, than the inflation factor is not
worse than k/(k-1) no matter how many recursive preemptions happen. And it's
not worse than (k+1)/k if only non-recursive preemption happens. Now, having
got through the theory and the related math, let's see how nqmgr implements
this.

Each job has so called "available delivery slot" counter. Each transport has a
transport_delivery_slot_cost parameter, which defaults to
default_delivery_slot_cost parameter which is set to 5 by default. This is the
k from the paragraph above. Each time k entries of the job are selected for
delivery, this counter is incremented by one. Once there are some slots
accumulated, job which requires no more than that number of slots to be fully
delivered can preempt this job.

[Well, the truth is, the counter is incremented every time an entry is selected
and it is divided by k when it is used. But for the understanding it's good
enough to use the above approximation of the truth.]

OK, so now we know the conditions which must be satisfied so one job can
preempt another one. But what job gets preempted, how do we choose what job
preempts it if there are several valid candidates, and when does all this
exactly happen?

The answer for the first part is simple. The job whose entry was selected the
last time is so called current job. Normally, it is the first job on the
scheduler's job list, but destination concurrency limits may change this as we
will see later. It is always only the current job which may get preempted.

Now for the second part. The current job has certain amount of recipient
entries, and as such may accumulate at maximum some amount of available
delivery slots. It might have already accumulated some, and perhaps even
already used some when it was preempted before (remember a job can be preempted
several times). In either case, we know how many are accumulated and how many
are left to deliver, so we know how many it may yet accumulate at maximum.
Every other job which may be delivered by less than that number of slots is a
valid candidate for preemption. How do we choose among them?

The answer is - the one with maximum enqueue_time/recipient_entry_count. That
is, the older the job is, the more we should try to deliver it in order to get
best message delivery rates. These rates are of course subject to how many
recipients the message has, therefore the division by the recipient (entry)
count. No one shall be surprised that message with n recipients takes n times
longer to deliver than message with one recipient.

Now let's recap the previous two paragraphs. Isn't it too complicated? Why
don't the candidates come only among the jobs which can be delivered within the
number of slots the current job already accumulated? Why do we need to estimate
how much it has yet to accumulate? If you found out the answer, congratulate
yourself. If we did it this simple way, we would always choose the candidate
with least recipient entries. If there were enough single recipient mails
coming in, they would always slip by the bulk mail as soon as possible, and the
two and more recipients mail would never get a chance, no matter how long they
have been sitting around in the job list.

This candidate selection has interesting implication - that when we choose the
best candidate for preemption (this is done in qmgr_choose_candidate()), it may
happen that we may not use it for preemption immediately. This leads to an
answer to the last part of the original question - when does the preemption
happen?

The preemption attempt happens every time next transport's recipient entry is
to be chosen for delivery. To avoid needless overhead, the preemption is not
attempted if the current job could never accumulate more than
transport_minimum_delivery_slots (defaults to default_minimum_delivery_slots
which defaults to 3). If there is already enough accumulated slots to preempt
the current job by the chosen best candidate, it is done immediately. This
basically means that the candidate is moved in front of the current job on the
scheduler's job list and decreasing the accumulated slot counter by the amount
used by the candidate. If there is not enough slots... well, I could say that
nothing happens and the another preemption is attempted the next time. But
that's not the complete truth.

The truth is that it turns out that it is not really necessary to wait until
the jobs counter accumulates all the delivery slots in advance. Say we have
ten-recipient mail followed by two two-recipient mails. If the preemption
happened when enough delivery slot accumulate (assuming slot cost 2), the
delivery sequence becomes 11112211113311. Now what would we get if we would
wait only for 50% of the necessary slots to accumulate and we promise we would
wait for the remaining 50% later, after we get back to the preempted job? If we
use such slot loan, the delivery sequence becomes 11221111331111. As we can
see, it makes it no considerably worse for the delivery of the ten-recipient
mail, but it allows the small messages to be delivered sooner.

The concept of these slot loans is where the transport_delivery_slot_discount
and transport_delivery_slot_loan come from (they default to
default_delivery_slot_discount and default_delivery_slot_loan, whose values are
by default 50 and 3, respectively). The discount (resp. loan) specifies how
many percent (resp. how many slots) one "gets in advance", when the number of
slots required to deliver the best candidate is compared with the number of
slots the current slot had accumulated so far.

And it pretty much concludes this chapter.

[Now you should have a feeling that you pretty much understand the scheduler
and the preemption, or at least that you will have it after you read the last
chapter couple more times. You shall clearly see the job list and the
preemption happening at its head, in ideal delivery conditions. The feeling of
understanding shall last until you start wondering what happens if some of the
jobs are blocked, which you might eventually figure out correctly from what had
been said already. But I would be surprised if your mental image of the
scheduler's functionality is not completely shattered once you start wondering
how it works when not all recipients may be read in-core. More on that later.]

HHooww ddeessttiinnaattiioonn ccoonnccuurrrreennccyy lliimmiittss aaffffeecctt tthhee sscchheedduulliinngg aallggoorriitthhmm

The nqmgr uses the same algorithm for destination concurrency control as oqmgr.
Now what happens when the destination limits are reached and no more entries
for that destination may be selected by the scheduler?

From user's point of view it is all simple. If some of the peers of a job can't
be selected, those peers are simply skipped by the entry selection algorithm
(the pseudo-code described before) and only the selectable ones are used. If
none of the peers may be selected, the job is declared a "blocker job". Blocker
jobs are skipped by the entry selection algorithm and they are also excluded
from the candidates for preemption of current job. Thus the scheduler
effectively behaves as if the blocker jobs didn't exist on the job list at all.
As soon as at least one of the peers of a blocker job becomes unblocked (that
is, the delivery agent handling the delivery of the recipient entry for given
destination successfully finishes), the job's blocker status is removed and the
job again participates in all further scheduler actions normally.

So the summary is that the users don't really have to be concerned about the
interaction of the destination limits and scheduling algorithm. It works well
on its own and there are no knobs they would need to control it.

From a programmer's point of view, the blocker jobs complicate the scheduler
quite a lot. Without them, the jobs on the job list would be normally delivered
in strict FIFO order. If the current job is preempted, the job preempting it is
completely delivered unless it is preempted itself. Without blockers, the
current job is thus always either the first job on the job list, or the top of
the stack of jobs preempting the first job on the job list.

The visualization of the job list and the preemption stack without blockers
would be like this:

    first job->    1--2--3--5--6--8--...    <- job list
    on job list    |
                   4    <- preemption stack
                   |
    current job->  7

In the example above we see that job 1 was preempted by job 4 and then job 4
was preempted by job 7. After job 7 is completed, remaining entries of job 4
are selected, and once they are all selected, job 1 continues.

As we see, it's all very clean and straightforward. Now how does this change
because of blockers?

The answer is: a lot. Any job may become blocker job at any time, and also
become normal job again at any time. This has several important implications:

 1. The jobs may be completed in arbitrary order. For example, in the example
    above, if the current job 7 becomes blocked, the next job 4 may complete
    before the job 7 becomes unblocked again. Or if both 7 and 4 are blocked,
    then 1 is completed, then 7 becomes unblocked and is completed, then 2 is
    completed and only after that 4 becomes unblocked and is completed... You
    get the idea.

    [Interesting side note: even when jobs are delivered out of order, from
    single destination's point of view the jobs are still delivered in the
    expected order (that is, FIFO unless there was some preemption involved).
    This is because whenever a destination queue becomes unblocked (the
    destination limit allows selection of more recipient entries for that
    destination), all jobs which have peers for that destination are unblocked
    at once.]

 2. The idea of the preemption stack at the head of the job list is gone. That
    is, it must be possible to preempt any job on the job list. For example, if
    the jobs 7, 4, 1 and 2 in the example above become all blocked, job 3
    becomes the current job. And of course we do not want the preemption to be
    affected by the fact that there are some blocked jobs or not. Therefore, if
    it turns out that job 3 might be preempted by job 6, the implementation
    shall make it possible.

 3. The idea of the linear preemption stack itself is gone. It's no longer true
    that one job is always preempted by only one job at one time (that is
    directly preempted, not counting the recursively nested jobs). For example,
    in the example above, job 1 is directly preempted by only job 4, and job 4
    by job 7. Now assume job 7 becomes blocked, and job 4 is being delivered.
    If it accumulates enough delivery slots, it is natural that it might be
    preempted for example by job 8. Now job 4 is preempted by both job 7 AND
    job 8 at the same time.

Now combine the points 2) and 3) with point 1) again and you realize that the
relations on the once linear job list became pretty complicated. If we extend
the point 3) example: jobs 7 and 8 preempt job 4, now job 8 becomes blocked
too, then job 4 completes. Tricky, huh?

If I illustrate the relations after the above mentioned examples (but those in
point 1)), the situation would look like this:

                                v- parent

    adoptive parent ->    1--2--3--5--...      <- "stack" level 0
                          |     |
    parent gone ->        ?     6              <- "stack" level 1
                         / \
    children ->         7   8   ^- child       <- "stack" level 2

                          ^- siblings

Now how does nqmgr deal with all these complicated relations?

Well, it maintains them all as described, but fortunately, all these relations
are necessary only for purposes of proper counting of available delivery slots.
For purposes of ordering the jobs for entry selection, the original rule still
applies: "the job preempting the current job is moved in front of the current
job on the job list". So for entry selection purposes, the job relations remain
as simple as this:

    7--8--1--2--6--3--5--..   <- scheduler's job list order

The job list order and the preemption parent/child/siblings relations are
maintained separately. And because the selection works only with the job list,
you can happily forget about those complicated relations unless you want to
study the nqmgr sources. In that case the text above might provide some helpful
introduction to the problem domain. Otherwise I suggest you just forget about
all this and stick with the user's point of view: the blocker jobs are simply
ignored.

[By now, you should have a feeling that there is more things going under the
hood than you ever wanted to know. You decide that forgetting about this
chapter is the best you can do for the sake of your mind's health and you
basically stick with the idea how the scheduler works in ideal conditions, when
there are no blockers, which is good enough.]

DDeeaalliinngg wwiitthh mmeemmoorryy rreessoouurrccee lliimmiittss

When discussing the nqmgr scheduler, we have so far assumed that all recipients
of all messages in the active queue are completely read into the memory. This
is simply not true. There is an upper bound on the amount of memory the nqmgr
may use, and therefore it must impose some limits on the information it may
store in the memory at any given time.

First of all, not all messages may be read in-core at once. At any time, only
qmgr_message_active_limit messages may be read in-core at maximum. When read
into memory, the messages are picked from the incoming and deferred message
queues and moved to the active queue (incoming having priority), so if there is
more than qmgr_message_active_limit messages queued in the active queue, the
rest will have to wait until (some of) the messages in the active queue are
completely delivered (or deferred).

Even with the limited amount of in-core messages, there is another limit which
must be imposed in order to avoid memory exhaustion. Each message may contain
huge amount of recipients (tens or hundreds of thousands are not uncommon), so
if nqmgr read all recipients of all messages in the active queue, it may easily
run out of memory. Therefore there must be some upper bound on the amount of
message recipients which are read into the memory at the same time.

Before discussing how exactly nqmgr implements the recipient limits, let's see
how the sole existence of the limits themselves affects the nqmgr and its
scheduler.

The message limit is straightforward - it just limits the size of the lookahead
the nqmgr's scheduler has when choosing which message can preempt the current
one. Messages not in the active queue simply are not considered at all.

The recipient limit complicates more things. First of all, the message reading
code must support reading the recipients in batches, which among other things
means accessing the queue file several times and continuing where the last
recipient batch ended. This is invoked by the scheduler whenever the current
job has space for more recipients, subject to transport's refill_limit and
refill_delay parameters. It is also done any time when all in-core recipients
of the message are dealt with (which may also mean they were deferred) but
there are still more in the queue file.

The second complication is that with some recipients left unread in the queue
file, the scheduler can't operate with exact counts of recipient entries. With
unread recipients, it is not clear how many recipient entries there will be, as
they are subject to per-destination grouping. It is not even clear to what
transports (and thus jobs) the recipients will be assigned. And with messages
coming from the deferred queue, it is not even clear how many unread recipients
are still to be delivered. This all means that the scheduler must use only
estimates of how many recipients entries there will be. Fortunately, it is
possible to estimate the minimum and maximum correctly, so the scheduler can
always err on the safe side. Obviously, the better the estimates, the better
results, so it is best when we are able to read all recipients in-core and turn
the estimates into exact counts, or at least try to read as many as possible to
make the estimates as accurate as possible.

The third complication is that it is no longer true that the scheduler is done
with a job once all of its in-core recipients are delivered. It is possible
that the job will be revived later, when another batch of recipients is read in
core. It is also possible that some jobs will be created for the first time
long after the first batch of recipients was read in core. The nqmgr code must
be ready to handle all such situations.

And finally, the fourth complication is that the nqmgr code must somehow impose
the recipient limit itself. Now how does it achieve it?

Perhaps the easiest solution would be to say that each message may have at
maximum X recipients stored in-core, but such solution would be poor for
several reasons. With reasonable qmgr_message_active_limit values, the X would
have to be quite low to maintain reasonable memory footprint. And with low X
lots of things would not work well. The nqmgr would have problems to use the
transport_destination_recipient_limit efficiently. The scheduler's preemption
would be suboptimal as the recipient count estimates would be inaccurate. The
message queue file would have to be accessed many times to read in more
recipients again and again.

Therefore it seems reasonable to have a solution which does not use a limit
imposed on per-message basis, but which maintains a pool of available recipient
slots, which can be shared among all messages in the most efficient manner. And
as we do not want separate transports to compete for resources whenever
possible, it seems appropriate to maintain such recipient pool for each
transport separately. This is the general idea, now how does it work in
practice?

First we have to solve little chicken-and-egg problem. If we want to use the
per-transport recipient pools, we first need to know to what transport(s) is
the message assigned. But we will find that out only after we read in the
recipients first. So it is obvious that we first have to read in some
recipients, use them to find out to what transports is the message to be
assigned, and only after that we can use the per-transport recipient pools.

Now how many recipients shall we read for the first time? This is what
qmgr_message_recipient_minimum and qmgr_message_recipient_limit values control.
The qmgr_message_recipient_minimum value specifies how many recipients of each
message we will read for the first time, no matter what. It is necessary to
read at least one recipient before we can assign the message to a transport and
create the first job. However, reading only qmgr_message_recipient_minimum
recipients even if there are only few messages with few recipients in-core
would be wasteful. Therefore if there is less than qmgr_message_recipient_limit
recipients in-core so far, the first batch of recipients may be larger than
qmgr_message_recipient_minimum - as large as is required to reach the
qmgr_message_recipient_limit limit.

Once the first batch of recipients was read in core and the message jobs were
created, the size of the subsequent recipient batches (if any - of course it's
best when all recipients are read in one batch) is based solely on the position
of the message jobs on their corresponding transports' job lists. Each
transport has a pool of transport_recipient_limit recipient slots which it can
distribute among its jobs (how this is done is described later). The subsequent
recipient batch may be as large as the sum of all recipient slots of all jobs
of the message permits (plus the qmgr_message_recipient_minimum amount which
always applies).

For example, if a message has three jobs, first with 1 recipient still in-core
and 4 recipient slots, second with 5 recipient in-core and 5 recipient slots,
and third with 2 recipients in-core and 0 recipient slots, it has 1+5+2=7
recipients in-core and 4+5+0=9 jobs' recipients slots in total. This means that
we could immediately read 2+qmgr_message_recipient_minimum more recipients of
that message in core.

The above example illustrates several things which might be worth mentioning
explicitly: first, note that although the per-transport slots are assigned to
particular jobs, we can't guarantee that once the next batch of recipients is
read in core, that the corresponding amounts of recipients will be assigned to
those jobs. The jobs lend its slots to the message as a whole, so it is
possible that some jobs end up sponsoring other jobs of their message. For
example, if in the example above the 2 newly read recipients were assigned to
the second job, the first job sponsored the second job with 2 slots. The second
notable thing is the third job, which has more recipients in-core than it has
slots. Apart from sponsoring by other job we just saw it can be result of the
first recipient batch, which is sponsored from global recipient pool of
qmgr_message_recipient_limit recipients. It can be also sponsored from the
message recipient pool of qmgr_message_recipient_minimum recipients.

Now how does each transport distribute the recipient slots among its jobs? The
strategy is quite simple. As most scheduler activity happens on the head of the
job list, it is our intention to make sure that the scheduler has the best
estimates of the recipient counts for those jobs. As we mentioned above, this
means that we want to try to make sure that the messages of those jobs have all
recipients read in-core. Therefore the transport distributes the slots "along"
the job list from start to end. In this case the job list sorted by message
enqueue time is used, because it doesn't change over time as the scheduler's
job list does.

More specifically, each time a job is created and appended to the job list, it
gets all unused recipient slots from its transport's pool. It keeps them until
all recipients of its message are read. When this happens, all unused recipient
slots are transferred to the next job (which is now in fact now first such job)
on the job list which still has some recipients unread, or eventually back to
the transport pool if there is no such job. Such transfer then also happens
whenever a recipient entry of that job is delivered.

There is also a scenario when a job is not appended to the end of the job list
(for example it was created as a result of second or later recipient batch).
Then it works exactly as above, except that if it was put in front of the first
unread job (that is, the job of a message which still has some unread
recipients in queue file), that job is first forced to return all of its unused
recipient slots to the transport pool.

The algorithm just described leads to the following state: The first unread job
on the job list always gets all the remaining recipient slots of that transport
(if there are any). The jobs queued before this job are completely read (that
is, all recipients of their message were already read in core) and have at
maximum as many slots as they still have recipients in-core (the maximum is
there because of the sponsoring mentioned before) and the jobs after this job
get nothing from the transport recipient pool (unless they got something before
and then the first unread job was created and enqueued in front of them later -
in such case the also get at maximum as many slots as they have recipients in-
core).

Things work fine in such state for most of the time, because the current job is
either completely read in-core or has as much recipient slots as there are, but
there is one situation which we still have to take care of specially. Imagine
if the current job is preempted by some unread job from the job list and there
are no more recipient slots available, so this new current job could read only
batches of qmgr_message_recipient_minimum recipients at a time. This would
really degrade performance. For this reason, each transport has extra pool of
transport_extra_recipient_limit recipient slots, dedicated exactly for this
situation. Each time an unread job preempts the current job, it gets half of
the remaining recipient slots from the normal pool and this extra pool.

And that's it. It sure does sound pretty complicated, but fortunately most
people don't really have to care how exactly it works as long as it works.
Perhaps the only important things to know for most people are the following
upper bound formulas:

Each transport has at maximum

    max(
    qmgr_message_recipient_minimum * qmgr_message_active_limit
    + *_recipient_limit + *_extra_recipient_limit,
    qmgr_message_recipient_limit
    )

recipients in core.

The total amount of recipients in core is

    max(
    qmgr_message_recipient_minimum * qmgr_message_active_limit
    + sum( *_recipient_limit + *_extra_recipient_limit ),
    qmgr_message_recipient_limit
    )

where the sum is over all used transports.

And this terribly complicated chapter concludes the documentation of nqmgr
scheduler.

[By now you should theoretically know the nqmgr scheduler inside out. In
practice, you still hope that you will never have to really understand the last
or last two chapters completely, and fortunately most people really won't.
Understanding how the scheduler works in ideal conditions is more than good
enough for vast majority of users.]

CCrreeddiittss

  * Wietse Venema designed and implemented the initial queue manager with per-
    domain FIFO scheduling, and per-delivery +/-1 concurrency feedback.
  * Patrik Rak designed and implemented preemption where mail with fewer
    recipients can slip past mail with more recipients in a controlled manner,
    and wrote up its documentation.
  * Wietse Venema initiated a discussion with Patrik Rak and Victor Duchovni on
    alternatives for the +/-1 feedback scheduler's aggressive behavior. This is
    when K/N feedback was reviewed (N = concurrency). The discussion ended
    without a good solution for both negative feedback and dead site detection.
  * Victor Duchovni resumed work on concurrency feedback in the context of
    concurrency-limited servers.
  * Wietse Venema then re-designed the concurrency scheduler in terms of the
    simplest possible concepts: less-than-1 concurrency feedback per delivery,
    forward and reverse concurrency feedback hysteresis, and pseudo-cohort
    failure. At this same time, concurrency feedback was separated from dead
    site detection.
  * These simplifications, and their modular implementation, helped to develop
    further insights into the different roles that positive and negative
    concurrency feedback play, and helped to identify some worst-case
    scenarios.