1. 01 May, 2023 16 commits
    • Min Zhou's avatar
      LoongArch: crypto: Add crc32 and crc32c hw acceleration · 2f164822
      Min Zhou authored
      With a blatant copy of some MIPS bits we introduce the crc32 and crc32c
      hw accelerated module to LoongArch.
      
      LoongArch has provided these instructions to calculate crc32 and crc32c:
              * crc.w.b.w    crcc.w.b.w
              * crc.w.h.w    crcc.w.h.w
              * crc.w.w.w    crcc.w.w.w
              * crc.w.d.w    crcc.w.d.w
      
      So we can make use of these instructions to improve the performance of
      calculation for crc32(c) checksums.
      
      As can be seen from the following test results, crc32(c) instructions
      can improve the performance by 58%.
      
                        Software implemention    Hardware acceleration
        Buffer size     time cost (seconds)      time cost (seconds)    Accel.
         100 KB                0.000845                 0.000534        59.1%
           1 MB                0.007758                 0.004836        59.4%
          10 MB                0.076593                 0.047682        59.4%
         100 MB                0.756734                 0.479126        58.5%
        1000 MB                7.563841                 4.778266        58.5%
      Signed-off-by: default avatarMin Zhou <zhoumin@loongson.cn>
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      2f164822
    • Bibo Mao's avatar
      LoongArch: Add checksum optimization for 64-bit system · 69e3a6aa
      Bibo Mao authored
      LoongArch platform is 64-bit system, which supports 8-bytes memory
      accessing, but generic checksum functions use 4-byte memory access.
      So add 8-bytes memory access optimization for checksum functions on
      LoongArch. And the code comes from arm64 system.
      
      When network hw checksum is disabled, iperf performance improves about
      10% with this patch.
      Signed-off-by: default avatarBibo Mao <maobibo@loongson.cn>
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      69e3a6aa
    • WANG Rui's avatar
      LoongArch: Optimize memory ops (memset/memcpy/memmove) · 8941e93c
      WANG Rui authored
      To optimize memset()/memcpy()/memmove() and so on, we use a jump table
      to dispatch cases for short data lengths; and for long data lengths, we
      split the destination into head part (first 8 bytes), tail part (last 8
      bytes) and middle part. The head part and tail part may be at unaligned
      addresses, while the middle part is always aligned (the middle part is
      allowed to overlap the head/tail part). In this way, the first and last
      8 bytes may be unaligned accesses, but we can make sure the data in the
      middle is processed at an aligned destination address.
      
      We have tested micro-bench[1] on a Loongson-3C5000 16-core machine (2.2GHz):
      
      1. memset
      
      | length | src offset | dst offset | speed before | speed after | %       |
      |--------|------------|------------|--------------|-------------|---------|
      | 8      | 0          | 0          | 696.191      | 1518.785    | 118.16% |
      | 8      | 0          | 1          | 696.325      | 1518.937    | 118.14% |
      | 50     | 0          | 0          | 969.976      | 8053.902    | 730.32% |
      | 50     | 0          | 1          | 970.034      | 8058.475    | 730.74% |
      | 300    | 0          | 0          | 5876.612     | 16544.703   | 181.53% |
      | 300    | 0          | 1          | 5030.849     | 16549.011   | 228.95% |
      | 1200   | 0          | 0          | 11797.077    | 16752.137   | 42.00%  |
      | 1200   | 0          | 1          | 5687.141     | 16645.233   | 192.68% |
      | 4000   | 0          | 0          | 15723.27     | 16761.557   | 6.60%   |
      | 4000   | 0          | 1          | 5906.114     | 16732.316   | 183.30% |
      | 8000   | 0          | 0          | 16751.403    | 16770.002   | 0.11%   |
      | 8000   | 0          | 1          | 5995.449     | 16754.07    | 179.45% |
      
      2. memcpy
      
      | length | src offset | dst offset | speed before | speed after | %       |
      |--------|------------|------------|--------------|-------------|---------|
      | 8      | 0          | 0          | 696.2        | 1670.605    | 139.96% |
      | 8      | 0          | 1          | 696.325      | 1671.138    | 139.99% |
      | 50     | 0          | 0          | 969.974      | 8724.999    | 799.51% |
      | 50     | 0          | 1          | 970.032      | 8730.138    | 799.98% |
      | 300    | 0          | 0          | 5564.662     | 16272.652   | 192.43% |
      | 300    | 0          | 1          | 4670.436     | 14972.842   | 220.59% |
      | 1200   | 0          | 0          | 10740.23     | 16751.728   | 55.97%  |
      | 1200   | 0          | 1          | 5027.741     | 14874.564   | 195.85% |
      | 4000   | 0          | 0          | 15122.367    | 16737.642   | 10.68%  |
      | 4000   | 0          | 1          | 5536.918     | 14890.397   | 168.93% |
      | 8000   | 0          | 0          | 16505.453    | 16553.543   | 0.29%   |
      | 8000   | 0          | 1          | 5821.619     | 14841.804   | 154.94% |
      
      3. memmove
      
      | length | src offset | dst offset | speed before | speed after | %       |
      |--------|------------|------------|--------------|-------------|---------|
      | 8      | 0          | 0          | 982.693      | 1670.568    | 70.00%  |
      | 8      | 0          | 1          | 983.023      | 1671.174    | 70.00%  |
      | 50     | 0          | 0          | 1230.87      | 8727.625    | 609.06% |
      | 50     | 0          | 1          | 1232.515     | 8730.138    | 608.32% |
      | 300    | 0          | 0          | 6490.375     | 16296.993   | 151.09% |
      | 300    | 0          | 1          | 4282.687     | 14972.842   | 249.61% |
      | 1200   | 0          | 0          | 11742.755    | 16752.546   | 42.66%  |
      | 1200   | 0          | 1          | 5039.338     | 14872.951   | 195.14% |
      | 4000   | 0          | 0          | 15467.786    | 16737.09    | 8.21%   |
      | 4000   | 0          | 1          | 5009.905     | 14890.542   | 197.22% |
      | 8000   | 0          | 0          | 16489.664    | 16553.273   | 0.39%   |
      | 8000   | 0          | 1          | 5823.786     | 14858.646   | 155.14% |
      
      * speed: MB/s
      * length: byte
      
      [1] https://github.com/heiher/mem-benchSigned-off-by: default avatarWANG Rui <wangrui@loongson.cn>
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      8941e93c
    • Huacai Chen's avatar
      LoongArch: Provide kernel fpu functions · 2b3bd32e
      Huacai Chen authored
      Provide kernel_fpu_begin()/kernel_fpu_end() to allow the kernel itself
      to use fpu. They can be used by some other kernel components, e.g., the
      AMDGPU graphic driver for DCN.
      Reported-by: default avatarWANG Xuerui <kernel@xen0n.name>
      Tested-by: default avatarWANG Xuerui <kernel@xen0n.name>
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      2b3bd32e
    • WANG Xuerui's avatar
      LoongArch: Relay BCE exceptions to userland as SIGSEGV with si_code=SEGV_BNDERR · c23e7f01
      WANG Xuerui authored
      SEGV_BNDERR was introduced initially for supporting the Intel MPX, but
      fell into disuse after the MPX support was removed. The LoongArch
      bounds-checking instructions behave very differently than MPX, but
      overall the interface is still kind of suitable for conveying the
      information to userland when bounds-checking assertions trigger, so we
      wouldn't have to invent more UAPI. Specifically, when the BCE triggers,
      a SEGV_BNDERR is sent to userland, with si_addr set to the out-of-bounds
      address or value (in asrt{gt,le}'s case), and one of si_lower or
      si_upper set to the configured bound depending on the faulting
      instruction. The other bound is set to either 0 or ULONG_MAX to resemble
      a range with both lower and upper bounds.
      
      Note that it is possible to have si_addr == si_lower in case of a
      failing asrtgt or {ld,st}gt, because those instructions test for strict
      greater-than relationship. This should not pose a problem for userland,
      though, because the faulting PC is available for the application to
      associate back to the exact instruction for figuring out the
      expectation.
      
      Example exception context generated by a faulting `asrtgt.d t0, t1`
      (assert t0 > t1 or BCE) with t0=100 and t1=200:
      
      > pc 00005555558206a4 ra 00007ffff2d854fc tp 00007ffff2f2f180 sp 00007ffffbf9fb80
      > a0 0000000000000002 a1 00007ffffbf9fce8 a2 00007ffffbf9fd00 a3 00007ffff2ed4558
      > a4 0000000000000000 a5 00007ffff2f044c8 a6 00007ffffbf9fce0 a7 fffffffffffff000
      > t0 0000000000000064 t1 00000000000000c8 t2 00007ffffbfa2d5e t3 00007ffff2f12aa0
      > t4 00007ffff2ed6158 t5 00007ffff2ed6158 t6 000000000000002e t7 0000000003d8f538
      > t8 0000000000000005 u0 0000000000000000 s9 0000000000000000 s0 00007ffffbf9fce8
      > s1 0000000000000002 s2 0000000000000000 s3 00007ffff2f2c038 s4 0000555555820610
      > s5 00007ffff2ed5000 s6 0000555555827e38 s7 00007ffffbf9fd00 s8 0000555555827e38
      >    ra: 00007ffff2d854fc
      >   ERA: 00005555558206a4
      >  CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
      >  PRMD: 00000007 (PPLV3 +PIE -PWE)
      >  EUEN: 00000000 (-FPE -SXE -ASXE -BTE)
      >  ECFG: 0007181c (LIE=2-4,11-12 VS=7)
      > ESTAT: 000a0000 [BCE] (IS= ECode=10 EsubCode=0)
      >  PRID: 0014c010 (Loongson-64bit, Loongson-3A5000)
      Signed-off-by: default avatarWANG Xuerui <git@xen0n.name>
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      c23e7f01
    • WANG Xuerui's avatar
      LoongArch: Tweak the BADV and CPUCFG.PRID lines in show_regs() · 325a38b5
      WANG Xuerui authored
      Use ISA manual names for BADV and CPUCFG.PRID lines in show_regs(), for
      stylistic consistency with the other lines already touched.
      
      While at it, also include current CPU's full name in show_regs() output.
      It may be more helpful for developers looking at the resulting dumps,
      because multiple distinct CPU models may share the same PRID. Not having
      this info available may hide problems only found on some but not all of
      the models sharing one specific PRID.
      Signed-off-by: default avatarWANG Xuerui <git@xen0n.name>
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      325a38b5
    • WANG Xuerui's avatar
      LoongArch: Humanize the ESTAT line when showing registers · 98b90ede
      WANG Xuerui authored
      Example output looks like:
      
      [   xx.xxxxxx] ESTAT: 00001000 [INT] (IS=12 ECode=0 EsubCode=0)
      Signed-off-by: default avatarWANG Xuerui <git@xen0n.name>
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      98b90ede
    • WANG Xuerui's avatar
      LoongArch: Humanize the ECFG line when showing registers · 5e3e784d
      WANG Xuerui authored
      Example output looks like:
      
      [   xx.xxxxxx]  ECFG: 00071c1c (LIE=2-4,10-12 VS=7)
      Signed-off-by: default avatarWANG Xuerui <git@xen0n.name>
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      5e3e784d
    • WANG Xuerui's avatar
      LoongArch: Humanize the EUEN line when showing registers · 9718d96c
      WANG Xuerui authored
      Example output looks like:
      
      [   xx.xxxxxx]  EUEN: 00000000 (-FPE -SXE -ASXE -BTE)
      Signed-off-by: default avatarWANG Xuerui <git@xen0n.name>
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      9718d96c
    • WANG Xuerui's avatar
      LoongArch: Humanize the PRMD line when showing registers · ce7f0b18
      WANG Xuerui authored
      Example output looks like:
      
      [   xx.xxxxxx]  PRMD: 00000004 (PPLV0 +PIE -PWE)
      Signed-off-by: default avatarWANG Xuerui <git@xen0n.name>
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      ce7f0b18
    • WANG Xuerui's avatar
      LoongArch: Humanize the CRMD line when showing registers · efada2af
      WANG Xuerui authored
      Example output looks like:
      
      [   xx.xxxxxx]  CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
      
      Some initial machinery for this pretty-printing format has been included
      in this patch as well.
      Signed-off-by: default avatarWANG Xuerui <git@xen0n.name>
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      efada2af
    • WANG Xuerui's avatar
      LoongArch: Fix format of CSR lines during show_regs() · 05fa8d49
      WANG Xuerui authored
      Use uppercase CSR names throughout for consistency with the manual
      wording, and right-align the keys. The "CSR" part is inferrable from
      context, hence dropped for more horizontal space.
      Signed-off-by: default avatarWANG Xuerui <git@xen0n.name>
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      05fa8d49
    • WANG Xuerui's avatar
      LoongArch: Print symbol info for $ra and CSR.ERA only for kernel-mode contexts · 863b3795
      WANG Xuerui authored
      Otherwise the addresses wouldn't make sense at all.
      
      While at it, align the "map keys" to maintain right-alignment with the
      "estat:" line too; also swap the ERA and ra lines so all CSRs are shown
      together.
      Signed-off-by: default avatarWANG Xuerui <git@xen0n.name>
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      863b3795
    • WANG Xuerui's avatar
      LoongArch: Print GPRs with ABI names when showing registers · f6a79b60
      WANG Xuerui authored
      Show PC (CSR.ERA) in place of $zero, and also show the syscall restart
      flag (conveniently stuffed in regs[0]) if non-zero.
      Signed-off-by: default avatarWANG Xuerui <git@xen0n.name>
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      f6a79b60
    • WANG Xuerui's avatar
      LoongArch: Define regular names for BCE/WATCH/HVC/GSPR exceptions · aa552254
      WANG Xuerui authored
      Define them according to the ISA manual, in order to enable matching the
      sub-exceptions for humanization purposes later.
      Signed-off-by: default avatarWANG Xuerui <git@xen0n.name>
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      aa552254
    • WANG Xuerui's avatar
      LoongArch: Clean up the architectural interrupt definitions · 9e36fa42
      WANG Xuerui authored
      While interrupts are assigned ECodes `64 + interrupt number`, all
      existing use sites of interrupt numbers want the 64 subtracted.
      Re-arrange the definitions so that the actual interrupt number is used
      everywhere, and make EXCCODE_INT_END inclusive as it is more intuitive
      that way.
      
      While at it, according to the asm/loongarch.h definitions, the total
      number of architectural interrupts should be 14, but various other
      places indicate otherwise (13 or 15). Those places have been adjusted
      to 14 as well for consistency.
      Signed-off-by: default avatarWANG Xuerui <git@xen0n.name>
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      9e36fa42
  2. 26 Apr, 2023 1 commit
  3. 23 Apr, 2023 9 commits
  4. 22 Apr, 2023 2 commits
  5. 21 Apr, 2023 12 commits