VICE / Patches / #350 Profiler Support

Oskar Linde - 2023-01-16

Here's an additive patch that adds the non-sc 6502 core as well. With this the profiler should work for the main CPU on:

x64
x64sc
x64dtv
x128 (8502 CPU)
xplus4
xvic
xcbm2
xcbm5x0
xpet (6502 CPU)
vsid

not implemented yet
xscpu64 (65C816)
xpet (w/ SuperPET 6809 CPU)
x128 (when running in Z80 mode)

Also, the profiler is not yet banking aware (only samples the 16-bit addresses the CPU sees), so will only display the correct disassembly when the bank configuration is compatible with the runtime one.

Last edit: Oskar Linde 2023-01-16

profiler-non-sc.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

gpz - 2023-01-16

very cool!

BTW, this should also get support for the remote interface (because at the end, you want to profile right inside vscode, right? :=))

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Oskar Linde - 2023-01-16

Here’s an open issue. It would be nice to support banking. For example, if the program during profiling swaps in the KERNAL ROM, executes a function (e.g. CHROUT), returns and swaps the ROM out, and then executes another function stored in the RAM under the KERNAL ROM it would be nice to differentiate between them and have the monitor output the correct disassembly depending on execution context.

In the current implementation, the execution context is defined purely based on the call stack (a sequence of 16-bit PC pointers) so the monitor doesn’t know that the ROM for example was swapped in during a specific execution execution.

There are two high-level approaches:

Augment the execution context information with the current memory bank configuration. This requires very little additional storage per profiling context (basically 1-2 bytes), but needs some handling for what to do when the bank configuration changes during a context. For the latter I’d probably simply create a linked-list structure of such contexts.

Augment every single PC pointer with bank information creating wide/far pointers. This could be a pretty disruptive change to the monitor, but has a few advantages. For example you could potentially support applying a label specifically to a ROM memory region (e.g. with 0x02 being the ROM bank prefix):

add_label 02FFD2 .CHROUT

Both of these approaches have challenges. I'd lean towards #1 as being the most straightforward to implement.

For this first one, there doesn’t seem to be a standardized way to query the current bank configuration. Most memory models have an (int) mem_config < NUM_CONFIGS <= 256, but there is no API to query this variable, and no API for the monitor to mem_peek_with_config(int mem_config, uint16_t addr). There is an API mem_bank_peek(int bank, uint16_t adds, ...), but that’s not really useful because the current bank number is both not queryable and is also not generally well-defined for a particular bank configuration.

So to implement this, I’d suggest maybe adding a

int mem_get_bank_config()

to each memory interface

For c64mem, that one would basically be a value 0-31:

return (((~pport.dir | pport.data) & 0x7) | (export.exrom << 3) | (export.game << 4));

(i.e. what’s computed by mem_pla_config_changed() today)

And also a (mostly) stateless

uint8_t mem_peek_with_config(int mem_config, uint16_t addr)

for monitor access.

This still doesn't address banking settings internal to cartridges, but that's starting to get pretty niche. If necessary, it could be handled by augmenting the config value like the hack below for the 65816 PB register...

for c128mem.c the config would be identical to what mmu_update_config() computes and things should work fine.

For scpu64mem it seems tricky to implement proper support without the monitor having 24-bit pointers. One potential option (or maybe hack) would be to effectively bake the program bank (PB) register into the memory config value (expanding it to 16 bits). This is in behavior actually similar to what the current monitor interface does (mem_bank_peek) by adding the current WDC65816_REGS_GET_PBR() value to the provided address, but that seems technically wrong: While the PB register makes sense for program memory (e.g. disassembly), for raw memory dumps you are just as likely (if not more) to want to use the DB register instead...

Anyway, this hack would work for 65816 profiling since the PB register never auto-increments during program execution (only by JML, JSL, RTL), so it's effectively constant for an execution context.

This being said, I'd consider holding off on supporting the 65816 until the monitor gets full 24-bit support. E.g. labels, memory and disassembly listings all lack 24-bit support today.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

As an experiment I implemented approach 1 (for one of the machines). See the example below:

This program runs with the kernal ROM swapped out and a custom IRQ handler at $2403 (context [6])

When viewing context [6], the memory bank configuration (Bank) column shows two different bank configurations (5 and 6) in effect during execution. The ~middle 3 lines (with the call to context [7]) are running with configuration 6 – the KERNAL ROM is mapped in and the kernal routine at $FF9F (SCNKEY) is called.

When viewing context [7] at the bottom, the monitor correctly reads the disassembly from KERNAL ROM instead of RAM even though the ROM is currently swapped out.

What makes this less clean than ideal:

The memory bank config value is not the same as the already existing "bank" value. There are now monitor interfaces for both. This feels a bit overcomplicated, but I don't have a good solution.
The memory config value is currently unique to the profiling instructions. It would be nice if there was a more consistent/standardized way to refer to an unambiguous memory location in general. E.g. just like the current .C:xxxx prefix for memspace indicating the computer memory, there could maybe be a .C:yy:xxxx format for accessing ram under a specific memory bank configuration. This is where approach (2) in the previous message might have some advantages by creating a virtual wider global and non-overlapping address space. (Where each memory location has a unique address rather than the non-unique address you get with a memory configuration prefix)
The profiler (and monitor) now directly queries mem_get_current_bank_config() to capture bank switch changes. It's probably clearer for the memory subsystem to tell the profiler of any bank switches. This is an easy change.

(C:$2418) prof graph
                                               Total      %          Self      %
                                       ------------- ------ ------------- ------
    [1] START                              4,003,228 100.0%             0   0.0%
      [2] e39a -> e422                     3,939,194  98.4%             6   0.0%
        [3] a7e7 -> a7ed                   3,939,188  98.4%             0   0.0%
          [4] a7e7 -> a7ed                 3,939,188  98.4%             0   0.0%
            [5] 0836 -> 2460               3,939,188  98.4%             0   0.0%
                   ...
>     [6] IRQ  -> 2403                        64,034   1.6%        48,100   1.2%
        [7] 2433 -> ff9f                      15,934   0.4%        15,934   0.4%



(C:$2418) prof context 6

Context    [6] IRQ  -> 2403                               

   Parent     [1] START                                      

   Entered        258 times
   Exited         257 times
   Total       64,034 cycles       65.0 ms
   Self        48,100 cycles       48.8 ms
   Average        249 cycles        252 us*
        * Averages are not accurate since enters differs from exits.

   Memory Banking Configs: 5 6

       Cycles      %         Times Context Bank Address  Disassembly
------------- ------ ------------- ------- ---- -------  -------------------------
          516   0.8%           258           5*.C:2403   D8          CLD
          774   1.2%           258           5 .C:2404   48          PHA
          516   0.8%           258           5 .C:2405   18          CLC
          774   1.2%           258           5 .C:2406   A5 02       LDA $02
          516   0.8%           258           5 .C:2408   69 FE       ADC #$FE
          774   1.2%           258           5 .C:240a   85 02       STA $02
          774   1.2%           258           5 .C:240c   A5 03       LDA $03
          516   0.8%           258           5 .C:240e   69 FE       ADC #$FE
          774   1.2%           258           5 .C:2410   85 03       STA $03
        1,032   1.6%           258           5 .C:2412   68          PLA
          774   1.2%           258           5 .C:2413   48          PHA
          516   0.8%           258           5 .C:2414   8A          TXA
          774   1.2%           258           5 .C:2415   48          PHA
          516   0.8%           258           5 .C:2416   98          TYA
          774   1.2%           258           5 .C:2417   48          PHA
          771   1.2%           257           5>.C:2418   A5 04       LDA $04
          771   1.2%           257           5 .C:241a   48          PHA
          771   1.2%           257           5 .C:241b   A5 12       LDA $12
          514   0.8%           257           5 .C:241d   A0 01       LDY #$01
        1,542   2.4%           257           5 .C:241f   91 02       STA ($02),Y
          771   1.2%           257           5 .C:2421   A5 13       LDA $13
          514   0.8%           257           5 .C:2423   88          DEY
        1,542   2.4%           257           5 .C:2424   91 02       STA ($02),Y
        1,542   2.4%           257           5 .C:2426   EE 00 04    INC $0400
        1,285   2.0%           257           5 .C:2429   E6 A2       INC $A2
          771   1.2%           257           5 .C:242b   A6 01       LDX $01
          771   1.2%           257           5 .C:242d   86 04       STX $04
          514   0.8%           257           5 .C:242f   A2 3E       LDX #$3E
          771   1.2%           257           5 .C:2431   86 01       STX $01
       17,470  27.3%           256   [7]     6 .C:2433   20 9F FF    JSR $FF9F
          771   1.2%           257           6 .C:2436   A6 04       LDX $04
          771   1.2%           257           6 .C:2438   86 01       STX $01
        1,028   1.6%           257           5 .C:243a   AE 0D DC    LDX $DC0D
          514   0.8%           257           5 .C:243d   A0 00       LDY #$00
        1,285   2.0%           257           5 .C:243f   B1 02       LDA ($02),Y
          771   1.2%           257           5 .C:2441   85 13       STA $13
          514   0.8%           257           5 .C:2443   C8          INY
        1,285   2.0%           257           5 .C:2444   B1 02       LDA ($02),Y
          771   1.2%           257           5 .C:2446   85 12       STA $12
        1,028   1.6%           257           5 .C:2448   68          PLA
          771   1.2%           257           5 .C:2449   85 04       STA $04
        1,028   1.6%           257           5 .C:244b   68          PLA
          514   0.8%           257           5 .C:244c   A8          TAY
        1,028   1.6%           257           5 .C:244d   68          PLA
          514   0.8%           257           5 .C:244e   AA          TAX
        1,028   1.6%           257           5 .C:244f   68          PLA
          771   1.2%           257           5 .C:2450   48          PHA
          514   0.8%           257           5 .C:2451   18          CLC
          771   1.2%           257           5 .C:2452   A5 02       LDA $02
          514   0.8%           257           5 .C:2454   69 02       ADC #$02
          771   1.2%           257           5 .C:2456   85 02       STA $02
          771   1.2%           257           5 .C:2458   A5 03       LDA $03
          514   0.8%           257           5 .C:245a   69 01       ADC #$01
          771   1.2%           257           5 .C:245c   85 03       STA $03
        1,028   1.6%           257           5 .C:245e   68          PLA
        1,542   2.4%           257           5 .C:245f   40          RTI
                                                 ...
        1,806   2.8%           258           5 IRQ 



(C:$2418) prof context 7

Context    [7] 2433 -> ff9f                               

   Parent     [6] IRQ  -> 2403                               

   Entered        257 times
   Exited         257 times
   Total       15,934 cycles       16.2 ms
   Self        15,934 cycles       16.2 ms
   Average         62 cycles       62.9 us

   Memory Banking Config: 6

       Cycles      %         Times Context Address  Disassembly
------------- ------ ------------- ------- -------  -------------------------
          514   3.2%           257        .C:ea87   A9 00       LDA #$00
        1,028   6.5%           257        .C:ea89   8D 8D 02    STA $028D
          514   3.2%           257        .C:ea8c   A0 40       LDY #$40
          771   4.8%           257        .C:ea8e   84 CB       STY $CB
        1,028   6.5%           257        .C:ea90   8D 00 DC    STA $DC00
        1,028   6.5%           257        .C:ea93   AE 01 DC    LDX $DC01
          514   3.2%           257        .C:ea96   E0 FF       CPX #$FF
          771   4.8%           257        .C:ea98   F0 61       BEQ $EAFB
                                            ...
        1,028   6.5%           257        .C:eafb   F0 29       BEQ $EB26
                                            ...
          771   4.8%           257        .C:eb26   A4 CB       LDY $CB
          771   4.8%           257        .C:eb28   84 C5       STY $C5
        1,028   6.5%           257        .C:eb2a   AC 8D 02    LDY $028D
        1,028   6.5%           257        .C:eb2d   8C 8E 02    STY $028E
          514   3.2%           257        .C:eb30   E0 FF       CPX #$FF
          771   4.8%           257        .C:eb32   F0 0E       BEQ $EB42
                                            ...
          514   3.2%           257        .C:eb42   A9 7F       LDA #$7F
        1,028   6.5%           257        .C:eb44   8D 00 DC    STA $DC00
        1,542   9.7%           257        .C:eb47   60          RTS
                                            ...
          771   4.8%           257       *.C:ff9f   4C 87 EA    JMP $EA87
(C:$2418)

Last edit: Oskar Linde 2023-01-17

gpz - 2023-03-26

assigned_to: gpz
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

gpz - 2023-03-29

Sorry for not coming back to this earlier... So whats the status on this? :)

(I also lean towards approach 1))

I think its acceptable to merge only for 6502, provided it works on all of them. There shouldn't be too many cases of banking either (if we ignore things like cartridges for a while, which are an entire different beast), depending on what needs to be done to support it, it might be OK to merge even if not all things are hooked up yet. Would be good to have a current patch to look at :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Oskar Linde - 2023-04-24

Hi, likewise sorry for the delay. Here's the diff for the approach 1 implementation. This is incremental to the other two patches in this thread. If this looks ok I'm happy to quickly implement support in the same way for the other 6502-based machines.

banking.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

gpz - 2023-04-24

so i will have to apply all 3 patches?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Oskar Linde - 2023-04-25

Yes – but give me some time to rebase the patches onto the latest head and squash them.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

gpz - 2023-04-25

ok. In that case you can also just make it one patch :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Oskar Linde - 2023-04-25

Here's one single patch with everything. Should apply cleanly on the latest head. Configure with --enable-profiling. banking config support per above is only implemented on the x64 and x64sc cores, for the rest it's been stubbed-out (doesn't break anything, just makes profiling traces across bank changes a bit harder to read).

profiling_squished.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

gpz - 2023-04-28

will have to play a bit with this... codestyle etc looks ok so far :) One thing that is missing in this patch would be the documentation update (vice.texi). Also, is it really necessary to have a configure switch for this? Does it have a notable performance impact even when profiling is disabled?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Oskar Linde - 2023-04-29

Thanks – will add the documentation.
The compiler switch was mainly a precaution. I think the runtime perf impact of having it disabled but compiled in is relatively minimal but I haven't benchmarked it. Any tips/best practices for benchmarking vice emu performance?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

gpz - 2023-04-29

I dont know to be honest... i'd say if it doesnt produce a noticable difference when running VICE in warpmode, then its just fine to have it unconditionally enabled.

I noted another small "problem" - there are a bunch of "printf modifier" style warnings in the monitor code now - please try to fix those (if you dont see them, we can do this later)

And i have another question - how does this deal with different CPUs? IE the disk drives? (its 3am here now, i didnt have time to actually check - will do tomorrow :=))

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Oskar Linde - 2023-04-29
  
  I noted another small "problem" - there are a bunch of "printf modifier" style warnings in the monitor code now - please try to fix those (if you dont see them, we can do this later)
  
  Interesting. My compiler didn't produce any warnings – I'll look into it.
  
  And i have another question - how does this deal with different CPUs? IE the disk drives?
  
  Currently it only profiles maincpu, other CPUs are not affected. In the future it wouldn't be terribly difficult to enable profiling of other CPUs – either by maintaining multiple parallel profiling contexts (one per CPU) or exclusive selection of one particular CPU to profile, but I didn't have an application for this and also no good test cases.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

gpz - 2023-04-29

another "problem" - the formatting of the "help" output in the monitor is a bit unfortunate - it seems to wrap around in the default monitor window size
Edit: that might be related to what font is being used(?) - probably a good idea to not use 80, but 79 or 78 chars width for the hardcoded formatting

Last edit: gpz 2023-04-29

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Oskar Linde - 2023-04-29
  
  Ok, I'll reformat to a smaller column width. I don't think it was a problem on my platform.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Another... guess you can call it request =) It would be very nice, if the profiler output would use the literal label names instead of addresses. Other than that, it all looks very nice to me!

I am running the testbench on x64sc now, just to see if this might break something.

Label names should be supported. See sample below.

I also have another patch I can share that adds comments on branch statements in disassembly, e.g.

JSR $FF9F ; -> .SCNKEY

(C:$2570) prof disass .__ashlqi3

Function .__ashlqi3                                                            

   Contexts [791][792][793][794][636][639][643][795][796][797][798]

   Entered        859 times
   Exited         859 times
   Total       39,004 cycles       39.6 ms
   Self        39,004 cycles       39.6 ms
   Average         45 cycles       46.1 us

       Cycles      %         Times  Address  Disassembly
------------- ------ -------------  -------  -------------------------
        2,577   6.6%           859*.C:bad8   4C DD BA    JMP $BADD
        6,186  15.9%         3,093 .C:badb   CA          DEX
        6,186  15.9%         3,093 .C:badc   0A          ASL A
        7,904  20.3%         3,952 .C:badd   E0 00       CPX #$00
       10,997  28.2%         3,952 .C:badf   D0 FA       BNE $BADB
        5,154  13.2%           859 .C:bae1   60          RTS

(C:$2570) prof func .AStar__search
                                          Callers
        Total      %          Self      %         Callees
------------- ------ ------------- ------ |---|---|------
    1,427,699 100.0%     1,165,087  81.6% .main                                  
    1,427,699 100.0%     1,165,087  81.6%     .AStar__search                          
      221,968  15.5%       161,836  11.3%         .call_indir                            
       25,088   1.8%        25,088   1.8%         .to_coord                              
       15,180   1.1%        15,180   1.1%         .__ashlqi3                              
          376   0.0%           376   0.0%         .mul40                                  

(C:$2570) prof graph 637
                                               Total      %          Self      %
                                       ------------- ------ ------------- ------
  [637] 3a4b -> .AStar__search             1,427,699 100.0%     1,165,087  81.6%
    [638] 0ba1 -> .mul40                         188   0.0%           188   0.0%
    [639] 0f78 -> .__ashlqi3                  14,993   1.1%        14,993   1.1%
    [640] 0e63 -> .call_indir                221,968  15.5%       161,836  11.3%
      [641] 87bf -> .passable                 60,132   4.2%        60,132   4.2%
    [642] 0de0 -> .to_coord                   25,088   1.8%        25,088   1.8%
    [643] 0c42 -> .__ashlqi3                     187   0.0%           187   0.0%
    [644] 0bbf -> .mul40                         188   0.0%           188   0.0%

(C:$2570) prof flat
        Total      %          Self      %
------------- ------ ------------- ------
    2,799,594  34.6%     2,799,594  34.6% .wait_for_trigger                      
    1,427,699  17.7%     1,165,087  14.4% .AStar__search                          
      627,428   7.8%       627,428   7.8% 3ba3                                   
    8,021,387  99.2%       510,768   6.3% .main                                  
      474,002   5.9%       474,002   5.9% .__mulhi3                              
      638,021   7.9%       366,541   4.5% .shadowcast_scan                        
      196,109   2.4%       196,109   2.4% .__umodhi3                              
      193,200   2.4%       193,200   2.4% .__mulsi3                              
    2,982,990  36.9%       176,779   2.2% .linebreak

Oskar Linde - 2023-04-30

Thanks for checking this out! I've attached a new version with the following changes:

Added vice.texi documentation for the monitor commands

Changed linebreak column for profiler help to ~76

I've not been able to get my compiler to emit any format string warnings (building using llvm/clang for arm64, -Wpedantic -Wformat-pedantic). Can you share what warnings you get?

I tested benchmarking in warp mode with and without the --enable-profiling flag for a workload. 3 repeats:

With: --disable-profiling 6.12 s 6.04 s 6.13 s With: --enable-profiling 6.13 6.05 6.08

So the performance impact of enabling profiling support seems to be in the noise. If you think this is sufficient, I'm happy to remove the configure option.

profiler-230430.patch
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Oskar Linde - 2023-04-30

Hmm, with profiling turned on from boot, 58 million samples collected in ~6 seconds, I got the virtually the same warp execution timings.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

gpz - 2023-05-02

I am running the testbench on x64sc now, just to see if this might break something.

That looks good....

Hmm, with profiling turned on from boot, 58 million samples collected in ~6 seconds, I got the virtually the same warp execution timings.

It should probably be enabled unconditionally then.... I'd much prefer this, since IF this patch breaks something, this will increase the chance someone notices it a lot

I've not been able to get my compiler to emit any format string warnings (building using llvm/clang for arm64, -Wpedantic -Wformat-pedantic). Can you share what warnings you get?

Working one something else right now... i will post them tomorrow or so. In any case, have a look at src/vice.h - clang can probably use similar attributes as gcc (VICE_ATTR_PRINTF and friends at the bottom of the file), so if you define those macros so clang recognizes them, you should get those warning too

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Oskar Linde - 2023-05-02

Thanks! I had to install GCC to get the warnings to show up. This latest patch should address them and also removes the configure option.

profiler-230502.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ok some minor nitpicking left... first, i see two warnings still:

In Datei, eingebunden von ../../src/6510core.c:50,
                 von drivecpu.c:434:
../../src/profiler.h: In Funktion »drivecpu_execute«:
../../src/profiler.h:6:13: Warnung: Variable »maincpu_profiling« wird nicht verwendet [-Wunused-variable]
    6 | extern bool maincpu_profiling;
      |             ^~~~~~~~~~~~~~~~~

and

monitor.c: In Funktion »mon_backtrace«:
monitor.c:1166:25: Warnung: Format »%x« erwartet Argumenttyp »unsigned int«, aber Argument 2 hat Typ »int« [-Wformat=]
 1166 |             mon_out("%04x", addr_src - 2);
      |                      ~~~^   ~~~~~~~~~~~~
      |                         |            |
      |                         unsigned int int
      |                      %04x

(if you don't get these, don't bother, i'll fix them here before committing)

then, in scpu64mem.c the patch removed

#include "vice.h"

Please don't do that, every file must include this :)

The same applies to other .c files you added, they should first include vice.h, followed by the standard C headers (everything in <>), and after that the vice internal includes in alphabetical order.

And last not least, the newly added .c and .h files should have the same GPL header comment as other files.

After those changes i'll commit this, and i could also post in the csdb forum and have some ppl test it - it should come in handy for doing final touches at the demos for X :)

Oskar Linde - 2023-05-07

Thanks! Here's a new patch that should address all the above.

Sorry about the removed vice.h include. That was not intentional – my editor keeps warning that that header is unused and pops up correction suggestions and I must have accidentally accepted thinking I probably added the header myself to debug something...

profiler-230507.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Profiler Support

Versatile Commodore Emulator

Version

Searches

Help

#350 Profiler Support

Discussion