CUSE performance improvement #519

JamesWekel · 2022-11-10T03:32:17Z

Fish,

Here is a proposed performance improvement for the CUSE instruction.

Before the change, the performance test reported:

         1,000,000 iterations of CUSE  took  10,295,109 microseconds
         1,000,000 iterations of CUSE  took  10,727,350 microseconds
         1,000,000 iterations of CUSE  took  11,088,763 microseconds
         1,000,000 iterations of CUSE  took  10,898,278 microseconds
         1,000,000 iterations of CUSE  took  10,432,426 microseconds
         1,000,000 iterations of CUSE  took  85,796,364 microseconds

and after the change:

         1,000,000 iterations of CUSE  took     415,661 microseconds
         1,000,000 iterations of CUSE  took     542,513 microseconds
         1,000,000 iterations of CUSE  took     296,749 microseconds
         1,000,000 iterations of CUSE  took     139,859 microseconds
         1,000,000 iterations of CUSE  took     505,759 microseconds
         1,000,000 iterations of CUSE  took     557,832 microseconds

The CUSE instruction performance is very dependent upon operand strings and substring length. CUSE-02-performance test shows performance increases between 95 - 99%. This increase is probably not representative of real-world instruction usage.

Thank you for your review and updates. I'm sure that there is something that I've missed.

Jim

Fish-Git · 2022-11-11T16:41:08Z

Thanks, James! I'll take a look at it and get back to you.

JamesWekel · 2022-11-11T18:43:18Z

Fish,

Performance has improved, but I'm not exactly happy with the implementation. I optimized special cases which leads to a long implementation. The solution needs a second pair of eyes!

Jim

Fish-Git

I'm seeing some serious issues with your code, James. :(

The first thing I noticed was the following (similar code which appears in several other places):

/* update mainstore addresses; check for page boundary */
if ( (uintptr_t) m1 & PAGEFRAME_PAGEMASK )
    m1--;
else
    m1 = MADDRL((ea1 + (ss_scan_index-1)) & ADDRESS_MAXWRAP( regs ), 1, b1, regs, ACCTYPE_READ, regs->psw.pkey );

Please explain how the above if statement is checking for whether a page is being, or is about to be, crossed?!

The PAGEFRAME_PAGEMASK constant is #defined (for e.g. z/Arch) as 0xFFFFFFFFFFFFF000ULL, which is the mask you apply to an address to get the start of the page (i.e. page number) for that address. Only in the highly unusual case where m1 was pointing to low core (i.e. to the very first page of storage, i.e. address 0) would that if statement ever be false! As coded, it is almost certainly always going to be true!

Another minor issue is you declare (and set) two variables called m1prior and m2prior, but never use them for anything.

Finally, you #define a constant called CUSE_INTERRUPT_POINT and an associated variable called ruptcnt and use them to decide whether or not the registers should be updated (with the comment update GPRs if scanned 1024 bytes - could get interrupt).

But yet you never bother to check for an interrupt anywhere (which is actually a good thing in this case because you shouldn't ever be checking for an interrupt!).

I suspect you read in the manual that CUSE was an interruptible instruction and so thought you should probably have code for that. But here's a hint: an instruction (in Hercules) can never be interrupted while it's executing! (unless you specifically do so of course, which you rarely, if ever, need to do.)

The only time you really need to MAYBE check for an interrupt is at the END of the instruction, once it has completed, and you're about to return from your function. For a good illustrative example of what I'm talking about, take a look at the code for the LPSWE instruction:

hyperion/esame.c

Lines 5219 to 5261 in d0ccfbc

    
           /*-------------------------------------------------------------------*/ 
        
           /* B2B2 LPSWE - Load PSW Extended                                [S] */ 
        
           /*-------------------------------------------------------------------*/ 
        
           DEF_INST(load_program_status_word_extended) 
        
           { 
        
           int     b2;                             /* Base of effective addr    */ 
        
           U64     effective_addr2;                /* Effective address         */ 
        
           QWORD   qword; 
        
           int     rc; 
        
               S(inst, regs, b2, effective_addr2); 
        
               /* All control instructions are restricted in transaction mode */ 
        
               TRAN_INSTR_CHECK( regs ); 
        
               PRIV_CHECK(regs); 
        
               DW_CHECK(effective_addr2, regs); 
        
           #if defined(_FEATURE_ZSIE) 
        
               if(SIE_STATE_BIT_ON(regs, IC1, LPSW)) 
        
                   longjmp(regs->progjmp, SIE_INTERCEPT_INST); 
        
           #endif /*defined(_FEATURE_ZSIE)*/ 
        
               /* Perform serialization and checkpoint synchronization */ 
        
               PERFORM_SERIALIZATION (regs); 
        
               PERFORM_CHKPT_SYNC (regs); 
        
               /* Fetch new PSW from operand address */ 
        
               ARCH_DEP(vfetchc) ( qword, 16-1, effective_addr2, b2, regs ); 
        
               /* Set the breaking event address register */ 
        
               SET_BEAR_REG(regs, regs->ip - 4); 
        
               /* Load updated PSW */ 
        
               if ( ( rc = ARCH_DEP(load_psw) ( regs, qword ) ) ) 
        
                   regs->program_interrupt (regs, rc); 
        
               /* Perform serialization and checkpoint synchronization */ 
        
               PERFORM_SERIALIZATION (regs); 
        
               PERFORM_CHKPT_SYNC (regs); 
        
               RETURN_INTCHECK(regs); 
        
           } /* end DEF_INST(load_program_status_word_extended) */

Notice that it ends with a RETURN_INTCHECK(regs); statement. That causes an immediate jump to our instruction execution logic that checks for an interrupt for that CPU. Otherwise, when an instruction finishes executing, it simply returns, which goes on to the next instruction.

Checking for interrupts in Hercules is an expensive thing so we purposely don't do it after every instruction. Instead, we always execute a certain fixed number of instructions in a loop, and only check for an interrupt at the end of that loop:

hyperion/cpu.c

Lines 2001 to 2024 in d0ccfbc

    
           fastest_no_txf_loop: 
        
               if (INTERRUPT_PENDING( regs )) 
        
                   ARCH_DEP( process_interrupt )( regs ); 
        
           enter_fastest_no_txf_loop: 
        
               ip = INSTRUCTION_FETCH( regs, 0 ); 
        
               PROCESS_TRACE( regs, ip, enter_fastest_no_txf_loop ); 
        
               EXECUTE_INSTRUCTION( current_opcode_table, ip, regs ); 
        
               regs->instcount++; 
        
               UPDATE_SYSBLK_INSTCOUNT( 1 ); 
        
               for (i=0; i < MAX_CPU_LOOPS/2; i++) 
        
               { 
        
                   UNROLLED_EXECUTE( current_opcode_table, regs ); 
        
                   UNROLLED_EXECUTE( current_opcode_table, regs ); 
        
               } 
        
               regs->instcount   +=     (i * 2); 
        
               UPDATE_SYSBLK_INSTCOUNT( (i * 2) ); 
        
               /* Perform automatic instruction tracing if it's enabled */ 
        
               do_automatic_tracing(); 
        
               goto fastest_no_txf_loop;

(Note: MAX_CPU_LOOPS is currently hard coded at 256)

I've temporarily stopped my review at this point in order to report on the above issues with your code. I have not finished with my review yet. For example, I have not reviewed your actual compare_until_substring_equal function yet, nor whether your helper functions are actually doing what they're supposed to be doing. I will get around to eventually doing that, and will report back to you with the results of my analysis/review once it's completed. In the mean time, I'm afraid I'm going to have to reject your pull request at this time until you can correct the issues I've already identified.

In order to help you, I would like to offer you the below patch for your consideration. You might consider applying it to your existing code, and then compare (review) the changes I've made to see whether you like them or not:

CUSE.patch.txt (some proposed corrections)

NOTE! I'm a human being too and subject to making mistakes just like anyone else! The above patch could well contain serious bugs of its own! I don't believe it does (I believe it's good), but the burden of review falls squarely on your shoulders should you wish to use it!

I hope that helps you!

Fish-Git · 2022-11-15T08:46:21Z

James (@JamesWekel),

Additionally, one more extremely minor required change (in addition to the previously mentioned):

For the CUSE-02-performance.tst script, I would suggest using "300" (the maximum allowed) for your runtest timeout value. Currently you have 500 coded, which is invalid, resulting (prior to my recent fix) in a default value of 30 seconds being used instead (which isn't long enough for the the current version of Hercules that doesn't have your fix).

Interestingly, even 300 seconds (5 minutes) isn't long enough for my system! But since I use a helper script which uses a runtest timeout factor of 2.0, I can manage to squeak by. When I tested current Hercules on my system (without your changes), here are the timings I got:

(current unfixed Hyperion):

         1,000,000 iterations of CUSE  took  23,531,501 microseconds
         1,000,000 iterations of CUSE  took  25,053,053 microseconds
         1,000,000 iterations of CUSE  took  25,340,175 microseconds
         1,000,000 iterations of CUSE  took  25,488,301 microseconds
         1,000,000 iterations of CUSE  took  23,953,078 microseconds
         1,000,000 iterations of CUSE  took 193,951,645 microseconds

So apparently your system is much faster than mine! :)

Anyway... I hope I didn't scare you off this project with my previous critique of your code. :(

You just need to make a few changes and you should be good to go. I finished reviewing the rest of your code (all of your helper functions as well as the primary CUSE instruction function itself too) and everything else looks good!

So just make those suggested fixes I mentioned (about detecting when you've crossed over to a new page) and I'll be happy to accept your pull request.

Thanks!

JamesWekel · 2022-11-15T19:07:58Z

Fish,

Sorry for the late reply; busy with other life items.

Thank you! Thank you! Thank you! How the (beep) (beep) did I code PAGEFRAME_PAGEMASK when I thought I had coded PAGEFRAME_BYTEMASK!! I definitely needed to have a second pair of eyes as I wasn't seeing what was on the page.

The current CUSE implementation includes the following interrupt coding:

        /* update GPRs if we just crossed half page - could get rupt */
        if ((addr1 & 0x7FF) == 0 || (addr2 & 0x7FF) == 0)
        {
            /* Update R1 and R2 to point to next bytes to compare */
            SET_GR_A( r1, regs, addr1 );
            SET_GR_A( r2, regs, addr2 );

            /* Set R1+1 and R2+1 to remaining operand lengths */
            SET_GR_A( r1+1, regs, len1 );
            SET_GR_A( r2+1, regs, len2 );
        }

and yes, I "read in the manual that CUSE was an interruptible instruction". So, it must be necessary but, with the optimizations, I would never know when a half page was crossed, therefore I just changed when the registers were saved. I now have some more reading on Hercules interrupts!

Thank you for the CUSE patch. I do like the simpler look for page crossing checks. It is much more explicit in the code; not relying on a comment and how page offset bits are going to change when crossing a page. To make the code even more explicit, I've added the following macro

#undef  MAINSTOR_PAGEBASE
#define MAINSTOR_PAGEBASE( _ma )  ((BYTE*) ((uintptr_t) ( _ma ) & PAGEFRAME_PAGEMASK))

to further simplify the look of crossed page checks. I'll do another review this afternoon. You should have a commit tonight.

I've changed CUSE-02-performance.tst to have a timeout of 300. My performance tests were done using a 11th Gen Intel Core i5-1135G7 @ 2.40GHz.

Anyway... I hope I didn't scare you off this project with my previous critique of your code.

Not at all. :-) I'm working on the TRTR optimization and should have a pull request to you on the weekend.

Thanks again for the review and the second pair of eyes,

Jim

Fish-Git · 2022-11-15T19:48:27Z

Thank you! Thank you! Thank you! How the (beep) (beep) did I code PAGEFRAME_PAGEMASK when I thought I had coded PAGEFRAME_BYTEMASK!! I definitely needed to have a second pair of eyes as I wasn't seeing what was on the page.

It happens. :) Which is why it's always a good idea to have someone else review one's code. Hell, I'm constantly re-reviewing (and re-re-reviewing) my own code all the time! It also helps doing so at a much later point in time too (e.g. several weeks or months later). You'd be surprised how often during re-reviewing months later that you see things you never noticed before when you first wrote it.

The current CUSE implementation includes the following interrupt coding:
[...]
So, it must be necessary ...

Yeah, I saw that myself too. But you need to keep in mind that Hercules's current implementation of a given instruction/function isn't necessarily "gospel" (sacrosanct). A good programmer should always question existing code by asking oneself "Is this right?", "Is this necessary?", "Is this the best way to do this?". As it turns out in this particular case (as I explained in my earlier comment) it's not actually needed. It's a complete waste of time. The only time an interrupt can occur is if/when the instruction purposely exits (ends), which is obviously not occurring in this case.

I now have some more reading on Hercules interrupts!

A basic understanding certainly wouldn't hurt, but it's not critical. I wouldn't recommend spending a lot of time on it unless you were interested in examining that part of Hercules's functionality. For most instruction coding however, it's not that critical. I only tried explaining it so you could see (understand) how silly and unnecessary such code was.

Anyway... I hope I didn't scare you off this project with my previous critique of your code.

Not at all. :-)

Good! I was slightly worried. Glad we're still cool. :)

I'm working on the TRTR optimization and should have a pull request to you on the weekend.

COOL! I look forward to seeing it! You're doing great work, James! MUCH appreciated!

Thanks again for the review and the second pair of eyes,

Not a problem.

Fish-Git · 2022-11-15T20:04:12Z

My performance tests were done using a 11th Gen Intel Core i5-1135G7 @ 2.40GHz.

My system has 3.0 GHz Xeon X5570s. I have no idea which "generation" they are. (I don't concern myself with such things.)

I think maybe the reason my numbers might be slower than yours is because:

My system's processors are older than yours.
I build Hercules with Visual Studio 2008.

I'm guessing you're using a much newer version of Visual Studio than me (2022? 2019?) and thus are using a version of Microsoft's compiler that is better able to leverage the newer instructions/features that your more modern processor has (whereas mine, having older processors and an older compiler, must resort to slightly less efficient instructions common with older processors).

But that's just a guess.

ONE of these days I'll upgrade my system to newer hardware and then maybe at that time I'll upgrade my Visual Studio as well. But right now I'm still poor and money is tight, so I'll probably be sticking with what I have for a few more years yet. (It's a nice system! I really like it! It's a Dell Precision T7500)

wrljet · 2022-11-15T20:26:00Z

You can compare almost any combination of CPUs on this site:

https://cpu.userbenchmark.com/Compare/Intel-Xeon-X5570-vs-Intel-Core-i5-1135G7/m12230vsm1286124

Bill

JamesWekel · 2022-11-16T03:59:10Z

Fish,

Let's try this again with a cross page fixes!

For my current environment, I'm using a 'small' Intel NUC with Debian as a server with "gcc (Debian 10.2.1-6) 10.2.1" for Hercules development. I access the environment using Visual Studio Code with remote explorer from a really old Intel NUC with a slow dual core i3-4010U CPU running Windows 10. I have Visual Studio 2022 on the I3 NUC just to check that I haven't broken the VS project files. The server NUC is newer (faster CPU and memory) so that's probably where I'm getting better results.

With the fixed cross page checks, the performance test now reports:

     1,000,000 iterations of CUSE  took     487,620 microseconds
     1,000,000 iterations of CUSE  took     544,157 microseconds
     1,000,000 iterations of CUSE  took     327,191 microseconds
     1,000,000 iterations of CUSE  took     176,237 microseconds
     1,000,000 iterations of CUSE  took     547,471 microseconds
     1,000,000 iterations of CUSE  took     608,836 microseconds

Hope your final review doesn't find any more major errors!

Jim

Fish-Git · 2022-11-16T04:29:36Z

Hope your final review doesn't find any more major errors!

Nope! Looks good to me! Pull Request accepted and merged!

Thanks again, James!

JamesWekel added 2 commits November 9, 2022 16:10

CUSE Performance Improvement

fe76dfa

Fix vcxproj.filters

a2967b5

JamesWekel requested review from Fish-Git, ivan-w, Juergen-Git and Peter-J-Jansen as code owners November 10, 2022 03:32

Fish-Git requested changes Nov 13, 2022

View reviewed changes

JamesWekel added 2 commits November 15, 2022 20:34

Major error: fix cross page checks

95ef421

Minor: reduce tst timeout to 300

e50e679

Fish-Git merged commit 30f9598 into SDL-Hercules-390:develop Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUSE performance improvement #519

CUSE performance improvement #519

JamesWekel commented Nov 10, 2022 •

edited by Fish-Git

Loading

Fish-Git commented Nov 11, 2022

JamesWekel commented Nov 11, 2022

Fish-Git left a comment •

edited

Loading

Fish-Git commented Nov 15, 2022

JamesWekel commented Nov 15, 2022 •

edited by Fish-Git

Loading

Fish-Git commented Nov 15, 2022

Fish-Git commented Nov 15, 2022 •

edited

Loading

wrljet commented Nov 15, 2022

JamesWekel commented Nov 16, 2022

Fish-Git commented Nov 16, 2022

	/-------------------------------------------------------------------/
	/* B2B2 LPSWE - Load PSW Extended [S] */
	/-------------------------------------------------------------------/
	DEF_INST(load_program_status_word_extended)
	{
	int b2; /* Base of effective addr */
	U64 effective_addr2; /* Effective address */
	QWORD qword;
	int rc;

	S(inst, regs, b2, effective_addr2);

	/* All control instructions are restricted in transaction mode */
	TRAN_INSTR_CHECK( regs );
	PRIV_CHECK(regs);
	DW_CHECK(effective_addr2, regs);

	#if defined(_FEATURE_ZSIE)
	if(SIE_STATE_BIT_ON(regs, IC1, LPSW))
	longjmp(regs->progjmp, SIE_INTERCEPT_INST);
	#endif /defined(_FEATURE_ZSIE)/

	/* Perform serialization and checkpoint synchronization */
	PERFORM_SERIALIZATION (regs);
	PERFORM_CHKPT_SYNC (regs);

	/* Fetch new PSW from operand address */
	ARCH_DEP(vfetchc) ( qword, 16-1, effective_addr2, b2, regs );

	/* Set the breaking event address register */
	SET_BEAR_REG(regs, regs->ip - 4);

	/* Load updated PSW */
	if ( ( rc = ARCH_DEP(load_psw) ( regs, qword ) ) )
	regs->program_interrupt (regs, rc);

	/* Perform serialization and checkpoint synchronization */
	PERFORM_SERIALIZATION (regs);
	PERFORM_CHKPT_SYNC (regs);

	RETURN_INTCHECK(regs);

	} /* end DEF_INST(load_program_status_word_extended) */

	fastest_no_txf_loop:

	if (INTERRUPT_PENDING( regs ))
	ARCH_DEP( process_interrupt )( regs );

	enter_fastest_no_txf_loop:

	ip = INSTRUCTION_FETCH( regs, 0 );
	PROCESS_TRACE( regs, ip, enter_fastest_no_txf_loop );
	EXECUTE_INSTRUCTION( current_opcode_table, ip, regs );
	regs->instcount++;
	UPDATE_SYSBLK_INSTCOUNT( 1 );

	for (i=0; i < MAX_CPU_LOOPS/2; i++)
	{
	UNROLLED_EXECUTE( current_opcode_table, regs );
	UNROLLED_EXECUTE( current_opcode_table, regs );
	}
	regs->instcount += (i * 2);
	UPDATE_SYSBLK_INSTCOUNT( (i * 2) );

	/* Perform automatic instruction tracing if it's enabled */
	do_automatic_tracing();
	goto fastest_no_txf_loop;

CUSE performance improvement #519

CUSE performance improvement #519

Conversation

JamesWekel commented Nov 10, 2022 • edited by Fish-Git Loading

Fish-Git commented Nov 11, 2022

JamesWekel commented Nov 11, 2022

Fish-Git left a comment • edited Loading

Choose a reason for hiding this comment

Fish-Git commented Nov 15, 2022

JamesWekel commented Nov 15, 2022 • edited by Fish-Git Loading

Fish-Git commented Nov 15, 2022

Fish-Git commented Nov 15, 2022 • edited Loading

wrljet commented Nov 15, 2022

JamesWekel commented Nov 16, 2022

Fish-Git commented Nov 16, 2022

JamesWekel commented Nov 10, 2022 •

edited by Fish-Git

Loading

Fish-Git left a comment •

edited

Loading

JamesWekel commented Nov 15, 2022 •

edited by Fish-Git

Loading

Fish-Git commented Nov 15, 2022 •

edited

Loading