archive-com.com » COM » I » IGNORANTUS.COM

Total: 25

Choose link from "Titles, links and description words view":

Or switch to "Titles and links view".
  • AES Optimization on Tilera TILE-Gx
    pte0 0x0000ff00 pte1 0x000000ff rk 43 in 4 out 4 All that s well and good but still we re only seeing 5 faster than the previous version Only 32 bits and nothing more Looking at the generated code it s easy to see that gcc does not like the 32 bit memory operations It ll use ld4s and clear the upper half afterwards I m truly puzzled by the generated code here This seems to be a compiler bug Given a uint32 array two loads to a simple intrinsic will generate ld4u Adding the xor gives ld4s v4int l which is a disaster We avoid that problem by enforcing the correct ld4u load operations in the ROUND macros as follows define ROUND T RKINDX pte0 uint32 t tblidxb3 uint64 t pte0 s0 pte1 uint32 t tblidxb2 uint64 t pte1 s1 pte2 uint32 t tblidxb1 uint64 t pte2 s2 pte3 uint32 t tblidxb0 uint64 t pte3 s3 t0 ld4u pte0 ld4u pte1 ld4u pte2 ld4u pte3 rk RKINDX pte0 uint32 t tblidxb3 uint64 t pte0 s1 pte1 uint32 t tblidxb2 uint64 t pte1 s2 pte2 uint32 t tblidxb1 uint64 t pte2 s3 pte3 uint32 t tblidxb0 uint64 t pte3 s0 t1 ld4u pte0 ld4u pte1 ld4u pte2 ld4u pte3 rk RKINDX 1 pte0 uint32 t tblidxb3 uint64 t pte0 s2 pte1 uint32 t tblidxb2 uint64 t pte1 s3 pte2 uint32 t tblidxb1 uint64 t pte2 s0 pte3 uint32 t tblidxb0 uint64 t pte3 s1 t2 ld4u pte0 ld4u pte1 ld4u pte2 ld4u pte3 rk RKINDX 2 pte0 uint32 t tblidxb3 uint64 t pte0 s3 pte1 uint32 t tblidxb2 uint64 t pte1 s0 pte2 uint32 t tblidxb1 uint64 t pte2 s1 pte3 uint32 t tblidxb0 uint64 t pte3 s2 t3 ld4u pte0 ld4u pte1 ld4u pte2 ld4u pte3 rk RKINDX 3 define ROUND S RKINDX pte0 uint32 t tblidxb3 uint64 t pte0 t0 pte1 uint32 t tblidxb2 uint64 t pte1 t1 pte2 uint32 t tblidxb1 uint64 t pte2 t2 pte3 uint32 t tblidxb0 uint64 t pte3 t3 s0 ld4u pte0 ld4u pte1 ld4u pte2 ld4u pte3 rk RKINDX pte0 uint32 t tblidxb3 uint64 t pte0 t1 pte1 uint32 t tblidxb2 uint64 t pte1 t2 pte2 uint32 t tblidxb1 uint64 t pte2 t3 pte3 uint32 t tblidxb0 uint64 t pte3 t0 s1 ld4u pte0 ld4u pte1 ld4u pte2 ld4u pte3 rk RKINDX 1 pte0 uint32 t tblidxb3 uint64 t pte0 t2 pte1 uint32 t tblidxb2 uint64 t pte1 t3 pte2 uint32 t tblidxb1 uint64 t pte2 t0 pte3 uint32 t tblidxb0 uint64 t pte3 t1 s2 ld4u pte0 ld4u pte1 ld4u pte2 ld4u pte3 rk RKINDX 2 pte0 uint32 t tblidxb3 uint64 t pte0 t3 pte1 uint32 t tblidxb2 uint64 t pte1 t0 pte2 uint32 t tblidxb1 uint64 t pte2 t1 pte3 uint32 t tblidxb0 uint64 t pte3 t2 s3 ld4u pte0 ld4u pte1 ld4u pte2 ld4u pte3 rk RKINDX 3 This certainly improves the situation but what more can be done to make this truly efficient Mmmmm registers First there s

    Original URL path: http://www.ignorantus.com/tilegx_aes/ (2016-04-24)
    Open archived version from archive


  • SRTP SHA1 Optimization
    keylen memset pad keylen 0 sizeof pad keylen for i 0 i 64 i pad i 0x5c SHA1 ProcessBlocks pado pad 1 Next we make the ProcessBlocks function which should resemble the OpenSSL inner loop only adjusted so my current compiler can make an optimal loop Your mileage may vary static void SHA1 ProcessBlocks const uint32 t inhash uint32 t outhash const uint8 t msg int cnt const uint32 t W uint32 t A B C D E T uint32 t h0 h1 h2 h3 h4 uint32 t XX0 XX1 XX2 XX3 XX4 XX5 XX6 XX7 XX8 XX9 XX10 XX11 XX12 XX13 XX14 XX15 W const uint32 t msg h0 inhash 0 h1 inhash 1 h2 inhash 2 h3 inhash 3 h4 inhash 4 while 1 A h0 B h1 C h2 D h3 E h4 BODY 00 15 0 A B C D E T W 0 BODY 16 19 16 C D E T A B X 0 W 0 W 2 W 8 W 13 BODY 20 31 20 E T A B C D X 4 W 4 W 6 W 12 X 1 BODY 32 39 32 E T A B C D X 0 X 2 X 8 X 13 BODY 40 59 40 C D E T A B X 8 X 10 X 0 X 5 BODY 60 79 60 A B C D E T X 12 X 14 X 4 X 9 h0 E h1 T h2 A h3 B h4 C if cnt break W 16 outhash 0 h0 outhash 1 h1 outhash 2 h2 outhash 3 h3 outhash 4 h4 Then we make a SHA1 HMAC function with sensible arguments and the Process function to sum it all up void SHA1 HMAC const uint8 t msg int msglen uint8 t dst int dstlen const uint32 t padi const uint32 t pado uint8 t digest SHA1HashSize SHA1 Process padi msg msglen digest SHA1HashSize SHA1 Process pado digest SHA1HashSize dst dstlen static void SHA1 Process const uint32 t pad const uint8 t msg unsigned msglen uint8 t dst int dstlen int block64 rest uint8 t buf 64 uint32 t hash 5 hash 0 pad 0 hash 1 pad 1 hash 2 pad 2 hash 3 pad 3 hash 4 pad 4 rest msglen 63 block64 msglen 6 if block64 SHA1 ProcessBlocks hash msg block64 msg msglen 63 if rest memcpy buf msg rest buf rest 0x80 memset buf rest 0 64 rest if rest 56 SHA1 ProcessBlocks hash buf 1 memset buf 0 60 else memset buf 0 60 buf 0 0x80 uint32 t buf 60 msglen 8 512 SHA1 ProcessBlocks hash buf 1 memcpy dst hash dstlen Nothing to it That s all pretty neat but I m interested in squeezing out every last cycle of this There s little to be done in the ProcessBlocks function but those static initializations and mem operations bother me With a little creative use of pointers and global variables most of these

    Original URL path: http://www.ignorantus.com/srtp_sha1/ (2016-04-24)
    Open archived version from archive

  • SRTP AES Optimization
    t1 8 0xff Te3 t2 0xff rk 19 round 5 t0 Te0 s0 24 Te1 s1 16 0xff Te2 s2 8 0xff Te3 s3 0xff rk 20 t1 Te0 s1 24 Te1 s2 16 0xff Te2 s3 8 0xff Te3 s0 0xff rk 21 t2 Te0 s2 24 Te1 s3 16 0xff Te2 s0 8 0xff Te3 s1 0xff rk 22 t3 Te0 s3 24 Te1 s0 16 0xff Te2 s1 8 0xff Te3 s2 0xff rk 23 round 6 s0 Te0 t0 24 Te1 t1 16 0xff Te2 t2 8 0xff Te3 t3 0xff rk 24 s1 Te0 t1 24 Te1 t2 16 0xff Te2 t3 8 0xff Te3 t0 0xff rk 25 s2 Te0 t2 24 Te1 t3 16 0xff Te2 t0 8 0xff Te3 t1 0xff rk 26 s3 Te0 t3 24 Te1 t0 16 0xff Te2 t1 8 0xff Te3 t2 0xff rk 27 round 7 t0 Te0 s0 24 Te1 s1 16 0xff Te2 s2 8 0xff Te3 s3 0xff rk 28 t1 Te0 s1 24 Te1 s2 16 0xff Te2 s3 8 0xff Te3 s0 0xff rk 29 t2 Te0 s2 24 Te1 s3 16 0xff Te2 s0 8 0xff Te3 s1 0xff rk 30 t3 Te0 s3 24 Te1 s0 16 0xff Te2 s1 8 0xff Te3 s2 0xff rk 31 round 8 s0 Te0 t0 24 Te1 t1 16 0xff Te2 t2 8 0xff Te3 t3 0xff rk 32 s1 Te0 t1 24 Te1 t2 16 0xff Te2 t3 8 0xff Te3 t0 0xff rk 33 s2 Te0 t2 24 Te1 t3 16 0xff Te2 t0 8 0xff Te3 t1 0xff rk 34 s3 Te0 t3 24 Te1 t0 16 0xff Te2 t1 8 0xff Te3 t2 0xff rk 35 round 9 t0 Te0 s0 24 Te1 s1 16 0xff Te2 s2 8 0xff Te3 s3 0xff rk 36 t1 Te0 s1 24 Te1 s2 16 0xff Te2 s3 8 0xff Te3 s0 0xff rk 37 t2 Te0 s2 24 Te1 s3 16 0xff Te2 s0 8 0xff Te3 s1 0xff rk 38 t3 Te0 s3 24 Te1 s0 16 0xff Te2 s1 8 0xff Te3 s2 0xff rk 39 out 0 Te2 t0 24 0xff000000 Te3 t1 16 0xff 0x00ff0000 Te0 t2 8 0xff 0x0000ff00 Te1 t3 0xff 0x000000ff rk 40 out 1 Te2 t1 24 0xff000000 Te3 t2 16 0xff 0x00ff0000 Te0 t3 8 0xff 0x0000ff00 Te1 t0 0xff 0x000000ff rk 41 out 2 Te2 t2 24 0xff000000 Te3 t3 16 0xff 0x00ff0000 Te0 t0 8 0xff 0x0000ff00 Te1 t1 0xff 0x000000ff rk 42 out 3 Te2 t3 24 0xff000000 Te3 t0 16 0xff 0x00ff0000 Te0 t1 8 0xff 0x0000ff00 Te1 t2 0xff 0x000000ff rk 43 Just this simple step gives a marked improvement on my target probably not so on modern systems with more cache With the basics covered we can move on to the interesting stuff The interesting stuff AES XOR reduction This little nugget is present in the SRTP AES specification 4 1 1 AES in Counter

    Original URL path: http://www.ignorantus.com/srtp_aes/ (2016-04-24)
    Open archived version from archive

  • MultiQuake - Quake for Tilera TILE-Gx CPU
    and output To pass arguments to the Quakes please modify the quake main function in sys tilegx c Should probably merge those two files but I already had a similar main tilegx c in the Doom port that I could recycle Output is done via an external viewer A file tmp doomscreen is mmap ed as a width height 4 buffer which the viewer converts to the correct format and sends to the output This viewer is copyrighted so I can t supply the code for that A minor problem is that there s no frame synchronization apart from the Quake internal timer stuff This doesn t seem to be a big issue it runs very smoothly on the Cisco SX80 The GitHub version does have one jarring bug If screen size is different from 320x200 it ll crash when the player is under water So to avoid that I made some custom TILE Gx based scaling routines in vid tilegx c The annoying part is that this ll scale everything including GUI and fonts However the fun part is running many Quakes and then the scaling won t be used I tried adding EPX routines for 2x and 3x scaling While they were nice in Doom they re not as nice in Quake Oh well the code s there so either use it or don t Input is done by remote control Yeah it s terrible but I got no keyboard driver on this box If when it s added some more remapping must be done in sys tilegx c Console output is mostly disabled The amount of output is staggering when there s 30 running at once Now it just prints out which demo is playing on which core Since there s only 3 demos in the shareware

    Original URL path: http://www.ignorantus.com/multiquake/ (2016-04-24)
    Open archived version from archive

  • MultiDoom - Doom for Tilera TILE64 and TILE-Gx CPUs
    but we never got that to work properly Doom for Tilera TILE64 PCIe Card using iLib This is the old iLib based version I m guessing that the TMC version should work on the TILE64 too but I haven t tried it since I don t have a TILE64 card anymore Feel free to try it SDL MultiDoom for TILE64 PCIe iLib 30 December 2009 Doom for Tilera TILE Gx

    Original URL path: http://www.ignorantus.com/multidoom/ (2016-04-24)
    Open archived version from archive

  • TANDBERG Secrets
    where you can see each participant on a separate wall The hidden command menu magic star enables the 60 fps starfield Based on 68k assembler code written by Carl Henrik Aaby TANDBERG 6000 MXP Hidden Menu The 770 880 990 MXP Edge 75 85 95 MXP 3000 MXP and 6000 MXP all run F series software and should have the same hidden menu TANDBERG 6000 Classic Hidden Menu Same here

    Original URL path: http://www.ignorantus.com/tandberg/ (2016-04-24)
    Open archived version from archive

  • PCTVNet HomePilot
    on who wrote what Some of things I did write were in no particular order Driver for AD1816 loosely based on the Linux driver included Port of MikMod so you can play modules on AD1816 Midi player Wav player Sampler for AD1816 Text TV decoder QMan program memory manager Temperature speed control and hw control server for AMD Elan CPU Smart card controller for Classic assorted pip hacks for both

    Original URL path: http://www.ignorantus.com/pctvnet/ (2016-04-24)
    Open archived version from archive

  • Triumph Amiga demos and source code
    The Archive Cyberspace Oldschool Scoopex Triumph coop demo released at The Gathering 2004 64k intro 64k intro released at KinderGarden 1998 Eclipse Demo released at The Gathering 1997 Dreamscape Demo released at The Gathering 1996 Dreamscape Remix 1 2 Patch for better graphics 2K intro 2k intro released at KinderGarden 1996 Speed 64k intro released at The Gathering 1995 Space Battle Space Battle animation released at The Gathering 1994 IntuiTracker

    Original URL path: http://www.ignorantus.com/triumph/ (2016-04-24)
    Open archived version from archive



  •