archive-com.com » COM » I » IGNORANTUS.COM

Total: 25

Choose link from "Titles, links and description words view":

Or switch to "Titles and links view".
  • Ignorantus
    is too conservative use of multipliers leading to complicated data shuffling before and or after the multiplications SSE2 multipliers are inherently cheap to use so let s try to maximize their usage instead A Look at Halide s SSE2 3x3 Box Filter 24 August 2012 A look at the SSE2 3x3 box filter used as example code in the Halide language specification I get significantly better results using normal C code and SSE2 intrinsics The code is also comprehensible AES Optimization on Tilera TILE Gx 23 December 2011 A TILE Gx core can issue 3 instructions in parallel given a set of strict restrictions This paper explores how to exploit that in an AES encryption routine using TILE Gx intrinsics SRTP SHA1 Optimization 13 December 2010 Calculating SHA1 hashes on SRTP packets can be quite costly on low end CPUs Since lengths etc are static let s try to strip out the code that actually does SHA1 calculation in OpenSSL and make it as fast as possible Tests are performed on a Freescale MPC8270 CPU SRTP AES Optimization 2 April 2010 A feature in the SRTP specification makes it possible to reduce the CPU cost of AES encryption and decryption by 30 Projects MultiQuake Quake for Tilera TILE Gx CPU 5 May 2014 Port of the original Quake to Tilera TILE Gx CPUs It runs natively on the TILE Gx and does not need a host CPU Number of Quakes possible to run in parallel is only limited by your screen size Has custom intrinsics based TILE Gx 2x 5x scalers MultiDoom Doom for Tilera TILE64 and TILE Gx CPUs 2 June 2009 Port of SDL Doom 1 10 to Tilera TILE64 and TILE Gx for use on Tilera s PCI Express cards Number of Dooms possible to run in

    Original URL path: http://www.ignorantus.com/ (2016-04-24)
    Open archived version from archive

  • OpenSSL aes_core.c Replacement for EZchip TILE-Gx
    rk 2 s3 b insn ld4u rk 3 What to do about unaligned stores Not much really Either use an array of st1u or define GX STORE ALIGNED if the destination buffer is 8 byte aligned AES Encrypt Let s start off by aligning the Te0 3 tables to 1k boundaries so tblidx can be used static const u32 attribute aligned 1024 Te0 256 static const u32 attribute aligned 1024 Te1 256 static const u32 attribute aligned 1024 Te2 256 static const u32 attribute aligned 1024 Te3 256 Define the encrypt round macros define ROUND E T I0 I1 I2 I3 pt0 uint32 t insn tblidxb3 uint64 t pt0 s0 pt1 uint32 t insn tblidxb2 uint64 t pt1 s1 pt2 uint32 t insn tblidxb1 uint64 t pt2 s2 pt3 uint32 t insn tblidxb0 uint64 t pt3 s3 t0 insn ld4u pt0 insn ld4u pt1 insn ld4u pt2 insn ld4u pt3 insn ld4u rk I0 pt0 uint32 t insn tblidxb3 uint64 t pt0 s1 pt1 uint32 t insn tblidxb2 uint64 t pt1 s2 pt2 uint32 t insn tblidxb1 uint64 t pt2 s3 pt3 uint32 t insn tblidxb0 uint64 t pt3 s0 t1 insn ld4u pt0 insn ld4u pt1 insn ld4u pt2 insn ld4u pt3 insn ld4u rk I1 pt0 uint32 t insn tblidxb3 uint64 t pt0 s2 pt1 uint32 t insn tblidxb2 uint64 t pt1 s3 pt2 uint32 t insn tblidxb1 uint64 t pt2 s0 pt3 uint32 t insn tblidxb0 uint64 t pt3 s1 t2 insn ld4u pt0 insn ld4u pt1 insn ld4u pt2 insn ld4u pt3 insn ld4u rk I2 pt0 uint32 t insn tblidxb3 uint64 t pt0 s3 pt1 uint32 t insn tblidxb2 uint64 t pt1 s0 pt2 uint32 t insn tblidxb1 uint64 t pt2 s1 pt3 uint32 t insn tblidxb0 uint64 t pt3 s2 t3 insn ld4u pt0 insn ld4u pt1 insn ld4u pt2 insn ld4u pt3 insn ld4u rk I3 define ROUND E S I0 I1 I2 I3 pt0 uint32 t insn tblidxb3 uint64 t pt0 t0 pt1 uint32 t insn tblidxb2 uint64 t pt1 t1 pt2 uint32 t insn tblidxb1 uint64 t pt2 t2 pt3 uint32 t insn tblidxb0 uint64 t pt3 t3 s0 insn ld4u pt0 insn ld4u pt1 insn ld4u pt2 insn ld4u pt3 insn ld4u rk I0 pt0 uint32 t insn tblidxb3 uint64 t pt0 t1 pt1 uint32 t insn tblidxb2 uint64 t pt1 t2 pt2 uint32 t insn tblidxb1 uint64 t pt2 t3 pt3 uint32 t insn tblidxb0 uint64 t pt3 t0 s1 insn ld4u pt0 insn ld4u pt1 insn ld4u pt2 insn ld4u pt3 insn ld4u rk I1 pt0 uint32 t insn tblidxb3 uint64 t pt0 t2 pt1 uint32 t insn tblidxb2 uint64 t pt1 t3 pt2 uint32 t insn tblidxb1 uint64 t pt2 t0 pt3 uint32 t insn tblidxb0 uint64 t pt3 t1 s2 insn ld4u pt0 insn ld4u pt1 insn ld4u pt2 insn ld4u pt3 insn ld4u rk I2 pt0 uint32 t insn tblidxb3 uint64 t pt0 t3 pt1 uint32 t insn tblidxb2 uint64 t pt1 t0 pt2 uint32 t insn tblidxb1 uint64 t pt2 t1 pt3 uint32 t insn tblidxb0 uint64 t pt3 t2 s3 insn ld4u pt0 insn ld4u pt1 insn ld4u pt2 insn ld4u pt3 insn ld4u rk I3 Load the input data uint64 t in0 insn ldna in uint64 t in1 insn ldna in 8 uint64 t in2 insn ldna in 15 uint64 t a insn revbytes insn dblalign in0 in1 in uint64 t b insn revbytes insn dblalign in1 in2 in s0 a 32 insn ld4u rk s1 a insn ld4u rk 1 s2 b 32 insn ld4u rk 2 s3 b insn ld4u rk 3 Do the rounds using the macros rounds 1 9 ROUND E T 4 5 6 7 ROUND E S 8 9 10 11 ROUND E T 12 13 14 15 ROUND E S 16 17 18 19 ROUND E T 20 21 22 23 ROUND E S 24 25 26 27 ROUND E T 28 29 30 31 ROUND E S 32 33 34 35 ROUND E T 36 37 38 39 if key rounds 10 rounds 10 11 ROUND E S 40 41 42 43 ROUND E T 44 45 46 47 if key rounds 12 rounds 12 13 ROUND E S 48 49 50 51 ROUND E T 52 53 54 55 rk key rounds 2 Use the method described above for the last round and store things now with added support for the GX STORE ALIGNED macro apply last round and map cipher state to byte array block Exploit pairs in Te tables 0 yppx 1 xypp 2 pxyp 3 ppxy pt2 uint32 t insn tblidxb3 uint64 t pt2 t0 pt3 uint32 t insn tblidxb2 uint64 t pt3 t1 pt0 uint32 t insn tblidxb1 uint64 t pt0 t2 pt1 uint32 t insn tblidxb0 uint64 t pt1 t3 s0 insn mm insn v2int l insn ld4u pt2 insn ld4u pt3 insn v2int l insn ld4u pt0 insn ld4u pt1 32 63 pt2 uint32 t insn tblidxb3 uint64 t pt2 t1 pt3 uint32 t insn tblidxb2 uint64 t pt3 t2 pt0 uint32 t insn tblidxb1 uint64 t pt0 t3 pt1 uint32 t insn tblidxb0 uint64 t pt1 t0 s1 insn mm insn v2int l insn ld4u pt2 insn ld4u pt3 insn v2int l insn ld4u pt0 insn ld4u pt1 32 63 if defined GX STORE ALIGNED s0 insn revbytes insn v2packh s0 s1 insn v4int l insn ld4u rk 0 insn ld4u rk 1 insn st add out s0 8 else s0 insn v2packh s1 s0 insn v4int l insn ld4u rk 1 insn ld4u rk 0 s1 s0 32 PUTU32 out s0 PUTU32 out 4 s1 endif Same for s2 and s3 obviously not shown AES Decrypt First align the Td0 3 tables in the same manner as the Te tables not shown Define the decrypt round macros define ROUND D T I0 I1 I2 I3 pt0 uint32 t insn tblidxb3 uint64 t pt0 s0 pt1 uint32 t insn tblidxb2 uint64 t pt1 s3 pt2 uint32

    Original URL path: http://www.ignorantus.com/tilegx_aes_openssl/ (2016-04-24)
    Open archived version from archive

  • Bilinear Picture Scaling on Tilera TILE-Gx
    vmul0 v00 insn v2packh 0 v00 insn v2shrui v00 8 roundval v01 insn v2packh 0 v01 insn v2shrui v01 8 roundval r00 insn v1mulu v00 hmul1 insn v1mulu v01 hmul0 r00 r00 insn v2shrui r00 8 roundval v00 insn v1mulu p02 vmul1 insn v1mulu p12 vmul0 v01 insn v1mulu p03 vmul1 insn v1mulu p13 vmul0 v00 insn v2packh 0 v00 insn v2shrui v00 8 roundval v01 insn v2packh 0 v01 insn v2shrui v01 8 roundval r01 insn v1mulu v00 hmul3 insn v1mulu v01 hmul2 r01 r01 insn v2shrui r01 8 roundval v00 insn v1mulu p20 vmul3 insn v1mulu p30 vmul2 v01 insn v1mulu p21 vmul3 insn v1mulu p31 vmul2 v00 insn v2packh 0 v00 insn v2shrui v00 8 roundval v01 insn v2packh 0 v01 insn v2shrui v01 8 roundval r10 insn v1mulu v00 hmul1 insn v1mulu v01 hmul0 r10 r10 insn v2shrui r10 8 roundval v00 insn v1mulu p22 vmul3 insn v1mulu p32 vmul2 v01 insn v1mulu p23 vmul3 insn v1mulu p33 vmul2 v00 insn v2packh 0 v00 insn v2shrui v00 8 roundval v01 insn v2packh 0 v01 insn v2shrui v01 8 roundval r11 insn v1mulu v00 hmul3 insn v1mulu v01 hmul2 r11 r11 insn v2shrui r11 8 roundval pack and store insn st add dp0 insn v2packh r01 r00 8 insn st add dp1 insn v2packh r11 r10 8 There s one small caveat left What to do about the right and bottom edges since fetches stop at x 1 y 1 Also the position calculation might glide a pixel across the picture since it s not 100 precise Either insert more conditional moves or allocate a slightly larger bitmap and duplicate the last column row twice No point in adding more code in the inner loop Let s make a pad function that has to be called on the source bitmap before scaling anything The bitmap library already allocates slightly larger bitmaps than needed to cover this static void pad right lower bmp argb bp uint32 t dst uint32 t bp argb bp w 4 uint32 t src dst 1 int str4 bp stride 2 for int y 0 y bp h y dst 0 src dst 1 src src str4 dst str4 uint32 t dst0 uint32 t bp argb bp stride bp h uint32 t dst1 dst0 str4 src dst0 str4 for int x 0 x bp w x dst0 src dst1 src The bilinear scaler is now ready for use The following code will load scale and save a picture int main int argc char argv bmp argb src if argc 5 printf Usage s input pic output pic sizex sizey n argv 0 return 1 src bmp load argv 1 false if src NULL return 1 pad right lower src unsigned int xs atoi argv 3 unsigned int ys atoi argv 4 bmp argb dst bmp alloc xs ys if dst NULL bmp free src return 1 printf Scale picture from dx d to dx d n src w src h dst w dst h scale bilinear src dst 0 dst h bmp save dst argv 2 false bmp free src bmp free dst return 0 A Parallel Single Pass Bilinear Scaler Writing a parallel version using the parallelization library presented earlier is very easy Just stuff the arguments into a struct and make a wrapper function that the thread can call Since I m interested in trying out different combinations of cores and lines add a lines argument to the scale function typedef struct bmp argb src bmp argb dst unsigned int startrow unsigned int height sc info static void scale bilinear thread void data unsigned int len sc info sc data scale bilinear sc src sc dst sc startrow sc height void scale bilinear parallel bmp argb src bmp argb dst unsigned int lines sc info sc sc src src sc dst dst for unsigned int startrow 0 startrow dst h startrow lines sc startrow startrow sc height dst h startrow lines dst h startrow lines par sendjob scale bilinear thread sc sizeof sc par wait scale bilinear parallel can replace the call to scale bilinear in the main function Let s make it a bit more interesting and loop through a set of lines and cores to make some nice graphs int main int argc char argv bmp argb src if argc 5 printf Usage s input pic output pic sizex sizey n argv 0 return 1 int rc par init 35 if rc 0 return rc src bmp load argv 1 false if src NULL return 1 pad right lower src int xs atoi argv 3 int ys atoi argv 4 bmp argb dst bmp alloc xs ys if dst NULL bmp free src return 1 printf Scale d d to d d n n src w src h dst w dst h printf 1 2 4 8 16 32 35 n for int lines 2 lines 128 lines 2 printf 03d lines fflush stdout for int cores2 1 cores2 128 cores2 2 int cores cores2 64 35 cores2 par set cores cores uint64 t t0 get cycle count for int i 0 i cores 30 i scale bilinear parallel src dst lines uint64 t t1 get cycle count uint64 t frametime t1 t0 cores 30 printf 08lld long long int frametime fflush stdout printf n bmp save dst argv 2 false bmp free src bmp free dst return 0 Test Run 1 Let s compile it and do some test runs user mainframe tile cc O3 Wall std c99 o main main c par c bmp c scaler c ltmc lpthread user mainframe scp main root lastv36 tmp lastv36 tmp main pics ebu3325 bmp output bmp 1280 720 Loading bitmap pics ebu3325 bmp Scale 1920 1080 to 1280 720 1 2 4 8 16 32 35 002 18079820 09691598 05350570 02977762 02053148 01647987 01628302 004 18798604 09744640 05606594 02957624 02094006 01697045 01696185 008 18732162 09989230 05213762 03044871 02122177 01719981 01732858 016 17703481 09793327 05282825 02999953 02207199 01878886 01918319 032 17733139

    Original URL path: http://www.ignorantus.com/tilegx_bilinear/ (2016-04-24)
    Open archived version from archive

  • Raytracing on Tilera TILE-Gx
    int i for i 0 i OBJNUM i vec3 t v tg obj i poseye v vec3 sub ray pos tg obj i pos float rv tg obj i rv rv norm2 v tg obj i radsq if dotsq v ul rv 0 0f dotsq v bl rv 0 0f dotsq v br rv 0 0f dotsq v ur rv 0 0f break if i OBJNUM clear block Gcc s optimizer actually manages to do a decent job of sorting the normalizes and such here so no need to mess it up further Some of the vars used between frames are precalculated by the controller as indicated This is not a perfect solution It will fail on short edges and completely remove very small spheres So to avoid that a crap bounding box is added which the spheres bounce around in Communication and Control 1080 is not divisible by 16 so for simplicity the screen size is rounded up to 1920x1088 That gives a total of 1920 1088 16 16 8160 jobs to be distributed That s a reasonably low count and perf says that very little time is wasted on the UDN calls Unfortunately top will show 100 load on all cores that are waiting for jobs in tmc udn0 receive The resulting main controller loop is very simple for int y 0 y rh y BLOCKY for int x 0 x rw x BLOCKX wait for a core to be ready uint32 t coreready tmc udn0 receive and give it the next job tmc udn send 2 tmc udn header from cpu coreready UDN0 DEMUX TAG x 16 y uint64 t dst And ditto for the tracer cores while true say we re ready tmc udn send 1 header UDN0 DEMUX TAG coreid wait for a job uint32 t xy tmc udn0 receive if xy 0xffff exit 0 ptr uint8 t tmc udn0 receive and do it render tglob ptr xy 16 xy 0xffff To fork or not to fork The parallelization code from the example used fork That s cozy but it would be more convenient if all cores shared the same data The scene data is only modified by the controller but must be read by all the tracers Pointers to the outbut buffers have to be distributed to the tracers and do not like to be passed across multiple processes Luckily Mr T Ringstad had some Tilera code available that uses pthreads instead of fork Integrating that into the system was pretty simple and didn t change the execution speed considerably Measuring time used Mr Ringstad also helped me measure performance using assorted tools Let s see how that worked out Since using the UDN network gives 100 load all the time get cycle count is used before and after the main loop and added up This will not be 100 accurate since there s no message that says a core is finished But there s little point in cluttering up the communication interface further

    Original URL path: http://www.ignorantus.com/tilegx_raytracer/ (2016-04-24)
    Open archived version from archive

  • RGB to YUV conversion on Tilera TILE-Gx
    insn v1ddotpus rgb89 conv 1 128 while cnt Stay Forever How does this look with the latest compiler release 4 1 4 152692 Compile the code with save temps and look at the generated assembler s file Do a double facepalm since the output looks really really terrible There s probably numerous easy ways to make it look better I did it the hard way and wrote a nifty awk script for reformatting the code Feed the script the code and look at the output from the inner loop part branch into L4 L9 addi r7 r7 16 L4 addi r9 r0 16 addi r8 r0 24 ld r10 r0 addi r11 r0 8 ld r12 r9 addi r14 r0 32 ld r13 r8 v1ddotpu r27 r10 r6 ld r11 r11 v1ddotpu r29 r12 r6 addi r9 r0 40 addi r15 r0 48 ld r14 r14 addi r8 r0 56 v1ddotpu r16 r13 r6 v1ddotpu r17 r11 r6 ld r9 r9 ld r15 r15 v1ddotpus r30 r12 r5 ld r8 r8 v1ddotpus r25 r11 r5 v1ddotpus r28 r13 r5 addi r24 r1 8 v1ddotpus r26 r10 r5 addi r22 r3 8 shufflebytes r16 r29 r20 v4packsc r28 r28 r30 shufflebytes r17 r27 r19 v4packsc r26 r25 r26 v1ddotpus lr r8 r4 or r17 r17 r16 v1ddotpus r25 r9 r4 v2shli r28 r28 1 v1ddotpus r31 r15 r4 v1adduc r17 r17 r18 v1ddotpus r30 r14 r4 v2shli r26 r26 1 st r1 r17 v2packh r26 r28 r26 v4packsc r30 r25 r30 v1ddotpus r29 r15 r5 v1ddotpus r25 r8 r5 v4packsc lr lr r31 v1ddotpus r27 r9 r5 v2shli lr lr 1 v1ddotpu r17 r14 r6 v2shli r16 r30 1 v1ddotpus r28 r14 r5 v2packh r16 lr r16 v1ddotpu r8 r8 r6 v4packsc r25 r25 r29 v1ddotpu r15 r15 r6 v4packsc r14 r27 r28 v1ddotpu r9 r9 r6 v1addi r26 r26 128 v1ddotpus r13 r13 r4 st r2 r26 v1ddotpus r12 r12 r4 v1addi r16 r16 128 v1ddotpus r11 r11 r4 v2shli r14 r14 1 v1ddotpus r10 r10 r4 v4packsc r12 r13 r12 shufflebytes r8 r15 r20 v2shli r13 r25 1 shufflebytes r9 r17 r19 v4packsc r10 r11 r10 or r8 r9 r8 st r3 r16 addi r23 r2 8 v1adduc r8 r8 r18 v2packh r14 r13 r14 v2shli r12 r12 1 v2shli r3 r10 1 st r24 r8 v2packh r3 r12 r3 v1addi r8 r14 128 v1addi r3 r3 128 st r23 r8 cmpeq r21 r32 r7 addi r0 r0 64 st r22 r3 addi r1 r1 16 addi r2 r2 16 move r3 r7 beqzt r21 L9 As usual gcc does not manage to use ld add or st add but otherwise it looks promising It does appear to look further ahead than earlier versions so shuffling blocks around is probably not necessary Let s add ld add st add L11 ld add r10 r0 8 addxi r4 r4 1 ld add r11 r0 8 v1ddotpus r22 r10 r6 v1ddotpu r8 r10 r7 ld add r12 r0 8 v1ddotpus r9

    Original URL path: http://www.ignorantus.com/tilegx_rgb2yuv/ (2016-04-24)
    Open archived version from archive

  • YUV to RGB Conversion on Tilera TILE-Gx
    v2int h gugv gugv gugvl insn v2int l gugv gugv Now we have all the parts needed to calculate r g and b values for row 0 r0 insn v2packuc insn v2shrsi insn v2addsc y0h rvh 6 insn v2shrsi insn v2addsc y0l rvl 6 g0 insn v2packuc insn v2shrsi insn v2subsc y0h gugvh 6 insn v2shrsi insn v2subsc y0l gugvl 6 b0 insn v2packuc insn v2shrsi insn v2addsc y0h buh 6 insn v2shrsi insn v2addsc y0l bul 6 These results are planar and that s not very useful We need to shuffle it around before storing zrl insn v1int l 0 r0 zrh insn v1int h 0 r0 gbl insn v1int l g0 b0 gbh insn v1int h g0 b0 insn st add dst0 insn v2int l zrl gbl 8 insn st add dst0 insn v2int h zrl gbl 8 insn st add dst0 insn v2int l zrh gbh 8 insn st add dst0 insn v2int h zrh gbh 8 Then we just repeat the last two steps with y1 for row 1 r0 insn v2packuc insn v2shrsi insn v2addsc y1h rvh 6 insn v2shrsi insn v2addsc y1l rvl 6 g0 insn v2packuc insn v2shrsi insn v2subsc y1h gugvh 6 insn v2shrsi insn v2subsc y1l gugvl 6 b0 insn v2packuc insn v2shrsi insn v2addsc y1h buh 6 insn v2shrsi insn v2addsc y1l bul 6 zrl insn v1int l 0 r0 zrh insn v1int h 0 r0 gbl insn v1int l g0 b0 gbh insn v1int h g0 b0 insn st add dst1 insn v2int l zrl gbl 8 insn st add dst1 insn v2int h zrl gbl 8 insn st add dst1 insn v2int l zrh gbh 8 insn st add dst1 insn v2int h zrh gbh 8 And we re done Or are we Of course not Looking at the generated code there s a flurry of redundant movei rxx 0 instructions Instead of loading zero outside the loop it does it before it s used in every single case So it has to be done manually I do that in the code below Skyscrapers A complete routine would look something like the code below Gcc now manages to maintain base pointers without all the usual fuzz so that code is kept as simple as possible void yuv420 to argb8888 uint8 t srcy uint8 t srcu uint8 t srcv uint32 t sy uint32 t suv int width int height uint32 t rgb uint32 t srgb int x y uint8 t srcy0 srcu0 srcv0 uint64 t dst0 dst1 uint64 t r0 g0 b0 uint64 t y0 y1 u0 v0 uint64 t y0l y0h y1l y1h uint64 t rv gu gv bu uint64 t rvh rvl buh bul uint64 t gugv gugvh gugvl uint64 t zrl zrh gbl gbh uint64 t zero zero 0 for y 0 y height y 2 dst0 uint64 t rgb y srgb dst1 uint64 t rgb y srgb srgb srcy0 srcy y sy srcu0 srcu y 2 suv srcv0 srcv y 2 suv for x 0

    Original URL path: http://www.ignorantus.com/tilegx_yuv2rgb/ (2016-04-24)
    Open archived version from archive

  • YUV to RGB Conversion using SSE2
    u01 mm sub epi16 mm unpackhi epi16 u0 u0 uvsub v0 mm unpacklo epi8 v0 zero v00 mm sub epi16 mm unpacklo epi16 v0 v0 uvsub v01 mm sub epi16 mm unpackhi epi16 v0 v0 uvsub common factors on both rows rv00 mm mullo epi16 facrv v00 rv01 mm mullo epi16 facrv v01 gu00 mm mullo epi16 facgu u00 gu01 mm mullo epi16 facgu u01 gv00 mm mullo epi16 facgv v00 gv01 mm mullo epi16 facgv v01 bu00 mm mullo epi16 facbu u00 bu01 mm mullo epi16 facbu u01 Now it s trivial to calculate the r g b planar values by summing things together as specified and shifting down by 6 the multiplier we used on the factors r00 mm srai epi16 mm add epi16 y00r0 rv00 6 r01 mm srai epi16 mm add epi16 y01r0 rv01 6 g00 mm srai epi16 mm sub epi16 mm sub epi16 y00r0 gu00 gv00 6 g01 mm srai epi16 mm sub epi16 mm sub epi16 y01r0 gu01 gv01 6 b00 mm srai epi16 mm add epi16 y00r0 bu00 6 b01 mm srai epi16 mm add epi16 y01r0 bu01 6 The remaining challenge is saturating and packing the results into chunky pixels efficiently Luckily we have just the instructions for the job r00 mm packus epi16 r00 r01 rrrr saturated g00 mm packus epi16 g00 g01 gggg saturated b00 mm packus epi16 b00 b01 bbbb saturated r01 mm unpacklo epi8 r00 zero 0r0r gbgb mm unpacklo epi8 b00 g00 gbgb rgb0123 mm unpacklo epi16 gbgb r01 0rgb0rgb rgb4567 mm unpackhi epi16 gbgb r01 0rgb0rgb r01 mm unpackhi epi8 r00 zero gbgb mm unpackhi epi8 b00 g00 rgb89ab mm unpacklo epi16 gbgb r01 rgbcdef mm unpackhi epi16 gbgb r01 We re just about done just store the finished pixels first mm store si128 dstrgb128r0 rgb0123 mm store si128 dstrgb128r0 rgb4567 mm store si128 dstrgb128r0 rgb89ab mm store si128 dstrgb128r0 rgbcdef That concludes the work necessary for row 0 Repeat the last three steps for row 1 replacing the y values and target pointer and we re done Mystery Maze A complete routine would look something like this You might want to put some more work into calculating the strides The code is fast enough for all practical purposes I have void yuv420 to argb8888 uint8 t yp uint8 t up uint8 t vp uint32 t sy uint32 t suv int width int height uint32 t rgb uint32 t srgb m128i y0r0 y0r1 u0 v0 m128i y00r0 y01r0 y00r1 y01r1 m128i u00 u01 v00 v01 m128i rv00 rv01 gu00 gu01 gv00 gv01 bu00 bu01 m128i r00 r01 g00 g01 b00 b01 m128i rgb0123 rgb4567 rgb89ab rgbcdef m128i gbgb m128i ysub uvsub m128i zero facy facrv facgu facgv facbu m128i srcy128r0 srcy128r1 m128i dstrgb128r0 dstrgb128r1 m64 srcu64 srcv64 int x y ysub mm set1 epi32 0x00100010 uvsub mm set1 epi32 0x00800080 facy mm set1 epi32 0x004a004a facrv mm set1 epi32 0x00660066 facgu mm set1 epi32 0x00190019 facgv mm set1 epi32 0x00340034 facbu mm set1 epi32 0x00810081

    Original URL path: http://www.ignorantus.com/yuv2rgb_sse2/ (2016-04-24)
    Open archived version from archive

  • A look at Halide's SSE2 3x3 Box Filter
    epi16 mm add epi16 r00 r10 r20 one third r00 r10 r10 r20 rown width dst width 3 Now we got a couple of code sets to compare This gives the following instruction count Box filter 3x3 instruction count 8192 8192 image Halide Vertical mul 17301504 8390656 add 34603008 16781312 loadu 17825792 16781312 loada 34078720 8390656 store 17301504 8388608 That s pretty sweet Unfortunately this goes all the way to hell when measuring execution time It s 4 5 times slower on the i7 950 Crap Let s try the other extreme instead Horizontal Cool The other extreme is of course a fully horizontal routine where we read three lines in parallel and completely ignore the overlap Actually I wrote this version first due to earlier experiences in the yuv converter but never mind void fast blur horiz const uint16 t in uint16 t blurred int width int height int x y m128i one third m128i dst one third mm set1 epi16 21846 dst m128i blurred for y 0 y height y const uint16 t row0 rowp rown row0 in y width rowp row0 width rown row0 width for x 0 x width x 8 m128i s0 s1 s2 m128i r0 r1 r2 s0 mm loadu si128 m128i row0 1 s1 mm loadu si128 m128i row0 1 s2 mm load si128 m128i row0 r0 mm mulhi epi16 mm add epi16 mm add epi16 s0 s1 s2 one third s0 mm loadu si128 m128i rowp 1 s1 mm loadu si128 m128i rowp 1 s2 mm load si128 m128i rowp r1 mm mulhi epi16 mm add epi16 mm add epi16 s0 s1 s2 one third s0 mm loadu si128 m128i rown 1 s1 mm loadu si128 m128i rown 1 s2 mm load si128 m128i rown r2 mm mulhi epi16 mm add epi16 mm add epi16 s0 s1 s2 one third mm store si128 dst mm mulhi epi16 mm add epi16 mm add epi16 r0 r1 r2 one third row0 8 rowp 8 rown 8 That s pretty and readable but the instruction count is a complete disaster Box filter 3x3 instruction count 8192 8192 image Halide Horiz Vertical mul 17301504 33554432 8390656 add 34603008 67108864 16781312 loadu 17825792 50331648 16781312 loada 34078720 25165824 8390656 store 17301504 8388608 8388608 Fortunately it s a complete win when we consider execution times It s about 31 faster than the original code on the i7 950 Neat Mental Overdrive Can we do any better Actually yes We can read one more line and have two destinations to make it 36 faster void fast blur horiz2d const uint16 t in uint16 t blurred int width int height int x y m128i one third m128i dst0 dst1 one third mm set1 epi16 21846 dst0 m128i blurred dst1 m128i blurred width for y 0 y height y 2 const uint16 t row0 row1 row2 row3 row1 in y width row0 row1 width row2 row1 width row3 row2 width for x 0 x width x 8 m128i s00 s01 s02

    Original URL path: http://www.ignorantus.com/box_sse2/ (2016-04-24)
    Open archived version from archive



  •