# IBM

# C/C++ Language Extensions for Cell Broadband Engine Architecture

Version 2.6

**CBEA JSRE Series** 

Cell Broadband Engine Architecture Joint Software Reference Environment Series



© Copyright International Business Machines Corporation, Sony Computer Entertainment Incorporated, Toshiba Corporation 2002 - 2008

All Rights Reserved
Printed in the United States of America August 2008

The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both.

IBM PowerPC

IBM Logo PowerPC Architecture

ibm.com

Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc.

Other company, product and service names may be trademarks or service marks of others.

All information contained in this document is subject to change without notice. The products described in this document are NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could result in death, bodily injury or catastrophic property damage. The information contained in this document does not affect or change IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific environments, and is presented as an illustration. The results obtained in other operating environments may vary.

THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document.

IBM Systems and Technology Group 2070 Route 52, Bldg. 330 Hopewell Junction, NY 12533-6351

The IBM home page can be found at ibm.com

The IBM semiconductor solutions home page can be found at ibm.com/chips

August 25, 2008



# **Table of Contents**

| List of Tables                                                                                                                                                                                                             | ix                          |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------|
| List of Figures                                                                                                                                                                                                            | xiii                        |
| About This Document Audience Version History Related Documentation Conventions Used in This Document                                                                                                                       | xv<br>xv<br>xv<br>xx<br>xxi |
| Data Types and Programming Directives     1.1. Data Types     1.1.1. Fundamental Data Types     1.1.2. Mapping of PPU Data Types to SPU Data Types                                                                         | 1<br>1<br>1<br>1            |
| 1.1.2. Mapping of FF o Bata Types to GF o Bata Types  1.1.3. Mapping of SPU Data Types to PPU Data Types  1.2. Header Files  1.2.1. Header File Contents                                                                   | 2<br>2<br>2<br>2<br>2       |
| 1.2.2. Single Token Typedefs 1.3. Alignment 1.3.1. Default Data Type Alignments                                                                                                                                            | 3                           |
| 1.3.2align_hint 1.4. Operating on Vector Types 1.4.1. sizeof() Operator 1.4.2. Assignment Operator                                                                                                                         | 3<br>4<br>4<br>4            |
| <ul><li>1.4.3. Address Operator</li><li>1.4.4. Pointer Arithmetic and Pointer Dereferencing</li><li>1.4.5. Type Casting</li><li>1.4.6. Vector Literals</li></ul>                                                           | 4<br>4<br>5<br>5            |
| 1.5. Restrict Type Qualifier 1.6. SPU Programmer Directed Branch Prediction 1.7. Inline Assembly                                                                                                                           | 7<br>7<br>8                 |
| 1.8. Target Definitions                                                                                                                                                                                                    | 8                           |
| SPU Low-Level Specific and Generic Intrinsics     2.1. Specific Intrinsics     2.1.1. Specific Casting Intrinsics                                                                                                          | 9<br>9<br>12                |
| <ul> <li>2.2. Generic Intrinsics and Built-ins</li> <li>2.2.1. Mapping Intrinsics with Scalar Operands</li> <li>2.2.2. Implicit Conversion of Arguments of Intrinsics</li> <li>2.2.3. Notations and Conventions</li> </ul> | 13<br>13<br>14<br>14        |
| 2.3. Constant Formation Intrinsics     spu_splats: Splat Scalar to Vector      2.4. Conversion Intrinsics                                                                                                                  | 15<br>15<br>16              |
| spu_convtf: Convert Integer Vector to Vector Float spu_convts: Convert Vector Float to Signed Integer Vector spu_convtu: Convert Vector Float to Unsigned Integer Vector spu_extend: Extend Vector                         | 16<br>16<br>16<br>17        |
| spu_roundtf: Round Vector Double to Vector Float  2.5. Arithmetic Intrinsics                                                                                                                                               | 17<br>17                    |
| spu_add: Vector Add<br>spu_addx: Vector Add Extended<br>spu_genb: Vector Generate Borrow                                                                                                                                   | 17<br>18<br>18              |
| spu_genbx: Vector Generate Borrow Extended<br>spu_genc: Vector Generate Carry<br>spu_gencx: Vector Generate Carry Extended<br>spu_madd: Vector Multiply and Add                                                            | 18<br>19<br>19<br>19        |
| spu_mhhadd: Vector Multiply And Add spu_msub: Vector Multiply and Subtract                                                                                                                                                 | 19                          |



| spu_mul: Vector Multiply                                                        | 20 |
|---------------------------------------------------------------------------------|----|
| spu_mulh: Vector Multiply High                                                  | 20 |
| spu_mule: Vector Multiply Firght                                                | 21 |
|                                                                                 |    |
| spu_mulo: Vector Multiply Odd                                                   | 21 |
| spu_mulsr: Vector Multiply and Shift Right                                      | 21 |
| spu_nmadd: Negative Vector Multiply and Add                                     | 22 |
| spu_nmsub: Negative Vector Multiply and Subtract                                | 22 |
| spu_re: Vector Floating-Point Reciprocal Estimate                               | 22 |
| spu_rsqrte: Vector Floating-Point Reciprocal Square Root Estimate               | 22 |
| spu_sub: Vector Subtract                                                        | 23 |
| spu_subx: Vector Subtract Extended                                              | 23 |
| 2.6. Byte Operation Intrinsics                                                  | 24 |
| spu_absd: Vector Absolute Difference                                            | 24 |
| ·                                                                               |    |
| spu_avg: Average of Two Vectors                                                 | 24 |
| spu_sumb: Sum Bytes into Shorts                                                 | 24 |
| 2.7. Compare, Branch and Halt Intrinsics                                        | 24 |
| spu_bisled: Branch Indirect and Set Link If External Data                       | 24 |
| spu_cmpabseq: Vector Compare Absolute Equal                                     | 25 |
| spu_cmpabsqt: Vector Compare Absolute Greater Than                              | 25 |
| spu_cmpeq: Vector Compare Equal                                                 | 25 |
| spu_cmpgt: Vector Compare Greater Than                                          | 27 |
| spu_hcmpeq: Halt If Compare Equal                                               | 28 |
| spu_hcmpgt: Halt if Compare Greater Than                                        | 28 |
|                                                                                 | 28 |
| spu_testsv: Vector Test Special Value                                           |    |
| 2.8. Bits and Mask Intrinsics                                                   | 29 |
| spu_cntb: Vector Count Ones for Bytes                                           | 29 |
| spu_cntlz: Vector Count Leading Zeros                                           | 29 |
| spu_gather: Gather Bits from Elements                                           | 29 |
| spu_maskb: Form Select Byte Mask                                                | 30 |
| spu_maskh: Form Select Halfword Mask                                            | 30 |
| spu_maskw: Form Select Word Mask                                                | 31 |
| spu_sel: Select Bits                                                            | 31 |
| spu_shuffle: Shuffle Two Vectors of Bytes                                       | 32 |
| 2.9. Logical Intrinsics                                                         | 33 |
| •                                                                               |    |
| spu_and: Vector Bit-Wise AND                                                    | 33 |
| spu_andc: Vector Bit-Wise AND with Complement                                   | 34 |
| spu_eqv: Vector Bit-Wise Equivalent                                             | 34 |
| spu_nand: Vector Bit-Wise Complement of AND                                     | 35 |
| spu_nor: Vector Bit-Wise Complement of OR                                       | 35 |
| spu_or: Vector Bit-Wise OR                                                      | 36 |
| spu_orc: Vector Bit-Wise OR with Complement                                     | 37 |
| spu_orx: OR Word Across                                                         | 37 |
| spu_xor: Vector Bit-Wise Exclusive OR                                           | 37 |
| 2.10. Shift and Rotate Intrinsics                                               | 38 |
| spu_rl: Vector Rotate Left by Bits                                              | 38 |
| spu_rlmask: Vector Rotate Left and Mask by Bits                                 | 39 |
| spu_rlmaska: Vector Rotate Left and Mask Algebraic by Bits                      | 40 |
| spu_rlmaska. Vector Rotate Left and Mask Algebraic by Bits                      | 41 |
|                                                                                 |    |
| spu_rlmaskqwbyte: Quadword Rotate Left and Mask by Bytes                        | 42 |
| spu_rlmaskqwbytebc: Quadword Rotate Left and Mask by Bytes from Bit Shift Count | 42 |
| spu_rlqw: Quadword Rotate Left by Bits                                          | 43 |
| spu_rlqwbyte: Quadword Rotate Left by Bytes                                     | 44 |
| spu_rlqwbytebc: Quadword Rotate Left by Bytes from Bit Shift Count              | 45 |
| spu_sl: Vector Shift Left by Bits                                               | 45 |
| spu_slqw: Quadword Shift Left by Bits                                           | 46 |
| spu_slqwbyte: Quadword Shift Left by Bytes                                      | 46 |
| spu_slqwbytebc: Quadword Shift Left by Bytes from Bit Shift Count               | 47 |
| spu_sr: Vector Shift Right by Bits                                              | 48 |
| spu_sra: Vector Shift Right Algebraic by Bits                                   | 48 |
| spu_srgw: Quadword Shift Right by Bits                                          | 49 |
| spu_srqwbyte: Quadword Shift Right by Bytes                                     | 51 |
| spu_srqwbytebc: Quadword Shift Right by Bytes from Bit Shift Count              | 52 |
| 1 = 1 -7 3 7 -7                                                                 | 3- |



| 2.11. Control Intrinsics                                                                 | 52 |
|------------------------------------------------------------------------------------------|----|
| spu_idisable: Disable Interrupts                                                         | 52 |
| spu_ienable: Enable Interrupts                                                           | 53 |
| spu_mffpscr: Move from Floating-Point Status and Control Register                        | 53 |
| spu_mfspr: Move from Special Purpose Register                                            | 53 |
|                                                                                          |    |
| spu_mtfpscr: Move to Floating-Point Status and Control Register                          | 54 |
| spu_mtspr: Move to Special Purpose Register                                              | 54 |
| spu_dsync: Synchronize Data                                                              | 54 |
| spu_stop: Stop and Signal                                                                | 54 |
| spu_sync: Synchronize                                                                    | 55 |
| 2.12. Channel Control Intrinsics                                                         | 55 |
| spu_readch: Read Word Channel                                                            | 56 |
| spu_readchqw: Read Quadword Channel                                                      | 56 |
| spu_readchcnt: Read Channel Count                                                        | 56 |
| spu_writech: Write Word Channel                                                          | 57 |
| spu_writechqw: Write Quadword Channel                                                    | 57 |
| 2.13. Scalar Intrinsics                                                                  |    |
|                                                                                          | 57 |
| spu_extract: Extract Vector Element from Vector                                          | 57 |
| spu_insert: Insert Scalar into Specified Vector Element                                  | 58 |
| spu_promote: Promote Scalar to Vector                                                    | 59 |
| 2. Composite Intrinsica                                                                  | 61 |
| 3. Composite Intrinsics                                                                  |    |
| spu_mfcdma32: Initiate DMA to/from 32-Bit Effective Address                              | 61 |
| spu_mfcdma64: Initiate DMA to/from 64-Bit Effective Address                              | 61 |
| spu_mfcstat: Read MFC Tag Status                                                         | 62 |
| 4. Programming Support for MFC Input and Output                                          | 63 |
|                                                                                          |    |
| 4.1. Structures                                                                          | 63 |
| mfc_list_element: DMA List Element for MFC List DMA                                      | 63 |
| 4.2. MFC Macro Definitions                                                               | 63 |
| 4.3. Effective Address Utilities                                                         | 64 |
| mfc_ea2h: Extract Higher 32 Bits from Effective Address                                  | 64 |
| mfc_ea2l: Extract Lower 32 Bits from Effective Address                                   | 64 |
| mfc_hl2ea: Concatenate Higher 32 Bits and Lower 32 Bits                                  | 64 |
| mfc_ceil128: Round Up Value to Next Multiple of 128                                      | 64 |
| ·                                                                                        | 64 |
| 4.4. MFC Tag Manager                                                                     |    |
| mfc_tag_reserve: Reserve a Tag for Exclusive Use                                         | 65 |
| mfc_tag_release: Release a Tag from Exclusive Use                                        | 65 |
| mfc_multi_tag_reserve: Reserve a Group of Tags for Exclusive Use                         | 65 |
| mfc_multi_tag_release: Release a Group of Tags from Exclusive Use                        | 65 |
| 4.5. MFC DMA Commands                                                                    | 65 |
| mfc_put: Move Data from Local Storage to Effective Address                               | 66 |
| mfc_putb: Move Data from Local Storage to Effective Address with Barrier                 | 66 |
| mfc_putf: Move Data from Local Storage to Effective Address with Fence                   | 66 |
| mfc_get: Move Data from Effective Address to Local Storage                               | 66 |
| mfc_getf: Move Data from Effective Address to Local Storage with Fence                   | 67 |
| mfc_getb: Move Data from Effective Address to Local Storage with Barrier                 | 67 |
| 4.6. MFC List DMA Commands                                                               | 67 |
|                                                                                          |    |
| mfc_putl: Move Data from Local Storage to Effective Address Using MFC List               | 68 |
| mfc_putlb: Move Data from Local Storage to Effective Address Using MFC List with Barrier | 68 |
| mfc_putlf: Move Data from Local Storage to Effective Address Using MFC List with Fence   | 68 |
| mfc_getl: Move Data from Effective Address to Local Storage Using MFC List               | 68 |
| mfc_getlb: Move Data from Effective Address to Local Storage Using MFC List with Barrier | 69 |
| mfc_getlf: Move Data from Effective Address to Local Storage Using MFC List with Fence   | 69 |
| 4.7. MFC Atomic Update Commands                                                          | 69 |
| mfc_getllar: Get Lock Line and Create Reservation                                        | 69 |
| mfc_putllc: Put Lock Line If Reservation for Effective Address Exists                    | 70 |
| mfc_putlluc: Put Lock Line Unconditional                                                 | 70 |
| mfc_putqlluc: Put Queued Lock Line Unconditional                                         | 70 |
| 4.8. MFC Synchronization Commands                                                        | 71 |
|                                                                                          |    |
| mfc_sndsig: Send Signal                                                                  | 71 |
| mfc_sndsigb: Send Signal with Barrier                                                    | 71 |



| mfc_sndsigf: Send Signal with Fence                                                         | 71      |
|---------------------------------------------------------------------------------------------|---------|
| mfc_barrier: Enqueue mfc_barrier Command into DMA Queue or Stall When Queue Is Fu           |         |
| mfc_eieio: Enqueue mfc_eieio Command into DMA Queue or Stall When Queue Is Full             | 72      |
| mfc_sync: Enqueue mfc_sync Command into DMA Queue or Stall When Queue Is Full               | 72      |
| 4.9. MFC SL1 Storage Control Commands                                                       | 72      |
| mfc_sdcrt: SL1 Data Cache Range Touch                                                       | 73      |
| mfc_sdcrtst: SL1 Data Cache Range Touch for Store                                           | 73      |
| mfc_sdcrz: SL1 Data Cache Range Set to Zero                                                 | 73      |
| mfc_sdcrst: SL1 Data Cache Range Store                                                      | 73      |
| mfc_sdcrf: SL1 Data Cache Range Flush                                                       | 74      |
| 4.10. MFC DMA Status                                                                        | 74      |
| mfc_stat_cmd_queue: Check the Number of Available Entries in the MFC DMA Queue              | 74      |
| mfc_write_tag_mask: Set Tag Mask to Select MFC Tag Groups to Be Included in Query Operation | 75      |
| mfc_read_tag_mask: Read Tag Mask Indicating MFC Tag Groups to Be Included in Quer           |         |
| Operation                                                                                   | ,<br>75 |
| mfc_write_tag_update: Request That Tag Status be Updated                                    | 75      |
| mfc_write_tag_update_immediate: Request That Tag Status Be Immediately Updated              | 75      |
| mfc_write_tag_update_any: Request That Tag Status Be Updated for Any Enabled Comp           |         |
| with No Outstanding Operation                                                               | 75      |
| mfc_write_tag_update_all: Request That Tag Status Be Updated When All Enabled Tag C         |         |
| Have No Outstanding Operation                                                               | 76      |
| mfc_stat_tag_update: Check Availability of Tag Status Update Request Channel                | 76      |
| mfc_read_tag_status: Wait for an Updated Tag Status                                         | 76      |
| mfc_read_tag_status_immediate: Wait for the Updated Status of Any Enabled Tag Group         |         |
| mfc_read_tag_status_any: Wait for No Outstanding Operation of Any Enabled Tag Group         |         |
| mfc_read_tag_status_all: Wait for No Outstanding Operation of All Enabled Tag Groups        | 76      |
| mfc_stat_tag_status: Check Availability of MFC_RdTagStat Channel                            | 77      |
| mfc_read_list_stall_status: Read List DMA Stall-and-Notify Status                           | 77      |
| mfc_stat_list_stall_status: Check Availability of List DMA Stall-and-Notify Status          | 77      |
| mfc_write_list_stall_ack: Acknowledge Tag Group Containing Stalled DMA List Command         |         |
| mfc_read_atomic_status: Read Atomic Command Status                                          | 77      |
| mfc_stat_atomic_status: Check Availability of Atomic Command Status                         | 78      |
| 4.11. MFC Multisource Synchronization Request                                               | 78      |
| mfc_write_multi_src_sync_request: Request Multisource Synchronization                       | 78      |
| mfc_stat_multi_src_sync_request: Check the Status of Multisource Synchronization            | 78      |
| 4.12. SPU Signal Notification                                                               | 78      |
| spu_read_signal1: Atomically Read and Clear Signal Notification 1 Channel                   | 79      |
| spu_stat_signal1: Check If Pending Signals Exist on Signal Notification 1 Channel           | 79      |
| spu_read_signal2: Atomically Read and Clear Signal Notification 2 Channel                   | 79      |
| spu_stat_signal2: Check If Pending Signals Exist on Signal Notification 2 Channel           | 79      |
| 4.13. SPU Mailboxes                                                                         | 79      |
| spu_read_in_mbox: Read Next Data Entry in SPU Inbound Mailbox                               | 79      |
| spu_stat_in_mbox: Get the Number of Data Entries in SPU Inbound Mailbox                     | 80      |
| spu_write_out_mbox: Send Data to SPU Outbound Mailbox                                       | 80      |
| spu_stat_out_mbox: Get Available Capacity of SPU Outbound Mailbox                           | 80      |
| spu_write_out_intr_mbox: Send Data to SPU Outbound Interrupt Mailbox                        | 80      |
| spu_stat_out_intr_mbox: Get Available Capacity of SPU Outbound Interrupt Mailbox            | 80      |
| 4.14. SPU Decrementer                                                                       | 80      |
| spu_read_decrementer: Read Current Value of Decrementer                                     | 80      |
| spu_write_decrementer: Load a Value to Decrementer                                          | 81      |
| 4.15. SPU Event                                                                             | 81      |
| spu_read_event_status: Read Event Status or Stall Until Status Is Available                 | 81      |
| spu_stat_event_status: Check Availability of Event Status                                   | 81      |
| spu_write_event_mask: Select Events to be Monitored by Event Status                         | 82      |
| spu_write_event_ack: Acknowledge Events                                                     | 82      |
| spu_read_event_mask: Read Event Status Mask                                                 | 82      |
| 4.16. SPU State Management                                                                  | 82      |
| spu_read_machine_status: Read Current SPU Machine Status                                    | 82      |
| spu_write_srr0: Write to SPU SRR0                                                           | 82      |
| spu_read_srr0: Read SPU SRR0                                                                | 82      |

| 4.17. Interrupt-Safe Critical Sections                                                                          | 83         |
|-----------------------------------------------------------------------------------------------------------------|------------|
| mfc_begin_critical_section: Begin Interrupt-Safe Critical Section                                               | 83         |
| mfc_end_critical_section: End Interrupt-Safe Critical Section                                                   | 83         |
| 5. Mapping SPU and PPU Vector Multimedia Extension Intrinsics                                                   | 85         |
| 5.1. Mapping of PPU VMX Intrinsics to SPU Intrinsics                                                            | 85         |
| 5.1.1. One-to-One Mapped Intrinsics                                                                             | 85         |
| 5.1.2. PPU VMX Intrinsics That Are Difficult to Map to SPU Intrinsics                                           | 86         |
| 5.2. Mapping of SPU Intrinsics to PPU VMX Intrinsics                                                            | 86         |
| 5.2.1. One-to-One Mapped Intrinsics                                                                             | 86         |
| 5.2.2. SPU Intrinsics That Are Difficult to Map to PPU VMX Intrinsics                                           | 87         |
|                                                                                                                 | _          |
| 6. PPU-Specific Intrinsics                                                                                      | 89         |
| cctph: Change Thread Priority to High                                                                           | 89         |
| cctpl: Change Thread Priority to Low                                                                            | 89         |
| cctpm: Change Thread Priority to Medium                                                                         | 89<br>90   |
| cntlzd: Count Leading Doubleword Zeros<br>cntlzw: Count Leading Word Zeros                                      | 90         |
| db10cyc: Delay 10 Cycles at Dispatch                                                                            | 90         |
| db12cyc: Delay 12 Cycles at Dispatch                                                                            | 90         |
| db16cyc: Delay 16 Cycles at Dispatch                                                                            | 91         |
| db8cyc: Delay 8 Cycles at Dispatch                                                                              | 91         |
| dcbf: Data Cache Block Flush                                                                                    | 91         |
| dcbst: Data Cache Block Store                                                                                   | 91         |
| dcbt: Data Cache Block Touch                                                                                    | 92         |
| dcbt_TH1000: Set Up Streaming Data                                                                              | 92         |
| dcbt_TH1010: Start or Stop Streaming Data                                                                       | 92         |
| dcbtst: Data Cache Block Touch for Store                                                                        | 93         |
| dcbz: Data Cache Block Set to Zero                                                                              | 93         |
| eieio: Enforce In-Order Execution of I/O                                                                        | 93         |
| fabs: Double Absolute Value                                                                                     | 94         |
| fabsf: Float Absolute Value                                                                                     | 94         |
| fcfid: Convert Doubleword to Double                                                                             | 94         |
| fctid: Convert Double to Doubleword                                                                             | 94         |
| fctidz: Convert Double to Doubleword with Round Towards Zero                                                    | 95         |
| fctiw: Convert Double to Word                                                                                   | 95         |
| fctiwz: Convert Double to Word with Round Towards Zero                                                          | 95<br>05   |
| fmadd: Double Fused Multiply and Add                                                                            | 95<br>96   |
| fmadds: Float Fused Multiply and Add<br>fmsub: Double Fused Multiply and Subtract                               | 96         |
| fmsubs: Float Fused Multiply and Subtract                                                                       | 96         |
| fmul: Double Multiply                                                                                           | 96         |
| fmuls: Float Multiply                                                                                           | 97         |
| fnabs: Double Negative                                                                                          | 97         |
| fnabsf: Float Negative                                                                                          | 97         |
| fnmadd: Double Fused Negative Multiply and Add                                                                  | 97         |
| fnmadds: Float Fused Negative Multiply and Add                                                                  | 98         |
| fnmsub: Double Fused Negative Multiply and Subtract                                                             | 98         |
| fnmsubs: Float Fused Negative Multiply and Subtract                                                             | 98         |
| fres: Float Reciprocal Estimate                                                                                 | 98         |
| frsp: Round to Single Precision                                                                                 | 99         |
| frsqrte: Double Reciprocal Square Root Estimate                                                                 | 99         |
| fsel: Floating-Point Select of Double                                                                           | 99         |
| fsels: Floating-Point Select of Float                                                                           | 99         |
| fsqrt: Double Square Root                                                                                       | 100        |
| fsqrts: Float Square Root                                                                                       | 100        |
| icbi: Instruction Cache Block Invalidate                                                                        | 100        |
| isync: Instruction Sync                                                                                         | 100        |
| ldarx: Load Doubleword with Reserved                                                                            | 101        |
| ldbrx: Load Reversed Doubleword                                                                                 | 101        |
| Ihbrx: Load Reversed Halfword                                                                                   | 101        |
| lwarx: Load Word with Reserved<br>lwbrx: Load Reversed Word                                                     | 101<br>102 |
| 1997) V. 1707 (170 A. 170 A |            |



| lwsync: Light Weight Sync                                   | 102        |
|-------------------------------------------------------------|------------|
| mffs: Move from Floating-Point Status and Control Register  | 102        |
| mfspr: Move from Special Purpose Register                   | 103        |
| mftb: Move from Time Base                                   | 103        |
| mtfsb0: Reset Bit of FPSCR                                  | 103        |
| mtfsb1: Set Bit of FPSCR                                    | 103        |
| mtfsf: Set Fields in FPSCR                                  | 104        |
| mtfsfi: Set Field of FPSCR                                  | 104        |
| mtspr: Move to Special Purpose Register                     | 104        |
| mulhd: Multiply Doubleword, High Part                       | 104        |
| mulhdu: Multiply Double Unsigned Word, High Part            | 105        |
| mulhw: Multiply Word, High Part                             | 105        |
| mulhwu: Multiply Unsigned Word, High Part                   | 105        |
| nop: No Operation                                           | 105        |
| protected_stream_count: Set the Number of Blocks to Stream  | 105        |
| protected_stream_go: Start All Streams                      | 106        |
| protected_stream_set: Set Up a Stream                       | 106        |
| protected_stream_stop: Stop a Stream                        | 106        |
| protected_stream_stop_all: Stop All Streams                 | 106        |
| protected_unlimited_stream_set: Set Up an Unlimited Stream  | 106        |
| rldcl: Rotate Left Doubleword Then Clear Left               | 106        |
| rldcr: Rotate Left Doubleword Then Clear Right              | 107        |
| rldic: Rotate Left Doubleword Immediate Then Clear          | 107        |
| rldicl: Rotate Left Doubleword Immediate Then Clear Left    | 107        |
| rldicr: Rotate Left Doubleword Immediate Then Clear Right   | 108        |
| rldimi: Rotate Left Doubleword Immediate Then Mask Insert   | 108        |
| rlwimi: Rotate Left Word Immediate Then Mask Insert         | 108        |
| rlwinm: Rotate Left Word Immediate Then AND with Mask       | 109        |
| rlwnm: Rotate Left Word Then AND with Mask                  | 109        |
| setflm: Save and Set the FPSCR                              | 109        |
| stdbrx: Store Reversed Doubleword                           | 109        |
| stdcx: Store Doubleword Conditional                         | 109        |
| sthbrx: Store Reversed Halfword                             | 110        |
| stwbrx: Store Reversed Word                                 | 110<br>111 |
| stwcx: Store Word Conditional                               | 111        |
| sync: Sync                                                  | 111        |
| 7. PPU Vector Multimedia Extension Intrinsics               | 113        |
| vec_extract: Extract Vector Element from Vector             | 114        |
| vec_insert: Insert Scalar into Specified Vector Element     | 115        |
| vec_lvlx: Load Vector Left Indexed                          | 116        |
| vec_lvlxl: Load Vector Left Indexed Last                    | 117        |
| vec_lvrx: Load Vector Right Indexed                         | 118        |
| vec_lvrxl: Load Vector Right Indexed Last                   | 119        |
| vec_stvlx: Store Vector Left Indexed                        | 120        |
| vec_stvlxl: Store Vector Left Indexed Last                  | 121        |
| vec_stvrx: Store Vector Right Indexed                       | 122        |
| vec_stvrxl: Store Vector Right Indexed Last                 | 123        |
| vec_promote: Promote Scalar to Vector                       | 124        |
| vec_splats: Splat Scalar to Vector                          | 124        |
| 8. SPU C and C++ Standard Libraries and Language Support    | 125        |
| 8.1. Standard Libraries                                     | 125        |
| 8.1.1. C Standard Library                                   | 125        |
| 8.1.2. C++ Standard Library                                 | 128        |
| 8.2. Non-Supported Language Features                        | 129        |
|                                                             |            |
| 9. Floating-Point Arithmetic on the SPU                     | 131        |
| 9.1. Properties of Floating-Point Data Type Representations | 131        |
| 9.2. Floating-Point Environment                             | 132        |
| 9.2.1. Rounding Modes                                       | 132        |
| 9.2.2. Floating-Point Exceptions                            | 132        |
| 9.2.3. Other Floating-Point Constants in math.h             | 134        |



|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | Table of Contents                                                                                                       |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>9.3. Floating-Point Operations</li> <li>9.3.1. Floating-Point Conversions</li> <li>9.3.2. Overall Behavior of C Operators and Standard Library Math Functions</li> <li>9.3.3. Floating-Point Expression Special Cases</li> <li>9.3.4. Specific Behavior of Standard Math Functions</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | 134<br>134<br>135<br>136<br>137                                                                                         |
| 10. Operator Overloading for Vector Data Types 10.1. Supported Types 10.2. Vector Subscripting 10.3. Unary Operators 10.4. Binary Operators 10.5. Relational Operators                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | 139<br>139<br>139<br>139<br>140<br>140                                                                                  |
| Index                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | 141                                                                                                                     |
| List of Tables                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                                                                                                                         |
| Table 1-1: Vector Data Types Table 1-2: Non-identical Mapping of PPU VMX Data Types to SPU Data Types Table 1-3: Non-identical Mapping of SPU Data Types to PPU VMX Data Types Table 1-4: Single Token Vector Data Types Table 1-5: Default Data Type Alignments Table 1-6: Vector Pointer Types and Matching Base Element Pointer Types Table 1-7: Vector Literal Format and Description Table 1-8: Alternate Vector Literal Format and Description Table 1-8: Alternate Vector Literal Format and Description Table 2-9: Assembly Instructions for which No Specific Intrinsic Exists Table 2-10: Specific Intrinsics Not Accessible Through Generic Intrinsics Table 2-11: Specific Casting Intrinsics Table 2-11: Specific Casting Intrinsics Table 2-12: Possible Uses of Immediate Load Instructions for Various Values of Constant Table 2-13: Splat Scalar to Vector Table 2-14: Convert Integer Vector to Vector Float Table 2-14: Convert Integer Vector to Vector Float Table 2-16: Convert Vector Float to Unsigned Integer Vector Table 2-17: Extend Vector Table 2-17: Extend Vector Table 2-19: Vector Add Table 2-19: Vector Add Extended Table 2-20: Vector Add Extended Table 2-21: Vector Generate Borrow Table 2-21: Vector Generate Carry Table 2-22: Vector Generate Carry Table 2-24: Vector Generate Carry Extended Table 2-25: Vector Multiply and Add Table 2-26: Vector Multiply and Subtract Table 2-28: Vector Multiply Sund Subtract Table 2-29: Vector Multiply Sund Subtract Table 2-30: Vector Multiply Didn Table 2-31: Vector Multiply Odd Table 2-32: Vector Multiply Odd Table 2-33: Negative Vector Multiply and Subtract Table 2-36: Vector Subtract Extended Table 2-37: Vector Subtract Extended Table 2-38: Vector Subtract Extended Table 2-39: Vector Absolute Difference Table 2-40: Average of Two Vectors | 1 2 2 2 2 3 3 5 6 6 9 10 13 b 14 15 16 16 16 17 17 17 18 18 18 19 19 20 20 20 20 21 21 21 21 22 22 22 22 22 23 23 24 24 |
| Table 2-41: Sum Bytes into Shorts  Table 2-42: Branch Indirect and Set Link If External Data  Table 2-43: Vector Compare Absolute Equal                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | 24<br>25<br>25                                                                                                          |
| Table 2-44: Vector Compare Absolute Greater Than                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | 25                                                                                                                      |

ix

Х

| Table 2-45: Vector Compare Equal                                                                          | 26             |
|-----------------------------------------------------------------------------------------------------------|----------------|
| Table 2-46: Vector Compare Greater Than                                                                   | 27             |
| Table 2-47: Halt If Compare Equal                                                                         | 28             |
| Table 2-48: Halt If Compare Greater Than                                                                  | 28             |
| Table 2-49: Vector Test Special Value                                                                     | 28             |
| Table 2-50: Special Value Bit Flag Mnemonics                                                              | 29             |
| Table 2-51: Vector Count Ones for Bytes                                                                   | 29             |
| Table 2-52: Vector Count Leading Zeros                                                                    | 29             |
| Table 2-53: Gather Bits from Elements                                                                     | 30             |
| Table 2-54: Form Select Byte Mask                                                                         | 30             |
| Table 2-55: Form Select Halfword Mask                                                                     | 30             |
| Table 2-56: Form Select Word Mask                                                                         | 31             |
| Table 2-57: Select Bits                                                                                   | 31             |
| Table 2-58: Shuffle Two Vectors of Bytes                                                                  | 32             |
| Table 2-59: Vector Bit-Wise AND                                                                           | 33             |
| Table 2-60: Vector Bit-Wise AND with Complement                                                           | 34             |
| Table 2-61: Vector Bit-Wise Equivalent                                                                    | 34             |
| Table 2-62: Vector Bit-Wise Complement of AND                                                             | 35             |
| Table 2-63: Vector Bit-Wise Complement of OR                                                              | 35             |
| Table 2-64: Vector Bit-Wise OR                                                                            | 36             |
| Table 2-65: Vector Bit-Wise OR with Complement                                                            | 37             |
| Table 2-66: OR Word Across                                                                                | 37             |
| Table 2-67: Vector Bit-Wise Exclusive OR                                                                  | 38             |
| Table 2-68: Vector Rotate Left by Bits                                                                    | 38             |
| Table 2-69: Vector Rotate Left and Mask by Bits                                                           | 39             |
| Table 2-70: Vector Rotate Left and Mask Algebraic by Bits                                                 | 40             |
| Table 2-71: Quadword Rotate Left and Mask by Bits                                                         | 41             |
| Table 2-72: Quadword Rotate Left and Mask by Bytes                                                        | 42             |
| Table 2-73: Quadword Rotate Left and Mask by Bytes from Bit Shift Count                                   | 43             |
| Table 2-74: Quadword Rotate Left by Bits                                                                  | 43             |
| Table 2-75: Quadword Rotate Left by Bytes                                                                 | 44             |
| Table 2-76: Quadword Rotate Left by Bytes from Bit Shift Count                                            | 45             |
| Table 2-77: Vector Shift Left by Bits                                                                     | 45             |
| Table 2-78: Quadword Shift Left by Bits                                                                   | 46             |
| Table 2-79: Quadword Shift Left by Bytes                                                                  | 46             |
| Table 2-80: Quadword Shift Left by Bytes from Bit Shift Count                                             | 47             |
| Table 2-81: Vector Shift Right by Bits                                                                    | 48             |
| Table 2-82: Vector Shift Right Algebraic by Bits                                                          | 49             |
| Table 2-83: Quadword Shift Right by Bits                                                                  | 49             |
| Table 2-84: Quadword Shift Right by Bytes                                                                 | 51             |
| Table 2-85: Quadword Shift Right by Bytes from Bit Shift Count                                            | 52             |
| Table 2-86: Disable Interrupts                                                                            | 52             |
| Table 2-87: Enable Interrupts                                                                             | 53             |
| Table 2-88: Move from Floating-Point Status and Control Register                                          | 53             |
| Table 2-89: Move from Special Purpose Register                                                            | 53             |
| Table 2-90: Move to Floating-Point Status and Control Register                                            | 54             |
| Table 2-91: Move to Special Purpose Register                                                              | 54             |
| Table 2-92: Synchronize Data                                                                              | 54             |
| Table 2-93: Stop and Signal                                                                               | 5 <del>4</del> |
| Table 2-94: Synchronize                                                                                   | 55             |
| Table 2-95: SPU Channel Numbers                                                                           | 55             |
| Table 2-96: MFC Channel Numbers                                                                           | 55             |
| Table 2-97: Read Word Channel                                                                             | 56             |
| Table 2-98: Read Quadword Channel                                                                         | 56             |
| Table 2-99: Read Channel Count                                                                            | 56             |
| Table 2-100: Write Word Channel                                                                           | 57             |
| Table 2-100: Write Quadword Channel                                                                       | 57<br>57       |
| Table 2-101: Write Quadword Charmer  Table 2-102: Extract Vector Element from Vector                      | 58             |
| Table 2-102: Extract Vector Element from Vector  Table 2-103: Insert Scalar into Specified Vector Element | 59             |
| Table 2-103: Insert Scalar into Specified Vector Element  Table 2-104: Promote Scalar to Vector           | 59             |
| Table 3-105: Initiate DMA to/from 32-Bit Effective Address                                                | 61             |
|                                                                                                           |                |



| Table 3-106: Initiate DMA to/from 64-Bit Effective Address                               | 61       |
|------------------------------------------------------------------------------------------|----------|
| Table 3-107: Read MFC Tag Status                                                         | 62       |
| Table 4-108: MFC Macro Definitions                                                       | 63       |
| Table 4-109: MFC Tag Manager Mnemonics                                                   | 65       |
| Table 4-110: MFC DMA Command Mnemonics                                                   | 66       |
| Table 4-111: MFC List DMA Command Mnemonics                                              | 67       |
| Table 4-112: MFC Atomic Update Command Mnemonics                                         | 69       |
| Table 4-113: MFC Synchronization Command Mnemonics                                       | 71       |
| Table 4-114: MFC SL1 Storage Control Command Mnemonics                                   | 72       |
| Table 4-115: MFC Write Tag Update Conditions                                             | 75       |
| Table 4-116: Read Atomic Command Status or Stall Until Status Is Available               | 77       |
| Table 4-117: MFC Event Bit-Fields                                                        | 81       |
| Table 5-118: PPU VMX Intrinsics That Map One-to-One with SPU Intrinsics                  | 86       |
| Table 5-119: PPU VMX Intrinsics That Are Difficult to Map to SPU Intrinsics              | 86       |
| Table 5-120: SPU Intrinsics That Map One-to-One with PPU VMX Intrinsics                  | 87       |
| Table 5-121: SPU Intrinsics That Are Difficult to Map to PPU VMX Intrinsics              | 87       |
| Table 6-122: Change Thread Priority to High                                              | 89<br>89 |
| Table 6-123: Change Thread Priority to Low Table 6-124: Change Thread Priority to Medium | 89       |
| Table 6-125: Count Leading Doubleword Zeros                                              | 90       |
| Table 6-126: Count Leading Word Zeros                                                    | 90       |
| Table 6-127: Delay 10 Cycles at Dispatch                                                 | 90       |
| Table 6-128: Delay 12 Cycles at Dispatch                                                 | 90       |
| Table 6-129: Delay 16 Cycles at Dispatch                                                 | 91       |
| Table 6-130: Delay 8 Cycles at Dispatch                                                  | 91       |
| Table 6-131: Data Cache Block Flush                                                      | 91       |
| Table 6-132: Data Cache Block Store                                                      | 91       |
| Table 6-133: Data Cache Block Touch                                                      | 92       |
| Table 6-134: Set Up Streaming Data                                                       | 92       |
| Table 6-135: Start or Stop Streaming Data                                                | 93       |
| Table 6-136: Data Cache Block Touch for Store                                            | 93       |
| Table 6-137: Data Cache Block Set to Zero                                                | 93       |
| Table 6-138: Enforce In-Order Execution of I/O                                           | 93       |
| Table 6-139: Double Absolute Value                                                       | 94       |
| Table 6-140: Float Absolute Value                                                        | 94       |
| Table 6-141: Convert Doubleword to Double                                                | 94       |
| Table 6-142: Convert Double to Doubleword                                                | 94       |
| Table 6-143: Convert Double to Doubleword with Round Towards Zero                        | 95       |
| Table 6-144: Convert Double to Word                                                      | 95       |
| Table 6-145: Convert Double to Word with Round Towards Zero                              | 95       |
| Table 6-146: Double Fused Multiply and Add                                               | 95       |
| Table 6-147: Float Fused Multiply and Add                                                | 96       |
| Table 6-148: Double Fused Multiply and Subtract                                          | 96       |
| Table 6-149: Float Fused Multiply and Subtract                                           | 96       |
| Table 6-150: Double Multiply                                                             | 96       |
| Table 6-151: Float Multiply                                                              | 97       |
| Table 6-152: Double Negative                                                             | 97       |
| Table 6-153: Float Negative                                                              | 97       |
| Table 6-154: Double Fused Negative Multiply and Add                                      | 97       |
| Table 6-155: Float Fused Negative Multiply and Add                                       | 98       |
| Table 6-156: Double Fused Negative Multiply and Subtract                                 | 98       |
| Table 6-157: Float Fused Negative Multiply and Subtract                                  | 98       |
| Table 6-158: Float Reciprocal Estimate                                                   | 98       |
| Table 6-159: Round to Single Precision                                                   | 99       |
| Table 6-160: Double Reciprocal Square Root Estimate                                      | 99       |
| Table 6-161: Floating-Point Select of Double                                             | 99       |
| Table 6-162: Floating-Point Select of Float                                              | 99       |
| Table 6-163: Double Square Root                                                          | 100      |
| Table 6-164: Float Square Root                                                           | 100      |
| Table 6-165: Instruction Cache Block Invalidate                                          | 100      |
| Table 6-166: Instruction Sync                                                            | 100      |



| Table 6-167: Load Doubleword with Reserved                                 | 101 |
|----------------------------------------------------------------------------|-----|
| Table 6-168: Load Reversed Doubleword                                      | 101 |
| Table 6-169: Load Reversed Halfword                                        | 101 |
| Table 6-170: Load Word with Reserved                                       | 101 |
| Table 6-171: Load Reversed Word                                            | 102 |
| Table 6-172: Light Weight Sync                                             | 102 |
| Table 6-173: Move from Floating-Point Status and Control Register          | 102 |
| Table 6-174: Move from Special Purpose Register                            | 103 |
| Table 6-175: Move from Time Base                                           | 103 |
| Table 6-176: Reset Bit of FPSCR                                            | 103 |
| Table 6-177: Set Bit of FPSCR                                              | 103 |
| Table 6-178: Set Fields in FPSCR                                           | 104 |
| Table 6-179: Set Field of FPSCR                                            | 104 |
| Table 6-180: Move to Special Purpose Register                              | 104 |
| Table 6-181: Multiply Doubleword, High Part                                | 104 |
| Table 6-182: Multiply Double Unsigned Word, High Part                      | 105 |
| Table 6-183: Multiply Word, High Part                                      | 105 |
| Table 6-184: Multiply Unsigned Word, High Part                             | 105 |
| Table 6-185: No Operation                                                  | 105 |
| Table 6-186: Rotate Left Doubleword Then Clear Left                        | 106 |
| Table 6-187: Rotate Left Doubleword Then Clear Right                       | 107 |
| Table 6-188: Rotate Left Doubleword Immediate Then Clear                   | 107 |
| Table 6-189: Rotate Left Doubleword Immediate Then Clear Left              | 107 |
| Table 6-190: Rotate Left Doubleword Immediate Then Clear Right             | 108 |
| Table 6-191: Rotate Left Doubleword Immediate Then Mask Insert             | 108 |
| Table 6-192: Rotate Left Word Immediate Then Mask Insert                   | 108 |
| Table 6-193: Rotate Left Word Immediate Then AND with Mask                 | 109 |
| Table 6-194: Rotate Left Word Then AND with Mask                           | 109 |
| Table 6-195: Save and Set the FPSCR                                        | 109 |
| Table 6-196: Store Reversed Doubleword                                     | 109 |
| Table 6-197: Store Doubleword Conditional                                  | 110 |
| Table 6-198: Store Reversed Halfword                                       | 110 |
| Table 6-199: Store Reversed Word                                           | 110 |
| Table 6-200: Store Word Conditional                                        | 111 |
| Table 6-201: Sync                                                          | 111 |
| Table 7-202: Stream Control Operators That Have Been Deprecated on the PPU | 113 |
| Table 7-203: Extract Vector Element from Vector                            | 114 |
| Table 7-204: Insert Scalar into Specified Vector Element                   | 115 |
| Table 7-205: Load Vector Left Indexed                                      | 116 |
| Table 7-206: Load Vector Left Indexed Last                                 | 117 |
| Table 7-207: Load Vector Right Indexed                                     | 118 |
| Table 7-208: Load Vector Right Indexed Last                                | 119 |
| Table 7-209: Store Vector Left Indexed                                     | 120 |
| Table 7-210: Store Vector Left Indexed Last                                | 121 |
| Table 7-211: Store Vector Right Indexed                                    | 122 |
| Table 7-212: Store Vector Right Indexed Last                               | 123 |
| Table 7-213: Promote Scalar to Vector                                      | 124 |
| Table 7-214: Splat Scalar to Vector                                        | 124 |
| Table 8-215: C Library Header Files                                        | 125 |
| Table 8-216: Fastest Minimum-Width Integer Types                           | 126 |
| Table 8-217: Vector Formats                                                | 127 |
| Table 8-218: C++ Library Header Files                                      | 128 |
| Table 8-219: New and Traditional C++ Library Header Files                  | 129 |
| Table 9-220: Values for Floating-Point Type Properties                     | 131 |
| Table 9-221: Rounding Mode for Two Bits of FLT_ROUNDS                      | 132 |
| Table 9-222: Macros for Double Precision Rounding Modes                    | 132 |
| Table 9-223: Macros for Single Precision Floating-Point Exceptions         | 133 |
| Table 9-224: Macros for Double Precision Floating-Point Exceptions         | 133 |
| Table 9-225: Floating-Point Constants                                      | 134 |
| Table 10-226: Integer Vector Types                                         | 139 |
| Table 10-227: Floating-Point Vector Types                                  | 139 |



| <b>=</b> ₹ <b>=</b> |                                                               | Table of Contents | xiii |
|---------------------|---------------------------------------------------------------|-------------------|------|
|                     | Table 10-228: Valid Types for Specified Unary Operators       | 139               |      |
|                     | Table 10-229: Valid Types for Specified Binary Operators      | 140               |      |
|                     | Table 10-230: Valid Types for Specified Relational Operators  | 140               |      |
| List                | of Figures                                                    |                   |      |
|                     | Figure 1-1: Big-Endian Byte/Element Ordering for Vector Types | xxi               |      |
|                     | Figure 2-2: Shuffle Pattern                                   | 32                |      |





## **About This Document**

This document describes language extension specifications that allow software developers to access hardware features that are not easily accessible from a high level language, such as C or C++, in order to obtain the best performance from a Synergistic Processor Unit (SPU) and a PowerPC<sup>®</sup> Processor Unit (PPU) of the Cell Broadband Engine™. This document also includes function specifications to facilitate communication between SPUs and PPU, and it lists a minimal set of standard library functions that must be provided as part of a standard SPU programming environment.

#### **Audience**

This document is intended for system and application programmers who want to write SPU and PPU programs for a CBEA-compliant processor.

# **Version History**

This section describes significant changes made to each version of this document.

| Version Number & Date        | Changes                                                                                                                                                                                |
|------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| v. 2.6<br>August 25, 2008    | Specify the MFC IO macro definitions that are used in spu_mfcio.h (TWG_RFC00114-1).                                                                                                    |
|                              | Added a section describing functions that implement the SL1 storage control commands (TWG_RFC00124-0).                                                                                 |
|                              | Changed incorrect table settings that caused information not to display (TWG_RFC00126-0: CORRECTION NOTICE).                                                                           |
|                              | Add a set of generic SPU shift right intrinsics (TWG_RFC00127-1).                                                                                                                      |
|                              | Corrected effective address calculations for <code>vec_extract</code> , <code>vec_insert</code> , and <code>vec_promote</code> (TWG_RFC00128-0).                                       |
|                              | Changed the specification for vector subscripting (TWG_RFC00129-1).                                                                                                                    |
|                              | Extended spu_mfcio.h to include additional functions so that interrupt-safe application software can be written (TWG_RFC00130-2).                                                      |
|                              | Changed the description for spu_s1 (TWG_RFC00131-0: CORRECTION NOTICE).                                                                                                                |
|                              | Support for MFC utility functions is no longer optional (TWG_RFC00132-0).                                                                                                              |
|                              | Clarified the definitions of data-type mappings for PPU and SPU (TWG_RFC00134-0: CORRECTION NOTICE).                                                                                   |
|                              | Added si_from_int and si_from_uint to the parameters of si intrinsics in the implementaion of the generic intrinsics that accept scalar arguments (TWG_RFC00135-2: CORRECTION NOTICE). |
|                              | Correct pseudocode for spu_rlmaska instrinsic (TWG_RFC00136-1: CORRECTION NOTICE).                                                                                                     |
| v. 2.5<br>September 14, 2007 | Corrected miscellaneous documentation errors (TWG_RFC00102-1: CORRECTION NOTICE).                                                                                                      |
|                              | Added six new PPU intrinsics to simplify streaming data prefetch (TWG_RFC00103-0 as amended by TWG_RFC00103-1).                                                                        |
|                              | Described special behaviors for some of the missing classification macros (TWG_RFC00104-0).                                                                                            |
|                              | Changed the return/argument types of several PPU intrinsics (TWG_RFC00105-0).                                                                                                          |
|                              | Changed the descriptive names of the SPU rotate and shift intrinsics (TWG_RFC00106-2).                                                                                                 |
|                              | Changed the descriptive names of several intrinsics (TWG_RFC00107-2:                                                                                                                   |



| Version Number & Date      | Changes                                                                                                                                                                                                              |
|----------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                            | CORRECTION NOTICE).                                                                                                                                                                                                  |
|                            | Added a section describing the MFC tag mananger (TWG_RFC00109-2).                                                                                                                                                    |
|                            | Eliminated unnecessary spaces from several headings (TWG_RFC00111-0: CORRECTION NOTICE).                                                                                                                             |
|                            | Specified the SPU "fastest minimum-width integer" typedefs in a way that conforms with the implementations for both spu-gcc and spuxlc (TWG_RFC00117-0).                                                             |
|                            | Clarified the mapping of intrinsics between SPU and VMX (TWG_RFC00118-1).                                                                                                                                            |
|                            | Corrected the implementation specification of the $mfc_hl2ea$ function so that it matches the implementation in $spu_mfcio.h$ . (TWG_RFC00119-0: CORRECTION NOTICE).                                                 |
|                            | Made miscellaneous editorial changes.                                                                                                                                                                                |
| v. 2.4                     | Added support for enhanced double precision SPU instructions (TWG_RFC00071-0).                                                                                                                                       |
| March 8, 2007              | Specified use of vector data types with standard C/C++ operators (TWG_RFC00082-1).                                                                                                                                   |
|                            | Made it explicit that the vector keyword n the SPU is the same as the vector keyword on the PPU (TWG_RFC00096-0).                                                                                                    |
|                            | Provided a predefined macro for use by compilers that are targeted to a processor that supports the SPU's optional enhanced double precision instructions (TWG_RFC00097-0).                                          |
|                            | Attached "volatile" with dmalist arguments in intrinsics (TWG_RFC00100-0).                                                                                                                                           |
|                            | Corrected various organizational, grammatical, and spelling issues (TWG_RFC00093-0: CORRECTION NOTICE and TWG_RFC00094-0: CORRECTION NOTICE).                                                                        |
|                            | Specified the kinds of variables to which the aligned attribute applies (TWG_RFC00098-0).                                                                                                                            |
|                            | Corrected the specification of isnan() so that it applies only to single precision (TWG_RFC00099-0: CORRECTION NOTICE).                                                                                              |
|                            | Corrected various minor errors (TWG_RFC00101-0: CORRECTION NOTICE).                                                                                                                                                  |
| v. 2.3<br>December 4, 2006 | Corrected the function parameter ordering of the PPUstwbrx instrinsic (TWG_RFC00074-0: CORRECTION NOTICE)                                                                                                            |
|                            | Corrected the type of element initializers used to initialize a vector of signed/unsigned char (TWG_RFC00075-0: CORRECTION NOTICE)                                                                                   |
|                            | Changed to note that the use of double-precision contracted operations is permitted by default unless prohibited by the FP_CONTRACT pragma or the no-fast-double compiler option (TWG_RFC00076-0).                   |
|                            | Added PPU data types and programming directives to Chapter 1, and changed title from "SPU Data Types and Program Directives" to "Data Types and Programming Directives" (TWG_RFC00077-1).                            |
|                            | Removed thefre,frsqrtes, andpopentb intrinsics, and added thefrsqrte intrinsic (TWG_RFC00078-3).                                                                                                                     |
|                            | Added that support is provided in the floating-point environment for both double-precision elements and all four single-precision elements. Also, updated information for FLT_ROUNDS (TWG_RFC00079-1).               |
|                            | Added a new chapter, "PPU VMX Intrinsics", that specifies a set of intrinsic functions making the underlying PPU VMX instruction set accessible from the C programming language (TWG_RFC00081-1 and TWG_RFC00092-0). |
|                            | Added 32-bit ABI support to the PPU intrinsic functions, changed function arguments to provide a consistent high-level interface, and corrected several typographical errors (TWG_RFC00083-1).                       |



| Version Number & Date      | Changes                                                                                                                                                                                                                                                                                                                                                             |
|----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                            | Changed the return type of thefctiw andfctiwx PPU intrinsic functions, changed the descriptive names of these and other similar conversion intrinsics, and removed thestfiwx intrinsic function (TWG_RFC00089-1).                                                                                                                                                   |
|                            | Identified deprecated PPU VMX operations and recommendations for suitable PPU intrinsic function alternatives (TWG_RFC00090-0).                                                                                                                                                                                                                                     |
|                            | Identified non-supported language features and specified that C++ exception handling is not supported on the SPU (TWG_RFC00091-0).                                                                                                                                                                                                                                  |
| v. 2.2<br>October 11, 2006 | Applied the changes made in the following requests: TWG_RFC00056-0, TWG_RFC00057-0, TWG_RFC00058-2, TWG_RFC00061-1, TWG_RFC00060-1, TWG_RFC00062-0, TWG_RFC00066-2, TWG_RFC00067-2, TWG_RFC00068-0, TWG_RFC00070-1, TWG_RFC00072-0, and TWG_RFC00073-0.                                                                                                             |
|                            | Changed document title because its contents are no longer limited to the SPU. Changed the sections "About this Document" and "Audience" accordingly. Applied TWG_RFC00053-0, TWG_RFC00054-1, and TWG_RFC00055-0.                                                                                                                                                    |
|                            | Replaced uses of a protected name by references to the document <i>AltiVec™ Technology Programming Interface Manual</i> per TWG_RFC00050-1 and TWG_RFC00052-0.                                                                                                                                                                                                      |
|                            | Corrected several operand errors related to spu_sub, which is the arithmetic intrinsic for vector subtraction (TWG_RFC00046-0: CORRECTION NOTICE).                                                                                                                                                                                                                  |
|                            | Corrected various documentation errors; for example, changed sample code demonstrating how to restore the Stack Pointer Information register as a result of invoking the longjmp function (TWG_RFC00047-0: CORRECTION NOTICE).                                                                                                                                      |
|                            | Specified that alternate vector syntax for vector literals is optional rather than mandatory (TWG_RFC00050).                                                                                                                                                                                                                                                        |
| v. 2.1<br>October 20, 2005 | Added a sub-section called "Malloc Heap" to the C library section of the "C and C++ Standard Libraries" chapter. This section is related to an attempt to define a standard process for memory heap initialization and stack management (TWG_RFC00024-3).                                                                                                           |
|                            | In the "SPU and Vector Multimedia Extension Intrinsics" chapter, clarified which intrinsic mappings are required according to this specification and which are not because a straightforward mapping does not exist. Provided additional explanations regarding the intrinsics that are difficult to map (TWG_RFC00034-1: CORRECTION NOTICE).                       |
|                            | Corrected the description of the si_stqx instruction (TWG_RFC00035-0: CORRECTION NOTICE).                                                                                                                                                                                                                                                                           |
|                            | Corrected various documentation errors; for example, changed several descriptions in the "Alternate Vector Literal Format and Description" table.                                                                                                                                                                                                                   |
|                            | (TWG_RFC00036-0: CORRECTION NOTICE, TWG_RFC00041-0: CORRECTION NOTICE, TWG_RFC00045-0: CORRECTION NOTICE).                                                                                                                                                                                                                                                          |
|                            | Changed "Broadband Processor Architecture" to "Cell Broadband Engine Architecture", and changed "BPA" to "CBEA" (TWG_RFC00037-0: CORRECTION NOTICE).                                                                                                                                                                                                                |
|                            | Deleted several references to BE revisions DD1.0 and DD2.0 (TWG_RFC00040-0: CORRECTION NOTICE).                                                                                                                                                                                                                                                                     |
|                            | Added a new chapter describing MFC I/O intrinsics; these intrinsics facilitate MFC programming by defining a common set of utility functions (TWG_RFC00043-2).                                                                                                                                                                                                      |
| v. 2.0<br>July 11, 2005    | Deleted several sections in the "About This Document" chapter. Changed two entries in the Write Word Channel table from si_wrch(channel, si_to_int(a)) to si_wrch(channel, si_from_int(a)). Clarified that the syntax for vector type specifiers does not allow the use of a typedef name as a type specifier. (All changes per TWG_RFC00032-0: CORRECTION NOTICE.) |
| v. 1.9                     | Added new chapter describing C and C++ Libraries (TWG_RFC00018-5).                                                                                                                                                                                                                                                                                                  |
| June 10, 2005              | Added new chapter describing SPU floating-point arithmetic (TWG_RFC00027-1).                                                                                                                                                                                                                                                                                        |



| Version Number & Date       | Changes                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|-----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                             | Changed "Broadband Engine" or "BE" to "a processor compliant with the Broadband Processor Architecture" or "a processor compliant with BPA"; changed VMX to Vector Multimedia Extension; changed Synergistic Processing Element to Synergistic Processor Element; and changed Synergistic Processing Unit to Synergistic Processor Unit. Defined a PPU as a PowerPC Processor Unit on first major instance. Corrected several book references and changed copyright page so that trademark owners were specified. (All changes per TWG_RFC00031-0: CORRECTION NOTICE.)  Made miscellaneous changes to the "About This Document" section. |
| v. 1.8                      | Added new channel number for multisource synchronization requests                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| May 12, 2005                | (TWG_RFC00023-1).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|                             | Corrected example describing loading of misaligned vectors.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|                             | Changed PU to PPU and SPC to SPE; changed "PU-to-SPU" (mailboxes) and "SPU-to-PU" to "inbound" and "outbound" respectively (TWG_RFC00028-1: CORRECTION NOTICE).                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|                             | Changed the name of spu_mulhh to spu_mule (TWG_RFC00021-0).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|                             | Updated channel names to coincide with BPA channel names (TWG_RFC00029-1).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| v. 1.7<br>July 16, 2004     | Clarified that channel intrinsics must not be reordered with respect to other channel commands or volatile local-storage memory accesses (TWG_RFC00007-1).                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                             | Warned that compliant compilers may ignorealign_hint intrinsics (TWG_RFC00008-1).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|                             | Added an additional SPU instruction, orx (TWG_RFC00010-0).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                             | Added mnemonics for channels that support reading the event mask and tag mask (TWG_RFC00011-0).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|                             | Specified that spu_ienable and spu_idisable intrinsics do not have return values (TWG_RFC00013-0).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|                             | Moved paragraph beginning "This intrinsic is considered volatile" from spu_mfspr intrinsic to spu_mtfpscr (TWG_RFC00014-0).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|                             | Changed the descriptions for si_lqd and si_stqd intrinsics (TWG_RFC00015-1).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|                             | Provided new descriptions of various rotation-and-mask intrinsics, specifically: spu_rlmask, spu_rlmaska, spu_rlmaskqw, spu_rlmaskqwbyte, and spu_rlmaskqwbytebc. These descriptions include pseudo-code examples (TWG_RFC00016-1).                                                                                                                                                                                                                                                                                                                                                                                                      |
|                             | Made miscellaneous editorial changes.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| v. 1.6<br>March 12, 2004    | Made miscellaneous editorial changes.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| v. 1.5<br>February 25, 2004 | Changed formatting of document so that it reflects the typographic conventions mentioned in the "About This Document" section. Made miscellaneous editorial changes.                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|                             | Changed some of the parameter types for spu_mfcdma32 and spu_mfcdma64, as requested in TWG_RFC00002.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|                             | Inserted new specifications for the vector literal format, as requested in TWG_RFC00003.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| v. 1.4<br>January 20, 2004  | Changed document to new format, including front matter. Made miscellaneous editorial changes.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| v. 1.3<br>November 4, 2003  | Added enable/disable interrupt intrinsics.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| v. 1.2<br>September 2, 2003 | Changed parameter types of spu_sel intrinsic to be compatible with Vector Multimedia Extension's vec_sel.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |



| Version Number & Date      | Changes                                                                                                                                                                                                                                        |  |  |  |  |
|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|
|                            | Added si_stopd specific intrinsic.                                                                                                                                                                                                             |  |  |  |  |
|                            | Corrected tables for spu_genb and spu_genc generic intrinsics.                                                                                                                                                                                 |  |  |  |  |
| v. 1.1                     | Made changes to support RFC 24. Added isolation control channel 64.                                                                                                                                                                            |  |  |  |  |
| June 15, 2003              | Made changes to support RFC 33. Removed spu_addc, spu_addsc, spu_subb, and spu_subsb. Added spu_addx, spu_subx, spu_genc, spu_gencx, spu_genb, and spu_genbx.                                                                                  |  |  |  |  |
| v. 1.0<br>April 28, 2003   | Made minor corrections.                                                                                                                                                                                                                        |  |  |  |  |
| v. 0.9<br>March 7, 2003    | Added new intrinsics to support new or modified instructions. These include: fscrrd, fscrwr, stop, dfma, mpyhhau, mpyhhu, rotqmbybi, iret, lqr, and stqr. Also added intrinsics to support new feature bits for iret, bisled, bihnz, and sync. |  |  |  |  |
| v. 0.8<br>January 23, 2003 | Improved documentation of specific intrinsics. Completely defined parameter ordering and immediate sizes.                                                                                                                                      |  |  |  |  |
|                            | Defined new global (spu_intrinsics.h) and compiler specific (spu_internals.h) header files. Specified that single token vector types and channel enumerants are declared in spu_intrinsics.h.                                                  |  |  |  |  |
|                            | Added specific pointer casting intrinsics.                                                                                                                                                                                                     |  |  |  |  |
|                            | Added standardizedSPU conditional compilation control.                                                                                                                                                                                         |  |  |  |  |
|                            | Changed specific convert intrinsics to unbiased scale parameters, such as generic intrinsics.                                                                                                                                                  |  |  |  |  |
|                            | Specified that the bisled target function does not observe the standard calling convention with respect to volatile registers.                                                                                                                 |  |  |  |  |
| v. 0.7                     | Specified that gcc-style inline assembly is required.                                                                                                                                                                                          |  |  |  |  |
| November 18, 2002          | Specified thatbuiltin_expect is required.                                                                                                                                                                                                      |  |  |  |  |
|                            | Added bisled specific and generic intrinsics.                                                                                                                                                                                                  |  |  |  |  |
|                            | Addedalign_hint intrinsic.                                                                                                                                                                                                                     |  |  |  |  |
|                            | Specified that the restrict type qualifier is required.                                                                                                                                                                                        |  |  |  |  |
|                            | Specified that out-of-range scale factors on generic conversion intrinsics return an error.                                                                                                                                                    |  |  |  |  |
| v. 0.6                     | Changed document title to include C++.                                                                                                                                                                                                         |  |  |  |  |
| September 24, 2002         | Made miscellaneous clarifications and typing corrections.                                                                                                                                                                                      |  |  |  |  |
|                            | Changed spu_eqv to return the same vector type as its inputs.                                                                                                                                                                                  |  |  |  |  |
|                            | Changed spu_and, spu_or, and spu_xor to accept immediate values of the same type as the elements of parameter $a$ .                                                                                                                            |  |  |  |  |
|                            | Added specific casting intrinsics.                                                                                                                                                                                                             |  |  |  |  |
|                            | Changed default action on out-of-range immediate values for specific intrinsics to issuing an error.                                                                                                                                           |  |  |  |  |
|                            | Added documentation of thebuiltin_expect builtin.                                                                                                                                                                                              |  |  |  |  |
|                            | Completed SPU-to-Vector Multimedia Extension intrinsic mapping section.                                                                                                                                                                        |  |  |  |  |
| v. 0.5                     | Edited discussion of Vector Multimedia Extension-to-SPU intrinsic mapping.                                                                                                                                                                     |  |  |  |  |
| August 27, 2002            | Removed appendices.                                                                                                                                                                                                                            |  |  |  |  |
|                            | Added support for 32-bit read and write channel intrinsics. Renamed quadword channel read and write to readchqw and writechqw.                                                                                                                 |  |  |  |  |



| Version Number & Date   | Changes                                                                                                                                  |
|-------------------------|------------------------------------------------------------------------------------------------------------------------------------------|
| v. 0.4                  | Corrected the instruction mapping for spu_promote and spu_extract.                                                                       |
| August 5, 2002          | Specified that instruction mapping for generic intrinsics spu_re and spu_rsqrte include the FI (floating-point interpolate) instruction. |
|                         | Renamed spu_splat to spu_splats (scalar splat) to avoid confusion with vec_splat.                                                        |
|                         | Added documentation about the size of the immediate intrinsic forms.                                                                     |
|                         | Changed all vector signed long to vector signed long long.                                                                               |
|                         | Changed count to unsigned for spu_sl, spu_slqw, spu_slqwbyte, and spu_slqwbytebc.                                                        |
|                         | Changed count to signed for spu_rl, spu_rlmask and spu_rlmaska.                                                                          |
|                         | Specified that the return value of spu_cntlz is an unsigned value.                                                                       |
|                         | Corrected description of spu_gather intrinsic.                                                                                           |
|                         | Edited mapping documentation of scalars for spu_and, spu_or, and spu_xor.                                                                |
|                         | Removed vector input forms of spu_hcmpeq and spu_hcmpgt.                                                                                 |
| v. 0.3<br>July 16, 2002 | Added fsmbi to literal constructor instructions. Added fsmbi (immediate form) to spu_maskb intrinsic.                                    |
|                         | Added vector forms to compare and halt (spu_hcmpeq and spu_hcmpgt) intrinsics.                                                           |
|                         | Added qword data type as the only vector type accepted by specific intrinsics.                                                           |
|                         | Added typedefs for the vector types as the basic types used for code portability.                                                        |
|                         | Merged all spu_splat generic intrinsics into a single intrinsic.                                                                         |
|                         | Dropped spu_load, spu_store, and spu_insertctl generic intrinsics.                                                                       |
| v. 0.2                  | Incorporated changes and suggestions from Peng.                                                                                          |
| July 9, 2002            | Changed vector long types to vector long long.                                                                                           |
| v. 0.1<br>June 21, 2002 | First version of the language extension specification. Initial specification based on the Tobey compiler intrinsics specification.       |

# **Related Documentation**

The following table provides a list of references and supporting materials for this document:

| Document Title                                                                                     | Versio<br>n | Date         |
|----------------------------------------------------------------------------------------------------|-------------|--------------|
| ISO/IEC 9899:1999, Programming Languages – C<br>(C Standard)                                       |             |              |
| ISO/IEC 14882:2003, Programming Languages – C++ (C++ Standard)                                     |             |              |
| IEEE 754: Standard for Binary Floating-Point Arithmetic                                            |             |              |
| Synergistic Processor Unit Instruction Set Architecture (SPU ISA) (public Cell document)           | 1.2         | January 2007 |
| Cell Broadband Engine Architecture (public Cell document)                                          | 1.01        | October 2006 |
| Tool Interface Standard (TIS) Executable and Linking Format (ELF) Specification                    | 1.2         | May 1995     |
| Tool Interface Standard (TIS) DWARF Debugging Information Format<br>Specification                  | 2.0         | May 1995     |
| PowerPC Architecture Book, Book II: PowerPC Virtual Environment Architecture (public IBM document) | 2.02        | January 2005 |
| AltiVec™ Technology Programming Interface Manual (PIM) (public Freescale Semiconductor document)   |             | June 1999    |



#### **Conventions Used in This Document**

#### **Bit Notation**

Standard bit notation is used throughout this document. Bits and bytes are numbered in ascending order from left to right. Thus, for a 4-byte word, bit 0 is the most significant bit and bit 31 is the least significant bit, as shown in the following figure:



MSB = Most significant bit

LSB = Least significant bit

Notation for bit encoding is as follows:

- Hexadecimal values are preceded by 0x. For example: 0x0A00.
- Binary values in sentences appear in single quotation marks. For example: '1010'.

#### **Byte Ordering and Element Numbering**

Byte ordering and element numbering are always displayed in big-endian order, as shown in Figure 1-1.

Figure 1-1: Big-Endian Byte/Element Ordering for Vector Types

| Byte 0<br>(MSB) |        | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Byte 6     | Byte 7 | Byte 8                | Byte 9 | Byte 10 | Byte 11 | Byte 12 | Byte 13 | Byte 14 | Byte 15<br>(LSB) |
|-----------------|--------|--------|--------|--------|--------|------------|--------|-----------------------|--------|---------|---------|---------|---------|---------|------------------|
| doubleword 0    |        |        |        |        |        |            |        | double                | word 1 |         |         |         |         |         |                  |
|                 | woi    | rd 0   |        |        | woi    | rd 1       |        |                       | WO     | rd 2    |         |         | woi     | rd 3    |                  |
| halfw           | ord 0  | halfw  | ord 1  | halfw  | ord 2  | halfword 3 |        | halfword 4 halfword 5 |        | halfw   | ord 6   | halfw   | ord 7   |         |                  |
| char 0          | char 1 | char 2 | char 3 | char 4 | char 5 | char 6     | char 7 | char 8                | char9  | char 10 | char 11 | char 12 | char 13 | char 14 | char 15          |

#### **Other Conventions**

The following typographic conventions are used throughout this document:

| Convention                | Meaning                                                                                                                                                                                                                                                                       |
|---------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| courier                   | Indicates programming code and literals, such as processing instructions, register names, data types, events, and file names. Also indicates function and macro names. This convention is only used where it facilitates comprehension, especially in narrative descriptions. |
| courier + italics         | Indicates arguments, parameters, and variables. This convention is only used where it facilitates comprehension, especially in narrative descriptions.                                                                                                                        |
| italics (without courier) | Indicates emphasis. Except when hyperlinked, book references are in italics. When a term is first defined, it is often in italics.                                                                                                                                            |
| blue                      | Indicates a hyperlink (color printers or online only).                                                                                                                                                                                                                        |



# 1. Data Types and Programming Directives

This chapter specifies PPU Vector Multimedia Extension (VMX) and SPU vector data types, operations on these data types, programming directives, and predefined macro target definitions.

Any confict between the requirements described here for PPU VMX data types and the *AltiVec™ Technology Programming Interface Manual* is unintentional.

The vector keyword and the \_\_vector keyword have the same properties, defined in the AltiVec™ Technology Programming Interface Manual. The \_\_vector keyword is preferred for code portability because it is always defined.

# 1.1. Data Types

In this section, a set of fundamental vector data types are introduced to the C language, and several mappings are described that relate PPU and SPU data types to one another.

#### 1.1.1. Fundamental Data Types

The fundamental vector data types that are supported by the PPU and SPU are shown in Table 1-1. All of these data types are 128-bits long and contain from 2 to 16 elements, depending on the corresponding element data type.

Table 1-1: Vector Data Types

| Vector Data Type          | Content                                                                                                                           | SPU/PPU |
|---------------------------|-----------------------------------------------------------------------------------------------------------------------------------|---------|
| vector unsigned char      | 16 8-bit unsigned chars                                                                                                           | Both    |
| vector signed char        | 16 8-bit signed chars                                                                                                             | Both    |
| vector unsigned short     | 8 16-bit unsigned halfwords                                                                                                       | Both    |
| vector signed short       | 8 16-bit signed halfwords                                                                                                         | Both    |
| vector unsigned int       | 4 32-bit unsigned words                                                                                                           | Both    |
| vector signed int         | 4 32-bit signed words                                                                                                             | Both    |
| vector unsigned long long | 2 64-bit unsigned doublewords                                                                                                     | SPU     |
| vector signed long long   | 2 64-bit signed doublewords                                                                                                       | SPU     |
| vector float              | 4 32-bit single-precision floats                                                                                                  | Both    |
| vector double             | 2 64-bit double-precision floats                                                                                                  | SPU     |
| qword                     | quadword (16-byte), used exclusively as an input/output to a specific intrinsic function. See section "2.1. Specific Intrinsics". | SPU     |
| vector bool char          | 16 8-bit bools - 0 (false) 255 (true)                                                                                             | PPU     |
| vector bool short         | 8 16-bit bools - 0 (false) 65535 (true)                                                                                           | PPU     |
| vector bool int           | 4 32-bit bools - 0 (false) 2 <sup>32</sup> - 1 (true)                                                                             | PPU     |
| vector pixel              | 8 16-bit unsigned halfword, 1/5/5/5 pixel                                                                                         | PPU     |

The syntax for vector type specifiers does not allow the use of a typedef name as a type specifier. For example, the following declaration is not allowed:

typedef signed short int16;
vector int16 data;

#### 1.1.2. Mapping of PPU Data Types to SPU Data Types

Not all PPU vector data types are supported on the SPU. The PPU vector data types that do not map identically to SPU data types are shown in Table 1-2.



Table 1-2: Non-identical Mapping of PPU VMX Data Types to SPU Data Types

| PPU VMX Data Type | Maps to SPU Data Type              |
|-------------------|------------------------------------|
| vector bool char  | vector unsigned char               |
| vector bool short | vector unsigned short              |
| vector bool int   | vector unsigned int                |
| vector pixel      | vector unsigned short <sup>1</sup> |

<sup>&</sup>lt;sup>1</sup> Because vector pixel and vector bool short are mapped to the same base vector type (vector unsigned short), the overloaded functions for vec\_unpackh and vec\_unpackl cannot be uniquely resolved.

#### 1.1.3. Mapping of SPU Data Types to PPU Data Types

Not all SPU data types are supported by the PPU VMX. The SPU data types that do not map identically to PPU vector data types are shown in Table 1-3.

Table 1-3: Non-identical Mapping of SPU Data Types to PPU VMX Data Types

| SPU Data Type             | Maps to PPU VMX Data Type |
|---------------------------|---------------------------|
| vector unsigned long long | vector bool char          |
| vector signed long long   | vector bool short         |
| vector double             | vector bool int           |

#### 1.2. Header Files

There are separate system header files for the SPU and PPU that include typedefs and other information required by this specification.

#### 1.2.1. Header File Contents

The SPU system header file, <code>spu\_intrinsics.h</code>, defines common enumerations and typedefs. These include the single token vector types and MFC channel mnemonic enumerations (see Table 1-4 and Table 2-96, respectively). In addition, <code>spu\_intrinsics.h</code> will include a compiler-specific header file, <code>spu\_internals.h</code>, that contains any implementation-specific definitions.

The PPU system header file, altivec.h, defines typedefs and keywords and also includes any implementation-specific definitions. The PPU system header file, vec\_types.h, defines typedefs required by the language extension features defined in this specification.

#### 1.2.2. Single Token Typedefs

To improve code portability, single token typedefs are provided for the vector keyword data types. These typedefs, which are shown in Table 1-4 are defined in spu\_intrinsics.h on the SPU and in vec\_types.h on the PPU. Besides simplifying type declarations, the single token types serve as class names for extending generic intrinsics or for mapping between PPU VMX intrinsics and/or SPU intrinsics.

Table 1-4: Single Token Vector Data Types

| Vector Keyword Data Type  | Single Token Typedef | SPU/PPU |
|---------------------------|----------------------|---------|
| vector unsigned char      | vec_uchar16          | Both    |
| vector signed char        | vec_char16           | Both    |
| vector unsigned short     | vec_ushort8          | Both    |
| vector signed short       | vec_short8           | Both    |
| vector unsigned int       | vec_uint4            | Both    |
| vector signed int         | vec_int4             | Both    |
| vector unsigned long long | vec_ullong2          | SPU     |



| Vector Keyword Data Type | Single Token Typedef | SPU/PPU |
|--------------------------|----------------------|---------|
| vector signed long long  | vec_llong2           | SPU     |
| vector float             | vec_float4           | Both    |
| vector double            | vec_double2          | SPU     |
| vector bool char         | vec_bchar16          | PPU     |
| vector bool short        | vec_bshort8          | PPU     |
| vector bool int          | vec_bint4            | PPU     |
| vector pixel             | vec_pixel8           | PPU     |

# 1.3. Alignment

#### 1.3.1. Default Data Type Alignments

Table 1-5 shows the size and default alignment of the various data types.

Table 1-5: Default Data Type Alignments

| Data Type | Size | Alignment       |
|-----------|------|-----------------|
| char      | 1    | byte            |
| short     | 2    | halfword        |
| int       | 4    | word            |
| long      | 4    | word/doubleword |
| long long | 8    | doubleword      |
| float     | 4    | word            |
| double    | 8    | doubleword      |
| pointer   | 4    | word            |
| vector    | 16   | quadword        |

The aligned attribute will be provided by implementations to align static, global, and local variables, as well as static and non-static data members. The aligned attribute will not guarantee alignment of variables allocated using malloc or operator new. Implementations will support at least 128-byte alignment.

In the following declaration statement, the floating-point scalar factor will be aligned on a quadword boundary:

```
float factor __attribute__ ((aligned (16)));
```

#### 1.3.2. \_\_align\_hint

The \_\_align\_hint intrinsic is provided to improve data access through pointers and to provide compilers the additional information that is needed to support auto-vectorization. This built-in function is available only on the SPU.

Although \_\_align\_hint is defined as an intrinsic, it behaves like a directive, because no code is ever specifically generated. For example:

```
__align_hint(ptr, base, offset)
```

The \_\_align\_hint intrinsic informs the compiler that the pointer ptr points to data with a base alignment of base and with an offset from base of offset. The base alignment has to be a power of 2. A base address of zero implies that the pointer has no known alignment. The alignment offset has to be less than base or zero.

The \_\_align\_hint intrinsic is not intended to specify pointers that are not naturally aligned. Specifying pointers that are not naturally aligned results in data objects straddling quadword boundaries. If a programmer specifies alignment incorrectly, incorrect programs might result.



**Programming Note:** Although compliant compiler implementations must provide the \_\_align\_hint intrinsic, compilers may ignore these hints.

## 1.4. Operating on Vector Types

This section describes the C/C++ operators and operations that are required to act on vector data types. These operators are the sizeof() operator, the assignment operator (=), and the address operator ( $\varepsilon$ ). Many other standard C/C++ operators are also extended for vector data types. The overloading of these operators for vector data types is described in "10. Operator Overloading for Vector Data Types".

The operations on vector data types are pointer operations and type casting operations.

#### 1.4.1. sizeof() Operator

The operation sizeof() on a vector type always returns 16.

#### 1.4.2. Assignment Operator

If either the left or right side of an expression has a vector type, both sides of the expression has to be of the same vector type. Thus, the expression a = b is valid and represents assignment if a and b are of the same type or if neither variable is a vector type. Otherwise, the expression is invalid, and the compiler reports the inconsistency as an error.

#### 1.4.3. Address Operator

The operation &a is valid when a is a vector type. The result of the operation is a pointer to vector a.

#### 1.4.4. Pointer Arithmetic and Pointer Dereferencing

The usual pointer arithmetic involving a pointer to a vector type can be performed. For example, assuming p is a pointer to a vector type, p+1 is the pointer to the next vector following p.

Dereferencing the vector pointer p implies a 128-bit vector load from or store to the address obtained by masking the 4 least significant bits of p. When a vector is misaligned, the 4 least significant bits of its address are nonzero. Although vectors are 16-byte aligned (see section "1.3. Alignment"), it nevertheless might be desirable to load or store a vector that is misaligned. A misaligned vector can be loaded in several ways using generic intrinsics (see section "2.2. Generic Intrinsics and Built-ins").

The following code shows one example of how to load a misaligned floating-point vector on the SPU:

Similarly, this next example shows how to store to a misaligned floating-point vector on the SPU.

```
void store_misaligned_vector_float (vector float flt, vector float *ptr)
{
    vector float qw0, qw1;
    vector unsigned int mask;
```



#### 1.4.5. Type Casting

}

Pointers to vector types and non-vector types may be cast back and forth to each other. For the purpose of aliasing, a vector type is treated as an array of its corresponding element type, as shown in Table 1-6. If a pointer is cast to the address of a vector type, it is the programmer's responsibility to ensure that the address is 16-byte aligned. Vector types that are applicable only on the PPU do not have an underlying scalar type.

| Table 1-6: Vector Pointer T | ypes and Matching Base Element Pointer Ty | /pes |
|-----------------------------|-------------------------------------------|------|
|                             |                                           |      |

| Vector Pointer Type (vector T*) | Base Element Pointer Type (T*) | SPU/PPU |
|---------------------------------|--------------------------------|---------|
| vector unsigned char*           | unsigned char*                 | Both    |
| vector signed char*             | signed char*                   | Both    |
| vector unsigned short*          | unsigned short*                | Both    |
| vector signed short*            | signed short*                  | Both    |
| vector unsigned int*            | unsigned int*                  | Both    |
| vector signed int*              | signed int*                    | Both    |
| vector unsigned long long*      | unsigned long long*            | SPU     |
| vector signed long long*        | signed long long*              | SPU     |
| vector float*                   | float*                         | Both    |
| vector double*                  | double*                        | SPU     |

Casts from one vector type to another vector type has to be explicit and are done using normal C-language casts. None of these casts performs any data conversion. Thus, the bit pattern of the result is the same as the bit pattern of the argument that is cast.

Casts between vector types and scalar types are illegal. On the SPU, the <code>spu\_extract</code>, <code>spu\_insert</code>, and <code>spu\_promote</code> generic intrinsics or the specific casting intrinsics may be used to efficiently achieve the same results (see section "2.1.1. Specific Casting Intrinsics"). On the PPU, the <code>vec\_lde</code> and <code>vec\_ste</code> intrinsics may be used to copy between scalar and vector types.

#### 1.4.6. Vector Literals

As shown in Table 1-7, a vector literal is written as a parenthesized vector type followed by a curly braced set of constant expressions. If a vector literal is used as an argument to a macro, the literal has to be enclosed in parentheses. In all other cases, the literal can be used without enclosing parentheses. The elements of the vector are initialized to the corresponding expression. Elements for which no expressions are specified default to 0. Vector literals may be used either in initialization statements or as constants in executable statements. The syntax for vector initialization and for vector compound literals is the same as the corresponding array syntax except designators which do not exist for vector elements. The initializer should act as an array of either 2, 4, 8, or 16 elements depending on the size of the underlying type. For example the following two initializations are valid and equivalent:

```
vector signed int v1[] = \{\{0, 1, 2, 3\}, \{4, 5, 6, 7\}\}; vector signed int v2[] = \{0, 1, 2, 3, 4, 5, 6, 7\};
```



The following two struct initializers are also valid and equivalent:

```
struct stypy {
    int i;
    vector signed int t;
} v3 = {1, {0, 1, 2, 3}}, v4 = {1, 0, 1, 2, 3};
```

The following types on both the SPU and PPU cannot be initialized using a vector literal: qword, vector bool char, vector bool short, vector bool int, and vector pixel. They can be created by using the intrinsics or by casting to these vector types.

Table 1-7: Vector Literal Format and Description

| Notation                                          | Represents                                   | SPU/PPU |
|---------------------------------------------------|----------------------------------------------|---------|
| (vector unsigned char) {unsigned char,}           | A set of 16 unsigned 8-bit quantities.       | Both    |
| (vector signed char) {signed char,}               | A set of 16 signed 8-bit quantities.         | Both    |
| (vector unsigned short) {unsigned short,}         | A set of 8 unsigned 16-bit quantities.       | Both    |
| (vector signed short) {signed short,}             | A set of 8 signed 16-bit quantities.         | Both    |
| (vector unsigned int) {unsigned int,}             | A set of 4 unsigned 32-bit quantities.       | Both    |
| (vector signed int) {signed int,}                 | A set of 4 signed 32-bit quantities.         | Both    |
| (vector unsigned long long) {unsigned long long,} | A set of 2 unsigned 64-bit quantities.       | SPU     |
| (vector signed long long) {signed long long,}     | A set of 2 signed 64-bit quantities.         | SPU     |
| (vector float) {float,}                           | A set of 4 32-bit floating-point quantities. | Both    |
| (vector double) {double,}                         | A set of 2 64-bit floating-point quantities. | SPU     |

An alternate format may also be supported which corresponds to the syntax specified in the  $AltiVec^{TM}$  Technology Programming Interface Manual. This format consists of a parenthesized vector type followed by a parenthesized set of constant expressions. See Table 1-8.

Table 1-8: Alternate Vector Literal Format and Description

| Notation                                             | Represents                                                                              | SPU/PPU |
|------------------------------------------------------|-----------------------------------------------------------------------------------------|---------|
| (vector unsigned char)(unsigned int)                 | A set of 16 unsigned 8-bit quantities that all have the value specified by the integer. | Both    |
| (vector unsigned char)(unsigned int,, unsigned int)  | A set of 16 unsigned 8-bit quantities specified by the 16 integers.                     | Both    |
| (vector signed char)(signed int)                     | A set of 16 signed 8-bit quantities that all have the value specified by the integer.   | Both    |
| (vector signed char)(signed int,, signed int)        | A set of 16 signed 8-bit quantities specified by the 16 integers.                       | Both    |
| (vector unsigned short)(unsigned int)                | A set of 8 unsigned 16-bit quantities that all have the value specified by the integer. | Both    |
| (vector unsigned short)(unsigned int,, unsigned int) | A set of 8 unsigned 16-bit quantities specified by the 8 integers.                      | Both    |
| (vector signed short)(signed int)                    | A set of 8 signed 16-bit quantities that all have the value specified by the integer.   | Both    |
| (vector signed short)(signed int,, signed int)       | A set of 8 signed 16-bit quantities specified by the 8 integers.                        | Both    |
| (vector unsigned int)(unsigned int)                  | A set of 4 unsigned 32-bit quantities that all have the value specified by the integer. | Both    |
| (vector unsigned int)(unsigned int,, unsigned int)   | A set of 4 unsigned 32-bit quantities specified by the 4 integers.                      | Both    |
| (vector signed int)(signed int)                      | A set of 4 signed 32-bit quantities that all have the value specified by the integer.   | Both    |



| Notation                                                            | Represents                                                                                     | SPU/PPU |
|---------------------------------------------------------------------|------------------------------------------------------------------------------------------------|---------|
| (vector signed int)(signed int,, signed int)                        | A set of 4 signed 32-bit quantities specified by the 4 integers.                               | Both    |
| (vector unsigned long long)(unsigned long long)                     | A set of 2 unsigned 64-bit quantities that all have the value specified by the long integer.   | SPU     |
| (vector unsigned long long)(unsigned long long, unsigned long long) | A set of 2 unsigned 64-bit quantities specified by the 2 long integers.                        | SPU     |
| (vector signed long long)(signed long long)                         | A set of 2 signed 64-bit quantities that all have the value specified by the long integer.     | SPU     |
| (vector signed long long)(signed long long, signed long long)       | A set of 2 signed 64-bit quantities specified by the 2 long integers.                          | SPU     |
| (vector float)(float)                                               | A set of 4 32-bit floating-point quantities that all have the value specified by the float.    | Both    |
| (vector float)(float, float, float, float)                          | A set of 4 32-bit floating-point quantities specified by the 4 floats.                         | Both    |
| (vector double)(double)                                             | A set of 2 64-bit double-precision quantities that all have the value specified by the double. | SPU     |
| (vector double)(double, double)                                     | A set of 2 64-bit quantities specified by the 2 doubles.                                       | SPU     |

# 1.5. Restrict Type Qualifier

The restrict type qualifier, which is specified in the C99 language specification, is intended to help the compiler generate better code by ensuring that all access to a given object is obtained through a particular pointer. When a pointer uses the restrict type qualifier, the pointer is restrict-qualified. For example:

```
void *memcpy(void * restrict s1, const void * restrict s2, size_t n);
```

In the above prototype, both pointers, s1 and s2, are restrict-qualified. Therefore, the compiler can safely assume that the source and destination objects will not overlap, allowing for a more efficient implementation.

# 1.6. SPU Programmer Directed Branch Prediction

Branch prediction can be significantly improved by using feedback-directed optimization. However, feedback-directed optimization is not always practical in situations where typical data sets do not exist. Instead, on the SPU, programmer-directed branch prediction is provided using an enhanced version of GCC's \_\_builtin\_expect function.

```
int __builtin_expect(int exp, int value)
```

Programmers can use \_\_builtin\_expect to provide the compiler with branch prediction information. The return value of \_\_builtin\_expect is the value of the <code>exp</code> argument, which has to be an integral expression. For dynamic prediction, the <code>value</code> argument can be either a compile-time constant or a variable. The \_\_builtin\_expect function assumes that <code>exp</code> equals <code>value</code>.



```
}
cond2 = cond1;  /* predict that next branch is the same as the
    previous */
```

Compilers may require limiting the complexity of the expression argument because multiple branches could be generated. When this situation occurs, the compiler has to issue a warning if the program's branch expectations are ignored.

Implementation of this extension is not required for the PPU because the PPU only supports static prediction for branches

# 1.7. Inline Assembly

Occasionally, a programmer might not be able to achieve the desired low-level programming result by using only C/C++ language constructs and intrinsic functions. To handle these situations, the use of inline assembly might be necessary, and therefore, it has to be provided. The inline assembly syntax have to match the AT&T assembly syntax implemented by GCC.

The .balignl directive may be used within the inline assembly to ensure the known alignment that is needed to achieve effective dual-issue by the hardware.

### 1.8. Target Definitions

Compilers must define \_\_\_SPU\_\_ when code is being compiled for the SPU, and \_\_\_PPU\_\_ when code is being compiled for the PPU. The availability of these definitions enables the development of code that can be conditionally compiled for either target.

As an example, the following code supports misaligned quadword loads. The \_\_SPU\_\_ and \_\_PPU\_\_ defines are used to conditionally select which code to use. The code that is selected will be different depending on the processor target.

```
vector unsigned char load gword unaligned(vector unsigned char *ptr)
      vector unsigned char qw0, qw1, qw;
#ifdef ___SPU___
      unsigned int shift;
#endif
      qw0 = *ptr;
      qw1 = *(ptr+1);
#ifdef __SPU__
      shift = (unsigned int)(ptr) & 15;
      qw = spu_or(spu_slqwbyte(qw0, shift),
                   spu_rlmaskqwbyte(qw1, (signed)(shift - 16)));
#elif defined(__PPU__) /* PPU */
      qw = vec_perm(qw0, qw1, vec_lvsl(0, ptr));
#else
# error "This code can only be compiled for PPU or the SPU"
#endif
      return (qw);
```

When compiling for an SPU implementation that supports the optional enhanced double-precision instructions, \_\_\_SPU\_EDP\_\_ will also be defined. The enhanced double-precision instructions include DFCEQ, DFMCEQ, DFMCGT, and DFTSV.



# 2. SPU Low-Level Specific and Generic Intrinsics

This chapter describes the minimal set of basic intrinsics and built-ins that make the underlying Instruction Set Architecture (ISA) and Synergistic Processor Element (SPE) hardware accessible from the C programming language. There are three types of intrinsics:

- Specific
- Generic
- Built-ins

Intrinsics may be implemented either internally within the compiler or as macros. However, if an intrinsic is implemented as a macro, restrictions apply with respect to vector literals being passed as arguments. For more details, see section "1.4.6. Vector Literals".

The instruction set may vary among SPU implementations. If an instruction is not supported by the SPU implementation for which the intrinsic is being compiled, special handling shall occur. For specific intrinsics, an error is generated if the targeted SPU does not support the corresponding instruction. For generic intrinsics, an alternate instruction mapping will be generated that achieves an equivalent operation.

Throughout this section, intrinsics which may generate special handling are indicated by a dagger (†).

# 2.1. Specific Intrinsics

Specific intrinsics are *specific* in the sense that they have a one-to-one mapping with a single SPU assembly instruction. All specific intrinsics are named using the SPU assembly instruction prefixed by the string si\_. For example, the specific intrinsic that implements the stop assembly instruction is named si\_stop.

A specific intrinsic exists for nearly every assembly instruction. However, the functionality provided by several of the assembly instructions is better provided by the C/C++ language; therefore, for these instructions no specific intrinsic has been provided. Table 2-9 describes the assembly instructions that have no corresponding specific intrinsic.

| Table 2-9: Assembly | Instructions for | which No S | pecific Intrinsic Exists |
|---------------------|------------------|------------|--------------------------|
|                     |                  |            |                          |

| Instruction Type              | SPU Instructions                                                                                                                                                                            |
|-------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Branch Instructions           | br, bra, brsl, brasl, bi, bid, bie, bisl, bisld, bisle, brnz, brz, brhnz, brhz, bizd, bize, binz, binzd, binze, bihz, bihzd, bihze, bihnzd, and bihnze (excluding bisled, bisledd, bislede) |
| Branch Hint Instructions      | hbr, hbrp, hbra, and hbrr                                                                                                                                                                   |
| Interrupt Return Instructions | iret, iretd, and irete                                                                                                                                                                      |

All specific intrinsics are accessible through generic intrinsics, except for the specific intrinsics shown in Table 2-10. The intrinsics that are not accessible fall into three categories:

- Instructions that are generated using basic variable referencing (that is, using vector and scalar loads and stores)
- · Instructions that are used for immediate vector construction
- Instructions that have limited usefulness and are not expected to be used except in rare conditions



Table 2-10: Specific Intrinsics Not Accessible Through Generic Intrinsics

| Instruction/Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | Usage                                      | Assembly Mapping |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------|------------------|
| Generate Controls for Sub-Quadword Insertion                                                                                                                                                                                                                                                                                                                                                                                                                                                       |                                            |                  |
| si_cbd: Generate Controls for Byte Insertion (d-form)                                                                                                                                                                                                                                                                                                                                                                                                                                              |                                            |                  |
| An effective address is computed by adding the value in the signed 7-bit immediate $imm$ to word element 0 of $a$ . The rightmost 4 bits of the effective address are used to determine the position of the addressed byte within a quadword. Based on the position, a pattern is generated that can be used with the $si\_shufb$ intrinsic to insert a byte (byte element 3) at the indicated position within a quadword. The pattern is returned in quadword $d$ .                               | d = si_cbd(a, imm)                         | CBD d, imm(a)    |
| si_cbx: Generate Controls for Byte Insertion (x-form)                                                                                                                                                                                                                                                                                                                                                                                                                                              |                                            |                  |
| An effective address is computed by adding the value of word element 0 of a to word element 0 of b. The rightmost 4 bits of the effective address are used to determine the position of the addressed byte within a quadword. Based on the position, a pattern is generated that can be used with the si_shufb intrinsic to insert a byte (byte element 3) at the indicated position within a quadword. The pattern is returned in quadword d.                                                     | $d = si\_cbx(a, b)$                        | CBX d, a, b      |
| si_cdd: Generate Controls for Doubleword Insertion (d-form)                                                                                                                                                                                                                                                                                                                                                                                                                                        |                                            |                  |
| An effective address is computed by adding the value in the signed 7-bit immediate $imm$ to word element 0 of $a$ . The rightmost 4 bits of the effective address are used to determine the position of the addressed doubleword within a quadword. Based on the position, a pattern is generated that can be used with the $si\_shufb$ intrinsic to insert a doubleword (doubleword element 0) at the indicated position within a quadword. The pattern is returned in quadword $d$ .             | <i>d</i> = si_cdd( <i>a</i> , <i>imm</i> ) | CDD d, imm(a)    |
| si_cdx: Generate Controls for Doubleword Insertion (x-form)                                                                                                                                                                                                                                                                                                                                                                                                                                        |                                            |                  |
| An effective address is computed by adding the value of word element 0 of $a$ to word element 0 of $b$ . The rightmost 4 bits of the effective address are used to determine the position of the addressed doubleword within a quadword. Based on the position, a pattern is generated that can be used with the $si\_shufb$ intrinsic to insert a doubleword (doubleword element 3) at the indicated position within a quadword. The pattern is returned in quadword $d$ .                        | $d = si\_cdx(a, b)$                        | CDX d, a, b      |
| si_chd: Generate Controls for Halfword Insertion (d-form)                                                                                                                                                                                                                                                                                                                                                                                                                                          |                                            |                  |
| An effective address is computed by adding the value in the signed 7-bit immediate $i\mathit{nm}$ to word element 0 of $a$ . The rightmost 4 bits of the effective address are used to determine the position of the addressed halfword within a quadword. Based on the position, a pattern is generated that can be used with the $\mathtt{si\_shufb}$ intrinsic to insert a halfword (halfword element 1) at the indicated position within a quadword. The pattern is returned in quadword $d$ . | <i>d</i> = si_chd( <i>a</i> , <i>imm</i> ) | CHD d, imm(a)    |
| si_chx: Generate Controls for Halfword Insertion (x-form)                                                                                                                                                                                                                                                                                                                                                                                                                                          |                                            |                  |
| An effective address is computed by adding the value of word element 0 of a to word element 0 of b. The rightmost 4 bits of the effective address are used to determine the position of the addressed halfword within a quadword. Based on the position, a pattern is generated that can be used with the si_shufb intrinsic to insert a halfword (halfword element 1) at the indicated position within a quadword. The pattern is returned in quadword d.                                         | $d = si_chx(a, b)$                         | CHX d, a, b      |



| Instruction/Description                                                                                                                                                                                                                                                                                                                                                                                                                                              | Usage               | Assembly Mapping       |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------|------------------------|
| si_cwd: Generate Controls for Word Insertion (d-form)                                                                                                                                                                                                                                                                                                                                                                                                                |                     |                        |
| An effective address is computed by adding the value in the signed 7-bit immediate $imm$ to word element 0 of $a$ . The rightmost 4 bits of the effective address are used to determine the position of the addressed word within a quadword. Based on the position, a pattern is generated that can be used with the $si\_shufb$ intrinsic to insert a word (word element 0) at the indicated position within a quadword. The pattern is returned in quadword $d$ . | d = si_cwd(a, imm)  | CWD d, imm(a)          |
| si_cwx: Generate Controls for Word Insertion (x-form)                                                                                                                                                                                                                                                                                                                                                                                                                |                     |                        |
| An effective address is computed by adding the value of word element 0 of a to word element 0 of b. The rightmost 4 bits of the effective address are used to determine the position of the addressed word within a quadword. Based on the position, a pattern is generated that can be used with the si_shufb intrinsic to insert a word (element 0) at the indicated position within a quadword. The pattern is returned in quadword d.                            | $d = si\_cwx(a, b)$ | CWX d, a, b            |
| Constant Formation Intrinsics                                                                                                                                                                                                                                                                                                                                                                                                                                        |                     |                        |
| si_il: Immediate Load Word                                                                                                                                                                                                                                                                                                                                                                                                                                           |                     |                        |
| The 16-bit signed immediate value $imm$ is sign-extended to 32 bits and placed into each of the 4 word elements of quadword $d$ .                                                                                                                                                                                                                                                                                                                                    | $d = si_i(imm)$     | IL d, imm              |
| si_ila: Immediate Load Address                                                                                                                                                                                                                                                                                                                                                                                                                                       |                     |                        |
| The 18-bit immediate value $imm$ is placed in the rightmost bits of each of the 4 word elements of quadword $d$ . The upper 14 bits of each word is set to 0.                                                                                                                                                                                                                                                                                                        | d = si_ila(imm)     | ILA d, imm             |
| si_ilh: Immediate Load Halfword                                                                                                                                                                                                                                                                                                                                                                                                                                      |                     |                        |
| The 16-bit signed immediate value $imm$ is placed in each of the 8 halfword elements of quadword $d$ .                                                                                                                                                                                                                                                                                                                                                               | $d = Si_ih(imm)$    | ILH d, imm             |
| si_ilhu: Immediate Load Halfword Upper                                                                                                                                                                                                                                                                                                                                                                                                                               |                     |                        |
| The 16-bit signed immediate value $imm$ is placed into the leftmost 16 bits each of the 4 word elements of quadword $d$ . The rightmost 16 bits are set to 0.                                                                                                                                                                                                                                                                                                        | d = si_ilhu(imm)    | ILHU d, imm            |
| si_iohl: Immediate or Halfword Lower                                                                                                                                                                                                                                                                                                                                                                                                                                 |                     | rt < a                 |
| The 16-bit immediate value $imm$ is prepended with zeros and ORed with each of the 4 word elements of quadword $a$ . The result is returned in quadword $d$ .                                                                                                                                                                                                                                                                                                        | d = si_iohl(a, imm) | IOHL rt, imm<br>d < rt |
| No Operation Intrinsics                                                                                                                                                                                                                                                                                                                                                                                                                                              | 1                   |                        |
| si_Inop: No Operation (load)                                                                                                                                                                                                                                                                                                                                                                                                                                         | si_lnop()           | LNOP                   |
| A no-operation is performed on the load pipeline.                                                                                                                                                                                                                                                                                                                                                                                                                    | si_IIIυρ()          | LINOI                  |
| si_nop: No Operation (execute)                                                                                                                                                                                                                                                                                                                                                                                                                                       | si_nop()            | NOP rt <sup>1</sup>    |
| A no-operation is performed on the execute pipeline.                                                                                                                                                                                                                                                                                                                                                                                                                 | 5op()               |                        |
| Memory Load and Store Intrinsics                                                                                                                                                                                                                                                                                                                                                                                                                                     |                     |                        |
| si_lqa: Load Quadword (a-form)                                                                                                                                                                                                                                                                                                                                                                                                                                       |                     |                        |
| An effective address is determined by the sign-extended 18-bit value <i>imm</i> , with the 4 least significant bits forced to zero. The quadword at this effective address is returned in quadword <i>d</i> .                                                                                                                                                                                                                                                        | d = si_lqa(imm)     | LQA d, imm             |
| si_lqd: Load Quadword (d-form)                                                                                                                                                                                                                                                                                                                                                                                                                                       |                     |                        |
| An effective address is computed by zeroing the 4 least significant bits of the sign-extended 14-bit immediate value $imm$ , adding $imm$ to word element 0 of quadword $a$ , and forcing the 4 least significant bits of the result to zero. The quadword at this effective address is then returned in quadword $a$ .                                                                                                                                              | d = si_lqd(a, imm)  | LQD d, imm(a)          |



| Instruction/Description                                                                                                                                                                                                                                                                                                                                                     | Usage              | Assembly Mapping |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------|------------------|
| si_lqr: Load Quadword Instruction Relative (a-form)  An effective address is computed by forcing the 2 least significant bits of the signed 18-bit immediate value imm to zero, adding this value to the address of the instruction, and forcing the 4 least significant bits of the result to zero. The quadword at this effective address is then returned in quadword d. | d = si_lqr(imm)    | LQR, d, imm      |
| si_lqx: Load Quadword (x-form)                                                                                                                                                                                                                                                                                                                                              |                    |                  |
| An effective address is computed by adding word element 0 of quadword $a$ to word element 0 of quadword $b$ and forcing the 4 least significant bits to zero. The quadword at this effective address is then returned in quadword $a$ .                                                                                                                                     | $d = si_lqx(a, b)$ | LQX d, a, b      |
| si_stqa: Store Quadword (a-form)                                                                                                                                                                                                                                                                                                                                            |                    |                  |
| An effective address is determined by the sign-extended 18-bit value <i>imm</i> , with the 4 least significant bits forced to zero. The quadword <i>a</i> is stored at this effective address.                                                                                                                                                                              | si_stqa(a, imm)    | STQA a, imm      |
| si_stqd: Store Quadword (d-form)                                                                                                                                                                                                                                                                                                                                            |                    |                  |
| An effective address is computed by zeroing the 4 least significant bits of the sign-extended 14-bit immediate value $imm$ , adding $imm$ to word element 0 of quadword $b$ , and forcing the 4 least significant bits to zero. The quadword $a$ is then stored at this effective address.                                                                                  | si_stqd(a, b, imm) | STQD a, imm(b)   |
| si_stqr: Store Quadword Instruction Relative (a-form)                                                                                                                                                                                                                                                                                                                       |                    |                  |
| An effective address is computed by forcing the 2 least significant bits of the signed 18-bit immediate value <i>imm</i> to zero, adding this value to the address of the instruction, and forcing the 4 least significant bits of the result to zero. The quadword <i>a</i> is then stored at this effective address.                                                      | si_stqr(a, imm)    | STQR, a, imm     |
| si_stqx: Store Quadword (x-form)                                                                                                                                                                                                                                                                                                                                            |                    |                  |
| An effective address is computed by adding word element 0 of quadword $b$ to word element 0 of quadword $c$ and forcing the 4 least significant bits to zero. The quadword $a$ is then stored at this effective address.                                                                                                                                                    | si_stqx(a, b, c)   | STQX a, b, c     |
| Control Intrinsics                                                                                                                                                                                                                                                                                                                                                          |                    |                  |
| si_stopd: Stop and Signal with Dependencies  Execution of the SPU is stopped and a signal type of 0x3FFF is delivered after all register dependencies are met. This intrinsic is considered volatile with respect to all instructions and will not be reordered with any other instructions.                                                                                | si_stopd(a, b, c)  | STOPD a, b, c    |

 $<sup>^{1}</sup>$ The false target parameter rt is optimally chosen depending on the register usage of neighboring instructions.

Specific intrinsics accept only the following types of arguments:

- Immediate literals, as an explicit constant expression or as a symbolic address
- Enumerations
- qword arguments

Arguments of other types must be cast to qword.

For complete details on the specific instructions, see the Synergistic Processor Unit Instruction Set Architecture.

#### 2.1.1. Specific Casting Intrinsics

When using specific intrinsics, it may be necessary to cast from scalar types to the qword data type, or from the qword data type to scalar types. Similar to casting between vector data types, specific cast intrinsics have no effect on an argument that is stored in a register. All specific casting intrinsics are of the following form:

d=casting\_intrinsic(a)



See Table 2-11 for additional details about the specific casting intrinsics.

Table 2-11: Specific Casting Intrinsics

| Casting Intrinsic | Return/Argument Types |                    | Description                                                        |
|-------------------|-----------------------|--------------------|--------------------------------------------------------------------|
|                   | d                     | а                  |                                                                    |
| si_to_char        | signed char           | qword              | Cast byte element 3 of qword a to signed char d.                   |
| si_to_uchar       | unsigned char         |                    | Cast byte element 3 of qword a to unsigned char d.                 |
| si_to_short       | short                 |                    | Cast halfword element 1 of qword a to short d.                     |
| si_to_ushort      | unsigned short        |                    | Cast halfword element 1 of qword a to unsigned short d.            |
| si_to_int         | int                   |                    | Cast word element 0 of qword a to int d.                           |
| si_to_uint        | unsigned int          |                    | Cast word element 0 of qword a to unsigned int d.                  |
| si_to_ptr         | void *                |                    | Cast word element 0 of qword a to a void pointer d.                |
| si_to_llong       | long long             |                    | Cast doubleword element 0 of qword a to long long d.               |
| si_to_ullong      | unsigned long long    |                    | Cast doubleword element 0 of qword $a$ to unsigned long long $d$ . |
| si_to_float       | float                 |                    | Cast word element 0 of qword a to float d.                         |
| si_to_double      | double                |                    | Cast doubleword element 0 of qword a to double d.                  |
| si_from_char      | qword                 | signed char        | Cast signed char a to byte element 3 of qword d.                   |
| si_from_uchar     |                       | unsigned char      | Cast unsigned char a to byte element 3 of qword d.                 |
| si_from_short     |                       | short              | Cast short a to halfword element 1 of qword d.                     |
| si_from_ushort    |                       | unsigned short     | Cast unsigned short a to halfword element 1 of qword d.            |
| si_from_int       |                       | int                | Cast int a to word element 0 of qword d.                           |
| si_from_uint      |                       | unsigned int       | Cast unsigned int a to word element 0 of qword d.                  |
| si_from_ptr       |                       | void *             | Cast void pointer a to word element 0 of qword d.                  |
| si_from_llong     |                       | long long          | Cast long long $a$ to doubleword element 0 of qword $d$ .          |
| si_from_ullong    |                       | unsigned long long | Cast unsigned long long $a$ to doubleword element 0 of qword $d$ . |
| si_from_float     |                       | float              | Cast float a to word element 0 of qword d.                         |
| si_from_double    |                       | double             | Cast double a to doubleword element 0 of qword d.                  |

Because the casting intrinsics do not perform data conversion, casting from a scalar type to a *qword* type results in portions of the quadword being undefined.

#### 2.2. Generic Intrinsics and Built-ins

Generic intrinsics are operations that map to one or more specific intrinsics. The mapping of a generic intrinsic to a specific intrinsic depends on the input arguments to the intrinsic. Built-ins are similar to generic intrinsics; however, unlike generic intrinsics, built-ins map to more than one SPU instruction. All generic intrinsics and built-ins are prefixed by the string spu\_. For example, the generic intrinsic that implements the stop assembly instruction is named spu\_stop.

#### 2.2.1. Mapping Intrinsics with Scalar Operands

Intrinsics with scalar arguments are introduced for SPU instructions with immediate fields. For example, the intrinsic function vector signed int spu\_add(vector signed int, int) will translate to an AI assembly instruction.

Depending on the assembly instruction, immediate values are either 7, 10, 16, or 18 bits in length. The action performed for out-of-range immediate values depends on the type of intrinsic. By default, immediate-form specific intrinsics with an out-of-range immediate value are flagged as an error. Compilers may provide an option to issue a



warning for out-of-range immediate values and use only the specified number of least significant bits for the out-of-range argument.

Generic intrinsics support a full range of scalar operands. This support is not dependent on whether the scalar operand can be represented within the instruction's immediate field. Consider the following example:

```
d = spu_and (vector unsigned int a, int b);
```

Depending on argument b, different instructions are generated:

- If *b* is a literal constant within the range supported by one of the immediate forms, the immediate instruction form is generated. For example, if *b* equals 1, then ANDI d, a, 1 is generated.
- If b is a literal constant and is out-of-range but can be folded and implemented using an alternate immediate instruction form, the alternate immediate instruction is generated. For example, if b equals 0x30003, then ANDHI d, a, 3 is generated. In this context, "alternate immediate instruction form" means an immediate instruction form having a smaller data element size.
- If b is a literal constant that can be constructed using one or two immediate load instructions followed by the non-immediate form of the instruction, the appropriate instructions will be used. Immediate load instructions include IL, ILH, ILHU, ILA, IOHL, and FSMBI. Table 2-12 shows possible uses of the immediate load instructions for various constants b.

Table 2-12: Possible Uses of Immediate Load Instructions for Various Values of Constant b

| Constant b                       | Generates Instructions                  |
|----------------------------------|-----------------------------------------|
| -6000                            | IL b, -6000<br>AND d, a, b              |
| 131074 (0x20002)                 | ILH b, 2<br>AND d, a, b                 |
| 131072 (0x20000)                 | ILHU b, 2<br>AND d, a, b                |
| 134000 (0x20B70)                 | ILA b, 134000<br>AND d, a, b            |
| 262780 (0x4027C)                 | ILHU b, 4<br>IOHL b, 636<br>AND d, a, b |
| (0xFFFFFFF, 0x0, 0x0, 0xFFFFFFF) | FSMBI b, 0xF00F<br>AND d, a, b          |

• If b is a variable (non-literal) integer, code to splat the integer across the entire vector is generated followed by the non-immediate form of the instruction. For example, if b is an integer of unknown value, the constant area is loaded with the shuffle pattern (0x10203, 0x10203, 0x10203, 0x10203) at "CONST\_AREA, offset" and the following instructions are generated:

```
LQD pattern, CONST_AREA, offset SHUFB b, b, b, pattern AND d, a, b
```

#### 2.2.2. Implicit Conversion of Arguments of Intrinsics

There is no implicit conversion of arguments that have a vector type. Arguments of scalar type are converted according to the rules specified in the C/C++ standards. Consider, for example,

```
d = spu_insert(a, b, element);
```

Scalar a is inserted into the element of vector b that is specified by the element parameter. When b is a vector double, a must be converted to double, element must be converted to int, and d must be a vector double.

#### 2.2.3. Notations and Conventions

The remaining documentation describing the generic intrinsics uses the following rules and naming conventions:



- The table associated with each generic intrinsic specifies the supported input types.
- For intrinsics with scalar operands, only the immediate form of the instruction is shown. The other forms can be deduced in accordance with the rules discussed in section "2.2.1. Mapping Intrinsics with Scalar Operands".
- Some intrinsics, whether specific or generic, map to assembly instructions that do not uniquely specify all input and output registers. Instead, an input register also serves as the output register. Examples of these assembly instructions include ADDX, DFMS, MPYHHA, and SFX. For these intrinsics, the notation rt <--- c is used to imply that a register-to-register copy (copy c to rt) might be required to satisfy the semantics of the intrinsic, depending on the inputs and outputs. No copies will be generated if input c is the same as output d.</li>
- Generic intrinsics that do not map to specific intrinsics are identified by the acronym "N/A" (not applicable) in the Specific Intrinsics column of the respective table.

# 2.3. Constant Formation Intrinsics

#### spu\_splats: Splat Scalar to Vector

d = spu\_splats(a)

A single scalar value is replicated across all elements of a vector of the same type. The result is returned in vector d.

Table 2-13: Splat Scalar to Vector

| Return/Argu               | ument Types                  | Specific Intrinsica | Accombly Manning         |  |
|---------------------------|------------------------------|---------------------|--------------------------|--|
| d                         | а                            | Specific Intrinsics | Assembly Mapping         |  |
| vector unsigned char      | unsigned char                |                     |                          |  |
| vector signed char        | signed char                  |                     |                          |  |
| vector unsigned short     | unsigned short               |                     |                          |  |
| vector signed short       | signed short                 |                     |                          |  |
| vector unsigned int       | unsigned int                 | N/A                 | SHUFB d, a, a, pattern   |  |
| vector signed int         | signed int                   | 14/7 (              | orior b a, a, a, pattern |  |
| vector unsigned long long | unsigned long long           |                     |                          |  |
| vector signed long long   | signed long long<br>float    |                     |                          |  |
| vector float              |                              |                     |                          |  |
| vector double             | double                       |                     |                          |  |
| vector unsigned char      | unsigned char (literal)      |                     | IL d, a                  |  |
| vector signed char        | signed char (literal)        |                     | or                       |  |
| vector unsigned short     | unsigned short (literal)     |                     | ILA d, a                 |  |
| vector signed short       | signed short (literal)       |                     | or<br>ILH d, a&0xFFFF    |  |
| vector unsigned int       | unsigned int (literal)       | N/A                 | or                       |  |
| vector signed int         | signed int (literal)         | I W/ CA             | ILHU d, a>>16            |  |
| vector unsigned long long | unsigned long long (literal) |                     | or<br>ILHU d, a>>16;     |  |
| vector signed long long   | signed long long (literal)   |                     | IOHL d, a                |  |
| vector float              | float (literal)              |                     | or                       |  |
| vector double             | double (literal)             |                     | FSMBI d, a               |  |



# 2.4. Conversion Intrinsics

#### spu\_convtf: Convert Integer Vector to Vector Float

d = spu\_convtf(a, scale)

Each element of vector a is converted to a floating-point value and divided by  $2^{\text{scale}}$ . The allowable range for scale is 0 to 127. Values outside this range are flagged as an error and compilation is terminated. The result is returned in vector d.

Table 2-14: Convert Integer Vector to Vector Float

| Return/Argument Types |                     |                     | Specific Intrinsics    | Assembly Mapping  |  |
|-----------------------|---------------------|---------------------|------------------------|-------------------|--|
| d                     | a scale             |                     | Specific intrinsics    | Assembly Mapping  |  |
| vector float          | vector unsigned int | unsigned int (7-bit | d = si_cuflt(a, scale) | CUFLT d, a, scale |  |
| vector float          | vector signed int   | literal)            | d = si_csflt(a, scale) | CSFLT d, a, scale |  |

#### spu\_convts: Convert Vector Float to Signed Integer Vector

d = spu\_convts(a, scale)

Each element of vector a is scaled by  $2^{\text{scale}}$ , and the result is converted to a signed integer. If the intermediate result is greater than  $2^{31}$ -1, the result saturates to  $2^{31}$ -1. If the intermediate value is less than  $-2^{31}$ , the result saturates to  $-2^{31}$ . The allowable range for scale is 0 to 127. Values outside this range are flagged as an error and compilation is terminated. The results are returned in the corresponding elements of vector d.

Table 2-15: Convert Vector Float to Signed Integer Vector

| Return/Argument Types |              |                              | Specific Intrinsics    | Accombly Manning  |
|-----------------------|--------------|------------------------------|------------------------|-------------------|
| d                     | а            | scale                        | Specific multisics     | Assembly Mapping  |
| vector signed int     | vector float | unsigned int (7-bit literal) | d = si_cflts(a, scale) | CFLTS d, a, scale |

#### spu\_convtu: Convert Vector Float to Unsigned Integer Vector

d = spu\_convtu(a, scale)

Each element of vector a is scaled by  $2^{\text{scale}}$  and the result is converted to an unsigned integer. If the intermediate result is greater than  $2^{32}$ -1, the result saturates to  $2^{32}$ -1. If the intermediate value is negative, the result saturates to zero. The allowable range for scale is 0 to 127. Values outside this range are flagged as an error and compilation is terminated. The results are returned in the corresponding elements of vector d.

Table 2-16: Convert Vector Float to Unsigned Integer Vector

| Return/Argument Types |              |                              | Specific Intrinsics    | Assembly Mapping  |  |
|-----------------------|--------------|------------------------------|------------------------|-------------------|--|
| d                     | d a scale    |                              | Specific munisics      | Assembly Mapping  |  |
| vector unsigned int   | vector float | unsigned int (7-bit literal) | d = si_cfltu(a, scale) | CFLTU d, a, scale |  |



#### spu\_extend: Extend Vector

d = spu\_extend(a)

For a fixed-point vector a, each odd element of vector a is extended to a double and returned in the corresponding element of vector d. For a floating-point vector, each even element of a is sign-extended and returned in the corresponding element of d.

Table 2-17: Extend Vector

| Return/Argument Types   |                     | Specific Intrinsics  | Assembly Mapping |  |
|-------------------------|---------------------|----------------------|------------------|--|
| d                       | а                   | Specific iritifisies | Assembly Mapping |  |
| vector signed short     | vector signed char  | $d = si_xsbh(a)$     | XSBH d, a        |  |
| vector signed int       | vector signed short | $d = si\_xshw(a)$    | XSHW d, a        |  |
| vector signed long long | vector signed int   | $d = si\_xswd(a)$    | XSWD d, a        |  |
| vector double           | vector float        | $d = si\_fesd(a)$    | FESD d, a        |  |

#### spu\_roundtf: Round Vector Double to Vector Float

d = spu\_roundtf(a)

Each doubleword element of vector a is rounded to a single-precision floating-point value and placed in the even element of vector a. Zeros are placed in the odd elements of a.

Table 2-18: Round Vector Double to Vector Float

| Return/Arg   | ument Types   | Specific         | Assembly Mapping |  |
|--------------|---------------|------------------|------------------|--|
| d            | а             | Intrinsics       |                  |  |
| vector float | vector double | $d = si_frds(a)$ | FRDS d, a        |  |

# 2.5. Arithmetic Intrinsics

# spu\_add: Vector Add

d = spu\_add(a, b)

Each element of vector a is added to the corresponding element of vector b. If b is a scalar, the scalar value is replicated for each element and then added to a. Overflows and carries are not detected, and no saturation is performed. The results are returned in the corresponding elements of vector d.

Table 2-19: Vector Add

| Return/Argument Types |                       |                       | Specific Intrinsics                    | Assambly Manning |  |
|-----------------------|-----------------------|-----------------------|----------------------------------------|------------------|--|
| d                     | a                     | b                     | Specific intrinsics                    | Assembly Mapping |  |
| vector signed int     | vector signed int     | vector signed int     | $d = si_a(a, b)$                       | A d, a, b        |  |
| vector unsigned int   | vector unsigned int   | vector unsigned int   | α = 3i_α(α, ω)                         | A u, a, b        |  |
| vector signed short   | vector signed short   | vector signed short   | $d = si_ah(a, b)$                      | AH d, a, b       |  |
| vector unsigned short | vector unsigned short | vector unsigned short | α = 3i_αιι(α, D)                       |                  |  |
| vector signed int     | vector signed int     | 10-bit signed int     | $d = si_ai(a, b)$                      | Al d, a, b       |  |
| vector unsigned int   | vector unsigned int   | (literal)             | $\alpha = \operatorname{Si\_al}(a, D)$ | Ai u, a, b       |  |
| vector signed int     | vector signed int     | int                   | See section "2.2.1. I                  |                  |  |
| vector unsigned int   | vector unsigned int   | unsigned int          | with Scalar Operand                    | ds".             |  |
| vector signed short   | vector signed short   | 10-bit signed short   | $d = si_ahi(a, b)$                     | AHI d, a, b      |  |
| vector unsigned short | vector unsigned short | (literal)             | α – 31_αι ιι(α, <i>D)</i>              | Ai ii u, a, b    |  |



| Return/Argument Types |                       |                | Specific Intrincipa               | Accombly Manning       |  |
|-----------------------|-----------------------|----------------|-----------------------------------|------------------------|--|
| d                     | a                     | b              | Specific Intrinsics Assembly Mapp |                        |  |
| vector signed short   | vector signed short   | short          | See section "2.2.1.               |                        |  |
| vector unsigned short | vector unsigned short | unsigned short | with Scalar Operand               | with Scalar Operands". |  |
| vector float          | vector float          | vector float   | $d = si_fa(a, b)$                 | FA d, a, b             |  |
| vector double         | vector double         | vector double  | $d = si_dfa(a, b)$                | DFA d, a, b            |  |

#### spu\_addx: Vector Add Extended

 $d = spu_addx(a, b, c)$ 

Each element of vector a is added to the corresponding element of vector b and to the least significant bit of the corresponding element of vector c. The result is returned in the corresponding element of vector d.

Table 2-20: Vector Add Extended

| Return/Argument Types |                     |                     | Specific            | Assembly     |                      |
|-----------------------|---------------------|---------------------|---------------------|--------------|----------------------|
| d                     | а                   | b                   | С                   | Intrinsics   | Mapping              |
| vector signed int     | vector signed int   | vector signed int   | vector signed int   | d = si_addx( | rt < c               |
| vector unsigned int   | vector unsigned int | vector unsigned int | vector unsigned int | . '          | ADDX rt, a, b d < rt |

# spu\_genb: Vector Generate Borrow

 $d = spu_genb(a, b)$ 

Each element of vector *b* is subtracted from the corresponding element of vector *a*. The resulting borrow out is placed in the least significant bit of the corresponding element of vector *d*. The remaining bits of *d* are set to 0.

Table 2-21: Vector Generate Borrow

| Return/Argument Types |                     |                     | Specific Intrinsics                                                               | Assembly Mapping |
|-----------------------|---------------------|---------------------|-----------------------------------------------------------------------------------|------------------|
| d                     | d a b               |                     | Specific intrinsics                                                               | Assembly Mapping |
| vector signed int     | vector signed int   | vector signed int   | $d = si_bg(b, a)$                                                                 | BG rt, b, a      |
| vector unsigned int   | vector unsigned int | vector unsigned int | $\alpha = \operatorname{Si}_{\mathcal{D}} \operatorname{Sg}(\mathcal{D}, \alpha)$ |                  |

#### spu\_genbx: Vector Generate Borrow Extended

 $d = spu_genbx(a, b, c)$ 

Each element of vector b is subtracted from the corresponding element of vector b. An additional 1 is subtracted from the result if the least significant bit of the corresponding element of vector c is 0. If the result is less than 0, a 1 is placed in the corresponding element of vector d; otherwise, a 0 is placed in the corresponding element of d.

Table 2-22: Vector Generate Borrow Extended

| Return/Argument Types |                     |                     | Specific            | Assembly Mapping |                     |
|-----------------------|---------------------|---------------------|---------------------|------------------|---------------------|
| d                     | а                   | b                   | С                   | Intrinsics       | Assembly Mapping    |
| vector signed int     | vector signed int   | vector signed int   | vector signed int   | d = si_bgx(      | rt < c              |
| vector unsigned int   | vector unsigned int | vector unsigned int | vector unsigned int | , ·              | BGX rt, b, a d < rt |



#### spu\_genc: Vector Generate Carry

d = spu\_genc(a, b)

Each element of vector a is added to the corresponding element of vector b. The resulting carry out is placed in the least significant bit of the corresponding element of vector d. The remaining bits of d are set to 0.

Table 2-23: Vector Generate Carry

| Return/Argument Types |                     |                     | Specific Intrinsics                    | Assembly Mapping |
|-----------------------|---------------------|---------------------|----------------------------------------|------------------|
| d a b                 |                     |                     |                                        |                  |
| vector signed int     | vector signed int   | vector signed int   | $d = si\_cg(a, b)$                     | CG rt, a, b      |
| vector unsigned int   | vector unsigned int | vector unsigned int | $\alpha = \operatorname{Si\_og}(a, D)$ | CG II, a, b      |

### spu\_gencx: Vector Generate Carry Extended

 $d = spu_gencx(a, b, c)$ 

Each element of vector a is added to the corresponding element of vector b and the least significant bit of the corresponding element of vector c. The resulting carry out is placed in the least significant bit of the corresponding element of vector d. The remaining bits of d are set to d.

Table 2-24: Vector Generate Carry Extended

|                     | Return/Argu         | Specific            | Assembly Mapping    |             |                        |
|---------------------|---------------------|---------------------|---------------------|-------------|------------------------|
| d                   | а                   | b                   | С                   | Intrinsics  | Assembly Mapping       |
| vector signed int   | vector signed int   | vector signed int   | vector signed int   | d = si_cgx( | rt < c                 |
| vector unsigned int | vector unsigned int | vector unsigned int | vector unsigned int | , , ,       | CGX rt, a, b<br>d < rt |

#### spu\_madd: Vector Multiply and Add

 $d = spu_madd(a, b, c)$ 

Each element of vector a is multiplied by vector b and added to the corresponding element of vector c. The result is returned to the corresponding element of vector d. For integer multiply-and-adds, the odd elements of vectors a and b are sign-extended to 32-bit integers prior to multiplication.

Table 2-25: Vector Multiply and Add

|                   | Return/Argu         | Specific            | Assembly Mapping  |                          |                                   |
|-------------------|---------------------|---------------------|-------------------|--------------------------|-----------------------------------|
| d                 | а                   | b                   | С                 | Intrinsics               | Assembly Mapping                  |
| vector signed int | vector signed short | vector signed short | vector signed int | d = si_mpya(<br>a, b, c) | MPYA d, a, b, c                   |
| vector float      | vector float        | vector float        | vector float      | d = si_fma(<br>a, b, c)  | FMA d, a, b, c                    |
| vector double     | vector double       | vector double       | vector double     | d = si_dfma(<br>a, b, c) | rt < c<br>DFMA rt, a, b<br>d < rt |

# spu\_mhhadd: Vector Multiply High High and Add

d = spu\_mhhadd(a, b, c)

Each even element of vector a is multiplied by the corresponding even element of vector b, the 32-bit result is added to the corresponding element of vector c, and the result is returned in the corresponding element of vector d.



Table 2-26: Vector Multiply High High and Add

|                     | Return/Argu           | Consideration         | Accombly Manning    |                             |                                      |
|---------------------|-----------------------|-----------------------|---------------------|-----------------------------|--------------------------------------|
| d                   | а                     | b                     | С                   | Specific intrinsics         | Assembly Mapping                     |
| vector signed int   | vector signed short   | vector signed short   | vector signed int   | d = si_mpyhha(<br>a, b, c)  | rt < c<br>MPYHHA rt, a, b<br>d < rt  |
| vector unsigned int | vector unsigned short | vector unsigned short | vector unsigned int | d = si_mpyhhau(<br>a, b, c) | rt < c<br>MPYHHAU rt, a, b<br>d < rt |

# spu\_msub: Vector Multiply and Subtract

 $d = spu_msub(a, b, c)$ 

Each element of vector a is multiplied by the corresponding element of vector b, and the corresponding element of vector c is subtracted from the product. The result is returned in the corresponding element of vector d.

Table 2-27: Vector Multiply and Subtract

| Return/Argument Types |               |               | Specific Intrinsics | Assembly Mapping       |                                   |
|-----------------------|---------------|---------------|---------------------|------------------------|-----------------------------------|
| d                     | а             | b             | С                   | Specific intrinsics    | Assembly Mapping                  |
| vector float          | vector float  | vector float  | vector float        | $d = si_fms(a, b, c)$  | FMS d, a, b, c                    |
| vector double         | vector double | vector double | vector double       | $d = si_dfms(a, b, c)$ | rt < c<br>DFMS rt, a, b<br>d < rt |

### spu\_mul: Vector Multiply

d = spu\_mul(a,b)

Each element of vector a is multiplied by the corresponding element of vector b and returned in the corresponding element of vector d.

Table 2-28: Vector Multiply

| Return/Argument Types |               |               | Specific Intrinsics | Accombly Manning |
|-----------------------|---------------|---------------|---------------------|------------------|
| d                     | а             | b             | Specific munisics   | Assembly Mapping |
| vector float          | vector float  | vector float  | $d = si_fm(a, b)$   | FM d, a, b       |
| vector double         | vector double | vector double | $d = si_dfm(a, b)$  | DFM d, a, b      |

# spu\_mulh: Vector Multiply High

d = spu\_mulh(a, b)

Each even element of vector a is multiplied by the next (odd) element of vector b. The product is shifted left by 16 bits and stored in the corresponding element of vector d. Bits shifted out at the left are discarded. Zeros are shifted in at the right.

Table 2-29: Vector Multiply High

| Return/Argument Types |                     |                     | Specific Intrinsics | Assembly Mapping |  |
|-----------------------|---------------------|---------------------|---------------------|------------------|--|
| d                     | a                   | b                   | Specific intrinsics | Assembly Mapping |  |
| vector signed int     | vector signed short | vector signed short | $d = si_mpyh(a, b)$ | MPYH d, a, b     |  |



### spu\_mule: Vector Multiply Even

d = spu\_mule(a, b)

Each even element of vector a is multiplied by the corresponding even element of vector b, and the 32-bit result is returned to the corresponding element of vector d.

Table 2-30: Vector Multiply Even

| Return/Argument Types |                       |                       | Specific Intrinsics            | Accombly Manning |  |
|-----------------------|-----------------------|-----------------------|--------------------------------|------------------|--|
| d                     | а                     | a b                   |                                | Assembly Mapping |  |
| vector signed int     | vector signed short   | vector signed short   | $d = si_mpyhh(a, b)$           | MPYHH d, a, b    |  |
| vector unsigned int   | vector unsigned short | vector unsigned short | <pre>d = si_mpyhhu(a, b)</pre> | MPYHHU d, a, b   |  |

### spu\_mulo: Vector Multiply Odd

 $d = spu_mulo(a, b)$ 

Each odd element of vector a is multiplied by the corresponding element of vector b. If b is a scalar, the scalar value is replicated for each element and then multiplied by a. The results are returned in vector d.

Table 2-31: Vector Multiply Odd

| Return/Argument Types    |                       |                               | Specific Intringica                                           | Accombly Manning |  |
|--------------------------|-----------------------|-------------------------------|---------------------------------------------------------------|------------------|--|
| d                        | a                     | b                             | Specific Intrinsics                                           | Assembly Mapping |  |
|                          |                       | vector signed short           | $d = si_mpy (a, b)$                                           | MPY d, a, b      |  |
| vector signed int        | vector signed short   | 10-bit signed short (literal) | $d = si_mpyi(a, b)$                                           | MPYI d, a, b     |  |
|                          |                       | signed short                  | See section "2.2.1. Mapping Intrinsics with Scalar Operands". |                  |  |
|                          |                       | vector unsigned short         | $d = si_mpyu(a, b)$                                           | MPYU d, a, b     |  |
| vector unsigned int vect | vector unsigned short | 10-bit signed short (literal) | $d = si_mpyui(a, b)$                                          | MPYUI d, a, b    |  |
| . Sec. aoignod int       | vector uneigned enert | unsigned short                | See section "2.2.1. Mapping Intrinsics with Scalar Operands". |                  |  |

# spu\_mulsr: Vector Multiply and Shift Right

 $d = spu_mulsr(a, b)$ 

Each odd element of vector *a* is multiplied by the corresponding odd element of vector *b*. The leftmost 16 bits of the resulting 32-bit product is sign-extended and returned in the corresponding 32-bit element of vector *a*.

Table 2-32: Vector Multiply and Shift Right

| Return/Argument Types |                     |                     | Specific Intrinsics | Assembly Mapping |
|-----------------------|---------------------|---------------------|---------------------|------------------|
| d                     | a                   | b                   | Specific intrinsics | Assembly Mapping |
| vector signed int     | vector signed short | vector signed short | $d = si_mpys(a, b)$ | MPYS d, a, b     |



#### spu\_nmadd: Negative Vector Multiply and Add

Each element of vector a is multiplied by the corresponding element in vector b and then added to the corresponding element of vector c. The result is negated and returned in the corresponding element of vector d.

Table 2-33: Negative Vector Multiply and Add

| Return/Argument Types |               |               | Specific Intrincies | Assembly Monning        |                                    |
|-----------------------|---------------|---------------|---------------------|-------------------------|------------------------------------|
| d                     | а             | b             | С                   | Specific Intrinsics     | Assembly Mapping                   |
| vector double         | vector double | vector double | vector double       | $d = si_dfnma(a, b, c)$ | rt < c<br>DFNMA rt, a, b<br>d < rt |

### spu\_nmsub: Negative Vector Multiply and Subtract

$$d = spu_nmsub(a, b, c)$$

Each element of vector a is multiplied by the corresponding element in vector b. The result is subtracted from the corresponding element in c and returned in the corresponding element of vector d.

Table 2-34: Negative Vector Multiply and Subtract

| Return/Argument Types |               |               | Specific Intrinsics | Accombly Manning        |                                    |
|-----------------------|---------------|---------------|---------------------|-------------------------|------------------------------------|
| d                     | а             | b             | С                   | Specific intrinsics     | Assembly Mapping                   |
| vector float          | vector float  | vector float  | vector float        | $d = si_fnms(a, b, c)$  | FNMS d, a, b, c                    |
| vector double         | vector double | vector double | vector double       | $d = si_dfnms(a, b, c)$ | rt < c<br>DFNMS rt, a, b<br>d < rt |

#### spu\_re: Vector Floating-Point Reciprocal Estimate

For each element of vector a, an estimate of its floating-point reciprocal is computed, and the result is returned in the corresponding element of vector d. The resulting estimate is accurate to 12 bits.

Table 2-35: Vector Floating-Point Reciprocal Estimate

| Return/Argument Types |              | Specific Intrinsics                    | Accomply Manning         |  |
|-----------------------|--------------|----------------------------------------|--------------------------|--|
| d                     | а            | Specific intrinsics                    | Assembly Mapping         |  |
| vector float          | vector float | $t = si_frest(a)$<br>$d = si_fi(a, t)$ | FREST d, a<br>FI d, a, d |  |

#### spu\_rsqrte: Vector Floating-Point Reciprocal Square Root Estimate

For each element of vector a, an estimate of its floating-point reciprocal square root is computed, and the result is returned in the corresponding element of vector d. The resulting estimate is accurate to 12 bits.

Table 2-36: Vector Floating-Point Reciprocal Square Root Estimate

|                       | ū            | •                   |                  |  |
|-----------------------|--------------|---------------------|------------------|--|
| Return/Argument Types |              | Specific Intrinsics | Assembly Mapping |  |
| d                     | d a          |                     |                  |  |
| vector float          | vector float | t = si_frsqest(a)   | FRSQEST d, a     |  |



| Return/Argument Types |   | Specific Intrinsics | Assembly Mapping |  |
|-----------------------|---|---------------------|------------------|--|
| d                     | а | Specific intrinsics | Assembly Mapping |  |
|                       |   | $d = si_fi(a, t)$   | Fl d, a, d       |  |

#### spu\_sub: Vector Subtract

 $d = spu_sub(a, b)$ 

Each element of vector b is subtracted from the corresponding element of vector a. If a is a scalar, the scalar value is replicated for each element of a, and then b is subtracted from the corresponding element of a. Overflows and carries are not detected. The results are returned in the corresponding elements of vector d.

Table 2-37: Vector Subtract

|                       | Return/Argument Types         |                       | Specific Intrincipa                    | Accombly Manning     |
|-----------------------|-------------------------------|-----------------------|----------------------------------------|----------------------|
| d                     | а                             | b                     | Specific intrinsics                    | Assembly Mapping     |
| vector signed short   | vector signed short           | vector signed short   | $d = si_sfh(b, a)$                     | SFH d, b, a          |
| vector unsigned short | vector unsigned short         | vector unsigned short | $\alpha = 3i\_3iii(D, a)$              | Or rr u, b, a        |
| vector signed int     | vector signed int             | vector signed int     | d = si sf(b, a)                        | SEdba                |
| vector unsigned int   | vector unsigned int           | vector unsigned int   | $\alpha = 3i\_3i(D, a)$                | SF d, b, a           |
| vector signed int     | 10 bit signed int (literal)   | vector signed int     | d = si sfi(b, a)                       | SFI d, b, a          |
| vector unsigned int   | 10-bit signed int (literal)   | vector unsigned int   | $\alpha = \operatorname{Si_Sii}(D, a)$ |                      |
| vector signed int     | int                           | vector signed int     | See section "2.2.1. Mapping Intrinsic  |                      |
| vector unsigned int   | unsigned int                  | vector unsigned int   | with Scalar Opera                      | nds".                |
| vector signed short   | 10-bit signed short (literal) | vector signed short   | d = si sfhi(b, a)                      | SFHI d, b, a         |
| vector unsigned short | 10-bit signed short (literal) | vector unsigned short | $\alpha = \operatorname{Si_Sim}(D, a)$ | SFFII U, D, a        |
| vector signed short   | short                         | vector signed short   | See section "2.2.1                     | . Mapping Intrinsics |
| vector unsigned short | unsigned short                | vector unsigned short | with Scalar Operands".                 |                      |
| vector float          | vector float                  | vector float          | $d = si_fs(a, b)$                      | FS d, a, b           |
| vector double         | vector double                 | vector double         | $d = si_dfs(a, b)$                     | DFS d, a, b          |

#### spu\_subx: Vector Subtract Extended

 $d = spu_subx(a, b, c)$ 

Each element of vector b is subtracted from the corresponding element of vector a. An additional 1 is subtracted from the result if the least significant bit of the corresponding element of vector c is 0. The final result is returned in the corresponding element of vector d.

Table 2-38: Vector Subtract Extended

|                     | Return/Arg          | Specific Intrinsics | Assembly            |                    |                        |
|---------------------|---------------------|---------------------|---------------------|--------------------|------------------------|
| d                   | а                   | Specific intrinsics | Mapping             |                    |                        |
| vector signed int   | vector signed int   | vector signed int   | vector signed int   | $d = si_sfx(b, a,$ | rt < c                 |
| vector unsigned int | vector unsigned int | vector unsigned int | vector unsigned int |                    | SFX rt, b, a<br>d < rt |



# 2.6. Byte Operation Intrinsics

# spu\_absd: Vector Absolute Difference

 $d = spu_absd(a, b)$ 

Each element of vector a is subtracted from the corresponding element of vector b, and the absolute value of the result is returned in the corresponding element of vector d.

Table 2-39: Vector Absolute Difference

| Return/Argument Types |                      |                      | Specific Intrinsics  | Assembly Mapping |
|-----------------------|----------------------|----------------------|----------------------|------------------|
| d                     | a b                  |                      | Specific intrinsics  | Assembly Mapping |
| vector unsigned char  | vector unsigned char | vector unsigned char | $d = si_absdb(a, b)$ | ABSDB d, a, b    |

### spu\_avg: Average of Two Vectors

 $d = spu_avg(a, b)$ 

Each element of vector a is added to the corresponding element of vector b plus 1. The result is shifted to the right by 1 bit and placed in the corresponding element of vector d.

Table 2-40: Average of Two Vectors

| Return/Argument Types |                      |                      | Specific Intrinsics | Accombly Manning |
|-----------------------|----------------------|----------------------|---------------------|------------------|
| d                     | а                    | b                    | Specific munisics   | Assembly Mapping |
| vector unsigned char  | vector unsigned char | vector unsigned char | $d = si_avgb(a, b)$ | AVGB d, a, b     |

### spu\_sumb: Sum Bytes into Shorts

 $d = spu_sumb(a, b)$ 

Each four elements of b are summed and returned in the corresponding even elements of vector d. Each four elements of a are summed and returned in the corresponding odd elements of d.

Table 2-41: Sum Bytes into Shorts

|                       | Return/Argument Types | Specific Intrinsics  | Assembly Mapping     |                  |
|-----------------------|-----------------------|----------------------|----------------------|------------------|
| d a b                 |                       |                      | Opecine munisies     | Assembly Mapping |
| vector unsigned short | vector unsigned char  | vector unsigned char | d = si_sumb(a,<br>b) | SUMB d, a, b     |

# 2.7. Compare, Branch and Halt Intrinsics

### spu\_bisled: Branch Indirect and Set Link If External Data

(void) spu\_bisled(func)
(void) spu\_bisled\_d(func)

(void) spu\_bisled\_e(func)

The count value of channel 0 (event status) is examined. If it is zero, execution continues with the next sequential instruction. If it is nonzero, the function func is called. The parameter func is the name of, or pointer to, a parameter-less function with no return value. If func is called, the  $spu\_bisled\_d$  and  $spu\_bisled\_e$  forms of the intrinsic do one of the following actions:

• Disable interrupts - use spu\_bisled\_d



• Enable interrupts – use spu\_bisled\_e

Because the bisled instruction is assumed to behave as a synchronous software interrupt, and because all volatile registers must be considered non-volatile by the bisled target function, func, standard calling conventions are not observed. See the SPU Application Binary Interface Specification for additional details about standard calling conventions.

With respect to branch prediction, it is assumed that func is not called. Therefore, a branch hint instruction will not be inserted as a result of the spu\_bisled() intrinsic.

Table 2-42: Branch Indirect and Set Link If External Data

| Generic Intrinsic Form | func            | Specific Intrinsics             | Assembly Mapping   |
|------------------------|-----------------|---------------------------------|--------------------|
| spu_bisled             |                 | si_bisled(si_from_ptr (func))   | BISLED \$LR, func  |
| spu_bisled_d           | void (*func) () | si_bisledd( si_from_ptr (func)) | BISLEDD \$LR, func |
| spu_bisled_e           |                 | si_bislede( si_from_ptr (func)) | BISLEDE \$LR, func |

### spu\_cmpabseq: Vector Compare Absolute Equal

d = spu\_cmpabseq(a, b)

The absolute value of each element of vector a is compared with the absolute value of the corresponding element of vector b. If the absolute values are equal, all bits of the corresponding element of vector d are set to one; otherwise, all bits of the corresponding element of d are set to zero.

Table 2-43: Vector Compare Absolute Equal

| Return/Argument Types     |               |                     | Specific Intrinsics    | Accombly Manning            |
|---------------------------|---------------|---------------------|------------------------|-----------------------------|
| d a b                     |               | Specific intrinsics | Assembly Mapping       |                             |
| vector unsigned int       | vector float  | vector float        | $d = si\_fcmeq(a, b)$  | FCMEQ d, a, b               |
| vector unsigned long long | vector double | vector double       | $d = si\_dfcmeq(a, b)$ | DFCMEQ d, a, b <sup>†</sup> |

### spu\_cmpabsgt: Vector Compare Absolute Greater Than

d = spu\_cmpabsgt(a, b)

The absolute value of each element of vector a is compared with the absolute value of the corresponding element of vector b. If the element of a is greater than the element of b, all bits of the corresponding element of vector a are set to one; otherwise, all bits of the corresponding element of a are set to zero.

Table 2-44: Vector Compare Absolute Greater Than

| Return/Argument Types     |               |                       | Specific Intrincies   | Assembly Mapping            |
|---------------------------|---------------|-----------------------|-----------------------|-----------------------------|
| d                         | а             | b Specific Intrinsics |                       | Assembly Mapping            |
| vector unsigned int       | vector float  | vector float          | $d = si\_fcmgt(a, b)$ | FCMGT d, a, b               |
| vector unsigned long long | vector double | vector double         | d = si_dfcmgt(a, b)   | DFCMGT d, a, b <sup>†</sup> |

# spu\_cmpeq: Vector Compare Equal

d = spu\_cmpeq(a, b)

Each element of vector a is compared with the corresponding element of vector b. If b is a scalar, the scalar value is first replicated for each element, and then a and b are compared. If the operands are equal, all bits of the



corresponding element of vector d are set to one. If they are unequal, all bits of the corresponding element of d are set to zero.

Table 2-45: Vector Compare Equal

|                           | Return/Argument Types |                             | Specific Intrincies                                          | A a a a mala le e Mananina m |  |
|---------------------------|-----------------------|-----------------------------|--------------------------------------------------------------|------------------------------|--|
| d                         | а                     | b                           | Specific Intrinsics                                          | Assembly Mapping             |  |
| vector unsigned char      | vector signed char    | vector signed char          | $d = si\_ceqb(a, b)$                                         | CEQb d, a, b                 |  |
| vector unsigned chai      | vector unsigned char  | vector unsigned char        | α = 3i_00qb(a, b)                                            | OLQD u, a, b                 |  |
| vector unsigned short     | vector signed short   | vector signed short         | $d = si\_ceqh(a, b)$                                         | CEQH d, a, b                 |  |
| vector unsigned short     | vector unsigned short | vector unsigned short       | α = 3i_ceqn(a, b)                                            | OLQITU, a, b                 |  |
|                           | vector signed int     | vector signed int           | $d = si\_ceq(a, b)$                                          | CEQ d, a, b                  |  |
| vector unsigned int       | vector unsigned int   | vector unsigned int         |                                                              | CEQ u, a, b                  |  |
|                           | vector float          | vector float                | $d = si_fceq(a, b)$                                          | FCEQ d, a, b                 |  |
| vector unsigned char      | vector signed char    | 10-bit signed int (literal) | $d = si\_ceqbi(a, b)$                                        | CEQBI d, a, b                |  |
|                           | vector unsigned char  | 10-bit signed int (interal) |                                                              |                              |  |
|                           | vector signed char    | signed char                 | See section "2.2.1. Mapping Intrinsic with Scalar Operands". |                              |  |
|                           | vector unsigned char  | unsigned char               |                                                              |                              |  |
|                           | vector signed short   | 10-bit signed int (literal) | $d = si\_ceqhi(a, b)$                                        | CEQHI d, a, b                |  |
| vector unsigned short     | vector unsigned short | 10-bit signed int (interal) |                                                              |                              |  |
| vector unsigned short     | vector signed short   | signed short                | See section "2.2.1. Mapping Intrinsics                       |                              |  |
|                           | vector unsigned short | unsigned short              | with Scalar Operand                                          | ls".                         |  |
|                           | vector signed int     | 10-bit signed int (literal) | $d = si\_ceqi(a, b)$                                         | CEQI d, a, b                 |  |
| voctor uncigned int       | vector unsigned int   | 10-bit signed int (interal) | α = Si_ceqi(a, D)                                            | CEQI u, a, b                 |  |
| vector unsigned int       | vector signed int     | signed int                  | See section "2.2.1. I                                        | Mapping Intrinsics           |  |
|                           | vector unsigned int   | unsigned int                | with Scalar Operand                                          | ls".                         |  |
| vector unsigned long long | vector double         | vector double               | $d = si\_dfceq(a, b)$                                        | DFCEQ d, a, b                |  |



#### spu\_cmpgt: Vector Compare Greater Than

d = spu\_cmpgt(a, b)

Each element of vector a is compared with the corresponding element of vector b. If b is a scalar, the scalar value is replicated for each element and then a and b are compared. If the element of a is greater than the corresponding element of b, all bits of the corresponding element of vector a are set to one; otherwise, all bits of the corresponding element of a are set to zero.

Table 2-46: Vector Compare Greater Than

|                           | Return/Argument Types  | S                           | Chacifia Intringias                                           | Accombly Manning           |
|---------------------------|------------------------|-----------------------------|---------------------------------------------------------------|----------------------------|
| d                         | а                      | b                           | Specific Intrinsics                                           | Assembly Mapping           |
|                           |                        | vector signed char          | $d = si\_cgtb(a, b)$                                          | CGTB d, a, b               |
|                           | vector signed char     | 10-bit signed int (literal) | $d = si\_cgtbi(a, b)$                                         | CGTBI d, a, b              |
| vector unsigned char      | vector orginal aria    | signed char                 | See section "2.2.1. with Scalar Operan                        |                            |
| vector unsigned chai      |                        | vector unsigned char        | $d = si\_clgtb(a, b)$                                         | CLGTB d, a, b              |
|                           | vector unsigned char   | 10-bit signed int (literal) | $d = si\_clgtbi(a, b)$                                        | CLGTBI d, a, b             |
|                           | rector unergrica orial | unsigned char               | See section "2.2.1. with Scalar Operan                        |                            |
|                           |                        | vector signed short         | $d = si\_cgth(a, b)$                                          | CGTH d, a, b               |
|                           | vector signed short    | 10-bit signed int (literal) | $d = si\_cgthi(a, b)$                                         | CGTHI d, a, b              |
| vector unsigned short     |                        | signed short                | See section "2.2.1. Mapping Intrinsics with Scalar Operands". |                            |
| vector unsigned short     | vector unsigned short  | vector unsigned short       | $d = si\_clgth(a, b)$                                         | CLGTH d, a, b              |
|                           |                        | 10-bit signed int (literal) | $d = si\_clgthi(a, b)$                                        | CLGTHI d, a, b             |
|                           |                        | unsigned short              | See section "2.2.1. Mapping Intrinsics with Scalar Operands". |                            |
|                           |                        | vector signed int           | $d = si\_cgt(a, b)$                                           | CGT d, a, b                |
|                           | vector signed int      | 10-bit signed int (literal) | $d = si\_cgti(a, b)$                                          | CGTI d, a, b               |
|                           | voctor orgined int     | signed int                  | See section "2.2.1. with Scalar Operan                        |                            |
| vector unsigned int       |                        | vector unsigned int         | $d = si\_clgt(a, b)$                                          | CLGT d, a, b               |
|                           | vector unsigned int    | 10-bit signed int (literal) | $d = si\_clgti(a, b)$                                         | CLGTI d, a, b              |
|                           |                        | unsigned int                | See section "2.2.1. with Scalar Operan                        |                            |
|                           | vector float           | vector float                | $d = si_fcgt(a, b)$                                           | FCGT d, a, b               |
| vector unsigned long long | vector double          | vector double               | $d = si_dfcgt(a, b)$                                          | DFCGT d, a, b <sup>†</sup> |



### spu\_hcmpeq: Halt If Compare Equal

(void) spu\_hcmpeq(a, b)

The contents of a and b are compared. If they are equal, execution is halted.

Table 2-47: Halt If Compare Equal

| Return/Argument Types  |                                | Specific Intrinsics                        | Assembly               |
|------------------------|--------------------------------|--------------------------------------------|------------------------|
| а                      | b                              | oposine manisies                           | Mapping <sup>1,2</sup> |
| int                    | int (non-literal)              | si_heq(si_from_int (a), si_from_int (b))   |                        |
| unsigned int           | unsigned int (non-<br>literal) | si_heq(si_from_uint (a), si_from_uint (b)) | HEQ rt, a, b           |
| int                    | 10-bit signed int              | si_heqi(si_from_int (a), b)                | UFOL# a b              |
| unsigned int (literal) |                                | si_heqi(si_from_uint (a), b)               | HEQI rt, a, b          |

<sup>&</sup>lt;sup>1</sup> Immediate values that cannot be represented as a 10-bit signed value are constructed similar to the method described in section "2.2.1. Mapping Intrinsics with Scalar Operands".

# spu\_hcmpgt: Halt If Compare Greater Than

(void) spu\_hcmpgt(a, b)

The contents of a and b are compared. If a is greater than b, execution is halted.

Table 2-48: Halt If Compare Greater Than

| Return/Argument Types |                             | Specific Intrinsics                         | Assembly Mapping <sup>1,2</sup> |  |
|-----------------------|-----------------------------|---------------------------------------------|---------------------------------|--|
| а                     | b                           | Specific intrinsics                         | Assembly Mapping                |  |
| int                   | int (non-literal)           | si_hgt(si_from_int (a), si_from_int (b))    | HGT rt, a, b                    |  |
| unsigned int          | unsigned int (non-literal)  | si_hlgt(si_from_uint (a), si_from_uint (b)) | HLGT rt, a, b                   |  |
| int                   | 10-bit signed int (literal) | si_hgti(si_from_int (a), b)                 | HGTI rt, a, b                   |  |
| unsigned int          | 10-bit signed int (literal) | si_hlgti(si_from_uint (a), b)               | HLGTI rt, a, b                  |  |

<sup>&</sup>lt;sup>1</sup> Immediate values that cannot be represented as 10-bit signed values are constructed in a way similar to the method described in section "2.2.1. Mapping Intrinsics with Scalar Operands".

#### spu\_testsv: Vector Test Special Value

d = spu\_testsv(a, values)

Each element of vector a is compared with the set of special values specified by values. If any one of the specified comparisons is true all ones are placed in the corresponding element of vector d. If none of the tests are true, zeros are placed in the corresponding element of vector d.

Table 2-49: Vector Test Special Value

| Ret                       | urn/Argument Ty | Chaoifia Intringias          | Assembly Mapping        |                    |
|---------------------------|-----------------|------------------------------|-------------------------|--------------------|
| d a                       |                 | values                       |                         |                    |
| vector unsigned long long | vector double   | 7-bit unsigned int (literal) | d = si_dftsv(a, values) | DFTSV d, a, values |

The set of bit flag mnemonics that can be used to specify a set of special values to be tested is shown in Table 2-50. These mnemonics are defined in spu\_intrinsics.h.

 $<sup>^2</sup>$  The false target parameter rt is optimally chosen depending on the register usage of neighboring instructions.

<sup>&</sup>lt;sup>2</sup> The false target parameter rt is optimally chosen depending on the register usage of neighboring instructions.



Table 2-50: Special Value Bit Flag Mnemonics

| Mnemonic            | Value | Description                                        |
|---------------------|-------|----------------------------------------------------|
| SPU_SV_NEG_DENORM   | 0x01  | Test for a negative denormalized number            |
| SPU_SV_POS_DENORM   | 0x02  | Test for a positive denormalized number            |
| SPU_SV_NEG_ZERO     | 0x04  | Test for a negative zero                           |
| SPU_SV_POS_ZERO     | 0x08  | Test for a positive zero                           |
| SPU_SV_NEG_INFINITY | 0x10  | Test for a negative infinity                       |
| SPU_SV_POS_INFINITY | 0x20  | Test for a positive infinity                       |
| SPU_SV_NAN          | 0x40  | Test for a Not-a-Number, both signalling and quiet |

# 2.8. Bits and Mask Intrinsics

# spu\_cntb: Vector Count Ones for Bytes

d = spu\_cntb(a)

For each element of vector a, the number of ones are counted, and the count is placed in the corresponding element of vector a.

Table 2-51: Vector Count Ones for Bytes

| Return/Arg      | gument Types                     | Specific Intrinsics | Assembly Mapping |  |
|-----------------|----------------------------------|---------------------|------------------|--|
| d a             |                                  | Specific intrinsics | Assembly Mapping |  |
| vector unsigned | or unsigned vector unsigned char |                     | CNTB d, a        |  |
| char            | vector signed char               | si_cntb             | ONTE u, a        |  |

# spu\_cntlz: Vector Count Leading Zeros

d = spu\_cntlz(a)

For each element of vector a, the number of leading zeros is counted, and the resulting count is placed in the corresponding element of vector d.

Table 2-52: Vector Count Leading Zeros

| Return/Argu         | ment Types                    | Specific   | Assembly Mapping |  |
|---------------------|-------------------------------|------------|------------------|--|
| d                   | а                             | Intrinsics |                  |  |
|                     | vector signed int             |            |                  |  |
| vector unsigned int | igned int vector unsigned int |            | CLZ d, a         |  |
|                     | vector float                  |            |                  |  |

# spu\_gather: Gather Bits from Elements

d = spu\_gather(a)

The rightmost bit (LSB) of each element of vector a is gathered, concatenated, and returned in the rightmost bits of element 0 of vector d. For a byte vector, 16 bits are gathered; for a halfword vector, 8 bits are gathered; and for a word vector, 4 bits are gathered. The remaining bits of element 0 of d and all other elements of that vector are zeroed.



Table 2-53: Gather Bits from Elements

| Return/             | Argument Types        | Specific        | Assembly Mapping |  |
|---------------------|-----------------------|-----------------|------------------|--|
| d                   | a                     | Intrinsics      |                  |  |
|                     | vector unsigned char  | d = si_gbb(a)   | GBB d, a         |  |
|                     | vector signed char    | a = 31_gbb(a)   |                  |  |
|                     | vector unsigned short | $d = si_gbh(a)$ | GBH d, a         |  |
| vector unsigned int | vector signed short   | r signed short  |                  |  |
|                     | vector unsigned int   |                 | GB d, a          |  |
|                     | vector signed int     | $d = si_gb(a)$  |                  |  |
|                     | vector float          |                 |                  |  |

# spu\_maskb: Form Select Byte Mask

d = spu\_maskb(a)

For each of the least significant 16 bits of a, each bit is replicated 8 times, producing a 128-bit vector mask that is returned in vector d.

Table 2-54: Form Select Byte Mask

| Return/Argument Types |                               | Specific Intringing              | Assembly    |  |
|-----------------------|-------------------------------|----------------------------------|-------------|--|
| d                     | a                             | Specific Intrinsics              | Mapping     |  |
|                       | unsigned short                | $d = si_fsmb(si_from_ushort(a))$ |             |  |
|                       | signed short                  | $d = si_fsmb(si_from_short(a))$  | FSMB d, a   |  |
| vector unsigned       | unsigned int                  | $d = si_fsmb(si_from_uint(a))$   | FSIVID U, a |  |
| char                  | signed int                    | $d = si_fsmb(si_from_int(a))$    |             |  |
|                       | 16-bit unsigned int (literal) | $d = si_fsmbi(a)$                | FSMBI d, a  |  |

# spu\_maskh: Form Select Halfword Mask

 $d = spu_maskh(a)$ 

For each of the least significant 8 bits of a, each bit is replicated 16 times, producing a 128-bit vector mask that is returned in vector d.

Table 2-55: Form Select Halfword Mask

| Return/Argument Types |                | Specific Intrinsics             | Assembly Mapping |  |
|-----------------------|----------------|---------------------------------|------------------|--|
| d a                   |                | Specific munisies               |                  |  |
|                       | unsigned char  | d = si_fsmh(si_from_uchar (a))  |                  |  |
|                       | signed char    | d = si_fsmh(si_from_char (a))   |                  |  |
| vector unsigned short | unsigned short | d = si_fsmh(si_from_ushort (a)) | FSMH d, a        |  |
| vector unsigned short | signed short   | $d = si_fsmh(si_from_short(a))$ | 1 Sivil 1 u, a   |  |
|                       | unsigned int   | $d = si_fsmh(si_from_uint(a))$  |                  |  |
|                       | signed int     | d = si_fsmh(si_from_int (a))    |                  |  |



### spu\_maskw: Form Select Word Mask

d = spu\_maskw(a)

For each of the least significant 4 bits of a, each bit is replicated 32 times, producing a 128-bit vector mask that is returned in vector d.

Table 2-56: Form Select Word Mask

| Return/Argument Types |                | Specific Intrinsics             | Assembly Mapping |  |
|-----------------------|----------------|---------------------------------|------------------|--|
| d                     | а              | Specific intrinsics             | Assembly Mapping |  |
|                       | unsigned char  | d = si_fsm(si_from_uchar (a))   |                  |  |
|                       | signed char    | $d = si_fsm(si_from_char(a))$   |                  |  |
| vector unsigned int   | unsigned short | $d = si_fsm(si_from_ushort(a))$ | FSM d, a         |  |
| vector unsigned int   | signed short   | $d = si_fsm(si_from_short(a))$  | olvi u, a        |  |
|                       | unsigned int   | $d = si_fsm(si_from_uint(a))$   |                  |  |
|                       | signed int     | d = si_fsm(si_from_int (a))     |                  |  |

### spu\_sel: Select Bits

d = spu\_sel(a, b, pattern)

For each bit in the 128-bit vector pattern, the corresponding bit from either vector a or vector b is selected. If the bit is 0, the bit from a is selected; otherwise, the bit from b is selected. The result is returned in vector d.

Table 2-57: Select Bits

|                           | Return/Argument Types     |                           |                                                                       |            |                          |
|---------------------------|---------------------------|---------------------------|-----------------------------------------------------------------------|------------|--------------------------|
| d                         | а                         | b                         | pattern                                                               | Intrinsics | Mapping                  |
| vector unsigned char      | vector unsigned char      | vector unsigned char      | vector unsigned                                                       |            |                          |
| vector signed char        | vector signed char        | vector signed char        | char                                                                  |            |                          |
| vector unsigned short     | vector unsigned short     | vector unsigned short     | vector unsigned short  vector unsigned int  vector unsigned long long |            |                          |
| vector signed short       | vector signed short       | vector signed short       |                                                                       |            | SELB d, a,<br>b, pattern |
| vector unsigned int       | vector unsigned int       | vector unsigned int       |                                                                       |            |                          |
| vector signed int         | vector signed int         | vector signed int         |                                                                       |            |                          |
| vector float              | vector float              | vector float              |                                                                       | pattern)   |                          |
| vector unsigned long long | vector unsigned long long | vector unsigned long long |                                                                       |            |                          |
| vector signed long long   | vector signed long long   | vector signed long long   |                                                                       |            |                          |
| vector double             | vector double             | vector double             |                                                                       |            |                          |



# spu\_shuffle: Shuffle Two Vectors of Bytes

d = spu\_shuffle(a, b, pattern)

For each byte of pattern, the byte is examined, and a byte is produced, as shown in Figure 2-2. The result is returned in the corresponding byte of vector d.

Figure 2-2: Shuffle Pattern

| Value in the Byte of pattern (in binary) | Resulting Byte                                                    |
|------------------------------------------|-------------------------------------------------------------------|
| 10xxxxxx                                 | 0x00                                                              |
| 110xxxxx                                 | 0xFF                                                              |
| 111xxxxx                                 | 0x80                                                              |
| otherwise                                | The byte of (a    b) addressed by the rightmost 5 bits of pattern |

Table 2-58: Shuffle Two Vectors of Bytes

|                           | Return/Argume             | ent Types                 |                    | Specific Intrinsics | Assembly               |
|---------------------------|---------------------------|---------------------------|--------------------|---------------------|------------------------|
| d                         | a                         | b                         | pattern            | Specific intrinsics | Mapping                |
| vector unsigned char      | vector unsigned char      | vector unsigned char      |                    |                     |                        |
| vector signed char        | vector signed char        | vector signed char        |                    |                     |                        |
| vector unsigned short     | vector unsigned short     | vector unsigned short     |                    |                     | SHUFB d, a, b, pattern |
| vector signed short       | vector signed short       | vector signed short       |                    |                     |                        |
| vector unsigned int       | vector unsigned int       | vector unsigned int       | vector<br>unsigned | d = si_shufb(       |                        |
| vector signed int         | vector signed int         | vector signed int         | char               | a, b, pattern)      |                        |
| vector unsigned long long | vector unsigned long long | vector unsigned long long |                    |                     |                        |
| vector signed long long   | vector signed long long   | vector signed long long   |                    |                     |                        |
| vector float              | vector float              | vector float              |                    |                     |                        |
| vector double             | vector double             | vector double             |                    |                     |                        |



# 2.9. Logical Intrinsics

# spu\_and: Vector Bit-Wise AND

 $d = spu_and(a, b)$ 

Each bit of vector *a* is logically ANDed with the corresponding bit of vector *b*. If *b* is a scalar, the scalar value is first replicated for each element, and then *a* and *b* are ANDed. The results are returned in the corresponding bit of vector *d*.

Table 2-59: Vector Bit-Wise AND

| Return/Argument Types     |                           | Caccific Intrincies       | A a a a salah ti Manasia a             |                    |  |
|---------------------------|---------------------------|---------------------------|----------------------------------------|--------------------|--|
| d                         | а                         | b                         | Specific Intrinsics                    | Assembly Mapping   |  |
| vector unsigned char      | vector unsigned char      | vector unsigned char      |                                        |                    |  |
| vector signed char        | vector signed char        | vector signed char        |                                        |                    |  |
| vector unsigned short     | vector unsigned short     | vector unsigned short     |                                        |                    |  |
| vector signed short       | vector signed short       | vector signed short       | _                                      |                    |  |
| vector unsigned int       | vector unsigned int       | vector unsigned int       |                                        |                    |  |
| vector signed int         | vector signed int         | vector signed int         | $d = si\_and(a, b)$                    | AND d, a, b        |  |
| vector unsigned long long | vector unsigned long long | vector unsigned long long | _ (,,,                                 | ,,,,               |  |
| vector signed long long   | vector signed long long   | vector signed long long   |                                        |                    |  |
| vector float              | vector float              | vector float              |                                        |                    |  |
| vector double             | vector double             | vector double             |                                        |                    |  |
| vector unsigned char      | vector unsigned char      | 10-bit signed int         | $d = si\_andbi(a,$                     | ANDDIALA           |  |
| vector signed char        | vector signed char        | (literal)                 | <i>b</i> )                             | ANDBI d, a, b      |  |
| vector unsigned char      | vector unsigned char      | unsigned char             | See section "2.2.1.                    | Mapping Intrinsics |  |
| vector signed char        | vector signed char        | signed char               | with Scalar Operan                     | nds".              |  |
| vector unsigned short     | vector unsigned short     | 10-bit signed int         | $d = si_andhi(a,$                      | ANDHI d, a, b      |  |
| vector signed short       | vector signed short       | (literal)                 | b)                                     | ANDHI u, a, b      |  |
| vector unsigned short     | vector unsigned short     | unsigned short            | See section "2.2.1.                    | Mapping Intrinsics |  |
| vector signed short       | vector signed short       | signed short              | with Scalar Operands".                 |                    |  |
| vector unsigned int       | vector unsigned int       | 10-bit signed int         | $d = si_andi(a, b)$                    | ANDI d, a, b       |  |
| vector signed int         | vector signed int         | (literal)                 | a - si_ariui(a, D)                     | AINDI U, a, D      |  |
| vector unsigned int       | vector unsigned int       | unsigned int              | See section "2.2.1. Mapping Intrinsics |                    |  |
| vector signed int         | vector signed int         | signed int                | with Scalar Operan                     | nds".              |  |



### spu\_andc: Vector Bit-Wise AND with Complement

d = spu\_andc(a, b)

Each bit of vector a is ANDed with the complement of the corresponding bit of vector b. The result is returned in the corresponding bit of vector d.

Table 2-60: Vector Bit-Wise AND with Complement

| Return/Argument Types     |                           |                           | Specific Intrinsics  | Assembly     |
|---------------------------|---------------------------|---------------------------|----------------------|--------------|
| d                         | а                         | b                         | Opecine municios     | Mapping      |
| vector unsigned char      | vector unsigned char      | vector unsigned char      |                      |              |
| vector signed char        | vector signed char        | vector signed char        |                      |              |
| vector unsigned short     | vector unsigned short     | vector unsigned short     |                      | ANDC d, a, b |
| vector signed short       | vector signed short       | vector signed short       |                      |              |
| vector unsigned int       | vector unsigned int       | vector unsigned int       |                      |              |
| vector signed int         | vector signed int         | vector signed int         | $d = si\_andc(a, b)$ |              |
| vector unsigned long long | vector unsigned long long | vector unsigned long long | -                    |              |
| vector signed long long   | vector signed long long   | vector signed long long   |                      |              |
| vector float              | vector float              | vector float              |                      |              |
| vector double             | vector double             | vector double             |                      |              |

# spu\_eqv: Vector Bit-Wise Equivalent

 $d = spu_eqv(a, b)$ 

Each bit of vector a is compared with the corresponding bit of vector b. The corresponding bit of vector d is set to 1 if the bits in a and b are equivalent; otherwise, the bit is set to 0.

Table 2-61: Vector Bit-Wise Equivalent

|                           | Return/Argument Types     |                           |                            | Assembly    |  |
|---------------------------|---------------------------|---------------------------|----------------------------|-------------|--|
| d                         | а                         | b                         | Specific Intrinsics        | Mapping     |  |
| vector unsigned char      | vector unsigned char      | vector unsigned char      |                            |             |  |
| vector signed char        | vector signed char        | vector signed char        |                            |             |  |
| vector unsigned short     | vector unsigned short     | vector unsigned short     |                            |             |  |
| vector signed short       | vector signed short       | vector signed short       |                            |             |  |
| vector unsigned int       | vector unsigned int       | vector unsigned int       | d = si_eqv(a, b)           | EQV d, a, b |  |
| vector signed int         | vector signed int         | vector signed int         | $\alpha = 3i_{eq} v(a, D)$ | LQV u, a, b |  |
| vector unsigned long long | vector unsigned long long | vector unsigned long long |                            |             |  |
| vector signed long long   | vector signed long long   | vector signed long long   |                            |             |  |
| vector float              | vector float              | vector float              |                            |             |  |
| vector double             | vector double             | vector double             |                            |             |  |



# spu\_nand: Vector Bit-Wise Complement of AND

d = spu\_nand(a, b)

Each bit of vector a is ANDed with the corresponding bit of vector b. The complement of the result is returned in the corresponding bit of vector d.

Table 2-62: Vector Bit-Wise Complement of AND

|                           | Return/Argument Types                                          |                           |                     | Assembly         |  |
|---------------------------|----------------------------------------------------------------|---------------------------|---------------------|------------------|--|
| d                         | а                                                              | b                         | Specific Intrinsics | Mapping          |  |
| vector unsigned char      | vector unsigned char                                           | vector unsigned char      |                     |                  |  |
| vector signed char        | vector signed char                                             | vector signed char        |                     |                  |  |
| vector unsigned short     | tor unsigned short vector unsigned short vector unsigned short |                           |                     |                  |  |
| vector signed short       | vector signed short                                            | vector signed short       |                     |                  |  |
| vector unsigned int       | vector unsigned int                                            | vector unsigned int       | $d = si_nand(a, b)$ | NAND d, a, b     |  |
| vector signed int         | vector signed int                                              | vector signed int         |                     | 147 (14D a, a, b |  |
| vector unsigned long long | vector unsigned long long                                      | vector unsigned long long |                     |                  |  |
| vector signed long long   | vector signed long long                                        | vector signed long long   |                     |                  |  |
| vector float              | vector float                                                   | vector float              |                     |                  |  |
| vector double             | vector double                                                  | vector double             |                     |                  |  |

# spu\_nor: Vector Bit-Wise Complement of OR

d = spu\_nor(a, b)

Each bit of vector a is ORed with the corresponding bit of vector b. The complement of the result is returned in the corresponding bit of vector d.

Table 2-63: Vector Bit-Wise Complement of OR

|                           | Return/Argument Types     |                           |                     | Assembly      |
|---------------------------|---------------------------|---------------------------|---------------------|---------------|
| d                         | a                         | b                         | Specific Intrinsics | Mapping       |
| vector unsigned char      | vector unsigned char      | vector unsigned char      |                     |               |
| vector signed char        | vector signed char        | vector signed char        |                     |               |
| vector unsigned short     | vector unsigned short     | vector unsigned short     |                     |               |
| vector signed short       | vector signed short       | vector signed short       |                     | NOR d, a, b   |
| vector unsigned int       | vector unsigned int       | vector unsigned int       | d = si nor(a, b)    |               |
| vector signed int         | vector signed int         | vector signed int         | a = 01_1101(a, b)   | ivoit a, a, b |
| vector unsigned long long | vector unsigned long long | vector unsigned long long |                     |               |
| vector signed long long   | vector signed long long   | vector signed long long   |                     |               |
| vector float              | vector float              | vector float              |                     |               |
| vector double             | vector double             | vector double             |                     |               |



# spu\_or: Vector Bit-Wise OR

 $d = spu_or(a, b)$ 

Each bit of vector a is logically ORed with the corresponding bit of vector b. If b is a scalar, the scalar value is first replicated for each element, and then a and b are ORed. The result is returned in the corresponding bit of vector d.

Table 2-64: Vector Bit-Wise OR

|                           | Return/Argument Types     |                              |                                 | Assembly              |  |
|---------------------------|---------------------------|------------------------------|---------------------------------|-----------------------|--|
| d                         | a                         | b                            | Intrinsics                      | Mapping               |  |
| vector unsigned char      | vector unsigned char      | vector unsigned char         |                                 |                       |  |
| vector signed char        | vector signed char        | vector signed char           |                                 |                       |  |
| vector unsigned short     | vector unsigned short     | vector unsigned short        |                                 |                       |  |
| vector signed short       | vector signed short       | vector signed short          |                                 |                       |  |
| vector unsigned int       | vector unsigned int       | vector unsigned int          |                                 |                       |  |
| vector signed int         | vector signed int         | vector signed int            | $d = si\_or(a, b)$              | OR d, a, b            |  |
| vector unsigned long long | vector unsigned long long | vector unsigned long long    |                                 |                       |  |
| vector signed long long   | vector signed long long   | vector signed long long      |                                 |                       |  |
| vector float              | vector float              | vector float                 |                                 |                       |  |
| vector double             | vector double             | vector double                |                                 |                       |  |
| vector unsigned char      | vector unsigned char      | 10-bit signed int (literal)  | $d = si \ orbi(a, b)$           | ORBI d, a, b          |  |
| vector signed char        | vector signed char        | To bit signed int (interal)  | $\alpha = 3i\_0ibi(\alpha, D)$  | Ortor a, a, b         |  |
| vector unsigned char      | vector unsigned char      | unsigned char                |                                 | 1. Mapping Intrinsics |  |
| vector signed char        | vector signed char        | signed char                  | with Scalar Opera               | ands".                |  |
| vector unsigned short     | vector unsigned short     | 10-bit signed int (literal)  | $d = si\_orhi(a, b)$            | ORHI d, a, b          |  |
| vector signed short       | vector signed short       | To bit signed int (interal)  | $\alpha = 3i\_0iiii(\alpha, D)$ | Ortin a, a, b         |  |
| vector unsigned short     | vector unsigned short     | unsigned short               |                                 | 1. Mapping Intrinsics |  |
| vector signed short       | vector signed short       | signed short                 | with Scalar Opera               | ands".                |  |
| vector unsigned int       | vector unsigned int       | 10-bit signed int (literal)  | $d = si\_ori(a, b)$             | ORI d, a, b           |  |
| vector signed int         | vector signed int         | To all digitod int (intofal) |                                 | J. 1. 4, 4, 5         |  |
| vector unsigned int       | vector unsigned int       | unsigned int                 |                                 | 1. Mapping Intrinsics |  |
| vector signed int         | vector signed int         | signed int                   | with Scalar Opera               | ands".                |  |



### spu\_orc: Vector Bit-Wise OR with Complement

d = spu\_orc(a, b)

Each bit of vector a is ORed with the complement of the corresponding bit of vector b. The result is returned in the corresponding bit of vector d.

Table 2-65: Vector Bit-Wise OR with Complement

|                           | Return/Argument Types     |                           |                     | Assembly    |
|---------------------------|---------------------------|---------------------------|---------------------|-------------|
| d                         | а                         | b                         | Specific Intrinsics | Mapping     |
| vector unsigned char      | vector unsigned char      | vector unsigned char      |                     |             |
| vector signed char        | vector signed char        | vector signed char        |                     |             |
| vector unsigned short     | vector unsigned short     | vector unsigned short     |                     |             |
| vector signed short       | vector signed short       | vector signed short       |                     |             |
| vector unsigned int       | vector unsigned int       | vector unsigned int       | $d = si\_orc(a, b)$ | ORC d,a, b  |
| vector signed int         | vector signed int         | vector signed int         | a = 01_010(a, b)    | Onto d,a, b |
| vector unsigned long long | vector unsigned long long | vector unsigned long long |                     |             |
| vector signed long long   | vector signed long long   | vector signed long long   |                     |             |
| vector float              | vector float              | vector float              |                     |             |
| vector double             | vector double             | vector double             |                     |             |

# spu\_orx: OR Word Across

d = spu\_orx(a)

The four word elements of vector a are logically ORed. The result is returned in word element 0 of vector a. All other elements (1,2,3) of a are assigned a value of zero.

Table 2-66: OR Word Across

| Return/Arg                              | ument Types | Specific Intrinsics | Assembly Mapping |  |
|-----------------------------------------|-------------|---------------------|------------------|--|
| d a                                     |             | Opecine munisies    | Assembly Mapping |  |
| vector unsigned int vector unsigned int |             | d = si orx(a)       | ORX d, a         |  |
| vector signed int vector signed int     |             |                     | 0.000            |  |

# spu\_xor: Vector Bit-Wise Exclusive OR

d = spu\_xor(a,b)

Each element of vector a is exclusive-ORed with the corresponding element of vector b. If b is a scalar, the scalar value is first replicated for each element. The result is returned in the corresponding bit of vector d.



Table 2-67: Vector Bit-Wise Exclusive OR

|                           | Return/Argument Types     |                             | Specific Intrinsics                             | Assembly      |
|---------------------------|---------------------------|-----------------------------|-------------------------------------------------|---------------|
| d                         | а                         | b                           | Opecine munisies                                | Mapping       |
| vector unsigned char      | vector unsigned char      | vector unsigned char        |                                                 |               |
| vector signed char        | vector signed char        | vector signed char          |                                                 |               |
| vector unsigned short     | vector unsigned short     | vector unsigned short       |                                                 |               |
| vector signed short       | vector signed short       | vector signed short         |                                                 |               |
| vector unsigned int       | vector unsigned int       | vector unsigned int         |                                                 |               |
| vector signed int         | vector signed int         | vector signed int           | $d = si\_xor(a, b)$                             | XOR d, a, b   |
| vector unsigned long long | vector unsigned long long | vector unsigned long long   |                                                 |               |
| vector signed long long   | vector signed long long   | vector signed long long     |                                                 |               |
| vector float              | vector float              | vector float                |                                                 |               |
| vector double             | vector double             | vector double               |                                                 |               |
| vector unsigned char      | vector unsigned char      | 10-bit signed int (literal) | $d = si\_xorbi(a, b)$                           | XORBI d, a, b |
| vector signed char        | vector signed char        | 10-bit signed int (literal) | $\alpha = \operatorname{Si_XOIDI}(a, D)$        |               |
| vector unsigned char      | vector unsigned char      | unsigned char               | See section "2.2.1.                             | Mapping       |
| vector signed char        | vector signed char        | signed char                 | Intrinsics with Scal                            | ar Operands". |
| vector unsigned short     | vector unsigned short     | 10 bit signed int (literal) | -1 oi vorbi(- 1-)                               | VODUI d a b   |
| vector signed short       | vector signed short       | 10-bit signed int (literal) | $d = si\_xorhi(a, b)$                           | XORHI d, a, b |
| vector unsigned short     | vector unsigned short     | unsigned short              | See section "2.2.1.                             | Mapping       |
| vector signed short       | vector signed short       | signed short                | Intrinsics with Scal                            |               |
| vector unsigned int       | vector unsigned int       | 10 bit signed int (literal) | d - oi vori(- b)                                | XORI d, a, b  |
| vector signed int         | vector signed int         | 10-bit signed int (literal) | 0-bit signed int (literal) $d = si\_xori(a, b)$ |               |
| vector unsigned int       | vector unsigned int       | unsigned int                | See section "2.2.1.                             |               |
| vector signed int         | vector signed int         | signed int                  | Intrinsics with Scalar Operands".               |               |

# 2.10. Shift and Rotate Intrinsics

### spu\_rl: Vector Rotate Left by Bits

d = spu\_rl(a, count)

Each element of vector a is rotated left by the number of bits specified by the corresponding element in vector <code>count</code>. Bits rotated out of the left end of the element are rotated in at the right end. A limited number of <code>count</code> bits are used depending on the size of the element. For halfword elements, the 4 least significant bits of <code>count</code> are used. For word elements, the 5 least significant bits of <code>count</code> are used.

The results are returned in the corresponding elements of vector d.

Table 2-68: Vector Rotate Left by Bits

|                       | Return/Argument Types |                            |                             | Assembly Mapping    |  |
|-----------------------|-----------------------|----------------------------|-----------------------------|---------------------|--|
| d                     | а                     | Specific Intrinsics        |                             | Assembly Mapping    |  |
| vector unsigned short | vector unsigned short | vector signed short        | $d = si\_roth(a, count)$    | ROTH d, a, count    |  |
| vector signed short   | vector signed short   | vector signed short        | a = 31_10th(a, country)     | rto i i a, a, count |  |
| vector unsigned int   | vector unsigned int   | vector signed int          | $d = si\_rot(a, count)$     | ROT d, a, count     |  |
| vector signed int     | vector signed int     | vector signed int          | $\alpha = Si_1Oi(a, Count)$ | KOT u, a, count     |  |
| vector unsigned short | vector unsigned short | 7-bit signed int (literal) | d = si_rothi(a, count)      | ROTHI d, a, count   |  |
| vector signed short   | vector signed short   | 1-bit signed int (literal) | a - si_iotiii(a, count)     | NOTHI u, a, count   |  |



|                       | Return/Argument Types |                            |                                | Accombly Manning     |
|-----------------------|-----------------------|----------------------------|--------------------------------|----------------------|
| d                     | а                     | count                      | Specific Intrinsics            | Assembly Mapping     |
| vector unsigned short | vector unsigned short |                            |                                | ping Intrinsics with |
| vector signed short   | vector signed short   |                            | Scalar Operands".              |                      |
| vector unsigned int   | vector unsigned int   | 7 bit signed int (literal) | d = si_roti(a, count)          | ROTI d, a, count     |
| vector signed int     | vector signed int     | 7-bit signed int (literal) | $\alpha = SI_1OII(a, COUILL)$  | ROTT d, a, count     |
| vector unsigned int   | vector unsigned int   | - int                      | See section "2.2.1. Mapping In |                      |
| vector signed int     | vector signed int     |                            | Scalar Operands".              | . •                  |

# spu\_rlmask: Vector Rotate Left and Mask by Bits

```
d = spu_rlmask(a, count)
```

This function uses an element-wise rotate left and mask operation to perform a logical shift right (LSR) by bits of each element of vector a, where count represents the negated value, or values, of the desired corresponding right-shift amounts. (The count parameter can be either a vector or a scalar, as shown in Table 2-69.) For example, if scalar count is -5, each element of a is shifted right by 5 bits. The effect of this function is more precisely shown by the following code:

```
For (each halfword element h in vector a) {
    int bitshift = -count & 0x1F;
    h = (bitshift & 0x10)? 0: LSR(h,bitshift);
}

For (each word element w in vector a) {
    int bitshift = -count & 0x3F;
    w = (bitshift & 0x20)? 0: LSR(w,bitshift);
}
```

The results are returned in the corresponding elements of vector *d*.

Table 2-69: Vector Rotate Left and Mask by Bits

| Return/Argument Types |                       | Specific Intrinsics        | Assembly Mapping                                              |                     |  |
|-----------------------|-----------------------|----------------------------|---------------------------------------------------------------|---------------------|--|
| d                     | a                     | count                      | Opecine municies                                              | 7.03cmbiy Mapping   |  |
| vector unsigned short | vector unsigned short | vector signed short        | $d = si_rothm(a, count)$                                      | ROTHM d, a, count   |  |
| vector signed short   | vector signed short   | vector signed short        | a = si_rounn(a, counc)                                        | NOTTIWI a, a, count |  |
| vector unsigned int   | vector unsigned int   | vector signed int          | $d = si_rotm(a, count)$                                       | ROTM d, a, count    |  |
| vector signed int     | vector signed int     | vector signed int          | a = 31_10(111(a, count)                                       | KOTWI u, a, count   |  |
| vector unsigned short | vector unsigned short | 7-hit signed int (literal) | d = si_rothmi(a, count)                                       | ROTHMI d, a, count  |  |
| vector signed short   | vector signed short   | 7-bit signed int (literal) |                                                               |                     |  |
| vector unsigned short | vector unsigned short | int                        | See section "2.2.1. Mapping Intrinsics with Scalar Operands". |                     |  |
| vector signed short   | vector signed short   | IIIL                       |                                                               |                     |  |
| vector unsigned int   | vector unsigned int   | 7 hit signed int (literal) | $d = si\_rotmi(a, count)$                                     | DOTML d. a. accust  |  |
| vector signed int     | vector signed int     | 7-bit signed int (ilteral) | $\alpha = Si_1O(iiii)(a, COuiic)$                             | ROTMI d, a, count   |  |
| vector unsigned int   | vector unsigned int   | int                        | See section "2.2.1. Mapping Intrinsics with Scalar Operands". |                     |  |
| vector signed int     | vector signed int     | Ш                          |                                                               |                     |  |



### spu\_rlmaska: Vector Rotate Left and Mask Algebraic by Bits

```
d = spu_rlmaska(a, count)
```

This function uses an element-wise rotate left and mask operation to perform an arithmetical shift right (ASR) of each element of vector a, where count represents the negated value, or values, of the desired corresponding right-shift amounts. (The count parameter can be either a vector or a scalar, as shown in Table 2-70.) For example, if scalar count is -5, each element of a is shifted right by 5 bits. The effect of this function is more precisely shown by the following code:

```
For (each halfword element h in vector a) {
   int bitshift = -count & 0x1F;
   h = (bitshift & 0x10)? ASR(h,15) : ASR(h,bitshift);
}

For (each word element w in vector a) {
   int bitshift = -count & 0x3F;
   w = (bitshift & 0x20)? ASR(w,31) : ASR(w,bitshift);
}
```

The results are returned in the corresponding elements of vector *d*.

Table 2-70: Vector Rotate Left and Mask Algebraic by Bits

| Return/Argument Types |                       | Specific Intrinsics | Assembly Mapping                                              |                      |  |
|-----------------------|-----------------------|---------------------|---------------------------------------------------------------|----------------------|--|
| d                     | а                     | count               | Specific intrinsics                                           | Assembly Mapping     |  |
| vector unsigned short | vector unsigned short | vector signed short | $d = si\_rotmah(a, count)$                                    | ROTMAH d, a, count   |  |
| vector signed short   | vector signed short   | vector signed short | a = si_iotinan(a, count)                                      | KOTWATTU, a, COUNT   |  |
| vector unsigned int   | vector unsigned int   | vector signed int   | d = si rotma(a gount)                                         | ROTMA d, a, count    |  |
| vector signed int     | vector signed int     | vector signed int   | $d = si\_rotma(a, count)$                                     | ROTIVIA d, a, count  |  |
| vector unsigned short | vector unsigned short | 7-bit signed int    | d = si_rotmahi(a, count)                                      | ROTMAHI d, a, count  |  |
| vector signed short   | vector signed short   | (literal)           |                                                               |                      |  |
| vector unsigned short | vector unsigned short | int                 | See section "2.2.1. Mapping Intrinsics with Scalar Operands". |                      |  |
| vector signed short   | vector signed short   | IIIC                |                                                               |                      |  |
| vector unsigned int   | vector unsigned int   | 7-bit signed int    | $d = si\_rotmai(a, count)$                                    | DOTMAL d. a. accurat |  |
| vector signed int     | vector signed int     | (literal)           | $\alpha = Si_1O(i)$                                           | ROTMAI d, a, count   |  |
| vector unsigned int   | vector unsigned int   | int                 | See section "2.2.1. Mapping Intrinsics with Scalar Operands". |                      |  |
| vector signed int     | vector signed int     | ПЦ                  |                                                               |                      |  |



### spu\_rlmaskqw: Quadword Rotate Left and Mask by Bits

```
d = spu_rlmaskqw(a, count)
```

This function uses a rotate and mask quadword by bits operation to perform a quadword logical shift right (LSR) of up to 7 bits, where *count* represents the negated value of the desired right-shift amount. For example, if *count* is – 5, vector *a* is shifted right by 5 bits. The effect of this function is more precisely shown by the following code:

```
qword spu_rlmaskqw(qword a, int count)
{    int bitshift = -count & 0x7;
      return LSR(a,bitshift);
}
```

The resulting quadword is returned in vector d.

Table 2-71: Quadword Rotate Left and Mask by Bits

| Return/Argument Types     |                           | Specific Intrincies | Assambly Manning          |                        |
|---------------------------|---------------------------|---------------------|---------------------------|------------------------|
| d                         | а                         | count               | Specific Intrinsics       | Assembly Mapping       |
| vector unsigned char      | vector unsigned char      |                     |                           |                        |
| vector signed char        | vector signed char        |                     |                           |                        |
| vector unsigned short     | vector unsigned short     |                     |                           |                        |
| vector signed short       | vector signed short       |                     | 7                         |                        |
| vector unsigned int       | vector unsigned int       | int                 | d = si_rotqmbii(a, count) | ROTQMBII d, a, count   |
| vector signed int         | vector signed int         | (literal)           | (count = 7-bit immediate) | NOT QIVIDITA, a, count |
| vector unsigned long long | vector unsigned long long |                     | ,                         |                        |
| vector signed long long   | vector signed long long   |                     |                           |                        |
| vector float              | vector float              |                     |                           |                        |
| vector double             | vector double             |                     |                           |                        |
| vector unsigned char      | vector unsigned char      |                     |                           |                        |
| vector signed char        | vector signed char        |                     |                           |                        |
| vector unsigned short     | vector unsigned short     |                     |                           |                        |
| vector signed short       | vector signed short       |                     |                           |                        |
| vector unsigned int       | vector unsigned int       | int<br>(non-        | $d = si\_rotqmbi(a,$      | ROTQMBI d, a, count    |
| vector signed int         | vector signed int         | literal)            | si_from_int(count))       | KOTQIVIDI U, a, COUITI |
| vector unsigned long long | vector unsigned long long | ,                   |                           |                        |
| vector signed long long   | vector signed long long   |                     |                           |                        |
| vector float              | vector float              |                     |                           |                        |
| vector double             | vector double             |                     |                           |                        |



#### spu\_rlmaskqwbyte: Quadword Rotate Left and Mask by Bytes

```
d = spu_rlmaskqwbyte(a, count)
```

This function uses a rotate and mask quadword by bytes operation to perform a quadword logical shift right (LSR) by bytes, where *count* represents the negated value of the desired byte right-shift amount. For example, if *count* is –5, vector a is shifted right by 5 bytes. The effect of this function is more precisely shown by the following code:

```
qword spu_rlmaskqwbyte(qword a, int count)
{    int bitshift = (-count << 3) & 0xF8;
    return LSR(a,bitshift);
}</pre>
```

The resulting quadword is returned in vector *d*.

Table 2-72: Quadword Rotate Left and Mask by Bytes

| Return/Argument Types     |                           | Specific Intrinsics | Assembly Mapping             |                         |
|---------------------------|---------------------------|---------------------|------------------------------|-------------------------|
| d                         | а                         | count               | Specific intrinsics          | Assembly Mapping        |
| vector unsigned char      | vector unsigned char      |                     |                              |                         |
| vector signed char        | vector signed char        |                     |                              |                         |
| vector unsigned short     | vector unsigned short     |                     |                              |                         |
| vector signed short       | vector signed short       |                     |                              |                         |
| vector unsigned int       | vector unsigned int       | int                 | $d = si\_rotqmbyi(a, count)$ | ROTQMBYI d, a, count    |
| vector signed int         | vector signed int         | (literal)           | (count = 7-bit immediate)    | NOTQINIDITI a, a, coant |
| vector unsigned long long | vector unsigned long long |                     |                              |                         |
| vector signed long long   | vector signed long long   |                     |                              |                         |
| vector float              | vector float              |                     |                              |                         |
| vector double             | vector double             |                     |                              |                         |
| vector unsigned char      | vector unsigned char      |                     |                              |                         |
| vector signed char        | vector signed char        |                     |                              |                         |
| vector unsigned short     | vector unsigned short     |                     |                              |                         |
| vector signed short       | vector signed short       |                     |                              |                         |
| vector unsigned int       | vector unsigned int       | int                 | $d = si\_rotqmby(a,$         | ROTQMBY d, a, count     |
| vector signed int         | vector signed int         | (non-literal)       | si_from_int (count))         | NOTQIVIDT u, a, count   |
| vector unsigned long long | vector unsigned long long |                     |                              |                         |
| vector signed long long   | vector signed long long   |                     |                              |                         |
| vector float              | vector float              |                     |                              |                         |
| vector double             | vector double             |                     |                              |                         |

### spu\_rlmaskqwbytebc: Quadword Rotate Left and Mask by Bytes from Bit Shift Count

```
d = spu_rlmaskqwbytebc(a, count)
```

This function uses a rotate and mask quadword by bytes from bit shift count operation to perform a quadword logical shift right (LSR) by bytes, where bits 24-28 of count represent the negated value of the desired byte right-shift amount. For example, if the bit shift count is -10, vector a is shifted right by 2 bytes. The effect of this function is more precisely shown by the following code:

```
qword spu_rlmaskqwbytebc(qword a, int count)
{   int bitshift = -(count & 0xF8) & 0xF8;
   return LSR(a,bitshift);
}
```

The resulting quadword is returned in vector *d*.



The following example code shows typical usage of this function; it computes a vector d that is the value of vector a logically shifted right by n bits:

```
d = spu_rlmaskqwbytebc(a,7-n);
d = spu_rlmaskqw(d,-n);
```

Table 2-73: Quadword Rotate Left and Mask by Bytes from Bit Shift Count

| Return/Argument Types     |                           | Specific Intrinsics | Assembly Mapping    |                      |                       |
|---------------------------|---------------------------|---------------------|---------------------|----------------------|-----------------------|
| d                         | а                         | count               | opecine munisies    | Assembly Mapping     |                       |
| vector unsigned char      | vector unsigned char      | int                 |                     |                      |                       |
| vector signed char        | vector signed char        |                     |                     |                      |                       |
| vector unsigned short     | vector unsigned short     |                     |                     |                      |                       |
| vector signed short       | vector signed short       |                     |                     |                      |                       |
| vector unsigned int       | vector unsigned int       |                     | d = si_rotqmbybi(a, |                      |                       |
| vector signed int         | vector signed int         |                     | int                 | si_from_int (count)) | ROTQMBYBI d, a, count |
| vector unsigned long long | vector unsigned long long |                     | //                  |                      |                       |
| vector signed long long   | vector signed long long   |                     |                     |                      |                       |
| vector float              | vector float              |                     |                     |                      |                       |
| vector double             | vector double             |                     |                     |                      |                       |

# spu\_rlqw: Quadword Rotate Left by Bits

d = spu\_rlqw(a, count)

Vector *a* is rotated to the left by the number of bits specified by the 3 least significant bits of *count*. Bits rotated out of the left end of the vector are rotated in on the right. The result is returned in vector *d*.

Table 2-74: Quadword Rotate Left by Bits

| Return/Argument Types     |                           | Specific Intrinsics | Assembly Mapping            |                        |
|---------------------------|---------------------------|---------------------|-----------------------------|------------------------|
| d                         | а                         | count               | Specific munisies           | Assembly Mapping       |
| vector unsigned char      | vector unsigned char      |                     |                             |                        |
| vector signed char        | vector signed char        |                     |                             |                        |
| vector unsigned short     | vector unsigned short     |                     |                             |                        |
| vector signed short       | vector signed short       |                     |                             |                        |
| vector unsigned int       | vector unsigned int       | int                 | $d = si\_rotqbii(a, count)$ | ROTQBII d, a, count    |
| vector signed int         | vector signed int         | (literal)           | (count = 7-bit immediate)   | rto i goir a, a, count |
| vector unsigned long long | vector unsigned long long |                     | ,                           |                        |
| vector signed long long   | vector signed long long   |                     |                             |                        |
| vector float              | vector float              |                     |                             |                        |
| vector double             | vector double             |                     |                             |                        |
| vector unsigned char      | vector unsigned char      |                     |                             |                        |
| vector signed char        | vector signed char        |                     |                             |                        |
| vector unsigned short     | vector unsigned short     |                     |                             |                        |
| vector signed short       | vector signed short       |                     |                             |                        |
| vector unsigned int       | vector unsigned int       | int                 | $d = si\_rotqbi(a,$         | ROTQBI d, a, count     |
| vector signed int         | vector signed int         | (non-literal)       | si_from_int (count))        | NOTQDI u, a, count     |
| vector unsigned long long | vector unsigned long long |                     |                             |                        |
| vector signed long long   | vector signed long long   |                     |                             |                        |
| vector float              | vector float              |                     |                             |                        |
| vector double             | vector double             |                     |                             |                        |



# spu\_rlqwbyte: Quadword Rotate Left by Bytes

d = spu\_rlqwbyte(a, count)

Vector *a* is rotated to the left by the number of bytes specified by the 4 least significant bits of *count*. Bytes rotated out of the left end of the vector are rotated in on the right. The result is returned in vector *d*.

Table 2-75: Quadword Rotate Left by Bytes

| Return/Argument Types     |                           | Specific Intrinsics | Assembly Mapping          |                     |
|---------------------------|---------------------------|---------------------|---------------------------|---------------------|
| d                         | a                         | count               | Specific intrinsics       | Assembly Mapping    |
| vector unsigned char      | vector unsigned char      |                     |                           |                     |
| vector signed char        | vector signed char        |                     |                           |                     |
| vector unsigned short     | vector unsigned short     |                     |                           |                     |
| vector signed short       | vector signed short       |                     |                           |                     |
| vector unsigned int       | vector unsigned int       | int                 | d = si_rotqbyi(a, count)  | ROTQBYI d, a, count |
| vector signed int         | vector signed int         | (literal)           | (count = 7-bit immediate) | NOTQDITA, a, count  |
| vector unsigned long long | vector unsigned long long |                     |                           |                     |
| vector signed long long   | vector signed long long   | -                   |                           |                     |
| vector float              | vector float              |                     |                           |                     |
| vector double             | vector double             |                     |                           |                     |
| vector unsigned char      | vector unsigned char      |                     |                           |                     |
| vector signed char        | vector signed char        |                     |                           |                     |
| vector unsigned short     | vector unsigned short     |                     |                           |                     |
| vector signed short       | vector signed short       |                     |                           |                     |
| vector unsigned int       | vector unsigned int       | int                 | $d = si\_rotqby(a,$       | ROTQBY d, a, count  |
| vector signed int         | vector signed int         | (non-literal)       | si_from_int (count))      | KOTQDT u, a, count  |
| vector unsigned long long | vector unsigned long long | -                   |                           |                     |
| vector signed long long   | vector signed long long   |                     |                           |                     |
| vector float              | vector float              |                     |                           |                     |
| vector double             | vector double             |                     |                           |                     |



### spu\_rlqwbytebc: Quadword Rotate Left by Bytes from Bit Shift Count

d = spu\_rlqwbytebc(a, count)

Vector *a* is rotated to the left by the number of bytes specified by bits 24-28 of *count*. Bytes rotated out of the left end of the vector are rotated in at the right. The result is returned in vector *d*.

Table 2-76: Quadword Rotate Left by Bytes from Bit Shift Count

| Return/Argument Types     |                           | Specific Intrinsics       | Assembly Mapping      |                  |
|---------------------------|---------------------------|---------------------------|-----------------------|------------------|
| d                         | а                         | count Specific Intrinsics |                       | Assembly Mapping |
| vector unsigned char      | vector unsigned char      |                           |                       |                  |
| vector signed char        | vector signed char        |                           |                       |                  |
| vector unsigned short     | vector unsigned short     |                           |                       |                  |
| vector signed short       | vector signed short       | int                       |                       |                  |
| vector unsigned int       | vector unsigned int       |                           | $d = si\_rotqbybi(a,$ | ROTQBYBI d, a,   |
| vector signed int         | vector signed int         | IIIL                      | si_from_int (count))  | count            |
| vector unsigned long long | vector unsigned long long |                           |                       |                  |
| vector signed long long   | vector signed long long   |                           |                       |                  |
| vector float              | vector float              |                           |                       |                  |
| vector double             | vector double             |                           |                       |                  |

# spu\_sl: Vector Shift Left by Bits

d = spu\_sl(a, count)

Each element of vector a is shifted left by the number of bits specified by the corresponding element in vector count. If count is a scalar, the scalar value is first replicated for each element, and then a is shifted.

Bits shifted out of the left end of the element are discarded, and zeros are shifted in at the right. A limited number of <code>count</code> bits are used depending on the size of the element. For halfword elements, the 5 least significant bits of <code>count</code> are used, and for word elements, the 6 least significant bits are used. The results are returned in the corresponding elements of vector <code>d</code>.

Table 2-77: Vector Shift Left by Bits

|                       | Return/Argument Types |                       |                                                               | Assembly Mapping                        |
|-----------------------|-----------------------|-----------------------|---------------------------------------------------------------|-----------------------------------------|
| d                     | a                     | count                 | Specific Intrinsics                                           | Assembly Mapping                        |
| vector unsigned short | vector unsigned short | vector unsigned short | $d = si_shlh(a, count)$                                       | SHLH d, a, count                        |
| vector signed short   | vector signed short   | vector unsigned short | a = si_sillin(a, country                                      | orierra, a, count                       |
| vector unsigned int   | vector unsigned int   | vector unsigned int   | d = ci chl(a gaunt)                                           | SHL d, a, count                         |
| vector signed int     | vector signed int     | vector unsigned int   | $d = si\_shl(a, count)$                                       | or in a, a, courit                      |
| vector unsigned short | vector unsigned short | 7-bit unsigned int    | d = si_shlhi(a, count)                                        | SHLHI d, a, count                       |
| vector signed short   | vector signed short   | (literal)             |                                                               |                                         |
| vector unsigned short | vector unsigned short | unsigned int          | See section "2.2.1. Mapping Intrinsics with Scalar Operands". |                                         |
| vector signed short   | vector signed short   | unsigned int          |                                                               |                                         |
| vector unsigned int   | vector unsigned int   | 7-bit unsigned int    | a oi obli/a a a                                               | 011111111111111111111111111111111111111 |
| vector signed int     | vector signed int     | (literal)             | $d = si\_shli(a, count)$                                      | SHLI d, a, count                        |
| vector unsigned int   | vector unsigned int   | uncianed int          | See section "2.2.1. Mapp                                      | oing Intrinsics with                    |
| vector signed int     | vector signed int     | unsigned int          | Scalar Operands".                                             |                                         |



### spu\_slqw: Quadword Shift Left by Bits

d = spu\_slqw(a, count)

Vector a is shifted left by the number of bits specified by the 3 least significant bits of count. Bits shifted out of the left end of the vector are discarded, and zeros are shifted in at the right. The result is returned in vector d.

Table 2-78: Quadword Shift Left by Bits

| Return/Argument Types     |                           | Consific Intrincipa | Accombly Manning            |                       |
|---------------------------|---------------------------|---------------------|-----------------------------|-----------------------|
| d                         | а                         | count               | Specific Intrinsics         | Assembly Mapping      |
| vector unsigned char      | vector unsigned char      |                     |                             |                       |
| vector signed char        | vector signed char        |                     |                             |                       |
| vector unsigned short     | vector unsigned short     |                     |                             |                       |
| vector signed short       | vector signed short       |                     | - ' 111"/                   |                       |
| vector unsigned int       | vector unsigned int       | unsigned int        | $d = si\_shlqbii(a, count)$ | SHLQBII d, a, count   |
| vector signed int         | vector signed int         | (literal)           | (count = 7-bit immediate)   | oriegon a, a, count   |
| vector unsigned long long | vector unsigned long long |                     |                             |                       |
| vector signed long long   | vector signed long long   |                     |                             |                       |
| vector float              | vector float              |                     |                             |                       |
| vector double             | vector double             |                     |                             |                       |
| vector unsigned char      | vector unsigned char      |                     |                             |                       |
| vector signed char        | vector signed char        |                     |                             |                       |
| vector unsigned short     | vector unsigned short     |                     |                             |                       |
| vector signed short       | vector signed short       |                     |                             |                       |
| vector unsigned int       | vector unsigned int       | unsigned int        | $d = si\_shlqbi(a,$         | SHLQBI d, a, count    |
| vector signed int         | vector signed int         | (non-literal)       | si_from_uint (count))       | or it QDI u, a, count |
| vector unsigned long long | vector unsigned long long |                     |                             |                       |
| vector signed long long   | vector signed long long   |                     |                             |                       |
| vector float              | vector float              |                     |                             |                       |
| vector double             | vector double             |                     |                             |                       |

# spu\_slqwbyte: Quadword Shift Left by Bytes

d = spu\_slqwbyte(a, count)

Vector *a* is shifted left by the number of bytes specified by the 5 least significant bits of *count*. Bytes shifted out of the left end of the vector are discarded, and zeros are shifted in at the right. The result is returned in vector *d*.

Table 2-79: Quadword Shift Left by Bytes

| Return/Argument Types     |                           |              | Specific Intrinsics As      | Assembly Mapping    |
|---------------------------|---------------------------|--------------|-----------------------------|---------------------|
| d                         | a                         | count        |                             | Assembly Mapping    |
| vector unsigned char      | vector unsigned char      | unsigned int | d = si_shlqbyi(a, count)    | SHLQBYI d, a, count |
| vector signed char        | vector signed char        | (literal)    | (count = 7-bit immediate)   |                     |
| vector unsigned short     | vector unsigned short     |              | (count = 7-bit infinediate) |                     |
| vector signed short       | vector signed short       |              |                             |                     |
| vector unsigned int       | vector unsigned int       |              |                             |                     |
| vector signed int         | vector signed int         |              |                             |                     |
| vector unsigned long long | vector unsigned long long |              |                             |                     |
| vector signed long long   | vector signed long long   |              |                             |                     |
| vector float              | vector float              |              |                             |                     |



| Return/Argument Types     |                           |               | Specific Intrinsics   | Accombly Manning   |
|---------------------------|---------------------------|---------------|-----------------------|--------------------|
| d                         | а                         | count         | Specific munisics     | Assembly Mapping   |
| vector double             | vector double             |               |                       |                    |
| vector unsigned char      | vector unsigned char      |               |                       |                    |
| vector signed char        | vector signed char        | _             |                       | SHLQBY d, a, count |
| vector unsigned short     | vector unsigned short     |               |                       |                    |
| vector signed short       | vector signed short       |               |                       |                    |
| vector unsigned int       | vector unsigned int       | unsigned int  | $d = si\_shlqby(a,$   |                    |
| vector signed int         | vector signed int         | (non-literal) | si_from_uint (count)) |                    |
| vector unsigned long long | vector unsigned long long |               |                       |                    |
| vector signed long long   | vector signed long long   |               |                       |                    |
| vector float              | vector float              |               |                       |                    |
| vector double             | vector double             |               |                       |                    |

# spu\_slqwbytebc: Quadword Shift Left by Bytes from Bit Shift Count

d = spu\_slqwbytebc(a, count)

Vector *a* is shifted left by the number of bytes specified by bits 24-28 of *count*. Bytes shifted out of the left end of the vector are discarded, and zeros are shifted in at the right. The result is returned in vector *d*.

Table 2-80: Quadword Shift Left by Bytes from Bit Shift Count

| Return/Argument Types     |                           |              | Specific Intrinsics                        | Accomply Manning     |
|---------------------------|---------------------------|--------------|--------------------------------------------|----------------------|
| d                         | а                         | count        | Specific intrinsics                        | Assembly Mapping     |
| vector unsigned char      | vector unsigned char      | unsigned int | d = si_shlqbybi(a,<br>si_from_uint(count)) | SHLQBYBI d, a, count |
| vector signed char        | vector signed char        |              |                                            |                      |
| vector unsigned short     | vector unsigned short     |              |                                            |                      |
| vector signed short       | vector signed short       |              |                                            |                      |
| vector unsigned int       | vector unsigned int       |              |                                            |                      |
| vector signed int         | vector signed int         |              |                                            |                      |
| vector unsigned long long | vector unsigned long long |              |                                            |                      |
| vector signed long long   | vector signed long long   |              |                                            |                      |
| vector float              | vector float              |              |                                            |                      |
| vector double             | vector double             |              |                                            |                      |



### spu\_sr: Vector Shift Right by Bits

d = spu\_sr(a, count)

Each element of vector *a* is shifted right by the number of bits specified by the corresponding element in vector *count*. If *count* is a scalar, the scalar value is first replicated for each element, and then *a* is shifted.

Bits shifted out of the right end of the element are discarded. Zeros are shifted in at the left. A limited number of count bits are used depending on the size of the element. For halfword elements, the 5 least significant bits of count are used, and for word elements, the 6 least significant bits are used. The result is returned in the corresponding elements of vector d.

Table 2-81: Vector Shift Right by Bits

| Return/Argument Types |                       |                              | Specific Intrinscs                                            | Assembly Mapping                    |
|-----------------------|-----------------------|------------------------------|---------------------------------------------------------------|-------------------------------------|
| d                     | а                     | count                        | Opecine muniscs                                               | Assembly Mapping                    |
| vector unsigned short | vector unsigned short | vector unsigned short        | $rt = si\_sfhi(count, 0)$<br>$d = si\_rothm(a, rt)$           | SFHI rt, count, 0<br>ROTHM d, a, rt |
| vector signed short   | vector signed short   |                              |                                                               |                                     |
| vector unsigned int   | vector unsigned int   | vector unsigned int          | $rt = si\_sfi(count, 0)$<br>$d = si\_rotm(a, rt)$             | SFI rt, count, 0<br>ROTM d, a, rt   |
| vector signed int     | vector signed int     |                              |                                                               |                                     |
| vector unsigned short | vector unsigned short | 7-bit unsigned int (literal) | d = si_rothmi(a, -count)                                      | ROTHMI d, a, - count                |
| vector signed short   | vector signed short   |                              |                                                               |                                     |
| vector unsigned short | vector unsigned short | unaigned int                 | See section "2.2.1. Mapping Intrinsics with Scalar Operands". |                                     |
| vector signed short   | vector signed short   | unsigned int                 |                                                               |                                     |
| vector unsigned int   | vector unsigned int   | 7-bit unsigned int (literal) | $d = si\_rotmi(a, -count)$                                    | OTMI d, a, -count                   |
| vector signed int     | vector signed int     |                              |                                                               |                                     |
| vector unsigned int   | vector unsigned int   | unaign ad int                | See section "2.2.1. Mapping Intrinsics with Scalar Operands". |                                     |
| vector signed int     | vector signed int     | unsigned int                 |                                                               |                                     |

#### spu\_sra: Vector Shift Right Algebraic by Bits

d = spu\_sra(a, count)

Each element of vector a is shifted right by the number of bits specified by the corresponding element in vector count. If count is a scalar, the scalar value is first replicated for each element, and then a is shifted.

Bits shifted out of the right end of the element are discarded. Copies of the sign bit are shifted in at the left. A limited number of *count* bits are used depending on the size of the element. For halfword elements, the 5 least significant bits of *count* are used, and for word elements, the 6 least significant bits are used. The result is returned in the corresponding elements of vector *d*.



Table 2-82: Vector Shift Right Algebraic by Bits

| Return/Argument Types |                       |                       | Specific Intrinscs                                            | Assembly                             |
|-----------------------|-----------------------|-----------------------|---------------------------------------------------------------|--------------------------------------|
| d                     | а                     | count                 | Specific Intiliacs                                            | Mapping                              |
| vector unsigned short | vector unsigned short | vector unsigned short | $rt = si\_sfhi(count, 0)$<br>$d = si\_rotmah(a, rt)$          | SFHI rt, count, 0<br>ROTMAH d, a, rt |
| vector signed short   | vector signed short   | vector unsigned short |                                                               |                                      |
| vector unsigned int   | vector unsigned int   | vector unsigned int   | $rt = si\_sfi(count, 0)$<br>$d = si\_rotma(a, rt)$            | SFI rt, count, 0<br>ROTMA d, a, rt   |
| vector signed int     | vector signed int     | vector unsigned int   |                                                               |                                      |
| vector unsigned short | vector unsigned short | 7-bit unsigned int    | d = si_rotmahi(a,-count)                                      | ROTMAHI                              |
| vector signed short   | vector signed short   | (literal)             |                                                               | d, a, -count                         |
| vector unsigned short | vector unsigned short | aima a d int          | See section "2.2.1. Mapping Intrinsics with Scalar Operands". |                                      |
| vector signed short   | vector signed short   | unsigned int          |                                                               |                                      |
| vector unsigned int   | vector unsigned int   | 7-bit unsigned int    | $d = si\_rotmi(a, -count)$                                    | ROTMI d, a,<br>-count                |
| vector signed int     | vector signed int     | (literal)             |                                                               |                                      |
| vector unsigned int   | vector unsigned int   | unaigned int          | See section "2.2.1. Mapping Intrinsics with Scalar Operands". |                                      |
| vector signed int     | vector signed int     | unsigned int          |                                                               |                                      |

# spu\_srqw: Quadword Shift Right by Bits

d = spu\_srqw(a, count)

Vector *a* is shifted right by the number of bits specified by the 3 least significant bits of *count*. Bits shifted out of the right end of the vector are discarded, and zeros are shifted in at the left. The result is returned in vector *d*.

Table 2-83: Quadword Shift Right by Bits

| Return/Argument Types     |                           |                               | Canaifia Intringga                                                        | Assembly                             |
|---------------------------|---------------------------|-------------------------------|---------------------------------------------------------------------------|--------------------------------------|
| d                         | а                         | count                         | Specific Intrinscs                                                        | Mapping                              |
| vector unsigned char      | vector unsigned char      | unsigned int<br>(literal)     | d = si_rotqmbii(a,-count)                                                 | ROTQMBII<br>d, a, -count             |
| vector signed char        | vector signed char        |                               |                                                                           |                                      |
| vector unsigned short     | vector unsigned short     |                               |                                                                           |                                      |
| vector signed short       | vector signed short       |                               |                                                                           |                                      |
| vector unsigned int       | vector unsigned int       |                               |                                                                           |                                      |
| vector signed int         | vector signed int         |                               |                                                                           |                                      |
| vector unsigned long long | vector unsigned long long |                               |                                                                           |                                      |
| vector signed long long   | vector signed long long   |                               |                                                                           |                                      |
| vector float              | vector float              |                               |                                                                           |                                      |
| vector double             | vector double             |                               |                                                                           |                                      |
| vector unsigned char      | vector unsigned char      |                               | <pre>rt = si_sfi(     si_from_uint(count), 0) d = si_rotqmbi(a, rt)</pre> | SFI rt, count, 0<br>ROTQMBI d, a, rt |
| vector signed char        | vector signed char        |                               |                                                                           |                                      |
| vector unsigned short     | vector unsigned short     |                               |                                                                           |                                      |
| vector signed short       | vector signed short       |                               |                                                                           |                                      |
| vector unsigned int       | vector unsigned int       | unsigned int<br>(non-literal) |                                                                           |                                      |
| vector signed int         | vector signed int         |                               |                                                                           |                                      |
| vector unsigned long long | vector unsigned long long |                               |                                                                           |                                      |
| vector signed long long   | vector signed long long   |                               |                                                                           |                                      |
| vector float              | vector float              |                               |                                                                           |                                      |
| vector double             | vector double             |                               |                                                                           |                                      |





# spu\_srqwbyte: Quadword Shift Right by Bytes

d = spu\_srqwbyte(a, count)

Vector *a* is shifted right by the number of bytes specified by the 5 least significant bits of *count*. Bytes shifted out of the right end of the vector are discarded, and zeros are shifted in at the left. The result is returned in vector *a*.

Table 2-84: Quadword Shift Right by Bytes

| Return/Argument Types     |                           |               | Specific Intrinscs                                         | Assembly                                |  |
|---------------------------|---------------------------|---------------|------------------------------------------------------------|-----------------------------------------|--|
| d                         | a                         | count         | opecine munises                                            | Mapping                                 |  |
| vector unsigned char      | vector unsigned char      |               |                                                            | ROTQMBYI<br>d, a, -count                |  |
| vector signed char        | vector signed char        |               |                                                            |                                         |  |
| vector unsigned short     | vector unsigned short     |               |                                                            |                                         |  |
| vector signed short       | vector signed short       |               |                                                            |                                         |  |
| vector unsigned int       | vector unsigned int       | unsigned int  | $d = si\_rotqmbyi(a,-$                                     |                                         |  |
| vector signed int         | vector signed int         | (literal)     | count)                                                     |                                         |  |
| vector unsigned long long | vector unsigned long long |               |                                                            |                                         |  |
| vector signed long long   | vector signed long long   |               |                                                            |                                         |  |
| vector float              | vector float              |               |                                                            |                                         |  |
| vector double             | vector double             |               |                                                            |                                         |  |
| vector unsigned char      | vector unsigned char      |               |                                                            |                                         |  |
| vector signed char        | vector signed char        |               |                                                            |                                         |  |
| vector unsigned short     | vector unsigned short     |               |                                                            |                                         |  |
| vector signed short       | vector signed short       |               |                                                            |                                         |  |
| vector unsigned int       | vector unsigned int       | unsigned int  | rt = si_sfi( si_from_uint(count), 0) d = si_rotqmby(a, rt) | SFI rt, count, 0<br>ROTQMBY d, a,<br>rt |  |
| vector signed int         | vector signed int         | (non-literal) |                                                            |                                         |  |
| vector unsigned long long | vector unsigned long long |               |                                                            |                                         |  |
| vector signed long long   | vector signed long long   |               |                                                            |                                         |  |
| vector float              | vector float              |               |                                                            |                                         |  |
| vector double             | vector double             |               |                                                            |                                         |  |



# spu\_srqwbytebc: Quadword Shift Right by Bytes from Bit Shift Count

d = spu\_srqwbytebc(a, count)

Vector *a* is shifted right by the number of bytes specified by bits 24-28 of *count*. Bytes shifted out of the right end of the vector are discarded, and zeros are shifted in at the left. The result is returned in vector *d*.

Table 2-85: Quadword Shift Right by Bytes from Bit Shift Count

| Return/Argument Types     |                           |                         | Specific Intrinscs               | Assembly Mapping   |
|---------------------------|---------------------------|-------------------------|----------------------------------|--------------------|
| d                         | a                         | count                   | Specific Intiliacs               | Assembly Mapping   |
| vector unsigned char      | vector unsigned char      |                         |                                  |                    |
| vector signed char        | vector signed char        |                         |                                  |                    |
| vector unsigned short     | vector unsigned short     |                         |                                  |                    |
| vector signed short       | vector signed short       |                         |                                  |                    |
| vector unsigned int       | vector unsigned int       | SI from lilnf(aquat) /) | SFI rt, count, 7                 |                    |
| vector signed int         | vector signed int         | int                     | $d = \text{si\_rotqmbybi}(a,rt)$ | ROTQMBYBI d, a, rt |
| vector unsigned long long | vector unsigned long long |                         |                                  |                    |
| vector signed long long   | vector signed long long   |                         |                                  |                    |
| vector float              | vector float              |                         |                                  |                    |
| vector double             | vector double             |                         |                                  |                    |

# 2.11. Control Intrinsics

# spu\_idisable: Disable Interrupts

(void) spu\_idisable()

Asynchronous interrupts are disabled.

This intrinsic is considered volatile with respect to all other instructions; thus, the BID instruction will not be reordered with any other instructions.

Table 2-86: Disable Interrupts

| Specific Intrinsics | Assembly Mapping      |
|---------------------|-----------------------|
|                     | position dependent:   |
|                     | ILA t, next_inst      |
|                     | BID t                 |
|                     | next_inst:            |
| N/A                 |                       |
| 1471                | position independent: |
|                     | BRSL t, next_inst     |
|                     | next_inst:            |
|                     | AI t, t, 8            |
|                     | BID t                 |



#### spu\_ienable: Enable Interrupts

(void) spu\_ienable()

Asynchronous interrupts are enabled.

This intrinsic is considered volatile with respect to all other instructions; thus, the BIE instruction will not be reordered with any other instructions.

Table 2-87: Enable Interrupts

| Specific Intrinsics | Assembly Mapping      |
|---------------------|-----------------------|
|                     | position dependent:   |
|                     | ILA t, next_inst      |
|                     | BIE t                 |
|                     | next_inst:            |
| N/A                 |                       |
|                     | position independent: |
|                     | BRSL t, next_inst     |
|                     | next_inst:            |
|                     | AI t, t, 8            |
|                     | BIE t                 |

# spu\_mffpscr: Move from Floating-Point Status and Control Register

d = spu\_mffpscr()

The floating-point status and control register (FPSCR) Special Purpose Register is read, and the contents are returned in *d*. Unused bits of the FPSCR are forced to zero.

This intrinsic is considered volatile with respect to the floating-point instructions and will not be reordered with respect to these instructions. The floating-point instructions include: cflts, cfltu, csflt, cuflt, dfa, dfm, dfma, dfms, dfnma, dfnms, dfs, fa, fceq, fcgt, fcmeq, fcmgt, fesd, fi, fm, fma, fms, fnms, frds, frest, frsqest, and fscrwr.

Table 2-88: Move from Floating-Point Status and Control Register

| Return/Argument Types | Specific Intrinsics | Assembly Mapping |  |
|-----------------------|---------------------|------------------|--|
| d                     | Specific intrinsics |                  |  |
| vector unsigned int   | d = si_fscrrd()     | FSCRRD d         |  |

# spu\_mfspr: Move from Special Purpose Register

d = spu\_mfspr(register)

The Specal Purpose Register specified by enumeration constant register is read, and the contents are returned in d.

Table 2-89: Move from Special Purpose Register

| Return/Argument Types |             | Specific Intrinsics                           | Accombly Manning  |
|-----------------------|-------------|-----------------------------------------------|-------------------|
| d register            |             | Specific multisics                            | Assembly Mapping  |
| unsigned int          | enumeration | <pre>d = si_to_uint(si_mfspr(register))</pre> | MFSPR d, register |



# spu\_mtfpscr: Move to Floating-Point Status and Control Register

(void) spu\_mtfpscr(a)

The argument a is written to the floating-point status and control register (FPSCR).

This intrinsic is considered volatile with respect to the floating-point instructions, and it will not be reordered with respect to these instructions.

Table 2-90: Move to Floating-Point Status and Control Register

| Return/Argument Types | Specific Intrinsics  | Assembly Mapping |  |
|-----------------------|----------------------|------------------|--|
| а                     | Specific iritifisies |                  |  |
| vector unsigned int   | si_fscrwr(a)         | FSCRWR rt1, a    |  |

<sup>&</sup>lt;sup>1</sup>The false target parameter rt is optimally chosen depending on register usage of neighboring instructions.

#### spu\_mtspr: Move to Special Purpose Register

(void) spu\_mtspr(register, a)

The argument *a* is written to the Special Purpose Register specified by the enumeration constant register.

Table 2-91: Move to Special Purpose Register

| Return/Argument Types |              | Specific Intrinsics                 | Assembly Mapping  |
|-----------------------|--------------|-------------------------------------|-------------------|
| register              | а            | opecine municies                    | Assembly Mapping  |
| enumeration           | unsigned int | si_mtspr(register, si_from_uint(a)) | MTSPR register, a |

# spu\_dsync: Synchronize Data

(void) spu\_dsync()

All earlier store instructions are forced to complete before proceeding. This function ensures that all stores to local storage are visible to the MFC or PPU.

This intrinsic is considered volatile with respect to the store and MFC write instructions, and it will not be reordered with respect to these instructions. The store and MFC instructions include: stqa, stqd, stqr, stqx, and wrch.

Table 2-92: Synchronize Data

| Specific Intrinsics | Assembly Mapping |
|---------------------|------------------|
| si_dsync()          | DSYNC            |

# spu\_stop: Stop and Signal

(void) spu\_stop(type)

Execution of the SPU program is stopped. The address of the stop instruction is placed into the least significant bits of the SPU NPC register. The signal type is written to the SPU status register, and the PPU is interrupted.

This intrinsic is considered volatile with respect to all instructions, and it will not be reordered with any other instructions.

Table 2-93: Stop and Signal

| Specific Intrinsics | type                          | Assembly Mapping |
|---------------------|-------------------------------|------------------|
| si_stop(type)       | unsigned int (14-bit literal) | STOP type        |



#### spu\_sync: Synchronize

```
(void) spu_sync()
(void) spu_sync_c()
```

The processor waits until all pending store instructions have been completed before fetching the next sequential instruction. The <code>spu\_sync\_c</code> form of the intrinsic also performs channel synchronization prior to the instruction synchronization. This operation must be used following a store instruction that modifies the instruction stream.

These synchronization intrinsics are considered volatile with respect to all instructions, and they will not be reordered with any other instructions.

Table 2-94: Synchronize

| Generic Intrinsic Form | Specific Intrinsics | Assembly Mapping |
|------------------------|---------------------|------------------|
| spu_sync               | si_sync()           | SYNC             |
| spu_sync_c             | si_syncc()          | SYNCC            |

# 2.12. Channel Control Intrinsics

The channel control intrinsics each take a *channel* number as an input. Channel numbers are literal unsigned integer values in the range from 0 to 127. Table 2-95 and Table 2-96 show the respective SPU and MFC channel numbers and their associated mnemonics. For additional details on the channels, see the *Cell Broadband Engine Architecture*.

The channel intrinsics must never be reordered with respect to other channel commands or volatile local-storage memory accesses.

The MFC channels are only valid for SPUs within a CBEA-compliant system. MFC and SPU channel enumerants are defined in  $spu\_intrinsics.h$ 

Table 2-95: SPU Channel Numbers

| Channel Number | Mnemonic          | Description                                                   |
|----------------|-------------------|---------------------------------------------------------------|
| 0              | SPU_RdEventStat   | Read event status with mask applied.                          |
| 1              | SPU_WrEventMask   | Write event mask.                                             |
| 2              | SPU_WrEventAck    | Write End of event processing.                                |
| 3              | SPU_RdSigNotify1  | Signal notification 1.                                        |
| 4              | SPU_RdSigNotify2  | Signal notification 2.                                        |
| 7              | SPU_WrDec         | Write decrementer count.                                      |
| 8              | SPU_RdDec         | Read decrementer count.                                       |
| 11             | SPU_RdEventMask   | Read event mask.                                              |
| 13             | SPU_RdMachStat    | Read SPU run status.                                          |
| 14             | SPU_WrSRR0        | Write SPU machine state save/restore register 0 (SRR0).       |
| 15             | SPU_RdSRR0        | Read SPU machine state save/restore register 0 (SRR0).        |
| 28             | SPU_WrOutMbox     | Write outbound mailbox contents.                              |
| 29             | SPU_RdInMbox      | Read inbound mailbox contents.                                |
| 30             | SPU_WrOutIntrMbox | Write outbound interrupt mailbox contents (interrupting PPU). |

Table 2-96: MFC Channel Numbers

| Channel Number | Mnemonic        | Description                                   |  |  |
|----------------|-----------------|-----------------------------------------------|--|--|
| 9              | MFC_WrMSSyncReq | Write multisource synchronization request.    |  |  |
| 12             | MFC_RdTagMask   | Read tag mask.                                |  |  |
| 16             | MFC_LSA         | Write local memory address command parameter. |  |  |



| Channel Number | Mnemonic            | Description                                                                   |
|----------------|---------------------|-------------------------------------------------------------------------------|
| 17             | MFC_EAH             | Write high order DMA effective address command parameter.                     |
| 18             | MFC_EAL             | Write low order DMA effective address command parameter.                      |
| 19             | MFC_Size            | Write DMA transfer size command parameter.                                    |
| 20             | MFC_TagID           | Write tag identifier command parameter.                                       |
| 21             | MFC_Cmd             | Write and enqueue DMA command with associated class ID.                       |
| 22             | MFC_WrTagMask       | Write tag mask.                                                               |
| 23             | MFC_WrTagUpdate     | Write request for conditional/unconditional tag status update.                |
| 24             | MFC_RdTagStat       | Read tag status with mask applied.                                            |
| 25             | MFC_RdListStallStat | Read DMA list stall-and-notify status.                                        |
| 26             | MFC_WrListStallAck  | Write DMA list stall-and-notify acknowledge.                                  |
| 27             | MFC_RdAtomicStat    | Read completion status of last completed immediate MFC atomic update command. |

# spu\_readch: Read Word Channel

d = spu\_readch(channel)

The word channel that is specified by *channel* is read, and the contents are placed in *d*. If the channel does not exist, a value of zero is returned.

Table 2-97: Read Word Channel

| Return/Argument Types |             | Specific Intrinsics                         | Assembly Mapping |  |
|-----------------------|-------------|---------------------------------------------|------------------|--|
| d                     | channel     | Specific intrinsics                         | Assembly Mapping |  |
| unsigned int          | enumeration | <pre>d = si_to_uint(si_rdch(channel))</pre> | RDCH d, channel  |  |

# spu\_readchqw: Read Quadword Channel

d = spu\_readchqw(channel)

The quadword channel that is specified by *channel* is read, and the contents are placed in vector *d*. If the channel does not exist, a value of zero is returned.

Table 2-98: Read Quadword Channel

| Return/Argument Types |             | Specific Intrinsics  | Assembly Mapping |
|-----------------------|-------------|----------------------|------------------|
| d                     | channel     | Specific intrinsics  | Assembly Mapping |
| vector unsigned int   | enumeration | d = si_rdch(channel) | RDCH d, channel  |

# spu\_readchcnt: Read Channel Count

d = spu\_readchcnt(channel)

A Read Count operation is performed on thes channel that is specified by channel, and the count is placed in d. If the channel does not exist, a value of zero is returned in d.

Table 2-99: Read Channel Count

| Return/Argument Types |             | Specific Intrinsics    | Assembly Mapping  |  |
|-----------------------|-------------|------------------------|-------------------|--|
| d                     | channel     | Specific murisics      | Assembly Mapping  |  |
| unsigned int          | enumeration | d = si_rchcnt(channel) | RCHCNT d, channel |  |



# spu\_writech: Write Word Channel

(void) spu\_writech(channel, a)

The contents of scalar a are written to the channel that is specified by the enumeration constant channel.

Table 2-100: Write Word Channel

| Return/Argument Types |     | Specific Intrinsics               | Assembly Manning   |  |
|-----------------------|-----|-----------------------------------|--------------------|--|
| channel               | а   | Specific intrinsics               | Assembly Mapping   |  |
| enumeration           | int | si_wrch(channe1, si_from_int(a))  | WRCH channel, a    |  |
|                       |     | si_wrch(channe1, si_from_uint(a)) | WITOIT CHAIIIEI, a |  |

# spu\_writechqw: Write Quadword Channel

(void) spu\_writechqw(channel, a)

The contents of vector a are written to the channel that is specified by the enumeration constant channel.

Table 2-101: Write Quadword Channel

| Return/Argument Types |                           | Specific Intrinsics | Assembly Mapping |  |
|-----------------------|---------------------------|---------------------|------------------|--|
| channel               | a                         | Specific intrinsics | Assembly Mapping |  |
|                       | vector unsigned char      |                     |                  |  |
|                       | vector signed char        |                     | WRCH channel, a  |  |
|                       | vector unsigned short     |                     |                  |  |
|                       | vector signed short       |                     |                  |  |
| enumeration           | vector unsigned int       | si wrch(channel, a) |                  |  |
| enumeration           | vector signed int         | Si_wich(channer, a) |                  |  |
|                       | vector unsigned long long |                     |                  |  |
|                       | vector signed long long   |                     |                  |  |
|                       | vector float              |                     |                  |  |
|                       | vector double             |                     |                  |  |

# 2.13. Scalar Intrinsics

All of the previous intrinsic functions perform operations only on vector data types. This section describes special utility intrinsics that allow programmers to efficiently coerce scalars to vectors, or vectors to scalars. With the aid of these intrinsics, programmers can use intrinsic functions to perform operations between vectors and scalars without having to revert to assembly language. This is especially important when there is a need is to perform an operation that cannot be conveniently expressed in C, such as shuffling bytes.

# spu\_extract: Extract Vector Element from Vector

d = spu\_extract(a, element)

The element that is specified by <code>element</code> is extracted from vector <code>a</code> and returned in <code>d</code>. Depending on the size of the element, only a limited number of the least significant bits of the <code>element</code> index are used. For 1-, 2-, 4-, and 8-byte elements, only 4, 3, 2, and 1 of the least significant bits of the element index are used, respectively.



Table 2-102: Extract Vector Element from Vector

|                    | Return/Argument Types     |                   | Specific Intrinsics | Assembly Mapping <sup>1</sup>                            |
|--------------------|---------------------------|-------------------|---------------------|----------------------------------------------------------|
| d                  | а                         | element           | Opcome munisies     | 7.656mbly Mapping                                        |
| unsigned char      | vector unsigned char      |                   | N/A                 | ROTQBY d, a, element<br>ROTMI d, d, -24                  |
| signed char        | vector signed char        |                   | N/A                 | ROTQBY d, a, element<br>ROTMAI d, d, -24                 |
| unsigned short     | vector unsigned short     |                   | N/A                 | SHLI t, element, 1<br>ROTQBY d, a, t<br>ROTMI d, d, -16  |
| signed short       | vector signed short       |                   | N/A                 | SHLI t, element, 1<br>ROTQBY d, a, t<br>ROTMAI d, d, -16 |
| unsigned int       | vector unsigned int       | int (non-literal) | N/A                 | SHLI t, element, 2<br>ROTQBY d, a, t                     |
| signed int         | vector signed int         |                   | N/A                 | SHLI t, element, 2<br>ROTQBY d, a, t                     |
| unsigned long long | vector unsigned long long |                   | N/A                 | SHLI t, element, 3<br>ROTQBY d, a, t                     |
| signed long long   | vector signed long long   |                   | N/A                 | SHLI t, element, 3<br>ROTQBY d, a, t                     |
| float              | vector float              |                   | N/A                 | SHLI t, element, 2<br>ROTQBY d, a, t                     |
| double             | vector double             |                   | N/A                 | SHLI t, element, 3<br>ROTQBY d, a, t                     |
| unsigned char      | vector unsigned char      |                   | N/A                 | ROTQBYI d, a, element-3                                  |
| signed char        | vector signed char        |                   | N/A                 | TOTQDITU, a, element-o                                   |
| unsigned short     | vector unsigned short     |                   | N/A                 | ROTQBYI d, a, 2*(element-                                |
| signed short       | vector signed short       |                   | N/A                 | 1)                                                       |
| unsigned int       | vector unsigned int       | int (literal)     | N/A                 | ROTQBYI d, a, 4*element                                  |
| signed int         | vector signed int         | ini (iliteral)    | N/A                 | TOTADITA, a, 4 CICIIICIII                                |
| unsigned long long | vector unsigned long long |                   | N/A                 | ROTQBYI d, a, 8*element                                  |
| signed long long   | vector signed long long   |                   | N/A                 | TOTADITA, A, O CICIIICIII                                |
| float              | vector float              |                   | N/A                 | ROTQBYI d, a, 4*element                                  |
| double             | vector double             |                   | N/A                 | ROTQBYI d, a, 8*element                                  |

<sup>&</sup>lt;sup>1</sup> If the specified element is a known value (literal) and specifies the preferred (scalar) element, no instructions are produced. For 1 byte elements, the scalar element is 3. For 2 byte elements, the scalar element is 1. For 4 and 8 byte elements, the scalar element is 0. Sign extension may still be performed if a subsequent operation requires the resulting scalar to be cast to a larger data type. This sign extension may be deferred until the subsequent operation.

# spu\_insert: Insert Scalar into Specified Vector Element

d = spu\_insert(a, b, element)

Scalar a is inserted into the element of vector b that is specified by the element parameter, and the modified vector is returned. All other elements of b are unmodified. Depending on the size of the element, only a limited



number of the least significant bits of the *element* index are used. For 1-, 2-, 4-, and 8-byte elements, only 4, 3, 2, and 1 of the least significant bits of the *element* index are used, respectively.

Table 2-103: Insert Scalar into Specified Vector Element

| Return/Argument Types     |                    |                           |                  | Specific<br>Intrinsics | Assembly Mapping                                      |  |
|---------------------------|--------------------|---------------------------|------------------|------------------------|-------------------------------------------------------|--|
| d                         | а                  | b                         | b element        |                        | Assembly Mapping                                      |  |
| vector unsigned char      | unsigned char      | vector unsigned char      |                  | N/A                    | CBD t, 0(element)                                     |  |
| vector signed char        | signed char        | vector signed char        |                  | N/A                    | SHUFB d, a, b, t                                      |  |
| vector unsigned short     | unsigned short     | vector unsigned short     |                  | N/A                    | SHLI t, element, 1                                    |  |
| vector signed short       | signed short       | vector signed short       | int              | N/A                    | CHD t, 0(t)<br>SHUFB d, a, b, t                       |  |
| vector unsigned int       | unsigned int       | vector unsigned int       | (non-            | N/A                    | SHLI t, element, 2                                    |  |
| vector signed int         | signed int         | vector signed int         | literal)         | N/A                    | CWD t, 0(t)                                           |  |
| vector float              | float              | vector float              |                  | N/A                    | SHUFB d, a, b, t                                      |  |
| vector unsigned long long | unsigned long long | vector unsigned long long |                  | N/A                    | SHLI t, element, 3<br>CDD t, 0(t)<br>SHUFB d, a, b, t |  |
| vector signed long long   | signed long long   | vector signed long long   |                  | N/A                    |                                                       |  |
| vector double             | double             | vector double             |                  | N/A                    |                                                       |  |
| vector unsigned char      | unsigned char      | vector unsigned char      |                  | N/A                    | LQD pat,<br>CONST_AREA<br>SHUFB d, a, b, pat          |  |
| vector signed char        | signed char        | vector signed char        |                  | N/A                    |                                                       |  |
| vector unsigned short     | unsigned short     | vector unsigned short     |                  | N/A                    | LQD pat,                                              |  |
| vector signed short       | signed short       | vector signed short       |                  | N/A                    | CONST_AREA<br>SHUFB d, a, b, pat                      |  |
| vector unsigned int       | unsigned int       | vector unsigned int       | int<br>(literal) | N/A                    | LQD pat,                                              |  |
| vector signed int         | signed int         | vector signed int         | (intorar)        | N/A                    | CONST_AREA                                            |  |
| vector float              | float              | vector float              |                  | N/A                    | SHUFB d, a, b, pat                                    |  |
| vector unsigned long long | unsigned long long | vector unsigned long long |                  | N/A                    | LQD pat,                                              |  |
| vector signed long long   | signed long long   | vector signed long long   |                  | N/A                    | CONST_AREA                                            |  |
| vector double             | double             | vector double             |                  | N/A                    | SHUFB d, a, b, pat                                    |  |

<sup>&</sup>lt;sup>1</sup> If the specified element is a known value (literal), a shuffle pattern can be loaded from the constant area. The contents of the pattern depend on the size of the element and the element being replaced.

# spu\_promote: Promote Scalar to Vector

d = spu\_promote(a, element)

Scalar *a* is promoted to a vector containing *a* in the element that is specified by the *element* parameter, and the vector is returned in *d*. All other elements of the vector are undefined. Depending on the size of the element/scalar, only a limited number of the least significant bits of the *element* index are used. For 1-, 2-, 4-, and 8-byte elements, only 4, 3, 2, and 1 of the least significant bits of the *element* index are used, respectively.

Table 2-104: Promote Scalar to Vector

| Return/Argument Types |                |                   | Specific   | Assembly Mapping <sup>1</sup>  |  |
|-----------------------|----------------|-------------------|------------|--------------------------------|--|
| d                     | а              | element           | Intrinsics | Assembly Mapping               |  |
| vector unsigned char  | unsigned char  | int (non-literal) | N/A        | SFI t, element, 3              |  |
| vector signed char    | signed char    |                   | N/A        | ROTQBY d, a, t                 |  |
| vector unsigned short | unsigned short |                   | N/A        | SFI t, element, 1              |  |
| vector signed short   | signed short   |                   | N/A        | SHLI t, t, 1<br>ROTQBY d, a, t |  |
| vector unsigned int   | unsigned int   |                   | N/A        | SFI t, element, 0              |  |
| vector signed int     | signed int     |                   | N/A        | SHLI t, t, 2<br>ROTQBY d, a, t |  |
| vector float          | float          |                   | N/A        |                                |  |



| Return/Argument Types     |                    |               | Specific   | Assembly Mapping <sup>1</sup>        |
|---------------------------|--------------------|---------------|------------|--------------------------------------|
| d                         | а                  | element       | Intrinsics | Assembly Mapping                     |
| vector unsigned long long | unsigned long long |               | N/A        | CLILLt alament 2                     |
| vector signed long long   | signed long long   |               | N/A        | SHLI t, element, 3<br>ROTQBY d, a, t |
| vector double             | double             |               | N/A        |                                      |
| vector unsigned char      | unsigned char      |               | N/A        | ROTQBYI d, a,                        |
| vector signed char        | signed char        |               | N/A        | (3-element)                          |
| vector unsigned short     | unsigned short     |               | N/A        | ROTQBYI d, a, 2*                     |
| vector signed short       | signed short       |               | N/A        | (1-element)                          |
| vector unsigned int       | unsigned int       | : (!:+  \     | N/A        |                                      |
| vector signed int         | signed int         | int (literal) | N/A        | ROTQBYI d, a, -4*element             |
| vector float              | float              |               | N/A        |                                      |
| vector unsigned long long | unsigned long long |               | N/A        |                                      |
| vector signed long long   | signed long long   |               | N/A        | ROTQBYI d, a, -8*element             |
| vector double             | double             |               | N/A        |                                      |

<sup>&</sup>lt;sup>1</sup> If the specified element is of known value (literal) and specifies the preferred (scalar) element, no instructions are produced. For 1 byte elements, the scalar element is 3. For 2 byte elements, the scalar element is 1. For 4 and 8 byte elements, the scalar element is 0.



# 3. Composite Intrinsics

This chapter describes several composite intrinsics that have practical use for a wide variety of SPU programs. Composite intrinsics are those intrinsics that can be constructed from a series of low-level intrinsics. In this context, "low-level" means generic or specific. Because of the complexity of these operations, frequency of use, and scheduling constraints, the particular services are provided as intrinsics.

Composite intrinsics are DMA intrinsics. The DMA intrinsics rely heavily on the channel control intrinsics.

# spu\_mfcdma32: Initiate DMA to/from 32-Bit Effective Address

spu\_mfcdma32(ls, ea, size, tagid, cmd)

A DMA transfer of size bytes is initiated from local to system memory or from system memory to local storage. The effective address that is specified by ea is a 32-bit virtual memory address. The local-storage address is specified by the ls parameter. The DMA request is issued using the specified tagid. The type and direction of DMA, bandwidth reservation, and class ID are encoded in the cmd parameter. For additional details about the commands and restrictions on the size of supported DMA operations, see the *Cell Broadband Engine Architecture*.

Table 3-105: Initiate DMA to/from 32-Bit Effective Address

| Return/Argument Types |                 |              |              |              | Assembly Mapping                                                                                                                      |
|-----------------------|-----------------|--------------|--------------|--------------|---------------------------------------------------------------------------------------------------------------------------------------|
| Is                    | ea              | size         | tagid cmd    |              | Assembly Mapping                                                                                                                      |
| volatile void *       | unsigned<br>int | unsigned int | unsigned int | unsigned int | spu_writech(MFC_LSA, 1s) spu_writech(MFC_EAL, ea) spu_writech(MFC_Size, size) spu_writech(MFC_TagID, tagid) spu_writech(MFC_Cmd, cmd) |

#### spu\_mfcdma64: Initiate DMA to/from 64-Bit Effective Address

spu\_mfcdma64(ls, eahi, ealow, size, tagid, cmd)

A DMA transfer of size bytes is initiated from local to system memory or from system memory to local storage. The effective address that is specified by the concatenation of eahi and ealow is a 64-bit virtual memory address. The local-storage address is specified by the ls parameter. The DMA request is issued using the specified tagid. The type and direction of DMA, bandwidth reservation, and class ID are encoded in the cmd parameter. For additional details about the commands and restrictions on the size of supported DMA operations, see the Cell Broadband Engine Architecture.

Table 3-106: Initiate DMA to/from 64-Bit Effective Address

| Return/Argument Types |              |                 |              |                 | Assembly Manning |                                                                                                                                                                     |
|-----------------------|--------------|-----------------|--------------|-----------------|------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Is                    | eahi         | ealow           | size         | tagid           | cmd              | Assembly Mapping                                                                                                                                                    |
| volatile void *       | unsigned int | unsigned<br>int | unsigned int | unsigned<br>int | unsigned int     | spu_writech(MFC_LSA, 1s) spu_writech(MFC_EAH, eahi) spu_writech(MFC_EAL, ealow) spu_writech(MFC_Size, size) spu_writech(MFC_TagID, tagid) spu_writech(MFC_CMD, cmd) |

# spu\_mfcstat: Read MFC Tag Status

d = spu\_mfcstat(type)

The current MFC tag status is read and logically ANDed with the current tag mask, and the result is returned in d. The type of read to be performed is specified by the type parameter. If the type is 0, the function reads and immediately returns the current MFC tag status. If the type is 1, the function reads and blocks for any outstanding MFC tags to complete, and if the type is 2, the function reads and blocks for all outstanding MFC tags to complete.

Table 3-107: Read MFC Tag Status

| Return/Argu  | ment Types   | Assembly Mapping                                                 |  |
|--------------|--------------|------------------------------------------------------------------|--|
| d type       |              | Assembly Mapping                                                 |  |
| unsigned int | unsigned int | spu_writech(MFC_WrTagUpdate, type) d = spu_readch(MFC_RdTagStat) |  |



# 4. Programming Support for MFC Input and Output

Several MFC utility functions are described in this chapter. These functions can be implemented either as macro definitions or as built-in functions within the compiler. To access these functions, programmers must include the header file <code>spu\_mfcio.h</code>.

For each function listed in the sections below, the function usage is shown, followed by a brief description and the function implementation.

# 4.1. Structures

A principal data structure is the MFC List DMA. The elements in this list are described below.

#### mfc\_list\_element: DMA List Element for MFC List DMA

```
typedef struct mfc_list_element {
  uint64_t notify : 1;
  uint64_t reserved : 16;
  uint64_t size : 15;
  uint64_t eal : 32;
} mfc_list_element_t;
```

The mfc\_list\_element is an element in the array MFC List DMA. The structure is comprised of several bit-fields: notify is the stall-and-notify bit, reserved is set to zero. size is the list element transfer size, and eal is the low word of the 64-bit effective address.

# 4.2. MFC Macro Definitions

Several macro definitions are included in spu\_mfcio.h, in addition to the MFC function declarations described later in this chapter. These definitions, which are provided to facilitate and promote programming uniformity, are listed in Table 4-108, below. This table also includes example values for each of the macros. These values are listed for clarification only and may differ as long as the differences remain compliant with the Cell Broadband Engine Architecture.

Table 4-108: MFC Macro Definitions

| Mnemonic                  | Value  | Description                                                                    |
|---------------------------|--------|--------------------------------------------------------------------------------|
| MFC_MIN_DMA_SIZE          | 16     | Minimum DMA size that can be accommodated without special alignment or padding |
| MFC_MAX_DMA_SIZE          | 16384  | Maximum DMA size that is supported by a single DMA request                     |
| MFC_MIN_DMA_LIST_ELEMENTS | 1      | Minumum number of DMA list elements in a single DMA list command               |
| MFC_MAX_DMA_LIST_ELEMENTS | 2048   | Maxmimum number of DMA list elements in a single DMA list command              |
| MFC_BARRIER_ENABLE        | 0x0001 | Command modifier 'b'; tag-specific barrier                                     |
| MFC_FENCE_ENABLE          | 0x0002 | Command modifier 'f'; tag-specific fence                                       |
| MFC_LIST_ENABLE           | 0x0004 | Command modifier 'I'; list command                                             |
| MFC_RESULT_ENABLE         | 0x0010 | Command modifier 'r'; performance hint for DMA operations                      |



The hexadecimal DMA command modifiers, which are described in Table 4-108, can be OR'd with the base DMA command types MFC\_PUT\_CMD, MFC\_GET\_CMD, and MFC\_SNDSIG\_CMD to construct the various command permutations.

# 4.3. Effective Address Utilities

A frequent requirement for MFC programming is to manipulate effective addresses. This section describes several functions for performing the most common operations.

# mfc\_ea2h: Extract Higher 32 Bits from Effective Address

```
(uint32_t) mfc_ea2h(uint64_t ea)
```

The higher 32 bits are extracted from the 64-bit effective address ea.

#### Implementation

```
(uint32_t)((uint64_t)(ea)>>32)
```

# mfc\_ea2I: Extract Lower 32 Bits from Effective Address

```
(uint32_t) mfc_ea2l(uint64_t ea)
```

The lower 32 bits are extracted from the 64-bit effective address ea.

#### Implementation

```
(uint32_t)(ea)
```

# mfc\_hl2ea: Concatenate Higher 32 Bits and Lower 32 Bits

```
(uint64_t) mfc_hl2ea(uint32_t high, uint32_t low)
```

The higher 32 bits of a 64-bit address high and the lower 32 bits low are concatenated.

# Implementation

#### mfc\_ceil128: Round Up Value to Next Multiple of 128

```
(uint32_t) mfc_ceil128(uint32_t value)
(uint64_t) mfc_ceil128(uint64_t value)
(uintptr_t) mfc_ceil128(uintptr_t value)
```

The argument value is rounded to the next higher multiple of 128.

# Implementation

```
(value + 127) & ~127
```

# Example

```
volatile char buf[256];
volatile void *ptr = (volatile void*)mfc_ceil128((uintptr_t)buf);
```

# 4.4. MFC Tag Manager

This section describes functions that facilitate interoperability through a cooperative use of tag identifiers. Applications, libraries, and tools that initiate DMAs should use these functions to reserve a tag ID or a set of IDs.



An implementation of the tag manager is not required to make all 32 architected tag IDs available for user allocation. Some tags may be pre-allocated and used by the operating environment.

These functions are provided in a system library; therefore, they do not require explicit library linking by the programmer.

MFC tag manager mnemonics are listed in Table 4-109. These mnemonics are defined in spu\_mfcio.h.

Table 4-109: MFC Tag Manager Mnemonics

| Mnemonic        | Value     | Description                                             |
|-----------------|-----------|---------------------------------------------------------|
| MFC_TAG_VALID   | 0x0000000 | The specified tag or tag group release was successful.  |
| MFC_TAG_INVALID | 0xFFFFFFF | The tag or tag group reservation or tag release failed. |

#### mfc\_tag\_reserve: Reserve a Tag for Exclusive Use

```
(uint32_t) mfc_tag_reserve(void)
```

Reserve a tag for exclusive use. This routine returns an available tag ID in the range 0 to 31 and marks the tag as reserved. If no tags are available, MFC\_TAG\_INVALID is returned, indicating that all tags have already been reserved.

#### mfc tag release: Release a Tag from Exclusive Use

```
(uint32_t) mfc_tag_release(uint32_t tag)
```

Release the specified tag from exclusive use. After it is released, it is available for future reservation. Upon successful release, MFC\_TAG\_VALID is returned. If the specified tag is not in the range 0 to 31 or if it was not reserved, no action is taken and MFC\_TAG\_INVALID is returned.

# mfc\_multi\_tag\_reserve: Reserve a Group of Tags for Exclusive Use

```
(uint32_t) mfc_multi_tag_reserve(uint32_t number_of_tags)
```

Reserve a sequential group of tags for exclusive use. The number of tags to be reserved is specified by the  $number\_of\_tags$  parameter. This routine returns the first tag ID in a sequential list of available tags and marks them as reserved. The reserved group of tags is in the range of IDs starting from the returned tag ID through the returned tag ID +  $number\_of\_tags$  - 1.

If the number of tags requested exceeds the number of available sequential tags, MFC\_TAG\_INVALID is returned, indicating that the request could not be performed.

# mfc\_multi\_tag\_release: Release a Group of Tags from Exclusive Use

```
(uint32_t) mfc_multi_tag_release(uint32_t first_tag, uint32_t number_of_tags)
```

Release a sequential group of tags from exclusive use. The sequential group of tags is the range of tag IDs starting from  $first\_tag$  through  $first\_tag + number\_of\_tags - 1$ . Upon successful release, the tags become available for future reservation, and MFC\_TAG\_VALID is returned. If the specified tags were not previously reserved, no action is taken, and MFC\_TAG\_INVALID is returned.

# 4.5. MFC DMA Commands

This section describes functions that implement the various MFC DMA commands. See the *Cell Broadband Engine Architecture* for a description of the DMA commands, including restrictions on the size of the supported operations.

MFC DMA command mnemonics are listed in Table 4-110. MFC command enumerants are defined in spu\_mfcio.h.



Table 4-110: MFC DMA Command Mnemonics

| Mnemonic     | Opcode | Command |
|--------------|--------|---------|
| MFC_PUT_CMD  | 0x0020 | put     |
| MFC_PUTB_CMD | 0x0021 | putb    |
| MFC_PUTF_CMD | 0x0022 | putf    |
| MFC_GET_CMD  | 0x0040 | get     |
| MFC_GETB_CMD | 0x0041 | getb    |
| MFC_GETF_CMD | 0x0042 | getf    |

#### mfc\_put: Move Data from Local Storage to Effective Address

Data is moved from local storage to system memory. The arguments to this function correspond to the arguments of the  $spu\_mfcdma64$  command: ls is the local-storage address, ea is the effective address in system memory, size is the DMA transfer size, tag is the DMA tag, tid is the transfer class identifier, and rid is the replacement class identifier.

# Implementation

#### mfc\_putb: Move Data from Local Storage to Effective Address with Barrier

Data is moved from local storage to system memory. The arguments to this function correspond to the arguments of the spu\_mfcdma64 command: ls is the local-storage address, ea is the effective address in system memory, size is the DMA transfer size, tag is the DMA tag, tid is the transfer class identifier, and rid is the replacement class identifier. This command and all subsequent commands with the same tag ID as this command are locally ordered with respect to all previously issued commands within the same tag group and command queue.

#### Implementation

#### mfc putf: Move Data from Local Storage to Effective Address with Fence

Data is moved from local storage to system memory. The arguments to this function correspond to the arguments of the spu\_mfcdma64 command: ls is the local-storage address, ea is the effective address in system memory, size is the DMA transfer size, tag is the DMA tag, tid is the transfer class identifier, and rid is the replacement class identifier. This command is locally ordered with respect to all previously issued commands within the same tag group and command queue.

# Implementation

# mfc\_get: Move Data from Effective Address to Local Storage



Data is moved from system memory to local storage. The arguments to this function correspond to the arguments of the  $spu_mfcdma64$  command: ls is the local-storage address, ea is the effective address in system memory, size is the DMA transfer size, tag is the DMA tag, tid is the transfer class identifier, and rid is the replacement class identifier.

#### Implementation

## mfc\_getf: Move Data from Effective Address to Local Storage with Fence

Data is moved from system memory to local storage. The arguments to this function correspond to the arguments of the spu\_mfcdma64 command: 1s is the local-storage address, ea is the effective address in system memory, size is the DMA transfer size, tag is the DMA tag, tid is the transfer class identifier, and rid is the replacement class identifier. This command is locally ordered with respect to all previously issued commands within the same tag group and command queue.

#### Implementation

# mfc\_getb: Move Data from Effective Address to Local Storage with Barrier

Data is moved from system memory to local storage. The arguments to this function correspond to the arguments of the spu\_mfcdma64 command: 1s is the local-storage address, ea is the effective address in system memory, size is the DMA transfer size, tag is the DMA tag, tid is the transfer class identifier, and rid is the replacement class identifier. This command and all subsequent commands with the same tag ID as this command are locally ordered with respect to all previously issued commands within the same tag group and command queue.

# Implementation

# 4.6. MFC List DMA Commands

This section describes utility functions that can be used to manage the MFC List DMA. See the *Cell Broadband Engine Architecture for* a description of the DMA commands, including restrictions on the size of the supported operations.

MFC List DMA command mnemonics are listed in Table 4-111. MFC command enumerants are defined in spu\_mfcio.h.

Table 4-111: MFC List DMA Command Mnemonics

| Mnemonic      | Opcode | Command |
|---------------|--------|---------|
| MFC_PUTL_CMD  | 0x0024 | putl    |
| MFC_PUTLB_CMD | 0x0025 | putlb   |
| MFC_PUTLF_CMD | 0x0026 | putlf   |
| MFC_GETL_CMD  | 0x0044 | getl    |
| MFC_GETLB_CMD | 0x0045 | getlb   |
| MFC_GETLF_CMD | 0x0046 | getlf   |



#### mfc\_putl: Move Data from Local Storage to Effective Address Using MFC List

Data is moved from local storage to system memory using the MFC list. The arguments to this function correspond to the arguments of the  $spu\_mfcdma64$  command: ls is the local-storage address, ea is the effective address in system memory, list is the DMA list address,  $list\_size$  is the DMA list size, tag is the DMA tag, tid is the transfer class identifier, and rid is the replacement class identifier.

#### Implementation

#### mfc\_putlb: Move Data from Local Storage to Effective Address Using MFC List with Barrier

Data is moved from local storage to system memory using the MFC list. The arguments to this function correspond to the arguments of the  $spu\_mfcdma64$  command: ls is the local-storage address, ea is the effective address in system memory, list is the DMA list address,  $list\_size$  is the DMA list size, tag is the DMA tag, tid is the transfer class identifier, and rid is the replacement class identifier. This command and all subsequent commands with the same tag ID as this command are locally ordered with respect to all previously issued commands within the same tag group and command gueue.

# Implementation

# mfc\_putlf: Move Data from Local Storage to Effective Address Using MFC List with Fence

Data is moved from local storage to system memory using the MFC list. The arguments to this function correspond to the arguments of the  $spu\_mfcdma64$  command: ls is the local-storage address, ea is the effective address in system memory, list is the DMA list address,  $list\_size$  is the DMA list size, tag is the DMA tag, tid is the transfer class identifier, and rid is the replacement class identifier. This command is locally ordered with respect to all previously issued commands within the same tag group and command queue.

#### Implementation

# mfc\_getl: Move Data from Effective Address to Local Storage Using MFC List

Data is moved from system memory to local storage using the MFC list. The arguments to this function correspond to the arguments of the  $spu_mfcdma64$  command: ls is the local-storage address, ea is the effective address in system memory, list is the DMA list address,  $list\_size$  is the DMA list size, tag is the DMA tag, tid is the transfer class identifier, and rid is the replacement class identifier.

#### Implementation



#### mfc\_getlb: Move Data from Effective Address to Local Storage Using MFC List with Barrier

Data is moved from system memory to local storage using the MFC list. The arguments to this function correspond to the arguments of the  $spu_mfcdma64$  command: ls is the local-storage address, ea is the effective address in system memory, list is the DMA list address,  $list_size$  is the DMA list size, tag is the DMA tag, tid is the transfer class identifier, and rid is the replacement class identifier. This command and all subsequent commands with the same tag ID as this command are locally ordered with respect to all previously issued commands within the same tag group and command queue.

#### **Implementation**

#### mfc\_getIf: Move Data from Effective Address to Local Storage Using MFC List with Fence

```
(void) mfc_getlf(volatile void *ls, uint64_t ea, volatile mfc_list_element_t *list,
      uint32_t list_size, uint32_t tag, uint32_t tid, uint32_t rid)
```

Data is moved from system memory to local storage using the MFC list. The arguments to this function correspond to the arguments of the  $spu\_mfcdma64$  command: ls is the local-storage address, ea is the effective address in system memory, list is the DMA list address,  $list\_size$  is the DMA list size, tag is the DMA tag, tid is the transfer class identifier, and rid is the replacement class identifier. This command is locally ordered with respect to all previously issued commands within the same tag group and command queue.

#### Implementation

# 4.7. MFC Atomic Update Commands

This section describes utility functions that can be used to manage the MFC Atomic DMA. See the *Cell Broadband Engine Architecture* for a description of the DMA commands, including restrictions on the size of the supported operations.

MFC Atomic DMA command mnemonics are listed in Table 4-112. MFC command enumerants are defined in spu\_mfcio.h.

Table 4-112: MFC Atomic Update Command Mnemonics

| Mnemonic         | Opcode | Command  |
|------------------|--------|----------|
| MFC_GETLLAR_CMD  | 0x00D0 | getllar  |
| MFC_PUTLLC_CMD   | 0x00B4 | putllc   |
| MFC_PUTLLUC_CMD  | 0x00B0 | putlluc  |
| MFC_PUTQLLUC_CMD | 0x00B8 | putqlluc |

#### mfc\_getllar: Get Lock Line and Create Reservation

```
(void) mfc_getllar(volatile void *ls, uint64_t ea, uint32_t tid, uint32_t rid)
```

The lock line is obtained and a reservation is created. The arguments to this function correspond to the arguments of the spu\_mfcdma64 command: *ls* is the 128-byte-aligned local-storage address, *ea* is the effective address in system memory, *tid* is the transfer class identifier, and *rid* is the replacement class identifier.



The  $mfc\_getllar$  command does not have a tag ID. The command is immediately executed by the MFC. The transfer size is fixed at 128 bytes. An  $mfc\_read\_atomic\_status()$  must follow this function to verify completion of the command.

#### Implementation

# mfc\_putllc: Put Lock Line If Reservation for Effective Address Exists

```
(void) mfc_putllc(volatile void *ls, uint64_t ea, uint32_t tid, uint32_t rid)
```

The lock line is put if a reservation for effective address exists. The arguments to this function correspond to the arguments of the spu\_mfcdma64 command: 1s is the 128-byte-aligned local-storage address, ea is the effective address in system memory, tid is the transfer class identifier, and rid is the replacement class identifier.

The mfc\_putllc command does not have a tag ID and is immediately executed by MFC. Transfer size is fixed at 128 bytes. An mfc\_read\_atomic\_status() must follow this command to verify completion of the command.

#### Implementation

# mfc\_putlluc: Put Lock Line Unconditional

```
(void) mfc_putlluc(volatile void *ls, uint64_t ea, uint32_t tid, uint32_t rid)
```

The lock line is put regardless of the existence of a previously made reservation. The arguments to this function correspond to the arguments of the spu\_mfcdma64 command: 1s is the 128-byte-aligned local-storage address, ea is the effective address in system memory, tid is the transfer class identifier, and rid is the replacement class identifier.

This command does not have a tag ID and is immediately executed by MFC. The transfer size is fixed at 128 bytes. The mfc\_read\_atomic\_status() must follow this function to verify completion of the command.

# Implementation

#### mfc\_putglluc: Put Queued Lock Line Unconditional

The lock line is put in the queue regardless of the existence of a previously made reservation. The arguments to this function correspond to the arguments of the  $spu_mfcdma64$  command: ls is the 128-byte-aligned local-storage address, ea is the effective address in system memory, tid is the transfer class identifier, and rid is the replacement class identifier.

Transfer size is fixed at 128 bytes. This command is functionally equivalent to the mfc\_putlluc command. The difference between the two commands is the order in which the commands are executed and the way that completion is determined. mfc\_putlluc is performed immediately; in contrast, mfc\_putqlluc is placed into the MFC command queue, along with other MFC commands. Because this command is queued, it is executed independently of any pending immediate mfc\_getllar, mfc\_putllc, or mfc\_putlluc commands. To determine if this command has been performed, a program must wait for a tag-group completion.

# Implementation



# 4.8. MFC Synchronization Commands

This section describes functions that implement the MFC synchronization commands, including signal notification and storage ordering. See the *Cell Broadband Engine Architecture for* a description of the DMA commands, including restrictions on the size of the supported operations.

MFC synchronization command mnemonics are listed in Table 4-113. MFC command enumerants are defined in spu\_mfcio.h.

Table 4-113: MFC Synchronization Command Mnemonics

| Mnemonic        | Opcode | Command  |
|-----------------|--------|----------|
| MFC_SNDSIG_CMD  | 0x00A0 | sndsig   |
| MFC_SNDSIGB_CMD | 0x00A1 | sndsigb  |
| MFC_SNDSIGF_CMD | 0x00A2 | sndsigf  |
| MFC_BARRIER_CMD | 0x00C0 | barrier  |
| MFC_EIEIO_CMD   | 0x00C8 | mfceieio |
| MFC_SYNC_CMD    | 0x00CC | mfcsync  |

#### mfc\_sndsig: Send Signal

An mfc\_sndsig command is enqueued into the DMA queue, or is stalled when the DMA queue is full. The arguments to this function correspond to the arguments of the spu\_mfcdma64 command: 1s is the local-storage address, ea is the effective address in system memory, tag is the DMA tag, tid is the transfer class identifier, and rid is the replacement class identifier. Transfer size is fixed at 4 bytes.

# Implementation

#### mfc\_sndsigb: Send Signal with Barrier

An mfc\_sndsigb command is enqueued into the DMA queue, or is stalled when the DMA queue is full. The arguments to this function correspond to the arguments of the spu\_mfcdma64 command: 1s is the local-storage address, ea is the effective address in system memory, tag is the DMA tag, tid is the transfer class identifier, and rid is the replacement class identifier. Transfer size is fixed at 4 bytes. This command and all subsequent commands with the same tag ID as this command are locally ordered with respect to all previously issued commands within the same tag group and command queue.

#### Implementation

#### mfc\_sndsigf: Send Signal with Fence

An mfc\_sndsigf command is enqueued into the DMA queue, or is stalled when the DMA queue is full. The arguments to this function correspond to the arguments of the spu\_mfcdma64 command: 1s is the local-storage address, ea is the effective address in system memory, tag is the DMA tag, tid is the transfer class identifier, and



rid is the replacement class identifier. Transfer size is fixed at 4 bytes. This command is locally ordered with respect to all previously issued commands within the same tag group and command queue.

# Implementation

# mfc\_barrier: Enqueue mfc\_barrier Command into DMA Queue or Stall When Queue Is Full

```
(void) mfc_barrier(uint32_t tag)
```

An  $mfc\_barrier$  command is enqueued into the DMA queue, or the command is stalled when the DMA queue is full. tag is the DMA tag. An  $mfc\_barrier$  command guarantees that MFC commands preceding the barrier will be executed before the execution of MFC commands following it, regardless of the tag of preceding or subsequent MFC commands.

#### Implementation

```
spu_mfcdma32(0, 0, 0, tag, MFC_BARRIER_CMD)
```

# mfc\_eieio: Enqueue mfc\_eieio Command into DMA Queue or Stall When Queue Is Full

```
(void) mfc_eieio (uint32_t tag, uint32_t tid, uint32_t rid)
```

An  $mfc\_eieio$  command is enqueued into the DMA queue, or the command is stalled when the DMA queue is full. tag is the DMA tag, tid is the transfer class identifier, and rid is the replacement class identifier. Do not use this command to maintain the order of commands immediately inside a single SPE. The  $mfc\_eieio$  command is designed to use inter-processor/device synchronization. This command creates a large load on the memory system.

#### Implementation

```
spu_mfcdma32(0, 0, 0, tag, ((tid<<24) | (rid<<16) | MFC_EIEIO_CMD))
```

### mfc\_sync: Enqueue mfc\_sync Command into DMA Queue or Stall When Queue Is Full

```
(void) mfc_sync (uint32_t tag)
```

An  $\mathfrak{mfc\_sync}$  command is enqueued into the DMA queue, where tag is the DMA tag, or the command is stalled when the DMA queue is full. This function must not be used to maintain the order of commands immediately inside a single SPE. The  $\mathfrak{mfc\_sync}$  command is designed to use inter-processor/device synchronization. This command creates a large load on the memory system.

# Implementation

```
spu_mfcdma32(0, 0, 0, tag, MFC_SYNC_CMD)
```

# 4.9. MFC SL1 Storage Control Commands

This section describes functions that implement the various SL1 storage control commands. See the *Cell Broadband Engine Architecture* document for a description of the storage control commands, including details on implementations without an SL1.

SL1 storage control command mnemonics are listed in Table 4-114. Storage control command enumerants are defined in spu\_mfcio.h.

Table 4-114: MFC SL1 Storage Control Command Mnemonics

| Mnemonic        | Opcode | Command |
|-----------------|--------|---------|
| MFC_SDCRT_CMD   | 0x0080 | sdcrt   |
| MFC_SDCRTST_CMD | 0x0081 | sdcrtst |
| MFC_SDCRZ_CMD   | 0x0089 | sdcrz   |



| Mnemonic       | Opcode | Command |
|----------------|--------|---------|
| MFC_SDCRT_CMD  | 0x0080 | sdcrt   |
| MFC_SDCRST_CMD | 0x008D | sdcrst  |
| MFC_SDCRF_CMD  | 0x008F | sdcrf   |

#### mfc\_sdcrt: SL1 Data Cache Range Touch

A hint is provided to the MFC that the SPU will probably issue a DMA get command from the range of addresses specified by the system memory effective address, ea, and transfer size, size. The parameter tag is the DMA tag; tid is the transfer class identifier; and rid is the replacement class identifier.

The mfc\_sdcrt function reduces the latency of subsequent MFC DMA get and put commands by prefetching data into an SL1 associated with the issuing SPE.

# Implementation

# mfc\_sdcrtst: SL1 Data Cache Range Touch for Store

A hint is provided to the MFC that the SPU will probably issue a DMA put command to the range of addresses specified by the system memory effective address, ea, and transfer size, size. The parameter tag is the DMA tag; tid is the transfer class identifier; and rid is the replacement class identifier.

The  $mfc\_sdcrtst$  function reduces the latency of subsequent MFC DMA get and put commands by prefetching data into an SL1 associated with the issuing SPE.

# Implementation

#### mfc sdcrz: SL1 Data Cache Range Set to Zero

The range of storage specified by the effective address, *ea*, and transfer size, *size*, is set to zero. The parameters *tag*, *tid*, and *rid* specify the DMA tag, the transfer class identifier, and replacement class identifier, respectively.

# Implementation

# mfc\_sdcrst: SL1 Data Cache Range Store



A data block that is in the associated SL1 cache and in any other processor cache to main storage is written if 1) the block is in memory-coherence-required storage; 2) it is considered modified; and 3) it is within the address range specified by the effective address, *ea*, and transfer size, *size*. The data block can remain in the processor cache, but it is no longer consider modified.

If the data block in the associated SL1 cache is not in memory-coherence-required storage and is considered modified, the modified block of only the SL1 cache associated with the issuing SPU is written to main storage. The data block can remain in the cache, but it is no longer considered modified.

The parameters tag, tid, and rid specify the DMA tag, the transfer class identifier, and replacement class identifier, respectively.

#### Implementation

# mfc\_sdcrf: SL1 Data Cache Range Flush

A data block that is in the associated SL1 cache and in any other processor cache to main storage is written if 1) the block is in memory-coherence-required storage; 2) it is considered modified; and 3) it is within the address range specified by the effective address, *ea*, and transfer size, *size*. All data blocks in the effective address range are invalidated.

If the data block in the associated SL1 cache is not in memory-coherence-required storage and is considered modified, the modified block of only the SL1 cache associated with the issuing SPU is written to main storage. All data blocks in the effective address are invalidated.

The parameters tag, tid, and rid specify the DMA tag, the transfer class identifier, and replacement class identifier, respectively.

# Implementation

#### 4.10. MFC DMA Status

This section describes functions that can be used to check the completion of MFC commands or the status of entries in the MFC DMA queue.

#### mfc\_stat\_cmd\_queue: Check the Number of Available Entries in the MFC DMA Queue

```
(uint32_t) mfc_stat_cmd_queue(void)
```

The number of available entries in the MFC DMA queue is checked. This information can be used to avoid stalling the execution of an SPU program if a DMA command is issued to a full queue. A full queue is 16 entries.

# Implementation

```
spu_readchcnt(MFC_Cmd)
```



#### mfc\_write\_tag\_mask: Set Tag Mask to Select MFC Tag Groups to Be Included in Query Operation

```
(void) mfc_write_tag_mask (uint32_t mask)
```

A tag mask is set to select the MFC tag groups to be included in the query operation, where mask is the DMA taggroup query mask. Each bit of mask indicates each tag group; tag 0 is mapped to LSB.

#### Implementation

spu\_writech(MFC\_WrTagMask, mask)

#### mfc\_read\_tag\_mask: Read Tag Mask Indicating MFC Tag Groups to Be Included in Query Operation

```
(uint32_t) mfc_read_tag_mask(void)
```

The tag mask is read to identify MFC tag groups to be included in the query operation. Each bit of the mask indicates each tag group; tag 0 is mapped to LSB. The result represents a DMA tag-group query mask.

#### Implementation

spu\_readch(MFC\_RdTagMask)

# mfc\_write\_tag\_update: Request That Tag Status be Updated

(void) mfc\_write\_tag\_update(uint32\_t ts)

A request is sent to the MFC to update tag status, where *ts* specifies a tag-status update condition shown in Table 4-115. Condition enumerants are defined in spu\_mfcio.h.

This function must precede a tag-status read with the mfc\_read\_tag\_status() function. A tag-status update request should be performed after setting the tag-group mask with the mfc\_write\_tag\_mask() function.

Table 4-115: MFC Write Tag Update Conditions

| Number | Mnemonic                 | Description                                                                                 |
|--------|--------------------------|---------------------------------------------------------------------------------------------|
| 0      | MFC_TAG_UPDATE_IMMEDIATE | Update immediately, unconditionally.                                                        |
| 1      | MFC_TAG_UPDATE_ANY       | Update tag status if or when any enabled tag group has "no outstanding operation" status.   |
| 2      | MFC_TAG_UPDATE_ALL       | Update tag status if or when all enabled tag groups have "no outstanding operation" status. |

#### Implementation

spu\_writech(MFC\_WrTagUpdate, ts)

# mfc\_write\_tag\_update\_immediate: Request That Tag Status Be Immediately Updated

(void) mfc\_write\_tag\_update\_immediate(void)

A request is sent to immediately update tag status.

# Implementation

spu\_writech(MFC\_WrTagUpdate, MFC\_TAG\_UPDATE\_IMMEDIATE)

# mfc\_write\_tag\_update\_any: Request That Tag Status Be Updated for Any Enabled Completion with No Outstanding Operation

(void) mfc\_write\_tag\_update\_any(void)

A request is sent to update tag status when any enabled MFC tag-group completion has a "no operation outstanding" status.

#### Implementation

 $\verb|spu_writech(MFC_WrTagUpdate, MFC_TAG_UPDATE_ANY)| \\$ 



# mfc\_write\_tag\_update\_all: Request That Tag Status Be Updated When All Enabled Tag Groups Have No Outstanding Operation

```
(void) mfc_write_tag_update_all(void)
```

A request is sent to update tag status when all enabled MFC tag groups have a "no operation outstanding" status.

#### Implementation

```
spu_writech(MFC_WrTagUpdate, MFC_TAG_UPDATE_ALL)
```

# mfc\_stat\_tag\_update: Check Availability of Tag Status Update Request Channel

```
(uint32_t) mfc_stat_tag_update(void)
```

The availability of the Tag Status Update Request channel is checked. The result has one of the following values:

- 0: The Tag Status Update Request channel is not yet available.
- 1: The Tag Status Update Request channel is available.

## Implementation

```
spu_readchcnt(MFC_WrTagUpdate)
```

# mfc\_read\_tag\_status: Wait for an Updated Tag Status

```
(uint32_t) mfc_read_tag_status(void)
```

The status of the tag groups is requested. Unless the tag update is set to MFC\_TAG\_UPDATE\_IMMEDIATE, this call could be blocked. Each bit of a returned value indicates the status of each tag group; tag 0 is mapped to LSB. If set, the tag group has no outstanding operation (that is, commands completed) and is not masked by the query.

Only the status of the enabled tag groups at the time of the tag-group status update are valid. The bit positions that correspond to the tag groups that are disabled at the time of the tag-group status update are set to 0.

### Implementation

```
spu_readch(MFC_RdTagStat)
```

# mfc\_read\_tag\_status\_immediate: Wait for the Updated Status of Any Enabled Tag Group

```
(uint32_t) mfc_read_tag_status_immediate(void)
```

A request is sent to immediately update tag status. The processor waits for the status to be updated.

#### Implementation

```
spu_mfcstat(MFC_TAG_UPDATE_IMMEDIATE)
```

#### mfc read tag status any: Wait for No Outstanding Operation of Any Enabled Tag Group

```
(uint32_t) mfc_read_tag_status_any(void)
```

A request is sent to update tag status when any enabled MFC tag-group completion has a "no operation outstanding" status. The processor waits for the status to be updated.

### Implementation

```
spu_mfcstat(MFC_TAG_UPDATE_ANY)
```

# mfc\_read\_tag\_status\_all: Wait for No Outstanding Operation of All Enabled Tag Groups

```
(uint32_t) mfc_read_tag_status_all(void)
```

A request is sent to update tag status when all enabled MFC tag groups have a "no operation outstanding" status. The processor waits for the status to be updated.



## Implementation

spu\_mfcstat(MFC\_TAG\_UPDATE\_ALL)

# mfc\_stat\_tag\_status: Check Availability of MFC\_RdTagStat Channel

```
(uint32_t) mfc_stat_tag_status(void)
```

The availability of MFC\_RdTagStat channel is checked, and one of the following values is returned:

- 0: The status is not yet available.
- 1: The status is available.

This function is used to avoid a channel stall caused by reading the MFC\_RdTagStat channel when a status is not available.

#### Implementation

spu\_readchcnt(MFC\_RdTagStat)

## mfc\_read\_list\_stall\_status: Read List DMA Stall-and-Notify Status

```
(uint32_t) mfc_read_list_stall_status(void)
```

The List DMA stall-and-notify status is read and returned, or the program is stalled until the status is available.

#### Implementation

spu\_readch(MFC\_RdListStallStat)

# mfc\_stat\_list\_stall\_status: Check Availability of List DMA Stall-and-Notify Status

```
(uint32_t) mfc_stat_list_stall_status(void)
```

The availability of the List DMA stall-and-notify status is checked, and one of the following values is returned:

- 0: The status is not yet available.
- 1: The status is available.

# Implementation

spu\_readchcnt(MFC\_RdListStallStat)

# mfc\_write\_list\_stall\_ack: Acknowledge Tag Group Containing Stalled DMA List Commands

```
(void) mfc_write_list_stall_ack(uint32_t tag)
```

An acknowledgement is sent with respect to a prior stall-and-notify event. (See  $mfc\_read\_list\_status$  and  $mfc\_stat\_list\_stall\_status$ .) The argument tag is the DMA tag.

#### Implementation

spu\_writech(MFC\_WrListStallAck, tag)

# mfc\_read\_atomic\_status: Read Atomic Command Status

(uint32\_t) mfc\_read\_atomic\_status(void)

The atomic command status is read, or the program is stalled until the status is available. As shown in Table 4-116, one of the following atomic command status results (binary value of bits 29 through 31) is returned. Status enumerants are defined in spu\_mfcio.h.

Table 4-116: Read Atomic Command Status or Stall Until Status Is Available

| Status | Mnemonic           | Description                                         |
|--------|--------------------|-----------------------------------------------------|
| 1      | MFC_PUTLLC_STATUS  | The mfc_putllc command failed (reservation lost).   |
| 2      | MFC_PUTLLUC_STATUS | The mfc_putlluc command was completed successfully. |



| Status | Mnemonic           | Description                                         |
|--------|--------------------|-----------------------------------------------------|
| 4      | MFC_GETLLAR_STATUS | The mfc_getllar command was completed successfully. |

#### Implementation

spu\_readch(MFC\_RdAtomicStat)

#### mfc\_stat\_atomic\_status: Check Availability of Atomic Command Status

```
(uint32_t) mfc_stat_atomic_status(void)
```

The availability of the atomic command status is checked, and one of the following values is returned:

- 0: An atomic DMA command has not yet completed.
- 1: An atomic DMA command has completed and the status is available.

# Implementation

spu\_readchcnt(MFC\_RdAtomicStat)

# 4.11. MFC Multisource Synchronization Request

The *Cell Broadband Engine Architecture* describes the MFC Multisource Synchronization Facility. In that document, a cumulative ordering is broadly defined as an ordering of storage accesses performed by multiple processors or units with respect to another processor or unit. In this section, several functions are described that can be used to achieve a cumulative ordering across local and main storage address domains.

#### mfc write multi src sync request: Request Multisource Synchronization

```
(void) mfc_write_multi_src_sync_request(void)
```

A request is sent to start tracking outstanding transfers sent to the associated MFC. When the requested synchronization is complete, the channel count of the MFC Multisource Synchronization Request channel is reset to one.

#### Implementation

spu\_writech(MFC\_WrMSSyncReq,0)

#### mfc\_stat\_multi\_src\_sync\_request: Check the Status of Multisource Synchronization

```
(uint32_t) mfc_stat_multi_src_sync_request(void)
```

The channel count of the MFC Multisource Synchronization Request channel is read, and one of the following values is returned:

- · 0: Outstanding transfers are being tracked.
- 1: The synchronization requested by mfc\_write\_multi\_src\_sync\_request is complete.

#### Implementation

spu\_readchcnt(MFC\_WrMSSyncReq)

# 4.12. SPU Signal Notification

In this section, functions are described that can be used to read signals from other processors and other devices in the system.



# spu\_read\_signal1: Atomically Read and Clear Signal Notification 1 Channel

```
(uint32_t) spu_read_signal1(void)
```

The Signal Notification 1 channel is read, and any bits that are set are atomically reset. A signal is returned. If no signals are pending, this function will stall the SPU until a signal is issued.

#### Implementation

spu\_readch(SPU\_RdSigNotify1)

# spu\_stat\_signal1: Check If Pending Signals Exist on Signal Notification 1 Channel

```
(uint32_t) spu_stat_signal1(void)
```

A check is made to determine whether any pending signals exist on the Signal Notification 1 channel. One of the following values is returned:

- 0: No signals are pending.
- 1: Signals are pending.

#### Implementation

spu\_readchcnt(SPU\_RdSigNotify1)

# spu\_read\_signal2: Atomically Read and Clear Signal Notification 2 Channel

```
(uint32_t) spu_read_signal2(void)
```

The Signal Notification 2 channel is read, and any bits that are set are atomically reset. A signal is returned. If no signals are pending, a call of this function stalls the SPU until a signal is issued.

#### Implementation

spu\_readch(SPU\_RdSigNotify2)

# spu\_stat\_signal2: Check If Pending Signals Exist on Signal Notification 2 Channel

```
(uint32_t) spu_stat_signal2(void)
```

A check is made to determine whether pending signals exist on the Signal Notification 2 channel. One of the following values is returned:

- 0: No signals are pending.
- 1: Signals are pending.

#### Implementation

spu\_readchcnt(SPU\_RdSigNotify2)

# 4.13. SPU Mailboxes

This section describes functions that can be used to manage SPU Mailboxes.

## spu\_read\_in\_mbox: Read Next Data Entry in SPU Inbound Mailbox

```
(uint32_t) spu_read_in_mbox(void)
```

The next data entry in the SPU Inbound Mailbox queue is read. The command stalls when the queue is empty. The application-specific mailbox data is returned. Each application can uniquely define the mailbox data.

#### Implementation

spu\_readch(SPU\_RdInMbox)



#### spu\_stat\_in\_mbox: Get the Number of Data Entries in SPU Inbound Mailbox

```
(uint32_t) spu_stat_in_mbox(void)
```

The number of data entries in the SPU Inbound Mailbox is returned. If the returned value is nonzero, the mailbox contains data entries that have not been read by the SPU.

#### Implementation

spu\_readchcnt(SPU\_RdInMbox)

#### spu\_write\_out\_mbox: Send Data to SPU Outbound Mailbox

```
(void) spu_write_out_mbox (uint32_t data)
```

Data is sent to the SPU Outbound Mailbox, where data is application-specific mailbox data, or the command stalls when the SPU Outbound Mailbox is full.

#### Implementation

spu\_writech(SPU\_WrOutMbox, data)

# spu\_stat\_out\_mbox: Get Available Capacity of SPU Outbound Mailbox

```
(uint32_t) spu_stat_out_mbox(void)
```

The available capacity of the SPU Outbound Mailbox is returned. A value of zero indicates that the mailbox is full.

### Implementation

spu\_readchcnt(SPU\_WrOutMbox)

# spu\_write\_out\_intr\_mbox: Send Data to SPU Outbound Interrupt Mailbox

```
(void) spu_write_out_intr_mbox (uint32_t data)
```

Data is sent to the SPU Outbound Interrupt Mailbox, where data is application-specific mailbox data. The command stalls when the SPU Outbound Interrupt Mailbox is full.

# Implementation

spu\_writech(SPU\_WrOutIntrMbox, data)

# spu\_stat\_out\_intr\_mbox: Get Available Capacity of SPU Outbound Interrupt Mailbox

```
(uint32_t) spu_stat_out_intr_mbox(void)
```

The available capacity of the SPU Outbound Interrupt Mailbox is returned. A value of zero indicates that the mailbox is full.

# Implementation

spu\_readchcnt(SPU\_WrOutIntrMbox)

# 4.14. SPU Decrementer

This section describes functions that use the SPU 32-bit decrementer.

# spu\_read\_decrementer: Read Current Value of Decrementer

```
(uint32_t) spu_read_decrementer(void)
```

The current value of the decrementer is read and returned.

# Implementation

spu\_readch(SPU\_RdDec)



# spu\_write\_decrementer: Load a Value to Decrementer

(void) spu\_write\_decrementer (uint32\_t count)

A count is loaded to the decrementer.

#### Implementation

spu\_writech(SPU\_WrDec, count)

# 4.15. SPU Event

This section describes several functions that can be used to monitor SPU events. See the *Cell Broadband Engine Architecture* for a description of the SPU Event Facility.

The bit-fields of the Event Status, the Event Mask, and the Event Ack are shown in Table 4-117. Bit-field names are defined in spu\_mfcio.h.

Table 4-117: MFC Event Bit-Fields

| Bits   | Field Name                        | Description                                    |
|--------|-----------------------------------|------------------------------------------------|
| 0x1000 | MFC_MULTI_SRC_SYNC_EVENT          | Multisource synchronization event              |
| 0x0800 | MFC_PRIV_ATTN_EVENT               | SPU privileged attention event                 |
| 0x0400 | MFC_LLR_LOST_EVENT                | Lock-line reservation lost event               |
| 0x0200 | MFC_SIGNAL_NOTIFY_1_EVENT         | SPU Signal Notification 1 available event      |
| 0x0100 | MFC_SIGNAL_NOTIFY_2_EVENT         | SPU Signal Notification 2 available event      |
| 0x0080 | MFC_OUT_MBOX_AVAILABLE_EVENT      | SPU Outbound Mailbox available event           |
| 0x0040 | MFC_OUT_INTR_MBOX_AVAILABLE_EVENT | SPU Outbound Interrupt Mailbox available event |
| 0x0020 | MFC_DECREMENTER_EVENT             | SPU decrementer event                          |
| 0x0010 | MFC_IN_MBOX_AVAILABLE_EVENT       | SPU Inbound Mailbox available event            |
| 0x0008 | MFC_COMMAND_QUEUE_AVAILABLE_EVENT | MFC SPU command queue available event          |
| 0x0002 | MFC_LIST_STALL_NOTIFY_EVENT       | MFC DMA List command stall-and-notify event    |
| 0x0001 | MFC_TAG_STATUS_UPDATE_EVENT       | MFC tag-group status update event              |

# spu\_read\_event\_status: Read Event Status or Stall Until Status Is Available

(uint32\_t) spu\_read\_event\_status(void)

The event status is read and returned. The command stalls until the status is available. Events that have been reported but not acknowledged will continue to be reported until acknowledged.

The return value is the value of the SPU Read Event Status channel.

# Implementation

spu\_readch(SPU\_RdEventStat)

# spu\_stat\_event\_status: Check Availability of Event Status

(uint32\_t) spu\_stat\_event\_status(void)

The event status is checked, and one of the following values is returned:

- 0: No enabled events occurred.
- 1: Enabled events are pending.

#### Implementation

spu\_readchcnt(SPU\_RdEventStat)



#### spu\_write\_event\_mask: Select Events to be Monitored by Event Status

```
(void) spu_write_event_mask (uint32_t mask)
```

Events are selected to be monitored by event status. The argument, mask, is the event mask.

#### Implementation

```
spu_writech(SPU_WrEventMask, mask)
```

#### spu\_write\_event\_ack: Acknowledge Events

```
(void) spu_write_event_ack (uint32_t ack)
```

This function acknowledges that the corresponding events are being serviced by the software. The status of acknowledged events is reset, and the events are resampled. The argument, ack, represents events acknowledgment.

# Implementation

```
spu_writech(SPU_WrEventAck, ack)
```

# spu\_read\_event\_mask: Read Event Status Mask

```
(uint32_t) spu_read_event_mask(void)
```

The current Event Status Mask is read, and the mask is returned.

### Implementation

spu\_readch(SPU\_RdEventMask)

# 4.16. SPU State Management

This section describes functions that relate to interrupts. See the *Cell Broadband Engine Architecture* for a description of the SPU Machine Status channel and the SPU interrupt-related channels.

# spu\_read\_machine\_status: Read Current SPU Machine Status

```
(uint32_t) spu_read_machine_status(void)
```

The current SPU machine status is read, and the status is returned.

#### Implementation

```
spu_readch(SPU_RdMachStat)
```

# spu\_write\_srr0: Write to SPU SRR0

```
(void) spu_write_srr0(uint32_t srr0)
```

The value of srr0 is written to the SPU state save/restore register 0 (SRR0).

### Implementation

```
spu_writech(SPU_WrSRR0,srr0)
```

#### spu\_read\_srr0: Read SPU SRR0

```
(uint32_t) spu_read_srr0(void)
```

The SPU state save/restore register 0 (SRR0) is read, and the state is returned.

# Implementation

```
spu_readch(SPU_RdSRR0)
```



# 4.17. Interrupt-Safe Critical Sections

The Cell Broadband Engine Architecture supports asynchronous events that are serviced by an SPE interrupt handler. To permit an interrupt handler's safe use of the MFC facilities, application software must guard critical sections to prevent these sections from being disrupted by MFC actions performed by the interrupt handler.

Critical sections that need to be guarded include:

- 1. Initiating an MFC command (see sections 3, 4.5, 4.6, 4.7 and 4.8). The initiation of an MFC command requires a sequence of up to 6 channel writes. This sequence cannot be interrupted by a handler that initiates an MFC command or writes to any of these channels. Even if the interrupt handler does not explicitly initiate a DMA, it could inadvertently do so if it accesses data through a software-managed cache, uses a software instruction cache, or makes a call to an overlayed code section.
- 2. Waiting for tag group completion (see section 4.10). The sequence of writing to the MFC\_WrTagUpdate channel followed by a read of the MFC\_RdTagStat channel cannot be interrupted by a handler that also waits for tag group completion.
- 3. Atomic update command sequence (see sections 4.7 and 4.10). The iterative sequence of getting a lock-line reservation (mfc\_getllar), putting the lock line (mfc\_putllc, mfc\_putlluc, and mfc\_putqlluc), and testing the atomic command status (mfc\_read\_atomic\_status and mfc\_stat\_atomic\_status), cannot be interrupted by a handler that performs atomic update command sequences.

A critical code section is specified by the sequence of instructions executed between the <code>mfc\_begin\_critical\_section</code> function and the <code>mfc\_end\_critical\_section</code> function. The critical section is interrupt safe only if the <code>SPU\_MFCIO\_INTERRUPT\_SAFE</code> is defined prior to the inclusion of <code>spu\_mfcio.h</code>.

# mfc\_begin\_critical\_section: Begin Interrupt-Safe Critical Section

```
(uint32_t) mfc_begin_critical_section(void)
```

An interrupt-safe critical section is begun. The current machine status register is returned so that the interrupt status can be restored when the critical section is ended.

# Implementation

```
<I>mach_stat is the return value of the function</I>
#ifdef SPU_MFCIO_INTERRUPT_SAFE
  mach_stat = spu_readch(SPU_RdMachStat);
  spu_idisable();
#else
  mach_stat = 0;
#endif
```

# mfc\_end\_critical\_section: End Interrupt-Safe Critical Section

```
(void) mfc_end_critical_section(uint32_t mach_stat)
```

An interrupt-safe critical section is ended. The mach\_stat parameter is the machine status returned when the critical section was begun.

#### Implementation

```
#ifdef SPU_MFCIO_INTERRUPT_SAFE
  if ((mach_stat) & 1) spu_ienable();
#endif
```





# 5. Mapping SPU and PPU Vector Multimedia Extension Intrinsics

Function mapping techniques can be used to increase the portability of source code written with SPU intrinsics or PPU Vector Multimedia Extension (VMX) intrinsics. By including the appropriate portability headers, SPU intrinsics can be used on the PPU, or VMX intrinsics can be used on the SPU. This chapter describes a minimal mapping between the two sets of intrinsics.

For many intrinsic functions, an efficient one-to-one mapping between architectures will exist. For some functions, there could be a less efficient one-to-many instruction mapping; and for other functions, no straightforward mapping will exist because a mapping is either impractical or impossible to implement. In this document, only one-to-one mappings are identified for the SPU and PPU. For those SPU and PPU intrinsic functions for which there is no straightforward mapping, an explanation of the difficulty in mapping is provided.

The mappings between SPU and PPU VMX intrinsics are defined in two header files: vmx2spu.h and spu2vmx.h. The former maps PPU VMX intrinsics to generic SPU intrinsics, and the latter maps generic SPU intrinsics to PPU VMX intrinsics. The functions that are defined in these two header files may be implemented as overloaded inline functions. To facilitate implementation, the vector data types are also mapped.

For the SPU, the header file vec\_types.h defines single token vector data types corresponding to the single token vector data types that are available on the PPU. These data types are listed in Table 1-4. The PPU VMX types vec\_bchar16, vec\_bshort8, vec\_bint4, and vec\_pixel8 will be defined for SPU as described in Table 1-2. For the PPU, the header spu2vmx.h defines single token vector data types corresponding to the vector data types that are available on the SPU. These data types are also listed in Table 1-4. The SPU types vec\_llong2, vec\_ullong2, and vec\_double2 will be defined for PPU as described in Table 1-3.

The following guidelines describe how to write code that uses these intrinsics and that is portable between the SPU and PPU:

- Always use the single vector token typedefs described in Table 1-4.
- Only use the intrinsics that are mapped in spu2vmx.h or vmx2spu.h.
- When using SPU intrinsics, include the headers in the following way:

```
#ifdef __SPU__
#include <spu_intrinsics.h>
#else
#include <spu2vmx.h>
#endif
```

When using PPU VMX intrinsics, include the headers in the following way:

```
#include <vec_types.h>
#ifdef __PPU__
#include <altivec.h>
#else
#include <vmx2spu.h>
#endif
```

# 5.1. Mapping of PPU VMX Intrinsics to SPU Intrinsics

This section lists the one-to-one mapping of PPU VMX intrinsics to SPU intrinsics. It also lists those PPU VMX intrinsics that are difficult to map to SPU intrinsics.

# 5.1.1. One-to-One Mapped Intrinsics

The PPU VMX intrinsics that map one-to-one with the generic SPU intrinsics are shown in Table 5-118.



Table 5-118: PPU VMX Intrinsics That Map One-to-One with SPU Intrinsics

| Generic PPU VMX Intrinsic | Maps to SPU Intrinsic | Applicable Data Type(s)              |
|---------------------------|-----------------------|--------------------------------------|
| vec_add                   | spu_add               | halfword, word, and float (not byte) |
| vec_addc                  | spu_genc              | All                                  |
| vec_and                   | spu_and               | All                                  |
| vec_andc                  | spu_andc              | All                                  |
| vec_avg                   | spu_avg               | unsigned char                        |
| vec_cmpeq                 | spu_cmpeq             | All                                  |
| vec_cmpgt                 | spu_cmpgt             | All                                  |
| vec_cmplt                 | spu_cmpgt             | All (requires parameter reordering)  |
| vec_ctf                   | spu_convtf            | All                                  |
| vec_cts                   | spu_convts            | All                                  |
| vec_ctu                   | spu_convtu            | All                                  |
| vec_madd                  | spu_madd              | all                                  |
| vec_mule                  | spu_mule              | halfword (not byte)                  |
| vec_mulo                  | spu_mulo              | halfword (not byte)                  |
| vec_nmusb                 | spu_nmsub             | All                                  |
| vec_nor                   | spu_nor               | All                                  |
| vec_or                    | spu_or                | All                                  |
| vec_re                    | spu_re                | All                                  |
| vec_rl                    | spu_rl                | halfword, word (not byte)            |
| vec_rsqrte                | spu_rsqrte            | All                                  |
| vec_sel                   | spu_sel               | All                                  |
| vec_sub                   | spu_sub               | halfword, word, float                |
| vec_subc                  | spu_genb              | All                                  |
| vec_xor                   | spu_xor               | all                                  |

# 5.1.2. PPU VMX Intrinsics That Are Difficult to Map to SPU Intrinsics

The PPU VMX intrinsics that are shown in Table 5-119 are not likely to be mapped to generic SPU intrinsics because a straightforward mapping does not exist.

Table 5-119: PPU VMX Intrinsics That Are Difficult to Map to SPU Intrinsics

| Generic PPU VMX Intrinsic(s) | Explanation                                                                                                                                                                                                                        |
|------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| vec_unpackh, vec_unpackl     | These functions cannot be mapped without creating additional SPU data types. A mapping of pixel and bool short vector types to an unsigned short (as described in Table 1-2) will cause an overloaded function selection conflict. |
| vec_mfvscr, vec_mtvscr       | Support of the VSCR register is difficult because the SPU does not support IEEE rounding modes on single-precision floating-point operations.                                                                                      |
| vec_step                     | Mapping requires specific compiler support that is not mandated by this specification.                                                                                                                                             |

# 5.2. Mapping of SPU Intrinsics to PPU VMX Intrinsics

This section lists the one-to-one mapping of SPU intrinsics to PPU VMX intrinsics. It also lists those SPU intrinsics that are difficult to map to PPU VMX intrinsics.

# 5.2.1. One-to-One Mapped Intrinsics

Many of the generic SPU intrinsics map one-to-one with PPU VMX intrinsics. These mappings are shown in Table 5-120.



Table 5-120: SPU Intrinsics That Map One-to-One with PPU VMX Intrinsics

| Generic SPU Intrinsic | Maps to PPU VMX Intrinsic | Applicable Data Type(s)                     |
|-----------------------|---------------------------|---------------------------------------------|
| spu_add               | vec_add                   | vector/vector (no scalar operands)          |
| spu_and               | vec_and                   | vector/vector (no scalar operands)          |
| spu_andc              | vec_andc                  | All                                         |
| spu_avg               | vec_avg                   | All                                         |
| spu_cmpeq             | vec_cmpeq                 | vector/vector (no scalar operands)          |
| spu_cmpgt             | vec_cmpgt                 | vector/vector (no scalar operands)          |
| spu_convtf            | vec_ctf                   | Limited scale range (5 bits)                |
| spu_convts            | vec_cts                   | Limited scale range (5 bits)                |
| spu_convtu            | vec_ctu                   | Limited scale range (5 bits)                |
| spu_genb              | vec_subc                  | All                                         |
| spu_genc              | vec_addc                  | All                                         |
| spu_madd              | vec_madd                  | float                                       |
| spu_mule              | vec_mule                  | All                                         |
| spu_mulo              | vec_mulo                  | Halfword vector/vector (no scalar operands) |
| spu_nmsub             | vec_nmsub                 | float                                       |
| spu_nor               | vec_nor                   | All                                         |
| spu_or                | vec_or                    | vector/vector (no scalar operands)          |
| spu_re                | vec_re                    | All                                         |
| spu_rl                | vec_rl                    | vector/vector (no scalar operands)          |
| spu_rsqrte            | vec_rsqrte                | all                                         |
| spu_sel               | vec_sel                   | All                                         |
| spu_sub               | vec_sub                   | vector/vector (no scalar operands)          |
| spu_xor               | vec_xor                   | vector/vector (no scalar operands)          |

### 5.2.2. SPU Intrinsics That Are Difficult to Map to PPU VMX Intrinsics

The generic SPU intrinsics that are shown in Table 5-121 are not likely to be mapped to PPU VMX intrinsics because a straightforward mapping does not exist.

Table 5-121: SPU Intrinsics That Are Difficult to Map to PPU VMX Intrinsics

| Generic SPU Intrinsic(s)                | Explanation                                                                                                                                                                                                                                                                         |  |
|-----------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| spu_bisled, spu_bislede, spu_bisledi    | Event handling and interrupt handling on the SPU cannot be precisely mapped.                                                                                                                                                                                                        |  |
| spu_idisable, spu_ienable               |                                                                                                                                                                                                                                                                                     |  |
| spu_readch, spu_readchqw, spu_readchcnt | Specific channel functionality cannot be easily supported on the PPU,                                                                                                                                                                                                               |  |
| spu_writech, spu_writechqw              | nor would it generally be desirable to do so. Whereas some channel sequences could be mapped, most would require special programmer insight and direction.                                                                                                                          |  |
| spu_mfcdma32, spu_mfcdma64, spu_mfcstat | The mapping of DMA transactions typically is not needed because the PPU has full memory access. Nevertheless, these intrinsics could be used to perform memory synchronization that might not be precisely mappable.                                                                |  |
| spu_sync, spu_sync_c                    | These intrinsics could be mapped to one of the PPU sync                                                                                                                                                                                                                             |  |
| spu_dsync                               | instructions, but the results might not be what was intended.                                                                                                                                                                                                                       |  |
| spu_convts, spu_convtu, spu_convtf      | The full dynamic range of scale factors is not easily supported. Vector Multimedia Extension provides a 5-bit scale factor; the SPU has an 8-bit scale factor. Some implementations might support only the 5-bit range provided by the direct mapping of the equivalent intrinsics. |  |



| Generic SPU Intrinsic(s) | Explanation                                                                                             |
|--------------------------|---------------------------------------------------------------------------------------------------------|
| spu_hcmpeq, spu_hcmpgt   | The halt instruction might be mappable to an exit function, but this will not work in all environments. |
| spu_stop, spu_stopd      | It is not always appropriate to stop execution of the PPU.                                              |



# 6. PPU-Specific Intrinsics

This chapter specifies a minimal set of specific intrinsics to make the underlying PPU instruction set accessible from the C programming language. Except for \_\_setflm, each of these intrinsics has a one-to-one assembly language mapping, unless compiled for a 32-bit ABI in which the high and low halves of a 64-bit doubleword are maintained in separate registers. In this latter situation, the corresponding 32-bit intrinsic might generate a sequence of instructions. In other instances, a corresponding 32-bit implementation cannot be supported.

The PPU intrinsics will be declared in the system header file, ppu\_intrinsics.h. They may be either defined within this header as macros or implemented internally within the compiler.

Some intrinsics take a literal value of either 3, 4, 5, 6, 8, or 10 bits in length. By default, a call to an intrinsic with an out-of-range literal is reported by the compiler as an error. Compilers may provide an option to issue a warning for out-of-range literal values and use only the specified number of least significant bits for the out-of-range argument.

The intrinsics do not have a specific ordering unless otherwise noted. The intrinsics can be optimized by the compiler and be scheduled like normal operations.

#### \_\_cctph: Change Thread Priority to High

(void) \_\_cctph()

The current thread priority is changed to high priority. This intrinsic will not be reordered by the compiler.

Table 6-122: Change Thread Priority to High

| Return/Argument Types | Assembly Mapping |
|-----------------------|------------------|
| none                  | cctph            |

#### \_\_cctpl: Change Thread Priority to Low

(void) \_\_cctpl()

The current thread priority is changed to low priority. This intrinsic will not be reordered by the compiler.

Table 6-123: Change Thread Priority to Low

| Return/Argument Types | Assembly Mapping |
|-----------------------|------------------|
| none                  | cctpl            |

#### cctpm: Change Thread Priority to Medium

(void) \_\_cctpm()

The current thread priority is changed to medium priority. This intrinsic will not be reordered by the compiler.

Table 6-124: Change Thread Priority to Medium

| Return/Argument Types | Assembly Mapping |
|-----------------------|------------------|
| none                  | cctpm            |



### \_\_cntlzd: Count Leading Doubleword Zeros

The number of leading zeros in the doubleword a is returned in d.

Table 6-125: Count Leading Doubleword Zeros

| Return/Ar    | gument Types       | Asse        | embly Mapping                                                                                                                             |
|--------------|--------------------|-------------|-------------------------------------------------------------------------------------------------------------------------------------------|
| d            | а                  | 64-bit ABI  | 32-bit ABI                                                                                                                                |
| unsigned int | unsigned long long | cntlzd d, a | cntlzw hi_cnt, a_hi cntlzw lo_cnt, a_lo rlwinm mask, hi_cnt, 26, 0, 5 srawi mask, mask, 31 and lo_cnt, lo_cnt, mask add d, hi_cnt, lo_cnt |

### \_\_cntlzw: Count Leading Word Zeros

$$d = \underline{\text{cntlzw}(a)}$$

The number of leading zeros in the word a is returned in d.

Table 6-126: Count Leading Word Zeros

| Return/Argument Types |              | Assembly Mapping |
|-----------------------|--------------|------------------|
| d                     | а            | Assembly Mapping |
| unsigned int          | unsigned int | cntlzw d, a      |

### \_\_db10cyc: Delay 10 Cycles at Dispatch

(void) \_\_db10cyc()

The current thread is blocked at dispatch for 10 cycles. This intrinsic will not be reordered by the compiler.

Table 6-127: Delay 10 Cycles at Dispatch

| Return/Argument Types | Assembly Mapping |
|-----------------------|------------------|
| none                  | db10cyc          |

#### \_\_db12cyc: Delay 12 Cycles at Dispatch

(void) \_\_db12cyc()

The current thread is blocked at dispatch for 12 cycles. This intrinsic will not be reordered by the compiler.

Table 6-128: Delay 12 Cycles at Dispatch

| Return/Argument Types | Assembly Mapping |
|-----------------------|------------------|
| none                  | db12cyc          |



### \_\_db16cyc: Delay 16 Cycles at Dispatch

(void) \_\_db16cyc()

The current thread is blocked at dispatch for 16 cycles. This intrinsic will not be reordered by the compiler.

Table 6-129: Delay 16 Cycles at Dispatch

| Return/Argument Types | Assembly Mapping |
|-----------------------|------------------|
| none                  | db16cyc          |

# \_\_db8cyc: Delay 8 Cycles at Dispatch

(void) \_\_db8cyc()

The current thread is blocked at dispatch for 8 cycles. This intrinsic will not be reordered by the compiler.

Table 6-130: Delay 8 Cycles at Dispatch

| Return/Argument Types | Assembly Mapping |
|-----------------------|------------------|
| none                  | db8cyc           |

#### \_\_dcbf: Data Cache Block Flush

(void) \_\_dcbf(pointer)

The cache block that contains the argument pointer is flushed and removed from the cache.

The base and index arguments for the assembly mapping are calculated from pointer.

Table 6-131: Data Cache Block Flush

| Return/Argument Types | Assembly Mapping |  |
|-----------------------|------------------|--|
| pointer               |                  |  |
| void*                 | dcbf base, index |  |

#### \_\_dcbst: Data Cache Block Store

(void) \_\_dcbst(pointer)

The cache block that contains the argument pointer is written to main memory. This intrinsic will not be reordered by the compiler.

The base and index arguments for the assembly mapping are calculated from pointer.

Table 6-132: Data Cache Block Store

| Return/Argument Types | Accombly Manning  |  |
|-----------------------|-------------------|--|
| pointer               | Assembly Mapping  |  |
| void*                 | dcbst base, index |  |



#### \_\_dcbt: Data Cache Block Touch

```
(void) __dcbt(pointer)
```

The processor receives a hint that the cache block which contains the argument pointer will soon be loaded. This intrinsic will not be reordered by the compiler.

The base and index arguments for the assembly mapping are calculated from pointer.

Table 6-133: Data Cache Block Touch

| Return/Argument Types | Accombly Manning |  |
|-----------------------|------------------|--|
| pointer               | Assembly Mapping |  |
| void*                 | dcbt base, index |  |

#### \_\_dcbt\_TH1000: Set Up Streaming Data

```
(void) __dcbt_TH1000(eatrunc, d, ug, id)
```

A stream is set up with an id of id and an effective address of eatrunc. The argument d describes which direction the stream is going: true for forwards and false for backwards. The argument ug says if the stream is unlimited in bounds or not. This intrinsic will not be reordered by the compiler.

The effective address for this instruction is calculated as:

```
((unsigned long long) eatrunc) & ~0x7F) | (((d & 1) << 6) | ((ug & 1) << 5) | (id & 0xF)
```

The base and index arguments for the assembly mapping are calculated from the above effective address.

Table 6-134: Set Up Streaming Data

| Return/Argument Types |      |      |                  | Assembly Mapping    |
|-----------------------|------|------|------------------|---------------------|
| eatrunc d ug id       |      |      | Assembly Mapping |                     |
| void*                 | bool | bool | int              | dcbt base, index, 8 |

## \_dcbt\_TH1010: Start or Stop Streaming Data

```
(void) __dcbt_TH1010(go, s, unitcnt, t, u, id)
```

The processor receives a hint that the stream identified by id will no longer be needed. If go is set, the program will soon load from all nascent data streams that have been completely described, and it will probably no longer load from any other nascent data streams; all the rest of the arguments are ignored in this case. If s is '10', the stream associated with id will stop and all other arguments except for id are ignored. If s is '11', all streams IDs are stopped and all other arguments are ignored. unitcnt specifies the number of units in a data stream. t tells if the program's need for each block of the data stream is likely to be transient. t tells if the data stream is unlimited and the unitcnt argument is ignored. This intrinsic will not be reordered by the compiler.

The effective address for this instruction is calculated as:

```
(((unsigned long long) go & 1) << 31)
| ((s & 0x3) << 29)
| ((unitcnt & 0x3FF) << 7)
| ((t & 1) << 6)
| ((u & 1) << 5)
| (id & 0xF)</pre>
```

The base and index arguments for the assembly mapping are calculated from the above effective address.



Table 6-135: Start or Stop Streaming Data

| Return/Argument Types |     |         |      | Accombly Manning |     |                      |  |
|-----------------------|-----|---------|------|------------------|-----|----------------------|--|
| go                    | S   | unitcnt | t    | u                | id  | Assembly Mapping     |  |
| bool                  | int | int     | bool | bool             | int | dcbt base, index, 10 |  |

#### dcbtst: Data Cache Block Touch for Store

(void) \_\_dcbtst(pointer)

The processor receives a hint that the cache block that contains the argument *pointer* will soon be stored. This intrinsic will not be reordered by the compiler.

The base and index arguments for the assembly mapping are calculated from pointer.

Table 6-136: Data Cache Block Touch for Store

| Return/Argument Types | Accombly Manning   |
|-----------------------|--------------------|
| pointer               | Assembly Mapping   |
| void*                 | dcbtst base, index |

#### dcbz: Data Cache Block Set to Zero

(void) \_\_dcbz(pointer)

The cache block that contains the argument pointer is zeroed out. If the address is already in cache, the cache block containing it is zeroed. If the address was not already in a cache block, a cache block for it is created with all zeros. This intrinsic will not be reordered by the compiler.

The base and index arguments for the assembly mapping are calculated from pointer.

Table 6-137: Data Cache Block Set to Zero

| Return/Argument Types | Accombly Manning |  |
|-----------------------|------------------|--|
| pointer               | Assembly Mapping |  |
| void*                 | dcbz base, index |  |

#### \_eieio: Enforce In-Order Execution of I/O

(void) \_\_eieio()

A memory barrier is created, which provides an ordering function for the storage accesses caused by *Load, Store*, \_\_dcbz(), \_\_eciwx(), and \_\_ecowx() instructions executed by the processor executing the \_\_eieio() instruction. The memory barrier and ordering function are described in section 1.7.1 of *PowerPC Architecture Book, Book II: PowerPC Virtual Environment Architecture*, version 2.02.

Table 6-138: Enforce In-Order Execution of I/O

| Return/Argument Types | Assembly Mapping |
|-----------------------|------------------|
| none                  | eieio            |



### \_\_fabs: Double Absolute Value

$$d = _fabs(a)$$

The absolute value of the argument a is returned in d with the sign bit set to zero.

Table 6-139: Double Absolute Value

| Return/Argument Types |        | Assembly Mapping |
|-----------------------|--------|------------------|
| d a                   |        |                  |
| double                | double | fabs d, a        |

#### fabsf: Float Absolute Value

The absolute value of the argument a is returned in d with the sign bit set to zero.

Table 6-140: Float Absolute Value

| Return/Argument Types |       | Assembly Mapping |
|-----------------------|-------|------------------|
| d a                   |       |                  |
| float                 | float | fabs d, a        |

### \_\_fcfid: Convert Doubleword to Double

$$d = \underline{fcfid(a)}$$

The doubleword in a is converted to a floating-point and returned in d.

Table 6-141: Convert Doubleword to Double

| Return/Argu | ıment Types | Assembly Mapping |
|-------------|-------------|------------------|
| d           | а           |                  |
| double      | long long   | fcfid d, a       |

# \_\_fctid: Convert Double to Doubleword

The double a is converted to a doubleword integer and returned in d. This function takes into account the current rounding mode.

Table 6-142: Convert Double to Doubleword

| Return/Argu | ıment Types | Assembly Mapping |
|-------------|-------------|------------------|
| d           | a           |                  |
| long long   | double      | fctid d, a       |



#### \_\_fctidz: Convert Double to Doubleword with Round Towards Zero

$$d = \underline{\hspace{0.2cm}} fctidz(a)$$

The double a is converted to a doubleword integer and returned in d. This function always rounds towards zero.

Table 6-143: Convert Double to Doubleword with Round Towards Zero

| Return/Argu | ıment Types | Accombly Manning |
|-------------|-------------|------------------|
| d a         |             | Assembly Mapping |
| long long   | double      | fctidz d, a      |

#### \_\_fctiw: Convert Double to Word

$$d = \underline{fctiw(a)}$$

The double a is converted to a word integer and returned in d. This function takes into account the current rounding mode.

Table 6-144: Convert Double to Word

| Return/Argument Types |        | Accomply Manning                                                   |  |
|-----------------------|--------|--------------------------------------------------------------------|--|
| d                     | а      | Assembly Mapping                                                   |  |
| int                   | double | fctiw tmp, a<br>stfiwx tmp, r1, tempspace<br>lwzx d, r1, tempspace |  |

#### fctiwz: Convert Double to Word with Round Towards Zero

The double a is converted to a word integer and returned in d. This function always rounds towards zero.

Table 6-145: Convert Double to Word with Round Towards Zero

| Return/Argu | ment Types | Assembly Mapping                                                    |
|-------------|------------|---------------------------------------------------------------------|
| d a         |            | Assembly Mapping                                                    |
| int         | double     | fctiwz tmp, a<br>stfiwx tmp, r1, tempspace<br>lwzx d, r1, tempspace |

#### \_\_fmadd: Double Fused Multiply and Add

$$d = \underline{fmadd(a, b, c)}$$

The argument a is multiplied by the argument b, and the argument c is added to that product. The resulting value  $(a \times b + c)$  is returned in d.

Table 6-146: Double Fused Multiply and Add

| Return/Argument Types |        |        |        | Accombly Manning |
|-----------------------|--------|--------|--------|------------------|
| d a b c               |        |        |        | Assembly Mapping |
| double                | double | double | double | fmadd d, a, b, c |



### \_\_fmadds: Float Fused Multiply and Add

$$d = \underline{\text{fmadds}(a,b,c)}$$

The argument a is multiplied by the argument b, and the argument c is added to that product. The resulting value  $(a \times b + c)$  is returned in d.

Table 6-147: Float Fused Multiply and Add

| Return/Argument Types |       |       |                  | Accombly Manning  |
|-----------------------|-------|-------|------------------|-------------------|
| d a b c               |       |       | Assembly Mapping |                   |
| float                 | float | float | float            | fmadds d, a, b, c |

# \_\_fmsub: Double Fused Multiply and Subtract

$$d = \underline{\qquad} fmsub(a, b, c)$$

The argument a is multiplied by the argument b, and the argument c is subtracted from that product. The resulting value  $(a \times b - c)$  is returned in d.

Table 6-148: Double Fused Multiply and Subtract

| Return/Argument Types |        |        |                  | Accomply Manning |
|-----------------------|--------|--------|------------------|------------------|
| d a b c               |        |        | Assembly Mapping |                  |
| double                | double | double | double           | fmsub d, a, b, c |

### \_\_fmsubs: Float Fused Multiply and Subtract

$$d = _fmsubs(a, b, c)$$

The argument a is multiplied by the argument b, and the argument c is subtracted from that product. The resulting value  $(a \times b - c)$  is returned in d.

Table 6-149: Float Fused Multiply and Subtract

| Return/Argument Types |       |       |       | Accomply Manning  |
|-----------------------|-------|-------|-------|-------------------|
| d a b c               |       |       |       | Assembly Mapping  |
| float                 | float | float | float | fmsubs d, a, b, c |

# \_\_fmul: Double Multiply

$$d = \underline{\qquad} fmul(a, b)$$

The doubles a and b are multiplied, and their product  $(a \times b)$  is returned in d.

Table 6-150: Double Multiply

| Retu   | Assembly Mapping |        |                  |
|--------|------------------|--------|------------------|
| d a b  |                  |        | Assembly Mapping |
| double | double           | double | fmul d, a, b     |



### \_\_fmuls: Float Multiply

$$d = _{muls(a,b)}$$

The floats a and b are multiplied, and their product (a×b) is returned in d.

Table 6-151: Float Multiply

| Retu  | Assembly Mapping |       |                  |
|-------|------------------|-------|------------------|
| d a b |                  |       | Assembly Mapping |
| float | float            | float | fmuls d, a, b    |

## \_\_fnabs: Double Negative

$$d = _fnabs(a)$$

The negative absolute value of the argument a is returned in d. The sign bit is set to 1.

Table 6-152: Double Negative

| Return/Ar | gument Types | Assembly Mapping |
|-----------|--------------|------------------|
| d         | а            |                  |
| double    | double       | fnabs d, a       |

### \_\_fnabsf: Float Negative

The negative absolute value of the argument a is returned in the d. The sign bit is set to 1.

Table 6-153: Float Negative

| Return/Ar   | gument Types | Assembly Mapping |
|-------------|--------------|------------------|
| d           | а            |                  |
| float float |              | fnabs d, a       |

### \_\_fnmadd: Double Fused Negative Multiply and Add

$$d = \underline{\qquad} fnmadd(a, b, c)$$

The arguments a and b are multiplied, and the argument c is added to their product. The sum is negated, and the resulting value  $-(a \times b + c)$  is returned in d.

Table 6-154: Double Fused Negative Multiply and Add

|         | Return/Argu | Accomply Manning |                  |                   |
|---------|-------------|------------------|------------------|-------------------|
| d a b c |             |                  | Assembly Mapping |                   |
| double  | double      | double           | double           | fnmadd d, a, b, c |



### \_\_fnmadds: Float Fused Negative Multiply and Add

$$d = \underline{\quad} fnmadds(a, b, c)$$

The arguments a and b are multiplied, and the argument c is added to their product. The sum is negated, and the resulting value  $-(a \times b + c)$  is returned in d.

Table 6-155: Float Fused Negative Multiply and Add

| Return/Argument Types |       |       |       | Accombly Manning   |  |
|-----------------------|-------|-------|-------|--------------------|--|
| d                     | а     | b     | С     | Assembly Mapping   |  |
| float                 | float | float | float | fnmadds d, a, b, c |  |

# \_\_fnmsub: Double Fused Negative Multiply and Subtract

```
d = \underline{\qquad} fnmsub(a, b, c)
```

The arguments a and b are multiplied, and the argument c is subtracted from their product. The sum is negated, and the resulting value  $-(a \times b - c)$  is returned in d.

Table 6-156: Double Fused Negative Multiply and Subtract

|        | Return/Argi | Accombly Manning |        |                   |
|--------|-------------|------------------|--------|-------------------|
| d      | d a b c     |                  |        | Assembly Mapping  |
| double | double      | double           | double | fnmsub d, a, b, c |

### \_\_fnmsubs: Float Fused Negative Multiply and Subtract

```
d = _fnmsubs(a, b, c)
```

The arguments a and b are multiplied, and the argument c is subtracted from their product. The sum is negated, and the resulting value  $-(a \times b - c)$  is returned in d.

Table 6-157: Float Fused Negative Multiply and Subtract

|       | Return/Argi | Accombly Manning |       |                    |  |
|-------|-------------|------------------|-------|--------------------|--|
| d     | а           | b                | С     | Assembly Mapping   |  |
| float | float       | float            | float | fnmsubs d, a, b, c |  |

### \_\_fres: Float Reciprocal Estimate

An estimate of the reciprocal of the argument a is returned in d. The estimate is correct to a precision of one part in 256 of the reciprocal.

Beyond this precision, the value is indeterminate; the results of executing this instruction may vary between implementations and between different executions on the same implementation.

Table 6-158: Float Reciprocal Estimate

| Return/Argu | ıment Types | Assembly Mapping |
|-------------|-------------|------------------|
| d           | а           |                  |
| float       | float       | fres d, a        |



### \_\_frsp: Round to Single Precision

$$d = _frsp(a)$$

The argument a is rounded to single precision and returned in d.

Table 6-159: Round to Single Precision

| Return/Argu | ıment Types | Accombly Manning |
|-------------|-------------|------------------|
| d           | а           | Assembly Mapping |
| float       | double      | frsp d, a        |

#### \_\_frsqrte: Double Reciprocal Square Root Estimate

An estimate of the reciprocal of the square root of the argument a is returned in d.

The estimate is correct to a precision of one part in 32 of the reciprocal of the square root. Beyond this precision, the value is indeterminate; the results of executing this instruction may vary between implementations and between different executions on the same implementation.

Table 6-160: Double Reciprocal Square Root Estimate

| Return/Argu | ıment Types | Assembly Mapping |
|-------------|-------------|------------------|
| d           | а           |                  |
| double      | double      | frsqrte d, a     |

#### \_\_fsel: Floating-Point Select of Double

$$d = \underline{fsel(a, b, c)}$$

The argument b is returned in d if the argument a is greater than or equal to 0.0; otherwise c is returned.

Table 6-161: Floating-Point Select of Double

|        | Return/Argu | Accombly Manning |        |                  |  |
|--------|-------------|------------------|--------|------------------|--|
| d      | а           | b                | С      | Assembly Mapping |  |
| double | double      | double           | double | fsel d, a, b, c  |  |

### \_\_fsels: Floating-Point Select of Float

$$d = \underline{fsels(a, b, c)}$$

The argument b is returned in d if the argument a is greater than or equal to 0.0; otherwise c is returned.

Table 6-162: Floating-Point Select of Float

|       | Return/Argu | Assembly Mapping |       |                 |
|-------|-------------|------------------|-------|-----------------|
| d     | а           |                  |       |                 |
| float | float       | float            | float | fsel d, a, b, c |



### \_\_fsqrt: Double Square Root

The square root of the argument a is returned in d.

Table 6-163: Double Square Root

| Return/Arg | ument Types | Accombly Manning |
|------------|-------------|------------------|
| d          | а           | Assembly Mapping |
| double     | double      | fsqrt d, a       |

## \_\_fsqrts: Float Square Root

The square root of the argument a is returned in d.

Table 6-164: Float Square Root

| Return/Arg | ument Types | Assembly Mapping |
|------------|-------------|------------------|
| d          | а           |                  |
| float      | float       | fsqrts d, a      |

### \_\_icbi: Instruction Cache Block Invalidate

(void) \_\_icbi(pointer)

The instruction cache block that contains the argument *pointer* is invalidated, if such a block is in the cache. This intrinsic will not be reordered by the compiler.

The  ${\it base}$  and  ${\it index}$  arguments for the assembly mapping are calculated from  ${\it pointer}.$ 

Table 6-165: Instruction Cache Block Invalidate

| Return/Argument Types | Assembly Mapping |  |
|-----------------------|------------------|--|
| pointer               | Assembly Mapping |  |
| void*                 | icbi base, index |  |

# \_\_isync: Instruction Sync

(void) \_\_isync()

The processor waits until all previous instructions have finished. The  $\__{isync()}$  function ensures that all <code>icbi</code> have been performed.

Table 6-166: Instruction Sync

| Return/Argument Types | Assembly Mapping |
|-----------------------|------------------|
| none                  | isync            |



#### \_\_Idarx: Load Doubleword with Reserved

d = \_\_ldarx(pointer)

The reserved address of the processor is set to the value of *pointer*. A doubleword from the address in *pointer* is returned in *d*.

The base and index arguments for the assembly mapping are calculated from pointer.

This intrinsic might not be supported when compiling for 32-bit ABIs in which a 64-bit doubleword is maintained in two separate registers.

Table 6-167: Load Doubleword with Reserved

| Return/Argument Types |         | Accombly Manning     |  |
|-----------------------|---------|----------------------|--|
| d                     | pointer | Assembly Mapping     |  |
| unsigned long long    | void*   | ldarx d, base, index |  |

#### Idbrx: Load Reversed Doubleword

d = \_\_ldbrx(pointer)

A doubleword from the address in pointer is loaded in reversed endian order into d and returned.

The base and index arguments for the assembly mapping are calculated from pointer.

Table 6-168: Load Reversed Doubleword

| Return/Argumen     | t Types | Asser                | nbly Mapping                                         |
|--------------------|---------|----------------------|------------------------------------------------------|
| d                  | pointer | 64-bit ABI           | 32-bit ABI                                           |
| unsigned long long | void*   | ldbrx d, base, index | lwbrx d_lo, base, index<br>lwbrx d_hi, base, index+4 |

#### \_\_Ihbrx: Load Reversed Halfword

d = \_\_lhbrx(pointer)

A halfword from the address in pointer is loaded in reversed endian order into d and returned.

The base and index arguments for the assembly mapping are calculated from pointer.

Table 6-169: Load Reversed Halfword

| Return/Argument Types |         | A a a combby Manain o |  |
|-----------------------|---------|-----------------------|--|
| d                     | pointer | Assembly Mapping      |  |
| unsigned short        | void*   | Ihbrx d, base, index  |  |

#### \_\_lwarx: Load Word with Reserved

d = \_\_lwarx(pointer)

The reserved address of the processor is set to the value of pointer. A word from the address in pointer is returned in d.

The base and index arguments for the assembly mapping are calculated from pointer.

Table 6-170: Load Word with Reserved

| Return/Argument Types |       | Assembly Mapping     |
|-----------------------|-------|----------------------|
| d pointer             |       |                      |
| unsigned              | void* | lwarx d, base, index |



#### \_\_Iwbrx: Load Reversed Word

A word from the address in *pointer* is loaded in reversed endian order into *d*.

The base and index arguments for the assembly mapping are calculated from pointer.

Table 6-171: Load Reversed Word

| Return/Argument Types |       | Assembly Mapping     |
|-----------------------|-------|----------------------|
| d pointer             |       |                      |
| unsigned              | void* | lwbrx d, base, index |

#### \_\_lwsync: Light Weight Sync

```
(void) __lwsync()
```

A memory barrier is created, providing an ordering function for the storage accesses caused by prior *Load, Store*, and \_\_dcbz() instructions that are executed by the processor executing \_\_lwsync(). The memory barrier and ordering function are described in section 1.7.1 of *PowerPC Architecture Book, Book II: PowerPC Virtual Environment Architecture*, version 2.02.

Table 6-172: Light Weight Sync

| Return/Argument Types | Assembly Mapping |
|-----------------------|------------------|
| none                  | lwsync           |

### \_\_mffs: Move from Floating-Point Status and Control Register

$$d = _mffs()$$

The current Floating-Point Status and Control Register is returned in *d*. This intrinsic will not be reordered by the compiler.

Table 6-173: Move from Floating-Point Status and Control Register

| Return/Argument Types | Assembly Mapping |  |
|-----------------------|------------------|--|
| d                     |                  |  |
| double                | mffs d           |  |



#### \_\_mfspr: Move from Special Purpose Register

The contents of the special purpose register specified by spr are returned in d. This intrinsic will not be reordered by the compiler.

This intrinsic might not be supported when compiling for 32-bit ABIs in which a 64-bit doubleword is maintained in two separate registers.

Table 6-174: Move from Special Purpose Register

| Return/Argument Types |                             | Assembly Mapping |
|-----------------------|-----------------------------|------------------|
| d                     | spr                         | Assembly Mapping |
| unsigned long long    | 10-bit literal unsigned int | mfspr d, spr     |

# \_\_mftb: Move from Time Base

$$d = \underline{mftb()}$$

The time base register is returned in *d*. This intrinsic will not be reordered by the compiler.

Table 6-175: Move from Time Base

| Return/Argument Types | Assembly Mapping |                                                               |
|-----------------------|------------------|---------------------------------------------------------------|
| d                     | 64-bit ABI       | 32-bit ABI                                                    |
| unsigned long long    | mftb d           | retry: mftbu d_hi mftb d_lo mftbu tmp cmp d_hi, tmp bne retry |

#### \_\_mtfsb0: Reset Bit of FPSCR

(void) \_\_mtfsb0(bt)

Bit bt of Floating-Point Status and Control Register (FPSCR) is set to 0. This intrinsic will not be reordered by the compiler. It will also cause a barrier for floating-point operations.

Table 6-176: Reset Bit of FPSCR

| Return/Argument Types        | Assembly Mapping |  |
|------------------------------|------------------|--|
| bt                           | Assembly Mapping |  |
| 5-bit unsigned int (literal) | mtfsb0 bt        |  |

### \_\_mtfsb1: Set Bit of FPSCR

(void) \_\_mtfsb1(bt)

Bit *bt* of Floating-Point Status and Control Register is set to 1. This intrinsic will not be reordered by the compiler. It will also cause a barrier for floating-point operations.

Table 6-177: Set Bit of FPSCR

| Return/Argument Types        | Assembly Mapping |
|------------------------------|------------------|
| bt                           |                  |
| 5-bit unsigned int (literal) | mtfsb1 bt        |



#### mtfsf: Set Fields in FPSCR

```
(void) __mtfsf(flm, b)
```

The fields of Floating-Point Status and Control Register are set to b masked by the argument flm. This intrinsic will not be reordered by the compiler. It will also cause a barrier for floating-point operations.

Table 6-178: Set Fields in FPSCR

| Return/Argument Types        |        | Accombly Manning |  |
|------------------------------|--------|------------------|--|
| flm                          | b      | Assembly Mapping |  |
| 8-bit unsigned int (literal) | double | mtfsf flm, b     |  |

## \_\_mtfsfi: Set Field of FPSCR

```
(void) __mtfsfi(bf, u)
```

The bf field of FPSCR is set to the argument u. This intrinsic will not be reordered by the compiler. It will also cause a barrier for floating-point operations.

Table 6-179: Set Field of FPSCR

| Return/Arg                   | Assembly Mapping             |              |
|------------------------------|------------------------------|--------------|
| bf                           |                              |              |
| 3-bit unsigned int (literal) | 4-bit unsigned int (literal) | mtfsfi bf, u |

### \_\_mtspr: Move to Special Purpose Register

```
(void) __mtspr(spr, value)
```

The special purpose register specified by spr is set to the argument value. This intrinsic will not be reordered by the compiler.

This intrinsic might not be supported when compiling for 32-bit ABIs in which a 64-bit doubleword is maintained in two separate registers.

Table 6-180: Move to Special Purpose Register

| Return/Arg                    | Assembly Mapping   |                  |  |
|-------------------------------|--------------------|------------------|--|
| spr value                     |                    | Assembly Mapping |  |
| 10-bit unsigned int (literal) | unsigned long long | mtspr spr, value |  |

# \_\_mulhd: Multiply Doubleword, High Part

```
d = \underline{\quad} mulhd(a, b)
```

The high part of the signed product of the doubleword arguments a and b is returned in d.

This intrinsic might not be supported when compiling for 32-bit ABIs in which a 64-bit doubleword is maintained in two separate registers.

Table 6-181: Multiply Doubleword, High Part

| Retu      | ırn/Argument T | Assembly Mapping |                  |
|-----------|----------------|------------------|------------------|
| d a b     |                |                  | Assembly Mapping |
| long long | long long      | long long        | mulhd d, a, b    |



### \_\_mulhdu: Multiply Double Unsigned Word, High Part

```
d = \underline{\quad} mulhdu(a, b)
```

The high part of the unsigned product of the doubleword arguments a and b is returned in d.

This intrinsic might not be supported when compiling for 32-bit ABIs in which a 64-bit doubleword is maintained in two separate registers.

Table 6-182: Multiply Double Unsigned Word, High Part

|                    | Assembly Mapping |
|--------------------|------------------|
| d                  | Assembly Mapping |
| unsigned long long | mulhdu d, a, b   |

#### \_\_mulhw: Multiply Word, High Part

```
d = \underline{\quad} mulhw(a, b)
```

The high part of the signed product of the word arguments a and b is returned in d.

Table 6-183: Multiply Word, High Part

| Return/Argument Types |       |     | Assembly Mapping |
|-----------------------|-------|-----|------------------|
| d                     | d a b |     | Assembly Mapping |
| int                   | int   | int | mulhw d, a, b    |

### \_\_mulhwu: Multiply Unsigned Word, High Part

```
d = \underline{\quad} mulhwu(a, b)
```

The high part of the unsigned product of the word arguments a and b is returned in d.

Table 6-184: Multiply Unsigned Word, High Part

| Ret          | Assembly Mapping |              |                  |
|--------------|------------------|--------------|------------------|
| d a b        |                  |              | Assembly Mapping |
| unsigned int | unsigned int     | unsigned int | mulhwu d, a, b   |

# \_\_nop: No Operation

```
(void) __nop()
```

The preferred nop instruction is generated. This intrinsic will not be reordered by the compiler.

Table 6-185: No Operation

| Return/Argument Types | Assembly Mapping |  |
|-----------------------|------------------|--|
| none                  | nop              |  |

### \_\_protected\_stream\_count: Set the Number of Blocks to Stream

```
(void) __protected_stream_count(COUNT, ID)
```

Set the number of units in the data stream corresponding to stream ID. This intrinsic is an alias for \_\_dcbt\_TH1010(0, 0, COUNT, 0, 0, ID). The compiler will not reorder this intrinsic.



#### \_\_protected\_stream\_go: Start All Streams

```
(void) __protected_stream_go()
```

Start all of the completely described streams. This intrinsic is an alias for  $\_dcbt_TH1010(1, 0, 0, 0, 0)$ . The compiler will not reorder this intrinsic.

### \_\_protected\_stream\_set: Set Up a Stream

```
(void) __protected_stream_set(D, ADDR, ID)
```

Set up the ID stream to start at ADDR and run in the direction of D. When D is 1, the direction is backwards (decrementing), and when D is 3, the direction is forwards (incrementing). The stream is started by setting the count and then calling \_\_protected\_stream\_go. This intrinsic is an alias for \_\_dcbt\_TH1000(ADDR, (D>>1), 0, ID). The compiler will not reorder this intrinsic.

#### \_\_protected\_stream\_stop: Stop a Stream

```
(void) __protected_stream_stop(ID)
```

Stop the ID stream. This intrinsic is an alias for  $\_dcbt\_TH1010(0, 2, 0, 0, 0, ID)$ . The compiler will not reorder this intrinsic.

#### \_\_protected\_stream\_stop\_all: Stop All Streams

```
(void) __protected_stream_stop_all()
```

Stop all data streams. This intrinsic is an alias for \_\_dcbt\_TH1010(0, 3, 0, 0, 0, 0). The compiler will not reorder this intrinsic.

#### \_\_protected\_unlimited\_stream\_set: Set Up an Unlimited Stream

```
(void) __protected_unlimited_stream_set(D, ADDR, ID)
```

Set up the ID stream to start at ADDR and run for an unlimited count in the direction of D. When D is 1, the direction is backwards (decrementing), and when D is 3, the direction is forwards (incrementing). The stream is started by calling  $\_protected\_stream\_go$ . This intrinsic is an alias for  $\_dcbt\_TH1000(ADDR, (D>>1), 1, ID)$ . The compiler will not reorder this intrinsic.

#### \_\_rldcl: Rotate Left Doubleword Then Clear Left

```
d = _rldcl(a, b, mb)
```

The value in the argument a is rotated leftwards by the number of bits specified by the argument b. A mask is generated having 1-bits from bit mb through bit 63, and 0-bits elsewhere. The rotated data ANDed with the generated mask is returned into d.

This intrinsic might not be supported when compiling for 32-bit ABIs in which a 64-bit doubleword is maintained in two separate registers.

Table 6-186: Rotate Left Doubleword Then Clear Left

|                    | Assembly Mapping                                                                      |  |  |  |  |
|--------------------|---------------------------------------------------------------------------------------|--|--|--|--|
| d                  | d a b mb                                                                              |  |  |  |  |
| unsigned long long | unsigned long long unsigned long long unsigned long long 6-bit unsigned int (literal) |  |  |  |  |



#### \_\_rldcr: Rotate Left Doubleword Then Clear Right

$$d = \underline{r}ldcr(a, b, me)$$

The value in the argument a is rotated leftwards by the number of bits specified by the argument b. A mask is generated having 1-bits from bit 0 though bit me and 0-bits elsewhere. The rotated data ANDed with the generated mask is returned in d.

This intrinsic might not be supported when compiling for 32-bit ABIs in which a 64-bit doubleword is maintained in two separate registers.

Table 6-187: Rotate Left Doubleword Then Clear Right

|                    | Accombly Manning   |                    |                              |                   |
|--------------------|--------------------|--------------------|------------------------------|-------------------|
| d a b me           |                    |                    |                              | Assembly Mapping  |
| unsigned long long | unsigned long long | unsigned long long | 6-bit unsigned int (literal) | rldcr d, a, b, me |

#### \_\_rldic: Rotate Left Doubleword Immediate Then Clear

$$d = \underline{rldic(a, sh, mb)}$$

The value in the argument a is rotated leftwards by the number of bits specified by the argument sh. A mask is generated having 1-bits from bit mb through bit 63-sh and 0-bits elsewhere. The rotated data ANDed with the generated mask is returned in d.

This intrinsic might not be supported when compiling for 32-bit ABIs in which a 64-bit doubleword is maintained in two separate registers.

Table 6-188: Rotate Left Doubleword Immediate Then Clear

|                    | Assembly Mapping   |                              |                              |                    |  |
|--------------------|--------------------|------------------------------|------------------------------|--------------------|--|
| d                  | d a sh mb          |                              |                              |                    |  |
| unsigned long long | unsigned long long | 6-bit unsigned int (literal) | 6-bit unsigned int (literal) | rldic d, a, sh, mb |  |

#### rldicl: Rotate Left Doubleword Immediate Then Clear Left

The value in the argument a is rotated leftwards by the number of bits specified by the argument sh. A mask is generated having 1-bits from bit mb through bit 63 and 0-bits elsewhere. The rotated data ANDed with the generated mask is returned in d.

This intrinsic might not be supported when compiling for 32-bit ABIs in which a 64-bit doubleword is maintained in two separate registers.

Table 6-189: Rotate Left Doubleword Immediate Then Clear Left

|                    | Assembly Mapping                                                                                |  |  |  |  |
|--------------------|-------------------------------------------------------------------------------------------------|--|--|--|--|
| d                  | d a sh mb                                                                                       |  |  |  |  |
| unsigned long long | unsigned long long unsigned long long 6-bit unsigned int (literal) 6-bit unsigned int (literal) |  |  |  |  |



#### \_\_rldicr: Rotate Left Doubleword Immediate Then Clear Right

The value in the argument a is rotated leftwards by the number of bits specified by the argument sh. A mask is generated having 1-bits from bit 0 though bit me and 0-bits elsewhere. The rotated data ANDed with the generated mask is returned in d.

This intrinsic might not be supported when compiling for 32-bit ABIs in which a 64-bit doubleword is maintained in two separate registers.

Table 6-190: Rotate Left Doubleword Immediate Then Clear Right

|                    | Assembly Mapping                                                                                |  |  |  |  |
|--------------------|-------------------------------------------------------------------------------------------------|--|--|--|--|
| d                  | d a sh me                                                                                       |  |  |  |  |
| unsigned long long | unsigned long long unsigned long long 6-bit unsigned int (literal) 6-bit unsigned int (literal) |  |  |  |  |

#### \_\_rldimi: Rotate Left Doubleword Immediate Then Mask Insert

```
d = \underline{rldimi(a, b, sh, mb)}
```

A mask is generated with 1-bits from bit *mb* through bit 63-*sh*, and 0-bits elsewhere. The value in *a* is ANDed with the complement of this mask, zeroing out just the bits inside the range *mb* through 63-*sh*. The argument *b* is rotated left by *sh* bits and ANDs the result with the mask, zeroing out all bits outside the range *mb* through 63-*sh*. The two masked values are combined together with inclusive OR, and returned in *d*.

This intrinsic might not be supported when compiling for 32-bit ABIs in which a 64-bit doubleword is maintained in two separate registers.

Table 6-191: Rotate Left Doubleword Immediate Then Mask Insert

|                    | Assembly Mapping   |                       |                              |                              |                                |
|--------------------|--------------------|-----------------------|------------------------------|------------------------------|--------------------------------|
| d                  | a b sh mb          |                       |                              |                              | Assembly Mapping               |
| unsigned long long | unsigned long long | unsigned long<br>long | 6-bit unsigned int (literal) | 6-bit unsigned int (literal) | mr d, a<br>rldimi d, b, sh, mb |

#### rlwimi: Rotate Left Word Immediate Then Mask Insert

d = \_\_rlwimi(a, b, sh, mb, me)

A mask is generated with 1-bits from bit mb through bit me, and 0-bits elsewhere. The value in a is ANDed with the complement of this mask, zeroing out just the bits inside the range mb through me. The argument b is rotated left by sh bits and ANDs the result with the mask, zeroing out all bits outside the range mb through me. The two masked values are combined together with inclusive OR, and returned in d.

Table 6-192: Rotate Left Word Immediate Then Mask Insert

|              | Return/Argument Types |                 |                              |                              |                              |                                    |  |  |
|--------------|-----------------------|-----------------|------------------------------|------------------------------|------------------------------|------------------------------------|--|--|
| d            | а                     | b               | sh                           | mb                           | me                           | Assembly Mapping                   |  |  |
| unsigned int | unsigned int          | unsigned<br>int | 5-bit unsigned int (literal) | 5-bit unsigned int (literal) | 5-bit unsigned int (literal) | mr d, a<br>rlwimi d, b, sh, mb, me |  |  |



#### \_\_rlwinm: Rotate Left Word Immediate Then AND with Mask

d = \_\_rlwinm(a, sh, mb, me)

A mask is generated with 1-bits from *mb* through bit *me*, and 0-bits elsewhere. The value in *a* is rotated left by *sh* bits, then ANDed with this mask, and returned in *d*.

Table 6-193: Rotate Left Word Immediate Then AND with Mask

|              | Assembly Mapping |                              |                              |                              |                         |
|--------------|------------------|------------------------------|------------------------------|------------------------------|-------------------------|
| d            | d a sh mb me     |                              |                              | Assembly Mapping             |                         |
| unsigned int | unsigned int     | 5-bit unsigned int (literal) | 5-bit unsigned int (literal) | 5-bit unsigned int (literal) | rlwinm d, a, sh, mb, me |

#### \_\_rlwnm: Rotate Left Word Then AND with Mask

 $d = _rlwnm(a, b, mb, me)$ 

The argument a is rotated leftwards by the argument b. A mask is generated having 1-bits from bit mb through bit me, and 0-bits elsewhere. The rotated data ANDed with the generated mask is returned in a.

Table 6-194: Rotate Left Word Then AND with Mask

| Return/Argum | Assembly Mapping |              |                              |                              |                       |  |
|--------------|------------------|--------------|------------------------------|------------------------------|-----------------------|--|
| d            | а                | b            | mb                           | me                           | Assembly Mapping      |  |
| unsigned int | unsigned int     | unsigned int | 5-bit unsigned int (literal) | 5-bit unsigned int (literal) | rlwnm d, a, b, mb, me |  |

### \_\_setflm: Save and Set the FPSCR

d = \_\_setflm(a)

The Floating-Point Status and Control Register is set to *a*, and the context of that register is returned in *d*. This intrinsic will not be reordered by the compiler. It will also cause a barrier for floating-point operations.

Table 6-195: Save and Set the FPSCR

| Return/Argu | ment Types | Assembly Mapping         |
|-------------|------------|--------------------------|
| d           | а          | Assembly Mapping         |
| double      | double     | mffs d;<br>mtfsf 0xFF, a |

### \_stdbrx: Store Reversed Doubleword

(void) \_\_stdbrx(pointer, b)

The argument b is stored in reversed endian order into the doubleword located at the argument pointer.

The  ${\it base}$  and  ${\it index}$  arguments for the assembly mapping are calculated from  ${\it pointer}.$ 

Table 6-196: Store Reversed Doubleword

| Return/Argument Types |                    | Assembly Mapping      |                                                           |
|-----------------------|--------------------|-----------------------|-----------------------------------------------------------|
| pointer               | b                  | 64-bit ABI            | 32-bit ABI                                                |
| void*                 | unsigned long long | stdbrx b, base, index | stwbrx b_lo, base, index<br>stwbrx b_hi, base,<br>index+4 |

stdcx: Store Doubleword Conditional



If the reserved address of the processor is the value in the argument pointer, b is stored into the doubleword at the argument pointer, and the value of 1 is returned in d. Otherwise, the store is not performed, and the value of 0 is returned in d.

The base and index arguments for the assembly mapping are calculated from pointer.

The instruction stdcx. returns its value in cr0.eq, the equals field of conditional register 0.

This intrinsic might not be supported when compiling for 32-bit ABIs in which a 64-bit doubleword is maintained in two separate registers.

Table 6-197: Store Doubleword Conditional

| Return/Argument Types |         |                    | Assembly Mapping                  |  |
|-----------------------|---------|--------------------|-----------------------------------|--|
| d                     | pointer | b                  | Assembly Mapping                  |  |
| bool                  | void*   | unsigned long long | stdcx. b, base, index; d = cr0.eq |  |

#### sthbrx: Store Reversed Halfword

(void) \_\_sthbrx(pointer, b)

The argument *b* is stored in reversed endian order into the halfword located at the argument *pointer*.

The base and index arguments for the assembly mapping are calculated from pointer.

Table 6-198: Store Reversed Halfword

| Return/Ar | gument Types   | Accombly Manning      |  |
|-----------|----------------|-----------------------|--|
| pointer   | b              | Assembly Mapping      |  |
| void*     | unsigned short | sthbrx b, base, index |  |

#### \_stwbrx: Store Reversed Word

(void) \_\_stwbrx(pointer, b)

The argument b is stored in reversed endian order into the word located at the argument pointer.

The  ${\it base}$  and  ${\it index}$  arguments for the assembly mapping are calculated from  ${\it pointer}.$ 

Table 6-199: Store Reversed Word

| Return/A | rgument Types | Assembly Mapping      |
|----------|---------------|-----------------------|
| pointer  | b             | Assembly Mapping      |
| void*    | unsigned      | stwbrx b, base, index |



#### stwcx: Store Word Conditional

If the reserved address of the processor is the value in the argument pointer, b is stored into the word at the argument pointer, and the value of 1 is returned in d. Otherwise, the store is not performed, and the value of 0 is returned in d.

The base and index arguments for the assembly mapping are calculated from pointer.

The instruction stwcx. returns its value in cr0.eq, the equals field of conditional register 0.

Table 6-200: Store Word Conditional

| Return/Argument Types |         |          | Assembly Manning                  |  |
|-----------------------|---------|----------|-----------------------------------|--|
| d                     | pointer | b        | Assembly Mapping                  |  |
| bool                  | void*   | unsigned | stwcx. b, base, index; d = cr0.eq |  |

### \_sync: Sync

A memory barrier is created, providing an ordering function for all instructions executing on the same processor. The memory barrier and ordering function are described in section 1.7.1 of PowerPC Architecture Book, Book II: PowerPC Virtual Environment Architecture, version 2.02.

Table 6-201: Sync

| Return/Argument Types | Assembly Mapping |
|-----------------------|------------------|
| none                  | sync             |





# 7. PPU Vector Multimedia Extension Intrinsics

This chapter describes intrinsics which make the underlying PPU Vector Multimedia Extension (VMX) instruction set accessible from the C and C++ programming languages. The *AltiVec™ Technology Programming Interface Manual*, Section 4.4, defines most of the generic intrinsics for the PPU VMX instruction set, except for a few new instructions which are specified in this chapter. The new intrinsics are in two different categories: intrinsics for extracting vector elements and intrinsics for inserting vector elements.

The PPU VMX intrinsics will be declared in the system header file altivec.h. These intrinsics may be either defined as macros within this header or implemented internally within the compiler.

For data prefetches, the \_\_dcbt, \_\_dcbtst, \_\_dcbt\_TH1000, and \_\_dcbt\_TH1010 intrinsics should be used. The related stream control operations that are defined in the *AltiVec*<sup>TM</sup> *Technology Programming Interface Manual*, which are listed below, have been deprecated on the PPU and will execute as a NOP.

Table 7-202: Stream Control Operators That Have Been Deprecated on the PPU

| Stream Control Operator | Description                                  |
|-------------------------|----------------------------------------------|
| vec_dss(a)              | Vector Data Stream Stop                      |
| vec_dssall()            | Vector Stream Stop All                       |
| vec_dst(a,b,c)          | Vector Stream Touch                          |
| vec_dstst(a,b,c)        | Vector Data Stream Touch for Store Transient |



### vec\_extract: Extract Vector Element from Vector

d = vec\_extract(a, element)

The element that is specified by <code>element</code> is extracted from vector <code>a</code> and returned in scalar <code>d</code>. Depending on the size of the element, only a limited number of the least significant bits of the <code>element</code> index are used. Specifically for 1-, 2-, and 4-byte elements, only four, three, and two of the least significant bits are used, respectively.

Table 7-203: Extract Vector Element from Vector

| Return/Argument Types |                       |         | Assembly Mapping <sup>1</sup>                                                                |
|-----------------------|-----------------------|---------|----------------------------------------------------------------------------------------------|
| d                     | a                     | element | Assembly Mapping                                                                             |
| unsigned char         | vector unsigned char  |         | EA=memaddr + (element&0xF)<br>stvebx a, 0, EA<br>lbzx d, 0, EA                               |
| signed char           | vector signed char    |         | EA=memaddr + (element&0xF)<br>stvebx a, 0, EA<br>lbzx d, 0, EA<br>extsb d, d                 |
| unsigned short        | vector unsigned short |         | EA=memaddr + (element&0x7)<<1<br>stvehx a, 0, EA<br>lhzx d, 0, EA                            |
| signed short          | vector signed short   | int     | EA=memaddr + (element&0x7)<<1<br>stvehx a, 0, EA<br>lhzx d, 0, EA<br>extsh d, d              |
| unsigned int          | vector unsigned int   |         | EA=memaddr + (element&0x3)<<2<br>stvewx a, 0, EA<br>lwzx a, 0, EA                            |
| signed int            | vector signed int     |         | EA=memaddr + (element&0x3)<<2<br>stvewx a, 0, EA<br>lwzx a, 0, EA<br>extsw d, d <sup>2</sup> |
| float                 | vector float          |         | EA=memaddr + (element&0x3)<<2<br>stvewx a, 0, EA<br>Ifsx a, 0, EA                            |

<sup>&</sup>lt;sup>1</sup> memaddr is the address of a temporary memory location which is 16-byte aligned.

<sup>&</sup>lt;sup>2</sup> The sign extend from word to doubleword can be omitted if the processor is running in 32-bit mode.



#### vec\_insert: Insert Scalar into Specified Vector Element

d = vec\_insert(a, b, element)

Scalar a is inserted into the element of vector b that is specified by the element parameter, and the modified vector is returned. All other elements of b are unmodified. Depending on the size of the element, only a limited number of the least significant bits of the element index are used. Specifically for 1-, 2-, and 4-byte elements, only four, three, and two of the least significant bits are used, respectively.

Table 7-204: Insert Scalar into Specified Vector Element

|                       | Assembly Mapping <sup>1</sup> |                       |       |                                                                                          |  |
|-----------------------|-------------------------------|-----------------------|-------|------------------------------------------------------------------------------------------|--|
| d a b elem            |                               |                       |       | Assembly Mapping                                                                         |  |
| vector unsigned char  | unsigned char                 | vector unsigned char  |       | EA=memaddr + (element&0xF)                                                               |  |
| vector signed char    | signed char                   | vector signed char    |       | stbx a, 0, EA<br>lvebx d, 0, EA<br>vperm d, d, a, pattern                                |  |
| vector unsigned short | unsigned short                | vector unsigned short |       | EA=memaddr + (element&0x7)<<1                                                            |  |
| vector signed short   | signed short                  | vector signed short   | _ int | sthx a, 0, EA<br>Ivehx d, 0, EA<br>vperm d, d, a, pattern                                |  |
| vector unsigned int   | unsigned int                  | vector unsigned int   | 1111  | EA=memaddr + (element&0x3)<<2                                                            |  |
| vector signed int     | signed int                    | vector signed int     |       | stwx a, 0, EA<br>lvewx d, 0, EA<br>vperm d, d, a, pattern                                |  |
| vector float          | float                         | vector float          |       | EA=memaddr + (element&0x3)<<2<br>stfsx a, EA<br>lvewx d, 0, EA<br>vperm d, d, a, pattern |  |

 $<sup>^{\</sup>rm 1}\,{\rm memadd}{\rm r}$  is the address of a temporary memory location which is 16-byte aligned.



### vec\_lvlx: Load Vector Left Indexed

 $d = vec_lvlx(a, b)$ 

Let EA be the effective address formed from the sum of the contents of a and the contents of b and let b be the value of the four least significant bits of EA. The (16 - b) bytes addressed by EA are loaded into the leftmost (16 - b) byte elements of d and the rightmost b byte of d are set to zero.

Table 7-205: Load Vector Left Indexed

| Return/Argument Types |                      |                         | Assembly Mapping |
|-----------------------|----------------------|-------------------------|------------------|
| d                     | а                    | b                       | Assembly Mapping |
| vector unsigned char  | any integral type    | unsigned char *         |                  |
| vector unsigned chai  |                      | vector unsigned char *  |                  |
| vector signed char    | any intogral typo    | signed char *           |                  |
| vector signed char    | any integral type    | vector signed char *    |                  |
| vector bool char      | any integral type    | vector bool char *      |                  |
| vector unsigned short | any integral type    | unsigned short *        |                  |
| vector unsigned short | any integral type    | vector unsigned short * |                  |
| vector signed short   | any integral type    | signed short *          |                  |
| vector signed short   |                      | vector signed short *   | lvlx d, a, b     |
| vector bool short     | any integral type    | vector bool short *     | IVIX G, A, D     |
| vector pixel          | any integral type    | vector pixel *          |                  |
| vector unsigned int   | any integral type    | unsigned int *          |                  |
| vector unsigned int   |                      | vector unsigned int *   |                  |
| vector signed int     | any intogral typo    | signed int *            |                  |
| vector signed int     | any integral type    | vector signed int *     |                  |
| vector bool int       | any integral type    | vector bool int *       |                  |
| vector float          | any integral to ==== | float *                 |                  |
| vector noat           | any integral type    | vector float *          |                  |



#### vec\_lvlxl: Load Vector Left Indexed Last

d = vec\_lvlxl(a, b)

Let EA be the effective address formed from the sum of the contents of a and the contents of b and let eb be the value of the four least significant bits of EA. The (16 - eb) bytes addressed by EA are loaded into the leftmost (16 eb) bytes of d and the rightmost eb bytes of d are set to zero.  $vec_lvlxl$  provides a hint that the quadword in memory addressed by EA will probably not be needed again by the program in the near future.

Table 7-206: Load Vector Left Indexed Last

| Return/Argument Types |                   |                         | Assambly Manning |
|-----------------------|-------------------|-------------------------|------------------|
| d                     | a                 | b                       | Assembly Mapping |
| vector unsigned char  | any integral type | unsigned char *         |                  |
| vector unsigned char  | any integral type | vector unsigned char *  |                  |
| vector signed char    | any integral type | signed char *           |                  |
| vector signed chai    | any integral type | vector signed char *    |                  |
| vector bool char      | any integral type | vector bool char *      |                  |
| vector unsigned short | any integral type | unsigned short *        |                  |
| vector unsigned short | any integral type | vector unsigned short * |                  |
| vector signed short   | any integral type | signed short *          |                  |
| vector signed short   | any integral type | vector signed short *   | lvlxl d, a, b    |
| vector bool short     | any integral type | vector bool short *     | IVIXI G, G, D    |
| vector pixel          | any integral type | vector pixel *          |                  |
| vector unsigned int   | any integral type | unsigned int *          |                  |
| vector unsigned int   | any integral type | vector unsigned int *   |                  |
| vector signed int     | any integral type | signed int *            |                  |
| vector signed int     | any integral type | vector signed int *     |                  |
| vector bool int       | any integral type | vector bool int *       |                  |
| vector float          | any integral type | float *                 |                  |
| vector noat           |                   | vector float *          |                  |



### vec\_lvrx: Load Vector Right Indexed

d = vec\_lvrx(a, b)

Let EA be the effective address formed from the sum of the contents of a and the contents of b and let a be the value of the four least significant bits of EA. If a is not equal to zero (for example, EA is not quadword-aligned), then a bytes in memory addressed by (EA - a b) are loaded into the rightmost a bytes of a and the leftmost (16 - a b) bytes of a are set to zero. If a is equal to zero (for example, EA is quadword-aligned), then the contents of a are set to zero.

Table 7-207: Load Vector Right Indexed

| Return/Argument Types |                   |                         | Agombly Manning  |
|-----------------------|-------------------|-------------------------|------------------|
| d                     | a                 | b                       | Assembly Mapping |
| vector unsigned char  | any integral type | unsigned char *         |                  |
| vector unsigned char  | any integral type | vector unsigned char *  |                  |
| vector signed char    | any integral type | signed char *           |                  |
| vector signed char    | any integral type | vector signed char *    |                  |
| vector bool char      | any integral type | vector bool char *      |                  |
| vector unsigned short | any integral type | unsigned short *        |                  |
| vector unsigned short |                   | vector unsigned short * |                  |
| vector signed short   | any integral type | signed short *          |                  |
| vector signed short   |                   | vector signed short *   | lvrx d, a, b     |
| vector bool short     | any integral type | vector bool short *     | IVIX G, A, D     |
| vector pixel          | any integral type | vector pixel *          |                  |
| vector unsigned int   | any integral type | unsigned int *          |                  |
| vector unsigned int   |                   | vector unsigned int *   |                  |
| vector signed int     | any integral type | signed int *            |                  |
| vector signed int     |                   | vector signed int *     |                  |
| vector bool int       | any integral type | vector bool int *       |                  |
| vector float          | any integral type | float *                 |                  |
| VECIOI IIOAI          |                   | vector float *          |                  |



### vec\_lvrxl: Load Vector Right Indexed Last

d = vec\_lvrxl(a,b)

Let EA be the effective address formed from the sum of the contents of a and the contents of b and let eb be the value of the four least significant bits of EA. If eb is not equal to zero (for example, EA is not quadword-aligned), then eb bytes in memory addressed by (EA - eb) are loaded into the rightmost eb bytes of d and the leftmost (16 - eb) bytes of d are set to zero. If eb is equal to zero (for example, EA is quadword-aligned), then the contents of d are set to zero. vec\_lvrxl provides a hint that the quadword in memory addressed by EA will probably not be needed again by the program in the near future.

Table 7-208: Load Vector Right Indexed Last

| Return/Argument Types |                    |                         | Assembly Manning |
|-----------------------|--------------------|-------------------------|------------------|
| d                     | a                  | b                       | Assembly Mapping |
| vector unsigned char  | any integral type  | unsigned char *         |                  |
| vector unsigned char  | any integral type  | vector unsigned char *  |                  |
| vector signed char    | any integral type  | signed char *           |                  |
| vector signed chai    | arry integral type | vector signed char *    |                  |
| vector bool char      | any integral type  | vector bool char *      |                  |
| vector unsigned short | any integral type  | unsigned short *        |                  |
| vector unsigned short |                    | vector unsigned short * |                  |
| vector signed short   | any integral type  | signed short *          |                  |
| vector signed short   |                    | vector signed short *   | Ivrxl d, a, b    |
| vector bool short     | any integral type  | vector bool short *     | IVIAI d, d, b    |
| vector pixel          | any integral type  | vector pixel *          |                  |
| vector unsigned int   | any integral type  | unsigned int *          |                  |
| vector unsigned int   |                    | vector unsigned int *   |                  |
| vector signed int     | any integral type  | signed int *            |                  |
| vector signed int     |                    | vector signed int *     |                  |
| vector bool int       | any integral type  | vector bool int *       |                  |
| vector float          | any integral type  | float *                 |                  |
| vector noat           |                    | vector float *          |                  |



#### vec\_stvlx: Store Vector Left Indexed

(void) vec\_stvlx(a, b, c)

Let EA be the effective address formed from the sum of the contents of b and the contents of c, and let b be the value of the four least significant bits of EA. Store the (16 - b) leftmost bytes of a into the memory addressed by EA.

Table 7-209: Store Vector Left Indexed

| Return/Argument Types |                   |                         | Assambly Manning |  |
|-----------------------|-------------------|-------------------------|------------------|--|
| a                     | b                 | С                       | Assembly Mapping |  |
| vector unsigned char  | any integral type | unsigned char *         |                  |  |
| vector unsigned char  | any integral type | vector unsigned char *  |                  |  |
| vector signed char    | any integral type | signed char *           |                  |  |
| vector signed char    | any integral type | vector signed char *    |                  |  |
| vector bool char      | any integral type | vector bool char *      |                  |  |
| vector unsigned short | any integral type | unsigned short *        |                  |  |
| vector unsigned short | any integral type | vector unsigned short * |                  |  |
| vector signed short   | any integral type | signed short *          | stvlx a, b, c    |  |
| vector signed short   | any integral type | vector signed short *   |                  |  |
| vector bool short     | any integral type | vector bool short *     | Stvix a, b, c    |  |
| vector pixel          | any integral type | vector pixel *          |                  |  |
| vector unsigned int   | any integral type | unsigned int *          |                  |  |
| vector unsigned int   | any integral type | vector unsigned int *   |                  |  |
| vector signed int     | any intogral type | signed int *            |                  |  |
| vector signed int     | any integral type | vector signed int *     |                  |  |
| vector bool int       | any integral type | vector bool int *       |                  |  |
| vector float          | any integral type | float *                 |                  |  |
| vector noat           |                   | vector float *          |                  |  |



#### vec\_stvlxl: Store Vector Left Indexed Last

(void) vec\_stvlxl(a, b, c)

Let EA be the effective address formed from the sum of the contents of b and the contents of c, and let eb be the value of the four least significant bits of EA. Store the (16 - eb) leftmost bytes of a into the memory addressed by EA. vec\_stvlxl provides a hint that the quadword in memory addressed by EA will probably not be needed again by the program in the near future.

Table 7-210: Store Vector Left Indexed Last

| Return/Argument Types |                                     |                         | Accombly Manning |
|-----------------------|-------------------------------------|-------------------------|------------------|
| а                     | b                                   | С                       | Assembly Mapping |
| vector unsigned char  | any integral type                   | unsigned char *         |                  |
| vector unsigned char  | any integral type                   | vector unsigned char *  |                  |
| voctor signed char    | any intogral typo                   | signed char *           |                  |
| vector signed char    | any integral type                   | vector signed char *    |                  |
| vector bool char      | any integral type                   | vector bool char *      |                  |
| vector unsigned short | any integral type                   | unsigned short *        |                  |
| vector unsigned short | any integral type                   | vector unsigned short * |                  |
| vector signed short   | any integral type any integral type | signed short *          |                  |
| vector signed short   |                                     | vector signed short *   | stvlxl a, b, c   |
| vector bool short     |                                     | vector bool short *     | Stylki d, b, c   |
| vector pixel          | any integral type                   | vector pixel *          |                  |
| vector unsigned int   | any integral type                   | unsigned int *          |                  |
| vector unsigned int   |                                     | vector unsigned int *   |                  |
| vector signed int     | any integral type                   | signed int *            |                  |
| vector signed int     | any integral type                   | vector signed int *     |                  |
| vector bool int       | any integral type                   | vector bool int *       |                  |
| vector float          | any integral type                   | float *                 |                  |
| vector noat           | any integral type                   | vector float *          |                  |



# vec\_stvrx: Store Vector Right Indexed

(void) vec\_stvrx(a, b, c)

Let  $\mathbb{E}\mathbb{A}$  be the effective address formed from the sum of the contents of b and the contents of c, and let  $\mathbf{e}\mathbf{b}$  be the value of the four least significant bits of  $\mathbb{E}\mathbb{A}$ . Store the  $\mathbf{e}\mathbf{b}$  rightmost bytes of a into the memory addressed by ( $\mathbb{E}\mathbb{A}$  -  $\mathbf{e}\mathbf{b}$ ). If  $\mathbf{e}\mathbf{b}$  is zero,  $\mathbb{E}\mathbb{A}$  is 16-byte aligned, and no memory is stored.

Table 7-211: Store Vector Right Indexed

| Return/Argument Types |                   |                         | Assembly Mapping |  |
|-----------------------|-------------------|-------------------------|------------------|--|
| а                     | b                 | С                       | Assembly Mapping |  |
| vector unsigned char  | any integral type | unsigned char *         |                  |  |
| vector unsigned char  |                   | vector unsigned char *  |                  |  |
| vector signed char    | any intogral typo | signed char *           |                  |  |
| vector signed char    | any integral type | vector signed char *    |                  |  |
| vector bool char      | any integral type | vector bool char *      |                  |  |
| vector unsigned short | any integral type | unsigned short *        |                  |  |
| vector unsigned short |                   | vector unsigned short * |                  |  |
| vector signed short   | any integral type | signed short *          |                  |  |
| vector signed short   |                   | vector signed short *   | stvrx a, b, c    |  |
| vector bool short     | any integral type | vector bool short *     | Sivix a, b, c    |  |
| vector pixel          | any integral type | vector pixel *          |                  |  |
| vector unsigned int   | any intogral typo | unsigned int *          |                  |  |
| vector unsigned int   | any integral type | vector unsigned int *   |                  |  |
| vector signed int     | any integral type | signed int *            |                  |  |
| vector signed int     | any integral type | vector signed int *     |                  |  |
| vector bool int       | any integral type | vector bool int *       |                  |  |
| vector float          | any integral type | float *                 |                  |  |
| VECIOI IIUAL          |                   | vector float *          |                  |  |



#### vec\_stvrxl: Store Vector Right Indexed Last

(void) vec\_stvrxl(a, b, c)

Let EA be the effective address formed from the sum of the contents of b and the contents of c, and let eb be the value of the four least significant bits of EA. Store the eb rightmost bytes of a into the memory addressed by (EA eb). If eb is zero, EA is 16-byte aligned, no memory is stored. vec\_stvrxl provides a hint that the quadword in memory addressed by EA will probably not be needed again by the program in the near future.

Table 7-212: Store Vector Right Indexed Last

| Return/Argument Types |                   |                         | Assambly Manning |
|-----------------------|-------------------|-------------------------|------------------|
| а                     | b                 | С                       | Assembly Mapping |
| vector uncigned ober  | any integral type | unsigned char *         |                  |
| vector unsigned char  | any integral type | vector unsigned char *  |                  |
| vector signed char    | any integral type | signed char *           |                  |
| vector signed char    | any integral type | vector signed char *    |                  |
| vector bool char      | any integral type | vector bool char *      |                  |
| vector unsigned short | any integral type | unsigned short *        |                  |
| vector unsigned short | any integral type | vector unsigned short * |                  |
| vector signed short   | any intogral type | signed short *          |                  |
| vector signed short   | any integral type | vector signed short *   | stvrxl a, b, c   |
| vector bool short     | any integral type | vector bool short *     | Stviki a, b, c   |
| vector pixel          | any integral type | vector pixel *          |                  |
|                       | any integral type | unsigned int *          |                  |
| vector unsigned int   | any integral type | vector unsigned int *   |                  |
| vector signed int     | any integral type | signed int *            |                  |
| vector signed int     | any integral type | vector signed int *     |                  |
| vector bool int       | any integral type | vector bool int *       |                  |
| vector float          | any integral type | float *                 |                  |
| vector iluat          | any integral type | vector float *          |                  |



#### vec\_promote: Promote Scalar to Vector

d = vec\_promote(a, element)

Scalar *a* is promoted to a vector containing *a* in the element that is specified by the *element* parameter, and the result is returned in vector *d*. All other elements of *d* are undefined. Depending on the size of *a*, only a limited number of the least significant bits of the *element* index are used. Specifically for 1-, 2-, and 4-byte elements, only four, three, and two of the least significant bits are used, respectively.

Table 7-213: Promote Scalar to Vector

| Return/Argument Types |                | Assembly Mapping <sup>1</sup> |                                                                |
|-----------------------|----------------|-------------------------------|----------------------------------------------------------------|
| d                     | а              | element                       | Assembly Mapping                                               |
| vector unsigned char  | unsigned char  |                               | EA=memaddr + (element&0xF)                                     |
| vector signed char    | signed char    |                               | stbx a, 0, EA<br>lvebx d, 0, EA                                |
| vector unsigned short | unsigned short | int                           | EA=memaddr + (element&0x7)<<1                                  |
| vector signed short   | signed short   |                               | sthx a, 0, EA<br>lvehx d, 0, EA                                |
| vector unsigned int   | unsigned int   | IIIL                          | EA=memaddr + (element&0x3)<<2                                  |
| vector signed int     | signed int     |                               | stwx a, 0, EA<br>lvewx d, 0, EA                                |
| vector float          | float          |                               | EA=memaddr + (element&0x3)<<2<br>stfsx a, EA<br>lvewx d, 0, EA |

<sup>&</sup>lt;sup>1</sup> memaddr is the address of a temporary memory location which is 16-byte aligned.

#### vec\_splats: Splat Scalar to Vector

d = vec\_splats(a)

The single scalar *a* value is replicated across all elements of a vector of the same type and the result is returned in vector *d*.

Table 7-214: Splat Scalar to Vector

| Return/Argument Types |                                         | Assembly Mapping                              |
|-----------------------|-----------------------------------------|-----------------------------------------------|
| d                     | а                                       | Assembly Mapping                              |
| vector unsigned char  | unsigned char                           |                                               |
| vector signed char    | signed char                             |                                               |
| vector unsigned short | unsigned short                          | store a into memory (EA) that 16-byte aligned |
| vector signed short   | signed short                            | lvebx/lvehx/lvewx tmp, 0, EA                  |
| vector unsigned int   | unsigned int                            | vspltb/vsplth/vspltw d, tmp, 0                |
| vector signed int     | signed int                              |                                               |
| vector float          | float                                   |                                               |
| vector unsigned char  | unsigned char (5-bit unsigned literal)  |                                               |
| vector signed char    | signed char (5-bit unsigned literal)    | vspltisb d, a                                 |
| vector unsigned short | unsigned short (5-bit unsigned literal) | <i>or</i><br>vspltish d, a                    |
| vector signed short   | signed short (5-bit unsigned literal)   | or                                            |
| vector unsigned int   | unsigned int (5-bit unsigned literal)   | vspltisw d, a or                              |
| vector signed int     | signed int (5-bit unsigned literal)     | vspltisw d, a                                 |
| vector float          | float (5-bit unsigned literal)          |                                               |



# 8. SPU C and C++ Standard Libraries and Language Support

This chapter describes differences between the implementations of the C and C++ standard libraries on the SPU and the corresponding ISO/IEC standards. It also identifies common language features that are specifically not supported on the SPU.

### 8.1. Standard Libraries

The C and C++ standard libraries that are required for the SPU are based on the Standard C Library described in ISO/IEC Standard 9899:1999 and the C++ Standard Library described in ISO/IEC Standard 14882:2003. However, neither library must be a fully compliant implementation of the respective ISO/IEC standard.

The proposed differences from ISO/IEC compliant implementations are due to two reasons: 1) The SPU does not have the same system resources and operating system support that are available to most stand-alone processors; and 2) the SPU hardware doesn't fully support the IEEE floating-point standard. Because of the SPU's limited operating system support, library functions that require system calls, thread facilities, and file input/output (I/0) may not be supported. Because of differences in floating-point behavior, the results of single-precision floating-point functions will probably be less accurate than defined by the Standard, and floating-point exceptions will be less reliable. Nevertheless, the standard library functions that are provided should execute fast, in most cases.

The minimum C and C++ library features that must be provided for the SPU are described in the following sections.

#### 8.1.1. C Standard Library

This section describes the minimum requirements of a compliant C standard library implementation.

#### **Library Contents**

All of the entities required in the C standard library must be declared and defined within the library header files listed in Table 8-215. Differences between the contents of these header files and the header files that comprise the ISO Standard Library are identified in the table. For a detailed description of the particular entities, see the ISO/IEC C Standard listed in the "Related Documentation" section.

Table 8-215: C Library Header Files

| Header Name | Description                                                                                                                                                                                                                                                                                                                                                                                                                           |
|-------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| assert.h    | Enforce assertions when functions execute. The assert macro reports assertion failures using the special debug printf (described later in this chapter).                                                                                                                                                                                                                                                                              |
| complex.h   | Perform complex arithmetic.                                                                                                                                                                                                                                                                                                                                                                                                           |
| ctype.h     | Classify characters. The functions declared in this header use only the "C" locale.                                                                                                                                                                                                                                                                                                                                                   |
| errno.h     | Test error codes reported by library functions.                                                                                                                                                                                                                                                                                                                                                                                       |
| fenv.h      | Control IEEE style floating-point arithmetic. Macros for single- and double-precision exceptions are described in "9.2.2. Floating-Point Exceptions".                                                                                                                                                                                                                                                                                 |
| float.h     | Test floating-point type properties. These properties are specified in section "9.1. Properties of Floating-Point Data Type Representations".                                                                                                                                                                                                                                                                                         |
| inttypes.h  | Convert various integer types.                                                                                                                                                                                                                                                                                                                                                                                                        |
| iso646.h    | Program in ISO 646 variant character sets.                                                                                                                                                                                                                                                                                                                                                                                            |
| limits.h    | Test integer type properties. The macro MB_LEN_MAX is defined as 1.                                                                                                                                                                                                                                                                                                                                                                   |
| locale.h    | Not available.                                                                                                                                                                                                                                                                                                                                                                                                                        |
| math.h      | Compute common mathematical functions. The floating-point behavior of these functions will adhere to the specifications described in section "9.3. Floating-Point Operations". Although not specified or required, corresponding vector versions of the math functions may be added to the library to take advantage of the many high-performance SIMD (single instruction, multiple data) instructions provided by the SPU hardware. |
| setjmp.h    | Execute nonlocal goto statements.                                                                                                                                                                                                                                                                                                                                                                                                     |



| Header Name | Description                                                                                                                                                                                              |
|-------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| signal.h    | Not available.                                                                                                                                                                                           |
| stdarg.h    | Access a varying number of arguments.                                                                                                                                                                    |
| stdbool.h   | Define a convenient Boolean type name and constants.                                                                                                                                                     |
| stddef.h    | Define several useful types and macros. The wchar_t is not defined.                                                                                                                                      |
| stdint.h    | Define various integer types with size constraints. SIG_ATOMIC_MAX and SIG_ATOMIC_MIN are not defined, nor are any of the WCHAR_MAX, WCHAR_MIN, WINT_MAX, and WINT_MIN.                                  |
| stdio.h     | Not available, except for printf, which is provided for debugging. (See section "Debug printf()".)                                                                                                       |
| stdlib.h    | Perform a variety of operations. The functions getenv, mblen, mbstowcs, mbtowc, system, wcstombs, and wctomb are not defined. The type wchar_t and the macro MB_CUR_MAX are also not defined.            |
| string.h    | Manipulate several kinds of strings. The function strxfrm uses only the "C" locale.                                                                                                                      |
| tgmath.h    | Declare various type-generic math functions. Single-precision functions declared in this header adhere to the same specifications described for the corresponding functions that are declared in math.h. |
| time.h      | Not available.                                                                                                                                                                                           |
| wchar.h     | Not available.                                                                                                                                                                                           |
| wctype.h    | Not available.                                                                                                                                                                                           |

#### **Fastest Minimum-Width Integer Types**

The typedefs named  $int_fastN_t$  and  $uint_fastN_t$  designate the fastest signed and unsigned integer types with a width of at least N. These typedefs are defined as shown in Table 8-216. The size of these types is not guaranteed to be equal to the types defined for the PPU.

Table 8-216: Fastest Minimum-Width Integer Types

| Types                      | Size (in bits) |
|----------------------------|----------------|
| int_fast8_t/uint_fast8_t   | 32             |
| int_fast16_t/uint_fast16_t | 32             |
| int_fast32_t/uint_fast32_t | 32             |
| int_fast64_t/uint_fast64_t | 64             |

#### Debug printf()

A printf() function will be provided for application debugging. The implementation of this function depends on the particular services provided by the underlying operating system. Although detailed specifications for this function are not mandated by this document, a full-featured implementation is recommended. Such an implementation would include all of the usual output format conversion specifiers required by the C standard. In addition, conversion specifiers of the type described in the AltiVec<sup>TM</sup> Technology Programming Interface Manual are recommended to handle vector output formatting. Output conversion specifiers take the following form:

```
%[<flags>][<width>][<precision>][<size>]<conversion>
```

where

```
<flags> ::= <flag-char> | <flags><flag-char>
<flag-char> ::= <std-flag-char> | <c-sep>
<std-flag-char> ::= '-' | '+' | '0' | '#' | ' '
<c-sep> ::= ',' | ';' | ':' | '_'
<width> ::= <decimal-integer> | '*'
<precision> ::= '.' <width> | `.' | `.*'
```



```
<size>
                    ::= 'hh' | 'h' | 'l' | 'll' | 'L' | <vector-size>
                    ::= 'v' | `vhh' | 'vh' | 'vl' | 'vll' | 'vL' | `hhv'
<vector-size>
                        | 'hv' | 'lv'| 'llv' | 'Lv'
<conversion>
                    ::= <char-conv> | <str_conv> | <fp-conv> | <int-conv>
                        | <byte-conv> | <misc-conv>
                    ::= 'c'
<char-conv>
<str-conv>
                    ::= 's'
                    ::= 'e' | 'E' | 'f' | 'F' | 'g' | 'G'
<fp-conv>
                    ::= 'd' | 'i' | 'u' | 'p' | 'o' | 'x' | 'X'
<int-conv>
                    ::= 'uc' | 'co' | 'cx' | 'cX'
<byte-conv>
                    ::= 'n' | '%'
<misc-conv>
```

Extensions to the C standard output conversion specification are shown in bold for vector types. Vector types are formatted using the conversions shown in Table 8-217. String conversions (<str-conv>) and miscellaneous conversions (<misc-conv>) are not defined for vectors. The 'p' integer conversion (<int-conv>) is also not defined. The default separator (<c-sep>) is a space, except for character conversion (<char-conv>), which has no separator.

Table 8-217: Vector Formats

| Vector Size | Conversion                                    | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|-------------|-----------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| v           | <char-conv></char-conv>                       | A vector is printed as a vector char, consisting of 16 one-byte elements. The 'c' conversion prints contiguous ASCII characters.                                                                                                                                                                                                                                                                                                                                                          |
| v           | <int-conv> <byte-conv></byte-conv></int-conv> | With the 'uc' conversion, a vector is printed as a vector unsigned char, consisting of 16 one-byte elements. Similarly, the 'co', 'cx', and 'cx' conversions print either a vector unsigned char or a qword, in octal format or in hexadecimal format. For all other integer conversions, a vector is printed in the respective octal (o), integer (d, i, u) or hexadecimal (x, X) format, either as a vector unsigned int or as a vector signed int, consisting of 4 four-byte elements. |
| v           | <fp-conv></fp-conv>                           | A vector is printed in a signed decimal fractional representation, either in standard decimal notation (f or F) or with a decimal power-of-ten exponent (e, E, g, G). The representation is printed as a vector float, containing 4 four-byte elements.                                                                                                                                                                                                                                   |
| vhh or hhv  | <int-conv></int-conv>                         | A vector is printed in the respective octal (o), integer (d, i, u), or hexadecimal (x, X) format, either as a vector unsigned char or as a vector signed char, consisting of 16 one-byte elements.                                                                                                                                                                                                                                                                                        |
| vh or hv    | <int-conv></int-conv>                         | A vector is printed in the respective octal (o), integer (d, i, u), or hexadecimal (x, X) format, either as a vector unsigned short or as a vector signed short, consisting of 8 two-byte elements.                                                                                                                                                                                                                                                                                       |
| vl or lv    | <int-conv></int-conv>                         | A vector is printed in the respective octal (o), integer (d, i, u), or hexadecimal (x, X) format, as a vector unsigned int or as a vector signed int, consisting of 4 four-byte elements.                                                                                                                                                                                                                                                                                                 |
| vII or IIv  | <int-conv></int-conv>                         | A vector is printed in the respective octal (o), integer (d, i, u), or hexadecimal (x, X) format, as a vector unsigned long long or as a vector signed long long, consisting of 2 eight-byte elements.                                                                                                                                                                                                                                                                                    |
| vL or Lv    | <fp-conv></fp-conv>                           | A vector is printed in a signed decimal fractional representation, either in standard decimal notation (f or F) or with a decimal power-of-ten exponent (e, E, g, G). The representation is printed as a vector double, consisting of 2 eight-byte elements.                                                                                                                                                                                                                              |

#### **Malloc Heap**

The malloc heap is defined to begin at \_end and to extend to the end of the stack. The memory heap may be enlarged by a heap-extending function. This function would negatively adjust the Available Stack Size element of



the current Stack Pointer Information register and all Available Stack Sizes residing in the saved SP registers found in the sequence of Back Chain quadwords.

Whenever the malloc heap is enlarged, code should verify that the enlarged malloc heap does not extend into the currently used stack. If it does, the operation should fail.

Implementations of setjmp/longjmp are also affected by the use of heap-extending functions. When restoring the Stack Pointer Information register as a result of invoking the longjmp function, the function must detect any change to the Available Stack Size between setjmp and longjmp, and it must correct the saved Stack Pointer Information register. For example:

where SP is the current Stack Pointer Information register, and  $SP\_set$  is the Stack Pointer Information register saved at the last setjmp call.

#### 8.1.2. C++ Standard Library

This section describes the minimum contents of the C++ standard library.

As with the C library, the C++ library header files declare or define the contents of the C++ library. Table 8-218 lists the header files that comprise the core of the C++ standard library. Differences between the contents of the C++ header files and the header files that comprise the ISO Standard Library are noted in this table.

Table 8-218: C++ Library Header Files

| Header Name | Description                                                                                                 |
|-------------|-------------------------------------------------------------------------------------------------------------|
| algorithm   | Define numerous templates that implement useful algorithms.                                                 |
| bitset      | Define a template class that administers sets of bits.                                                      |
| complex     | Define a template class that supports complex arithmetic.                                                   |
| deque       | Define a template class that implements a deque container.                                                  |
| exception   | Not available.                                                                                              |
| fstream     | Not available.                                                                                              |
| functional  | Define several templates that help construct predicates for the templates defined in algorithm and numeric. |
| iomanip     | Not available.                                                                                              |
| ios         | Not available.                                                                                              |
| iosfwd      | Not available.                                                                                              |
| iostream    | Not available.                                                                                              |
| istream     | Not available.                                                                                              |
| iterator    | Define several templates that help define and manipulate iterators.                                         |
| limits      | Test numeric type properties.                                                                               |
| list        | Define a template class that implements a doubly linked list container.                                     |
| locale      | Not available.                                                                                              |
| map         | Define template classes that implement associative containers that map keys to values.                      |
| memory      | Define several templates that allocate and free storage for various container classes.                      |
| new         | Declare several functions that allocate and free storage.                                                   |
| numeric     | Define several templates that implement useful numeric functions.                                           |
| ostream     | Not available.                                                                                              |
| queue       | Define a template class that implements a queue container.                                                  |
| set         | Define template classes that implement associative containers.                                              |
| slist       | Define a template class that implements a singly linked list container.                                     |
| sstream     | Not available.                                                                                              |
| stack       | Define a template class that implements a stack container.                                                  |



| Header Name | Description                                                                     |
|-------------|---------------------------------------------------------------------------------|
| stdexcept   | Not available.                                                                  |
| streambuf   | Not available.                                                                  |
| string      | Define a template class that implements a string container.                     |
| strstream   | Not available.                                                                  |
| typeinfo    | Not available.                                                                  |
| utility     | Define several templates of general utility.                                    |
| valarray    | Define several classes and template classes that support value-oriented arrays. |
| vector      | Define a template class that implements a vector container.                     |

The C++ standard library contains new-style C++ header files that correspond to 12 traditional C header files. Both the new-style and the traditional-style header files are included in the library. These header files are listed in Table 8-219.

Table 8-219: New and Traditional C++ Library Header Files

| New-Style Header Name | Traditional Header Name | Description                                                  |
|-----------------------|-------------------------|--------------------------------------------------------------|
| cassert               | assert.h                | Enforce assertions when functions execute. <sup>1</sup>      |
| cctype                | ctype.h                 | Classify characters. <sup>1</sup>                            |
| cerrno                | errno.h                 | Test error codes reported by library functions. <sup>1</sup> |
| cfloat                | float.h                 | Test floating-point type properties.                         |
| ciso646               | iso646.h                | Program in ISO 646 variant character sets.                   |
| climits               | limits.h                | Test integer type properties. <sup>1</sup>                   |
| clocale               | locale.h                | Not available.                                               |
| cmath                 | math.h                  | Compute common mathematical functions. <sup>1</sup>          |
| csetjmp               | setjmp.h                | Execute nonlocal goto statements.                            |
| csignal               | signal.h                | Not available.                                               |
| cstdarg               | stdarg.h                | Access a varying number of arguments.                        |
| cstddef               | stddef.h                | Define several useful types and macros. <sup>1</sup>         |
| cstdio                | stdio.h                 | Not available.                                               |
| cstdlib               | stdlib.h                | Perform a variety of operations. <sup>1</sup>                |
| cstring               | string.h                | Manipulate several kinds of strings. <sup>1</sup>            |
| ctime                 | time.h                  | Not available.                                               |
| cwchar                | wchar.h                 | Not available.                                               |
| cwctype               | wctype.h                | Not available.                                               |

<sup>&</sup>lt;sup>1</sup> See Table 8-215: C Library Header Files, for specific implementation limitations.

# 8.2. Non-Supported Language Features

C and C++ implementations should comply with the language features prescribed in the respective ISO/IEC standards, as much as possible. However, certain features are specifically not supported because of SPU architecture limitations. Currently, the only non-supported feature is C++ exception handling.





# 9. Floating-Point Arithmetic on the SPU

Annex F of the C99 language standard (ISO/IEC 9899) specifies support for the IEC 60559 floating-point standard. This chapter describes differences from Annex F and ISO/IEC Standard 60559 that apply to SPU compilers and libraries.

Floating-point behavior is essentially dictated by the SPU hardware. For single precision, the hardware provides an extended single-precision number range. Denorm arguments are treated as 0, and NaN (not-a-number) and Infinity are not supported. The only rounding mode that is supported is truncation (round towards 0), and exceptions apply only to certain extended range floating-point instructions). For double precision, the hardware provides the standard IEEE number range, but again, denorm arguments are treated as 0. IEEE exceptions are detected and accumulated in the FPSCR register, and the IEEE rules for propagation of NaNs are not implemented in the architecture. (For details, see the *Synergistic Processor Unit Instruction Set Architecture*.) These and other IEEE differences affect almost every aspect of floating-point computation, including data-type properties, rounding modes, exception status, error reporting, and expression evaluation. The particular effect of these differences on the compiler and libraries are described in the following sections.

## 9.1. Properties of Floating-Point Data Type Representations

The properties of floating-point data type representations are declared as macros in float.h. Table 9-220 lists these macros and the corresponding values that are applicable for the SPU.

Table 9-220: Values for Floating-Point Type Properties

| Macro           | Value                                            |
|-----------------|--------------------------------------------------|
| FLT_DIG         | 6                                                |
| FLT_EPSILON     | 0x1p-23f (1.19209290E-07f)                       |
| FLT_MANT_DIG    | 24                                               |
| FLT_MAX_10_EXP  | 38                                               |
| FLT_MAX_EXP     | 129                                              |
| FLT_MIN_10_EXP  | -37                                              |
| FLT_MIN_EXP     | -125                                             |
| FLT_MAX         | 0x1.FFFFFEp128f (6.80564694E+38f)                |
| FLT_MIN         | 0x1p-126f (1.17549436E-38f)                      |
| FLT_ROUNDS      | Initialized to 16 (to nearest for both elements) |
| FLT_EVAL_METHOD | 0 (no promotions occur)                          |
| FLT_RADIX       | 2                                                |
| DBL_DIG         | 15                                               |
| DBL_EPSILON     | 0x1p-52 (2.2204460492503131E-016)                |
| DBL_MANT_DIG    | 53                                               |
| DBL_MAX_10_EXP  | 308                                              |
| DBL_MAX_EXP     | 1024                                             |
| DBL_MIN_10_EXP  | -307                                             |
| DBL_MIN_EXP     | -1021                                            |
| DBL_MAX         | 0x1.FFFFFFFFFFp1023 (1.7976931348623157E+308)    |
| DBL_MIN         | 0x1p-1022 (2.2250738585072014E-308)              |
| DECIMAL_DIG     | 17                                               |



# 9.2. Floating-Point Environment

The macros defined within fenv.h control the directed-rounding control mode and floating-point exception status flags for floating-point operations.

#### 9.2.1. Rounding Modes

Whereas the C language specification requires that all floating-point data types use the same rounding modes, the SPU hardware supports different rounding modes for single- and double-precision arithmetic. On the SPU, the rounding mode for single precision is round-towards-zero, and the default rounding mode for double precision is round-to-nearest.

According to the C99 standard, the rounding mode for floating-point addition is characterized by the implementation-defined value of FLT\_ROUNDS. On the SPU, this macro is only used for double precision. Single-precision rounding mode is always truncation. (See Table 9-220.)

FLT\_ROUNDS will return a 5-bit value which represents the rounding mode for both double precision elements. The highest bit is always 1. The next two bits are the rounding mode for element 0 and the two lowest bits are the rounding mode for element 1. Table 9-221 lists the rounding mode represented by the two bits for each element.

Table 9-221: Rounding Mode for Two Bits of FLT\_ROUNDS

| Last Two Bits | Rounding Mode                |
|---------------|------------------------------|
| 00            | Round to nearest even        |
| 01            | Round toward zero (truncate) |
| 10            | Round toward +infinity       |
| 11            | Round towards -infinity      |

Because the SPU hardware only supports rounding towards zero for single precision, some single-precision math functions will necessarily deviate from the C99 standard. The standard library math functions and macros that deviate are described later, in section "9.3.2. Overall Behavior of C Operators and Standard Library Math Functions".

Table 9-222 lists the macros that can be used to set the double precision rounding modes for element 0 and element 1. The macros for element 0 and element 1 may be used together with a bitwise OR to set the rounding mode for both elements, or the macros can be used separately to set the rounding mode for only that element.

Table 9-222: Macros for Double Precision Rounding Modes

| Macro           | Comment                                  |
|-----------------|------------------------------------------|
| FE_TONEAREST    | Set element 0 to round to nearest even   |
| FE_TOWARDZERO   | Set element 0 to round towards zero      |
| FE_UPWARD       | Set element 0 to round towards +infinity |
| FE_DOWNWARD     | Set element 0 to round towards –infinity |
| FE_TONEAREST_1  | Set element 1 to round to nearest even   |
| FE_TOWARDZERO_1 | Set element 1 to round towards zero      |
| FE_UPWARD_1     | Set element 1 to round towards +infinity |
| FE_DOWNWARD_1   | Set element 1 to round towards -infinity |

#### 9.2.2. Floating-Point Exceptions

Table 9-223 and Table 9-224 list the macros for floating-point exceptions that will be defined in fenv.h. Because of the restricted behavior of the SPU floating-point hardware, single-precision library functions can have an undefined effect on these exception flags. Moreover, hardware traps will not result from any raised exception.



Table 9-223: Macros for Single Precision Floating-Point Exceptions

| Macro                | Comment                                     |
|----------------------|---------------------------------------------|
| FE_OVERFLOW_SNGL     | Overflow exception for element 0            |
| FE_UNDERFLOW_SNGL    | Underflow exception for element 0           |
| FE_DIFF_SNGL         | Different from IEEE exception for element 0 |
| FE_DIVBYZERO_SNGL    | Divide by zero exception for element 0      |
| FE_OVERFLOW_SNGL_1   | Overflow exception for element 1            |
| FE_UNDERFLOW_SNGL_1  | Underflow exception for element 1           |
| FE_DIFF_SNGL_1       | Different from IEEE exception for element 1 |
| FE_DIVBYZERO_SNGL_1  | Divide by zero exception for element 1      |
| FE_OVERFLOW_SNGL_2   | Overflow exception for element 2            |
| FE_UNDERFLOW_SNGL_2  | Underflow exception for element 2           |
| FE_DIFF_SNGL_2       | Different from IEEE exception for element 2 |
| FE_DIVBYZERO_SNGL_2  | Divide by zero exception for element 2      |
| FE_OVERFLOW_SNGL_3   | Overflow exception for element 3            |
| FE_UNDERFLOW_SNGL_3  | Underflow exception for element 3           |
| FE_DIFF_SNGL_3       | Different from IEEE exception for element 3 |
| FE_DIVBYZERO_SNGL_3  | Divide by zero exception for element 3      |
| FE_ALL_EXCEPT_SNGL   | Bitwise OR of all macros for element 0      |
| FE_ALL_EXCEPT_SNGL_1 | Bitwise OR of all macros for element 1      |
| FE_ALL_EXCEPT_SNGL_2 | Bitwise OR of all macros for element 2      |
| FE_ALL_EXCEPT_SNGL_3 | Bitwise OR of all macros for element 3      |

Table 9-224: Macros for Double Precision Floating-Point Exceptions

| Macro               | Comment                                       |
|---------------------|-----------------------------------------------|
| FE_OVERFLOW_DBL     | Overflow exception for element 0              |
| FE_UNDERFLOW_DBL    | Underflow exception for element 0             |
| FE_INEXACT_DBL      | ISO/IEC inexact for element 0                 |
| FE_INVALID_DBL      | ISO/IEC invalid for element 0                 |
| FE_NC_NAN_DBL       | Possibly non-compliant NaN for element 0      |
| FE_NC_DENORM_DBL    | Possibly non-compliant denormal for element 0 |
| FE_OVERFLOW_DBL_1   | Overflow exception for element 1              |
| FE_UNDERFLOW_DBL_1  | Underflow exception for element 1             |
| FE_INEXACT_DBL_1    | ISO/IEC inexact for element 1                 |
| FE_INVALID_DBL_1    | ISO/IEC invalid for element 1                 |
| FE_NC_NAN_DBL_1     | Possibly non-compliant NaN for element 1      |
| FE_NC_DENORM_DBL_1  | Possibly non-compliant denormal for element 1 |
| FE_ALL_EXCEPT_DBL   | Bitwise OR of all macros for element 0        |
| FE_ALL_EXCEPT_DBL_1 | Bitwise OR of all macros for element 1        |
| FE_ALL_EXCEPT       | Bitwise OR of all macros from this table      |

The floating-point environment variables defined in the C99 specification only apply to double-precision.

The pragma FENV\_ACCESS will be used to inform the compiler whether the program intends to control and test floating-point status. If the pragma is on, the compiler will take appropriate action to ensure that code transformations preserve the behavior specified in this document.



#### 9.2.3. Other Floating-Point Constants in math.h

Several additional floating-point constants are defined in math.h. These constants are used by functions to report various domain and range errors. Many have a non-standard definition for the SPU. A description of these particular constants is shown in Table 9-225.

Table 9-225: Floating-Point Constants

| Macro                                             | Description                                                                                                                                                                              |
|---------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| HUGE_VAL                                          | Infinity                                                                                                                                                                                 |
| HUGE_VALF                                         | FLT_MAX                                                                                                                                                                                  |
| HUGE_VALL                                         | Infinity                                                                                                                                                                                 |
| INFINITY<br>NAN                                   | Double precision adheres to the IEEE definition. These macros are not used for single-precision operations.                                                                              |
| FP_INFINITE FP_NAN FP_NORMAL FP_SUBNORMAL FP_ZERO | For single precision, the fpclassify() function will only return FP_NORMAL and FP_ZERO classes; FP_NAN, FP_INFINITE, and FP_SUBNORMAL are never generated.                               |
| FP_FAST_FMA<br>FP_FAST_FMAF<br>FP_FAST_FMAL       | These are defined to indicate that the ${\tt fma}$ function executes more quickly than a multiply and an add of float and double operands.                                               |
| FP_ILOGB0<br>FP_ILOGBNAN                          | <code>FP_ILOGBO</code> is the value returned by $ilogb(x)$ and $ilogbf(x)$ if $x$ is zero or a denorm number. Its value is <code>INT_MIN</code> .                                        |
|                                                   | FP_ILOGBNAN is the value returned by $ilogb(x)$ if x is a NaN. This does not apply to the single-precision case of $ilogbf$ . Its value is $INT_MAX$ .                                   |
| MATH_ERRNO<br>MATH_ERREXCEPT                      | These will expand to the integer constants 1 and 2, respectively.                                                                                                                        |
| math_errhandling                                  | Expands to an expression that has type int and the value MATH_ERRNO, MATH_ERREXCEPT, or the bitwise OR of both. The value of math_errhandling is constant for the duration of a program. |

# 9.3. Floating-Point Operations

This section specifies floating-point data conversions, and it describes the overall behavior of C operators and standard library functions. It also describes several special cases where floating-point results might vary from the IEEE standard. Lastly, the section describes the specific behavior of several specific math functions.

#### 9.3.1. Floating-Point Conversions

This section provides specifications for the four types of floating-point data conversions: 1) conversions from integers

to floating-point; 2) conversions from floating-point to integer; 3) conversion between floating-point precisions; and, 4) conversions between floating-point and string.

### **Integer to Floating-Point Conversions**

Conversions from integers to floats will adhere to the following rules:

- A single-precision conversion from integer to float produces a result within the extended single-precision floating-point range. See Table 9-220 for details about this range.
- A single-precision conversion from integer to float rounds towards zero.
- A double-precision conversion from integer to float produces a result within the C99 standard double-precision floating-point range.
- A double-precision conversion from integer to float rounds according to the rounding mode indicated by the value of FLT\_ROUNDS.



#### Floating-Point to Integer Conversions

Conversions from floats to integers will have the following behavior:

- When converting from a float to an integer, exceptions are raised for overflow, underflow, and IEEE noncompliant result.
- Overflow and underflow exceptions are raised when converting from a double to an integer. If a
  double-precision value is infinite or NaN or if the integral part of the floating value exceeds the range of the
  integer type, an "invalid" floating-point exception is raised, and the resulting value is unspecified. An
  "inexact" floating-point exception is raised by the hardware when a conversion involves an integral floatingpoint value that is outside the range of the integer data type.

#### **Conversions Between Floating-Point Precision**

To achieve maximum performance, compilers only perform conversion from float to double and from double to float within the IEEE standard range. These conversions will comply with the IEEE standard, except for denormal inputs, which are forced to zero. Conversion of numbers outside of the IEEE standard range is unspecified. Conversions with NaNs, infinities, or denormal results are also unspecified.

#### **Conversions Between Floating-Point and Strings**

Conversions between floating-point and string values will adhere to both the extended single-precision floating-point range and the IEEE standard double-precision floating-point range.

#### 9.3.2. Overall Behavior of C Operators and Standard Library Math Functions

Library functions and compilers will obey the same general rules with respect to rounding and overflow. These rules differ, however, depending on whether the code is single precision or double precision.

#### **Single-Precision Code**

For single precision, the C operators (+, -, \*, and /) and the standard library math functions will have the following behavior:

- If the operation produces a value with a magnitude greater than the largest positive representable extended-precision number, the result will be FLT\_MAX with appropriate sign, and the overflow flag will be raised.
- For all operators and standard functions, except the negate operator and the fabsf() and copysignf() functions, an argument with a denormal value will be treated as +0.0.
- Except for the negate operator and the fabsf() and copysignf() functions, operators and standard functions will never return a denormal value or -0.0.
- The negate operator and the fabsf() and copysignf() functions must be implemented such that only the sign bit is changed.
- Expressions will be evaluated using the round-towards-zero mode. Implementations that depend on other rounding directions for algorithm correctness will produce incorrect results and therefore cannot be used.
- The overflow flag will be set when FLT\_MAX is returned instead of a value whose magnitude is too large.
   Because infinity is undefined for single precision, FLT\_MAX will be used to signal infinity in situations where infinity would otherwise be generated on an IEEE754-compliant system. This modification will enable common trig identities to work.
- NaN is not supported and does not need to be copied from any input parameter.
- By default, compilers may perform optimizations for single-precision floating-point arithmetic that assume 1) that NaNs are never given as arguments; and, 2) that ±Inf will never be generated as a result.
- Compilers can assume that floating-point operations will not generate user-visible traps, such as division by zero, overflow, and underflow.
- Constant expressions that are evaluated at compile time will produce the same result as they would if they were evaluated at runtime. For example,

```
float x = 6.0e38f * 8.1e30f;
```



will be evaluated as FLT\_MAX.

Compilers may use single-precision contracted operations, such as Floating Reciprocal Absolute Square
Root Estimate (frsqest) or Floating Multiply and Add (fma), unless explicitly prohibited by FP\_CONTRACT
pragma or a no-fast-float compiler option. When contracted operations are used, errno does not need to
be set.

#### **Double-Precision Code**

For double-precision floating-point, the C operators and standard library math functions will be compliant with the IEEE standard, with the following exceptions:

- When a NaN is produced as a result of an operation, it will always be a QNaN.
- Except for the negate operator and the fabs() and copysign() functions, denormal values will only be supported as results. A denormal operand is treated as 0 with same sign as the denormal operand.
- The default rounding mode for double precision is rounding to nearest.
- Compilers may use double precision contracted operations, such as Double Floating Multiply and Add
   (dfma), unless explicitly prohibited by the FP\_CONTRACT pragma or a no-fast-double compiler option. When
   contracted operations are used, errno does not need to be set.

#### 9.3.3. Floating-Point Expression Special Cases

The C99 standard describes several standard expression transformations that might fail to produce the required effect on the SPU:

•  $x/2 \rightarrow x*0.5$ 

Valid for this particular value because the value is an exact power of 2, but it is invalid in general (for example, x/10 = x\*0.1) because the floating-point constant is not exactly representable in any finite base-2 floating-point system.

•  $x*1 \rightarrow x \text{ and } x/1 \rightarrow x$ 

Invalid when: 1) x is a SNaN or a non-default QNaN (double precision only); 2)  $\times$  is a denormal number; or, 3)  $\times$  is -0.0 (single precision only).

• x/x -> 1.0

Invalid for single precision when x is zero or a denormal, and invalid for double precision when x is zero, or a denormal, Inf, or NaN.

• x-y -> -(y-x)

Invalid for zero results which might have different signs, or, for double precision, round to +/- infinity, nonzero results might differ by 1 ULP.

• x-x -> 0.0

Always valid for single precision, but the equivalence is invalid for double precision when x is either NaN or Inf. It is also invalid for double precision for round to –infinity, in which case the result will be -0.0.

• 0\*x -> 0.0

Always valid for single precision, but invalid for double precision when x is a NaN, Inf, negative number, or -0.

• x+0 -> x

Invalid in single precision, if x is a denormal operand or -0. Invalid in double precision if x=-0 under round-to-nearest, round to +infinity and truncate. Also invalid in double precision if x is a SNaN or non-default QNaN and if x is a denormal number, in which case x+0 becomes a zero with appropriate sign.

• x-0 -> x



Valid for single precision, except if x is a denormal operand or -0. Invalid for double precision if x is an SNaN or non-default QNaN, if x is a denormal number, or if x is +0 and rounding mode is rounding to infinity. In this last case, x-0 = +0-0 = -0. For any normalized operand the result is valid even with round to -infinity.

• -x -> 0-x

Invalid for single precision when x = 0.0 or a denormal. Invalid for double precision in the following cases: 1) For Nans the value of -x is undefined; the result will be different for all Nans. 2) If x is +0 and the rounding mode is rounding to nearest-even, +infinity, or truncation, 0-x=+0 and -x=-0.

•  $x!=x \rightarrow false$ 

Always valid for single precision. For double precision, x=NaN always compares unordered, so x!=x -> true.

• x==x -> true

Always valid for single precision. For double precision, x=NaN always compares unordered, so x==x -> false.

```
• x<y -> isless(x,y),
  x <= y -> islessequal(x,y),
  x>y \rightarrow isgreater(x,y), and
  x>=y -> isgreaterequal(x,y)
```

Valid. Exceptions are due to flags that are set as side effects when x or y are NaN under double precision. The FENV\_ACCESS pragma can change the invalid flag behavior.

#### 9.3.4. Specific Behavior of Standard Math Functions

This section describes the specific behavior of various floating-point functions declared in math.h. As noted, the SPU hardware has a direct effect on the behavior of floating-point functions. Because of the many differences between strict IEEE behavior and the hardware behavior, the standard math functions do not need to provide rigorous checks for exception situations and out-of-range conditions. Consequently, the results of many functions are redefined. The following is a list of differences:

- The function nanf() will return zero.
- The isfinite() macro will always return a nonzero value for single precision.
- The isinf() macro will always return zero for single precision.
- The isnan() macro will always return zero for single precision.
- Unlike the C99 standard specifications, the single-precision functions nearbyintf(), lrintf(), llrintf(), and fmaf() round towards zero.
- Trig, hyperbolic, exponential, logarithmic, and gamma functions do not need to set the inexact flag when values are rounded.
- The boundary cases with a NaN argument will not be supported for single precision because NaN is not a valid argument.
- nextafterf(subnormal,y) will never raise an underflow flag. The functions nextafterf() and nexttowardf() will succeed when incrementing past the IEEE maximal float value.



- The following boundary cases will not be supported for single precision because infinity is not a valid argument: atanf(±inf), atan2f(±y, ±inf), atan2f(±inf,x), atan2f(±inf,±inf), acoshf(+inf), asinhf(±inf), atanhf(±1), atanhf(±inf), coshf(±inf), sinhf(±inf), tanhf(±inf), expf(±inf), exp2f(±inf), expmlf(±inf), frexpf(±inf,&exp), ldexpf(±inf,exp), logf(+inf), log10f(+inf), log1pf(+inf), log2f(+inf), logbf(±inf), modff(±inf,iptr), scalbnf(±inf,n), cbrtf(±inf), fabsf(±inf), hypotf(±inf,y), powf(-1,±inf), powf(x,±inf), powf(±inf,y), sqrtf(±inf), erff(±inf), erfcf(±inf), lgammaf(±inf), tgammaf(+inf), ceilf(±inf), floorf(±inf), nearbyintf(±inf), roundf(±inf), rintf(±inf), lrintf(±inf), lrintf(±inf), lroundf(±inf), llroundf(±inf), and copysignf(±inf).
- For single precision, the following boundary cases will produce a non-IEEE-compliant result: acosf(|x|>1), asinf(|x|>1), acoshf(x<1.0), atanhf(|x|>1), tgammaf(x<0), fmodf(x,0), ldexpf(x,BIG\_INT), logf(±0), logf(x<0), log10f(±0), log10f(x<0), log1pf(-1), log1pf(x<-1), log2f(±0), log2f(x<0), logbf(±0), powf(±0,y), and tgammaf(±0)</p>
- For single precision, the following boundary cases will not return NaN: cosf(±inf), sinf(±inf), tanf(±inf), tgammaf(-inf), fmodf(±inf,y), nextafterf(x,±inf), fmaf(±inf|0,0|±inf,z), and fmaf(±inf,0,-+inf).
- Section "9.3.1. Floating-Point Conversions" describes the behavior of implicit conversions when a single precision value is passed as an argument to a double precision function or when a single precision variable is assigned the result of a double-precision function.



# 10. Operator Overloading for Vector Data Types

Operator overloading is a syntactic feature in which common operators, such '+' or '-', have different implementations depending upon the type of their arguments. This section describes the vector data types that may be used with certain standard C/C++ operators and the behavior of these operators.

# 10.1. Supported Types

Operator overloading is valid on the vector data types listed in Table 10-226 and Table 10-227.

Table 10-226: Integer Vector Types

| Туре                      | SPU/PPU |
|---------------------------|---------|
| vector signed char        | Both    |
| vector unsigned char      | Both    |
| vector signed short       | Both    |
| vector unsigned short     | Both    |
| vector signed int         | Both    |
| vector unsigned int       | Both    |
| vector signed long long   | SPU     |
| vector unsigned long long | SPU     |

Table 10-227: Floating-Point Vector Types

| Туре          | SPU/PPU |
|---------------|---------|
| vector float  | Both    |
| vector double | SPU     |

# 10.2. Vector Subscripting

Vector subscripts are used to insert an element into a vector or to extract an element from a vector.

Given E1[E2], where E1 has a vector type with base type T and E2 has an integer type or enumeration type, the result is equivalent to:

When the value of E2 does not designate a valid element of E1, the behavior is undefined. Vector subscripts can be used on a vector declared with the register storage class.

# 10.3. Unary Operators

Given OP E1, where E1 is a vector type  $\mathtt{T}$  with  $\mathtt{N}$  elements and OP is one of the operators in Table 10-228, the result has a value equivalent to:

$$(T) \{ OP E1[0], \ldots, OP E1[N-1] \}$$

Table 10-228: Valid Types for Specified Unary Operators

| OP | Integer Vector Types | Floating-Point Vector Types |
|----|----------------------|-----------------------------|
| ++ | yes                  | yes                         |

| ОР | Integer Vector Types | Floating-Point Vector Types |
|----|----------------------|-----------------------------|
|    | yes                  | yes                         |
| +  | yes                  | yes                         |
| _  | yes                  | yes                         |
| ~  | yes                  | no                          |

# 10.4. Binary Operators

Given E1 OP E2, where E1 and E2 have equivalent vector types  $\mathtt{T}$  with  $\mathtt{N}$  elements and OP is one of the operators in Table 10-229, the result has a value equivalent to:

```
(T) { E1[0] OP E2[0], ..., E1[N-1] OP E2[N-1] }
```

For the assignment operators, E1 shall be a modifiable Ivalue, and the result value will be assigned to the object it designates.

Table 10-229: Valid Types for Specified Binary Operators

| OP     | Integer Vector Types | Floating-Point Vector Types |
|--------|----------------------|-----------------------------|
| + +=   | yes                  | yes                         |
| =      | yes                  | yes                         |
| * *=   | yes                  | yes                         |
| / /=   | yes                  | yes                         |
| % %=   | yes                  | no                          |
| & &=   | yes                  | no                          |
| =      | yes                  | no                          |
| ^ ^=   | yes                  | no                          |
| << <<= | yes                  | no                          |
| >> >>= | yes                  | no                          |

# 10.5. Relational Operators

Given E1 OP E2, where E1 and E2 have equivalent vector types  $\mathtt{T}$  with  $\mathtt{N}$  elements and OP is one of the operators in Table 10-230, the result has a value equivalent to:

```
((E1[0] OP E2[0]) \& ... \& (E1[N-1] OP E2[N-1]))
```

Table 10-230: Valid Types for Specified Relational Operators

| OP  | Integer Vector Types | Floating-Point Vector Types |
|-----|----------------------|-----------------------------|
| ==  | yes                  | yes                         |
| ! = | yes                  | yes                         |
| <   | yes                  | yes                         |
| >   | yes                  | yes                         |
| <=  | yes                  | yes                         |
| >=  | yes                  | yes                         |



# Index

| $\boldsymbol{A}$                                      | write word channel (spu_writech)5/                                              |
|-------------------------------------------------------|---------------------------------------------------------------------------------|
|                                                       | common intrinsic operations – compare, branch and halt                          |
| alignment                                             | branch indirect and set link if external data                                   |
| align_hint3                                           | (spu_bisled)24                                                                  |
| AltiVec compatibility6                                | halt if compare equal (spu_hcmpeq)28                                            |
| c                                                     | halt if compare greater than (spu_hcmpgt)28                                     |
| C                                                     | vector compare absolute equal (spu_cmpabseq)25, 28                              |
| C library header files125                             | vector compare absolute greater than (spu_cmpabsgt)                             |
| C standard library125                                 | 25                                                                              |
| C++ library header files128                           | vector compare equal (spu_cmpeq)25                                              |
| C++ standard library128                               | vector compare greater than (spu_cmpgt)27                                       |
| common intrinsic operations – arithmetic              | common intrinsic operations - constant formation                                |
| negative vector multiply and add (spu_nmadd)22        | intrinsics                                                                      |
| negative vector multiply and subtract (spu_nmsub)22   | splat scalar to vector (spu_splats)15                                           |
| vector add (spu_add)17                                | common intrinsic operations – control                                           |
| vector add extended (spu_addx)18                      | disable interrupts (spu_idisable)52                                             |
| vector floating-point reciprocal estimate (spu_re)22  | enable interrupts (spu_ienable)53                                               |
| vector floating-point reciprocal square root estimate | move from floating-point status and control register                            |
| (spu_rsqrte)22                                        | (spu_mffpscr)53                                                                 |
| vector generate borrow (spu_genb)18                   | move from special purpose register (spu_mfspr)53                                |
| vector generate borrow extended (spu_genbx)18         | move to floating-point status and control register                              |
| vector generate carry (spu_genc)19                    | (spu_mtfpscr)54                                                                 |
| vector generate carry extended (spu_gencx)19          | move to special purpose register (spu_mtspr)54                                  |
| vector multiply (spu_mul)20                           | stop and signal (spu_stop)54                                                    |
| vector multiply and add (spu_madd)19                  | synchronize (spu_sync)55                                                        |
| vector multiply and shift right (spu_mulsr)21         | synchronize data (spu_dsync)54                                                  |
| vector multiply and subtract (spu_msub)20             | common intrinsic operations – conversion                                        |
| vector multiply even (spu_mule)21                     | convert integer vector to vector float (spu_convtf)16                           |
| vector multiply high (spu_mulh)20                     | convert vector float to signed integer vector                                   |
| vector multiply high high and add (spu_mhhadd)19      | (spu_convts)16                                                                  |
| vector multiply odd (spu_mulo)21                      | convert vector float to unsigned integer vector                                 |
| vector subtract (spu_sub)23                           | (spu_convtu)16                                                                  |
| vector subtract extended (spu_subx)23                 | extend vector (spu_extend)                                                      |
| common intrinsic operations – bits and masking        | round vector double to vector float (spu_roundtf)17                             |
| form select byte mask (spu_maskb)30                   | common intrinsic operations – logical                                           |
| form select byte mask (spu_mask)30                    | OR word across (spu_orx)37                                                      |
| form select word mask (spu_maskw)31                   | vector bit-wise AND (spu_and)33                                                 |
| gather bits from elements (spu_gather)29              | vector bit-wise AND with complement (spu_andc)34                                |
| select bits (spu_sel)31                               | vector bit-wise complement of AND (spu_nand)35                                  |
| shuffle two vectors of bytes (spu_shuffle)32          | vector bit-wise complement of OR (spu_nor)35                                    |
| vector count leading zeros (spu_snume)29              | vector bit-wise equivalent (spu_eqv)34                                          |
| vector count reading zeros (spu_cntz)29               | vector bit-wise equivalent (spu_eqv)34 vector bit-wise exclusive OR (spu_xor)37 |
|                                                       | vector bit-wise exclusive OK (spu_xor)37 vector bit-wise OR (spu_or)36          |
| common intrinsic operations – bytes                   |                                                                                 |
| average of two vectors (spu_avg)                      | vector bit-wise OR with complement (spu_orc)37                                  |
| sum bytes into shorts (spu_sumb)24                    | common intrinsic operations – scalar                                            |
| vector absolute difference (spu_absd)24               | extract vector element from vector (spu_extract)57                              |
| common intrinsic operations – channel control         | insert scalar into specified vector element (spu_insert)                        |
| read channel count (spu_readchent)                    | 58                                                                              |
| read quadword channel (spu_readchqw)56                | promote scalar to vector (spu_promote)59                                        |
| read word channel (spu_readch)                        | common intrinsic operations – shift and rotate                                  |
| write quadword channel (spu_writechqw)57              | quadword rotate left by bits (spu_rlqw)43                                       |



| quadword rotate left and mask by bits (spu_rlmaskqw)41 | macros for double precision rounding modes132         |
|--------------------------------------------------------|-------------------------------------------------------|
|                                                        | macros for single precision floating-point exceptions |
| quadword rotate left and mask by bytes                 |                                                       |
| (spu_rlmaskqwbyte)42                                   | rounding mode for two bits of FLT_ROUNDS132           |
| quadword rotate left and mask by bytes from bit shift  | rounding modes                                        |
| count (spu_rlmaskqwbytebc)42                           | floating-point operations                             |
| quadword rotate left by bytes (spu_rlqwbyte)44         | conversions134                                        |
| quadword rotate left by bytes from bit shift count     | conversion between floating-point and strings135      |
| (spu_rlqwbytebc)45                                     | conversions between floating-point precision135       |
| quadword shift left by bits (spu_slqw)46               | floating-point to integer conversions135              |
| quadword shift left by bytes (spu_slqwbyte)46          | integer to floating-point conversions134              |
| quadword shift left by bytes from bit shift count      | G                                                     |
| (spu_slqwbytebc)47                                     | U                                                     |
| quadword shift right by bits (spu_srqw)49              | generate controls for sub-quadword insertion          |
| quadword shift right by bytes (spu_srqwbyte)51         | si_cbd10                                              |
| quadword shift right by bytes from bit shift count     | si_cbx10                                              |
| spu_srqwbytebc)52                                      | si_cdd10                                              |
| vector rotate left and mask algebraic by bits          | si_cdx10                                              |
| (spu_rlmaska)40                                        | si_chd10                                              |
| vector rotate left and mask by bits (spu_rlmask)39     | si_chx10                                              |
| vector rotate left by bits (spu_rl)38                  | si_cwd11                                              |
| vector shift left by bits (spu_sl)45                   | si_cwx                                                |
| vector shift right algebraic by bits (spu_sra)48       | _                                                     |
| vector shift right by bits (spu_sr)48                  | Н                                                     |
| composite intrinsics (DMA)61                           | header files2                                         |
| spu_mfcdma32                                           | neader thes                                           |
| spu_mfcdma6461                                         | I                                                     |
| spu_mfcstat                                            | inline assembly8                                      |
| constant formation intrinsics                          | interrupt-safe critical sections 83                   |
|                                                        |                                                       |
| si_il                                                  | interrupt-safe critical sections functions            |
| si_ila                                                 | begin interrupt-safe critical section                 |
| si_ilh                                                 | (mfc_begin_critical_section)83                        |
| si_ilhu11                                              | end interrupt-safe critical section                   |
| si_iohl11                                              | (mfc_end_critical_section)83                          |
| control intrinsics                                     | intrinsics                                            |
| si_stopd12                                             | arithmetic                                            |
| D                                                      | bits and mask29                                       |
|                                                        | byte operation24                                      |
| data types                                             | channel control55                                     |
| default alignments3                                    | compare, branch and halt24                            |
| restrict type qualifier7                               | composite (DMA)61                                     |
| single token vector2                                   | constant formation11, 15                              |
| type casting5                                          | control12, 52                                         |
| vector1                                                | conversion16                                          |
| vector literals5                                       | generic and built-ins                                 |
| debug printf()126                                      | logical intrinsics                                    |
|                                                        | low-level specific and generic9                       |
| F                                                      | mapping with scalar operands                          |
| floating-point arithmetic on the SPU131                | scalar                                                |
| floating-point environment                             | shift and rotate                                      |
| exceptions                                             | specific                                              |
| floating-point constants                               | specific casting                                      |
| macros for double precision floating-point exceptions  | specific intrinsics not accessible through generic    |
|                                                        | intrinsics                                            |
| 133                                                    | mumsics10                                             |

| M                                                         | MFC DMA mnemonics                             | 66          |
|-----------------------------------------------------------|-----------------------------------------------|-------------|
| malloc heap127                                            | MFC DMA status                                | 74          |
| •                                                         | MFC DMA status functions                      |             |
| mapping PPU VMX data types to SPU data types2             | acknowledge tag group containing stalled I    | DMA list    |
| PPU VMX intrinsics that are difficult to map to SPU       | commands (mfc_write_list_stall_ack)           | 77          |
| intrinsics86                                              | check availability of atomic command statu    | ıs          |
| PPU VMX intrinsics that map one-to-one with SPU           | (mfc_stat_atomic_status)                      | 78          |
| = <del>-</del>                                            | check availability of list DMA stall-and-no   | tify status |
| intrinsics                                                | (mfc_stat_list_stall_status)                  | 77          |
| SPU data types to PPU VMX data types2                     | check availability of MFC_RdTagStat char      |             |
| SPU intrinsics that are difficult to map to PPU VMX       | (mfc_stat_tag_status)                         |             |
| intrinsics                                                | check availability of tag status update reque | est channel |
| SPU intrinsics that map one-to-one with PPU VMX           | (mfc_stat_tag_update)                         |             |
| intrinsics                                                | check the number of available entries in the  |             |
| with scalar operands                                      | DMA queue (mfc_stat_cmd_queue)                | 74          |
| memory load and store intrinsics                          | read atomic command status (mfc_read_ato      |             |
| si_lqa11                                                  |                                               |             |
| si_lqd11                                                  | read list DMA stall-and-notify status         |             |
| si_lqr12                                                  | (mfc_read_list_stall_status)                  | 77          |
| si_lqx12                                                  | read tag mask indicating MFC tag groups to    |             |
| si_stqa12                                                 | included in query operation (mfc_read_ta      |             |
| si_stqd12                                                 | request that tag status be immediately upda   | _           |
| si_stqr12                                                 | (mfc_write_tag_update_immediate)              |             |
| si_stqx12                                                 | request that tag status be updated            |             |
| MFC atomic update commands69                              | (mfc_write_tag_update)                        | 75          |
| get lock line and create reservation (mfc_getllar)69      | request that tag status be updated for any en |             |
| put lock line if reservation for effective address exists | completion with no outstanding operation      |             |
| (mfc_putllc)70                                            | (mfc_write_tag_update_any)                    |             |
| put lock line unconditional (mfc_putlluc)70               | request that tag status be updated when all   |             |
| MFC DMA commands                                          | groups have no outstanding operation          | chabica tag |
| move data from effective address to local storage         | (mfc_write_tag_update_all)                    | 74          |
| (mfc_get)66                                               |                                               |             |
| move data from effective address to local storage using   | set tag mask to select MFC tag groups to be   |             |
| MFC list (mfc_getl)68                                     | in query operation (mfc_write_tag_mask        |             |
| move data from effective address to local storage using   | wait for an updated tag status (mfc_read_ta   | -           |
| MFC list with barrier (mfc_getlb)69                       | wait for no outstanding operation of all ena  | _           |
| move data from effective address to local storage using   | groups (mfc_read_tag_status_all)              |             |
| MFC list with fence (mfc_getlf)69                         | wait for no outstanding operation of any en   |             |
| move data from effective address to local storage with    | group (mfc_read_tag_status_any)               |             |
| barrier (mfc_getb)67                                      | wait for the updated status of any enabled t  |             |
| move data from effective address to local storage with    | (mfc_read_tag_status_immediate)               |             |
| fence (mfc_getf)67                                        | MFC macro definitions                         | 03          |
| move data from local storage to effective address         | MFC multisource synchronization functions     |             |
| (mfc_put)66                                               | check the status of multisource synchroniza   |             |
| move data from local storage to effective address using   | (mfc_stat_multi_src_sync_request)             | /           |
| MFC list (mfc_putl)68                                     | request multisource synchronization           | 70          |
| move data from local storage to effective address using   | (mfc_write_multi_src_sync_request)            |             |
| MFC list with barrier (mfc_putlb)68                       | MFC multisource synchronization request       |             |
| move data from local storage to effective address using   | MFC SL1 storage control commands              | 72          |
| MFC list with fence (mfc_putlf)68                         | MFC SL1 storage control functions             |             |
| move data from local storage to effective address with    | SL1 data cache range flush (mfc_sdcrf)        |             |
| barrier (mfc_putb)66                                      | SL1 data cache range set to zero (mfc_sdcr    |             |
| move data from local storage to effective address with    | SL1 data cache range store (mfc_sdcrst)       |             |
| fence (mfc_putf)66                                        | SL1 data cache range touch (mfc_sdcrt)        |             |
| (P «)                                                     | SL1 data cache range touch for store (mfc_    | _sdcrtst)73 |



| MFC structures                                          | change thread priority to medium (cctpm)          |        |
|---------------------------------------------------------|---------------------------------------------------|--------|
| DMA list element for MFC list DMA                       | convert double to (fctiw)                         |        |
| (mfc_list_element)63                                    | convert double to doubleword (fctid)              | 94     |
| MFC synchronization commands71                          | convert double to doubleword with round toward    | s zero |
| MFC synchronization functions                           | (fctidz)                                          | 95     |
| enqueue mfc_barrier command into DMA queue or           | convert double to word with round towards zero    |        |
| stall when queue is full (mfc_barrier)72                | (fctiwz)                                          | 95     |
| enqueue mfc_eieio command into DMA queue or stall       | convert doubleword to double (fcfid)              | 94     |
| when queue is full (mfc_eieio)72                        | count leading doubleword zeros (cntlzd)           |        |
| enqueue mfc_sync command into DMA queue or stall        | count leading word zeros (cntlzw)                 |        |
| when queue is full (mfc_sync)72                         | data cache block flush (dcbf)                     |        |
| send signal (mfc_sndsig)71                              | data cache block set to zero (dcbz)               |        |
| send signal with barrier (mfc_sndsigb)71                | data cache block store (dcbst)                    |        |
| send signal with fence (mfc_sndsigf)71                  | data cache block touch (dcbt)                     |        |
| MFC tag manager64                                       | data cache block touch for store (dcbtst)         |        |
| MFC tag manager functions                               | delay 10 cycles at dispatch (db10cyc)             |        |
| put queued lock line unconditional (mfc_putqlluc)70     | delay 12 cycles at dispatch (db12cyc)             |        |
| release a group of tags from exclusive use              | delay 16 cycles at dispatch (db16cyc)             |        |
| (mfc_multi_tag_release)65                               | delay 8 cycles at dispatch (_db8cyc)              |        |
| release a tag from exclusive use (mfc_tag_release)65    | double absolute value (fabs)                      |        |
| reserve a group of tags for exclusive use               | double fused multiply and add (fmadd)             |        |
| (mfc_multi_tag_reserve)65                               | double fused multiply and subtract (fmsub)        |        |
| reserve a tag for exclusive use (mfc_tag_reserve)65     | double fused negative multiply and add (fnmac     |        |
| MFC Tag manager mnemonics65                             | double fused negative multiply and subtract       |        |
| MFC utility functions                                   | (fnmsub)                                          | 98     |
| concatenate higher 32 bits and lower 32 bits            | double multiply (fmul)                            |        |
| (mfc_hl2ea)64                                           | double negative (fnabs)                           |        |
| extract higher 32 bits from effective address           | double reciprocal square root estimate (frsqrte)  |        |
| (mfc_ea2h)64                                            | double square root (fsqrt)                        |        |
| extract lower 32 bits from effective address (mfc_ea2l) | enforce in-order execution of I/O (eieio)         |        |
| 64                                                      | float absolute value (fabsf)                      |        |
| round up value to next multiple of 128 (mfc_ceil128)    | float fused multiply and add (fmadds)             |        |
| 64                                                      | float fused multiply and subtract (fmsubs)        |        |
|                                                         | float fused negative multiply and add (fnmadd     |        |
| N                                                       | float fused negative multiply and subtract (fnm   |        |
| new and traditional C++ library header files129         |                                                   |        |
| no operation intrinsics                                 | float multiply (fmuls)                            |        |
| si_lnop11                                               | float negative (fnabsf)                           |        |
| si_nop11                                                | float reciprocal estimate (fres)                  |        |
| non-supported language features129                      | float square root (fsqrts)                        |        |
|                                                         | floating-point select of double (fsel)            |        |
| 0                                                       | floating-point select of float (fsels)            |        |
| operator overloading for vector data types139           | instruction cache block invalidate (_icbi)        |        |
| operators                                               | instruction sync (_isync)                         | 100    |
| address4                                                | light weight sync (_lwsync)                       |        |
| assignment4                                             | load doubleword with reserved (_ldarx)            |        |
| sizeof()4                                               | load reversed doubleword (_ldbrx)                 |        |
| n.                                                      | load reversed halfword (_lhbrx)                   |        |
| P                                                       | load reversed word (_lwbrx)                       |        |
| pointers                                                | load word with reserved (_lwarx)                  |        |
| arithmetic and pointer dereferencing4                   | move from floating-point status and control regis |        |
| PPU instrinsics                                         | (mffs)                                            |        |
| change thread priority to high (cctph)89                | move from special purpose register (mfspr)        |        |
| change thread priority to low (cctpl)89                 | move from time base (mftb)                        |        |
|                                                         |                                                   |        |



| move to special purpose register (mtspr)104                                                                  | store vector left indexed last (vec_stvlxl)121          |  |
|--------------------------------------------------------------------------------------------------------------|---------------------------------------------------------|--|
| multiply double unsigned word, high part (mulhdu)                                                            | store vector right indexed (vec_stvrx)122               |  |
| 105                                                                                                          | store vector right indexed last (vec_stvrxl)123         |  |
| multiply doubleword, high part (mulhd)104                                                                    | stream control operators that have been deprecated on   |  |
| multiply unsigned word, high part (mulhwu)105                                                                | the PPU113                                              |  |
| multiply word, high part (mulhw)105                                                                          | programmer directed branch prediction7                  |  |
| no operation (nop)105                                                                                        | programming support for MFC input and output63          |  |
| reset bit of FPSCR (mtfsb0)103                                                                               | R                                                       |  |
| rotate left doubleword immediate then clear (rldic)                                                          |                                                         |  |
| 107                                                                                                          | restrict type qualifier7                                |  |
| rotate left doubleword immediate then clear left                                                             | S                                                       |  |
| (rldicl)107                                                                                                  |                                                         |  |
| rotate left doubleword immediate then clear right                                                            | SPU decrementer80                                       |  |
| (rldicr)108                                                                                                  | SPU decrementer functions                               |  |
| rotate left doubleword immediate then mask insert                                                            | load a value to decrementer (spu_write_decrementer)     |  |
| (rldimi)                                                                                                     | 81                                                      |  |
| rotate left doubleword then clear left (rldcl)106                                                            | read current value of decrementer                       |  |
| rotate left doubleword then clear right (rldcr)107                                                           | (spu_read_decrementer)80                                |  |
| rotate left immediate then mask insert (rlwimi) 108                                                          | SPU event81                                             |  |
| rotate left word immediate then AND with mask                                                                | SPU event functions                                     |  |
| (rlwinm)                                                                                                     | acknowledge events (spu_write_event_ack)82              |  |
| rotate left word then AND with mask (rlwnm)109                                                               | check availability of event status                      |  |
| round to single precision (frsp)99                                                                           | (spu_stat_event_status)81                               |  |
| save and set the FPSCR (setflm)109                                                                           | read event status mask (spu_read_event_mask)82          |  |
| set bit of FPSCR (mtfsb1)103                                                                                 | read event status or stall until status is available    |  |
| set field of FPSCR (mtfsfi)104                                                                               | (spu_read_event_status)81                               |  |
| set fields in FPSCR (mtfsf)104                                                                               | select events to be monitored by event status           |  |
| set the number of blocks to stream                                                                           | (spu_write_event_mask)82                                |  |
| (protected_stream_count)105                                                                                  | SPU mailbox functions                                   |  |
| set up a stream (protected_stream_set)106                                                                    | get available capacity of SPU outbound interrupt        |  |
| set up an unlimited stream (                                                                                 | mailbox (spu_stat_out_intr_mbox)80                      |  |
| protected_unlimited_stream_set)106                                                                           | get available capacity of SPU outbound mailbox          |  |
| set up streaming data (dcbt_TH1000)92                                                                        | (spu_stat_out_mbox)80                                   |  |
| start all streams (protected_stream_go)106                                                                   | get the number of data entries in SPU inbound mailbox   |  |
| start or stop streaming data (dcbt_TH1010)92                                                                 | (spu_stat_in_mbox)80                                    |  |
| stop a stream (protected_stream_stop)106                                                                     | read next data entry in SPU inbound mailbox             |  |
| stop all streams (protected_stream_stop_all)106                                                              | (spu_read_in_mbox)                                      |  |
| store doubleword conditional (_stdcx)109                                                                     | send data to SPU outbound interrupt mailbox             |  |
| store reversed doubleword (_stdbrx)109                                                                       | (spu_write_out_intr_mbox)80                             |  |
| store reversed halfword (sthbrx)110                                                                          | send data to SPU outbound mailbox                       |  |
| store reversed word (_stwbrx)110                                                                             | (spu_write_out_mbox)80                                  |  |
| store word conditional (_stwcx)111                                                                           | SPU mailboxes                                           |  |
| sync (sync)111 PPU VMX intrinsics                                                                            | SPU signal notification                                 |  |
|                                                                                                              | check if pending signals exist on signal notification 1 |  |
| extract vector element from vector (vec_extract)114 insert scalar into specified vector element (vec_insert) | channel (spu_stat_signal1)                              |  |
|                                                                                                              | SPU signal notification functions                       |  |
| load vector left indexed (vec_lvlx)116                                                                       | atomically read and clear signal notification 1 channel |  |
|                                                                                                              | (spu_read_signal1)                                      |  |
| load vector left indexed last (vec_lvlxl)117<br>load vector right Indexed (vec_lvrx)118                      | atomically read and clear signal notification 2 channel |  |
| load vector right indexed (vec_lvrx)118                                                                      | (spu_read_signal2)                                      |  |
| promote scalar to vector (vec_promote)124                                                                    | check if pending signals exist on signal notification 2 |  |
| splat scalar to vector (vec_splats)124                                                                       | channel (spu_stat_signal2)                              |  |
| store vector left indexed (vec_stvlx)120                                                                     | SPU state management functions                          |  |
| biole vector left indexed (vec_stvix)120                                                                     | or o state management ranctions                         |  |



| read current SPU machine status                                         | V                                             |
|-------------------------------------------------------------------------|-----------------------------------------------|
| (spu_read_machine_status)                                               | vector literals                               |
| read SPU SRR0 (spu_read_srr0)82<br>write to SPU SRR0 (spu_write_srr0)82 | alternate format (for AltiVec compatibility)6 |
| SPU target definition8                                                  | standard format6                              |

**End of Document**