Classicnewcar.us


Crays, Clusters, Centers and Grids - ppt download

Crays, Clusters, Centers and Grids - ppt download

Download presentationWe think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!Buttons: Presentation is loading. Please wait. To view this video please enable JavaScript, and consider upgrading to a web browser that supports HTML5 video Published byAnastasia Anderson Modified about 1 year ago 1 Computer Architecture: Static Instruction SchedulingProf. Onur Mutlu Carnegie Mellon University { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/1/Computer+Architecture%3A+Static+Instruction+Scheduling.jpg", "name": "Computer Architecture: Static Instruction Scheduling", "description": "Prof. Onur Mutlu. Carnegie Mellon University.", "width": "800" } 2 A Note on This Lecture These slides are partly from Spring 2013, Computer Architecture, Lecture 21: Static Instruction Scheduling Video of that lecture: { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/2/A+Note+on+This+Lecture+These+slides+are+partly+from+18-447+Spring+2013%2C+Computer+Architecture%2C+Lecture+21%3A+Static+Instruction+Scheduling..jpg", "name": "A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 21: Static Instruction Scheduling.", "description": "Video of that lecture: http://www.youtube.com/watch v=XdDUn2WtkRg.", "width": "800" } 3 Higher (uArch) Level SimulationGoal: Get an idea of the impact of an optimization on performance (or another metric) -- quickly Idea: Simulate the cycle-level behavior of the processor without modeling the logic required to enable execution (i.e., no need for control and data path) Upside: Fast: Enables faster exploration of techniques and design space Flexible: Can change the modeled microarchitecture Downside: Inaccuracy: Cycle count may not be accurate Cannot provide cycle time (not a goal either, however) Still need logic-level implementation of the final design { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/3/Higher+%28uArch%29+Level+Simulation.jpg", "name": "Higher (uArch) Level Simulation", "description": "Goal: Get an idea of the impact of an optimization on performance (or another metric) -- quickly. Idea: Simulate the cycle-level behavior of the processor without modeling the logic required to enable execution (i.e., no need for control and data path) Upside: Fast: Enables faster exploration of techniques and design space. Flexible: Can change the modeled microarchitecture. Downside: Inaccuracy: Cycle count may not be accurate. Cannot provide cycle time (not a goal either, however) Still need logic-level implementation of the final design.", "width": "800" } 4 Review: Systolic ArchitecturesBasic principle: Replace a single PE with a regular array of PEs and carefully orchestrate flow of data between the PEs  achieve high throughput w/o increasing memory bandwidth requirements Differences from pipelining: Array structure can be non-linear and multi-dimensional PE connections can be multidirectional (and different speed) PEs can have local memory and execute kernels (rather than a piece of the instruction) { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/4/Review%3A+Systolic+Architectures.jpg", "name": "Review: Systolic Architectures", "description": "Basic principle: Replace a single PE with a regular array of PEs and carefully orchestrate flow of data between the PEs  achieve high throughput w/o increasing memory bandwidth requirements. Differences from pipelining: Array structure can be non-linear and multi-dimensional. PE connections can be multidirectional (and different speed) PEs can have local memory and execute kernels (rather than a piece of the instruction)", "width": "800" } 5 Review: Systolic ArchitecturesH. T. Kung, “Why Systolic Architectures?,” IEEE Computer 1982. Memory: heart PEs: cells Memory pulses data through cells { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/5/Review%3A+Systolic+Architectures.jpg", "name": "Review: Systolic Architectures", "description": "H. T. Kung, Why Systolic Architectures , IEEE Computer 1982. Memory: heart. PEs: cells. Memory pulses. data through. cells.", "width": "800" } 6 Pipeline Parallel Programming Model { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/6/Pipeline+Parallel+Programming+Model.jpg", "name": "Pipeline Parallel Programming Model", "description": "Pipeline Parallel Programming Model", "width": "800" } 7 Review: Decoupled Access/ExecuteMotivation: Tomasulo’s algorithm too complex to implement 1980s before HPS, Pentium Pro Idea: Decouple operand access and execution via two separate instruction streams that communicate via ISA-visible queues. Smith, “Decoupled Access/Execute Computer Architectures,” ISCA 1982, ACM TOCS 1984. { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/7/Review%3A+Decoupled+Access%2FExecute.jpg", "name": "Review: Decoupled Access/Execute", "description": "Motivation: Tomasulo’s algorithm too complex to implement. 1980s before HPS, Pentium Pro. Idea: Decouple operand. access and execution via. two separate instruction. streams that communicate. via ISA-visible queues. Smith, Decoupled Access/Execute. Computer Architectures, ISCA 1982, ACM TOCS 1984.", "width": "800" } 8 Review: Decoupled Access/ExecuteAdvantages: + Execute stream can run ahead of the access stream and vice versa + If A takes a cache miss, E can perform useful work + If A hits in cache, it supplies data to lagging E + Queues reduce the number of required registers + Limited out-of-order execution without wakeup/select complexity Disadvantages: -- Compiler support to partition the program and manage queues -- Determines the amount of decoupling -- Branch instructions require synchronization between A and E -- Multiple instruction streams (can be done with a single one, though) { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/8/Review%3A+Decoupled+Access%2FExecute.jpg", "name": "Review: Decoupled Access/Execute", "description": "Advantages: + Execute stream can run ahead of the access stream and vice versa. + If A takes a cache miss, E can perform useful work. + If A hits in cache, it supplies data to lagging E. + Queues reduce the number of required registers. + Limited out-of-order execution without wakeup/select complexity. Disadvantages: -- Compiler support to partition the program and manage queues. -- Determines the amount of decoupling. -- Branch instructions require synchronization between A and E. -- Multiple instruction streams (can be done with a single one, though)", "width": "800" } 9 Today Static SchedulingEnabler of Better Static Scheduling: Block Enlargement Predicated Execution Loop Unrolling Trace Superblock Hyperblock Block-structured ISA { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/9/Today+Static+Scheduling.jpg", "name": "Today Static Scheduling", "description": "Enabler of Better Static Scheduling: Block Enlargement. Predicated Execution. Loop Unrolling. Trace. Superblock. Hyperblock. Block-structured ISA.", "width": "800" } 10 Static Instruction Scheduling (with a Slight Focus on VLIW) { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/10/Static+Instruction+Scheduling+%28with+a+Slight+Focus+on+VLIW%29.jpg", "name": "Static Instruction Scheduling (with a Slight Focus on VLIW)", "description": "Static Instruction Scheduling (with a Slight Focus on VLIW)", "width": "800" } 11 Key Questions Q1. How do we find independent instructions to fetch/execute? Q2. How do we enable more compiler optimizations? e.g., common subexpression elimination, constant propagation, dead code elimination, redundancy elimination, … Q3. How do we increase the instruction fetch rate? i.e., have the ability to fetch more instructions per cycle A: Enabling the compiler to optimize across a larger number of instructions that will be executed straight line (without branches getting in the way) eases all of the above { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/11/Key+Questions.jpg", "name": "Key Questions", "description": "Q1. How do we find independent instructions to fetch/execute Q2. How do we enable more compiler optimizations e.g., common subexpression elimination, constant propagation, dead code elimination, redundancy elimination, … Q3. How do we increase the instruction fetch rate i.e., have the ability to fetch more instructions per cycle A: Enabling the compiler to optimize across a larger number of instructions that will be executed straight line (without branches getting in the way) eases all of the above", "width": "800" } 12 Review: Loop UnrollingIdea: Replicate loop body multiple times within an iteration + Reduces loop maintenance overhead Induction variable increment or loop condition test + Enlarges basic block (and ysis scope) Enables code optimization and scheduling opportunities -- What if iteration count not a multiple of unroll factor? (need extra code to detect this) -- Increases code size { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/12/Review%3A+Loop+Unrolling.jpg", "name": "Review: Loop Unrolling", "description": "Idea: Replicate loop body multiple times within an iteration. + Reduces loop maintenance overhead. Induction variable increment or loop condition test. + Enlarges basic block (and ysis scope) Enables code optimization and scheduling opportunities. -- What if iteration count not a multiple of unroll factor (need extra code to detect this) -- Increases code size.", "width": "800" } 13 VLIW: Finding Independent OperationsWithin a basic block, there is limited instruction-level parallelism To find multiple instructions to be executed in parallel, the compiler needs to consider multiple basic blocks Problem: Moving an instruction above a branch is unsafe because instruction is not guaranteed to be executed Idea: Enlarge blocks at compile time by finding the frequently-executed paths Trace scheduling Superblock scheduling Hyperblock scheduling { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/13/VLIW%3A+Finding+Independent+Operations.jpg", "name": "VLIW: Finding Independent Operations", "description": "Within a basic block, there is limited instruction-level parallelism. To find multiple instructions to be executed in parallel, the compiler needs to consider multiple basic blocks. Problem: Moving an instruction above a branch is unsafe because instruction is not guaranteed to be executed. Idea: Enlarge blocks at compile time by finding the frequently-executed paths. Trace scheduling. Superblock scheduling. Hyperblock scheduling.", "width": "800" } 14 Safety and Legality in Code MotionTwo characteristics of speculative code motion: Safety: whether or not spurious exceptions may occur Legality: whether or not result will be always correct Four possible types of code motion: { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/14/Safety+and+Legality+in+Code+Motion.jpg", "name": "Safety and Legality in Code Motion", "description": "Two characteristics of speculative code motion: Safety: whether or not spurious exceptions may occur. Legality: whether or not result will be always correct. Four possible types of code motion:", "width": "800" } 15 Code Movement ConstraintsDownward When moving an operation from a BB to one of its dest BB’s, all the other dest basic blocks should still be able to use the result of the operation the other source BB’s of the dest BB should not be disturbed Upward When moving an operation from a BB to its source BB’s register values required by the other dest BB’s must not be destroyed the movement must not cause new exceptions { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/15/Code+Movement+Constraints.jpg", "name": "Code Movement Constraints", "description": "Downward. When moving an operation from a BB to one of its dest BB’s, all the other dest basic blocks should still be able to use the result of the operation. the other source BB’s of the dest BB should not be disturbed. Upward. When moving an operation from a BB to its source BB’s. register values required by the other dest BB’s must not be destroyed. the movement must not cause new exceptions.", "width": "800" } 16 Trace Scheduling Trace: A frequently executed path in the control-flow graph (has multiple side entrances and multiple side exits) Idea: Find independent operations within a trace to pack into VLIW instructions. Traces determined via profiling Compiler adds fix-up code for correctness (if a side entrance or side exit of a trace is exercised at runtime, corresponding fix-up code is executed) { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/16/Trace+Scheduling+Trace%3A+A+frequently+executed+path+in+the+control-flow+graph+%28has+multiple+side+entrances+and+multiple+side+exits%29.jpg", "name": "Trace Scheduling Trace: A frequently executed path in the control-flow graph (has multiple side entrances and multiple side exits)", "description": "Idea: Find independent operations within a trace to pack into VLIW instructions. Traces determined via profiling. Compiler adds fix-up code for correctness (if a side entrance or side exit of a trace is exercised at runtime, corresponding fix-up code is executed)", "width": "800" } 17 Trace Scheduling (II) There may be conditional branches from the middle of the trace (side exits) and transitions from other traces into the middle of the trace (side entrances). These control-flow transitions are ignored during trace scheduling. After scheduling, fix-up/bookkeeping code is inserted to ensure the correct execution of off-trace code. Fisher, “Trace scheduling: A technique for global microcode compaction,” IEEE TC 1981. { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/17/Trace+Scheduling+%28II%29.jpg", "name": "Trace Scheduling (II)", "description": "There may be conditional branches from the middle of the trace (side exits) and transitions from other traces into the middle of the trace (side entrances). These control-flow transitions are ignored during trace scheduling. After scheduling, fix-up/bookkeeping code is inserted to ensure the correct execution of off-trace code. Fisher, Trace scheduling: A technique for global microcode compaction, IEEE TC 1981.", "width": "800" } 18 Trace Scheduling Idea { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/18/Trace+Scheduling+Idea.jpg", "name": "Trace Scheduling Idea", "description": "Trace Scheduling Idea", "width": "800" } 19 Trace Scheduling (III)Instr 1 Instr 2 Instr 2 Instr 3 Instr 3 Instr 4 Instr 4 Instr 1 Instr 5 Instr 5 What bookeeping is required when Instr 1 is moved below the side entrance in the trace? { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/19/Trace+Scheduling+%28III%29.jpg", "name": "Trace Scheduling (III)", "description": "Instr 1. Instr 2. Instr 2. Instr 3. Instr 3. Instr 4. Instr 4. Instr 1. Instr 5. Instr 5. What bookeeping is required when Instr 1. is moved below the side entrance in the trace", "width": "800" } 20 Trace Scheduling (IV) Instr 3 Instr 1 Instr 2 Instr 4 Instr 2 Instr 3 { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/20/Trace+Scheduling+%28IV%29+Instr+3+Instr+1+Instr+2+Instr+4+Instr+2+Instr+3.jpg", "name": "Trace Scheduling (IV) Instr 3 Instr 1 Instr 2 Instr 4 Instr 2 Instr 3", "description": "Trace Scheduling (IV) Instr 3 Instr 1 Instr 2 Instr 4 Instr 2 Instr 3", "width": "800" } 21 Trace Scheduling (V) What bookeeping is required when Instr 5moves above the side entrance in the trace? { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/21/Trace+Scheduling+%28V%29+What+bookeeping+is+required+when+Instr+5.jpg", "name": "Trace Scheduling (V) What bookeeping is required when Instr 5", "description": "moves above the side entrance in the trace", "width": "800" } 22 Trace Scheduling (VI) Instr 5 Instr 1 Instr 1 Instr 2 Instr 5 Instr 3 { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/22/Trace+Scheduling+%28VI%29+Instr+5+Instr+1+Instr+1+Instr+2+Instr+5+Instr+3.jpg", "name": "Trace Scheduling (VI) Instr 5 Instr 1 Instr 1 Instr 2 Instr 5 Instr 3", "description": "Trace Scheduling (VI) Instr 5 Instr 1 Instr 1 Instr 2 Instr 5 Instr 3", "width": "800" } 23 Trace Scheduling Fixup Code IssuesSometimes need to copy instructions more than once to ensure correctness on all paths (see C below) { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/23/Trace+Scheduling+Fixup+Code+Issues.jpg", "name": "Trace Scheduling Fixup Code Issues", "description": "Sometimes need to copy instructions more than once to ensure correctness on all paths (see C below)", "width": "800" } 24 Trace Scheduling OverviewTrace Selection select seed block (the highest frequency basic block) extend trace (along the highest frequency edges) forward (successor of the last block of the trace) backward (predecessor of the first block of the trace) don’t cross loop back edge bound max_trace_length heuristically Trace Scheduling build data precedence graph for a whole trace perform list scheduling and allocate registers add compensation code to maintain semantic correctness Speculative Code Motion (upward) move an instruction above a branch if safe { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/24/Trace+Scheduling+Overview.jpg", "name": "Trace Scheduling Overview", "description": "Trace Selection. select seed block (the highest frequency basic block) extend trace (along the highest frequency edges) forward (successor of the last block of the trace) backward (predecessor of the first block of the trace) don’t cross loop back edge. bound max_trace_length heuristically. Trace Scheduling. build data precedence graph for a whole trace. perform list scheduling and allocate registers. add compensation code to maintain semantic correctness. Speculative Code Motion (upward) move an instruction above a branch if safe.", "width": "800" } 25 Data Precedence Graph { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/25/Data+Precedence+Graph.jpg", "name": "Data Precedence Graph", "description": "Data Precedence Graph", "width": "800" } 26 List Scheduling Assign priority to each instructionInitialize ready list that holds all ready instructions Ready = data ready and can be scheduled Choose one ready instruction I from ready list with the highest priority Possibly using tie-breaking heuristics Insert I into schedule Making sure resource constraints are satisfied Add those instructions whose precedence constraints are now satisfied into the ready list { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/26/List+Scheduling+Assign+priority+to+each+instruction.jpg", "name": "List Scheduling Assign priority to each instruction", "description": "Initialize ready list that holds all ready instructions. Ready = data ready and can be scheduled. Choose one ready instruction I from ready list with the highest priority. Possibly using tie-breaking heuristics. Insert I into schedule. Making sure resource constraints are satisfied. Add those instructions whose precedence constraints are now satisfied into the ready list.", "width": "800" } 27 Instruction Prioritization HeuristicsNumber of descendants in precedence graph Maximum latency from root node of precedence graph Length of operation latency Ranking of paths based on importance Combination of above { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/27/Instruction+Prioritization+Heuristics.jpg", "name": "Instruction Prioritization Heuristics", "description": "Number of descendants in precedence graph. Maximum latency from root node of precedence graph. Length of operation latency. Ranking of paths based on importance. Combination of above.", "width": "800" } 28 VLIW List Scheduling Assign PrioritiesCompute Data Ready List - all operations whose predecessors have been scheduled. Select from DRL in priority order while checking resource constraints Add newly ready operations to DRL and repeat for next instruction 4-wide VLIW Data Ready List 1 {1} 6 3 4 5 {2,3,4,5,6} 9 2 7 8 {2,7,8,9} 12 10 11 {10,11,12} 13 {13} { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/28/VLIW+List+Scheduling+Assign+Priorities.jpg", "name": "VLIW List Scheduling Assign Priorities", "description": "Compute Data Ready List - all operations whose predecessors have been scheduled. Select from DRL in priority order while checking resource constraints. Add newly ready operations to DRL and repeat for next instruction. 4-wide VLIW. Data Ready List. 1. {1} 6. 3. 4. 5. {2,3,4,5,6} 9. 2. 7. 8. {2,7,8,9} 12. 10. 11. {10,11,12} 13. {13}", "width": "800" } 29 Trace Scheduling Example (I) { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/29/Trace+Scheduling+Example+%28I%29.jpg", "name": "Trace Scheduling Example (I)", "description": "Trace Scheduling Example (I)", "width": "800" } 30 Trace Scheduling Example (II) { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/30/Trace+Scheduling+Example+%28II%29.jpg", "name": "Trace Scheduling Example (II)", "description": "Trace Scheduling Example (II)", "width": "800" } 31 Trace Scheduling Example (III) { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/31/Trace+Scheduling+Example+%28III%29.jpg", "name": "Trace Scheduling Example (III)", "description": "Trace Scheduling Example (III)", "width": "800" } 32 Trace Scheduling Example (IV) { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/32/Trace+Scheduling+Example+%28IV%29.jpg", "name": "Trace Scheduling Example (IV)", "description": "Trace Scheduling Example (IV)", "width": "800" } 33 Trace Scheduling Example (V) { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/33/Trace+Scheduling+Example+%28V%29.jpg", "name": "Trace Scheduling Example (V)", "description": "Trace Scheduling Example (V)", "width": "800" } 34 Trace Scheduling TradeoffsAdvantages + Enables the finding of more independent instructions  fewer NOPs in a VLIW instruction Disadvantages -- Profile dependent -- What if dynamic path deviates from trace  lots of NOPs in the VLIW instructions -- Code bloat and additional fix-up code executed -- Due to side entrances and side exits -- Infrequent paths interfere with the frequent path -- Effectiveness depends on the bias of branches -- Unbiased branches  smaller traces  less opportunity for finding independent instructions { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/34/Trace+Scheduling+Tradeoffs.jpg", "name": "Trace Scheduling Tradeoffs", "description": "Advantages. + Enables the finding of more independent instructions  fewer NOPs in a VLIW instruction. Disadvantages. -- Profile dependent. -- What if dynamic path deviates from trace  lots of NOPs in the VLIW instructions. -- Code bloat and additional fix-up code executed. -- Due to side entrances and side exits. -- Infrequent paths interfere with the frequent path. -- Effectiveness depends on the bias of branches. -- Unbiased branches  smaller traces  less opportunity for finding independent instructions.", "width": "800" } 35 Superblock SchedulingTrace: multiple entry, multiple exit block Superblock: single-entry, multiple exit block A trace with side entrances are eliminated Infrequent paths do not interfere with the frequent path + More optimization/scheduling opportunity than traces + Eliminates “difficult” bookkeeping due to side entrances Hwu+, “The Superblock: An Effective Technique for VLIW and superscalar compilation,” J of SC 1991. { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/35/Superblock+Scheduling.jpg", "name": "Superblock Scheduling", "description": "Trace: multiple entry, multiple exit block. Superblock: single-entry, multiple exit block. A trace with side entrances are eliminated. Infrequent paths do not interfere with the frequent path. + More optimization/scheduling opportunity than traces. + Eliminates difficult bookkeeping due to side entrances. Hwu+, The Superblock: An Effective Technique for VLIW and superscalar compilation, J of SC 1991.", "width": "800" } 36 Can You Do This with a Trace?opA: mul r1,r2,3 opC: mul r3,r2,3 opB: add r2,r2,1 99 1 Code After Superblock Formation opC’: mul r3,r2,3 opA: mul r1,r2,3 opC: mul r3,r2,3 opB: add r2,r2,1 99 1 Original Code opA: mul r1,r2,3 opC: mov r3,r1 opB: add r2,r2,1 99 1 Code After Common Subexpression Elimination opC’: mul r3,r2,3 { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/36/Can+You+Do+This+with+a+Trace.jpg", "name": "Can You Do This with a Trace", "description": "opA: mul r1,r2,3. opC: mul r3,r2,3. opB: add r2,r2,1. 99. 1. Code After Superblock Formation. opC’: mul r3,r2,3. opA: mul r1,r2,3. opC: mul r3,r2,3. opB: add r2,r2,1. 99. 1. Original Code. opA: mul r1,r2,3. opC: mov r3,r1. opB: add r2,r2,1. 99. 1. Code After Common Subexpression Elimination. opC’: mul r3,r2,3.", "width": "800" } 37 Superblock Scheduling Shortcomings-- Still profile-dependent -- No single frequently executed path if there is an unbiased branch -- Reduces the size of superblocks -- Code bloat and additional fix-up code executed -- Due to side exits { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/37/Superblock+Scheduling+Shortcomings.jpg", "name": "Superblock Scheduling Shortcomings", "description": "-- Still profile-dependent -- No single frequently executed path if there is an unbiased branch -- Reduces the size of superblocks -- Code bloat and additional fix-up code executed -- Due to side exits", "width": "800" } 38 Hyperblock SchedulingIdea: Use predication support to eliminate unbiased branches and increase the size of superblocks Hyperblock: A single-entry, multiple-exit block with internal control flow eliminated using predication (if-conversion) Advantages + Reduces the effect of unbiased branches on scheduling block size Disadvantages -- Requires predicated execution support -- All disadvantages of predicated execution { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/38/Hyperblock+Scheduling.jpg", "name": "Hyperblock Scheduling", "description": "Idea: Use predication support to eliminate unbiased branches and increase the size of superblocks. Hyperblock: A single-entry, multiple-exit block with internal control flow eliminated using predication (if-conversion) Advantages. + Reduces the effect of unbiased branches on scheduling block size. Disadvantages. -- Requires predicated execution support. -- All disadvantages of predicated execution.", "width": "800" } 39 Hyperblock Formation (I)1. Block selection 2. Tail duplication 3. If-conversion Block selection Select subset of BBs for inclusion in HB Difficult problem Weighted cost/benefit function Height overhead Resource overhead Dependency overhead Branch elimination benefit Weighted by frequency Mahlke et al., “Effective Compiler Support for Predicated Execution Using the Hyperblock,” MICRO 1992. 10 BB1 90 80 20 BB2 BB3 80 20 BB4 10 BB5 90 10 BB6 10 { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/39/Hyperblock+Formation+%28I%29.jpg", "name": "Hyperblock Formation (I)", "description": "1. Block selection. 2. Tail duplication. 3. If-conversion. Block selection. Select subset of BBs for inclusion in HB. Difficult problem. Weighted cost/benefit function. Height overhead. Resource overhead. Dependency overhead. Branch elimination benefit. Weighted by frequency. Mahlke et al., Effective Compiler Support for Predicated Execution Using the Hyperblock, MICRO 1992. 10. BB1. 90. 80. 20. BB2. BB3. 80. 20. BB4. 10. BB5. 90. 10. BB6. 10.", "width": "800" } 40 Hyperblock Formation (II)Tail duplication same as with Superblock formation 10 10 BB1 BB1 80 20 80 20 BB2 BB3 BB2 BB3 80 20 80 20 BB4 BB4 10 10 BB5 90 BB5 90 10 10 BB6 BB6 BB6’ 90 81 9 10 9 1 { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/40/Hyperblock+Formation+%28II%29.jpg", "name": "Hyperblock Formation (II)", "description": "Tail duplication same as with Superblock formation. 10. 10. BB1. BB1. 80. 20. 80. 20. BB2. BB3. BB2. BB3. 80. 20. 80. 20. BB4. BB4. 10. 10. BB5. 90. BB5. 90. 10. 10. BB6. BB6. BB6’ 90. 81. 9. 10. 9. 1.", "width": "800" } 41 Hyperblock Formation (III)If-convert (predicate) intra-hyperblock branches 10 10 BB1 80 20 BB1 p1,p2 = CMPP BB2 BB3 80 20 BB2 if p1 BB4 BB3 if p2 10 BB4 BB5 90 BB6 BB5 10 BB6 10 BB6’ 81 9 81 BB6’ 9 9 1 1 { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/41/Hyperblock+Formation+%28III%29.jpg", "name": "Hyperblock Formation (III)", "description": "If-convert (predicate) intra-hyperblock branches. 10. 10. BB1. 80. 20. BB1. p1,p2 = CMPP. BB2. BB3. 80. 20. BB2 if p1. BB4. BB3 if p2. 10. BB4. BB5. 90. BB6. BB5. 10. BB6. 10. BB6’ 81. 9. 81. BB6’ 9. 9. 1. 1.", "width": "800" } 42 Can We Do Better? Hyperblock stillProfile dependent Requires fix-up code And, requires predication support Single-entry, single-exit enlarged blocks Block-structured ISA Optimizes multiple paths (can use predication to enlarge blocks) No need for fix-up code (duplication instead of fixup) { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/42/Can+We+Do+Better+Hyperblock+still.jpg", "name": "Can We Do Better Hyperblock still", "description": "Profile dependent. Requires fix-up code. And, requires predication support. Single-entry, single-exit enlarged blocks. Block-structured ISA. Optimizes multiple paths (can use predication to enlarge blocks) No need for fix-up code (duplication instead of fixup)", "width": "800" } 43 Block Structured ISA Blocks (> instructions) are atomic (all-or-none) operations Either all of the block is committed or none of it Compiler enlarges blocks by combining basic blocks with their control flow successors Branches within the enlarged block converted to “fault” operations  if the fault operation evaluates to true, the block is discarded and the target of fault is fetched Melvin and Patt, “Enhancing Instruction Scheduling with a Block-Structured ISA,” IJPP 1995. { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/43/Block+Structured+ISA+Blocks+%28%3E+instructions%29+are+atomic+%28all-or-none%29+operations.+Either+all+of+the+block+is+committed+or+none+of+it..jpg", "name": "Block Structured ISA Blocks (> instructions) are atomic (all-or-none) operations. Either all of the block is committed or none of it.", "description": "Compiler enlarges blocks by combining basic blocks with their control flow successors. Branches within the enlarged block converted to fault operations  if the fault operation evaluates to true, the block is discarded and the target of fault is fetched. Melvin and Patt, Enhancing Instruction Scheduling with a Block-Structured ISA, IJPP 1995.", "width": "800" } 44 Block Structured ISA (II)Advantages: + Larger atomic blocks  larger units can be fetched from I-cache + Aggressive compiler optimizations (e.g. reordering) can be enabled within atomic blocks (no side entries or exits) + Can explicitly represent dependencies among operations within an enlarged block Disadvantages: -- “Fault operations” can lead to work to be wasted (atomicity) -- Code bloat (multiple copies of the same basic block exists in the binary and possibly in I-cache) -- Need to predict which enlarged block comes next Optimizations Within an enlarged block, the compiler can perform optimizations that cannot normally/easily be performed across basic blocks { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/44/Block+Structured+ISA+%28II%29.jpg", "name": "Block Structured ISA (II)", "description": "Advantages: + Larger atomic blocks  larger units can be fetched from I-cache. + Aggressive compiler optimizations (e.g. reordering) can be enabled within atomic blocks (no side entries or exits) + Can explicitly represent dependencies among operations within an enlarged block. Disadvantages: -- Fault operations can lead to work to be wasted (atomicity) -- Code bloat (multiple copies of the same basic block exists in the binary and possibly in I-cache) -- Need to predict which enlarged block comes next. Optimizations. Within an enlarged block, the compiler can perform optimizations that cannot normally/easily be performed across basic blocks.", "width": "800" } 45 Block Structured ISA (III)Hao et al., “Increasing the instruction fetch rate via block-structured instruction set architectures,” MICRO 1996. { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/45/Block+Structured+ISA+%28III%29.jpg", "name": "Block Structured ISA (III)", "description": "Hao et al., Increasing the instruction fetch rate via block-structured instruction set architectures, MICRO 1996.", "width": "800" } 46 Superblock vs. BS-ISA Superblock BS-ISA blocksSingle-entry, multiple exit code block Not atomic Compiler inserts fix-up code on superblock side exit BS-ISA blocks Single-entry, single exit Atomic Need to roll back to the beginning of the block on fault { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/46/Superblock+vs.+BS-ISA+Superblock+BS-ISA+blocks.jpg", "name": "Superblock vs. BS-ISA Superblock BS-ISA blocks", "description": "Single-entry, multiple exit code block. Not atomic. Compiler inserts fix-up code on superblock side exit. BS-ISA blocks. Single-entry, single exit. Atomic. Need to roll back to the beginning of the block on fault.", "width": "800" } 47 Superblock vs. BS-ISA Superblock Block Structured ISA+ No ISA support needed -- Optimizes for only 1 frequently executed path -- Not good if dynamic path deviates from profiled path  missed opportunity to optimize another path Block Structured ISA + Enables optimization of multiple paths and their dynamic selection. + Dynamic prediction to choose the next enlarged block. Can dynamically adapt to changes in frequently executed paths at run-time + Atomicity can enable more aggressive code optimization -- Code bloat becomes severe as more blocks are combined -- Requires “next enlarged block” prediction, ISA+HW support -- More wasted work on “fault” due to atomicity requirement { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/47/Superblock+vs.+BS-ISA+Superblock+Block+Structured+ISA.jpg", "name": "Superblock vs. BS-ISA Superblock Block Structured ISA", "description": "+ No ISA support needed. -- Optimizes for only 1 frequently executed path. -- Not good if dynamic path deviates from profiled path  missed opportunity to optimize another path. Block Structured ISA. + Enables optimization of multiple paths and their dynamic selection. + Dynamic prediction to choose the next enlarged block. Can dynamically adapt to changes in frequently executed paths at run-time. + Atomicity can enable more aggressive code optimization. -- Code bloat becomes severe as more blocks are combined. -- Requires next enlarged block prediction, ISA+HW support. -- More wasted work on fault due to atomicity requirement.", "width": "800" } 48 Summary: Larger Code Blocks { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/48/Summary%3A+Larger+Code+Blocks.jpg", "name": "Summary: Larger Code Blocks", "description": "Summary: Larger Code Blocks", "width": "800" } 49 Summary and Questions Trace, superblock, hyperblock, block-structured ISA How many entries, how many exits does each of them have? What are the corresponding benefits and downsides? What are the common benefits? Enable and enlarge the scope of code optimizations Reduce fetch breaks; increase fetch rate What are the common downsides? Code bloat (code size increase) Wasted work if control flow deviates from enlarged block’s path { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/49/Summary+and+Questions+Trace%2C+superblock%2C+hyperblock%2C+block-structured+ISA.+How+many+entries%2C+how+many+exits+does+each+of+them+have.jpg", "name": "Summary and Questions Trace, superblock, hyperblock, block-structured ISA. How many entries, how many exits does each of them have", "description": "What are the corresponding benefits and downsides What are the common benefits Enable and enlarge the scope of code optimizations. Reduce fetch breaks; increase fetch rate. What are the common downsides Code bloat (code size increase) Wasted work if control flow deviates from enlarged block’s path.", "width": "800" } 50 IA-64: A Complicated VLIWRecommended reading: Huck et al., “Introducing the IA-64 Architecture,” IEEE Micro 2000. { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/50/IA-64%3A+A+Complicated+VLIW.jpg", "name": "IA-64: A Complicated VLIW", "description": "Recommended reading: Huck et al., Introducing the IA-64 Architecture, IEEE Micro 2000.", "width": "800" } 51 EPIC – Intel IA-64 ArchitectureGets rid of lock-step execution of instructions within a VLIW instruction Idea: More ISA support for static scheduling and parallelization Specify dependencies within and between VLIW instructions (explicitly parallel) + No lock-step execution + Static reordering of stores and loads + dynamic checking -- Hardware needs to perform dependency checking (albeit aided by software) -- Other disadvantages of VLIW still exist Huck et al., “Introducing the IA-64 Architecture,” IEEE Micro, Sep/Oct 2000. { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/51/EPIC+%E2%80%93+Intel+IA-64+Architecture.jpg", "name": "EPIC – Intel IA-64 Architecture", "description": "Gets rid of lock-step execution of instructions within a VLIW instruction. Idea: More ISA support for static scheduling and parallelization. Specify dependencies within and between VLIW instructions (explicitly parallel) + No lock-step execution. + Static reordering of stores and loads + dynamic checking. -- Hardware needs to perform dependency checking (albeit aided by software) -- Other disadvantages of VLIW still exist. Huck et al., Introducing the IA-64 Architecture, IEEE Micro, Sep/Oct 2000.", "width": "800" } 52 IA-64 Instructions IA-64 “Bundle” (~EPIC Instruction)Total of 128 bits Contains three IA-64 instructions Template bits in each bundle specify dependencies within a bundle \ IA-64 Instruction Fixed-length 41 bits long Contains three 7-bit register specifiers Contains a 6-bit field for specifying one of the 64 one-bit predicate registers { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/52/IA-64+Instructions+IA-64+Bundle+%28%7EEPIC+Instruction%29.jpg", "name": "IA-64 Instructions IA-64 Bundle (~EPIC Instruction)", "description": "Total of 128 bits. Contains three IA-64 instructions. Template bits in each bundle specify dependencies within a bundle. \ IA-64 Instruction. Fixed-length 41 bits long. Contains three 7-bit register specifiers. Contains a 6-bit field for specifying one of the 64 one-bit predicate registers.", "width": "800" } 53 IA-64 Instruction Bundles and GroupsGroups of instructions can be executed safely in parallel Marked by “stop bits” Bundles are for packaging Groups can span multiple bundles Alleviates recompilation need somewhat { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/53/IA-64+Instruction+Bundles+and+Groups.jpg", "name": "IA-64 Instruction Bundles and Groups", "description": "Groups of instructions can be executed safely in parallel. Marked by stop bits Bundles are for packaging. Groups can span multiple bundles. Alleviates recompilation need somewhat.", "width": "800" } 54 Template Bits Specify two thingsStop information: Boundary of independent instructions Functional unit information: Where should each instruction be routed { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/54/Template+Bits+Specify+two+things.jpg", "name": "Template Bits Specify two things", "description": "Stop information: Boundary of independent instructions. Functional unit information: Where should each instruction be routed.", "width": "800" } 55 Non-Faulting Loads and Exception Propagationld.s fetches speculatively from memory i.e. any exception due to ld.s is suppressed If ld.s r1 did not cause an exception then chk.s r1 is a NOP, else a branch is taken (to execute some compensation code) ld.s r1=[a] inst 1 inst 2 …. br inst 1 inst 2 …. unsafe code motion br …. ld r1=[a] use=r1 …. chk.s r1 use=r1 ld r1=[a] { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/55/Non-Faulting+Loads+and+Exception+Propagation.jpg", "name": "Non-Faulting Loads and Exception Propagation", "description": "ld.s fetches speculatively from memory. i.e. any exception due to ld.s is suppressed. If ld.s r1 did not cause an exception then chk.s r1 is a NOP, else a branch is taken (to execute some compensation code) ld.s r1=[a] inst 1. inst 2. …. br. inst 1. inst 2. …. unsafe. code. motion. br. …. ld r1=[a] use=r1. …. chk.s r1. use=r1. ld r1=[a]", "width": "800" } 56 Non-Faulting Loads and Exception Propagation in IA-64Load data can be speculatively consumed prior to check “speculation” status is propagated with speculated data Any instruction that uses a speculative result also becomes speculative itself (i.e. suppressed exceptions) chk.s checks the entire dataflow sequence for exceptions ld.s r1=[a] inst 1 inst 2 use=r1 …. br inst 1 inst 2 …. br unsafe code motion br …. ld r1=[a] use=r1 …. chk.s use ld r1=[a] use=r1 { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/56/Non-Faulting+Loads+and+Exception+Propagation+in+IA-64.jpg", "name": "Non-Faulting Loads and Exception Propagation in IA-64", "description": "Load data can be speculatively consumed prior to check. speculation status is propagated with speculated data. Any instruction that uses a speculative result also becomes speculative itself (i.e. suppressed exceptions) chk.s checks the entire dataflow sequence for exceptions. ld.s r1=[a] inst 1. inst 2. use=r1. …. br. inst 1. inst 2. …. br. unsafe. code. motion. br. …. ld r1=[a] use=r1. …. chk.s use. ld r1=[a] use=r1.", "width": "800" } 57 Aggressive ST-LD Reordering in IA-64ld.a starts the monitoring of any store to the same address as the advanced load If no aliasing has occurred since ld.a, ld.c is a NOP If aliasing has occurred, ld.c re-loads from memory inst 1 inst 2 …. st [?] ld r1=[x] use=r1 ld.a r1=[x] inst 1 inst 2 …. st [?] ld.c r1=[x] use=r1 potential aliasing st[?] { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/57/Aggressive+ST-LD+Reordering+in+IA-64.jpg", "name": "Aggressive ST-LD Reordering in IA-64", "description": "ld.a starts the monitoring of any store to the same address as the advanced load. If no aliasing has occurred since ld.a, ld.c is a NOP. If aliasing has occurred, ld.c re-loads from memory. inst 1. inst 2. …. st [ ] ld r1=[x] use=r1. ld.a r1=[x] inst 1. inst 2. …. st [ ] ld.c r1=[x] use=r1. potential. aliasing. st[ ]", "width": "800" } 58 Aggressive ST-LD Reordering in IA-64inst 1 inst 2 …. st [?] ld r1=[x] use=r1 ld.a r1=[x] inst 1 inst 2 use=r1 …. st [?] chk.a X potential aliasing st[?] ld r1=[a] use=r1 { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://slideplayer.com/9261405/27/images/58/Aggressive+ST-LD+Reordering+in+IA-64.jpg", "name": "Aggressive ST-LD Reordering in IA-64", "description": "inst 1. inst 2. …. st [ ] ld r1=[x] use=r1. ld.a r1=[x] inst 1. inst 2. use=r1. …. st [ ] chk.a X. potential. aliasing. st[ ] ld r1=[a] use=r1.", "width": "800" } Ppt on conservation of forest in india Ppt on introduction to object-oriented programming vs procedural programming Ppt on 4g technology Download ppt on great indian mathematicians Free ppt on motivation download Ppt on cross docking technique Download ppt on height and distance for class 10 Ppt on the history of space flight Ppt on tourism industry in india 2012 Ppt on limits and derivatives for class 11 Computer Architecture Lecture 21: Static Instruction Scheduling Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 3/25/2013.CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.Multiscalar processors15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.IA64 Complier Optimizations Alex Bobrek Jonathan Bradbury.1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.1 Compiling for VLIWs and ILP Profiling Region formation Acyclic scheduling Cyclic scheduling.Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.Introducing The IA-64 Architecture - Kalyan Gopavarapu - Kalyan Gopavarapu.U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.Computer Architecture: VLIW, DAE, Systolic Arrays Prof. Onur Mutlu Carnegie Mellon University.Unit II Intel IA-64 and Itanium Processor By N.R.Rejin Paul Lecturer/VIT/CSE CS2354 Advanced Computer Architecture.Super computers Parallel Processing By Lecturer: Aisha Dawood.POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors, Similar presentations © 2017 SlidePlayer.com Inc. All rights reserved.



Download presentationWe think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!Buttons: Presentation is loading. Please wait.Published byArthur Speights Modified over 3 years ago 1 Steven Bogaerts Assistant Professor of Computer Science Wittenberg University Springfield, OH Joshua Stough Assistant Professor of Computer Science Washington and Lee University Lexington, VA http://cs.wlu.edu/~stough/SC12 { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/4/1466080/slides/slide_1.jpg", "name": "Steven Bogaerts Assistant Professor of Computer Science Wittenberg University Springfield, OH Joshua Stough Assistant Professor of Computer Science Washington and Lee University Lexington, VA http://cs.wlu.edu/~stough/SC12", "description": "Steven Bogaerts Assistant Professor of Computer Science Wittenberg University Springfield, OH Joshua Stough Assistant Professor of Computer Science Washington and Lee University Lexington, VA http://cs.wlu.edu/~stough/SC12", "width": "800" } 2 Easy! www.python.org, Download, Python 2.7.3 www.python.org 2.x or 3.x? 3.x has some changes to the base language (not backwards compatible) Better handling of unicode Exception chaining … Many third-party libraries still support only 2.x Most current Linux distributions and Macs us 2.x as default So well stick with 2.x here { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/4/1466080/slides/slide_2.jpg", "name": "Easy. www.python.org, Download, Python 2.7.3 www.python.org 2.x or 3.x.", "description": "3.x has some changes to the base language (not backwards compatible) Better handling of unicode Exception chaining … Many third-party libraries still support only 2.x Most current Linux distributions and Macs us 2.x as default So well stick with 2.x here.", "width": "800" } 3 Simple syntax (as well demonstrate) No variable declaration Variables can hold any type Automatic garbage collection No explicit memory management Allows consideration of interesting problems sooner Students definitely need to learn the concepts Python brushes over… …but not necessarily in the first course or two What is the meaning of each const ? const string & foo(const int * const p) const; { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/4/1466080/slides/slide_3.jpg", "name": "Simple syntax (as well demonstrate) No variable declaration Variables can hold any type Automatic garbage collection No explicit memory management Allows consideration of interesting problems sooner Students definitely need to learn the concepts Python brushes over… …but not necessarily in the first course or two What is the meaning of each const .", "description": "const string & foo(const int * const p) const;.", "width": "800" } 4 Reasons: So you can follow the rest of our presentation Demonstrate the kinds of concepts you can consider early on with Python in CS1 See pythonCrashCourse.py { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/4/1466080/slides/slide_4.jpg", "name": "Reasons: So you can follow the rest of our presentation Demonstrate the kinds of concepts you can consider early on with Python in CS1 See pythonCrashCourse.py", "description": "Reasons: So you can follow the rest of our presentation Demonstrate the kinds of concepts you can consider early on with Python in CS1 See pythonCrashCourse.py", "width": "800" } 5 Our purpose: For learning, not for all-out speed Options pprocess Celery MPI4Py Parallel Python Multiprocessing module { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/4/1466080/slides/slide_5.jpg", "name": "Our purpose: For learning, not for all-out speed Options pprocess Celery MPI4Py Parallel Python Multiprocessing module", "description": "Our purpose: For learning, not for all-out speed Options pprocess Celery MPI4Py Parallel Python Multiprocessing module", "width": "800" } 6 Comparatively simple Good documentation Comes with Python 2.6+ Does not work in IDLE Edit with any editor, then run at terminal Might need to set PYTHONPATH environment variable to your Python installations Lib directory Could use a batch file: SET PYTHONPATH="C:\Program Files\Python\2.7.3\Lib "C:\Program Files\Python\2.7.3\python.exe Then use Python import command to load a file So how do we teach the parallelism with the multiprocessing module? { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/4/1466080/slides/slide_6.jpg", "name": "Comparatively simple Good documentation Comes with Python 2.6+ Does not work in IDLE Edit with any editor, then run at terminal Might need to set PYTHONPATH environment variable to your Python installations Lib directory Could use a batch file: SET PYTHONPATH= C:\Program Files\Python\2.7.3\Lib C:\Program Files\Python\2.7.3\python.exe Then use Python import command to load a file So how do we teach the parallelism with the multiprocessing module", "description": "Comparatively simple Good documentation Comes with Python 2.6+ Does not work in IDLE Edit with any editor, then run at terminal Might need to set PYTHONPATH environment variable to your Python installations Lib directory Could use a batch file: SET PYTHONPATH= C:\Program Files\Python\2.7.3\Lib C:\Program Files\Python\2.7.3\python.exe Then use Python import command to load a file So how do we teach the parallelism with the multiprocessing module", "width": "800" } 7 Using the Python Multiprocessing Module { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/4/1466080/slides/slide_7.jpg", "name": "Using the Python Multiprocessing Module", "description": "Using the Python Multiprocessing Module", "width": "800" } 8 First attempt: Fall 2009 Tried parallelism too early in the semester! (about 1/3 of the way through the course) Introduction of some concepts needed better organization Fall 2010 and again in Fall 2011 Concepts introduced much later (about 3/4 of the way through the course) Now a smooth integration with the rest of the course Students having this CS1 experience (and related experiences in CS2, etc.) have shown strong understanding of parallelism before beginning our Sequential and Parallel Algorithms course { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/4/1466080/slides/slide_8.jpg", "name": "First attempt: Fall 2009 Tried parallelism too early in the semester.", "description": "(about 1/3 of the way through the course) Introduction of some concepts needed better organization Fall 2010 and again in Fall 2011 Concepts introduced much later (about 3/4 of the way through the course) Now a smooth integration with the rest of the course Students having this CS1 experience (and related experiences in CS2, etc.) have shown strong understanding of parallelism before beginning our Sequential and Parallel Algorithms course.", "width": "800" } 9 Yes, it is a new topic, and yes, a little something might need to be cut We ended up shifting concepts that are also covered in other courses Our CS2 covers writing classes in great detail, so much less is now in CS1 But parallelism also serves as a great complement to the rest of CS1 (and other courses, in different ways) A great medium to study and review core CS1 topics { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/4/1466080/slides/slide_9.jpg", "name": "Yes, it is a new topic, and yes, a little something might need to be cut We ended up shifting concepts that are also covered in other courses Our CS2 covers writing classes in great detail, so much less is now in CS1 But parallelism also serves as a great complement to the rest of CS1 (and other courses, in different ways) A great medium to study and review core CS1 topics", "description": "Yes, it is a new topic, and yes, a little something might need to be cut We ended up shifting concepts that are also covered in other courses Our CS2 covers writing classes in great detail, so much less is now in CS1 But parallelism also serves as a great complement to the rest of CS1 (and other courses, in different ways) A great medium to study and review core CS1 topics", "width": "800" } 10 We do some non-Python introduction first: The world is obviously parallel. Big-picture descriptions of some applications. Physical activities Low-level: binary adder Higher-level: card sorting Terminology, history Communication Shared memory vs. message passing { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/4/1466080/slides/slide_10.jpg", "name": "We do some non-Python introduction first: The world is obviously parallel.", "description": "Big-picture descriptions of some applications. Physical activities Low-level: binary adder Higher-level: card sorting Terminology, history Communication Shared memory vs. message passing.", "width": "800" } 11 All materials on website, students follow along on own computer Big picture on slides Overview at the start Cheat sheet when done Heavily-commented code illustrates details Some completed examples Some exercises Pause after each section for students to fill in Key Ideas sections { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/4/1466080/slides/slide_11.jpg", "name": "All materials on website, students follow along on own computer Big picture on slides Overview at the start Cheat sheet when done Heavily-commented code illustrates details Some completed examples Some exercises Pause after each section for students to fill in Key Ideas sections", "description": "All materials on website, students follow along on own computer Big picture on slides Overview at the start Cheat sheet when done Heavily-commented code illustrates details Some completed examples Some exercises Pause after each section for students to fill in Key Ideas sections", "width": "800" } 12 Process A running program Keeps track of current instruction and data Single-core processor: only one process actually runs at a time Many processes active at once – OS goes from one to another via a context switch Threads A process can contain multiple threads – things that can/should happen at the same time Multi-core processor: multiple threads of a given process can run at the same time { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/4/1466080/slides/slide_12.jpg", "name": "Process A running program Keeps track of current instruction and data Single-core processor: only one process actually runs at a time Many processes active at once – OS goes from one to another via a context switch Threads A process can contain multiple threads – things that can/should happen at the same time Multi-core processor: multiple threads of a given process can run at the same time", "description": "Process A running program Keeps track of current instruction and data Single-core processor: only one process actually runs at a time Many processes active at once – OS goes from one to another via a context switch Threads A process can contain multiple threads – things that can/should happen at the same time Multi-core processor: multiple threads of a given process can run at the same time", "width": "800" } 13 Tuples Comma required for length 1 Comma optional for length >1 Keyword arguments For example: func(y = 14, x = 27) from random import randint randint(low, high) Includes low and high! from time import time, sleep time.time() for current time in seconds Call a second time and subtract for elapsed time time.sleep(seconds) to sleep for that amount of time { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/4/1466080/slides/slide_13.jpg", "name": "Tuples Comma required for length 1 Comma optional for length >1 Keyword arguments For example: func(y = 14, x = 27) from random import randint randint(low, high) Includes low and high.", "description": "from time import time, sleep time.time() for current time in seconds Call a second time and subtract for elapsed time time.sleep(seconds) to sleep for that amount of time.", "width": "800" } 14 from multiprocessing import * Create and start a process: procVar = Process(target = funcNoParen, args = tupleOfArgs) procVar.start() Get process info: current_process().pid current_process().name Gives name specified by the name=___ argument in process creation { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/4/1466080/slides/slide_14.jpg", "name": "from multiprocessing import * Create and start a process: procVar = Process(target = funcNoParen, args = tupleOfArgs) procVar.start() Get process info: current_process().pid current_process().name Gives name specified by the name=___ argument in process creation", "description": "from multiprocessing import * Create and start a process: procVar = Process(target = funcNoParen, args = tupleOfArgs) procVar.start() Get process info: current_process().pid current_process().name Gives name specified by the name=___ argument in process creation", "width": "800" } 15 Only one process can acquire a given lock at a time Any other process that tries will sleep until lock is released Use to control access to stdout and other shared resources lockVar = Lock() Pass lockVar to all processes that need it lockVar.acquire() lockVar.release() { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/4/1466080/slides/slide_15.jpg", "name": "Only one process can acquire a given lock at a time Any other process that tries will sleep until lock is released Use to control access to stdout and other shared resources lockVar = Lock() Pass lockVar to all processes that need it lockVar.acquire() lockVar.release()", "description": "Only one process can acquire a given lock at a time Any other process that tries will sleep until lock is released Use to control access to stdout and other shared resources lockVar = Lock() Pass lockVar to all processes that need it lockVar.acquire() lockVar.release()", "width": "800" } 16 queueVar = Queue() Pass queueVar to all processes that need it queueVar.put(dataToSend) dataToReceive = queueVar.get() Process will sleep until theres something to get The first data put into the queue is the first data get -ed out of the queue procVar.join() Makes current process sleep until the procVar process completes { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/4/1466080/slides/slide_16.jpg", "name": "queueVar = Queue() Pass queueVar to all processes that need it queueVar.put(dataToSend) dataToReceive = queueVar.get() Process will sleep until theres something to get The first data put into the queue is the first data get -ed out of the queue procVar.join() Makes current process sleep until the procVar process completes", "description": "queueVar = Queue() Pass queueVar to all processes that need it queueVar.put(dataToSend) dataToReceive = queueVar.get() Process will sleep until theres something to get The first data put into the queue is the first data get -ed out of the queue procVar.join() Makes current process sleep until the procVar process completes", "width": "800" } 17 When would a process sleep? Calls the time.sleep function Waiting for a process to finish ( procVar.join() ) Waiting to acquire a lock ( lockVar.acquire() ) Waiting for something to be put in the queue ( queueVar.get() ) { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/4/1466080/slides/slide_17.jpg", "name": "When would a process sleep.", "description": "Calls the time.sleep function Waiting for a process to finish ( procVar.join() ) Waiting to acquire a lock ( lockVar.acquire() ) Waiting for something to be put in the queue ( queueVar.get() ).", "width": "800" } 18 Using the Python Multiprocessing Module { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/4/1466080/slides/slide_18.jpg", "name": "Using the Python Multiprocessing Module", "description": "Using the Python Multiprocessing Module", "width": "800" } 19 First day: sort a deck of cards, and show me how In pairs, precise, simple steps If you cant describe what you are doing as a process, you don't know what you're doing. (W.E. Deming) Introduces: variable assignment (take that card…), conditionals, expressions (comparison), loops, (potentially) functional abstraction (find min) Much later, during search/sorting/complexity Now theyre ready, know O(N^2) sorting { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/4/1466080/slides/slide_19.jpg", "name": "First day: sort a deck of cards, and show me how In pairs, precise, simple steps If you cant describe what you are doing as a process, you don t know what you re doing.", "description": "(W.E. Deming) Introduces: variable assignment (take that card…), conditionals, expressions (comparison), loops, (potentially) functional abstraction (find min) Much later, during search/sorting/complexity Now theyre ready, know O(N^2) sorting.", "width": "800" } 20 Whenever there is a hard job to be done I assign it to a lazy man; he is sure to find an easy way of doing it. (W. Chrysler) { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/4/1466080/slides/slide_20.jpg", "name": "Whenever there is a hard job to be done I assign it to a lazy man; he is sure to find an easy way of doing it.", "description": "(W. Chrysler).", "width": "800" } 21 Pool/map: easy, great for data parallelism parallel[Hello|SumPrimes|MontePi|Integration|MergesortPool].py from multiprocessing import Pool mypool = Pool(processes=N) mypool.map(myfunc, args) args is list of arguments to evaluate with myfunc myfunc can accept only one argument (using wrapping) Process/Pipe: data/task parallelism parallel[Quicksort|Mergesort].py parentConn, childConn = Pipe() duplex (both can send and receive) { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/4/1466080/slides/slide_21.jpg", "name": "Pool/map: easy, great for data parallelism parallel[Hello|SumPrimes|MontePi|Integration|MergesortPool].py from multiprocessing import Pool mypool = Pool(processes=N) mypool.map(myfunc, args) args is list of arguments to evaluate with myfunc myfunc can accept only one argument (using wrapping) Process/Pipe: data/task parallelism parallel[Quicksort|Mergesort].py parentConn, childConn = Pipe() duplex (both can send and receive)", "description": "Pool/map: easy, great for data parallelism parallel[Hello|SumPrimes|MontePi|Integration|MergesortPool].py from multiprocessing import Pool mypool = Pool(processes=N) mypool.map(myfunc, args) args is list of arguments to evaluate with myfunc myfunc can accept only one argument (using wrapping) Process/Pipe: data/task parallelism parallel[Quicksort|Mergesort].py parentConn, childConn = Pipe() duplex (both can send and receive)", "width": "800" } 22 Obviously: http://docs.python.org/library/multiprocessing.html http://docs.python.org/library/multiprocessing.html Our code: http://cs.wlu.edu/~stough/SC12/http://cs.wlu.edu/~stough/SC12/ CS1 quotes: http://www.cs.cmu.edu/~pattis/quotations.html http://www.cs.cmu.edu/~pattis/quotations.html Jokes: http://www.phy.ilstu.edu/~rfm/107f07/epmjokes.html http://www.phy.ilstu.edu/~rfm/107f07/epmjokes.html Distributed computing using multiprocessing: http://eli.thegreenplace.net/2012/01/24/distributed-computing-in-python-with-multiprocessing/ http://eli.thegreenplace.net/2012/01/24/distributed-computing-in-python-with-multiprocessing/ Various options for PDC in Python: http://wiki.python.org/moin/ParallelProcessing http://wiki.python.org/moin/DistributedProgramming http://code.google.com/p/distributed-python-for-scripting/ http://wiki.python.org/moin/ParallelProcessing http://wiki.python.org/moin/DistributedProgramming http://code.google.com/p/distributed-python-for-scripting/ { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/4/1466080/slides/slide_22.jpg", "name": "Obviously: http://docs.python.org/library/multiprocessing.html http://docs.python.org/library/multiprocessing.html Our code: http://cs.wlu.edu/~stough/SC12/http://cs.wlu.edu/~stough/SC12/ CS1 quotes: http://www.cs.cmu.edu/~pattis/quotations.html http://www.cs.cmu.edu/~pattis/quotations.html Jokes: http://www.phy.ilstu.edu/~rfm/107f07/epmjokes.html http://www.phy.ilstu.edu/~rfm/107f07/epmjokes.html Distributed computing using multiprocessing: http://eli.thegreenplace.net/2012/01/24/distributed-computing-in-python-with-multiprocessing/ http://eli.thegreenplace.net/2012/01/24/distributed-computing-in-python-with-multiprocessing/ Various options for PDC in Python: http://wiki.python.org/moin/ParallelProcessing http://wiki.python.org/moin/DistributedProgramming http://code.google.com/p/distributed-python-for-scripting/ http://wiki.python.org/moin/ParallelProcessing http://wiki.python.org/moin/DistributedProgramming http://code.google.com/p/distributed-python-for-scripting/", "description": "Obviously: http://docs.python.org/library/multiprocessing.html http://docs.python.org/library/multiprocessing.html Our code: http://cs.wlu.edu/~stough/SC12/http://cs.wlu.edu/~stough/SC12/ CS1 quotes: http://www.cs.cmu.edu/~pattis/quotations.html http://www.cs.cmu.edu/~pattis/quotations.html Jokes: http://www.phy.ilstu.edu/~rfm/107f07/epmjokes.html http://www.phy.ilstu.edu/~rfm/107f07/epmjokes.html Distributed computing using multiprocessing: http://eli.thegreenplace.net/2012/01/24/distributed-computing-in-python-with-multiprocessing/ http://eli.thegreenplace.net/2012/01/24/distributed-computing-in-python-with-multiprocessing/ Various options for PDC in Python: http://wiki.python.org/moin/ParallelProcessing http://wiki.python.org/moin/DistributedProgramming http://code.google.com/p/distributed-python-for-scripting/ http://wiki.python.org/moin/ParallelProcessing http://wiki.python.org/moin/DistributedProgramming http://code.google.com/p/distributed-python-for-scripting/", "width": "800" } Ppt on france culture and civilization Microsoft office ppt online templates Ppt on prevention of needlestick injury Ppt on improvement in food resources science Ppt on resources and development class 10 cbse book Ppt on society and culture Ppt on resistance temperature detector manufacturers Run ppt on mac Ppt on cnc wire cut machine Ppt on bluetooth hacking software Steven Bogaerts Assistant Professor of Computer Science DePauw University Greencastle, IN Joshua Stough Assistant Professor of Computer Science Washington.Teaching Parallelism in a Python- Based CS1 at a Small Institution Challenges, Technical and Non-Technical Material, And CS2013 coverage Steven Bogaerts.CSCC69: Operating SystemsPROGRAMMING USING PYTHON LANGUAGE ASSIGNMENT 1. INSTALLATION OF RASPBERRY NOOB First prepare the SD card provided in the kit by loading an Operating System.Guide to Programming with Python Chapter One Getting Started: The Game Over Program.© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.Introduction to PythonPROBLEM SOLVING WARM-UP Fill in the spaces using any operation to solve the following (!, (), -/+,÷,×): = 6.Introduction to a Programming Environment1 CS 106, Winter 2009 Class 4, Section 4 Slides by: Dr. Cynthia A. Brown, Instructor section 4: Dr. Herbert G. Mayer,Guide To UNIX Using Linux Third Edition1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009.Limited Time and Experience: Parallelism in CS1 Fourth NSF/TCPP Workshop on Parallel and Distributed Computing Education (EduPar-14) Steven Bogaerts.Lab III – Linux at UMBC.3.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Process An operating system executes a variety of programs: Batch system.CS190/295 Programming in Python for Life Sciences: Lecture 1 Instructor: Xiaohui Xie University of California, Irvine.Bash Scripting CIRC Summer School 2016 Baowei Liu CIRC Summer School 2016 Baowei Liu.Mutual Exclusion.MULTIPROCESSING MODULE OF PYTHON. CPYTHON  CPython is the default, most-widely used implementation of the Python programming language.  CPython - single-threaded.Bruce Beckles University of Cambridge Computing Service Similar presentations © 2017 SlidePlayer.com Inc. All rights reserved.





#Contact US #Terms of Use #Privacy Policy #Earnings Disclaimer