Berkeley upc memory fence

7/10/2023

Programmed by “annotating” serial programs Programmed by completely rethinking algorithms and software for parallelism The effect of the “Killer Micros” on high end computing is most obvious from looking at data from the Top500 list, the fastest (reported) systems in the world The systems went from dominated by fully customized vendor 20 years ago these HPC systems were dominated by vector supercomputers, while today they are almost 100% networks of X86 microprocessors Hardware innovations for HPC have been the high speed interconnects that tie the processors today In this same period, the number of machines placed in industry went from about 25% to 50%, and we know this is underreported because HPC use is often considered a competitive advantage, so companies do not report This is an oversimplification, the vector supercomputers of the 80s and early 90s were programmed by annotating serial programs Once machines were built as networks of microprocessors, they required substantial investments in algorithms and software 25% industrial use % ~650 applications 75% Fortran, 45% C/C++, 10% Python 85% MPI, 25% with OpenMP 10% PGAS or global objects 70% with checkpointing for resilience These are self-reported, likely lowĤ Parallel Programming Problem: HistogramĬonsider the problem of computing a histogram: Large number of “words” streaming in from somewhere You want to count the # of words with a given property In shared memory Make an array of counts A’s B’s C’s … Y’s Z’s Each processor works on a subset of the stream and lock each entry before incrementing Distributed memory: the array is huge and spread out Each processor has a substream and sends +1 to the appropriate processor… and that processor “receives”Īdvantage: Convenience Can share data structures Just annotate loops Closer to serial code Disadvantages No locality control Does not scale Race conditions Advantage: Scalability Locality control Communication is all explicit in code (cost transparency) Disadvantage Need to rethink entire application / data structures Lots of tedious pack/unpack code Don’t know when to say “receive” for some problems Ħ PGAS Languages Global address space: thread may directly read/write remote data Hides the distinction between shared/distributed memory Partitioned: data is designated as local or global Does not hide this: critical for locality and scaling x: 1 y: x: 5 y: x: 7 y: 0 Global address space l: l: l: g: g: g: Too much time on this slide and next 3 Don’t talk about malloc What about clusters of SMPs Process, processor, thread Too much talking relative to slide p0 p1 pn UPC, CAF, Titanium: Static parallelism (1 thread per proc) Does not virtualize processors X10, Chapel and Fortress: PGAS,but not static (dynamic threads) CS267 Lecture 2ħ HPC: From Vector Supercomputers to Massively Parallel Systems NERSC computing for science 4500 users, 600 projects ~65% from universities, 30% labs 1500 publications per year! Systems designed for science 1.3PF Petaflop Cray system, Hopper 8 PB filesystem 250 PB archive Several systems for genomics, astronomy, visualization, etc. NERSC Facility Computational Research Computational Science Applied Mathematics ESnet Facility Computer Science Opportunities for summer internship, joint projects, etc.ģ NERSC Represents a Broad HPC Workload including Data and Simulation Kathy Yelick Associate Laboratory Director for Computing Sciences Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley 1 Partitioned Global Address Space Programming with Unified Parallel C (UPC)

0 Comments

I'm James. This is my year of travel.

Berkeley upc memory fence

Leave a Reply.

Author

Archives

Categories