

This research project has been supported by a Marie Curie Early European Industrial Doctorates Fellowship of he European Community's Seventh Framework Programme under contract number (PITN-GA-2012-316596-ICE-DIP)"

0

CERNopenlab

0



**ICE-DIP CERN** Overview 0 LHCb Triggered Data AcQuisition (TDAQ) **Computing at CERN Explicit** vectorization UME framework 

CERNopenlab



### **ICE-DIP**

#### ICE-DIP 2013-2017: The Intel-CERN European Doctorate Industrial Program

A public-private partnership to research solutions for next generation data acquisition networks, offering research training to five Early Stage Researchers in ICT



#### Research topics:

0

CERNopenlab

Silicon photonics systems
Next generation data
High speed configurable logic
Computing solutions for high performance data filtering

#### CERNopenlab

0

**0** 

0

# **ICE-DIP Projects**

|     | Theme                                 | WP  | ESR                       | Challenge                                                                                        | Research                                                                                           |
|-----|---------------------------------------|-----|---------------------------|--------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|
| × / | Silicon<br>Photonics                  | WP1 | ESR1<br>(Santa-clara, US) | Need affordable, high<br>throughput, radiation<br>tolerant<br>links                              | Design, manufacture, test under<br>stress a Si-<br>photonics link                                  |
| -   | Reconfi-<br>gurable<br>Logic          | WP2 | ESR2<br>(Munich, Germany) | Reconfigurable logic is used<br>where potentially more<br>programmable CPUs could<br>be proposed | A hybrid CPU/FPGA data pre-<br>processing<br>system                                                |
| 3   | DAQ<br>networks                       | WP3 | ESR3<br>(Gdańsk, Poland   | Bursts in traffic are not<br>handled well by off-the-shelf<br>networking equipment               | Loss-less throughput up to multiple<br>Tbit/s<br>with new protocols                                |
| 181 | High<br>performance<br>data filtering | WP4 | ESR4<br>(Munich, Germany) | Accelerators need network data, but have very limited networking capabilities                    | Direct data access for accelerators<br>(network-<br>bus-devices-memory)                            |
| 6 M |                                       |     | ESR5<br>(Paris, France)   | Benefits of new<br>computing<br>architectures are rarely<br>fully<br>exploited by software       | Find and exploit parallelization<br>opportunities<br>and ensure forward scaling in<br>DAQ networks |





### **Standard Model**

#### **BOHR MODEL**

0



#### STANDARD MODEL









HQ in Geneva (Switzerland) 61 years of existence 21 member states (Israel since 2014), 45 associate states, 17 cooperating states, 7 Observers ~14000 people



CERNopenlab

0

### LHC







**ATLAS**: A Thoroidal LHC Apparatus

CMS: Compact Muon Solenoid

ALICE: A Large Ion Collider Experiment

LHCb: Large Hadron Collider Beauty

### **Eksperymenty:**

ACE, AEGIS, **ALICE**, ALPHA, AMS, ASACUSA, **ATLAS**, ATRAP, AWAKE, BASE, CAST, CLOUD, **CMS**, COMPASS, DIRAC, ISOLDE, **LHCb**, **LHCf**, **MOEDEL**, NA61/SHINE, NA62, NA63, nTOF, OSQAR, **TOTEM**, UA9







### **LHCb**

**VELO**: Collision point localization

Inner/Outer Tracker: Trajectories and momentum

**RICH:** *Particle identification* 

**SPD, PS, ECAL, HCAL:** *Hadron, electron, photon identification* 

**MUON:** *Particle identification* 

# **Trigger System**



0

0

#### **Tasks:**

- **Bandwidth reduction**
- Data buffering

#### Some features:

- Hierarchic structure
- ASIC (L0)
- FPGA (L1)
- Non-standard solutions!

| Level     | Event<br>Frequency | Bandwidth |
|-----------|--------------------|-----------|
| Front-end | 40MHz              | 4TB/s     |
| LO        | 1MHz               | 100GB/s   |
| L1        | 40kHz              | 4GB/s     |
| HLT       | 400Hz              | 40MB/s    |

### **Readout Network**



0

0

CERNopenlab



# **Computing at CERN**



0



Przemysław Karpiński - CERN Openlab, ICE-DIP



0

# **Software for LHCb**

| Application<br>AppConfig | Simulation:<br>Gauss | Digitization:<br>Boole                                                            | Alignment                     | Analysis<br>(Python):<br>Bender | Analysis<br>repository:<br>Erasmus | Event<br>presentation:<br>Panoramix | Trigger:<br>Moore,<br>L0App | Monitoring and control:               |  |
|--------------------------|----------------------|-----------------------------------------------------------------------------------|-------------------------------|---------------------------------|------------------------------------|-------------------------------------|-----------------------------|---------------------------------------|--|
|                          | DecFiles             | 20010                                                                             | Reconstruction<br>:<br>Brunel |                                 | Analysis:<br>DaVinci               |                                     |                             | Orwell (Calo)<br>Panoptech<br>(Rich), |  |
|                          |                      |                                                                                   |                               | Anal                            | ysis                               | Stripping                           | Hlt                         | Vetra (Velo, ST)                      |  |
| Component<br>Libraries   |                      |                                                                                   |                               |                                 | Ρ                                  | hys                                 |                             |                                       |  |
|                          |                      |                                                                                   |                               | Rec                             |                                    |                                     |                             |                                       |  |
|                          |                      |                                                                                   |                               | Lb                              | com                                |                                     |                             |                                       |  |
| Frameworks               | LI                   | LHCbSys [Data_Dictionary, Event_Model, Detector_Description, Conditions_Database] |                               |                                 |                                    |                                     | Online                      |                                       |  |
|                          |                      |                                                                                   | Gau                           | ıdi (GaudiPytl                  | non)                               |                                     |                             |                                       |  |
|                          |                      |                                                                                   | $\rangle$                     |                                 |                                    |                                     |                             |                                       |  |

### LHCb Schedule



0

| $\geq$ | 2008   | 2009 | 2010 | 2011 | 2012 | 2013  | 2014 | 2015 | 2016 | 2017 | 2018  | 2019 | 2020        | 2021 | 2022 |
|--------|--------|------|------|------|------|-------|------|------|------|------|-------|------|-------------|------|------|
| LH     | C test | R1 P | hase |      |      | LS1 P | hase | R2 P | hase |      | LS2 P | hase | <b>R3</b> P | hase |      |

|   | 0  | Collision<br>energy | Bunch<br>length | Bunch<br>Luminosity | Event<br>Frequency | Event Size | Generated data<br>(limit) | Stored data |  |
|---|----|---------------------|-----------------|---------------------|--------------------|------------|---------------------------|-------------|--|
|   | R1 | 8 TeV               | 50ns            | 4e32/(cm^2*s)       | 40MHz              | 100KB      | 4TB/s                     | 40MB/s      |  |
|   | R2 | 13 TeV              | 25ns            | 4e32/(cm^2*s)       | 40MHz              | 100KB      | 4TB/s                     | 2GB/s       |  |
| - | R3 | 14 TeV              | 25ns            | 2e33/(cm^2*s)       | 30MHz              | > 100KB    | 4TB/s                     | > 2GB/s     |  |

#### LS1 + R2: Simplifications in L0 i L1

0

LS2 + R3:

HLT moved from cavern to surface

Complete elimination of L0 i L1 (Full Software Trigger) 



Multiple "big" frameworks

**CERN** openlab

0

0

- Code developed by physicists
- Code developed in a hurry
- Detector systems specific knowledge
  - Development criteria change over time
  - High robustness & efficiency requirements



# Manycore architectures

- Time and energy **costs**?
- Programmability?
  - Deployment model and scalability?
  - Performance tuning methodology?
- Future of MIC?





0

# Work in progress

| Activity                                 | Status                                          | Measurables                                                                                                                                                                                            |
|------------------------------------------|-------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| VCL library port for KNC                 | Completed?                                      | <ul> <li>Good cooperation with VCL<br/>author Agner Fog (TUD,<br/>Copenhagen)</li> <li>Code published in public domain<br/>(GPL)</li> <li>Measurements gathered,</li> <li>Article in review</li> </ul> |
| LLVM as large code optimization platform | Hardware<br>manufacturers need to<br>put effort | <ul> <li>Possible methodology for both<br/>industry and academia</li> </ul>                                                                                                                            |
| HEP benchmarking suite                   | Under development                               | <ul> <li>New benchmark suite and<br/>algorithm library for HEP<br/>available in public domain<br/>(permissive license)</li> </ul>                                                                      |
| Blog on Many-core                        | Continuous work                                 | - <u>cern.ch/manycore</u>                                                                                                                                                                              |
| × /                                      |                                                 |                                                                                                                                                                                                        |

# -o • Wron

**CERN** openlab

# Wrong questions asked?

How do I measure performance?

- Do you know what the metric is?
- How do I increase performance?
- What are your hot-spots?

How do I make my solution scalable?
What is your definition for scaling?

# **Better questions?**

CERNopenlab

- How do l'increase performance with minimal effort?
  - #pragma ...
- Compile with -O3 fastmath
- Use faster library
- How do I choose proper metric?
  - Measure throughput
  - Measure latency
    - Measure memory utilization
- How do I create specification of my software?
  - Code IS the specification

# VCL and VCLKNC

**CERN** openlab

Ø

Vec16f: Vector of 16 single precision floating point values

#### class Vec16f {

protected:

m512 zmm; // Float vector

0

```
public:
```

```
// Default constructor:
Vec16f() {
```

// Constructor to broadcast the same value into all elements: Vec16f(float f) { zmm = mm512 set1 ps(f);

```
// Constructor to build from all elements:
Vec16f(float f0, float f1, float f2, float f3, float f4, float f5, float
     float f8, float f9, float f10, float f11, float f12, float f13, float
    zmm = mm512 set ps(f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11, f1:
```

```
// Constructor to convert from type m512 used in intrinsics:
Vec16f( m512 const & x) {
```

```
zmm = x;
```

SIMD vector abstraction layer

- Based on VCL library by Agner Fog (TUD, Copenhagen) www.agner.org Invaluable learning materials!
- Classes hiding SSE, AVXx
  - VCLKNC extension for IMCI (KNC) https://bitbucket.org/veclibknc/vclknc
- GPL, proprietary licensing possible

# VCL: KNC vs XEON



0

0

CERNopenlab

# VCL: KNC vs XEON



### Conclusions:

**CERN** openlab

0

Explicit vectorization makes vectorization straightforward Intrinsics are not that complicated (but tricky sometimes) KNC core microarchitecture is not that bad

- Can we get higher frequency?
- Use floats instead of doubles!
- Throughput is promising.



# Law Of Diminishing Returns

Suppose, for example, that **1 kilogram** of seed applied to a certain plot of land produces **one ton** of crop, that **2 kg of seed produces 1.5 tons**, and that **3 kg of seed produces 1.75 tons**.

CERNopenlab





# We need systematic revolution

- Write simplest code that solves your problem
  - We ALWAYS underestimate complexity!
  - Not sure what is proper SPECIFICATION before writing code down
  - > Early optimization is overkill

CERNopenlab

1)

Ŧ

- ) Evaluate the cost of optimization and cost of redesign
  - > Cost metric depends on project requirements!
  - > You already know cost of initial implementation
- Identify "hot spots" and optimize them
  - Hot spot is not only a function: it can be algorithm or structure
  - Repeat 2) until it is REASONABLE!
  - Write version 2 and start from the beginning
    - Don't be afraid to do that! Now you have knowledge you didn't have at stage 1)
  - Some components can and should be re-used

CERN openlab

•

## **UME: basic structure**





# UME – Unified Multi/Manycore Environment

### SIMD abstraction layer:

| <pre>// 256 bit integer vectors</pre>                   |
|---------------------------------------------------------|
| <pre>typedef SIMDVec<int8_t, 32=""></int8_t,></pre>     |
| <pre>typedef SIMDVec<uint8_t, 32=""></uint8_t,></pre>   |
| <pre>typedef SIMDVec<int16_t, 16=""></int16_t,></pre>   |
| <pre>typedef SIMDVec<uint16_t, 16=""></uint16_t,></pre> |
| <pre>typedef SIMDVec<int32 8="" t,=""></int32></pre>    |
| <pre>typedef SIMDVec<uint32_t, 8=""></uint32_t,></pre>  |
| <pre>typedef SIMDVec<int64_t, 4=""></int64_t,></pre>    |
| typedef SIMDVec <uint64 4="" t,=""></uint64>            |
|                                                         |
| <pre>typedef SIMDVec<float, 8=""></float,></pre>        |
| <pre>typedef SIMDVec<double, 4=""></double,></pre>      |
|                                                         |
| <pre>// 512 bit integer vectors</pre>                   |
| <pre>typedef SIMDVec<int8_t, 64=""></int8_t,></pre>     |
| <pre>typedef SIMDVec<uint8_t, 64=""></uint8_t,></pre>   |
| <pre>typedef SIMDVec<int16_t, 32=""></int16_t,></pre>   |
| <pre>typedef SIMDVec<uint16_t, 32=""></uint16_t,></pre> |
| <pre>typedef SIMDVec<int32_t, 16=""></int32_t,></pre>   |
| <pre>typedef SIMDVec<uint32_t, 16=""></uint32_t,></pre> |
| <pre>typedef SIMDVec<int64_t, 8=""></int64_t,></pre>    |
| typedef STMDVeczuint64 t 85                             |

- SIMDVector32\_8i; SIMDVector32\_8u; SIMDVector16\_16i; SIMDVector16\_16u; SIMDVector8\_32i; SIMDVector8\_32u; SIMDVector4\_64i; SIMDVector4\_64u;
- SIMDVector8\_32f; SIMDVector4\_64f;
- SIMDVector64\_8i; SIMDVector64\_8u; SIMDVector32\_16i; SIMDVector32\_16u; SIMDVector32\_16u; SIMDVector16\_i32; SIMDVector16\_u32; SIMDVector8\_i64; SIMDVector8\_u64;

- VCL, VC, Boost::SIMD
- Library selection at compile time
- Uniform interface chosen after analysis of libraries
  - Vector symetry problems resolved by emulation
- Possible to "plug-in" other libraries



### Unified Multi/Manycore Environment (UME)

Next steps:

- "Other" abstraction layers
- Integrated benchmarking capabilities
  - Performance evaluation & cost evaluation
- Microbenchmarking platform characteristics
  - Canonical models of microarchitectures
  - Before or even during application compilation
- Canonical design of HEP algorithms
  - Ability to select parameters of the algorithm based on the platform specifics
  - Ability to re-use the algorithm for other applications (e.g.: Hough Transform, Kalman Filter)
  - Canonical algorithm FORCES data structures layout!!!
- Autotuning based on runtime information
  - It's difficult to do "real" autotuning, we can gather runtime data and re-compile

### **UME: Three runs**

#### Static identification:

CERNopenlab

0

- Identify hardware parameters:
  - > Memory/core hierarchy
  - > Memory and cross-core latencies
  - Single core performance
  - Dump config file and recompile

#### **Dynamic identification:**

- Run domain specific microbenchmarks
- Select final software configuration:
- Dump final config file

#### Compile application for optimal set of SW components

# **Optimization methodology**

### Step 1: Write your algorithms using UME

- No need to know about underlying hardware
- Don't worry about OS specific stuff
- Focus on performance

**CERN** openlab

### Step 2: Tune your software

 Static tuning allows selection of best libraries and some of algorithm parameters (compile time information)

Dynamic tuning allows tuning for domain and specific data (runtime information

# Step 3: Identify hot spots and specialize your algorithm

- Some tools for performance assesment integrated
- Specialize for HW/OS data intensity



#### Hough transform (line detection):



CERNopenlab

0

 $r = x\cos(\theta) + y\sin(\theta)$ 





### Scalar vs. SIMD

#### Scalar version:

CERNopenlab

0

```
uint32_t value = inputArray[y*mWIDTH + x];
if(value == 0)
{
    count++;
    SCALAR_FLOAT_T currTheta = 0.0;
    for(uint32_t thetaCoord = 0; thetaCoord < mWIDTH; thetaCoord++)
    {
        SCALAR_FLOAT_T currR = (SCALAR_FLOAT_T)x * cos(currTheta) + (SCALAR_FLOAT_T)y * sin(currTheta);
        uint32_t rCoord = (uint32_t)((SCALAR_FLOAT_T)mHEIGHT * ((currR + mR_MAX)*mR_RANGE_INV ));
        mAccu[rCoord*mWIDTH + thetaCoord]++;
        currTheta += DELTA_THETA;
    }
}
```

### Scalar vs. SIMD

#### SIMD version:

**CERN** openlab

```
uint32_t value = inputPtr[y*inputArray.PADDED_WIDTH + x];
if(value != 0) // Drop round if all elements are 0
```

```
// for every pixel traverse the thetas ranging <0:2*PI>
theta_vec = VEC_THETA_INITIALIZER; // horizontal coordinate in accumulator space
for(uint32_t k = 0; k < mAccu->VECTOR_WIDTH; k++)
```

```
UME::SIMD::SIMDVector8_32f cos_theta_vec = cos(theta_vec);
UME::SIMD::SIMDVector8_32f sin_theta_vec = sin(theta_vec);
UME::SIMD::SIMDVector8_32f cos_part = ((float)x)*cos_theta_vec;
UME::SIMD::SIMDVector8_32f sin_part = ((float)y)*sin_theta_vec;
r_vec = cos_part + sin_part;
```

```
temp0 = (float) mAccu->HEIGHT * ((r_vec + R_MAX) * R_RANGE_INV );
r_vec_i = truncateToInt(temp0); //truncateToInt(temp0); // vertical coordinate in accumulator space
r_vec_u = UME::SIMD::SIMDvector8_32u(abs(r_vec_i));
```

// store the offsets
r\_theta\_offset.storeAligned(accu\_r\_theta\_offsets);

```
// gather from accumulator
accu_vec.gather((uint64_t)(accuPtr), accu_r_theta_offsets);
accu_vec += VEC_INIT_UNIT_I; // incrementing accumulator
accu_vec.scatter((uint64_t)(accuPtr), accu_r_theta_offsets);
```

theta\_vec += 8\*dTheta;





# **HT benchmark results**



#### Key notes:

- Benchmarks 1 to 3: purely scalar
- Benchmark 4: explicit SIMD (8x32f vectors used)
- Exactly the same benchmark code for SSE2, AVX and AVX2
- Exactly the same benchmark code regardless of libraries selection
- Actual speedup depends on input data

