

# DRC/LVS Development Best Practices

Learning from GF180 PDK optimization



## This is **not** a Banana!





## The GF180 DRC/LVS

Project link (efabless fork)

https://github.com/efabless/globalfoundries-pdk-libs-gf180mcu\_fd\_pv

#### Highlights

- Nice Python wrapper
- Large test suite
- Modular
- Clean structure
- References to design manual

#### Troubles

 Performance issue (>10h runtime, >40G memory for medium size layout with 410k stdcells)



### Performance Killers





- Pile-up of memory for intermediate results
- Use of flat mode Fixed V
  - little trust in the other modes?
- Inefficient implementation of certain rules
- After optimization (large test case)
  - speed 10h → 1h
  - memory 40G → <4G
  - runs on single CPU and consumer hardware

atorkmabrains commented on Nov 30, 2022

@klayoutmatthias That's impressive.



## Debugging Techniques

#### Recent features make debugging easier

#### profile

Used at the beginning of a script will print the commands by CPU time and process memory delta

```
Operations by execution time
Operation
                                          # calls Time (s) Memory (k)
"enclosing" in: sky130A mr.drc:395
                                                    72.930
"-" in: sky130A mr.drc:397
                                                    70.440
"enclosing" in: sky130A mr.drc:396
                                                    62.270
"enclosing" in: sky130A mr.drc:388
                                                    41.010
"&" in: sky130A mr.drc:290
                                                    38.710
                                                             32784
"space" in: sky130A mr.drc:374
                                                    26.070
"enclosing" in: sky130A_mr.drc:449
                                                    22.150
"-" in: sky130A mr.drc:419
                                                    19.950
                                                              32784
"enclosing" in: sky130A mr.drc:384
                                                    16.750
"width" in: sky130A mr.drc:368
                                                    16.380
"enclosing" in: sky130A_mr.drc:421
                                                    15.030
"enclosing" in: sky130A_mr.drc:418
                                                    14.380
                                                             -131136
"-" in: skv130A mr.drc:289
                                                    14.310
"space" in: sky130A mr.drc:435
                                                   13.240
"width" in: sky130A mr.drc:428
                                                    13.190
"interacting" in: sky130A mr.drc:299
                                                    12.760
"interacting" in: sky130A mr.drc:290
                                                    12.570
"width" in: sky130A mr.drc:265
                                                   11.260
"without length" in: sky130A mr.drc:294
```

Memory returned to system by garbage collector

#### new\_target

Allows sending intermediate results to a separate layout file for easy inspection

```
1
2 report("L1 over L2 overlap")
3
4 l1 = input(1, 0)
5 l2 = input(2, 0)
6
7 debug = new_target("debug.gds")
8
9 l1and2 = l1 & l2
10
11 l1and2.output(debug, 1, 0)
12
13 l1and2.width(1.0.um).output("l1 over l2 < 1µm")
14
```



### Choice of Modes

flat (default)

Simple, predictable, single CPU, vanilla implementation

Memory proportional to # objects

Only for small designs or quick checks

tiled

Operations work on tiles

Parallelization along tiles, good scaling

Heap allocation for single tiles only

Results / intermediate layers are flat → large memory footprint possible

Range-limited (border specification needed)

**Useful for flat layouts** 

deep

Hierarchical processing where possible (local computation done once per cell)

Can be very fast, but also slow (skillful use reqd)

Results / intermediate layers are hierarchical → small memory footprint possible

Scales with "cores 0.5"

Preferred solution for big hierarchical layouts



### Deep Mode in a Nutshell

A OP B

\*) OP = "local"

```
compute(A, B, OP, dist):

for subject in shapes of A:
   intruders = shapes of B with distance to subject < dist
   results = OP.compute(subject, intruders)
   store results</pre>
```

#### Hierarchical treatment

- Compute cell neighborhoods ("contexts")
- Collect intruders per context (→ minimum set of configurations)
- For the results, keep common core inside cell, propagate specific results to parent cells



#### some

## Deep Mode Best Practices

- Watch for hierarchy degradation
  - Results my be propagated, destroying hierarchy over time

```
"-" in: sky130A_mr.drc:294
Polygons (raw): 682248 (flat) 329 (hierarchical)
Elapsed: 0.020s Memory: 4391.00M
```

- Complexity determined by <u>first</u> operand
  - Less shapes, less work
  - The more hierarchical, the better
  - First operand is able to "pull" B shapes down in hierarchy
- Beware of pre-merge
  - Not all operations are "local" and need pre-merge e.g. "interact"
  - Pre-merge will form large polygons potentially higher up in the hierarchy → spoils hierarchical performance

For details see: https://www.klayout.de/drc\_function\_internals.html#drc\_function\_details



## Klayout is not Calibre!





### Klayout is **not** Calibre!

- Immediate execution vs. operation graph
  - Layer == Variable, Value == Layer Geometry
  - Memory allocation == variable lifetime → use "forget" or reset variable
  - Intermediate results allocate memory too (will be cleaned up by GC)

```
d = a - (b \& c) Intermediate result - avoid duplication of expressions
```

No optimization of dead execution branches

```
c = empty & a.interacting(b) Computed even though not needed
```

- No selection of input layers based on what is needed
- No parallelization
- Pro: allows loops, conditionals and direct per-shape manipulations
- No hierarchy manipulation
  - Except for variant formation for non-isotropic transformations and grid snap operations



#### Pitfalls we have seen

"drc" function is more generic, but not better than simple equivalents

```
a.drc(space < 0.2.um) a.space(0.2.um)
```

Same result, but performance is better with "space"

- Edge "width" != polygon "width"
  - Edge "width" only refers to relative orientation of the edges, but treats edges separately (→ potential long-distance interactions)
  - Polygon "width" is a single-polygon operation on pre-merged polygons

```
a.edges.width(0.5.um) a.width(0.5)
```

Similar results, but left side is better with large clusters of polygons while right side is better with large distances

- "+" (join) may be better than "|" (or)
  - "+" simply collects the shapes, "|" merges the shapes → this may give large polygons high up in the hierarchy and eats CPU time
  - For "local" operations, fragmented input is better → use "+"



# "+" (join) vs. "|" (or)

poly.or(comp)

Gives a single giant polygon over memory area





Leaves the original polygons in the hierarchy

Executes much faster on operations not doing pre merge



### Optimization Example I

**Rule**: Max transistor channel length  $\leq$  20 µm

Initial implementation (concept):

```
channel_edges = poly.edges & comp
channel_not_too_wide = channel_edges.width(20.001.um)
error = channel_edges -
   channel_edges.interacting(channel_not_too_wide.edges)
```



channel\_edges

#### **Observation**

slow execution on standard logic layouts

Even worse in deep mode



## Analysis

```
channel_edges = poly.edges & comp
channel_not_too_wide = channel_edges.width(20.001.um)
error = channel_edges -
   channel_edges.interacting(channel_not_too_wide.edges)
```



Explanation: edge "width" captures many interactions due to the long range





## Optimized Version

Rewriting to polygon width check → range is limited to polygon area



channel\_edges

**Effect** 



Execution time drops from 50s (medium size sample) to basically nothing



### Optimization Example II

**Rule**: NMOS distance to p tap  $\leq$  20µm

Initial implementation (concept):

```
nmos = ncomp.outside(nwell)
ptap = pcomp.outside(nwell)
error = ptap.not_interacting(nmos.sized(20.um))
```



#### Observation

slow execution on standard logic layouts

Even worse in deep mode



### Optimized Version

Turning around the check optimizes it

Explanation

- ptap has less shapes than nmos
- ptap is localized → pre-merge of "sized" does not spoil the hierarchy and is quick.

Effect: "nmos.not\_interacting(...)" has more primary shapes, but has to deal with fewer intruder shapes

```
nmos = ncomp.outside(nwell)
ptap = pcomp.outside(nwell)
error = nmos.not_interacting(ptap.sized(20.um))
```

**Effect** 

Execution time drops by a factor 10





### Wrap-up

- Prefer deep mode
- Keep in mind the basic concepts of deep mode
  - First argument should have low complexity
  - Beware of large regions formed by pre-merge
  - Avoid hierarchy degradation
- Use profiling, focus on the greedy ones
- Look at the intermediate results
- Rethink your rule implementation & try alternatives
- LVS: needs hierarchical device recognition layers for schematic / layout correspondence
  - Avoid hierarchy degradation (specifically pre-merge driven)



#### Homework

What are the input / output formats in different tools?

| Input / Output  | KLayout                                               | Others                           |
|-----------------|-------------------------------------------------------|----------------------------------|
| Layout          | GDS2 / OASIS etc.                                     | GDS2 / OASIS etc.                |
| DRC / LVS decks | Ruby language, tool specific, but follows conventions | Proprietary, copyright protected |
| Error DB        | Tool specific, documented                             | Proprietary                      |
| LVS database    | Tool specific, (documented)                           | Proprietary                      |

- Need help from community to enable some features?
  - YES preferably in the form of test cases, benchmarks, user stories and problem statements
  - YES in form of scripted prototypes
  - (under certain conditions) C++ core features
- Is a common (open source) database a solution to some of the (open) questions?
  - In parts it is where no open standards exists
  - But after all, a "silver bullet" does not exist IMHO we get more value if we focus on making best use of what we have and seamless integration



#### Vision: The Open Source Growth Cycle



**Provide** test cases, regression tests, benchmarks, use cases, user stories, feedback & defect reports



# Thank you for Listening!