12:43PM EDT - The first talk being live blogged at Hot Chips today is from Tesla, who are showing off their compute and redundancy solution for a fully-self driving car. We assume this means a Level 5 car, so it will be interesting to see what is mentioned.
12:56PM EDT - Looks like we're going to start in a minute
01:00PM EDT - Presented by a former AMD architect who worked on Bulldozer and Zen
01:01PM EDT - FSD = Fully Self Driving
01:01PM EDT - Needed Custom hardware to run CNN very fast
01:01PM EDT - Level 5 is a tough target
01:01PM EDT - 100W was a limit for the computer
01:01PM EDT - FSD needed to be retrofitted into HW2.x cars
01:02PM EDT - Cooling in those cars is limited
01:02PM EDT - HW2.x was pre-FSD
01:02PM EDT - Looked at the market, nothing suitable to meet perf levels at the power constraints and form factor constraints
01:02PM EDT - Tesla had to design its own chip to meet those goals
01:03PM EDT - Dual Redundant SoCs
01:03PM EDT - Redundant Power Supplies
01:03PM EDT - Backwards compatible
01:03PM EDT - Operlapping camera field with redundant paths
01:03PM EDT - Four of the cameras are on the blue supply, four on the green supply
01:03PM EDT - All info goes to both SoCs
01:04PM EDT - both can process it all independently
01:04PM EDT - rich sensor suite
01:04PM EDT - Cameras, Radar, GPUs, Maps, IMUs, Ultrasonic, Wheel Ticks, Steering Angle
01:05PM EDT - Two SoCs comes up with plans. Plans are compared, and when they agree, actions are taken by the master, and it is validated by slave SoC, and it repeats
01:05PM EDT - As many TOPs for Tesla workloads, 50 TOPs was a minimum bar
01:05PM EDT - High utilization for batch size of one (video)
01:06PM EDT - Ended up with sub-40W/chip. Best in class power efficiency for inference
01:06PM EDT - Leading latency results. Safety and security gets speicial processors
01:06PM EDT - Samsung 14FF
01:06PM EDT - 260mm2, 6b transistors
01:06PM EDT - AECQ100
01:07PM EDT - 12x A72 CPUs on right, 1x GPU
01:07PM EDT - Two Neural Network Accelerators, a from-scratch design. Everything else waas industry IP
01:07PM EDT - Dual NNAs, each one is 96x96 MACs, can do 36.8 TOPs per NNA
01:08PM EDT - 32MB SRAM per instance, bandwidth optimized
01:08PM EDT - lots of programs can be resident in SRAMs
01:08PM EDT - simple programming model
01:08PM EDT - Built for 2 GHz+
01:08PM EDT - 72 TOPs for whole SoC at 2 GHz
01:08PM EDT - 14 month from Arch to Tape out
01:08PM EDT - First silicon success
01:08PM EDT - Took some calculated risks on the design
01:09PM EDT - Simulation challenges
01:09PM EDT - Needed to get it right
01:09PM EDT - Used Verilator, 50x faster than commercial simulators
01:10PM EDT - NNA Design Motivation. Solve a convolutional neural network
01:10PM EDT - 99.7% of operations are MACs
01:10PM EDT - Speeding up MACs makes Qualtization/pooling more perf sensitive
01:11PM EDT - Dedicated Quantization and Pooling HW to speed things
01:13PM EDT - 8-bit MULs with 30-bit ADDs
01:15PM EDT - Going over the slide. Basic MatMul stuff
01:20PM EDT - Control flow is extremely important for perf and power
01:20PM EDT - Most energy is spent is moving instructions and data around
01:21PM EDT - FSD eliminates DRAM reads/writes
01:21PM EDT - Minimise SRAM reads
01:21PM EDT - Optimized MAC switching power
01:21PM EDT - Single clock domain
01:21PM EDT - DVFS power/clock distribution
01:22PM EDT - For inference, when you are done with a layer, it can be destroyed and not kept
01:22PM EDT - Instruction Set - here are all the operations
01:23PM EDT - Limited OoO support
01:24PM EDT - Instructions are 32B to 256B (256B = Convolution in one instruction)
01:24PM EDT - NNA Microarchitecture
01:25PM EDT - 32MB SRAM with one port per bank
01:25PM EDT - 256B of read bw, 128B of write bw
01:25PM EDT - 1TB/s bw in SRAM
01:27PM EDT - Programmable SIMD unit with 3-cycle
01:28PM EDT - FP16 and INT data types
01:28PM EDT - Predication support for all instructions
01:29PM EDT - Max pooling and average pooling
01:29PM EDT - custom pooling hardware required
01:30PM EDT - 2.5x perf over HW2.5 platform for 1.25x power
01:30PM EDT - Module cost lowered by 20%
01:31PM EDT - Q&A
01:31PM EDT - Q: Dual redundant SoCs. Insight into dual aspect? Are you sharing the load? A: The software folks have the flexibility to use it either way. We primarily designed for safety.
01:32PM EDT - Q: 2 instances of the Convolution Engine. Why 2? A: Goal of bandwidth to achieve with 96x96 x2. Sweet spot for physical design, area, phsyical design.
01:32PM EDT - Q: 37 TOPs? A: INT8
01:33PM EDT - Q: Custom model or public? A: Custom
01:35PM EDT - Q: Why SoC rather than PCIe card? A: Automotive has to go through vigorous life cycle. PCIe card wouldn't work.
01:35PM EDT - Q: Logging? A: Yes
01:36PM EDT - Q: What if the two SoCs don't agree? A: We have a high framerate. But a dropped frame doesn't affect perf.
01:37PM EDT - Q: Raw TOPs? A: Yes
01:38PM EDT - Q: Cooling? A: Depends on the car platform. Air or water. But reducing power was key for this platform
01:38PM EDT - That's a wrap. Break time, next up is NVIDIA Multi-Chip
01:38PM EDT - .
ncG1vNJzZmivp6x7orrAp5utnZOde6S7zGiqoaenZH51g5VvZqGnpGKwqbXPrGRsaV2htrexjJujqJ%2BjYsGmv8uaZKynnKrBqrvNZp2oql2bwq24jKycpZ5dmb%2Bqwsinng%3D%3D