How the new Apple M1 Chip actually compares to x86

After watching reviews of the new apple macbook containing the M1 chip, it seems to wipe the floor on x86 however its not a straight story as the reviewers claim and they get basic CPU and SoC architecture information entirely wrong. SoCs are very interesting because long before apply managed to get good performance with their own ARM implementations and when snapdragon was competitive, there were very interesting things that set these chips apart. Although the M1 did manage to match up against AMD zen in clock per clock, core per core, the use cases are very different, however the ryzen did pull ahead in some tests. The best part about the benchmark was that intel was not able to play dirty tricks in getting the benchmarks to make AMD seem slow (we are talking about software compatible with apple ARM, so its likely designed by apple who dont care about intel being ahead, only that the chips do as they want).

The first thing to understand is that the x86_64 architecture which intel/AMD uses is of CISC type while ARM or apple M1 is of RISC type. CISC stands for complex instruction set computer while RISC stands for reduced instruction set computer. The biggest difference is that RISC uses less instructions than CISC, with the simplest example being the inclusion of different compute units within the core. For instance it was theorised during the early modern days of computing that only 23 instructions are needed to perform any task. a 64 bit processor can have up to 2^64 different instructions, the other part of a 64 bit processor is the data side of the instruction. For example a 32 bit microblaze processor has 16 bits for the instruction and 16 bits for the data (this is a RISC CPU) while an intel CPU would have 64 bits for instructions and 64 bits for data. So for example a simple instraction like ADD A B in microblaze would use 16 bits for ADD, 16 bits for A and 16 bits for B while in an x86_64 the same instruction would each have 64 bits for ADD, A and B making it take 192 bits in total. For the well known AVX 512 which has datasets 512 bits long, 64 bits are used for the register address containing the location of the data so if you had MUL A and B to multiple 2 numbers 512 bit numbers using the AVX512, A and B would be register addresses.

The 2nd part not so well known is that intel and AMD aim to have multiple instructions per cycle (threads help). For example the AVX instruction set for AVX512 unit allows 4 128 bit data operations per cycle per core while a single 512 bit data operation would take 1 cycle. On RISC CPUs, it is possible for a single operation to take more than 1 cycle (like the FPU in a super simple CPU), but the pipeline does not skip through instructions, it will go through them 1 by 1 and typically spend 1 instruction per cycle. This is the part the reviewers totally get wrong about RISC vs CISC CPUs. How many cycles it takes to perform an instruction has nothing to do with RISC or CISC only that CSIC aims to do more tasks with less cycles while RISC focuses on simplicity in the pipeline going through instructions one at a time.

Because of the simplicity and reduction of units for RISC CPUs, they take less power. The original ARM CPU was able to run on residual power from the capacitors. In one incident, its creator ran instructions and saw it did not use any power on the wattmeter, it wasnt even plugged in. Less core units also makes the cores smaller, simpler, use less power, take less heat and also helps lower latencies which help a lot on a lower clocked CPU. By contrast, CISC CPUs consume more power because they have extra hardware units to perform tasks in less cycles. Best example is with the FPU. a CISC CPU containing a FPU can perform a FPU every cycle at its for a dataset its bit size, whereas if a RISC omitted the FPU, a MUL A B (multiple A and B) would be converted into a series of logical and arithmetic instructions done by the ALU (arithmetic logic unit) which does simple operations like additions, bit shiftings and so on. Although not having a FPU makes performing multiplications 20x or more slower, it also has the added advantage of not being tied down by hardware accuracy so it can be a lot more accurate with less penalty, so while the CISC would do it fast, its accuracy would be hard set (it could be more accurate but suffer a penalty of multiple cycles to perform the task though), and this is why financial calculations are done with logic and numbers, not simple floating point operations.

CPU design isnt hard set, we have all these units to accelerate tasks, and each unit adds a bunch of instructions that allow us to use it. All this adds to the core’s design which is why we ended up with higher bit CPUs as we have more instructions to perform tasks in less steps. The more units a core has, the more circuits are powered on when the core is in use. So doing a simple addition uses more power on CISC than RISC. processor development has come very very far that every processor has a mixture of both types. Lets take a look at some example designs, they arent accurate, but i am trying to show how the different units are available in different architectures and how they effect the architecture’s performance and flexibility, including power use.

This is an example of a multithreaded x86 CPU containing a bunch of units

Your bog standard multithreaded x86 CPU. multithreading called hyperthreading by intel and SMT by AMD ryzen. You had 2 pipelines, a single decoder that determined which instructions got executed and when (instructions that required the same units have to wait for one another while ones that dont both execute at the same time). a novell idea in efficiency but sparc CPUs did manage 8 threads per core as specific workloads could then gain an 8x increase in speed for that RISC CPU. X86 was very late to the multi core, multi thread concept.

This is an example of a RISC CPU, it is very simple having fewer units but a lot cheaper and more efficient.

Much Much simpler than the x86 CISC based CPU. It has less units so uses less power, has less instructions. It is a lot cheaper because of a lot less silicon used, but if you wanted to perform a more complex task expect to wait cycles where CISC did it in 1. This is why you do not blend on ARM.

This is an example of a GPU, it has a lot of the same units to be good at just what those units are for like number crunching. Graphics operations require a lot of matrix operations. The shader contains a few ALUs and FPUs only.

GPUs evolved from tricks to get number crunching fast for display where accuracy isnt a concern. Displays are never accurate, their pixels, colour, and other properties including our eyes dont properly match up so why bother taking forever to get it right? This gave birth to a specialised processor to only have what is needed to just do lots of math with floating points fast and in parallel, and to handle static assets too. Textures are just a file of numbers that need to be translated. Components in a modern GPU operate asynchronously to each other. If you have any modern compute orientated GPU and play a higher end graphical game, you would notice that the graphical details load in late, so you see the world with simple graphics as textures are loaded in late. This is due to the asynchronous nature of the units with priorities so some things load in late when the GPU is fully utilised. The different units on a GPU are also from the process of rendering, whereby the ROPs handled the final pixels that will be displayed and can perform tasks like anti-aliasing on them, even nvidia’s DLSS and other adaptive resolutions do use the ROPs to process the final pixels, approximate what pixels are in between without needing the compute power to render the larger resolution. Some modern features are based on existing units in a GPU but without updating older GPUs, they do not benefit from this even if they are compatible. The tesselation unit in dx11 and newer GPUs are an example of a design requirement in dx11, as to use the shaders for the task would slow down graphics rendering for the current workload. Thats why older GPUs cannot run dx11 though if you did update their microcode they could but at a high performance penalty.

AMD Bulldozer, being faster and more efficient if properly utilised, beats intel by a mile in logic tasks. My 8 core 4Ghz amd piledriver is 8x faster than my patched i7-7700hq in compressing files (same software, drive pool).

The AMD bulldozer architecture was the result of analysing which tasks were more common and so were aimed at that. AMD got a lot of flak because it did what apple did with the M1, optimise for the most common tasks but instead of taking a large step, they took a very small step. The media and enthusiast community were heavily biased against AMD all because the AMD bulldozer line did poorly in number crunching tasks. Ofcourse it did badly when you run a 3D physics simulator game seeing that 4.5Ghz be as fast a 2.6Ghz 1st gen xeon in core per core, clock per clock but for the most common tasks, the bulldozer outperformed intel by a huge margin. Compression, compilations and just general tasks that you would do, the AMD did it with the same speed or faster than intel. AMD traded number crunching performance for logic and most common tasks also because GPUs are much faster at it using much less power. Apparently trying to do the right thing gets you killed in todays world. The AMD CPU was more power efficient than intel with the same number of cores (8) for doing code compiles and other tasks that utilised the cores fully without delays. It was called a heater because it used 50% more power than an intel quad core core 8 cores, it was unfairly compared and worse still, intel dirty handedly cripple AMD by having software run unoptimised code if it detected AMD. Today, people still find this CPU to perform decently while the old sandybridge has become too slow. Even my 1st gen iseries xeons are running at almost twice their clocks.

The Apple M1 chip, taking all the lessons learnt above and pushing it as far as possible.

With this one might ask, why the CISC approach? The reason was that during the earlier days of intel, they came up with the emulated processor concept. This meant that a CPU was just a general purpose design that we could emulate any architecture we want, a bit like FPGAs but with complex logic instead of many many simple slices. By making the logic very complex, this allowed the CISC CPU to perform well for any task and could be updated which intel never does today for profit reasons as upgrading an old CPU microcode to be better and support newer instructions (even if slower than new CPUs), would mean less people buying new CPUs. As this trend continued and intel didnt want to innovate and deliver decent processors with decent features at decent prices, everyone just started looking elsewhere. Sadly many fail to keep up with the news that intel falling way behind. They had some interesting concepts, like it would’ve been nice to make an intel skull canyon NUC cluster, but those days are gone as both apple and AMD have taken the lead in their respective audiences and markets with their own efforts.

The RISC approach was to minimise the hardware design and maximise the software design. A well coded software will run well on both RISC and CISC CPUs but a badly coded software will run fine on CISC and poorly on RISC. Some coding conventions that you are not supposed to do can also improve performance when running on a RISC CPU. By minimising the hardware design, this made each silicon smaller, cheaper, take less power, cheaper and quicker to design. Taking from the lessons of the past, apple set out to create a SoC full of many different units asynchronously. Why have CPU cores with compute units for number crunching and waste power when you could have a dedicated core with just that number crunching unit alone? Apples approach was to make available dedicated compute units for each time for common work load, from rendering to media encoding and keeping the SoC asynchronous as well to keep the whole CPU simpler. One workload that hasn’t yet been tested was if one would perform graphical/video rendering, AI, gaming and streaming all at the same time if it would allow all that to happen at full speed though CPU power use in that scenario for the M1 might be quite high in that case.

So far Apple’s M1 SoC looks very promising as a portable chip replacing intel but it remains to be seen if true multi tasking as i mentioned above can be safely done without any performance losses and excessive heat, with pricing way lower than what they had when they used intel. Since Intel had been pushing for planned obsolescence unlike AMD, and Apple was doing it better by having a more selective audience, so their chips not getting a microcode update would mean they will be obsolete in features in the next release. If only apple doesnt use slave labour it would be on my recommendation though while intel is not a choice currently, AMD is and i still see people with outdated information recommending intel when they should be recommending AMD.

How the new Apple M1 Chip actually compares to x86

Like this:

Related

Leave a Reply Cancel reply

Add Me Website! to ya Homescreen!

How the new Apple M1 Chip actually compares to x86

Share this:

Like this:

Related

Leave a Reply Cancel reply

Add Me Website! to ya Homescreen!