Last month Intel released their new line of enterprise-class x86 server processors, the Xeon 7500-series “Nehalem-EX” processors. This is very significant, as their existing enterprise x86 processors (7400-series) were getting quite old and were not particularly competitive. The new Xeons provide much higher computational performance, as well as many enhancements for reliability, availability, and serviceability (RAS). They are immediately available in 4-socket configurations and will also be appearing in 8-socket configurations.
With a product this complex, it’s very difficult to cover every aspect of the new design. I will be focusing primarily on the performance of the new processors, with a particular focus on HPC as that is the market with which I’m most familiar.
To the best of my knowledge, the Xeon 7500s are some of the most diverse processors released under the same name. Their core counts range from 4 to 8, with clock speeds ranging from 1.87GHz to 2.67GHz and L3 cache ranging from 12MB to 24MB. This makes the decision of which processor to purchase more difficult than ever before, as one can’t easily determine which processor is “best”. You have to carefully evaluate your application and requirements, as well as the capabilities of each model.
Introduction and Disclaimer
No matter which processor model you choose, all offer great performance. They are built upon the Nehalem architecture, which launched in April 2009. Those processors (Xeon 5500-series) have performed very well over the last year, and have re-claimed many performance crowns from AMD’s Opteron processors – it’s a good architecture to build from. With improved features such as quad-channel DDR3 memory, the new Xeon 7500-series will be even faster.
Since I’m focusing primarily on the HPC space, make a mental note that the X7542 is the model Intel has designed for HPC workloads.
It’s worth noting that this write-up is simply an analysis of the processor architecture and features. I haven’t gotten my hands on a benchmark system yet, but some fairly solid conclusions can be drawn without verified performance results. Still, the variation between models definitely requires that unless you know the exact performance characteristics of your application(s), you’ll have to run benchmarks to know which processor is actually the “best”. Hopefully this analysis will help you know what to look for as you consider your options.
The L3 cache available on each model is one of the factors that varies widely across the product line. Not all the top-end models have the most cache, and the lowest-end model doesn’t have the least (the E7520 has more than the E7530). Most of the processor features vary in this manner, so you must determine which features are most important. Even if cost is not a factor, purchasing the most expensive model may not be the fastest for your application.
To see some of the implications of the cache quantities, it’s helpful to examine the amount of L3 cache per core. Although each core has access to the full L3 cache, all cores must share this fixed amount of storage. The plot below assumes the cache is equally divided among all cores:
Suddenly, the low-end E7520 looks attractive with the most cache per core and the high-end X7550 looks like it might be one of the worst choices (only the E7530 has less L3 cache per core). You probably aren’t lucky enough that your application runs entirely in cache, but if it does take this seriously. The low cache per core on the X7550 may result in not being able to use all the processor cores (or else face the application being kicked into main memory and running significantly slower).
Although the E7520 is not a particularly attractive processor once all factors are considered, this is where it might shine. With two to four times the cache per core of most other processors on the market, the E7520 will allow new applications to run entirely from cache – applications that never fit until now. This processor has significantly fewer processor cycles available than the other models, but when an application is running in cache almost no processor cycles will be wasted.
Processor Cores and Computational Throughput
Other features with some caveats are the processor core counts and thread counts. All of these Xeons (with the exception of the X7542) support hyperthreading, which allows two threads to be assigned to each processor core. While this does result in speedups for many applications, it’s notorious for offering no advantage or actually reducing performance of HPC applications (thus it has been disabled on the X7542).
In the plot above, the bars represent core counts and the points represent thread counts. The core counts are fairly straightforward, with one 4-core model, four 6-core models and three 8-core models. Hyperthreading doubles the number of cores that will be seen by your operating system and applications.
Where it becomes less straightforward is the overall computational capacity of each processor. Even the lowest-end server will be endowed with 16 processor cores (up to 32 threads), so your application better be parallel. The total capacity provided by the server will be a combination of the number of processors and their clock speed. Processor clock speeds are no longer increasing these days (they rarely even reach 3GHz), but processors are becoming more intelligent:
A new feature in Intel’s latest lines of processors, dubbed Turbo Boost, enables processors to selectively overclock themselves in certain situations. In the plot above, the bars represent the base clock speed and the points represent the maximum clock speed the processor may choose to run at. In short, a processor will increase its clock speed if doing so will not exceed the wattage/heat specs for that model. Don’t expect all the cores to be running at full speed while 100% loaded. The points in the plot are the very fastest possible, and you won’t reach them (but they do provide an upper bound).
There is quite a bit to be gleaned from this plot. The first five processors tend towards the low-end, although the E7540 is very close to the X7550. The E7520 doesn’t have any turbo boost at all. The X7542 is by far the fastest model. However, the X7550 and X7560 clocks are impressive when you consider that they have 8 cores. This will be very clear in the next plot.
The two L models in the middle are the low-wattage parts and thus have fairly low clock speeds. But note that they will go almost as fast as the top-end models when conditions allow. Judging from the amount of turbo boost they allow, I would expect these two models to be a bit more liberal than the others when overclocking.
Continuing with the idea of determining total computational capacity, consider the total number of processor cycles provided by each model. This is simply the number of cores multiplied by the clock speed:
This is the most straightforward plot of the bunch – about as linear as you could hope for. As in the plots above, the bars represent the base cycles and the points represent the theoretical maximum number of cycles with full turbo boost on every core.
A comparison of cores vs clock speed could be made between any models, but consider the top end. Although the X7542, X7550 and X7560 are in the same range, the X7542 provides its cycles in just 6 processor cores, while the other two run 8 cores at lower clock speeds. This is the type of difference that requires benchmarking: does your application perform better with 6 cores at 2.67GHz (X7542) or with 8 cores at 2GHz (X7550)?
Despite the simplicity of the plot above, it’s very important to recognize that the raw number of cycles a processor provides rarely translates into real performance. Every processor architecture has its own efficiency – the amount of useful work actually accomplished during each processor cycle. These efficiencies vary widely depending on the design, and will vary widely within the Xeon 7500-series due to the variety of capabilities each provides.
All of the features within each processor must be considered to determine efficiency, but inter-processor communication is also vital:
Intel’s Quick Path Interconnect (QPI) was introduced last year with the Xeon 5500-series “Nehalem” processors. Much like AMD’s HyperTransport, QPI transforms Intel-based servers into NUMA machines. This greatly increases performance, but also increases the complexity of memory operations and communication between processors.
Four-socket Xeon 7500-series servers will be fully connected, with each processor having a QPI link to each of the three other processors. The speed of the interconnect will be determined by the processor model, ranging from 4.8 to 6.4 billion transfers per second (GT/s).
But before making your judgement, keep in mind that QPI transfers are not typically necessary for a single process running on one core. The transfers become necessary when multiple processes must communicate with one another or when a process needs access to memory attached to another processor. Thus, we have to look at how much bandwidth is available to each processor core:
These values correlate with the number of processor cores. As core counts increase from 4 to 6, and 6 to 8, the QPI simply isn’t able to keep up. The 8-core processors are not able to provide as much communication bandwidth to each core, so performance may suffer if your application requires a lot of communication. All but the most communication-intensive applications should fare better with more cores, but the efficiency of each core will be lower. You’ll be able to make a much more educated decision if you know the communication to computation ratio of your application(s).
Few datacenters, if any, can ignore the power consumption of their servers. It’s naive to simply look at the overall power consumption of each server – you have to balance the power it consumes with the performance it provides. There is some variation between the new Xeons:
The above plot is a bit different from the others, as the bars represent the power consumed by each processor core (in watts). The points represent the number of watts consumed per billion processor cycles. Examining the bars is the naive approach, as this ignores performance per watt (but it’s worth looking at once). Examining the points reveals the true performance per watt of each model.
Obviously, the cheapest low-end model consumes the most power. The energy-efficient L7555 scores the best rating, but the supposedly efficient L7545 doesn’t look very attractive. If your servers do not frequently run idle, you’ll see about the same power efficiency from the E7540, X7542, X7550 or X7560.
Although the HPC-oriented X7542 appears to be poor when looking at the bar, it’s very competitive when looking at the point (watts per billion cycles). This is because it has a fewer number of cores, but runs them at a very high clock speed. Just another reason why you must carefully examine the data before making your choice.
The final decision will always come down to price:
Similar to power consumption, it’s wiser to look at performance per dollar than just the raw price of each model. The results are fairly easy to comprehend. The E7520 is the cheapest, with the E7530 and X7542 being very cost-effective. The rest of the models cost quite a bit more, with the fast, power-efficient L7555 being the most costly.
The total price for one of these four-socket servers will be in the range of $10,000 to $30,000. The quantity of memory will be a factor, but for most installations the processor cost will make up the majority of the total server price. The cost-differential between some of the models might appear daunting, but remember to factor the overall cost-effectiveness of these systems. You may be able to replace 20 existing servers with just one of these four-socket Xeon 7500 servers.
Unfortunately, the upshot of all these various factors is that there can’t be a conclusion. None of the processors is better than the others in every possible way. The new Xeon 7500-series do provide exceptional performance, but only you can determine the most cost-effective model for your application(s).
Given the details above, these seem to be the pros and cons of each model:
E7520: Low cost and high L3 cache/core make this model attractive for cache-bound applications. But low core counts, low processor frequency, lack of turbo boost and low power-efficiency will not make this model attractive for most applications.
E7530: With the second-lowest price per cycle, good QPI speeds and a decent number of processor cycles available, this model is acceptable. But with the lowest L3 cache per core, the E7530 must be chosen with attention to that detail.
E7540: This model fairs well in almost every category, so it should be a popular choice. The amount of cache and high QPI speed are among the highest of any models. But processor frequency and power efficiency are still towards the bottom end, so the higher cost of this model will make it less attractive to some.
L7545: While billed as a power-efficient processor, the L7545 appears to be in the bottom half for power vs performance. With specs very similar to the E7540, it will probably perform just a bit better. But its high price will turn away power users, and its low power-efficiency will turn away power-conscious users.
L7555: This processor has the absolute best performance per watt, but the highest price per cycle. The L7555 will perform exceptionally well, but only if you can afford it. The low QPI speed is worth noting before you purchase it.
X7542: Abundant L3 cache, excellent processor frequency, good QPI speed and a low price per cycle will make this model popular. It’s billed as the model for HPC, which is a price-sensitive market, so this makes sense. But it is also fairly power efficient, so I would not be surprised if this becomes one of the most popular models.
X7550: This is another model that will probably only fit in because it costs less. Its L3 cache per core is near the bottom – only beating the E7530. Processor frequency and QPI speeds are good, considering it’s an 8-core processor. But it doesn’t offer anything to beat its siblings except that it costs less than the X7560.
X7560: It’s usually safe to assume that the highest-end model is the one to purchase if you have the budget. While that is generally true for this model, the lower QPI transfers per core will be a liability for communication-intensive applications. It offers the most cache, the most processor cycles and the second best power efficiency, so it’s a good choice.
One use-case that is not addressed here, and has been suggested over the last few years, is that many applications simply cannot use the number of cores available with the latest processors. An increasing number of users may choose to use less than the full number of cores on each processor. This will result in increased cache/core, increased processor frequency (due to turbo boost) and increased QPI transfers/core, but will reduce the total processor cycles provided by each processor (the cycles on unused cores must be discarded). Users in this category will have to perform their own analysis and may not be able to use any of the conclusions listed in this article.
The final decision is further confused by AMD’s new Opteron processors, which appear to be some of the most cost-effective processors ever released. I expect serious users will have to very carefully examine all their options (both AMD Opteron 6100-series and Intel Xeon 7500-series). Mission critical servers may not be able to use the Opterons, as they are not equipped with the same caliber of RAS features. But AMD has strongly established themselves with the Opteron line of processors, and it’s unwise to ignore them. The Opteron 6100-series model lineup is less complex than the Xeon 7500-series, but I’ve posted a full write-up. After that will come real-world benchmark results.
Data analyzed with R:
R Development Core Team (2009). R: A language and environment for
statistical computing. R Foundation for Statistical Computing,
Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.