Conf42 Observability 2024 - Online

Understanding Data Center Engineering Systems: A Concise Guide for Developers. Key Insights to Consider

Abstract

Unlock the secrets of data center engineering! Join me at Observality to demystify complex systems, optimize performance, and bridge development with operations.

Summary

  • Today I will talk on the topics that is not commonly covered on such conferences. Well discuss core infrastructure and data centers infrastructure in particular. Of course, all engineering topics are very complicated and deep and it's impossible to cover them all.
  • Developers, IT architects, and other field professionals often lack a solid understanding of core IT infrastructure. This gap in knowledge can be fatal, especially for critical systems. The more critical system is, the deeper infrastructure awareness should be.
  • The three pillars of any data center are electricity, cooling, and connectivity. Without a reliable power supply, servers and other hardware cannot function. In 2023, data centers consumes around eight gigawatts of power.
  • Data centers consume huge amount of electricity. Electricity is 100% transformed into heat and to nothing else. Cooling techniques fall in two main air based and non air based. Free cooling is my favorite cooling methods.
  • Freecooling is one of the most efficient cooling technologies. It provides the highest power usage effectiveness, or pue value. Pue is a ratio that describes how efficiently a data center uses energy. Ideally it equals one, but achieving this perfect scenario is impossible in real world.
  • The three pillars of any data centers are electricity, cooling and connectivity. Today we have a virtual tour of the data center electricity infrastructure. We also shortly covered data center servers including GPU's. The more critical your system is, the deeper your infrastructure awareness should be.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Today I will talk on the topics that is not commonly covered on such conferences. Well discuss core infrastructure and data centers infrastructure in particular. Of course, all engineering topics are very complicated and deep and it's impossible to cover them all. I just want to slightly open this box and have a look inside and hoping in case of need, you will know where to look deeper. But before that, let me introduce myself. My name is Yigor and I have been leading critical IT infrastructure for more than 15 years now. I am ahead of IT infrastructure of one of the worlds biggest marketplaces. And before that I led infrastructure transformation at one of Europes largest banks and IT ecosystems. Developers, IT architects, and other field professionals often lack a solid understanding of core IT infrastructure. This gap in knowledge can be fatal, especially for critical systems. So today we'll try to bridge the gap and talk about why it's important to understand one of the main parts of core infrastructure, the data center level. We'll discuss the data center infrastructure setup and operations, and the main idea of this talk is to invite you to a virtual tour of the datacenter main systems and let you see how this system works. Imagine hosting your systems on the core infrastructure as you were sending your child to a daycare. You could choose the cheapest or closest kindergarten, but will your child be safe and sound there? As a responsible parent, you would normally research and select the best option to ensure your child's safety and comfort. Actually, the same way you should investigate and choose the best infrastructure environment for your systems to provide their required level of reliability and resilience. That's why it's important to understand what core infrastructure systems are now hosted and we at infrastructure department call it infrastructure awareness. This concept presumes the deep understanding of all engineering and ITC systems behind the product level. Actually, you may have a question if infrastructure awareness is important for all kind of IT systems, and the answer is, it depends how critical the system is. The more critical system is, the deeper infrastructure awareness should be. If you are working a personal pet project, you can actually run it on your local PC, and if something fails, it's not a big deal. Only you as a developer are affected and you can fix it at your own pace. However, if you are responsible for corporate application used by many people and tied to business processes, the stakes get higher. Even if small back office application for ordering lunches goes back for a week, it will be inconvenient but not disastrous. Still, it is a much greater issue than a pet project failure, and in case of failure of marketplace application for customers, the impact of downtime will be huge, critical business processes will be down, and even the whole existence of business can be at risk. Many root causes of critical failures are located on the infrastructure racks or server fails, connectivity fails, and so on. That's why we should very carefully think of core infrastructure. This is exactly what I mean by infrastructure awareness. So first of all, you need to define how critical your system is and define what to do in case of possible failures. Usually the system architects limit themselves to a platform level. For example, they plan several instances and in case of fails, the rest will be bear the load. If we deal with non critical systems, it's totally enough to stop on this level and that's it. But if the system is critical, then you must drill down further. There are many more questions to be answered. For example, what are your servers and virtual servers hosted? What kind of cloud provider is used, if any? Are your servers old on you? In what kind of quality of data centers they are hosted, in what geographical regions and country the data center placed, and many, many more. One of the most important elements of core infrastructure with developers and architects often overlook, is data centers. I'll give you a moment to try and figure out what systems it includes. The three pillars of any data center are electricity, cooling, and connectivity. Let's elaborate a bit of each. Electricity is a lifeblood of data centers. Obviously, without a reliable power supply, servers and other hardware cannot function. Data centers often have multiple power sources, including backup generators and uninterruptible power supplies, providing continuous operations during mainline power outages. Then comes cooling. As you may know, data centers generate a significant amount of heat. Effective cooling systems remove heat energy from IT equipment. They prevent hardware failures caused by overheating, and later on we'll talk about it more in details. Basically, the goal is to remove heat and connectivity, without which data centers are completely useless. That is a must. All other engineering systems you could heard of, like fire extinguishing physical access systems, automation are actually optional. Data centers come in different sizes, but how do we measure them? Do we need to know the area of the building, or do we know the count or number of recs or units? No, no, and no. The funny thing is that data centers are measured in catals. In other words, we use power watts to understand the capacity of a data center. Of course, more often we speak about thousands of kettles or megawatts. A small size data center has a power capacity up to 3 mw. Medium sized data center, three to 10. We consider 10 more as data centers for a large size. For context, 1 mw is enough to energy the power about 200 homes. And what about XXL size? When we speak about hyperscalers such as Amazon, Google, Meta and so on, their data center capacity reaches a gigawatt. To give an idea how big this number is, let me just say that all in all, in 2023, data centers consumes around eight gigawatts of power. So you can see half of this amount is consumed by the hyperscalers. As we have discovered, data centers need power and I like to visualize it as a tree like structure with a trunk and numerous branches and leaves. It is a fitting metaphor for redistribute electricity from one central source to numerous consumers, ensuring that power reaches every endpoint device. Actually, electricity comes from the local power grid with the high voltage transmission lines on its way to data center consumers. High voltage is transformed to medium and then to low usable wattage from the transformer. Electricity goes through the primary distribution panel which you can see on the screen to be further distributed to it. Equipment load electricity also always goes into Ups which stands for uninterruptible power supply for backup power in case of grid failure. On the screen on the top of raxd you can see distribution busbar with the hanging red distribution outlets which are connected to the server racks. That's what a diesel generator looks like. In case of continuous grid failure, it will take the it load on. The data center will keep on working as long as is needed. A diesel generator can operate even for months. Now we know the data centers consume huge amount of electricity and I have a tricky question for you. What are the things the electricity is used for? I'll give you a moment to make a guess. The answer is heat. Electricity is 100% transformed into heat and to nothing else. To get rid of produced heat, we need to take this heat away somehow, and there are many ways to do it. Cooling techniques fall in two main air based and non air based. Air cooling used by vast majority of data centers includes common air conditioning systems and no air cooling system. Use water, oil or solid materials. Air conditioners are the data centers most common air cooling methods. They actually function the same way as the ones we have at homes. Just bigger. Chillers are the second most common cooling systems, using water or water based solutions to transfer heat from it equipment halls to outside. They are more energy efficient than air conditioners, but require more complex installation and maintenance. Adiabatic cooling involves chambers or mats. Water evaporates in them and cools the air inside. This technique is normally used in addition to other cooling methods, exotic methods including peltier elements which rely on thermoelectric effects, and underwater data centers like Microsoft project Natit. There is a method that stands out of other cooling techniques. It's called free cooling and it uses normal outside air. As it is, free cooling is just running the outdoor air through it equipment and does nothing more. If the air is warm, we just need to run more air through the servers, as free cooling requires nothing but air fans and outside air. This technique is extremely energy efficient and to be honest, free cooling is my favorite cooling methods in which I have quite a broad experience and I am always eager to discuss and answer any questions about that. As I said, freecooling is one of the most efficient cooling technologies and it provides the highest power usage effectiveness, or pue value, which one of the most important metrics to understand the efficiency of data centers? Many companies worship pue as a sacred animal. Let's try to figure out why it's so important. What is pv power usage effectiveness is a ratio that describes how efficiently a data center uses energy, especially how much energy is used by computing equipment. In contrast to cooling and other overhead that support the equipment, ideally it equals one, which means that all energy is used for it equipment without any wastage. However, achieving this perfect scenario is impossible in real world. There are different methodologies for calculating PvE. For example, there is a way to calculate an instantaneous PVE. On the other hand, a more comprehensive assessment will be to calculate an annual average pUv. I have just said that it is impossible to achieve pv equal to one. But how do you think if it is possible to achieve pv less than one? And the answer is yes. But for this, we need to switch from the terms of power to the terms of money. For example, few data centers in USA and Europe use heating pipes and sell hot water to the nearest towns. In this case, if we calculate their PvE in terms of money, it will be even less than one because they get profit from selling this hot water to facilities now to more tangible things. Many it professionals are well familiar with the servers and regs, but there are also those who are not. And this is the opportunity to have a quick tour of a data hall and get an overall impression of its setup. For many of you, it's obvious, but I'll mention it. The data centers servers are mounted in racks, which are standardized frames or cabinets that host multiple servers and network equipment. This is the way they look, and here is the view from another angle. Each rack contains many units, allowing servers to be stacked vertically. This setup provides easier operation and maintenance and maximizes space efficiency. As you can see, network ports and drive slots are located in the foreign part of servers so that a data center engineer can easily access them. Now we are witnessing the rise of ML and AI technologies, so more and more capacities of data centers are allocated to GPU's. The GPU's through physically small consume significant amounts of electricity. One GPU node can consume more than 8 power. That means they produce the same huge amount of heat. To address this problem, data centers have to upgrade the electrical and cooling infrastructure to support their increased power and cooling demand to wind up let us quickly go through the today's takeaways. The three pillars of any data centers are electricity, cooling and connectivity. Today we have a virtual tour of the data center electricity infrastructure. We made an overview of the DC cooling technologies, the special focus on free cooling. We also shortly covered data center servers including GPU's. But if you take away only one thing from today's talk, let then it be this. The more critical your system is, the deeper your infrastructure awareness should be. Hope this short data center infrastructure tool was useful. If you are interested in any above discussed topics, especially in free cooling, you are always welcome to contact me directly or refer to my articles in Hackernoon. Thank you very much.
...

Egor Karitskiy

Head of IT infrastructure @ Wildberries

Egor Karitskiy's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways