Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, everyone, and thank you for joining me today at COM 42 DevOps conference.
My name is Stefan and I work as DevOps engineer at Analog Devices.
I'm excited to present you today an automated way of testing in hardware
across a wide range of platforms that run different Linux distributions.
System level testing plays an essential role for quality assurance
in the fast evolving landscape of software and hardware integration.
A robust testing infrastructure is required to ensure that individual
software components work together.
Key attributes like automation, rigorous testing, and coverage
are some of the must haves for achieving a reliable integration.
By implementing these best practices, Organizations can deliver high quality
systems that meet user expectations.
A reliable test infrastructure accelerates also test executions, but
most important, reduce the probability of finding bugs in production phase.
The Linux distribution I talked about is called Analog Devices Kuiper and it
is a free open source distro customized for analog devices signal chain.
It comes pre equipped with essential components like device drivers, pre
built boot files or FPGA, and Raspberry Pi based solutions, as well as a
wide range of development utilities, libraries, and project examples.
The system supports multiple hardware platforms, AMD and Intel FPGA based
platforms, Raspberry Pi, and also NXP.
To streamline software management, KyperLinux incorporates a
custom Linux package repository.
Simplifying software components installation or update.
This is included by default in the image ensuring easy of use for both
testing and production environments.
For more details regarding how Kyper Linux releases got optimized, please
check Refining the release strategy of a custom Linux distro that is also
presented at this edition of Comfortitude DevOps by one of my colleagues, Andrea.
The continuous integration flow for individual software components is handled
using classic CI tools, such as Jenkins, ErrorPipeline, or GitHub Actions.
The CI builds software components across various operating systems
and the resulting output binaries, which might be Linux packages,
Windows installers, or just archives, are stored in GitHub releases,
package manager, or internal server.
This streamlined approach not only improves efficiency, but
also ensures traceability, version control, and artifacts accessibility,
making it easier to integrate and deploy them into larger systems.
The testing process in hardware closely mirrors the one in software.
The workflow typically begins with specific boot files being
written onto the hardware boards to configure them for testing.
Tests are executed in parallel across multiple hardware setups.
Reducing overall testing time and increasing efficiency.
Because popular testing framework available in the market even are designed
for application level testing or are specific to a single hardware platform
we worked on creating our own testing framework called Hardware Test Harness.
It is designed to unify testing across a wide range of hardware platforms
enabling consistent execution regardless of the hardware type.
Builds are triggered by pushes or pull requests in GitHub.
Multiple repositories are monitored like HDL, Linux, libraries or applications.
And changes from those will trigger the entire testing process.
Let's see what happens after the build passes.
The results binaries are saved on internal servers, in this case JFrog Artifactory.
And then the main Jenkins job is triggered.
The Jenkins job begins by downloading the binaries from Artifactory and
writing them onto the hardware platforms connected to each agent.
After this, the automated tests are distributed and executed.
The tests are mostly written in Python and have specific tags that
indicate hardware compatibility.
Let's see now why do you think we need an intermediate Artifactory server.
this server is required for several reasons.
The main one is that the hardware test harness is distributed across
multiple physical locations, requiring a centralized system to share binaries.
Another reason is to handle concurrency.
Since multiple repositories are implied, this will act as a bumper.
And it also ensures better organization and versioning of binaries,
making the process more scalable.
Another question may be why the Jenkins agents need to be physical machines.
The primary reason here is that these agents directly control the
hardware setups, including power management, WART connections, Ethernet,
and in some cases JTAG interface.
Some of the build servers are also physical machines, because builds require
significant computational resources or paid licenses, as Xilinx or Matlab cases.
A good framework should be adaptable to different types of
hardware and testing scenarios.
It should be modular enough to accommodate changes, such as adding
new DUTs or modifying test cases.
Because there are implied multiple repositories and multiple Jenkins agents,
there is necessary a test manager to queue, distribute, and execute tests.
But let's go first through some implementation details.
As main tool, there was used Jenkins.
It can be hosted independently without relying on external servers, integrates
easily with tools like GitHub or Artifactory, and benefits from a
large online community for support.
Additionally, features as Jenkins share libraries, dynamic script language, and
resource logging mechanism proved to be extremely useful in this context.
Another tool used is Nebula.
This tool was developed by us and consists in a collection of
Python scripts that manage hardware connections, such as sending word
commands, configuring Ethernet IPs, sending files through SSH and so on.
In the event of a hardware setup failure, we can physically reboot the system
and bring it back online using Power Distribution Unit and USB SD card mobs.
Both of them are also controlled through Python by Nebula.
As we added more harder setups, we needed a tool to keep tracking
of all of the devices under test.
And that's where Netbox comes in.
It is a free open source tool, originally designed for modeling
and documenting network racks, but it also fits in our use case.
We use it to generate Nebola configuration files, YAMLs that contain information
about each DUT, such as platform, the board that is plugged into it.
Ethernet and serial addresses, PDU outlet and USB connections.
The NetBox configuration needs to be updated only when new duties
are added, removed or rearranged.
All data stored in NetBox is backed up automatically in Artifactory and it can
be restored if something goes wrong.
Jenkins.
JAR library is a very good way to centralize groupy code.
https: and reuse it in multiple Jenkins pipelines.
It contains definition for common functions and pipeline steps that can be
shared across different Jenkins files.
Improving in this way reusability, modularity, and consistency.
We use it in multiple pipeline stages to update agent tools as Nebula, to
send files to hardware setups, or to run tests and collect results.
this structured approach ensures an efficient process of updating
and maintaining the same pipeline functionality across
all the test harness instances.
By combining continuous integration with continuous testing, the
resulting diagram will look like this.
Behind it are over 100 CI pipelines implied, implemented in Azure,
GitHub Actions, or Jenkins, and about 15 physical build servers.
For most of them, besides build status, the test result from hardware testing
is returned back to github pull request.
Some software components, such as libraries, are tested
individually on hardware.
If all of the test passes, the corresponding binaries are stored in
artifactory or Linux package repository.
For other components, the build artifacts are first stored on internal
servers and tested afterwards.
In some cases, Linux packages are created automatically at each push
and saved into the package repository.
Ideally, once all the software components are packed as Linux
packages, those packages generated by each CI run will be uploaded
automatically under testing environment.
So they can be installed on Kyper for further testing or just used
internally as pre released versions.
On the other side, whenever there are changes in the Kyport
sources, new Docker containers are created, being used by other CIs.
This ensures that everything is consistently built and validated
across multiple environments.
Let's see how results visualization was handled.
The easiest way was to manually verify status of Jenkins pipelines.
Of course, this method didn't give us any details about what stages
Failed, and on which hardware setup.
So we switch to BlueOceanView.
It looks a bit better.
For those of you who don't know, the BlueOcean is a Jenkins plugin that offers
a good visualization of parallel stages.
In this case, we could see the status of all the stages on all
the hardware setups, becoming a bit easier to know exactly what failed.
I still couldn't see all the details.
So next step was to convert results into XML format and use JUnit
Jenkins plugin for visualization.
Even this method requires logging into Jenkins and visually
inspect the results at every run.
So we have started to use even more powerful tools.
Logstash for processing results, Elasticsearch for storing them into
database and Kibana for generating graphs.
At this point, we also created a web page with multiple dashboards and added
the ability to create and apply filters.
The new implementation eliminates the need of going through individual
artifacts to check results.
But developers were still needed to look over the graphs to check the
status before pushing their changes.
Actually in all of the above cases, developers were still needed
to manually check the results.
Even if the results were shown in tables, graphs, or dashboards, it was not feasible
to handle a big number of repositories and pull requests in this way, so we
needed to close the loop completely.
An important step in the implementation was to bind hardware test results
back to GitHub pull requests.
The main challenge here was to ensure that private data from our internal
build and test environment, such as internal IPs, Jenkins links, or any
other sensitive information, to not be exposed in public repositories.
At the same time, it was very important to provide sufficient
information about the build status and testing results to aid developers.
To achieve this, multiple tools were used to parse results, merging them into
the same tables with build statuses, securely tunneling SSH connections,
and posting summaries via GIST.
With this system in place, we were finally able to enable, require CI status to
pass setting from GitHub repository.
This ensures that only changes that pass builds and don't broke
any test are allowed to be merged, increasing the overall stability
and reliability of the code base.
We can dive deeper into this topic by accessing secure integration of private
testing infrastructure with GitHub.
public github repositories presented by my colleague Bianca at the
same edition of home 42 devops.
The final step needed to achieve a fully automated testing framework was
to implement a mechanism for recovering harder setups from bad states.
One common issue arises when boot files produced by CIs are faulty.
In this case, harder setups can hang during the boot process and
remain stuck in the unstable state.
The framework detects these failures and attempts to recover the affected
boards through various methods.
As part of this process, we maintain a set of golden files, a reliable
baseline of boot files that are overwritten with the latest set
of files that passes successfully.
They serve as a fallback option, allowing to set the hardware
systems back up and running.
And of course, to be prepared for next set of files to be tested.
However, the rare scenario when the hardware setups are physically
damaged remains the single situation that requires manual intervention.
This recovery mechanism ensures that the testing framework remains resilient,
minimize the downtime, and increase the efficiency of testing framework.
Now that I have gone through all the details, let's see how
the overview design looks like.
On the left side, you can see the triggering mechanism, multiple
Jenkins files and Jenkins server.
The server manages the testing requests queue, ensuring efficient resource
allocation and also merges results from all test harness instances
and prepare them to be published.
Then are the Jenkins agents.
By deploying agents inside Docker containers, we have successfully
connected multiple hardware setups into a single physical machine,
optimizing resource usage.
Test hardness supports tests written in different programming
languages as Python, C or Matlab.
Hardware boards are locked only when the tests are running on them, otherwise
they remain accessible for remote connections, allowing team members to
perform debugging and development work.
Test results are very well structured and presented clearly, ensuring
that any defects are identified and addressed in early stages.
This system increases efficiency, maintains stability, and
enhances the overall reliability of the testing process.
But let's see how the hardware setup looks in real life.
This is how the prototype looked in early beginning stages.
There were just a few hardware boards connected between them
and lying around on a desk.
We were experimenting at that time using Raspberry Pi as Jenkins agent.
and adding support for multiple platforms.
And this is how Test Harness looks now.
In conclusion, we have managed to implement a very complex testing
framework that can be triggered from multiple GitHub repositories, as well
as Jenkins, Cron, or even manually.
Hardware setups remain accessible for remote connection, allowing team members
to perform debugging and development.
It supports multiple platforms and can run tests written in different languages.
Resources got optimized by using Jenkins agents inside Docker containers, and there
is a robust recovery mechanism in place.
Test results are well structured and bind back to GitHub, ensuring that
bugs are found as early as possible.
The presented testing framework is highly efficient, flexible, and
robust, designed for complex workflows.
It ensures software and hardware integration, streamlining the testing
process, and enhancing the stability.
Thank you all for listening me, if you have any related questions or need more
details don't hesitate to contact me.
Have a nice day, bye.