Discover the world of cryptographic checksums. Learn about their meaning, the magic behind them, how to use them, and what their key role is in data integrity and cybersecurity
Historically, the only way to install new software on your device was to buy a physical copy at a brick-and-mortar shop. Software programs were usually delivered on a floppy disk or a CD, which was carefully packed in a sealed box to guarantee its authenticity and integrity. A few decades later, with most of our life gone digital, getting a new software program has become easier than ever. Now, you can go to a reputable website, select the software you need, click on download and you’re done.
But what if the software you just downloaded isn’t the same as the one described on the website or, even worse, it’s infected with malware? With ransomware costs growing exponentially, is there a quick way to verify the integrity of a file (and, thus, avoid the disastrous consequences of malware infection)? Of course there is! It’s called checksum.
But what is a checksum and how does it work? In this first article (which is part of a series on checksums), we’re going to discover how a string of cryptic random alphanumeric characters has become the digital version of the integrity seals used in the past, and an effective shield protecting you and your organization from cybersecurity threats. You’ll learn:
- What a checksum is,
- what it does,
- where it’s used, and
- why it’s beneficial both for developers and users alike.
What Is a Checksum? A Brief Definition
A checksum is an indicator (usually in a form of a short string of letters and numbers) that enables you to verify if the original data has been modified during storage or transmission. Checksums are sometimes referred to as hash values, which are unique values that are generated by cryptographic algorithms and work like digital fingerprints of files. (One file, one unique fingerprint.) These values are generated based on the input and are stored and/or transmitted with the file.
But it’s important to note that while all file hash values can serve as checksums, but not all checksums are hash digests. It’s kind of like how all ice creams are desserts but not all desserts are ice cream. In general, checksums are used to catch accidental changes to data (i.e., errors), whereas hash values are used for identifying other (potentially malicious) changes. Hash values also tend to be larger in size than checksums (more on sizes later).
In this article, we’ll mainly focus on understanding checksums from the perspective of hashing algorithms and digests. But semantics aside, the main takeaway here is that if the file changes, even only slightly, the resulting fingerprint (checksum) created is completely different. Depending on how checksums are being used, this mismatch will result in:
- Displaying a message that warns you the values don’t match,
- Preventing you from installing corrupted or altered software,
- Preventing you from uploading corrupted or infected files to your server, and
- Preventing undetected and unauthorized file modifications.
If you’ve ever downloaded software from the internet, you’ve probably already seen a checksum without even knowing it. It’s usually a gibberish, weird looking code (typically accompanied by terms like “MD5” or “SHA”) that’s placed somewhere near the download button.
In some cases, it’s also followed by a short message inviting you verify the downloaded file by comparing your checksum hash result to the one shown on the page.
Before we go into all the technicalities of a checksum to learn how it works, let’s have a look to a few examples of cryptographic algorithms used to generate checksums.
Checksum Algorithms: A Few Examples
There are several checksum algorithms that can be used to create checksum values and the choice of the which one to use, depends on the purpose. The most common ones are:
- MD5 (Message Digest 5). Designed in 1991, it takes an input and produces a 128-bit (i.e.,16 bytes) checksum, which displays as 32 hexadecimal digits. Open to vulnerabilities like collision, it’s not as secure as the SHA (secure hash algorithm) families.
- SHA-1 (Secure Hash Algorithm-1). Published by the NIST (National Institute of Standards and Technology), this hashing algorithm takes an input and produces a 160-bit (i.e., 20 bytes) output hash value that serves as a checksum. This value, which displays as a 40-digit hexadecimal string, is considered insecure since 2005.
- SHA-2 family (Secure Hash Algorithm-2). Approved and recommended by NIST, it’s a family of widely used algorithms, including:
- SHA-224 and SHA-256 producing 256 bits (i.e., 32 bytes) checksum displayed as a 64 hexadecimal digit output, and
- SHA-384, SHA-512, SHA-512/224, SHA-512-256 — all generating 512 bits (i.e., 64 bytes) checksum displayed as a 128 hexadecimal digits string.
- SHA-3 family (Secure Hash Algorithm-2). Also including different algorithms, but based on a cryptographic new approach, they’re very different from the previous ones:
- SHA3-224 — this produces a 224-bit (i.e., 28 bytes) checksum composed of 56 hexadecimal characters,
- SHA3-256 — this produces a 256-bit (i.e., 32 bytes) checksum that displays as a 64 hexadecimal character output,
- SHA3-384 — this produces a 384-bit (i.e., 48 bytes) checksum composed as a string of 96 hexadecimal characters,
- SHA-512 — this generates a 512-bit (i.e., 64 bytes) checksum displayed as a 128 hexadecimal digit output.
- CRC (cyclical redundancy check checksum algorithms). Very similar to “traditional” checksums, they’re commonly used for error detection and identification of accidental changes to data in digital networks and storage devices (e.g., in Ethernet and Wi-Fi packets). Based on cyclic codes, CRCs use polynomial division to determine their values. The most common CRCs are:
- CRC-16, which generates a checksum of 16 bits (i.e., 2 bytes) displayed as a 4-character hexadecimal string,
- CRC-32, which produces a checksum of 32 bits (i.e., 4 bytes) that’s composed of 8 hexadecimal digits, and
- CRC-64, which generates a checksum of 64 bits (i.e., 8 bytes) that displays as a 16 hexadecimal digit string.
To read about the ins and outs of cryptographic checksum algorithms, check our latest hash algorithms comparison article “Hash Algorithm Comparison: MD5, SHA-1, SHA-2 & SHA-3”.
Now that you have a better idea of how a checksum looks like, we can move on and see how it works.
How Does a Checksum Work?
To generate the checksum, the input data, broken into a number of smaller blocks with the same bits, goes through a complex algorithm process with multiple rounds of operations.
The produced checksum always has the same length output, regardless of the original file size. To give you an example, if you run through the cryptographic algorithm the whole book “The Lord of the Rings” (1,178 pages) and then you process only the name of the author “Tolkien” through the same algorithm, you will end up with two different checksums with the same fixed length for each.
We’ve already mentioned that even a minimum change to the input file will result in a totally different checksum hash. This makes checksums handy for:
- Identifying changes in a file and other data, and
- Comparing two or more files to verify if they have the same content.
But how does this verification work? Let’s have a look to a few practical examples.
1. Checking Data Integrity During the Download Process
Say, you’re a developer who creates a program and you want to protect the integrity of its code. Before uploading it to your website, you’ll generate a unique checksum using a cryptographic algorithm (we’ll talk more about how to generate a checksum in our next article of this series). Next:
- You’ll upload the program to your website together with the checksum.
- A user downloads the program onto their device.
- The user verifies if the checksum of the downloaded code matches with the original one. This comparison, which can be done manually but often occurs on the backend of the user’s device, involves calculating the checksum value of the downloaded file and comparing it with the value provided by the download website. If the original and calculated checksums don’t match, it means the file has changed in some way (either unintentionally corrupted or modified by a malicious actor).
2. Checking Data Integrity During Transfers (e.g., Checking File Transmissions via Email)
Let’s consider another example. Your boss digitally signs a Microsoft Office Word attachment. In the process, the file’s unique hash value is calculated and included in the digital signature appended to the document.
Your boss sends you an email with the digitally signed attachment. When you receive the email, your system will automatically compare the digital signature (including the hash value) with the sender’s generated hash. If they match, it means the document is authentic and hasn’t been tampered with or altered since it was created.
To know more about digital signatures and digital certificates, check our article “What Is a Digital Certificate? PKI Digital Certificates Explained.”
3. Checking Data Integrity in File Storage (e.g., Comparing Saved Files)
Say, you have three Word documents saved on a USB stick. All three are hundreds of pages long each, which make manually comparing their contents impractical. You know that two of them are duplicate files saved under different names — but you don’t remember which ones — and you want to get rid of one of them to save space on your USB stick.
To deal with this issue using checksums, you can do the following:
- Calculate the checksum of each file.
- Compare the three checksums to see which ones match.
- Delete one of the redundant files once you’ve found the two matching files that share the same content.
This sounds pretty handy, right? The process of creating and comparing checksums, like in the examples mentioned above, is sometimes called fixity checking. The use of checksums for data integrity and the fixity checking process are part of NDSA’s (National Digital Stewardship Alliance) digital preservation good practices, but there are many other ways checksums can be used in today’s digital world. Let’s explore the most important ones.
What Can You Do With a Checksum?
Checksums values can be used for many different applications like:
- Password storage. Saving only the passwords’ checksum values rather than the plaintext passwords is much more secure. This way, in case of a data breach, the hacker will just get ahold of a whole set of gibberish hexadecimal strings (i.e., not the plaintext passwords themselves), making things much more difficult.
- Guarantee software/code integrity. As described in one of the examples above, checksums help to prevent unauthorized access and data manipulation through the fixity checking process.
- Malware protection. Once again, referring to the previously mentioned examples, comparing checksum values ensures that the downloaded document/code/file has not been damaged or infected with malware.
- Copyright image protection. Checksums can prevent the utilization of copyrighted images by a third party who may try to do so by simply altering the original image. Remember: A slightly modified input (in this case a picture) will return an entirely different checksum value.
- Email malware protection. When the sender’s cryptographic checksum doesn’t match the recipient’s, it means that the email has been tampered with (e.g., malware has been injected into the email’s attachment). This is an easy and secure way to identify suspicious emails and/or attachments.
- Spam protection. Used by many email providers, the checksum-based spam filter looks for identical or similar messages from the same sender applying a checksum to them. It then looks for the same checksum in a database, including all the hash values of the messages classified as spam. If a match is found, the email is automatically sent to the user’s spam folder.
- ISO integrity. Before installing Ubuntu or any other operating system from a downloaded ISO, you can turn on the image checksum option when burning the CD or DVD. This will enable you to verify the downloaded ISO’s checksum with the one of the CD or DVD you just created and help you avoid installing a corrupted or infected ISO.
These are just a few examples of how a checksum value can be used to check the integrity of data. But what are the benefits of using them? If we look at the list above, some are already clearly evident; but there’s more, and this is what we’re going to discover next.
Why Integrity Matters
Using checksums can be beneficial for developers, organizations and users alike as they can increase:
- Trust — proving that your software is authentic and untainted. Adding a checksum near your software download button will take your trustworthiness to a different level. Customers will be able to verify the authenticity and integrity of your codes and, as a result, their trust in you as a developer will grow.
- Security — protecting your customers and organization from malware and data breaches. Being able to immediately identify data manipulation through the fixity checking process — and mitigating risks of data breaches with the help of checksums — will easily boost your security and data protection levels.
- Reputation — showing your customers that you care about their online safety. Guaranteeing a safe customer experience through extended use of checksums will enable you to further build your reputation as a reliable and highly skilled developer. As you can imagine, this goes a long way in increasing customer confidence.
- Revenue and software distribution — signing your code or software with a signing certificate. Nowadays, a high number of software distribution platforms require codes to be signed with a signing certificate issued by an approved CA (certificate authority) prior to distribution. This allows users to verify the author’s identity and the software’s integrity, once again thanks to checksums. As a result, you, as a developer, will reach as many customers as possible.
- Credibility and customer satisfaction — protecting your customers from spam. As an email provider, you’ll be able to immediately and automatically detect spam messages using a checksum-based spam filter. Taking into account that last year, Spamcop reported a staggering 38,658,552 emails were submitted as spam — equivalent to an average of 1.2 per second — it’s easy to imagine the positive impact that the filter can have on your credibility and customer satisfaction.
As you can see, the power of checksums is amazing. No matter if you’re a user or a developer, you can enjoy their unquestionable benefits as soon as you start using them.
Wrapping Up the Topic ‘What Is a Checksum?’
In today’s digital environment, checksums apply to a wide range of use cases and play an important part of all organizations’ data protection and cybersecurity strategy. This article gave you a first, basic glimpse into the world of checksums. You now know what a checksum is, how it works, what’s used for, and why it matters in today’s digital world.
There is much more to say and explore in the world of checksums, though. We’ll do that in our next two articles of the series — where you’ll learn step by step how to check a file checksum and an MD5 checksum.
Stay tuned then and get ready to unleash the power of checksums!