Entropy

One more time I met entropy in heuristic detections. BitDefender has a detection routine for executables with a bad entropy ("randomness of data"), namely detection for your compressed or encrypted data. F-Secure and G Data have the same engines and thus also detect encrypted contents. When I looked a couple of months ago on a recent Sinowal sample, I noticed the zero blocks added to the compressed part, and now I thought I give it a try.

Peter Kleissner, Software Development Guru

Text Files

If you look at the entropy of text files you can clearly figure out the most common used letters. I simply used the hex editor HxD to generate these statistic diagramms. Clearly visible, the "e" is the most common used letter in english. Antiviruses today all have detection for texts based on exactly that pattern, the occurences of the printable characters.

Random data

The next picture shows the entropy of a file with encrypted or compressed data. You can clearly see that all bytes are used nearly equally - which would be unusual for an executable file. Branches are always the most used opcodes. In the first picture we only have 0,6 % and 0,7 % branches, in the second (better) one 2,5 % and 1,3 %. The amount of zeros is also significantly higher with > 20 % on a legitimate executable, while it is only 10 % for a non legitimate. That means that you can - by checking the entropy - find out if a file contains random data. Obviously having only that heuristic would lead to many false positives, as installers and self extractors all have their data compressed.

Sinowal

Here is the hex dump of a Sinowal sample (address 24200h of 2d4f0001_8bed65eaf97719b7ab48eb63dbf03cab), so you can get an idea of how it looks like:

000000F0  25 1D 5C 97 86 F6 AC C0 58 32 C7 35 2D 8D 49 8D  %.\X25-.I.
00000100  11 34 D0 89 09 AC C1 E9 B4 05 41 CB 24 A4 96 46  .4Љ..A$F
00000110  A2 B0 DA 8F 6B 3B A7 AD 41 48 81 02 3F 27 AA 1B  .k;.AH..?'.
00000120  5C DE 0E 1E 14 5F 0E C3 0F 40 F7 13 45 73 56 77  \..._..@.EsVw
00000130  1A 35 7D A6 F9 50 BE 46 30 5C BD 4F 05 6F 3C 6B  .5}PF0\O.o<k
00000140  08 C5 C6 59 83 03 3A 3C E8 AC 20 DA C8 F9 10 63  .Y.:< .c
00000150  F5 77 8E A9 16 17 3A ED 3F 16 E3 1D 02 38 D3 86  w..:?...8ӆ
00000160  17 38 6B F3 18 DB 95 DC 03 7A 02 75 EB AC C3 3A  .8k.ە.z.u:
00000170  A9 A5 8D B8 E4 39 BC 05 F6 07 12 72 AC D9 1C A7  .9...r.
00000180  00 B8 21 2A 60 BA 59 12 A1 20 98 4C E9 F3 CC 20  .!*`Y. L 
00000190  00 FA 9C 63 93 44 3E A4 51 AF 2A 67 62 65 60 6B  .cD>Q*gbe`k
000001A0  7E 71 2A 4A FA EA B6 5D C6 99 BC 39 7A DB 78 1D  ~q*J]ƙ9zx.
000001B0  C0 4A D5 92 82 72 96 CE F1 77 CC 38 57 2D 13 E3  JՒrw8W-.
000001C0  70 54 DF 47 1A 54 00 00 00 00 00 00 00 00 00 00  pTG.T..........
000001D0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
000001E0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
000001F0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000200  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000210  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000220  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000230  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000240  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000250  00 00 00 00 00 00 00 00 00 00 00 00 00 00 91 00  ...............
00000260  00 00 1E 01 00 00 CA 49 DA 43 48 9C C8 26 54 1A  ......ICH&T.
00000270  26 42 D1 D9 7A 25 AC 0A A9 10 56 8A 75 45 F3 36  &Bz%..VuE6
00000280  3B 9F E8 28 F6 7D 52 F6 01 05 B5 10 D5 78 5A 8D  ;(}R...xZ.
00000290  CF 17 E7 3F 60 CB 9C 8B 8E 6F FC C8 3E 11 2E 8B  .?`˜o>..
000002A0  DF 94 AD 20 E3 57 E9 6A 7B A8 FE B5 8F A7 3D 3C  ߔ. Wj{.=<
000002B0  15 26 3F 89 03 8F DB 7D 4A 1C 8F F3 EA B7 B5 02  .&?..}J..귵.
000002C0  EF 5E B8 58 51 85 01 49 4A 47 6C 34 93 54 11 1E  ^XQ.IJGl4T..
000002D0  4A 26 99 B5 D1 0A 60 53 FB A1 E2 D6 EF 91 D9 75  J&.`Su
000002E0  8C 1E E3 68 0E AF 0C E3 2F 75 40 70 BB 4D B9 24  .h../u@pM$
000002F0  B6 DE 4A E8 15 0B 65 00 00 00 00 00 00 00 00 00  J..e.........
00000300  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000310  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000320  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000330  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000340  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000350  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000360  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000370  00 00 00 00 00 00 00 00 00 00 00 00 B2 02 00 00  ...............
00000380  60 03 00 00 9F 1E 4A AF E2 56 3B B1 26 2B 31 62  `....JV;&+1b
00000390  B6 14 1C F8 C1 73 4C A7 DE C0 74 B9 0E E1 64 5A  ..sLt.dZ
000003A0  32 AD 70 0B 60 51 92 1B 75 EF B1 02 6C DA 8A 1B  2.p.`Q.u.lڊ.
000003B0  FB 94 8D 54 5E FA 89 61 EB BD 51 37 D5 ED EB 8B  .T^aQ7
000003C0  D4 64 71 74 91 8F BE 48 3A 06 FE 1C 4C 47 81 45  dqt.H:..LG.E
000003D0  FD E1 53 BC 94 2E 48 91 EC 1B 72 F2 D2 E8 5F 59  S.H.r_Y
000003E0  59 FF 74 B3 18 B4 D4 FF E4 BD 4E 29 44 E5 D8 DB  Yt.N)D
000003F0  3D 1F B1 C7 FD 6C 55 6A B8 81 AE 11 68 0E 16 11  =.lUj..h...
00000400  AA 73 8F 0B 32 15 B3 28 9C F8 AE 5E 11 6E 38 7B  s..2.(^.n8{
00000410  78 47 D2 58 F9 CF 2B 8C 9E 84 3A 12 77 6D C9 94  xGX+:.wmɔ
00000420  1F 39 26 3B B5 B4 F3 72 0B 5E 0A 98 C5 67 48 9F  .9&;r.^.gH
00000430  88 3D E7 50 9F 58 90 30 DB FC 05 48 F1 FC 37 CE  =PX.0.H7
00000440  40 B8 61 B9 7C 93 2E EA 40 67 E7 91 ED A8 33 D0  @a|.@g3
00000450  AB F1 13 CA 0F 8F F2 91 CC E2 27 BE A5 61 00 63  ...'a.c

What does it mean

On stackoverflow there is a very good entry and idea about getting the file entropy:

A simpler solution: gzip the file. Use the ratio of file sizes: (size-of-gzipped)/(size-of-original) as measure of randomness (i.e. entropy).

This method doesn't give you the exact absolute value of entropy (because gzip is not an "ideal" compressor), but it's good enough if you need to compare entropy of different sources.

The program "peid" also gives out the entropy (value), but on their forums the algorithm is reported as covering only specific sections of the executable and is "voodoo" and sometimes returns bogus results. I have seen the values 5.93 (Not Packed) and 7.58 (Packed) in my testings (the latter one was a Sinowal sample 7e92c6fa53628fb457f54cb3e0e0e3ee). The algorithm in peid is probably the well known one using the counted data and log2, here the implementation of sample hexer plugin (open source) of EntropyCalculator.java:

	/**
	* Counts how often each byte value appears in a range of bytes.
	*
	* @param data The input buffer.
	* @param start Index into the buffer where the counting starts.
	* @param length Number of bytes to count.
	*
	* @return Array with 256 entries that say how often each byte value appeared
	* in the requested input buffer range.
	**/
	private static int[] countByteDistribution(byte[] data, int start, int length)
	{
		final int[] countedData = new int[256];
		
		for (int i=start;i<start + length;i++)
		{
			countedData[data[i] & 0xFF]++;
		}
		
		return countedData;
	}
	
	/**
	* Calculates the log2 of a value.
	**/
	private static double log2(double d)
	{
		return Math.log(d)/Math.log(2.0);
	}
	
	/**
	* Calculates the entropy of a sub-array.
	*
	* @param data The input data.
	* @param start Index into the input data buffer where the entropy calculation begins.
	* @param length Number of bytes to consider during entropy calculation.
	*
	* @return Entropy of the sub-array.
	**/
	private static double calculateEntropy(byte[] data, int start, int length)
	{
		double entropy = 0;
		
		final int[] countedData = countByteDistribution(data, start, length);
		
		for (int i=0;i<256;i++)
		{
			final double p_x = 1.0 * countedData[i] / length;
			
			if (p_x > 0)
			{
				entropy += -p_x * log2(p_x);
			}
		}
		
		return entropy;
	}

References