How ephemeral metadata may cause real problems

The most dangerous data leaks are the ones people don’t even know about.

Calling Captain Obvious…come in, Captain Obvious: Which IT threat brings the most danger to enterprises, SMBs, governments, and individuals?

The answer, of course, is data breaches. Now: Which data breaches are the hardest to prevent? And the answer is, those people don’t know about.

Today we are talking about something most people don’t know or think much about, metadata — information about a file rather than information shown in a file. Metadata can turn a normal digital document into compromising intel.

Document metadata

Let’s start our deep dive with a bit of theory. American law defines three categories of metadata:

1. App metadata is added to the file by the application used to create the document. This type of metadata keeps edits introduced by the user, including change logs and comments.

2. System metadata include the name of the author, file name and size, changes, and so forth.

3. Embedded metadata might be formulae in Excel cells, hyperlinks, and associated files. EXIF metadata typical of graphic files also belongs to this category.

Here’s a classic example of the troubles compromised metadata may bring: the UK government’s 2003 report on Iraq’s supposed weapons of mass destruction. The .doc version of the report included metadata on the authors (or, precisely, people who introduced the latest 10 edits). This information raised some flags about the quality, authenticity, and credibility of the report.

According to the BBC follow-up story, as a result of noticing the original file’s metadata, the government chose to use the .pdf version of the report instead, because it contained less metadata.

A $20 million (doctored) file

Another curious metadata-powered eye-opener involved a client of Venable, an American law firm, back in 2015. Venable was contacted by a company whose vice president had recently resigned. Shortly after his exit, the firm lost a contract with a government organization to a competitor — a competitor working with the former VP.

The company accused its former VP of misuse of trade secrets, saying that’s how he won the government contract. In their defense, the defendant and his new firm provided as evidence a similar commercial offer prepared for a foreign government. They claimed it was created for another client before the contract pitch in the United States, and thus it did not violate the former VP’s non-compete agreement with the plaintiff.

But the defendants failed to consider that metadata in their evidence contained a time stamp abnormality. System metadata showed that the file was last saved before it was last printed, which, as an expert affirmed, could not happen. The time stamp of the last print belongs to app metadata, and it is saved in the document only when the file itself is saved. If a document is printed and is not saved afterwards, the new date of printing would not be saved to the metadata.

Another proof of document forgery was its date of creation on the corporate server. The document was created after the lawsuit was brought to court. Moreover, defendants were accused of tampering with the time stamp of the last edit in the .olm files (that extension is used for Microsoft Outlook for Mac files).

The metadata evidence was enough for the court to rule in favor of the plaintiffs, eventually awarding them $20 million and slapping the defendants with millions more in sanctions.

Hidden files

Microsoft Office files offer a rich tool set for collecting private data. For example, footnotes to text can include additional information not intended for public use. The built-in revision tracking in Word could also be of use for a spy. If you choose the “Show final” option (or “No markup,” or similar, depending on your version of Word), tracked changes will disappear from the screen, yet they will remain in the files, waiting for some observant reader.

Also, there are notes to slides in Power Point presentations, hidden columns in Excel sheets, and more.

Ultimately, attempts to hide data without knowing how to do it properly tends not to work. A great example here is a court document published on CBSLocal, referring to the case of United States vs. Rod Blagojevic, ex-governor of Illinois. This is a motion for the court to issue a trial subpoena to Barack Obama, dated 2010.

Some parts of text are hidden by black boxes. However, if you copy and paste a text block into any text editor, you can read the text in its entirety.

Black boxes in a PDF may be useful to hide information in print, but this measure can be easily bypassed in a digital format

 

Files inside of files

Data from external files embedded in a document is a completely different story.

To show a real example, we searched through some documents on .gov websites, and picked the US Department of Education’s tax report for the 2010 financial year to examine.

We downloaded the file and disabled read-only protection (which did not require a password). There is a seemingly normal graph on page 41. We selected “Change data” in the graph’s context menu, eventually opening an embedded Microsoft Excel source file containing all source data.

Here is a report in a Word file, containing an Excel with an abundance of source data for this and some other graphs

 

It should go without saying such embedded files might contain anything, including loads of private information; whoever published the document must have assumed that data was inaccessible.

Harvesting metadata

The process of collecting metadata from a document belonging to an organization of interest may be automated with help of software such as ElevenPaths’ FOCA (Fingerprinting Organizations with Collected Archives).

FOCA can find and download required document formats (for example, .docx and .pdf), analyze their metadata, and find out many things about the organization, such as the server-side software they use, usernames, and more.

We must insert a serious warning, here: Analyzing websites with such tools, even for the sake of research, might be taken very seriously by websites owners or even qualify as cybercrime.

Documented oddities

Here are a couple of metadata peculiarities not all IT experts are familiar with. Take the NTFS file system used by Windows.

Fact 1. If you delete a file from a folder and immediately save a new file with the same name in the same folder, the date of creation will be the same as that of the file you deleted.

Fact 2. In addition to other metadata, NTFS keeps the date of the last access to the file. However, if you open the file and then check out the date stamp of last access in the file properties, the date remains the same.

You might think those oddities are just bugs, but they are in essence documented features. In the first case, we are talking about tunneling, which is required to enable backward software compatibility. By default, this effect lasts for 15 seconds, during which the new file gets the creation time stamp associated with the previous file (you can change the interval in system settings or disable tunneling entirely in the registry). Actually, the default interval was sufficient for me to stumble across tunneling twice in a week just doing my job.

The second case is also documented: Starting with Windows 7, for the sake of performance Microsoft disabled automated time-stamping for the time of last access. You can enable this feature in the registry. However, once it’s enabled, you cannot reverse the process to correct the problem; the file system does not keep the correct date stamps (as proven by a low-level disk editor).

We hope computer forensics experts are aware of these peculiarities.

By the way, file metadata can be altered using default OS / native apps and special software. That means you can’t rely on metadata as evidence in a court of law unless it’s accompanied by things like mailing service and server logs.

Metadata: Security

A built-in feature in Microsoft Office called Document Inspector (File → Info → Inspect Document in Word 2016) shows a user the data contained in a file. To an extent, this data can be deleted on request — although not embedded data (as in the report by Department of Education cited above). Users should take care when inserting graphs and diagrams.

Adobe Acrobat has a similar ability to remove metadata from files.

In any case, security systems should manage leak prevention. For example, we have the DLP (Data Loss Prevention) module in Kaspersky Total Security for Business, Kaspersky Security for mail servers, and Kaspersky Security for collaboration platforms. These products can filter confidential metadata such as change logs, comments, and embedded objects.

Of course, the ideal (read: unachievable) method to prevent leaks entirely is having responsible, aware, and well-trained staff.

Tips