Abraham Lincoln always said that if he only had six hours to cut down a tree, he would spend this first four sharpening his axe. Those in the security community often relate this statement to the information gathering and reconnaissance phases of breaking into another system. In the reconnaissance phase a potential intruder will spend a great deal of time learning everything they can about their target before they attempt any sort of exploitation because the information gathered here is often crucial in order to find weaknesses in a system or its user. In this article I'm going to discuss the importance of metadata as it relates to reconnaissance. I'll cover what it is, how it's stored, and how attackers can extract it to find out more about you or your network. Finally, I'll provide some defensive tips to help you ensure that you aren't leaking the wrong kinds of information out to the world via file metadata.
The general phrase associated with metadata is "data about data". That's because metadata is generally information that describes the content of a document. Not necessarily a summary of the document, but data more along the lines of the document author, the date it was created, the system it was created on, and more. The types of metadata you will see vary wildly and depend on the file type itself. For instance images files can contain metadata that shows the GPS coordinates of the location where the picture was taken.
The biggest problem with document metadata is that its typically automatically populated without the intervention of the user. This means that when you publish a document online you might also be publishing your full name, company name, title, phone number, extension, address, computer name, and so on and so forth. Using this information a potential attacker would already have a pretty good head start on building a social engineering attack against you. Taking things one step further, imagine a scenario in which you have created a PDF file with an outdated version of Adobe Acrobat. The metadata included with this PDF could easily detail the version of Acrobat it was created on. An attacker attempting to compromise your system could do a quick search on that version of Acrobat, find an exploit for it, and that's all he would need. That said, inadvertent data disclosure via metadata provides both direct and indirect threats to system security.
Just because metadata can disclose information inadvertently doesn't mean it's inherently evil. A great deal of searching and indexing functions within an operating system rely upon metadata so that files can be searched and accessed quickly. The key here is to be aware of what metadata exists in your documents and controlling it accordingly.
Before we look at some defensive strategies to protecting inadvertent data disclosure I want to provide a couple of examples of methods that can be used to extract metadata. The methods used to examine metadata depend on the type of file you are extracting it from. In some cases, such as that of a Microsoft Word document, viewing some of the metadata tied to the document can be as simple as right clicking on the file and viewing its properties. The figure below shows the metadata tied to a Word document.
Figure 1: Examining Metadata form a Word Document
We can take things a step further by extracting metadata from an unlikely source. Consider a simple JPG image file taken with a cell phone. Most people will never realize that unless the feature is disabled, every picture they take contains the GPS coordinates of the photos location embedded into the image itself. This data, amongst other things, can be easily extracted with the Jhead tool, available as a free download for both Windows and Linux at here.
Executing Jhead is as simple as it gets. You simply run the program from a command prompt while specifying an image file as a command line option and Jhead will output the available metadata to the screen. In the case of Figure 2, I simply took a photograph with my phone while location services were set to "On" (enabled by default).
Figure 2: The amount of information embedded in this imagine is astonishing
This is something to consider when you are posting pictures to Facebook or a personal blog. Anybody could extract this data and find out where you live or work.
One of the most popular tools for an automated approach to finding and extracting metadata is the FOCA tool. FOCA stands for Fingerprinting an Organization with Collected Archives and has become such a popular tool for extracting metadata that you will often here information security professionals refer to metadata as FOCA data in a similar light to how people will often call a photograph a Polaroid. You can download a free version of FOCA from here. There is also a commercial version available.
FOCA is a pretty powerful tool with a lot of different options, but I want to show how someone would use its basic feature set to search a domain for documents containing metadata. In order to do this you will first need to download and install FOCA and create a new project from the File menu. This project will need to be centered on a particular target domain. Once the project is created FOCA will use a list of search engines to search the domain for particular file types known to contain usable metadata and present those on screen to you. In the case of Figure 3, I've done a simple search on the WindowSecurity.com domain.
Figure 3: The WindowSecurity.com domain contains a great deal of indexed documents that may contain metadata.
Once the search process is completed you can zero in on particular files that look interesting by first downloading the file by right clicking it and clicking Download, and then by extracting the metadata of that file by right clicking it and selecting Extract Metadata. This will place all of the extracted data in an easy to read format in the left pane of the FOCA window. An example output is shown in Figure 4.
Figure 4: FOCA extracts just about everything you would want to know.
Although we just viewed an individual file, the real power of FOCA comes when you enumerate a domain fully and extract all of the metadata it has to offer. Once done, FOCA will break down the metadata giving you an output of all of the usernames, e-mail addresses, software versions, etc that it finds. It's a literal road map of exploitation for an attacker.
Protecting Against Metadata Extraction
If this article has struck a chord with you then there is a good chance you have started going through your companies' web servers looking for documents that may contain data that you don't necessarily want exposed to the public. Rightfully so! Every organization should be conscious of the information that they are inadvertently leaking, which is why I want to provide a few tips on controlling document metadata.
Auditing Web Servers
The best way to ensure you aren't leaking information unintentionally is to regularly audit your public facing web infrastructure using a tool like FOCA. This tool is easy enough to use that it can be done by web administrators and not necessarily folks from an already overburdened security team. A quarterly or semi-annually run through should be enough to make sure you aren't giving away too much to a potential attacker.
Controlling Mobile Devices
The reality of the modern era is that most people are walking around with a computer in their pockets running every application as an administrator. The last thing you want to do is take a picture of a new component of one of your products and accidentally give away the location of one of your suppliers via GPS coordinates hidden in the metadata of a JPG. At an individual level you should make sure that your mobile device isn't configured to embed such data by disabling what some phone providers refer to as "Location Services". At an organizational level you should ensure this same configuration on all company issued devices and ensure users are properly educated on the potential impact of this issue and how to configure their personal phones accordingly.
Just Remove It
The most simple solution to ensuring information isn't leaked out inadvertently is to remove metadata from documents before posting them in an accessible location. In a lot of cases this will need to be done different on a per application basis. Microsoft Office and Open Office both provide guidance on this at their respective support sites. Additionally, there are quite a few free tools that can remove metadata from images, PDFs, and similar file types in bulk. Just about any tool will do here.
Although we tend to focus on securing the most on critical devices in our networks to keep the bad guys out, it's often the small things that let them in. Most attackers will tell you that extracted metadata can prove to be the most important asset they have for breaking into a network, or even knowing where to go once they get in. In this article I've discussed what metadata is, its importance, how it can be extracted, and some basic tips for ensuring you aren't leaking too much information by way of publicly accessible metadata.