The Curios Case of the Enron Emails

Many months ago I was faced with a normal problem, my ec2 instances disk was full. But why I asked? Why was it running out of space so quickly. What was the actual cause of my loss of disk space. This is the story of my investigation and the a final word on security.

It all started with a page, a message telling me an instance had a full disk, well time to investigate. (I’m replaying this a tad with a vagrant instance as this was so long ago)

First let’s bust out some bash:

1
2
3
4
5
6
7
8
9
ubuntu# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 7.8G 7.4G 252K 100% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
udev 492M 12K 492M 1% /dev
tmpfs 100M 332K 99M 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 497M 0 497M 0% /run/shm
none 100M 0 100M 0% /run/user

Well we’ve got a problem, lets find the root cause. I poked around a little bit by hand and I came across a file with an odd name.

1
257M Feb  2  2012 enron_mongo.tar.bz2

Well that’s unusual, where in the world did that come from. After many find, sort and various other commands later I also found a file called messages.bson which was what was inside the enron_mongo.tar.bz2.

TIP: du -xak .|sort -n|tail -50 can be used to find large files quickly. Try running it from a specific folder or from /.

1
1.4G May 30 01:21 messages.bson

Around this time we were working on analyzing many open source projects and testing out various methods for finding data. Many of these instances weren’t exactly stable but it was early so hard drives got filled up and things crashed here and there. What was unusual was that I appeared to have a copy of the Enron emails now on my server. No suspicious commands were in the history on the server so I went about the task of finding out where the emails came from.

The first idea I had was to search GitHub due to the scanning of various repos so I went about that. The first few items alerted me quickly that this data was a common mongo test setup but specifically I was led to a specific repo https://github.com/mongodb/mongo-hadoop.

Since this appears to be a gradle project I started looking at the build.gradle file https://github.com/mongodb/mongo-hadoop/blob/master/build.gradle where we find a few interesting items.

1
2
3
4
5
6
7
8
9
10
test {
dependsOn 'jar', 'testsJar', ':startCluster', ':downloadEnronEmails'
}
....
project(":examples/enron") {
uploadArchives.onlyIf { false }
dependencies {
compile project(':core')
}
}

Inside functions.gradle we will also find:

1
2
3
4
task downloadEnronEmails() << {
extract(dataHome, dataHome,
downloadFile('https://s3.amazonaws.com/mongodb-enron-email/enron_mongo.tar.bz2'))
}

Well it looks like we’ve found out where this came from and it was our very early days of testing gradle. The more import point here is to remember that build tools can do almost anything and to be careful when running them. I’ll leave that discussion for another time.