Tuesday, May 19, 2015

Inverse Geographic Mapping: Introduction

As I mentioned in my previous post, I got started working on the Kaggle forest cover type competition while taking a data science course on Coursera. I got an introduction to many different machine learning techniques while working on my submissions for that competition (read more about some of my process here.) While exploring the forest cover data set I had an idea for a side project that I would like to describe here.

To start I will give a little description of the forest cover data set I used (find the official data summary here.) The data was collected in four regions of the national forest in Colorado, each sample corresponds to a 30m by 30m patch. There are 12 different fields given for each sample, plus the forest type classification (given only in the training data set; this field gives the type of forest there, for example aspen or douglas-fir). None of these fields contain information to directly locate the samples in the real world or relative to each other (e.g. no latitude or longitude). The goal for the competition was to take those 12 fields and use them to predict what the forest type would be. For my side project though, only 3 fields are of primary interest with 2 more fields being of supporting interest.

  • The three primary fields are:
    • Horizontal distance to hydrology (distance to nearest surface water features)
    • Horizontal distance to fire points (distance to nearest wildfire ignition points)
    • Horizontal distance to roadways
    I refer to the places these distances are in reference to (i.e. the surface water features, fire ignition points and roadways) as reference points.
  • The two fields of supporting interest are:
    • Elevation
    • Vertical distance to hydrology
    Making use of them helps me group neighborhoods of sample points by checking to see if the body of water they are closest to is at the same (or nearly so) elevation. Elevation is also a helpful field for use in plotting the data to help see if the results are reasonable.

While exploring the data I noticed that when the primary distance based fields are plotted against elevation, the graphs look like landscape features (hills and valleys - see figure below.) That lead me to the question guiding this side project: Can I map backwards from the data to create an x, y coordinate system accurately locating the data points relative to where they are in actual space?

To look at this on a smaller scale, if I look at a couple of samples together I might have data something like:

IdElevationVertical Distance To HydrologyHorizontal Distance To HydrologyHorizontal Distance To RoadHorizontal Distance To Fire Points

This tells me nothing about how close the sample locations were to each other. Because the distances for each sample are relatively similar, we might guess they are physically close, but we can't be sure from just this data. If I can successfully make an inverse mapping, it would tell me approximately where those points (and others) were located relative to each other. This could be useful if I was not sure what type of forest cover a particular sample had but (for example) could determine that it was surrounded by samples that had lodgepole pine: Chances are that sample might also have lodgepole pine.

That's it for this post. Stay tuned for more later! (update: The next post: Initial Analysis)