Friday, July 3, 2015

Inverse Geometric Mapping: Local Process

Previous posts:

This is another post exploring the attempt to map from a set of distance features back to an x, y coordinate system. If you haven't already read it, you may want to start by reading the Introductory post in the series and work your way through. This post will assume knowledge of what was presented in the previous posts in the series.

The Process:

I will be looking here at a relatively simple version of the problem (just one set of reference points and a small number of sample points). The process I use here will become a component of the process for more complex versions.

I will start with data that looks like:

IdHorizontal_Distance_To_Fire_PointsHorizontal_Distance_To_HydrologyHorizontal_Distance_To_Roadways
03212.763650174.4977935159.179288
13135.697623783.3459515433.928096
25247.5811513504.8404997829.386517
31532.8532802141.1957284107.445635
41824.8065842114.7391184413.043552

Which comes from samples that were originally positioned like:

Strategy

My strategy will be to develop a cost function for fitting coordinates to the samples and use gradient descent to try get the cost function to zero (or close to it) and hopefully that will result in a good set of coordinates.

Gradient Descent

If you are unfamiliar with gradient descent, it can be thought of as though the function being minimized describes a landscape. If you were placed randomly on an unknown landscape and were blindfolded (and didn't have to worry about falling down cliffs), how might you go about trying to find the lowest point? One option would be the following procedure:

  1. Determine what direction would result in the steepest downhill step
  2. If every direction is uphill then stop
  3. Otherwise take a step in the steepest downhill direction
  4. Go back to step 1

When the procedure ends, you will be at a local low point. Under some conditions you might be able to guarantee that it is the lowest point on the landscape, but often that will not be the case. If it is low enough and you don't care if it is the absolute lowest you can stop and be happy, otherwise you might start at different places and repeat the procedure, picking the smallest of the local minima you find. That's the general idea of the gradient descent algorithm - it uses calculus to find the steepest direction to move and algebra to take its steps.

Cost Function

In this context, the cost function should give me an idea of how well my chosen x, y coordinates fit the data I was given. The cost of a perfect fit should be 0 and no cost should be negative. It is also helpful if the cost function is differentiable (so the gradient can be found). It is possible to come up with multiple cost functions that will work for our purposes, some may be simpler or better performing than others. I chose a cost function that met the above requirements and kept the calculus and algebra relatively straightforward.

To motivate the cost function I use, let's start with an example of an individual point and reference point:
Suppose my first sample point has horizontal distance to fire of 1500 and my hypothesis is that the sample location is at (100, 200) and the nearest fire point is at (800, 1200). Using those points, I can find the hypothesized distance to fire by:

The amount the hypothesis is off by is 1500 - 1220.6 = 279.4. This is the basic idea I want to capture in the cost function, but have to be a little careful to ensure I don't get a negative value (e.g. if the hypothesized distance was 1700 then 1500 - 1700 = -200). One way I could deal with this is by taking the absolute value of the difference, but that is a little complicated to deal with when it comes to the calculus. The option I chose is to square the difference: this satisfies our requirements and keeps the calculus and algebra reasonable, though it makes the cost function value less intuitive.

The overall cost function takes the idea above and applies it to all of the hypothesized sample/reference point locations. Written out, the part of the cost function related to the fire point would look like:

To get the total cost I would need to also add in the parts corresponding to the water and road points. In practice I also multiply by a scaling factor related to the number of points in the set. This helps make costs comparable across sets of different sizes.

Implementation

Initially I implemented the gradient descent algorithm in python. It worked okay, but there are a couple of finer points that can be tricky to fine tune well. Eventually I switched over to using the fmin_bfgs function from the scipy.optimize library which works similarly to gradient descent. This left me with less control for fine tuning, but the time savings in not having to fine tune as much allowed me to focus on other parts of the problem.

Results

After running the gradient descent algorithm, I can compare the plots of the hypothesized points to the true points. Since the hypothesized points may be different up to translation, rotation and/or reflection (but not stretches!) and still be a correct fit, when comparing the plots I may need to perform a translation, rotation and/or reflection. To start with, for simplicity I translate both sets of points so that the fire reference point is at (0, 0). In the plots below the original points are circles and the hypothesized points are Xs.

It looks like the points match up pretty well but need a rotation to really fit - no reflection appears to be necessary this time. After rotating an appropriate amount I get:

Looks like a good fit!

Friday, June 5, 2015

Inverse Geographic Mapping: Experimental Setup

Previous posts:

This is another post exploring the attempt to map from a set of distance features back to an x, y coordinate system. If you haven't already read it, you may want to start by reading the Introductory post in the series and work your way through. This post will assume knowledge of what was presented in the two previous posts in the series.

Experimental Setup

Because the data I have from Kaggle does not include any x, y coordinate system it may be difficult to discern whether my approach is effective. In practice there are things that can be done with the Kaggle data that might provide some indication of correctness, but it will be simpler to create my own data sets while doing the initial development and testing of my approach. Below I will describe a couple of steps in setting up my practice data.

The basic setup:

The first step is to randomly generate a set of reference and sample points. Sample points are generated with x and y position values from a normal distribution with mean of 0 and standard deviation of 1000, and reference points with position values from a normal distribution with mean of 0 and standard deviation of 2000. At different times I generate different amounts of practice points, but for demonstration, here is a graph of 10 test points with 1 each of the three reference point types.

The distance between each sample point and each reference point is calculated to get the "horizontal distance to..." fields:

IdHorizontal_Distance_To_Fire_PointsHorizontal_Distance_To_HydrologyHorizontal_Distance_To_Roadways
03095.8435402182.2406193553.646848
14156.9004971200.9770464765.299586
2643.0746604513.1322924881.010825
31820.1766894114.3329545843.656280
42504.9936513740.1131196036.146318
(The distance fields for the first 5 sample points in the generated set.)

The practice data is now in the form of the Kaggle data. I can run it through my process, take the x and y coordinates that are output and compare them to the original positions of the practice data. There will be no way for the process to determine correct orientation or absolute position, but if it works properly it should find points that are the same locations as the original up to rotation and/or reflection and a translation.

Further Steps:

The setup above is about as simple as I can make it and is a good starting place. As I continue to develop the process I will also need to consider how it handles situations where there are more than 1 of each reference point type. For example, what happens when there are 3 water points? Does it matter if they are far apart/close together/in a line? What happens near the boundaries when a sample point is close to the same distance from 2 or more of the water points? It is straight forward to extend my practice set generator to include additional reference points, and I will likely include an example of that when I get to exploring the process at that level.

I may need to extend the practice set generation even further as I continue to iterate between getting the process running on the practice data and seeing how it performs on the actual data. The next step for the blog though is to start looking at the basics of how my current approach works with just the simple practice setup.

Monday, June 1, 2015

Inverse Geographic Mapping: Initial Analysis

Initial Analysis:

(The introductory post)

While the plot in the introductory post made it seem like the data might allow for a mapping from the distance fields to an x, y coordinate system, I wanted to double check the idea before proceeding. To do so I looked at the dimensions of the inputs and outputs for the proposed mapping. If the dimension of the information being input is smaller than the dimension I would like the inverse process to output, that would suggest the idea is untenable.

Looking again at a couple of sample points:
IdElevationVertical Distance To HydrologyHorizontal Distance To HydrologyHorizontal Distance To RoadHorizontal Distance To Fire Points
22590-62123906225
52595-11533916172

Each sample point contributes three pieces of individual data directly related to re-creating x, y coordinates for the points: the three horizontal distance fields. If I have a group of n sample points with the same 3 reference points, then there will be 3n pieces of information. For the output I will need 2n pieces of information for the sample points (their x, y coordinates), plus 6 total for the x, y coordinates of the 3 reference points. So if 3n is at least as big as 2n + 6 it seems plausible that we might have enough information to perform the inversion - this should be the case for n at least 6. In practice it is a bit more complicated than that since these are not linear systems, but for me it was enough justification to give it a try.

While proceeding, I will make the assumption that the reference points can be treated as discrete points. For example, if a set of points have a pond as their closest water source my assumption is that the distances were all measured to the same location, rather than each measuring to whichever point on the pond edge is closest to the sample location. I am not certain if this assumption is warranted and if not whether I will be able to adjust my algorithm to compensate.

My next post should be a description of my experimental setup to test my process in a controlled system.

Tuesday, May 19, 2015

Inverse Geographic Mapping: Introduction

As I mentioned in my previous post, I got started working on the Kaggle forest cover type competition while taking a data science course on Coursera. I got an introduction to many different machine learning techniques while working on my submissions for that competition (read more about some of my process here.) While exploring the forest cover data set I had an idea for a side project that I would like to describe here.

To start I will give a little description of the forest cover data set I used (find the official data summary here.) The data was collected in four regions of the national forest in Colorado, each sample corresponds to a 30m by 30m patch. There are 12 different fields given for each sample, plus the forest type classification (given only in the training data set; this field gives the type of forest there, for example aspen or douglas-fir). None of these fields contain information to directly locate the samples in the real world or relative to each other (e.g. no latitude or longitude). The goal for the competition was to take those 12 fields and use them to predict what the forest type would be. For my side project though, only 3 fields are of primary interest with 2 more fields being of supporting interest.

  • The three primary fields are:
    • Horizontal distance to hydrology (distance to nearest surface water features)
    • Horizontal distance to fire points (distance to nearest wildfire ignition points)
    • Horizontal distance to roadways
    I refer to the places these distances are in reference to (i.e. the surface water features, fire ignition points and roadways) as reference points.
  • The two fields of supporting interest are:
    • Elevation
    • Vertical distance to hydrology
    Making use of them helps me group neighborhoods of sample points by checking to see if the body of water they are closest to is at the same (or nearly so) elevation. Elevation is also a helpful field for use in plotting the data to help see if the results are reasonable.

While exploring the data I noticed that when the primary distance based fields are plotted against elevation, the graphs look like landscape features (hills and valleys - see figure below.) That lead me to the question guiding this side project: Can I map backwards from the data to create an x, y coordinate system accurately locating the data points relative to where they are in actual space?

To look at this on a smaller scale, if I look at a couple of samples together I might have data something like:

IdElevationVertical Distance To HydrologyHorizontal Distance To HydrologyHorizontal Distance To RoadHorizontal Distance To Fire Points
22590-62123906225
52595-11533916172

This tells me nothing about how close the sample locations were to each other. Because the distances for each sample are relatively similar, we might guess they are physically close, but we can't be sure from just this data. If I can successfully make an inverse mapping, it would tell me approximately where those points (and others) were located relative to each other. This could be useful if I was not sure what type of forest cover a particular sample had but (for example) could determine that it was surrounded by samples that had lodgepole pine: Chances are that sample might also have lodgepole pine.

That's it for this post. Stay tuned for more later! (update: The next post: Initial Analysis)