Monday, March 7, 2016

The Call of the Pika

Some of the most interesting places for pika researchers to study pika are at the edges of suitable pika habitat - places where conditions often aren't ideal and if there are any pika there probably won't be many. Discovering pika at such locations can be time intensive and even with a lot of time spent, it is still possible to miss an animal that is there. The initial goal of the pika audio project is to be able to help with situations like that. If successful, it will be possible for a researcher to leave a recording device in the field for days at a time and then use our program to automatically analyze the audio and identify pika calls that may be present. It would still not be a guarantee of finding pika that are there, but it could be a good option to increase monitoring coverage without significantly increasing human-hours.

With that in mind, our goal is to be able to efficiently analyze large quantities of audio to find pika calls. How do we do that? Let's start out by looking at what a pika call looks like. A spectrogram is a common tool for examining audio graphically by providing a way of looking at the strength of different pitches in an audio signal. Here is a spectrogram (also called a sonogram) of a pika call:

Audio of the pika call:
For this pika call there are four main pitch bands (along with a few bands of lesser intensity) that appear to be fairly evenly spaced from each other throughout the short call.
For comparison, here is the spectrogram of a song sparrow song:

(thanks to Matt Goff for the song sparrow audio)
Audio of the song sparrow song:

Although the different time scales in each of the above spectrograms make them a little tricky to directly compare, it is possible to see that the pika spectrogram is distinguishable from the song sparrow spectrogram. In my next post I will go into some of the details of how we use some of the features of the pika call spectrogram to pull their calls out of a larger audio file.

Friday, February 26, 2016

Back in the Habit?

Last summer I went to a training that Cascades Pika Watch put on in Tacoma. Their organization is geared to get citizen scientists out gathering useful data on pika. Pika are remarkable little relatives of the rabbit. They primarily live in high elevation rocky areas, don't like getting too warm, they are quite cute, and surprisingly manage to live through alpine winters under the snow without hibernating. They are a species of concern because they have been losing a lot of previously suitable habitat as the climate has shifted.

While at the training, I learned that one of the organizers had been working on a side project to identify pika from audio recordings. This could provide a useful tool for monitoring for pika, and in the long run might be able to be expanded to facilitate other interesting analysis. That sounded like a wonderful opportunity to blend my interests in natural history and data analysis, so I asked if I could help out and have been working on it off and on since then.

I intend to describe the details and approach of the pika audio project over the next few posts. I haven't finished work on the inverse geometric mapping project I had been posting about previously, but it has taken a back burner for the moment.

Friday, July 3, 2015

Inverse Geometric Mapping: Local Process

Previous posts:

This is another post exploring the attempt to map from a set of distance features back to an x, y coordinate system. If you haven't already read it, you may want to start by reading the Introductory post in the series and work your way through. This post will assume knowledge of what was presented in the previous posts in the series.

The Process:

I will be looking here at a relatively simple version of the problem (just one set of reference points and a small number of sample points). The process I use here will become a component of the process for more complex versions.

I will start with data that looks like:

IdHorizontal_Distance_To_Fire_PointsHorizontal_Distance_To_HydrologyHorizontal_Distance_To_Roadways
03212.763650174.4977935159.179288
13135.697623783.3459515433.928096
25247.5811513504.8404997829.386517
31532.8532802141.1957284107.445635
41824.8065842114.7391184413.043552

Which comes from samples that were originally positioned like:

Strategy

My strategy will be to develop a cost function for fitting coordinates to the samples and use gradient descent to try get the cost function to zero (or close to it) and hopefully that will result in a good set of coordinates.

Gradient Descent

If you are unfamiliar with gradient descent, it can be thought of as though the function being minimized describes a landscape. If you were placed randomly on an unknown landscape and were blindfolded (and didn't have to worry about falling down cliffs), how might you go about trying to find the lowest point? One option would be the following procedure:

  1. Determine what direction would result in the steepest downhill step
  2. If every direction is uphill then stop
  3. Otherwise take a step in the steepest downhill direction
  4. Go back to step 1

When the procedure ends, you will be at a local low point. Under some conditions you might be able to guarantee that it is the lowest point on the landscape, but often that will not be the case. If it is low enough and you don't care if it is the absolute lowest you can stop and be happy, otherwise you might start at different places and repeat the procedure, picking the smallest of the local minima you find. That's the general idea of the gradient descent algorithm - it uses calculus to find the steepest direction to move and algebra to take its steps.

Cost Function

In this context, the cost function should give me an idea of how well my chosen x, y coordinates fit the data I was given. The cost of a perfect fit should be 0 and no cost should be negative. It is also helpful if the cost function is differentiable (so the gradient can be found). It is possible to come up with multiple cost functions that will work for our purposes, some may be simpler or better performing than others. I chose a cost function that met the above requirements and kept the calculus and algebra relatively straightforward.

To motivate the cost function I use, let's start with an example of an individual point and reference point:
Suppose my first sample point has horizontal distance to fire of 1500 and my hypothesis is that the sample location is at (100, 200) and the nearest fire point is at (800, 1200). Using those points, I can find the hypothesized distance to fire by:

The amount the hypothesis is off by is 1500 - 1220.6 = 279.4. This is the basic idea I want to capture in the cost function, but have to be a little careful to ensure I don't get a negative value (e.g. if the hypothesized distance was 1700 then 1500 - 1700 = -200). One way I could deal with this is by taking the absolute value of the difference, but that is a little complicated to deal with when it comes to the calculus. The option I chose is to square the difference: this satisfies our requirements and keeps the calculus and algebra reasonable, though it makes the cost function value less intuitive.

The overall cost function takes the idea above and applies it to all of the hypothesized sample/reference point locations. Written out, the part of the cost function related to the fire point would look like:

To get the total cost I would need to also add in the parts corresponding to the water and road points. In practice I also multiply by a scaling factor related to the number of points in the set. This helps make costs comparable across sets of different sizes.

Implementation

Initially I implemented the gradient descent algorithm in python. It worked okay, but there are a couple of finer points that can be tricky to fine tune well. Eventually I switched over to using the fmin_bfgs function from the scipy.optimize library which works similarly to gradient descent. This left me with less control for fine tuning, but the time savings in not having to fine tune as much allowed me to focus on other parts of the problem.

Results

After running the gradient descent algorithm, I can compare the plots of the hypothesized points to the true points. Since the hypothesized points may be different up to translation, rotation and/or reflection (but not stretches!) and still be a correct fit, when comparing the plots I may need to perform a translation, rotation and/or reflection. To start with, for simplicity I translate both sets of points so that the fire reference point is at (0, 0). In the plots below the original points are circles and the hypothesized points are Xs.

It looks like the points match up pretty well but need a rotation to really fit - no reflection appears to be necessary this time. After rotating an appropriate amount I get:

Looks like a good fit!

Friday, June 5, 2015

Inverse Geographic Mapping: Experimental Setup

Previous posts:

This is another post exploring the attempt to map from a set of distance features back to an x, y coordinate system. If you haven't already read it, you may want to start by reading the Introductory post in the series and work your way through. This post will assume knowledge of what was presented in the two previous posts in the series.

Experimental Setup

Because the data I have from Kaggle does not include any x, y coordinate system it may be difficult to discern whether my approach is effective. In practice there are things that can be done with the Kaggle data that might provide some indication of correctness, but it will be simpler to create my own data sets while doing the initial development and testing of my approach. Below I will describe a couple of steps in setting up my practice data.

The basic setup:

The first step is to randomly generate a set of reference and sample points. Sample points are generated with x and y position values from a normal distribution with mean of 0 and standard deviation of 1000, and reference points with position values from a normal distribution with mean of 0 and standard deviation of 2000. At different times I generate different amounts of practice points, but for demonstration, here is a graph of 10 test points with 1 each of the three reference point types.

The distance between each sample point and each reference point is calculated to get the "horizontal distance to..." fields:

IdHorizontal_Distance_To_Fire_PointsHorizontal_Distance_To_HydrologyHorizontal_Distance_To_Roadways
03095.8435402182.2406193553.646848
14156.9004971200.9770464765.299586
2643.0746604513.1322924881.010825
31820.1766894114.3329545843.656280
42504.9936513740.1131196036.146318
(The distance fields for the first 5 sample points in the generated set.)

The practice data is now in the form of the Kaggle data. I can run it through my process, take the x and y coordinates that are output and compare them to the original positions of the practice data. There will be no way for the process to determine correct orientation or absolute position, but if it works properly it should find points that are the same locations as the original up to rotation and/or reflection and a translation.

Further Steps:

The setup above is about as simple as I can make it and is a good starting place. As I continue to develop the process I will also need to consider how it handles situations where there are more than 1 of each reference point type. For example, what happens when there are 3 water points? Does it matter if they are far apart/close together/in a line? What happens near the boundaries when a sample point is close to the same distance from 2 or more of the water points? It is straight forward to extend my practice set generator to include additional reference points, and I will likely include an example of that when I get to exploring the process at that level.

I may need to extend the practice set generation even further as I continue to iterate between getting the process running on the practice data and seeing how it performs on the actual data. The next step for the blog though is to start looking at the basics of how my current approach works with just the simple practice setup.

Monday, June 1, 2015

Inverse Geographic Mapping: Initial Analysis

Initial Analysis:

(The introductory post)

While the plot in the introductory post made it seem like the data might allow for a mapping from the distance fields to an x, y coordinate system, I wanted to double check the idea before proceeding. To do so I looked at the dimensions of the inputs and outputs for the proposed mapping. If the dimension of the information being input is smaller than the dimension I would like the inverse process to output, that would suggest the idea is untenable.

Looking again at a couple of sample points:
IdElevationVertical Distance To HydrologyHorizontal Distance To HydrologyHorizontal Distance To RoadHorizontal Distance To Fire Points
22590-62123906225
52595-11533916172

Each sample point contributes three pieces of individual data directly related to re-creating x, y coordinates for the points: the three horizontal distance fields. If I have a group of n sample points with the same 3 reference points, then there will be 3n pieces of information. For the output I will need 2n pieces of information for the sample points (their x, y coordinates), plus 6 total for the x, y coordinates of the 3 reference points. So if 3n is at least as big as 2n + 6 it seems plausible that we might have enough information to perform the inversion - this should be the case for n at least 6. In practice it is a bit more complicated than that since these are not linear systems, but for me it was enough justification to give it a try.

While proceeding, I will make the assumption that the reference points can be treated as discrete points. For example, if a set of points have a pond as their closest water source my assumption is that the distances were all measured to the same location, rather than each measuring to whichever point on the pond edge is closest to the sample location. I am not certain if this assumption is warranted and if not whether I will be able to adjust my algorithm to compensate.

My next post should be a description of my experimental setup to test my process in a controlled system.

Tuesday, May 19, 2015

Inverse Geographic Mapping: Introduction

As I mentioned in my previous post, I got started working on the Kaggle forest cover type competition while taking a data science course on Coursera. I got an introduction to many different machine learning techniques while working on my submissions for that competition (read more about some of my process here.) While exploring the forest cover data set I had an idea for a side project that I would like to describe here.

To start I will give a little description of the forest cover data set I used (find the official data summary here.) The data was collected in four regions of the national forest in Colorado, each sample corresponds to a 30m by 30m patch. There are 12 different fields given for each sample, plus the forest type classification (given only in the training data set; this field gives the type of forest there, for example aspen or douglas-fir). None of these fields contain information to directly locate the samples in the real world or relative to each other (e.g. no latitude or longitude). The goal for the competition was to take those 12 fields and use them to predict what the forest type would be. For my side project though, only 3 fields are of primary interest with 2 more fields being of supporting interest.

  • The three primary fields are:
    • Horizontal distance to hydrology (distance to nearest surface water features)
    • Horizontal distance to fire points (distance to nearest wildfire ignition points)
    • Horizontal distance to roadways
    I refer to the places these distances are in reference to (i.e. the surface water features, fire ignition points and roadways) as reference points.
  • The two fields of supporting interest are:
    • Elevation
    • Vertical distance to hydrology
    Making use of them helps me group neighborhoods of sample points by checking to see if the body of water they are closest to is at the same (or nearly so) elevation. Elevation is also a helpful field for use in plotting the data to help see if the results are reasonable.

While exploring the data I noticed that when the primary distance based fields are plotted against elevation, the graphs look like landscape features (hills and valleys - see figure below.) That lead me to the question guiding this side project: Can I map backwards from the data to create an x, y coordinate system accurately locating the data points relative to where they are in actual space?

To look at this on a smaller scale, if I look at a couple of samples together I might have data something like:

IdElevationVertical Distance To HydrologyHorizontal Distance To HydrologyHorizontal Distance To RoadHorizontal Distance To Fire Points
22590-62123906225
52595-11533916172

This tells me nothing about how close the sample locations were to each other. Because the distances for each sample are relatively similar, we might guess they are physically close, but we can't be sure from just this data. If I can successfully make an inverse mapping, it would tell me approximately where those points (and others) were located relative to each other. This could be useful if I was not sure what type of forest cover a particular sample had but (for example) could determine that it was surrounded by samples that had lodgepole pine: Chances are that sample might also have lodgepole pine.

That's it for this post. Stay tuned for more later! (update: The next post: Initial Analysis)

Thursday, September 18, 2014

Getting Started

In which the obligatory introductory post is found:

My more recent interest in working with data and machine learning re-kindled a couple of years ago when a friend and I started thinking about creating devices which would record audio in the wild (or backyards or ...) and then have a program that would analyze the audio to identify what bird species were present. Further goals for this project might include mapping the location of the call and identifying the type of call (i.e. is it a territorial song, companion call, juvenile begging, alarm, etc.) The basic identification tool could of course be useful for cataloging the birds in an area (useful information for research and home use). The extended goals could allow for things like mapping individual bird territories, identifying nesting success and potentially more.

Getting started, I thought that working on location of sounds from a stereo signal might be an accessible beginning: it would not be too difficult to generate a set of training data and at least in an ideal world the mathematics of locating a sound source is approachable. Of course, outside in the real world things can get messy and my naive approaches had problems.

It was time to start moving beyond naive approaches. I started reading more about machine learning, including reading some papers on the recent progress on the bird call identification problem (it appears that the basic idea of the project has been implemented by at least a couple of companies at this point), and I had the opportunity to take the University of Washington's introductory data science course on Coursera this summer. That course was very helpful at getting me using some of the tools and techniques available (in particular, I have mainly been playing with various classifiers in Python's scikit-learn), and as one of the assignments began working on a Kaggle competition.

Initially, much of my posting here will likely revolve around my work on the Kaggle forest cover type competition. I also have an on going write-up of some of my results on my website. My intention is to post about some of the things I have tried and learned, both as a way of keeping track for myself as well as to hopefully be of help to others who might run into similar problems along the way.