Kyle Stratis Software Developer in Tallahassee, FL

A MongoDB Optimization

Recently at Homes.com, one of my coworkers was charged with speeding up a batch process that we were required to run at a scheduled interval. No big deal, but he was stuck: the process required a number of steps at every typical ‘stage’, for identifying the data we needed to pull, for pulling the data, for transforming the data, and for writing the transformed data back to Mongo. When he was talking about the process, I realized this would be a perfect use case for Mongo’s aggregation framework. I offered to help, based on my experience with the aggregation framework I got while working on NASDANQ, and immediately got to work on designing an aggregation pipeline to handle this process.

Original Solution

This batch process exists to update mean and median home value data for a given area. A rough overview is laid out below:

  • Query a large collection (> 1 TB) for two fields that help identify an area in which a property resides.
  • Join into a single ID
  • Use this to pull location data from another collection on a separate database (> 25 GB).
  • For each property in this area, we pull from another collection the price.
  • These are loaded into an array in our script, and then we iterate to find the mean and median.

At our estimates, if we could run this process to completion with no interruption, it would take ~104 days to finish. This is unacceptable for obvious reasons.

Attempt the First

We ran into an architecture issue early on. MongoDB’s aggregation pipeline doesn’t support working across multiple databases, and the collections we needed to work on were split between a few databases. Luckily, we were able to move the collections onto a single database so we could start testing the pipeline. We started with a 9 stage monster - the multiple collections we had to match on required multiple match stages and a stage to perform what was essentially a join. Because the data is stored in memory at each stage, and each stage is limited to 100MB of memory we first attempted to switch on the allowDiskUse option to allow the pipeline to at least run. And run it did. Our DBA team notified us that we were spiking memory usage to unacceptable levels while the pipeline was running.

We reduced the pipeline to 7 stages - we take location data as inputs to the pipeline, match on this to the 25GB collection, use $lookup to join in the 1TB collection holding the ID fields on one of the ID fields, project the fields we actually want, unwind, redact, sort by value, group by location (2 fields), take an average, and sort again. The pipeline looked like this:

$match->$lookup->$project->$unwind->$redact
                                      |
                                      V
       VALUES<-$sort<-$avg<-$group<-$sort
                                                       

This failed at $lookup. Why? For many locations, we were joining in nearly 1 million documents per month of data, and we wanted multiple years of that data. Across all locations, this fails to solve our performance issues. It also fails to run.

Attempt the Second

Our first thought was to add the unique identifier (which was a combination of two fields in different collections) to the 1TB collection, but that was not workable due to the size of the collection. Instead, what we can do is project a concatenated version of the two fields we are using as a UID, use $lookup to join in the 25GB collection on that - because it’s much faster to make this change to the smaller collection. Simultaneously, we were testing performance differences in sorting in our ETL code vs. within the pipeline itself. Since the time taken in the code to run these sorts was trivial, we could remove these stages from the pipeline. Now we are looking at all IDs for a single month taking 10 days, but we need to run multiple years - this comes out to worse performance than the original solution.

However, an interesting finding was when we ran a single identifier at a time - 1 for one month took about a minute. But one location identifier over 3 years only took 10 minutes. I mention this because it demonstrates how nicely aggregation pipeline performance scales as your dataset grows. And remember how I said we projected all IDs for one month taking 10 days? In light of the results showing that time taken in the pipeline does not scale linearly with data size, we ran the pipeline and found that a single month for all identifiers took about 14 hours. This is a big improvement, but not enough. So we look for more optimizations.

Attempt the Third

This was a smaller change, and on its own created a big change in time taken to process our data. We re-architected our data so that we had a temporary collection of a single month of data. We generally process one month at a time, despite the overall length of time we want. We were able to cut the time in half from the previous attempt - a single month for all identifiers now took only 7 hours and we were now not querying the full 1TB collection when we only wanted a small piece of it anyways. Creating the temporary collections is trivial and done by our counterparts on the DBA team as a part of their data loading procedures.

Attempt the Fourth

After seeing these results, management was fully convinced of the need of redesigning how we stored this data. Let this be a lesson: hard data is very convincing. So now each month’s data will be in its own collection, when we load data the location data will also be added in to these monthly collections avoiding costly joins via $lookup. Surprisingly, while testing, adding this information did not impact our overall data preloading times. This location data was also indexed for quicker querying. All of this allowed us to go from a 7-stage aggregation pipeline to 3. Now we start with the split collections, project the fields we are interested (location, value, etc.), group on location, average them and also add all individual values to an array (for sorting and finding the median in code), and output to a temp collection. If we want to process a period of time longer than a month, we rinse and repeat.

For all identifiers in each month, our processing time went from 7 hours to 8 minutes. Then the queries made on the generated collections to get the computed averages plus arrays of individual values to calculate the median in code added a minute per output collection if we did it serially. Being primarily ETL pipeline builders, we do nothing serially. We tested with 7 workers, and the added processing time goes from a minute to 30 seconds. In production, we have a worker pool that numbers in the hundreds, so this additional time was satisfactory. If we assume a conservative 7.5 minutes to process a month of data, then projecting to 3 years we estimated we should see runtime of around 4.5 hours. We decided we were happy with this, especially when we considered the original process was projected to take 104 days.

$project->$avg->$out

Conclusion

I learned a lot of lessons from this little project, and wanted to distill them here as a sort of tl;dr to ensure that they can be passed on to the reader.

  • Management is convinced by data. Run your proposed changes, show them graphs or numbers of performance improvements. You’re all working to the same goal.
  • ABB. Always Be Building. In my case, my work on NASDANQ gave me the knowledge I needed to hear a teammate’s struggles, identify a use case, and implement a plan to alleviate those issues.
  • Standups suck, but they’re useful. Hearing my teammate’s struggles with this code during a standup meeting allowed me to assist him and come up with a workable solution.
  • More generally, communication is important. Not only was understanding my teammate’s needs important, but this project required constant contact between our DBA team, me and my teammate (who was running many of our tests), and management of our team, the DBA team, and the larger team we report to.
  • And, finally, MongoDB’s aggregation pipelines are incredibly powerful. They’re worth learning and getting familiar with if you work with sizable datasets at all.

On Imposter Syndrome

Note: This was first written in response to this excellent discussion on dev.to. I strongly recommend reading through the whole discussion, whether you feel like you’re suffering from imposter syndrome or not, there is some great advice and experience shared there.

I started in this industry after being a hobbyist and having no formal CS background - I studied neuroscience and was well on my way to an academic career when I decided (with the assistance of a friend who from then on became my career sherpa) I should take my hobby into a fulltime job. I was speedily doing code challenges, TA’ing classes, running my thesis experiments, and writing my thesis while simultaneously applying to every job I could find.

After over 100 submissions with either rejections or no responses, I finally had an interview! Some time after, I was plopped into my first software job in a new city, and had no idea what I was doing. I struggled with imposter syndrome for almost a year - how did I conquer this? Sometimes it’s hard to really think about, and sometimes I wonder if I truly did ever conquer it. A few different things, in my mind, helped me with this:

  • Realizing this is a very meritocratic industry, despite how HR departments treat things. This becomes most clear when other developers start coming to you with questions, and you can comfortably answer them. This comes with the next point:

  • PRACTICE. I always felt like (and still do) that I have to prove something because I don’t have a degree. I do a lot of coding and reading about coding in my free time - it helps that it’s a passion of mine, but I make sure to maintain balance with other things in my life, such as exercise, pleasure reading, etc. But this gets you the knowledge to start answering questions your teammates you have, and as you do this more and more often (it’ll come as you practice, because the questions will naturally seem like easy ones to you), your own confidence will build along with the confidence others have in you. It’s a nice positive feedback loop.

  • Work on actual projects for your practice - yes you should take online courses, yes you should do coding practice, but you should contribute to open source, build projects that are interesting to you, etc. This skill transfers over - and I actually have a story about an instance in which this happened to me just over the last week, but it’s a long one that should probably be its own post.

  • Jump into the deep end. This is always a leap of faith - once you get past the drowning feeling, you’ll find yourself swimming. Challenge yourself - take the initiative to rewrite your company’s testing procedures, learn big data processing by building a big data app, deploy your first single page app, do some embedded programming - then break it all or have others break it, and fix it. And break it. And fix it. It will then become routine and old hat - then comes the knowledge, the confidence, and it starts replacing that feeling of being an imposter.

I think the key takeaway here is building confidence, and constantly pushing your limits. Failure doesn’t exist - you will always learn something and become better, and as you build that understanding and the confidence I talked about above, you’ll welcome ‘failure’ and success: failure is learning, success is demonstrating what you learned.

As for my personal example - I’ve been in the industry for 3 years now, am building out my own startup (remember when I said to jump into the deep end?), and am loving every minute of it.

Hash Traversal in Perl

Today, I wanted to talk about hash traversal in Perl with the aim of flattening a complex multi-level hash into a simpler single-level hash. Hash traversal is a perfect way to get some practice with recursion in, so we will be using that approach here as well. Let’s outline how this might look:

  1. Inputs
  2. Iterate over original hash
  3. Recursion!
  4. Flattening

We can choose to flatten in any way we want. Since this was originally conceived to work with MongoDB queries (specifically to assist in accessing values in complex objects), we will join the keys down to the final key with a period (.).

Inputs

Since we will be writing a recursive function, we know we will at least have to take as an input the next level of hash available (we will be doing a depth-first traversal). Since we need to keep track of every level we get into for flattening, we should also keep track of the keys through each call. Another hash reference, which we will call output, will also need to be passed in to each recursive call, which will be loaded with our key-value pairs as we unwind the stack.

This will be a special, unnamed parameter, and will be undef on the call your program makes to flatten_hash. So the parameters we will pass in to our function will be the undefined output, your input hash, and an initially empty arrayref that will keep track of the keys that point to subhashes.

# Named inputs: original_hash, keys_list
sub flatten_hash {
    my %output = %{shift @_};
    my %args = @_;
}

Iteration

As a necessity, we need to iterate over the keys of the hash we are working on. Instead of the more common foreach my $key (keys %hash) idiom, we will instead use the each built-in. There are some arguments against using each, which include the fact that the each internals can get confused if the original hash is modified while being iterated through. Luckily, that doesn’t cover our use case, and each is fine to use here. Use caution with it in general, though.

each returns to us the key and value of each element of a hash. We will unpack those results as we iterate, and also set up the recursive call and the list of previous keys.

while (my ($key, $value) = each(%{$args{original_hash}})) {
    my @data_address = defined($args{keys_list}) ? @{$args{keys_list}} : ();
    push(@data_address, $key);
}

Notice the ternary operator - if the keys_list parameter is defined, we will dereference it and set it to @data_address, otherwise we will initialize @data_address as an empty array. We will then push the current key we’re on to the end of the @data_address array.

Recursion

We looked a little bit at the $key part of the each call’s results, but let’s take a look at the $value return. If we’re examining a multi-level hash, there will be occasions where $value’s type is a hashref, instead of a scalar. That indicates to us that there is more to explore – we have arrived at our recursive case.

if (ref($value) eq 'HASH') {
    %output = flatten_hash(\%output, original_hash => $value, keys_list => \@data_address);
}

Flattening

If $value is not a reference to a hash, then we’re in our base case (you can expand the recursive case check to include arrays, but I’ll leave that as an exercise to the reader) - $value is an actual scalar value.

else {
    my $addr = join('.', @data_address);
    $output{$addr} = $value;
}

Here, we join all the keys leading to this particular piece of data with a period. We can do this with whatever delimiter you need for your specific application. Then we use autovivification to add to the %output hash the new combined key and then assign it the value stored in $value. All that is left is to return %output, which you can see in the final code below:

sub flatten_hash {
    my %output = %{shift @_};
    my %args = @_;
    while (my ($key, $value) = each(%{$args{original_hash}})) {
        my @data_address = defined($args{keys_list}) ? @{$args{keys_list}} : ();
        push(@data_address, $key);

        if (ref($value) eq 'HASH') {
            %output = flatten_hash(\%output, original_hash => $value, keys_list => \@data_address);
        }
        else {
            my $addr = join('.', @data_address);
            $output{$addr} = $value;
        }
    }
    return %output;
}

Here is a quick test script that shows how to use this method and what its output looks like.

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;

my $test_hash = {
    a => 'a',
    b => {
        d => 'd',
        e => 'e',
        f => {
            g => 'g',
            h => 'h',
        },
    },
    c => 'c',
};

my %empty;
my %res = flatten_hash(\%empty, hashref => $test_hash, arguments => []);
print "Test hash:\n";
print Dumper($test_hash);
print "Result\n";
print Dumper(\%res);

This code grew out of modifications I made to the excellent answer given in this Stack Overflow answer, which gives a basic template to traversing a hash with a callback in Perl.

Hash Slices in Perl

Recently, I had a bug at work that would be solved by simply stripping certain characters from the keys of a hash. This seems easy at first: iterate through the keys with the keys operator and run a regex substitution on that, something like the following (with the goal of stripping periods):

my $my_hash = ('al.pha' => 'a', 'beta' => 'b');

s/\.*//g foreach keys %my_hash;

For those deeply familiar with Perl, the error probably seems obvious - keys doesn’t return references to the keys in the hash for you to manipulate in place, it instead returns copies of the keys. These copies are what we’re iterating through with foreach and stripping the period from.

If we use Data::Dumper on %my_hash, we will be greeted with the hash in its original state:

$VAR1 = {
          'al.pha' => 'a',
          'beta' => 'b'
        };

But if we instead print the default variable within the foreach after the substitution with a little code reorganization:

foreach (keys %hash) {
    s/\.*//g;
    print "Default var: $_\n";
}
print Dumper(keys %hash);

This is our output:

Default var: a
Default var: alpha
$VAR1 = 'al.pha';
$VAR2 = 'a';

So the keys stored in $_ were correctly stripped of periods, but the actual keys in the hash were unaffected. Let’s continue on this path (using the substitution operator s/// with a foreach) - if you have used Perl for any appreciable amount of time, you know TIMTOWTDI (pronounced “Tim Toady”: there is more than one way to do it) is one of the language’s guiding philosophies, and there are a number of ways to tackle a problem like this. We will be continuing the use of keys, though. Since keys returns a copy of a hash’s keys, we can load it into a new array:

my @new_keys = keys %my_hash;

Easy enough. Now we can iterate over this and use the substitution operator to substitute in place:

s/\.*//g foreach @new_keys;

Here comes the Perl magic, can you see all the things that are accomplished in this one line?

@my_hash{@new_keys} = delete @my_hash{keys %my_hash};

Here we finally make use of hash slices - twice, in fact. Both times that we encounter my_hash, it’s in a list context, as are the keys that we are addressing. This technique allows us to operate on or assign to a list of keys all at once. So @my_hash{keys %my_hash}, in the state it is in now, gives us a list of all the values that are in my_hash. Why are we calling delete on those, though? delete has an interesting side effect beyond deleting things: delete also returns what it is deleting.

So not only do we delete the old values from the hash, we load them in to the appropriate new keys, all in one line! We can also guarantee that the values get loaded into the correct keys. This is because keys returns entries in the same random order for the same hash. This depends on the hash not changing between calls to keys.

To put it all together, we have three lines (any wizards want to make it even shorter?) that, in effect, will run a regex substitution on all keys of a hash, seemingly in place.

my @new_keys = keys %my_hash;
s/\.*//g foreach @new_keys;
@my_hash{@new_keys} = delete @my_hash{keys %my_hash};

And the results:

$VAR1 = {
          'alpha' => 'a',
          'beta' => 'b'
        };

Success!

A Short Update

Well, it’s been quite a while, hasn’t it? I know this site doesn’t get a ton of traffic, but it probably still makes sense to explain the long gap in posts. Around this time last year, I got engaged. After the initial excitement leveled out, panic decided to set in. How will I pay for the wedding? How will I continue to pay for my student loans? I eventually decided to begin looking for other jobs, putting this blog on the backburner. I looked all over the state of Florida, not wanting to be too far from where my new fiancée and I were to be married. Like my initial job search, it was met with a lot of auto-rejects due to not having a CS degree. After many, many applications, lots of coding challenges, and working through the first few levels of the Google Secret Interview challenge (which was an absolute blast and may become fodder for another post in the future), I finally got an interview locally, this time with homes.com. The first two interviews were over the phone, first with the recruiter, then with the hiring manager. The second interview described the job in more depth, how the team works with large amounts of data, a number of databases, and, of all languages, Perl. The manager liked my excitement for the job (backend data engineering? Too interesting for me to decline!), as well as the fact that I didn’t have a CS degree. Score!

Perl

As a digression, let’s talk a little bit about Perl, and my experience with it. Having had Internet access since the late 90s, I was familiar with Perl via CGI scripts that were common on websites at that time. This was quite a bit different from how the job was described for me, so I wasn’t too worried that I’d be working on some hugely deprecated presentation codebase that hasn’t been meaningfully updated since 2001, but some grad school experience with Perl made me nervous. My first real experience with the language was writing a data parsing script for a project when I was in grad school, and it also happened to be my first experience with regular expressions. You can guess how well that went.

With this memory, I was pretty nervous going into it. So leading up to my technical interview I bought Learning Perl (no ref) and Intermediate Perl (no ref) and began reading. I had made it through Learning Perl prior to my in-person interview, but decided not to rely on my still-wobbly Perl legs for this interview.

The Interview

This interview was actually my first real technical interview. I remember being very nervous, and it took a lot of time (and questions) for me to get an accurate view of the actual problem. I was shaking the entire time, and by the time I left (this was mid-September in Florida) I was sweating all over and sure I had failed. I did eventually get the solution (using Python-ish), but not like I would have if I had done it at home and not surrounded by three strangers. I felt like a failure, and didn’t want to tell my fiancée about this failure. I did, and she was nothing but supportive.

The Call

The next week I got the call, they were extending me an offer, and I took it. I submitted my 2 weeks’ notice to my employer at the time, which was nerve-wracking itself because I had built quite a friendship with many of my coworkers. They were very supportive and understanding, and we had a nice lunch at Tijuana Flats to send me off.

The Job

I started with homes.com at the end of September, and have really been enjoying it. Perl is no longer an untameable monster, and it really can be quite elegant (and fast). Also very odd and quirky, but I enjoy the little bit of challenge that that provides. The people I work with are great, and I’ve again built new friendships that have been very rewarding, and learned a lot of new skills. I also participated in Digital Ocean Hacktoberfest and got a great tee shirt out of some fun bug-hunting for a few great open source projects.

The Future

So what am I working on now, and what does the future have in store? Well, right now I’m starting back up with a traditional-ish CS program to hone some skills that I’ve been lacking in, I’ve been especially interested in machine intelligence, so that is one of the goals. Eventually, I’d like to work on another M.S., this time in computer science. I was also recruited to work on an Android app called WeatherGirl with a few other local developers who are all learning Android development, something I’ve been trying to get started for a while now. On top of all that, I am in the very early stages of writing the life story of my paternal grandparents, of whom my grandmother came to America from Greece as a refugee, then later went back to Greece to find her man at a time when her sisters were put into arranged marriages. On my last visit with them, I learned that there was a lot I still didn’t know, despite reading my great uncle’s book Eleni (non ref) and growing up with them. I am hoping to be able to keep their incredible story alive for my children and, eventually, their children. And come October, I’ll be married, and who knows what is in store for us after that? It’s a very exciting time! With everything going on, it sounds like I’ll be unable to blog much more, but on the contrary, I plan to bring my posting volume back up. I have a few posts already in varying stages of completion, so something should be finding its way up soon. Until then!