Tips and tricks for navigating large codebases

Posted by

Introduction

One I thing I believe is lacking in modern CS education in universities is how to navigate extremely large code bases, and how to quickly make sense of them.

One possible reason this is not taught is that the majority of CS students will end up working in small/medium sized companies where they will be focused on one program, one code base, which previous employees will know well enough to explain all parts of at a high level to the newcomer.

Another potential reason (likely the real one, but the one university professors don’t want to admit) is that the teachers themselves have never had to deal with any code bases larger than 100K lines, much less had to deal with code written by others (besides their own students), but that’s a discussion for another article ☺

Back to my point however, when one joins the likes of a Microsoft, Amazon, Google, Facebook or any one of a handful of what I truly consider to be “large tech companies”, it is easy to get lost very quickly and to fail to be productive.

If the interns I’ve mentored in the past are any evidence, no one seems to teach them how to deal with the firehose of code that suddenly gets pointed at them. So in the hopes this will help future generations of coders, here’s my take on it.


Lost in Code

There are many situations in which you may find yourself having to navigate a large foreign codebase. Some examples:

  1. The most senior person on your team may have just been around for less than a year, especially in a company with a high churn rate, and they themselves may not know large parts of the systems the team is tasked with maintaining because they just haven’t had to deal with them yet. Now that a new feature needs to be added to a legacy system, guess who the boss picked to do it?
  2. You are dealing with another team’s code and they don’t have the time (or don’t want to) explain it to you, yet you want/need to understand it.
  3. You are working on something that is cross-cutting like performance or efficiency which requires you to constantly be working in other people’s code and bouncing around on a weekly (or daily) basis.

This generally tends to leave you at a disadvantage when compared to working on the same code for a long time.

So how does one go about simply finding relevant code and grokking (understanding) this same code without wasting a lot of time?

Here’s how.

1. Don’t Panic

Every good nature survivalist will tell you the first rule of survival in the wild is “Don’t Panic”.

So what it’s 2 am, the servers are burning, you’re bleeding money on the floor and all the senior engineers are out on the annual company trip to Lake Tahoe and you got stuck doing support instead?

Now is not the time to curl up in a ball and wish you had dropped out of college to raise goats with your hippie sister in Wyoming. Goats be damned! There’s code to be read.

2. Get the relevant code in front of your face.

The first thing you need to be good at is quickly finding the right piece of code and getting it in front of your eyeballs. This may sound dumb, but I’ve wasted many minutes (and sometimes more) reading up on code that was not the piece of code I was interested in.

There are two things to do before investing a lot of time reading part of a large program, they are first finding the code that is relevant and second making sure it is the right piece of code before deep diving in.

Identify odd phrases and grep for them in code

One of the most simple, yet most powerful ways of figuring out what piece of code is causing some behavior to happen or driving some piece of UI to show up is to grep for odd phrases you can identify.

If you don’t know how to grep inside a repo, I encourage you look it up. If you are using git this great article should give you some good tips on doing just that. Don’t know how to write regular expressions? You should. That’s another skill that is highly valuable and there are lots of good tutorials on the web to help you. But I digress…

“All events are logged in the Pacific (PST) timezone.” written on a web page is a line of pure gold in my opinion. How many times do you think a long phrase with a specific capitalization like that is repeated in a 2 million line codebase? Probably a few, but that at least narrows down the search to a few files (and the file names should be your second clue to quickly discard unrelated code).

CatShouldBeADogException thrown” is another piece of gold. Custom exceptions (especially in Java programs) are likely only thrown from 4 or 5 places. With a little bit of added context on whatever you’re dealing with, like the rest of the stack trace, or the 2 preceding lines of log that say you’re entering another method, you can very quickly narrow down and have the relevant lines of code in front of your face.

Log line messages are another good avenue for finding code. “Starting to process log lines with x lines” is a less desirable phrase to search for. First of all the x is dynamic, meaning what you should grep for really is “Starting to process log lines”. To grep for anything more will be pointless since the statement is likely wrapped with quotes and a print statement.

In general these can be good starting points. But what if that doesn’t get you to where you want to go?

Blame stalking

So you know who is working on a feature, but you don’t know where the code is that they are working on? Well, that’s easy, just blame stalk them!

What do I mean by blame stalking?

Whenever someone commits a piece of code in any source control repository (git, mercurial, svn, etc…) their name or login is associated with the commit.

If you know who is working on a particular feature, finding the code becomes as easy as just looking for files they’ve touched in the repo!

This is a gold mine if you are curious about what someone else is working on, or if you think they are hot stuff and want to see an example of a true master at work (hey, that’s a great way to learn sometimes!), this is the method for you.

Ways to do this depend on the source control, but for example with git one might do git log —name-only —author=imperio59 assuming you wanted to see all my commits (including modified files).

Start at the entry point

Sometimes there is no good way of knowing where to go. You’ve tried to grep for odd phrases, you’ve tried blame stalking. You’re stuck. It might be time to slog it out and begin at the beginning.

Look for the program’s entry point (this can anyways be educational for other reasons) and follow the trail. This can be quite long though, so use this as a last resort.

Modify the code now found and make sure that was the right piece of code

Before you invest minutes or hours reading this 10,000 line monster of a spaghetti and meatball class you just found, it might be a good idea to make sure it’s the right one.

Add a print statement. Run the code and make sure your print prints. The best line for this is echo(“”). It’s highly unlikely the program is already printing out your last name, so grepping for it in logs or output will be easy.

Alternately you can also repeat some phrase over and over (copy/paste) like “FOOFOOFOOFOOFOOFOOFOOFOO”. This is my friend’s favorite line to print for debugging code and it stands out well in log output too (except he childishly likes to replace the F with a P. Kids these days, you know?)

3. Read some code

Ok you’ve found relevant code you want to read up on. Now what?

Time to read my friend! But not just any reading, no. We are going to read so we can understand what we are reading. There is a difference see. It does us no good to stupidly scan line by line, not getting half of what we are reading, hoping we will come to some magical understanding, or through deep, repeated readings we will just “get it”.

There are in my opinion three main things which can impede your understanding of others’ code:

  1. You don’t understand some convention of how the code is written. (Naming conventions, style conventions…)
  2. You don’t understand what a function does because you’ve never come across it before.
  3. You don’t understand something in the coding language itself.

Let’s go over how to deal with each of these:

1. Coding conventions, style guides and people inventing their own stuff

There aren’t that many conventions people use when naming variables, classes or methods. Depending on the language you are dealing with, some naming conventions may be part of the language itself (as in the variable or function is named in that style because it has to be).

For naming conventions I find people usually stick to the basics. Good old Hungarian notation or a subset of it is still around. Sometimes people add a lowercase m to every member variable of a class, to differentiate it from local variable. Some people enforce writing memberVariablesInCamelCase while local_variables_have_underscores. Sometimes private function names _startWithAnUnderscore while publicMethodsDont.

Whatever the convention is, figure it out, and fast. If you’re lucky there’s a published code style guide on an internal wiki. Read it. Read the code. Read the style guide again.

Code conventions aren’t evil, they’re there to help your reading and understanding of code speed up by helping you classify types of variables and functions quickly. But they only speed you up if you know the convention fairly well, otherwise they cloud your understanding by adding a bit of extra friction every time you come across it. (“Now why the heck did they call this mServiceManager instead of serviceManager?!” ).

Lastly, sometimes people invent their own style guide and don’t publish it anywhere. If that’s the case in my experience either the person who wrote the code was in a hurry and didn’t come back to write a style guide or document it anywhere (be prepared to find some dumb bugs laying around) or the person was too self-important to write a style guide or explain what they were doing (one of these “My code doesn’t need unit tests because it has no bugs” types, yea right!) in which case the code is going to be over-engineered, bloated and buggy.

Either way, some non-standard, non-documented coding convention is likely a sign there may be trouble brewing on the horizon sailor, so hang on to your hat.

2. Utility functions, random methods and bad naming

The next thing that will slow you down when reading code is functions you have never heard of.

If you step back and look at a language, it is composed of simple building blocks which you should understand: creating new variables, assigning them values, doing some conditional logic, maybe a few loops and if you’re unlucky some binary arithmetic. But assuming you know your language well, the next thing to know is what function calls you read are doing.

For the sake of this article let’s classify these into two types of functions:

  1. Utility functions (also called library functions)
  2. “Business logic” functions.

Utility functions or library functions are a class of function which are re-used by almost anything. Need to format a date? There’s a utility for that. Parsing some JSON? There’s another one for that.

Usually but not always, these methods are part of the standard set of libraries for the language. Sometimes they will be part of some 3rd party library (open source or not).

If you have never seen a particular one of these, it’s a good use of time to read its documentation. What input does it take? What output does it produce? Does it mutate the input or merely use it? Are there any side-effects to calling the function? How slow/fast is the function going to be?

The other sub-class of utility functions is library functions specific to the codebase you’re in. This is one of those you are seeing every 5th line in the code you are reading, and that seems to be used for everything. Again, it’s wise to spend some time understanding it.

Next is business logic functions. These are the meat to your code’s potatoes. They do the stuff you came here to read about.

Read them. But not every line. Unless you need to.

Let me explain.

Whenever I feel I need to go read up on a function that is a part of the program I am reading, I have two choices. I can either decide to deeply understand every part of it, or take it at face value.

Taking it at face value means I will read the name of the function, the name of the input variables, scan over the code to get a high level understanding of what it does, figure out what it returns or does to other parts of the program (or both, which is a code smell in my mind) and move on.

Unless I am refactoring something that impacts this function, I will not read every line. I don’t need to! As soon as I get what it does, BOOM, I’m moving on to the next one!

There is too little time in the day for me to know what every one of those 15,000 lines does. I just wanted to know the title and read the back of the book, not read the full story, thank you very much.

There is one exception to this and that is code written by people who (by evidence of the code they wrote) must be monkeys.

If every variable is named “thingA, thingB, thingC”, or worse half of the methods are in a folder named “util”, you’re in for a rough ride. These people generally do not know how to code professionally to begin with, or how to organize lots of code well and should likely not have been hired in the first place.

Code is self documenting most of the time assuming you name things properly. If you don’t name things properly, it’s a mess.

This might also be the time to update your resume and start writing back to all those recruiter e-mails you’ve been getting on LinkedIn. If more than 30% of the code you maintain is monkey code, you’re probably not getting paid enough to maintain it to begin with. (After all you’re reading this article, which means you care about software and your job. The same can’t be said for the monkey who wrote that piece of code you’re now trying to debug at 2 am while he’s counting sheep in a cabin in Lake Tahoe on the company dime).

Finally, Unit tests can be very helpful to grokking code.

If you happen to notice that a function you are reading has a corresponding unit test, it’s probably faster to understand what the function does by reading the test.

Assuming the test was not written by a monkey.

3. You don’t understand something in the coding language itself

Oh my god! What?! You should be ashamed of yourself!

… or not.

I mean come on. So what? You don’t know, so you look it up. Don’t feel bad about yourself, educate yourself!

How do you think the best programmers got to be so good? They just read about software and code. A lot. Over and over until it is second nature.

So you’ve never come across the volatile keyword in Java but suddenly every other file has a member variable declared volatile? Well now would be a good time to go read up on Java concurrency and threading and get some more knowledge. It will come in very handy.

So you’ve never seen someone bind arguments to a function in Javascript (like foo.bind(this, arg1) )? Now is probably the time to learn a little bit more about closures in JS.

The fact of the matter is if you thought learning was over the moment you graduated, someone lied to you (or you lied to yourself). College (or whatever form of primary education on programming you got) gives you the basics. The rest you will have to learn on your own, depending on what you specialize in, and what you come across in your day job.

Plus, the tech industry is constantly moving, innovating, reinventing itself. Software today is not written like software ten years ago. Or even five years ago really.

Don’t get frustrated with yourself if you don’t know something. Look it up, become smarter, improve your own skills. It’s the only way to go in the long run.

Conclusion

So now you know how to find relevant code in a large foreign codebase, and hopefully you’ve learned a few things about how to quickly understand it and not get stuck too.

In closing I would say one last thing: be a good code citizen.

Code can be like a big city at times. Some parts of it are nice and clean. Some parts have broken windows and signs held up with duct tape.

Do your part by improving things you come across which are in a bad state, even if they might not strictly speaking belong to you (but do get permission or code review from the owner before checking in a fix, of course!).

After all you never know, this might be the piece of code that pages you at 2 in the morning one day. And you’ll be glad you were the one who refactored it when it does.

 

Leave a Reply