Balaji Vajjala's Blog

A DevOps Blog from Trenches

Speeding Up NuGet.Server

(Get the source code at https://github.com/themotleyfool/NuGet)

Last time I wrote about creating a LINQ provider for Lucene.Net, and today I’ll talk about integrating that provider with NuGet. The existing server part of the NuGet codebase is a drop-in replacement for using local file-system based feeds. I wanted to try to preserve that turnkey advantage but improve the performance of various queries.

In order to make sure that my improvements were up to snuff, I set up a private mirror of all packages on nuget.org, which turned out to be 44,193 packages at the time, for a total size of over 20 gigs.

If you try hitting ~/api/v2/Packages on stock NuGet.Server, you’ll find that your request just spins and spins. And spins. In fact it took so long that I gave up waiting for the application to initialize. In the background, the server is finding all *.nupkg files in ~/Packages and calculating a hash of the contents. Needless to say, it can take a while to run a checksum algorithm on 20gb of data.

Switching over to my custom lucene branch, the first time the site is started, it scans the Packages folder and finds all packages that haven’t been indexed by Lucene. The site homepage helpfully tells you the current status, such as “Indexing 2113 of 44193 new packages.” An ajax timer refreshes the info every few seconds so progress can be easily tracked.

The packages don’t begin to appear in the feed until they’ve all been indexed. So this isn’t much better than stock NuGet.Server.

Incremental Indexing

The real improvements are appreciated after the initial index is built.

[celdredge@localhost]$ appcmd recycle apppool nuget
"nuget" successfully recycled

[celdredge@localhost]$ time wget -O /dev/null http://localhost/api/v2/Packages

(snip)

real    0m3.230s
user    0m0.062s
sys     0m0.125s

This means that you don’t have to worry much about IIS shutting down the application during idle times. The index gets loaded and ready to go in a matter of seconds. Vast improvement over stock NuGet.Server.

While that happens, a background thread scans the Packages folder to see what might have changed while the application was stopped. New, modified and deleted packages are synchronized with the Lucene index. The sycnhronization process takes about 25 seconds to scan 44,193 package files split into 6,180 folders and calculate the differences with the Lucene index. That’s pretty fast.

After the application finishes this initial scan, a FileSystemWatcher monitors the Packages folder to synchronize any changes in real time. This allows the index to stay in sync when new packages appear, even if they are copied into the folder instead of using nuget push.

Superfast Search

All sorts of complex queries are possible, and they execute in very reasonable time. I used LINQPad to construct various test queries, like this one that finds packages whose id contain lucene but do not start with lucene:

from p in Packages
where p.Id.Contains("Lucene")
where !p.Id.StartsWith("Lucene")
where p.IsLatestVersion
orderby p.Id descending
select p

Query successful (00:00.136)

136ms is pretty respectable, IMO.

Another advantage to using Lucene is how queries are analyzed. Term queries will match various word forms, so a query like build will match packages that use any words like build, builds, building, built, etc. It is also possible to search for phrase queries, such as “glue them back together”. That query matches only one package that contains the exact phrase, whereas on nuget.org you’ll get all kinds of results.

Other Features

The Tab Completion API Endpoints introduced in NuGet 2.0 have been implemented, bringing fast results to users of the Package Manager Console.

Conclusion

It has taken a substantial amount of time and effort to implement Lucene.Net.Linq and integrate it with NuGet.Server, but the results have proven to be worth the investment.

Lucene.Net.Linq has become a fairly mature, though still nascent, project now available on nuget.org. There are a few other OSS projects that attempt to do what it does, but I think it is already one of the best.

Binaries of NuGet.Server + Lucene can be downloaded from https://github.com/themotleyfool/NuGet/downloads.

NuGet Aside for Octopress

I just finished as aside for Octopress that list the Top N downloaded packages where you are an author. It also adds a link to your NuGet gallery profile if you have one. The style is basically the same as the style of the github aside.

Since there is no official way to publish 3rd party add-ons for Octopress yet, I created a github repository with the required files and setup instructions in the ReadMe.

Configuration Management Strategies

I just watched the “To Package or Not to Package” video from DevOps days Mountain View. The discussion was great, and there were some moments of hilarity. If you haven’t watched it yet, check it out here

Stephen Nelson Smith, I salute you, sir.

I’m quite firmly in the “Let your CM tool handle your config files” camp. To explain why, I think it’s worth briefly examining the evolution of configuration management strategies.

In order to keep this post as vague and heady as possible, no distinction between “system” and “application” configurations shall be made.

What is a configuration file?

Configuration files are text files that control the behavior of programs on a machine. That’s it. They are usually read once, when a program is started from a prompt or init script. A process restart or HUP is typically required for changes to take effect.

What is configuration management, really?

When thinking about configuration management, especially across multiple machines, it is easy to equate the task to file management. Configs do live in files, after all. Packages are remarkably good at file management, so it’s natural to want to use them.

However, the task goes well beyond that.

An important attribute of an effective management strategy, config or otherwise, is that it reduces the amount of complexity (aka work) that humans need to deal with. But what is the work that we’re trying to avoid?

Dependency Analysis and Runtime Configuration

Two tasks that systems administrators concern themselves with doing are dependency analysis and runtime configuration.

Within the context of a single machine, dependency analysis usually concerns software installation. Binaries depend on libraries and scripts depend on binaries. When building things from source, headers and compilers are needed. Keeping the details of all this straight is no small task. Packages capture these relationships in their metadata, the construction of which is painstaking and manual. Modern linux distributions can be described as collections of packages and the metadata that binds them. Go out and hug a package maintainer today.

Within the context of infrastructure architecture, dependency analysis involves stringing together layers of services and making individual software components act in concert. A typical web application might depend on database, caching, and email relay services being available on a network. A VPN or WiFi service might rely on PKI, Radius, LDAP and Kerberos services.

Runtime configuration is the process of taking all the details gathered from dependency analysis and encoding them into the system. Appropriate software needs to be installed, configuration files need to be populated, and kernels need to be tuned. Processes need to be started, and of course, it should all still work after a reboot.

Manual Configuration

Once upon a time, all systems were configured manually. This strategy is the easiest to understand, but the hardest one to execute. It typically happens in development and small production environments where configuration details are small enough to fit into a wiki or spreadsheet. As a network’s size and scope increases, management efforts became massive, time consuming, and prone to human error. Details end up in the heads of a few key people and reproducibility is abysmal. This is obviously unsustainable.

Scripting

The natural progression away from manual configuration was custom scripting. Scripting reduced management complexity by automating things using languages like Bash and Perl. Tutorials and documentation instruction like “add the following line to your /etc/sshd_config” were turned into automated scripts that grepped, sed’ed, appended, and clobbered. These scripts were typically very brittle and would only produce desired outcome after their first run.

File Distribution

File distribution was the next logical tactic. In this scheme, master copies of important configuration files are kept in a centralized location and distributed to machines. Distribution is handled in various ways. RDIST, NFS mounts, scp-on-a-for-loop, and rsync pulls are all popular methods.

This is nice for a lot of reasons. Centralization enables version control and reduces the time it takes to make changes across large groups of hosts. Like scripting, file distribution lowers the chance of human error by automating repetitive tasks.

However, these methods have their drawbacks. NFS mounts introduce single points of failure and brittleness. Push based methods miss hosts that happen to be down for maintenance. Pulling via rsync on a cron is better, but lacks the ability to notify services when files change.

Managing configs with packages falls into this category, and is attractive for a number of reasons. Packages can be written to take actions in their post-install sections, creating a way to restart services. It’s also pretty handy to be able to query package managers to see installed versions. However, you still need a way to manage config content, as well as initiate their installation in the first place.

Declarative Syntax

In this scheme, autonomous agents run on hosts under management. The word autonomous is important, because it stresses that the machines manage themselves by interpreting policy remotely set by administrators. The policy could state any number of things about installed software and configuration files.

Policy written as code is run through an agent, letting the manipulation of packages, configuration files, and services all be handled by the same process. Brittle scripts behaving badly are eliminated by exploiting the idempotent nature of a declarative interface.

When first encountered, this is often perceived as overly complex and confusing by some administrators. I believe this is because they have equated the task of configuration management to file management for such a long time. After the initial learning curve and picking up some tools, management is dramatically simplified by allowing administrators to spend time focusing on policy definition rather than implementation.

Configuration File Content Management

This is where things get interesting. We have programs under our command running on every node in an infrastructure, so what should we make them to do concerning configuration files?

“Copy this file from its distribution point” is very common, since it allows for versioning of configuration files. Packaging configs also accomplishes this, and lets you make declarations about dependency. But how are the contents of the files determined?

It’s actually possible to do this by hand. Information can be gathered from wikis, spreadsheets, grey matter, and stick-it notes. Configuration files can then be assembled by engineers, distributed, and manually modified as an infrastructure changes.

File generation is a much better idea. Information about the nodes in an infrastructure can be encoded into a database, then fed into templates by small utility programs that handle various aspects of dependency analysis. When a change is made, such as adding or removing a node from a cluster, configurations concerning themselves with that cluster can be updated with ease.

Local Configuration Generation

The logic that generates configuration files has to be executed somewhere. This is often done on the machine responsible for hosting the file distribution. A better place is directly on the nodes that need the configurations. This eliminates the need for distribution entirely.

Modifications to the node database now end up in all the correct places during the next agent run. Packaging the configs is completely unnecessary, since they don’t need to be moved from anywhere. Management complexity is reduced by eliminating the task entirely. Instead of worrying about file versioning, all that needs to be ensured is code correctness and the accuracy of the database.

Don’t edit config files. Instead, edit the truth.

-s

Subversion hot-backup change in 1.6.11

An important notice to users of the hot-backup.py utility which ships with subversion.

I found our nightly backup of subversion was failing with the following error:

svnadmin: Can't open file '/pathtorepo/db/fsfs.conf': No such file or directory

What was troubling me was:

  1. How come a file is missing in my svn repository – nothing has changes (As far as I know … :)
  2. hot-backup.py script hasen’t changed much, how come my version has changed ?

So I looked up the subversion’s change log: http://svn.apache.org/repos/asf/subversion/tags/1.6.12/CHANGES which is the latest release (1.6.12) and took a look on my svn machine to find which version was installed and I found 1.6.12 was indeed in talled and in 1.6.11 release notes you will find:

* make 'svnadmin hotcopy' copy the fsfs config file (r905303)

In addition I took a look at hot-bakup.py change log under: http://svn.apache.org/viewvc/subversion/trunk/tools/backup/hot-backup.py.in?view=log, and found that indeed the fsfs file has be included in the hotbackup script since the 1.6.11 version of the file (see link speified above).

Googeling on the fsfs.conf subject led me to: http://comments.gmane.org/gmane.comp.version-control.subversion.user/97647 which noted the same exact issue.

How do we solve this issue?

  1. Create a test repository – svnadmin create /tmp/svntest which will create the fsfs.conf under /tmp/svntest/db/fsfs.conf
  2. Copy the fsfs.conf to your svnroot/db directory and walla you have the fsfs.conf (what this file does is a different topic)

Please note:

1. The svnadmin upgrade –Doesn’t add this file so unless you are using an old veriosn of the hot-backup.py script your backups will fail!!! (beleive me I tried).

2. If you update subverison – don’t forget to run svnadmin upgrade /pathto yourrepo/ or you miss all the point of upgrading

So I’ve learened that subversion was upgraded (again which doesn’t say your repository was upgraded!!!) – but when? – considering the fact I am running CentOS – and I didn’t have to compile SVN from source and start checking the subversion binaries Creation / Update time I used RPM to tell me when subverions was installed and there is was:

[root@dev ~]# rpm -qi subversion
Name        : subversion                   Relocations: (not relocatable)
Version     : 1.6.12                       Vendor: Dag Apt Repository
Release     : 0.1.el5.rf                   Build Date: Tue 22 Jun 2010 12:55:11 PM IDT
Install Date: Mon 19 Jul 2010 12:36:54 AM IDT      Build Host: lisse.hasselt.wieers.com
Group       : Development/Tools             Source RPM: subversion-1.6.12-0.1.el5.rf.src.rpm
Size        : 21247326                         License: BSD
Signature   : DSA/SHA1, Tue 22 Jun 2010 04:46:18 PM IDT, Key ID a20e52146b8d79e6
Packager    : Dag Wieers 

I think the #1 lesson learned here is before you upgrade read the release notes, see if it impacts you environment in any way – then you can upgrade.

Hope you find this useful.

Learning Programming

I was tutoring someone in web app development recently, and the monumental task in front of him really hit me. He was trying to learn and use nine new languages at the same time.

In our case it was:

  • Ruby

  • Rails

  • MySQL

  • Bash (command line usage)

  • HTML

  • CSS

  • Javascript

  • JQuery

  • Git

  • (Capistrano, Yaml, nginx, ??)

Even though some of these aren’t true languages in the traditional sense, they appear this way to newcomers since they are each a new syntax to learn.

If you slowly built up these skills over 15 years, they are clearly separate concepts in your mind. But for a newcomer trying to use them, it’s not even clear which one is which.

Is that a Ruby method or a Rails method?

Is “script/server” a shell command or is “ls” part of rails?

Is this file html, js, or css? (actually a mix of all three)

He made a comment along the lines of “wouldn’t it be great if you could build an entire web app in one language”, and I started thinking about it.

GWT (Google Website Toolkit), ActiveRecord, CoffeeScript, and Heroku are all steps in this direction. You could classify them generally as trying to “eliminate a language in the stack” or allowing you to do a piece of the stack in a language you already know.

Obviously there is a trade off here between power and simplicity, but I’m wondering – would it be possible or desirable to get an entire web app down to just one language? If not that how few could you use?

Btw, I think there are benefits to seasoned developers here as well. I remember Lars Rasmussen (creator of Google Maps and Wave) mentioned something to this effect at Google IO in 2009, that GWT allowed him to spend his mental CPU cycles at a higher level and be more creative (not having to worry about cross browser css or js). So the benefits of higher abstraction may not only be for newcomers.

Functional Groovy switch statement

In the previous post I showed how to replace chained if-else statements in Groovy with one concise switch. It was done for the special case of if-stement where every branch was evaluated using the same condition function. Today I want to make a generalization of that technique by allowing to use different conditionals.

Suppose your code looks like this:

if (param % 2 == 0) {
    'even'
} else if (param % 3 == 0) {
    'threeven'
} else if (0 < param) {
    'positive'
} else {
    'negative'
}

As long as every condition operates on the same parameter, you can replace the entire chain with a switch. In this scenario param becomes a switch parameter and conditions become case parameters of Closure type. The only thing we need to do is to override Closure.isCase() method as I described in the previous post. The safest way to do it is to create a category class:

class CaseCategory {
    static boolean isCase(Closure casePredicate, Object switchParameter) {
        casePredicate.call switchParameter
    }
}

Now we can replace if-statement with the following switch:

use (CaseCategory) {
    switch (param) {
        case { it % 2 == 0 } : return 'even'
        case { it % 3 == 0 } : return 'threeven'
        case { 0 < it }      : return 'positive'
        default              : return 'negative'
    }
}

We can actually go further and extract in-line closures:

def even = {
    it % 2 == 0
}
def threeven = {
    it % 3 == 0
}
def positive = {
    0 < it
}

After which the code becomes even more readable:

use (CaseCategory) {
    switch (param) {
        case even     : return 'even'
        case threeven : return 'threeven'
        case positive : return 'positive'
        default       : return 'negative'
    }
}

Nothing new under the Sun

Every generation of software developers needs its own fad. For my generation it was Agile, for generation before it was OOP, and before that it was another big thing. Gerald Weinberg, one of the most influential people in our industry, blogged yesterday about this issue. With over 50 years of experience in software development he knows what he is talking about. Read his blog post — he has a very good point.

P.S. I’m wondering what will be the next big thing. Will it be Cloud or Big Data?

Multimethods in Groovy

Every time I switch from Groovy to Java I have to remind myself that some things that seem so natural and work as expected in Groovy, don’t work in Java. One of such differences is method dispatching. Groovy supports multiple dispatch, while Java does not. Therefore the following code works differently in Groovy and Java:

public class A {
    public void foo(A a) { System.out.println("A/A"); }
    public void foo(B b) { System.out.println("A/B"); }
}
public class B extends A {
    public void foo(A a) { System.out.println("B/A"); }
    public void foo(B b) { System.out.println("B/B"); }
}
public class Main {
    public static void main(String[] args) {
        A a = new A();
        A b = new B();
        a.foo(a);
        b.foo(b);
    }
}

$ java Main
A/A
B/A

$ groovy Main.groovy
A/A
B/B

Reversing Groovy switch statement

Recently I’ve been working on a Groovy code that had many methods with long multibranch conditionals like this:

def parse(message, options) {
    if (options.contains('A')) {
        parseARule message
    } else if (options.contains(2)) {
        parseSmallDigitRule message
    ...
    } else if (options.contains(something)) {
        parseSomeRule message
    } else {
        parseSomeOtherRule message
    }
}

Although this code is working, it is hard to see which branch is called under which condition. It would be much better if we could replace this code with something like Lisp cond macro. The best candidate for such a task in Groovy would be a switch statement. If we could only refactor the code above to something like following, it would significantly improve readability:

def parse(message, options) {
    switch (options) {
        case 'A' : return parseARule(message)
        case 2   : return parseSmallDigitRule(message)
        ...
        case ... : return parseSomeRule(message)
        default  : return parseSomeOtherRule(message)
    }
}

Unfortunately, this code doesn’t work out of the box in Groovy, but it works if we do some metaprogramming.

The way switch statement works in Groovy is a bit different than in Java. Instead of equals() it uses isCase() method to match case-value and switch-value. The default implementation of isCase() method falls back to equals() method, but some classes, including Collection, override this behaviour. That’s why in Groovy you can do things like this:

switch (value) {
    case ['A','E','I','O','U'] : return 'vowel'
    case 0..9                  : return 'digit'
    case Date                  : return 'date'
    default                    : return 'something else'
}

For our purposes we need some sort of reverse switch, where collection is used as a switch-value, and String and Integer are used as a case-value. To do this we need to override default implementation of isCase() method on String and Integer classes. It’s not possible in Java, but is very easy in Groovy. You can change method implementation globally by replacing it in corresponding meta class, or locally with the help of categories. Let’s create a category that swaps object and subject of isCase() method:

class CaseCategory {
    static boolean isCase(String string, Collection col) {
        reverseCase(string, col)
    }
    static boolean isCase(Integer integer, Collection col) {
        reverseCase(integer, col)
    }
    // Add more overloaded methods here if needed

    private static boolean reverseCase(left, right) {
        right.isCase(left)
    }
}

Now we can use this category to achieve the goal we stated at the beginning of this post:

def parse(message, options) {
    use (CaseCategory) {
        switch (options) {
            case 'A' : return parseARule(message)
            case 2   : return parseSmallDigitRule(message)
            ...
            case ... : return parseSomeRule(message)
            default  : return parseSomeOtherRule(message)
        }
    }
}

If you are comfortable with global method replacement, you can amend String and Integer meta classes. In this case you don’t need to wrap switch statement with use keyword.

Mount remote dirs via ssh with sshfs / fuse

Well, there is nothing like a simple and easy innovative solutions to save the day -it’s been around for quite a while and never really needed it until now …

Use Case:

we moved Subversion from server A to server B and we wanted to bea ble to utilze the same backup scripts we were using so one (not real elegant way) was to mount the remote location via NFS which has its issues, from time to time you will meet stale NFS records and such so that is almost in all cases out of the question.

A neat solution would be to mount over SSH a specific directory run svnhotbackup and close the share, I took this to another level by utilising this over a VDSL connection which worked like a charm, so how do we do this ?

If you are on Ubuntu (see install snippet below):

sudo apt-get install sshfs

add fuse to your /etc/modules [edit /etc/modules and add the word fuse on a single line]

vi /etc/modules ...