Scott Leberknight's Weblog

Adding a MockWebServer JUnit Jupiter Extension

sleberkn — Sat, 5 Jul 2025 19:19:05 +0000

In the last post, I used several utilities in the kiwi-test library to clean up and remove boilerplate from tests using OkHttp's MockWebServer. But there's something else we can do to remove even more boilerplate from tests. The tests in the previous two blogs have the same code in the @BeforeEach and @AfterEach methods to:

Create a new MockWebServer instance and set an instance field
Get the base URI for the server where tests can send requests
Close the server after each test completes

This setup and teardown logic can be extracted into a JUnit Jupiter extension that will:

Before each test, create a new MockWebServer
Provide methods to get the server instance and the base URI of the server
After each test, close the server

Here is one implementation:

package org.kiwiproject.test.okhttp3.mockwebserver;

// imports...

public class MockWebServerExtension implements BeforeEachCallback, AfterEachCallback {

    @Getter
    @Accessors(fluent = true)
    private MockWebServer server;

    @Getter
    @Accessors(fluent = true)
    private URI uri;

    public MockWebServerExtension() {
        this(new MockWebServer());
    }

    public MockWebServerExtension(MockWebServer server) {
        this.server = KiwiPreconditions.requireNotNull(server, "server must not be nul");
    }

    @Override
    public void beforeEach(ExtensionContext context) throws IOException {
        server = new MockWebServer();
        server.start();
        uri = server.url("/").uri();
    }

    @Override
    public void afterEach(ExtensionContext context) {
        KiwiIO.closeQuietly(server);
    }
}

This implementation provides two constructors. The no-arg constructor creates a MockWebServer instance for you, while the one-arg constructor lets you create your own instance with any customization your tests need. For example, to support TLS.

It also provides the server() and uri() methods to easily get the MockWebServer instance and the base URI for use in your tests. Note these methods are generated usng Lombok, though they would be easy enough to create manually.

Using the extension in tests is straightforward. You add a MockWebServerExtension instance field and annotate it with @RegisterExtension:

@RegisterExtension
private final MockWebServerExtension serverExtension = new MockWebServerExtension();

For convenience, you can also declare a MockWebServer field:

private MockWebServer server;

Then in your test's @BeforeEach method, you initialize the server field, which can then be referenced in tests.

@BeforeEach
void setUp() {
    server = serverExtension.server();
    
    // additional initialization code...
}

Alternatively, you can get the server in each test using the extension's server() method:

@Test
void someTest() {
    var server = serverExtension.server();
    
    // test code...
}

Since the extension takes care of closing the server, you don't need to have a custom @AfterEach method to do that.

Now, you can write a complete test that uses the extension like the following:

class MathApiTest {

    @RegisterExtension
    private final MockWebServerExtension serverExtension = new MockWebServerExtension();
    
    private MathApiClient mathClient;
    private Client client;
    private MockWebServer server;
    
    @BeforeEach
    void setUp() {
        // Create the Jersey client
        client = ClientBuilder.newBuilder()
                .connectTimeout(500, TimeUnit.MILLISECONDS)
                .readTimeout(500, TimeUnit.MILLISECONDS)
                .build();
        
        server = serverExtension.server();
        var baseUri = serverExtension.uri();
        mathClient = new MathApiClient(client, baseUri);
    }

    @AfterEach
    void tearDown() {
        // Close the Jersey client
        client.close();
    }

    @Test
    void shouldAdd() {
        server.enqueue(new MockResponse()
                .setResponseCode(200)
                .setHeader(HttpHeaders.CONTENT_TYPE, "text/plain")
                .setBody("42"));

        assertThat(mathClient.add(40, 2)).isEqualTo(42);

        var recordedRequest = takeRequiredRequest(server);

        assertThatRecordedRequest(recordedRequest)
                .isGET()
                .hasPath("/math/add/40/2")
                .hasNoBody();
    }

    // ...more tests...
}

This test's @BeforeEach method gets the MockWebServer and the base URI directly from the MockWebServerExtension. So the only initialization logic it needs to do is to create a Jersey client and an instance of the class being tested, MathApiClient. As mentioned earlier, the test doesn't need to close the server in the @AfterEach method, so all it needs to do is close the Jersey client.

Each test then is the same as the previous post, where we used RecordedReqests and RecordedRequestAssertions from kiwi-test to keep the test code clean.

And that's all there is to it! The extension code shown above provides what you need in the majority of testing situations. But you don't need to create your own or copy this code if you don't want. kiwi-test version 3.9.0 adds its own MockWebServerExtension. It is very similar to the extension show here, but adds a few additional features such as the ability to specify a "server customizer", which is a Consumer<MockWebServer> that lets you customize a server, for example, to add TLS support and only support HTTP 1.1 and 2.0:

@RegisterExtension
private final MockWebServerExtension serverExtension = new MockWebServerExtension(svr -> {
    svr.setProtocols(List.of(Protocol.HTTP_2, Protocol.HTTP_1_1));
    svr.useHttps(getSocketFactory(), false);
});

It also provides a uri(path) method that lets you easily get a URI relative to the base URI of the server:

var statusURI = serverExtension.uri("/status");

Wrapping Up

Using a JUnit extension like the MockWebServerExtension shown here is one more thing you can do to eliminate boilerplate code in your tests. It can also provide the flexibility needed by different tests by allowing customization of the MockWebServer.

Making HTTP Client Tests Cleaner with MockWebServer and kiwi-test

sleberkn — Sun, 9 Mar 2025 19:50:04 +0000

In the previous blog, I showed using MockWebServer (part of OkHttp) to test HTTP client code. The test code was pretty clean and simple, but there are a few minor annoyances:

The boilerplate invocation to get the URI of the MockWebServer
Having to deal with InterruptedException using takeRequest
Needing to assert that the RecordedRequest returned from takeRequest is not null
Wrapping the assertions on a RecordedRequest in assertAll versus having an AssertJ-style fluent API

I fully admit these are all minor. However, the more I used MockWebServer the more I wanted to:

Reduce boilerplate code
Not need to deal with InterruptedException in tests
Not have to null-check the RecordedRequest
Have a fluent assertion API for RecordedRequest

In addition, there's another "gotcha" which is that if you use the no-argument takeRequest() method, your tests might never end. From the Javadoc, the takeRequest() method "will block until the request is available, possibly forever". (emphasis mine). It actually happened to me a few times before I actually read the Javadocs! After that I decided to only use the takeRequest method that accepts a timeout. This fixes the "never ends" problem. But whichever of the takeRequest methods you use, they both throw InterruptedException which you need to handle (unless you are using Kotlin in which case you don't need to worry about it).

To resolve the above "problems" I added several test utilities to kiwi-test in release 3.5.0 last July:

MockWebServers
MockWebServerAssertions
RecordedRequests
RecordedRequestAssertions

MockWebServers

This currently contains only two overloaded methods named uri. These are convenience methods to get the URI for a MockWebServer, either with or without a path. For example, instead of:

this.baseUri = server.url("/math").uri();

you can do this:

this.baseUri = MockWebServers.uri(server, "/math");

And with a static import for MockWebServers, the code is even shorter.

Is this small amount of boilerplate really worth these methods? Maybe, maybe not. Once I had written similar code a few dozen times, I decided it was worth having methods that accomplished the same thing.

Generally, I use these methods in @BeforeEach methods and store the value in a field, so that all tests can easily access it. Sometimes you don't need to store it in a field, but instead just pass it to the HTTP client:

var baseUri = MockWebServers.uri(server, "/math");
this.mathClient = new MathApiClient(client, baseUri);

In this example, the mathClient is stored in a field and each test uses it.

MockWebServerAssertions

This class is a starting point for assertions on a MockWebServer. It contains a few static factory methods to start from, one named assertThat and one named assertThatMockWebServer. The reason for the second one is to avoid conflicts with AssertJ's Assertions#assertThat methods. It provides a way to assert the number of requests made to the MockWebServer and has several other methods to assert on RecordedRequest. For example, assuming you use a static import:

assertThatMockWebServer(server)
        .hasRequestCount(1)
        .recordedRequest()
        .isGET()
        .hasPath("/status");

This code verifies that exactly one request was made, then uses the recordedRequest() method to get the RecordedRequest, and finally makes assertions that the request was a GET with path /status.

If you want to verify more than one request, you can use the hasRecordedRequest. The following code verifies that there were two requests made, and checks each one in the Consumer that is passed to hasRecordedRequest:

var path1 = "...";
var path2 = "...";
var requestBody = "{ ... }";

assertThatMockWebServer(server)
        .hasRequestCount(2)
        .hasRecordedRequest(recordedRequest1 -> {
            assertThat(recordedRequest1.getMethod()).isEqualTo("GET");
            assertThat(recordedRequest1.getPath()).isEqualTo(path1);
        })
        .hasRecordedRequest(recordedRequest2 -> {
            assertThat(recordedRequest2.getMethod()).isEqualTo("POST");
            assertThat(recordedRequest2.getPath()).isEqualTo(path2);
            assertThat(recordedRequest2.getBody().readUtf8()).isEqualTo(requestBody);
        });

RecordedRequests

While MockWebServers and MockWebServerAssertions are useful, RecordedRequests and RecordedRequestAssertions (discussed below) are the tools I use most when writing HTTP client tests.

RecordedRequests contains several methods to get a RecordedRequest from a MockWebServer. The method to use depends on whether there must be a request, or whether there may or may not be a request. If a request is required, you can use takeRequiredRequest:

var recordedRequest = takeRequiredRequest(server);

// make assertions on the RecordedRequest instance

But if it's possible that there might not be a request, you can use either takeRequestOrEmpty or takeRequestOrNull. The former returns Optional<RecordedRequest> while the latter returns a (possibly null) RecordedRequest. For example, if some business logic makes a request but only when certain requirements are met, a test can use one of these two methods:

// work with an Optional<RecordedRequest>
var maybeRequest = takeRequestOrEmpty(server);
assertThat(maybeRequest).isEmpty();

// or with a RecordedRequest directly
var request = takeRequestOrNull(server);
assertThat(request).isNull();

But wait, there's more. Not much, but there is another method assertNoMoreRequests that does what you expect: it verifies the MockWebServer does not contain any additional requests. So, once you have checked one or more requests, you can call it to verify the client didn't do anything else unexpected:

// get and assert one or more RecordedRequest

// now, verify there weren't any additional requests
assertNoMoreRequests(server);

As mentioned in the introduction, the RecordedRequest#takeRequest() method blocks, possibly forever. RecordedRequests avoids this problem by assuming all requests should already have been made by the time you want to get a request and make assertions on it.

Under the hood, all RecordedRequests methods call takeRequest(timeout: Long, unit: TimeUnit) (it's Kotlin, so the argument name is first and the type is second) and only wait 10 milliseconds before giving up. They handle InterruptedException by catching it, re-interrupting the current thread, and throwing an UncheckedInterruptedException (from the kiwi library). This allows for cleaner test code without needing to catch InterruptedException or declare a throws clause. So, your test code can just do this without worrying about timeouts:

var recordedRequest = RecordedRequests.takeRequiredRequest(server);

RecordedRequestAssertions

You use the methods in RecordedRequests to get one or more RecordedRequest to make assertions on. You can use RecordedRequestAssertions to make these assertions in a fluent-style API like AssertJ. If you don't like the AssertJ assertion chaining style, you can skip this section and move on with life. But if you like AssertJ, read on.

RecordedRequestAssertions contains several static methods to start from, and a number of assertion methods to check things like the request method, path, URI, and body. For example, suppose you are using the "Math API" from the previous blog and want to test addition. You can do this:

assertThatRecordedRequest(recordedRequest)
        .isGET()
        .hasPath("/math/add/40/2")
        .hasNoBody();

Here you are checking that a GET request was made to the server with path /math/add/40/2, and that there was no request body (since GET requests should in general not have one).

You can also verify the request body. Suppose you have a "User API" to perform various actions. To test a request sent to the "Create User" endpoint, you can write a test like this:

@Test
void shouldCreateUser() {
    var id = RandomGenerator.getDefault().nextLong(1, 501);
    var responseEntity = User.newWithRedactedPassword(id, "s_white", "Shaun White");

    server.enqueue(new MockResponse()
            .setResponseCode(201)
            .setHeader(HttpHeaders.CONTENT_TYPE, "application/json")
            .setHeader(HttpHeaders.LOCATION, UriBuilder.fromUri(baseUri).path("/users/{id}").build(id))
            .setBody(JSON_HELPER.toJson(responseEntity)));

    var newUser = new User(null, "s_white", "snowboarding", "Shaun White");
    var createdUser = apiClient.create(newUser);

    assertAll(
            () -> assertThat(createdUser.id()).isEqualTo(id),
            () -> assertThat(createdUser.username()).isEqualTo("s_white"),
            () -> assertThat(createdUser.password()).isEqualTo(User.REDACTED_PASSWORD)
    );

    var recordedRequest = RecordedRequests.takeRequiredRequest(server);

    assertThatRecordedRequest(recordedRequest)
            .isPOST()
            .hasHeader("Accept", "application/json")
            .hasPath("/users")
            .hasBody(JSON_HELPER.toJson(newUser));
            
    RecordedRequests.assertNoMoreRequests(server);
}

This test does the following:

Create a sample User entity
Set up the response that the MockWebServer should return
Call the create method on the "User API" client
Make some assertions on the returned User object
Get the recorded request from MockWebServer
Check the request
Verify that there are no more requests

To check the request, we verify that the request was a POST to /users, that it contains the required Accept header, and that it has the expected body. If the API is using JSON, then instead of doing the Object-to-JSON conversion manually, you can use hasJsonBodyWithEntity:

assertThatRecordedRequest(recordedRequest)
        .isPOST()
        .hasHeader("Accept", "application/json")
        .hasPath("/users")
        .hasJsonBodyWithEntity(newUser);

This will use a default kiwi JsonHelper instance. If you need control over the JSON serializaiton, you can use one of the overloaded hasJsonBodyWithEntity methods which accept either JsonHelper or a Jackson ObjectMapper. For example:

ObjectMapper mapper = customObjectMapper();

assertThatRecordedRequest(recordedRequest)
        .isPOST()
        .hasHeader("Accept", "application/json")
        .hasPath("/users")
        .hasJsonBodyWithEntity(newUser, mapper);

There are various other methods in RecordedRequestAssertions as well, for example methods to check the TLS version or whether there is a failure, perhaps because the inbound request was truncated. But the assertions in the above examples handle most of the use cases I've needed when writing HTTP client tests.

Wrapping Up

The kiwi-test library contains test utilities for making HTTP client testing with MockWebServer just a bit less tedious, with a little less boilerplate, and provides AssertJ-style fluent assertions for RecordedRequest. You can use these utilities to write cleaner and less "boilerplate-y" tests.

Testing HTTP Client Code with MockWebServer

sleberkn — Thu, 16 Jan 2025 02:09:18 +0000

When testing HTTP client code, it can be challenging to verify your application's behavior. For example, if you have an HTTP client that makes calls to some third-party API, or even to another service that you control, you want to make sure that you are sending the correct requests and handling the responses properly. There are various libraries available to help, and many times the library or framework you're using provides some kind of test support.

For example, I've used Dropwizard to create REST-based web services for a number of years. Dropwizard uses Jersey, which is the reference implementation of Jakarta RESTful Web Services (formerly known as JAX-RS). Dropwizard provides a way to test HTTP client implementations by creating a resource within your test that acts as a "test double" of the real server you are trying to simulate. When the test executes, a real HTTP server is started that can respond to real HTTP requests. No mocking, which is important since mocks can't easily simulate all the various things that can happen with HTTP requests.

Suppose you have an HTTP client that uses Jersey Client to call a "Math API". For now, you only care about adding two numbers, so your client looks like:

public class MathApiClient {

    private final Client client;
    private final URI baseUri;

    public MathApiClient(Client client, URI baseUri) {
        this.client = client;
        this.baseUri = baseUri;
    }

    public int add(int a, int b) {
        var response = client.target(baseUri)
                .path("/math/add/{a}/{b}")
                .resolveTemplate("a", a)
                .resolveTemplate("b", b)
                .request()
                .get();

        return response.readEntity(Integer.class);
    }
}

You want to design the client for easy testing, so the constructor accepts a Jersey Client and a URI, which lets you easily change the target server location. That's important, since you need to be able to provide the URI of the test server.

Here's an example of a Math API test class using Dropwizard's integration testing support:

@ExtendWith(DropwizardExtensionsSupport.class)
class DropwizardMathApiClientTest {

    @Path("/math")
    public static class MathStubResource {
        @GET
        @Path("/add/{a}/{b}")
        @Produces(MediaType.TEXT_PLAIN)
        public Response add(@PathParam("a") int a, @PathParam("b") int b) {
            var answer = a + b;
            return Response.ok(answer).build();
        }
    }

    private static final DropwizardClientExtension CLIENT_EXTENSION =
            new DropwizardClientExtension(new MathStubResource());

    private MathApiClient mathClient;
    private Client client;

    @BeforeEach
    void setUp() {
        client = ClientBuilder.newBuilder()
                .connectTimeout(500, TimeUnit.MILLISECONDS)
                .readTimeout(500, TimeUnit.MILLISECONDS)
                .build();
        var baseUri = CLIENT_EXTENSION.baseUri();
        mathClient = new MathApiClient(client, baseUri);
    }

    @AfterEach
    void tearDown() {
        client.close();
    }

    @Test
    void shouldAdd() {
        assertThat(mathClient.add(40, 2)).isEqualTo(42);
    }
}

In this code, it's the DropwizardClientExtension that provides all the real HTTP server functionality. You provide it the stub resource (a new MathStubResource instance) and it takes care of starting a real application that responds to HTTP requests and responds as you defined in the stub resource. Then you write tests that use the MathApiClient, make assertions as you normally would, and so on.

This works great, but there are some downsides. First, there is no way to (easily) verify the HTTP requests that the HTTP client made. The client makes the HTTP request and handles the response, but unless it provides some way to access the requests it has made, there's not really any way to verify this. You can add code into the stub resource to capture the requests, and provide a way for test code to access them, but that adds complexity to the stub resource.

Second, while testing the "happy path" is straightforward, things quickly become more difficult if you want to test errors, invalid input, and other "not happy path" scenarios. For example, let's say you want to test how your client responds when it receives an error response such as a 400 Bad Request or 500 Internal Server Error. How can you do this? One way is "magic input" where the server responds with a 400 when you provide one set of input (e.g., whenever a is 84) and a 500 when you provide a different input (e.g., whenever a is 142). Depending on the number of error cases you want to test, the stub resource code can quickly get complicated with conditionals. Another way is to use some kind of "flag" field inside the test stub resource class, where each test can "record" the response it wants. But this starts to become a "mini-framework" as you need more and more features.

Something else you can do is to create separate tests with different stub resources for different scenarios. But again, this can get out of control quickly if your HTTP client has a lot of methods and you want to test each one thoroughly.

Despite these shortcomings, you can still write good HTTP tests using what Dropwizard (and other similar libraries) provides. I've used the Dropwizard test support for the vast majority of HTTP client testing over the past few years. But I've recently come across the excellent MockWebServer from OkHttp. Basically, it is like a combination of a real HTTP server to test against and a mocking library such as Mockito.

To test HTTP clients using MockWebServer, you:

Record the responses you want to receive
Run your HTTP client code
Make assertions about the result from the client (if any)
Verify the client made the expected requests

This is very similar to using mocking like in Mockito, except that MockWebServer lets you test against the full HTTP/HTTPS request/response lifecycle in a realistic manner. So, rewriting the above test to use MockWebServer looks like:

class OkHttpMathApiClientTest {

    private MathApiClient mathClient;
    private Client client;
    private MockWebServer server;

    @BeforeEach
    void setUp() throws URISyntaxException {
        client = ClientBuilder.newBuilder()
                .connectTimeout(500, TimeUnit.MILLISECONDS)
                .readTimeout(500, TimeUnit.MILLISECONDS)
                .build();

        server = new MockWebServer();
        var baseUri = server.url("/").uri();

        mathClient = new MathApiClient(client, baseUri);
    }

    @AfterEach
    void tearDown() throws IOException {
        client.close();
        server.close();
    }

    @Test
    void shouldAdd() throws InterruptedException {
        server.enqueue(new MockResponse()
                .setResponseCode(200)
                .setHeader(HttpHeaders.CONTENT_TYPE, "text/plain")
                .setBody("42"));

        assertThat(mathClient.add(40, 2)).isEqualTo(42);

        var recordedRequest = server.takeRequest(1, TimeUnit.SECONDS);
        assertThat(recordedRequest).isNotNull();

        assertAll(
                () -> assertThat(recordedRequest.getMethod()).isEqualTo("GET"),
                () -> assertThat(recordedRequest.getPath()).isEqualTo("/math/add/40/2"),
                () -> assertThat(recordedRequest.getBodySize()).isZero()
        );
    }
}

In this test, we first record the response (or responses) we want to receive by calling enqueue with a MockResponse. Don't let the "Mock" in the name fool you, though, since this just tells MockWebServer the response you want. It will take care of returning a real HTTP response from a real HTTP server. The next line in the test is the same as in the Dropwizard example above, where we call the HTTP client and assert the result. But after that, MockWebServer lets you get the requests that the client code made using takeRequest, so you can verify that it sent exactly what it should have, with the expected path, query parameters, headers, body, etc.

One advantage of using MockWebServer is that it is really easy to record different responses and test how your client responds. For example, suppose the Math API returns a 400 response if you provide two numbers that add up to a number higher than the maximum value of a Java int, or a 500 response if there is a server error. Here are a few tests for those situations:

@Test
void shouldThrowIllegalArgumentException_ForInvalidInput() throws InterruptedException {
    server.enqueue(new MockResponse()
            .setResponseCode(400)
            .setHeader(HttpHeaders.CONTENT_TYPE, "text/plain")
            .setBody("overflow"));

    assertThatIllegalArgumentException()
            .isThrownBy(() -> mathClient.add(Integer.MAX_VALUE, 1))
            .withMessage("Invalid arguments: overflow");

    var recordedRequest = server.takeRequest(1, TimeUnit.SECONDS);
    assertThat(recordedRequest).isNotNull();

    assertAll(
            () -> assertThat(recordedRequest.getMethod()).isEqualTo("GET"),
            () -> assertThat(recordedRequest.getPath()).isEqualTo("/math/add/%d/1", Integer.MAX_VALUE)
    );
}

@Test
void shouldThrowIllegalStateException_ForServerError() throws InterruptedException {
    server.enqueue(new MockResponse()
            .setResponseCode(500)
            .setHeader(HttpHeaders.CONTENT_TYPE, "text/plain")
            .setBody("Server error: can't add right now"));

    assertThatIllegalStateException()
            .isThrownBy(() -> mathClient.add(2, 2))
            .withMessage("Unknown error: Server error: can't add right now");

    var recordedRequest = server.takeRequest(1, TimeUnit.SECONDS);
    assertThat(recordedRequest).isNotNull();

    assertAll(
            () -> assertThat(recordedRequest.getMethod()).isEqualTo("GET"),
            () -> assertThat(recordedRequest.getPath()).isEqualTo("/math/add/2/2", Integer.MAX_VALUE)
    );
}

Each test defines the response(s) that the MockWebServer should sent it. This makes it possible to create clean, self-contained test code that is easy to understand and change.

To make these tests pass, we should update the original implementation with some error handling code:

public int add(int a, int b) {
    var response = client.target(baseUri)
            .path("/math/add/{a}/{b}")
            .resolveTemplate("a", a)
            .resolveTemplate("b", b)
            .request()
            .get();

    if (successful(response)) {
        return response.readEntity(Integer.class);
    } else if (clientError(response)) {
        throw new IllegalArgumentException("Invalid arguments: " + response.readEntity(String.class));
    }

    throw new IllegalStateException("Unknown error: " + response.readEntity(String.class));
}

The code examples (adding two numbers) I've used are simple. In "real life" you are probably calling more complicated and expansive APIs, and need to test various success and failure scenarios. To recap, some of the advantages of using MockWebServer in your HTTP client tests are:

You can record different responses for each test (similar to setting up mock objects, e.g., Mockito)
You can avoid having to implement "stub" resources that are a "shadow API" of the remote API
Avoiding complexity in "stub" resources when adding logic to provide different responses based on inputs or other signals
You can verify the requests that were made, like how you verify method calls with mocking (e.g., Mockito)

There are other things you can do with MockWebServer, for example you can throttle responses to simulate a slow network to test timeout and retry behavior. You can also test with and without HTTPS, requiring client certificates, and customizing the supported protocols. These are all things that can be done in custom code, but it's much nicer when it comes out of the box.

To sum up, MockWebServer makes it simple to write tests for HTTP client code, allowing you to test the "happy path" and various failure scenarios, and provides support for more advanced testing situations such as when requiring client certificate authentication or simulating network slowness.

JUnit Pioneer Presentation Slides

sleberkn — Fri, 5 Mar 2021 17:07:49 +0000

Recently I've been using JUnit Pioneer, which is an extension library for JUnit Jupiter (JUnit 5). It contains a lot of useful annotations that are really easy to use in tests, for example to generate a range of numbers for input into a parameterized test. This is a presentation about Pioneer that I gave on March 4, 2021.

JUnit Pioneer from Scott Leberknight

In case the embedded slideshow doesn’t work properly here is a link to the slides (opens in a new window/tab).

Unit Testing Presentation Slides

sleberkn — Thu, 11 Jul 2019 03:12:52 +0000

We have several interns this summer, and each Friday we're doing a short presentation on a different software development topic. On June 28, I gave a short presentation on (unit) testing. This presentation is very light on code, and heavier on philosophy. I shared the slides on SlideShare and have embedded them below.

Unit Testing from Scott Leberknight

In case the embedded slideshow doesn’t work properly here is a link to the slides (opens in a new window/tab).

SDKMAN! Presentation Slides

sleberkn — Sun, 7 Apr 2019 19:29:08 +0000

I’ve been using SDKMAN! for a while now to make it really easy to install and manage multiple versions of various SDKs like Java, Kotlin, Groovy, and so on. I recently gave a mini-talk on SDKMAN! and have embedded the slides below.

SDKMAN! from Scott Leberknight

In case the embedded slideshow doesn’t work properly here is a link to the slides (opens in a new window/tab).

JUnit 5 Presentation Slides

sleberkn — Mon, 13 Aug 2018 12:36:08 +0000

I just gave a short presentation on JUnit 5 at my company, Fortitude Technologies. JUnit 5 adds a bunch of useful features for developer testing such as parameterized tests, a more flexible extension model, and a lot more. Plus, it aims to provide a more clean separation between the testing platform that IDEs and build tools like Maven and Gradle use, versus the developer testing APIs. It also provides an easy migration path from JUnit 4 (or earlier) by letting you run JUnit 3, 4, and 5 tests in the same propject. Here are the slides:

JUnit 5 from Scott Leberknight

In case the embedded slide show does not display properly, here is a link to the slides on Slideshare. The sample code for the presentation is on GitHub here.

Process API Improvements in JDK9

sleberkn — Tue, 4 Apr 2017 12:32:00 +0000

Over the past year, several microservices I have worked on responded to specific events and then executed native OS processes, for example launching custom C++ applications, Python scripts, etc. In addition to simply launching processes, those services also needed to obtain information for executing processes upon request, or shut down processes upon receiving shut down events. A lot of what the services were doing was controlling native processes in response to specific external events, whether via JMS queues, Kafka topics, or even XML files dropped in specific directories.

Since the microservices were implemented in Java, I had to use the less-than-stellar Process API, which provides only the most basic support. Even though a few additional features were added in Java 8 - such as being able to check if a process is alive using Process#isAlive and waiting for process exit with a timeout - you still cannot obtain a handle to a running process by its process ID nor can you even get the process ID of a Process object. As a result of the limitations I wrote a bunch of utilities that basically call out to native programs like grep and pgrep to gather information on running processes, child processes for a specific process ID, and so on. Even worse, to simply find the process ID for a Process instance I used reflection to directly access the private pid field in the java.lang.UNIXProcess class (which first required checking that we were actually dealing with a UNIXProcess instance, by comparing the class name as a string, since UNIXProcess is package-private and thus you cannot obtain its Class instance).

Most people writing and talking about Java 9 are excited about things like the new module system in Project Jigsaw; the Java shell/REPL; the HTTP/2 client; convenience factory methods for collections; and so on. But I am maybe even more excited about the process API improvements, since it means I can get rid of a lot of the hackery I used to obtain process information. Some of the information you can now obtain from a Process instance includes:

Whether the process supports normal termination (i.e. any of the "non-forcible" kill signals in Linux)
The process ID (i.e. the "pid"), and yes it's about time
A handle to the current process
A handle to the parent process, if one exists
A stream of handles to the direct children of the process
A stream of handles to the descendants (direct children, their children, and so on recursively)
A stream of handles to all processes visible to the current process
Process metadata such as the full command line, arguments, start instant, owning user, and total CPU duration

For example, to obtain the process ID (written as a unit test, and using AssertJ assertions):

@Test
public void getPid() throws IOException {
    ProcessBuilder builder = new ProcessBuilder("/bin/sleep", "5");
    Process proc = builder.start();
    assertThat(proc.getPid()).isGreaterThan(0);
}

Or, to obtain all sorts of different process metadata using ProcessHandle (which is also new in JDK 9 via the info() method in Process):

@Test
public void processInfo() throws IOException {
    ProcessBuilder builder = new ProcessBuilder("/bin/sleep", "5");
    Process proc = builder.start();
    ProcessHandle.Info info = proc.info();
    assertThat(info.arguments().orElse(new String[] {})).containsExactly("5");
    assertThat(info.command().orElse(null)).isEqualTo("/bin/sleep");
    assertThat(info.commandLine().orElse(null)).isEqualTo("/bin/sleep 5");
    assertThat(info.user().orElse(null)).isEqualTo(System.getProperty("user.name"));
    assertThat(info.startInstant().orElse(null)).isLessThanOrEqualTo(Instant.now());
}

Note in the above test that every method in the ProcessHandle.Info returns an Optional, which is the reason for the orElse in the assertions. Another thing that I really needed - and thankfully JDK 9 now provides - is the ability to get a handle to an existing process by its process ID using the ProcessHandle#of method. Here is a simple example as a unit test:

@Test
public void getProcessHandleForExistingProcess() throws IOException {
    ProcessBuilder builder = new ProcessBuilder("/bin/sleep", "5");
    Process proc = builder.start();
    long pid = proc.getPid();

    ProcessHandle handle = ProcessHandle.of(pid).orElseThrow(IllegalStateException::new);
    assertThat(handle.getPid()).isEqualTo(pid);
    assertThat(handle.info().commandLine().orElse(null)).isEqualTo("/bin/sleep 5");
}

As with the ProcessHandle.Info methods, ProcessHandle#of returns an Optional so again that is the reason for the orElseThrow. In a real application you might take some other action if the returned Optional is empty, or maybe you just throw an exception as the above test does.

As a last example, here is a test that launches a sleep process, then streams all visible processes and finds the launched sleep process:

@Test
public void allProcesses() throws IOException {
    ProcessBuilder builder = new ProcessBuilder("/bin/sleep", "5");
    builder.start();

    String sleep = ProcessHandle.allProcesses()
            .map(handle -> handle.info().command().orElse(String.valueOf(handle.getPid())))
            .filter(cmd -> cmd.equals("/bin/sleep"))
            .findFirst()
            .orElse(null);
    assertThat(sleep).isNotNull();
}

In the above test, since allProcesses returns a Stream we can use normal Java 8 stream API features like map, filter, and so on. In this example, we first map (transform) the ProcessHandle to the command (i.e. "sleep") or the process ID if the command Optional is empty. Next we filter on whether the command equals /bin/sleep and call findFirst which returns an Optional, and finally use orElse to return null if the returned Optional was empty. Of course the above test can fail if, for example, there already happens to be a /bin/sleep 5 process executing in the operating system but we won't really worry about that here.

One last piece of information that might be needed is the current process, i.e. a process needs get a handle to its own process. You can now accomplish this easily by calling ProcessHandle.current(). The Javadoc notes that you cannot use the returned handle to destroy the current process, and says to use System#exit instead.

In addition to the process information shown in the above examples, there is also a new onExit method that returns a CompletableFuture that "provides the ability to trigger dependent functions or actions that may be run synchronously or asynchronously upon process termination" according to the Javadoc. The following example shows an example that uses the native cmp program to compare two files, and upon exit applies a lambda expression to check whether the exit value is zero (meaning the two files are identical). Finally, it uses the Future#get method with a 1 second timeout (to avoid blocking indefinitely) to obtain the result:

Process proc = new ProcessBuilder("/usr/bin/cmp", "/tmp/file1.txt", "/tmp/file2.txt").start();
Future<Boolean> areIdentical = proc.onExit().thenApply(proc1 -> proc1.exitValue() == 0);
if (areIdentical.get(1, TimeUnit.SECONDS)) { ... }

So a big thanks to the Java team at Oracle (I can't believe I just thanked Oracle) for adding these new features! In the "real world" where systems are heterogenous and need to integrate in myriad ways, having a much more featureful and robust process API helps a lot for any system that needs to launch, monitor, and destroy native processes.

AWS Lambda Presentation Slides

sleberkn — Mon, 20 Mar 2017 12:15:00 +0000

A few months ago I gave a short presentation to my company on AWS Lambda, which is basically a "serverless" framework that lets you deploy and run code in Amazon's cloud without managing, provisioning, or administering any servers whatsoever. Here are the slides:

AWS Lambda from Scott Leberknight

Testing HTTP Clients Using Spark, Revisited

sleberkn — Tue, 14 Mar 2017 20:09:22 +0000

In a previous post I described the very small sparkjava-testing library I created to make it really simple to test HTTP client code using the Spark micro-framework. It is basically one simple JUnit 4 rule (SparkServerRule) that spins up a Spark HTTP server before tests run, and shuts it down once tests have executed. It can be used either as a @ClassRule or as a @Rule. Using @ClassRule is normally what you want to do, which starts an HTTP server before any tests has run, and shuts it down afer all tests have finished.

In that post I mentioned that I needed to do an "incredibly awful hack" to reset the Spark HTTP server to non-secure mode so that, if tests run securely using a test keystore, other tests can also run either non-secure or secure, possibly with a different keystore. I also said the reason I did that was because "there is no way I found to easily reset security". The reason for all that nonsense was because I was using the static methods on the Spark class such as port, secure, get, post, and so on. Using the static methods also implies only one server instance across all tests, which is also not so great.

Well, it turns out I didn't really dig deep enough into Spark's features, because there is a really simple way to spin up separate and independent Spark server instances. You simply use the Service#ignite method to return an instance of Service. You then configure the Service however you want, e.g. change the port, add routes, filters, set the server to run securely, etc. Here's an example:

Service http = Service.ignite();
http.port(56789);
http.get("/hello", (req, resp) -> "Hello, Spark service!");

So now you can create as many servers as you want. This is exactly what is needed for the SparkServerRule, which has been refactored to use Spark#ignite to get separate servers for each test. It now has only one constructor which takes a ServiceInitializer and can be used to do whatever configuration you need, add routes, filters, etc. Since ServiceInitializer is a @FunctionalInterface you can simply supply a lambda expression, which makes it cleaner. Here is a simple example:

@ClassRule
public static final SparkServerRule SPARK_SERVER = new SparkServerRule(http -> {
    http.get("/ping", (request, response) -> "pong");
    http.get("/health", (request, response) -> "healthy");
});

This is a rule that, before any test is run, spins up a Spark server on the default port 4567 with two GET routes, and shuts the server down after all tests have completed. To do things like change the port and IP address in addition to adding routes, you just call the appropriate methods on the Service instance (in the example above, the http object passed to the lambda). Here's an example:

@ClassRule
public static final SparkServerRule SPARK_SERVER = new SparkServerRule(https -> {
    https.ipAddress("127.0.0.1");
    https.port(56789);
    URL resource = Resources.getResource("sample-keystore.jks");
    https.secure(resource.getFile(), "password", null, null);
    https.get("/ping", (request, response) -> "pong");
    https.get("/health", (request, response) -> "healthy");
});

In this example, tests will be able to access a server with two secure (https) endpoints at IP 127.0.0.1 on port 56789. So that's it. On the off chance someone was actually using this rule other than me, the migration path is really simple. You just need to configure the Service instance passed in the SparkServerRule constructor as shown above. Now, each server is totally independent which allows tests to run in parallel (assuming they're on different ports). And better, I was able to remove the hack where I used reflection to go under the covers of Spark and manipulate fields, etc. So, test away on that HTTP client code!

This blog was originally published on the Fortitude Technologies blog here.

Testing HTTP Clients Using the Spark Micro Framework

sleberkn — Wed, 7 Dec 2016 14:25:00 +0000

Testing HTTP client code can be a hassle. Your tests either need to run against a live HTTP server, or you somehow need to figure out how to send mock requests which is generally not easy in most libraries that I have used. The tests should also be fast, meaning you need a lightweight server that starts and stops quickly. Spinning up heavyweight web or application servers, or relying on a specialized test server, is generally error-prone, adds complexity and slows tests down. In projects I'm working on lately we are using Dropwizard, which provides first class testing support for testing JAX-RS resources and clients as JUnit rules. For example, it provides DropwizardClientRule, a JUnit rule that lets you implement JAX-RS resources as test doubles and starts and stops a simple Dropwizard application containing those resources. This works great if you are already using Dropwizard, but if not then a great alternative is Spark. Even if you are using Dropwizard, Spark can still work well as a test HTTP server.

Spark is self-described as a "micro framework for creating web applications in Java 8 with minimal effort". You can create the steroptypical "Hello World" in Spark like this (shamelessly copied from Spark's web site):

import static spark.Spark.get;

public class HelloWorld {
    public static void main(String[] args) {
        get("/hello", (req, res) -> "Hello World");
    }
}

You can run this code and visit http://localhost:4567 in a browser or using a client tool like curl or httpie. Spark is a perfect fit for creating HTTP servers in tests (whether you call them unit tests, integration tests or something else is up to you, I will just call them tests here). I have created a very simple library sparkjava-testing that contains a JUnit rule for spinning up a Spark server for functional testing of HTTP clients. This library consists of one JUnit rule, the SparkServerRule. You can annotate this rule with @ClassRule or just @Rule. Using @ClassRule will start a Spark server one time before any test is run. Then your tests run, making requests to the HTTP server, and finally once all tests have finished the server is shut down. If you need true isolation between every single test, annotate the rule with @Rule and a test Spark server will be started before each test and shut down after each test, meaning each test runs against a fresh server. (The SparkServerRule is a JUnit 4 rule mainly because JUnit 5 is still in milestone releases, and because I have not actually used JUnit 5.)

To declare a class rule with a test Spark server with two endpoints, you can do this:

@ClassRule
public static final SparkServerRule SPARK_SERVER = new SparkServerRule(() -> {
    get("/ping", (request, response) -> "pong");
    get("/healthcheck", (request, response) -> "healthy");
});

The SparkServerRule constructor takes a Runnable which define the routes the server should respond to. In this example there are two HTTP GET routes, /ping and /healthcheck. You can of course implement the other HTTP verbs such as POST and PUT. You can then write tests using whatever client library you want. Here is an example test using a JAX-RS:

@Test
public void testSparkServerRule_HealthcheckRequest() {
    client = ClientBuilder.newBuilder().build();
    Response response = client.target(URI.create("http://localhost:4567/healthcheck"))
            .request()
            .get();
    assertThat(response.getStatus()).isEqualTo(200);
    assertThat(response.readEntity(String.class)).isEqualTo("healthy");
}

In the above test, client is a JAX-RS Client instance (it is an instance variable which is closed after each test). I'm using AssertJ assertions in this test. The main thing to note is that your client code be parameterizable, so that the local Spark server URI can be injected instead of the actual production URI. When using the JAX-RS client as in this example, this means you need to be able to supply the test server URI to the Client#target method. Spark runs on port 4567 by default, so the client in the test uses that port.

The SparkServerRule has two other constructors: one that accepts a port in addition to the routes, and another that takes a SparkInitializer. To start the test server on a different port, you can do this:

@ClassRule
public static final SparkServerRule SPARK_SERVER = new SparkServerRule(6543, () -> {
    get("/ping", (request, response) -> "pong");
    get("/healthcheck", (request, response) -> "healthy");
});

You can use the constuctor that takes a SparkInitializer to customize the Spark server, for example in addition to changing the port you can also set the IP address and make the server secure. The SparkInitializer is an @FunctionalInterface with one method init(), so you can use a lambda expression. For example:

@ClassRule
public static final SparkServerRule SPARK_SERVER = new SparkServerRule(
        () -> {
            Spark.ipAddress("127.0.0.1");
            Spark.port(9876);
            URL resource = Resources.getResource("sample-keystore.jks");
            String file = resource.getFile();
            Spark.secure(file, "password", null, null);
        },
        () -> {
            get("/ping", (request, response) -> "pong");
            get("/healthcheck", (request, response) -> "healthy");
        });

The first argument is the initializer. It sets the IP address and port, and then loads a sample keystore and calls the Spark#secure method to make the test sever accept HTTPS connections using a sample keystore. You might want to customize settings if running tests in parallel, specifically the port, to ensure parallel tests do not encounter port conflicts.

The last thing to note is that SparkServerRule resets the port, IP address, and secure settings to the default values (4567, 0.0.0.0, and non-secure, respectively) when it shuts down the Spark server. If you use the SparkInitializer to customize other settings (for example the server thread pool, static file location, before/after filters, etc.) those will not be reset, as they are not currently supported by SparkServerRule. Last, resetting to non-secure mode required an incredibly awful hack because there is no way I found to easily reset security - you cannot just pass in a bunch of null values to the Spark#secure method as it will throw an exception, and there is no unsecure method probably because the server was not intended to set and reset things a bunch of times like we want to do in test scenarios. If you're interested, go look at the code for the SparkServerRule in the sparkjava-testing repository, but prepare thyself and get some cleaning supplies ready to wash away the dirty feeling you're sure to have after seeing it.

The ability to use SparkServerRule to quickly and easily setup test HTTP servers, along with the ability to customize the port, IP address, and run securely intests has worked very well for my testing needs thus far. Note that unlike the above toy examples, you can implement more complicated logic in the routes, for example to return a 200 or a 404 for a GET request depending on a path parameter or request parameter value. But at the same time, don't implement extremely complex logic either. Most times I simply create separate routes when I need the test server to behave differently, for example to test various error conditions. Or, I might even choose to implement separate JUnit test classes for different server endpoints, so that each test focuses on only one endpoint and its various success and failure conditions. As is many times the case, the context will determine the best way to implement your tests. Happy testing!

This blog was originally published on the Fortitude Technologies blog here.

Towards More Functional Java - Digging into Nested Data Structures

sleberkn — Mon, 14 Nov 2016 13:15:00 +0000

In the last post we saw an example that used a generator combined with a filter to find the first available port in a specific range. It returned an Optional to model the case when no open ports are found, as opposed to returning null. In this example, we'll look at how to use Java 8 streams to dig into a nested data structure and find objects of a specific type. We'll use map and filter operations on the stream, and also introduce a new concept, the flat-map.

In the original, pre-Java 8 code that I was working on in a project, the data structure was a three-level nested structure that was marshaled into Java objects from an XML file based on a schema from an external web service. The method needed to find objects of a specific type at the bottom level. For this article, to keep things simple we will work with a simple class structure in which class A contains a collection of class B, and B contains a collection of class C. The C class is a base class, and there are several subclasses C1, C2, and C3. In pseudo-code the classes look like:

class A {
  List<B> bs = []
}

class B {
  List<C> cs = []
}

class C {}
class C1 extends C {}
class C2 extends C {}
class C3 extends C {}

The goal here is to find the first C2 instance, given an instance of A. The pre-Java 8 code looks like the following:

public C2 findFirstC2(A a) {
    for (B b : a.getBs()) {
        for (C c : b.getCs()) {
            if (c instanceof C2) {
                return (C2) c;
            }
        }
    }
    return null;
}

In this code, I made the assumption that the collections are always non-null. The original code I was working on did not make that assumption, and was more complicated as a result. We will revisit the more complicated case later. This code is pretty straightforward: two loops and a conditional, plus an early exit if we find an instance of C2, and return null if we exit the loops without having found anything.

Refactoring to streams using Java 8's stream API is not too bad, though we need to introduce the flat-map concept. Martin Fowler's simple explanation is better than any I would come up with so I will repeat it here: "Map a function over a collection and flatten the result by one-level". In our example, each B has a collection of C. The flat-map operation over a collection of B will basically return a stream of all C for all B. For example, if there are two B instances in the collection, the first having 3 C and the second having 5 C, then the flat-map operation returns a stream of 8 C instances (effectively combining the 3 from the first C and 5 from the second C, and flattening by one level up). With the new flat-map tool in our tool belts, here is the Java 8 code using the stream API:

public Optional<C2> findFirstC2(A a) {
    return a.getBs().stream()
            .flatMap(b -> b.getCs().stream())
            .filter(C2.class::isInstance)
            .map(C2.class::cast)
            .findFirst();
}

In the above code, we first stream over the collection of B. Next is where we apply the flatMap method to get a stream of all C. The one somewhat tricky thing about the Java 8 flatMap method is that the mapper function must return a stream. In our example, we use b.getCs().stream() as the mapper function, thus returning a stream of C. From then on we can apply the filter and map operations, and close out with the findFirst short-circuiting (because it stops at the first C2 it finds) terminal operation which returns an Optional that either contains a value, or is empty.

If you have read any of my previous posts, you won't be surprised that I prefer the functional-style of the Java 8 stream API, for the same reasons I've listed previously (e.g. declarative code, no explicit loops or conditionals, etc.). And as we've seen before in previous posts, we can make the above example generic very easily:

public <T extends C> Optional<T> findFirst(A a, Class<T> clazz) {
    return a.getBs().stream()
            .flatMap(b -> b.getCs().stream())
            .filter(clazz::isInstance)
            .map(clazz::cast)
            .findFirst();
}

Of course, it is also not difficult to make the imperative version with loops generic, using the isAssignableFrom and cast methods in the Class class. And you can even make it just as short by removing the braces, as in the following:

public <T> T findFirstC2(A a, Class<T> clazz) {
    for (B b : a.getBs())
        for (C c : b.getCs())
            if (clazz.isAssignableFrom(c.getClass()))
                return clazz.cast(c);
    return null;
}

I never omit the braces even for one liners, because I believe it is a great way to introduce bugs (remember goto fail a few years ago?). Braces or no braces, why prefer the more functional style to the imperative style? Some is obviously personal preference, and what you are used to. Clearly if you are used to and comfortable with reading imperative code, it won't be an issue to read the above code. But the same goes for functional style, i.e. once you learn the basic concepts (map, filter, reduce, flat-map, etc.) it becomes very easy to quickly see what code is doing (and what is intended).

One other thing is that instead of using stream(), you can easily switch to parallelStream() which then automatically parallelizes the code. But simply using parallelStream() will not always (counter-intuitively) make code faster, e.g. for small collections it will probably make it slower due to context switching. But if things like transformation or filtering take a significant amount of time, then parallelizing the code can produce significant performance improvement. Unfortunately there are no hard rules though, and whether parallelizing speeds the code up depends on various and sundry factors.

The examples above were very simple. The original code was more complex because it did not make any assumptions about nullability of the original argument or the nested collections. Here is the code:

public C2 findFirstC2(A a) {
    if (a == null || a.getBs() == null) {
        return null;
    }

    for (B b : a.getBs()) {
        List<C> cs = b.getCs();
        if (cs == null) {
            continue;
        }

        for (C c : cs) {
            if (c instanceof C2) {
                return (C2) c;
            }
        }
    }
    return null;
}

This code is more difficult to read than the original code due to the additional null-checking conditionals. There are two loops, three conditionals, a loop continuation, and a short-circuit return form within a nested loop. So what does this look like using the Java 8 stream API? Here is one solution:

public Optional<C2> findFirstC2(A a) {
    return Optional.ofNullable(a)
            .map(A::getBs)
            .orElseGet(Lists::newArrayList)
            .stream()
            .flatMap(this::toStreamOfC)
            .filter(C2.class::isInstance)
            .map(C2.class::cast)
            .findFirst();
}

private Stream<? extends C> toStreamOfC(B b) {
    return Optional.ofNullable(b.getCs())
            .orElseGet(Lists::newArrayList)
            .stream();
}

That looks like a lot, so let's see what is going on. The main difference is that we need to account for possible null values. For that the code uses the Optional#ofNullable method which unsurprisingly returns an Optional. We are also using map operations on the Optional objects, which returns an empty Optional if it was originally empty, otherwise it applies the operation. We are also using the Optional#orElseGet method to ensure we are always operating on non-null collections, for example if a.getBs() returns null then the first orElseGet provides a new ArrayList. In this manner, the code always works the same way whether the intermediate collections are null or not. Instead of embedding a somewhat complicated map operation in the flatMap I extracted the toStreamOfC method, and then used a method reference. When writing code in functional style, often it helps to extract helper methods, even if that ends up creating more code because, in the end, the code is more easily understood.

The code in this more complex example illustrates the declarative nature of the functional style. Once you are familiar with the functional primitives (like map, flat-map, filter, and so on) reading this code is quite easy and fast, because it reads like a specification of the problem. Like reading code, writing code in the functional style takes some practice and getting used to, but once you get the hang of it, I think you will find you can often write the code faster. The main difference when writing code in functional style is that I do more thinking about what exactly I am trying to do before just slinging code. Until next time, auf Wiedersehen.

This blog was originally published on the Fortitude Technologies blog here.

Towards More Functional Java using Generators and Filters

sleberkn — Wed, 12 Oct 2016 12:30:00 +0000

Last time we saw how to use lambdas as predicates, and specifically how to use them with the Java 8 Collection#removeIf method in order to remove elements from a map based on the predicate. In this article we will use a predicate to filter elements from a stream, and combine it with a generator to find the first open port in a specific range. The use case is a (micro)service-based environment where each new service binds to the first open port it finds in a specific port range. For example, suppose we need to limit the port range of each service to the dynamic and/or private ports (49152 to 65535, as defined by IANA). Basically we want to choose a port at random in the dynamic port range and bind to that port if it is open, otherwise repeat the process until we find an open port or we have tried more than a pre-defined number of times.

The original pre-Java 8 code to accomplish this looked like the following:

public Integer findFreePort() {
    int assignedPort = -1;
    int count = 1;
    while (count <= MAX_PORT_CHECK_ATTEMPTS) {
        int checkPort = MIN_PORT + random.nextInt(PORTS_IN_RANGE);
        if (portChecker.isAvailable(checkPort)) {
            assignedPort = checkPort;
            break;
        }
        count++;
    }
    return assignedPort == -1 ? null : assignedPort;
}

There are a few things to note here. First, the method returns an Integer to indicate that it could not find an open port (as opposed to throwing an exception, which might or might not be better). Second, there are two mutable variables assignedPort and count, which are used to store the open port (if found) and to monitor the number of attempts made, respectively. Third, the while loop executes so long as as the maximum number of attempts has not been exceeded. Fourth, a conditional inside the loop uses a port checker object to determine port availability, breaking out of the loop if an open port is found. Finally, a ternary expression is used to check the assignedPort variable and return either null or the open port.

Taking a step back, all this code really does is loop until an open port is found, or until the maximum attempts has been exceeded. It then returns null (if no open port was found) or the open port as an Integer. There are two mutable variables, a loop, a conditional inside the loop with an early break, and another conditional (via the ternary) to determine the return value. I'm sure there are a few ways this code could be improved without using Java 8 streams. For example, we could simply return the open port from the conditional inside the loop and return null if we exit the loop without finding an open port, thereby eliminating the assignedPort variable. Even so it still contains a loop with a conditional and an early exit condition. And some people really dislike early returns and only want to see one return statement at the end of a method (I don't generally have a problem with early exits from methods, so long as the method is relatively short). Not to mention returning null when a port is not found forces a null check on callers; if a developer isn't paying attention or this isn't documented, perhaps they omit the null check causing a NullPointerException somewhere downstream.

Refactoring this to use the Java 8 stream API can be done relatively simply. In order to accomplish this we want to do the following, starting with generating a sequence of random ports. For each randomly generated port, filter on open ports and return the first open port we find. If no open ports are found after limiting our attempts to a pre-determined maximum, we want to return something that clearly indicates no open port was found, i.e. that the open port is empty. I chose the terminology here very specifically, to correspond to both general functional programming concepts as well as to the Java 8 API methods we can use.

Here is the code using the Java 8 APIs:

public OptionalInt findFreePort() {
    IntSupplier randomPorts = () -> MIN_PORT + random.nextInt(PORTS_IN_RANGE);
    return IntStream.generate(randomPorts)
            .limit(MAX_PORT_CHECK_ATTEMPTS)
            .filter(portChecker::isAvailable)
            .findFirst();
}

Without any explanation you can probably read the above code and tell generally what it does, because we are declaring what should happen, as opposed to listing the explicit instructions for how to do it. But let's dive in and look at the specifics anyway. The refactored method returns an OptionalInt to indicate the presence or absence of a value; OptionalInt is just the version of the Optional class specialized for primitive integers. This better matches the semantics we'd like, which is to clearly indicate to a caller that there may or may not be a value present. Next, we are using the generate method to create an infinite sequence of random values, using the specified IntSupplier (which is a specialization of Supplier for primitive integers). Suppliers do exactly what they say they do - supply a value, and in this case a random integer in a specific range. Note the supplier is defined using a lambda expression.

The infinite sequence is truncated (limited) using the limit method, which turns it into a finite sequence. The final two pieces are the filter and findFirst methods. The filter method uses a method reference to the isAvailable method on the portChecker object, which is just a shortcut for a lambda expression when the method accepts a single value that is the lambda argument. Finally, we use findFirst which is described by the Javadocs as a "short-circuiting terminal operation" which simply means it terminates a stream, and that as soon as its condition is met, it "short circuits" and terminates. The short-circuiting behavior is basically the same as the break statement in the original pre-Java 8 code.

So now we have a more functional version that finds free ports with no mutable variables and a more semantically correct return type. As we've seen in several of the previous articles in this ad-hoc series, we are seeing common patterns (i.e. map, filter, collect/reduce) recurring in a slightly different form. Instead of a map operation to transform an existing stream, we are generating a stream from scratch, limiting to a finite number of attempts, filtering the items we want to accept, and then using a short-circuiting terminal operation to return the value found, or an empty value as an OptionalInt.

As you can probably tell, I am biased toward the functional version for various reasons such as the declarative nature of the code, no explicit looping or variable mutation, and so on. In this case I think the more functional version is much more readable (though I am 100% sure there will be people who vehemently disagree, and that's OK). In addition, because we are using what are effectively building blocks (generators, map, filter, reduce/collect, etc.) we can much more easily make something generic to find the first thing that satisifies a filtering condition given a supplier and limit. For example:

public <T> Optional<T> findFirst(long maxAttempts,
                                 Supplier<T> generator,
                                 Predicate<T> condition) {
    return Stream.generate(generator)
            .limit(maxAttempts)
            .filter(condition)
            .findFirst();
}

Now we have a re-usable method that can accept any generator and any predicate. For example, suppose you want to find the first random number over two billion if it occurs within 10 attempts, or else default to 42 (naturally). Assuming you have a random number generator object rand, then you could call the findFirst method like this, making use of the orElse method on Optional to provide a default value:

Integer value = findFirst(10, rand::nextInt, value -> value > 2_000_000_000).orElse(42);

So as I mentioned in the last article on predicates, there is a separation of concerns achieved by using the functional style that simply is not possible using traditional control structures such as the while loop and explicit if conditional as in the first example of this article. (*) Essentially, the functional style is composable using basic building blocks, which is another huge win. Because of this composability, in general you tend to write less code, and the code that you do write tends to be more focused on the business logic you are actually trying to perform. And when you do see the same pattern repeated several times, it is much easier to extract the commonality using the functional style building blocks as we did to create the generic findFirst method in the last example. To paraphrase Yoda, once you start down the path to the functional side, forever will it dominate your destiny. Unlike the dark side of the Force, however, the functional side is much better and nicer. Until next time, arrivederci.

You can find all the sample code used in this blog and the others in this series on my GitHub in the java8-blog-code repository.

(*) Yes, you can simulate functional programming using anonymous inner classes prior to Java 8, or you can use a library like Guava and use its functional programming idioms. In general this tends to be verbose and you end up with more complicated and awkward-looking code. As the Guava team explains:

Excessive use of Guava's functional programming idioms can lead to verbose, confusing, unreadable, and inefficient code. These are by far the most easily (and most commonly) abused parts of Guava, and when you go to preposterous lengths to make your code "a one-liner," the Guava team weeps.

This blog was originally published on the Fortitude Technologies blog here.

Towards More Functional Java using Lambdas as Predicates

sleberkn — Tue, 13 Sep 2016 11:45:00 +0000

Previously I showed an example that transformed a map of query parameters into a SOLR search string. The pre-Java 8 code used a traditional for loop with a conditional and used a StringBuilder to incrementally build a string. The Java 8 code streamed over the map entries, mapping (transforming) each entry to a string of the form "key:value" and finally used a Collector to join those query fragments together. This is a common pattern in functional-style code, in which a for loop transforms one collection of objects into a collection of different objects, optionally filters some of them out, and optionally reduce the collection to a single element. These are common patterns in the functional style - map, filter, reduce, etc. You can almost always replace a for loop with conditional filtering and reduction into a Java 8 stream with map, filter, and reduce (collect) operations.

But in addition to the stream API, Java 8 also introduced some nice new API methods that make certain things much simpler. For example, suppose we have the following method to remove all map entries for a given set of keys. In the example code, dataCache is a ConcurrentMap and deleteKeys is the set of keys we want to remove from that cache. Here is the original code I came across:

public void deleteFromCache(Set<String> deleteKeys) {
    Iterator<Map.Entry<String, Object>> iterator = dataCache.entrySet().iterator();
    while (iterator.hasNext()) {
        Map.Entry<String, Object> entry = iterator.next();
        if (deleteKeys.contains(entry.getKey())) {
            iterator.remove();
        }
    }
}

Now, you could argue there are better ways to do this, e.g. iterate the delete keys and remove each mapping using the Map#remove(Object key) method. For example:

public void deleteFromCache(Set<String> deleteKeys) {
    for (String deleteKey : deleteKeys) {
        dataCache.remove(deleteKey);
    }
}

The code using the for loop certainly seems cleaner than using the Iterator in this case, though both are functionally equivalent. Can we do better? Java 8 introduced the removeIf method as a default method, not in Map but instead in the Collection interface. This new method "removes all of the elements of this collection that satisfy the given predicate", to quote from the Javadocs. This method accepts one argument, a Predicate, which is a functional interface introduced in Java 8, and which can therefore be used in lambda expressions. Let's first implement this a regular old anonymous inner class, which you can always do even in Java 8. It looks like:

public void deleteFromCache(Set<String> deleteKeys) {
    dataCache.entrySet().removeIf(new Predicate<Map.Entry<String, Object>>() {
        @Override
        public boolean test(Map.Entry<String, Object> entry) {
            return deleteKeys.contains(entry.getKey());
        }
    });
}

As you can see, we first get the map's entry set via the entrySet method and call removeIf on it, supplying a Predicate that tests whether the set of deleteKeys contains the entry key. If this test returns true, the entry is removed. Since Predicate is annotated with @FunctionalInterface it can act as a lambda expression, a method reference, or a constructor reference according to the Javadoc. So let's take the first step and convert the anonymous inner class into a lambda expression:

public void deleteFromCache(Set<String> deleteKeys) {
    dataCache.entrySet().removeIf((Map.Entry<String, Object> entry) ->
        deleteKeys.contains(entry.getKey()));
}

In the above, we've replaced the anonymous class with a lambda expression that takes a single Map.Entry argument. But, Java 8 can infer the argument types of lambda expressions, so we can remove the explicit (and a bit noisy) type declarations, leaving us with the following cleaner code:

public void deleteFromCache(Set<String> deleteKeys) {
    dataCache.entrySet().removeIf(entry -> deleteKeys.contains(entry.getKey()));
}

This code is quite a bit nicer than the original code using an explicit Iterator. But what about compared to the second code example that looped through the keys using a simple for loop, and calling remove to remove each element? The lines of code really aren't that different, so assuming they are functionally equivalent then perhaps it is just a style preference. The explicit for loop is a traditional imperative style, whereas the removeIf has a more functional flavor to it. If you look at the actual implementation of removeIf in the Collection interface, it actually uses an Iterator under the covers, just as with the first example in this post.

So practically there is no difference in functionality. But, removeIf could theoretically be implemented for certain types of collections to perform the operation in parallel, and perhaps only for collections over a certain size where it can be shown that parallelizing the operation has benefits. But this simple example is really more about separation of concerns, i.e. separating the logic of traversing the collection from the logic that determines whether or not an element is removed.

For example, if a code base needs to remove elements from collections in many difference places, chances are good that it will end up having similar loop traversal logic intertwined with remove logic in many different places. In contrast, using the removeIf function leads to only having the remove logic in the different locations - and the removal logic is really your business logic. And, if at some later point in time the traversal logic in the Java collections framework were to be improved somehow, e.g. parallelized for large collections, then all the locations using that function automatically receive the same benefit, whereas code that combines the traversal and remove logic using explicit Iterator or loops would not.

In this case, and many others, I'd argue the separation of concerns is a much better reason to prefer functional style to imperative style. Separation of concerns leads to better, cleaner code and easier code re-use precisely since those concerns can be implemented separately, and also tested separately, which results in not only cleaner production code but also cleaner test code. All of which leads to more maintainable code, which means new features and enhancements to existing code can be accomplished faster and with less chance of breaking existing code. Until the next post in this ad-hoc series on Java 8 features and a functional style, happy coding!

This blog was originally published on the Fortitude Technologies blog here.

Towards more functional Java using Streams and Lambdas

sleberkn — Tue, 23 Aug 2016 12:30:00 +0000

In the last post I showed how the Java 7 try-with-resources feature reduces boilerplate code, but probably more importantly how it removes errors related to unclosed resources, thereby eliminating an entire class of errors. In this post, the first in an ad-hoc series on Java 8 features, I'll show how the stream API can reduce the lines of code, but also how it can make the code more readable, maintainable, and less error-prone.

The following code is from a simple back-end service that lets us query metadata about messages flowing through various systems. It takes a map of key-value pairs and creates a Lucene query that can be submitted to SOLR to obtain results. It is primarily used by developers to verify behavior in a distributed system, and it does not support very sophisticated queries, since it only ANDs the key-value pairs together to form the query. For example, given a parameter map containing the (key, value) pairs (lastName, Smith) and (firstName, Bob), the method would generate the query "lastName:Smith AND firstName:Bob". As I said, not very sophisticated.

Here is the original code (where AND, COLON, and DEFAULT_QUERY are constants):

public String buildQueryString(Map<String, String> parameters) {
    int count = 0;
    StringBuilder query = new StringBuilder();

    for (Map.Entry<String, String> entry : parameters.entrySet()) {
        if (count > 0) {
            query.append(AND);
        }
        query.append(entry.getKey());
        query.append(COLON);
        query.append(entry.getValue());
        count++;
    }

    if (parameters.size() == 0) {
        query.append(DEFAULT_QUERY);
    }

    return query.toString();
}

The core business logic should be very simple, since we only need to iterate the parameter map, join the keys and values with a colon, and finally join them together. But the code above, while not terribly hard to understand, has a lot of noise. First off, it uses two mutable variables (count and query) that are modified within the for loop. The first thing in the loop is a conditional that is needed to determine whether we need to append the AND constant, as we only want to do that after the first key-value pair is added to the query. Next, joining the keys and values is done by concatenating them, one by one, to the StringBuilder holding the query. Finally the count must be incremented so that in subsequent loop iterations, we properly include the AND delimiter. After the loop there is another conditional which appends DEFAULT_QUERY if there are no parameters, and then we finally convert the StringBuilder to a String and return it.

Here is the buildQueryString method after refactoring it to use the Java 8 stream API:

public String buildQueryString(Map<String, String> parameters) {
    if (parameters.isEmpty()) {
        return DEFAULT_QUERY;
    }

    return parameters.entrySet().stream()
            .map(entry -> String.join(COLON, entry.getKey(), entry.getValue()))
            .collect(Collectors.joining(AND));
}

This code does the exact same thing, but in only 6 lines of code (counting the map and collect lines as separate even though technically they are part of the stream call chain) instead of 15. But just measuring lines of code isn't everything. The main difference here is the lack of mutable variables, no external iteration via explicit looping constructs, and no conditional statements other than the empty check which short circuits and returns DEFAULT_QUERY when there are no parameters. The code reads like a functional declaration of what we want to accomplish: stream over the parameters, convert each (key, value) to "key:value" and join them all together using the delimiter AND.

The specific Java 8 features we've used here start with the stream() method to convert the map entry set to a Java 8 java.util.stream.Stream. We then use the map operation on the stream, which applies a function (String.join) to each element (Map.Entry) in the stream. Finally, we use the collect method to reduce the elements using the joining collector into the resulting string that is the actual query we wanted to build. In the map method we've also made use of a lambda expression to specify exactly what transformation to perform on each map entry.

By removing explicit iteration and mutable variables, the code is more readable, in that a developer seeing this code for the first time will have an easier and quicker time understanding what it does. Note that much of the how it does things has been removed, for example the iteration is now implicit via the Stream, and the joining collector now does the work of inserting a delimiter between the elements. You're now declaring what you want to happen, instead of having to explicitly perform all the tedium yourself. This is more of a functional style than most Java developers are used to, and at first it can be a bit jarring, but as you practice and get used to it, the more you'll probably like it and you'll find youself able to read and write this style of code much more quickly than traditional code with lots of loops and conditionals. Generally there is also less code than when using traditional looping and control structures, which is another benefit for maintenance. I won't go so far as to say Java 8 is a functional language like Clojure or Haskell - since it isn't - but code like this has a more functional flavor to it.

There is now a metric ton of content on the internet related to Java 8 streams, but in case this is all new to you, or you're just looking for a decent place to begin learning more in-depth, the API documentation for the java.util.stream package is a good place to start. Venkat Subramaniam's Functional Programming in Java is another good resource, and at less than 200 pages can be digested pretty quickly. And for more on lambda expressions, the Lambda Expressions tutorial in the official Java Tutorials is a decent place to begin. In the next post, we'll see another example where a simple Java 8 API addition combined with a lambda expression simplifies code, making it more readable and maintainable.

This blog was originally published on the Fortitude Technologies blog here.

Reduce Java boilerplate using try-with-resources

sleberkn — Thu, 11 Aug 2016 12:12:00 +0000

Java 8 has been out for a while, and Java 7 has been out even longer. But even so, many people still unfortunately are not taking advantage of some of the new features, many of which make reading and writing Java code much more pleasant. For example, Java 7 introduced some relatively simple things like strings in switch statements, underscores in numeric literals (e.g. 1_000_000 is easier to read and see the magnitude than just 1000000), and the try-with-resources statement. Java 8 went a lot further and introduced lambda expressions, the streams API, a new date/time API based on the Joda Time library, Optional, and more.

In this blog and in a few subsequent posts, I will take a simple snippet of code from a real project, and show what the code looked like originally and what it looked like after refactoring it to be more readable and maintainable. To start, this blog will actually tackle the try-with-resources statement introduced in Java 7. Many people even in 2016 still seem not to be aware of this statement, which not only makes the code less verbose, but also eliminates an entire class of errors resulting from failure to close I/O or other resources.

Without further ado (whatever ado actually means), here is a method that was used to check port availability when starting up services.

public boolean isPortAvailable(final int port) {
    ServerSocket serverSocket = null;
    DatagramSocket dataSocket = null;

    try {
        serverSocket = new ServerSocket(port);
        serverSocket.setReuseAddress(true);
        dataSocket = new DatagramSocket(port);
        dataSocket.setReuseAddress(true);
        return true;
    } catch (IOException e) {
        return false;
    } finally {
        if (dataSocket != null) {
            dataSocket.close();
        }

        if (serverSocket != null) {
            try {
                serverSocket.close();
            } catch (IOException e) {
                // ignored
            }
        }
    }
}

The core logic for the above code is pretty simple: open a ServerSocket and a DatagramSocket and if both opened without throwing an exception, then the port is open. It's all the extra boilerplate code and exception handling that makes the code so lengthy and error-prone, because we need to make sure to close the sockets in the finally block, being careful to first check they are not null. For good measure, the ServerSocket#close method throws yet another IOException, which we simply ignore but are required to catch nonetheless. A lot of extra code which obscures the actual simple core of the code.

Here's the refactored version which makes use of the try-with-resources statement from Java 7.

public boolean isPortAvailable(final int port) {
    try (ServerSocket serverSocket = new ServerSocket(port); 
         DatagramSocket dataSocket = new DatagramSocket(port)) {
        serverSocket.setReuseAddress(true);
        dataSocket.setReuseAddress(true);
        return true;
    } catch (IOException e) {
        return false;
    }
}

As you can hopefully see, this code has the same core logic, but much less of the boilerplate. There is not only less code (7 lines instead of 22), but the code is much more readable since only the core logic remains. We are still catching the IOException that can be thrown by the ServerSocket and DatagramSocket constructors, but we no longer need to deal with the routine closing of those socket resources. The try-with-resources statement does that task for us, automatically closing any resources opened in the declaration statement that immediately follows the try keyword.

The one catch is that the declared resources must implement the AutoCloseable interface, which itself extends Closeable. Since the Java APIs make extensive use of Closeable and AutoCloseable this means that most things you'll want to use can be handled via try-with-resources. Classes that don't implement AutoCloseable cannot be used directly in try-with-resources statments. For example, if you are unfortunate enough to still need to deal with XML, for example if you need to use the old-school XMLStreamReader then you are out of luck since it doesn't implement Closeable or AutoCloseable. I generally fix those types of things by creating a small wrapper/decorator class, e.g. CloseableXMLStreamReader, but sometimes it simply isn't worth the trouble unless you are using it in many difference places.

For more information on try-with-resources, the Java tutorials on Oracle's website has a more in-depth article here. In subsequent posts, I'll show some before/after code that makes use of Java 8 features such as the stream API and lambda expressions.

This blog was originally published on the Fortitude Technologies blog here.

Slides for RESTful Web Services with Jersey presentation

sleberkn — Tue, 10 Jun 2014 10:50:00 +0000

While teaching a course on web development which included Ruby on Rails and Java segments, we used Jersey to expose a simple web services which the Rails application consumed. I put together a presentation on Jersey that I recently gave. Here are the slides:

RESTful Web Services with Jersey from Scott Leberknight

Slides for httpie presentation

sleberkn — Mon, 9 Jun 2014 09:40:00 +0000

I've used cURL for a long time but I can never seem to remember all the various flags and settings. Recently I came across httpie which is a simple command line tool for accessing HTTP resources. Here are the presentation slides:

httpie from Scott Leberknight

Building a Distributed Lock Revisited: Using Curator's InterProcessMutex

sleberkn — Mon, 30 Dec 2013 00:00:00 +0000

Last summer I wrote a series of blogs introducing Apache ZooKeeper, which is a distributed coordination service used in many open source projects like Hadoop, HBase, and Storm to manage clusters of machines. The fifth blog described how to use ZooKeeper to implement a distributed lock. In that blog I explained that the goals of a distributed lock are "to build a mutually exclusive lock between processes that could be running on different machines, possibly even on different networks or different data centers". I also mentioned that one significant benefit is that "clients know nothing about each other; they only know they need to use the lock to access some shared resource, and that they should not access it unless they own the lock." That blog described how to use the ZooKeeper WriteLock "recipe" that comes with ZooKeeper in the contrib modules to build a synchronous BlockingWriteLock with easier semantics in which you simply call a lock() method to acquire the lock, and call unlock() to release the lock. Earlier in the series, we learned how to connect to ZooKeeper in the Group Membership Example blog using a Watcher and a CountDownLatch to block until the SyncConnected event was received. All that code wasn't terribly complex but it also was fairly low-level, especially if you include the need to block until a connection event is received and the non-trival implementation of the WriteLock recipe.

In the wrap-up blog I mentioned the Curator project, originally opened sourced by Netflix and later donated by them to Apache. The Curator wiki describes Curator as "a set of Java libraries that make using Apache ZooKeeper much easier". In this blog we'll see how to use Curator to implement a distributed lock, without needing to write any of our own wrapper code for obtaining a connection or to implement the lock itself. In the distributed lock blog we saw how sequential ephemeral child nodes (e.g. child-lock-node-0000000000, child-lock-node-0000000001, child-lock-node-0000000002, etc.) are created under a persistent parent lock node. The client holding the lock on the child with the lowest sequence number owns the lock. We saw several potential gotchas: first, how does a client know whether it successfully created a child node in the case of a partial failure, i.e. a (temporary) connection loss, and how does it know which child node it created, i.e. the child with which sequence number? I noted that a solution was the embed the ZooKeeper session ID in the child node such that the client can easily identify the child node it created. Jordan Zimmerman (the creator of Curator) was kind enough to post a comment to that blog noting that using the session ID is "not ideal" because it "prevents the same ZK connection from being used in multiple threads for the same lock". He said "It's much better to use a GUID. This is what Curator uses."

Second, we noted that distributed lock clients should watch only the immediately preceding child node rather than the parent node in order to prevent a "herd effect" in which every client is notified for every single child node event, when in reality each client only need care about the child immediately preceding it. Curator handles both these cases plus adds other goodies such as a retry policy for connecting to ZooKeeper. So without further comment, lets see how to use a distributed lock in Curator.

First, we'll need to get an instance of CuratorFramework - this is an interface that represents a higher level abstraction API for working with ZooKeeper. It provides automatic connection management including retry operations, a fluent-style API, as well as a bunch of recipes you can use out-of-the-box for distributed data structures like locks, queues, leader election, etc. We can use the CuratorFrameworkFactory and a RetryPolicy of our choosing to get one.

String hosts = "host-1:2181,host-2:2181,host-3:2181";
int baseSleepTimeMills = 1000;
int maxRetries = 3;

RetryPolicy retryPolicy = new ExponentialBackoffRetry(baseSleepTimeMills, maxRetries);
CuratorFramework client = CuratorFrameworkFactory.newClient(hosts, retryPolicy);
client.start();

In the above code we first create a retry policy - in this case an ExponentialBackoffRetry using a base sleep time of 1000 milliseconds and up to 3 retries. Then we can use the CuratorFrameworkFactory.newClient() to obtain an instance of CuratorFramework. Finally we need to call start() (note we'll need to call close() when we're done with the client). Now that we have a client instance, we can use an implementation of InterProcessLock to create our distributed lock. The simplest one is the InterProcessMutex which is a re-entrant mutual exclusion lock that works across JVMs, by using ZooKeeper to hold the lock.

InterProcessLock lock = new InterProcessMutex(client, lockPath);
lock.acquire();
try {
  // do work while we hold the lock
} catch (Exception ex) {
  // handle exceptions as appropriate
} finally {
  lock.release();
}

The above code simply creates a InterProcessMutex for a specific lock path (lockPath), acquires the lock, does some work, and releases the lock. In this case acquire() will block until the lock becomes available. In many cases blocking indefinitely won't be a Good Thing, and Curator provides an overloaded version of acquire() which requires a maximum time to wait for the lock and returns true if the lock is obtained within the time limit and false otherwise.

InterProcessLock lock = new InterProcessMutex(client, lockPath);
if (lock.acquire(waitTimeSeconds, TimeUnit.SECONDS)) {
  try {
    // do work while we hold the lock
  } catch (Exception ex) {
    // handle exceptions as appropriate
  } finally {
    lock.release();
  }
} else {
  // we timed out waiting for lock, handle appropriately
}

The above code demonstrates using the timeout version of acquire. The code is slightly more complex since you need to check whether the lock is acquired or whether we timed out waiting for the lock. Regardless of which version of acquire() you use, you'll need to release() the lock in a finally block. The final piece is to remember to close the client when you're done with it:

client.close();

And that's pretty much it for using Curator's InterProcessMutex to implement a distributed lock. All the complexity in handling connection management, partial failures, the "herd effect", automatic retries, and so on are handled by the higher level Curator APIs. To paraphrase Stu Halloway, you should always understand at least one layer beneath the one you're working at - in this case you should have a decent understanding of how ZooKeeper works under the covers and some of the potential issues of distributed computing. But having said that, go ahead and use Curator to work at a higher level of abstraction and gain the benefits of all the distributed computing experience at Netflix as well as Yahoo (which created ZooKeeper). And last, Happy New Year 2014!

Handling Big Data with HBase Part 6: Wrap-up

sleberkn — Fri, 20 Dec 2013 00:00:00 +0000

This is the sixth and final blog in an introduction to Apache HBase. In the fifth part, we learned the basics of schema design in HBase and several techniques you can use to make scanning and filtering data more efficient. We also saw two different basic design schemes ("wide" and "tall") for storing information about the same entity, and briefly touched upon more advanced topics like adding full-text search and secondary indexes. In this part, we'll wrap up by summarizing the main points and then listing the (many) things we didn't cover in this introduction to HBase series.

HBase is a distributed database providing inherent scalability, performance, and fault-tolerance across potentially massive clusters of commodity servers. It provides the means to store and efficiently scan large swaths of data. We've looked at the HBase shell for basic interaction, covered the high-level HBase architecture and looked at using the Java API to create, get, scan, and delete data. We also considered how to design tables and row keys for efficient data access.

One thing you certainly noticed when working with the HBase Java API is that it is much lower level than other data APIs you might be used to working with, for example JDBC or JPA. You get the basics of CRUD plus scanning data, and that's about it. In addition, you work directly with byte arrays which is about as low-level as it gets when you're trying to retrieve information from a datastore.

If you are considering whether to use HBase, you should really think hard about how large the data is, i.e. does your app need to be able to accomodate ever-growing volumes of data? If it does, then you need to think hard about what that data looks like and what the most likely data access patterns will be, as this will drive your schema design and data access patters. For example, if you are designing a schema for a weather collection project, you will want to consider using a "tall" schema design such that the sensor readings for each sensor are split across rows as opposed to a "wide" design in which you keep adding columns to a column family in a single row. Unlike relational models in which you work hard to normalize data and then use SQL as a flexible way to join the data in various ways, with HBase you need to think much more up-front about the data access patterns, because retrieval by row key and table scans are the only two ways to access data. In other words, there is no joining across multiple HBase tables and projecting out the columns you need. When you retrieve data, you want to only ask HBase for the exact data you need.

Things We Didn't Cover

Now let's discuss a few things we didn't cover. First, coprocessors were a major addition to HBase in version 0.92, and were inspired by Google adding coprocessors to its Bigtable data store. You can, at a high level, think of coprocessors like triggers or stored procedures in relational databases. Basically you can have either trigger-like functionality via observers, or stored-procedure functionality via RPC endpoints. This allows many new things to be accomplished in an elegant fashion, for example maintaining secondary indexes via observing changes to data.

We showed basic API usage, but there is more advanced usage possible with the API. For example, you can batch data and provide much more advanced filtering behavior than a simple paging filter like we showed. There is also the concept of counters, which allows you to do atomic increments of numbers without requiring the client to perform explicit row locking. And if you're not really into Java, there are external APIs available via Thrift and REST gateways. There's also even a C/C++ client available and there are DSLs for Groovy, Jython, and Scala. These are all discussed on the HBase wiki.

Cluster setup and configuration was not covered at all, nor was performance tuning. Obviously these are hugely important topics and the references below are good starting places. With HBase you not only need to worry about tuning HBase configuration, but also tuning Hadoop (or more specifically, the HDFS file system). For these topics definitely start with the HBase References Guide and also check out HBase: The Definitive Guide by Lars George.

We also didn't cover how to Map/Reduce with HBase. Essentially you can use Hadoop's Map/Reduce framework to access HBase tables and perform tasks like aggregation in a Map/Reduce-style.

Last there is security (which I suppose should be expected to come last for a developer, right?) in HBase. There are two types of security I'm referring to here: first is access to HBase itself in order to create, read, update, and delete data, e.g. via requiring Kerberos authentication to connect to HBase. The second type of security is ACL-based access restrictions. HBase as of this writing you can restrict access via ACLs at the table and column family level. However, HBase Cell Security describes how cell-level security features similar to Apache Accumulo are being added to HBase and are scheduled to be released in version 0.98 in this issue (the current version as of this writing is 0.96).

Goodbye!

With this background, you can now consider whether HBase makes sense on future projects with Big Data and high scalability requirements. I hope you found this series of posts useful as an introduction to HBase.

References

HBase web site, http://hbase.apache.org/
HBase wiki, http://wiki.apache.org/hadoop/Hbase
HBase Reference Guide http://hbase.apache.org/book/book.html
HBase: The Definitive Guide, http://bit.ly/hbase-definitive-guide
Google Bigtable Paper, http://labs.google.com/papers/bigtable.html
Hadoop web site, http://hadoop.apache.org/
Hadoop: The Definitive Guide, http://bit.ly/hadoop-definitive-guide
Fallacies of Distributed Computing, http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing
HBase lightning talk slides, http://www.slideshare.net/scottleber/hbase-lightningtalk
Sample code, https://github.com/sleberknight/basic-hbase-examples

Handling Big Data with HBase Part 5: Data Modeling (or, Life without SQL)

sleberkn — Wed, 18 Dec 2013 00:00:00 +0000

This is the fifth of a series of blogs introducing Apache HBase. In the fourth part, we saw the basics of using the Java API to interact with HBase to create tables, retrieve data by row key, and do table scans. This part will discuss how to design schemas in HBase.

HBase has nothing similar to a rich query capability like SQL from relational databases. Instead, it forgoes this capability and others like relationships, joins, etc. to instead focus on providing scalability with good performance and fault-tolerance. So when working with HBase you need to design the row keys and table structure in terms of rows and column families to match the data access patterns of your application. This is completely opposite what you do with relational databases where you start out with a normalized database schema, separate tables, and then you use SQL to perform joins to combine data in the ways you need. With HBase you design your tables specific to how they will be accessed by applications, so you need to think much more up-front about how data is accessed. You are much closer to the bare metal with HBase than with relational databases which abstract implementation details and storage mechanisms. However, for applications needing to store massive amounts of data and have inherent scalability, performance characteristics and tolerance to server failures, the potential benefits can far outweigh the costs.

In the last part on the Java API, I mentioned that when scanning data in HBase, the row key is critical since it is the primary means to restrict the rows scanned; there is nothing like a rich query like SQL as in relational databases. Typically you create a scan using start and stop row keys and optionally add filters to further restrict the rows and columns data returned. In order to have some flexibility when scanning, the row key should be designed to contain the information you need to find specific subsets of data. In the blog and people examples we've seen so far, the row keys were designed to allow scanning via the most common data access patterns. For the blogs, the row keys were simply the posting date. This would permit scans in ascending order of blog entries, which is probably not the most common way to view blogs; you'd rather see the most recent blogs first. So a better row key design would be to use a reverse order timestamp, which you can get using the formula (Long.MAX_VALUE - timestamp), so scans return the most recent blog posts first. This makes it easy to scan specific time ranges, for example to show all blogs in the past week or month, which is a typical way to navigate blog entries in web applications.

For the people table examples, we used a composite row key composed of last name, first name, middle initial, and a (unique) person identifier to distinguish people with the exact same name, separated by dashes. For example, Brian M. Smith with identifier 12345 would have row key smith-brian-m-12345. Scans for the people table can then be composed using start and end rows designed to retrieve people with specific last names, last names starting with specific letter combinations, or people with the same last name and first name initial. For example, if you wanted to find people whose first name begins with B and last name is Smith you could use the start row key smith-b and stop row key smith-c (the start row key is inclusive while the stop row key is exclusive, so the stop key smith-c ensures all Smiths with first name starting with the letter "B" are included). You can see that HBase supports the notion of partial keys, meaning you do not need to know the exact key, to provide more flexibility creating appropriate scans. You can combine partial key scans with filters to retrieve only the specific data needed, thus optimizing data retrieval for the data access patterns specific to your application.

So far the examples have involved only single tables containing one type of information and no related information. HBase does not have foreign key relationships like in relational databases, but because it supports rows having up to millions of columns, one way to design tables in HBase is to encapsulate related information in the same row - a "wide" table design. It is called a "wide" design since you are storing all information related to a row together in as many columns as there are data items. In our blog example, you might want to store comments for each blog. The "wide" way to design this would be to include a column family named comments and then add columns to the comment family where the qualifiers are the comment timestamp; the comment columns would look like comments:20130704142510 and comments:20130707163045. Even better, when HBase retrieves columns it returns them in sorted order, just like row keys. So in order to display a blog entry and its comments, you can retrieve all the data from one row by asking for the content, info, and comments column families. You could also add a filter to retrieve only a specific number of comments, adding pagination to them.

The people table column families could also be redesigned to store contact information such as separate addresses, phone numbers, and email addresses in column families allowing all of a person's information to be stored in one row. This kind of design can work well if the number of columns is relatively modest, as blog comments and a person's contact information would be. If instead you are modeling something like an email inbox, financial transactions, or massive amounts of automatically collected sensor data, you might choose instead to spread a user's emails, transactions, or sensor readings across multiple rows (a "tall" design) and design the row keys to allow efficient scanning and pagination. For an inbox the row key might look like <user_id>-<reversed_email_timestamp> which would permit easily scanning and paginating a user's inbox, while for financial transactions the row key might be <user_id>-<reversed_transaction_timestamp>. This kind of design can be called "tall" since you are spreading information about the same thing (e.g. readings from the same sensor, transactions in an account) across multiple rows, and is something to consider if there will be an ever-expanding amount of information, as would be the case in a scenario involving data collection from a huge network of sensors.

Designing row keys and table structures in HBase is a key part of working with HBase, and will continue to be given the fundamental architecture of HBase. There are other things you can do to add alternative schemes for data access within HBase. For example, you could implement full-text searching via Apache Lucene either within rows or external to HBase (search Google for HBASE-3529). You can also create (and maintain) secondary indexes to permit alternate row key schemes for tables; for example in our people table the composite row key consists of the name and a unique identifier. But if we desire to access people by their birth date, telephone area code, email address, or any other number of ways, we could add secondary indexes to enable that form of interaction. Note, however, that adding secondary indexes is not something to be taken lightly; every time you write to the "main" table (e.g. people) you will need to also update all the secondary indexes! (Yes, this is something that relational databases do very well, but remember that HBase is designed to accomodate a lot more data than traditional RDBMSs were.)

Conclusion to Part 5

In this part of the series, we got an introduction to schema design in HBase (without relations or SQL). Even though HBase is missing some of the features found in traditional RDBMS systems such as foreign keys and referential integrity, multi-row transactions, multiple indexes, and son on, many applications that need inherent HBase benefits like scaling can benefit from using HBase. As with anything complex, there are tradeoffs to be made. In the case of HBase, you are giving up some richness in schema design and query flexibility, but you gain the ability to scale to massive amounts of data by (more or less) simply adding additional servers to your cluster.

In the next and last part of this series, we'll wrap up and mention a few (of the many) things we didn't cover in these introductory blogs.

References

HBase web site, http://hbase.apache.org/
HBase wiki, http://wiki.apache.org/hadoop/Hbase
HBase Reference Guide http://hbase.apache.org/book/book.html
HBase: The Definitive Guide, http://bit.ly/hbase-definitive-guide
Google Bigtable Paper, http://labs.google.com/papers/bigtable.html
Hadoop web site, http://hadoop.apache.org/
Hadoop: The Definitive Guide, http://bit.ly/hadoop-definitive-guide
Fallacies of Distributed Computing, http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing
HBase lightning talk slides, http://www.slideshare.net/scottleber/hbase-lightningtalk
Sample code, https://github.com/sleberknight/basic-hbase-examples

Handling Big Data with HBase Part 4: The Java API

sleberkn — Mon, 16 Dec 2013 00:00:00 +0000

This is the fourth of an introductory series of blogs on Apache HBase. In the third part, we saw a high level view of HBase architecture . In this part, we'll use the HBase Java API to create tables, insert new data, and retrieve data by row key. We'll also see how to setup a basic table scan which restricts the columns retrieved and also uses a filter to page the results.

Having just learned about HBase high-level architecture, now let's look at the Java client API since it is the way your applications interact with HBase. As mentioned earlier you can also interact with HBase via several flavors of RPC technologies like Apache Thrift plus a REST gateway, but we're going to concentrate on the native Java API. The client APIs provide both DDL (data definition language) and DML (data manipulation language) semantics very much like what you find in SQL for relational databases. Suppose we are going to store information about people in HBase, and we want to start by creating a new table. The following listing shows how to create a new table using the HBaseAdmin class.

Configuration conf = HBaseConfiguration.create();
HBaseAdmin admin = new HBaseAdmin(conf);
HTableDescriptor tableDescriptor = new HTableDescriptor(TableName.valueOf("people"));
tableDescriptor.addFamily(new HColumnDescriptor("personal"));
tableDescriptor.addFamily(new HColumnDescriptor("contactinfo"));
tableDescriptor.addFamily(new HColumnDescriptor("creditcard"));
admin.createTable(tableDescriptor);

The people table defined in preceding listing contains three column families: personal, contactinfo, and creditcard. To create a table you create an HTableDescriptor and add one or more column families by adding HColumnDescriptor objects. You then call createTable to create the table. Now we have a table, so let's add some data. The next listing shows how to use the Put class to insert data on John Doe, specifically his name and email address (omitting proper error handling for brevity).

Configuration conf = HBaseConfiguration.create();
HTable table = new HTable(conf, "people");
Put put = new Put(Bytes.toBytes("doe-john-m-12345"));
put.add(Bytes.toBytes("personal"), Bytes.toBytes("givenName"), Bytes.toBytes("John"));
put.add(Bytes.toBytes("personal"), Bytes.toBytes("mi"), Bytes.toBytes("M"));
put.add(Bytes.toBytes("personal"), Bytes.toBytes("surame"), Bytes.toBytes("Doe"));
put.add(Bytes.toBytes("contactinfo"), Bytes.toBytes("email"), Bytes.toBytes("john.m.doe@gmail.com"));
table.put(put);
table.flushCommits();
table.close();

In the above listing we instantiate a Put providing the unique row key to the constructor. We then add values, which must include the column family, column qualifier, and the value all as byte arrays. As you probably noticed, the HBase API's utility Bytes class is used a lot; it provides methods to convert to and from byte[] for primitive types and strings. (Adding a static import for the toBytes() method would cut out a lot of boilerplate code.) We then put the data into the table, flush the commits to ensure locally buffered changes take effect, and finally close the table. Updating data is also done via the Put class in exactly the same manner as just shown in the prior listing. Unlike relational databases in which updates must update entire rows even if only one column changed, if you only need to update a single column then that's all you specify in the Put and HBase will only update that column. There is also a checkAndPut operation which is essentially a form of optimistic concurrency control - the operation will only put the new data if the current values are what the client says they should be.

Retrieving the row we just created is accomplished using the Get class, as shown in the next listing. (From this point forward, listings will omit the boilerplate code to create a configuration, instantiate the HTable, and the flush and close calls.)

Get get = new Get(Bytes.toBytes("doe-john-m-12345"));
get.addFamily(Bytes.toBytes("personal"));
get.setMaxVersions(3);
Result result = table.get(get);

The code in the previous listing instantiates a Get instance supplying the row key we want to find. Next we use addFamily to instruct HBase that we only need data from the personal column family, which also cuts down the amount of work HBase must do when reading information from disk. We also specify that we'd like up to three versions of each column in our result, perhaps so we can list historical values of each column. Finally, calling get returns a Result instance which can then be used to inspect all the column values returned.

In many cases you need to find more than one row. HBase lets you do this by scanning rows, as shown in the second part which showed using a scan in the HBase shell session. The corresponding class is the Scan class. You can specify various options, such as the start and ending row key to scan, which columns and column families to include and the maximum versions to retrieve. You can also add filters, which allow you to implement custom filtering logic to further restrict which rows and columns are returned. A common use case for filters is pagination. For example, we might want to scan through all people whose last name is Smith one page (e.g. 25 people) at a time. The next listing shows how to perform a basic scan.

Scan scan = new Scan(Bytes.toBytes("smith-"));
scan.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("givenName"));
scan.addColumn(Bytes.toBytes("contactinfo"), Bytes.toBytes("email"));
scan.setFilter(new PageFilter(25));
ResultScanner scanner = table.getScanner(scan);
for (Result result : scanner) {
    // ...
}

In the above listing we create a new Scan that starts from the row key smith- and we then use addColumn to restrict the columns returned (thus reducing the amount of disk transfer HBase must perform) to personal:givenName and contactinfo:email. A PageFilter is set on the scan to limit the number of rows scanned to 25. (An alternative to using the page filter would be to specify a stop row key when constructing the Scan.) We then get a ResultScanner for the Scan just created, and loop through the results performing whatever actions are necessary. Since the only method in HBase to retrieve multiple rows of data is scanning by sorted row keys, how you design the row key values is very important. We'll come back to this topic later.

You can also delete data in HBase using the Delete class, analogous to the Put class to delete all columns in a row (thus deleting the row itself), delete column families, delete columns, or some combination of those.

Connection Handling

In the above examples not much attention was paid to connection handling and RPCs (remote procedure calls). HBase provides the HConnection class which provides functionality similar to connection pool classes to share connections, for example you use the getTable() method to get a reference to an HTable instance. There is also an HConnectionManager class which is how you get instances of HConnection. Similar to avoiding network round trips in web applications, effectively managing the number of RPCs and amount of data returned when using HBase is important, and something to consider when writing HBase applications.

Conclusion to Part 4

In this part we used the HBase Java API to create a people table, insert a new person, and find the newly inserted person information. We also used the Scan class to scan the people table for people with last name "Smith" and showed how to restrict the data retrieved and finally how to use a filter to limit the number of results.

In the next part, we'll learn how to deal with the absence of SQL and relations when modeling schemas in HBase.

References

HBase web site, http://hbase.apache.org/
HBase wiki, http://wiki.apache.org/hadoop/Hbase
HBase Reference Guide http://hbase.apache.org/book/book.html
HBase: The Definitive Guide, http://bit.ly/hbase-definitive-guide
Google Bigtable Paper, http://labs.google.com/papers/bigtable.html
Hadoop web site, http://hadoop.apache.org/
Hadoop: The Definitive Guide, http://bit.ly/hadoop-definitive-guide
Fallacies of Distributed Computing, http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing
HBase lightning talk slides, http://www.slideshare.net/scottleber/hbase-lightningtalk
Sample code, https://github.com/sleberknight/basic-hbase-examples

Handling Big Data with HBase Part 3: Architecture Overview

sleberkn — Fri, 13 Dec 2013 00:00:00 +0000

This is the third blog in a series of introductory blogs on Apache HBase. In the second part, we saw how to interact with HBase via the shell. In this part, we'll look at the HBase architecture from a bird's eye view.

HBase is a distributed database, meaning it is designed to run on a cluster of dozens to possibly thousands or more servers. As a result it is more complicated to install than a single RDBMS running on a single server. And all the typical problems of distributed computing begin to come into play such as coordination and management of remote processes, locking, data distribution, network latency and number of round trips between servers. Fortunately HBase makes use of several other mature technologies, such as Apache Hadoop and Apache ZooKeeper, to solve many of these issues. The figure below shows the major architectural components in HBase.

In the above figure you can see there is a single HBase master node and multiple region servers. (Note that it is possible to run HBase in a multiple master setup, in which there is a single active master.) HBase tables are partitioned into multiple regions with each region storing a range of the table's rows, and multiple regions are assigned by the master to a region server.

HBase is a column-oriented data store, meaning it stores data by columns rather than by rows. This makes certain data access patterns much less expensive than with traditional row-oriented relational database systems. For example, in HBase if there is no data for a given column family, it simply does not store anything at all; contrast this with a relational database which must store null values explicitly. In addition, when retrieving data in HBase, you should only ask for the specific column families you need; because there can literally be millions of columns in a given row, you need to make sure you ask only for the data you actually need.

HBase utilizes ZooKeeper (a distributed coordination service) to manage region assignments to region servers, and to recover from region server crashes by loading the crashed region server's regions onto other functioning region servers.

Regions contain an in-memory data store (MemStore) and a persistent data store (HFile), and all regions on a region server share a reference to the write-ahead log (WAL) which is used to store new data that hasn't yet been persisted to permanent storage and to recover from region server crashes. Each region holds a specific range of row keys, and when a region exceeds a configurable size, HBase automatically splits the region into two child regions, which is the key to scaling HBase.

As a table grows, more and more regions are created and spread across the entire cluster. When clients request a specific row key or scan a range of row keys, HBase tells them the regions on which those keys exist, and the clients then communicate directly with the region servers where those regions exist. This design minimizes the number of disk seeks required to find any given row, and optimizes HBase toward disk transfer when returning data. This is in contrast to relational databases, which might need to do a large number of disk seeks before transferring data from disk, even with indexes.

The HDFS component is the Hadoop Distributed Filesystem, a distributed, fault-tolerant and scalable filesystem which guards against data loss by dividing files into blocks and spreading them across the cluster; it is where HBase actually stores data. Strictly speaking the persistent storage can be anything that implements the Hadoop FileSystem API, but usually HBase is deployed onto Hadoop clusters running HDFS. In fact, when you first download and install HBase on a single machine, it uses the local filesystem until you change the configuration!

Clients interact with HBase via one of several available APIs, including a native Java API as well as a REST-based interface and several RPC interfaces (Apache Thrift, Apache Avro). You can also use DSLs to HBase from Groovy, Jython, and Scala.

Conclusion to Part 3

In this part, we got a pretty high level view of HBase architecture. In the next part, we'll dive into some real code and show the basics of working with HBase via its native Java API.

References

HBase web site, http://hbase.apache.org/
HBase wiki, http://wiki.apache.org/hadoop/Hbase
HBase Reference Guide http://hbase.apache.org/book/book.html
HBase: The Definitive Guide, http://bit.ly/hbase-definitive-guide
Google Bigtable Paper, http://labs.google.com/papers/bigtable.html
Hadoop web site, http://hadoop.apache.org/
Hadoop: The Definitive Guide, http://bit.ly/hadoop-definitive-guide
Fallacies of Distributed Computing, http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing
HBase lightning talk slides, http://www.slideshare.net/scottleber/hbase-lightningtalk
Sample code, https://github.com/sleberknight/basic-hbase-examples

Handling Big Data with HBase Part 2: First Steps

sleberkn — Thu, 12 Dec 2013 00:00:00 +0000

This is the second in a series of blogs that introduce Apache HBase. In the first blog, we introduced HBase at a high level. In this part, we'll see how to interact with HBase via its command line shell.

Let's take a look at what working with HBase is like at the command line. HBase comes with a JRuby-based shell that lets you define and manage tables, execute CRUD operations on data, scan tables, and perform maintenance among other things. When you're in the shell, just type help to get an overall help page. You can get help on specific commands or groups of commands as well, using syntax like help <group> and help command. For example, help 'create' provides help on creating new tables. While HBase is deployed in production on clusters of servers, you can download it and get up and running with a standalone installation in literally minutes. The first thing to do is fire up the HBase shell. The following listing shows a shell session in which we create a blog table, list the available tables in HBase, add a blog entry, retrieve that entry, and scan the blog table.

$ bin/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.96.0-hadoop2, r1531434, Fri Oct 11 15:28:08 PDT 2013

hbase(main):001:0> create 'blog', 'info', 'content'
0 row(s) in 6.0670 seconds

=> Hbase::Table - blog

hbase(main):002:0> list
TABLE
blog
fakenames
my-table
3 row(s) in 0.0300 seconds

=> ["blog", "fakenames", "my-table"]

hbase(main):003:0> put 'blog', '20130320162535', 'info:title', 'Why use HBase?'
0 row(s) in 0.0650 seconds

hbase(main):004:0> put 'blog', '20130320162535', 'info:author', 'Jane Doe'
0 row(s) in 0.0230 seconds

hbase(main):005:0> put 'blog', '20130320162535', 'info:category', 'Persistence'
0 row(s) in 0.0230 seconds

hbase(main):006:0> put 'blog', '20130320162535', 'content:', 'HBase is a column-oriented...'
0 row(s) in 0.0220 seconds

hbase(main):007:0> get 'blog', '20130320162535'
COLUMN             CELL
 content:          timestamp=1386556660599, value=HBase is a column-oriented...
 info:author       timestamp=1386556649116, value=Jane Doe
 info:category     timestamp=1386556655032, value=Persistence
 info:title        timestamp=1386556643256, value=Why use HBase?
4 row(s) in 0.0380 seconds

hbase(main):008:0> scan 'blog', { STARTROW => '20130300', STOPROW => '20130400' }
ROW                COLUMN+CELL
 20130320162535    column=content:, timestamp=1386556660599, value=HBase is a column-oriented...
 20130320162535    column=info:author, timestamp=1386556649116, value=Jane Doe
 20130320162535    column=info:category, timestamp=1386556655032, value=Persistence
 20130320162535    column=info:title, timestamp=1386556643256, value=Why use HBase?
1 row(s) in 0.0390 seconds

In the above listing we first create the blog table having column families info and content. After listing the tables and seeing our new blog table, we put some data in the table. The put commands specify the table, the unique row key, the column key composed of the column family and a qualifier, and the value. For example, info is the column family while title and author are qualifiers and so info:title specifies the column title in the info family with value "Why use HBase?". The info:title is also referred to as a column key. Next we use the get command to retrieve a single row and finally the scan command to perform a scan over rows in the blog table for a specific range of row keys. As you might have guessed, by specifying start row 20130300 (inclusive) and end row 20130400 (exclusive) we retrieve all rows whose row key falls within that range; in this blog example this equates to all blog entries in March 2013 since the row keys are the time when an entry was published.

An important characteristic of HBase is that you define column families, but then you can add any number of columns within that family, identified by the column qualifier. HBase is optimized to store columns together on disk, allowing for more efficient storage since columns that don't exist don't take up any space, unlike in a RDBMS where null values must actually be stored. Rows are defined by columns they contain; if there are no columns then the row, logically, does not exist. Continuing the above example in the following listing, we delete some specific columns from a row.

hbase(main):009:0>  delete 'blog', '20130320162535', 'info:category'
0 row(s) in 0.0490 seconds

hbase(main):010:0> get 'blog', '20130320162535'
COLUMN             CELL
 content:          timestamp=1386556660599, value=HBase is a column-oriented...
 info:author       timestamp=1386556649116, value=Jane Doe
 info:title        timestamp=1386556643256, value=Why use HBase?
3 row(s) in 0.0260 seconds

As shown just above, you can delete a specific column from a table as we deleted the info:category column. You can also delete all columns within a row and thereby delete the row using the deleteall shell command. To update column values, you simply use the put command again. By default HBase retains up to three versions of a column value, so if you put a new value into info:title, HBase will retain both the old and new version.

The commands issued in the above examples show how to create, read, update, and delete data in HBase. Data retrieval comes in only two flavors: retrieving a row using get and retrieving multiple rows via scan. When retrieving data in HBase you should take care to retrieve only the information you actually require. Since HBase retrieves data from each column family separately, if you only need data for one column family, then you can specify to retrieve only that bit of information. In the next listing we retrieve only the blog titles for a specific row key range that equate to March through April 2013.

hbase(main):011:0> scan 'blog', { STARTROW => '20130300', STOPROW => '20130500', COLUMNS => 'info:title' }
ROW                COLUMN+CELL
 20130320162535    column=info:title, timestamp=1386556643256, value=Why use HBase?
1 row(s) in 0.0290 seconds

So by setting row key ranges, restricting the columns we need, and restricting the number of versions to retrieve, you can optimize data access patterns in HBase. Of course in the above examples, all this is done from the shell, but you can do the same things, and much more, using the HBase APIs.

Conclusion to Part 2

In this second part of the HBase introductory series, we saw how to use the shell to create tables, insert data, retrieve data by row key, and saw a basic scan of data via row key range. You also saw how you can delete a specific column from a table row.

In the next blog, we'll get an overview of HBase's high level architecture.

References

HBase web site, http://hbase.apache.org/
HBase wiki, http://wiki.apache.org/hadoop/Hbase
HBase Reference Guide http://hbase.apache.org/book/book.html
HBase: The Definitive Guide, http://bit.ly/hbase-definitive-guide
Google Bigtable Paper, http://labs.google.com/papers/bigtable.html
Hadoop web site, http://hadoop.apache.org/
Hadoop: The Definitive Guide, http://bit.ly/hadoop-definitive-guide
Fallacies of Distributed Computing, http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing
HBase lightning talk slides, http://www.slideshare.net/scottleber/hbase-lightningtalk
Sample code, https://github.com/sleberknight/basic-hbase-examples

Handling Big Data with HBase Part 1: Introduction

sleberkn — Tue, 10 Dec 2013 00:00:00 +0000

This is the first in a series of blogs that will introduce Apache HBase. This blog provides a brief introduction to HBase. In later blogs you will see how the the HBase shell can be used for quick and dirty data access via the command line, learn about the high-level architecture of HBase, learn the basics of the Java API, and learn how to live without SQL when designing HBase schemas.

In the past few years we have seen a veritable explosion in various ways to store and retrieve data. The so-called NoSql databases have been leading the charge and creating all these new persistence choices. These alternatives have, in large part, become more popular due to the rise of Big Data led by companies such as Google, Amazon, Twitter, and Facebook as they have amassed vast amounts of data that must be stored, queried, and analyzed. But more and more companies are collecting massive amounts of data and they need to be able to effectively use all that data to fuel their business. For example social networks all need to be able to analyze large social graphs of people and make recommendations for who to link to next, while almost every large website out there now has a recommendation engine that tries to suggest ever more things you might want to purchase. As these businesses collect more data, they need a way to be able to easily scale-up without needing to re-write entire systems.

Since the 1970s, relational database management systems (RDBMS) have dominated the data landscape. But as businesses collect, store and process more and more data, relational databases are harder and harder to scale. At first you might go from a single server to a master/slave setup, and add caching layers in front of the database to relieve load as more and more reads/writes hit the database. When performance of queries begins to degrade, usually the first thing to be dropped is indexes, followed quickly by denormalization to avoid joins as they become more costly. Later you might start to precompute (or materialize) the most costly queries so that queries then effectively become key lookups and perhaps distribute data in huge tables across multiple database shards. At this point if you step back, many of the key benefits of RDBMSs have been lost — referential integrity, ACID transactions, indexes, and so on. Of course, the scenario just described presumes you become very successful, very fast and need to handle more data with continually increasing data ingestion rates. In other words, you need to be the next Twitter.

Or do you? Maybe you are working on an environment monitoring project that will deploy a network of sensors around the world, and all these sensors will produce huge amounts of data. Or maybe you are working on DNA sequencing. If you know or think you are going to have massive data storage requirements where the number of rows run into the billions and number of columns potentially in the millions, you should consider alternative databases such as HBase. These new databases are designed from the ground-up to scale horizontally across clusters of commodity servers, as opposed to vertical scaling where you try to buy the next larger server (until there are no more bigger ones available anyway).

Enter HBase

HBase is a database that provides real-time, random read and write access to tables meant to store billions of rows and millions of columns. It is designed to run on a cluster of commodity servers and to automatically scale as more servers are added, while retaining the same performance. In addition, it is fault tolerant precisely because data is divided across servers in the cluster and stored in a redundant file system such as the Hadoop Distributed File System (HDFS). When (not if) servers fail, your data is safe, and the data is automatically re-balanced over the remaining servers until replacements are online. HBase is a strongly consistent data store; changes you make are immediately visible to all other clients.

HBase is modeled after Google's Bigtable, which was described in a paper written by Google in 2006 as a "sparse, distributed, persistent multi-dimensional sorted map." So if you are used to relational databases, then HBase will at first seem foreign. While it has the concept of tables, they are not like relational tables, nor does HBase support the typical RDBMS concepts of joins, indexes, ACID transactions, etc. But even though you give those features up, you automatically and transparently gain scalability and fault-tolerance. HBase can be described as a key-value store with automatic data versioning.

You can CRUD (create, read, update, and delete) data just as you would expect. You can also perform scans of HBase table rows, which are always stored in HBase tables in ascending sort order. When you scan through HBase tables, rows are always returned in order by row key. Each row consists of a unique, sorted row key (think primary key in RDBMS terms) and an arbitrary number of columns, each column residing in a column family and having one or more versioned values. Values are simply byte arrays, and it's up to the application to transform these byte arrays as necessary to display and store them. HBase does not attempt to hide this column-oriented data model from developers, and the Java APIs are decidedly more lower-level than other persistence APIs you might have worked with. For example, JPA (Java Persistence API) and even JDBC are much more abstracted than what you find in the HBase APIs. You are working with bare metal when dealing with HBase.

Conclusion to Part 1

In this introductory blog we've learned that HBase is a non-relational, strongly consistent, distributed key-value store with automatic data versioning. It is horizontally scaleable via adding additional servers to a cluster, and provides fault-tolerance so data is not lost when (not if) servers fail. We've also discussed a bit about how data is organized within HBase tables; specifically each row has a unique row key, some number of column families, and an arbitrary number of columns within a family. In the next blog, we'll take first steps with HBase by showing interaction via the HBase shell.

References

HBase web site, http://hbase.apache.org/
HBase wiki, http://wiki.apache.org/hadoop/Hbase
HBase Reference Guide http://hbase.apache.org/book/book.html
HBase: The Definitive Guide, http://bit.ly/hbase-definitive-guide
Google Bigtable Paper, http://labs.google.com/papers/bigtable.html
Hadoop web site, http://hadoop.apache.org/
Hadoop: The Definitive Guide, http://bit.ly/hadoop-definitive-guide
Fallacies of Distributed Computing, http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing
HBase lightning talk slides, http://www.slideshare.net/scottleber/hbase-lightningtalk
Sample code, https://github.com/sleberknight/basic-hbase-examples

Distributed Coordination With ZooKeeper Part 6: Wrapping Up

sleberkn — Tue, 16 Jul 2013 00:00:00 +0000

This is the sixth (and last) in a series of blogs that introduce Apache ZooKeeper. In the fifth blog, we implemented a distributed lock, dealing with the issues of partial failure due to connection loss and the "herd effect" along the way.

In this final blog in the series you'll learn a few tips for administering and tuning ZooKeeper, and we'll introduce the Curator and Exhibitor frameworks.

Administration and Tuning

As with any complex distributed system, Apache ZooKeeper provides administrators plenty of knobs to control its behavior. Several important properties include the tickTime (the fundamental unit of time in ZooKeeper measured in milliseconds); the initLimit which is the time in ticks to allow followers to connect and sync to the leader; the syncLimit which is the time in ticks to allow a follower to synchronize with the leader; and the dataDir and dataLogDir which are the directories where ZooKeeper stores the in-memory database snapshots and transaction log, respectively.

Next, we'll cover just a few things you will want to be aware of when running a ZooKeeper ensemble in production.

First, when creating a ZooKeeper ensemble you should run each node on a dedicated server, meaning the only thing the server does is run an instance of ZooKeeper. The main reason you want to do this is to avoid any contention with other processes for both network and disk I/O. If you run other I/O and/or CPU-intensive processes on the same machines you are running a ZooKeeper node, you will likely see connection timeouts and other issues due to contention. I've seen this happen in production systems, and as soon as the ZooKeeper nodes were moved to their own dedicated machines, the connection loss problems disappeared.

Second, start with a three node ensemble and monitor the usage of those machines, for example using Ganglia and Nagios, to determine if your ensemble needs additional machines. Remember also to maintain an odd number of machines in the ensemble, so that there can be a majority when nodes commit write operations and when they need to vote for a new leader. Another really useful tool is zktop, which is very similar to the top command on *nix systems. It is a simple, quick and dirty way to easily start monitoring your ensemble.

Third, watch out for session timeouts, and modify the tickTime appropriately, for example maybe you have heavy network traffic and can increase tickTime to 5 seconds.

The above three tips are by no means the end of the story when it comes to administering and tuning ZooKeeper. For more in-depth information on setting up, running, administering and monitoring a ZooKeeper ensemble see the ZooKeeper Administrator's Guide on the ZooKeeper web site. Another resource is Kathleen Ting's Building an Impenetrable ZooKeeper presentation which I attended at Strange Loop 2013, and which provides a lot of very useful tips for running a ZooKeeper ensemble.

Getting a Curator

So far we've seen everything ZooKeeper provides out of the box. But when using ZooKeeper in production, you may quickly find that building recipes like distributed locks and other similar distributed data structures is harder than it looks, because you must be aware of many different kinds of problems that can arise - recall the connection loss and herd effect issues when constructing the distributed lock. You need to know when you can handle exceptions and retry an operation. For example if an idempotent operation fails during a client automatic failover event, you can simply retry the operation. The raw ZooKeeper library does not do much exception handling for you, and you need to implement retry logic yourself.

Helpfully Netflix uses ZooKeeper and has developed a framework named Curator, which they open sourced and later donated to Apache. The Curator wiki page describes it as "a set of Java libraries that make using Apache ZooKeeper much easier". While ZooKeeper comes bundled with the ZooKeeper Java client, using it to develop correct distributed data structures can be difficult and makes the code much harder to understand, due to problems such as connection loss and the "herd effect" which we saw in the previous blog.

Once you have a good understanding of ZooKeeper basics, check out Curator. It provides a client that replaces (wraps) the ZooKeeper class; a framework that contains a high-level API and improved connection and exception handling, along with built-in retry logic in the form of retry policies. Last, it provides a bunch of recipes that implement distributed data structures including locks, barriers, queues, and more. Curator even provides useful testing servers to run a single embedded ZooKeeper server or a test ensemble in unit tests.

Even better, Netflix also created Exhibitor, which is a "supervisor" for your ZooKeeper ensemble. It provides features such as monitoring, backups, a web-based interface for znode exploration, and a RESTful API.

Conclusion

In this series of blogs you were introduced to ZooKeeper; took a test drive in the ZooKeeper shell; worked with ZooKeeper's Java API to build a group membership application as well as a distributed lock; and toured the architecture and implementation details of ZooKeeper. If nothing else, remember that ZooKeeper is like a filesystem, except distributed and replicated. It allows you to build distributed coordination and data structures, is highly available, reliable, and fast due to its leader/follower design with no single point of failure, in-memory reads, and writes via the leader to maintain sequential consistency. Last, it provides clients with (mostly) transparent and automatic session failover in case of server failure. After becoming comfortable with ZooKeeper, be sure to have a look at the Curator framework by Apache (donated by Netflix recently) and also the Exhibitor monitoring application.

References

Source code for these blogs, https://github.com/sleberknight/zookeeper-samples
Presentation on ZooKeeper, http://www.slideshare.net/scottleber/apache-zookeeper
ZooKeeper web site, http://zookeeper.apache.org/
ZooKeeper Administrator's Guide http://zookeeper.apache.org/doc/current/zookeeperAdmin.html
Projects powered by ZooKeeper, https://cwiki.apache.org/confluence/display/ZOOKEEPER/PoweredBy
Fallacies of Distributed Computing, http://en.wikipedia.org/wiki/FallaciesofDistributed_Computing
Hadoop web site, http://hadoop.apache.org/
Hadoop: The Definitive Guide, http://bit.ly/hadoop-definitive-guide
Apache Blur (incubating) web site, http://incubator.apache.org/blur/
Apache Curator, http://curator.incubator.apache.org/
Netflix Exhibitor, https://github.com/Netflix/exhibitor/wiki
zktop, https://github.com/phunt/zktop
Building an Impenetrable ZooKeeper, http://www.infoq.com/presentations/Misconfiguration-ZooKeeper

Distributed Coordination With ZooKeeper Part 5: Building a Distributed Lock

sleberkn — Thu, 11 Jul 2013 00:00:00 +0000

This is the fifth in a series of blogs that introduce Apache ZooKeeper. In the fourth blog, you saw a high-level view of ZooKeeper's architecture and data consistency guarantees. In this blog, we'll use all the knowledge we've gained thus far to implement a distributed lock.

You've now seen how to interact with Apache ZooKeeper and learned about its architecture and consistency model. Let's now use that knowledge to build a distributed lock. The goals are to build a mutually exclusive lock between processes that could be running on different machines, possibly even on different networks or different data centers. This also has the benefit that clients know nothing about each other; they only know they need to use the lock to access some shared resource, and that they should not access it unless they own the lock.

To build the lock, we'll create a persistent znode that will serve as the parent. Clients wishing to obtain the lock will create sequential, ephemeral child znodes under the parent znode. The lock is owned by the client process whose child znode has the lowest sequence number. In Figure 2, there are three children of the lock-node and child-1 owns the lock at this point in time, since it has the lowest sequence number. After child-1 is removed, the lock is relinquished and then the client who owns child-2 owns the lock, and so on.

Figure 2 - Parent lock znode and child znodes

The algorithm for clients to determine if they own the lock is straightforward, on the surface anyway. A client creates a new sequential ephemeral znode under the parent lock znode. The client then gets the children of the lock node and sets a watch on the lock node. If the child znode that the client created has the lowest sequence number, then the lock is acquired, and it can perform whatever actions are necessary with the resource that the lock is protecting. If the child znode it created does not have the lowest sequence number, then wait for the watch to trigger a watch event, then perform the same logic of getting the children, setting a watch, and checking for lock acquisition via the lowest sequence number. The client continues this process until the lock is acquired.

While this doesn't sound too bad there are a few potential gotchas. First, how would the client know that it successfully created the child znode if there is a partial failure (e.g. due to connection loss) during znode creation? The solution is to embed the client ZooKeeper session IDs in the child znode names, for example child-<sessionId>-; a failed-over client that retains the same session (and thus session ID) can easily determine if the child znode was created by looking for its session ID amongst the child znodes. Second, in our earlier algorithm, every client sets a watch on the parent lock znode. But this has the potential to create a "herd effect" - if every client is watching the parent znode, then every client is notified when any changes are made to the children, regardless of whether a client would be able to own the lock. If there are a small number of clients this probably doesn't matter, but if there are a large number it has the potential for a spike in network traffic. For example, the client owning child-9 need only watch the child immediately preceding it, which is most likely child-8 but could be an earlier child if the 8th child znode somehow died. Then, notifications are sent only to the client that can actually take ownership of the lock.

Fortunately for us, ZooKeeper comes with a lock "recipe" in the contrib modules called WriteLock. WriteLock implements a distributed lock using the above algorithm and takes into account partial failure and the herd effect. It uses an asynchronous callback model via a LockListener instance, whose lockAcquired method is called when the lock is acquired and lockReleased method is called when the lock is released. We can build a synchronous lock class on top of WriteLock by blocking until the lock is acquired. Listing 6 shows how we use a CountDownLatch to block until the lockAcquired method is called. (Sample code for this blog is available on GitHub at https://github.com/sleberknight/zookeeper-samples)

Listing 6 - Creating BlockingWriteLock on top of WriteLock

public class BlockingWriteLock {
  private String path;
  private WriteLock writeLock;
  private CountDownLatch signal = new CountDownLatch(1);

  public BlockingWriteLock(ZooKeeper zookeeper,
          String path, List<ACL> acls) {
    this.path = path;
    this.writeLock =
        new WriteLock(zookeeper, path, acls, new SyncLockListener());
  }

  public void lock() throws InterruptedException, KeeperException {
    writeLock.lock();
    signal.await();
  }

  public void unlock() {
    writeLock.unlock();
  }

  class SyncLockListener implements LockListener {
    @Override public void lockAcquired() {
      signal.countDown();
    }

    @Override public void lockReleased() { /* ignored */ }
  }
}

You can then use the BlockingWriteLock as shown in Listing 7.

Listing 7 - Using BlockingWriteLock

BlockingWriteLock lock =
  new BlockingWriteLock(zooKeeper, path, ZooDefs.Ids.OPEN_ACL_UNSAFE);
try {
  lock.lock();
  // do something while we own the lock
} catch (Exception ex) {
  // handle appropriately
} finally {
  lock.unlock();
}

You can take this a step further, wrapping the try/catch/finally logic and creating a class that takes commands which implement an interface. For example, you can create a DistributedLockOperationExecutor class that implements a withLock method that takes a DistributedLockOperation instance as an argument, as shown in Listing 8.

Listing 8 - Wrapping the BlockingWriteLock try/catch/finally logic

DistributedLockOperationExecutor executor =
  new DistributedLockOperationExecutor(zooKeeper);
executor.withLock(lockPath, ZooDefs.Ids.OPEN_ACL_UNSAFE,
  new DistributedLockOperation() {
    @Override public Object execute() {
      // do something while we have the lock
    }
  });

The nice thing about wrapping try/catch/finally logic in DistributedLockOperationExecutor is that when you call withLock you eliminate boilerplate code and you cannot possibly forget to unlock the lock.

Conclusion to Part 5

In this fifth blog on ZooKeeper, you implemented a distributed lock and saw some of the potential problems that should be avoided such as partial failure on connection loss, and the "herd effect". We took our initial distributed lock and cleaned it up a bit, which resulted in a synchronous implementation using the DistributedLockOperationExecutor and DistributedLockOperation which ensures proper connection handling and lock release.

In the next (and final) blog, we'll briefly touch on administration and tuning ZooKeeper and introduce the Apache Curator framework, and finally summarize what we've learned.

References

Source code for these blogs, https://github.com/sleberknight/zookeeper-samples
Presentation on ZooKeeper, http://www.slideshare.net/scottleber/apache-zookeeper
ZooKeeper web site, http://zookeeper.apache.org/

Distributed Coordination With ZooKeeper Part 4: Architecture from 30,000 Feet

sleberkn — Mon, 8 Jul 2013 00:00:00 +0000

This is the fourth in a series of blogs that introduce Apache ZooKeeper. In the third blog, you implemented a group membership example using the ZooKeeper Java API. In this blog, we'll get an overview of ZooKeeper's architecture.

Now that we've test driven Apache ZooKeeper in the shell and Java code, let's take a bird's eye view of the ZooKeeper architecture, and expand on the core concepts discussed earlier. As previously mentioned, ZooKeeper is essentially a distributed, hierarchical filesystem comprised of znodes, which can be either persistent or ephemeral. Persistent znodes can have chidren, whereas ephemeral nodes cannot, and persistent znodes persist after client sessions expire or disconnect. In contrast, ephemeral nodes cannot have children, and they are automatically destroyed as soon as the session in which they were created is closed. Both persistent and ephemeral znodes can have associated data, however the data must be less than 1MB (per znode). All znodes can optionally be sequential, for which ZooKeeper maintains a monotonically increasing number which is automatically appended to the znode name upon creation. Each sequence number is guaranteed to be unique. Finally, all znode operations (reads and writes) are atomic; they either succeed or fail and there is never a partial application of an operation. For example, if a client tries to set data on a znode, the operation will either set the data in its entirely, or no data will be changed at all.

A key element of ZooKeeper's architecture is the ability to set watches on read operations such as exist, getChildren, and getData. Write operations (i.e. create, delete, setData) on znodes trigger any watches previously set on those znodes, and watchers are notified via a WatchedEvent. How clients respond to events is entirely up to them, but setting watches and receiving notifications at some later point in time results in an event-driven, decoupled architecture. Suppose client A sets a watch on a znode. At some point in the future, when client B performs a write operation on the znode client A is watching, a WatchedEvent is generated and client A is called back via the processResult method. Client A and B are completely independent and need not even know anything about each other, so long as they each know their own responsibilities in relation to specific znodes.

Important to remember about watches is that they are one-time notifications about changes to a znode. If a client receives a WatchedEvent notification, it must re-register a new Watcher if it wants to be notified about future updates. During the period between receipt of the notification and re-registration, there exists the possibility that other clients could perform write operations on the znode before the new Watcher is registered which the client would not know about. In other words, it is entirely possible in a high write volume environment that a client can miss updates during the time it takes to process an event and re-register a new watch. Clients should assume updates can be missed, and not rely on having a complete history of every single event that occurs to a given znode.

ZooKeeper implements the hierarchical filesystem via an "ensemble" of servers. Figure 1 shows a three server ensemble with multiple clients reading and one client writing. The basic idea is that the filesystem state is replicated on each server in the ensemble, both on disk and in memory.

Figure 1 - ZooKeeper Ensemble

In Figure 1 you can see one of the servers in the ensemble acts as the leader, while the rest are followers. When an ensemble is first started, a leader election is held. During leader election, a leader is elected and the process is complete onces a simple majority of followers have synchronized their state with the leader. After leader election is complete, all write requests are routed through the leader, and changes are broacast to all followers - this is termed atomic broadcast. Once a majority of followers have persisted the change (to disk and memory), the leader commits the change and notifies the client of a successful update. Because only a majority of followers are required for a successful update, followers can lag the leader which means ZooKeeper is an eventually consistent system. Thus, different clients can read information about a given znode and receive a different answer. Every write is assigned a globally unique, sequentially ordered identifier called a zxid, or ZooKeeper transaction id. This guarantees a global order to all updates in a ZooKeeper ensemble. In addition, because all writes go through the leader, write throughput does not scale as more nodes are added.

This leader/follower architecture is not a master/slave setup, however, since the leader is not a single point of failure. If a leader dies, then a new leader election takes place and a new leader is elected (this is typically very fast and will not noticeably degrade performance, however). In addition, because leader election and writes both require a simple majority of servers, ZooKeeper ensembles should contain an odd number of machines; in a five node ensemble any two machines can go down and ZooKeeper can still remain available, whereas a six node ensemble can also only handle two machines going down because if three nodes fail, the remaining three are not a majority (of the original six).

All client read requests are served directly from the memory of the server they are connected to, which makes reads very fast. In addition, clients have no knowledge about the server they are connected to and do not know if they are connected to a leader or follower. Because reads are from the in-memory representation of the filesystem, read throughput increases as servers are added to an ensemble. But recall that write throughput is limited by the leader, so you cannot simply add more and more ZooKeepers forever and expect performance to increase.

Data Consistency

With ZooKeeper's leader/follower architecture in mind, let's consider what guarantees it makes regarding data consistency.

Sequential Updates

ZooKeeper guarantees that updates are made to the filesystem in the order they are received from clients. Since all writes route through the leader, the global order is simply the order in which the leader receives write requests.

Atomicity

All updates either succeed or fail, just like transactions in ACID-compliant relational databases. ZooKeeper, as of version 3.4.0, supports transactions as a thin wrapper around the multi operation, which performs a list of operations (instances of the Op class) and either all operations succeed or none succeed. So if you need to ensure that multiple znodes are updated at the same time, for example if two znodes are part of a graph, then you can use multi or the transaction wrapper around multi.

Consistent client view

Consistent client view means that a client will see the same view of the system, regardless of which server it is connected to. The offical ZooKeeper documentation calls this "single system image". So, if a client fails over to a different server during a session, it will never see an older view of the system than it has previously seen. A server will not accept a connection from a client until it has caught up with the state of the server to which the client was previously connected.

Durability

If an update succeeds, ZooKeeper guarantees it has been persisted and will survive server failures, even if all ZooKeeper ensemble nodes were forcefully killed at the same time! (Admittedly this would be an extreme situation, but the update would survive such an apocalypse.)

Eventual consistency

Because followers may lag the leader, ZooKeeper is an eventually consistent system. But ZooKeeper limits the amount of time a follower can lag the leader, and a follower will take itself offline if it falls too far behind. Clients can force a server to catch up with the leader by calling the asynchronous sync command. Despite the fact that sync is asynchronous, a ZooKeeper server will not process any operations until it has caught up to the leader.

Conclusion to Part 4

In this fourth blog on ZooKeeper you saw a bird's eye view of ZooKeeper's architecture, and learned about its data consistency guarantees. You also learned that ZooKeeper is an eventually consistent system.

In the next blog, we'll dive back into some code and use what we've learned so far to build a distributed lock.

References

Source code for these blogs, https://github.com/sleberknight/zookeeper-samples
Presentation on ZooKeeper, http://www.slideshare.net/scottleber/apache-zookeeper
ZooKeeper web site, http://zookeeper.apache.org/

Distributed Coordination With ZooKeeper Part 3: Group Membership Example

sleberkn — Tue, 2 Jul 2013 00:00:00 +0000

This is the third in a series of blogs that introduce Apache ZooKeeper. In the second blog, you took a test drive of ZooKeeper using its command-line shell. In this blog, we'll re-implement the group membership example using the ZooKeeper Java API.

Apache ZooKeeper is implemented in Java, and its native API is also Java. ZooKeeper also provides a C language API, and the distribution provides contrib modules for Perl, Python, and RESTful clients. The ZooKeeper APIs come in two flavors, synchronous or asynchronous. Which one you use depends on the situation. For example you might choose the asynchronous Java API if you are implementing a Java application to process a large number of child znodes independently of one another; in this case you could make good use of the asynchronous API to simultaneously launch all the independent tasks in parallel. On the other hand, if you are implementing simple tasks that perform sequential operations in ZooKeeper, the synchronous API is easier to use and might be a better fit in such cases.

For our group membership example, we'll use the synchronous Java API. The first thing we need to do is connect to ZooKeeper and get an instance of ZooKeeper, which is the main client API through which you perform operations like creating znodes, setting data on znodes, listing znodes, and so on. The ZooKeeper constructor launches a separate thread to connect, and returns immediately. As a result, you need to watch for the SyncConnected event which indicates when the connection has been established. Listing 1 shows code to connect to ZooKeeper, in which we use a CountDownLatch to block until we've received the connected event. (Sample code for this blog is available on GitHub at https://github.com/sleberknight/zookeeper-samples).

Listing 1 - Connecting to ZooKeeper

public ZooKeeper connect(String hosts, int sessionTimeout)
        throws IOException, InterruptedException {
  final CountDownLatch connectedSignal = new CountDownLatch(1);
  ZooKeeper zk = new ZooKeeper(hosts, sessionTimeout, new Watcher() {
    @Override
    public void process(WatchedEvent event) {
      if (event.getState() == Watcher.Event.KeeperState.SyncConnected) {
        connectedSignal.countDown();
      }
    }
  });
  connectedSignal.await();
  return zk;
}

The next thing we need to do is create a znode for the group. As in the test drive, this znode should be persistent, so that it hangs around regardless of whether any clients are connected or not. Listing 2 shows creating a group znode.

Listing 2 - Creating the group znode

public void createGroup(String groupName)
        throws KeeperException, InterruptedException {
  String path = "/" + groupName;
  zk.create(path,
            null /* data */,
            ZooDefs.Ids.OPEN_ACL_UNSAFE,
            CreateMode.PERSISTENT);
}

Note in Listing 2 that we prepended a leading slash to the group name since ZooKeeper requires that all paths be absolute. The create operation takes arguments for the path, a byte[] for data which is optional, a list of ACLs (access control list) to control who can access the znode, and finally the type of znode, in this case persistent. Creating the group member znodes is almost identical to creating the group znode, except we need to create an ephemeral, sequential znode. Let's also say that we need to store some information about each member, so we'll set data on the member znodes. This is shown in Listing 3.

Listing 3 - Creating group member znodes with data

public String joinGroup(String groupName, String memberName, byte[] data)
        throws KeeperException, InterruptedException {
  String path = "/" + groupName + "/" + memberName + "-";
  String createdPath = zk.create(path,
          data,
          ZooDefs.Ids.OPEN_ACL_UNSAFE,
          CreateMode.EPHEMERAL_SEQUENTIAL);
  return createdPath;
}

Now that we can create the group allow members to join the group, it would be nice to have some way to monitor the group membership. To do this we'll first need to list children for the group znode, then set a watch on the group znode, and whenever the watch triggers an event, we'll query ZooKeeper for the group's (updated) members, as shown in Listing 4. This process continues in an infinite loop, hence the class name ListGroupForever.

Listing 4 - Listing a group's members indefinitely

public class ListGroupForever {
  private ZooKeeper zooKeeper;
  private Semaphore semaphore = new Semaphore(1);

  public ListGroupForever(ZooKeeper zooKeeper) {
    this.zooKeeper = zooKeeper;
  }

  public static void main(String[] args) throws Exception {
    ZooKeeper zk = new ConnectionHelper().connect(args[0]);
    new ListGroupForever(zk).listForever(args[1]);
  }

  public void listForever(String groupName)
          throws KeeperException, InterruptedException {
    semaphore.acquire();
    while (true) {
      list(groupName);
      semaphore.acquire();
    }
  }

  private void list(String groupName)
          throws KeeperException, InterruptedException {
    String path = "/" + groupName;
    List<String> children = zooKeeper.getChildren(path, new Watcher() {
      @Override
      public void process(WatchedEvent event) {
        if (event.getType() == Event.EventType.NodeChildrenChanged) {
          semaphore.release();
        }
      }
    });
    if (children.isEmpty()) {
      System.out.printf("No members in group %s\n", groupName);
      return;
    }
    Collections.sort(children);
    System.out.println(children);
    System.out.println("--------------------");
}

The ListGroupForever class in Listing 4 has some interesting characteristics. The listForever method loops infinitely and uses a semaphore to block until changes occur to the group node. The list method calls getChildren to actually retrieve the child nodes from ZooKeeper, and critically sets a Watcher to watch for changes of type NodeChildrenChanged. When the NodeChildrenChanged event occurs, the watcher releases the semaphore, which permits listForever to re-acquire the semaphore and then retrieve and display the updated group znodes. This process continues until ListGroupForever is terminated.

To round out the example, we'll create a method to delete the group. As shown in the test drive, ZooKeeper doesn't permit znodes that have children to be deleted, so we first need to delete all the children, and then delete the group (parent) znode. This is shown in Listing 5.

Listing 5 - Deleting a group

public void delete(String groupName)
        throws KeeperException, InterruptedException {
  String path = "/" + groupName;
  try {
    List<String> children = zk.getChildren(path, false);
    for (String child : children) {
      zk.delete(path + "/" + child, -1);
    }
    zk.delete(path, -1);
  }
  catch (KeeperException.NoNodeException e) {
    System.out.printf("Group %s does not exist\n", groupName);
  }
}

When deleting a group, we passed -1 to the delete method to unconditionally delete the znodes. We could also have passed in a version, so that if we have the correct version number, the znode is deleted but otherwise we receive an optimistic locking violation in the form of a BadVersionException.

Conclusion to Part 3

In this third blog on ZooKeeper, we implemented a group membership example using the Java API. You saw how to connect to ZooKeeper; how to create persistent, ephemeral, and sequential znodes; how to list znodes and set watches to receive events; and finally how to delete znodes.

In the next blog, we'll back off from the code level and get an overview of ZooKeeper's architecture.

References

Source code for these blogs, https://github.com/sleberknight/zookeeper-samples
Presentation on ZooKeeper, http://www.slideshare.net/scottleber/apache-zookeeper
ZooKeeper web site, http://zookeeper.apache.org/

Distributed Coordination With ZooKeeper Part 2: Test Drive

sleberkn — Fri, 28 Jun 2013 00:00:00 +0000

This is the second in a series of blogs that introduce Apache ZooKeeper. In the first blog, you got an introduction to ZooKeeper and its core concepts. In this blog, you'll take a brief test drive of ZooKeeper using its command line shell. This is a really fast and convenient way to get up and running with ZooKeeper immediately.

To get an idea of some of the basic building blocks in Apache ZooKeeper, let's take a test drive. ZooKeeper comes with a command-line shell that you can connect to and interact with the service. The following listing shows connecting to the shell, listing the znodes at the root level, and creating a znode named /sample-group which will serve as a parent znode for some other znodes that we'll create in a moment. All paths in ZooKeeper must be absolute and begin with a /. The first argument to the create command is the path, while the second is the data that is associated with the znode. Note also that when a connection is established, the default watcher sends the SyncConnected event, which you see in the listing below.

$ ./zkCli.sh
Connecting to localhost:2181
Welcome to ZooKeeper!
JLine support is enabled

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
[zk: localhost:2181(CONNECTED) 0] ls /
[zookeeper]
[zk: localhost:2181(CONNECTED) 1] create /sample-group a-sample-group
Created /sample-group
[zk: localhost:2181(CONNECTED) 2] ls /
[sample-group, zookeeper]

At this point we want to create some child znodes under /sample-group. ZooKeeper znodes can be either persistent or ephemeral. Persistent znodes are permanent and once created, stick around until they are explicitly deleted. On the other hand, ephemeral znodes exist only as long as the client who created them is alive; once the client goes away for any reason, all ephemeral znodes it created are automatically destroyed. As you might imagine, if we want to build a group membership service for a distributed system, the client ( which is a group member) should indicate its status via ephemeral znodes, so that if it dies, the znode representing its membership is destroyed thus indicating the client is no longer a member of the group. When we created the group, we created a persistent znode. To create an ephemeral znode we use the -e option. In addition, maybe we'd like to know the order in which clients joined our group. ZooKeeper znodes can be automatically and uniquely ordered by their parent. In the shell we use -s to indicate we want to create the child znode as a sequential znode. Note also that we named the child nodes /sample-group/child- in each case. When creating sequential znodes, it is typical to end the name with a dash, to which a unique, monotonically increasing integer is automatically appended.

[zk: localhost:2181(CONNECTED) 3] create -s -e /sample-group/child- data-1
Created /sample-group/child-0000000000
[zk: localhost:2181(CONNECTED) 4] create -s -e /sample-group/child- data-2
Created /sample-group/child-0000000001
[zk: localhost:2181(CONNECTED) 5] create -s -e /sample-group/child- data-3
Created /sample-group/child-0000000002

Now let's set a watch on the /sample-group znode in order to receive change notifications whenever a child znode is added or removed. Setting the watch lets us monitor the group for changes and react accordingly. For example, if we are building a distributed search engine and a server in the search cluster dies, we need to know about that event and move the data held by the (now dead) server across the remaining servers, assuming the data is stored redundantly such as in Hadoop. This is exactly what the Apache Blur distributed search engine does in order to ensure data is not lost and that the cluster continues operating when one or more servers is lost. In ZooKeeper you set watches on read operations, for example when listing a znode or getting its data. We'll list the children under /sample-group and set a watch, indicated by using true as the second argument.

[zk: localhost:2181(CONNECTED) 6] ls /sample-group true
[child-0000000001, child-0000000002, child-0000000000]

Now if we create another child znode, the watch event will fire and notify us that a NodeChildrenChanged event occurred.

[zk: localhost:2181(CONNECTED) 7] create -s -e /sample-group/child- data-4

WATCHER::

WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/sample-group
Created /sample-group/child-0000000003

The event does not tell us what actually changed, however. To get the updated list of children we need to again list the contents of /sample-group. In addition, watchers are one-time events, and clients must re-register the watch to continue receiving change notifications. So if we now create another child znode, no watch will fire.

[zk: localhost:2181(CONNECTED) 8] create -s -e /sample-group/child- data-5
Created /sample-group/child-0000000004

To finish off our test drive, let's delete our test group.

[zk: localhost:2181(CONNECTED) 9] delete /sample-group
Node not empty: /sample-group

Oops. ZooKeeper won't allow znodes to be deleted if they have children. In addition updates, including deletes, are conditional upon a specific version, which is a form of optimistic locking that ensures a client update succeeds only if it passes the current version of the data. Otherwise the update fails with a BadVersionException. You can short-circuit the optimistic versioning behavior by passing -1 to updates, which tells ZooKeeper to perform the update unconditionally. So in order to delete our group, we first delete all the child znodes and then delete the group znode, all unconditionally.

[zk: localhost:2181(CONNECTED) 10] delete /sample-group/child-0000000000 -1
[zk: localhost:2181(CONNECTED) 11] delete /sample-group/child-0000000001 -1
[zk: localhost:2181(CONNECTED) 12] delete /sample-group/child-0000000002 -1
[zk: localhost:2181(CONNECTED) 13] delete /sample-group/child-0000000003 -1
[zk: localhost:2181(CONNECTED) 14] delete /sample-group/child-0000000004 -1
[zk: localhost:2181(CONNECTED) 15] delete /sample-group -1

In addition to the shell, ZooKeeper also provides commands referred to as the "four letter words". You issue the commands via telnet or nc (netcat). For example, let's ask ZooKeeper how it's feeling.

$ echo "ruok" | nc localhost 2181
imok

You can also use the stat command to get basic statistics on ZooKeeper.

$ echo "stat" | nc localhost 2181
Zookeeper version: 3.4.5-1392090, built on 09/30/2012 17:52 GMT
Clients:
 /0:0:0:0:0:0:0:1%0:63888[0](queued=0,recved=1,sent=0)

Latency min/avg/max: 0/0/157
Received: 338
Sent: 337
Connections: 1
Outstanding: 0
Zxid: 0xb
Mode: standalone
Node count: 17

In this test drive, we've seen some basic but important aspects of ZooKeeper. We created persistent and sequential ephemeral znodes, set a watch and received a change notification event when a znode's children changed, and deleted znodes. We also saw how znodes can have associated data. When building real systems you obviously won't be using the command line shell to implement behavior, however, so let's translate this simple group membership example into Java code.

Conclusion to Part 2

In this second part of the ZooKeeper series of blogs, you took a test drive using the command-line shell available in ZooKeeper. You created both persistent and ephemeral znodes. You created the ephemeral znodes as children of the persistent znode, and made them sequential as well so that ZooKeeper maintains a monotonically increasing, unique order. Finally you saw how to delete znodes and use a few of the "four letter words" to check ZooKeeper's status.

In the next blog, we'll recreate the group example you've just seen using the ZooKeeper Java API.

References

Source code for these blogs, https://github.com/sleberknight/zookeeper-samples
Presentation on ZooKeeper, http://www.slideshare.net/scottleber/apache-zookeeper
ZooKeeper web site, http://zookeeper.apache.org/

Distributed Coordination With ZooKeeper Part 1: Introduction

sleberkn — Tue, 25 Jun 2013 10:29:24 +0000

This is the first in a series of blogs that introduce Apache ZooKeeper. This blog provides an introduction to ZooKeeper and its core concepts and use cases. In later blogs you will test drive ZooKeeper, see some examples of the Java API, learn about its architecture, build a distributed data structure which can be used across independent processes and machines, and finally get a brief introduction to a higher-level API on top of ZooKeeper.

Consider a distributed system with multiple servers, each of which is responsible for holding data and performing operations on that data. This could be a distributed search engine, a distributed build system, or even something like Hadoop which has both a distributed file system and a Map/Reduce data processing framework that operates on the data in the file system. How would you determine which servers are alive and operating at any given moment in time? Or, how would you determine which servers are available to process a build in a distributed build system? Or for a distributed search system how would you know which servers are available to hold data and handle search requests? Most importantly, how would you do these things reliably in the face of the difficulties of distributed computing such as network failures, bandwidth limitations, variable latency connections, security concerns, and anything else that can go wrong in a networked environment, perhaps even across multiple data centers?

These and similar questions are the focus of Apache ZooKeeper, which is a fast, highly available, fault tolerant, distributed coordination service. Using ZooKeeper you can build reliable, distributed data structures for group membership, leader election, coordinated workflow, and configuration services, as well as generalized distributed data structures like locks, queues, barriers, and latches.

Many well-known and successful projects already rely on ZooKeeper. Just a few of them include HBase, Hadoop 2.0, Solr Cloud, Neo4J, Apache Blur (incubating), and Accumulo.

Core Concepts

ZooKeeper is a distributed, hierarchical file system that facilitates loose coupling between clients and provides an eventually consistent view of its znodes, which are like files and directories in a traditional file system. It provides basic operations such as creating, deleting, and checking existence of znodes. It provides an event-driven model in which clients can watch for changes to specific znodes, for example if a new child is added to an existing znode. ZooKeeper achieves high availability by running multiple ZooKeeper servers, called an ensemble, with each server holding an in-memory copy of the distributed file system to service client read requests. Each server also holds a persistent copy on disk.

One of the servers is elected as the leader, and all other servers are followers. The leader is responsible for all writes and for broadcasting changes to to followers. Assuming a majority of followers commit a change successfully, the write succeeds and the data is then durable even if the leader then fails. This means ZooKeeper is an eventually consistent system, because the followers may lag the leader by some small amount of time, hence clients might not always see the most up-to-date information. Importantly, the leader is not a master as in a master/slave architecture and thus is not a single point of failure; rather, if the leader dies, then the remaining followers hold an election for a new leader, and the new leader takes over where the old one left off.

Each client connects to ZooKeeper, passing in the list of servers in the ensemble. The client connects to one of the servers in the ensemble at random until a connection is established. Once connected, ZooKeeper creates a session with the client-specified timeout period. The ZooKeeper client automatically sends periodic heartbeats to keep the session alive if no operations are performed for a while, and automatically handles failover. If the ZooKeeper server a client is connected to fails, the client automatically detects this and tries to reconnect to a different server in the ensemble. The nice thing is that the same client session is retained during this failover event; however during failover it is possible that client operations could fail and, as with almost all ZooKeeper operations, client code must be vigilant and detect errors and deal with them as necessary.

Partial Failure

One of the fallacies of distributed computing is that the network is reliable. Having worked on a project for the past few years with multiple Hadoop, Apache Blur, and ZooKeeper clusters including hundreds of servers, I can definitely say from experience that the network is not reliable. Simply put, things break and you cannot assume the network is 100% reliable all the time. When designing distributed systems, you must keep this in mind and handle things you ordinarily would not even consider when building software for a single server. For example, assume a client sends an update to a server, but before the response is received the network connection is lost for a brief period. You need to ask several questions in this case. Did the message get through to the server? If it did, then did the operation actually complete successfully? Is it safe to retry an operation for which you don't even know whether it reached the server or if it failed at the server, in other words is the operation idempotent? You need to consider questions like these when building distributed systems. ZooKeeper cannot help with network problems or partial failures, but once you are aware of the kinds of problems which can arise, you are much better prepared to deal with problems when (not if) they occur. ZooKeeper provides certain guarantees regarding data consistency and atomicity that can aid you when building systems, as you will see later.

Conclusion to Part 1

In this blog we've learned that ZooKeeper is a distributed coordination service that facilitates loose coupling between distributed components. It is implemented as a distributed, hierarchical file system and you can use it to build distributed data structures such as locks, queues, and so on. In the next blog, we'll take a test drive of ZooKeeper using its command line shell.

References

Presentation on ZooKeeper, http://www.slideshare.net/scottleber/apache-zookeeper
ZooKeeper web site, http://zookeeper.apache.org/
Projects powered by ZooKeeper, https://cwiki.apache.org/confluence/display/ZOOKEEPER/PoweredBy
Fallacies of Distributed Computing, http://en.wikipedia.org/wiki/FallaciesofDistributed_Computing
Hadoop web site, http://hadoop.apache.org/
Hadoop: The Definitive Guide, http://bit.ly/hadoop-definitive-guide
Apache Blur (incubating) web site, http://incubator.apache.org/blur/

Hadoop Presentation at NOVA/DC Java Users Group

sleberkn — Tue, 10 May 2011 01:02:46 +0000

Last Thursday (on Cinco de Mayo) I gave a presentation on Hadoop and Hive at the Nova/DC Java Users Group. As several people asked about getting the slides, I've shared them here on Slideshare. I also posted the presentation sample code on Github at basic-hadoop-examples.

What's in JDK 7 Lightning Talk Slides

sleberkn — Sat, 16 Apr 2011 11:06:24 +0000

Yesterday at the Near Infinity 2011 Spring Conference I gave a talk on CoffeeScript (see here) and a very short lightning talk on what exactly is in JDK 7. You can find the slides for the JDK 7 talk here if you're interested.

CoffeeScript Slides

sleberkn — Fri, 15 Apr 2011 15:30:14 +0000

Today is the Near Infinity Spring Conference. We have one conference in the fall and one in the spring for all our developers as well as invited guests. Today I gave a presentation on CoffeeScript and shared the slides here.

Introducing RJava

sleberkn — Fri, 1 Apr 2011 00:00:00 +0000

You’ve no doubt heard about JRuby, which lets you run Ruby code on the JVM. This is nice, but wouldn’t it be nicer if you could write Java code on a Ruby VM? This would let you take advantage of the power of Ruby 1.9’s new YARV (Yet Another Ruby VM) interpreter while letting you write code in a statically-typed language. Without further ado, I’d like to introduce RJava, which does just that!

RJava lets you write code in Java and run it on a Ruby VM! And you still get the full benefit of the Java compiler to ensure your code is 100% correct. Of course with Java you also get checked exceptions and proper interfaces and abstract classes to ensure compliance with your design. You no longer need to worry about whether an object responds to a random message, because the Java compiler will enforce that it does.

You get all this and more but on the power and flexibility of a Ruby VM. And because Java does not support closures, you are ensured that everything is properly designed since you’ll be able to define interfaces and then implement anonymous inner classes just like you’re used to doing! Even when JDK 8 arrives sometime in the future with lambdas, you can rest assured that they will be statically typed.

As a first example, let’s see how you could filter a collection in RJava to find only the even numbers from one to ten. In Ruby you’d probably write something like this:

evens = (1..10).find_all { |n| n % 2 == 0 }

With RJava, you’d write this:

List<Integer> evens = new ArrayList<Integer>();
for (int i = 1; i <= 10; i++) {
  if (i % 2 == 0) {
    evens.add(i);
  }
}

This example shows the benefits of declaring variables with specific types, how you can use interfaces (e.g. List in the example) when declaring variables, and shows how you also get the benefits of Java generics to ensure your collections are always type-safe. Without any doubt you know that “evens” is a List containing Integers and that “i” is an int, so you can sleep soundly knowing your code is correct. You can also see Java’s powerful “for” loop at work here, to easily traverse from 1 to 10, inclusive. Finally, you saw how to effectively use Java’s braces to organize code to clearly show blocks, and semi-colons ensure you always know where lines terminate.

I’ve just released RJava on GitHub, so go check it out. Please download RJava today and give it a try and let me know what you think!

Database-Backed Refreshable Beans with Groovy and Spring 3

sleberkn — Sat, 30 Oct 2010 11:57:05 +0000

In 2009 I published a two-part series of articles on IBM developerWorks entitled Groovier Spring. The articles showed how Spring supports implementing beans in Groovy whose behavior can be changed at runtime via the "refreshable beans" feature. This feature essentially detects when a Spring bean backed by a Groovy script has changed, recompiles it, and replaces the old bean with the new one. This feature is pretty powerful in certain scenarios, for example in PDF generation; mail or any kind of template generation; and as a way to implement runtime modifiable business rules. One specific use case I showed was how to implement PDF generation where the Groovy scripts reside in a database, allowing you to change how PDFs are generated by simply updating Groovy scripts in your database.

In order to load Groovy scripts from a database, I showed how to implement custom ScriptFactoryPostProcessor and ScriptSource classes. The CustomScriptFactoryPostProcessor extends the default Spring ScriptFactoryPostProcessor and overrides the convertToScriptSource method to recognize a database-based script, e.g. you could specify a script source of database:com/nearinfinity/demo/GroovyPdfGenerator.groovy. There is also DatabaseScriptSource that implements the ScriptSource interface and which knows how to load Groovy scripts from a database.

In order to put these pieces together, you need to do a bit of configuration. In the articles I used Spring 2.5.x which was current at the time in early 2009. The configuration looked like this:

<bean id="dataSource"
  class="org.springframework.jdbc.datasource.DriverManagerDataSource">
    <!-- set data source props, e.g. driverClassName, url, username, password... -->
</bean>

<bean id="scriptFactoryPostProcessor"
  class="com.nearinfinity.spring.scripting.support.CustomScriptFactoryPostProcessor">
    <property name="dataSource" ref="dataSource"/>
</bean>

<lang:groovy id="pdfGenerator"
  script-source="database:com/nearinfinity/demo/DemoGroovyPdfGenerator.groovy">
    <lang:property name="companyName" value="Database Groovy Bookstore"/>
</lang:groovy>

In Spring 2.5.x this works because the <lang:groovy> tag looks for a Spring bean with id "scriptFactoryPostProcessor" and if one exists it uses it, if not it creates it. In the above configuration we created our own "scriptFactoryPostProcessor" bean for <lang:groovy> tags to utilize. So all's well...until you move to Spring 3.x at which point the above configuration no longer works. This was pointed out to me by João from Brazil who tried the sample code in the articles with Spring 3.x, and it did not work. After trying a bunch of things, we eventually determined that in Spring 3.x the <lang:groovy> tag looks for a ScriptFactoryPostProcessor bean whose id is "org.springframework.scripting.config.scriptFactoryPostProcessor" not just "scriptFactoryPostProcessor." So once you figure this out, it is easy to change the above configuration to:

<bean id="org.springframework.scripting.config.scriptFactoryPostProcessor"
  class="com.nearinfinity.spring.scripting.support.CustomScriptFactoryPostProcessor">
    <property name="dataSource" ref="dataSource"/>
</bean>

<lang:groovy id="pdfGenerator"
  script-source="database:com/nearinfinity/demo/DemoGroovyPdfGenerator.groovy">
    <lang:property name="companyName" value="Database Groovy Bookstore"/>
</lang:groovy>

Then, everything works as expected and the Groovy scripts can reside in your database and be automatically reloaded when you change them. So if you download the article sample code as-is, it will work since the bundled Spring version is 2.5.4, but if you update to Spring 3.x then you'll need to modify the configuration in applicationContext.xml for example #7 (EX #7) as shown above to change the "scriptFactoryPostProcessor" bean to be "org.springframework.scripting.config.scriptFactoryPostProcessor." Note there is a scheduled JIRA issue SPR-5106 that will make the ScriptFactoryPostProcessor mechanism pluggable, so that you won't need to extend the default ScriptFactoryPostProcessor and replace the default bean, etc. But until then, this hack continues to work pretty well.

Rack Lightning Talk

sleberkn — Thu, 21 Oct 2010 20:32:38 +0000

I gave a short lightning talk on Rack tonight at the NovaRUG. It's on slideshare here. Rack is really cool because it makes creating modular functionality really easy. For example, if you want to have exceptions mailed to you you can use the Rack::MailExceptions middleware, or if you want responses compressed you can add one line of code to a Rails app to use Rack::Deflater. Cool.

Missing the each_line method in FakeFS version 0.2.1? Add it!

sleberkn — Thu, 6 May 2010 23:21:28 +0000

Recently we have been using the excellent FakeFS (fake filesystem) gem in some specs to test code that reads and writes files on the filesystem. We are using the latest release version of this gem which is 0.2.1 as I am writing this. Some of the code under test uses the IO each_line method to iterate lines in relatively largish files. But we found out quickly that is a problem, since in version 0.2.1 the FakeFS::File class does not extend StringIO and so you don't get all its methods such as each_line. (The version on master in GitHub as I write this does extend StringIO, but it is not yet released as a formal version.) As an example suppose we have the following code that prints out the size of each line in a file as stars (asterisks):

def lines_to_stars(file_path)
  File.open(file_path, 'r').each_line { |line| puts '*' * line.size }
end

Let's say we use FakeFS to create a fake file like this:

require 'fakefs/safe'
require 'stringio'

FakeFS.activate!

File.open('/tmp/foo.txt', 'w') do |f|
  f.write "The quick brown fox jumped over the lazy dog\n"
  f.write "The quick red fox jumped over the sleepy cat\n"
  f.write "Jack be nimble, Jack be quick, Jack jumped over the candle stick\n"
  f.write "Twinkle, twinkle little star, how I wonder what you are\n"
  f.write "The End."
end

So far, so good. But now if we call lines_to_stars we get an error:

NoMethodError: undefined method `each_line' for #<FakeFS::File:0x000001012c22b8>

Oops. No each_line. If you don't want to use an unreleased version of the gem, you can add each_line onto FakeFS::File using the following code:

module FakeFS
  class File
    def each_line
      File.readlines(self.path).each { |line| yield line }
    end
  end
end

Basically all it does is define each_line so that it reads all the lines from a (fake) file on the (fake) filesystem and then yields them up one by one, so you can have code under test that iterates a file and work as expected. So now calling lines_to_stars gives a nice pretty bar chart containing the line sizes represented by stars:

********************************************
********************************************
***************************************************************
*******************************************************
********

Since we're using RSpec, to make this work nicely we added the above code that defines each_line into a file named fakefs.rb in the spec/support directory, since spec_helper requires supporting files in the spec/support directory and its subdirectories. So now all our specs automatically get the each_line behavior when using FakeFS.

Hibernate Performance Tuning Part 2 Article Published

sleberkn — Mon, 21 Dec 2009 14:23:06 +0000

I've just published the second article of a two-part series in the December 2009 NFJS Magazine on Hibernate Performance Tuning. Here's the abstract:

Tuning performance in Hibernate applications is all about reducing the number of database queries or eliminating them entirely using caching. In the first article in this two part series, you saw how to tune object retrieval using eager fetching techniques to optimize queries and avoid lazy-loads. In this second and final article, I’ll show you how inheritance strategy affects performance, how to eliminate queries using the Hibernate second-level cache, and show some simple but effective tools you can use to monitor and profile your applications.

If you are using Hibernate and want to know more about how inheritance affects performance, how to use the second-level cache, and some simple monitoring and profiling techniques, check it out and let me know what you think. Note that NFJS Magazine does require a subscription.

Making Cobertura Reports Show Groovy Code with Maven

sleberkn — Tue, 15 Dec 2009 23:43:37 +0000

A recent project started out life as an all-Java project that used Maven as the build tool. Initially we used Atlassian Clover to measure unit test coverage. Clover is a great product for Java code, but unfortunately it only works with Java code because it works at the Java source level. (This was the case as of Spring 2009, and I haven't checked since then.) As we started migrating existing code from Java to Groovy and writing new code in Groovy, we started to lose data about unit test coverage because Clover does not understand Groovy code. To remedy this problem we switched from Clover to Cobertura, which instruments at the bytecode level and thus works with Groovy code. Theoretically it would also work with any JVM-based language but I'm not sure whether or not it could handle something like Clojure or not.

In any case, we only cared about Groovy so Cobertura was a good choice. With the Cobertura Maven plugin we quickly found a problem, which was that even though the code coverage was running, the reports only showed coverage for Java code, not Groovy. This blog shows you how to display coverage on Groovy code when using Maven and the Cobertura plugin. In other words, I'll show how to get Cobertura reports to link to the real Groovy source code in Maven, so you can navigate Cobertura reports as you normally would.

The core problem is pretty simple, though it took me a while to figure out how to fix it. Seems to be pretty standard in Maven: I know what I want to do, but finding out how to do it is the really hard part. The only thing you need to do is tell Maven about the Groovy source code and where it lives. The way I did this is to use the Codehaus build-helper-maven-plugin which has an add-source goal. The add-source goal does just what you would expect; it adds a specified directory (or directories) as a source directory in your Maven build. Here's how you use it in your Maven pom.xml file:

<plugin>
    <groupId>org.codehaus.mojo</groupId>
    <artifactId>build-helper-maven-plugin</artifactId>
    <executions>
        <execution>
            <phase>generate-sources</phase>
            <goals>
                <goal>add-source</goal>
            </goals>
            <configuration>
                <sources>
                    <source>src/main/groovy</source>
                </sources>
            </configuration>
        </execution>
    </executions>
</plugin>

In the above code snippet, we're using the "build-helper-maven-plugin" to add the src/main/groovy directory. That's pretty much it. Run Cobertura as normal, view the reports, and you should now see coverage on Groovy source code as well as Java.

Hibernate Performance Tuning Part 1 Article Published

sleberkn — Tue, 1 Dec 2009 19:39:51 +0000

I've just published an article in the November 2009 NFJS Magazine on Hibernate Performance Tuning. Here's the abstract:

Many developers treat Hibernate like a "black box" and assume it will simply "Do the Right Thing" when it comes to all things related to the underlying database. This is a faulty assumption because, while Hibernate is great at the mechanics of database interaction, it cannot and will likely not ever be able to figure out the specific details of your domain model and discern the most efficient and best performing data access strategies. In this first article of a two part series, I'll show you how to achieve better performance in your Hibernate applications by focusing on tuning object retrieval, which forms the basis of your "fetch plan" for finding and storing objects in the database.

If you are using Hibernate and want to know more about how to change how objects are fetched from the database, check it out and let me know what you think. Note that NFJS Magazine does require a subscription.

Can Java Be Saved?

sleberkn — Mon, 9 Nov 2009 15:37:25 +0000

Java and Evolution

The Java language has been around for a pretty long time, and in my view is now a stagnant language. I don't consider it dead because I believe it will be around for probably decades if not longer. But it appears to have reached its evolutionary peak, and it doesn't look it's going to be evolved any further. This is not due to problems inherent in the language itself. Instead it seems the problem lies with Java's stewards (Sun and the JCP) and their unwillingness to evolve the language to keep it current and modern, and more importantly the goal to keep backward compatibility at all costs. Not just Sun, but also it seems the large corporations with correspondingly large investments in Java like IBM and Oracle aren't exactly chomping at the bit to improve Java. I don't even know if they think it even needs improvement at all. So really, the ultra-conservative attitude towards change and evolution is the problem with Java from my admittedly limited view of things.

That's why I don't hate Java. But, I do hate the way it has been treated by the people charged with improving it. It is clear many in the Java community want things like closures and a native property syntax but instead we got Project Coin. This, to me, is sad really. It is a shame that things like closures and native properties were not addressed in Java/JDK/whatever-it-is-called 7.

Why Not?

I want to know why Java can't be improved. We have concrete examples that it is possible to change a major language in major ways. Even in ways that break backward compatibility in order to evolve and improve. Out with the old, in with the new. Microsoft with C# showed that you can successfully evolve a language over time in major ways. For example C# has always had a property syntax but it now also has many features found in dynamically typed and functional languages such as type inference and, effectively, closures. With LINQ it introduced functional concepts. When C# added generics they did it correctly and retained the type information in the compiled IL, whereas Java used type-erasure and simply dropped the types from the compiled bytecode. There is a great irony here: though C# began life about five or six years after Java, it not only has caught up but has surpassed Java in most if not all ways, and has continued to evolve while Java has become stagnant.

C# is not the only example. Python 3 is a major overhaul of the Python language, and it introduced breaking changes that are not backwards compatible. I believe they provide a migration tool to assist you should you want to move from the 2.x series to version 3 and beyond. Microsoft has done this kind of thing as well. I remember when they made Visual Basic conform to the .NET platform and introduced some rather gut wrenching (for VB developers anyway) changes, and they also provided a tool to aid the transition. One more recent example is Objective-C which has experienced a resurgence in importance mainly because of the iPhone. Objective-C has been around longer than all of Java, C#, Ruby, Python, etc. since the 1980s. Apple has made improvements to Objective-C and it now sports a way to define and synthesize properties and most recently added blocks (effectively closures). If a language that pre-dates Java (Python also pre-dates Java by the way) can evolve, I just don't get why Java can't.

While it is certainly possible to remain on older versions of software, forcing yourself to upgrade can be a Good Thing, because it ensures you don't get the "COBOL Syndrome" where you end up with nothing but binaries that have to run on a specific hardware platform forever and you are trapped until you rewrite or you go out of business. The other side of this, of course, is that organizations don't have infinite time, money, and resources to update every single application. Sometimes this too can be good, because it forces you to triage older systems, and possibly consolidate or outright eliminate them if they have outlived their usefulness. In order to facilitate large transitions, I believe it is very important to use tools that help automate the upgrade process, e.g. tools that analyze code and fix it if possible (reporting all changes in a log) and which provide warnings and guidance when a simple fix isn't possible.

The JVM Platform

Before I get into the changes I'd make to Java to make it not feel like I'm developing with a straightjacket on while having to type masses of unnecessary boilerplate code, I want to say that I think the JVM is a great place to be. Obviously the JVM itself facilitates developing all kinds of languages as evidenced by the huge number of languages that run on the JVM. The most popular ones and most interesting ones these days are probably JRuby, Scala, Groovy, and Clojure though there are probably hundreds more. So I suppose you could make an argument that Java doesn't need to evolve any more because we can simply use a more modern language that runs on the JVM.

The main problem I have with that argument is simply that there is already a ton of Java code out there, and there are many organizations who are simply not going to allow other JVM-based languages; they're going to stick with Java for the long haul, right or wrong. This means there is a good chance that even if you can manage convince someone to try writing that shiny new web app using Scala and its Lift framework, JRuby on Rails, Grails, or Clojure, chances are at some point you'll also need to maintain or enhance existing large Java codebases. Wouldn't you like to be able to first upgrade to a version of Java that has closures, native property syntax, method/property handles, etc?

Next I'll choose what would be my top three choices to make Java much better immediately.

Top Three Java Improvements

If given the chance to change just three things about Java to make it better, I would choose these:

Remove checked exceptions
Add closures
Add formal property support

I think these three changes along would make coding in Java much, much better. Let's see how.

Remove Checked Exceptions

By removing checked exceptions you eliminate a ton of boilerplate try/catch clauses that do nothing except log a message, wrap and re-throw as a RuntimeException, pollute the API with throws clauses all over the place, or worst of all empty catch blocks that can cause very subtle and evil bugs. With unchecked exceptions, developers still have the option to catch exceptions that they can actually handle. It would be interesting to see how many times in a typical Java codebase people actually handle exceptions and do something at the point of exception, or whether they simply punt it away for the caller to handle, who in turn also punts, and so forth all the way up the call stack until some global handler catches it or the program crashes. If I were a betting man, I'd bet a lot of money that for most applications, developers punt the vast majority of the time. So why force people to handle something they cannot possible handle?

Add Closures

I specifically listed removing checked exceptions first, because to me it is the first step to being able to have a closure/block syntax that isn't totally horrendous. If you remove checked exceptions, then adding closures would seem to be much easier since you don't need to worry at all about what exceptions could possibly be thrown and there is obviously no need to declare exceptions. Closures/blocks would lead to better ability to handle collections, for example as in Groovy but in Java you would still have types (note I'm also using a literal property syntax here):

// Find all people whose last name is "Smith"
List<Person> peeps = people.findAll { Person person -> person.lastName.equals("Smith");   }

// Create a list of names by projecting the name property of a bunch of Person objects
List<String> names = people.collect { Person person -> person.name; }

Not quite as clean as Groovy but still much better than the for loops that would traditionally be required (or trying to shoehorn functional-style into Java using the Jakarta Commons Collections or Google Collections). Removal of checked exceptions would allow, as mentioned earlier, the block syntax to not have to deal with declaring exceptions all over the place. Having to declare checked exceptions in blocks makes the syntax worse instead of better, at least when I saw the various closure proposals for Java/JDK/whatever 7 which did not get included. Requiring types in the blocks is still annoying, especially once you get used to Ruby and Groovy, but it would be passable.

Native Property Syntax

The third change should do essentially what Groovy for properties does but should introduce a "property" keyword (i.e. don't rely on whether someone accidentally put an access modifier in there as Groovy does). The syntax could be very clean:

property String firstName;
property String lastName;
property Date dateOfBirth;

The compiler could automatically generate the appropriate getter/setter for you like Groovy does. This obviates the need to manually code the getter/setter. Like Groovy you should be able to override either or both. It de-clutters code enormously and removes a ton of lines of silly getter/setter code (plus JavaDocs if you are actually still writing them for every get/set method). Then you could reference properties as you would expect: person.name is the "getter" and person.name = "Fred" is the "setter." Much cleaner syntax, way less boilerplate code. By the way, if someone used the word "property" in their code, i.e. as a variable name, it is just not that difficult to rename refactor, especially with all the advanced IDEs in the Java community that do this kind of thing in their sleep.

Lots of other things could certainly be done, but if just these three were done I think Java would be much better off, and maybe it would even come into the 21st century like Objective-C. (See the very long but very good Ars Technica Snow Leopard review for information on Objective-C's new blocks feature.)

Dessert Improvements

If (as I suspect they certainly will :-) ) Sun/Oracle/whoever takes my suggestions and makes these changes and improves Java, then I'm sure they'll want to add in a few more for dessert. After the main course which removes checked exceptions, adds closures, and adds native property support, dessert includes the following:

Remove type-erasure and clean up generics
Add property/method handles
String interpolation
Type inference
Remove "new" keyword

Clean Up Generics

Generics should simply not remove type information when compiled. If you're going to have generics in the first place, do it correctly and stop worrying about backward compatibility. Keep type information in the bytecode, allow reflection on it, and allow me to instantiate a "new T()" where T is some type passed into a factory method, for example. I think an improved generics implementation could basically copy the way C# does it and be done.

Property/Method Handles

Property/method handles would allow you to reference a property or method directly. They would make code that now must use strings strongly typed and refactoring-safe (IDEs like IntelliJ already know how to search in text and strings but can never be perfect) much nicer. For example, a particular pet peeve of mine and I'm sure a lot of other developers is writing Criteria queries in Hibernate. You are forced to reference properties as simple strings. If the lastName property is changed to surname then you better make sure to catch all the places the String "lastName" is referenced. So you could replace code like this:

session.createCriteria(Person.class)
	.add(Restrictions.eq("lastName", "Smith")
	.addOrder(Order.asc("firstName")
	.list();

with this using method/property handles:

session.createCriteria(Person.class)
	.add(Restrictions.eq(Person.lastName, "Smith")
	.addOrder(Order.asc(Person.firstName)
	.list();

Now the code is strongly-typed and refactoring-safe. JPA 2.0 tries mightily to overcome having strings in the new criteria query API with its metamodel. But I find it pretty much appalling to even look at, what with having to create or code-generate a separate "metamodel" class which you reference like "_Person.lastName" or some similar awful way. This metamodel class lives only to represent properties on your real model object for the sole purpose of making JPA 2.0 criteria queries strongly typed. It just isn't worth it and is total overkill. In fact, it reminds me of the bad-old days of rampant over-engineering in Java (which apparently is still alive and well in many circles but I try to avoid it as best I can). The right thing is to fix the language, not to invent something that adds yet more boilerplate and more complexity to an already overcomplicated ecosystem.

Method handles could also be used to make calling methods using reflection much cleaner than it currently is, among other things. Similarly it would make accessing properties via reflection easier and cleaner. And with only unchecked exceptions you would not need to catch the four or five kinds of exceptions reflective code can throw.

String Interpolation

String interpolation is like the sorbet that you get at fancy restaurants to cleanse your palate. This would seem to be a no-brainer to add. You could make code like:

log.error("The object of type  ["
    + foo.getClass().getName()
    + "] and identifier ["
    + foo.getId()
    + "] does not exist.", cause);

turn into this much more palatable version (using the native property syntax I mentioned earlier):

log.error("The object of type [${foo.class.name}] and identifier [${foo.id}] does not exist.", cause);

Type Inference

I'd also suggest adding type inference, if only for local variables like C# does. Why do we have to repeat ourselves? Instead of writing:

Person person = new Person();

why can't we just write:

var person = new Person();

I have to believe the compiler and all the tools are smart enough to infer the type from the "new Person()". Especially since other strongly-typed JVM languages like Scala do exactly this kind of thing.

Elminate "new"

Last but not least, and actually not the last thing I can think of but definitely the last I'm writing about here, let's get rid of the "new" keyword and either go with Ruby's new method or Python's constructor syntax, like so:

// Ruby-like new method
var person = Person.new()

// or Python-like construction
var person = Person()

This one came to me recently after hearing Bruce Eckel give an excellent talk on language evolution and archaeology. He had a ton of really interesting examples of why things are they way they are, and how Java and other languages like C++ evolved from C. One example was the reason for "new" in Java. In C++ you can allocate objects on the stack or the heap, so there is a stack-based constructor syntax that does not use "new" while the heap-based constructor syntax uses the "new" operator. Even though Java only has heap-based object allocation, it retained the "new" keyword which is not only boilerplate code but also makes the entire process of object construction pretty much immutable: you cannot change anything about it nor can you easily add hooks into the object creation process.

I am not an expert at all in the low-level details, and Bruce obviously knows what he is talking about way more than I do, but I can say that I believe the Ruby and Python syntaxes are not only nicer but more internally consistent, especially in the Ruby case because there is no special magic or sauce going on. In Ruby, new is just a method, on a class, just like everything else.

Conclusion to this Way Too Long Blog Entry

I did not actually set out to write a blog whose length is worthy of a Ted Neward blog. It just turned out that way. (And I do in fact like reading Ted's long blogs!) Plus, I found out that speculative fiction can be pretty fun to write, since I don't think pretty much any of these things are going to make it into Java anytime soon, if ever, and I'm sure there are lots of people in the Java world who hate things like Ruby won't agree anyway.

Several Must Have Firebug-Related Firefox Extensions

sleberkn — Mon, 28 Sep 2009 12:54:20 +0000

Last week while doing the usual (web development stuff) I discovered a few Firefox extensions I didn't even know I was missing until I found them by accident. The "accident" happened while adding Firebug to a Firefox that was running in a VMWare Fusion Windows virtual machine on which I was testing in, gasp, Windows. I went to find add-ons and searched for Firebug. And up came not only Firebug but also results for Firecookie, Firefinder, Inline Code Finder for Firebug, and CodeBurner for Firebug.

Of course everyone doing web development uses Firebug (or really should anyway) since it rules. But these other extensions provide some really nice functionality and complement Firebug perfectly. Here's a quick run down:

Firecookie

Firecookie lets you see all the cookies for a site, add new ones, remove existing cookies, etc. It gives useful information about each cookie like the name, value, raw value (if URI-encoded), domain, size, path, expiration, and security. Very cool.

Firefinder

Firefinder for Firebug lets you search for elements on a page using either CSS expressions or an XPath query. In the list of matching elements, you can expand each result, inspect the element by clicking the "Inspect" link, or click "FriendlyFire" which will copy the content you're looking at and post it up to JS Bin. (Be careful with this one if you have code you'd rather not have going up over the wire to a different web site.) Firefinder also puts a dashed border around each matching element it found. As you hover over search results, it highlights the matching element in the page. This is really useful when you want to find all elements matching a CSS expression or when you'd like to use XPath to find specific elements. Nice.

Inline Code Finder for Firebug

The Inline Code Finder does just that. It finds inline CSS styles, JavaScript links, and inline events, and reports the number of each of these in its results pane. Even better, it highlights each of these problems on the page you are viewing with a thick red border, and as you hover over them it shows you what the problem is in a nicely tooltip. This is really nice to help you become less obtrusive by writing more unobtrusive JavaScript and avoiding inline styles. For older sites or sites that weren't designed with "unobtrusivity" in mind though, be warned that there might be a lot of red on the page!

CodeBurner for Firebug

CodeBurner for Firebug provides an inline HTML and CSS reference within Firebug. It allows you to search for HTML elements or CSS styles and shows a definition and an example. It also provides links to the awesome Sitepoint reference and even to the Sitepoint live demos of the feature you are learning about. This is so unbelievably useful to have a HTML and CSS references directly within Firebug it isn't even funny. Thanks Sitepoint.

Sorting Collections in Hibernate Using SQL in @OrderBy

sleberkn — Tue, 15 Sep 2009 12:40:00 +0000

When you have collections of associated objects in domain objects, you generally want to specify some kind of default sort order. For example, suppose I have domain objects Timeline and Event:

@Entity
class Timeline {

    @Required 
    String description

    @OneToMany(mappedBy = "timeline")
    @javax.persistence.OrderBy("startYear, endYear")
    Set<Event> events
}

@Entity
class Event {

    @Required
    Integer startYear

    Integer endYear

    @Required
    String description

    @ManyToOne
    Timeline timeline
}

In the above example I've used the standard JPA (Java Persistence API) @OrderBy annotation which allows you to specify the order of a collection of objects via object properties, in this example a @OneToMany association . I'm ordering first by startYear in ascending order and then by endYear, also in ascending order. This is all well and good, but note that I've specified that only the start year is required. (The @Required annotation is a custom Hibernate Validator annotation which does exactly what you would expect.) How are the events ordered when you have several events that start in the same year but some of them have no end year? The answer is that it depends on how your database sorts null values by default. Under Oracle 10g nulls will come last. For example if two events both start in 2001 and one of them has no end year, here is how they are ordered:

2001 2002  Some event
2001 2003  Other event
2001       Event with no end year

What if you want to control how null values are ordered so they come first rather than last? In Hibernate there are several ways you could do this. First, you could use the Hibernate-specific @Sort annotation to perform in-memory (i.e. not in the database) sorting, using natural sorting or sorting using a Comparator you supply. For example, assume I have an EventComparator helper class that implements Comparator. I could change Timeline's collection of events to look like this:

@OneToMany(mappedBy = "timeline")
@org.hibernate.annotations.Sort(type = SortType.COMPARATOR, comparator = EventCompator)
 Set<Event> events

Using @Sort will perform sorting in-memory once the collection has been retrieved from the database. While you can certainly do this and implement arbitrarily complex sorting logic, it's probably better to sort in the database when you can. So we now need to turn to Hibernate's @OrderBy annotation, which lets you specify a SQL fragment describing how to perform the sort. For example, you can change the events mapping to :

@OneToMany(mappedBy = "timeline")
@org.hibernate.annotations.OrderBy("start_year, end_year")
 Set<Event> events

This sort order is the same as using the JPA @OrderBy with "startYear, endYear" sort order. But since you write actual SQL in Hibernate's @OrderBy you can take advantage of whatever features your database has, at the possible expense of portability across databases. As an example, Oracle 10g supports using a syntax like "order by start_year, end_year nulls first" to order null end years before non-null end years. You could also say "order by start_year, end year nulls last" which sorts null end years last as you would expect. This syntax is probably not portable, so another trick you can use is the NVL function, which is supported in a bunch of databases. You can rewrite Timeline's collection of events like so:

@OneToMany(mappedBy = "timeline")
@org.hibernate.annotations.OrderBy("start_year, nvl(end_year , start_year)")
 Set<Event> events

The expression "nvl(end_year , start_year)" simply says to use end_year as the sort value if it is not null, and start_year if it is null. So for sorting purposes you end up treating end_year as the same as the start_year if end_year is null. In the contrived example earlier, applying the nvl-based sort using Hibernate's @OrderBy to specify SQL sorting criteria, you now end with the events sorted like this:

2001       Event with no end year
2001 2002  Some event
2001 2003  Other event

Which is what you wanted in the first place. So if you need more complex sorting logic than what you can get out of the standard JPA @javax.persistence.OrderBy, try one of the Hibernate sorting options, either @org.hibernate.annotations.Sort or @org.hibernate.annotations.OrderBy. Adding a SQL fragment into your domain class isn't necessarily the most elegant thing in the world, but it might be the most pragmatic thing.

Groovification

sleberkn — Mon, 4 May 2009 17:25:40 +0000

Last week I tweeted about groovification, which is defined thusly:

groovification. noun. the process of converting java source code into groovy source code (usually done to make development more fun)

On my main day-to-day project, we've been writing unit tests in Groovy for quite a while now, and recently we decided to start implementing new code in Groovy rather than Java. The reason for doing this is to gain more flexibility in development, to make testing easier (i.e. in terms of the ability to mock dependencies in a trivial fashion), to eliminate a lot of Java boilerplate code and thus write less code, and of course to make developing more fun. It's not that I hate Java so much as I feel Java simply isn't innovating anymore and hasn't for a while, and isn't adding features that I simply don't want to live without anymore such as closures and the ability to do metaprogramming when I need to. In addition, it isn't removing features that I don't want, such as checked exceptions. If I know, for a fact, that I can handle an exception, I'll handle it appropriately. Otherwise, when there's nothing I can do anyway, I want to let the damn thing propagate up and just show a generic error message to the user, log the error, and send the admin team an email with the problem details.

This being, for better or worse, a Maven project, we've had some interesting issues with mixed compilation of Java and Groovy code. The GMaven plugin is easy to install and works well but currently has some outstanding issues related to Groovy stub generation, specifically it cannot handle generics or enums properly right now. (Maybe someone will be less lazy than me and help them fix it instead of complaining about it.) Since many of our classes use generics, e.g. in service classes that return domain objects, we currently are not generating stubs. We'll convert existing classes and any other necessary dependencies to Groovy as we make updates to Java classes, and we are implementing new code in Groovy. Especially in the web controller code, this becomes trivial since the controllers generally depend on other Java and/or Groovy code, but no other classes depend on the controllers. So starting in the web tier seems to be a good choice. Groovy combined with implementing controllers using the Spring @MVC annotation-based controller configuration style (i.e. no XML configuration), is making the controllers really thin, lightweight, and easy to read, implement, and test.

I estimate it will take a while to fully convert all the existing Java code to Groovy code. The point here is that we are doing it piecemeal rather than trying to do it all at once. Also, whenever we convert a Java file to a Groovy one, there are a few basics to make the classes Groovier without going totally overboard and spending loads of time. First, once you've used IntelliJ's move refactoring to move the .java file to the Groovy source tree (since we have src/main/java and src/main/groovy) you can then use IntelliJ's handy-dandy "Rename to Groovy" refactoring. In IntelliJ 8.1 you need to use the "Search - Find Action" menu option or keystroke and type "Rename to..." and select "Rename to Groovy" since they goofed in version 8 and that option was left off a menu somehow. Once that's done you can do a few simple things to make the class a bit more groovy. First, get rid of all the semi-colons. Next, replace getter/setter code with direct property access. Third, replace for loops with "each"-style internal iterators when you don't need the loop index and "eachWithIndex" where you do. You can also get rid of some of the redundant modifiers like "public class" since that is the Groovy default. That's not too much at once, doesn't take long, and makes your code Groovier. Over time you can do more groovification if you like.

The most common gotchas I've found have to do with code that uses anonymous or inner classes since Groovy doesn't support those Java language features. In that case you can either make a non-public named class (and it's OK to put it in the same Groovy file unlike Java as long as it's not public) or you can refactor the code some other way (using your creativity and expertise since we are not monkeys, right?). This can sometimes be a pain, especially if you are using a lot of them. So it goes. (And yes, that is a Slaughterhouse Five reference.)

Happy groovification!

Thinking Matters

sleberkn — Thu, 30 Apr 2009 16:10:30 +0000

Aside from the fact that Oracle's Java Problem contains all kinds of factual and other errors (see the comments on the post) this sentence caught my eye in particular when referring to Java being "quite hard to work with" - "Then, as now, you needed to be a highly trained programmer to make heads or tails of the language."

What's the issue here? That Java is hard to work with? Perhaps more specifically, not just Java but perhaps the artificial complexity in developing "Enterprise" applications in Java? Nope. The problem is that this type of thinking epitomizes the attitude that business people and other "professionals" tend to have about software development in general, in that they believe it is or should be easy and that it is always the tools and rogue programmers that are the problem. Thus, with more and better tools, they reason, there won't be a need for skilled developers and the monkey-work of actually programming could be done by, well, monkeys.

I believe software development is one of the hardest activities humans currently do, and yes I suppose I do have some bias since I am a developer. Also contrary to what many people think, there is both art and engineering involved, and any given problem can be solved in an almost infinite variety of ways. Unlike more established disciplines that have literally been around for hundreds or thousands of years (law, medicine, accounting, architecture, certain branches of engineering like civil, etc.), the software industry hasn't even reached the century mark yet! As a result there isn't any kind of consensus whatsoever about a completely standardized "body of knowledge" and thus there isn't an industy-recognized set of standard exams and boards like you find in the medical and law professions for example. (That topic is for a future post.)

One thing that is certain is that software development involves logic, and thus people who can solve problems using logic will always be needed, whether the primary medium stays in textual format (source code) or whether it evolves into some different representation like Intentional Software is trying to do. So the statement from the article that "you needed to be a highly trained programmer to make heads or tails of the language" is always going to be true in software development. More generally, highly skilled people are needed in any complex endeavor, and attempts to dumb dumb complex things will likely not succeed in any area, not just software development. Would you trust someone to perform surgery on you so long as they have a "Dummies Guide to Surgery" book? Or someone to represent you in court who stayed at a Holiday Inn Express last night?

I hypothesize that things are becoming more complex as time moves on, not less. I also propose that unless we actually succeed in building Cylons who end up wiping us all out or enslaving us, we will never reach a point where we don't need people to actually think and use logic to solve problems. So even though many business-types would love to be able to hire a bunch of monkeys and pay them $0.01 per day to develop software, those who actually realize that highly skilled people are an asset and help their bottom line, and treat them as such, are the ones who will come out on top, because they will smash their competitors who think of software/IT purely as a cost center and not a profit center.

Running VisualVM on a 32-bit Macbook Pro

sleberkn — Wed, 1 Apr 2009 11:03:44 +0000

If you want/need to run VisualVM on a 32-bit Macbook Pro you'll need to do a couple of things. First, download and install Soy Latte, using these instructions - this gets you a Java 6 JDK/JRE on your 32-bit Macbook Pro. Second, download VisualVM and extract it wherever, e.g. /usr/local/visualvm. If you now try to run VisualVM you'll get the following error message:

$ ./visualvm
./..//platform9/lib/nbexec: line 489: /System/Library/Frameworks/JavaVM.framework/
Versions/1.6/Home/bin/java: Bad CPU type in executable

Oops. After looking at the bin/visualvm script I noticed it is looking for an environment variable named "jdkhome." So the third step is to export an environment variable named 'jdkhome' that points to wherever you installed Soy Latte:

export jdkhome=/usr/local/soylatte16-i386-1.0.3

Now run the bin/visualvm script from the command line. Oh, almost forgot to mention that you should also have X11 installed, which it will be by default on Mac OS X Leopard. Now if all went well, you should have VisualVM up and running!

Missing aop 'target' packages in Spring 3.0.0.M1 zip file

sleberkn — Thu, 15 Jan 2009 18:46:43 +0000

Today I was mucking around with the Spring 3.0.0.M1 source release I downloaded as a ZIP file. I wanted to simply get the sample PetClinic up and running and be able to load Spring as a project in IntelliJ. Note Spring now requires Java 6 to build, so if you're using an older 32-bit Macbook Pro you'll need to install JDK 6. I used these instructions generously provided by Landon Fuller to install Soy Latte, which is a Java 6 port for Mac OS X (Tiger and Leopard). So I went to run the "ant jar package" command (after first setting up Ivy since that is how Spring now manages dependencies) and everything went well until I got a compilation exception. There unfortunately wasn't any nice error message about why the compile failed.

So next I loaded up the Spring project in IntelliJ and tried to compile from there. Aha! It tells me that the org.springframework.aop.target package is missing as well as the org.springframework.aop.framework.autoproxy.target package, and of course all the classes in those packages were also missing. I was fairly sure I didn't accidentally delete those two packages in the source code, so I checked the spring-framework-3.0.0.M1.zip file to be sure. Sure enough those two 'target' packages are not present in the source code in the zip file. The resolution is to go grab the missing files from the Spring 3.0.0.M1 subversion repository and put them in the correct place in the source tree. The better resolution is to do an export of the 3.0.0.M1 tag from the Subversion repo directly, rather than be lazy like I was and download the zip file.

I still am wondering why the 'target' packages were missing, however. My guess is that whatever build process builds the zip file for distribution excluded directories named 'target' since 'target' is a common output directory name in build systems like Ant and Maven and usually should be excluded since it contains generated artifacts. If that assumption is correct and all directories named 'target' were excluded, then unfortunately the two aop subpackages named 'target' got mistakenly excluded which caused a bit of head-scratching as to why Spring wouldn't compile.

Groovy + Spring = Groovier Spring

sleberkn — Tue, 6 Jan 2009 23:50:46 +0000

If you're into Groovy and Spring, check out my two-part series on IBM developerWorks on using Groovy together with Spring's dynamic language support for potentially more flexible (and interesting) applications. In Part 1 I show how to easily integrate Groovy scripts (i.e. .groovy files containing one or more classes) into Spring-based applications. In Part 2 I show how to use the "refreshable beans" feature in Spring to automatically and transparently reload Spring beans implemented in Groovy from pretty much anywhere including a relational database, and why you might actually are to do something like that!

iPhone Bootcamp Summary

sleberkn — Fri, 5 Dec 2008 17:08:33 +0000

So, after having actually written a blog entry covering each day of the iPhone bootcamp at Big Nerd Ranch, I thought a more broad summary would be in order. (That, and I'm sitting in the airport waiting for my flight this evening.) Anyway, the iPhone bootcamp was my second BNR class (I took the Cocoa bootcamp last April and wrote a summary blog about it here.)

As with the Cocoa bootcamp, I had a great time and learned a ton about iPhone development. I met a lot of really cool and interesting people with a wide range of backgrounds and experiences. This seems to be a trend at BNR, that the people who attend are people who have a variety of knowledge and experience, and bring totally different perspectives to the class. The students who attend are also highly motivated people in general, which, when combined with excellent instruction and great lab coding exercises all week, makes for a great learning environment.

Another interesting thing that happens at BNR is that in this environment, you somehow don't burn out and can basically write code all day every day and many people keep at it into the night hours. I think this is due to the way the BNR classes combine short, targeted lecture with lots and lots and lots of hands-on coding. In addition, taking an afternoon hike through untouched nature really helps to refresh you and keep energy levels up. (Maybe if more companies, and the USA for that matter, encouraged this kind of thing people would actually be more productive rather than less.) And because of the diversity of the students, every meal combines good food with interesting conversation.

So, thanks to our instructors Joe and Brian for a great week of learning and to all the students for making it a great experience. Can't wait to take the OpenGL bootcamp sometime in the future.