Reliable Distributed Communication with JGroups

We were looking for an alternative to Java Messaging Service (JMS) for a project requirement – I had had my fair share of unpleasant experiences with JMS from the past. With JMS, most of which had to do with housekeeping involved in the event of a broker outage.

Issue at hand

The solution we were working on was a legacy web-based application, which was being enabled for a cluster. Scalability, it seems, was not designed into the system. The solution, therefore, demanded making the system scalable with a quick-turnaround, minimal code changes, and of course, if not positive, then no negative impact on the application throughput.

The main challenge was the presence of a programmatic cache (using core Java data structures), for some specific requirements, which of course, needed to be made cluster-aware. Data consistency is the primary concern in such cases, because the application pages can be serviced by any potentially any node.

In this post, I want to talk about how we achieved the same – since such a scenario of different nodes needing to “talk” can arise in many a applications of the present day – as distributed computing is becoming more and more popular.

Analysis

When we say cluster-aware we imply that the clustered application, might need to talk to its peers at various junctures. It might or might not decide to perform an operation based on those interactions – but the conversation does need to occur. There are of course several options today to facilitate the same – JMS, Hazelcast, JGroups, etc. JMS requires the presence of an active broker process at all times. The housekeeping involved in broker outages, especially when message persistence is enabled (for reliability), becomes a problem. Solutions like Hazelcast required us to modify the existing code so as to use the data structures provided by them. This wasn’t feasible because we wanted to minimize the code changes and rather work out a solution on top of the existing code base.

We settled for JGroups because it seemed to address all the concerns:

  • there is no active process required beforehand, but rather, each node either joins an existing communication session, or else, starts one of its own. This means that there’s no single point of failure (SPOF)
  • message reliability, ordering and flow control are all configurable
  • code changes could be contained, because it involved minor modifications.

JGroups is a reliable IP  toolkit that is based upon a flexible protocol stack. IP , or broadcast, implies sending a message to multiple recipients. Traditionally, multicast communication is UDP (User Datagram Protocol) based, and is unreliable. For instance, think of a streaming video from YouTube – where once the streaming starts, little does it matter that a few frames are missed or jagged. Such are typically unreliable UDP based transmissions.

JGroups addresses the shortcomings of traditional IP multicast by adding aspects like reliability and membership to it.

Flexible protocol stack implies that there are options to tailor JGroups behaviour according to the application requirements. For example, aspects like Failure Detection, Encryption, etc., can be configured.

Design

Conceptually, the design was very simple: the caches on each node needed to “talk” to other nodes, whenever a significant operation occurred. Since we were dealing with user-defined entities, these operations mostly involved some kind of update of these entities.

Since each of the node needed a JGroups handle in order to be able to talk, we invoked a local JGroups session on each. Through the concept of groups and membership, a JGroups session looks for other active sessions when it comes alive. Consequently, members belonging to the same group are aware of all the peers. For example:
Suppose node n1 comes up as the very first node, and initializes a JGroups session, creating a group called ProjectXGroup. Any subsequent node, say, n2, n3, also does the same, but in its case, it find an active ProjectXGroup already in place, and hence joins it, rather than creating a new session. Thus, each node (peer) is aware of all the members in that form the group. JGroups, by design, provides failure detection (FD) – that any node that goes down is removed from the list of active members, thus ensuring that at any given point of time, all the group members are aware of the current state of the group.

Any local update of one or more entities required the change to be propagated to all the peers. We identified that this should be a 3-step process. Given an entity EntityX, the following needed to be done on all the nodes:

  • Lock EntityX
  • Update ExtityX
  • Unlock EntityX

We also needed to ensure that failure to perform any of these operations should result in a rollback, which also needed to be taken care of. For any given operation, we subdivided it into smaller tasks. An example of a task could be LockTask, UpdateTask, UnlockTask, etc. It is these tasks that would be the language of communication. Typically a task would have taskID, information about the entities involved, status flag (SCHEDULED, SUCCESSFUL, FAILED..), etc.
task design

JGroups has options to either broadcast a message or unicast it. Since our implementation required that caches on all the peers be consistent, a sender broadcasts a task to all the peers. Each peer, would then perform the task, and report the outcome back to the sender. Continuing with our earlier example, given a task T1 which the involves the following sub-tasks on an entity known as E1:

  1. Lock E1
  2. Update E1
  3. Unlock E1

Let’s consider the first sub-task — Lock E1:
This would require E1 to be locked on all the nodes, and therefore we invoke the LockTask, with information about E1. Now, suppose n2 is a node which is catering to the user at a given point of time, then n2 would first lock E1 locally, and then broadcast this LockTask to the group.  Since it is a broadcast, all the nodes (including n2) would receive and process the LockTask (and lock E1 locally). We can easily identify the sender and not perform the task there (since the task was issued to other nodes only after the sender had performed it). This means that n2 would ignore the task, but n1, n3..nn would need to perform it.

conversations in T1

Once the given task is processed, in order to report the outcome back, we set the appropriate status flags in the original Task object and send it back to the sender. This time, however, we set it as a unicast communication (by specifying the recipient), since we know who is this message directed to. What it boils down to in terms of JGroups is: after performing the task n1, n3,…nn would all get the handle to the JGroups session and can use the same LockTask instance to report back the outcome of a given task. Upon receiving responses from all the peers, and if the responses are positive, the sender can then proceed with other operations (sub-tasks) — Update E1, Unlock E1…etc.

Exceptional Scenarios

There are various junctures where the above conversation could fail, and thus, all the exceptional scenarios need to be taken care of. There were several failure scenarios that we handled. To list a few:

  • No response (or response timeout) from one or more peers
  • Failure of subtask in the sender node
  • Failure of subtask in any peer, etc.

Summary

We discussed how JGroups can be an apt solution in situations where reliable distributed communication is necessary. We also saw an example where JGroups was used for enabling intra-node communication for broadcast and unicast services.

Think.

One of the best mails I received was from a CEO (or to put it mildly — from the head of a startup I was working for) that he had sent to all the ’employees’. It was about a seemingly small fact which I had unconsciously felt on several occasions, but never actively thought about.

It went something like this:
Between the time when you’re given a problem, say an issue to troubleshoot, or a feature to implement, and the time that you actually start implementing or troubleshooting it: there is a something called “THINKING” required. I see a lot of you nodding when I tell you about an issue, or the next set of features to implement — but then when I see the approach or the implemented feature — the “THINKING” part seems to be lacking, or in some cases, absent. ‘Thinking’ is something that cannot be done on the fly but deserves a conscious time and effort!

The actual mail tone would have been more euphemistic that I can ever get, but the point I’m trying to highlight in this post is: I see it happening on so many occasions. That too in very senior developers, managers, etc.
There is almost negligible time devoted to understanding the need or cause of a feature or issue — and there is a restlessness to pounce to a solution. I think it might be driven by the assumption that one would be sort of ‘one-up’ or the ‘apple of customers’ eyes’ or that one-true-deserving ‘pat on the back’ awardee. But such myopic approaches tend to bite back in the long run.

anyString() v/s anyString() in Mocking frameworks

So I was trying out mocking frameworks for the first time a few days ago, and there’s an interesting observation: even though EasyMock, PowerMock, and Mockito and said to be brothers-in-arms, there are a few nuances to take care of. One of them is the following.

There are convenience methods (called matchers) provided by both Mockito and EasyMock to define a message signature that can accept any value of a given parameter type. It is claimed, that these any implementations (e.g. anyObject, anyString, etc.) are interchangeable, but that does not seem to be the case.

Consider a very simple scenario, where one would want to mock a login() implementation. That is to say, success (true in this case) must be returned, regardless of the parameters passed.

class N {
        public boolean login(String username, String password) {
            return doLogin(username, password);
        }
        private boolean doLogin(String u, String p){
            //validate login
            //...
            //...
            return true;
        }
}

Now that the method is defined, let’s take a look at the test case where we would mock is call.

@Test
public void testMockLogin() throws Exception {
    // mocking only specific methods
    N n = createPartialMock(N.class, "doLogin", String.class, String.class);
    boolean expected = true;

    expectPrivate(n, "doLogin", anyString(), anyString()).andReturn(expected);
    replay(n);

    boolean actual = n.login("foo", "bar");

    verify(n);
    assertEquals("Expected and actual did not match", expected, actual);
}

Easy, peasy! Except for line no. 7, where anyString() is used via the import org.mockito.Matchers.anyString, causing the following exception:

java.lang.AssertionError: 
  Unexpected method call N.doLogin("foo", "bar"):
    N.doLogin("", ""): expected: 1, actual: 0
    at org.easymock.internal.MockInvocationHandler.invoke(MockInvocationHandler.java:44)
    at org.powermock.api.easymock.internal.invocationcontrol.EasyMockMethodInvocationControl.invoke(EasyMockMethodInvocationControl.java:91)
    at org.powermock.core.MockGateway.doMethodCall(MockGateway.java:124)
    at org.powermock.core.MockGateway.methodCall(MockGateway.java:185)
    at com.pugmarx.mock.N.doLogin(N.java)
    ...

The same test case passes when the anyString() implementation of Mockito is replaced by the implementation provided by EasyMock (via org.easymock.EasyMock.anyString).
If we dig a bit, we realize that one of the reasons could be that EasyMock.anyString returns null, as against the empty string returned by mockit.Matchers.anyString.

[Thank you for troig from the wonderful StackOverflow community for helping me on this query.]

Not guilty!

We have been recruiting a lot lately. I try to stick to the basics, and sometimes get surprised by the responses I receive. A lot of seemingly successful candidates — who can take any J2EE question you throw at them — fumble when the convenience of their favourite data structures are taken away from them. For example, a simple Map implementation seems to give them a tough time.

I am surprised by one of the answers I receive quite a lot when asked about data structures. That, “I don’t remember — it was a long time ago” (that they were ‘taught’ about it). In other words: they plead “not guilty!”. Now, even though we are not recruiting for a programmer position, so to speak — such an answer does spark a bit of outrage within me. And then I ponder: what are we doing in our colleges? Are we getting too driven by the shiny-white things like BigData, BigAnalysis and IoT and what not — that we don’t really worry much about the basics of computer science. First of all: is the essence of computer science even put forth to the students? That it’s a science…that one has to develop a scientific mindset! I believe, that’s something above all the big college tag/big technology/big offer/big robot that one develops.

Even after college, do they ever go back and revisit the basics they had been ‘taught’? Or at least think about them?

One of the candidate we came across recently could not proceed on a small programming problem because we took away the luxury of being able to use HashMap from him. When I asked why couldn’t he create a Map of his own, the response was: “I am not sure what language are Java Maps written in.”

Adafruit WebIDE

If you’re looking for the best place to start tinkering with a RaspberryPi (RPi) or BeagleBone (BB), I would highly recommend Adafruit’s WebIDE. As the names suggests — it’s a web-based IDE, and facilitates physical programming (for RPi — manipulating the GPIO) on your device.

Picture of Adafruit webIDE in action
Adafruit’s WebIDE for RPi and BB

The installation steps are clearly laid-out, and mostly smooth (for RPi I remember having to take care of a few easy-to-fix issues). Once it’s up and running — it’s a delight to work with!

Here’s a big shout of “Thanks” to Adafruit!

Kindle-ing

I’m highly impressed by Kindle. Not just the device — I don’t have one — but the whole idea around it; and what Amazon has done with it. I’m in awe! No, seriously.
While we were busy reading about gadget wars, and tech-websites busy bombarding us with feature comparisons of Kindle, and Nook and what-not — Amazon did something very smart: it created a Kindle plugin for every platform known to mankind…Android, iOS, Windows..you name it (ok maybe not all, but you get the idea!). It was blessing for people like me who might want to read something at their own convenience. All my devices are synched..so that takes care of remembering the page number/bookmarks etc.

At this point however I must confess that the intent of this particular post was not exactly blowing Kindle’s trumpet..but rather to talk about a nifty little extension that they created for Chrome, which allows me to send any webpage to my Kindle library in Kindle format with just a click (aka ‘automagically’)!

kindlebutton

I consider it as one of the coolest things I’ve been enlightened with recently — gives me the flexibility of not just bookmarking a webpage — I have thousands of those will-read-it-someday ones, but rather going several steps further and making that webpage/article available on all my Kindle-app devices. This means that I have the convenience of (re-)visiting those articles/bookmarking them/reading them in oddest of places — all at my disposal! That’s übercool and so very thoughtful!

About Mavericks

One of the best things Apple has done with OSX Mavericks is the feature of keeping the extended desktop (external monitor) separate. In the sense that, unlike before, it’s somewhat disconnected from the main desktop. This means that it’s now possible to have maximised applications on each of the desktops — and thus fixing the limitation in the previous versions of OSX where an app could be maximised on either of the screen. It beats me on how did that feature seep-in in the first place.
Anyway, so again, unlike before, AppleTV now allows you to use the connected monitor as a secondary desktop as well. Finally, someone seems to have put brains into these pesky little flaws.