The #hashtag #muddle

What’s with the hashtags? No really, why are people still using them to the point of banal abuse? I think, and that’s true for so many things… it all started with good intent — in the not-so-savvy-times, maybe several years ago, hashtags had a purpose. You click on one — that succinct hashtag — and voila! … depending on the social networking medium you’re on, you see a filtered list all the posts that share the same keyword. Life was good, and hashtags were serving humanity well — the usage was pretty-much in-sync with the intent.

But then it seems, there was an upsurge of clueless, mostly uncoached, marketing personnel who thought that hashtags are a great way to make their posts (or some otherwise irrelevant part thereof) stand-out. They started using them #everywhere…#literally!
The result was a plethora of hashtags, which were so generic and irrelevant, that they defeat the purpose altogether. One is better off Googling instead.

For instance, I recently came across a LinkedIn post of an IT company that went something like:

#Technology #disruption ….in #Finance and #Banking sectors..

Not sure what the poster is trying to achieve by tagging ‘Technology’ and ‘disruption’. Or, for that matter, ‘Finance’ and ‘Banking’. What is the motive? Is this some unique concept? All these terms have been around from the beginning of the Internet — so one cannot possibly expect to hashtag them and hope to lead the user to a filtered set of posts from a “niche” domain or a concept that’s being highlighted. Au contraire, you’re hindering your user experience by letting these basic, nominal terms, stand-out!

Not very far from them is a clan of these DSLR-wielding-picture-watermarking nouveau riche “photographers” who are quite active on Flickr and Instagram. These ones think that every picture has to be tagged with every known tag even remotely related to the post!  Yes, monetization could be one motive in few cases. But, I doubt if it’s applicable to the majority. Anyway, this lot can be discounted because even though most of them have a technical background, not all of them do.

Coming back to the ‘IT crowd’, here is an example of  a decent usage:

good_ht

Notice how the hashtags have been restricted to terms that matter. Yes, IoT is buzzing domain. Yes, Ethereum is an upcoming platform. By tagging, and hence enabling the user to click on them, or SEO bots indexing them, you’re letting the interested reader in the right direction.
And that, my friend, should be the intent!

Advertisements

A Spark learning.

About a month back, I’d done something I was not very proud of — a piece of code that I was not very happy about — and I had decided to get back to it when time permits.

The scenario was something close to the typical Word Count problem, in which the task required counting words at specific indices, and then print the unified count per word. Of course, there were many other things to consider in the final solution since a streaming context was being dealt with — but those are outside the purview of what I’m trying to highlight here.

An abridged problem statement could be put as:

Given a comma-separated stream of lines, pick the words at indices j and k from each line, and print the cumulative count of all the words at jth and kth positions.

So the ‘crude’ solution to isolate all the words was:

  1. Implement a PairFunction<String, String, Integer>, to get all the words at a given index
  2. Use the above function to get all occurrences of words at index jand get a Pair (Word -> Count) RDD
  3. Use the same function to get all occurrences of words at index k, and get another Pair (Word -> Count) RDD
  4. Use the RDD union() operator to combine the above two RDDs, and get a unified RDD
  5. Do other operations (reduceByKey, etc. …)

As is apparent, anyone would cringe at this approach, especially due to the two passes (#2, #3) over the entire data set — even though it gets the work done!
So I decided to revisit this piece, with the tool of additional knowledge about what all Spark offers.

One useful tool is the flatMap operation that Spark Java 8 offers. By Spark’s definition:

flatMap is a DStream operation that creates a new DStream by generating multiple new records from each record in the source DStream

Given our requirement, this was exactly what was needed — create two records (one for each jth and kth index word), for each incoming line. This would, of course, benefit us in that we have the final (unified) RDD in just a single pass of the incoming stream of lines.

I went ahead with a flatMapToPair implementation, like so:

JavaPairDStream<String, Integer>  unified = lines.flatMapToPair((s) -> {
        String a[] = s.split(",");
        List<Tuple2<String, Integer>> apFreq = new ArrayList<>();
        apFreq.add(new Tuple2<>(a[Constants.J_INDEX],1));
        apFreq.add(new Tuple2<>(a[Constants.K_INDEX],1));
        return apFreq.iterator();
});

To further validate the benefits, I ran some tests* with datasets ranging from 1M to 100M records and the benefits of flatMap approach were more and more pronounced as data grew bigger.

Following were the observations.

flatmap

As we can see, whilst the difference is ~2s for 1 million records, it becomes almost twice as we reach 10M and more than twice at around 100M mark.
It’s therefore, obvious that production systems (e.g. a real-time analytics solution), where data the volume is much higher, need to be cautious about the choice of each operation (transformation, filtering or any other action), as these govern the inter-stage as well as the overall throughput of a Spark application.

 


* Test conditions:
– Performed on a 3-Node (m4.large) Spark cluster on AWS, using Spark 2.2.0 on Hadoop 2.7
– Considers only the time spent on a particular stage (union or flatMap), available via Spark UI
– Each reading is an average of time taken in 3 separate runs

A ‘Kafka > Storm > Kafka’ Topology gotcha

If you’re trying to make a Kafka Storm topology work, and are getting baffled by your recipient topic not receiving any damn thing, here’s the secret:

  • The defaultorg.apache.storm.kafka.bolt.KafkaBolt implementation expects only a single key field from the upstream (Bolt/Spout)
  • If you’re tying your KafkaBolt to a KafkaSpout, you’ve got to use the internal name:str
  • However, if you have an upstream Bolt, doing some filtering, then make sure that you tie the name of your ONLY output field (value) to the KafkaBolt

Let me break it down a little bit more for the larger good.

Consider a very basic Storm topology where we read raw messages from a Kafka Topic (say, raw_records), enrich/cleanse them (in a Bolt), and publish these enriched/filtered records on another Kaka Topic (say, filtered_records).

Given that the final publisher (the guy that talks to filtered_records) is a KafkaBolt, it needs a way to find out the relevant key that the values are available from. And that key is what you need to specify/detect from the upstream bolt or spout.

So, the declared output field of the upstream Bolt would be something like:

@Override
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
    outputFieldsDeclarer.declare(new Fields(new String[]{"output"}));
}

Note the key field named “output“.

Now, in KafkaBolt the only thing to take care of is using this key field in the configuration, like so:

KafkaBolt bolt = (new KafkaBolt()).withProducerProperties(newProps(BROKER_URL,
        OUTPUT_TOPIC))
        .withTopicSelector(new DefaultTopicSelector(OUTPUT_TOPIC))
        .withTupleToKafkaMapper(new FieldNameBasedTupleToKafkaMapper("key",
                "output"));

The default key field name is “message“, so you could as well use the no-arg constructor of  FieldNameBasedTupleToKafkaMapper, by specifying the upstream key as “message“.

If however, you have scenario where you’d want to pass both the key and value from the upstream, for example,

@Override
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
    outputFieldsDeclarer.declare(new Fields(new String[]{"word","count"}));
}

Note that we’ve specified the key field here as “word“.

Then obviously, we need to use this (modified) key name downstream, like so:

KafkaBolt bolt = (new KafkaBolt()).withProducerProperties(newProps(BROKER_URL,
        OUTPUT_TOPIC))
        .withTopicSelector(new DefaultTopicSelector(OUTPUT_TOPIC))
        .withTupleToKafkaMapper(new FieldNameBasedTupleToKafkaMapper("word",
                "count"));

Update (2017-08-23): Added the scenario where a modified key name can be used.

(Good) Introduction to ML

The Machine Learning (ML) bandwagon is on, and I do not want to be left behind. There are plenty of YouTube videos, online courses, books, and people with ready advise on what ML is all about. After have skimmed through several of them, I got this video on YouTube, and honestly, this is the best one I have seen thus far.

Many thanks to Ron Bekkerman, and LinkedIn for putting this on YouTube for the larger good!

The glorious ‘threat/reward’ model

We have witnessed this since our childhood. Right from our homes, to our schools, colleges, and finally in our jobs. The glorious ‘threat/reward’ model! As the name suggests, it’s an approach where a certain set of actions lead (or are known to lead) to a certain reward (candy, toys, perks, H1B, and of course “that-irresistible-promotion”). On the flipside, non-compliance to a given, pre-defined set of laid-out steps, leads to ‘threats’, or a consequence of those threats (read: no candy, no perks, …and..well, you get the point.)

Continue reading