Big Data and Hadoop- Drivers of Future

Big Data & Hadoop are the buzz words now days. Every service provider is positioning itself as a champion in Big Data & Hadoop space. Organizations are finding it very difficult to understand & make it a workable solution because of the “infancy stage” and “complexity” involved.

Let’s start with answering a fundamental question “does my organization really need Big Data solutions?”  To answer this, one needs to understand the organizations’ current data, analytical practices and critically evaluate future requirements. Are they finding the current data, analytical processes & systemsinefficient to deliver results in fast and actionable manner? Or will they become inefficient and insufficient in near future due to fastgrowing requirements and new sets of data challenges like web-log, social media, and audio/video data? If answer to any of the above questions is yes, then organizations need to start thinking about Big Data strategy and road map seriously.

The first baby step towards drawing Big Data strategy is to understand it from IT/Data and Analytical point of view. Organizations need to decide if this Big Data strategy is an efficiency booster or path to new capabilities/discoveries or both. At the foundation level, either of the goals will require investment in IT and skill sets. The Investment in IT may be controlled using Apache Hadoop and other open source platforms but training and skill development is surely going to be an ongoing journey.  To achieve latter part of the goal (new capabilities/discoveries) organizations need to develop the Big Data strategy not only from IT point of view but mainly from business & analytical point of view. It’s like a baby learning not only how to walk but also where to head…..

If we see the market land space of Big Data and Hadoop, there are numerous player and they are providing solutions to different aspect of Big Data deployment.  These players can be categorized into a few broad categories.

  1. There are players who focus on technology development like Hbase, Hive and Hadoop HDFS etc.
  2. The Second category is that of Service Companies providing IT solutions like setting up Hadoop platform etc and possibly helps Integration between legacy system and Hadoop.
  3. Third category is of companies is in the space of development of framework and application that run on top of Hadoop.
  4. The Forth category of companies is the BI outsourcing companies that provide value by running BI jobs with shorter turnaround time.
  5. Highest value comes from another category of companies that use Analytics/data mining to generate new insights from the huge amounts of data, hitherto not feasible on legacy RDBMS systems.

Affine Analytics provides solutions into two of the above categories. One is using BI atop Hadoop and second is performing predictive analytics on the data which were huge and unmanageable in the past and also discovering hidden patterns. We use Big Data to take predictive analytics accuracy to next level deploying Machine Learning Techniques and other advanced techniques, made feasible by Big Data.

At Affine we are developing Big Data capability in literally a big way.Besides having practically every analyst trained on Big Data platform, wealso have an in-house Hadoop Analytical Lab “Hal”(A Hindi word meaning solution)which increase the efficiency of our operations multifold and gives us the capability to mine unstructured and semi-structure data to generate new insights which helps our esteemed clients’ business take faster and better decisions.

Affine is currently working on the Telecom CDR data to generate insights which were not known earlier and to improve performance of existing strategies. It requires mining of huge amount of data using Big Data platforms like Hadoop & Hive. Affine is trying to use call data records or voice/data transaction data to better manage churn, come up with better strategies to increase ARPU, Increase usage of VAS etc.

Ashish Maheshwari

Director – Client Delivery, Affine Analytics

For feedback and comments I can be reached at ashish.maheshwari@affineanalytics.com

Advertisements

Affine Transformations 101: The Analytics Scientist Spiderweb

Analytics is the buzzword these days. Businesses are increasingly realizing the need to use analytics, or for some, even the need to be seen using analytics. A recent article in the Harvard Business Review identifies Data Scientists or Analyticians to be having the Sexiest Job in the 21st century.

But what does it take to become a good analytician? How is their DNA different from the rest? Do they eat differently / follow a separate exercise regimen?

At Affine Analytics, we believe we have identified the secret ingredients to creating successful analyticians[1].

 Image

Business Knowledge – As George Clooney rightly said in “Up in the Air” –“Before you try to revolutionize my business, I’d like to know that you know my business”, one should not approach a problem without having a proper knowledge of the business. It is utmost important to appreciate the “why and how of a business problem” and one should do a proper ground work before embarking on the approach. Every business is unique in its own way and needs to be understood thoroughly before attempting to solve the problem. Do not try to blindly fit the hypotheses learnt from one problem to the other. The more you know, the easier it is!

 

Problem Engineering – Don’t engineer a problem, but engineer a solution to an existing business problem.

There exists a misconception that analytical solutions necessarily require applying new and advanced statistical procedures. Not always true: instead what is always needed is the ability to take a business problem in its rawest format, break it down into logical pieces and then view /solve each of the pieces in a systematic manner.

Common sense coupled with an understanding of the business context is the first & foremost requirement to get started on the road to become an impactful analytics scientist.

The next logical step in solution engineering is to find creative solution approaches. To do that, you need to keep in mind the end goal of what is required to be achieved. Don’t have the door-to-door salesman approach selling your standard product. Find what the end goal is and create a solution that achieves it. Sound knowledge of statistical techniques is needed here. Superficial knowledge can help you in coffee table discussions, but doesn’t work here. A deep understanding of the pros / cons of each method can get you to the optimal solution approach.

Innovation is also a key driver here.  It is required not just at the apex of the hierarchical pyramid, but at the lowest level. Innovation can range from creating an automated business suite for a retailer to a completely different way of creating a variable or a metric.

 

Curiosity & Skepticism – Necessity is the mother of invention, said Pluto, but it is more curiosity than necessity. Of late, new things are born just because people are inquisitive. Whenever a person is faced with a challenge (Something one hasn’t seen or solved before), the curious devil in him wakes up and doesn’t sleep until he’s become an expert on that topic. Curiosity makes one productive and work becomes fun.

Managers & leaders, take note: curiosity can create the passion or more commonly used (/abused) “fire in the belly”.

Analyze the approach from various different angles to increase confidence in your results. The more you look into the problem, the more you will get out of it. Be critical of your own findings, and use multiple approaches and techniques to verify unintuitive results.

 

Math – Mathematics forms the basis of the analytics industry and every data scientist is expected to have a good grip on the subject. Mathematical skills especially appreciating numbers in general and variable trends are more important in the field of analytics than knowing machine learning techniques. Master the basics of all the techniques and you will rule the world. The basic understanding of the principles governing numbers changes the way one looks at a variable.

 

Technology – If math forms the foundation of analytics, then technology enables us to construct a proper structure out of it. One can’t give meaning to the math behind, unless he knows the right tools. Great analysts and even managers cannot work without these tools. Knowledge of MS-Excel and R is must in this field. Though technology is evolving rapidly, one can deliver high quality analytics with proper knowledge of these two tools.

 

Writing ­- Your story, however great it may be, is worthless, until you are not able to sell it. Presentation forms the key to create an “impact” of an analysis. If people cannot make out any meaning out of your presentation, then your analysis holds no value. If the client cannot understand, forget about them implementing it. Create decks/reports that make proper sense. Simple decks / reports with systematic storyboards are what create the impact. If someone stumbles upon your deck in the future, and is able to understand it without much background –then you will have carved a good analytics scientist out of you!


[1] Affine’s leadership contains a concoction of a few decades of analytics experience in analyzing diverse problems across multiple business functions & industries

Vineet Kumar & Krishna Agarwal

Process Orientation in Analytics.

Before you go on to read this blog in entirety, let me ensure that we are on same page on the very definition and boundaries of Analytics. To me analytics is a way of problem solving that relies predominantly on one thing – looking for repeatable, reliable and meaningful patterns in data to better understand the problem at hand – and hence developing a decision and action strategy, whose outcome is more predictable. I’m not getting into defining, or segmenting analytics itself into multiple types – that’s for another post.

If I may assume, in a rather non-analytical manner, that the hypothesis stated above that “Analytically driven decisions have more predictable outcomes” is true, then we have a solid case to ask the next question – “If Analytics is so important, should not all decisions be driven by analytics?” I’m convinced and I’m sure most of you are too, that corporations around the world already agree that analytics does deliver more predictable results and hence is critical for growth and sustainability.

It leads us to the next big question – “If Analytics is so important, or rather mandatory, then are organizations developing their analytical capability in the right manner? Are they performing various analysis in a manner that is consistent across time and space (read business functions)?” The answer, if not a resounding no, is at least a muted acceptance of lack of maturity from a vast majority of consumers of analytics.

Answer these to convince yourself – “Does your marketing team know what aspects of a product did the R&D team find important in the Conjoint Analysis they did two years ago?” Sample another one – “Does your analytics manager know the hypothesis tested by his/her predecessor and more importantly the ones he/she rejected?” Another one – “Are you sure the vendor you outsourced your analytics project to or for that matter your internal analytics team has done it in the right manner?”

Unfortunately while each individual project is done with great rigor and in most cases the end result is also great, the rigor, the process, the methodologies change from analyst to analyst and from analysis to analysis. This “adhocism” engenders uncertainty and a colossal waste of synergies across analyses.

It goes without saying that there is immense value in having a consistent way of performing various analyses across time and across functions. Some of the value adds that come to my mind immediately are, higher degree of reliability in results, consistency in interpretation of results by different functions, greater amount of cross learning owing to a common shared code of conduct and the availability of insights well post the project completion.

However the one single benefit that stand tall above rest is reduction in errors – given the magnitude of the decisions taken based on analysis, the cost of error could be very high.

Quoting an experience at Affine Analytics with campaign analysis project for a large internet company – Given the number of campaigns they run, the number of response models to be made was rather large (one for each “type” of campaign). Taking cognizance of scale and the need for repeatability, we first defined a framework or sorts (still keeping in mind the flexibility we need for individual models). Using this framework, we were able create huge efficiencies (40-50% reduction in model development time from 1st set of models to the third set) and also ensure error free delivery from the first model till the last one.

That said, there is a definite need to be wary of bureaucratizing processes. Overdone, it will end up stifling creativity, which is equally important in the development of analytics solutions. Organizations  need to strike a fine balance between consistency and creativity by designing & following processes that are minimalistic in nature but comprehensive in assuring quality.

In my next post, I will start looking back at the analytical exercises I have been part of, primarily in the last two years at Affine, and see where process helped in making it more meaningful, reliable and robust. We will also briefly talk about aspects of Affine’s own analytical framework A4.

To end I quote John Updike : “Creativity is merely a plus name for regular activity. Any activity becomes creative when the doer cares about doing it right, or better”

Manas Agrawal

CEO

 

Affine Transformations 101: Overpowering the Predictive Power Greed called “Overfitting”!

The term overfitting originates from the way predictive models are built – they are “fitted” to match the historical data. The fit can be poor – called underfitting – in which case the predictions are far away from most of the actual data points. Or it can be too close – called ovefitting – in which case, we are also force-fitting the noise rather than capturing the true underlying structure. As obvious as it may sound, many analysts/forecasters completely ignore this problem and hence develop not predictive models but chaotic models.

Overfitting usually happens in cases when the data is limited and noisy, but the main (de)motivation behind building overfitted models is the urge to build super-predictive models. Combine these ingredients and you have a recipe for predictive chaos.

Here’s a simple enough example to explain overfitting

Image

An overfitted model will score high on various statistical tests and measures, but it scores those extra points by cheating – by fitting the noise rather than the true underlying structure. This might make it easier to sell the model to the client, but has the potential to hurt their business.

At Affine, we perform multiple diagnostic checks, both during model training and testing phases, to ensure that our models are overfit-free. Our in-house multi-tier validation framework leverages bagging (boostrap aggregation), where the models are built as well as validated on multiple boostrap samples (pulled with and without replacement). The no. of bootstrap samples may var from 20-50 depending on the underlying statistical model and samp size amongst other things. Multiple model parameters and performance metrics are validated for consistency across these bagged samples and summarized to create a final validation report that lets our analytic scientists take a call on overfitting as well as take measures to get rid of it.

‘Overfitting Diagnostic Check’ is just one of the many checkpoints through which we build fair, robust and more importantly, business-ready “longer shelf-life” models for our clients. What motivates our analytic scientists to perform these checks, which other may call overheads? Simple – skepticism, curiosity and a mindset which forces them to be critical of their own work to achieve continued excellence.

Watch out this space for more Affine Transformations…

~VK