Investor Retention Analytics

Business Challenge

The $200Billion+ Indian mutual fund industry is experiencing a period of significant growth and increasing competition.With a six-fold increase in Asset Under Management (AUM) over the last 10 years, the Asset Management Companies(AMCs) face an increasing amount of competitive pressures to their share of wallet. However, the increasing challenges arise not only in the area of acquiring new investors, but also in maintaining the vintage investors and to offset redemptions.In India, the AMCs’ distribution network comprises of Independent Financial Advisors (IFAs), in-house Distributors, and Banks. Typically, distributors rely on their delivery channels such as IFAs, and expect the delivery-side agents to have a good understanding of their territories based on their relationships with the investors.

Incidentally, the Asset Under Management (AUM) is unevenly distributed across the country, primary due to the penetrations by IFAs (Independent Financial Advisors) in new locations, which explains one of the reasons for the dependency of the distributors on the delivery channels. The traditional penetration channelssuch as IFAs and banks, wherein IFAs play a crucial role in fund distribution as they interact with the investors on a regular basis. However, it’s observed during the post-purchase phase till the redemption or exiting of an investor from the system, the ineffective investor management results in considerable loss for the AMCs where attrition of investors leads to higher costs for an AMC in comparison to gaining new customers. The existing reactive approaches result in significant and measurable missed investments and redemption control opportunities. The proposed solutions for more effective investor management is empowering the IFAs and other channels through effective forward guidance, meaning providing insights on their investor behavior such as next best product that could be recommended, and investor churning out nature.

Product Introduction

Machine learning based predictive analytics platforms leverage anorganization’sdata by applying sophisticated analysis techniques to mine hidden patterns and insights, enabling them to build optimization strategies for customer behavior analysis, product attributes analysis, and marketing models, among others. In the mutual fund industry, predictive analytics can be applied to financialtransaction data, demographicfactors, and scheme level features, to analyze the business concerns such as the growth in net assets under management for a time period, customer acquisition and retention, through efficient analytics platforms for prediction and optimization problems.

The purpose of this white paper is to illustrate a few select capabilities of Karvy’s predictive analytics domain in assisting the AMCs with respect to their investor behavior and management analysis. Specifically designed for the mutual fund industry; RedemptionSECURE is a product that is developedengineered with advanced machine learningplatformsto intelligently provide insights to the mutual fund distributors about their investors’ behavior patterns for near real-time data, to reduce redemption trends. The solution process includes multi-dimensional client segmentation, precision investor targeting, identifying redemption targets, and periodic consolidated analysis reports.

Today, the industries are moving towards a data-driven mechanization across the phases of production and delivery to service optimization. Most businesses recognize the need to deploy effective machine learning predictive platforms to analyze huge amounts of data and provide prescriptive solutions for automation and strategic solutions. The predictive analytics is extensively used in predicting customer survival, estimating customer lifetime value, demand forecasts, pricing, and product recommendations. However, the analytics that many financial service providers use are slow, require high manual contribution with low level of automation, and cannot handle the high scale of data for relevant outputs leading to unoptimized outcomes.

In the mutual fund industry, predictive analytics plays a key role in providing data-driven decisions for managing the resources under an AMC. The growth in the holdings by the investors decides the growth of net assets under management of an AMC, and is adversely affected by the offsets in redemptions. The attributes that trigger a redemption by an investors are complex in nature to identify and analyze. These attributes include financial transaction patterns by the investor, market conditions and sentiments, macroeconomics variables, scheme level features, and demographic factors.Predicting the redemption behavior requires sophisticated platform that can capture multiple factors that affect the redemption behavior. However, big data predictive analytics using advanced machine learning platform can analyze these massive amounts of transaction data and other time trend variables at a macro level. This platform can investigate these factors for near real-time data and can provide highly accurate predictions for the redeeming investors in the future at a particle level (investor-level).

What is Big Data?

Big data describes the vast amount of company owned information that can be mined from the internal sources maintained in the computer systems and cloud environments that are continuously generated. The internal sources include information such as financial transaction details, demographic datafrom KYC documentations, and agent / distributor details. Big data also includes external sourcessuch as public and other sources which capture macroeconomic indicators, social media, and third-party databases. These data in raw format, unstructured or structured need to be analyzed to derive valuable insights. In today’s world, the firms compete in this competitive space to capitalize on this trend.

SolutionDesign: Predicting the Length of Relationships (LOR) of Investors

RedemptionSECURE is a cognitive-computing and user-behavior based platform, where the foundation is built using input-based reactive architecture leveraging advanced algorithms. The product has been deployed specifically for predicting the Length of Relationship(LOR) of an investor, or the total survival time of an investor with an Asset Management Company (AMC). The Length of Relationship (LOR) is defined as the total time period an investor stays with an AMC. The time period is calculated by taking the difference between his first investment day and the full-redemption transaction day. Predicting such a variable that is affected by several parameters, which includebehavioral factors such as the emotional responses of an investor to the short-term and long-term market environment, needs a highly cognitive-computing platform for incorporating multiple variables that can be identified as simultaneously affecting the study variable; LOR.

The existing system for database management captures several variables such as financial transactions, demographic factors, scheme-level features, and agent information, with high heterogeneity in data,meaning they provide a massive amount of information that could be analyzedwhich need advanced algorithms for examining the data in clusters or groups. Our machine learning based models convert a priori understanding of the behavior of investors into a set of mathematical equations and then take the advantage of data mining methods to generate new features that has been further developed to provide predictive solutions. The process started by analyzing the investor financial transaction behavior data (includes amounts invested and the date of investment), redemption and purchase transactions, demographic factors, macroeconomic variables, Net Asset Value (NAV) changes, andAUM changes, for the years from 2009 to 2016. Initially, the distribution oflength of relationships was analyzed in detailthrough descriptive analytics, to identify the clusters and groups in the investors, in terms of the demographic and investment behavior patterns.

A Length of Relationship(LOR) distribution map was developed, as illustrated below, which was studied to analyze the patterns and the time-series nature in the investor churning-out trends.

Figure 1: Length of Relationship(LOR) Distribution

  • X-Axis: Number of Months (Length of Relationship)
  • Y-Axis: Number of Investors (’000)

Observations: There is a significant number of investors churning out from 20th to 30th month time-period. Similar spikes were observed during 70th to 90th period and above 97. Same number of investors churn out with different lengths of LOR, where such patterns are shown through linear trend lines.

Through descriptive analytics, it was identified that the time-period an investor can possiblychurn-out of the system can be predicted by analyzing the associated factors which were highly correlated with the length of relationship. A classification algorithm based predictive model dependent on these factors was developed to predict the churning-out period of an investor with high level of accuracy. Additionally, once the model was deployed, it was cross-validated through several samples to identify the fitting nature of the model. The platform integrates the information for near real-time data, analyze the historical behavior patterns, and periodically predicts information regarding the likelihood of churning out investors in the future time-periods. These insights channeled to robust intervention mechanism can prevent these investors from churning out prior to the offsets of the redemptions.

Methodology Framework


The heterogeneous investor data is collected from the database and was pre-processed to remove irregularities. The raw data that is generated in the databases is time-series in cross-sectional format, which is not very valuable as these data points cannot be applied directly for developing predictive models. These raw measurements are not useable for data modeling, and the issues with the raw data are as follows:

  • Disturbances due to noises
  • Missing observations in the data points
  • Different scales of the data
  • Extreme outliers
  • Imbalanced distribution of data
  • Sparsity in the data

These irregularities in the data significantly reduce the predictive nature of the decision making algorithms. For example, if it was found that there is a high correlation between the third and eleventh transactions by the investors with the length of their relationships, apart from an extremely aberrant investment behavior exhibiting investors of ~6%, it’s vital that the model identifies such extreme outliers (the ~6% investors) and these observations are redirected to another behavior predictive models which can explain the group behavior of such investors.

Feature Extraction:

One of the most important steps in any data mining process is the extraction of features from the raw data. These features are orthogonal in nature, correlated to the study variable, and have the potential to explain the variance of the study variable. From our raw data for analyzing the length of relationship, several features were generated through feature engineering using mathematical and statistical algorithms. Scaling of the variables was another important step in the feature generation due to the high variance in the amounts invested by the investors.

Variable Selection:

Through the feature engineering process, several features are extracted from the raw data. The generated features contain valuable undiscovered information that explains the nature of trend of the study variable.

Classifier Design (Predictive Model):

The classifier model has been developed to predict the Length of Relationship for the existing investors. The algorithm predicts with a high accuracy the likelihood of an investor to churn-out a particular number of months. For example, if the algorithm predicts that a particular investor would stay in the system of 28 months, and if he has completed 26 months by now, he is expected to churn-out in another two months. These results will be generated regularly on a monthly basis to identify the churning-out investors.

Figure 2: The transaction behavior of a particular investor

  • X-Axis: Nth Transaction (explains the transaction count)
  • Y-Axis: Number of Days (the difference between a particular transaction day and the first day of transaction)

Algorithm Overview

During the research phase of the product, we used a spectrum of machine learning techniques like neural networks, support vector machines, decision trees, and randomForest, to analyze the fitness of a predictive algorithm for a particular segment of investors. In general, the type of classifiers (predictive algorithm) depends on thenature of data, complexity of the problem and other specific considerations. Our response variable has more than 90 levels, which cannot be handled by simple multi-level classifiers. Hence, we need a powerful classifier which can handle multiple levels with higher accuracy.

Deep Learning based on neural network methods is often the algorithm of choice for complex predictive solutions, as deep learning algorithms performs efficiently in a number of diverse problems like multi-level classification. Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction.

The deep learning algorithms based on neural networks, as illustrated above, depicts a network with working of a single neuron. Based on human nervous system, a neuron collates all the inputs and performs an operation on them, and similarly, the information is passed through several possibilities to identify the best fit for a certain type of input.

The classifier is designed in such a way that it discovers heterogeneous nature of investors from the data. It considers the input variables such as financial transactions, demographic factors, scheme-level features, and are run through the intermediate layers or the hidden layers between input and output which help the Neural Network learn the complicated relationships involved in data. The final output is extracted from the previous two layers. For example, for a classification problem with 5 classes, the output later will have 5 neurons.

Delivery Framework

Karvy can act as a service provider and a reliable man power provider, running staff augmentation and also managed services for key clients. By leveraging on its resource strength in analytics and data management services, Karvy can provide its clients with need based staffing or end-to-end service, across the delivery cycle of major program rollouts.

About Authors:

  • Sarang Venukala is a senior data scientist with Karvy Analytics, where he combines his experience in machine learning, macroeconomic modeling, econometrics, and financial analytics to design and develop predictive and prescriptive analytics solutions predominantly for the BFSI sector. He holds a Master’s of Science in Econometrics.

Karvy Analytics

115 Broadway, Suite 1506
New York, NY 10006
Tel: 212 267 4334
Fax: 212 267 4335
Registered Address
"Karvy House", 46 Avenue 4,
Street No. 1, Banjara Hills,
Hyderabad 500 034
Tel No:(+91-40) 23312454, 23320751
Fax No:(+91-40) 23311968
back to top