IDRC is in touch with a Mumbai-based multinational bank backoffice. This is also a technology, processing, support arm of the global bank. A requirement of a consulting assignment has been shared with IDRC, for which the analyst needs to hold a computer Science PhD and 4 years of related work experience.
This particular group is looking to use specialized modeling and statistical techniques to achieve objectives like -
Case 1: Optimized server loads, by analyzing the user cycles and patterns, the peaks, max., min. and average load schedules or patterns, and thus fitting more users in the system, reducing the number of required servers and hence reduce costs. Case 2: Improve the distribution/implementation of global network architecture. A state-through process involving fecthing market data, placing the order, its execution, and registration/accounting may require information to travel across continents, several times over. Can this be improved, leading to lower latency and bandwidth, hence taking on more traffic, and hence reduce costs. Case 3: Optimizing the size of in-memory databases and online analytical processes (OLAP) for their size versus processing capabilities.
…understand such things at application, system, and enterprise level.
IDRC is looking for 1 analyst, who will be involved at the early stage of this long-term project. Exceptions to education and work experience cannot be made.
Candidates meeting the requirements can expect top-decile-in-category remuneration. Candidates can write to us on email@example.com with a cover letter and mentioning their current remuneration. Cover letter is essential, without which there’s a strong chance that applications will not be considered.
Candidates are strongly suggested to create a descriptive profiles on LinkedIn and using that ‘follow’ IDRC on LinkedIn. This will keep you posted on our activities, including job posts. We regularly visit the list of IDRC followers to match them with our relevant promotions and campaigns. (To do so, click on top right button on IDRC LinkedIn page).
Note that due to capacity constraints, only short-listed candidates will be contacted back.
Advances in Noisy Text Analytics
Knowing your customers has been part and parcel of running a business, a natural consequence of living and working in a community. But big firms such as “e-tailers” and Airtel have no chance of knowing every single one of its customers. So the idea of gathering huge amounts of information and analysing it to pick out trends indicative of customers’ wants and needs — data mining — has long been trumpeted as a way to return to the intimacy of a small-town general store.
Labor-intensive manual text mining approaches first surfaced in the mid-1980s, but technological advances have enabled the field to advance during the past decade. Text mining is an interdisciplinary field that draws on information retrieval, data mining, machine learning, statistics, and computational linguistics. As most information (common estimates say over 80%) is currently stored as text, text mining is believed to have a high commercial potential value. Increasing interest is being paid to multilingual data mining: the ability to gain information across languages and cluster similar items from different linguistic sources according to their meaning.
The challenge of exploiting the large proportion of enterprise information that originates in “unstructured” form has been recognized for decades. It is recognized in the earliest definition of Business Intelligence (BI), in as early as an October 1958 IBM Journal article by H.P. Luhn, A Business Intelligence System, which describes a system that will:
“…utilize data-processing machines for auto-abstracting and auto-encoding of documents and for creating interest profiles for each of the ‘action points’ in an organization. > Both incoming and internally generated documents are automatically abstracted, characterized by a word pattern, and sent automatically to appropriate action points.”
Yet as management information systems developed starting in the 1960s, and as BI emerged in the ’80s and ’90s as a software category and field of practice, the emphasis was on numerical data stored in relational databases. This is not surprising: text in “unstructured” documents is hard to process. The emergence of text analytics in its current form stems from a refocusing of research in the late 1990s from algorithm development to application, as described by Prof. Marti A. Hearst in the paper Untangling Text Data Mining:
“For almost a decade the computational linguistics community has viewed large text collections as a resource to be tapped in order to produce better text analysis algorithms. > In this paper, I have attempted to suggest a new emphasis: the use of large online text collections to discover new facts and trends about the world itself. I suggest that to > make progress we do not need fully artificial intelligent text analysis; rather, a mixture of computationally-driven and user-guided analysis may open the door to exciting new results”.
Hearst’s 1999 statement of need fairly well describes the state of text analytics technology and practice a decade later.
For many years, data mining’s claims were greatly exaggerated. Customer-loyalty cards, which allow retailers to gather information not just about what is selling, but who is buying it, sound like a great idea. Yet Safeway eliminated its customer-loyalty card when it realised it was gathering mountains of data without being able to use it. Then there was the famous story that Wal-Mart had discovered that sales of nappies (diapers) and beer were highly correlated, as young fathers dropped in at its stores on their way home from work to pick up supplies of the former, and decided to stock up on the latter at the same time. Wal-Mart, the story goes, then put the two items side-by-side on its shelves, and sales rocketed. Alas, the whole story is a myth, an illustration of data mining’s hypothetical possibilities, not the reality.
In recent years, however, improvements in both hardware and software, and the rise of the World Wide Web, have enabled data mining to start delivering on its promises. Richard Neale of Business Objects, a software company based in San Jose, California, tells the story of a British supermarket that was about to discontinue a line of expensive French cheeses which were not selling well. But data mining showed that the few people who were buying the cheeses were among the supermarket’s most profitable customers — so it was worth keeping the cheeses to retain their custom.
Noisy text analytics is a process of information extraction whose goal is to automatically extract structured or semi-structured information from noisy unstructured text data. While Text analytics is a growing and mature field that has great value because of the huge amounts of data being produced, processing of noisy text is gaining in importance because a lot of common applications produce noisy text data. Noisy unstructured text data is found in informal settings such as online chat, text messages, e-mails, message boards, newsgroups, blogs, wikis and web pages. Also, text produced by processing spontaneous speech using automatic speech recognition and printed or handwritten text using optical character recognition contains processing noise. Text produced under such circumstances is typically highly noisy containing spelling errors, abbreviations, non-standard words, false starts, repetitions, missing punctuations, missing letter case information, pause filling words such as “um” and “uh” and other texting and speech disfluencies. Such text can be seen in large amounts in contact centers, chat rooms, optical character recognition (OCR) of text documents, short message service (SMS) text, etc. Documents with historical language can also be considered noisy with respect to today’s knowledge about the language. Such text contains important historical, religious, ancient medical knowledge that is useful. The nature of the noisy text produced in all these contexts warrants moving beyond traditional text analysis techniques.
Techniques for noisy text analysis
Missing punctuation and the use of non-standard words can often hinder standard natural language processing tools such as Part-of-speech tagging and parsing. Techniques to both learn from the noisy data and then to be able to process the noisy data are only now being developed.
Sources of noisy text data Contact centers: This is a general term for help desks, information lines and customer service centers operating in domains ranging from computer sales and support to mobile phones to apparels. On an average a person in the developed world interacts at least once a week with a contact center agent. A typical contact center agent handles over a hundred calls per day. They operate in various modes such as voice, online chat and E-mail. The contact center industry produces gigabytes of data in the form of E-mails, chat logs, voice conversation transcriptions, customer feedback, etc. A bulk of the contact center data is voice conversations. Transcription of these using state of the art automatic speech recognition results in text with 30-40% word error rate. Further, even written modes of communication like online chat between customers and agents and even the interactions over email tend to be noisy. Analysis of contact center data is essential for customer relationship management, customer satisfaction analysis, call modeling, customer profiling, agent profiling, etc., and it requires sophisticated techniques to handle poorly written text.
Printed Documents: Many libraries, government organizations and national defence organizations have vast repositories of hard copy documents. To retrieve and process the content from such documents, they need to be processed using Optical Character Recognition. In addition to printed text, these documents may also contain handwritten annotations. OCRed text can be highly noisy depending on the font size, quality of the print etc. It can range from 2-3% word error rates to as high as 50-60% word error rates. Handwritten annotations can be particularly hard to decipher, and error rates can be quite high in their presence.
Short Messaging Service (SMS): Language usage over computer mediated discourses, like chats, emails and SMS texts, significantly differs from the standard form of the language. An urge towards shorter message length facilitating faster typing and the need for semantic clarity, shape the structure of this non-standard form known as the texting language.
At IDRC we employ market-leading models, tools, and techniques to achieve tangible returns on analytics. For information, write to us on firstname.lastname@example.org
IDRC is building a game-theoretical decision-support framework for consumer good segment clients. This framework provides an analytical backing and a quantitative measurement to the decisions and predicts their impact in competitive scenarios.
Model includes following marketing levers -
new product launch in existing product category.
Imagine a scenario where IDRC’s client FMCG (fast moving consumer goods) company is faced with competitor’s action (use of one of the above marketing levers) in a particular product category.
IDRC’s game theory framework calibrates to the historical data, and works out an optimum response of our client to competitor’s action. We answer questions like what combinations of above marketing levers are the best responses, with respect to the following functions -
maximum profit over a time-horizon from the product category’, or
maximizing marketing share.
Further, it works out the equilibrium level prediction. For example, in case of a price-war, what is the expected equilibrium price.
The basic analytical model can also be run though numerical techniques to simulate more complex scenarios. Techniques such as agent-based models can be integrated to simulate and visualize the convergence to the outcomes.
IDRC is an analytics, quant-modeling, and data-modeling focused industry agnostic consulting firm. For more information, write to us on email@example.com. For frequent updates, ‘follow’ us on LinkedIn