Check out the below recent headlines –

  1.  – 13th Aug 2017 - Elon Musk backed OpenAI’s bot defeats champion in Defeats of the Ancients 2 (DOTA2) –
  2.  – 15th Mar 2016 – Google’s AlphaGo AI beats world champion in Go

These headlines point to the massive proliferation of Artificial Intelligence in general and Machine Learning in particular in our everyday world.

What is Machine Learning?

At a very high level, Machine Learning is the ability of computers to learn on their own. They do so by using algorithms that traverse and analyze the huge amount of data to generate the precise ability to predict human behavior. Some other real-life everyday examples are how spam mails no longer flood your inbox, how Amazon recommends product based on your interest and past browsing history, or speech recognition tool on Google Now.

The work around adding human-like intelligence to machines started way back in 1947 with Alan Turing’s code-breaking machine and his publication “Intelligent Military”. However not much happened in the subsequent decades until 2015. By then the availability of fast processors, decreasing bandwidth costs, better storage facilities, and explosion in big data meant that the ground was ripe for Artificial Intelligence to flourish wide.

Deep learning within machine learning

Deep learning is an interesting subset of the machine learning ecosystem. This branch requires phenomenal volumes of data and a phenomenal amount of processing power to deliver results. In many cases it is used in conjunction with machine learning. An example would be the need to reduce the number of false positives within a security breach program, which can utilize both machine learning and deep learning.

Deep learning uses complex neural networks for doing computations that would further improve the practical value of AI with better applications.

Use cases of Machine Learning in Action – Text (Document) Classification   

Company profile

A leading global information services and publishing company is 180+ years old and employs a strength of 15000 globally.

The problem

The company’s growing volume of publication had prompted it to contact CIGNEX Datamatics for automating the classification process of all publications within the organization. While doing so, it needed the solution to be scalable for future demands, to be accurate, and to continue delivering high performance over a long time. It wanted to have a self-learning system that could hit high accuracy rates (minimum of 80%) in the classification process. The solution needed to be integrated into the existing systems and be offered as an end to end solution.

The solution construct

a) Execution flow

  1. Data acquisition: The first step here was acquiring the right set of data. It made use of  external RBS filings and extraction of XML files for data acquisition
  2. Parsing: The next step was preparing the data by adding meta information like parsing of filter tags and meta data. This step also did basic pre-processing to get the data ready (noise words removal, tokenization and stemming)
  3. External feature: The 3rd step was to introduce the external feature set i.e. business rules and business analysis so that the solution performs as per stakeholders’ expectation
  4. Hypothesis/modeling: Once the above 3 steps are done, the modeling and testing of the model is carried out
  5. Unlabeled doc classification: This gives insight into ways to correctly classify unlabeled documents
  6. Evaluation: In this stage of hypothesis model testing, the accuracy or diff against existing classification is ascertained  
  7. Optimization: Once the above factors are in place, it becomes easier to see if the model is delivering outcomes as expected or does it need to be fine tuned further. This helps to verify the accuracy of the model selected at point#4 above

b) Technology stack and components used

  1. Custom coding and Apache Tika used for:
  • Identification of documents
  • Conversion of XML data into text
  • Processing of header data
  1. DL4J, Naïve Baiyes, Tensor Flow, and Decision Tree used for:
  • Classification using neural networks
  • Provide a training set
  • Create a model and test it for relevance and accuracy
  • Tune a model with available parameters
  • Validate the output with model, fine tune and iterate further
  1. Reviewer and custom programming used for:
  • Capture and assess the accuracy of the results obtained from the model
  • Review outliers and tune the model
  • Classify the unclassified documents through external feature set
  • Enhance the training set
  • Review the performance trend

Key takeaways

  • Applicability of the use case for Machine Learning is crucial
  • In depth understanding of the data set is needed
  • When building the Machine Learning solution, it is important to maintain the larger picture focus of the overall solution
  • Accuracy of Training data set very important to build a good model
  • Optimal model performance can be observed at 60:40 or 70:30 training set: test set ratio
  • Pre-processing and rigorous cleaning of data is necessary
  • Scalability needs to be included as part of solution architecture for long term efficacy of the solution

Interested to check out more about how big data analytics and machine learning from CIGNEX Datamatics can enhance your operational efficiencies? Then connect with us for a quick consultation.