For large enterprises,analysing logs on daily basis is a difficult task. Generally developers/management analyse the logs manually based on need.
If Log can be analysed upfront and if all stakeholders got preventive report based on log analysis it will help to prevent error and act on issue on time.
More often than not, this is done with tools like Graylog2 rather than pig+hadoop.
Hi, Do you not think handling terabyte/Petabyte of logs(which company like Amadeus who owns GDS and others has) in cluster environment and using Map reduce for more efficient way(execution time will be less) is a proper use case?
Later on use Titan and Faunus to enhance this model and make it more useful. But may be i am wrong or may be i am thinking in wrong direction , so any further inputs /ideas are most welcome.
A lot of post-mortem analysis and large-scale ETL happens on hadoop and pig - which is what pig is meant for in the y! context. The real trouble with pig+hadoop is that they are only moderately unstructured, still operating on row+column style inputs.
Most of the preventive up-front analysis of unstructured data is covered by Splunk on the commercial side (which will use hadoop YARN inside, very soon) & Graylog2/Elasticsearch on the open-source side. I have seen tools like Esper used to do complex event processing on input logs immediately and let hdfs spool it onto disks for later deeper inspections.
Batch systems are always second in line to CEP layers or index+search layers, with alering mechanisms, particularly if time is of importance.