Building Blocks Technologies Proprietary
Java Developer -- Web Crawling and More!
Objective of this effort -- Automated collection of data from the following sources:
Job Postings
White Papers
Websites from companies/projects in the Blockchain space
Using Crunchbase data as a guide
Using results of google search
Websites for companies listed on Etherscan with ERC-20 based projects
Current status:
Basic crawler, as checked into Git works "OK" for Job Postings. There are a bunch of enhancements and fixes that are necessary, as given in the following Task List.
Fix encoding
Get proxy option working in an automated fashion
Fix PDF parsing issue
Enhance to include Automated assessment and summary of performance
Develop a 1-off solution for pages that require Javascript
Parameterize crawl for, and livecareer to filter good data from noise
Develop a cloud based deployment and execution solution
Integrate the NLP/ML solution into the code-base-proper to help filter noise and control crawler behavior
The basic crawler is in Java and started from Crawler4J off of GitHub
Lots of modifications were made to Crawler4J to control crawler behavior
The current state of the code can be found here
NLP Rules, blacklists and whitelists have been incorporated into the crawler to try to ensure the final data is "desireable"
End-to-End execution of the full batch of crawling instructions is currently controlled via a Bash script