Building Blocks Technologies Proprietary
Java Developer -- Web Crawling and More!
Objective of this effort -- Automated collection of data from the following sources:
-
Job Postings
-
White Papers
-
Websites from companies/projects in the Blockchain space
-
Using Crunchbase data as a guide
-
Using results of google search
-
-
Websites for companies listed on Etherscan with ERC-20 based projects
Current status:
Basic crawler, as checked into Git works "OK" for Job Postings. There are a bunch of enhancements and fixes that are necessary, as given in the following Task List.
Tasks:
-
Fix encoding
-
Get proxy option working in an automated fashion
-
Fix PDF parsing issue
-
Enhance to include Automated assessment and summary of performance
-
Develop a 1-off solution for pages that require Javascript
-
Parameterize crawl for recruit.net, indeed.com and livecareer to filter good data from noise
-
Develop a cloud based deployment and execution solution
-
Integrate the NLP/ML solution into the code-base-proper to help filter noise and control crawler behavior
Notes:
-
The basic crawler is in Java and started from Crawler4J off of GitHub
-
Lots of modifications were made to Crawler4J to control crawler behavior
-
The current state of the code can be found here
-
NLP Rules, blacklists and whitelists have been incorporated into the crawler to try to ensure the final data is "desireable"
-
End-to-End execution of the full batch of crawling instructions is currently controlled via a Bash script