Java Developer -- Web Crawling and More!

Objective of this effort -- Automated collection of data from the following sources:

  1. Job Postings

  2. White Papers

  3. Websites from companies/projects in the Blockchain space

    • Using Crunchbase data as a guide​

    • Using results of google search

  4. Websites for companies listed on Etherscan with ERC-20 based projects

Current status:

Basic crawler, as checked into Git works "OK" for Job Postings. There are a bunch of enhancements and fixes that are necessary, as given in the following Task List.


  1. Fix encoding

  2. Get proxy option working in an automated fashion

  3. Fix PDF parsing issue

  4. Enhance to include Automated assessment and summary of performance

  5. Develop a 1-off solution for pages that require Javascript

  6. Parameterize crawl for, and livecareer to filter good data from noise

  7. Develop a cloud based deployment and execution solution

  8. Integrate the NLP/ML solution into the code-base-proper to help filter noise and control crawler behavior


  • The basic crawler is in Java and started from Crawler4J off of GitHub

  • Lots of modifications were made to Crawler4J to control crawler behavior

  • The current state of the code can be found here

  • NLP Rules, blacklists and whitelists have been incorporated into the crawler to try to ensure the final data is "desireable"

  • End-to-End execution of the full batch of crawling instructions is currently controlled via a Bash script  

