top of page

Building Blocks Technologies Proprietary

Java Developer -- Web Crawling and More!

Objective of this effort -- Automated collection of data from the following sources:

  1. Job Postings

  2. White Papers

  3. Websites from companies/projects in the Blockchain space

    • Using Crunchbase data as a guide​

    • Using results of google search

  4. Websites for companies listed on Etherscan with ERC-20 based projects

Current status:

Basic crawler, as checked into Git works "OK" for Job Postings. There are a bunch of enhancements and fixes that are necessary, as given in the following Task List.

Tasks:

  1. Fix encoding

  2. Get proxy option working in an automated fashion

  3. Fix PDF parsing issue

  4. Enhance to include Automated assessment and summary of performance

  5. Develop a 1-off solution for pages that require Javascript

  6. Parameterize crawl for recruit.net, indeed.com and livecareer to filter good data from noise

  7. Develop a cloud based deployment and execution solution

  8. Integrate the NLP/ML solution into the code-base-proper to help filter noise and control crawler behavior

Notes:

  • The basic crawler is in Java and started from Crawler4J off of GitHub

  • Lots of modifications were made to Crawler4J to control crawler behavior

  • The current state of the code can be found here

  • NLP Rules, blacklists and whitelists have been incorporated into the crawler to try to ensure the final data is "desireable"

  • End-to-End execution of the full batch of crawling instructions is currently controlled via a Bash script  

bottom of page