User Tools

Site Tools


projects:digikey_partsdb

digikey parts slurper

fetch www.digikey.com/product-search/en?FV=

grep for catfilterlink

remove beginning of line to inclusive

remove end of line from inclusive

produces following info
grabbing FV's

we need the FV's to crawl each subsection. grab all the above urls, make sure Results per Page = 500. The CSV download is capped at 500 results per fetch, so no point increasing this value.

  • <input type=hidden name=FV value=fff40000,fff80000>

also grab the total page count

  • <a class=“Last” href=”/product-search/en/undefined-category/undefined-family/0/page/8”>Last</a>

The page/8“ is the total page count, pages start from 1

grab the FV value and page count, and store for each of the above URL's

crawl individual pages

curl with a valid useragent i used –useragent “Chrome/1.0” but vary it to avoid rate limiters.

curl.exe -o page%1.csv -L -v -G "http://www.digikey.com/product-search/download.csv?FV=fff40008%2Cfff801b9&mnonly=0&newproducts=0&ColumnSort=0&page=%1&stock=0&pbfree=0&rohs=0&quantity=0&ptm=0&fid=0&pageSize=500" --digest --user-agent "Chrome/1.0"

The response has 4 bytes at the front we don't want, so a simple byteskip script or piece of code.

 

#include <stdio.h>
#include <stdlib.h>

int main(int argc,char*argv[])
{
	FILE *fp,*ofp;

	if( argc < 4 ) {
		fprintf(stderr,"%s usage : infile outfile offset\n",argv[0]);
		exit(-1);
	}

	fp =fopen( argv[1],  "rb");
	if( fp == NULL ) {
		fprintf(stderr,"Couldnt open input file %s\n",argv[1]);
		exit(-2);
	}

	unsigned long length ;

	fseek(fp,0,SEEK_END);

	length = ftell( fp ) ;


	if( length == 0 ) {
		
		fclose( fp );

		fprintf(stderr,"zero length file %s\n",argv[1]);
		exit(-3);
	}

	unsigned long offset;

	//skip offset
	offset = strtoul (argv[3], NULL, 0);

	if( offset >= length ){
		
		fclose( fp );

		fprintf(stderr,"offset is  outside file length %s at %d\n",argv[1], offset);
		exit(-5);
	}

	// set to skip position
	fseek(fp,offset,SEEK_SET);

	unsigned char *buffer = NULL;

	buffer = (unsigned char *)malloc( length - offset );
	if( buffer == NULL ) {
		
		fclose(fp);

		fprintf(stderr,"Couldnt allocate output buffer %lu\n", offset );
		exit(-6);
	}

	// read whole buffer.
	if( fread(buffer,1,length - offset ,fp ) != (length-offset) ) {
		fclose(fp);
		fprintf(stderr,"Couldnt allocate output buffer %lu\n", offset );
		exit(-7);

	}

	// open output file for writing.
	ofp = fopen( argv[2],  "wb");
	
	if( ofp == NULL ) {
		fclose(fp);
		
		free( buffer );
		buffer = NULL;
		fprintf(stderr,"Couldnt open output file %s\n",argv[2]);
		exit(-8);
	}

	if( fwrite(buffer,1,length-offset,ofp) != (length-offset) ) { 
		fclose(fp);
		fclose(ofp);
		fprintf(stderr,"Couldnt write output file %s\n", argv[2]);
		exit(-9);
	}

	free( buffer );

	fclose(fp);
	fclose(ofp);


	return 0;
}

Process all the files.

for %a in (*.csv) do byteskip %a o%a 4

I used one of the online CSV to MYSQL converters, but most of them can't handle the variations in CSV. To create the initial schema for each table i converted one CSV to XLS by importing it into google docs, and then re-exporting it as an XLS then importing that into phpmyadmin, that makes the base schema.<br>

Rename the table in phpmyadmin or via mysql tool

Then do the final import with the csvtosql tool, (in progress)

projects/digikey_partsdb.txt · Last modified: 2013/10/13 10:11 by charliex