User Tools

Site Tools


projects:digikey_partsdb

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
projects:digikey_partsdb [2013/10/12 08:34] – created charliexprojects:digikey_partsdb [2013/10/13 10:11] (current) charliex
Line 1: Line 1:
 +=== digikey parts slurper ==
 +
 +fetch www.digikey.com/product-search/en?FV=
 +
 +grep for **catfilterlink** 
 +
 +remove beginning of line to inclusive **"**
 +
 +remove end of line from **"** inclusive
 +
 +== produces following info ==
  
   * http://www.digikey.com/product-search/en/audio-products/accessories/720980   * http://www.digikey.com/product-search/en/audio-products/accessories/720980
Line 859: Line 870:
   * http://www.digikey.com/product-search/en/undefined-category/miscellaneous/752   * http://www.digikey.com/product-search/en/undefined-category/miscellaneous/752
   * http://www.digikey.com/product-search/en/undefined-category/undefined-family/0   * http://www.digikey.com/product-search/en/undefined-category/undefined-family/0
-  + 
 +== grabbing FV's == 
 + 
 +we need the FV's to crawl each subsection. grab all the above urls, make sure Results per Page = 500. The CSV download is capped at 500 results per fetch, so no point increasing this value. 
 + 
 +  <input type=hidden name=FV value=fff40000,fff80000> 
 + 
 +also grab the total page count  
 + 
 +  * <a class="Last" href="/product-search/en/undefined-category/undefined-family/0/page/8">Last</a> 
 + 
 +The page/8" is the total page count, pages start from 1 
 + 
 +grab the FV value and page count, and store for each of the above URL'
 + 
 +== crawl individual pages == 
 + 
 +curl with a valid useragent i used --useragent "Chrome/1.0" but vary it to avoid rate limiters. 
 + 
 +<code> 
 +curl.exe -o page%1.csv -L -v -G "http://www.digikey.com/product-search/download.csv?FV=fff40008%2Cfff801b9&mnonly=0&newproducts=0&ColumnSort=0&page=%1&stock=0&pbfree=0&rohs=0&quantity=0&ptm=0&fid=0&pageSize=500" --digest --user-agent "Chrome/1.0" 
 +</code> 
 + 
 + 
 + 
 +The response has 4 bytes at the front we don't want, so a simple byteskip script or piece of code. 
 + 
 +<code>  
 + 
 +#include <stdio.h> 
 +#include <stdlib.h> 
 + 
 +int main(int argc,char*argv[]) 
 +
 + FILE *fp,*ofp; 
 + 
 + if( argc < 4 ) { 
 + fprintf(stderr,"%s usage : infile outfile offset\n",argv[0]); 
 + exit(-1); 
 +
 + 
 + fp =fopen( argv[1],  "rb"); 
 + if( fp == NULL ) { 
 + fprintf(stderr,"Couldnt open input file %s\n",argv[1]); 
 + exit(-2); 
 +
 + 
 + unsigned long length ; 
 + 
 + fseek(fp,0,SEEK_END); 
 + 
 + length = ftell( fp ) ; 
 + 
 + 
 + if( length == 0 ) { 
 +  
 + fclose( fp ); 
 + 
 + fprintf(stderr,"zero length file %s\n",argv[1]); 
 + exit(-3); 
 +
 + 
 + unsigned long offset; 
 + 
 + //skip offset 
 + offset = strtoul (argv[3], NULL, 0); 
 + 
 + if( offset >= length ){ 
 +  
 + fclose( fp ); 
 + 
 + fprintf(stderr,"offset is  outside file length %s at %d\n",argv[1], offset); 
 + exit(-5); 
 +
 + 
 + // set to skip position 
 + fseek(fp,offset,SEEK_SET); 
 + 
 + unsigned char *buffer = NULL; 
 + 
 + buffer = (unsigned char *)malloc( length - offset ); 
 + if( buffer == NULL ) { 
 +  
 + fclose(fp); 
 + 
 + fprintf(stderr,"Couldnt allocate output buffer %lu\n", offset ); 
 + exit(-6); 
 +
 + 
 + // read whole buffer. 
 + if( fread(buffer,1,length - offset ,fp ) != (length-offset) ) { 
 + fclose(fp); 
 + fprintf(stderr,"Couldnt allocate output buffer %lu\n", offset ); 
 + exit(-7); 
 + 
 +
 + 
 + // open output file for writing. 
 + ofp = fopen( argv[2],  "wb"); 
 +  
 + if( ofp == NULL ) { 
 + fclose(fp); 
 +  
 + free( buffer ); 
 + buffer = NULL; 
 + fprintf(stderr,"Couldnt open output file %s\n",argv[2]); 
 + exit(-8); 
 +
 + 
 + if( fwrite(buffer,1,length-offset,ofp) != (length-offset) ) {  
 + fclose(fp); 
 + fclose(ofp); 
 + fprintf(stderr,"Couldnt write output file %s\n", argv[2]); 
 + exit(-9); 
 +
 + 
 + free( buffer ); 
 + 
 + fclose(fp); 
 + fclose(ofp); 
 + 
 + 
 + return 0; 
 +
 +</code>  
 + 
 +Process all the files. 
 + 
 +<code> 
 +for %a in (*.csv) do byteskip %a o%a 4 
 +</code> 
 + 
 +I used one of the online CSV to MYSQL converters, but most of them can't handle the variations in CSV. To create the initial schema for each table i converted one CSV to XLS by importing it into google docs, and then re-exporting it as an XLS then importing that into phpmyadmin, that makes the base schema.<br> 
 + 
 +Rename the table in phpmyadmin or via mysql tool 
 + 
 +Then do the final import with the csvtosql tool, (in progress) 
 + 
 + 
projects/digikey_partsdb.1381592079.txt.gz · Last modified: 2013/10/12 08:34 by charliex