User Tools

Site Tools


projects:digikey_partsdb

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Last revisionBoth sides next revision
projects:digikey_partsdb [2013/10/12 08:41] charliexprojects:digikey_partsdb [2013/10/13 10:10] charliex
Line 873: Line 873:
 == grabbing FV's == == grabbing FV's ==
  
-we need the FV's to crawl each subsection.+we need the FV's to crawl each subsection. grab all the above urls, make sure Results per Page = 500. The CSV download is capped at 500 results per fetch, so no point increasing this value.
  
   * <input type=hidden name=FV value=fff40000,fff80000>   * <input type=hidden name=FV value=fff40000,fff80000>
  
-grab the value and store for each of the above URL's+also grab the total page count  
 + 
 +  * <a class="Last" href="/product-search/en/undefined-category/undefined-family/0/page/8">Last</a> 
 + 
 +The page/8" is the total page count, pages start from 1 
 + 
 +grab the FV value and page count, and store for each of the above URL's
  
 == crawl individual pages == == crawl individual pages ==
 +
 +curl with a valid useragent i used --useragent "Chrome/1.0" but vary it to avoid rate limiters.
 +
 +<code>
 +curl.exe -o page%1.csv -L -v -G "http://www.digikey.com/product-search/download.csv?FV=fff40008%2Cfff801b9&mnonly=0&newproducts=0&ColumnSort=0&page=%1&stock=0&pbfree=0&rohs=0&quantity=0&ptm=0&fid=0&pageSize=500" --digest --user-agent "Chrome/1.0"
 +</code>
 +
 +
 +
 +The response has 4 bytes at the front we don't want, so a simple byteskip script or piece of code.
 +
 +<code> 
 +
 +#include <stdio.h>
 +#include <stdlib.h>
 +
 +int main(int argc,char*argv[])
 +{
 + FILE *fp,*ofp;
 +
 + if( argc < 4 ) {
 + fprintf(stderr,"%s usage : infile outfile offset\n",argv[0]);
 + exit(-1);
 + }
 +
 + fp =fopen( argv[1],  "rb");
 + if( fp == NULL ) {
 + fprintf(stderr,"Couldnt open input file %s\n",argv[1]);
 + exit(-2);
 + }
 +
 + unsigned long length ;
 +
 + fseek(fp,0,SEEK_END);
 +
 + length = ftell( fp ) ;
 +
 +
 + if( length == 0 ) {
 +
 + fclose( fp );
 +
 + fprintf(stderr,"zero length file %s\n",argv[1]);
 + exit(-3);
 + }
 +
 + unsigned long offset;
 +
 + //skip offset
 + offset = strtoul (argv[3], NULL, 0);
 +
 + if( offset >= length ){
 +
 + fclose( fp );
 +
 + fprintf(stderr,"offset is  outside file length %s at %d\n",argv[1], offset);
 + exit(-5);
 + }
 +
 + // set to skip position
 + fseek(fp,offset,SEEK_SET);
 +
 + unsigned char *buffer = NULL;
 +
 + buffer = (unsigned char *)malloc( length - offset );
 + if( buffer == NULL ) {
 +
 + fclose(fp);
 +
 + fprintf(stderr,"Couldnt allocate output buffer %lu\n", offset );
 + exit(-6);
 + }
 +
 + // read whole buffer.
 + if( fread(buffer,1,length - offset ,fp ) != (length-offset) ) {
 + fclose(fp);
 + fprintf(stderr,"Couldnt allocate output buffer %lu\n", offset );
 + exit(-7);
 +
 + }
 +
 + // open output file for writing.
 + ofp = fopen( argv[2],  "wb");
 +
 + if( ofp == NULL ) {
 + fclose(fp);
 +
 + free( buffer );
 + buffer = NULL;
 + fprintf(stderr,"Couldnt open output file %s\n",argv[2]);
 + exit(-8);
 + }
 +
 + if( fwrite(buffer,1,length-offset,ofp) != (length-offset) ) { 
 + fclose(fp);
 + fclose(ofp);
 + fprintf(stderr,"Couldnt write output file %s\n", argv[2]);
 + exit(-9);
 + }
 +
 + free( buffer );
 +
 + fclose(fp);
 + fclose(ofp);
 +
 +
 + return 0;
 +}
 +</code> 
 +
 +Process all the files.
 +
 +<code>
 +for %a in (*.csv) do byteskip %a o%a 4
 +</code>
 +
 +I used one of the online CSV to MYSQL converters, but most of them can't handle the variations in CSV. To create the initial schema for each table i converted one CSV to XLS by importing it into google docs, and then re-exporting it as an XLS then importing that into phpmyadmin, that makes the base schema.<br>
 +
 +Rename the table in phpmyadmin<br>
 +<br>
 +then do the final import with the csvtosql<br>
  
projects/digikey_partsdb.txt · Last modified: 2013/10/13 10:11 by charliex