User Tools

Site Tools


projects:digikey_partsdb

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
projects:digikey_partsdb [2013/10/12 08:41] charliexprojects:digikey_partsdb [2013/10/13 10:11] (current) charliex
Line 873: Line 873:
 == grabbing FV's == == grabbing FV's ==
  
-we need the FV's to crawl each subsection.+we need the FV's to crawl each subsection. grab all the above urls, make sure Results per Page = 500. The CSV download is capped at 500 results per fetch, so no point increasing this value.
  
   * <input type=hidden name=FV value=fff40000,fff80000>   * <input type=hidden name=FV value=fff40000,fff80000>
  
-grab the value and store for each of the above URL's+also grab the total page count  
 + 
 +  * <a class="Last" href="/product-search/en/undefined-category/undefined-family/0/page/8">Last</a> 
 + 
 +The page/8" is the total page count, pages start from 1 
 + 
 +grab the FV value and page count, and store for each of the above URL's
  
 == crawl individual pages == == crawl individual pages ==
 +
 +curl with a valid useragent i used --useragent "Chrome/1.0" but vary it to avoid rate limiters.
 +
 +<code>
 +curl.exe -o page%1.csv -L -v -G "http://www.digikey.com/product-search/download.csv?FV=fff40008%2Cfff801b9&mnonly=0&newproducts=0&ColumnSort=0&page=%1&stock=0&pbfree=0&rohs=0&quantity=0&ptm=0&fid=0&pageSize=500" --digest --user-agent "Chrome/1.0"
 +</code>
 +
 +
 +
 +The response has 4 bytes at the front we don't want, so a simple byteskip script or piece of code.
 +
 +<code> 
 +
 +#include <stdio.h>
 +#include <stdlib.h>
 +
 +int main(int argc,char*argv[])
 +{
 + FILE *fp,*ofp;
 +
 + if( argc < 4 ) {
 + fprintf(stderr,"%s usage : infile outfile offset\n",argv[0]);
 + exit(-1);
 + }
 +
 + fp =fopen( argv[1],  "rb");
 + if( fp == NULL ) {
 + fprintf(stderr,"Couldnt open input file %s\n",argv[1]);
 + exit(-2);
 + }
 +
 + unsigned long length ;
 +
 + fseek(fp,0,SEEK_END);
 +
 + length = ftell( fp ) ;
 +
 +
 + if( length == 0 ) {
 +
 + fclose( fp );
 +
 + fprintf(stderr,"zero length file %s\n",argv[1]);
 + exit(-3);
 + }
 +
 + unsigned long offset;
 +
 + //skip offset
 + offset = strtoul (argv[3], NULL, 0);
 +
 + if( offset >= length ){
 +
 + fclose( fp );
 +
 + fprintf(stderr,"offset is  outside file length %s at %d\n",argv[1], offset);
 + exit(-5);
 + }
 +
 + // set to skip position
 + fseek(fp,offset,SEEK_SET);
 +
 + unsigned char *buffer = NULL;
 +
 + buffer = (unsigned char *)malloc( length - offset );
 + if( buffer == NULL ) {
 +
 + fclose(fp);
 +
 + fprintf(stderr,"Couldnt allocate output buffer %lu\n", offset );
 + exit(-6);
 + }
 +
 + // read whole buffer.
 + if( fread(buffer,1,length - offset ,fp ) != (length-offset) ) {
 + fclose(fp);
 + fprintf(stderr,"Couldnt allocate output buffer %lu\n", offset );
 + exit(-7);
 +
 + }
 +
 + // open output file for writing.
 + ofp = fopen( argv[2],  "wb");
 +
 + if( ofp == NULL ) {
 + fclose(fp);
 +
 + free( buffer );
 + buffer = NULL;
 + fprintf(stderr,"Couldnt open output file %s\n",argv[2]);
 + exit(-8);
 + }
 +
 + if( fwrite(buffer,1,length-offset,ofp) != (length-offset) ) { 
 + fclose(fp);
 + fclose(ofp);
 + fprintf(stderr,"Couldnt write output file %s\n", argv[2]);
 + exit(-9);
 + }
 +
 + free( buffer );
 +
 + fclose(fp);
 + fclose(ofp);
 +
 +
 + return 0;
 +}
 +</code> 
 +
 +Process all the files.
 +
 +<code>
 +for %a in (*.csv) do byteskip %a o%a 4
 +</code>
 +
 +I used one of the online CSV to MYSQL converters, but most of them can't handle the variations in CSV. To create the initial schema for each table i converted one CSV to XLS by importing it into google docs, and then re-exporting it as an XLS then importing that into phpmyadmin, that makes the base schema.<br>
 +
 +Rename the table in phpmyadmin or via mysql tool
 +
 +Then do the final import with the csvtosql tool, (in progress)
 +
 +
  
projects/digikey_partsdb.1381592462.txt.gz · Last modified: 2013/10/12 08:41 by charliex