The parser base nakolesah.ru
Phew, dokrutili parser nakolesah to sane state and grab the selection of tires on your car. Who cares - a reference to the script at the end of the post.
Something in him will have to change, not so I like the logic of the present, based on the GET-requests (if the browser gets all the information referring to the asp-script to the transfer of various parameters in the POST-request). I POST only at the end, and we ought to try the browser completely copied, so the time was not particularly understand.
Still do not like a crutch as a function of change of names of models of machines. Parsing nkolesah faced with the problem (actually only for GET-requests) different names brands and modifications of machinery in the drop-down lists and address pages, for example:
sub TransformModel ($$){ my ($ brand, $ car_model) = @ _; $ car_model = ~ s / - / / g if $ brand! ~ / Saab | Jaguar | Nissan | Honda | Citroen | MG | Mercedes | Mazda | Ford / i; $ car_model = ~ s /[-+]/_/ g if $ brand! ~ / Citroen / i; if ($ brand = ~ / Nissan / i) {$ car_model = ~ s/Z/350z / i; $ car_model = ~ s / GT_R / GTR / i;} $ car_model = 'navigaror_1' if $ brand = ~ m # Lincoln # i and $ car_model eq 'Navigator'; $ car_model = 'Du% D1% 81ato' if $ brand = ~ m # Fiat # i and $ car_model = ~ / dusato / i; if ($ brand = ~ / Chery / i) {$ car_model = 'c_eastar' if $ car_model eq 'CrossEastar'; $ car_model = $ brand .'_'. $ Car_model if $ car_model = ~ / kimo | qq \ d? / I;} return $ car_model;}
Complete unloading takes about 12 hours in sequential mode (works in one stream, multi-threading client was not necessary, but I had no time to attach it for fun). If someone decide to make downloading and parsing - I do that like four copies of the script and break the range of brands of machines into four groups, respectively (all in the database nakolesah 61 mark at the moment). You can use the ready decomposition, which is the code I have done:
# Next if $ brand! ~ / Rover | FAW | Volkswagen | Ferrari | Jaguar | Smart | Suzuki | gaz | Bentley | Peugeot | Pontiac | Honda | Maybach | vaz | Infiniti | Buick | Subaru / i; # Next if $ brand! ~ / Lancia | Opel | Daihatsu | Hummer | Kia | Fiat | Nissan | Saturn | Mini | Hyundai | Renault | Citroen | Lincoln | Chevrolet | Dodge / i; # Next if $ brand! ~ / Chery | Mazda | Ford | uaz | Acura | Porsche | Lotus | Volvo | Toyota | Skoda | Cadillac | Scion | Saab | Mercury | Daewoo / i; # Next if $ brand! ~ / Chrysler | BMW | Isuzu | MG | Mercedes | GMC | Seat | Maserati | Mitsubishi | Jeep | Lexus | Audi | Lifan | Geely / i;
In each of the four copies to uncomment the range, the files are better described in different ways, as the default output is a file named imya_skripta.xml (although you can if zpuske little key to transfer the output file name).
Along the way, did skriptik to validate the results of the parser nakolesah.ru, once again rejoiced beautiful pearl regulyarok:
m | <(\ w +) \ s? \ w *=?"? \ w * "?> \ s * </ \ 1> $ | ig single line scans the tags to the occupancy (all I download), understands the tags with attributes and without. Validator results nakolesah.ru unloading can be downloaded along with the parser.
For fun, a little showgirl (when can pull ponostalgirovat
):
- net database in XML (no blank lines):
$ Wc-l nakolesah.ru_full_base_4.12.2009.xml 550 657 nakolesah.ru_full_base_4.12.2009.xml
- 577 car models
As promised, a link to download a parser-grabber site nakolesah.ru (validator output also is in the archive): nakolesah.ru_parser + validator
Good luck to everyone!
More on similar topics:
Filed under: Internet , Coding |
Tags: nakolesah.ru , perl , parser , programming , work |
18 comments 


Good day! Apparently they changed the design and size are not parsed, can not you fix this payment / free of charge! thank you)
Dimensions of exactly what? Let us at once with the specifics, so it will be easier to understand what was going on.
excellent script takes the machines, but the sizes of wheels and tires suitable he chooses, as a result of xml has the form:
....
I can not say what it was, as I normally unloaded all the information.
I swill redirect does not work, writes:
Use of uninitialized value $ redir_url in concatenation (.) Or string at / home / digbox / data / www / digbox.ru / cgi-bin / nakolesah_ru_parser.pl line 152.
not help us to understand?
Immediately on the first run does not pass? Add to line 152 as follows:
2
exit;
and let me know the result.
issues following:
1 | # | | 4 | 54 | pageRedirect | |% 2fselect% 2ftiresbyauto% 2facura% 2fcl% 2f2003% 2f32i.aspx |
I realized URL to redirect to recognize, but does not pass
But this fix is not difficult. He just did not recognize the link to redirect, changed shape since its issuance.
It should be in line to replace 150 search pattern:
on
Thank you very much it worked)
but not hurried (did not want to pull out, keeps the same (
Most likely it has changed not only the form of issue links, but also giving information on tires / disks and to recover, have a lot of change in the function of parsing pages.
Corrected parser can share or base or of his own writing ... skype:
what email? or Asya
http://www.dimio.org/about
If someone could fix the parser. Help pzhl.
icq: 308037667
skype: viperstp
Can someone share all the same information, why not pull out sizes, what the code should be changed.
Above people gave their contact and wrote that he straightened everything under the present-day. the conditions.
Hello, if anyone has a parser for php then please share and then immediately have (((my Asya 202 716 and then we Dle Engine (he nxn)