= 'https://raw.githubusercontent.com/manuvazquez/sproc/main/samples/2018-2021_20samples.parquet'
url = 'download_sample.parquet'
output_file file(url, output_file)
download
In order to avoid errors like
certificate verify failed: unable to get local issuer certificate
when downloading from https://contrataciondelsectorpublico.gob.es/
A function to download a file from internet.
file
file (url:str, output_file:str|pathlib.Path|None, timeout:float=2.0)
Downloads a file
Type | Default | Details | |
---|---|---|---|
url | str | URL for the file to be downloaded | |
output_file | str | pathlib.Path | None | Name of the local file to be saved; if None its content is returned |
|
timeout | float | 2.0 | How long to wait for a response |
Returns | None | bytes | Content of the file or None if output_file was passed |
A sample file (from this repository) is downloaded
Let us check it is readable
pd.read_parquet(output_file).head()
id | summary | title | updated | ContractFolderStatus | deleted_on | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ContractFolderID | ContractFolderStatusCode | LocatedContractingParty | ProcurementProject | ... | LegalDocumentReference | TechnicalDocumentReference | LocatedContractingParty | TenderingProcess | |||||||||||||||
Party | Name | TypeCode | ... | ID | Attachment | ID | Attachment | ParentLocatedParty | ParticipationRequestReceptionPeriod | TenderSubmissionDeadlinePeriod | |||||||||||||
PartyIdentification | PartyName | ... | ExternalReference | ExternalReference | ParentLocatedParty | EndDate | EndTime | ||||||||||||||||
ID | Name | ... | URI | URI | ParentLocatedParty | ||||||||||||||||||
... | PartyName | ParentLocatedParty | |||||||||||||||||||||
... | Name | PartyName | |||||||||||||||||||||
... | Name | ||||||||||||||||||||||
zip | file name | entry | |||||||||||||||||||||
some.zip | PlataformasAgregadasSinMenores_20180217_180137_1.atom | 453 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1284/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 2018-01-02 08:01:52.024000+00:00 | 1284/17 | RES | L02000047 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 3.0 | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | 2017-11-02 | 23:59:00 | 2017-11-02 23:59:00+00:00 | NaT |
452 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1282/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 2018-01-02 08:02:24.833000+00:00 | 1282/17 | RES | L02000047 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 3.0 | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | 2017-11-02 | 23:59:00 | 2017-11-02 23:59:00+00:00 | NaT | ||
451 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1281/17, Entidad: Diputación Provi... | Refuerzo de firme en la VP 4013 Melgar de Arri... | 2018-01-02 08:02:51.744000+00:00 | 1281/17 | RES | L02000047 | Diputación Provincial de Valladolid | Refuerzo de firme en la VP 4013 Melgar de Arri... | 3.0 | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | 2017-11-02 | 23:59:00 | 2017-11-02 23:59:00+00:00 | NaT | ||
450 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: VI/17/04-015; Órgano de Contrat... | Obras de edificación en el barrio de Pumarabul... | 2018-01-02 08:02:56.115000+00:00 | VI/17/04-015 | EV | <NA> | Consejería de Servicios y Derechos Sociales | Edificación de 36 VPP, garaje y trasteros en e... | 3.0 | ... | Pliego_Clausulas_Administrativas_VI-17-04-015.pdf | http://www.asturias.es/Proveedores/FICHEROS/ES... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | 2017-12-11 14:00:00+00:00 | NaT | ||
449 | https://contrataciondelestado.es/sindicacion/P... | Id Licitación: PcPG/2017/194222, Órgano de Con... | Suministro de gas natural canalizado y gas nat... | 2018-01-02 09:10:49.572000+00:00 | PcPG/2017/194222 | ADJ | A12017369 | Consellería de Economía, Emprego e Industria | Suministro de gas natural canalizado y gas nat... | 1.0 | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | 2017-09-29 23:59:00+00:00 | NaT |
5 rows × 41 columns
= 'https://contrataciondelsectorpublico.gob.es/sindicacion/sindicacion_1044/PlataformasAgregadasSinMenores_2018.zip'
aggregate_2018 # file(aggregate_2018, 'foo.zip')
A convenience function leveraging file
to read YAML files
yaml_to_dict
yaml_to_dict (url:str, timeout:float=2.0)
Read YAML data from an URL
Type | Default | Details | |
---|---|---|---|
url | str | URL for the file to be downloaded | |
timeout | float | 2.0 | How long to wait for a response |
Returns | dict | YAML data |
An example
'https://raw.githubusercontent.com/manuvazquez/sproc/main/samples/PLACE.yaml') yaml_to_dict(
{'id': ['id', nan, nan, nan, nan, nan, nan],
'summary': ['summary', nan, nan, nan, nan, nan, nan],
'title': ['title', nan, nan, nan, nan, nan, nan],
'updated': ['updated', nan, nan, nan, nan, nan, nan],
'Número de Expediente': ['ContractFolderStatus',
'ContractFolderID',
nan,
nan,
nan,
nan,
nan],
'Estado': ['ContractFolderStatus',
'ContractFolderStatusCode',
nan,
nan,
nan,
nan,
nan],
'ID': ['ContractFolderStatus',
'LocatedContractingParty',
'Party',
'PartyIdentification',
'ID',
nan,
nan],
'Nombre': ['ContractFolderStatus',
'LocatedContractingParty',
'Party',
'PartyName',
'Name',
nan,
nan],
'URL perfil de contratante': ['ContractFolderStatus',
'LocatedContractingParty',
'BuyerProfileURIID',
nan,
nan,
nan,
nan],
'Ubicación orgánica': ['ContractFolderStatus',
'LocatedContractingParty',
'ParentLocatedParty',
'PartyName',
'Name',
nan,
nan],
'Objeto del Contrato': ['ContractFolderStatus',
'ProcurementProject',
'Name',
nan,
nan,
nan,
nan],
'Tipo de Contrato': ['ContractFolderStatus',
'ProcurementProject',
'TypeCode',
nan,
nan,
nan,
nan],
'Subtipo': ['ContractFolderStatus',
'ProcurementProject',
'SubTypeCode',
nan,
nan,
nan,
nan],
'Valor estimado del contrato': ['ContractFolderStatus',
'ProcurementProject',
'BudgetAmount',
'EstimatedOverallContractAmount',
nan,
nan,
nan],
'Presupuesto base sin impuestos': ['ContractFolderStatus',
'ProcurementProject',
'BudgetAmount',
'TaxExclusiveAmount',
nan,
nan,
nan],
'Presupuesto base con impuestos': ['ContractFolderStatus',
'ProcurementProject',
'BudgetAmount',
'TotalAmount',
nan,
nan,
nan],
'Clasificación CPV': ['ContractFolderStatus',
'ProcurementProject',
'RequiredCommodityClassification',
'ItemClassificationCode',
nan,
nan,
nan],
'Código de Subentidad Nacional': ['ContractFolderStatus',
'ProcurementProject',
'RealizedLocation',
'CountrySubentityCode',
nan,
nan,
nan],
'Subentidad Territorial': ['ContractFolderStatus',
'ProcurementProject',
'RealizedLocation',
'CountrySubentity',
nan,
nan,
nan],
'Población': ['ContractFolderStatus',
'ProcurementProject',
'RealizedLocation',
'Address',
'CityName',
nan,
nan],
'Plazo de Ejecución (Comienzo)': ['ContractFolderStatus',
'ProcurementProject',
'PlannedPeriod',
'StartDate',
nan,
nan,
nan],
'Plazo de Ejecución (Fin)': ['ContractFolderStatus',
'ProcurementProject',
'PlannedPeriod',
'EndDate',
nan,
nan,
nan],
'Plazo de Ejecución (Duración)': ['ContractFolderStatus',
'ProcurementProject',
'PlannedPeriod',
'DurationMeasure',
nan,
nan,
nan],
'Nümero de Lote': ['ContractFolderStatus',
'ProcurementProjectLot',
'ID',
nan,
nan,
nan,
nan],
'Objeto': ['ContractFolderStatus',
'ProcurementProjectLot',
'ProcurementProject',
'Name',
nan,
nan,
nan],
'Importe sin impuestos': ['ContractFolderStatus',
'ProcurementProjectLot',
'ProcurementProject',
'BudgetAmount',
'TaxExclusiveAmount',
nan,
nan],
'Importe con impuestos': ['ContractFolderStatus',
'ProcurementProjectLot',
'ProcurementProject',
'BudgetAmount',
'TotalAmount',
nan,
nan],
'Clasificación CPV (Lote)': ['ContractFolderStatus',
'ProcurementProjectLot',
'ProcurementProject',
'RequiredCommodityClassification',
'ItemClassificationCode',
nan,
nan],
'Tipo de Procedimiento': ['ContractFolderStatus',
'TenderingProcess',
'ProcedureCode',
nan,
nan,
nan,
nan],
'Presentación de Oferta (Fecha)': ['ContractFolderStatus',
'TenderingProcess',
'TenderSubmissionDeadlinePeriod',
'EndDate',
nan,
nan,
nan],
'Presentación de Oferta (Hora)': ['ContractFolderStatus',
'TenderingProcess',
'TenderSubmissionDeadlinePeriod',
'EndTime',
nan,
nan,
nan],
'Presentación de Oferta (Observaciones)': ['ContractFolderStatus',
'TenderingProcess',
'TenderSubmissionDeadlinePeriod',
'Description',
nan,
nan,
nan],
'Presentación de Solicitudes (Fecha)': ['ContractFolderStatus',
'TenderingProcess',
'ParticipationRequestReceptionPeriod',
'EndDate',
nan,
nan,
nan],
'Presentación de Solicitudes (Hora)': ['ContractFolderStatus',
'TenderingProcess',
'ParticipationRequestReceptionPeriod',
'EndTime',
nan,
nan,
nan],
'Presentación de Solicitudes (Observaciones)': ['ContractFolderStatus',
'TenderingProcess',
'ParticipationRequestReceptionPeriod',
'Description',
nan,
nan,
nan],
'Tipo de Anuncio': ['ContractFolderStatus',
'ValidNoticeInfo',
'NoticeTypeCode',
nan,
nan,
nan,
nan],
'Medio de Publicación': ['ContractFolderStatus',
'ValidNoticeInfo',
'AdditionalPublicationStatus',
'PublicationMediaName',
nan,
nan,
nan],
'Fecha de Publicación': ['ContractFolderStatus',
'ValidNoticeInfo',
'AdditionalPublicationStatus',
'AdditionalPublicationDocumentReference',
'IssueDate',
nan,
nan],
'Resultado': ['ContractFolderStatus',
'TenderResult',
'ResultCode',
nan,
nan,
nan,
nan],
'Fecha del Acuerdo': ['ContractFolderStatus',
'TenderResult',
'AwardDate',
nan,
nan,
nan,
nan],
'Número de Licitadores Participantes': ['ContractFolderStatus',
'TenderResult',
'ReceivedTenderQuantity',
nan,
nan,
nan,
nan],
'Identificador (+ Tipo: mod schemeName)': ['ContractFolderStatus',
'TenderResult',
'WinningParty',
'PartyIdentification',
'ID',
nan,
nan],
'Nombre del Adjudicatario': ['ContractFolderStatus',
'TenderResult',
'WinningParty',
'PartyName',
'Name',
nan,
nan],
'Lote': ['ContractFolderStatus',
'TenderResult',
'AwardedTenderedProject',
'ProcurementProjectLotID',
nan,
nan,
nan],
'Importe total ofertado (sin impuestos)': ['ContractFolderStatus',
'TenderResult',
'AwardedTenderedProject',
'LegalMonetaryTotal',
'TaxExclusiveAmount',
nan,
nan],
'Importe total ofertado (con impuestos)': ['ContractFolderStatus',
'TenderResult',
'AwardedTenderedProject',
'LegalMonetaryTotal',
'PayableAmount',
nan,
nan],
'Pliego de cláusulas administrativas': ['ContractFolderStatus',
'LegalDocumentReference',
'ID',
nan,
nan,
nan,
nan],
'Pliego de cláusulas administrativas (URI)': ['ContractFolderStatus',
'LegalDocumentReference',
'Attachment',
'ExternalReference',
'URI',
nan,
nan],
'Pliego de Prescripciones técnicas': ['ContractFolderStatus',
'TechnicalDocumentReference',
'ID',
nan,
nan,
nan,
nan],
'Pliego de Prescripciones técnicas (URI)': ['ContractFolderStatus',
'TechnicalDocumentReference',
'Attachment',
'ExternalReference',
'URI',
nan,
nan],
'Anexos a los pliegos': ['ContractFolderStatus',
'AdditionalDocumentReference',
'ID',
nan,
nan,
nan,
nan],
'Anexos a los pliegos (URI)': ['ContractFolderStatus',
'AdditionalDocumentReference',
'Attachment',
'ExternalReference',
'URI',
nan,
nan],
'Presentación de Oferta': ['ContractFolderStatus',
'TenderingProcess',
'TenderSubmissionDeadlinePeriod',
nan,
nan,
nan,
nan]}
ULRs are produced from the given date, from_date
, onwards.
make_urls
make_urls (base_url:str, base_filename:str, from_date:datetime.datetime)
Assemble URLs for files of a given kind that are to be downloaded
Type | Details | |
---|---|---|
base_url | str | URL to the server including the hosting directory |
base_filename | str | File name without neither date information nor extension |
from_date | datetime | The starting date |
Returns | list | List of tuples (URL, file name) |
As an example, let us assemble the URLs of all the outsiders files from November 2019 on.
'outsiders'] sproc.structure.tables[
{'base_url': 'https://contrataciondelsectorpublico.gob.es/sindicacion/sindicacion_1044/',
'base_filename': 'PlataformasAgregadasSinMenores_',
'naming_filename': 'outsiders.yaml'}
'outsiders']['base_url'], sproc.structure.tables['outsiders']['base_filename'] sproc.structure.tables[
('https://contrataciondelsectorpublico.gob.es/sindicacion/sindicacion_1044/',
'PlataformasAgregadasSinMenores_')
= make_urls(
urls_filenames 'outsiders']['base_url'],
sproc.structure.tables['outsiders']['base_filename'],
sproc.structure.tables[=datetime.datetime(2019, 11, 1))
from_date# urls_filenames[:3]
urls_filenames
[('https://contrataciondelsectorpublico.gob.es/sindicacion/sindicacion_1044/PlataformasAgregadasSinMenores_201912.zip',
'PlataformasAgregadasSinMenores_201912.zip'),
('https://contrataciondelsectorpublico.gob.es/sindicacion/sindicacion_1044/PlataformasAgregadasSinMenores_2020.zip',
'PlataformasAgregadasSinMenores_2020.zip'),
('https://contrataciondelsectorpublico.gob.es/sindicacion/sindicacion_1044/PlataformasAgregadasSinMenores_2021.zip',
'PlataformasAgregadasSinMenores_2021.zip'),
('https://contrataciondelsectorpublico.gob.es/sindicacion/sindicacion_1044/PlataformasAgregadasSinMenores_2022.zip',
'PlataformasAgregadasSinMenores_2022.zip'),
('https://contrataciondelsectorpublico.gob.es/sindicacion/sindicacion_1044/PlataformasAgregadasSinMenores_202301.zip',
'PlataformasAgregadasSinMenores_202301.zip')]
From the 2022 new year’s eve
make_urls('outsiders']['base_url'],
sproc.structure.tables['outsiders']['base_filename'],
sproc.structure.tables[=datetime.datetime(2022, 1, 1) - datetime.timedelta(days=1)) from_date
[('https://contrataciondelsectorpublico.gob.es/sindicacion/sindicacion_1044/PlataformasAgregadasSinMenores_2022.zip',
'PlataformasAgregadasSinMenores_2022.zip'),
('https://contrataciondelsectorpublico.gob.es/sindicacion/sindicacion_1044/PlataformasAgregadasSinMenores_202301.zip',
'PlataformasAgregadasSinMenores_202301.zip')]
In order to actually download all the files of a given kind
from a certain `date.
from_date
from_date (kind:str, date:datetime.datetime, output_directory:str|pathlib.Path=Path('/home/runner/work/spro c/sproc'))
Downloads all the files of a given kind from a certain moment in time
Type | Default | Details | |
---|---|---|---|
kind | str | One of ‘outsiders’, ‘insiders’, or ‘minors’ | |
date | datetime | The starting date | |
output_directory | str | pathlib.Path | /home/runner/work/sproc/sproc | Output directory, defaults is the current one |
Returns | list | File name of every downloaded file |
Let us make a new directory…
= pathlib.Path.cwd().parent / 'downloads'
output_directory =True)
output_directory.mkdir(exist_okprint(output_directory)
/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/downloads
…into which files are to be downloaded
try:
= from_date('outsiders', datetime.datetime(2021, 10, 1), output_directory=output_directory)
dl_files except:
print('can\'t download...most likeky due to banning...')
Downloading raw data: 100%|██████████| 4/4 [00:00<00:00, 315.46it/s]
"PlataformasAgregadasSinMenores_202111.zip" already exists
"PlataformasAgregadasSinMenores_202112.zip" already exists
"PlataformasAgregadasSinMenores_2022.zip" already exists
"PlataformasAgregadasSinMenores_202301.zip" already exists