What scraping taught me about the importance of semantic HTML
I had the talk accepted to the online version of the TECH(K)NOW conference. This is the draft of my content, that I am going to be presenting there (and will be the place for finished version, after I record the talk).
What is semantic
!!What is the semantic use of the HTML + description + a couple of examples
!! working definition of semantics is “enough meaning to result in an action, [citation https://www.oreilly.com/library/view/mining-the-social/9781449368180/ch08.html]
Why scraping
Problems noticed while scraping
!! Have examples of different suructed HTML, that you ntoiced in work !! Are there any examples from my work? Check the code
ng tags are bad, aria are good
The first advice, that I got
Use right semantic element
Button for search (BAD): #FiltersDiv input[name=3Dgo] Button to accept invitation (BAD): input[value="Accept invitation"]
Link for file download (BAD): div.action-header form:not(.checkoutForm):not(.buy-link)
Navigation (BAD): ul.pagination a.endless_page_link
Div instead of a table
Use descriptive class and id name
Error message (BAD): div.py5.max-width-2.mx-auto.big.line-height-3 > div Error message (GOOD): div.alert-warning Error message (GOOD): div[data-ref="account.billing.no-billing-access"] Error message (GOOD): Button to accept invitation input[value="Accept invitation"]
Total (BAD): div.nobr
Next page (BAD): table[role="presentation"] > tbody > tr > td > a > i.cnqr-arrow-ww
Next page link (BAD): body > table > tbody > tr:nth-child(2) > td > table > tbody > tr > td > table > tbody > tr > td > center > form > table > tbody > tr:nth-child(1) > td > table.TblBgColor > tbody > tr:nth-child(2) > td > table > tbody > tr > td > font > b + a
random class name changes
aria elements missing
no-URL based navigation (don't hide links)
Navigation button (BAD): div.responsive-wrapper__container.team-mngmt > div.tabs__list > span:nth-child(2)
Navigation button (BAD): div.selectedState:not(.helpBtn)
Navigation buttons (BAD): #invoices-info > div.c-row > a
Navigation buttons (BAD): div.selectedState:not(.helpBtn) -> div.item.settings -> a.row.billing
We have nav and button elements
Why is semantic HTML important
!! accsesability !! There are congitive imprerments, that can make websites browsable just in reader mode: https://alistapart.com/article/accessibility-for-vestibular/ !! Screen readers what screen readers are using (https://alistapart.com/article/conversational-semantics/): link, interaction, header tranversion !! configurability !! semantic web
!! include in the speech the web is fro user -> from the standards
Clean up your Twitter: https://calumryan.com/articles/cleanup-your-twitter-timeline
Standards
HTML Design Principles
It is how HTML is designed, but it is a good reading for the people writting for the web: https://www.w3.org/TR/html-design-principles/
Degrade Gracefully
Even if markdown is not recognized (old or less capable user agents), should degrade grasfully (maybe by prving an additional semantic information in the elements) -> id means that no new element is introducted, without thinking about this
Examples: Proposed new multimedia elements like or allow fallback content. Older user agents will show "fallback" while user agents supporting canvas or video will show the multimedia content.
Priority of Constituencies
In web the end users are the priority. Citation: "In case of conflict, consider users over authors over implementors over specifiers over theoretical purity. In other words costs or difficulties to the user should be given more weight than costs to authors; which in turn should be given more weight than costs to implementors; which should be given more weight than costs to authors of the spec itself, which should be given more weight than those proposing changes for theoretical reasons alone. Of course, it is preferred to make things better for multiple constituencies at once. "
Structure over presentation
Citation: "HTML should allow separation of content and presentation. For this reason, markup that expresses structure is usually preferred to purely presentational markup. "
Media independency
Features should, when possible, work across different platforms, devices, and media. This should not be taken to mean that a feature should be omitted just because some media or platforms can't support it. For example, interactive features should not be omitted merely because they can not be represented in a printed document.
HTML should be accessible
Citation: Design features to be accessible to users with disabilities. Access by everyone regardless of ability is essential. This does not mean that features should be omitted entirely if not all users can make full use of them, but alternate mechanisms should be provided.
The image in an img may not be visible to blind users, but that is a reason to provide alternate text, not to leave out images.
The backbone of accessibility is semantic HTML, which at its heart means using HTML elements as they were intended to be used. https://www.customerservant.com/in-reply-to-jgmac1106-regarding-indieweb-and-a11y/
Extending the semantic HTML
ARIA
Microformats
Microformats are a way to enbedent the semantics of the information in the HTML in backward compatible way [citation from https://www.oreilly.com/library/view/mining-the-social/9781449368180/ch08.html]
Example of calendar:
RVSP
GEO
Older browsers
!! progressive enhancment !! HTML can also not be recognized in olrde browsers !! lang attribute (IE6 and IE7) !! video
Good practices
!! good use -> microformats and IndieWeb !! HTML as an API